Table of Contents
All queries used the same production-grade prompt, similar to those in automated insulin delivery apps. The results were staggering, 26,904 answers with no two exactly alike per model per image, and some with life-threatening swings in carb and insulin estimates.
For instance, on a single paella photo, Gemini 2.5 Pro’s carb guesses ranged from 55g to 484g, a 429g spread, which translates to a 42.9-unit difference in insulin at a standard 1:10 insulin-to-carb ratio. Even the most “stable” model, Claude Sonnet 4.6, still made consistent but wrong guesses on simple foods like a cheese sandwich, underestimating by 12g every time.
AI uncertainty in carbohydrate estimation isn’t just academic, it creates real risk in insulin dosing.
Why AI Variability and Misplaced Confidence Are Dangerous in Healthcare
Why does this happen? Large language models (LLMs) and vision APIs are not deterministic calculators. Their answers depend on internal randomness, context ambiguity, and the stochastic nature of their training. Even when “temperature” is set to the lowest setting, these models can produce widely different answers to the exact same input.
But the risk goes deeper. In the carb counting study, all models returned a “confidence score” for every answer, usually in the 0.8-0.9 range. However, these confidence ratings had no correlation to actual correctness , in some cases, higher confidence meant less accuracy. For example, Claude Sonnet 4.6’s confidence was r = -0.01 with real accuracy, and Gemini models always reported high confidence regardless of wild answer swings.
AI confidence scores look reassuring, but in practice, high-confidence errors are common and invisible to users.
Food identification was also a weak point. In the benchmark, models misidentified common foods (“Bakewell tart” as “Linzer torte,” or hallucinating non-existent “deli meat”) in over half the images, compounding errors in carb estimation. These inaccuracies are not rare, they are systematic. As noted in the original Diabettech study :
Claude Sonnet 4.6: Consistent but sometimes “precisely wrong”, e.g., always underestimates a cheese sandwich by 12g.
GPT-5.4: Highly variable and prone to overestimation, with a mean error of +1.2 insulin units per meal on reference foods.
Gemini Pro models: Prone to both wide variability and high-confidence misidentification.
Practical Lessons: How (Not) to Use AI for Diabetes and Medical Decisions
For people with diabetes, these findings are not just technical warnings, they are potential life-or-death safety issues. If an AI-powered app delivers a single carb estimate from a photo, users have no way to know whether that answer is typical, an outlier, or just plain wrong. The risk is not just chronic (systematic bias), but acute, single unlucky queries could result in severe insulin overdoses, risking hypoglycemia.
Best practices, based on the study and clinical guidance, include:
Never trust a single answer from an AI carb counter for insulin dosing.
Query multiple times and look at the spread of answers. If results are inconsistent, treat the model as uncertain, even if it reports high confidence.
Always check the ingredients the AI “sees” in your food photo. Misidentification leads to huge errors.
Understand that consistency (the same wrong answer every time) is not safety. Accuracy and uncertainty matter more.
The UK Diabetes Technology Network has already formally advised that generic LLMs must never be used as autonomous calculators for insulin delivery (source ).
Comparison: AI Carb Counting Model Variability & Risk
Model
Median Variation (CV)
Median Insulin Swing (U)
Worst-Case Insulin Swing (U)
% of Dangerous Queries†
Source
Claude Sonnet 4.6
2.4%
0.9
13.6
0%
Diabettech 2026
GPT-5.4
8.4%
2.3
16.6
37%
Diabettech 2026
Gemini 3.1 Pro
10.3%
2.9
16.2
12%
Diabettech 2026
Gemini 2.5 Pro
11.0%
4.7
42.9
12%
Diabettech 2026
† “Dangerous” defined as queries producing >2U insulin error or more, which can lead to hypoglycemia. See Diabettech study for full details.
Architecture Diagram: Carb Counting with AI, Where the Errors Happen
To understand how this risk emerges in practice, here’s a real-world data flow:
# Example: Querying an AI model for carb estimation
import openai
def carb_estimate(image_path, prompt):
with open(image_path, "rb") as image_file:
response = openai.Image.create(
model="gpt-5.4-vision",
prompt=prompt,
image=image_file
)
return response["carb_estimate"], response["confidence"]
# Note: production use should query multiple times, aggregate spread, and always verify ingredient identification.
# This code omits logic for handling model variability, user safety checks, and food misidentification.
carbs, conf = carb_estimate("cheese_sandwich.jpg", "Estimate carbs for this meal. Return value and confidence.")
print(f"Estimated carbs: {carbs}g, Model confidence: {conf}")
Critical: As shown in the study, a single answer from this code could be dangerously wrong. In production, always aggregate multiple queries and never automate insulin dosing based solely on a single AI estimate.
Key Takeaways
Key Takeaways:
AI models for carb counting produce highly variable and sometimes dangerous results, even for the same input.
Model confidence scores are poorly calibrated and cannot be used to filter out errors.
Food misidentification is common and leads to major estimation errors.
Current LLMs and vision APIs are not suitable for autonomous medical decision-making in regulated domains.
Always use AI as an assistive tool, not as a replacement for human clinical judgment in high-stakes scenarios like diabetes care.
For more production-ready AI architecture patterns that address reliability and safety, see our deep dive on RAG systems for knowledge-aware AI . For further reading on the original study and dataset, visit Diabettech and the open data repository.
As the global AI market surges ahead and new benchmarks are set almost monthly (see our coverage of the US-China AI model race ), the lesson for healthcare and other safety-critical domains is clear: accuracy alone is not enough, consistency, transparency, and robust human oversight are essential.
Sources and References
This article was researched using a combination of primary and supplementary sources:
Primary Source
This is the main subject of the article. The post analyzes and explains concepts from this source.