AI Variability in Medical Carb Counting: Risks and Lessons

Table of Contents

The 2026 Market Story: AI Models, Medical Risk, and the Carb Counting Disaster
Inside the Experiment: 27,000 AI Carb Counts and No Consistency
Why AI Variability and Misplaced Confidence Are Dangerous in Healthcare
Practical Lessons: How (Not) to Use AI for Diabetes and Medical Decisions
Comparison: AI Carb Counting Model Variability & Risk
Architecture Diagram: Carb Counting with AI, Where the Errors Happen

All queries used the same production-grade prompt, similar to those in automated insulin delivery apps. The results were staggering, 26,904 answers with no two exactly alike per model per image, and some with life-threatening swings in carb and insulin estimates.

Why AI Variability and Misplaced Confidence Are Dangerous in Healthcare

Why does this happen? Large language models (LLMs) and vision APIs are not deterministic calculators. Their answers depend on internal randomness, context ambiguity, and the stochastic nature of their training. Even when “temperature” is set to the lowest setting, these models can produce widely different answers to the exact same input.

But the risk goes deeper. In the carb counting study, all models returned a “confidence score” for every answer, usually in the 0.8-0.9 range. However, these confidence ratings had no correlation to actual correctness, in some cases, higher confidence meant less accuracy. For example, Claude Sonnet 4.6’s confidence was r = -0.01 with real accuracy, and Gemini models always reported high confidence regardless of wild answer swings.

AI confidence scores look reassuring, but in practice, high-confidence errors are common and invisible to users.

Practical Lessons: How (Not) to Use AI for Diabetes and Medical Decisions

For people with diabetes, these findings are not just technical warnings, they are potential life-or-death safety issues. If an AI-powered app delivers a single carb estimate from a photo, users have no way to know whether that answer is typical, an outlier, or just plain wrong. The risk is not just chronic (systematic bias), but acute, single unlucky queries could result in severe insulin overdoses, risking hypoglycemia.

Best practices, based on the study and clinical guidance, include:

Never trust a single answer from an AI carb counter for insulin dosing.
Query multiple times and look at the spread of answers. If results are inconsistent, treat the model as uncertain, even if it reports high confidence.
Always check the ingredients the AI “sees” in your food photo. Misidentification leads to huge errors.
Understand that consistency (the same wrong answer every time) is not safety. Accuracy and uncertainty matter more.

The UK Diabetes Technology Network has already formally advised that generic LLMs must never be used as autonomous calculators for insulin delivery (source).

Comparison: AI Carb Counting Model Variability & Risk

Model	Median Variation (CV)	Median Insulin Swing (U)	Worst-Case Insulin Swing (U)	% of Dangerous Queries^†	Source
Claude Sonnet 4.6	2.4%	0.9	13.6	0%	Diabettech 2026
GPT-5.4	8.4%	2.3	16.6	37%	Diabettech 2026
Gemini 3.1 Pro	10.3%	2.9	16.2	12%	Diabettech 2026
Gemini 2.5 Pro	11.0%	4.7	42.9	12%	Diabettech 2026

^† “Dangerous” defined as queries producing >2U insulin error or more, which can lead to hypoglycemia. See Diabettech study for full details.

Architecture Diagram: Carb Counting with AI, Where the Errors Happen

To understand how this risk emerges in practice, here’s a real-world data flow:

# Example: Querying an AI model for carb estimation
import openai

def carb_estimate(image_path, prompt):
 with open(image_path, "rb") as image_file:
 response = openai.Image.create(
 model="gpt-5.4-vision",
 prompt=prompt,
 image=image_file
 )
 return response["carb_estimate"], response["confidence"]

# Note: production use should query multiple times, aggregate spread, and always verify ingredient identification.
# This code omits logic for handling model variability, user safety checks, and food misidentification.

carbs, conf = carb_estimate("cheese_sandwich.jpg", "Estimate carbs for this meal. Return value and confidence.")
print(f"Estimated carbs: {carbs}g, Model confidence: {conf}")

Critical: As shown in the study, a single answer from this code could be dangerously wrong. In production, always aggregate multiple queries and never automate insulin dosing based solely on a single AI estimate.

Key Takeaways

Key Takeaways:

AI models for carb counting produce highly variable and sometimes dangerous results, even for the same input.

Model confidence scores are poorly calibrated and cannot be used to filter out errors.

Food misidentification is common and leads to major estimation errors.

Current LLMs and vision APIs are not suitable for autonomous medical decision-making in regulated domains.

Always use AI as an assistive tool, not as a replacement for human clinical judgment in high-stakes scenarios like diabetes care.

For more production-ready AI architecture patterns that address reliability and safety, see our deep dive on RAG systems for knowledge-aware AI. For further reading on the original study and dataset, visit Diabettech and the open data repository.

As the global AI market surges ahead and new benchmarks are set almost monthly (see our coverage of the US-China AI model race), the lesson for healthcare and other safety-critical domains is clear: accuracy alone is not enough, consistency, transparency, and robust human oversight are essential.

Sources and References

This article was researched using a combination of primary and supplementary sources:

Primary Source

This is the main subject of the article. The post analyzes and explains concepts from this source.

https://www.diabettech.com/i-asked-ai-to-count-my-carbs-27000-times-it-couldnt-give-me-the-same-answer-twice/

Thomas A. Anderson