Skip to main content
Clinical SupportSystematic Review2023

Accuracy of AI Systems in Generating Differential Diagnoses

Key Finding

Prospective and retrospective evaluations of diagnostic decision‑support algorithms show top‑3 differential accuracy in the 70–90% range for common presentations, comparable to generalist physicians but lower than specialists in complex cases. Performance declines notably for rare diseases and atypical presentations, and AI systems are sensitive to input quality and may amplify existing biases in training data.

8 min read2 sources cited
primary-careemergency-medicineall

Executive Summary

Multiple studies comparing AI‑based diagnostic decision‑support systems to human clinicians report that modern algorithms can generate correct diagnoses within their top‑3 suggestions in roughly 70–90% of test vignettes and real‑world cases for common conditions. Performance typically matches or exceeds that of generalist clinicians in standardized vignette studies but falls short of subspecialists for complex or rare diseases. AI systems tend to do best with well‑structured, complete inputs and clear symptom constellations, and they perform less reliably when key elements are omitted, mislabeled, or described in non‑standard ways.

Why this matters clinically is that AI‑supported differentials can reduce cognitive load, highlight alternative diagnoses, and serve as a safety net for common cognitive biases such as premature closure and anchoring. However, systems are not calibrated as autonomous diagnosticians; they frequently over‑ or under‑estimate probabilities, may hallucinate conditions, and often lack robust representation of minority populations, multimorbidity, and social context. These limitations underscore the need for careful human oversight and structured integration into existing diagnostic workflows rather than replacement of clinician reasoning.

Detailed Research

Methodology

The evidence base consists of diagnostic‑accuracy studies using standardized clinical vignettes, retrospective chart reviews, and some prospective evaluations of AI tools integrated into clinical workflows. Systematic reviews typically include more than 20 primary studies evaluating a range of algorithms—from traditional probabilistic engines and symptom checkers to deep‑learning models trained on large EHR and claims datasets—against reference standards established by expert panels or final discharge diagnoses.

Outcomes include top‑1 and top‑3 diagnostic accuracy, sensitivity and specificity for selected conditions, calibration of probability estimates, and, in some cases, impact on clinician decision‑making. Many studies are limited by convenience samples, restriction to a narrow set of conditions, or reliance on simulated cases rather than real patients.

Key Studies

Systematic Review of Computer‑assisted Diagnosis (2020–2022)

  • Design: Systematic review of AI‑based differential diagnosis tools
  • Sample: Aggregated data from multiple symptom checkers and decision‑support systems
  • Findings: Median top‑3 accuracy around 70–80% across a wide range of systems, with higher performance for common conditions and lower performance for rare diseases. In head‑to‑head comparisons with physicians on standardized vignettes, AI tools often matched generalist accuracy but remained inferior to subspecialists.
  • Clinical Relevance: Establishes baseline performance expectations

Randomized Trials of AI Decision Support in Primary Care

  • Design: Randomized or quasi‑randomized studies of decision‑support tools in EHRs
  • Sample: Primary care encounters
  • Findings: Modest improvements in problem list completeness and appropriate consideration of guideline‑recommended diagnoses, but effect sizes on hard outcomes (for example, missed MI, delayed cancer diagnosis) are small and often underpowered.
  • Clinical Relevance: Shows real-world implementation challenges

Deep‑learning Models for Specific Conditions

  • Design: Specialty‑focused AI models for dermatologic lesions, retinopathy, or radiographic pneumonia
  • Sample: Controlled test sets in specific domains
  • Findings: Area‑under‑the‑curve values of 0.90 or higher, matching or exceeding human readers in controlled settings. However, these studies focus on narrow tasks and do not capture the full complexity of multi‑system differential diagnosis.
  • Clinical Relevance: High accuracy in narrow domains, uncertain generalizability

Bias and Equity Analyses

  • Design: Analyses of AI diagnostic model performance across demographic groups
  • Sample: Diverse patient populations
  • Findings: AI diagnostic models may perform less well for under‑represented groups, including racial and ethnic minorities, women, and patients with multimorbidity, due to imbalances in training data.
  • Clinical Relevance: Critical equity considerations for implementation

Clinical Implications

For osteopathic physicians, AI‑generated differentials can serve as a cognitive forcing function: a quick way to check for missed possibilities and reconsider anchoring diagnoses, especially in busy primary care settings. They may also help trainees and residents structure their reasoning and identify red flags that warrant further workup or referral.

However, these tools should be used as adjuncts, not authorities. DOs remain responsible for integrating structural findings, psychosocial context, and patient values—elements that current AI systems do not reliably capture. Over‑reliance on AI lists without critical appraisal risks "automation bias" and could exacerbate existing inequities.

Limitations & Research Gaps

Most studies are conducted in simulated environments or single institutions, which may not reflect real‑world performance across diverse populations and practice settings. There is limited evidence that AI‑generated differentials improve patient‑centered outcomes such as time to correct diagnosis, reduction in unnecessary testing, or avoidance of adverse events.

Almost no research specifically addresses osteopathic practice patterns, including the integration of structural and somatic findings into differential diagnosis. Future work should evaluate AI systems in DO‑heavy settings, assess impact on diagnostic safety and equity, and explore how to encode osteopathic reasoning (for example, viscerosomatic relationships) into decision‑support tools.

Osteopathic Perspective

Osteopathic medicine emphasizes that structure and function are reciprocally interrelated and that rational treatment is based on understanding the whole person; current AI systems rarely incorporate structural exam findings, posture, or somatic dysfunction into their models. DOs should therefore treat AI differentials as one input among many, not a replacement for hands‑on assessment.

The osteopathic focus on the body's self‑regulatory capacity and on the therapeutic relationship also implies caution about over‑medicalization driven by algorithmic suggestions, which may increase testing and labeling without improving outcomes. Thoughtful governance and transparent communication with patients about how AI is used in diagnosis align with the osteopathic commitment to holistic, person‑centered care.

References (2)

  1. Liu X, Rivera SC, Moher D, et al. Reporting guidelines for clinical trials evaluating artificial intelligence interventions: the CONSORT-AI extension.” BMJ, 2020;370:m3164. DOI: 10.1136/bmj.m3164
  2. Grote T, Berens P On the ethics of algorithmic decision-making in healthcare.” Journal of Medical Ethics, 2020;46:205-211. DOI: 10.1136/medethics-2019-105586

Related Research

Impact of AI on Diagnostic Errors in Clinical Practice

Randomized and quasi‑experimental studies integrating AI decision support into imaging, dermatology, and selected primary care workflows report relative reductions in specific diagnostic errors on the order of 10–25%, mainly by increasing sensitivity, often at the cost of more false positives. Evidence that broad, general‑purpose AI systems reduce overall diagnostic error rates in real‑world ambulatory care remains limited and inconsistent.

AI‑Enhanced Drug Interaction Checking and Medication Safety

AI‑augmented clinical decision‑support systems can identify potential drug–drug interactions and contraindications with high sensitivity, with some systems detecting 10–20% more clinically relevant interactions than traditional rule‑based checkers, but they also risk overwhelming clinicians with low‑value alerts if not carefully tuned. Evidence linking AI‑based interaction checking to reductions in hard outcomes such as adverse drug events or hospitalizations is suggestive but not yet definitive.

AI Detection of Rare Diseases from Symptom and Multimodal Patterns

Scoping and narrative reviews report that AI methods—particularly few-shot learning, multimodal models, and AI-augmented symptom checkers—can shorten the diagnostic odyssey for rare diseases, with potential reductions in time to diagnosis from the current 4–5 year average, though quantitative effect sizes are not yet well established. Performance remains highly dependent on data quality, representativeness, and clinical integration.