Large language models (LLMs) are increasingly used in medical diagnostics.1 However, their performance in rheumatology remains understudied.2 This gap is particularly pronounced in multimodal capabilities, where visual components play a crucial role in diagnosis.3 We evaluated 2 state-of-the-art LLMs, GPT-4o (OpenAI) and Claude Sonnet 3.5 (Anthropic), on their ability to diagnose rheumatologic and immunologic conditions using multimodal inputs.
An expert rheumatologist (MEN) selected and validated 64 cases from the New England Journal of Medicine (NEJM) Image Challenge.4 The cases were evenly split: 32 with rheumatologic/immunologic diagnoses and 32 with similar presentations but nonrheumatologic diagnoses. This design aimed to test the models’ diagnostic accuracy and their ability to differentiate between rheumatologic and nonrheumatologic conditions.
We presented each case to the models in 3 formats: text only, image only, and combined text + image. This approach allowed us to assess the models’ performances across different input modalities and their ability to integrate multimodal information. We used official application programming interface (API) libraries to interact with the models, ensuring standardized and reproducible results.
Both models demonstrated comparable overall performance, with no statistically significant differences in rheumatologic/immunologic cases. However, in nonrheumatologic diagnoses, significant differences emerged in text-only (P = 0.03) and image-only (P = 0.01) tasks, highlighting the models’ varying strengths across different input modalities.
Sonnet 3.5 outperformed GPT-4o in text + image tasks for both case types (81.2% vs 75% for rheumatologic, 78.1% vs 71.9% for nonrheumatologic). GPT-4o excelled in text-only rheumatologic cases (81.2% vs 75%), whereas Sonnet 3.5 performed better in text-only nonrheumatologic cases (84.4% vs 59.4%; Figure).
A summary and comparison of the models’ performances.
Interestingly, GPT-4o consistently outperformed Sonnet 3.5 in image-only tasks across both case types. A notable finding was Sonnet 3.5’s lower rate of incorrect rheumatologic/immunologic diagnoses in nonrheumatologic cases (9.4% vs 18.8%). This suggests Sonnet 3.5 may have better specificity in distinguishing rheumatologic from nonrheumatologic conditions, a crucial factor in clinical decision making.
Both models significantly outperformed online participants in the NEJM weekly challenge in rheumatologic and non-rheumatologic cases, with total responders of 4,376,164 achieving 51.6% and 43.6%, respectively (Figure).
The results of our present study raise intriguing questions about the models’ underlying architectures and training methodologies. Sonnet 3.5’s superior performance in multimodal tasks might stem from more effective cross-modal attention mechanisms or better integration of visual and textual features. Conversely, GPT-4o’s strength in image-only tasks could result from more extensive or diverse visual training data.
The models’ differing performances across case types suggest they may have complementary strengths. This opens possibilities for ensemble approaches in clinical applications, where multiple models could be used in tandem to leverage their individual strengths.5
Our study has limitations. Some older NEJM cases might overlap with the models’ training data, potentially inflating performance metrics, although we included many recent published cases that were not part of the models’ training data. We cannot determine precisely how the models use visual content in their decision-making process, which remains a “black box.”6 Our sample size, while substantial for a specialized study, is limited when considering the vast array of rheumatologic and immunologic conditions. Additionally, these cases lack the nuanced, real-time clinical context that physicians rely on for diagnosis.
As artificial intelligence tools advance, it is important to view them as augmentation to, but not replacements for, clinical expertise in rheumatology. Their integration into clinical practice should be approached cautiously, with ongoing evaluation of their performance, biases, and limitations. The promising results of this study, however, suggest that LLMs could become valuable tools in supporting rheumatologic diagnoses, potentially improving accuracy and efficiency in clinical practice.
Footnotes
CONTRIBUTIONS
MO: conceptualization, methodology, case vignette development, writing - original draft preparation, review, editing; RA: writing - review and editing, validation; EK: writing - review and editing, validation; MEN: writing - review and editing, validation, supervision, case vignette validation.
FUNDING
The authors declare no funding or support for this research.
COMPETING INTERESTS
The authors declare no conflicts of interest relevant to this article.
ETHICS AND PATIENT CONSENT
No patients were involved in this study, and the data used are open access data from the NEJM weekly challenge.
- Copyright © 2025 by the Journal of Rheumatology