Cardiovascular disease refers to a group of disorders affecting the heart and blood vessels, including conditions like coronary artery disease, stroke, and heart failure. Arrhythmias are irregularities in the rhythm of the heart, where the heart may beat too fast, too slow, or erratically. This study presents a comparison between ChatGPT-4o and a group of cardiologists in the analysis of electrocardiogram images for assisting in the diagnosis of cardiovascular conditions. The purpose of this comparison is to evaluate the potential of using large language models like ChatGPT-4o in clinical environments, specifically for interpreting electrocardiogram traces. To achieve this, we designed an experiment where both the model and a cohort of cardiologists analyzed the same set of ECG images, and their interpretations were compared to assess performance. The evaluation focused on key diagnostic aspects: heart rate determination, rhythm interpretation, and the overall diagnosis of potential cardiovascular abnormalities. Cardiologists were asked to provide their expert insights through a structured survey that captured their diagnostic reasoning. ChatGPT-4o, in turn, was provided with the same set of images and asked to produce diagnostic outputs. Given that large language models are not explicitly trained in medical image analysis, the responses were generated based on the model's ability to infer from the textual and visual information presented. The model's outputs were processed and evaluated for accuracy against the responses of the cardiologists and the ground truth labels provided by the dataset. The results revealed notable differences in diagnostic accuracy between the outputs of ChatGPT-4o and the cardiologists' assessments. ChatGPT-4o achieved an accuracy of 29.20%, sensitivity of 29.20%, and an F1-score of 0.29 when compared to the ground truth labels. In contrast, the cardiologists collectively performed significantly better, achieving an accuracy of 58.70%, sensitivity of 58.70%, and an F1-score of 0.59.
Can ChatGPT-4o enhance ECG interpretation accuracy compared to cardiologists?
De Roberto A. M.;De Marco F.;Di Biasi L.;Rossi D.;Tortora G.
2024
Abstract
Cardiovascular disease refers to a group of disorders affecting the heart and blood vessels, including conditions like coronary artery disease, stroke, and heart failure. Arrhythmias are irregularities in the rhythm of the heart, where the heart may beat too fast, too slow, or erratically. This study presents a comparison between ChatGPT-4o and a group of cardiologists in the analysis of electrocardiogram images for assisting in the diagnosis of cardiovascular conditions. The purpose of this comparison is to evaluate the potential of using large language models like ChatGPT-4o in clinical environments, specifically for interpreting electrocardiogram traces. To achieve this, we designed an experiment where both the model and a cohort of cardiologists analyzed the same set of ECG images, and their interpretations were compared to assess performance. The evaluation focused on key diagnostic aspects: heart rate determination, rhythm interpretation, and the overall diagnosis of potential cardiovascular abnormalities. Cardiologists were asked to provide their expert insights through a structured survey that captured their diagnostic reasoning. ChatGPT-4o, in turn, was provided with the same set of images and asked to produce diagnostic outputs. Given that large language models are not explicitly trained in medical image analysis, the responses were generated based on the model's ability to infer from the textual and visual information presented. The model's outputs were processed and evaluated for accuracy against the responses of the cardiologists and the ground truth labels provided by the dataset. The results revealed notable differences in diagnostic accuracy between the outputs of ChatGPT-4o and the cardiologists' assessments. ChatGPT-4o achieved an accuracy of 29.20%, sensitivity of 29.20%, and an F1-score of 0.29 when compared to the ground truth labels. In contrast, the cardiologists collectively performed significantly better, achieving an accuracy of 58.70%, sensitivity of 58.70%, and an F1-score of 0.59.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.