Large Language Models (LLMs) are driving significant advances across different domains, including the healthcare field. Recent studies highlight the existence of several foundational models capable to process and interpret Electronic Health Records (EHR), underscoring their potential in developing Clinical Decision Support Systems (CDSSs). This paper presents a comparative analysis of nine open-source LLMs organised over three macro categories: (a) general-purpose (3), (b) trained using medical-domain data only (1), and (c) general-purpose fine-tuned on medical-domains dataset (5). The models have been evaluated across four different Questions Answering (QA) medical datasets: two generic, iCliniq and LiveQA, and two specific for chronic disease, diabetes-QA-dataset-origin for diabetes, and Cardio-docs-qa for cardiology. The evaluation involves three different tokenizers (GPT2TokenizerFast, LlamaTokenizerFast and PreTrainedTokenizerFast) and three different fine-tuning strategies. Moreover, fine-tuning is performed using three specific datasets. Performance metrics are derived using the BERTScore. The achieved results show that all models have similar performance over all testing datasets, with the best performance achieved by a recent general-purpose model (DeepSeek): Fine-tuning activities slightly increase performance over all datasets for one model only. Additionally, Low-Rank Adaptation (LoRA) and Quantized LoRA (QLORA) are confirmed as the most effective fine-tuning techniques.
Foundational Models for Building Clinical Decision Support Systems: An Open-Source Models Comparison
D'Aniello, GiuseppeMethodology
;Fraenza, ValeriaSoftware
;Ritrovato, Pierluigi
Conceptualization
2025
Abstract
Large Language Models (LLMs) are driving significant advances across different domains, including the healthcare field. Recent studies highlight the existence of several foundational models capable to process and interpret Electronic Health Records (EHR), underscoring their potential in developing Clinical Decision Support Systems (CDSSs). This paper presents a comparative analysis of nine open-source LLMs organised over three macro categories: (a) general-purpose (3), (b) trained using medical-domain data only (1), and (c) general-purpose fine-tuned on medical-domains dataset (5). The models have been evaluated across four different Questions Answering (QA) medical datasets: two generic, iCliniq and LiveQA, and two specific for chronic disease, diabetes-QA-dataset-origin for diabetes, and Cardio-docs-qa for cardiology. The evaluation involves three different tokenizers (GPT2TokenizerFast, LlamaTokenizerFast and PreTrainedTokenizerFast) and three different fine-tuning strategies. Moreover, fine-tuning is performed using three specific datasets. Performance metrics are derived using the BERTScore. The achieved results show that all models have similar performance over all testing datasets, with the best performance achieved by a recent general-purpose model (DeepSeek): Fine-tuning activities slightly increase performance over all datasets for one model only. Additionally, Low-Rank Adaptation (LoRA) and Quantized LoRA (QLORA) are confirmed as the most effective fine-tuning techniques.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


