Large Language Models (LLMs) are driving significant advances across different domains, including the healthcare field. Recent studies highlight the existence of several foundational models capable to process and interpret Electronic Health Records (EHR), underscoring their potential in developing Clinical Decision Support Systems (CDSSs). This paper presents a comparative analysis of nine open-source LLMs organised over three macro categories: (a) general-purpose (3), (b) trained using medical-domain data only (1), and (c) general-purpose fine-tuned on medical-domains dataset (5). The models have been evaluated across four different Questions Answering (QA) medical datasets: two generic, iCliniq and LiveQA, and two specific for chronic disease, diabetes-QA-dataset-origin for diabetes, and Cardio-docs-qa for cardiology. The evaluation involves three different tokenizers (GPT2TokenizerFast, LlamaTokenizerFast and PreTrainedTokenizerFast) and three different fine-tuning strategies. Moreover, fine-tuning is performed using three specific datasets. Performance metrics are derived using the BERTScore. The achieved results show that all models have similar performance over all testing datasets, with the best performance achieved by a recent general-purpose model (DeepSeek): Fine-tuning activities slightly increase performance over all datasets for one model only. Additionally, Low-Rank Adaptation (LoRA) and Quantized LoRA (QLORA) are confirmed as the most effective fine-tuning techniques.

Foundational Models for Building Clinical Decision Support Systems: An Open-Source Models Comparison

D'Aniello, Giuseppe
Methodology
;
Fraenza, Valeria
Software
;
Ritrovato, Pierluigi
Conceptualization
2025

Abstract

Large Language Models (LLMs) are driving significant advances across different domains, including the healthcare field. Recent studies highlight the existence of several foundational models capable to process and interpret Electronic Health Records (EHR), underscoring their potential in developing Clinical Decision Support Systems (CDSSs). This paper presents a comparative analysis of nine open-source LLMs organised over three macro categories: (a) general-purpose (3), (b) trained using medical-domain data only (1), and (c) general-purpose fine-tuned on medical-domains dataset (5). The models have been evaluated across four different Questions Answering (QA) medical datasets: two generic, iCliniq and LiveQA, and two specific for chronic disease, diabetes-QA-dataset-origin for diabetes, and Cardio-docs-qa for cardiology. The evaluation involves three different tokenizers (GPT2TokenizerFast, LlamaTokenizerFast and PreTrainedTokenizerFast) and three different fine-tuning strategies. Moreover, fine-tuning is performed using three specific datasets. Performance metrics are derived using the BERTScore. The achieved results show that all models have similar performance over all testing datasets, with the best performance achieved by a recent general-purpose model (DeepSeek): Fine-tuning activities slightly increase performance over all datasets for one model only. Additionally, Low-Rank Adaptation (LoRA) and Quantized LoRA (QLORA) are confirmed as the most effective fine-tuning techniques.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4945059
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact