Background: High-quality translations of radiology reports are essential for optimal patient care. Because of limited availability of human translators with medical expertise, large language models (LLMs) are a promising solution, but their ability to translate radiology reports remains largely unexplored. Purpose: To evaluate the accuracy and quality of various LLMs in translating radiology reports across high-resource languages (English, Italian, French, German, and Chinese) and low-resource languages (Swedish, Turkish, Russian, Greek, and Thai). Materials and Methods: A dataset of 100 synthetic free-text radiology reports from CT and MRI scans was translated by 18 radiologists between January 14 and May 2, 2024, into nine target languages. Ten LLMs, including GPT-4 (OpenAI), Llama 3 (Meta), and Mixtral models (Mistral AI), were used for automated translation. Translation accuracy and quality were assessed with use of BiLingual Evaluation Understudy (BLEU) score, translation error rate (TER), and CHaRacter-level F-score (chrF++) metrics. Statistical significance was evaluated with use of paired t tests with Holm-Bonferroni corrections. Radiologists also conducted a qualitative evaluation of translations with use of a standardized questionnaire. Results: GPT-4 demonstrated the best overall translation quality, particularly from English to German (BLEU score: 35.0 ± 16.3 [SD]; TER: 61.7 ± 21.2; chrF++: 70.6 ± 9.4), to Greek (BLEU: 32.6 ± 10.1; TER: 52.4 ± 10.6; chrF++: 62.8 ± 6.4), to Thai (BLEU: 53.2 ± 7.3; TER: 74.3 ± 5.2; chrF++: 48.4 ± 6.6), and to Turkish (BLEU: 35.5 ± 6.6; TER: 52.7 ± 7.4; chrF++: 70.7 ± 3.7). GPT-3.5 showed highest accuracy in translations from English to French, and Qwen1.5 excelled in English-to-Chinese translations, whereas Mixtral 8x22B performed best in Italian-to-English translations. The qualitative evaluation revealed that LLMs excelled in clarity, readability, and consistency with the original meaning but showed moderate medical terminology accuracy. Conclusion: LLMs showed high accuracy and quality for translating radiology reports, although results varied by model and language pair.

Large Language Model Ability to Translate CT and MRI Free-Text Radiology Reports Into Multiple Languages

Cuocolo R.;
2024-01-01

Abstract

Background: High-quality translations of radiology reports are essential for optimal patient care. Because of limited availability of human translators with medical expertise, large language models (LLMs) are a promising solution, but their ability to translate radiology reports remains largely unexplored. Purpose: To evaluate the accuracy and quality of various LLMs in translating radiology reports across high-resource languages (English, Italian, French, German, and Chinese) and low-resource languages (Swedish, Turkish, Russian, Greek, and Thai). Materials and Methods: A dataset of 100 synthetic free-text radiology reports from CT and MRI scans was translated by 18 radiologists between January 14 and May 2, 2024, into nine target languages. Ten LLMs, including GPT-4 (OpenAI), Llama 3 (Meta), and Mixtral models (Mistral AI), were used for automated translation. Translation accuracy and quality were assessed with use of BiLingual Evaluation Understudy (BLEU) score, translation error rate (TER), and CHaRacter-level F-score (chrF++) metrics. Statistical significance was evaluated with use of paired t tests with Holm-Bonferroni corrections. Radiologists also conducted a qualitative evaluation of translations with use of a standardized questionnaire. Results: GPT-4 demonstrated the best overall translation quality, particularly from English to German (BLEU score: 35.0 ± 16.3 [SD]; TER: 61.7 ± 21.2; chrF++: 70.6 ± 9.4), to Greek (BLEU: 32.6 ± 10.1; TER: 52.4 ± 10.6; chrF++: 62.8 ± 6.4), to Thai (BLEU: 53.2 ± 7.3; TER: 74.3 ± 5.2; chrF++: 48.4 ± 6.6), and to Turkish (BLEU: 35.5 ± 6.6; TER: 52.7 ± 7.4; chrF++: 70.7 ± 3.7). GPT-3.5 showed highest accuracy in translations from English to French, and Qwen1.5 excelled in English-to-Chinese translations, whereas Mixtral 8x22B performed best in Italian-to-English translations. The qualitative evaluation revealed that LLMs excelled in clarity, readability, and consistency with the original meaning but showed moderate medical terminology accuracy. Conclusion: LLMs showed high accuracy and quality for translating radiology reports, although results varied by model and language pair.
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4896563
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact