Reproducibility of radiomics quality score: an intra- and inter-rater reliability study

Akinci D'Antonoli, Tugba; Cavallo, Armando Ugo; Vernuccio, Federica; Stanzione, Arnaldo; Klontzas, Michail E; Cannella, Roberto; Ugga, Lorenzo; Baran, Agah; Fanni, Salvatore Claudio; Petrash, Ekaterina; Ambrosini, Ilaria; Cappellini, Luca Alessandro; Van Ooijen, Peter; Kotter, Elmar; Pinto Dos Santos, Daniel; Cuocolo, Renato

doi:10.1007/s00330-023-10217-x

Objectives: To investigate the intra- and inter-rater reliability of the total radiomics quality score (RQS) and the reproducibility of individual RQS items' score in a large multireader study. Methods: Nine raters with different backgrounds were randomly assigned to three groups based on their proficiency with RQS utilization: Groups 1 and 2 represented the inter-rater reliability groups with or without prior training in RQS, respectively; group 3 represented the intra-rater reliability group. Thirty-three original research papers on radiomics were evaluated by raters of groups 1 and 2. Of the 33 papers, 17 were evaluated twice with an interval of 1 month by raters of group 3. Intraclass coefficient (ICC) for continuous variables, and Fleiss' and Cohen's kappa (k) statistics for categorical variables were used. Results: The inter-rater reliability was poor to moderate for total RQS (ICC 0.30-055, p < 0.001) and very low to good for item's reproducibility (k - 0.12 to 0.75) within groups 1 and 2 for both inexperienced and experienced raters. The intra-rater reliability for total RQS was moderate for the less experienced rater (ICC 0.522, p = 0.009), whereas experienced raters showed excellent intra-rater reliability (ICC 0.91-0.99, p < 0.001) between the first and second read. Intra-rater reliability on RQS items' score reproducibility was higher and most of the items had moderate to good intra-rater reliability (k - 0.40 to 1). Conclusions: Reproducibility of the total RQS and the score of individual RQS items is low. There is a need for a robust and reproducible assessment method to assess the quality of radiomics research. Clinical relevance statement: There is a need for reproducible scoring systems to improve quality of radiomics research and consecutively close the translational gap between research and clinical implementation. Key points: • Radiomics quality score has been widely used for the evaluation of radiomics studies. • Although the intra-rater reliability was moderate to excellent, intra- and inter-rater reliability of total score and point-by-point scores were low with radiomics quality score. • A robust, easy-to-use scoring system is needed for the evaluation of radiomics research.

Reproducibility of radiomics quality score: an intra- and inter-rater reliability study

Akinci D'Antonoli, Tugba;Cavallo, Armando Ugo;Vernuccio, Federica;Stanzione, Arnaldo;Klontzas, Michail E;Cannella, Roberto;Ugga, Lorenzo;Baran, Agah;Fanni, Salvatore Claudio;Petrash, Ekaterina;Ambrosini, Ilaria;Cappellini, Luca Alessandro;van Ooijen, Peter;Kotter, Elmar;Pinto Dos Santos, Daniel;Cuocolo, Renato

2023

Abstract

Objectives: To investigate the intra- and inter-rater reliability of the total radiomics quality score (RQS) and the reproducibility of individual RQS items' score in a large multireader study. Methods: Nine raters with different backgrounds were randomly assigned to three groups based on their proficiency with RQS utilization: Groups 1 and 2 represented the inter-rater reliability groups with or without prior training in RQS, respectively; group 3 represented the intra-rater reliability group. Thirty-three original research papers on radiomics were evaluated by raters of groups 1 and 2. Of the 33 papers, 17 were evaluated twice with an interval of 1 month by raters of group 3. Intraclass coefficient (ICC) for continuous variables, and Fleiss' and Cohen's kappa (k) statistics for categorical variables were used. Results: The inter-rater reliability was poor to moderate for total RQS (ICC 0.30-055, p < 0.001) and very low to good for item's reproducibility (k - 0.12 to 0.75) within groups 1 and 2 for both inexperienced and experienced raters. The intra-rater reliability for total RQS was moderate for the less experienced rater (ICC 0.522, p = 0.009), whereas experienced raters showed excellent intra-rater reliability (ICC 0.91-0.99, p < 0.001) between the first and second read. Intra-rater reliability on RQS items' score reproducibility was higher and most of the items had moderate to good intra-rater reliability (k - 0.40 to 1). Conclusions: Reproducibility of the total RQS and the score of individual RQS items is low. There is a need for a robust and reproducible assessment method to assess the quality of radiomics research. Clinical relevance statement: There is a need for reproducible scoring systems to improve quality of radiomics research and consecutively close the translational gap between research and clinical implementation. Key points: • Radiomics quality score has been widely used for the evaluation of radiomics studies. • Although the intra-rater reliability was moderate to excellent, intra- and inter-rater reliability of total score and point-by-point scores were low with radiomics quality score. • A robust, easy-to-use scoring system is needed for the evaluation of radiomics research.