Context: In empirical software engineering, crossover designs are popular for experiments comparing software engineering techniques that must be undertaken by human participants. However, their value depends on the correlation (r) between the outcome measures on the same participants. Software engineering theory emphasizes the importance of individual skill differences, so we would expect the values of r to be relatively high. However, few researchers have reported the values of r. Goal: To investigate the values of r found in software engineering experiments. Method: We undertook simulation studies to investigate the theoretical and empirical properties of r. Then we investigated the values of r observed in 35 software engineering crossover experiments. Results: The level of r obtained by analysing our 35 crossover experiments was small. Estimates based on means, medians, and random effect analysis disagreed but were all between 0.2 and 0.3. As expected, our analyses found large variability among the individual r estimates for small sample sizes, but no indication that r estimates were larger for the experiments with larger sample sizes that exhibited smaller variability. Conclusions: Low observed r values cast doubts on the validity of crossover designs for software engineering experiments. However, if the cause of low r values relates to training limitations or toy tasks, this affects all Software Engineering (SE) experiments involving human participants. For all human-intensive SE experiments, we recommend more intensive training and then tracking the improvement of participants as they practice using specific techniques, before formally testing the effectiveness of the techniques.

The Importance of the Correlation in Crossover Experiments

Scanniello G.;Gravino C.
2021-01-01

Abstract

Context: In empirical software engineering, crossover designs are popular for experiments comparing software engineering techniques that must be undertaken by human participants. However, their value depends on the correlation (r) between the outcome measures on the same participants. Software engineering theory emphasizes the importance of individual skill differences, so we would expect the values of r to be relatively high. However, few researchers have reported the values of r. Goal: To investigate the values of r found in software engineering experiments. Method: We undertook simulation studies to investigate the theoretical and empirical properties of r. Then we investigated the values of r observed in 35 software engineering crossover experiments. Results: The level of r obtained by analysing our 35 crossover experiments was small. Estimates based on means, medians, and random effect analysis disagreed but were all between 0.2 and 0.3. As expected, our analyses found large variability among the individual r estimates for small sample sizes, but no indication that r estimates were larger for the experiments with larger sample sizes that exhibited smaller variability. Conclusions: Low observed r values cast doubts on the validity of crossover designs for software engineering experiments. However, if the cause of low r values relates to training limitations or toy tasks, this affects all Software Engineering (SE) experiments involving human participants. For all human-intensive SE experiments, we recommend more intensive training and then tracking the improvement of participants as they practice using specific techniques, before formally testing the effectiveness of the techniques.
2021
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4765014
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 2
social impact