Clustering of patients allows to find groups of subjects with similar characteristics. This categorization can facilitate diagnosis, treatment decision and prognosis prediction. Heterogeneous genome-wide data sources capture different biological aspects that can be integrated in order to better categorize the patients. Clustering methods work by comparing how patients are similar or dissimilar in a suitable similarity space. While several clustering methods have been proposed, there is no systematic comparative study concerning the impact of similarity metrics on the cluster quality. We compared seven popular similarity measures (Pearson, Spearman and Kendall Correlations; Euclidean, Canberra, Minkowski and Manhattan Distances) in conjunction with two classical single-view clustering algorithms and a late integration approach (partitioning around medoids, hierarchical clustering and matrix factorization approaches), on high dimensional multi-view cancer data coming from the TCGA repository. Performance was measured against tumour subcategories classification. Only Euclidean and Minkowski distances showed similar results in terms of clustering similarity indexes. On the other hand, an absolute best similarity measure did not emerge in terms of misclassification, but it strongly depends on the data.
Impact of different metrics on multi-view clustering
SERRA, ANGELA;TAGLIAFERRI, Roberto
2015
Abstract
Clustering of patients allows to find groups of subjects with similar characteristics. This categorization can facilitate diagnosis, treatment decision and prognosis prediction. Heterogeneous genome-wide data sources capture different biological aspects that can be integrated in order to better categorize the patients. Clustering methods work by comparing how patients are similar or dissimilar in a suitable similarity space. While several clustering methods have been proposed, there is no systematic comparative study concerning the impact of similarity metrics on the cluster quality. We compared seven popular similarity measures (Pearson, Spearman and Kendall Correlations; Euclidean, Canberra, Minkowski and Manhattan Distances) in conjunction with two classical single-view clustering algorithms and a late integration approach (partitioning around medoids, hierarchical clustering and matrix factorization approaches), on high dimensional multi-view cancer data coming from the TCGA repository. Performance was measured against tumour subcategories classification. Only Euclidean and Minkowski distances showed similar results in terms of clustering similarity indexes. On the other hand, an absolute best similarity measure did not emerge in terms of misclassification, but it strongly depends on the data.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.