Clustering of patients allows to find groups of subjects with similar characteristics. This categorization can facilitate diagnosis, treatment decision and prognosis prediction. Heterogeneous genome-wide data sources capture different biological aspects that can be integrated in order to better categorize the patients. Clustering methods work by comparing how patients are similar or dissimilar in a suitable similarity space. While several clustering methods have been proposed, there is no systematic comparative study concerning the impact of similarity metrics on the cluster quality. We compared seven popular similarity measures (Pearson, Spearman and Kendall Correlations; Euclidean, Canberra, Minkowski and Manhattan Distances) in conjunction with two classical single-view clustering algorithms and a late integration approach (partitioning around medoids, hierarchical clustering and matrix factorization approaches), on high dimensional multi-view cancer data coming from the TCGA repository. Performance was measured against tumour subcategories classification. Only Euclidean and Minkowski distances showed similar results in terms of clustering similarity indexes. On the other hand, an absolute best similarity measure did not emerge in terms of misclassification, but it strongly depends on the data.

Impact of different metrics on multi-view clustering

SERRA, ANGELA;TAGLIAFERRI, Roberto
2015-01-01

Abstract

Clustering of patients allows to find groups of subjects with similar characteristics. This categorization can facilitate diagnosis, treatment decision and prognosis prediction. Heterogeneous genome-wide data sources capture different biological aspects that can be integrated in order to better categorize the patients. Clustering methods work by comparing how patients are similar or dissimilar in a suitable similarity space. While several clustering methods have been proposed, there is no systematic comparative study concerning the impact of similarity metrics on the cluster quality. We compared seven popular similarity measures (Pearson, Spearman and Kendall Correlations; Euclidean, Canberra, Minkowski and Manhattan Distances) in conjunction with two classical single-view clustering algorithms and a late integration approach (partitioning around medoids, hierarchical clustering and matrix factorization approaches), on high dimensional multi-view cancer data coming from the TCGA repository. Performance was measured against tumour subcategories classification. Only Euclidean and Minkowski distances showed similar results in terms of clustering similarity indexes. On the other hand, an absolute best similarity measure did not emerge in terms of misclassification, but it strongly depends on the data.
2015
9781479919604
9781479919604
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4667768
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 13
  • ???jsp.display-item.citation.isi??? 6
social impact