Motivation Microarray technology can be used to study the expression of thousands of genes across a number of different experimental conditions, usually hundreds. The underlying principle is that genes sharing similar expression patterns, across different samples, can be part of the same co-expression system, or they may share the same biological functions. Groups of genes are usually identified based on cluster analysis. Clustering methods rely on the similarity matrix between genes. A common choice to measure similarity is to compute the sample correlation matrix. Dimensionality reduction is another popular data analysis task which is also based on covariance/correlation matrix estimates. Unfortunately, covariance/correlation matrix estimation suffers from the intrinsic noise present in high-dimensional data. Sources of noise are: sampling variations, presents of outlying sample units, and the fact that in most cases the number of units is much larger than the number of genes. Results In this paper we propose a robust correlation matrix estimator that is regularized based on adaptive thresholding. The resulting method jointly tames the effects of the high-dimensionality, and data contamination. Computations are easy to implement and do not require hand tunings. Both simulated and real data are analysed. A Monte Carlo experiment shows that the proposed method is capable of remarkable performances. Our correlation metric is more robust to outliers compared with the existing alternatives in two gene expression data sets. It is also shown how the regularization allows to automatically detect and filter spurious correlations. The same regularization is also extended to other less robust correlation measures. Finally, we apply the ARACNE algorithm on the SyNTreN gene expression data. Sensitivity and specificity of the reconstructed network is compared with the gold standard. We show that ARACNE performs better when it takes the proposed correlation matrix estimator as input. Availability The R software is available at https://github.com/angy89/RobustSparseCorrelation. Contact aserra@unisa.it. Supplementary information Supplementary data are available at Bioinformatics online.

Robust and sparse correlation matrix estimation for the analysis of high-dimensional genomics data

Angela Serra;Pietro Coretto;Michele Fratello;Roberto Tagliaferri
2018-01-01

Abstract

Motivation Microarray technology can be used to study the expression of thousands of genes across a number of different experimental conditions, usually hundreds. The underlying principle is that genes sharing similar expression patterns, across different samples, can be part of the same co-expression system, or they may share the same biological functions. Groups of genes are usually identified based on cluster analysis. Clustering methods rely on the similarity matrix between genes. A common choice to measure similarity is to compute the sample correlation matrix. Dimensionality reduction is another popular data analysis task which is also based on covariance/correlation matrix estimates. Unfortunately, covariance/correlation matrix estimation suffers from the intrinsic noise present in high-dimensional data. Sources of noise are: sampling variations, presents of outlying sample units, and the fact that in most cases the number of units is much larger than the number of genes. Results In this paper we propose a robust correlation matrix estimator that is regularized based on adaptive thresholding. The resulting method jointly tames the effects of the high-dimensionality, and data contamination. Computations are easy to implement and do not require hand tunings. Both simulated and real data are analysed. A Monte Carlo experiment shows that the proposed method is capable of remarkable performances. Our correlation metric is more robust to outliers compared with the existing alternatives in two gene expression data sets. It is also shown how the regularization allows to automatically detect and filter spurious correlations. The same regularization is also extended to other less robust correlation measures. Finally, we apply the ARACNE algorithm on the SyNTreN gene expression data. Sensitivity and specificity of the reconstructed network is compared with the gold standard. We show that ARACNE performs better when it takes the proposed correlation matrix estimator as input. Availability The R software is available at https://github.com/angy89/RobustSparseCorrelation. Contact aserra@unisa.it. Supplementary information Supplementary data are available at Bioinformatics online.
File in questo prodotto:
File Dimensione Formato  
2018_serra_coretto_benfratello_tagliaferri_robust_and_sparse_correlation_matrix_estimation_for_the_analysis_of_high_dimensional_genomics_data.pdf

non disponibili

Tipologia: Documento in Post-print (versione successiva alla peer review e accettata per la pubblicazione)
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 3.56 MB
Formato Adobe PDF
3.56 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11386/4702198
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? 4
  • Scopus 18
  • ???jsp.display-item.citation.isi??? 15
social impact