Validation of Gene Expression Profiles in Genomic Data through Complementary Use of Cluster Analysis and PCA-Related Biplots

Authors

  • Niccolò Bassani Department of Clinical Sciences and Community Health, University of Milan, via Vanzetti 5, 20133 Milano (MI), Itlay
  • Federico Ambrogi Department of Clinical Sciences and Community Health, University of Milan, via Vanzetti 5, 20133 Milano (MI), Itlay
  • Danila Coradini Department of Clinical Sciences and Community Health, University of Milan, via Vanzetti 5, 20133 Milano (MI), Itlay
  • Patrizia Boracchi Department of Clinical Sciences and Community Health, University of Milan, via Vanzetti 5, 20133 Milano (MI), Itlay
  • Elia Biganzoli Department of Clinical Sciences and Community Health, University of Milan, via Vanzetti 5, 20133 Milano (MI), Itlay

DOI:

https://doi.org/10.6000/1929-6029.2012.01.02.09

Keywords:

Microarrays, cluster stability, multivariate visualization, Principal Components Analysis, cell polarity

Abstract

High-throughput genomic assays are used in molecular biology to explore patterns of joint expression of thousands of genes.

These methodologies had relevant developments in the last decade, and concurrently there was a need for appropriate methods for analyzing the massive data generated.

Identifying sets of genes and samples characterized by similar values of expression and validating these results are two critical issues related to these investigations because of their clinical implication. From a statistical perspective, unsupervised class discovery methods like Cluster Analysis are generally adopted.

However, the use of Cluster Analysis mainly relies on the use of hierarchical techniques without considering possible use of other methods. This is partially due to software availability and to easiness of representation of results through a heatmap, which allows to simultaneously visualize clusterization of genes and samples on the same graphical device. One drawback of this strategy is that clusters’ stability is often neglected, thus leading to over-interpretation of results.

Moreover, validation of results using external datasets is still subject of discussion, since it is well known that batch effects may condition gene expression results even after normalization.

In this paper we compared several clustering algorithms (hierarchical, k-means, model-based, Affinity Propagation) and stability indices to discover common patterns of expression and to assess clustering reliability, and propose a rank-based passive projection of Principal Components for validation purposes.

Results from a study involving 23 tumor cell lines and 76 genes related to a specific biological pathway and derived from a publicly available dataset, are presented.

Author Biographies

Niccolò Bassani, Department of Clinical Sciences and Community Health, University of Milan, via Vanzetti 5, 20133 Milano (MI), Itlay

Department of Clinical Sciences and Community Health

Federico Ambrogi, Department of Clinical Sciences and Community Health, University of Milan, via Vanzetti 5, 20133 Milano (MI), Itlay

Department of Clinical Sciences and Community Health

Danila Coradini, Department of Clinical Sciences and Community Health, University of Milan, via Vanzetti 5, 20133 Milano (MI), Itlay

Department of Clinical Sciences and Community Health

Patrizia Boracchi, Department of Clinical Sciences and Community Health, University of Milan, via Vanzetti 5, 20133 Milano (MI), Itlay

Department of Clinical Sciences and Community Health

Elia Biganzoli, Department of Clinical Sciences and Community Health, University of Milan, via Vanzetti 5, 20133 Milano (MI), Itlay

Department of Clinical Sciences and Community Health

References

Simon RM, Korn EL, McShane LM, Radmacher MD, Wright GW, Zhao Y. Design and analysis of DNA microarray investigations. New York: Springer 2003.

Kaufman L, Rousseeuw PJ. Finding groups in data-An introduction to cluster analysis. New York: John Wiley and Sons, Inc 1990. DOI: https://doi.org/10.1002/9780470316801

Joliffe LT. Principal Components Analysis. 2nd ed. New York: Springer-Verlag 2002.

Alter O, Brown PO, Botstein D. Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 2000; 97: 10101-6. http://dx.doi.org/10.1073/pnas.97.18.10101 DOI: https://doi.org/10.1073/pnas.97.18.10101

Chapman S, Schenk P, Kazan K, Manners J. Using biplots to interpret gene expression patterns in plants. Bioinformatics 2001; 18(1): 202-4. http://dx.doi.org/10.1093/bioinformatics/18.1.202 DOI: https://doi.org/10.1093/bioinformatics/18.1.202

Handl J, Knowles J, Kell DB. Computational cluster validation in post-genomic data-analysis. Bioinformatics 2005; 21(15): 3201-12. http://dx.doi.org/10.1093/bioinformatics/bti517 DOI: https://doi.org/10.1093/bioinformatics/bti517

Datta S, Datta S. Comparison and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics 2003; 19(4): 459-66. http://dx.doi.org/10.1093/bioinformatics/btg025 DOI: https://doi.org/10.1093/bioinformatics/btg025

Yeung KY, Haynor DR, Ruzzo WL. Validating clustering for gene expression data. Bioinformatics 2001, 17(4): 309-18. http://dx.doi.org/10.1093/bioinformatics/17.4.309 DOI: https://doi.org/10.1093/bioinformatics/17.4.309

Gabriel KR. The biplot graphic display of matrices with application to principal components analysis. Biometrika 1971; 58(3): 453-67. http://dx.doi.org/10.1093/biomet/58.3.453 DOI: https://doi.org/10.1093/biomet/58.3.453

Lander ES. Array of hope. Nat Genet 1999; 21: 3-4. http://dx.doi.org/10.1038/4427 DOI: https://doi.org/10.1038/4427

Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spellman P, et al. Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet 2000; 24: 227-35. http://dx.doi.org/10.1038/73432 DOI: https://doi.org/10.1038/73432

Scherf U, Ross DT, Waltham M, Smith LH, Lee JK, Tanabe L et al. A gene expression database for the molecular pharmacology of cancer. Nat Genet 2000; 24: 236-44. http://dx.doi.org/10.1038/73439 DOI: https://doi.org/10.1038/73439

Lee M, Vasioukhin V. Cell polarity and cancer-cell and tissue polarity as a non-canonical tumor suppressor. J Cell Sci 2008; 121: 1141-50. http://dx.doi.org/10.1242/jcs.016634 DOI: https://doi.org/10.1242/jcs.016634

Morrison SH, Kimble J. Asymmetric and symmetric stem-cell divisions in development and cancer. Nature 2006; 441: 1068-74. http://dx.doi.org/10.1038/nature04956 DOI: https://doi.org/10.1038/nature04956

Hugo H, Ackland ML, Blick T, et al. Epithelial-Mesenchymal and Mesenchymal-Epithelial Transitions in Carcinoma Progression. J Cell Physiol 2007; 213: 374-83. http://dx.doi.org/10.1002/jcp.21223 DOI: https://doi.org/10.1002/jcp.21223

Moreno-Buono G, Portillo F, Cano A. Transcriptional regulation of cell polarity in EMT and cancer. Oncogene 2008; 27: 6958-69. http://dx.doi.org/10.1038/onc.2008.346 DOI: https://doi.org/10.1038/onc.2008.346

Cavallaro U, Cristofori G. Cell adhesion and signalling by cadherins and Ig-CAMs in cancer. Nat Rev Cancer 2004; 4: 118-32. http://dx.doi.org/10.1038/nrc1276 DOI: https://doi.org/10.1038/nrc1276

Cowin P, Rowlands TM, Hatsell SJ. Cadherins and catenins in breast cancer. Curr Opin Cell Biol 2005; 17: 499-508. http://dx.doi.org/10.1016/j.ceb.2005.08.014 DOI: https://doi.org/10.1016/j.ceb.2005.08.014

Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998; 95(25): 14863-8. http://dx.doi.org/10.1073/pnas.95.25.14863 DOI: https://doi.org/10.1073/pnas.95.25.14863

Frey BJ, Dueck D. Clustering by passing messages between data points. Science 2007; 315: 972-6. http://dx.doi.org/10.1126/science.1136800 DOI: https://doi.org/10.1126/science.1136800

Soria D, Garibaldi JM, Ambrogi F, Boracchi P, Raimondi E, Biganzoli E. Cancer profiles by Affinity Propagation. Int J Knowl Eng Soft Data Paradig 2009; 1(3): 195-215. http://dx.doi.org/10.1504/IJKESDP.2009.028814 DOI: https://doi.org/10.1504/IJKESDP.2009.028814

Fraley C, Raftery AE. Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 2002; 97(458): 611-31. http://dx.doi.org/10.1198/016214502760047131 DOI: https://doi.org/10.1198/016214502760047131

McShane LM, Radmacher MD, Freidlin B, Yu R, Li MC, Simon R. Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics 2002; 18(11): 1462-9. http://dx.doi.org/10.1093/bioinformatics/18.11.1462 DOI: https://doi.org/10.1093/bioinformatics/18.11.1462

Smolkin M, Ghosh D. Cluster stability scores for microarray data in cancer studies. BMC Bioinformatics 2003; 4: 36. http://dx.doi.org/10.1186/1471-2105-4-36 DOI: https://doi.org/10.1186/1471-2105-4-36

Scherer A, Ed. Batch effects and noise in microarray experiments - Sources and Solutions. New York: Wiley 2009. http://dx.doi.org/10.1002/9780470685983 DOI: https://doi.org/10.1002/9780470685983

R Core Team (2012). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/

Venables WN, Ripley BD. Modern applied statistics with S. 4th ed. New York: Springer-Verlag 2002. http://dx.doi.org/10.1007/978-0-387-21706-2 DOI: https://doi.org/10.1007/978-0-387-21706-2

Parmigiani G, Garrett-Mayer ES, Anbazhagan R, Gabrielson E. A Cross-Study Comparison of Gene Expression Studies for the Molecular Classification of Lung Cancer. Clin Cancer Res 2004; 10: 2922-7. http://dx.doi.org/10.1158/1078-0432.CCR-03-0490 DOI: https://doi.org/10.1158/1078-0432.CCR-03-0490

Garrett-Mayer E, Parmigiani G, Zhong X, Cope L, Gabrielson E. Cross-Study validation and combined analysis of gene expression microarray data. Biostatistics 2008; 9(2): 333-54. http://dx.doi.org/10.1093/biostatistics/kxm033 DOI: https://doi.org/10.1093/biostatistics/kxm033

Lusa L, McShane LM, Reid JF, et al. Challenges in projecting clustering results across gene expression-profiling datasets. J Natl Canc Inst 2007; 99: 1715-23. http://dx.doi.org/10.1093/jnci/djm216 DOI: https://doi.org/10.1093/jnci/djm216

Kennelly D, Kavanagh DO, Hogan AM, Winter DC. Oestrogen and the colon: potential mechanisms for cancer prevention. Lancet Oncol 2008; 9: 385-91. http://dx.doi.org/10.1016/S1470-2045(08)70100-1 DOI: https://doi.org/10.1016/S1470-2045(08)70100-1

Heimann R, Lan F, McBride R, Heimann S. Separating favorable from unfavorable prognostic markers in breast cancer: the role of E-cadherin. Cancer Res 2000; 60: 298-304.

Gould RBE, Bracken MB. E-cadherin immunohistochemical expression as a prognostic factor in infiltrating ductal carcinoma of the breast: a systematic review and meta-analysis. Breast Cancer Res Treat 2006; 100: 139-48. http://dx.doi.org/10.1007/s10549-006-9248-2 DOI: https://doi.org/10.1007/s10549-006-9248-2

Hazan RB, Phillips GR, Qiao RF, Norton L, Aaronson SA. Exogenous expression of NCadherinin breast cancer cells induces cell migration, invasion, and metastasis. J Cell Biol 2000; 148: 779-90. http://dx.doi.org/10.1083/jcb.148.4.779 DOI: https://doi.org/10.1083/jcb.148.4.779

Downloads

Published

2012-12-20

How to Cite

Bassani, N., Ambrogi, F., Coradini, D., Boracchi, P., & Biganzoli, E. (2012). Validation of Gene Expression Profiles in Genomic Data through Complementary Use of Cluster Analysis and PCA-Related Biplots. International Journal of Statistics in Medical Research, 1(2), 162–173. https://doi.org/10.6000/1929-6029.2012.01.02.09

Issue

Section

General Articles