Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Reuse of public genome-wide gene expression data

Key Points

  • Over the past decade, high-throughput gene expression experiments have generated data from millions of assays. Data sets linked to publications are stored in functional genomics data archives: ArrayExpress at the European Bioinformatics Institute, Gene Expression Omnibus at the US National Center for Biotechnology Information and at the DNA Databank of Japan Omics Archive.

  • Secondary added-value and topical databases process data from the primary archives, adding analysis and annotation to make these data accessible to every biologist by allowing queries such as 'in which tissue is a particular gene expressed?' or 'which genes are differentially expressed between a particular disease and normal samples?'

  • Public gene expression data are commonly reused to study biological questions, both by reanalysis of primary data and by queries to secondary resources. Approximately half of the studies that use public gene expression data rely solely on existing data without adding newly generated data, and half of them use the public data in combination with new data.

  • The reproducibility of published microarray-based studies is limited, mostly owing to insufficient experiment annotation and sometimes to unavailability of the raw or processed data. A stricter enforcement of Minimum Information About a Microarray Experiment (MIAME) requirements and also development of easy-to-use experiment annotation tools are needed to achieve a better reproducibility.

  • Although most of the public gene expression data still are based on microarray experiments, the contribution of high-throughput-sequencing-based expression studies, known as RNA sequencing (RNA-seq), are growing rapidly.

  • Reuse of RNA-seq data can potentially be even more valuable than reuse of microarray data, partly owing to the costs of experiments and data storage but even more importantly because of a more quantitative nature of sequencing-based expression data. Community standards such as Minimum Information about Sequencing Experiments (MINSEQE) should be adopted to make RNA-seq data maximally reusable.

  • The bioinformatics resources that store and manage public data are sensitive to short-term funding changes, complicating the maintenance of important databases. The development of long-term infrastructure in bioinformatics, such as the ELIXIR project in Europe, is needed to ensure the long term availability of public data.

Abstract

Our understanding of gene expression has changed dramatically over the past decade, largely catalysed by technological developments. High-throughput experiments — microarrays and next-generation sequencing — have generated large amounts of genome-wide gene expression data that are collected in public archives. Added-value databases process, analyse and annotate these data further to make them accessible to every biologist. In this Review, we discuss the utility of the gene expression data that are in the public domain and how researchers are making use of these data. Reuse of public data can be very powerful, but there are many obstacles in data preparation and analysis and in the interpretation of the results. We will discuss these challenges and provide recommendations that we believe can improve the utility of such data.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Similar content being viewed by others

References

  1. Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467–470 (1995).

    Article  CAS  PubMed  Google Scholar 

  2. Brazma, A. et al. Minimum information about a microarray experiment (MIAME)—toward standards for microarray data. Nature Genet. 29, 365–371 (2001). MIAME was the first initiative to set standards for high-throughput data reporting sharing.

    Article  CAS  PubMed  Google Scholar 

  3. The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology. Nature Genet. 25, 25–29 (2000).

  4. Gentleman, R. C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80 (2004). Bioconductor is arguably the most commonly used framework for bioinformatics analysis tools and supports a vast array of open source analysis packages.

    Article  PubMed  PubMed Central  Google Scholar 

  5. Edgar, R., Domrachev, M. & Lash, A. E. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30, 207–210 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Brazma, A. et al. ArrayExpress—a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 31, 68–71 (2003). References 5 and 6 describe the primary archives at NCBI and EBI, which provide public availability of data from approximately one million microarrays.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Wang, Z., Gerstein, M. & Snyder, M. RNA-seq: a revolutionary tool for transcriptomics. Nature Rev. Genet. 10, 57–63 (2009).

    Article  CAS  PubMed  Google Scholar 

  8. Parkinson, H. et al. ArrayExpress update — an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Res. 39, D1002–D1004 (2011).

    Article  CAS  PubMed  Google Scholar 

  9. Barrett, T. et al. NCBI GEO: archive for functional genomics data sets—10 years on. Nucleic Acids Res. 39, D1005–D1010 (2011).

    Article  CAS  PubMed  Google Scholar 

  10. Kodama, Y. et al. The DNA Data Bank of Japan launches a new resource, the DDBJ Omics Archive of functional genomics experiments. Nucleic Acids Res. 40, D38–D42 (2012).

    Article  CAS  PubMed  Google Scholar 

  11. Piwowar, H. A. Who shares? Who doesn't? Factors associated with openly archiving raw research data. PLoS ONE 6, e18657 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Rustici, G. et al. ArrayExpress update — trends in database growth and links to popular analysis tools. Nucleic Acids Res. 27 Nov 2012 (doi:10.1093/nar/gks1174).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  13. Barrett, T. et al. BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucleic Acids Res. 40, D57–D63 (2012).

    Article  CAS  PubMed  Google Scholar 

  14. Gostev, M. et al. The BioSample Database (BioSD) at the European Bioinformatics Institute. Nucleic Acids Res. 40, D64–D70 (2012).

    Article  CAS  PubMed  Google Scholar 

  15. Malone, J. et al. Modeling sample variables with an Experimental Factor Ontology. Bioinformatics 26, 1112–1118 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Kapushesky, M. et al. Gene Expression Atlas update — a value-added database of microarray and sequencing-based functional genomics experiments. Nucleic Acids Res. 40, D1060–D1066 (2012).

    Article  PubMed  CAS  Google Scholar 

  17. Chen, R., Mallelwar, R., Thosar, A., Venkatasubrahmanyam, S. & Butte, A. J. GeneChaser: identifying all biological and clinical conditions in which genes of interest are differentially expressed. BMC Bioinformatics 9, 548 (2008).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  18. Zilliox, M. J. & Irizarry, R. A. A gene expression bar code for microarray data. Nature Methods 4, 911–913 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. McCall, M. N., Uppal, K., Jaffee, H. A., Zilliox, M. J. & Irizarry, R. A. The Gene Expression Barcode: leveraging public data repositories to begin cataloging the human and murine transcriptomes. Nucleic Acids Res. 39, D1011–D1015 (2011). The Gene Expression Barcode is probably the most successful attempt at answering the fundamental question of what is expressed and what is not expressed in a given sample.

    Article  CAS  PubMed  Google Scholar 

  20. Mochida, K., Uehara-Yamaguchi, Y., Yoshida, T., Sakurai, T. & Shinozaki, K. Global landscape of a co-expressed gene network in barley and its application to gene discovery in Triticeae crops. Plant Cell Physiol. 52, 785–803 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Hamada, K. et al. OryzaExpress: an integrated database of gene expression networks and omics annotations in rice. Plant Cell Physiol. 52, 220–229 (2011).

    Article  CAS  PubMed  Google Scholar 

  22. Obayashi, T., Nishida, K., Kasahara, K. & Kinoshita, K. ATTED-II updates: condition-specific gene coexpression to extend coexpression analyses and applications to a broad range of flowering plants. Plant Cell Physiol. 52, 213–219 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. van Verk, M. C., Bol, J. F. & Linthorst, H. J. Prospecting for genes involved in transcriptional regulation of plant defenses, a bioinformatics approach. BMC Plant Biol. 11, 88 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Wilson, T. J. & Ge, S. X. ArraySearch: a web-based genomic search engine. Comp. Funct. Genom. 2012, 650842 (2012).

    Article  CAS  Google Scholar 

  25. Obayashi, T. & Kinoshita, K. COXPRESdb: a database to compare gene coexpression in seven model animals. Nucleic Acids Res. 39, D1016–D1022 (2011).

    Article  CAS  PubMed  Google Scholar 

  26. Engreitz, J. M. et al. ProfileChaser: searching microarray repositories based on genome-wide patterns of differential expression. Bioinformatics 27, 3317–3318 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Liu, T. et al. Cistrome: an integrative platform for transcriptional regulation studies. Genome Biol. 12, R83 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Cho, S. et al. miRGator v2.0: an integrated system for functional investigation of microRNAs. Nucleic Acids Res. 39, D158–D162 (2011).

    Article  CAS  PubMed  Google Scholar 

  29. Cheng, W. C. et al. Microarray meta-analysis database (M(2)DB): a uniformly pre-processed, quality controlled, and manually curated human clinical microarray database. BMC Bioinformatics 11, 421 (2010).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  30. Gadaleta, E. et al. A global insight into a cancer transcriptional space using pancreatic data: importance, findings and flaws. Nucleic Acids Res. 39, 7900–7907 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Cutts, R. J. et al. The Pancreatic Expression database: 2011 update. Nucleic Acids Res. 39, D1023–D1028 (2011).

    Article  CAS  PubMed  Google Scholar 

  32. Taccioli, C. et al. ParkDB: a Parkinson's disease gene expression database. Database 18, bar007 (2011).

    Google Scholar 

  33. Howell, G. R., Walton, D. O., King, B. L., Libby, R. T. & John, S. W. Datgan, a reusable software system for facile interrogation and visualization of complex transcription profiling data. BMC Genomics 12, 429 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Rhodes, D. R. et al. ONCOMINE: a cancer microarray database and integrated data-mining platform. Neoplasia 6, 1–6 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Liu, F., White, J. A., Antonescu, C., Gusenleitner, D. & Quackenbush, J. GCOD — GeneChip Oncology Database. BMC Bioinformatics 12, 46 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  36. Harding, S. D. et al. The GUDMAP database—an online resource for genitourinary research. Development 138, 2845–2853 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Dash, S., Van Hemert, J., Hong, L., Wise, R. P. & Dickerson, J. A. PLEXdb: gene expression resources for plants and plant pathogens. Nucleic Acids Res. 40, D1194–D1201 (2012).

    Article  CAS  PubMed  Google Scholar 

  38. Fei, Z. et al. Tomato Functional Genomics Database: a comprehensive resource and analysis package for tomato functional genomics. Nucleic Acids Res. 39, D1156–D1163 (2011).

    Article  CAS  PubMed  Google Scholar 

  39. Su, A. I. et al. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc. Natl Acad. Sci. 101, 6062–6067 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Wu, C. et al. BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources. Genome Biol. 10, R130 (2009).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  41. Finger, J. H. et al. The mouse Gene Expression Database (GXD): 2011 update. Nucleic Acids Res. 39, D835–D841 (2011).

    Article  CAS  PubMed  Google Scholar 

  42. Richardson, L. et al. EMAGE mouse embryo spatial gene expression database: 2010 update. Nucleic Acids Res. 38, D703–D709 (2010).

    Article  CAS  PubMed  Google Scholar 

  43. Haudry, Y. et al. 4DXpress: a database for cross-species expression pattern comparisons. Nucleic Acids Res. 36, D847–D853 (2008).

    Article  CAS  PubMed  Google Scholar 

  44. Jiménez-Lozano, N., Segura, J., Macías, J. R., Vega, J. & Carazo, J. M. Integrating human and murine anatomical gene expression data for improved comparisons. Bioinformatics 28, 397–402 (2012).

    Article  PubMed  CAS  Google Scholar 

  45. Gundem, G. et al. IntOGen: integration and data mining of multidimensional oncogenomic data. Nature Methods 7, 92–93 (2010).

    Article  CAS  PubMed  Google Scholar 

  46. Lamb, J. et al. The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science 313, 1929–1935 (2006). This much-used resource links gene signatures derived from disease data and drug treatments.

    Article  CAS  PubMed  Google Scholar 

  47. Halling-Brown, M. D., Bulusu, K. C., Patel, M. & Tym, J. E. & Al-Lazikani, B. canSAR: an integrated cancer public translational research and drug discovery resource. Nucleic Acids Res. 40, D947–D956 (2012).

    Article  CAS  PubMed  Google Scholar 

  48. Huang, H., Liu, C.-C. & Zhou, X. J. Bayesian approach to transforming public gene expression repositories into disease diagnosis databases. Proc. Natl Acad. Sci. USA 107, 6823–6828 (2010) (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Yook, K. et al. WormBase 2012: more genomes, more data, new website. Nucleic Acids Res. 40, D735–D741 (2012).

    Article  CAS  PubMed  Google Scholar 

  50. Ioannides, J. P. A. et al. Repeatability of public microarray gene analyses. Nature Genet. 41, 149–155 (2009). This study clearly demonstrates the irreproducibility that follows a lack of annotation or insufficient data or code sharing.

    Article  CAS  Google Scholar 

  51. Couzin-Frankel, J. As questions grow, Duke halts trials, launches investigation. Science 329, 614–615.

    Article  CAS  PubMed  Google Scholar 

  52. Baggerly, K. A. & Coombes, K. R. Deriving chemosensitivity from cell lines: forensic bioinformatics and reproducible research in high-throughput biology. Ann. Appl. Stat. 3, 1309–1344 (2009).

    Article  Google Scholar 

  53. Baggerly K. A. & Coombes, K. R. What information should be required to support clinical “omics” publications? Clin. Chem. 57, 688–690 (2011).

    Article  CAS  PubMed  Google Scholar 

  54. Shankar, R. et al. Annotare — a tool for annotating high-throughput biomedical investigations and resulting data. Bioinformatics 26, 2470–2471 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Sansone, S.-A. et al. Toward interoperable bioscience data. Nature Genet. 44, 121–126 (2012).

    Article  CAS  PubMed  Google Scholar 

  56. Krestyaninova, M. et al. A System for Information Management in BioMedical Studies—SIMBioMS. Bioinformatics 25, 2768–2769 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Piwowar, H. A., Vision, T. J. & Whitlock, M. C. Data archiving is a good investment. Nature 473, 285–285 (2011).

    Article  CAS  PubMed  Google Scholar 

  58. Parkinson, H. et al. ArrayExpress—a public repository for microarray gene expression data at the EBI. Nucleic Acids Res. 33, D553–D555 (2005).

    Article  CAS  PubMed  Google Scholar 

  59. Parkinson, H. et al. ArrayExpress—a public database of microarray experiments and gene expression profiles. Nucleic Acids Res. 35, D747–D750 (2007).

    Article  CAS  PubMed  Google Scholar 

  60. Parkinson, H. et al. ArrayExpress update—from an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Res. 37, D868–D872 (2009).

    Article  CAS  PubMed  Google Scholar 

  61. Rudy, J. & Valafar, F. Empirical comparison of cross-platform normalization methods for gene expression data. BMC Bioinformatics 12, 467 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  62. Lukk, M. et al. A global map of human gene expression. Nature Biotech. 28, 322–324 (2010). This analysis of a large compilation of public data shows the large-scale structure of gene expression space in a large variety of human samples, which could not be derived from any contributing studies individually.

    Article  CAS  Google Scholar 

  63. Schmid, P. R., Palmer, N. P., Kohane, I. S. & Berger, B. Making sense out of massive data by going beyond differential expression. Proc. Natl Acad. Sci. 109, 5594–5599 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Kohane, I. S. & Valtchinov, V. I. Quantifying the white blood cell transcriptome as an accessible window to the multiorgan transcriptome. Bioinformatics 28, 538–545 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  65. Ojala, K. A., Kilpinen, S. K. & Kallioniemi, O. P. Classification of unknown primary tumors with a data-driven method based on a large microarray reference database. Genome Med. 3, 63 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  66. Zheng-Bradley, X., Rung, J., Parkinson, H. & Brazma, A. Large scale comparison of global gene expression patterns in human and mouse. Genome Biol. 11, R124 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. Tseng, G. C., Ghosh, D. & Feingold, E. Comprehensive literature review and statistical considerations for microarray meta-analysis. Nucleic Acids Res. 40, 3785–3799 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  68. Kang, D. D., Sibille, E., Kaminski, N. & Tseng, G. C. MetaQC: objective quality control and inclusion/exclusion criteria for genomic meta-analysis. Nucleic Acids Res. 40, e15 (2012).

    Article  CAS  PubMed  Google Scholar 

  69. Ramasamy, A., Mondry, A., Holmes, C. C. & Altman, D. G. Key issues in conducting a meta-analysis of gene expression microarray datasets. PLoS Med. 5, e184 (2008).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  70. Vilardell, M. et al. Meta-analysis of heterogeneous Down syndrome data reveals consistent genome-wide dosage effects related to neurological processes. BMC Genomics 12, 229 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  71. Chen, M., Wang, K., Zhang, L., Li, C. & Yang, Y. The discovery of putative urine markers for the specific detection of prostate tumor by integrative mining of public genomic profiles. PLoS ONE 6, e28552 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  72. Sontrop, H. M., Verhaegh, W. F., Reinders, M. J. & Moerland, P. D. An evaluation protocol for subtype-specific breast cancer event prediction. PLoS ONE 6, e21681 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  73. Pierre, M. et al. Meta-analysis of archived DNA microarrays identifies genes regulated by hypoxia and involved in a metastatic phenotype in cancer cells. BMC Cancer 10, 176 (2010).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  74. Kim, S., You, S. & Hwang, D. Aminoacyl-tRNA synthetases and tumorigenesis: more than housekeeping. Nature Rev. Cancer. 11, 708–718 (2011).

    Article  CAS  Google Scholar 

  75. Cochran, B. G. The combination of estimates from different experiments. Biometrics 10, 101–129 (1954).

    Article  Google Scholar 

  76. Wang, X. et al. An R package suite for microarray meta-analysis in quality control, differentially expressed gene analysis and pathway enrichment detection. Bioinformatics 28, 2534–2536 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  77. Marot, G., Foulley, J.-L., Mayer, C.-D. & Jaffrézic, F. Moderated effect size and p-value combinations for microarray meta-analyses. Bioinformatics 25, 2692–2699 (2009).

    Article  CAS  PubMed  Google Scholar 

  78. Gentleman, R., Ruschhaupt, M., Huber, W. & Lusa, L. Meta-analysis for microarray experiments. bioconductor.org [online], (2012).

  79. Ghosh, D. & Choi, H. Package 'metaArray'. bioconductor.org [online], (2012).

  80. Seo, Y. S. et al. Towards establishment of a rice stress response interactome. PLoS Genet. 7, e1002020 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  81. Soreq, L., Ben-Shaul, Y., Israel, Z., Bergman, H. & Soreq, H. Meta-analysis of genetic and environmental Parkinson's disease models reveals a common role of mitochondrial protection pathways. Neurobiol. Dis. 45, 1018–1030 (2012).

    Article  CAS  PubMed  Google Scholar 

  82. Cacciottolo, M. et al. Reverse engineering gene network identifies new dysferlin-interacting proteins. J. Biol. Chem. 286, 5404–5413 (2011).

    Article  CAS  PubMed  Google Scholar 

  83. Tram, E. et al. Identification of germline alterations of the mad homology 2 domain of SMAD3 and SMAD4 from the Ontario site of the breast cancer family registry (CFR). Breast Cancer Res. 13, R77 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  84. Xu, Y. et al. Unique DNA methylome profiles in CpG island methylator phenotype colon cancers. Genome Res. 22, 283–291 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  85. Witkiewicz, A. K. et al. Molecular profiling of a lethal tumor microenvironment, as defined by stromal caveolin-1 status in breast cancers. Cell Cycle. 10, 1794–1809 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  86. Oshino, T. et al. Auxin depletion in barley plants under high-temperature conditions represses DNA proliferation in organelles and nuclei via transcriptional alterations. Plant Cell Environ. 34, 284–290 (2011).

    Article  CAS  PubMed  Google Scholar 

  87. Alboresi, A. et al. Reactive oxygen species and transcript analysis upon excess light treatment in wild-type Arabidopsis thaliana versus a photosensitive mutant lacking zeaxanthin and lutein. BMC Plant Biol. 11, 62 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  88. Donoghue, M. T., Keshavaiah, C., Swamidatta, S. H. & Spillane, C. Evolutionary origins of Brassicaceae specific genes in Arabidopsis thaliana. BMC Evol. Biol. 11, 47 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  89. Sanz-Pamplona, R. et al. Gene expression differences between colon and rectum tumors. Clin. Cancer Res. 17, 7303–7312 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  90. Momin, A. A. et al. A method for visualization of “omic” datasets for sphingolipid metabolism to predict potentially interesting differences. J. Lipid Res. 52, 1073–1083 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  91. Yeung, K. Y. et al. Construction of regulatory networks using expression time-series data of a genotyped population. Proc. Natl Acad. Sci. 108, 19436–19441 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  92. Kacmarczyk, T., Waltman, P., Bate, A., Eichenberger, P. & Bonneau, R. Comparative microbial modules resource: generation and visualization of multi-species biclusters. PLoS Comput. Biol. 7, e1002228 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  93. Deng, J. et al. Investigating the predictability of essential genes across distantly related organisms using an integrative approach. Nucleic Acids Res. 39, 795–807 (2011).

    Article  CAS  PubMed  Google Scholar 

  94. Wilson, P. A. & Plucinski, M. A simple Bayesian estimate of direct RNAi gene regulation events from differential gene expression profiles. BMC Genomics 12, 250 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  95. Jézéquel, P. et al. bc-GenExMiner: an easy-to-use online platform for gene prognostic analyses in breast cancer. Breast Cancer Res. Treat. 131, 765–775 (2012).

    Article  PubMed  Google Scholar 

  96. Kolde, R., Laur, S., Adler, P. & Vilo, J. Robust rank aggregation for gene list integration and meta-analysis. Bioinformatics 28, 573–580 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  97. Tsoi, L. C., Qin, T., Slate, E. H. & Zheng, W. J. Consistent Differential Expression Pattern (CDEP) on microarray to identify genes related to metastatic behavior. BMC Bioinformatics 12, 438 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  98. Berrar, D., Bradbury, I. & Dubitzky, W. Avoiding model selection bias in small-sample genomic datasets. Bioinformatics 22, 1245–1250 (2006).

    Article  CAS  PubMed  Google Scholar 

  99. Zheng, W., Chung, L. M. & Zhao, H. Bias detection and correction in RNA-sequencing data. BMC Bioinformatics 12, 290 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  100. Gonzàlez-Porta, M., Calvo, M., Sammeth, M. & Guigó, R. Estimation of alternative splicing variability in human populations. Genome Res. 22, 528–538 (2012).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  101. Mailman, M. D. et al. The NCBI dbGaP Database of Genotypes and Phenotypes. Nature Genet. 39, 1181–1186 (2007).

    Article  CAS  PubMed  Google Scholar 

  102. Kauffmann, A. Gentleman, R. & Huber, W. arrayQualityMetrics—a bioconductor package for quality assessment of microarray data. Bioinformatics 25, 415–416 (2009).

    Article  CAS  PubMed  Google Scholar 

  103. Dai, M. et al. Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res. 33, e175 (2005).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  104. Sherlock, G. et al. The Stanford Microarray Database. Nucleic Acids Res. 29, 152–155 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  105. Hruz, T. et al. Genevestigator v3: a reference expression database for the meta-analysis of transcriptomes. Adv. Bioinformat. 2008, 420747 (2008).

    Article  Google Scholar 

Download references

Acknowledgements

We would like to thank H. Parkinson and U. Sarkans for useful comments and help in analysing ArrayExpress statistics. The work was partly funded by the European Community's FP7 HEALTH grants ENGAGE (grant agreement 201413), SYBARIS (grant agreement 242220) and EurocanPlatform (grant agreement 260791).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alvis Brazma.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Related links

Related links

FURTHER INFORMATION

Alvis Brazma's homepage

ArrayExpress

FGED — MINSEQE

Gene Expression Atlas

Gene Expression Omnibus

MIAME

Glossary

Microarray

A solid surface slide on which a collection of microscopic DNA spots representing specific DNA sequences of genomic regions are attached and to which sample DNA fragments can hybridize. Microarrays are used to measure the expression levels of large numbers of genes simultaneously, to genotype multiple regions of a genome or for other high-throughput assays.

Minimum Information About a Microarray Experiment

(MIAME). A guideline for information that is necessary for the unambiguous interpretation of the results of the experiment, potentially allowing the reproduction of the experiment. MIAME postulates that raw and processed data, sample annotation, array feature annotation, relationship between the samples used in the experiment, arrays and data files, the overall description of the experiment and experimental variables must be given in a usable format to make the results of a microarray experiment interpretable.

Gene Expression Omnibus

(GEO). A public functional genomics data repository supporting MIAME-compliant data submissions at the US National Center for Biotechnology Information accepting array- and sequence-based data.

ArrayExpress

A MIAME-compliant archive of functional genomics data at the European Bioinformatics Institute. It is one of the international public data archives recommended by scientific journals for depositions of microarray or high-throughput sequencing data related to publications.

High-throughput sequencing

DNA sequencing technologies that parallelize the sequencing operations, thus achieving several magnitudes higher throughput than the traditional sequencing methods based on processes invented by Fred Sanger.

RNA sequencing

(RNA-seq). The use of high-throughput sequencing technologies applied to cDNA molecules obtained by reverse transcription from RNA, or sequencing RNA directly, in order to get information about the RNA content of a sample.

Meta-analysis

Refers to methods focused on contrasting and combining results from different studies to identify common patterns and improving the signal in data by combining multiple studies.

Normalization

In relation to microarray and other high-throughput data, normalization usually refers to data transformations that remove systematic noise and that make data combined from several assays mutually comparable.

Minimum Information about a Sequencing Experiment

(MINSEQE). A formulation of the information that is necessary to interpret the results of a sequencing experiment unambiguously and potentially to reproduce the experiment. MINSEQE is an adoption of Minimum Information About a Microarray Experiment guidelines to functional genomics experiments based on RNA sequencing and other high-throughput-sequencing-based functional genomics experiments.

ELIXIR

A life sciences infrastructure project that unites Europe's leading life sciences organizations in managing and safeguarding the massive amounts of data being generated every day by publicly funded research.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rung, J., Brazma, A. Reuse of public genome-wide gene expression data. Nat Rev Genet 14, 89–99 (2013). https://doi.org/10.1038/nrg3394

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrg3394

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research