Availability of gene expression data is increasing at a rapid rate in today’s “Big Data” world, allowing for large-scale integrations and comparisons of said data. The aim of this project is to perform meta-analysis on data generated by the SciLifeLab NGI and data available in Gene Expression Omnibus in the context of cancer. We aim to characterize cell line populations used in RNA-sequencing experiments to explore the effect of genetic variations on gene expression.
Numerous analysis across laboratories enable comparisons of in-house generated data with public datasets allowing researchers to contrast their results to existing information in a biologically meaningful way and increase the statistical power of such analysis. A growing number of researchers use publicly available expression data to compare with their own results, but the accuracy of such analyses has yet to be assessed on a large scale. An increasingly apparent problem is that of the comparability of datasets regarding the equivalency of biological samples and the quality of data generated. In order to alleviate these problems, we developed a method that performs transcriptome-wide comparison of datasets in a pairwise manner. The method assesses comparability of data sets by interrogating equivalency of biological samples with respect to genetic heterogeneity, single nucleotide variations and RNA-seq data quality in terms of sequencing coverage and depth. Currently RNA-seq datasets obtained from the analysis of cells cultivated in different laboratories are compared and the impact of genetic variation and data quality on global gene expression and expression of cancer driver genes is analysed.
Our approach aims to develop means to question comparability of future, current and past experiments and facilitate meta-analysis of RNA-seq datasets within systems biology. Furthermore, analysis of sequence information in the analysed datasets will enable characterization of cell line populations and increase our knowledge regarding genetic alterations expressed that are associated to propagation of cell lines used across laboratories.