A framework for terabase metagenomics


SNIC 2017/1-164


SNAC Medium

Principal Investigator:

Erik Kristiansson


Chalmers tekniska högskola

Start Date:


End Date:


Primary Classification:

10203: Bioinformatik (beräkningsbiologi) (tillämpningar under 10610)

Secondary Classification:

10606: Mikrobiologi (mikrobiologi inom medicinska området under 30109)

Tertiary Classification:

30109: Mikrobiologi inom det medicinska området





Metagenomics has for the last decade been used in many applications for elucidating the functional content and interplay between species in complex systems previously inaccessible by microbial cultivation (Handelsman 2004). Modern metagenomic studies employ sequencing technologies (latest versions commonly labeled “next-generation” or “second generation”) which are able to produce very large amounts of DNA sequence data. For example the latest Illumina HiSeq2500 sequencing machine can identify 600 gigabases (DNA nucleotide bases) of sequence in one run. Though the length of an average environmental bacterial genome is approximately 2 million nucleotides (DNA bases), the sequence information acquired from these technologies is fragmented with common fragment lengths 100-700 nucleotides, depending on which sequencing technology is used. Recent high profile research projects focusing on the microbial inhabitants of our earth are about to create very large public metagenomic datasets, for example the European Meta-HIT consortium produced one of the largest public metagenomes to date (Qin et al. 2010) with a size of approximately 500 gigabases. Other projects are on the horizon, like the Human Microbiome Project (Turnbaugh et al. 2007) and Earth Microbiome Project (Gilbert et al. 2011) which are aiming to release 10 terabases of data during the coming years. These projects are the poster children of the approaching era of terabase metagenomics. The aim of this project is to establish suitable methods for terabyte scale analysis of DNA sequences in the form of raw high-throughput sequencing output (i.e. “reads”). The creation of a distributed terabase metagenomic sequence analysis framework is required as no complete off-the-shelf solutions exist, despite being essential for the types of analyses described in this application. This framework will provide easy-to-use parallelized analysis and gene detection in terabase metagenomics. A special application of a terabase metagenomics framework is the detection of antibiotic resistance genes in the environment. Antibiotics have been highly used since their discovery in the early 20th century to greatly improve both human and veterinary medicine, and today the modern health-care system is totally dependent on antibiotics. Many modern broad-spectrum antibiotics are losing their effect as bacteria improve their ability to resist increased levels of exposure. A hypothesis is that many genetic building blocks for antibiotic resistance are available in the environment and the coming terabase metagenomic projects are bringing the increased sequencing coverage required to detect fragmented precursors of antibiotic resistance genes in the environment. Gilbert, J. 2011. The Earth Microbiome Project: The Meeting Report for the 1st International Earth Microbiome Project Conference, Shenzhen, China, June 13th-15th 2011. Standards in Genomic Sciences, 5(2) Handelsman, J., 2004. Metagenomics: application of genomics to uncultured microorganisms. Microbiology and Molecular Biology Reviews, 68(4) Qin, J. 2010. A human gut microbial gene catalogue established by metagenomic sequencing. Nature, 464(7285) Turnbaugh, P.J. , 2007. The human microbiome project. Nature, 449(7164)