Denovo assembly of large and complex plant genomes using short read data

SNIC 2018/3-450


SNAC Medium

Principal Investigator:

Nick Sirijovski


Lunds universitet

Start Date:


End Date:


Primary Classification:

10610: Bioinformatics and Systems Biology (methods development to be 10203)

Secondary Classification:

10609: Genetics (medical to be 30107 and agricultural to be 40402)

Tertiary Classification:

10602: Biochemistry and Molecular Biology



OBS- This is a continuation of SNIC 2018/8-99. We could not finish the pipeline and had to transfer results from a part of the pipeline to another server to finish completely. We were in contact with support but could not solve the problem during the project. I now have a PhD student that will try to solve the problem and complete what we set out to do on SNIC resources. Genome assembly of complex plant genomes is an extremely difficult task due to their large size and repetitive nature. Despite the advances in long-read sequencing technologies it is of interest to pursue genome assembly using the more accurate Illumina short read technology. At this stage there is only one group in the entire world that can assemble complex genomes from Illumina data alone, and they have developed this IP into a business called NRGene. The Oat research center 'ScanOats' contracted NRGene to perform denovo assembly of the 12.5Gbp hexaploid oat genome, which they delivered within 2 months of receiving the data - this was very expensive and only possible due to the 100MSEK SSF grant we received. The assembly statistics were outstanding - N50 17Mbp, N90 of 2.8Mbp, assembly size of 11Gbp, and complete BUSCOs at 98.2%. The Question is - now that we have the excellent NRGene assembly and the raw data, can we develop our own assembly pipeline to reproduce the oat genome assembly? Why do this? - many reasons. No one in Sweden can perform denovo assembly of plant genomes with such speed and accuracy. We are interesting in creating the oat pangenome, which will require denovo assembly of four additional oat varieties that are different to each other - and we cannot afford to pay NRGene for another assembly. Furthermore, if this pipeline is successful on the Swedish infrastructure we can perform additional genome assemblies for complex plant genomes under investigation at Lund University and this would open up an whole new area of plant research in Sweden. How will we do this? In collaboration with our German colleagues from the international Wheat Genome Sequencing consortium we would like to evaluate a novel denovo genome assembly pipeline on our existing Illumina dataset. This pipeline has already been evaluated on wheat genome data by our German collaborators who are leaders in the field of polyploid plant genomics, and they have achieved denovo genome assembly statistics that are comparable to the results achieved by NRGene. Are we ready to go now? Yes! We have all the raw data necessary as input. The raw reads represent 270x coverage of the 12.5Gbp hexaploid oat genome. The CPU requirements are extremely intensive and the entire pipeline would take ca. 2 months to complete. We have asked for 6-month project duration, which should allow at least two full iterations.