Investigating reference bias

SNIC 2018/8-106


SNAC Small

Principal Investigator:

Torsten G√ľnther


Uppsala universitet

Start Date:


End Date:


Primary Classification:

10610: Bioinformatics and Systems Biology (methods development to be 10203)




High quality reference genomes are an important resource in genomic research projects. In palaeogenomic studies of human populations, mapping against the human reference genome is used to identify endogenous human sequences in sequencing libraries built from ancient human remains. The linear human reference genome represents a single haploid sequence carrying only one allele at each variant site. A consequence is that DNA fragments carrying the reference allele map over proportionally or with higher quality scores. This reference bias can have effects on population genomic downstream analysis when heterozygous sites are falsely considered homozygous for the reference allele. Due to DNA preservation, ancient DNA studies usually operate with low sequencing coverages where a variant site is often covered by a single sequencing read only. Furthermore, fragmentation of DNA molecules causes a large proportion of the sequenced fragments to be shorter than 50 bp reducing the amount of accepted mismatches between reference and sequenced read. These ancient DNA specific properties represent limitations for calling the allelic state of the individual potentially exacerbating the impact of reference bias on downstream analysis. This project will investigate reference bias in published ancient DNA sequence data of prehistoric populations. Comparing different strategies for mapping and data filtering will illustrate how reference bias can influence our interpretations of the results from allele sharing based approaches.