Generalized adverserial network for nucleotide sequence simulation
Simulation of DNA/RNA massively parallel sequencing reads (MPS). For variation discovery on MPS datasets, supervised learning is a valuable tool. Getting enough training data is not always possible. Typically one relies on genotyping data or projects like 1,000 Genomes, which provide sources of common variation (> 1% allele frequency). This does not help with singletons. Simulating data is an option for training machine learning algorithms, but often considered undesirable since simulated data rarely captures the full nuance of real data, leading algorithms to badly overfit and thus have limited utility on real data. We suggest a method to realistically simulate a new MPS (Illumina and/or other technology) dataset with user-specified known variation given an existing dataset. We propose to use a generalized adverserial network (GAN) for this. Although initially framed as a variant discovery project, we intend to extend the project to also encompass simulation of genomic DNA sequencing as well as RNA-sequencing data, using the same GAN-based approach.