Variant discovery using long-read data across SRA

Background

Single Nucleotide Polymorphisms (SNPs) across SARS-CoV-2 genomes are critical to the understanding of their molecular biology and SNPs are also important for public health interventions, especially for CoVID-19 (the disease resulting from SARS-CoV-2 infection).

What's the problem?

SARS-CoV-2 variation data can be visualized using some already developed tools but these existing tools are designed mostly for short reads and are therefore limited by the amount and kind of information that can be obtained from the resulting SARS-CoV-2 genomic data analysis.

Solution to the problem

This project is therefore aimed towards the development of a pipeline for the discovery and visualization of SNPs in long read SARS-CoV-2 sequences obtained from the Sequence Read Archive (SRA). These long reads will be primarily from experiments performed using Pacbio Single Molecule Real-Time (SMRT) and Oxford Nanopore Sequencing technologies.

Goal

Our main immediate goal is to provide variants, metadata, and annotations in JSON format. The SARS2 Variation viewer can then display certain regions, proteins e.t.c.

Architecture for the solution

SARS-CoV-2 longread (SRA) mapped to SARS-CoV-2 reference genome (RefSeq) +++++> Alignments +++++> Identify SNPs +++++> Visualize SARS-CoV-2 SNPs

Workflow

Download Reference SARS-CoV-2 Genome from RefSeq (data/ref/sars2_ref_sequence.fasta)

Query the Sequence Read Archive (SRA) to find longread datasets for SARS-CoV-2

Generate TSV file with all accessions resulting from the longread query (18,966 accessions in data/long_reads_SARS2.tsv)

Download some SARS-CoV-2 fastq files from the Sequence Read Archive using a few accessions above (7 fastqs in test dataset)

Use Minimap to align these fastqs to the reference SARS-CoV-2 genome

Use Deepvariant to generate VCFs from the alignments for each of the samples.

Assemblies were done with CANU

Analyse the VCFs and look for SNPs based on the alignments

Correlate SNPs with SARS-CoV-2 genome metadata

Visualize SNPs and associated SARS-CoV-2 metadata

Note:

Outputs from from our pipeline are JSON files which are similar to the one below.

Template for variant data (JSON)

{
     "start": 9560,
     "stop": 9561,
     "reference_sequence": "C",
     "alleles": [
       {
         "allele": "T",
         "count": 6,
         "spdi": "NC_045512.2:9560:C:T",
         "Host": [
           {
             "value": "Homo sapiens",
             "count": 6
           }
         ],
         "Collection Date": [
           {
             "value": "2020-01-11",
             "count": 1
           },
           ...
         ],
         "Collection Location": [
           {
             "value": "USA: CA/North America",
             "count": 5
           },
           {
             "value": "China/Asia",
             "count": 1
           }
         ],
         "codon": "TTA",
         "amino_acid": "L",
         "protein_variant": "S336L",
         "aa_type": "non_synonymous"
       }
     ],
     "protein_name": "nsp4",
     "protein_accession": "YP_009724389.1",
     "protein_position": 336,
     "offset": 1,
     "codon": "TCA",
     "amino_acid": "S"
   },

Our pipeline produced the following products:

Alignments (BAM files)

Vigor4 annotations (GFF and peptides)

Assemblies (contig fasta)

nsps search results

Some Longread Variant Callers

NanoCaller, DeepVariant, LongShot, Clair, Medaka

Research Questions

Using the resulting data from about 20,000 runs in this project, here are the research questions that we want to answer:

What are the top 10 hotspots for SNPs in SARS-CoV-2?
What are the most common locations for SNPs among US samples?
Which genes have the least number of SNPs across all samples?
Based on date of sample collection, where are the SNP hotspots?

Challenges

We didn't have time to figure out permissions to run cluster, so we had to stick with our local Virtual Machines.
fastq-dump took a long time and we were running out of disk space (we needed ~10Tb).
We can't do visualization of SNPs because the viewer is not available.
The hardest part of our work is to wait.
A few issues with Docker.

Next steps

We intend to continue working on this project after the CSHL codeathon 2020.

People/Team

Vadim Zalunin, NCBI/NIH, Maryland, MD, [email protected]
Vamsi Kodali, NCBI/NIH, Maryland, MD, [email protected]
Olaitan I. Awe, University of Ibadan, [email protected]
Brett Youtsey, Los Alamos National Laboratory, [email protected]
Weizhong Chang, NIH, Maryland, MD, [email protected]
Cory Weller, , [email protected]
Xiaoli Jiao, , [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
data		data
snakemake		snakemake
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Variant discovery using long-read data across SRA

Background

What's the problem?

Solution to the problem

Goal

Architecture for the solution

Workflow

Note:

Template for variant data (JSON)

Some Longread Variant Callers

Research Questions

Challenges

Next steps

People/Team

About

Releases

Packages

Contributors 4

Languages

License

STRIDES-Codes/Variant-discovery-using-long-read-data-across-SRA

Folders and files

Latest commit

History

Repository files navigation

Variant discovery using long-read data across SRA

Background

What's the problem?

Solution to the problem

Goal

Architecture for the solution

Workflow

Note:

Template for variant data (JSON)

Some Longread Variant Callers

Research Questions

Challenges

Next steps

People/Team

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages