-
Notifications
You must be signed in to change notification settings - Fork 13
Tutorial
This tutorial assumes you have an Amazon Web Services account registered with the NMDP. This will grant you access to a public machine image with all the data, tools, and compute infrastructure you need to proceed. If you do not have these things, go here first.
If you have a GitHub account with a public key:
git clone [email protected]:nmdp-bioinformatics/pipeline.git
This will allow you to contribute changes, otherwise you may clone and use the pipeline anonymously:
git clone https://github.com/nmdp-bioinformatics/pipeline.git
Will create a local clone (working copy) of the GitHub repository, which contains several shell scripts for parallel execution of pipeline components.
Public sample data from the sequence read archive are provided here:
/opt/data/DRP000941/
Each file (73 total) contains phased NGS data for 6-locus HLA published by Hosomichi et al, 2013. The files must be decompressed from SRA format to FASTQ before processing. SRA provides tools for this purpose. The decompressed data are also provided in the fastq/ directory.
Be Careful, please do not process all of the SRA data on the NMDP-provided AMI, use the following command to process a single sample and then scale up as you wish using your own Amazon account or other high-performance compute environment.
From within your cloned pipeline directory:
./tutorial/prep_data_dir.bash ~/workdir
./splitter.bash ~/workdir
Processing the data will take several minutes, depending on your hardware instance. In the meantime, you may proceed using preprocessed results.
Clinical interpretation of HLA DNA sequence for transplantation is typically confined to the antigen recognition sites (ARS), which correspond to exons 2 and 3 or exon 2 of class I and class II HLA genes, respectively. The NMDP's interpretation service currently requires consensus sequences that are trimmed of sequence representing other structural elements (non-ARS exons, introns, promoters and other untranslated regions). We will use HLA-A results for a single homozygous sample DRR003809 (DKB) for illustrative purposes. The first step is to filter contigs by region, in this case exons 2 and 3 of HLA-A:
groovy -classpath /mnt/scratch/caleb/ngs-feature-1.0-SNAPSHOT/lib/ngs-feature-1.2-SNAPSHOT.jar:/mnt/scratch/caleb/ngs-tools-1.0-SNAPSHOT/lib/biojava.jar:/mnt/scratch/caleb/ngs-tools-1.0-SNAPSHOT/lib/picard-1.102.0.jar:/mnt/scratch/caleb/ngs-tools-1.0-SNAPSHOT/lib/guava-17.0.jar:/mnt/scratch/caleb/ngs-variant-1.0-SNAPSHOT/lib/ngs-variant-1.2-SNAPSHOT.jar /mnt/scratch/caleb/splice-bam.groovy -i /opt/data/DRP000941/results/DRR003809_1.fastq.contigs.bwa.sorted.bam -x /opt/nmdp/regions/clinical-exons/hla-a.txt -g HLA-A -m -c -b 0.5 > DKB.fasta
The parameters are as follows:
-classpath includes required Java dependencies
-i is the input aligned, consensus sequences
-x is the filter regions, in this case exons 2 and 3 of HLA-A
-g is a passthrough parameter that is used for FASTA headers
-m enforces strict merging of consensus contigs
-b is the minimum breadth-of-coverage filter
-c creates phased cDNA from exon-sequences on the same contig
You may validate that the consensus sequences were correctly filtered by uploading to the UCSC genome browser BLAT tool:
Finally, use the NMDP's interpretation service to assign nomenclature on the command-line. A simple web-interface is also available (not shown):
curl -T DKB.fasta http://interp.b12x.org/hla/api/rs/interp | awk '{split($0,a,"|"); print a[1]}'
The awk command limits output to the first assigned allele, resulting in:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 26011 0 25455 100 556 9232 201 0:00:02 0:00:02 --:--:-- 9232
HLA-A*24:02:01:01+HLA-A*24:02:01:01
Consistent with the known typing for cell line DKB.
With this tutorial completed you have some of the basic elements to construct an HML message. The following will demonstrate some additional tools and process that should facilitate creation of fully-compliant HML documents appropriate for submitting genetic data to the NMDP.
The following Groovy code snippet uses BioJava's FASTA reader to validate consensus sequences.
import java.io.BufferedReader
import java.io.File
import org.biojava.bio.BioException;
import org.biojava.bio.seq.io.SeqIOTools
import org.biojava.bio.seq.SequenceIterator
File file = new File("DKB.fasta")
BufferedReader reader = new BufferedReader(new FileReader(file))
for (SequenceIterator sequences = SeqIOTools.readFastaDNA(reader); sequences.hasNext(); ) {
try {
println "<sequence alphabet=\"DNA\">${sequences.nextSequence().seqString()}<\\sequence>"
}
catch(BioException error) {
println "invalid sequence: ${error}"
}
}
For other file formats, such as VCF, the feature parser has corresponding methods to validate DNA fields.
The following Groovy code snippet uses the feature parser to reformat HLA-A clinical exons into proper HML targeted-region tags.
import org.nmdp.ngs.feature.Locus
import org.nmdp.ngs.feature.parser.FeatureParser
def filename = "/opt/nmdp/regions/clinical-exons/hla-a.txt"
new File(filename).each { line ->
def (index, coordinate) = line.split("\t")
def locus = FeatureParser.parseLocus(coordinate)
println "<targeted-region \
\n assembly=\"GRCh38\" \
\n contig=\"${locus.getContig()}\" \
\n start=\"${locus.getMin()}\" \
\n end=\"${locus.getMax()}\" \
\n strand=\"1\" \
\n id=\"file://${filename}\" \
\n description=\"HLA-A exon ${index}\"/>"
}
This is as simple as placing the interpreted allele between tags such as:
<glstring>HLA-A*24:02:01:01+HLA-A*24:02:01:01</glstring>
Alternatively, the GL String may be registered using the GlClient java class from gl-client module
GlClient client = new JsonGlClient(...);
try {
String identifier = client.registerGenotypeList("HLA-A*24:02:01:01+HLA-A*24:02:01:01");
}
catch (GlClientException e) {
// handle error
}
gl-tools command line tools
$ gl-register-genotype-lists -h
usage:
gl-register-genotype-lists --namespace http://localhost:8080/gl/ [-g glstrings.txt] [-i identifiers.txt] [-t bearer-token] [-c json]
arguments:
-h, --help display help message [optional]
-s, --namespace [class java.lang.String] namespace [required]
-g, --glstrings [class java.io.File] glstring input file [optional]
-i, --identifiers [class java.io.File] identifier output file [optional]
-t, --bearer-token [class java.lang.String] OAuth 2.0 bearer token [optional]
-c, --client [class java.lang.String] client implementation, json or xml, default json [optional]
$ echo "HLA-A*24:02:01:01+HLA-A*24:02:01:01" | gl-register-genotype-lists -s https://gl.immunogenomics.org/imgt-hla/3.16.0/
http://gl.immunogenomics.org/imgt-hla/3.16.0/genotype-list/2
or directly via HTTP using curl
$ curl --header "content-type: text/plain" \
--data "HLA-A*24:02:01:01+HLA-A*24:02:01:01" \
-X POST https://gl.immunogenomics.org/imgt-hla/3.16.0/genotype-list -v && echo
* Adding handle: conn: 0x7ffe22003a00
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 0 (0x7ffe22003a00) send_pipe: 1, recv_pipe: 0
* About to connect() to gl.immunogenomics.org port 443 (#0)
* Trying 54.214.9.114...
* Connected to gl.immunogenomics.org (54.214.9.114) port 443 (#0)
* TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256
* Server certificate: gl.immunogenomics.org
* Server certificate: GeoTrust DV SSL CA
* Server certificate: GeoTrust Global CA
> POST /imgt-hla/3.16.0/genotype-list HTTP/1.1
> User-Agent: curl/7.30.0
> Host: gl.immunogenomics.org
> Accept: */*
> content-type: text/plain
> Content-Length: 35
>
* upload completely sent off: 35 out of 35 bytes
< HTTP/1.1 201 Created
< Content-Type: text/plain
< Date: Wed, 24 Sep 2014 19:33:35 GMT
< Location: http://gl.immunogenomics.org/imgt-hla/3.16.0/genotype-list/2
* Server Apache-Coyote/1.1 is not blacklisted
< Server: Apache-Coyote/1.1
< Content-Length: 35
< Connection: keep-alive
<
* Connection #0 to host gl.immunogenomics.org left intact
HLA-A*24:02:01:01+HLA-A*24:02:01:01
Then include the returned identifier URI in the glstring element
<glstring uri="http://gl.immunogenomics.org/imgt-hla/3.16.0/genotype-list/2"/>
- Home
- DaSH 17 (Prague) 2024
- DaSH 16 (Stanford) 2024
- DaSH 15 (Utrecht) 2024
- DaSH 14 (Oklahoma City) 2024
- DaSH 13 (Rochester) 2023
- DASH VRS (Virtual) 2022
- DASSH3 (Virtual) 2020
- DASH12 (Virtual) 2022
- DASSH4 (Virtual) 2021
- DASH11 (Virtual) 2021
- DASSH3 (Virtual) 2020
- DASH10 (Virtual) 2020
- DASH Validation (Minneapolis) 2020
- DaSSH 2 (Minneapolis) 2019
- DASH9 (Denver) 2019
- DASH8 (Baltimore) 2018
- DASSH FHIR (Minneapolis) 2018
- DASH7 (Utrecht) 2017
- DASH IHIWS (Stanford) 2017
- DASH6 (Heidelberg) 2017
- DASH5 (Berkeley) 2017
- DASH4 (Vienna) 2016
- DASH3 (Minneapolis) 2016
- DASH2 (La Jolla) 2015
- DASH1 (Bethesda) 2014
- Preparing for the Hackathon
- Tool access
- Tools
- Data
- Github help