Tutorial

This tutorial assumes you have an Amazon Web Services account or administrative privileges necessary to install software within your local environment. The NMDP's public machine image will provide all the data, tools, and compute infrastructure you need to proceed. For more information go here first.

Get the code

If you have a GitHub account with a public key and would like to contribute your local changes, we encourage you to do so but ask that you kindly follow our instructions, otherwise you may clone and use the pipeline anonymously:

$ git clone https://github.com/nmdp-bioinformatics/pipeline.git

Will create a local clone (working copy) of the GitHub repository, which contains several shell scripts for parallel execution of pipeline components.

Get the sample data

Public sample data are provided within the pipeline parent directory:

tutorial/raw/

Each compressed file (10 total) contains simulated NGS data from a single IMGT-HLA reference. There are two files per homozygous sample (paired reads). The files must be decompressed before processing, for example:

gunzip tutorial/raw/*

Run the pipeline

From within your cloned pipeline directory:

$ ./splitter.bash tutorial

Processing should only take a couple minutes depending on your hardware instance and available resources. Upon successful execution there will be several results files in the following directory:

tutorial/final

If not the pipeline didn't execute properly. The most likely failure results from providing an improper path to NGS tools (or not installing them at all). Here's how.

Filter and interpret the consensus sequences

Clinical interpretation of HLA DNA sequence for transplantation is typically confined to the antigen recognition sites (ARS), which correspond to exons 2 and 3 or exon 2 of class I and class II HLA genes, respectively. The NMDP's interpretation service currently requires consensus sequences that are trimmed of other structural elements (non-ARS exons, introns, promoters and other untranslated regions).

More information about filtering and interpreting consensus sequences is here. For a quick test you can run the following script, which will filter and interpret the assembled consensus sequences generated by the pipeline:

./interpret.bash tutorial/verify/expected.txt tutorial/verify/observed.txt

Validate the results

NMDP has developed a simple tool that compares an expected results file with its observed counterpart (generated from the step above) to identify nomenclature-level matches for each sample at defined resolution (-r option).

ngs-validate-interpretation -e tutorial/verify/expected.txt -b tutorial/verify/observed.txt -r 2

After running the validation tool you should observe the following result where 'PASS' indicates that the corresponding expected allele matched the observed allele for each sample. In this case there are two identical alleles per sample (homozygous).

PASS	./tutorial/final/HLA00664_RX.fq.contigs.bwa.sorted.bam	HLA-DRB1*01:01:01
PASS	./tutorial/final/HLA00664_RX.fq.contigs.bwa.sorted.bam	HLA-DRB1*01:01:01
PASS	./tutorial/final/HLA00401_RX.fq.contigs.bwa.sorted.bam	HLA-C*01:02:01
PASS	./tutorial/final/HLA00401_RX.fq.contigs.bwa.sorted.bam	HLA-C*01:02:01
PASS	./tutorial/final/HLA00132_RX.fq.contigs.bwa.sorted.bam	HLA-B*07:02:01
PASS	./tutorial/final/HLA00132_RX.fq.contigs.bwa.sorted.bam	HLA-B*07:02:01
PASS	./tutorial/final/HLA00622_RX.fq.contigs.bwa.sorted.bam	HLA-DQB1*02:01:01
PASS	./tutorial/final/HLA00622_RX.fq.contigs.bwa.sorted.bam	HLA-DQB1*02:01:01
PASS	./tutorial/final/HLA00001_RX.fq.contigs.bwa.sorted.bam	HLA-A*01:01:01:01
PASS	./tutorial/final/HLA00001_RX.fq.contigs.bwa.sorted.bam	HLA-A*01:01:01:01

Create an HML message

With this tutorial completed you have some of the basic elements to construct an HML message. The following will demonstrate some additional tools and process that should facilitate creation of fully-compliant HML documents appropriate for submitting genetic data to the NMDP.

Create a template HML from the XSD

There are a few ways you can create a new HML message. Since HML is just XML and is based off of an XML schema, you can generate HML directly from the HML schema. For the Hackathon, the public schema is located at http://schemas.nmdp.org/ and is currently 0.9.5 (beta). You can point a schema-aware text editor at this location and generate HML. Some examples of tools that do this are XMLSpy, jEdit, and most development IDEs like Eclipse. For XMLSpy, here is the tutorial for how to generate an XML message from the HML schema: http://manual.altova.com/xmlspy/spyenterprise/index.html?generatesamplexmlfile.htm.

Use the HMLGenerator tool

Auto-generating a message from a schema has a couple drawbacks. Many times there are conditional or 'choice' elements in a schema where either one OR the other element can be generated, but not both. Some generators just choose the first option which may not match your data needs. Another drawback is that optional or abstract attributes and elements may or may not be generated depending on your text editor. To get around these limitations, there is a command-line tool included with the open-source NGS toolset built for the Hackathon. Based on simple command-line input, a sample HML structure will be created that matches your business needs.

ngs-tools command line tools

$ ngs-hml-generator

Enter HML version [0.9.5] > 0.9.5

Enter NMDP reporting center code like '567' > 678

Do you have typing data that refers to a typing test list? [Y] > y

Enter the 3-digit NMDP center-code to use for this sample. > 999

=== Select GENE-FAMILY ===
 (1) HLA
 (2) KIR
---------- 
1

...

Using this tool, you can create a reference HML message for any combination of typing methods and interpretations.

Create `<sequence>` tags

The following Groovy code snippet uses BioJava's FASTA reader to validate the consensus sequence generated above and reformat it in HML.

import java.io.BufferedReader

import java.io.File

import org.biojava.bio.BioException;

import org.biojava.bio.seq.io.SeqIOTools
import org.biojava.bio.seq.SequenceIterator

File file = new File("DKB.fasta")
BufferedReader reader = new BufferedReader(new FileReader(file))

for (SequenceIterator sequences = SeqIOTools.readFastaDNA(reader); sequences.hasNext(); ) {
  try {
    println "<sequence alphabet=\"DNA\">${sequences.nextSequence().seqString()}<\\sequence>"
  }
  catch(BioException error) {
    println "invalid sequence: ${error}"
  }
}

For other file formats, such as VCF, the feature parser has corresponding methods to validate DNA fields.

Create `<targeted-region>` tags

The following Groovy code snippet uses the feature parser to reformat HLA-A clinical exons into proper HML targeted-region tags.

import org.nmdp.ngs.feature.Locus
import org.nmdp.ngs.feature.parser.FeatureParser

def filename = "/opt/nmdp/regions/clinical-exons/hla-a.txt"

new File(filename).each { line ->
  def (index, coordinate) = line.split("\t")
  def locus = FeatureParser.parseLocus(coordinate)
  
  println  "<targeted-region \
             \n assembly=\"GRCh38\" \
             \n contig=\"${locus.getContig()}\" \
             \n start=\"${locus.getMin()}\" \
             \n end=\"${locus.getMax()}\" \
             \n strand=\"1\" \
             \n id=\"file://${filename}\" \
             \n description=\"HLA-A exon ${index}\"/>"
}

Create `<glstring>` tags

This is as simple as placing the interpreted allele between tags such as:

<glstring>HLA-A*24:02:01:01+HLA-A*24:02:01:01</glstring>

Alternatively (and preferably), the GL String may be registered using the GlClient java class from gl-client module

GlClient client = new JsonGlClient(...);
String identifier = client.registerGenotypeList("HLA-A*24:02:01:01+HLA-A*24:02:01:01");

gl-tools command line tools

$ echo "HLA-A*24:02:01:01+HLA-A*24:02:01:01" |\
  gl-register-genotype-lists -s https://gl.immunogenomics.org/imgt-hla/3.16.0/

http://gl.immunogenomics.org/imgt-hla/3.16.0/genotype-list/2

or directly via HTTP using curl

$ curl --header "content-type: text/plain" \
       --data "HLA-A*24:02:01:01+HLA-A*24:02:01:01" \
       -X POST https://gl.immunogenomics.org/imgt-hla/3.16.0/genotype-list \
       -v 2>&1 | grep Location \

< Location: http://gl.immunogenomics.org/imgt-hla/3.16.0/genotype-list/2

Then include the returned identifier URI in the glstring element

<glstring uri="http://gl.immunogenomics.org/imgt-hla/3.16.0/genotype-list/2"/>

DaSH

Home
DaSH 17 (Prague) 2024
DaSH 16 (Stanford) 2024
DaSH 15 (Utrecht) 2024
DaSH 14 (Oklahoma City) 2024
DaSH 13 (Rochester) 2023
DASH VRS (Virtual) 2022
DASSH3 (Virtual) 2020
DASH12 (Virtual) 2022
DASSH4 (Virtual) 2021
DASH11 (Virtual) 2021
DASSH3 (Virtual) 2020
DASH10 (Virtual) 2020
DASH Validation (Minneapolis) 2020
DaSSH 2 (Minneapolis) 2019
DASH9 (Denver) 2019
DASH8 (Baltimore) 2018
DASSH FHIR (Minneapolis) 2018
DASH7 (Utrecht) 2017
DASH IHIWS (Stanford) 2017
DASH6 (Heidelberg) 2017
DASH5 (Berkeley) 2017
DASH4 (Vienna) 2016
DASH3 (Minneapolis) 2016
DASH2 (La Jolla) 2015
DASH1 (Bethesda) 2014
Preparing for the Hackathon
Tool access
- AWS Account Creation and Setup
- Shared Hackathon Server Access
- Tutorial
Tools
- MIRING
- HML
- HML/MIRING Validation
- HL7 FHIR
- HL7 OID Registration
- Gene Feature Enumeration
- GL Service
- Pipeline
- Validation tools
- Public resources
Data
Github help

Tutorial

Get the code

Get the sample data

Run the pipeline

Filter and interpret the consensus sequences

Validate the results

Create an HML message

Create a template HML from the XSD

Use the HMLGenerator tool

Create <sequence> tags

Create <targeted-region> tags

Create <glstring> tags

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DaSH

Clone this wiki locally

Create `<sequence>` tags

Create `<targeted-region>` tags

Create `<glstring>` tags