2019-2020/02-sequence-alignment/index.rmd

---
title: Sequence Alignment
output:
    ioslides_presentation:
        incremental: true
        transition: faster
        css: ../style.css
author: James Hawley
date: 2019-10-18
autosize: true
widescreen: true
---

```{r setup, include=FALSE}
options(htmltools.dir.version = FALSE)
```

<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.1.1/jquery.min.js"></script>
<script>
$(document).ready(function() {
  $('slide:not(.backdrop):not(.title-slide)').append('<div class=\"footnotes\">');

  $('footnote').each(function(index) {
    var text  = $(this).html();
    var fnNum = (index+1).toString();
    $(this).html(fnNum.sup());

    var footnote   = fnNum + '. ' + text + '<br/>';
    var oldContent = $(this).parents('slide').children('div.footnotes').html();
    var newContent = oldContent + footnote;
    $(this).parents('slide').children('div.footnotes').html(newContent);
  });
});
</script>

## Last session

* DNA sequencing technologies
* Phred quality scores
* FASTQ quality control
* Sequencing as a random sampling

## Outline

* Sequencing reads as strings
* Algorithms for finding string matches
* Algorithmic efficiency
* Tutorial: aligning sequencing data

# Exact string matching

## DNA sequences as strings {.build}

```
@reference-seq-name1
ATCTATACTTTATCTTTATCTTTA
+
GFFFFBBBCBCCBBAAA:::;;;;
@reference-seq-name2
ATTTTATCGCGTAGCTAGCTGGCT
+
FFFEEBBBCBCCBBAAA:::;;;;
```

* Bunch of sequences and quality scores
* How do they relate to each other?

<div class="note">
* We have the sequences themselves
* As the title suggests, this general process is called "sequence alignment"
* Broad approach is going to be to line them up (match characters)
* To do this, we're going to have to build up the idea of what it means for 2 strings of letters to be similar
* First is the idea of a "string", how we're going to model the sequences
</div>

## Strings

* String: an ordered sequence of characters belonging to an alphabet
* $S_n = \left\{ s_1, …, s_n | s_i \in \Sigma \right\},$ where $\Sigma$ is the alphabet
* DNA: $\Sigma = \{A, C, G, T\}$
* RNA: $\Sigma = \{A, C, G, U\}$
* Protein: $\Sigma = \{D, E, R, ..., M, W, C\}$

## String properties

* Length of a string: $|S_n| = n$
* $S[i] = (i + 1)th$ character of S
* Start counting $i$ at 0

## String comparisons

* Substring: $S$ is a substring of $T$ if there exists strings $u$, $v$, where $T = uSv$
* Prefix: $S$ is a prefix of $T$ if there exists string $v$, where $T = Sv$
* Suffix: $S$ is a suffix of $T$ if there exists string $u$, where $T = uS$

<div class="note">
* We've built up most of the machinery we need
</div>

## Exact string matching

* Take two strings $S, T$
* Assume $|S| < |T|$
* Find all positions (indices) in $T$ where $S$ matches exactly
* Question: What's an easy way to find all of these positions?

## Exact string matching | Examples

* $S =$ "ATCG", $T =$ "TATCGTGA"
* $S =$ "AAA", $T =$ "AAAAAAAAAAAAAAAA"
* $S =$ "ACGG", $T =$ "GTGTGTGTGTGTGTGTGTGT"

## Exact string matching | Naive algorithm

1. Check first character of $S$ against first character of $T$
2. If match, check second of each
3. Repeat #2 until no match or end of $S$
4. Remove first character of $T$, and repeat #1-3 above
5. Stop when $|T| < |S|$

## Exact string matching | Naive algorithm

* $S =$ "ATCG", $T =$ "TATCGTGA"
* $S =$ "AAA", $T =$ "AAAAAAAAAAAAAAAA"
* $S =$ "ACGG", $T =$ "GTGTGTGTGTGTGTGTGTGT"

* See `naive-alignment.py`.

## Exact string matching | Naive algorithm

* How efficient is this method?
* How many comparisons will we make, in the worst case?
* Computer science: "big-O" notation

* $|S|$ comparisons for each shift of $S$
* $|T| - |S| + 1$ shifts in total
* $O(|S|(|T| - |S| + 1)) = O(|S||T|)$

## Exact string matching | Boyer-Moore algorithm

* Can improvements be made?
* Want to reduce total number of comparisons in our worst case scenarios

1. **Bad character**: When mismatch found, shift $S$ until the mismatch becomes a match
2. **Longer skip**: Try alignments in one direction, compare characters in opposite direction
3. **Good suffix**: Check characters that previously matched prior to shifting $S$ still match after shift

* See `boyer-moore.py`.

## Exact string matching | Boyer-Moore algorithm

* Can show that Boyer-Moore is $O(|T|)$
* Worst case, asymptotically linear with the length of $T$
* Compare with naive: $O(|T| |S|)$

## Summary so far

* Two methods to align strings exactly
* Key ideas:
  * Efficiency of algorithm is critical
  * Preprocess strings to skip alignments
* Questions?

# Break

## Non-exact string matching

* Previous methods only produced exact matches
* DNA contains mutations, insertions, deletions
* Need to do string matching with errors

## Edit distance

* Consider $S, T$ where $|S| < |T|$
* $E(S, T) =$ min number of operations to transform $S$ into $T$ 
* 3 operations on characters:
  * Substitution
  * Insertion
  * Deletion
  
## Edit distance | Examples

* $S =$ AATTC, $T =$ AATTG
* $S =$ AATTC, $T =$ AAGGTGT
* $S =$ ACCGT, $T =$ TCCTAGT
* Can show $E(S, T) \sim O(|S| |T|)$


<div class="note">
* 1
* 4
* 3
</div>

## Non-exact alignment

* Exact alignment: finding positions where edit distance = 0
* Non-exact alignment: finding positions with smallest edit distance

<div class="note">
* Reduced molecular biology problem to a calculation
</div>

## Global vs local alignment

* Global: match entirety of $S$ to entirety of $T$
  * Needleman-Wunsch algorithm
* Local: match entirety of $S$ to a part of $T$
  * Smith-Waterman algorithm

## Modern sequence alignment algorithms

* Preprocessing reference and input
* Edit distance for local alignment
* Different penalties for mismatches
* Including sequence quality scores

# Performing sequence alignment

## Tools

* Anaconda (https://docs.conda.io/en/latest/miniconda.html)
* Bowtie2 (https://github.com/BenLangmead/bowtie2)

## Data

* [https://bit.ly/2BkwQF0](https://utoronto-my.sharepoint.com/:f:/g/personal/james_hawley_mail_utoronto_ca/Ep0Dx9zDRUFDjD_ZkEV_kxQB5m3g61VOOf7Hkl1-q7mfZw?e=1aEFmm)
* Reference (FASTA)
* Bowtie2 index (`.bt2`)
* DNA sequences (FASTQ)

## Alignment process

1. DNA sequencing file (FASTQ)
2. Perform sequence alignment (Bowtie2)
3. Look at file of alignments (SAM/BAM file)

# Summary

## Summary

* DNA sequences as strings of characters
* Naive and Boyer-Moore Algorithms
* Edit distance
* Aligned a FASTQ to a reference genome