-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathindex.rmd
245 lines (179 loc) · 6.26 KB
/
index.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
---
title: Sequence Alignment
output:
ioslides_presentation:
incremental: true
transition: faster
css: ../style.css
author: James Hawley
date: 2019-10-18
autosize: true
widescreen: true
---
```{r setup, include=FALSE}
options(htmltools.dir.version = FALSE)
```
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.1.1/jquery.min.js"></script>
<script>
$(document).ready(function() {
$('slide:not(.backdrop):not(.title-slide)').append('<div class=\"footnotes\">');
$('footnote').each(function(index) {
var text = $(this).html();
var fnNum = (index+1).toString();
$(this).html(fnNum.sup());
var footnote = fnNum + '. ' + text + '<br/>';
var oldContent = $(this).parents('slide').children('div.footnotes').html();
var newContent = oldContent + footnote;
$(this).parents('slide').children('div.footnotes').html(newContent);
});
});
</script>
## Last session
* DNA sequencing technologies
* Phred quality scores
* FASTQ quality control
* Sequencing as a random sampling
## Outline
* Sequencing reads as strings
* Algorithms for finding string matches
* Algorithmic efficiency
* Tutorial: aligning sequencing data
# Exact string matching
## DNA sequences as strings {.build}
```
@reference-seq-name1
ATCTATACTTTATCTTTATCTTTA
+
GFFFFBBBCBCCBBAAA:::;;;;
@reference-seq-name2
ATTTTATCGCGTAGCTAGCTGGCT
+
FFFEEBBBCBCCBBAAA:::;;;;
```
* Bunch of sequences and quality scores
* How do they relate to each other?
<div class="note">
* We have the sequences themselves
* As the title suggests, this general process is called "sequence alignment"
* Broad approach is going to be to line them up (match characters)
* To do this, we're going to have to build up the idea of what it means for 2 strings of letters to be similar
* First is the idea of a "string", how we're going to model the sequences
</div>
## Strings
* String: an ordered sequence of characters belonging to an alphabet
* $S_n = \left\{ s_1, …, s_n | s_i \in \Sigma \right\},$ where $\Sigma$ is the alphabet
* DNA: $\Sigma = \{A, C, G, T\}$
* RNA: $\Sigma = \{A, C, G, U\}$
* Protein: $\Sigma = \{D, E, R, ..., M, W, C\}$
## String properties
* Length of a string: $|S_n| = n$
* $S[i] = (i + 1)th$ character of S
* Start counting $i$ at 0
## String comparisons
* Substring: $S$ is a substring of $T$ if there exists strings $u$, $v$, where $T = uSv$
* Prefix: $S$ is a prefix of $T$ if there exists string $v$, where $T = Sv$
* Suffix: $S$ is a suffix of $T$ if there exists string $u$, where $T = uS$
<div class="note">
* We've built up most of the machinery we need
</div>
## Exact string matching
* Take two strings $S, T$
* Assume $|S| < |T|$
* Find all positions (indices) in $T$ where $S$ matches exactly
* Question: What's an easy way to find all of these positions?
## Exact string matching | Examples
* $S =$ "ATCG", $T =$ "TATCGTGA"
* $S =$ "AAA", $T =$ "AAAAAAAAAAAAAAAA"
* $S =$ "ACGG", $T =$ "GTGTGTGTGTGTGTGTGTGT"
## Exact string matching | Naive algorithm
1. Check first character of $S$ against first character of $T$
2. If match, check second of each
3. Repeat #2 until no match or end of $S$
4. Remove first character of $T$, and repeat #1-3 above
5. Stop when $|T| < |S|$
## Exact string matching | Naive algorithm
* $S =$ "ATCG", $T =$ "TATCGTGA"
* $S =$ "AAA", $T =$ "AAAAAAAAAAAAAAAA"
* $S =$ "ACGG", $T =$ "GTGTGTGTGTGTGTGTGTGT"
* See `naive-alignment.py`.
## Exact string matching | Naive algorithm
* How efficient is this method?
* How many comparisons will we make, in the worst case?
* Computer science: "big-O" notation
* $|S|$ comparisons for each shift of $S$
* $|T| - |S| + 1$ shifts in total
* $O(|S|(|T| - |S| + 1)) = O(|S||T|)$
## Exact string matching | Boyer-Moore algorithm
* Can improvements be made?
* Want to reduce total number of comparisons in our worst case scenarios
1. **Bad character**: When mismatch found, shift $S$ until the mismatch becomes a match
2. **Longer skip**: Try alignments in one direction, compare characters in opposite direction
3. **Good suffix**: Check characters that previously matched prior to shifting $S$ still match after shift
* See `boyer-moore.py`.
## Exact string matching | Boyer-Moore algorithm
* Can show that Boyer-Moore is $O(|T|)$
* Worst case, asymptotically linear with the length of $T$
* Compare with naive: $O(|T| |S|)$
## Summary so far
* Two methods to align strings exactly
* Key ideas:
* Efficiency of algorithm is critical
* Preprocess strings to skip alignments
* Questions?
# Break
## Non-exact string matching
* Previous methods only produced exact matches
* DNA contains mutations, insertions, deletions
* Need to do string matching with errors
## Edit distance
* Consider $S, T$ where $|S| < |T|$
* $E(S, T) =$ min number of operations to transform $S$ into $T$
* 3 operations on characters:
* Substitution
* Insertion
* Deletion
## Edit distance | Examples
* $S =$ AATTC, $T =$ AATTG
* $S =$ AATTC, $T =$ AAGGTGT
* $S =$ ACCGT, $T =$ TCCTAGT
* Can show $E(S, T) \sim O(|S| |T|)$
<div class="note">
* 1
* 4
* 3
</div>
## Non-exact alignment
* Exact alignment: finding positions where edit distance = 0
* Non-exact alignment: finding positions with smallest edit distance
<div class="note">
* Reduced molecular biology problem to a calculation
</div>
## Global vs local alignment
* Global: match entirety of $S$ to entirety of $T$
* Needleman-Wunsch algorithm
* Local: match entirety of $S$ to a part of $T$
* Smith-Waterman algorithm
## Modern sequence alignment algorithms
* Preprocessing reference and input
* Edit distance for local alignment
* Different penalties for mismatches
* Including sequence quality scores
# Performing sequence alignment
## Tools
* Anaconda (https://docs.conda.io/en/latest/miniconda.html)
* Bowtie2 (https://github.com/BenLangmead/bowtie2)
## Data
* [https://bit.ly/2BkwQF0](https://utoronto-my.sharepoint.com/:f:/g/personal/james_hawley_mail_utoronto_ca/Ep0Dx9zDRUFDjD_ZkEV_kxQB5m3g61VOOf7Hkl1-q7mfZw?e=1aEFmm)
* Reference (FASTA)
* Bowtie2 index (`.bt2`)
* DNA sequences (FASTQ)
## Alignment process
1. DNA sequencing file (FASTQ)
2. Perform sequence alignment (Bowtie2)
3. Look at file of alignments (SAM/BAM file)
# Summary
## Summary
* DNA sequences as strings of characters
* Naive and Boyer-Moore Algorithms
* Edit distance
* Aligned a FASTQ to a reference genome