-
Notifications
You must be signed in to change notification settings - Fork 87
Requirements for Binning Directories
Given a text T with |T| = n and a k-mer size k, the set of consecutive k-mers S = [ [T_0, T_1, ..., T_k-1], [T_1, T_2, ..., T_k], ..., [T_n-k, T_n-k+1, ..., T_n-1] ].
k = 5
T = GACTACGTAGC
S = [ GACTA,
ACTAC,
CTACG,
TACGT,
ACGTA,
CGTAG,
GTAGC ]
To determine whether a pattern P with |P| = m occurs in the text T with up to e errors, the well known k-mer lemma can be applied: A text T of length n contains n-k+1 k-mers and a occurrence of pattern P with |P|=m shares at least m-(e+1)*k-1 k-mers with T.
Given a text T with |T| = n, a k-mer size k and an offset o, the set of offset k-mers S = [ [T_0, T_1, ..., T_k-1], [T_o, T_o+1, ..., T_o+k-1], [T_2o, T_2o+1, ..., T_2o+k-1], ... [T_ceil((n-k+1)/o)-o, ..., T_ceil(ceil((n-k+1)/o)-o+k].
Given a text T of length n, a k-mer size k and an offset o, T contains ok = ceil((n-k+1)/o) many k-mers and a occurrence of pattern |P| with |P|=m shares at least ok - (e+1)*ceil(k/o) many k-mers.
k = 5
o = 3
T = GACTACGTAGC
S = [ GACTA,
TACGT,
GTAGC ]
### Details
When looking for a pattern P in text T, **all** k-mers of T must be compared to the offset k-mers of P, since we don't know if P aligns to T with a multiple of offset o.
- Gapped shapes
- Minimizer shapes