Skip to content

Requirements for Binning Directories

Enrico Seiler edited this page Jan 25, 2019 · 8 revisions

Shapes

Consecutive shapes

Definition

Given a text T with |T| = n and a k-mer size k, the set of consecutive k-mers S = [ [T_0, T_1, ..., T_k-1], [T_1, T_2, ..., T_k], ..., [T_n-k, T_n-k+1, ..., T_n-1] ].

Example

k = 5
T =   GACTACGTAGC
S = [ GACTA,
       ACTAC,
        CTACG,
         TACGT,
          ACGTA,
           CGTAG,
            GTAGC ]

Thresholding

To determine whether a pattern P with |P| = m occurs in the text T with up to e errors, the well known k-mer lemma can be applied: A text T of length n contains n-k+1 k-mers and a occurrence of pattern P with |P|=m shares at least m-(e+1)*k-1 k-mers with T.

Offset shapes

Definition

Given a text T with |T| = n, a k-mer size k and an offset o, the set of offset k-mers S = [ [T_0, T_1, ..., T_k-1], [T_o, T_o+1, ..., T_o+k-1], [T_2o, T_2o+1, ..., T_2o+k-1], ... [T_ceil((n-k+1)/o)-o, ..., T_ceil(ceil((n-k+1)/o)-o+k].

Thresholding

Given a text T of length n, a k-mer size k and an offset o, T contains ok = ceil((n-k+1)/o) many k-mers and a occurrence of pattern |P| with |P|=m shares at least ok - (e+1)*ceil(k/o) many k-mers.

Example

k = 5
o = 3
T =   GACTACGTAGC
S = [ GACTA,
         TACGT,
            GTAGC ]
### Details
When looking for a pattern P in text T, **all** k-mers of T must be compared to the offset k-mers of P, since we don't know if P aligns to T with a multiple of offset o.
- Gapped shapes
- Minimizer shapes

Clone this wiki locally