@@ -4,7 +4,6 @@ Try to test in order, so early tests don't require correct interpretation of
44later tests. This gives an ordering for software development and testing.
55
66- Data types
7-
87 - ITF8
98 - Strings
109 - Arrays
@@ -65,7 +64,6 @@ An empty file to check the file definition can be read. We require a SAM header
6564too, but this is also empty (using one block, see below).
6665
6766- Empty CRAM file (failed/0000_empty_noref.cram)
68-
6967 - File definition
7068 - SAM header container (zero content)
7169 - [ End of file; no EOF block so may emit warning]
@@ -79,7 +77,6 @@ too, but this is also empty (using one block, see below).
7977 warning or a hard error.)
8078
8179- Empty CRAM file with EOF block (0001_empty_eof.cram)
82-
8380 - As above, but with official EOF block.
8481
8582 This EOF block can be decoded either by checking for a specific series of
@@ -127,23 +124,19 @@ c->num_landmarks=0 set c->curr_slice=0 set c->length=181 c ]
127124Files with 1 or more sequences. These are all unmapped with no auxiliary tags.
128125
129126- Single read (0300_unmapped.cram)
130-
131127 - Tests decoded data via EXTERNAL, HUFFMAN, BYTE_ARRAY_STOP and BYTE_ARRAY_LEN
132128 encodings.
133129 - 4 blocks in slice (CORE - empty, RN, QS, BA).
134130
135131- Two unpaired reads, of differing length (0301_unmapped_cram)
136-
137132 - As above, but RL is no longer a constant and is in its own block.
138133
139134- Three reads, including a pair (0302_unmapped_cram)
140-
141135 - Also contains BF and MF blocks. All still CF "detached".
142136 - BF 77 & 141 match the input SAM, but this is redundant as it's
143137 - also set in MF bit 2.
144138
145139- Three reads, including a pair (0303_unmapped_cram)
146-
147140 - As above, but the SAM FLAGs of 77 and 141 are stored as 69 and 133 (clearing
148141 mate unmapped flag). BF + MF are sufficient to regenerate the correct FLAG
149142 field.
@@ -154,21 +147,18 @@ Files with 1 or more sequences. These are all unmapped with no auxiliary tags.
154147## Slice basics, mapped reads, no reference
155148
156149- Single read (0400_mapped.cram)
157-
158150 - Container ref id, pos and span, number of records and number of bases fields
159151 are changed.
160152 - Checks that mapped data can process MD5 0, provided container RR=0.
161153 - Additional data series in use: FN, FP, FC, MQ.
162154 - One feature of type 'b', with sequence stored in BB.
163155
164156- Paired reads, but detached (0401_mapped.cram)
165-
166157 - RNEXT/PNEXT/TLEN of \* /0/0
167158 - Explicit TS, NP, NS with constant values as they would disagree with
168159 auto-computed values.
169160
170161- Paired reads, but detached (0402_mapped.cram)
171-
172162 - RNEXT/PNEXT/TLEN filled out.
173163 - Explicit TS, NP, NS, with non-constant values [ Edit htslib to force
174164 bam_ins_size check to fail and hence "goto detached". ]
@@ -183,23 +173,20 @@ Testing of the FC (Feature Codes) data series types and their associated
183173type-specific data series.
184174
185175- External reference, CIGAR ops (0500_mapped.cram)
186-
187176 - No edits: entirely match reference
188177 - No FP/FC needed (and FN=0).
189178 - Sequence is implicitly assumed to entirely match reference
190179 - Header gains an @SQ UR: tag, although note this pathname is local and not
191180 transferable to other systems.
192181
193182- External reference, CIGAR ops (0501_mapped.cram)
194-
195183 - Mismatching first and last base on first seq and first / last 3 bases on
196184 second seq.
197185 - Adds use of FC "X" and the BS (base substitution) data series. This tests
198186 the compression header "SM" preservation map. Note BB data series has an
199187 encoding in the compression header, but is not used here.
200188
201189- As above, but R/Y bases (0502_mapped.cram).
202-
203190 - Test of the BA data series and FC "B" code. BS (base substitution) only
204191 applies for A, C, G, T, N.
205192
@@ -213,19 +200,16 @@ type-specific data series.
213200 s->block[ 12] ->data[ 0] .]
214201
215202- As above with R/Y bases, using using "b" FC (0503_mapped.cram).
216-
217203 - Unlike FC "B", "b" is a string instead of a single character and doesn't
218204 require storing quality data.
219205
220206 [ Produced by changing the "if (0 && CRAM_MAJOR_VERS...)" line in
221207 process_one_read().]
222208
223209- Soft/hard clips (0504_mapped.cram)
224-
225210 - FC codes S and H, with associated SC and HC data series.
226211
227212- Indels (0505_mapped.cram)
228-
229213 - Tests FC codes and data-series: D (DL), I (IN) and i (BA). The table below
230214 shows cigar ops, with "m" being lowercase as it's not explicitly stored in
231215 CRAM. The FC row shows the associated CRAM feature code. REF
@@ -237,7 +221,6 @@ type-specific data series.
237221 FC D D I i
238222
239223- As above, but explicit padding in the 5bp indel (0506_mapped.cram)
240-
241224 - Tests FC code P and data series PD REF
242225 ATTTTTCGGGTTTTTTGAAATGAATATCGTAGCTACAGAAACGGTTGTGCACTCATCTGAAAGTTTGTTT T
243226 TCTTGTTTTCTTGCACTTTGTGCAGAATT SEQ ATTTTTCGGGTTTTTTGAAA AT
@@ -405,15 +388,12 @@ examples produced by some current implementations.
405388- BETA (already tested in 1101_BETA.cram)
406389
407390- SUBEXPONENTIAL
408-
409391 - I have no code to write this data format. Exists in htsjdk though?
410392
411393- GAMMA
412-
413394 - I have no code to write this data format. Exists in htsjdk though?
414395
415396- GOLOMB (deprecated)
416-
417397 - I have no code to read nor write this data format
418398
419399- GOLOMB-RICE (deprecated)
@@ -422,19 +402,16 @@ examples produced by some current implementations.
422402## Index
423403
424404- Simple mapped case (1400_index_simple.cram)
425-
426405 - 10bp reads starting one per base. Read name indicates bases covered.
427406 - 77 reads per container
428407 - Index query CHROMOSOME_I:333-444 should return 121 records, from s324-333 to
429408 s444-453
430409
431410- Unmapped data (1401_index_unmapped.cram)
432-
433411 - As above, but all data is unmapped
434412 - Index query for unmapped (eg ref ` * ` ) should return all 1000 records.
435413
436414- Multiple references + unmapped (1402_index_3ref.cram)
437-
438415 - 300 for first ref, 10 for second, 300 for third, and 300 unmapped.
439416 - Only one reference per slice.
440417 - CHROMOSOME_I:100-200 returns 110 records
@@ -445,7 +422,6 @@ examples produced by some current implementations.
445422 - ` * ` (unmapped) returns 300 records
446423
447424- Multi-ref mode (1403_index_multiref.cram)
448-
449425 - As above, but containers / slices use the RI data series with multiple
450426 references per container. The same queries will work as above.
451427 - Hence index reports reference IDs, but multiple references can occur at the
@@ -459,17 +435,14 @@ examples produced by some current implementations.
459435 container as the last reads in ref 2.
460436
461437- Multi-slice containers (1404_index_multislice.cram)
462-
463438 - As 1402_index_3ref.cram, but 3 slices per container.
464439 - Same queries will work as above.
465440
466441- Multi-slice multi-ref containers (1405_index_multisliceref.cram)
467-
468442 - As above, but with multiple references permitted per slice.
469443 - Same queries will work as above.
470444
471445- Mix of long and short reads
472-
473446 - 10bp reads starting every position
474447 - 350bp reads starting every 300 positions
475448
0 commit comments