Skip to content

Commit 8b3817d

Browse files
add supplementary
1 parent 57f05a5 commit 8b3817d

File tree

1 file changed

+317
-0
lines changed

1 file changed

+317
-0
lines changed

docs/Supplementary/1-supplementary.md

Lines changed: 317 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,317 @@
1+
---
2+
title: File manipulation
3+
---
4+
5+
## File manipulation and more practice with pipes
6+
7+
Let's use the tools we've added to our tool kit so far, along with a few new ones, to example our SRA metadata file. First, let's navigate to the correct directory.
8+
9+
```bash
10+
$ cd
11+
$ cd shell_data/sra_metadata
12+
```
13+
14+
This file contains a lot of information about the samples that we submitted for sequencing. We
15+
took a look at this file in an earlier lesson. Here we're going to use the information in this
16+
file to answer some questions about our samples.
17+
18+
#### How many of the read libraries are paired end?
19+
20+
The samples that we submitted to the sequencing facility were a mix of single and paired end
21+
libraries. We know that we recorded information in our metadata table about which samples used
22+
which library preparation method, but we don't remember exactly where this data is recorded.
23+
Let's start by looking at our column headers to see which column might have this information. Our
24+
column headers are in the first row of our data table, so we can use `head` with a `-n` flag to
25+
look at just the first row of the file.
26+
27+
```bash
28+
$ head -n 1 SraRunTable.txt
29+
```
30+
31+
```output
32+
BioSample_s InsertSize_l LibraryLayout_s Library_Name_s LoadDate_s MBases_l MBytes_l ReleaseDate_s Run_s SRA_Sample_s Sample_Name_s Assay_Type_s AssemblyName_s BioProject_s Center_Name_s Consent_s Organism_Platform_s SRA_Study_s g1k_analysis_group_s g1k_pop_code_s source_s strain_s
33+
```
34+
35+
That is only the first line of our file, but because there are a lot of columns, the output
36+
likely wraps around your terminal window and appears as multiple lines. Once we figure out which
37+
column our data is in, we can use a command called `cut` to extract the column of interest.
38+
39+
Because this is pretty hard to read, we can look at just a few column header names at a time by combining the `|` redirect and `cut`.
40+
41+
```bash
42+
$ head -n 1 SraRunTable.txt | cut -f1-4
43+
```
44+
45+
`cut` takes a `-f` flag, which stands for "field". This flag accepts a list of field numbers,
46+
in our case, column numbers. Here we are extracting the first four column names.
47+
48+
```output
49+
BioSample_s InsertSize_l LibraryLayout_s Library_Name_s
50+
```
51+
52+
The LibraryLayout\_s column looks like it should have the information we want. Let's look at some
53+
of the data from that column. We can use `cut` to extract only the 3rd column from the file and
54+
then use the `|` operator with `head` to look at just the first few lines of data in that column.
55+
56+
```bash
57+
$ cut -f3 SraRunTable.txt | head -n 10
58+
```
59+
60+
```output
61+
LibraryLayout_s
62+
SINGLE
63+
SINGLE
64+
SINGLE
65+
SINGLE
66+
SINGLE
67+
SINGLE
68+
SINGLE
69+
SINGLE
70+
PAIRED
71+
```
72+
73+
We can see that there are (at least) two categories, SINGLE and PAIRED. We want to search all entries in this column
74+
for just PAIRED and count the number of matches. For this, we will use the `|` operator twice
75+
to combine `cut` (to extract the column we want), `grep` (to find matches) and `wc` (to count matches).
76+
77+
```bash
78+
$ cut -f3 SraRunTable.txt | grep PAIRED | wc -l
79+
```
80+
81+
```output
82+
2
83+
```
84+
85+
We can see from this that we have only two paired-end libraries in the samples we submitted for
86+
sequencing.
87+
88+
::::::::::::::::::::::::::::::::::::::: challenge
89+
90+
## Exercise
91+
92+
How many single-end libraries are in our samples?
93+
94+
::::::::::::::: solution
95+
96+
## Solution
97+
98+
```bash
99+
$ cut -f3 SraRunTable.txt | grep SINGLE | wc -l
100+
```
101+
102+
```output
103+
35
104+
```
105+
106+
:::::::::::::::::::::::::
107+
108+
::::::::::::::::::::::::::::::::::::::::::::::::::
109+
110+
#### How many of each class of library layout are there?
111+
112+
We can extract even more information from our metadata table if we add in some new tools: `sort` and `uniq`. The `sort` command will sort the lines of a text file and the `uniq` command will
113+
filter out repeated neighboring lines in a file. You might expect `uniq` to
114+
extract all of the unique lines in a file. This isn't what it does, however, for reasons
115+
involving computer memory and speed. If we want to extract all unique lines, we
116+
can do so by combining `uniq` with `sort`. We'll see how to do this soon.
117+
118+
For example, if we want to know how many samples of each library type are recorded in our table,
119+
we can extract the third column (with `cut`), and pipe that output into `sort`.
120+
121+
```bash
122+
$ cut -f3 SraRunTable.txt | sort
123+
```
124+
125+
If you look closely, you might see that we have one line that reads "LibraryLayout\_s". This is the
126+
header of our column. We can discard this information using the `-v` flag in `grep`, which means
127+
return all the lines that **do not** match the search pattern.
128+
129+
```bash
130+
$ cut -f3 SraRunTable.txt | grep -v LibraryLayout_s | sort
131+
```
132+
133+
This command returns a sorted list (too long to show here) of PAIRED and SINGLE values. We can use
134+
the `uniq` command to see a list of all the different categories that are present. If we do this,
135+
we see that the only two types of libraries we have present are labelled PAIRED and SINGLE. There
136+
aren't any other types in our file.
137+
138+
```bash
139+
$ cut -f3 SraRunTable.txt | grep -v LibraryLayout_s | sort | uniq
140+
```
141+
142+
```output
143+
PAIRED
144+
SINGLE
145+
```
146+
147+
If we want to count how many of each we have, we can use the `-c` (count) flag for `uniq`.
148+
149+
```bash
150+
$ cut -f3 SraRunTable.txt | grep -v LibraryLayout_s | sort | uniq -c
151+
```
152+
153+
```output
154+
2 PAIRED
155+
35 SINGLE
156+
```
157+
158+
::::::::::::::::::::::::::::::::::::::: challenge
159+
160+
## Exercise
161+
162+
1. How many different sample load dates are there?
163+
2. How many samples were loaded on each date?
164+
165+
::::::::::::::: solution
166+
167+
## Solution
168+
169+
1. There are two different sample load dates.
170+
171+
```bash
172+
cut -f5 SraRunTable.txt | grep -v LoadDate_s | sort | uniq
173+
```
174+
175+
```output
176+
25-Jul-12
177+
29-May-14
178+
```
179+
180+
2. Six samples were loaded on one date and 31 were loaded on the other.
181+
182+
```bash
183+
cut -f5 SraRunTable.txt | grep -v LoadDate_s | sort | uniq -c
184+
```
185+
186+
```output
187+
6 25-Jul-12
188+
31 29-May-14
189+
```
190+
191+
:::::::::::::::::::::::::
192+
193+
::::::::::::::::::::::::::::::::::::::::::::::::::
194+
195+
#### Can we sort the file by library layout and save that sorted information to a new file?
196+
197+
We might want to re-order our entire metadata table so that all of the paired-end samples appear
198+
together and all of the single-end samples appear together. We can use the `-k` (key) flag for `sort` to
199+
sort based on a particular column. This is similar to the `-f` flag for `cut`.
200+
201+
Let's sort based on the third column (`-k3`) and redirect our output to a new file.
202+
203+
```bash
204+
$ sort -k3 SraRunTable.txt > SraRunTable_sorted_by_layout.txt
205+
```
206+
207+
#### Can we extract only paired-end records into a new file?
208+
209+
We also might want to extract the information for all samples that meet a specific criterion
210+
(for example, are paired-end) and put those lines of our table in a new file. First, we need
211+
to check to make sure that the pattern we're searching for ("PAIRED") only appears in the column
212+
where we expect it to occur (column 3). We know from earlier that there are only two paired-end
213+
samples in the file, so we can `grep` for "PAIRED" and see how many results we get.
214+
215+
```bash
216+
$ grep PAIRED SraRunTable.txt | wc -l
217+
```
218+
219+
```output
220+
2
221+
```
222+
223+
There are only two results, so we can use "PAIRED" as our search term to extract the paired-end
224+
samples to a new file.
225+
226+
```bash
227+
$ grep PAIRED SraRunTable.txt > SraRunTable_only_paired_end.txt
228+
```
229+
230+
::::::::::::::::::::::::::::::::::::::: challenge
231+
232+
## Exercise
233+
234+
Sort samples by load date and export each of those sets to a new file (one new file per
235+
unique load date).
236+
237+
::::::::::::::: solution
238+
239+
## Solution
240+
241+
```bash
242+
$ grep 25-Jul-12 SraRunTable.txt > SraRunTable_25-Jul-12.txt
243+
$ grep 29-May-14 SraRunTable.txt > SraRunTable_29-May-14.txt
244+
```
245+
246+
:::::::::::::::::::::::::
247+
248+
::::::::::::::::::::::::::::::::::::::::::::::::::
249+
250+
## Making code more customizeable using command line arguments
251+
252+
In Lesson 05 (Writing Scripts) we used the `grep` command line tool to look for FASTQ records with
253+
lots of Ns from all the .fastq files in our current folder using the following code:
254+
255+
```bash
256+
$ grep -B1 -A2 -h NNNNNNNNNN *.fastq | grep -v '^--' > scripted_bad_reads.txt
257+
```
258+
259+
This is very useful, but could be more customizeable. We may want to be able to run this command
260+
over and over again without needing to copy and paste it and allow the user to specify exactly which
261+
file they want to examine for bad reads.
262+
263+
We can accomplish these goals by including the above command in a script that takes in user input
264+
via a command line argument. We can slightly modify our `bad-reads-script.sh` file to do so. Use
265+
`c` to copy your `bad-reads-script.sh` into a new script called `custom-bad-reads-script.sh`. Make
266+
the following modifications to `custom-bad-reads-script.sh`:
267+
268+
```bash
269+
filename=$1
270+
grep -B1 -A2 -h NNNNNNNNNN $filename | grep -v '^--' > scripted_bad_reads.txt
271+
```
272+
273+
`$1` is our command line argument. The line `filename=$1` tells Bash to take the first thing you type
274+
after the name of the script itself and assign that value to a variable called filename.
275+
276+
For example, this script can be run in the following way to output the bad reads just from one file:
277+
278+
```bash
279+
bash custom-bad-reads-script.sh SRR098026.fastq
280+
```
281+
282+
We can then take a look at what the output file currently contains using `head scripted_bad_reads.txt`:
283+
284+
```output
285+
@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
286+
NNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNN
287+
+SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
288+
!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!
289+
@SRR098026.2 HWUSI-EAS1599_1:2:1:0:312 length=35
290+
NNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNN
291+
+SRR098026.2 HWUSI-EAS1599_1:2:1:0:312 length=35
292+
!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!
293+
@SRR098026.3 HWUSI-EAS1599_1:2:1:0:570 length=35
294+
NNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNN
295+
```
296+
297+
This should be same output as using our original code and manually modifying the
298+
original standalone code on the command line to "SRR098026.fastq" on the command line,
299+
which should give us the same output as above:
300+
301+
```bash
302+
$ grep -B1 -A2 -h NNNNNNNNNN SRR098026.fastq | grep -v '^--' > scripted_bad_reads.txt
303+
head scripted_bad_reads.txt
304+
```
305+
306+
```output
307+
@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
308+
NNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNN
309+
+SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
310+
!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!
311+
@SRR098026.2 HWUSI-EAS1599_1:2:1:0:312 length=35
312+
NNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNN
313+
+SRR098026.2 HWUSI-EAS1599_1:2:1:0:312 length=35
314+
!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!
315+
@SRR098026.3 HWUSI-EAS1599_1:2:1:0:570 length=35
316+
NNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNN
317+
```

0 commit comments

Comments
 (0)