|
| 1 | +--- |
| 2 | +title: File manipulation |
| 3 | +--- |
| 4 | + |
| 5 | +## File manipulation and more practice with pipes |
| 6 | + |
| 7 | +Let's use the tools we've added to our tool kit so far, along with a few new ones, to example our SRA metadata file. First, let's navigate to the correct directory. |
| 8 | + |
| 9 | +```bash |
| 10 | +$ cd |
| 11 | +$ cd shell_data/sra_metadata |
| 12 | +``` |
| 13 | + |
| 14 | +This file contains a lot of information about the samples that we submitted for sequencing. We |
| 15 | +took a look at this file in an earlier lesson. Here we're going to use the information in this |
| 16 | +file to answer some questions about our samples. |
| 17 | + |
| 18 | +#### How many of the read libraries are paired end? |
| 19 | + |
| 20 | +The samples that we submitted to the sequencing facility were a mix of single and paired end |
| 21 | +libraries. We know that we recorded information in our metadata table about which samples used |
| 22 | +which library preparation method, but we don't remember exactly where this data is recorded. |
| 23 | +Let's start by looking at our column headers to see which column might have this information. Our |
| 24 | +column headers are in the first row of our data table, so we can use `head` with a `-n` flag to |
| 25 | +look at just the first row of the file. |
| 26 | + |
| 27 | +```bash |
| 28 | +$ head -n 1 SraRunTable.txt |
| 29 | +``` |
| 30 | + |
| 31 | +```output |
| 32 | +BioSample_s InsertSize_l LibraryLayout_s Library_Name_s LoadDate_s MBases_l MBytes_l ReleaseDate_s Run_s SRA_Sample_s Sample_Name_s Assay_Type_s AssemblyName_s BioProject_s Center_Name_s Consent_s Organism_Platform_s SRA_Study_s g1k_analysis_group_s g1k_pop_code_s source_s strain_s |
| 33 | +``` |
| 34 | + |
| 35 | +That is only the first line of our file, but because there are a lot of columns, the output |
| 36 | +likely wraps around your terminal window and appears as multiple lines. Once we figure out which |
| 37 | +column our data is in, we can use a command called `cut` to extract the column of interest. |
| 38 | + |
| 39 | +Because this is pretty hard to read, we can look at just a few column header names at a time by combining the `|` redirect and `cut`. |
| 40 | + |
| 41 | +```bash |
| 42 | +$ head -n 1 SraRunTable.txt | cut -f1-4 |
| 43 | +``` |
| 44 | + |
| 45 | +`cut` takes a `-f` flag, which stands for "field". This flag accepts a list of field numbers, |
| 46 | +in our case, column numbers. Here we are extracting the first four column names. |
| 47 | + |
| 48 | +```output |
| 49 | +BioSample_s InsertSize_l LibraryLayout_s Library_Name_s |
| 50 | +``` |
| 51 | + |
| 52 | +The LibraryLayout\_s column looks like it should have the information we want. Let's look at some |
| 53 | +of the data from that column. We can use `cut` to extract only the 3rd column from the file and |
| 54 | +then use the `|` operator with `head` to look at just the first few lines of data in that column. |
| 55 | + |
| 56 | +```bash |
| 57 | +$ cut -f3 SraRunTable.txt | head -n 10 |
| 58 | +``` |
| 59 | + |
| 60 | +```output |
| 61 | +LibraryLayout_s |
| 62 | +SINGLE |
| 63 | +SINGLE |
| 64 | +SINGLE |
| 65 | +SINGLE |
| 66 | +SINGLE |
| 67 | +SINGLE |
| 68 | +SINGLE |
| 69 | +SINGLE |
| 70 | +PAIRED |
| 71 | +``` |
| 72 | + |
| 73 | +We can see that there are (at least) two categories, SINGLE and PAIRED. We want to search all entries in this column |
| 74 | +for just PAIRED and count the number of matches. For this, we will use the `|` operator twice |
| 75 | +to combine `cut` (to extract the column we want), `grep` (to find matches) and `wc` (to count matches). |
| 76 | + |
| 77 | +```bash |
| 78 | +$ cut -f3 SraRunTable.txt | grep PAIRED | wc -l |
| 79 | +``` |
| 80 | + |
| 81 | +```output |
| 82 | +2 |
| 83 | +``` |
| 84 | + |
| 85 | +We can see from this that we have only two paired-end libraries in the samples we submitted for |
| 86 | +sequencing. |
| 87 | + |
| 88 | +::::::::::::::::::::::::::::::::::::::: challenge |
| 89 | + |
| 90 | +## Exercise |
| 91 | + |
| 92 | +How many single-end libraries are in our samples? |
| 93 | + |
| 94 | +::::::::::::::: solution |
| 95 | + |
| 96 | +## Solution |
| 97 | + |
| 98 | +```bash |
| 99 | +$ cut -f3 SraRunTable.txt | grep SINGLE | wc -l |
| 100 | +``` |
| 101 | + |
| 102 | +```output |
| 103 | +35 |
| 104 | +``` |
| 105 | + |
| 106 | +::::::::::::::::::::::::: |
| 107 | + |
| 108 | +:::::::::::::::::::::::::::::::::::::::::::::::::: |
| 109 | + |
| 110 | +#### How many of each class of library layout are there? |
| 111 | + |
| 112 | +We can extract even more information from our metadata table if we add in some new tools: `sort` and `uniq`. The `sort` command will sort the lines of a text file and the `uniq` command will |
| 113 | +filter out repeated neighboring lines in a file. You might expect `uniq` to |
| 114 | +extract all of the unique lines in a file. This isn't what it does, however, for reasons |
| 115 | +involving computer memory and speed. If we want to extract all unique lines, we |
| 116 | +can do so by combining `uniq` with `sort`. We'll see how to do this soon. |
| 117 | + |
| 118 | +For example, if we want to know how many samples of each library type are recorded in our table, |
| 119 | +we can extract the third column (with `cut`), and pipe that output into `sort`. |
| 120 | + |
| 121 | +```bash |
| 122 | +$ cut -f3 SraRunTable.txt | sort |
| 123 | +``` |
| 124 | + |
| 125 | +If you look closely, you might see that we have one line that reads "LibraryLayout\_s". This is the |
| 126 | +header of our column. We can discard this information using the `-v` flag in `grep`, which means |
| 127 | +return all the lines that **do not** match the search pattern. |
| 128 | + |
| 129 | +```bash |
| 130 | +$ cut -f3 SraRunTable.txt | grep -v LibraryLayout_s | sort |
| 131 | +``` |
| 132 | + |
| 133 | +This command returns a sorted list (too long to show here) of PAIRED and SINGLE values. We can use |
| 134 | +the `uniq` command to see a list of all the different categories that are present. If we do this, |
| 135 | +we see that the only two types of libraries we have present are labelled PAIRED and SINGLE. There |
| 136 | +aren't any other types in our file. |
| 137 | + |
| 138 | +```bash |
| 139 | +$ cut -f3 SraRunTable.txt | grep -v LibraryLayout_s | sort | uniq |
| 140 | +``` |
| 141 | + |
| 142 | +```output |
| 143 | +PAIRED |
| 144 | +SINGLE |
| 145 | +``` |
| 146 | + |
| 147 | +If we want to count how many of each we have, we can use the `-c` (count) flag for `uniq`. |
| 148 | + |
| 149 | +```bash |
| 150 | +$ cut -f3 SraRunTable.txt | grep -v LibraryLayout_s | sort | uniq -c |
| 151 | +``` |
| 152 | + |
| 153 | +```output |
| 154 | +2 PAIRED |
| 155 | +35 SINGLE |
| 156 | +``` |
| 157 | + |
| 158 | +::::::::::::::::::::::::::::::::::::::: challenge |
| 159 | + |
| 160 | +## Exercise |
| 161 | + |
| 162 | +1. How many different sample load dates are there? |
| 163 | +2. How many samples were loaded on each date? |
| 164 | + |
| 165 | +::::::::::::::: solution |
| 166 | + |
| 167 | +## Solution |
| 168 | + |
| 169 | +1. There are two different sample load dates. |
| 170 | + |
| 171 | + ```bash |
| 172 | + cut -f5 SraRunTable.txt | grep -v LoadDate_s | sort | uniq |
| 173 | + ``` |
| 174 | + |
| 175 | + ```output |
| 176 | + 25-Jul-12 |
| 177 | + 29-May-14 |
| 178 | + ``` |
| 179 | + |
| 180 | +2. Six samples were loaded on one date and 31 were loaded on the other. |
| 181 | + |
| 182 | + ```bash |
| 183 | + cut -f5 SraRunTable.txt | grep -v LoadDate_s | sort | uniq -c |
| 184 | + ``` |
| 185 | + |
| 186 | + ```output |
| 187 | + 6 25-Jul-12 |
| 188 | + 31 29-May-14 |
| 189 | + ``` |
| 190 | + |
| 191 | +::::::::::::::::::::::::: |
| 192 | + |
| 193 | +:::::::::::::::::::::::::::::::::::::::::::::::::: |
| 194 | + |
| 195 | +#### Can we sort the file by library layout and save that sorted information to a new file? |
| 196 | + |
| 197 | +We might want to re-order our entire metadata table so that all of the paired-end samples appear |
| 198 | +together and all of the single-end samples appear together. We can use the `-k` (key) flag for `sort` to |
| 199 | +sort based on a particular column. This is similar to the `-f` flag for `cut`. |
| 200 | + |
| 201 | +Let's sort based on the third column (`-k3`) and redirect our output to a new file. |
| 202 | + |
| 203 | +```bash |
| 204 | +$ sort -k3 SraRunTable.txt > SraRunTable_sorted_by_layout.txt |
| 205 | +``` |
| 206 | + |
| 207 | +#### Can we extract only paired-end records into a new file? |
| 208 | + |
| 209 | +We also might want to extract the information for all samples that meet a specific criterion |
| 210 | +(for example, are paired-end) and put those lines of our table in a new file. First, we need |
| 211 | +to check to make sure that the pattern we're searching for ("PAIRED") only appears in the column |
| 212 | +where we expect it to occur (column 3). We know from earlier that there are only two paired-end |
| 213 | +samples in the file, so we can `grep` for "PAIRED" and see how many results we get. |
| 214 | + |
| 215 | +```bash |
| 216 | +$ grep PAIRED SraRunTable.txt | wc -l |
| 217 | +``` |
| 218 | + |
| 219 | +```output |
| 220 | +2 |
| 221 | +``` |
| 222 | + |
| 223 | +There are only two results, so we can use "PAIRED" as our search term to extract the paired-end |
| 224 | +samples to a new file. |
| 225 | + |
| 226 | +```bash |
| 227 | +$ grep PAIRED SraRunTable.txt > SraRunTable_only_paired_end.txt |
| 228 | +``` |
| 229 | + |
| 230 | +::::::::::::::::::::::::::::::::::::::: challenge |
| 231 | + |
| 232 | +## Exercise |
| 233 | + |
| 234 | +Sort samples by load date and export each of those sets to a new file (one new file per |
| 235 | +unique load date). |
| 236 | + |
| 237 | +::::::::::::::: solution |
| 238 | + |
| 239 | +## Solution |
| 240 | + |
| 241 | +```bash |
| 242 | +$ grep 25-Jul-12 SraRunTable.txt > SraRunTable_25-Jul-12.txt |
| 243 | +$ grep 29-May-14 SraRunTable.txt > SraRunTable_29-May-14.txt |
| 244 | +``` |
| 245 | + |
| 246 | +::::::::::::::::::::::::: |
| 247 | + |
| 248 | +:::::::::::::::::::::::::::::::::::::::::::::::::: |
| 249 | + |
| 250 | +## Making code more customizeable using command line arguments |
| 251 | + |
| 252 | +In Lesson 05 (Writing Scripts) we used the `grep` command line tool to look for FASTQ records with |
| 253 | +lots of Ns from all the .fastq files in our current folder using the following code: |
| 254 | + |
| 255 | +```bash |
| 256 | +$ grep -B1 -A2 -h NNNNNNNNNN *.fastq | grep -v '^--' > scripted_bad_reads.txt |
| 257 | +``` |
| 258 | + |
| 259 | +This is very useful, but could be more customizeable. We may want to be able to run this command |
| 260 | +over and over again without needing to copy and paste it and allow the user to specify exactly which |
| 261 | +file they want to examine for bad reads. |
| 262 | + |
| 263 | +We can accomplish these goals by including the above command in a script that takes in user input |
| 264 | +via a command line argument. We can slightly modify our `bad-reads-script.sh` file to do so. Use |
| 265 | +`c` to copy your `bad-reads-script.sh` into a new script called `custom-bad-reads-script.sh`. Make |
| 266 | +the following modifications to `custom-bad-reads-script.sh`: |
| 267 | + |
| 268 | +```bash |
| 269 | +filename=$1 |
| 270 | +grep -B1 -A2 -h NNNNNNNNNN $filename | grep -v '^--' > scripted_bad_reads.txt |
| 271 | +``` |
| 272 | + |
| 273 | +`$1` is our command line argument. The line `filename=$1` tells Bash to take the first thing you type |
| 274 | +after the name of the script itself and assign that value to a variable called filename. |
| 275 | + |
| 276 | +For example, this script can be run in the following way to output the bad reads just from one file: |
| 277 | + |
| 278 | +```bash |
| 279 | +bash custom-bad-reads-script.sh SRR098026.fastq |
| 280 | +``` |
| 281 | + |
| 282 | +We can then take a look at what the output file currently contains using `head scripted_bad_reads.txt`: |
| 283 | + |
| 284 | +```output |
| 285 | +@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35 |
| 286 | +NNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNN |
| 287 | ++SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35 |
| 288 | +!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!! |
| 289 | +@SRR098026.2 HWUSI-EAS1599_1:2:1:0:312 length=35 |
| 290 | +NNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNN |
| 291 | ++SRR098026.2 HWUSI-EAS1599_1:2:1:0:312 length=35 |
| 292 | +!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!! |
| 293 | +@SRR098026.3 HWUSI-EAS1599_1:2:1:0:570 length=35 |
| 294 | +NNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNN |
| 295 | +``` |
| 296 | + |
| 297 | +This should be same output as using our original code and manually modifying the |
| 298 | +original standalone code on the command line to "SRR098026.fastq" on the command line, |
| 299 | +which should give us the same output as above: |
| 300 | + |
| 301 | +```bash |
| 302 | +$ grep -B1 -A2 -h NNNNNNNNNN SRR098026.fastq | grep -v '^--' > scripted_bad_reads.txt |
| 303 | +head scripted_bad_reads.txt |
| 304 | +``` |
| 305 | + |
| 306 | +```output |
| 307 | +@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35 |
| 308 | +NNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNN |
| 309 | ++SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35 |
| 310 | +!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!! |
| 311 | +@SRR098026.2 HWUSI-EAS1599_1:2:1:0:312 length=35 |
| 312 | +NNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNN |
| 313 | ++SRR098026.2 HWUSI-EAS1599_1:2:1:0:312 length=35 |
| 314 | +!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!! |
| 315 | +@SRR098026.3 HWUSI-EAS1599_1:2:1:0:570 length=35 |
| 316 | +NNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNN |
| 317 | +``` |
0 commit comments