-
Notifications
You must be signed in to change notification settings - Fork 0
Step 1: Prepare reference database
mealser edited this page Jul 11, 2017
·
6 revisions
- The reference database contains many contigs for each organism. If you are interested in mapping at organism-level rather than contig-level, you need to pre-process the reference database.
- The code below prepares the database (e.g., fungi.fa), such that for each organism, it concatenates all contigs and generates a single fasta output. It also updates the organism name and the length of the concatenated contigs in the header field of each organism.
$ grep '>' fungi.fa | awk -F "|" '{print $2}' | uniq > fungi_RefList.txt
$ python ConcatContigs.py fungi_RefList.txt fungi.fa > fungi_ConcatContigs.fa
To execute the algorithm on UCLA-Huffman2 cluster:
$ cd /u/project/zarlab/malser/MiCoP/Scripts
$ qsub submit-Concat-Contigs.sh
Or execute the following commands:
$ cd /u/project/zarlab/malser/MiCoP/Scripts
$ grep '>' /u/project/zarlab/serghei/eupathdb/eupathdbFasta/ameoba.fa | awk -F "|" '{print $2}' | uniq > /u/project/zarlab/malser/MiCoP/eupathdbFasta_ConcatContigs/ameoba_RefList.txt
$ module load python/3.6.1
$ python3 ConcatContigs.py /u/project/zarlab/malser/MiCoP/eupathdbFasta_ConcatContigs/ameoba_RefList.txt /u/project/zarlab/serghei/eupathdb/eupathdbFasta/ameoba.fa > /u/project/zarlab/malser/MiCoP/eupathdbFasta_ConcatContigs/ameoba_ConcatContigs.fa
The output concatenated contigs of each organism can be found at: /u/project/zarlab/malser/MiCoP/eupathdbFasta_ConcatContigs
Either compare the total number of bases in both reference databases (before and after concatenation).
$ grep '>' fungi_ConcatContigs.fa | awk -F '|' '{print $3}' | awk -F '=' '{sum += $2} END {print sum}'
$ grep '>' /u/project/zarlab/serghei/eupathdb/eupathdbFasta/fungi.fa | awk -F '|' '{print $4}' | awk -F '=' '{sum += $2} END {print sum}'
Or compare the total number of bases for each organism in both databases.
$ sed -n '2p' fungi_ConcatContigs.fa | wc -m
$ grep 'organism=Aspergillus_aculeatus_ATCC_16872' /u/project/zarlab/serghei/eupathdb/eupathdbFasta/fungi.fa | awk -F '|' '{print $4}' | awk -F '=' '{sum += $2} END {print sum}'