Skip to content

Step 1: Prepare reference database

mealser edited this page Jul 11, 2017 · 6 revisions

Step 1: Prepare reference database:

  1. The reference database contains many contigs for each organism. If you are interested in mapping at organism-level rather than contig-level, you need to pre-process the reference database.
  2. The code below prepares the database (e.g., fungi.fa), such that for each organism, it concatenates all contigs and generates a single fasta output. It also updates the organism name and the length of the concatenated contigs in the header field of each organism.
$ grep '>' fungi.fa | awk -F "|" '{print $2}' | uniq > fungi_RefList.txt
$ python ConcatContigs.py fungi_RefList.txt fungi.fa > fungi_ConcatContigs.fa

To execute the algorithm on UCLA-Huffman2 cluster:

$ cd /u/project/zarlab/malser/MiCoP/Scripts
$ qsub submit-Concat-Contigs.sh

Or execute the following commands:

$ cd /u/project/zarlab/malser/MiCoP/Scripts
$ grep '>' /u/project/zarlab/serghei/eupathdb/eupathdbFasta/ameoba.fa | awk -F "|" '{print $2}' | uniq > /u/project/zarlab/malser/MiCoP/eupathdbFasta_ConcatContigs/ameoba_RefList.txt
$ module load python/3.6.1
$ python3 ConcatContigs.py /u/project/zarlab/malser/MiCoP/eupathdbFasta_ConcatContigs/ameoba_RefList.txt /u/project/zarlab/serghei/eupathdb/eupathdbFasta/ameoba.fa > /u/project/zarlab/malser/MiCoP/eupathdbFasta_ConcatContigs/ameoba_ConcatContigs.fa

The output concatenated contigs of each organism can be found at: /u/project/zarlab/malser/MiCoP/eupathdbFasta_ConcatContigs

Check the correctness of the results:

Either compare the total number of bases in both reference databases (before and after concatenation).

$ grep '>' fungi_ConcatContigs.fa | awk -F '|' '{print $3}' | awk -F '=' '{sum += $2} END {print sum}' 
$ grep '>' /u/project/zarlab/serghei/eupathdb/eupathdbFasta/fungi.fa | awk -F '|' '{print $4}' | awk -F '=' '{sum += $2} END {print sum}'

Or compare the total number of bases for each organism in both databases.

$ sed -n '2p' fungi_ConcatContigs.fa | wc -m
$ grep 'organism=Aspergillus_aculeatus_ATCC_16872' /u/project/zarlab/serghei/eupathdb/eupathdbFasta/fungi.fa | awk -F '|' '{print $4}' | awk -F '=' '{sum += $2} END {print sum}'
Clone this wiki locally