Skip to content

GSA documentation home

Karen Yook edited this page Mar 1, 2016 · 3 revisions

Welcome to the GSA-pipeline wiki!

Pipeline scripts

Original documentation is here: [GSA wiki on WormBase] (http://wiki.wormbase.org/index.php/Caltech_documentation)


The following cron jobs on textpresso-dev.caltech.edu {WHERE??} automate the GSA pipeline.
Location of scripts are shown and cron details show the order in which they are run.

WB Scripts

Step 1: download entities
05 06 * * * cd /home/arunr/gsa/worm/scripts; ./01downloadModEntities.pl 2>/dev/null >/dev/null;
25 15 * * * cd /home/arunr/gsa/worm/scripts; ./01create_elegans_gene_list.pl 2>/dev/null >/dev/null;
27 15 * * * cd /home/arunr/gsa/worm/scripts; ./02create_elegans_variation_list.pl 2>/dev/null >/dev/null;
31 17 * * * cd /home/arunr/gsa/worm/scripts; ./03create_elegans_transgene_list.pl 2>/dev/null >/dev/null;
NOTE: One may have to perform the following manually weekly on textpresso-dev.caltech.edu using the password for citace:
$ cd /home/arunr/gsa/worm/scripts
$ ./01downloadModEntities.pl

Step 2: form sorted lexicon
20 20 * * 2,4,6 cd /home/arunr/gsa/worm/scripts; ./02formSortedLexicon.pl 2>/dev/null >/dev/null;
Step 3: check for new XML files and run if any new files in incoming_xml/ 12,27,43,57 * * * * cd /home/arunr/gsa/worm/scripts; ./03link.pl ../incoming_xml/ ../html/ 2>/dev/null >/dev/null;
Step 3.1: check if WB curator wants to re-run the linking script after adding new entities via the journal first pass form.
13,28,43,58 * * * * cd /home/arunr/gsa/worm/scripts; ./06rerunLinking.pl 2>/dev/null >/dev/null;
Step 4: FTP the linked XML file after the curator submits file for FTP
06,21,36,51 * * * * cd /home/arunr/gsa/worm/scripts: ./05ftpAndEmailDjs.pl

SGD Scripts

Step 1: download entities
00 07 * * 0 cd /home/arunr/gsa/yeast/scripts/; ./01downloadModEntities.pl 2>/dev/null >/dev/null;
Step 2: form sorted lexicon
10 07 * * 0 cd /home/arunr/gsa/yeast/scripts/; ./02formSortedLexicon.pl 2>/dev/null >/dev/null;
Step 3: checks for new XML files and runs if any new files
01,16,31,46 * * * * cd /home/arunr/gsa/yeast/scripts; ./03link.pl ../incoming_xml/ ../html/ 2>/dev/null >/dev/null;
Step 4: FTPs the linked XML file after the curator submits file for FTP
09,29,49 * * * * cd /home/arunr/gsa/yeast/scripts; ./07run04and05.pl 2>/dev/null >/dev/null;

Flybase Scripts

Step 1: download entities
00 08 * * 0 cd /home/arunr/gsa/fly/scripts/; ./01downloadModEntities.pl 2>/dev/null >/dev/null;
Step 2: form sorted lexicon
10 09 * * 0 cd /home/arunr/gsa/fly/scripts/; ./02formSortedLexicon.pl 2>/dev/null >/dev/null;
Step 3: checks for new XML files and runs if any new files
02,17,32,47 * * * * cd /home/arunr/gsa/fly/scripts; ./03link.pl ../incoming_xml/ ../html/ 2>/dev/null >/dev/null;
Step 4: FTPs the linked XML file after the curator submits file for FTP
01,21,41 * * * * cd /home/arunr/gsa/yflyast/scripts; ./07run04and05.pl 2>/dev/null >/dev/null;

Pipeline Problems/Solutions

  1. A file could get caught in the pipeline if problems in the XML itself is the cause -> an email will alert the developer of which line is causing a problem
    Action: edit the latest XML file in:
    /data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/<MOD>/incoming_xml where is fly, worm or yeast.

  2. DJS needs the paper to be redone
    Action: developer will have to clear the pipeline of the paper Delete the corresponding file containing the document number in /data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/<MOD>/incoming_xml/
    /data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/<MOD>/done/
    /data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/<MOD>/logs/
    /data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/<MOD>/html/
    /data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/<MOD>/entity_link_tables/
    /data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/<MOD>/first_pass_logs/
    /data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/<MOD>/first_pass_entity_link_tables/
    Rerun the script manually by:
    $ cd /home/arunr/gsa/<MOD>/scripts/
    $ ./03link.pl ../incoming_xml/<docid>.xml ../html
    Curators will be resent the newly linked paper with new entity table link

  3. QC curators need the paper to be redone
    Action: developer will have to clear the pipeline of the paper Delete the corresponding file containing the document number in /data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa//incoming_xml/ DO NOT delete the incoming XML file
    /data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/<MOD>/done/
    /data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/<MOD>/logs/
    /data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/<MOD>/html/
    /data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/<MOD>/entity_link_tables/
    /data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/<MOD>/first_pass_logs/
    /data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/<MOD>/first_pass_entity_link_tables/
    Rerun the script manually by:
    $ cd /home/arunr/gsa/<MOD>/scripts/
    $ ./03link.pl ../incoming_xml/<docid>.xml ../html
    Curators will receive alerts as before

  4. FTP fails for some reason
    Action: file will need to be manually FTP'd to DJS, ftp1.dartmouthjournals.com. /data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/<MOD>/linked_xml/<docid>.XML
    is the document ID and is fly, worm or yeast. Ask curator for username and password.

  5. Changes in developers, GSA editors, DJS personnel
    Action: Emails need to be added, or removed. The files with the email addresses are located in: /data2/srv/textpresso-dev.caltech.edu/www/docroot/gsa/<MOD>/emails/

Clone this wiki locally