- Install dependencies with
pipenv install
. Python 3.9 is required. - Update
data/SGS.sequences.fact.csv
spreadsheet with latest data. Follow Steps for adding studies for creating intermediate files. - Run command
make fasta
; wait until the command finished. - Run command
make sierra
; wait until the command finished. - Run command
make build stat
; wait until the command finished. - Run command
git add data/upload
, commit and push.
pipenv run python scripts/add_study.py multiple <PMID1> <PMID2> <PMID3> ...
# or
pipenv run python scripts/add_study.py single <PMID> [ACCESSION1, ACCESSION2, ...]
Alternative way:
- Find Genbank IDs for this study.
- In Mangabey data entry program, check "No reference" checkbox in reference entry page then "Continue".
- Click "Nucleotide Sequences".
- Click "Add Genbank Sequence(s)" then type the Genbank IDs. Wait until all sequences loaded.
- "Download" the output file.
- Merge the downloaded TSV file with
data/SGS.sequences.fact.csv
. Remember to delete all non-pol sequences.
- MedlineID
- Accession
- CollectionDate: Format YYYY-MM-DD
- Source: Plasma, PBMC, etc
- PtIdentifier: Format '{InternalRefID}-{SourcePtID}'
- CometSubtype: Use COMET to determine subtype; can be empty
- Rx: ART or None
- DateAdded: Format YYYY-MM-DD
- _Include: Always False (only used in research)
- _Reservoir: Is this sample collected from virus reservoir?