Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest: Derive URL column during ingest #80

Merged
merged 1 commit into from
Dec 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions ingest/defaults/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,8 @@ curate:
output_id_field: "accession"
# The field in the NDJSON record that contains the actual genomic sequence
output_sequence_field: "sequence"
# The field in the NDJSON record that contains the actual GenBank accession
genbank_accession: 'accession'
# The list of metadata columns to keep in the final output of the curation pipeline.
metadata_columns: [
"accession",
Expand All @@ -121,4 +123,5 @@ curate:
"authors",
"full_authors",
"institution",
"url"
]
21 changes: 20 additions & 1 deletion ingest/rules/curate.smk
Original file line number Diff line number Diff line change
Expand Up @@ -116,10 +116,29 @@ rule curate:
--output-seq-field {params.sequence_field} ) 2>> {log}
"""

rule add_metadata_columns:
"""Add columns to metadata
Notable columns:
- [NEW] url: URL linking to the NCBI GenBank record ('https://www.ncbi.nlm.nih.gov/nuccore/*').
"""
input:
metadata = "data/all_metadata.tsv"
output:
metadata = temp("data/all_metadata_added.tsv")
params:
accession=config['curate']['genbank_accession']
shell:
"""
csvtk mutate2 -t \
-n url \
-e '"https://www.ncbi.nlm.nih.gov/nuccore/" + ${params.accession}' \
{input.metadata} \
> {output.metadata}
"""

rule subset_metadata:
input:
metadata="data/all_metadata.tsv",
metadata="data/all_metadata_added.tsv",
output:
subset_metadata="data/subset_metadata.tsv",
params:
Expand Down
Loading