Merge pull request #419: Sync vendored scripts

nextstrain · Oct 17, 2023 · 2a73f4d · 2a73f4d
2 parents 5aed3ed + b38be5c
commit 2a73f4d
Show file tree

Hide file tree

Showing 20 changed files with 216 additions and 39 deletions.
diff --git a/README.md b/README.md
@@ -153,13 +153,8 @@ aws s3 cp - s3://nextstrain-data/files/ncov/open/nextclade_21L.tsv.zst.renew < /
 
 ## `vendored`
 
-This repository uses [`git subrepo`](https://github.com/ingydotnet/git-subrepo) to manage copies of ingest scripts in `vendored`, from [nextstrain/ingest](https://github.com/nextstrain/ingest). To pull new changes from the central ingest repository, first install `git subrepo`, then run:
+This repository uses [`git subrepo`](https://github.com/ingydotnet/git-subrepo) to manage copies of ingest scripts in [`vendored`](./vendored), from [nextstrain/ingest](https://github.com/nextstrain/ingest). To pull new changes from the central ingest repository, first install `git subrepo`, then run:
 
-```sh
-git subrepo pull vendored
-```
-
-Changes should not be pushed using `git subrepo push`.
-
-1. For pathogen-specific changes, make them in this repository via a pull request.
-2. For pathogen-agnostic changes, make them on [nextstrain/ingest](https://github.com/nextstrain/ingest) via pull request there, then use `git subrepo pull` to add those changes to this repository.
+See [vendored/README.md](vendored/README.md#vendoring) for instructions on how to update
+the vendored scripts. Note that this repo is a special case and does not put vendored
+scripts in an `ingest/` directory. Modify commands accordingly.
diff --git a/bin/csv-to-ndjson b/bin/csv-to-ndjson
diff --git a/vendored/.cramrc b/vendored/.cramrc
@@ -0,0 +1,3 @@
+[cram]
+shell = /bin/bash
+indent = 2
diff --git a/vendored/.github/workflows/ci.yaml b/vendored/.github/workflows/ci.yaml
@@ -1,13 +1,23 @@
 name: CI
 
 on:
-  - push
-  - pull_request
-  - workflow_dispatch
+  push:
+    branches:
+      - main
+  pull_request:
+  workflow_dispatch:
 
 jobs:
   shellcheck:
     runs-on: ubuntu-latest
     steps:
       - uses: actions/checkout@v3
       - uses: nextstrain/.github/actions/shellcheck@master
+
+  cram:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+      - uses: actions/setup-python@v4
+      - run: pip install cram
+      - run: cram tests/
diff --git a/vendored/.gitrepo b/vendored/.gitrepo
@@ -6,7 +6,7 @@
 [subrepo]
 	remote = https://github.com/nextstrain/ingest
 	branch = main
-	commit = 1eb8b30428d5f66adac201f0a246a7ab4bdc9792
-	parent = 6fd5a9b1d87e59fab35173dbedf376632154943b
+	commit = 7617c39fae05e5882c5e6c065c5b47d500c998af
+	parent = 6c0a9cc7a1c3cfc6a055707a0eb661af56befeb6
 	method = merge
 	cmdver = 0.4.6
diff --git a/vendored/README.md b/vendored/README.md
@@ -25,6 +25,31 @@ Any future updates of ingest scripts can be pulled in with:
 git subrepo pull ingest/vendored
 ```
 
+If you run into merge conflicts and would like to pull in a fresh copy of the
+latest ingest scripts, pull with the `--force` flag:
+
+```
+git subrepo pull ingest/vendored --force
+```
+
+> **Warning**
+> Beware of rebasing/dropping the parent commit of a `git subrepo` update
+
+`git subrepo` relies on metadata in the `ingest/vendored/.gitrepo` file,
+which includes the hash for the parent commit in the pathogen repos.
+If this hash no longer exists in the commit history, there will be errors when
+running future `git subrepo pull` commands.
+
+If you run into an error similar to the following:
+```
+$ git subrepo pull ingest/vendored
+git-subrepo: Command failed: 'git branch subrepo/ingest/vendored '.
+fatal: not a valid object name: ''
+```
+Check the parent commit hash in the `ingest/vendored/.gitrepo` file and make
+sure the commit exists in the commit history. Update to the appropriate parent
+commit hash if needed.
+
 ## History
 
 Much of this tooling originated in
@@ -69,6 +94,13 @@ Scripts for supporting ingest workflow automation that don’t really belong in
 - [trigger-on-new-data](trigger-on-new-data) - Triggers downstream GitHub Actions if the provided `upload-to-s3` outputs do not contain the `identical_file_message`
   A hacky way to ensure that we only trigger downstream phylogenetic builds if the S3 objects have been updated.
 
+NCBI interaction scripts that are useful for fetching public metadata and sequences.
+
+- [fetch-from-ncbi-entrez](fetch-from-ncbi-entrez) - Fetch metadata and nucleotide sequences from [NCBI Entrez](https://www.ncbi.nlm.nih.gov/books/NBK25501/) and output to a GenBank file.
+  Useful for pathogens with metadata and annotations in custom fields that are not part of the standard [NCBI Datasets](https://www.ncbi.nlm.nih.gov/datasets/) outputs.
+
+Historically, some pathogen repos used the undocumented NCBI Virus API through [fetch-from-ncbi-virus](https://github.com/nextstrain/ingest/blob/c97df238518171c2b1574bec0349a55855d1e7a7/fetch-from-ncbi-virus) to fetch data. However we've opted to drop the NCBI Virus scripts due to https://github.com/nextstrain/ingest/issues/18.
+
 Potential Nextstrain CLI scripts
 
 - [sha256sum](sha256sum) - Used to check if files are identical in upload-to-s3 and download-from-s3 scripts.
@@ -89,3 +121,17 @@ Potential augur curate scripts
 - [transform-authors](transform-authors) - Abbreviates full author lists to '<first author> et al.'
 - [transform-field-names](transform-field-names) - Rename fields of NDJSON records
 - [transform-genbank-location](transform-genbank-location) - Parses `location` field with the expected pattern `"<country_value>[:<region>][, <locality>]"` based on [GenBank's country field](https://www.ncbi.nlm.nih.gov/genbank/collab/country/)
+- [transform-strain-names](transform-strain-names) - Ordered search for strain names across several fields.
+
+## Software requirements
+
+Some scripts may require Bash ≥4. If you are running these scripts on macOS, the builtin Bash (`/bin/bash`) does not meet this requirement. You can install [Homebrew's Bash](https://formulae.brew.sh/formula/bash) which is more up to date.
+
+## Testing
+
+Most scripts are untested within this repo, relying on "testing in production". That is the only practical testing option for some scripts such as the ones interacting with S3 and Slack.
+
+For more locally testable scripts, Cram-style functional tests live in `tests` and are run as part of CI. To run these locally,
+
+1. Download Cram: `pip install cram`
+2. Run the tests: `cram tests/`
diff --git a/vendored/cloudfront-invalidate b/vendored/cloudfront-invalidate
@@ -1,4 +1,4 @@
-#!/bin/bash
+#!/usr/bin/env bash
 # Originally from @tsibley's gist: https://gist.github.com/tsibley/a66262d341dedbea39b02f27e2837ea8
 set -euo pipefail
 

diff --git a/vendored/download-from-s3 b/vendored/download-from-s3
@@ -1,4 +1,4 @@
-#!/bin/bash
+#!/usr/bin/env bash
 set -euo pipefail
 
 bin="$(dirname "$0")"

diff --git a/vendored/fetch-from-ncbi-entrez b/vendored/fetch-from-ncbi-entrez
@@ -0,0 +1,70 @@
+#!/usr/bin/env python3
+"""
+Fetch metadata and nucleotide sequences from NCBI Entrez and output to a GenBank file.
+"""
+import json
+import argparse
+from Bio import SeqIO, Entrez
+
+# To use the efetch API, the docs indicate only around 10,000 records should be fetched per request
+# https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch
+# However, in my testing with HepB, the max records returned was 9,999
+#   - Jover, 16 August 2023
+BATCH_SIZE = 9999
+
+Entrez.email = "[email protected]"
+
+def parse_args():
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument('--term', required=True, type=str,
+        help='Genbank search term. Replace spaces with "+", e.g. "Hepatitis+B+virus[All+Fields]complete+genome[All+Fields]"')
+    parser.add_argument('--output', required=True, type=str, help='Output file (Genbank)')
+    return parser.parse_args()
+
+
+def get_esearch_history(term):
+    """
+    Search for the provided *term* via ESearch and store the results using the
+    Entrez history server.¹
+
+    Returns the total count of returned records, query key, and web env needed
+    to access the records from the server.
+
+    ¹ https://www.ncbi.nlm.nih.gov/books/NBK25497/#chapter2.Using_the_Entrez_History_Server
+    """
+    handle = Entrez.esearch(db="nucleotide", term=term, retmode="json", usehistory="y", retmax=0)
+    esearch_result = json.loads(handle.read())['esearchresult']
+    print(f"Search term {term!r} returned {esearch_result['count']} IDs.")
+    return {
+        "count": int(esearch_result["count"]),
+        "query_key": esearch_result["querykey"],
+        "web_env": esearch_result["webenv"]
+    }
+
+
+def fetch_from_esearch_history(count, query_key, web_env):
+    """
+    Fetch records in batches from Entrez history server using the provided
+    *query_key* and *web_env* and yields them as a BioPython SeqRecord iterator.
+    """
+    print(f"Fetching GenBank records in batches of n={BATCH_SIZE}")
+
+    for start in range(0, count, BATCH_SIZE):
+        handle = Entrez.efetch(
+            db="nucleotide",
+            query_key=query_key,
+            webenv=web_env,
+            retstart=start,
+            retmax=BATCH_SIZE,
+            rettype="gb",
+            retmode="text")
+
+        yield SeqIO.parse(handle, "genbank")
+
+
+if __name__=="__main__":
+    args = parse_args()
+
+    with open(args.output, "w") as output_handle:
+        for batch_results in fetch_from_esearch_history(**get_esearch_history(args.term)):
+            SeqIO.write(batch_results, output_handle, "genbank")
diff --git a/vendored/notify-on-diff b/vendored/notify-on-diff
@@ -1,4 +1,4 @@
-#!/bin/bash
+#!/usr/bin/env bash
 
 set -euo pipefail
 

diff --git a/vendored/notify-on-job-fail b/vendored/notify-on-job-fail
@@ -1,4 +1,4 @@
-#!/bin/bash
+#!/usr/bin/env bash
 set -euo pipefail
 
 : "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"

diff --git a/vendored/notify-on-job-start b/vendored/notify-on-job-start
@@ -1,4 +1,4 @@
-#!/bin/bash
+#!/usr/bin/env bash
 set -euo pipefail
 
 : "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"

diff --git a/vendored/notify-on-record-change b/vendored/notify-on-record-change
@@ -1,4 +1,4 @@
-#!/bin/bash
+#!/usr/bin/env bash
 set -euo pipefail
 
 : "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"

diff --git a/vendored/notify-slack b/vendored/notify-slack
@@ -1,4 +1,4 @@
-#!/bin/bash
+#!/usr/bin/env bash
 set -euo pipefail
 
 : "${SLACK_TOKEN:?The SLACK_TOKEN environment variable is required.}"

diff --git a/vendored/s3-object-exists b/vendored/s3-object-exists
@@ -1,4 +1,4 @@
-#!/bin/bash
+#!/usr/bin/env bash
 set -euo pipefail
 
 url="${1#s3://}"

diff --git a/vendored/tests/transform-strain-names/transform-strain-names.t b/vendored/tests/transform-strain-names/transform-strain-names.t
@@ -0,0 +1,17 @@
+Look for strain name in "strain" or a list of backup fields.
+
+If strain entry exists, do not do anything.
+
+  $ echo '{"strain": "i/am/a/strain", "strain_s": "other"}' \
+  >   | $TESTDIR/../../transform-strain-names \
+  >       --strain-regex '^.+$' \
+  >       --backup-fields strain_s accession
+  {"strain":"i/am/a/strain","strain_s":"other"}
+
+If strain entry does not exists, search the backup fields
+
+  $ echo '{"strain_s": "other"}' \
+  >   | $TESTDIR/../../transform-strain-names \
+  >       --strain-regex '^.+$' \
+  >       --backup-fields accession strain_s 
+  {"strain_s":"other","strain":"other"}
diff --git a/vendored/transform-strain-names b/vendored/transform-strain-names
@@ -0,0 +1,50 @@
+#!/usr/bin/env python3
+"""
+Verifies strain name pattern in the 'strain' field of the NDJSON record from
+stdin. Adds a 'strain' field to the record if it does not already exist.
+
+Outputs the modified records to stdout.
+"""
+import argparse
+import json
+import re
+from sys import stderr, stdin, stdout
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(
+        description=__doc__,
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter
+    )
+    parser.add_argument("--strain-regex", default="^.+$",
+        help="Regex pattern for strain names. " +
+             "Strain names that do not match the pattern will be dropped.")
+    parser.add_argument("--backup-fields", nargs="*",
+        help="List of backup fields to use as strain name if the value in 'strain' " +
+             "does not match the strain regex pattern. " +
+             "If multiple fields are provided, will use the first field that has a non-empty string.")
+
+    args = parser.parse_args()
+
+    strain_name_pattern = re.compile(args.strain_regex)
+
+    for index, record in enumerate(stdin):
+        record = json.loads(record)
+
+        # Verify strain name matches the strain regex pattern
+        if strain_name_pattern.match(record.get('strain', '')) is None:
+            # Default to empty string if not matching pattern
+            record['strain'] = ''
+            # Use non-empty value of backup fields if provided
+            if args.backup_fields:
+                for field in args.backup_fields:
+                    if record.get(field):
+                        record['strain'] = str(record[field])
+                        break
+
+        if record['strain'] == '':
+            print(f"WARNING: Record number {index} has an empty string as the strain name.", file=stderr)
+
+
+        json.dump(record, stdout, allow_nan=False, indent=None, separators=',:')
+        print()
diff --git a/vendored/trigger b/vendored/trigger
@@ -1,4 +1,4 @@
-#!/bin/bash
+#!/usr/bin/env bash
 set -euo pipefail
 
 : "${PAT_GITHUB_DISPATCH:=}"

diff --git a/vendored/trigger-on-new-data b/vendored/trigger-on-new-data
@@ -1,4 +1,4 @@
-#!/bin/bash
+#!/usr/bin/env bash
 set -euo pipefail
 
 : "${PAT_GITHUB_DISPATCH:?The PAT_GITHUB_DISPATCH environment variable is required.}"

diff --git a/vendored/upload-to-s3 b/vendored/upload-to-s3
@@ -1,4 +1,4 @@
-#!/bin/bash
+#!/usr/bin/env bash
 set -euo pipefail
 
 bin="$(dirname "$0")"