Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ingest-to-phylogenetic GitHub Action... [#14] #34

Merged
merged 1 commit into from
Jul 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
146 changes: 146 additions & 0 deletions .github/workflows/ingest-to-phylogenetic.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
name: Ingest to phylogenetic

defaults:
run:
# This is the same as GitHub Action's `bash` keyword as of 20 June 2023:
# https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepsshell
#
# Completely spelling it out here so that GitHub can't change it out from under us
# and we don't have to refer to the docs to know the expected behavior.
shell: bash --noprofile --norc -eo pipefail {0}

on:
schedule:
# Note times are in UTC, which is 1 or 2 hours behind CET depending on daylight savings.
#
# Note the actual runs might be late.
# Numerous people were confused, about that, including me:
# - https://github.community/t/scheduled-action-running-consistently-late/138025/11
# - https://github.com/github/docs/issues/3059
#
# Note, '*' is a special character in YAML, so you have to quote this string.
#
# Docs:
# - https://docs.github.com/en/actions/learn-github-actions/events-that-trigger-workflows#schedule
#
# Tool that deciphers this particular format of crontab string:
# - https://crontab.guru/
#
# Runs at 5:30pm UTC (1:30pm EDT/10:30am PDT) since curation by NCBI happens on the East Coast.
# We were running into invalid zip archive errors at 9am PDT, so hoping an hour-thirty
# delay will lower the error frequency
- cron: '30 17 * * *'

workflow_dispatch:
inputs:
ingest_image:
description: 'Specific container image to use for ingest workflow (will override the default of "nextstrain build")'
required: false
phylogenetic_image:
description: 'Specific container image to use for phylogenetic workflow (will override the default of "nextstrain build")'
required: false

jobs:
ingest:
permissions:
id-token: write
uses: nextstrain/.github/.github/workflows/pathogen-repo-build.yaml@master
secrets: inherit
with:
# Starting with the default docker runtime
# We can migrate to AWS Batch when/if we need to for more resources or if
# the job runs longer than the GH Action limit of 6 hours.
runtime: docker
env: |
NEXTSTRAIN_DOCKER_IMAGE: ${{ inputs.ingest_image }}
run: |
nextstrain build \
ingest \
upload_all \
--configfile build-configs/nextstrain-automation/config.yaml
# Specifying artifact name to differentiate ingest build outputs from
# the phylogenetic build outputs
artifact-name: ingest-build-output
artifact-paths: |
ingest/results/
ingest/benchmarks/
ingest/logs/
ingest/.snakemake/log/

# Check if ingest results include new data by checking for the cache
# of the file with the results' Metadata.sh256sum (which should have been added within upload-to-s3)
# GitHub will remove any cache entries that have not been accessed in over 7 days,
# so if the workflow has not been run over 7 days then it will trigger phylogenetic.
check-new-data:
needs: [ingest]
runs-on: ubuntu-latest
outputs:
cache-hit: ${{ steps.check-cache.outputs.cache-hit }}
steps:
- name: Get sha256sum
id: get-sha256sum
env:
AWS_DEFAULT_REGION: ${{ vars.AWS_DEFAULT_REGION }}
run: |
s3_urls=(
"s3://nextstrain-data/files/workflows/seasonal-cov/229e/metadata.tsv.zst"
"s3://nextstrain-data/files/workflows/seasonal-cov/229e/sequences.fasta.zst"
"s3://nextstrain-data/files/workflows/seasonal-cov/hku1/metadata.tsv.zst"
"s3://nextstrain-data/files/workflows/seasonal-cov/hku1/sequences.fasta.zst"
"s3://nextstrain-data/files/workflows/seasonal-cov/nl63/metadata.tsv.zst"
"s3://nextstrain-data/files/workflows/seasonal-cov/nl63/sequences.fasta.zst"
"s3://nextstrain-data/files/workflows/seasonal-cov/oc43/metadata.tsv.zst"
"s3://nextstrain-data/files/workflows/seasonal-cov/oc43/sequences.fasta.zst"
)

# Code below is modified from ingest/upload-to-s3
# https://github.com/nextstrain/ingest/blob/c0b4c6bb5e6ccbba86374d2c09b42077768aac23/upload-to-s3#L23-L29


no_hash=0000000000000000000000000000000000000000000000000000000000000000

for s3_url in "${s3_urls[@]}"; do
s3path="${s3_url#s3://}"
bucket="${s3path%%/*}"
key="${s3path#*/}"

s3_hash="$(aws s3api head-object --no-sign-request --bucket "$bucket" --key "$key" --query Metadata.sha256sum --output text 2>/dev/null || echo "$no_hash")"
echo "${s3_hash}" | tee -a ingest-output-sha256sum
done

- name: Check cache
id: check-cache
uses: actions/cache@v4
with:
path: ingest-output-sha256sum
key: ingest-output-sha256sum-${{ hashFiles('ingest-output-sha256sum') }}
lookup-only: true

phylogenetic:
needs: [check-new-data]
if: ${{ needs.check-new-data.outputs.cache-hit != 'true' }}
permissions:
id-token: write
uses: nextstrain/.github/.github/workflows/pathogen-repo-build.yaml@master
secrets: inherit
with:
# Starting with the default docker runtime
# We can migrate to AWS Batch when/if we need to for more resources or if
# the job runs longer than the GH Action limit of 6 hours.
runtime: docker
env: |
NEXTSTRAIN_DOCKER_IMAGE: ${{ inputs.phylogenetic_image }}
run: |
nextstrain build \
phylogenetic \
deploy_all \
--configfile build-configs/nextstrain-automation/config.yaml
# Specifying artifact name to differentiate ingest build outputs from
# the phylogenetic build outputs
artifact-name: phylogenetic-build-output
artifact-paths: |
phylogenetic/auspice/
phylogenetic/results/
phylogenetic/benchmarks/
phylogenetic/logs/
phylogenetic/.snakemake/log/
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ ingest/logs
ingest/results
phylogenetic/auspice
phylogenetic/benchmarks
phylogenetic/data
phylogenetic/logs
phylogenetic/results

Expand Down
7 changes: 7 additions & 0 deletions ingest/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -45,3 +45,10 @@ rule clean:
"""
rm -rfv {params.targets}
"""


# Import custom rules provided via the config.
if "custom_rules" in config:
for rule_file in config["custom_rules"]:

include: rule_file
30 changes: 30 additions & 0 deletions ingest/build-configs/nextstrain-automation/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# This configuration file should contain all required configuration parameters
# for the ingest workflow to run with additional Nextstrain automation rules.

# Custom rules to run as part of the Nextstrain automated workflow
# The paths should be relative to the ingest directory.
custom_rules:
- build-configs/nextstrain-automation/upload.smk

# Nextstrain CloudFront domain to ensure that we invalidate CloudFront after the S3 uploads
# This is required as long as we are using the AWS CLI for uploads
cloudfront_domain: "data.nextstrain.org"

# Nextstrain AWS S3 Bucket with pathogen prefix
# Replace <pathogen> with the pathogen repo name.
s3_dst: "s3://nextstrain-data/files/workflows/seasonal-cov"

# Mapping of files to upload
files_to_upload:
229e/ncbi.ndjson.zst: data/229e/ncbi.ndjson
229e/metadata.tsv.zst: results/229e/metadata.tsv
229e/sequences.fasta.zst: results/229e/sequences.fasta
hku1/ncbi.ndjson.zst: data/hku1/ncbi.ndjson
hku1/metadata.tsv.zst: results/hku1/metadata.tsv
hku1/sequences.fasta.zst: results/hku1/sequences.fasta
nl63/ncbi.ndjson.zst: data/nl63/ncbi.ndjson
nl63/metadata.tsv.zst: results/nl63/metadata.tsv
nl63/sequences.fasta.zst: results/nl63/sequences.fasta
oc43/ncbi.ndjson.zst: data/oc43/ncbi.ndjson
oc43/metadata.tsv.zst: results/oc43/metadata.tsv
oc43/sequences.fasta.zst: results/oc43/sequences.fasta
48 changes: 48 additions & 0 deletions ingest/build-configs/nextstrain-automation/upload.smk
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
"""
This part of the workflow handles uploading files to AWS S3.

Files to upload must be defined in the `files_to_upload` config param, where
the keys are the remote files and the values are the local filepaths
relative to the ingest directory.

Produces a single file for each uploaded file:
"results/upload/{remote_file}.upload"

The rule `upload_all` can be used as a target to upload all files.
"""

import os

slack_envvars_defined = "SLACK_CHANNELS" in os.environ and "SLACK_TOKEN" in os.environ
send_notifications = (
config.get("send_slack_notifications", False) and slack_envvars_defined
)


rule upload_to_s3:
input:
file_to_upload=lambda wildcards: config["files_to_upload"][wildcards.remote_file],
output:
"results/upload/{remote_file}.upload",
params:
quiet="" if send_notifications else "--quiet",
s3_dst=config["s3_dst"],
cloudfront_domain=config["cloudfront_domain"],
shell:
"""
./vendored/upload-to-s3 \
{params.quiet} \
{input.file_to_upload:q} \
{params.s3_dst:q}/{wildcards.remote_file:q} \
{params.cloudfront_domain} 2>&1 | tee {output}
"""


rule upload_all:
input:
uploads=[
f"results/upload/{remote_file}.upload"
for remote_file in config["files_to_upload"].keys()
],
output:
touch("results/upload_all.done"),
7 changes: 7 additions & 0 deletions phylogenetic/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,10 @@ rule clean:
"""
rm -rfv {params.targets}
"""


# Import custom rules provided via the config.
if "custom_rules" in config:
for rule_file in config["custom_rules"]:

include: rule_file
4 changes: 4 additions & 0 deletions phylogenetic/build-configs/nextstrain-automation/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
custom_rules:
- build-configs/nextstrain-automation/deploy.smk

deploy_url: "s3://nextstrain-data"
19 changes: 19 additions & 0 deletions phylogenetic/build-configs/nextstrain-automation/deploy.smk
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
"""
This part of the workflow handles automatic deployments of the
`seasonal-cov` builds. Uploads the build defined as the default output of
the workflow through the `all` rule from Snakefille

"""


rule deploy_all:
input:
*rules.all.input,
output:
touch("results/deploy_all.done"),
params:
deploy_url=config["deploy_url"],
shell:
"""
nextstrain remote upload {params.deploy_url} {input}
"""
8 changes: 0 additions & 8 deletions phylogenetic/defaults/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,7 @@ viruses:
229e:
reference: "defaults/229e/reference.fasta"
genemap: "defaults/229e/genemap.gff"
metadata: "../ingest/results/229e/metadata.tsv"
prepare_sequences:
sequences: "../ingest/results/229e/sequences.fasta"
group_by: "country"
subsample_max_sequences: 4000
min_length: 20000
Expand All @@ -50,9 +48,7 @@ viruses:
nl63:
reference: "defaults/nl63/reference.fasta"
genemap: "defaults/nl63/genemap.gff"
metadata: "../ingest/results/nl63/metadata.tsv"
prepare_sequences:
sequences: "../ingest/results/nl63/sequences.fasta"
group_by: "country"
subsample_max_sequences: 4000
min_length: 20000
Expand All @@ -67,9 +63,7 @@ nl63:
oc43:
reference: "defaults/oc43/reference.fasta"
genemap: "defaults/oc43/genemap.gff"
metadata: "../ingest/results/oc43/metadata.tsv"
prepare_sequences:
sequences: "../ingest/results/oc43/sequences.fasta"
group_by: "country"
subsample_max_sequences: 4000
min_length: 20000
Expand All @@ -84,9 +78,7 @@ oc43:
hku1:
reference: "defaults/hku1/reference.fasta"
genemap: "defaults/hku1/genemap.gff"
metadata: "../ingest/results/hku1/metadata.tsv"
prepare_sequences:
sequences: "../ingest/results/hku1/sequences.fasta"
group_by: "country"
subsample_max_sequences: 4000
min_length: 20000
Expand Down
2 changes: 1 addition & 1 deletion phylogenetic/rules/construct_phylogeny.smk
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ rule refine:
input:
tree="results/{virus}/tree_raw.nwk",
alignment="results/{virus}/aligned.fasta",
metadata=lambda wildcards: config[wildcards.virus]["metadata"],
metadata="data/{virus}/metadata.tsv",
output:
tree="results/{virus}/tree.nwk",
node_data="results/{virus}/branch_lengths.json",
Expand Down
2 changes: 1 addition & 1 deletion phylogenetic/rules/export.smk
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ tree and at least one node data JSON.
rule export:
input:
tree="results/{virus}/tree.nwk",
metadata=lambda wildcards: config[wildcards.virus]["metadata"],
metadata="data/{virus}/metadata.tsv",
branch_lengths="results/{virus}/branch_lengths.json",
nt_muts="results/{virus}/nt_muts.json",
aa_muts="results/{virus}/aa_muts.json",
Expand Down
32 changes: 30 additions & 2 deletions phylogenetic/rules/prepare_sequences.smk
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,38 @@ and will produce an aligned FASTA file of subsampled sequences as an output.
"""


rule download:
output:
sequences="data/{virus}/sequences.fasta.zst",
metadata="data/{virus}/metadata.tsv.zst",
params:
sequences_url="https://data.nextstrain.org/files/workflows/seasonal-cov/{virus}/sequences.fasta.zst",
metadata_url="https://data.nextstrain.org/files/workflows/seasonal-cov/{virus}/metadata.tsv.zst",
shell:
"""
curl -fsSL --compressed {params.sequences_url:q} --output {output.sequences}
curl -fsSL --compressed {params.metadata_url:q} --output {output.metadata}
"""


rule decompress:
input:
sequences="data/{virus}/sequences.fasta.zst",
metadata="data/{virus}/metadata.tsv.zst",
output:
sequences="data/{virus}/sequences.fasta",
metadata="data/{virus}/metadata.tsv",
shell:
"""
zstd -d -c {input.sequences} > {output.sequences}
zstd -d -c {input.metadata} > {output.metadata}
"""


rule filter:
input:
sequences=lambda wildcards: config[wildcards.virus]["prepare_sequences"]["sequences"],
metadata=lambda wildcards: config[wildcards.virus]["metadata"],
sequences="data/{virus}/sequences.fasta",
metadata="data/{virus}/metadata.tsv",
exclude="defaults/{virus}/dropped_strains.txt",
output:
sequences="results/{virus}/filtered.fasta",
Expand Down