Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update master #151

Merged
merged 264 commits into from
Oct 28, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
264 commits
Select commit Hold shift + click to select a range
b512f75
Add prerequisite workflow to download r2dt data
afg1 May 11, 2022
70d10b0
Update with working crontab
afg1 May 11, 2022
d130293
Uncomment workflows in run script, ready to test
afg1 May 11, 2022
a63730d
Consolidate changes in import-data
afg1 May 11, 2022
dd95149
Fix mis-named config file
afg1 May 11, 2022
2beb6c0
Add some more slack notifications in environment preparation workflow
afg1 May 11, 2022
7281aa6
Use right queue for copying r2dt data
afg1 May 18, 2022
4f19c32
Use polars for groupby and sort in xref precompute
afg1 May 18, 2022
7b77b7c
Simplify xref query
afg1 May 18, 2022
43e22d6
Add slack messaging for pipeline status
afg1 May 18, 2022
b3acf81
Need to specify no container for datamover job
afg1 May 18, 2022
36e074a
Fix inputs on one line not working
afg1 May 18, 2022
49bf3ad
Modify pdbe import to allow overriding with rfam hits
afg1 May 19, 2022
65f7737
Tweak memory and cpu requirements in pdbe xref groupby
afg1 May 19, 2022
de3f586
Update script to send job_id only
carlosribas May 28, 2022
1e28d94
Update the pipeline to submit unique ids
carlosribas May 28, 2022
357480b
Add script to filter unique ids
carlosribas May 28, 2022
27a00de
Merge branch 'dev' of https://github.com/RNAcentral/rnacentral-import…
carlosribas May 28, 2022
cdb67b3
Make a single query per database
carlosribas May 30, 2022
84199c8
Update script
carlosribas May 30, 2022
48e2c73
Remove list of dbs
carlosribas May 31, 2022
54d4249
Remove ids that contain only numbers and dash
carlosribas May 31, 2022
b3a3e4f
Ignore backup files
carlosribas May 31, 2022
3286c0b
Bug fix
carlosribas May 31, 2022
e0190a9
First commit of working parser for expression atlas
afg1 May 31, 2022
e562902
Track submit folder with lfs
carlosribas May 31, 2022
787cb33
Add submitted IDs
carlosribas May 31, 2022
2b3db60
Add rust code to main pipeline repo
afg1 May 31, 2022
b34a3b3
Add expression atlas to known databased enumeration
afg1 Jun 1, 2022
1bf51e8
Add docstring for expression atlas parser and tidy up
afg1 Jun 1, 2022
780bb3f
Add Expression Atlas parser to makefile targets
afg1 Jun 1, 2022
6e19583
Add expression atlas nextflow workflow
afg1 Jun 1, 2022
829f8da
Add pipeline that helps in creating the litscan metadata for the RNAc…
carlosribas Jun 7, 2022
9ea468a
Add a query to select weekly updates from xref
afg1 Jun 7, 2022
a538c09
Remove duplicate ids
carlosribas Jun 7, 2022
6cfe103
Run sort/uniq ignoring case
carlosribas Jun 7, 2022
d2303ea
First step in creating a pipeline to extract manually annotated refer…
carlosribas Jun 7, 2022
d4d9300
Avoid lines with no id
carlosribas Jun 8, 2022
24fe24e
Add the necessary filtering operations to mimic those done by Express…
afg1 Jun 13, 2022
6d780bd
Don't remove spaces in column headings
afg1 Jun 14, 2022
d4c2fa2
Fix missing close paren
afg1 Jun 14, 2022
9e0922e
Changes to allow for multiple baseline measures (e.g. E-GEOD-38430) a…
afg1 Jun 14, 2022
972516a
Remove some verbosity from parser
afg1 Jun 14, 2022
064be13
Workaround stack overflow for very wide entries like E-MTAB-2770
afg1 Jun 14, 2022
1141057
Previous split didn't quite work - combine the expression fold with a…
afg1 Jun 14, 2022
1c1d972
Properly propagate nulls through re-parsing of non-numeric columns
afg1 Jun 15, 2022
9e1321e
Update nextflow script for EA to grab the right files
afg1 Jun 15, 2022
4f60fca
Ensure unique gene IDs
afg1 Jun 16, 2022
9c6f9bc
Updated and refactored expression atlas parser
afg1 Jun 30, 2022
caac0b6
Start working on plnc import
afg1 Jun 30, 2022
d7d76a8
Working parser for PLncDB, with some caveats
afg1 Jul 1, 2022
5d1acdf
Interact with the PLncDB FTP web app to get the download urls, then d…
afg1 Jul 1, 2022
8d950d4
Change Expression Atlas parser output to jsonlines
afg1 Jul 4, 2022
2cccb57
Try adding automated build for pipeline docker
afg1 Jul 5, 2022
cfc6fb7
Disable docker build notifications for now
afg1 Jul 5, 2022
9863128
Fix indentation and simplify
afg1 Jul 5, 2022
e6da348
Fix typo in docker command
afg1 Jul 5, 2022
7600ff1
Forgot to send context to docker bulid
afg1 Jul 5, 2022
438c5a9
Generate descriptions using new phylogeny retrieval
afg1 Jul 5, 2022
07b8bf8
Add pyppeteer dependency to requirements
afg1 Jul 5, 2022
da51b95
Merge branch 'expressionatlas' into plncdb
afg1 Jul 6, 2022
28ccde5
Merge branch 'plncdb' into dev
afg1 Jul 6, 2022
22f5b4c
Add slack notification to github workflow
afg1 Jul 6, 2022
6661849
Bump python version to 3.8
afg1 Jul 7, 2022
5e068f4
Add PLncDB to databases
afg1 Jul 8, 2022
0bd9c31
Add nextflow code for PLncDB
afg1 Jul 8, 2022
d13a825
Install pyppeteer dependency in singularity image
afg1 Jul 12, 2022
f3a0cb1
Remove some debug screenshotting
afg1 Jul 12, 2022
956ad3e
Fix pyppeteer dependincies, change executable location to inside cont…
afg1 Jul 12, 2022
977d2fe
Fix number of species calculation in precompute and add a test for it
afg1 Jul 13, 2022
c6414e2
Split PLncDB url finding and downloading since pyppeteer doesn't work…
afg1 Jul 13, 2022
d7aec82
Add slack notification in PLncDB download step
afg1 Jul 14, 2022
68241d6
Allow use of prefetched data in PLncDB import via param flag
afg1 Jul 14, 2022
aa5591d
Harmonising changed specific to weekly run
afg1 Jul 14, 2022
2a86875
Harmonise weelky run config
afg1 Jul 14, 2022
9fc6b4e
Reduce number of requested cpus in build_urs_table
afg1 Jul 14, 2022
5c5b519
Scheduling optimisation - requesting less memory and a different queue
afg1 Jul 14, 2022
c562f2c
Fix missing feature flag for expression atlas parser
afg1 Jul 14, 2022
3bdaf73
Fix url typo and skip existing downloaded files
afg1 Jul 18, 2022
fa77052
Fix trailing whitespace in file siffixes, and fix species name with _…
afg1 Jul 19, 2022
23f46cf
Stop using flaky ftp for EPMC metadata download. Use datamover queue …
afg1 Jul 19, 2022
987ed42
Merge branches 'master' and 'dev' of github.com:RNAcentral/rnacentral…
blakesweeney Jul 19, 2022
1f534a3
Try auto-building the singularity image and pushing to ghcr
afg1 Jul 22, 2022
388c1e3
Slack is handled differently now
afg1 Jul 22, 2022
34ec7e8
Unifying dev with weekly run on codon
afg1 Jul 22, 2022
98719f5
Switch to use query method for precompute
afg1 Jul 22, 2022
6a3aaa9
Fix typo in workflow step name
afg1 Jul 22, 2022
096b546
Fix wrong container name
afg1 Jul 22, 2022
ab4a360
Merge branch 'dev' of github.com:RNAcentral/rnacentral-import-pipelin…
blakesweeney Jul 24, 2022
1c7b56c
Fix typo in singularity push code
afg1 Jul 25, 2022
685c830
Merge branch 'dev' of github.com:RNAcentral/rnacentral-import-pipelin…
afg1 Jul 25, 2022
8c45c90
Correct final notification message
afg1 Jul 26, 2022
8f80833
search_export_publication_counts table should not be TEMP
afg1 Jul 26, 2022
45fc36c
Change data_type argument for search-export group to publication-count
afg1 Jul 26, 2022
d361cee
Handle databases having zero entries in stats update
afg1 Jul 28, 2022
7a3d951
Remove pyppeteer dependency
afg1 Jul 28, 2022
926d4ed
PLncDB now parses each species in a separate process
afg1 Jul 28, 2022
05eec00
Create the metadata for a given database
carlosribas Aug 3, 2022
9ccfe90
Create XML files with metadata
carlosribas Aug 3, 2022
d403036
Remove duplicate metadata
carlosribas Aug 3, 2022
421eadf
Uppercase sequence before running through generic parser
afg1 Aug 8, 2022
c2d5ea2
Include final search export changes from codon
afg1 Aug 8, 2022
f03cefa
Stop notifying for every single metadata step
afg1 Aug 8, 2022
266fb6d
Include PLncDB changes from codon
afg1 Aug 8, 2022
4a1d9b8
Change filename to match the actual data
afg1 Aug 8, 2022
55adf58
Fix parser output to use multiple channels
afg1 Aug 9, 2022
a2fd696
LncBook parser sync with codon
afg1 Aug 9, 2022
7ba4404
Fix syntax and enable nextflow DSL2
afg1 Aug 9, 2022
d2a92af
try using the ghcr singularity image
afg1 Aug 9, 2022
8c5ab76
Update SGD remote
afg1 Jul 28, 2022
13ebe0f
Working selection of weekly import targets, based on md5 of import file
afg1 Jul 28, 2022
ab7061f
Add cli command to update tracker table.
afg1 Jul 29, 2022
f1aba52
Add prototype workflow for tracker update
afg1 Aug 8, 2022
b5ac134
Changes for weekly update to use tracking table
afg1 Aug 9, 2022
f605fcb
Hopefully working second stage plnc parser
afg1 Aug 9, 2022
0f8646b
Fix singularity container url to use oras endpoint
afg1 Aug 9, 2022
ac600b4
Split url finder away from cli and parser
afg1 Aug 9, 2022
103dbf8
Split url finder away from cli and parser
afg1 Aug 9, 2022
e06b4dd
Merge branch 'plncdb' of github.com:RNAcentral/rnacentral-import-pipe…
afg1 Aug 9, 2022
83ab43a
Remove pyppeteer dependency completely
afg1 Aug 9, 2022
1d2b8a1
Tidy up cli
afg1 Aug 9, 2022
a38b801
Try to optimise for memory in plncdb parser
afg1 Aug 10, 2022
01379b4
Add dynamic memory request and retry with increasing limit
afg1 Aug 10, 2022
3b5052a
SeqIO index doesn't like pathlib.Path objects
afg1 Aug 10, 2022
5156006
Turn off PLncDB parsing notifications
afg1 Aug 10, 2022
98bb329
Use SeqIO to extract sequence from chromosome
afg1 Aug 10, 2022
77d98eb
Tweak fetch to use datamover queue
afg1 Aug 10, 2022
ab77e6a
Rebase plnc and dev
afg1 Aug 12, 2022
19a6071
Remove pyppeteer dependency completely
afg1 Aug 9, 2022
5fa0742
Try to optimise for memory in plncdb parser
afg1 Aug 10, 2022
567842d
Add dynamic memory request and retry with increasing limit
afg1 Aug 10, 2022
4dd408b
SeqIO index doesn't like pathlib.Path objects
afg1 Aug 10, 2022
a45476c
Turn off PLncDB parsing notifications
afg1 Aug 10, 2022
77aab79
Use SeqIO to extract sequence from chromosome
afg1 Aug 10, 2022
4148278
Merge branch 'plncdb' of github.com:RNAcentral/rnacentral-import-pipe…
afg1 Aug 12, 2022
88bff69
Fix missing query execution in release databse stats update
afg1 Aug 12, 2022
093d699
Add branch for not running PLncDB
afg1 Aug 15, 2022
619e289
Changes needed to get CRW import running
afg1 Aug 15, 2022
6ae5241
Tweak fetch to use datamover queue
afg1 Aug 10, 2022
aeb699f
Merge branch 'mala-gene-reimport' of github.com:RNAcentral/rnacentral…
afg1 Aug 15, 2022
85ddfd1
Make the weekly update workflow put the new md5s into the tracker table
afg1 Aug 17, 2022
0d08d28
Fix for unexpected md5 on codon
afg1 Aug 17, 2022
820f1b9
Drop md5 from tracker table before updating
afg1 Aug 17, 2022
a9f4d67
Explicitly uppercase sequence and replace U with T
afg1 Aug 17, 2022
c83529f
Merge branch 'mala-gene-reimport' into dev
afg1 Aug 17, 2022
ebe0fe6
Start prep to run search export on codon
afg1 Aug 19, 2022
b8b8ece
ZFIN remote isn't compressed now
afg1 Aug 19, 2022
813380f
Fix params options need to go at the end of the line now for nextflow
afg1 Aug 19, 2022
cdac8ee
Clear out previous database selection
afg1 Aug 19, 2022
aadf44f
Pombase's HTTPS certificate is broken, fall back to ftp
afg1 Aug 19, 2022
b96ad86
Add import report function to slack notifier
afg1 Aug 19, 2022
1fb03a4
Enable search export and final report in weekly run
afg1 Aug 19, 2022
1ebfabb
Changes to get import working, mostly database memory fixes
afg1 Aug 17, 2022
835cd23
FIx missing comma syntax error
afg1 Aug 17, 2022
3c95470
Modify weekly precompute query to run on all databases
afg1 Aug 18, 2022
5378e76
Add plncdb in the description selection ordering
afg1 Aug 18, 2022
6318c8e
Write species to accession file as well.
afg1 Aug 22, 2022
f520491
Preserve species name in species info
afg1 Aug 22, 2022
93457ea
Only take one value for species name, not the whole column
afg1 Aug 22, 2022
53c6391
Update ENA remote & queues for codon
afg1 Aug 23, 2022
128f748
Bump pre-commit black version to 22.6.0 for compatibility with click
afg1 Aug 23, 2022
98cb445
Fix ENA container options to bind nfs directory properly
afg1 Aug 23, 2022
43d093b
lncipedia remote is being flaky. Use wget instead for retries
afg1 Aug 23, 2022
2ff1adc
Fixes for lncbook remote handling
afg1 Aug 23, 2022
a99c622
Fix handling of ZWD remote
afg1 Aug 23, 2022
365f663
Using mv instead of cp on mac stops the codesigning thing killing pro…
afg1 Aug 23, 2022
da6786b
Update expression atlas rust parser, run cargo fmt
afg1 Aug 23, 2022
7d6beef
Catch exception when no ncRNA is found, continue with empty files
afg1 Aug 24, 2022
af61dce
Force overwrite using mv -f in Makefile
afg1 Aug 24, 2022
733a076
Add sgRNA as an INSDC -> SO mapping
afg1 Aug 24, 2022
d452775
Add circRNA as an INSDC -> SO mapping
afg1 Aug 24, 2022
aa256cb
Revert skipping empty output from ensembl
afg1 Aug 25, 2022
4b8a4eb
Add isort pre-commit hook
blakesweeney Aug 25, 2022
a35564b
Update .editorconfig
blakesweeney Aug 25, 2022
2cde1ee
Merge branch 'dev' of github.com:RNAcentral/rnacentral-import-pipelin…
afg1 Aug 25, 2022
9ab56ec
Allow pdb to get molecule name from rfam matches, if present
afg1 Aug 25, 2022
83fec77
SILVA FTP doesn't work, using http + wget workaround
afg1 Aug 25, 2022
0d32916
Add rfam_id to list of retrieved data
afg1 Aug 26, 2022
1636e6d
Add exception for missing type info
afg1 Aug 26, 2022
c5faae6
Handle new exception properly for pdbe
afg1 Aug 26, 2022
7a08728
Add LSU as a description key that gets type as rRNA
afg1 Aug 26, 2022
d040d1a
Update and add some rust checks
blakesweeney Aug 26, 2022
8a942fd
Handle empty parse from ensembl by warning and continuing
afg1 Aug 26, 2022
3a9a576
Merge branch 'dev' of github.com:RNAcentral/rnacentral-import-pipelin…
afg1 Aug 26, 2022
733ffdb
Install procps
blakesweeney Aug 30, 2022
ceef7cf
Merge branch 'dev' of github.com:RNAcentral/rnacentral-import-pipelin…
blakesweeney Aug 30, 2022
dccf2d1
Add poetry pre-commit actions
blakesweeney Aug 31, 2022
b2e464c
Handle ENA parsing problems and send warning to slack
afg1 Sep 1, 2022
18d0ac2
Merge branch 'dev' of github.com:RNAcentral/rnacentral-import-pipelin…
afg1 Sep 1, 2022
e655052
Add poetry files
blakesweeney Sep 1, 2022
2d0fce9
Merge branch 'dev' of github.com:RNAcentral/rnacentral-import-pipelin…
blakesweeney Sep 1, 2022
de171a5
Fix ribotyper path error in slack warning
afg1 Sep 1, 2022
7480735
Modify slack notification to use slack API method
afg1 Sep 6, 2022
232f23c
Add skipping for broken ensembl entries
afg1 Sep 7, 2022
4d21e7a
Workaround for ensembl permissions trouble
afg1 Sep 7, 2022
0d7df5f
Shorten slack warnings for empty parses
afg1 Sep 8, 2022
d1dbdba
Catch utf-8 erroros in ensembl parser
afg1 Sep 8, 2022
d451289
Allow multiple genomes per taxid
blakesweeney Sep 15, 2022
b949a28
Changes to make quickgo work with codon
afg1 Sep 16, 2022
732e047
Fix for conditional execution of quickgo workflow
afg1 Sep 16, 2022
f3d4728
Fix missing parens in conditional
afg1 Sep 16, 2022
97fa840
Move decompression to fetch
afg1 Sep 16, 2022
554bc5c
Disable rust pre-commit for now. Accept poetry lock changes
afg1 Sep 20, 2022
5c4113f
Finishing touches on expression atlas stage 1 parser
afg1 Sep 20, 2022
fa3ce34
Add groupby urs_taxid to get right gene -> transcript linkage
afg1 Sep 20, 2022
5d45e1b
Working python side of the expression atlas parsers
afg1 Sep 20, 2022
98fb59c
Add lookup table fetxh and modify expression atlas workflow
afg1 Sep 20, 2022
3b27e3d
Fix pdbe parser tests
blakesweeney Sep 22, 2022
31a784b
Correct test expectation
blakesweeney Sep 22, 2022
573b779
Try to fix PDBe import
blakesweeney Sep 22, 2022
4bd55c7
Include expression altas into parsing, add sql query and conditional …
afg1 Sep 22, 2022
2602b25
Disable poetry-lock
blakesweeney Oct 11, 2022
47fa4e7
Add some type annotations
blakesweeney Oct 11, 2022
0a69456
Ensure all required pdbs/chains are parsed
blakesweeney Oct 11, 2022
ff952d7
Bump cpat base container version to the oldest supported
afg1 Oct 19, 2022
5e067a6
Add publication count in search export rust code
afg1 Oct 19, 2022
49695b1
Rate limit job submission, and reduce precompute chunk size
afg1 Oct 19, 2022
0c17215
Fix expression atlad database ID
afg1 Oct 19, 2022
db2824f
Export gene synonyms from exression atlas
afg1 Oct 19, 2022
4103179
Write gene synonyms as comma separated list, or empty string if none …
afg1 Oct 19, 2022
9783469
Fix xref note splitting for rfam
afg1 Oct 19, 2022
22c38e2
Insert expression atlas into database ranking for precompute descript…
afg1 Oct 19, 2022
8ef0531
Change text mining sql to join on urs_taxid rather than urs
afg1 Oct 19, 2022
23c49a5
Add assembly id for expression atlas lookup dump
afg1 Oct 19, 2022
bf2e882
Update r2dt error strategy to just skip failing layouts. Also moved t…
afg1 Oct 19, 2022
729f22c
Queue optimisation - send this process to short
afg1 Oct 19, 2022
502a8df
Ignore failures to fetch ensembl species data
afg1 Oct 19, 2022
28c9872
Specify no container for atomic_publish
afg1 Oct 19, 2022
d67684a
Expression atlas nextflow tweaks
afg1 Oct 19, 2022
2115dc2
Decompress cms data without leading directory
afg1 Oct 19, 2022
649e6fd
Add clean target for make
afg1 Oct 19, 2022
b6abbd3
Add new RNA types to export schema
afg1 Oct 19, 2022
e68ecda
disable poetry-lock for now
afg1 Oct 20, 2022
4e3cd6e
Remove wget process for getting r2dt data
afg1 Oct 20, 2022
0f4e886
Rework expressionatlas tsv find command
afg1 Oct 20, 2022
5c35978
Split parsing process and parallelise to reduce memory usage
afg1 Oct 20, 2022
2f45925
Exclude circRNA from what we send to ensembl
afg1 Oct 20, 2022
bc36a9c
Add sgRNA to the disallowed types
afg1 Oct 21, 2022
598bad6
Remove grep and add filter operator
afg1 Oct 21, 2022
20cb9e4
Use baseName on path, should allow only specifying the experiment name
afg1 Oct 21, 2022
dbdb8a1
Merge pull request #139 from RNAcentral/r21-postrelease-patching
afg1 Oct 25, 2022
c7c9d83
Merge branch 'dev' into import-more-pdbe
afg1 Oct 25, 2022
3604737
Merge pull request #136 from RNAcentral/import-more-pdbe
afg1 Oct 25, 2022
b42a2f9
Do not use common names from rnc_accessions
blakesweeney Oct 25, 2022
c25ecf2
Add previous-release folder
carlosribas Oct 27, 2022
178f183
Submit new ids only
carlosribas Oct 27, 2022
586fa8c
Bump r2dt version in environment prep
afg1 Oct 26, 2022
3cc7255
Ignore singularity directory
afg1 Oct 28, 2022
0b43482
Revert precompute method for merge into master
afg1 Oct 28, 2022
657f542
Merge pull request #152 from RNAcentral/fix-common-names
afg1 Oct 28, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .editorconfig
Original file line number Diff line number Diff line change
Expand Up @@ -34,3 +34,6 @@ indent_size = 2
[*.yaml]
indent_style = space
indent_size = 2

[*.nf]
indent_size = 2
1 change: 1 addition & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
workflows/references/submit/*.txt filter=lfs diff=lfs merge=lfs -text
75 changes: 75 additions & 0 deletions .github/workflows/main.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Thid workflow will build and push the import pipeline container.
# the plan later will be to include unit tests as well


name: Building Pipeline Containers

on:
push:
branches:
'dev'
jobs:

starting-notification:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2

- name: Intital notification
uses: rtCamp/action-slack-notify@v2
env:
SLACK_MESSAGE: 'Creating new pipeline image in docker hub'
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
MSG_MINIMAL: true

create-docker-image:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2

- name: docker login
env:
DOCKER_USER: ${{ secrets.DOCKER_USER }}
DOCKER_PASSWORD: ${{ secrets.DOCKER_PASSWORD }}
run: docker login -u $DOCKER_USER -p $DOCKER_PASSWORD

- name: docker build
run: docker build -f Dockerfile -t rnacentral/rnacentral-import-pipeline .

- name: docker push
run: docker push rnacentral/rnacentral-import-pipeline

finished-notification:
needs:
- create-docker-image
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2

- name: Finished notification
uses: rtCamp/action-slack-notify@v2
env:
SLACK_MESSAGE: 'New pipeline image pushed to docker hub'
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
MSG_MINIMAL: true

singularity-conversion:
needs:
- create-docker-image
uses: rnacentral/rnacentral-import-pipeline/.github/workflows/singularity.yaml@dev
secrets: inherit


finished-singularity:
needs:
- singularity-conversion
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2

- name: Finished notification
uses: rtCamp/action-slack-notify@v2
env:
SLACK_MESSAGE: 'New singularity image pushed to ghcr'
SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
MSG_MINIMAL: true
25 changes: 25 additions & 0 deletions .github/workflows/singularity.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# This workflow runs the conversion to singularity and stores the result in the
# ghcr so we can pull it easier

name: Singularity Build
on: workflow_call


jobs:
run_conversion:
name: "Pull docker image and convert"
runs-on: ubuntu-latest

container:
image: quay.io/singularity/singularity:v3.8.1
options: --privileged

steps:
- name: "Pull image"
run: |
singularity pull --name rnacentral-rnacentral-import-pipeline-latest.sif docker://rnacentral/rnacentral-import-pipeline:latest

- name: "Push to ghcr"
run: |
echo ${{ secrets.GITHUB_TOKEN }} | singularity remote login -u ${{ secrets.GHCR_USERNAME }} --password-stdin oras://ghcr.io
singularity push rnacentral-rnacentral-import-pipeline-latest.sif oras://ghcr.io/${GITHUB_REPOSITORY}:latest
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -101,3 +101,8 @@ stubs
.envrc
workflows/references/results
workflows/references/metadata
workflows/references/backup
workflows/references/submit/previous-release
workflows/references/manually_annotated/from*
workflows/references/manually_annotated/results
singularity/*
21 changes: 19 additions & 2 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,11 +1,28 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v3.2.0
rev: v4.3.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-yaml
- repo: https://github.com/psf/black
rev: 19.3b0
rev: 22.6.0
hooks:
- id: black
- repo: https://github.com/pycqa/isort
rev: 5.10.1
hooks:
- id: isort
args: ["--profile", "black", "--filter-files"]
name: isort (python)
# - repo: https://github.com/doublify/pre-commit-rust
# rev: v1.0
# hooks:
# - id: fmt
# - id: cargo-check
# - id: clippy
- repo: https://github.com/python-poetry/poetry
rev: '1.2.0rc1'
hooks:
- id: poetry-check
# - id: poetry-lock
4 changes: 3 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM python:3.7-buster
FROM python:3.8-buster

ENV RNA /rna

Expand Down Expand Up @@ -46,6 +46,7 @@ RUN apt-get install -y \
unzip \
wget


# Install Infernal
RUN \
cd $RNA/ && \
Expand Down Expand Up @@ -94,6 +95,7 @@ RUN pip3 install -r $RNACENTRAL_IMPORT_PIPELINE/requirements.txt

RUN python3 -m textblob.download_corpora


WORKDIR /

COPY openssl/openssl.cnf /etc/ssl/
Expand Down
26 changes: 19 additions & 7 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,25 @@ requirements-dev.txt: requirements-dev.in

rust:
cargo build --release
cp target/release/json2fasta bin
cp target/release/split-ena bin
cp target/release/expand-urs bin
cp target/release/precompute bin
cp target/release/search-export bin
cp target/release/ftp-export bin
cp target/release/json2dfasta bin
mv -f target/release/json2fasta bin
mv -f target/release/split-ena bin
mv -f target/release/expand-urs bin
mv -f target/release/precompute bin
mv -f target/release/search-export bin
mv -f target/release/ftp-export bin
mv -f target/release/json2dfasta bin
mv -f target/release/expression-parse bin

clean:
rm bin/json2fasta
rm bin/split-ena
rm bin/expand-urs
rm bin/precompute
rm bin/search-export
rm bin/ftp-export
rm bin/json2dfasta
rm bin/expression-parse
cargo clean

docker: Dockerfile requirements.txt .dockerignore
docker build -t "$(docker)" .
Expand Down
16 changes: 16 additions & 0 deletions analyze.nf
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,29 @@ include { genome_mapping } from './workflows/genome-mapping'
include { r2dt } from './workflows/r2dt'
include { rfam_scan } from './workflows/rfam-scan'

include { slack_closure } from './workflows/utils/slack'
include { slack_message } from './workflows/utils/slack'

workflow analyze {
take: ready
emit: done
main:
Channel.of("Starting analyze pipeline") | slack_message
ready | (genome_mapping & rfam_scan & r2dt & cpat) | mix | collect | set { done }
}

workflow {
analyze(Channel.of('ready'))
}


workflow.onComplete {
slack_closure("Analyze workflow completed")

}

workflow.onError {

slack_closure("Analyze workflow hit an error and crashed")

}
106 changes: 62 additions & 44 deletions bin/check_ids.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,11 @@
words.update(ignore_ids)
special_char = re.compile('[@!#$%^&()<>?/\[\]\'}{~:]')
nts = re.compile('^[acgu]+$')
numbers_and_dash = re.compile('^\d+[\-]\d+$') # do not use ids like 6-1, 260-1, etc


def check_id(item):
if item.isnumeric() or item.lower() in words:
if item.isnumeric() or item.lower() in words or numbers_and_dash.search(item):
result = None
elif len(item) > 2 and not special_char.search(item) and not nts.search(item.lower()) and "\\" not in item:
result = item
Expand All @@ -47,55 +48,72 @@ def main(database, filename, output):
"""
Check ids and create file that will be used by RNAcentral-references.
"""
remove_dot = ["ensembl_gene", "ensembl_gencode_gene", "ensembl_metazoa_gene"]
split_on_comma = ["flybase_gene_synonym", "pombase_gene_synonym", "refseq_gene_synonym", "hgnc_gene_synonym"]
remove_dot = ["ensembl", "ensembl_gencode", "ensembl_metazoa"]
split_on_comma = ["flybase", "hgnc", "pombase", "refseq"]
rfam_ignore = [
"30_255", "30_292", "5S_rRNA", "5_8S_rRNA", "6A", "6S", "7SK", "C4", "CRISPR-DR10", "CRISPR-DR11",
"CRISPR-DR12", "CRISPR-DR13", "CRISPR-DR14", "CRISPR-DR15", "CRISPR-DR16", "CRISPR-DR17", "CRISPR-DR18",
"CRISPR-DR19", "CRISPR-DR2", "CRISPR-DR20", "CRISPR-DR21", "CRISPR-DR22", "CRISPR-DR23", "CRISPR-DR24",
"CRISPR-DR25", "CRISPR-DR26", "CRISPR-DR27", "CRISPR-DR28", "CRISPR-DR29", "CRISPR-DR3", "CRISPR-DR30",
"CRISPR-DR31", "CRISPR-DR32", "CRISPR-DR33", "CRISPR-DR34", "CRISPR-DR35", "CRISPR-DR36", "CRISPR-DR37",
"CRISPR-DR38", "CRISPR-DR39", "CRISPR-DR4", "CRISPR-DR40", "CRISPR-DR41", "CRISPR-DR42", "CRISPR-DR43",
"CRISPR-DR44", "CRISPR-DR45", "CRISPR-DR46", "CRISPR-DR47", "CRISPR-DR48", "CRISPR-DR49", "CRISPR-DR5",
"CRISPR-DR50", "CRISPR-DR51", "CRISPR-DR52", "CRISPR-DR53", "CRISPR-DR54", "CRISPR-DR55", "CRISPR-DR56",
"CRISPR-DR57", "CRISPR-DR58", "CRISPR-DR6", "CRISPR-DR60", "CRISPR-DR61", "CRISPR-DR62", "CRISPR-DR63",
"CRISPR-DR64", "CRISPR-DR65", "CRISPR-DR66", "CRISPR-DR7", "CRISPR-DR8", "CRISPR-DR9", "F6", "Hairpin",
"Hairpin-meta1", "Hairpin-meta2", "Hatchet", "P1", "P10", "P11", "P13", "P14", "P15", "P17", "P18", "P2", "P24",
"P26", "P27", "P31", "P33", "P34", "P35", "P36", "P37", "P4", "P5", "P6", "P8", "P9", "ROSE", "S35", "S414",
"S774", "S808", "SAM", "SL1", "SL2", "U1", "U11", "U12", "U1_yeast", "U2", "U3", "U4", "U4atac", "U5", "U54",
"U6", "U6atac", "U7", "U8", "VA", "csRNA", "drum", "g2", "pRNA", "sar", "sul1", "t44", "tRNA", "tRNA-Sec",
"tmRNA", "tp2", "tracrRNA"
]

with open(filename, 'r') as input_file:
with open(output, 'w') as output_file:
while line := input_file.readline():
line = line.rstrip()
line = line.split('|')

if len(line) == 4:
get_gene = line[0]
get_primary_id = line[1]
urs = line[2]
taxid = line[3]

# remove "."
if database in remove_dot and "." in get_gene:
get_gene = get_gene.split('.')[0]

# split on ","
gene_results = []
if database in split_on_comma:
gene_list = get_gene.split(',')
for item in gene_list:
item = check_id(item)
if item:
gene_results.append(item)

if gene_results:
primary_id = check_id(get_primary_id)
for gene in gene_results:
if gene and primary_id and gene != primary_id:
output_file.write(gene + '|' + primary_id + '|' + urs + '_' + taxid + '\n')
else:
gene = check_id(get_gene)
primary_id = check_id(get_primary_id)
if gene and primary_id and gene != primary_id:
output_file.write(gene + '|' + primary_id + '|' + urs + '_' + taxid + '\n')

else:
get_primary_id = line[0]
urs = line[1]
taxid = line[2]

# check if it is a valid id
primary_id = check_id(get_primary_id)

if primary_id:
output_file.write(primary_id + '|' + urs + '_' + taxid + '\n')
urs = line[0]
taxid = line[1]
primary_id = check_id(line[2])
if primary_id and database in remove_dot and "." in primary_id:
primary_id = primary_id.split('.')[0]

if primary_id and line[3:]:
for item in line[3:]:
if item:
get_id = item
else:
continue

# ignore some optional_id from Rfam
if database == "rfam" and get_id in rfam_ignore:
output_file.write('|' + primary_id + '|' + urs + '_' + taxid + '\n')
continue

# remove "."
if database in remove_dot and "." in get_id:
get_id = get_id.split('.')[0]

# split on ","
results = []
if database in split_on_comma:
list_of_ids = get_id.split(',')
for elem in list_of_ids:
elem = check_id(elem)
if elem:
results.append(elem)

if results:
for db_id in results:
if db_id != primary_id:
output_file.write(db_id + '|' + primary_id + '|' + urs + '_' + taxid + '\n')
else:
db_id = check_id(get_id)
if db_id and db_id != primary_id:
output_file.write(db_id + '|' + primary_id + '|' + urs + '_' + taxid + '\n')
elif primary_id:
output_file.write(primary_id + '|' + urs + '_' + taxid + '\n')


if __name__ == '__main__':
Expand Down
Loading