22
33## About
44
5- This is the main pipeline that is used internally for loading the data into the RNAcentral database.
6- [ More information] ( http://www.ebi.ac.uk/seqdb/confluence/display/RNAC/RNAcentral+data+import+pipeline )
5+ This is the main pipeline that is used internally for loading the data into the
6+ RNAcentral database. [ More
7+ information] ( http://www.ebi.ac.uk/seqdb/confluence/display/RNAC/RNAcentral+data+import+pipeline ) .
8+ The pipeline is [ nextflow] ( https://www.nextflow.io ) based and the main entry
9+ point is main.nf.
710
8- ## Installation
11+ The pipeline is typically run as:
12+
13+ ``` sh
14+ nextflow run -profile env -with-singularity pipeline.sif main.nf
15+ ```
16+
17+ The pipeline is meant to run
18+
19+ ## Configuring the pipeline
20+
21+ The pipeline requires a ` local.config ` file to exist and contain some
22+ information. Notably a ` PGDATABASE ` environment variable must be defined so
23+ data can be imported or fetched. In addition, to import specific databases
24+ there must be a ` params.import_data.databases ` dict defined. The keys must be
25+ known databases names and the values should be truthy to indicate the databases
26+ should be imported.
27+
28+ There is some more advanced configuration options available, such as turning on or off
29+ specific parts of the pipeline like genome mapping, qa, etc.
930
1031## Using with Docker
1132
33+ The pipeline is meant to run in docker or singularity. You should build or
34+ fetch a suitable container. Some example commands are below.
35+
1236* build container
1337 ```
1438 docker build -t rnacentral-import-pipeline .
@@ -19,55 +43,10 @@ This is the main pipeline that is used internally for loading the data into the
1943 docker run -v `pwd`:/rnacentral/rnacentral-import-pipeline -v /path/to/data:/rnacentral/data/ -it rnacentral-import-pipeline bash
2044 ```
2145
22- * example luigi command
23-
24- ```
25- python -m luigi --module tasks.release LoadRelease --local-scheduler
26- ```
27-
28- ## Running luigi tasks
29-
30- Several databases are imported using the
31- [ luigi] ( https://github.com/spotify/luigi ) pipeline. The code for the pipeline
32- is stored in ` luigi ` directory. The rfam search task are stored in the ` tasks `
33- subdirectory. These can be run with:
34-
35- ``` sh
36- export PYTHONPATH=$PYTHONPATH :luigi
37- python -m luigi --module tasks < task>
38- ```
39-
40- Sadly, luigi doesn't seem to provide a nice way to inspect the available tasks,
41- so the easiest way to see what is available is to read
42- ` luigi/tasks/__init__.py ` .
43-
44- Some individual examples are:
45-
46- ``` sh
47- python -m luigi --module tasks RfamCSV
48- python -m luigi --module tasks RfamSearches
49- ```
50-
51- For details on each individual part read the documentation for the task you are
52- interested in.
53-
54- There are also several other database, like NONCODE and Greengenes, that aren't
55- yet moved into the tasks directory. These can be found under the ` luigi/ `
56- directory. Running these is similar, some examples are:
57-
58- ``` sh
59- python -m luigi --module json_batch_processor Noncode [options]
60- python -m luigi --module ensembl.species SpeciesImporter [options]
61- ```
62-
63- The pipeline requires the: ` luigi.cfg ` file be filled out, an example file,
64- with comments is in ` luigi.cfg.txt ` . In addition there is documentation about
65- the configuration in ` luigi/tasks/config.py ` .
66-
6746## Testing
6847
69- Running tests for ensembl import requires downloading data from Ensembl first.
70- This can be done with:
48+ Several tests require fetching some data files prior to testing. The files can
49+ be fetched with:
7150
7251``` sh
7352./scripts/fetch-test-data.sh
@@ -77,9 +56,16 @@ The tests can then be run using [py.test](http://pytest.org). For example,
7756running Ensembl importing tests can be done with:
7857
7958``` sh
80- py.test luigi/ tests/ensembl_test.py
59+ py.test tests/databases/ensembl/
8160```
8261
62+ ## Other environment variables
63+
64+ The pipeline requires the ` NXF_OPTS ` environment variable to be set to
65+ ` -Dnxf.pool.type=sync -Dnxf.pool.maxThreads=10000 ` , a module for doing this is
66+ in ` modules/cluster ` . Also some configuration settings for efficient usage on
67+ EBI's LSF cluster are in ` config/cluster.config ` .
68+
8369## License
8470
8571See [ LICENSE] ( https://github.com/RNAcentral/rnacentral-import-pipeline/blob/master/LICENSE ) for more information.
0 commit comments