2
2
3
3
## About
4
4
5
- This is the main pipeline that is used internally for loading the data into the RNAcentral database.
6
- [ More information] ( http://www.ebi.ac.uk/seqdb/confluence/display/RNAC/RNAcentral+data+import+pipeline )
5
+ This is the main pipeline that is used internally for loading the data into the
6
+ RNAcentral database. [ More
7
+ information] ( http://www.ebi.ac.uk/seqdb/confluence/display/RNAC/RNAcentral+data+import+pipeline ) .
8
+ The pipeline is [ nextflow] ( https://www.nextflow.io ) based and the main entry
9
+ point is main.nf.
7
10
8
- ## Installation
11
+ The pipeline is typically run as:
12
+
13
+ ``` sh
14
+ nextflow run -profile env -with-singularity pipeline.sif main.nf
15
+ ```
16
+
17
+ The pipeline is meant to run
18
+
19
+ ## Configuring the pipeline
20
+
21
+ The pipeline requires a ` local.config ` file to exist and contain some
22
+ information. Notably a ` PGDATABASE ` environment variable must be defined so
23
+ data can be imported or fetched. In addition, to import specific databases
24
+ there must be a ` params.import_data.databases ` dict defined. The keys must be
25
+ known databases names and the values should be truthy to indicate the databases
26
+ should be imported.
27
+
28
+ There is some more advanced configuration options available, such as turning on or off
29
+ specific parts of the pipeline like genome mapping, qa, etc.
9
30
10
31
## Using with Docker
11
32
33
+ The pipeline is meant to run in docker or singularity. You should build or
34
+ fetch a suitable container. Some example commands are below.
35
+
12
36
* build container
13
37
```
14
38
docker build -t rnacentral-import-pipeline .
@@ -19,55 +43,10 @@ This is the main pipeline that is used internally for loading the data into the
19
43
docker run -v `pwd`:/rnacentral/rnacentral-import-pipeline -v /path/to/data:/rnacentral/data/ -it rnacentral-import-pipeline bash
20
44
```
21
45
22
- * example luigi command
23
-
24
- ```
25
- python -m luigi --module tasks.release LoadRelease --local-scheduler
26
- ```
27
-
28
- ## Running luigi tasks
29
-
30
- Several databases are imported using the
31
- [ luigi] ( https://github.com/spotify/luigi ) pipeline. The code for the pipeline
32
- is stored in ` luigi ` directory. The rfam search task are stored in the ` tasks `
33
- subdirectory. These can be run with:
34
-
35
- ``` sh
36
- export PYTHONPATH=$PYTHONPATH :luigi
37
- python -m luigi --module tasks < task>
38
- ```
39
-
40
- Sadly, luigi doesn't seem to provide a nice way to inspect the available tasks,
41
- so the easiest way to see what is available is to read
42
- ` luigi/tasks/__init__.py ` .
43
-
44
- Some individual examples are:
45
-
46
- ``` sh
47
- python -m luigi --module tasks RfamCSV
48
- python -m luigi --module tasks RfamSearches
49
- ```
50
-
51
- For details on each individual part read the documentation for the task you are
52
- interested in.
53
-
54
- There are also several other database, like NONCODE and Greengenes, that aren't
55
- yet moved into the tasks directory. These can be found under the ` luigi/ `
56
- directory. Running these is similar, some examples are:
57
-
58
- ``` sh
59
- python -m luigi --module json_batch_processor Noncode [options]
60
- python -m luigi --module ensembl.species SpeciesImporter [options]
61
- ```
62
-
63
- The pipeline requires the: ` luigi.cfg ` file be filled out, an example file,
64
- with comments is in ` luigi.cfg.txt ` . In addition there is documentation about
65
- the configuration in ` luigi/tasks/config.py ` .
66
-
67
46
## Testing
68
47
69
- Running tests for ensembl import requires downloading data from Ensembl first.
70
- This can be done with:
48
+ Several tests require fetching some data files prior to testing. The files can
49
+ be fetched with:
71
50
72
51
``` sh
73
52
./scripts/fetch-test-data.sh
@@ -77,9 +56,16 @@ The tests can then be run using [py.test](http://pytest.org). For example,
77
56
running Ensembl importing tests can be done with:
78
57
79
58
``` sh
80
- py.test luigi/ tests/ensembl_test.py
59
+ py.test tests/databases/ensembl/
81
60
```
82
61
62
+ ## Other environment variables
63
+
64
+ The pipeline requires the ` NXF_OPTS ` environment variable to be set to
65
+ ` -Dnxf.pool.type=sync -Dnxf.pool.maxThreads=10000 ` , a module for doing this is
66
+ in ` modules/cluster ` . Also some configuration settings for efficient usage on
67
+ EBI's LSF cluster are in ` config/cluster.config ` .
68
+
83
69
## License
84
70
85
71
See [ LICENSE] ( https://github.com/RNAcentral/rnacentral-import-pipeline/blob/master/LICENSE ) for more information.
0 commit comments