Skip to content

Commit 2bf6d9d

Browse files
committed
Update the readme a bit
More up-to-date but not yet great.
1 parent c3b00da commit 2bf6d9d

File tree

1 file changed

+37
-51
lines changed

1 file changed

+37
-51
lines changed

readme.md

+37-51
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,37 @@
22

33
## About
44

5-
This is the main pipeline that is used internally for loading the data into the RNAcentral database.
6-
[More information](http://www.ebi.ac.uk/seqdb/confluence/display/RNAC/RNAcentral+data+import+pipeline)
5+
This is the main pipeline that is used internally for loading the data into the
6+
RNAcentral database. [More
7+
information](http://www.ebi.ac.uk/seqdb/confluence/display/RNAC/RNAcentral+data+import+pipeline).
8+
The pipeline is [nextflow](https://www.nextflow.io) based and the main entry
9+
point is main.nf.
710

8-
## Installation
11+
The pipeline is typically run as:
12+
13+
```sh
14+
nextflow run -profile env -with-singularity pipeline.sif main.nf
15+
```
16+
17+
The pipeline is meant to run
18+
19+
## Configuring the pipeline
20+
21+
The pipeline requires a `local.config` file to exist and contain some
22+
information. Notably a `PGDATABASE` environment variable must be defined so
23+
data can be imported or fetched. In addition, to import specific databases
24+
there must be a `params.import_data.databases` dict defined. The keys must be
25+
known databases names and the values should be truthy to indicate the databases
26+
should be imported.
27+
28+
There is some more advanced configuration options available, such as turning on or off
29+
specific parts of the pipeline like genome mapping, qa, etc.
930

1031
## Using with Docker
1132

33+
The pipeline is meant to run in docker or singularity. You should build or
34+
fetch a suitable container. Some example commands are below.
35+
1236
* build container
1337
```
1438
docker build -t rnacentral-import-pipeline .
@@ -19,55 +43,10 @@ This is the main pipeline that is used internally for loading the data into the
1943
docker run -v `pwd`:/rnacentral/rnacentral-import-pipeline -v /path/to/data:/rnacentral/data/ -it rnacentral-import-pipeline bash
2044
```
2145

22-
* example luigi command
23-
24-
```
25-
python -m luigi --module tasks.release LoadRelease --local-scheduler
26-
```
27-
28-
## Running luigi tasks
29-
30-
Several databases are imported using the
31-
[luigi](https://github.com/spotify/luigi) pipeline. The code for the pipeline
32-
is stored in `luigi` directory. The rfam search task are stored in the `tasks`
33-
subdirectory. These can be run with:
34-
35-
```sh
36-
export PYTHONPATH=$PYTHONPATH:luigi
37-
python -m luigi --module tasks <task>
38-
```
39-
40-
Sadly, luigi doesn't seem to provide a nice way to inspect the available tasks,
41-
so the easiest way to see what is available is to read
42-
`luigi/tasks/__init__.py`.
43-
44-
Some individual examples are:
45-
46-
```sh
47-
python -m luigi --module tasks RfamCSV
48-
python -m luigi --module tasks RfamSearches
49-
```
50-
51-
For details on each individual part read the documentation for the task you are
52-
interested in.
53-
54-
There are also several other database, like NONCODE and Greengenes, that aren't
55-
yet moved into the tasks directory. These can be found under the `luigi/`
56-
directory. Running these is similar, some examples are:
57-
58-
```sh
59-
python -m luigi --module json_batch_processor Noncode [options]
60-
python -m luigi --module ensembl.species SpeciesImporter [options]
61-
```
62-
63-
The pipeline requires the: `luigi.cfg` file be filled out, an example file,
64-
with comments is in `luigi.cfg.txt`. In addition there is documentation about
65-
the configuration in `luigi/tasks/config.py`.
66-
6746
## Testing
6847

69-
Running tests for ensembl import requires downloading data from Ensembl first.
70-
This can be done with:
48+
Several tests require fetching some data files prior to testing. The files can
49+
be fetched with:
7150

7251
```sh
7352
./scripts/fetch-test-data.sh
@@ -77,9 +56,16 @@ The tests can then be run using [py.test](http://pytest.org). For example,
7756
running Ensembl importing tests can be done with:
7857

7958
```sh
80-
py.test luigi/tests/ensembl_test.py
59+
py.test tests/databases/ensembl/
8160
```
8261

62+
## Other environment variables
63+
64+
The pipeline requires the `NXF_OPTS` environment variable to be set to
65+
`-Dnxf.pool.type=sync -Dnxf.pool.maxThreads=10000`, a module for doing this is
66+
in `modules/cluster`. Also some configuration settings for efficient usage on
67+
EBI's LSF cluster are in `config/cluster.config`.
68+
8369
## License
8470

8571
See [LICENSE](https://github.com/RNAcentral/rnacentral-import-pipeline/blob/master/LICENSE) for more information.

0 commit comments

Comments
 (0)