Update the readme a bit

blakesweeney · blakesweeney · commit 2bf6d9d8fd89 · 2019-05-10T15:55:43.000+01:00
More up-to-date but not yet great.
diff --git a/readme.md b/readme.md
@@ -2,13 +2,37 @@
 
 ## About
 
-This is the main pipeline that is used internally for loading the data into the RNAcentral database.
-[More information](http://www.ebi.ac.uk/seqdb/confluence/display/RNAC/RNAcentral+data+import+pipeline)
+This is the main pipeline that is used internally for loading the data into the
+RNAcentral database. [More
+information](http://www.ebi.ac.uk/seqdb/confluence/display/RNAC/RNAcentral+data+import+pipeline).
+The pipeline is [nextflow](https://www.nextflow.io) based and the main entry
+point is main.nf. 
 
-## Installation
+The pipeline is typically run as:
+
+```sh
+nextflow run -profile env -with-singularity pipeline.sif main.nf
+```
+
+The pipeline is meant to run 
+
+## Configuring the pipeline
+
+The pipeline requires a `local.config` file to exist and contain some
+information. Notably a `PGDATABASE` environment variable must be defined so
+data can be imported or fetched. In addition, to import specific databases
+there must be a `params.import_data.databases` dict defined. The keys must be
+known databases names and the values should be truthy to indicate the databases
+should be imported.
+
+There is some more advanced configuration options available, such as turning on or off
+specific parts of the pipeline like genome mapping, qa, etc.
 
 ## Using with Docker
 
+The pipeline is meant to run in docker or singularity. You should build or
+fetch a suitable container. Some example commands are below.
+
 * build container
   ```
   docker build -t rnacentral-import-pipeline .
@@ -19,55 +43,10 @@ This is the main pipeline that is used internally for loading the data into the
   docker run -v `pwd`:/rnacentral/rnacentral-import-pipeline -v /path/to/data:/rnacentral/data/ -it rnacentral-import-pipeline bash
   ```
 
-* example luigi command
-
-  ```
-  python -m luigi --module tasks.release LoadRelease --local-scheduler
-  ```
-
-## Running luigi tasks
-
-Several databases are imported using the
-[luigi](https://github.com/spotify/luigi) pipeline. The code for the pipeline
-is stored in `luigi` directory. The rfam search task are stored in the `tasks`
-subdirectory. These can be run with:
-
-```sh
-export PYTHONPATH=$PYTHONPATH:luigi
-python -m luigi --module tasks <task>
-```
-
-Sadly, luigi doesn't seem to provide a nice way to inspect the available tasks,
-so the easiest way to see what is available is to read
-`luigi/tasks/__init__.py`.
-
-Some individual examples are:
-
-```sh
-python -m luigi --module tasks RfamCSV
-python -m luigi --module tasks RfamSearches
-```
-
-For details on each individual part read the documentation for the task you are
-interested in.
-
-There are also several other database, like NONCODE and Greengenes, that aren't
-yet moved into the tasks directory. These can be found under the `luigi/`
-directory. Running these is similar, some examples are:
-
-```sh
-python -m luigi --module json_batch_processor Noncode [options]
-python -m luigi --module ensembl.species SpeciesImporter [options]
-```
-
-The pipeline requires the: `luigi.cfg` file be filled out, an example file,
-with comments is in `luigi.cfg.txt`. In addition there is documentation about
-the configuration in `luigi/tasks/config.py`.
-
 ## Testing
 
-Running tests for ensembl import requires downloading data from Ensembl first.
-This can be done with:
+Several tests require fetching some data files prior to testing. The files can
+be fetched with:
 
 ```sh
 ./scripts/fetch-test-data.sh
@@ -77,9 +56,16 @@ The tests can then be run using [py.test](http://pytest.org). For example,
 running Ensembl importing tests can be done with:
 
 ```sh
-py.test luigi/tests/ensembl_test.py
+py.test tests/databases/ensembl/
 ```
 
+## Other environment variables
+
+The pipeline requires the `NXF_OPTS` environment variable to be set to
+`-Dnxf.pool.type=sync -Dnxf.pool.maxThreads=10000`, a module for doing this is
+in `modules/cluster`. Also some configuration settings for efficient usage on
+EBI's LSF cluster  are in `config/cluster.config`.
+
 ## License
 
 See [LICENSE](https://github.com/RNAcentral/rnacentral-import-pipeline/blob/master/LICENSE) for more information.