Skip to content

Commit e6402b1

Browse files
authored
Merge pull request #150 from marbl/make-partition-default
Make partition default
2 parents 925dabb + a46ad91 commit e6402b1

File tree

2 files changed

+51
-50
lines changed

2 files changed

+51
-50
lines changed

Diff for: README.md

+39-39
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,52 @@
11
Parsnp is a command-line-tool for efficient microbial core genome alignment and SNP detection. Parsnp was designed to work in tandem with Gingr, a flexible platform for visualizing genome alignments and phylogenetic trees; both Parsnp and Gingr form part of the Harvest suite :
22

33

4+
45
# Installation
56
## From conda
67
Parsnp is available on the [Bioconda](https://bioconda.github.io/user/install.html#set-up-channels) channel. This is the recommended method of installation. Once you have [added the Bioconda channel](https://bioconda.github.io/user/install.html#set-up-channels) to your conda environment, `parsnp` can be installed via
78
```
89
conda install parsnp
910
```
1011

11-
## From source
12+
Instructions for building Parsnp from source are available towards the end of this README.
13+
14+
# Running Parsnp
15+
Parsnp can be run multiple ways, but the most common is with a set of genomes and a reference.
16+
```
17+
parsnp -g <reference_genbank> -d <genomes>
18+
```
19+
```
20+
parsnp -r <reference_fasta> -d <genomes>
21+
```
22+
For example,
23+
```
24+
parsnp -r examples/mers_virus/ref/England1.fna -d examples/mers_virus/genomes/*.fna -o examples-out
25+
```
26+
27+
## Partition mode
28+
Parsnp 2 will group query genomes up into random partitions of at least `--min-partition-size` genomes each (50 by default). Parsnp is then run independently on each group, and the resulting alignment of each group is merged into a single alignment of all input genomes. This limits the input size for an individual "core" Parsnp step, leading to significantly less memory and CPU usage. We've also shown, on simulated and empirical data, that this partitioning step often leads to increased core-genome size and better phylogenetic signal.
29+
30+
The `--no-partition` flag allows users to run all query genomes at once.
31+
32+
## Output files
33+
* `parsnp.xmfa` is the core-genome alignment.
34+
* `parsnp.ggr` is the compressed representation of the alignment generated by the harvest-toolkit. This file can be used to visualize alignments with Gingr.
35+
* `parsnp.snps.mblocks` is the core-SNP signature of each sequence in fasta format. This is the file which is used to generate `parsnp.tree`
36+
* `parsnp.tree` is the resulting phylogeny.
37+
* If run in partition mode, Parsnp will produce a `partition` folder in the output directory, which contains the output of each of the partitioned runs.
38+
39+
40+
### XMFA format
41+
The output XMFA file contains a header section mapping contig names to indices. Following the header section, the LCBs/clusters are reported in the XMFA format, where the ID for each record in an LCB is formatted as:
42+
43+
```
44+
[fileidx]:[concat_start]-[concat_end] [strand] cluster[x] s[contig_idx]:p[contig_pos]
45+
```
46+
47+
The `concat_start` and `concat_end` values are internal to parsnp. The sequence for this record can be found in the file at index `fileidx` (these are declared at the top of the xmfa) on the `contig_idx`th contig starting at position `contig_pos`.
48+
49+
## Building from source
1250

1351
To build Parsnp from source, users must have automake 1.15, autoconf, and libtool installed. Parsnp also requires RaxML (or FastTree), Harvest-tools, and numpy. Some additional features require pySPOA, Mash, FastANI, and Phipack. All of these packages are available via Conda (many on the Bioconda channel).
1452

@@ -44,44 +82,6 @@ Note that the `parsnp` executable in `bin/` is not the same as the one in the ro
4482
## OSX Users (Catalina)
4583
Recent OSX have a Gatekeeper, that's designed to ensure that only softwre from known developers runs on tour Mac. Please refer to this link to enable the binaries shipped with Parsnp to run: https://support.apple.com/en-us/HT202491
4684

47-
# Running Parsnp
48-
Parsnp can be run multiple ways, but the most common is with a set of genomes and a reference.
49-
```
50-
parsnp -g <reference_genbank> -d <genomes>
51-
```
52-
```
53-
parsnp -r <reference_fasta> -d <genomes>
54-
```
55-
For example,
56-
```
57-
parsnp -r examples/mers_virus/ref/England1.fna -d examples/mers_virus/genomes/*.fna -o examples-out
58-
```
59-
60-
## Partition mode
61-
Parsnp 2 includes a new mode which can be activated with `--partition`. This mode randomly splits the input genomes up into groups of *p* genomes each, where *p* defaults to 50 and can be changed with `--partition-size=p`. Parsnp is then run independently on each group, and the resulting alignment of each group is merged into a single alignment of all input genomes. This mode is intended for large datasets, as it reduces the computational requirements.
62-
63-
```
64-
parsnp -r examples/mers_virus/ref/England1.fna -d examples/mers_virus/genomes/*.fna --partition --partition-size 10 -o examples-out-partitioned
65-
```
66-
67-
More examples can be found in the [readthedocs tutorial](https://harvest.readthedocs.io/en/latest/content/parsnp/tutorial.html)
68-
69-
## Output files
70-
* `parsnp.xmfa` is the core-genome alignment.
71-
* `parsnp.ggr` is the compressed representation of the alignment generated by the harvest-toolkit. This file can be used to visualize alignments with Gingr.
72-
* `parsnp.snps.mblocks` is the core-SNP signature of each sequence in fasta format. This is the file which is used to generate `parsnp.tree`
73-
* `parsnp.tree` is the resulting phylogeny.
74-
* If run in partition mode, Parsnp will produce a `partition` folder in the output directory, which contains the output of each of the partitioned runs.
75-
76-
### XMFA format
77-
The output XMFA file contains a header section mapping contig names to indices. Following the header section, the LCBs/clusters are reported in the XMFA format, where the ID for each record in an LCB is formatted as:
78-
79-
```
80-
[fileidx]:[concat_start]-[concat_end] [strand] cluster[x] s[contig_idx]:p[contig_pos]
81-
```
82-
83-
The `concat_start` and `concat_end` values are internal to parsnp. The sequence for this record can be found in the file at index `fileidx` (these are declared at the top of the xmfa) on the `contig_idx`th contig starting at position `contig_pos`.
84-
8585
## Misc
8686

8787
CITATION provides details on how to cite Parsnp.

Diff for: parsnp

+12-11
Original file line numberDiff line numberDiff line change
@@ -430,14 +430,14 @@ def parse_args():
430430

431431
partition_args = parser.add_argument_group("Sequence Partitioning")
432432
partition_args.add_argument(
433-
"--partition",
433+
"--no-partition",
434434
action='store_true',
435-
help="Evenly split input sequences across separate runs of parsnp, then merge results")
435+
help="Run all query genomes in single parsnp alignment, no partitioning.")
436436
partition_args.add_argument(
437-
"--partition-size",
437+
"--min-partition-size",
438438
type=int,
439439
default=50,
440-
help="Number of sequences in a partitioned run of parsnp")
440+
help="Minimum size of a partition. Input genomes will be split evenly across partitions at least this large.")
441441

442442
extend_args = parser.add_argument_group("LCB Extension")
443443
extend_args.add_argument(
@@ -1459,7 +1459,9 @@ SETTINGS:
14591459
run_recomb_filter = 0
14601460

14611461
#3)run parsnp (cores, grid?)
1462-
if not args.partition:
1462+
if args.no_partition or len(finalfiles) < 2*args.min_partition_size:
1463+
if len(finalfiles) < 2*args.min_partition_size:
1464+
logger.info(f"Too few genomes to run partitions of size >{args.min_partition_size}. Running all genomes at once.")
14631465
# Editing the ini file to be used for parnsp-aligner (which is different from parsnip as a mumi finder)
14641466
if not inifile_exists:
14651467
write_inifile_2(inifiled, outputDir, unaligned, args, auto_ref, query, finalfiles, ref, args.threads)
@@ -1620,16 +1622,15 @@ SETTINGS:
16201622

16211623
else:
16221624
import partition
1623-
1624-
if len(finalfiles) % args.partition_size == 1:
1625-
logger.warning("Incrementing partition size by 1 to avoid having a remainder partition of size 1")
1626-
args.partition_size += 1
1625+
full_partitions = len(finalfiles) // args.min_partition_size
1626+
effective_partition_size = len(finalfiles) // full_partitions
1627+
logger.info(f"Setting the partition size to {effective_partition_size}")
16271628
partition_output_dir = f"{outputDir}/partition"
16281629
partition_list_dir = f"{partition_output_dir}/input-lists"
16291630
os.makedirs(partition_list_dir, exist_ok=True)
1630-
for partition_idx in range(math.ceil(len(finalfiles) / args.partition_size)):
1631+
for partition_idx in range(math.ceil(len(finalfiles) / effective_partition_size)):
16311632
with open(f"{partition_list_dir}/{partition.CHUNK_PREFIX}-{partition_idx:010}.txt", 'w') as part_out:
1632-
for qf in finalfiles[partition_idx*args.partition_size : (partition_idx+1)*args.partition_size]:
1633+
for qf in finalfiles[partition_idx*effective_partition_size : (partition_idx+1)*effective_partition_size]:
16331634
part_out.write(f"{qf}\n")
16341635

16351636
chunk_label_parser = re.compile(f'{partition.CHUNK_PREFIX}-(.*).txt')

0 commit comments

Comments
 (0)