Skip to content

Commit eb7b75e

Browse files
Merge pull request #338 from nickjcroucher/update_rec
v3.2.1
2 parents a9d5dcd + 5f80ed7 commit eb7b75e

File tree

56 files changed

+3184
-2464
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

56 files changed

+3184
-2464
lines changed

Diff for: CHANGELOG.md

+12-1
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,19 @@
11
# Change Log
22

3-
## [v3.1.6](https://github.com/sanger-pathogens/gubbins/tree/v3.1.6) (2021-1-20)
3+
## [v3.2.1](https://github.com/sanger-pathogens/gubbins/tree/v3.2.1) (2022-5-24)
4+
[Full Changelog](https://github.com/sanger-pathogens/gubbins/compare/v3.1.6...v3.2.1)
5+
6+
- Fix problem with sequence reconstruction
7+
- Improve detection of small recombinations by modifying window sizes
8+
- Enable resumption of stalled analyses
9+
- Clean C code
10+
- Fixes to scripts
11+
- Add CI tests and update expected results
12+
13+
## [v3.1.6](https://github.com/sanger-pathogens/gubbins/tree/v3.1.6) (2022-1-20)
414
[Full Changelog](https://github.com/sanger-pathogens/gubbins/compare/v3.1.5...v3.1.6)
515

16+
617
- Fix problem with sequence reconstruction
718
- Add test for consistency of reconstructions
819

Diff for: README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,7 @@ chmod +x configure
105105
make
106106
sudo make install
107107
cd python
108-
python setup.py install
108+
python3 -m pip install .
109109
```
110110

111111
### OSX/Linux/Windows - Virtual Machine

Diff for: VERSION

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
3.1.6
1+
3.2.1

Diff for: docs/gubbins_manual.md

+16-7
Original file line numberDiff line numberDiff line change
@@ -170,7 +170,7 @@ Gubbins was originally designed to use a [joint ancestral state reconstruction](
170170

171171
### Recombination detection options
172172

173-
Recombination is detected using a [spatial scanning statistic](https://link.springer.com/chapter/10.1007/978-1-4612-1578-3_14), which relies on a sliding window. The size of this window may need to be reduced if you apply Gubbins to very small genomes (e.g. viruses).
173+
Recombination is detected using a [spatial scanning statistic](https://link.springer.com/chapter/10.1007/978-1-4612-1578-3_14), which relies on a sliding window. The size of this window may need to be reduced if you apply Gubbins to very small genomes (e.g. viruses). To increase the sensitivity for detecting recombinations, `--min-snps` can be set at the minimum value of 2; the `--p-value` threshold required to detect recombinations can be increased; the `--trimming-ratio` can be raised above 1.0, to disfavour the trimming of recombination edges; and the `--extensive-search` mode can be used.
174174

175175
```
176176
--min-snps MIN_SNPS, -m MIN_SNPS
@@ -179,19 +179,26 @@ Recombination is detected using a [spatial scanning statistic](https://link.spri
179179
Minimum window size (default: 100)
180180
--max-window-size MAX_WINDOW_SIZE, -b MAX_WINDOW_SIZE
181181
Maximum window size (default: 10000)
182+
--p-value P_VALUE Uncorrected p value used to identify recombinations (default: 0.05)
183+
--trimming-ratio TRIMMING_RATIO
184+
Ratio of log probabilities used to trim recombinations (default: 1.0)
185+
--extensive-search Undertake slower, more thorough, search for recombination (default: False)
182186
```
183187

184-
### Algorithm stop options
188+
### Algorithm stop and restart options
185189

186-
Given the scale of available dataset sizes, and the size of tree space, it is unlikely that any Gubbins analysis will ever converge based on identifying identical trees in subsequent iterations. Note that trees from previous iterations are used as starting trees for inference in subsequent iterations with IQTree and RAxML (although not RAxML-NG). In practice, there is little improvement to the tree after three iterations.
190+
Given the scale of available dataset sizes, and the size of tree space, it is unlikely that any Gubbins analysis will ever converge based on identifying identical trees in subsequent iterations. Normally the algorithm will stop after reaching the maximum number of iterations. Should the run fail or stall before this point, the analysis can be restarted from the last iteration that successfully completed by providing a tree through the `--resume` flag (all other flags should be kept identical to the original commend, including `--iterations`). Note that although only the tree is provided to `--resume`, the corresponding alignment generated at the end of the same iteration also needs to be available within the same directory.
187191

188192
```
189193
--iterations ITERATIONS, -i ITERATIONS
190194
Maximum No. of iterations (default: 5)
191195
--converge-method {weighted_robinson_foulds,robinson_foulds,recombination}, -z {weighted_robinson_foulds,robinson_foulds,recombination}
192196
Criteria to use to know when to halt iterations (default: weighted_robinson_foulds)
197+
--resume RESUME Intermediate tree from previous run (must include "iteration_X" in file name) (default: None)
193198
```
194199

200+
Note that trees from previous iterations are used as starting trees for inference in subsequent iterations with IQTree and RAxML (although not RAxML-NG).
201+
195202
## Output files
196203

197204
A successful Gubbins run will generate files with the suffixes:
@@ -221,13 +228,15 @@ The `.per_branch_statistics.csv` file contains summary statistics for each branc
221228

222229
* **Node** - Name of the node subtended by the branch. This can either be one of the taxa included in the input alignment, or an internal node, which are numbered
223230
* **Total SNPs** - Total number of base substitutions reconstructed onto the branch
224-
* **Num of SNPs inside recombinations** - Number of base substitutions reconstructed onto the branch that fall within a predicted recombination (*r*)
225-
* **Num of SNPs outside recombinations** - Number of base substitutions reconstructed onto the branch that fall outside of a predicted recombination. i.e. predicted to have arisen by point mutation (*m*)
226-
* **Num of Recombination Blocks** - Total number of recombination blocks reconstructed onto the branch
227-
* **Bases in recombinations** - Total length of all recombination events reconstructed onto the branch
231+
* **Number of SNPs Inside Recombinations** - Number of base substitutions reconstructed onto the branch that fall within a predicted recombination (*r*)
232+
* **Number of SNPs Outside Recombinations** - Number of base substitutions reconstructed onto the branch that fall outside of a predicted recombination. i.e. predicted to have arisen by point mutation (*m*)
233+
* **Number of Recombination Blocks** - Total number of recombination blocks reconstructed onto the branch
234+
* **Bases in Recombinations** - Total length of all recombination events reconstructed onto the branch
235+
* **Cumulative Bases in Recombinations** - Total number of bases in the alignment affected by recombination on this branch and its ancestors
228236
* ***r/m*** - The r/m value for the branch. This value gives a measure of the relative impact of recombination and mutation on the variation accumulated on the branch
229237
* ***rho/theta*** - The ratio of the number of recombination events to point mutations on a branch; a measure of the relative rates of recombination and point mutation
230238
* **Genome Length** - The total number of aligned bases between the ancestral and descendent nodes for the branch excluding any missing data or gaps in either
239+
* **Bases in Clonal Frame** - The number of called bases at the descendant node that have not been affected by recombination on this branch or an ancestor (i.e., the length of sequence that can be used for phylogenetic interpretation)
231240

232241
Note that all positions in the output files are relative to the input alignment. If you wish to compare the positions of recombinations relative to a reference annotation, their coordinates will need to be adjusted to account for any gaps in the reference sequence introduced when generating the alignment.
233242

Diff for: python/gubbins/common.py

+37-16
Original file line numberDiff line numberDiff line change
@@ -108,11 +108,28 @@ def parse_and_run(input_args, program_description=""):
108108
gaps_vcf_filename = base_filename + ".gaps.vcf"
109109
joint_sequences_filename = base_filename + ".seq.joint.aln"
110110

111+
# If restarting from a previous run
112+
starting_iteration = 1
113+
if input_args.resume is not None:
114+
search_itr = re.search(r'iteration_(\d+)', input_args.resume)
115+
if search_itr is None:
116+
sys.stderr.write('Resuming a Gubbins run requires a tree file name containing the phrase "iteration_X"\n')
117+
exit(1)
118+
else:
119+
starting_iteration = int(search_itr.group(1)) + 1
120+
if starting_iteration >= input_args.iterations:
121+
sys.stderr.write('Run has already reached the number of specified iterations\n')
122+
exit(1)
123+
else:
124+
sys.stderr.write('Resuming Gubbins analysis at iteration ' + str(starting_iteration) + '\n')
125+
input_args.starting_tree = input_args.resume
126+
current_tree_name = input_args.starting_tree
127+
111128
# Check if intermediate files from a previous run exist
112129
intermediate_files = [basename + ".iteration_"]
113-
if not input_args.no_cleanup:
130+
if not input_args.no_cleanup and input_args.resume is None:
114131
utils.delete_files(".", intermediate_files, "", input_args.verbose)
115-
if utils.do_files_exist(".", intermediate_files, "", input_args.verbose):
132+
if utils.do_files_exist(".", intermediate_files, "", input_args.verbose) and input_args.resume is None:
116133
sys.exit("Intermediate files from a previous run exist. Please rerun without the --no_cleanup option "
117134
"to automatically delete them or with the --use_time_stamp to add a unique prefix.")
118135

@@ -176,11 +193,11 @@ def parse_and_run(input_args, program_description=""):
176193
reconvert_fasta_file(gaps_alignment_filename, base_filename + ".start")
177194
# Start the main loop
178195
printer.print("\nEntering the main loop.")
179-
for i in range(1, input_args.iterations+1):
196+
for i in range(starting_iteration, input_args.iterations+1):
180197
printer.print("\n*** Iteration " + str(i) + " ***")
181198

182199
# 1.1. Construct the tree-building command depending on the iteration and employed options
183-
if i == 2:
200+
if i == 2 or input_args.resume is not None:
184201
# Select the algorithms used for the subsequent iterations
185202
current_tree_builder, current_model_fitter, current_model, extra_tree_arguments, extra_model_arguments = return_algorithm_choices(input_args,i)
186203
# Initialise tree builder
@@ -247,7 +264,7 @@ def parse_and_run(input_args, program_description=""):
247264
# 3.2a. Joint ancestral reconstruction
248265
printer.print(["\nReconstructing ancestral sequences with pyjar..."])
249266

250-
if i == 1:
267+
if i == starting_iteration:
251268

252269
# 3.3a. Read alignment and identify unique base patterns in first iteration only
253270

@@ -281,6 +298,7 @@ def parse_and_run(input_args, program_description=""):
281298
info_filename = info_filename, # file containing evolutionary model parameters
282299
info_filetype = input_args.model_fitter, # model fitter - format of file containing evolutionary model parameters
283300
output_prefix = temp_working_dir + "/" + ancestral_sequence_basename, # output prefix
301+
outgroup_name = input_args.outgroup, # outgroup for rooting and reconstruction
284302
threads = input_args.threads, # number of cores to use
285303
verbose = input_args.verbose,
286304
max_pos = max_pos)
@@ -354,7 +372,8 @@ def parse_and_run(input_args, program_description=""):
354372
shutil.copyfile(current_tree_name_with_internal_nodes, current_tree_name)
355373
gubbins_command = create_gubbins_command(
356374
gubbins_exec, gaps_alignment_filename, gaps_vcf_filename, current_tree_name,
357-
input_args.alignment_filename, input_args.min_snps, input_args.min_window_size, input_args.max_window_size)
375+
input_args.alignment_filename, input_args.min_snps, input_args.min_window_size, input_args.max_window_size,
376+
input_args.p_value, input_args.trimming_ratio, input_args.extensive_search)
358377
printer.print(["\nRunning Gubbins to detect recombinations...", gubbins_command])
359378
try:
360379
subprocess.check_call(gubbins_command, shell=True)
@@ -617,13 +636,16 @@ def return_algorithm(algorithm_choice, model, input_args, node_labels = None, ex
617636
return initialised_algorithm
618637

619638
def create_gubbins_command(gubbins_exec, alignment_filename, vcf_filename, current_tree_name,
620-
original_alignment_filename, min_snps, min_window_size, max_window_size):
639+
original_alignment_filename, min_snps, min_window_size, max_window_size,
640+
p_value, trimming_ratio, extensive_search):
621641
command = [gubbins_exec, "-r", "-v", vcf_filename, "-a", str(min_window_size),
622642
"-b", str(max_window_size), "-f", original_alignment_filename, "-t", current_tree_name,
623-
"-m", str(min_snps), alignment_filename]
643+
"-m", str(min_snps), "-p", str(p_value), "-i", str(trimming_ratio)]
644+
if extensive_search:
645+
command.append("-x")
646+
command.append(alignment_filename)
624647
return " ".join(command)
625648

626-
627649
def number_of_sequences_in_alignment(filename):
628650
return len(get_sequence_names_from_alignment(filename))
629651

@@ -734,14 +756,13 @@ def reroot_tree(tree_name, outgroups):
734756

735757
def reroot_tree_with_outgroup(tree_name, outgroups):
736758
clade_outgroups = get_monophyletic_outgroup(tree_name, outgroups)
737-
outgroups = [{'name': taxon_name} for taxon_name in clade_outgroups]
738-
739-
tree = Phylo.read(tree_name, 'newick')
740-
tree.root_with_outgroup(*outgroups)
741-
Phylo.write(tree, tree_name, 'newick')
742-
743759
tree = dendropy.Tree.get_from_path(tree_name, 'newick', preserve_underscores=True)
744-
tree.deroot()
760+
outgroup_mrca = tree.mrca(taxon_labels=clade_outgroups)
761+
print('Edge length is: ' + str(outgroup_mrca.edge.length))
762+
tree.reroot_at_edge(outgroup_mrca.edge,
763+
length1 = outgroup_mrca.edge.length/2,
764+
length2 = outgroup_mrca.edge.length/2,
765+
update_bipartitions=False)
745766
tree.update_bipartitions()
746767
output_tree_string = tree_as_string(tree, suppress_internal=False)
747768
with open(tree_name, 'w+') as output_file:

0 commit comments

Comments
 (0)