Hi,
I am interested in detecting genes under selection and I recently found out about the alignment error masking and the multi-hit options.
My goal is to identify potential genes under selection in a couple of sister species. I would appreciate if you could help with comments and advice on how to conduct the analysis.
First, I identified single-copy orthogroups with Orthofinder from a set of 13 genomes, then my analysis steps are:
- Retrieve the transcripts from a database, and align them with MACSE.
- Filter the alignment with Clipkit using the codon model in default mode, and then trim further using a restrictive gap threshold ( -g 0.2).
- Remove the stop codons with hyphy cln
- Lastly, run HYPHYMPI absrel with --srv Yes
My question is where to do the alignment filtering (BUSTED-E) , and whether the multiple-hit should be done before or after alignment filtering because BUSTED-E can also be done with the --multiple-hits option.
For instance, I ran the multi-hit on some aligments, and I got for instance:
### Summary of rate estimates and significance testing
| Model | Log-L | omega | 2-hit rate | p-value | 3-hit rate | p-value |
|:------------------------------------:|:----------:|:----------:|:----------:|:----------------------------------:|:----------:|:----------------------------------:|
| Standard MG94 |-10977.33...| 0.2316 | N/A | N/A | N/A | N/A |
| Standard MG94 + 2 hits |-10952.29...| 0.1801 | 0.4034 | 0.0000 (2-hit rate = 0) | N/A | N/A |
| Standard MG94 + 2 or 3 hits |-10923.71...| 0.1723 | 0.1899 | 0.0000 (2&3-hit rates = 0) | 1.8212 | 0.0000 (3-hit rate(s) = 0) |
### 5 individual sites which showed sufficiently strong preference for multiple-hit models
| Site | ER (2 vs 1) | ER (3 vs 2) | Substitutions |
|:--------:|:-----------------------:|:-----------------------:|:----------------------------------------------------------:|
| 695 | 63.5193 | 227.6457 | TCA->AGC(1)CTC(1) |
| 698 | 124.6144 | 156.6477 | AGC->CAT(1)TCT(1) |
| 1096 | 4.6150 | 24.3391 | TCG->AGC(1)CAA(1)TCT(1) |
| 1097 | 5.3539 | 9.7650 | CAA->GCC(1)TTG(1) |
| 1106 | 3.1503 | 7.5585 | TCT->AAG(1)CCC(1) |
and the alignments stats are:
>Loaded a multiple sequence alignment with **13** sequences, **1112** codons, and **1** partitions
According to the fmm analysis, the model with the highest logL is the one with either 2 or 3 hits, but out of 1112 codons, only 5 sites have preference for multiple hits. Is this enough evidence (enough sites) to use BUSTED-E or absrel with the 3-hits-multiple-hits model?
For instance, another alignment:
>Loaded a multiple sequence alignment with **13** sequences, **268** codons, and **1** partitions
### Summary of rate estimates and significance testing
| Model | Log-L | omega | 2-hit rate | p-value | 3-hit rate | p-value |
|:------------------------------------:|:----------:|:----------:|:----------:|:----------------------------------:|:----------:|:----------------------------------:|
| Standard MG94 | -3175.17 | 0.4972 | N/A | N/A | N/A | N/A |
| Standard MG94 + 2 hits | -3089.86 | 0.3681 | 1.1937 | 0.0000 (2-hit rate = 0) | N/A | N/A |
| Standard MG94 + 2 or 3 hits | -3051.41 | 0.3651 | 0.7279 | 0.0000 (2&3-hit rates = 0) | 4.2756 | 0.0000 (3-hit rate(s) = 0) |
### 19 individual sites which showed sufficiently strong preference for multiple-hit models
| Site | ER (2 vs 1) | ER (3 vs 2) | Substitutions |
|:--------:|:-----------------------:|:-----------------------:|:----------------------------------------------------------:|
| 4 | 15.0814 | 6.8236 | CAA->AGG(1)CTG(1) |
| 18 | 20.5916 | 0.3047 | ACT->AGG(1)GCA(1), GCT->ACT(1) |
| 21 | 72.7447 | 7.7902 | ACA->CGA(1)GAG(1) |
| 105 | 53.3908 | 0.4668 | CCG->CCC(1)GGG(1), GGG->CCG(1) |
| 214 | 10.0952 | 0.5638 | CCC->CCT(1), CCT->CAG(1) |
| 215 | 23.3547 | 10.7724 | CTC->AGC(1), TCT->CCT(1)CTC(1)TCG(1) |
| 216 | 45.6818 | 1.7204 | CCA->TGG(1), CTG->CCA(1) |
| 217 | 44.1538 | 7.6918 | GCA->TGC(1), GTG->GCA(1)GCG(1) |
| 221 | 9.5131 | 2.3019 | GCA->CGG(1)GCG(1) |
| 223 | 10.7141 | 16.6627 | AGT->GGT(1)GTA(1) |
| 248 | 13.0184 | 13.1496 | GCC->AGA(1) |
| 259 | 8.8395 | 0.5507 | GAA->CAC(1) |
| 263 | 9.7945 | 4.6021 | GCC->CAC(1)GCG(1), GCG->GTG(1), GTG->AGC(1) |
| 268 | 26.8636 | 20.4134 | CTC->ACT(1) |
| 20 | 4.6980 | 13.0971 | GAG->AAG(1)AGA(1)GAA(1) |
| 182 | 5.2119 | 8.2645 | TGC->CAG(1) |
| 251 | 6.3157 | 13.5233 | ACC->ACA(1)AGC(1)GAG(1) |
| 258 | 4.2727 | 8.3152 | CCG->CCA(1)GAA(1), CCT->CCC(1)CCG(1) |
| 267 | 4.9757 | 7.8712 | CTC->AGG(1)CAC(1)CTT(1) |
The results were more convincing about using multi-hits, 19 sites out of 268 codons.
My concern is if multiple-hits sites are found, but then BUSTED-E marks them as erroneous, making the usage of multiple-hit model unnecessary or wrong.
I'd appreciate if you could help me decide whether I should use BUSTED-E + multi-hit detection.
Also, I have Hyphy installed in a server, and I am trying to run GARD to detect recombination and I get an error.
I first ran:
for f in `cat test_ortho.tsv`; do echo -e "\n$f"; hyphy gard --type nucleotide --alignment "${f}.CDS.NT.trimmed.g0.2.STOP.fa" --model JTT --mode Normal --rv None --rate-classes 4 --output ${f}.gard.json ; echo -e "Done $f\n" ; done
and I got the error:
### Fitting the baseline (single-partition; no breakpoints) model
* Log(L) = -8770.42, AIC-c = 17603.40 (31 estimated parameters)
### Performing an exhaustive single breakpoint analysis
[GARD] Breakpoint 6 of 576. Best cAIC = 17603.3967 with no breakpoints.Error:
Internal error
Internal error in ComputeBranchCache (branch gard.tree.part_0.Musnig, eval #3993 ) reversible model cached likelihood = -89.57536821397055, directly computed likelihood = -89.57536805683237. This is most likely because a non-reversible model was incorrectly auto-detected (or specified by the model file in environment variables; for smaller errors, this could be due to numerical instability of calculations for larger alignments). To treat numerical errors as warnings, please specify "ENV=TOLERATE_NUMERICAL_ERRORS=1;" as the command line argument. This often resolves the issue, which is indicative of numerical instability.
Function call stack
1 : [namespace = LYWdsGte] Optimize(mles, ^lf_id);
Keyword arguments:
{
"model":"JTT",
"rate-classes":"4"
}
-------
2 : [namespace = ZDRcMlPV] res=estimators.FitExistingLF(&likelihoodFunction,modelObjects);
Keyword arguments:
{
"model":"JTT",
"rate-classes":"4"
}
-------
3 : [namespace = fE_Rixoy] fit=gard.fitPartitionedModel(breakPoints,model,initialValues,null,FALSE);
Keyword arguments:
{
"model":"JTT",
"rate-classes":"4"
}
-------
then, I tried
for f in `cat test_ortho.tsv`; do echo -e "\n$f"; ENV="TOLERATE_NUMERICAL_ERRORS=1;" hyphy gard --type nucleotide --alignment "${f}.CDS.NT.trimmed.g0.2.STOP.fa" --model JTT --mode Normal --rv None --rate-classes 4 --output ${f}.gard.json ; echo -e "Done $f\n" ; done
and I still got the error.
For example, I ran the multiple-hits with this script without any problem:
for f in `cat test_ortho.tsv`; do echo -e "\n$f"; hyphy fmm --alignment ${f}.CDS.NT.trimmed.g0.2.STOP.fa --tree tree.nwk --srv Yes ; echo -e "Done $f\n" ; done
I would be grateful of any suggestion on how or where to add the code to correct for numerical instability.
Thanks
Hi,
I am interested in detecting genes under selection and I recently found out about the alignment error masking and the multi-hit options.
My goal is to identify potential genes under selection in a couple of sister species. I would appreciate if you could help with comments and advice on how to conduct the analysis.
First, I identified single-copy orthogroups with Orthofinder from a set of 13 genomes, then my analysis steps are:
My question is where to do the alignment filtering (BUSTED-E) , and whether the multiple-hit should be done before or after alignment filtering because BUSTED-E can also be done with the --multiple-hits option.
For instance, I ran the multi-hit on some aligments, and I got for instance:
and the alignments stats are:
According to the fmm analysis, the model with the highest logL is the one with either 2 or 3 hits, but out of 1112 codons, only 5 sites have preference for multiple hits. Is this enough evidence (enough sites) to use BUSTED-E or absrel with the 3-hits-multiple-hits model?
For instance, another alignment:
The results were more convincing about using multi-hits, 19 sites out of 268 codons.
My concern is if multiple-hits sites are found, but then BUSTED-E marks them as erroneous, making the usage of multiple-hit model unnecessary or wrong.
I'd appreciate if you could help me decide whether I should use BUSTED-E + multi-hit detection.
Also, I have Hyphy installed in a server, and I am trying to run GARD to detect recombination and I get an error.
I first ran:
and I got the error:
then, I tried
and I still got the error.
For example, I ran the multiple-hits with this script without any problem:
I would be grateful of any suggestion on how or where to add the code to correct for numerical instability.
Thanks