Skip to content

Suggestion for selection analysis using multiple-hits + alignment errors and troubleshooting GARD #1977

@desmodus1984

Description

@desmodus1984

Hi,

I am interested in detecting genes under selection and I recently found out about the alignment error masking and the multi-hit options.
My goal is to identify potential genes under selection in a couple of sister species. I would appreciate if you could help with comments and advice on how to conduct the analysis.

First, I identified single-copy orthogroups with Orthofinder from a set of 13 genomes, then my analysis steps are:

  1. Retrieve the transcripts from a database, and align them with MACSE.
  2. Filter the alignment with Clipkit using the codon model in default mode, and then trim further using a restrictive gap threshold ( -g 0.2).
  3. Remove the stop codons with hyphy cln
  4. Lastly, run HYPHYMPI absrel with --srv Yes

My question is where to do the alignment filtering (BUSTED-E) , and whether the multiple-hit should be done before or after alignment filtering because BUSTED-E can also be done with the --multiple-hits option.

For instance, I ran the multi-hit on some aligments, and I got for instance:

### Summary of rate estimates and significance testing
|                Model                 |   Log-L    |   omega    | 2-hit rate |              p-value               | 3-hit rate |              p-value               |
|:------------------------------------:|:----------:|:----------:|:----------:|:----------------------------------:|:----------:|:----------------------------------:|
|            Standard MG94             |-10977.33...|    0.2316  |    N/A     |                N/A                 |    N/A     |                N/A                 |
|        Standard MG94 + 2 hits        |-10952.29...|    0.1801  |    0.4034  |       0.0000 (2-hit rate = 0)      |    N/A     |                N/A                 |
|     Standard MG94 + 2 or 3 hits      |-10923.71...|    0.1723  |    0.1899  |      0.0000 (2&3-hit rates = 0)    |    1.8212  |      0.0000 (3-hit rate(s) = 0)    |

### 5 individual sites which showed sufficiently strong preference for multiple-hit models

|   Site   |       ER (2 vs 1)       |       ER (3 vs 2)       |                       Substitutions                        |
|:--------:|:-----------------------:|:-----------------------:|:----------------------------------------------------------:|
|   695    |          63.5193        |         227.6457        |                     TCA->AGC(1)CTC(1)                      |
|   698    |         124.6144        |         156.6477        |                     AGC->CAT(1)TCT(1)                      |
|   1096   |           4.6150        |          24.3391        |                  TCG->AGC(1)CAA(1)TCT(1)                   |
|   1097   |           5.3539        |           9.7650        |                     CAA->GCC(1)TTG(1)                      |
|   1106   |           3.1503        |           7.5585        |                     TCT->AAG(1)CCC(1)                      |

and the alignments stats are:

>Loaded a multiple sequence alignment with **13** sequences, **1112** codons, and **1** partitions

According to the fmm analysis, the model with the highest logL is the one with either 2 or 3 hits, but out of 1112 codons, only 5 sites have preference for multiple hits. Is this enough evidence (enough sites) to use BUSTED-E or absrel with the 3-hits-multiple-hits model?

For instance, another alignment:

>Loaded a multiple sequence alignment with **13** sequences, **268** codons, and **1** partitions
### Summary of rate estimates and significance testing
|                Model                 |   Log-L    |   omega    | 2-hit rate |              p-value               | 3-hit rate |              p-value               |
|:------------------------------------:|:----------:|:----------:|:----------:|:----------------------------------:|:----------:|:----------------------------------:|
|            Standard MG94             |  -3175.17  |    0.4972  |    N/A     |                N/A                 |    N/A     |                N/A                 |
|        Standard MG94 + 2 hits        |  -3089.86  |    0.3681  |    1.1937  |       0.0000 (2-hit rate = 0)      |    N/A     |                N/A                 |
|     Standard MG94 + 2 or 3 hits      |  -3051.41  |    0.3651  |    0.7279  |      0.0000 (2&3-hit rates = 0)    |    4.2756  |      0.0000 (3-hit rate(s) = 0)    |

### 19 individual sites which showed sufficiently strong preference for multiple-hit models

|   Site   |       ER (2 vs 1)       |       ER (3 vs 2)       |                       Substitutions                        |
|:--------:|:-----------------------:|:-----------------------:|:----------------------------------------------------------:|
|    4     |          15.0814        |           6.8236        |                     CAA->AGG(1)CTG(1)                      |
|    18    |          20.5916        |           0.3047        |               ACT->AGG(1)GCA(1), GCT->ACT(1)               |
|    21    |          72.7447        |           7.7902        |                     ACA->CGA(1)GAG(1)                      |
|   105    |          53.3908        |           0.4668        |               CCG->CCC(1)GGG(1), GGG->CCG(1)               |
|   214    |          10.0952        |           0.5638        |                  CCC->CCT(1), CCT->CAG(1)                  |
|   215    |          23.3547        |          10.7724        |            CTC->AGC(1), TCT->CCT(1)CTC(1)TCG(1)            |
|   216    |          45.6818        |           1.7204        |                  CCA->TGG(1), CTG->CCA(1)                  |
|   217    |          44.1538        |           7.6918        |               GCA->TGC(1), GTG->GCA(1)GCG(1)               |
|   221    |           9.5131        |           2.3019        |                     GCA->CGG(1)GCG(1)                      |
|   223    |          10.7141        |          16.6627        |                     AGT->GGT(1)GTA(1)                      |
|   248    |          13.0184        |          13.1496        |                        GCC->AGA(1)                         |
|   259    |           8.8395        |           0.5507        |                        GAA->CAC(1)                         |
|   263    |           9.7945        |           4.6021        |        GCC->CAC(1)GCG(1), GCG->GTG(1), GTG->AGC(1)         |
|   268    |          26.8636        |          20.4134        |                        CTC->ACT(1)                         |
|    20    |           4.6980        |          13.0971        |                  GAG->AAG(1)AGA(1)GAA(1)                   |
|   182    |           5.2119        |           8.2645        |                        TGC->CAG(1)                         |
|   251    |           6.3157        |          13.5233        |                  ACC->ACA(1)AGC(1)GAG(1)                   |
|   258    |           4.2727        |           8.3152        |            CCG->CCA(1)GAA(1), CCT->CCC(1)CCG(1)            |
|   267    |           4.9757        |           7.8712        |                  CTC->AGG(1)CAC(1)CTT(1)                   |

The results were more convincing about using multi-hits, 19 sites out of 268 codons.

My concern is if multiple-hits sites are found, but then BUSTED-E marks them as erroneous, making the usage of multiple-hit model unnecessary or wrong.

I'd appreciate if you could help me decide whether I should use BUSTED-E + multi-hit detection.

Also, I have Hyphy installed in a server, and I am trying to run GARD to detect recombination and I get an error.
I first ran:

for f in `cat test_ortho.tsv`; do echo -e "\n$f"; hyphy gard --type nucleotide --alignment "${f}.CDS.NT.trimmed.g0.2.STOP.fa" --model JTT --mode Normal --rv None --rate-classes 4 --output ${f}.gard.json ; echo -e "Done $f\n" ; done

and I got the error:

### Fitting the baseline (single-partition; no breakpoints) model
* Log(L) = -8770.42, AIC-c = 17603.40 (31 estimated parameters)

### Performing an exhaustive single breakpoint analysis
[GARD] Breakpoint          6 of 576. Best cAIC =   17603.3967 with no breakpoints.Error:
Internal error
Internal error in ComputeBranchCache (branch gard.tree.part_0.Musnig, eval #3993 ) reversible model cached likelihood =   -89.57536821397055, directly computed likelihood =   -89.57536805683237. This is most likely because a non-reversible model was incorrectly auto-detected (or specified by the model file in environment variables; for smaller errors, this could be due to numerical instability of calculations for larger alignments). To treat numerical errors as warnings, please specify "ENV=TOLERATE_NUMERICAL_ERRORS=1;" as the command line argument. This often resolves the issue, which is indicative of numerical instability.

Function call stack
1 :  [namespace = LYWdsGte] Optimize(mles, ^lf_id);

        Keyword arguments:
                {
                 "model":"JTT",
                 "rate-classes":"4"
                }
-------
2 :  [namespace = ZDRcMlPV] res=estimators.FitExistingLF(&likelihoodFunction,modelObjects);

        Keyword arguments:
                {
                 "model":"JTT",
                 "rate-classes":"4"
                }
-------
3 :  [namespace = fE_Rixoy] fit=gard.fitPartitionedModel(breakPoints,model,initialValues,null,FALSE);

        Keyword arguments:
                {
                 "model":"JTT",
                 "rate-classes":"4"
                }
-------

then, I tried

for f in `cat test_ortho.tsv`; do echo -e "\n$f"; ENV="TOLERATE_NUMERICAL_ERRORS=1;" hyphy gard --type nucleotide --alignment "${f}.CDS.NT.trimmed.g0.2.STOP.fa" --model JTT --mode Normal --rv None --rate-classes 4 --output ${f}.gard.json ; echo -e "Done $f\n" ; done

and I still got the error.
For example, I ran the multiple-hits with this script without any problem:

for f in `cat test_ortho.tsv`; do echo -e "\n$f"; hyphy fmm --alignment ${f}.CDS.NT.trimmed.g0.2.STOP.fa --tree tree.nwk --srv Yes ; echo -e "Done $f\n" ; done

I would be grateful of any suggestion on how or where to add the code to correct for numerical instability.

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions