Releases: cov-lineages/pangoLEARN
Releases · cov-lineages/pangoLEARN
pangoLEARN data release 2021-01-20
Release notes
- Curation of B.1 lineages
- B.5 issue resolved, reassigned just B.1
pangoLEARN data release 2021-01-16
Release notes
- Minor release removing classification B.1.1.248
pangoLEARN data release 2021-01-11
Release notes
- Minor update to the training model, incorporating new sequences from lineage B.1.375
pangoLEARN data release 2021-01-06
Release notes
Update to current model and metadata
- Subset of South African sequences reassigned B.1.1.56 and B.1.1.54
- Brazilian lineage B.1.1.248 reassigned B.1.1.28
pangoLEARN data release 2020-12-17_2
Release notes
Update to model training to remedy bug from issue #116
pangoLEARN data release 2020-12-17
### Release notes
- Update to include lineage B.1.1.7, associated with the N501Y mutation originating in the UK
pangoLEARN data release 2020-11-30_2
Release notes:
- Issue identified in global tree, identical sequences present in B.1.1 base and B.1.374. This patch includes a newly trained model with the re-assignment of B.1.374 to B.1.1 and B.1.374.1 to B.1.1.316.
pangoLEARN data release 2020-11-30
Release notes
- Lineage curation
The lineage assignments have been fully updated from a tree built using FastTree MP. Previously, releases had some circularity as four sub-trees were built based on pangolin assignments for A, B, B.1 and B.1.1, and then manually assigned. This release resolves this circularity by building a single tree with all SARS-CoV-2 diversity. This tree was then split into 25 roughly equal-sized chunks using jclusterfunk. These chunks were then manually curated for new lineages and previous lineage definitions were refined/ updated. - There are now 779 lineages defined, full details found at cov-lineages.org.
- Ambiguity curation
In this release we are doing more aggressive data curation in training to remove ambiguous sequences and resolve conflicts resulting from ambiguities in the training set. - Results on accuracy and precision are generated using 10 fold cross validation with the curated dataset, so be aware that query sequences with a lot of ambiguity may have lower assignment accuracy
- Model description
The current version of pangoLEARN uses a Classification Tree, but the pipeline has been written so that as more complex models are developed, the user will be able to choose which model to use to assign their lineages. The model was trained using 188,193 SARS-CoV-2 sequences from GISAID, with their lineages assigned by manually curating the global ML tree, as is the standard lineages data release procedure for pangolin. Training takes approximately 2 hours on our hardware (may change with different hardware). This model was built using the standard sci-kit learn implementation of the decision tree learning algorithm. The code for this process is available in the cov-lineages/cov-support repository.
pangoLEARN data release 2020-10-30
pangoLEARN data release 2020-08-29_3
Minor release
Updated decision tree model, excluding any N's from training. Reference base at a given position taken as uninformative alternative to N.
Updated recall rates for each lineage.
A222V lineage B.1.177 included in training set
pangoLEARN data release 2020-08-29_3
Minor release
- Updated decision tree model, excluding any N's from training. Reference base at a given position taken as uninformative alternative to N.
- Updated recall rates for each lineage