Releases: cov-lineages/pangoLEARN
pangoLEARN data release 2020-08-29_2
Release notes
Minor fix:
- B.1.1.25.1 -> D.2
- B.1.1.25.2 -> D.1
pangoLEARN data release 2020-08-29
Release description
The current version of pangoLEARN uses a Classification Tree, but the pipeline has been written so that as more complex models are developed, the user will be able to choose which model to use to assign their lineages. The model was trained using approximately 60,000 SARS-CoV-2 sequences from GISAID, with their lineages assigned by manually curating the global ML tree, as is the standard lineages data release procedure for pangolin. Training takes approximately half an hour on our hardware (may change with different hardware). This model was built using the standard sci-kit learn implementation of the decision tree learning algorithm. The code for this process is available in the cov-lineages/cov-support repository.
pangoLEARN data release 2020-07-20
Release description
pangoLEARN is an alternative algorithm for lineage assignment, implemented as of pangolin 2.0. This new algorithm, which relies on machine learning, offers much faster lineage assignment, as the phylogenetic approach was struggling to scale with the increase in number of lineages needing to be represented in the guide tree. This new approach also takes into account all of the diversity present within a lineage rather than just selecting a representative few. The consequences of this approach mean that for large lineages, we have improved our recall and precision significantly. We are continuing to develop more sophisticated approaches to machine learning for lineage assignment, which we hope will offer even better improvements in both speed and accuracy.
The current version of pangoLEARN uses multinomial logistic regression, but the pipeline has been written so that as more complex models are developed,the user will be able to choose which model to use to assign their lineages.
While in standard regression a line of best fit is found for a set of training data, which represents a linear relationship between variables of interest, a logistic regression fits a sigmoid function to the training data, in order to tell two different classes apart. A multinomial logistic regression is an extension of a standard logistic regression in that it can be used to classify more than two classes. Each potential assignment (i.e. lineage) is modeled as a set of n-1 independent binary choices (sigmoid functions), where n is the number of classes.
The model was trained using 30,000 SARS-CoV-2 sequences from GISAID (acknowledgements here), their assigned lineages being manually curating the global ML tree, as is the standard lineages data release procedure for pangolin. Each base of each genome was one-hot encoded. This left us with a large number of parameters to train, which is why training this model takes approximately 14 hours on our hardware (may change with different hardware). This model was built using the standard sci-kit learn implementation of multinomial logistic regression. The code for this process is available in the cov-lineages/cov-support repository.
Multinomial logistic regression is an extremely commonly used model as it is able to simply and intuitively assign probabilities to class assignments. However, it does not incorporate any hierarchical structure. We are currently developing new models that do incorporate hierarchical structure. However, given the limitations of this simple model, it has performed surprisingly well with this data. While more complex models may offer improvements in assignment accuracies for smaller lineages, the logistic regression has the advantages of being intuitive, easy to implement, and relatively fast to train.
Release notes
This data now gets pip installed as part of the pangolin environment. If you're not using conda, ensure that you install this repo prior to using pangolin 2.0