ML algorithms identify patterns that explain or fit a given dataset. Every ML algorithm goes through training, where it identifies underlying patterns in a given dataset to create a “trained” algorithm (a model), and testing, where the model applies the identified patterns to unseen data points. Typically, a ML algorithm is provided with: 1. a training dataset , 2. an evaluation dataset , 3. a held-out validation dataset. These input data can be images, text, numbers, or other types of data which are typically encoded as a numerical representation of the input data. A training dataset is used by the model to learn underlying patterns from the features present in the data of interest. An evaluation dataset is a small and previously unused dataset which is used during the training phase to help the model iteratively update its parameters (i.e., hyperparameter tuning or model tuning). In many cases, a large training set may be subdivided to form a smaller training dataset and the evaluation dataset, both of which are used to train the model. In the testing phase, a completely new or unseen test dataset or held-out validation set is used to test whether the patterns learned by the model hold true in new data (i.e., they are generalizable). While the evaluation dataset helps us refine a model’s fit to patterns in the training data, the held-out validation set helps us test the generalizability of the model.
If a model is generalizable, it is able to make accurate predictions on new data. High generalizability of a model on previously unseen data suggests that the model has identified important patterns in the data that are not unique to the data used for training and tuning. Generalizability can be affected if data leakage occurs during training of the model, i.e., if a model is exposed to the same or related data points in both the training set and the held-out validation set. Ensuring absence of any overlap or relatedness among data points or samples used in the training set and evaluation set is important to avoid data leakage during model training. Specifically, in cases of rare genetic diseases where, for example, many samples can contain familial relationships or data from the same patient could be collected by multiple specialists at different clinical facilities, special care should be taken while crafting the training and testing sets to ensure that no data leakage occur and the trained model has high generalizability.
The implementation of a ML experiment begins with splitting a single dataset of interest such that a large proportion of the data is used for training, and the remaining data is used for testing or validation as the held-out validation dataset. The training dataset is generally subdivided into the training dataset and the evaluation dataset. Ideally, a held-out validation dataset is an entirely new study or cohort, as researchers typically aim to build models that generalize to unseen, newly generated data. In rare diseases where multiple datasets may be combined to make a large enough training dataset, special care should be taken to standardize the features and the patterns therein. Although ML algorithms generally expect that datasets have uniform features, normalizing training and testing data together may introduce similarities between samples (causing inadvertent data leakage) that hamper the goal of training models that are highly generalizable.
The iterative training phase helps the model learn important patterns in the training dataset and then use the evaluation dataset to test for errors in prediction and update its learning parameters (hyperparameter tuning). The method by which the trained model is applied to the evaluation dataset to measure performance and update the hyperparameters is called cross-validation. There are multiple approaches that can be deployed to maximally utilize the available data when generating training and evaluation datasets e.g., leave-p-out cross-validation, leave-one-out cross-validation, k-fold cross-validation, Monte-Carlo random subsampling cross-validation.[@doi:10.1007/978-1-4614-6849-3] In the case of k-fold cross-validation, a given dataset is shuffled randomly and split into k parts. One of the k parts is reserved as the evaluation dataset and the rest are combined and used as the training dataset. In the next iteration, a different part is used as the evaluation dataset, while the rest are used for training. To avoid data leakage, and to promote generalization of models to new studies, researchers can use study-wise cross-validation, such that all samples from a study are in the same fold and no individual study is represented in both the training and evaluation datasets. Once the model has iterated through all k parts of the training and evaluation datasets, it is ready to be tested on the held-out validation dataset.(Figure [@fig:1]b)
The held-out validation dataset is exposed to the model only once to estimate the accuracy of the model. High accuracy of a model during cross-validation but low accuracy on the held-out validation dataset is a sign that the model has become overfit to the training set and has low generalizability. If there is evidence of overfitting, researchers should revisit the construction of the dataset to make sure they meet the best practices outlined above.
It is important to note that accuracy alone may not be the best measure of performance in rare disease datasets. A model tested for identifying rare disease samples may still achieve high accuracy if it identifies every sample as a non-rare disease sample. Measures that are more suitable to handle class imbalance, such as the Kappa statistic or area under the precision-recall curve [@doi:10.1145/1143844.1143874], are better metrics for model performance for rare disease.