With a large amount of data about personal activity collected by devices, it is interesting not only to quantify how much a particular activity a person does, but also to quantify how well he does it. In this task, given a large amount of labeled data which describe accelerometers on the belt, forearm, arm and dumbell of six participants, and whether they are performing barbell lifts correctly or incorrectly (if incorrectly, what mistake they have made), we would like to train a system which automatically tells us for new data how a person is performing.
In 2013, Velloso E. et. al found 17 factors in their work which are most critical in deciding how a participant performs in barbell lifts and what is a mistake in an incorrect performance. These 17 factors, however, do not always come from raw data (e.g. the range of accelerometer) and require manual manipulation.
The data in "pml-training" are partitioned based on the percentile of variable "classe", and 70% of the data are used for training the model while the other 30% are used for cross-validating, see Fig. 1 Instead of directly using only these 17 factors, I would like to try a more straightforward model: training a random forest model on all "meaningful" columns (by "meaningful", the columns like "user name" or "date and time" and columns with rare valid data are eliminated). Though the training process might take long (eventually it took about one hour to train the model with 70% of the input data, which is the size of my training set), it is intuitively the model to choose -- all mistakes on performance are related to some combinations of angles and velocities. If cross validation showed that the model did not perform well (probably due to overfitting if I could observe a big gap between the error on training and cross validating data), then I would try to look at the order of variable importance contributing to variable "classe" and remove several of those less important variables, and rebuild a random forest model. However, the straightforward random forest model seems to be a big pay-off: on training data, an accuracy of 100% was observed; on cross validation data, an accuracy of more than 99% was observed (see Fig. 2). (This model turned out to be a pay-off on the 20 testing data also, where an accuracy of 100% was observed.)