Six participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes. The data was collected from accelerometers on the belt, forearm, arm, and dumbbell of the participants, and complied into datasets with 160 features.
The goal of this report is to train a model to predict the manner (class) in which they did the exercise. This model will then be used to predict 20 different test cases.
The report describes, how I built a random forest model, and how I used cross validation. The report also shows, the expected out of sample error, and why I made the choices I did.
#The training data for this project are available here:
#https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
#The test data are available here:
#https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
# The data was downloaded to a local folder
setwd("~/Coursea/PracticalMachineLearning")
training <- read.csv("./pml-training.csv")
testing <- read.csv("./pml-testing.csv")
if(!require(caret)) install.packages('caret'); library(caret)
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
The data has a number of derived features (mean, std. etc.), which are based on subsets of rows. These features have an NA value for all but one row in each subset. These will be removed as random forest cannot process features with NA values. There are also several columns with near zero values. The concern here that these predictors may become zero-variance predictors when the data are split into cross-validation/bootstrap sub-samples or that a few samples may have an undue influence on the model. These “near-zero-variance” predictors will be removed. Of the remaining features, only numeric and outcome features will be kept. Cleaning processes will reduce the dataset to 53 features.
# near zero value features
nzv <- nearZeroVar(training, saveMetrics = T)
# features without NA values
no_na_cols <- names(training[ , colSums(is.na(training)) == 0])
# filter the features
features <- setdiff(no_na_cols,row.names(nzv)[nzv$nzv==T])
# keep numeric and outcome features only
features <- features[-c(1:6)]
Partition the training data into training (75%) and validation (25%) data sets. The validation set is held out, while a random forest model will be trained. For the random forest, 501 trees are build, using 10-fold cross validation. This 10-fold cross validation will allow for improved (averaging) bias at some increased cost in variance for the predictions. The 501 trees should allow the model to converge with good accuracy. The combination of random forest with cross validation tends to produce robust models with high accuracy, and performs well in the presence of features that are correlated in the training data.
set.seed(3234)
#Partition the training data into training and validation data sets
inBuild <- createDataPartition(y=training$classe,p=.75,list=FALSE)
train_data <- training[inBuild,features]
validate_data <- training[-inBuild,features]
# filter the unlabelled test data
test_data <- testing[,features[-53]]
# define training control
train_control <- trainControl(method="cv", number=10)
# train the model
set.seed(235)
rf.model <- train(classe~., data=train_data,
trControl=train_control,
method="rf", ntree = 501,
prox = TRUE)
Below are the summary details of the random forest as well as the confusion matrix and out-of-bag error for final model. The final model is the tree with the highest accuracy and the lowest out-of-bag error rate.
# print the random forest
print(rf.model)
## Random Forest
##
## 14718 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 13246, 13245, 13247, 13247, 13246, 13246, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9928662 0.9909756
## 27 0.9932057 0.9914053
## 52 0.9864113 0.9828101
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
# print the final model oob error
rf.model$finalModel$err.rate[501,1]
## OOB
## 0.006726457
# print the confusion matrix
confusionMatrix(rf.model)
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction A B C D E
## A 28.4 0.1 0.0 0.0 0.0
## B 0.0 19.2 0.1 0.0 0.0
## C 0.0 0.1 17.3 0.2 0.0
## D 0.0 0.0 0.1 16.2 0.1
## E 0.0 0.0 0.0 0.0 18.3
##
## Accuracy (average) : 0.9932
In the final model, 27 features were selected. The confusion matrix shows a very high predictive ability with a final accuracy of 0.9932057. The estimated out-of-bag (oob) error is only 0.0067265.
Below are plots of the error rate by the number of trees used in the models as well as a plot of the most important features in the final model.
plot(rf.model$finalModel)
varImpPlot(rf.model$finalModel)
Running the held out validation set through the final model produces the following results.
pred_val_data <- predict(rf.model,newdata=validate_data)
confusionMatrix(pred_val_data,validate_data$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1393 6 0 0 0
## B 2 943 4 0 0
## C 0 0 845 4 0
## D 0 0 6 800 2
## E 0 0 0 0 899
##
## Overall Statistics
##
## Accuracy : 0.9951
## 95% CI : (0.9927, 0.9969)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9938
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9986 0.9937 0.9883 0.9950 0.9978
## Specificity 0.9983 0.9985 0.9990 0.9980 1.0000
## Pos Pred Value 0.9957 0.9937 0.9953 0.9901 1.0000
## Neg Pred Value 0.9994 0.9985 0.9975 0.9990 0.9995
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2841 0.1923 0.1723 0.1631 0.1833
## Detection Prevalence 0.2853 0.1935 0.1731 0.1648 0.1833
## Balanced Accuracy 0.9984 0.9961 0.9937 0.9965 0.9989
The model predicts with an accuracy of 0.9951 with a 95% confidence interval of (0.9927, 0.9969). The error rate (1 - accuracy) on the validation set is 0.0049.
Below are the predicted results from the unlabelled test set.
pred_test_data <- predict(rf.model,newdata=test_data)
pred_test_data
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E