This is a repository for Cousera's Getting and Cleaning Data peer assessment project. The main purpose of this repository is to provide a script (run_analysis.R) that manipulates the Human Activity Recognition using smartphones data set in order to get a tidy data set by executing the following transformations:
- Merge the training and the test sets to create one data set.
- Extract only the measurements on the mean and standard deviation for each measurement.
- Use descriptive activity names to name the activities in the data set
- Appropriately label the data set with descriptive activity names.
- Create a second, independent tidy data set with the average of each variable for each activity and each subject.
According to The Leek group guide to data sharing this repository provides:
- The raw data (UCI HAR Dataset folder)
- A tidy data set (dataset.csv.txt)
- A codebook describing each variable and its values in the tidy data set (CodeBook.md).
- An explicit and exact recipe to go from 1 to 2,3. (run_analysis.R)
- Two additional R file with helper functions for run_analysis.R
In order to recreate the data set you need to:
- Download this repository using git
- Point your R working directory to the path of the downloaded repository. (Using setwd directive or, in case of rstudio, by clicking Session -> Set Working Directory -> Choose Directory...)
- Run source("run_analysis.R")
This will recreate dataset.csv.txt by using the raw data set files in "UCI HAR dataset" directory.
- Loads training and test features, labels and subject from /UCI HAR Dataset/train/ and /UCI HAR Dataset/test/ directories.
- Merges training and test features into a data frame, training and test labels into another data frame and training and test subjects into another data frame.
- Parses /UCI HAR Dataset/features.txt file and add features names to the features data frame.
- Select only those features that measures means or standard deviations and creates a new features data frame with only those features.
- Parse /UCI HAR Dataset/activity_labels.txt and replaces labels data frame code values with their text version.
- Uses regular expressions to replace feature data frame column names with better one by expanding abbreviations to complete words.
- Joins subject and labels data frames into a single one and add the right column names.
- Merges the data frame created in 7 with the features.
- Dump the tidy data set to dataset.csv.txt file.