Skip to content

Commit 4c903b0

Browse files
erictleungQuincyLarson
authored andcommitted
[WIP] Add script to clean and combine data, and add data (freeCodeCamp#29)
* Add script to clean and combine data, and add data - Update survey data dictionary with left out questions - Update survey data dictionary with variable/column names for questions - Add script `clean-data.R` to clean and combine the two survey datasets into one for ease of analysis - Create the combined survey dataset after running `clean-data.R` - Create README.md file to explain cleaned data and the script to produce it - Update root README.md file to briefly explain data - Change `data/` directory to `raw-data/` * Move around functions and add more edits - Update date - Categorize functions into different categories - Utility functions - Sub-process functions - Main process functions - Main function - Update function descriptions - Add function to check survey data uses only one ID from each * Move cleaning of code events to own function * Create function to search and add col + formatting - Create function to search in a given column for search terms, then creates a new column labeling rows containing search terms - Reformat input data comments - Reformat NSE functions e.g. mutate_() * Create temp helper function to look at columns * Move reading data function to main processes * Create draft full dataset * Rename cleaning function and update joining key - The cleaning function `clean_part_1` was written for the first dataset. I've changed the function, along with the variables, to attend to the joined dataset. - Removing outliers for hours learning per week was simplified - Added usage case for `search_and_create()` function * Add feedback to user on script actions * Separate other job interests cleaning to function * Fix inconsistent indenting in helper function * Move cleaning other podcasts to separate function * Reorganize sub-cleaning functions to own category * Update helper function with flexible use Allow helper function to either default view the data, print data to console (printYes=1), or to print the number of instances * Create new columns for significant other podcasts - Update description of `clean_podcasts` function - Add more variations to “None” response - Add feedback to user on start and finish of function - Add new columns for podcasts that were mentioned >15 times * Separate a function for cleaning hours learned * Add feedback in cleaning code events & exp earning * Separate function for cleaning months programming * Separate function cleaning post bootcamp salary - Retain previous cleaning - Add in same normalizations from expected income * Separate function for cleaning money for learning * Add description to entire script * Floor values and remove outliers in money to learning * Create function for cleaning age * Initialize functions for columns needing cleaning * Create new boolean column for PodcastOther * Fix feedback message for cleaning hours learning * Update draft of complete data * Remove boolean Podcast Other column * Finish cleaning income and remove extras - Finished cleaning income function - Removed changing ExpectedEarning to integer - Remove unnecessary cleaning * Remove "Other" from new podcast cols * Finish cleaning commute times * Update code events cleaning to make new cols * Clean other resources * Update code events threshold to 1.5% frequency * Update detail on cutoff for other podcasts is 1.5% * Add Bootcamp Name into joining key * Add back in podcast and events from 2nd dataset * Make ages less than 10 to NA * Convert resources to boolean * Finish cleaning data with consistency check - Check for inconsistencies between job role interests - Remove unnecessary columns * Remove "Other" from new Podcast columns * Clean student debt owed * Add CodeEvent column to columns removed * Write final polish of data * Fix small spelling mistakes * Update final dataset * Remove first dataset * Update script date
1 parent 97ba361 commit 4c903b0

8 files changed

+17801
-37
lines changed

README.md

+10-1
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,18 @@ We announced on [March 29th,
55

66
Survey development was lead by [Quincy Larson](https://twitter.com/ossia) with Free Code Camp and [Saron Yitbarek](https://twitter.com/saronyitbarek) with Code Newbie. For more about why we made this survey: ["How we crafted a survey for thousands of people who are learning to code"](https://medium.freecodecamp.com/we-just-launched-the-biggest-ever-survey-of-people-learning-to-code-cac81dadf1ea#.8g9ts8gm5).
77

8+
## Table of Contents
9+
10+
- [About the Data](#about-the-data)
11+
- [How to Contribute](#how-to-contribute)
12+
- [Analysis of other relevant recent data](#analysis-of-other-relevant-recent-data)
13+
- [License](#license)
14+
815
## About the Data
916

10-
The survey results are located in the [`data/`](data/) directory, in .csv format.
17+
The raw survey results are located in the [`raw-data/`](raw-data/) directory, in `.csv` format.
18+
19+
We have cleaned and combined the data for convenience of downstream analyses and visualizations. The cleaned data is located in the [`clean-data/`](clean-data/) directory.
1120

1221
## How to Contribute
1322

clean-data/2016-FCC-New-Coders-Survey-Data.csv

+15,621
Large diffs are not rendered by default.

clean-data/README.md

+65
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# Cleaning and Combine Free Code Camp Survey Data
2+
3+
## Introduction
4+
5+
The survey data was broken up into two parts and need to be combined into one
6+
for ease of future downstream analyses. Additionally, these two data sets need
7+
to be cleaned up a bit because of the nature of survey data.
8+
9+
## Notable Data Transformations
10+
11+
### Obvious Outliers
12+
13+
In some of the numeric free text answers, numeric values were filtered out if it
14+
was beyond a reasonable threshold. For example, an answer saying you've coded
15+
for 100,000 months would be removed.
16+
17+
### Numeric Ranges
18+
19+
Some answers were given as ranges. For example, a range of "9-10" months of
20+
programming might have been answer to a question. The average of this range was
21+
taken when possible.
22+
23+
### Years to Months
24+
25+
Some answers to a question asking about months were given in years. These were
26+
converted to months if possible.
27+
28+
### Normalization of Answers
29+
30+
Some of the free text answers were very similar to each other, with the
31+
exception of a space or two. These will register as different answers if you
32+
aren't looking for them. Answers like "Cybersecurity" and "Cyber Security" are
33+
the same and were changed to a consistent manner. There may have been some
34+
missed.
35+
36+
37+
## Prerequisites to Rerun Data Manipulations
38+
39+
- [R][RProj] (>= 3.2.3)
40+
- [dplyr][dplyrGH] (>= 0.4.3) [CRAN][dplyrCRAN]
41+
- [Rcpp][RcppGH] (>= 0.12.4) [CRAN][RcppCRAN]
42+
43+
[RProj]: https://www.r-project.org/
44+
[dplyrGH]: https://github.com/hadley/dplyr
45+
[RcppGH]: https://github.com/RcppCore/Rcpp
46+
[dplyrCRAN]: https://cran.r-project.org/web/packages/dplyr/index.html
47+
[RcppCRAN]: https://cran.r-project.org/web/packages/Rcpp/index.html
48+
49+
50+
## Reproduce Cleaning and Combining of Data
51+
52+
Running the following script will create a new file
53+
`2016-New-Coders-Survey.csv` file in this directory `clean-data/`.
54+
55+
```shell
56+
git clone https://github.com/FreeCodeCamp/2016-new-coder-survey.git
57+
cd clean-data
58+
Rscript clean-data.R
59+
```
60+
61+
62+
## Cleaning Pipeline
63+
64+
1. Rename column names
65+
2. Clean free text fields for appropriate question

0 commit comments

Comments
 (0)