Skip to content

Commit 8428cf6

Browse files
First commit
0 parents  commit 8428cf6

14 files changed

+14682
-0
lines changed

CapitalOne/Part1/codetest_test.txt

+1,001
Large diffs are not rendered by default.

CapitalOne/Part1/codetest_train.txt

+5,001
Large diffs are not rendered by default.

CapitalOne/codetest_instructions.txt

+106
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
--------------------------------------------------------------------------------
2+
| Capital One Labs Coding Challenge |
3+
--------------------------------------------------------------------------------
4+
The purpose of this test is to test your ability to write software to collect,
5+
normalize, store, analyze and visualize “real world” data. The test is designed
6+
to take about four hours, but it is not timed. Please try to deliver your
7+
results within 24 hours.
8+
9+
You may also use any tools or software on your computer, or that are freely
10+
available on the Internet. We prefer that you use simpler tools to more complex
11+
ones and that you are “lazy” in the sense of using third party APIs and
12+
libraries as much as possible. (However, use of obscure, undocumented “black
13+
box” libraries is discouraged.)
14+
15+
Do as much as you can, as well as you can. Prefer efficient, elegant solutions.
16+
Prefer scripted analysis to unrepeatable use of GUI tools. For data security and
17+
transfer time reasons, you have been given a relatively small data file. Prefer
18+
solutions that do not require the full data set to be stored in memory.
19+
20+
There is certainly no requirement that you have previous experience working on
21+
these kind of problem, or that you be able to finish everything. Rather, we are
22+
looking for an ability to research and select the appropriate tools for an open­
23+
ended problem and implement something meaningful. We are also interested in your
24+
ability to work on a team, which means considering how to package and deliver
25+
your results in a way that makes it easy for us to review them. Undocumented
26+
code and data dumps are virtually useless; commented code and a clear writeup
27+
with elegant visuals are ideal. Also consider how asking targeted questions to
28+
members of our team may allow you to get more done in less time.
29+
30+
31+
--------------------------------------------------------------------------------
32+
| Code Test Part 1: Model building on a synthetic dataset |
33+
--------------------------------------------------------------------------------
34+
35+
We have provided two tab-delimited files along with these instructions:
36+
37+
- codetest_train.txt: 5,000 records x 254 features + 1 target (~18.0MB)
38+
- codetest_test.txt : 1,000 records x 254 features (~ 3.6MB)
39+
40+
These two synthetic datasets were generated using the same underlying data
41+
model. Your goal is to build a predictive model using the data in the training
42+
dataset to predict the withheld target values from the test set.
43+
44+
You may use any tools available to you for this task. Ultimately, we will
45+
assess predictive accuracy on the test set using the mean squared error metric.
46+
You should return to us the following:
47+
48+
- A 1,000 x 1 text file containing 1 prediction per line for each record
49+
in the test dataset.
50+
51+
- A brief writeup describing the techniques you used to generate the
52+
predictions. Details such as important features and your estimates of
53+
predictive performance are helpful here, though not strictly
54+
necessary.
55+
56+
- (Optional) An implementable version of your model. What this would look
57+
like largely depends on the methods you used, but could include things
58+
like source code, a pickled Python object, a PMML file, etc. Please
59+
do not include any compiled executables. If you choose not to submit
60+
this, please ensure your modeling methods are adequately described
61+
in the writeup.
62+
63+
64+
--------------------------------------------------------------------------------
65+
| Code Test Part 2: Baby Names! |
66+
--------------------------------------------------------------------------------
67+
68+
In this section, you will acquire and analyze a real dataset on baby name
69+
popularity provided by the Social Security Administration. To warm up, we will
70+
ask you a few simple questions that can be answered by inspecting the data.
71+
72+
A) Descriptive analysis
73+
74+
The data can be downloaded in zip format from:
75+
http://www.ssa.gov/oact/babynames/state/namesbystate.zip
76+
77+
1. Please describe the format of the data files. Can you identify any
78+
limitations or distortions of the data?
79+
2. What is the most popular name of all time? (Of either gender.)
80+
3. What is the most gender ambiguous name in 2013? 1945?
81+
4. Of the names represented in the data, find the name that has had the largest
82+
percentage increase in popularity since 1980. Largest decrease?
83+
5. Can you identify names that may have had an even larger increase or decrease
84+
in popularity?
85+
86+
87+
B) Onward to Insight!
88+
89+
What insight can you extract from this dataset? Feel free to combine the baby
90+
names data with other publicly available datasets or APIs, but be sure to include
91+
code for accessing any alternative data that you use.
92+
93+
This is an open­ended question and you are free to answer as you see fit. In
94+
fact, we would love it if you find an interesting way to look at the data that
95+
we haven't thought of!
96+
97+
Please provide us with both your code and an informative write­up of your
98+
results. The code should be in a runnable form. Do not assume that we have a
99+
copy of the data set or that we are familiar with the build procedures for your
100+
chosen language.
101+
102+
If you do not have time to implement your solution, a detailed, actionable
103+
description of how you would attack the problem would also count in your favor.
104+
105+
106+
Good luck!

CivisAnalytics/CivisTest.docx.docx

8.36 KB
Binary file not shown.

Mattersight/documents-export-2015-11-15/CA_Crimes_Data Dictionary.csv

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Column,MeaningCommunity.Area,A community area in ChicagoWeek,Week in the year (starting on Sunday)Year,YearWeeek,Week in the year (starting on Sunday)Crimes,Reported crimes for the community area for a given week in a given yearCrimes.LastWeek,Reported crimes for the community area for the previous week of a given yearArrests.LastWeek,Number of arrests for the community area for the previous week of a given yearDomestics.LastWeek,Number of domestic crimes reported for the community area for the previous week of a given yearMonth,Calendar monthMinDay,"Smallest number calendar day in the specified week (e.g. if week starts on Feb 7 and ends Feb 13, MinDay=7)"MaxDay,"Largest number calendar day in the specified week (e.g. if week starts on Feb 7 and ends Feb 13, MaxDay=13)"CommonCrimes.LastWeek,Number of common crimes (i.e. crimes with codes reprenting greater than 33% of all reported crimes) reported for the community area for the previous week of a given year
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
Community.Area,Week,PREDICTED.RANK
2+
1,1,2
3+
2,1,3
4+
3,1,4
5+
4,1,5
6+
5,1,6
7+
6,1,1
8+
1,2,3
9+
2,2,4
10+
3,2,5
11+
4,2,6
12+
5,2,1
13+
6,2,2
14+
1,3,4
15+
2,3,5
16+
3,3,6
17+
4,3,1
18+
5,3,2
19+
6,3,3
20+
1,4,5
21+
2,4,6
22+
3,4,1
23+
4,4,2
24+
5,4,3
25+
6,4,4
Binary file not shown.

0 commit comments

Comments
 (0)