Skip to content

Commit d332f58

Browse files
author
zoupeicheng
committed
added besinees part short discussion
1 parent e2c31bc commit d332f58

File tree

2 files changed

+85
-2
lines changed

2 files changed

+85
-2
lines changed

README.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,11 @@ Python Data Science Toolbox
77
This is my personal open notebook for foundamental Data Science in Python at work place.
88
Unlike many books, this notebook is quite superficial. It covers major steps of doing simple data analysis and a lot of simple code examples.
99
I learnt the those tools from many sources, welcome to browse and edit this notebook.
10-
This notebook is just a **remainder for what is available out there**. It does not offer any mathematical or technical or business details.
10+
This notebook is just a **remainder for what is available out there**. It's under development.
1111

1212
You can visit this [handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) for technical details.
13+
I recommend taking Coursera's ML courses by Andrew for beginners who want to learn the foundamentals of ML.
14+
1315

1416
## Quick advice
1517

@@ -19,10 +21,11 @@ There are many great sources to learn Data Science, and here are some advice to
1921
2. Get to know the general idea and areas of knowledge about data science.
2022
3. Practice as you go. Google or Bing any term or question you don't understand. Stackoverflow and supporting documents for specific packages and functions are your best friend. During this process, do not lose sight of the big picture of data science.
2123
4. Become a better data scientiest by doing more projects! (Don't try to memorize these tools, just do data science!)
22-
5. Once comfortable with all materials in this notebook and engage in data analysis in business, you will know what skills to pick up next.
24+
2325

2426
## Materials in this notebook
2527

28+
0. [Road Map in Business](roadmap.md)
2629
1. [Environment Configuration](EnvironmentConfiguration.md)
2730
2. [Data Processing](DataProcessing.md)
2831
* Getting Data

roadmap.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# Data Science for Business
2+
A vast majority of the knowledge is from this [book](https://www.amazon.com/Data-Science-Business-Data-Analytic-Thinking-ebook/dp/B00E6EQ3Xs).
3+
4+
## Main structure of data science at workplace
5+
The Cross Industry Standard Process for Data Mining
6+
See [wiki](https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining).
7+
8+
Center of data mining: automated pattern, knowledge and regularities discovery.
9+
10+
### Classic Tasks
11+
12+
* Classification
13+
* Regression
14+
* Similarity matching
15+
* Clustering
16+
* Co-occurrence grouping
17+
* Profiling
18+
* Link Prediction
19+
* Data Reduction
20+
* Causal modeling
21+
22+
23+
### Phases of CISP-DM
24+
25+
* Business Understanding: Formulate the business problem to unambiguous data mining problems
26+
* What exactly do we want to do?
27+
* How exactly would we do it?
28+
* What parts of this use scenario constitute possible data mining models?
29+
* Data Understanding
30+
* How reliable is the data for our task?
31+
* What is the cost of getting data?
32+
* How the data affects our approach? Note that superficially similar tasks could have distinct approaches due to different data available.
33+
* Business understanding + data understanding determines possible solutions.
34+
* Data Preparation
35+
* creative, sensible and business minded varialble crafting
36+
* systematic data processing/clearning
37+
* Pay Special Attention to Data Leakage
38+
* Modeling
39+
* Most technical and scietific part. Others are arts. (joking).
40+
* Evaluation: in business context, not in the lab.
41+
* Quantitative and qualitative assessments
42+
* Stakeholders considerations: pros and cons
43+
* Comprehensibility of model, or how to making the model more comprehensible?
44+
* Do this *Before the deployment*
45+
* How susceptible is the model to the changing behaviour of data source?
46+
* The model is what developers build (advisable to include them in data science projects)
47+
48+
49+
### Side Remark: Managing a data science team
50+
* Data science tasks are exploratory undertaking in nature and is closer to research and development than it is to engineering.
51+
* Iterates on approaches and strategy rather than software designs
52+
* Outcomes are far less certain
53+
* Results of each step change change the understandings of problems
54+
* Do not engineeting solution directly for deployment: most of the efforts should go to analytical testings, pilot studies and thowaway prototypes to reduce risks.
55+
* In building a data science team, the most important qualities are:
56+
* Formulate problems well
57+
* Making reasonable assumptions if face of ill-structured problems
58+
* Prototype solutions quickly
59+
* Design Experiments that represent good investments
60+
* Ability to analyze the results
61+
* NOT traditional software engineering expertise
62+
63+
### Related Skills
64+
* Statistics
65+
* Querying Database
66+
* Data Warehousing
67+
* Machine Learning or Applied Statistics or Pattern Recognition
68+
* Answer Business Questions with These Techniques
69+
* Who are the most profitable customers? Querying DB
70+
* Is there really a difference between the profitable customers and the average customer? Hypothesis Testing
71+
* However, who really are these customers? Can I characterize them? Find pattern that differentiate profitable customers from unprofitable ones.
72+
* Will some particular new customer be profitable? How much revenue should I expect this customer to generate?
73+
74+
### Summary
75+
* There are fields of study closely related to data science, each task type serves different purpose and has an associated set of solution techniques
76+
* Data Scientist combine these components
77+
* A successful data project involves an intelligent compromise between what the data can do and project goals
78+
* Need to keep in mind how data mining results will be used and use this to inform the data mining process itself.
79+
80+

0 commit comments

Comments
 (0)