Skip to content

Commit 030d860

Browse files
committed
Added pandas and sklearn notebooks
1 parent b369db1 commit 030d860

File tree

3 files changed

+470
-1337
lines changed

3 files changed

+470
-1337
lines changed

pandas-example.ipynb

+273
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,273 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Operating on Data in Pandas"
8+
]
9+
},
10+
{
11+
"cell_type": "markdown",
12+
"metadata": {},
13+
"source": [
14+
"Copied from https://github.com/jakevdp/PythonDataScienceHandbook"
15+
]
16+
},
17+
{
18+
"cell_type": "markdown",
19+
"metadata": {},
20+
"source": [
21+
"One of the essential pieces of NumPy is the ability to perform quick element-wise operations, both with basic arithmetic (addition, subtraction, multiplication, etc.) and with more sophisticated operations (trigonometric functions, exponential and logarithmic functions, etc.).\n",
22+
"Pandas inherits much of this functionality from NumPy, and the ufuncs that we introduced in [Computation on NumPy Arrays: Universal Functions](02.03-Computation-on-arrays-ufuncs.ipynb) are key to this.\n",
23+
"\n",
24+
"Pandas includes a couple useful twists, however: for unary operations like negation and trigonometric functions, these ufuncs will *preserve index and column labels* in the output, and for binary operations such as addition and multiplication, Pandas will automatically *align indices* when passing the objects to the ufunc.\n",
25+
"This means that keeping the context of data and combining data from different sources–both potentially error-prone tasks with raw NumPy arrays–become essentially foolproof ones with Pandas.\n",
26+
"We will additionally see that there are well-defined operations between one-dimensional ``Series`` structures and two-dimensional ``DataFrame`` structures."
27+
]
28+
},
29+
{
30+
"cell_type": "code",
31+
"execution_count": 121,
32+
"metadata": {
33+
"collapsed": true
34+
},
35+
"outputs": [],
36+
"source": [
37+
"import pandas as pd\n",
38+
"import numpy as np"
39+
]
40+
},
41+
{
42+
"cell_type": "markdown",
43+
"metadata": {},
44+
"source": [
45+
"Any item for which one or the other does not have an entry is marked with ``NaN``, or \"Not a Number,\" which is how Pandas marks missing data (see further discussion of missing data in [Handling Missing Data](03.04-Missing-Values.ipynb)).\n",
46+
"This index matching is implemented this way for any of Python's built-in arithmetic expressions; any missing values are filled in with NaN by default:"
47+
]
48+
},
49+
{
50+
"cell_type": "code",
51+
"execution_count": 122,
52+
"metadata": {
53+
"collapsed": false
54+
},
55+
"outputs": [],
56+
"source": [
57+
"A = pd.Series([2, 4, 6], index=[0, 1, 2])\n",
58+
"B = pd.Series([1, 3, 5], index=[1, 2, 3])"
59+
]
60+
},
61+
{
62+
"cell_type": "markdown",
63+
"metadata": {},
64+
"source": [
65+
"If using NaN values is not the desired behavior, the fill value can be modified using appropriate object methods in place of the operators.\n",
66+
"For example, calling ``A.add(B)`` is equivalent to calling ``A + B``, but allows optional explicit specification of the fill value for any elements in ``A`` or ``B`` that might be missing:"
67+
]
68+
},
69+
{
70+
"cell_type": "code",
71+
"execution_count": 123,
72+
"metadata": {
73+
"collapsed": false
74+
},
75+
"outputs": [
76+
{
77+
"data": {
78+
"text/plain": [
79+
"0 2.0\n",
80+
"1 5.0\n",
81+
"2 9.0\n",
82+
"3 5.0\n",
83+
"dtype: float64"
84+
]
85+
},
86+
"execution_count": 123,
87+
"metadata": {},
88+
"output_type": "execute_result"
89+
}
90+
],
91+
"source": [
92+
"A.add(B, fill_value=0)"
93+
]
94+
},
95+
{
96+
"cell_type": "markdown",
97+
"metadata": {},
98+
"source": [
99+
"Notice that indices are aligned correctly irrespective of their order in the two objects, and indices in the result are sorted.\n",
100+
"As was the case with ``Series``, we can use the associated object's arithmetic method and pass any desired ``fill_value`` to be used in place of missing entries.\n",
101+
"Here we'll fill with the mean of all values in ``A`` (computed by first stacking the rows of ``A``):"
102+
]
103+
},
104+
{
105+
"cell_type": "code",
106+
"execution_count": 29,
107+
"metadata": {},
108+
"outputs": [],
109+
"source": [
110+
"rng = np.random.RandomState(42)\n",
111+
"A = pd.DataFrame(rng.randint(0, 20, (2, 2)),\n",
112+
" columns=list('AB'))\n",
113+
"B = pd.DataFrame(rng.randint(0, 10, (3, 3)),\n",
114+
" columns=list('BAC'))"
115+
]
116+
},
117+
{
118+
"cell_type": "code",
119+
"execution_count": 127,
120+
"metadata": {
121+
"collapsed": false
122+
},
123+
"outputs": [
124+
{
125+
"data": {
126+
"text/html": [
127+
"<div>\n",
128+
"<style scoped>\n",
129+
" .dataframe tbody tr th:only-of-type {\n",
130+
" vertical-align: middle;\n",
131+
" }\n",
132+
"\n",
133+
" .dataframe tbody tr th {\n",
134+
" vertical-align: top;\n",
135+
" }\n",
136+
"\n",
137+
" .dataframe thead th {\n",
138+
" text-align: right;\n",
139+
" }\n",
140+
"</style>\n",
141+
"<table border=\"1\" class=\"dataframe\">\n",
142+
" <thead>\n",
143+
" <tr style=\"text-align: right;\">\n",
144+
" <th></th>\n",
145+
" <th>A</th>\n",
146+
" <th>B</th>\n",
147+
" <th>C</th>\n",
148+
" </tr>\n",
149+
" </thead>\n",
150+
" <tbody>\n",
151+
" <tr>\n",
152+
" <th>0</th>\n",
153+
" <td>19.00</td>\n",
154+
" <td>20.00</td>\n",
155+
" <td>16.75</td>\n",
156+
" </tr>\n",
157+
" <tr>\n",
158+
" <th>1</th>\n",
159+
" <td>8.00</td>\n",
160+
" <td>3.00</td>\n",
161+
" <td>12.75</td>\n",
162+
" </tr>\n",
163+
" <tr>\n",
164+
" <th>2</th>\n",
165+
" <td>16.75</td>\n",
166+
" <td>10.75</td>\n",
167+
" <td>12.75</td>\n",
168+
" </tr>\n",
169+
" </tbody>\n",
170+
"</table>\n",
171+
"</div>"
172+
],
173+
"text/plain": [
174+
" A B C\n",
175+
"0 19.00 20.00 16.75\n",
176+
"1 8.00 3.00 12.75\n",
177+
"2 16.75 10.75 12.75"
178+
]
179+
},
180+
"execution_count": 127,
181+
"metadata": {},
182+
"output_type": "execute_result"
183+
}
184+
],
185+
"source": [
186+
"fill = A.stack().mean()\n",
187+
"A.add(B, fill_value=fill)"
188+
]
189+
},
190+
{
191+
"cell_type": "markdown",
192+
"metadata": {},
193+
"source": [
194+
"## Ufuncs: Operations Between DataFrame and Series\n",
195+
"\n",
196+
"When performing operations between a ``DataFrame`` and a ``Series``, the index and column alignment is similarly maintained.\n",
197+
"Operations between a ``DataFrame`` and a ``Series`` are similar to operations between a two-dimensional and one-dimensional NumPy array.\n",
198+
"Consider one common operation, where we find the difference of a two-dimensional array and one of its rows:"
199+
]
200+
},
201+
{
202+
"cell_type": "code",
203+
"execution_count": 128,
204+
"metadata": {
205+
"collapsed": false
206+
},
207+
"outputs": [
208+
{
209+
"data": {
210+
"text/plain": [
211+
"array([[1, 5, 5, 9],\n",
212+
" [3, 5, 1, 9],\n",
213+
" [1, 9, 3, 7]])"
214+
]
215+
},
216+
"execution_count": 128,
217+
"metadata": {},
218+
"output_type": "execute_result"
219+
}
220+
],
221+
"source": [
222+
"A = rng.randint(10, size=(3, 4))\n",
223+
"A"
224+
]
225+
},
226+
{
227+
"cell_type": "code",
228+
"execution_count": 129,
229+
"metadata": {
230+
"collapsed": false
231+
},
232+
"outputs": [
233+
{
234+
"data": {
235+
"text/plain": [
236+
"array([[ 0, 0, 0, 0],\n",
237+
" [ 2, 0, -4, 0],\n",
238+
" [ 0, 4, -2, -2]])"
239+
]
240+
},
241+
"execution_count": 129,
242+
"metadata": {},
243+
"output_type": "execute_result"
244+
}
245+
],
246+
"source": [
247+
"A - A[0]"
248+
]
249+
}
250+
],
251+
"metadata": {
252+
"anaconda-cloud": {},
253+
"kernelspec": {
254+
"display_name": "Python 3",
255+
"language": "python",
256+
"name": "python3"
257+
},
258+
"language_info": {
259+
"codemirror_mode": {
260+
"name": "ipython",
261+
"version": 3
262+
},
263+
"file_extension": ".py",
264+
"mimetype": "text/x-python",
265+
"name": "python",
266+
"nbconvert_exporter": "python",
267+
"pygments_lexer": "ipython3",
268+
"version": "3.10.0"
269+
}
270+
},
271+
"nbformat": 4,
272+
"nbformat_minor": 0
273+
}

sklearn-example.ipynb

+197
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)