Skip to content

Commit 55d2422

Browse files
committed
modify for OS training
1 parent dec49ab commit 55d2422

File tree

9 files changed

+802
-115
lines changed

9 files changed

+802
-115
lines changed

Diff for: _layouts/default.html

+5-3
Original file line numberDiff line numberDiff line change
@@ -21,13 +21,15 @@
2121
<li><a href="./strings">Simple Types</a></li>
2222
<li><a href="./lists">Lists</a></li>
2323
<li><a href="./sorting">Sorting</a></li>
24-
<li><a href="./dict-files">Dict and Files</a></li>
25-
<li><a href="./regular-expressions">Regular expressions</a></li>
26-
<li><a href="./utilities">Utilities</a></li>
24+
<li><a href="./dict">Dicts</a></li>
25+
<li><a href="./files">Files</a></li>
26+
<li><a href="./requests">Fetching Data from the Internet</a></li>
27+
<li><a href="./analysis">Simple Data Analysis</a></li>
2728
<li>
2829
Exercises
2930
<ol>
3031
<li><a href="./basic">Basic exercises</a></li>
32+
<li><a href="./names">OS Names API</a></li>
3133
<li><a href="./mortality">Life expectancy tables</a></li>
3234
<li><a href="./copy-special">Copy special</a></li>
3335
<li><a href="./log-puzzle">Log puzzle</a></li>

Diff for: analysis.md

+279
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,279 @@
1+
# Data Analysis in Python
2+
3+
## NumPy
4+
5+
NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a
6+
multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment
7+
of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting,
8+
selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random
9+
simulation and much more.
10+
11+
At the core of the NumPy package, is the `ndarray` object. This encapsulates n-dimensional arrays of
12+
homogeneous data types, with many operations being performed in compiled code for performance. There are
13+
several important differences between NumPy arrays and the standard Python sequences:
14+
15+
+ NumPy arrays have a fixed size at creation, unlike Python lists (which can grow dynamically). Changing the
16+
size of an `ndarray` will create a new array and delete the original.
17+
18+
+ The elements in a NumPy array are all required to be of the same data type, and thus will be the same size
19+
in memory.
20+
21+
These restrictions together with the compiled code used in the NumPy functions makes NumPy much faster than
22+
normal Python arrays for many operations.
23+
24+
## Creating arrays from lists
25+
26+
`np.array([1,2,3,5,5])`
27+
28+
`np.array([range(i, i + 3) for i in [2, 4, 6]])` gives:
29+
30+
~~~python
31+
array([[2, 3, 4], [4, 5, 6], [6, 7, 8]])
32+
~~~
33+
34+
To create an array full of `0`s use `np.zeros(10, dtype=int)`, you can also use `np.arange(0,20,2)` to produce
35+
0 to 20 in steps of 2 and `np.linspace(0,1,5)` to give 5 values evenly spaced between 0 & 1.
36+
37+
NumPy Arrays have attributes:
38+
+ `ndim` number of dimensions
39+
+ `shape` the size of each dimension as a tuple.
40+
+ `size` the size of the total array
41+
+ `dtype` what is the type of the elements of the array (int, float32, etc)
42+
43+
You can slice arrays as you would lists or strings, including from the end using negative number or string.
44+
Multiple dimension arrays are accessed as `x[1,2]` for the 3rd element of the 2nd row.
45+
46+
### Fast Functions on Arrays
47+
48+
Computations can use vectorization through the use of `ufuncs`. These are nearly always more efficient than
49+
their counterpart implemented through Python loops, especially as the arrays grow in size. Any time you see
50+
such a loop in a Python script, you should consider whether it can be replaced with a vectorized expression.
51+
52+
Simply apply the `ufunc` to the array, e.g:
53+
54+
~~~python
55+
import numpy as np
56+
np.random.seed(0)
57+
def compute_reciprocals(values):
58+
output = np.empty(len(values))
59+
for i in range(len(values)):
60+
output[i] = 1.0 / values[i]
61+
return output
62+
63+
values = np.random.randint(1, 100, size=1000000)
64+
compute_reciprocals(values)
65+
# calculate reciprocal for all values
66+
print(1.0/values)
67+
~~~
68+
69+
The function takes 2+ seconds, `1.0/values` takes ~4ms
70+
71+
Also works for array to array operations including multidimensional arrays
72+
73+
You can also use the `reduce()` method to act across an array: `np.add.reduce(x)` gives the sum of the array
74+
elements, while `np.multiple.reduce(x)` gives the product of the array.
75+
76+
`np.sum` is a fast version of the python `sum` function, along with `np.min()` and `np.max()`. They all take
77+
an `axis` parameter to force action on rows or columns (etc).
78+
79+
80+
## Pandas
81+
82+
Pandas is an open source Python package that is most widely used for data science/data analysis and machine
83+
learning tasks. It is built on top of another package named NumPy, which provides support for
84+
multi-dimensional arrays. As one of the most popular data wrangling packages, Pandas works well with many
85+
other data science modules inside the Python ecosystem, and is typically included in every Python
86+
distribution.
87+
88+
### A generalised NumPy array
89+
90+
~~~python
91+
>>> import pandas as pd
92+
>>> import numpy as np
93+
>>> data = pd.Series([0.25, 0.5, 0.75, 1.0])
94+
>>> data
95+
0 0.25
96+
1 0.50
97+
2 0.75
98+
3 1.00
99+
dtype: float64
100+
>>> data.index
101+
RangeIndex(start=0, stop=4, step=1)
102+
>>>
103+
~~~
104+
105+
As this example the main difference between NumPy and Pandas is that the index of the `Series` is explicit in
106+
Pandas while it is implicit in NumPy. There is no need for it to be a simple range, any list will work.
107+
108+
~~~python
109+
>>> data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
110+
>>> data
111+
a 0.25
112+
b 0.50
113+
c 0.75
114+
d 1.00
115+
dtype: float64
116+
>>> data['b']
117+
0.50
118+
>>> data.index
119+
Index(['a', 'b', 'c', 'd'], dtype='object')
120+
~~~
121+
122+
`Series` can be considered to be `dict`s and can be created from a `dict` with the keys of the `dict` being
123+
used as the index elements:
124+
125+
~~~py
126+
>>> population_dict = {'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135}
127+
>>> population = pd.Series(population_dict)
128+
>>> population
129+
California 38332521
130+
Texas 26448193
131+
New York 19651127
132+
Florida 19552860
133+
Illinois 12882135
134+
dtype: int64
135+
>>>
136+
~~~
137+
138+
But slicing is supported even on essentially unordered indexes:
139+
140+
~~~py
141+
>>> population['California':'New York']
142+
California 38332521
143+
Texas 26448193
144+
New York 19651127
145+
dtype: int64
146+
>>>
147+
~~~
148+
149+
A `DataFrame` is a generalised NumPy multidimensional array, that consists of one or more `Series`:
150+
151+
~~~py
152+
>>> area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995}
153+
>>> area = pd.Series(area_dict)
154+
>>> area
155+
California 423967
156+
Texas 695662
157+
New York 141297
158+
Florida 170312
159+
Illinois 149995
160+
dtype: int64
161+
>>> states = pd.DataFrame({'population': population, 'area': area})
162+
>>> states
163+
population area
164+
California 38332521 423967
165+
Texas 26448193 695662
166+
New York 19651127 141297
167+
Florida 19552860 170312
168+
Illinois 12882135 149995
169+
>>>
170+
~~~
171+
172+
So in addition to the `index` giving access to the rows, we also have `columns` - giving a generalised 2
173+
dimensional array. We can also treat the columns as a dictionary, allowing access by name, this is not
174+
possible by just index value, but does work with an index or filter expression:
175+
176+
~~~py
177+
>>> states['area']
178+
California 423967
179+
Texas 695662
180+
New York 141297
181+
Florida 170312
182+
Illinois 149995
183+
Name: area, dtype: int64
184+
>>> data['California']
185+
Traceback (most recent call last):
186+
File "/usr/lib/python3/dist-packages/pandas/core/indexes/base.py", line 3361, in get_loc
187+
return self._engine.get_loc(casted_key)
188+
File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
189+
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
190+
File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
191+
File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
192+
KeyError: 'California'
193+
~~~
194+
195+
You can construct a `DataFrame` from a `Series`, a `dict` of `Series`, a 2D NumPy array or a structured NumPy
196+
array.
197+
198+
#### Data access
199+
200+
You can use `DataFrame` as a dict `data[area]` or via attributes `data.area`, providing column name is a string and doesn't match a method of `DataFrame` (e.g. pop).
201+
202+
You can create new columns in the frame `data['density'] = data['pop']/data['area']`
203+
204+
205+
### Reading a file with Pandas
206+
207+
~~~python
208+
import pandas as pd
209+
210+
data = pd.read_csv('datafile.csv')
211+
heights = np.array(data['height(cm)'])
212+
~~~
213+
214+
## GeoPandas
215+
216+
GeoPandas, as the name suggests, extends the popular data science library pandas by adding support for
217+
geospatial data. The core data structure in GeoPandas is the `geopandas.GeoDataFrame`, a subclass of
218+
`pandas.DataFrame`, that can store geometry columns and perform spatial operations. The `geopandas.GeoSeries`,
219+
a subclass of `pandas.Series`, handles the geometries. Therefore, your `GeoDataFrame` is a combination of
220+
`pandas.Series`, with traditional data (numerical, boolean, text etc.), and `geopandas.GeoSeries`, with
221+
geometries (points, polygons etc.). You can have as many columns with geometries as you wish;
222+
223+
A `GeoDataFrame` can be constructed from a spatial data file, such as a geopackage or shapefile:
224+
225+
~~~py
226+
>>> import geopandas as gp
227+
>>> gdf = gp.read_file('/data/os-data/national-parks.shp')
228+
>>> gdf
229+
CODE ... geometry
230+
0 5820 ... POLYGON ((317635.000 770725.000, 317078.000 77...
231+
1 5820 ... POLYGON ((308047.000 500000.000, 308050.000 50...
232+
2 5820 ... POLYGON ((259611.000 300000.000, 259763.000 30...
233+
3 5820 ... POLYGON ((209870.000 700000.000, 209940.000 70...
234+
4 5820 ... POLYGON ((400000.000 453080.000, 399940.000 45...
235+
5 5820 ... POLYGON ((550228.000 100000.000, 550138.000 10...
236+
6 5820 ... POLYGON ((500000.000 487140.000, 499650.000 48...
237+
7 5820 ... POLYGON ((400000.000 360330.000, 399930.000 36...
238+
8 5820 ... POLYGON ((300000.000 208070.000, 299817.000 20...
239+
9 5820 ... POLYGON ((400000.000 594660.000, 399230.000 59...
240+
10 5820 ... POLYGON ((249830.000 73470.000, 249590.000 736...
241+
11 5820 ... MULTIPOLYGON (((232234.000 600000.000, 232055....
242+
12 5820 ... POLYGON ((300000.000 132806.000, 299600.000 13...
243+
13 5820 ... POLYGON ((416177.000 99662.000, 416070.000 997...
244+
14 5820 ... POLYGON ((637924.000 300000.000, 637885.000 30...
245+
15 5820 ... MULTIPOLYGON (((200000.000 234385.000, 199944....
246+
16 5820 ... MULTIPOLYGON (((286860.000 760900.000, 287050....
247+
248+
[17 rows x 6 columns]
249+
250+
>>> gdf = gp.read_file('/data/os-data/OS_Open_Zoomstack.gpkg', layer='Airports')
251+
>>> gdf.columns
252+
Index(['name', 'geometry'], dtype='object')
253+
>>> gdf['name']
254+
0 Sumburgh Airport
255+
1 Tingwall Airport
256+
2 Kirkwall Airport
257+
3 Port-Adhair Steòrnabhaigh/Stornoway Airport
258+
4 Wick John O'Groats Airport
259+
5 Port-adhair Bheinn na Faoghla/Benbecula Airport
260+
6 Port-adhair Bharraigh/Barra Airport
261+
7 Inverness Airport
262+
8 Aberdeen International Airport
263+
9 Tiree Airport
264+
~~~
265+
266+
All the expected geometric construction operations and intersection and difference methods are also available.
267+
The `matplotlib` module can also be used to produce maps of your data.
268+
269+
~~~py
270+
>>> import matplotlib.pyplot as plt
271+
>>> gdf.plot()
272+
<AxesSubplot:>
273+
>>> plt.title("GB Airports")
274+
Text(0.5, 1.0, 'GB Airports')
275+
>>> plt.show()
276+
~~~
277+
278+
![A map of GB Airports](images/Figure_1.png "A map of GB Airports")
279+

Diff for: assets/names.zip

2.32 KB
Binary file not shown.

0 commit comments

Comments
 (0)