|
| 1 | +# Data Analysis in Python |
| 2 | + |
| 3 | +## NumPy |
| 4 | + |
| 5 | +NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a |
| 6 | +multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment |
| 7 | +of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, |
| 8 | +selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random |
| 9 | +simulation and much more. |
| 10 | + |
| 11 | +At the core of the NumPy package, is the `ndarray` object. This encapsulates n-dimensional arrays of |
| 12 | +homogeneous data types, with many operations being performed in compiled code for performance. There are |
| 13 | +several important differences between NumPy arrays and the standard Python sequences: |
| 14 | + |
| 15 | ++ NumPy arrays have a fixed size at creation, unlike Python lists (which can grow dynamically). Changing the |
| 16 | + size of an `ndarray` will create a new array and delete the original. |
| 17 | + |
| 18 | ++ The elements in a NumPy array are all required to be of the same data type, and thus will be the same size |
| 19 | + in memory. |
| 20 | + |
| 21 | +These restrictions together with the compiled code used in the NumPy functions makes NumPy much faster than |
| 22 | +normal Python arrays for many operations. |
| 23 | + |
| 24 | +## Creating arrays from lists |
| 25 | + |
| 26 | +`np.array([1,2,3,5,5])` |
| 27 | + |
| 28 | +`np.array([range(i, i + 3) for i in [2, 4, 6]])` gives: |
| 29 | + |
| 30 | +~~~python |
| 31 | +array([[2, 3, 4], [4, 5, 6], [6, 7, 8]]) |
| 32 | +~~~ |
| 33 | + |
| 34 | +To create an array full of `0`s use `np.zeros(10, dtype=int)`, you can also use `np.arange(0,20,2)` to produce |
| 35 | +0 to 20 in steps of 2 and `np.linspace(0,1,5)` to give 5 values evenly spaced between 0 & 1. |
| 36 | + |
| 37 | +NumPy Arrays have attributes: |
| 38 | ++ `ndim` number of dimensions |
| 39 | ++ `shape` the size of each dimension as a tuple. |
| 40 | ++ `size` the size of the total array |
| 41 | ++ `dtype` what is the type of the elements of the array (int, float32, etc) |
| 42 | + |
| 43 | +You can slice arrays as you would lists or strings, including from the end using negative number or string. |
| 44 | +Multiple dimension arrays are accessed as `x[1,2]` for the 3rd element of the 2nd row. |
| 45 | + |
| 46 | +### Fast Functions on Arrays |
| 47 | + |
| 48 | +Computations can use vectorization through the use of `ufuncs`. These are nearly always more efficient than |
| 49 | +their counterpart implemented through Python loops, especially as the arrays grow in size. Any time you see |
| 50 | +such a loop in a Python script, you should consider whether it can be replaced with a vectorized expression. |
| 51 | + |
| 52 | +Simply apply the `ufunc` to the array, e.g: |
| 53 | + |
| 54 | +~~~python |
| 55 | +import numpy as np |
| 56 | +np.random.seed(0) |
| 57 | +def compute_reciprocals(values): |
| 58 | + output = np.empty(len(values)) |
| 59 | + for i in range(len(values)): |
| 60 | + output[i] = 1.0 / values[i] |
| 61 | + return output |
| 62 | + |
| 63 | +values = np.random.randint(1, 100, size=1000000) |
| 64 | +compute_reciprocals(values) |
| 65 | +# calculate reciprocal for all values |
| 66 | +print(1.0/values) |
| 67 | +~~~ |
| 68 | + |
| 69 | +The function takes 2+ seconds, `1.0/values` takes ~4ms |
| 70 | + |
| 71 | +Also works for array to array operations including multidimensional arrays |
| 72 | + |
| 73 | +You can also use the `reduce()` method to act across an array: `np.add.reduce(x)` gives the sum of the array |
| 74 | +elements, while `np.multiple.reduce(x)` gives the product of the array. |
| 75 | + |
| 76 | +`np.sum` is a fast version of the python `sum` function, along with `np.min()` and `np.max()`. They all take |
| 77 | +an `axis` parameter to force action on rows or columns (etc). |
| 78 | + |
| 79 | + |
| 80 | +## Pandas |
| 81 | + |
| 82 | +Pandas is an open source Python package that is most widely used for data science/data analysis and machine |
| 83 | +learning tasks. It is built on top of another package named NumPy, which provides support for |
| 84 | +multi-dimensional arrays. As one of the most popular data wrangling packages, Pandas works well with many |
| 85 | +other data science modules inside the Python ecosystem, and is typically included in every Python |
| 86 | +distribution. |
| 87 | + |
| 88 | +### A generalised NumPy array |
| 89 | + |
| 90 | +~~~python |
| 91 | +>>> import pandas as pd |
| 92 | +>>> import numpy as np |
| 93 | +>>> data = pd.Series([0.25, 0.5, 0.75, 1.0]) |
| 94 | +>>> data |
| 95 | +0 0.25 |
| 96 | +1 0.50 |
| 97 | +2 0.75 |
| 98 | +3 1.00 |
| 99 | +dtype: float64 |
| 100 | +>>> data.index |
| 101 | +RangeIndex(start=0, stop=4, step=1) |
| 102 | +>>> |
| 103 | +~~~ |
| 104 | + |
| 105 | +As this example the main difference between NumPy and Pandas is that the index of the `Series` is explicit in |
| 106 | +Pandas while it is implicit in NumPy. There is no need for it to be a simple range, any list will work. |
| 107 | + |
| 108 | +~~~python |
| 109 | +>>> data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd']) |
| 110 | +>>> data |
| 111 | +a 0.25 |
| 112 | +b 0.50 |
| 113 | +c 0.75 |
| 114 | +d 1.00 |
| 115 | +dtype: float64 |
| 116 | +>>> data['b'] |
| 117 | +0.50 |
| 118 | +>>> data.index |
| 119 | +Index(['a', 'b', 'c', 'd'], dtype='object') |
| 120 | +~~~ |
| 121 | + |
| 122 | +`Series` can be considered to be `dict`s and can be created from a `dict` with the keys of the `dict` being |
| 123 | +used as the index elements: |
| 124 | + |
| 125 | +~~~py |
| 126 | +>>> population_dict = {'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135} |
| 127 | +>>> population = pd.Series(population_dict) |
| 128 | +>>> population |
| 129 | +California 38332521 |
| 130 | +Texas 26448193 |
| 131 | +New York 19651127 |
| 132 | +Florida 19552860 |
| 133 | +Illinois 12882135 |
| 134 | +dtype: int64 |
| 135 | +>>> |
| 136 | +~~~ |
| 137 | + |
| 138 | +But slicing is supported even on essentially unordered indexes: |
| 139 | + |
| 140 | +~~~py |
| 141 | +>>> population['California':'New York'] |
| 142 | +California 38332521 |
| 143 | +Texas 26448193 |
| 144 | +New York 19651127 |
| 145 | +dtype: int64 |
| 146 | +>>> |
| 147 | +~~~ |
| 148 | + |
| 149 | +A `DataFrame` is a generalised NumPy multidimensional array, that consists of one or more `Series`: |
| 150 | + |
| 151 | +~~~py |
| 152 | +>>> area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995} |
| 153 | +>>> area = pd.Series(area_dict) |
| 154 | +>>> area |
| 155 | +California 423967 |
| 156 | +Texas 695662 |
| 157 | +New York 141297 |
| 158 | +Florida 170312 |
| 159 | +Illinois 149995 |
| 160 | +dtype: int64 |
| 161 | +>>> states = pd.DataFrame({'population': population, 'area': area}) |
| 162 | +>>> states |
| 163 | + population area |
| 164 | +California 38332521 423967 |
| 165 | +Texas 26448193 695662 |
| 166 | +New York 19651127 141297 |
| 167 | +Florida 19552860 170312 |
| 168 | +Illinois 12882135 149995 |
| 169 | +>>> |
| 170 | +~~~ |
| 171 | + |
| 172 | +So in addition to the `index` giving access to the rows, we also have `columns` - giving a generalised 2 |
| 173 | +dimensional array. We can also treat the columns as a dictionary, allowing access by name, this is not |
| 174 | +possible by just index value, but does work with an index or filter expression: |
| 175 | + |
| 176 | +~~~py |
| 177 | +>>> states['area'] |
| 178 | +California 423967 |
| 179 | +Texas 695662 |
| 180 | +New York 141297 |
| 181 | +Florida 170312 |
| 182 | +Illinois 149995 |
| 183 | +Name: area, dtype: int64 |
| 184 | +>>> data['California'] |
| 185 | +Traceback (most recent call last): |
| 186 | + File "/usr/lib/python3/dist-packages/pandas/core/indexes/base.py", line 3361, in get_loc |
| 187 | + return self._engine.get_loc(casted_key) |
| 188 | + File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc |
| 189 | + File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc |
| 190 | + File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item |
| 191 | + File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item |
| 192 | +KeyError: 'California' |
| 193 | +~~~ |
| 194 | + |
| 195 | +You can construct a `DataFrame` from a `Series`, a `dict` of `Series`, a 2D NumPy array or a structured NumPy |
| 196 | +array. |
| 197 | + |
| 198 | +#### Data access |
| 199 | + |
| 200 | +You can use `DataFrame` as a dict `data[area]` or via attributes `data.area`, providing column name is a string and doesn't match a method of `DataFrame` (e.g. pop). |
| 201 | + |
| 202 | +You can create new columns in the frame `data['density'] = data['pop']/data['area']` |
| 203 | + |
| 204 | + |
| 205 | +### Reading a file with Pandas |
| 206 | + |
| 207 | +~~~python |
| 208 | +import pandas as pd |
| 209 | + |
| 210 | +data = pd.read_csv('datafile.csv') |
| 211 | +heights = np.array(data['height(cm)']) |
| 212 | +~~~ |
| 213 | + |
| 214 | +## GeoPandas |
| 215 | + |
| 216 | +GeoPandas, as the name suggests, extends the popular data science library pandas by adding support for |
| 217 | +geospatial data. The core data structure in GeoPandas is the `geopandas.GeoDataFrame`, a subclass of |
| 218 | +`pandas.DataFrame`, that can store geometry columns and perform spatial operations. The `geopandas.GeoSeries`, |
| 219 | +a subclass of `pandas.Series`, handles the geometries. Therefore, your `GeoDataFrame` is a combination of |
| 220 | +`pandas.Series`, with traditional data (numerical, boolean, text etc.), and `geopandas.GeoSeries`, with |
| 221 | +geometries (points, polygons etc.). You can have as many columns with geometries as you wish; |
| 222 | + |
| 223 | +A `GeoDataFrame` can be constructed from a spatial data file, such as a geopackage or shapefile: |
| 224 | + |
| 225 | +~~~py |
| 226 | +>>> import geopandas as gp |
| 227 | +>>> gdf = gp.read_file('/data/os-data/national-parks.shp') |
| 228 | +>>> gdf |
| 229 | + CODE ... geometry |
| 230 | +0 5820 ... POLYGON ((317635.000 770725.000, 317078.000 77... |
| 231 | +1 5820 ... POLYGON ((308047.000 500000.000, 308050.000 50... |
| 232 | +2 5820 ... POLYGON ((259611.000 300000.000, 259763.000 30... |
| 233 | +3 5820 ... POLYGON ((209870.000 700000.000, 209940.000 70... |
| 234 | +4 5820 ... POLYGON ((400000.000 453080.000, 399940.000 45... |
| 235 | +5 5820 ... POLYGON ((550228.000 100000.000, 550138.000 10... |
| 236 | +6 5820 ... POLYGON ((500000.000 487140.000, 499650.000 48... |
| 237 | +7 5820 ... POLYGON ((400000.000 360330.000, 399930.000 36... |
| 238 | +8 5820 ... POLYGON ((300000.000 208070.000, 299817.000 20... |
| 239 | +9 5820 ... POLYGON ((400000.000 594660.000, 399230.000 59... |
| 240 | +10 5820 ... POLYGON ((249830.000 73470.000, 249590.000 736... |
| 241 | +11 5820 ... MULTIPOLYGON (((232234.000 600000.000, 232055.... |
| 242 | +12 5820 ... POLYGON ((300000.000 132806.000, 299600.000 13... |
| 243 | +13 5820 ... POLYGON ((416177.000 99662.000, 416070.000 997... |
| 244 | +14 5820 ... POLYGON ((637924.000 300000.000, 637885.000 30... |
| 245 | +15 5820 ... MULTIPOLYGON (((200000.000 234385.000, 199944.... |
| 246 | +16 5820 ... MULTIPOLYGON (((286860.000 760900.000, 287050.... |
| 247 | + |
| 248 | +[17 rows x 6 columns] |
| 249 | + |
| 250 | +>>> gdf = gp.read_file('/data/os-data/OS_Open_Zoomstack.gpkg', layer='Airports') |
| 251 | +>>> gdf.columns |
| 252 | +Index(['name', 'geometry'], dtype='object') |
| 253 | +>>> gdf['name'] |
| 254 | +0 Sumburgh Airport |
| 255 | +1 Tingwall Airport |
| 256 | +2 Kirkwall Airport |
| 257 | +3 Port-Adhair Steòrnabhaigh/Stornoway Airport |
| 258 | +4 Wick John O'Groats Airport |
| 259 | +5 Port-adhair Bheinn na Faoghla/Benbecula Airport |
| 260 | +6 Port-adhair Bharraigh/Barra Airport |
| 261 | +7 Inverness Airport |
| 262 | +8 Aberdeen International Airport |
| 263 | +9 Tiree Airport |
| 264 | +~~~ |
| 265 | + |
| 266 | +All the expected geometric construction operations and intersection and difference methods are also available. |
| 267 | +The `matplotlib` module can also be used to produce maps of your data. |
| 268 | + |
| 269 | +~~~py |
| 270 | +>>> import matplotlib.pyplot as plt |
| 271 | +>>> gdf.plot() |
| 272 | +<AxesSubplot:> |
| 273 | +>>> plt.title("GB Airports") |
| 274 | +Text(0.5, 1.0, 'GB Airports') |
| 275 | +>>> plt.show() |
| 276 | +~~~ |
| 277 | + |
| 278 | + |
| 279 | + |
0 commit comments