diff --git a/Python/Module5_OddsAndEnds/WorkingWithFiles.md b/Python/Module5_OddsAndEnds/WorkingWithFiles.md index e8fce932..28496b63 100644 --- a/Python/Module5_OddsAndEnds/WorkingWithFiles.md +++ b/Python/Module5_OddsAndEnds/WorkingWithFiles.md @@ -249,6 +249,119 @@ with open("a_poem.txt", mode="r") as my_open_file: ``` + +## Working with Comma Seperated Value Files + +Comma Seperated Value (CSV) files are commonly used to store data that you might typically find in a table. +These files can be formatted in many ways, but the typical format is to have each of the column values in the table be separated by commas while having a newline separate each row. +Suppose we have the following table of test scores: + +| | Exam 1 (%) | Exam 2 (%) | +| ------------- |:-------------:| -----:| +| Ashley | $93$ | $95$ | +| Brad | $84$ | $100$ | +| Cassie | $99$ | $87$ | + +This table depicts the test scores of three students across 2 exams. +Here is what the corresponding CSV file might look like: + +```python +name,exam one score,exam two score +Ashley,93,95 +Brad,84,100 +Cassie,99,87 +``` +In addition to the fact that the first line typically contains a header, you are also allowed to have spaces within each of columns as well. + +
+ +**Note**: + +It is not guaranteed that all CSV files are actually comma separated. +Non-standard CSV files will typically come with instructions on how the data is organized. +In general, it is a good practice to open up the CSV file and look at the first few lines to get a sense of how it is organized (unless the file is too large). +
+ +### How to parse CSVs with NumPy + +We will first look into parsing and storing CSV data using our favorite package: `numpy`! + +To demonstrate how importing a CSV works, we will try to import [a coastal waves dataset](https://www.kaggle.com/jolasa/waves-measuring-buoys-data-mooloolaba/data) from Kaggle. +After you extract the *.csv* from the *.zip*, rename it to *coastal_dataset.csv*. +```python +from numpy import genfromtxt # genfromtxt() allows for easy parsing of CSVs +my_data = genfromtxt(r"./Downloads/coastal_dataset.csv", delimiter=',') +``` +`genfromtxt()` takes in a CSV file path and delimiter (the character used to split the data, typically comma for CSV). +Let's check out some properties of the data: + +```python +>>> type(my_data) +numpy.ndarray + +>>> my_data.shape +(43729, 7) + +#Let's look at the actual data +>>> my_data +array([[ nan, nan, nan, ..., nan, nan, nan], + [ nan, -99.9 , -99.9 , ..., -99.9 , -99.9 , -99.9 ], + [ nan, 0.875, 1.39 , ..., 4.506, -99.9 , -99.9 ], + ..., + [ nan, 2.157, 3.43 , ..., 12.89 , 97. , 21.95 ], + [ nan, 2.087, 2.84 , ..., 10.963, 92. , 21.95 ], + [ nan, 1.926, 2.98 , ..., 12.228, 84. , 21.95 ]]) +``` +You may notice that there are some `nan` values present when we look at this perticular set of data. +Typically, if there are non-numerical values in the file, such as headers and dates, importing it into a NumPy array will turn those values into `nan`. + +### How to parse CSVs with Pandas + +A really popular library for parsing CSVs is the [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html "Pandas Documentation") library. Here is a quick way to parse a CSV using Pandas: +```Python +import pandas as pd +my_data = pd.read_csv(r"./Downloads/coastal_dataset.csv", sep=',', header=None) +``` +That's it! +The method `read_csv()` loads the contents of the CSV into the variable `my_data`. +This method has similar input parameters to `genfromtxt()` and many extra optional parameters as well. +Look at the docstring for more information. + + Let's parse the same [ocean waves csv](https://www.kaggle.com/jolasa/waves-measuring-buoys-data-mooloolaba/data) from before but with Pandas instead of NumPy: + +```Python +>>> type(my_data) +pandas.core.frame.DataFrame # Notice that this is a custom type + +>>> my_data.shape +(43729, 7) + +>>> my_data.values # This is how we access the values as an array +array([['Date/Time', 'Hs', 'Hmax', ..., 'Tp', 'Peak Direction', 'SST'], + ['01/01/2017 00:00', '-99.9', '-99.9', ..., '-99.9', '-99.9', + '-99.9'], + ['01/01/2017 00:30', '0.875', '1.39', ..., '4.506', '-99.9', + '-99.9'], + ..., + ['30/06/2019 22:30', '2.157', '3.43', ..., '12.89', '97', '21.95'], + ['30/06/2019 23:00', '2.087', '2.84', ..., '10.963', '92', + '21.95'], + ['30/06/2019 23:30', '1.926', '2.98', ..., '12.228', '84', + '21.95']], dtype=object) +``` +One of the coolest features of Pandas is how it nicely organizes the parsed CSV data for visualization. +Here is how `my_data` is displayed in a Jupyter Notebook: + +```Python +my_data[0:21] # Prints out first 20 values in nice format +``` +![Pandas Parsed Figure](pics/Pandas_CSV.jpg) + +One of the main advantages of Pandas is that it **treats all the data as strings**, while NumPy only deals with numerical values. +This allows Pandas to store information such as headers and dates, while NumPy cannot. +Read the [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/index.html "Documentation Link") for more information. + + ## Globbing for Files @@ -457,4 +570,3 @@ Write a glob pattern for each of the following prompts - Any file with an odd number in its name (answer: `*[13579]*`) - All txt files that have the letters 'q' or 'z' in them (answer: `*[qz]*.txt`) - diff --git a/Python/Module5_OddsAndEnds/pics/Pandas_CSV.jpg b/Python/Module5_OddsAndEnds/pics/Pandas_CSV.jpg new file mode 100644 index 00000000..f0f3aeb1 Binary files /dev/null and b/Python/Module5_OddsAndEnds/pics/Pandas_CSV.jpg differ