diff --git a/docs/_quarto.yml b/docs/_quarto.yml
index 794d28f..57d8ae2 100644
--- a/docs/_quarto.yml
+++ b/docs/_quarto.yml
@@ -182,6 +182,8 @@ website:
href: notes/fetching-data/json-data.qmd
- section:
href: notes/fetching-data/csv-data.qmd
+ - section:
+ href: notes/fetching-data/xml.qmd
- section:
href: notes/fetching-data/html-web-scraping.qmd
#text: "HTML Data (Web Scraping)"
@@ -287,6 +289,8 @@ format:
code-fold: false #show
#code-line-numbers: true
toc: true
+ #toc-depth: 3 # specify the number of section levels to include in the table of contents
+ #toc-expand: 3 # specify how much of the table of contents to show initially (defaults to 1 with auto-expansion as the user scrolls)
#toc-location: left
#number-sections: false
#number-depth: 1
diff --git a/docs/notes/fetching-data/html-web-scraping.qmd b/docs/notes/fetching-data/html-web-scraping.qmd
index 8dab422..7ced88e 100644
--- a/docs/notes/fetching-data/html-web-scraping.qmd
+++ b/docs/notes/fetching-data/html-web-scraping.qmd
@@ -2,83 +2,68 @@
format:
html:
code-fold: false
+ #toc: true
+ #toc-depth: 4
+ #toc-expand: 5
jupyter: python3
execute:
cache: true # re-render only when source changes
---
+# Fetching HTML Data (i.e. "Web Scraping")
-# Fetching HTML Data
-
-If the data you want to fetch is in XML or HTML format, we can use the `requests` package to fetch it, and the `beautifulsoup4` package to process it.
-
-## XML
-
-
-Let's consider this example \"students.xml\" file we have hosted on the Internet:
-
-```xml
-
- 2018-06-05
- 123
-
-
- 1
- 76.7
-
-
- 2
- 85.1
-
-
- 3
- 50.3
-
-
- 4
- 89.8
-
-
- 5
- 97.4
-
-
- 6
- 75.5
-
-
- 7
- 87.2
-
-
- 8
- 88.0
-
-
- 9
- 93.9
-
-
- 10
- 92.5
-
-
-
+If the data you want to fetch is in HTML format, like most web pages, we can use the `requests` package to fetch it, and the `beautifulsoup4` package to process it.
+
+Before moving on to process HTML formatted data, it will be important to first review [Basic HTML](https://www.w3schools.com/html/html_basic.asp), [HTML Lists](https://www.w3schools.com/html/html_lists.asp), and [HTML Tables](https://www.w3schools.com/html/html_tables.asp).
+
+
+## HTML Lists
+
+Let's consider this \"my_lists.html\" file we have hosted on the Internet, which is a simplified web page containing a few HTML list elements:
+
+```html
+
+
+
+
+ HTML List Parsing Exercise
+
+
+ HTML List Parsing Exercise
+
+ This is an HTML page.
+
+ Favorite Ice cream Flavors
+
+ - Vanilla Bean
+ - Chocolate
+ - Strawberry
+
+
+ Skills
+
+ - HTML
+ - CSS
+ - JavaScript
+ - Python
+
+
+
```
-First we note the URL of where the data resides. Then we pass that as a parameter to the `get` function from the `requests` package, to issue an HTTP GET request (as usual):
+First we note the URL of where the data or webpage resides. Then we pass that as a parameter to the `get` function from the `requests` package, to issue an HTTP GET request (as usual):
```{python}
import requests
-# the URL of some CSV data we stored online:
-request_url = "https://raw.githubusercontent.com/prof-rossetti/python-for-finance/main/docs/data/gradebook.xml"
+# the URL of some HTML data or web page stored online:
+request_url = "https://raw.githubusercontent.com/prof-rossetti/python-for-finance/main/docs/data/my_lists.html"
response = requests.get(request_url)
print(type(response))
```
-Then we pass the response text (an HTML or XML formatted string) to the `BeautifulSoup` class constructor.
+Then we pass the response text (an HTML formatted string) to the `BeautifulSoup` class constructor.
```{python}
from bs4 import BeautifulSoup
@@ -87,23 +72,146 @@ soup = BeautifulSoup(response.text)
type(soup)
```
-The soup object is able to intelligently process the data.
+The soup object is able to intelligently process the data. We can invoke a `find` or `find_all` method on the soup object to find elements or tags based on their names or other attributes.
+
+### Finding Elements by Identifier
+
+Since the example HTML contains an ordered list (`ol` element) with a unique identifier of \"my-fav-flavors\", we can use the following code to access it:
+
+
+```{python}
+# get first element that has a given identifier of "my-fav-flavors":
+ul = soup.find("ol", id="my-fav-flavors")
+print(type(ul))
+ul
+```
+
+```{python}
+# get all child - elements from that list:
+flavors = ul.find_all("li")
+print(type(flavors))
+print(len(flavors))
+flavors
+```
-We can invoke a `find` or `find_all` method on the soup object to find elements or tags based on their names or other attributes. For example, finding all the student tags in this structure:
```{python}
-students = soup.find_all("student")
-print(type(students))
-len(students)
+for li in flavors:
+ print("-----------")
+ print(type(li))
+ print(li.text)
```
+### Finding Elements by Class
+
+Since the example HTML contains an unordered list (`ul` element) of skills, where each list item shares the same class of \"skill\", we can use the following code to access the list items directly:
```{python}
-for student in students:
+# get all
- elements that have a given class of "skill"
+skills = soup.find_all("li", "skill")
+print(type(skills))
+print(len(skills))
+skills
+```
+
+```{python}
+for li in skills:
print("-----------")
- print(type(student))
- student_id = student.studentid.text
- final_grade = student.finalgrade.text
- print(student_id, final_grade)
+ print(type(li))
+ print(li.text)
+```
+
+
+## HTML Tables
+
+Let's consider this \"my_tables.html\" file we have hosted on the Internet, which is a simplified web page containing an HTML table element:
+
+```html
+
+
+
+
+ HTML Table Parsing Exercise
+
+
+
HTML Table Parsing Exercise
+
+ This is an HTML page.
+
+ Products
+
+
+
+ Id |
+ Name |
+ Price |
+
+
+ 1 |
+ Chocolate Sandwich Cookies |
+ 3.50 |
+
+
+ 2 |
+ All-Seasons Salt |
+ 4.99 |
+
+
+ 3 |
+ Robust Golden Unsweetened Oolong Tea |
+ 2.49 |
+
+
+
+
+```
+
+We repeat the process of fetching this data, as previously exemplified:
+
+
+```{python}
+import requests
+from bs4 import BeautifulSoup
+
+# the URL of some HTML data or web page stored online:
+request_url = "https://raw.githubusercontent.com/prof-rossetti/python-for-finance/main/docs/data/my_tables.html"
+
+response = requests.get(request_url)
+
+soup = BeautifulSoup(response.text)
+type(soup)
+```
+
+Since the example HTML contains a `table` element with a unique identifier of \"products\", we can use the following code to access it:
+
+
+```{python}
+# get first element that has a given identifier of "products":
+table = soup.find("table", id="products")
+print(type(ul))
+table
+```
+
+```{python}
+# get all child elements from that list:
+rows = table.find_all("tr")
+print(type(rows))
+print(len(rows))
+rows
+```
+
+This gets us a list of the rows, where the first is the header row. We can then loop through the rows, ignoring the header row:
+
+```{python}
+for tr in rows:
+ cells = tr.find_all("td") # skip header row, which contains elements instead
+ if any(cells):
+ print("-----------")
+ # makes assumptions about the order of the cells:
+ product_id = cells[0].text
+ product_name = cells[1].text
+ product_price = cells[2].text
+ print(product_id, product_name, product_price)
+
```
diff --git a/docs/notes/fetching-data/xml.qmd b/docs/notes/fetching-data/xml.qmd
new file mode 100644
index 0000000..b6f7f37
--- /dev/null
+++ b/docs/notes/fetching-data/xml.qmd
@@ -0,0 +1,113 @@
+---
+format:
+ html:
+ code-fold: false
+jupyter: python3
+execute:
+ cache: true # re-render only when source changes
+---
+
+
+# Fetching XML Data
+
+If the data you want to fetch is in XML format, including in an RSS feed, we can use the `requests` package to fetch it, and the `beautifulsoup4` package to process it.
+
+Let's consider this example \"students.xml\" file we have hosted on the Internet:
+
+```xml
+
+ 2018-06-05
+ 123
+
+
+ 1
+ 76.7
+
+
+ 2
+ 85.1
+
+
+ 3
+ 50.3
+
+
+ 4
+ 89.8
+
+
+ 5
+ 97.4
+
+
+ 6
+ 75.5
+
+
+ 7
+ 87.2
+
+
+ 8
+ 88.0
+
+
+ 9
+ 93.9
+
+
+ 10
+ 92.5
+
+
+
+```
+
+First we note the URL of where the data resides. Then we pass that as a parameter to the `get` function from the `requests` package, to issue an HTTP GET request (as usual):
+
+```{python}
+import requests
+
+# the URL of some XML data we stored online:
+request_url = "https://raw.githubusercontent.com/prof-rossetti/python-for-finance/main/docs/data/gradebook.xml"
+
+response = requests.get(request_url)
+print(type(response))
+```
+
+Then we pass the response text (an HTML or XML formatted string) to the `BeautifulSoup` class constructor.
+
+```{python}
+from bs4 import BeautifulSoup
+
+soup = BeautifulSoup(response.text)
+type(soup)
+```
+
+The soup object is able to intelligently process the data.
+
+
+We can invoke a `find` or `find_all` method on the soup object to find elements or tags based on their names or other attributes. For example, finding all the student tags in this structure:
+
+```{python}
+students = soup.find_all("student")
+print(type(students))
+print(len(students))
+```
+
+
+```{python}
+# examining the first item for reference:
+print(type(students[0]))
+students[0]
+```
+
+```{python}
+# looping through all the items:
+for student in students:
+ print("-----------")
+ print(type(student))
+ student_id = student.studentid.text
+ final_grade = student.finalgrade.text
+ print(student_id, final_grade)
+```
|