Separate XML from HTML

prof-rossetti · Jul 22, 2024 · bbc43d4 · bbc43d4
1 parent 8844c84
commit bbc43d4
Show file tree

Hide file tree

Showing 3 changed files with 295 additions and 70 deletions.
diff --git a/docs/_quarto.yml b/docs/_quarto.yml
@@ -182,6 +182,8 @@ website:
                 href: notes/fetching-data/json-data.qmd
               - section:
                 href: notes/fetching-data/csv-data.qmd
+              - section:
+                href: notes/fetching-data/xml.qmd
               - section:
                 href: notes/fetching-data/html-web-scraping.qmd
                 #text: "HTML Data (Web Scraping)"
@@ -287,6 +289,8 @@ format:
     code-fold: false #show
     #code-line-numbers: true
     toc: true
+    #toc-depth: 3 # specify the number of section levels to include in the table of contents
+    #toc-expand: 3 # specify how much of the table of contents to show initially (defaults to 1 with auto-expansion as the user scrolls)
     #toc-location: left
     #number-sections: false
     #number-depth: 1

diff --git a/docs/notes/fetching-data/html-web-scraping.qmd b/docs/notes/fetching-data/html-web-scraping.qmd
@@ -2,83 +2,68 @@
 format:
   html:
     code-fold: false
+    #toc: true
+    #toc-depth: 4
+    #toc-expand: 5
 jupyter: python3
 execute:
   cache: true # re-render only when source changes
 ---
 
+# Fetching HTML Data (i.e. "Web Scraping")
 
-# Fetching HTML Data
-
-If the data you want to fetch is in XML or HTML format, we can use the `requests` package to fetch it, and the `beautifulsoup4` package to process it.
-
-## XML
-
-
-Let's consider this example \"students.xml\" file we have hosted on the Internet:
-
-```xml
-<GradeReport>
-    <DownloadDate>2018-06-05</DownloadDate>
-    <ProfessorId>123</ProfessorId>
-    <Students>
-        <Student>
-            <StudentId>1</StudentId>
-            <FinalGrade>76.7</FinalGrade>
-        </Student>
-        <Student>
-            <StudentId>2</StudentId>
-            <FinalGrade>85.1</FinalGrade>
-        </Student>
-        <Student>
-            <StudentId>3</StudentId>
-            <FinalGrade>50.3</FinalGrade>
-        </Student>
-        <Student>
-            <StudentId>4</StudentId>
-            <FinalGrade>89.8</FinalGrade>
-        </Student>
-        <Student>
-            <StudentId>5</StudentId>
-            <FinalGrade>97.4</FinalGrade>
-        </Student>
-        <Student>
-            <StudentId>6</StudentId>
-            <FinalGrade>75.5</FinalGrade>
-        </Student>
-        <Student>
-            <StudentId>7</StudentId>
-            <FinalGrade>87.2</FinalGrade>
-        </Student>
-        <Student>
-            <StudentId>8</StudentId>
-            <FinalGrade>88.0</FinalGrade>
-        </Student>
-        <Student>
-            <StudentId>9</StudentId>
-            <FinalGrade>93.9</FinalGrade>
-        </Student>
-        <Student>
-            <StudentId>10</StudentId>
-            <FinalGrade>92.5</FinalGrade>
-        </Student>
-    </Students>
-</GradeReport>
+If the data you want to fetch is in HTML format, like most web pages, we can use the `requests` package to fetch it, and the `beautifulsoup4` package to process it.
+
+Before moving on to process HTML formatted data, it will be important to first review [Basic HTML](https://www.w3schools.com/html/html_basic.asp), [HTML Lists](https://www.w3schools.com/html/html_lists.asp), and [HTML Tables](https://www.w3schools.com/html/html_tables.asp).
+
+
+## HTML Lists
+
+Let's consider this \"my_lists.html\" file we have hosted on the Internet, which is a simplified web page containing a few HTML list elements:
+
+```html
+<!DOCTYPE html>
+<html lang="en">
+    <head>
+        <meta charset="UTF-8">
+        <title>HTML List Parsing Exercise</title>
+    </head>
+    <body>
+        <h1>HTML List Parsing Exercise</h1>
+
+        <p>This is an HTML page.</p>
+
+        <h2>Favorite Ice cream Flavors</h2>
+        <ol id="my-fav-flavors">
+            <li>Vanilla Bean</li>
+            <li>Chocolate</li>
+            <li>Strawberry</li>
+        </ol>
+
+        <h2>Skills</h2>
+        <ul id="my-skills">
+            <li class="skill">HTML</li>
+            <li class="skill">CSS</li>
+            <li class="skill">JavaScript</li>
+            <li class="skill">Python</li>
+        </ul>
+    </body>
+</html>
 ```
 
-First we note the URL of where the data resides. Then we pass that as a parameter to the `get` function from the `requests` package, to issue an HTTP GET request (as usual):
+First we note the URL of where the data or webpage resides. Then we pass that as a parameter to the `get` function from the `requests` package, to issue an HTTP GET request (as usual):
 
 ```{python}
 import requests
 
-# the URL of some CSV data we stored online:
-request_url = "https://raw.githubusercontent.com/prof-rossetti/python-for-finance/main/docs/data/gradebook.xml"
+# the URL of some HTML data or web page stored online:
+request_url = "https://raw.githubusercontent.com/prof-rossetti/python-for-finance/main/docs/data/my_lists.html"
 
 response = requests.get(request_url)
 print(type(response))
 ```
 
-Then we pass the response text (an HTML or XML formatted string) to the `BeautifulSoup` class constructor.
+Then we pass the response text (an HTML formatted string) to the `BeautifulSoup` class constructor.
 
 ```{python}
 from bs4 import BeautifulSoup
@@ -87,23 +72,146 @@ soup = BeautifulSoup(response.text)
 type(soup)
 ```
 
-The soup object is able to intelligently process the data.
+The soup object is able to intelligently process the data. We can invoke a `find` or `find_all` method on the soup object to find elements or tags based on their names or other attributes.
+
+### Finding Elements by Identifier
+
+Since the example HTML contains an ordered list (`ol` element) with a unique identifier of \"my-fav-flavors\", we can use the following code to access it:
+
+
+```{python}
+# get first <ol> element that has a given identifier of "my-fav-flavors":
+ul = soup.find("ol", id="my-fav-flavors")
+print(type(ul))
+ul
+```
+
+```{python}
+# get all child <li> elements from that list:
+flavors = ul.find_all("li")
+print(type(flavors))
+print(len(flavors))
+flavors
+```
 
 
-We can invoke a `find` or `find_all` method on the soup object to find elements or tags based on their names or other attributes. For example, finding all the student tags in this structure:
 
 ```{python}
-students = soup.find_all("student")
-print(type(students))
-len(students)
+for li in flavors:
+    print("-----------")
+    print(type(li))
+    print(li.text)
 ```
 
+### Finding Elements by Class
+
+Since the example HTML contains an unordered list (`ul` element) of skills, where each list item shares the same class of \"skill\", we can use the following code to access the list items directly:
 
 ```{python}
-for student in students:
+# get all <li> elements that have a given class of "skill"
+skills = soup.find_all("li", "skill")
+print(type(skills))
+print(len(skills))
+skills
+```
+
+```{python}
+for li in skills:
     print("-----------")
-    print(type(student))
-    student_id = student.studentid.text
-    final_grade = student.finalgrade.text
-    print(student_id, final_grade)
+    print(type(li))
+    print(li.text)
+```
+
+
+## HTML Tables
+
+Let's consider this \"my_tables.html\" file we have hosted on the Internet, which is a simplified web page containing an HTML table element:
+
+```html
+<!DOCTYPE html>
+<html lang="en">
+    <head>
+        <meta charset="UTF-8">
+        <title>HTML Table Parsing Exercise</title>
+    </head>
+    <body>
+        <h1>HTML Table Parsing Exercise</h1>
+
+        <p>This is an HTML page.</p>
+
+        <h2>Products</h2>
+
+        <table id="products">
+            <tr>
+                <th>Id</th>
+                <th>Name</th>
+                <th>Price</th>
+            </tr>
+            <tr>
+                <td>1</td>
+                <td>Chocolate Sandwich Cookies</td>
+                <td>3.50</td>
+            </tr>
+            <tr>
+                <td>2</td>
+                <td>All-Seasons Salt</td>
+                <td>4.99</td>
+            </tr>
+            <tr>
+                <td>3</td>
+                <td>Robust Golden Unsweetened Oolong Tea</td>
+                <td>2.49</td>
+            </tr>
+        </table>
+    </body>
+</html>
+```
+
+We repeat the process of fetching this data, as previously exemplified:
+
+
+```{python}
+import requests
+from bs4 import BeautifulSoup
+
+# the URL of some HTML data or web page stored online:
+request_url = "https://raw.githubusercontent.com/prof-rossetti/python-for-finance/main/docs/data/my_tables.html"
+
+response = requests.get(request_url)
+
+soup = BeautifulSoup(response.text)
+type(soup)
+```
+
+Since the example HTML contains a `table` element with a unique identifier of \"products\", we can use the following code to access it:
+
+
+```{python}
+# get first <table> element that has a given identifier of "products":
+table = soup.find("table", id="products")
+print(type(ul))
+table
+```
+
+```{python}
+# get all child <tr> elements from that list:
+rows = table.find_all("tr")
+print(type(rows))
+print(len(rows))
+rows
+```
+
+This gets us a list of the rows, where the first is the header row. We can then loop through the rows, ignoring the header row:
+
+```{python}
+for tr in rows:
+    cells = tr.find_all("td") # skip header row, which contains <th> elements instead
+    if any(cells):
+        print("-----------")
+        # makes assumptions about the order of the cells:
+        product_id = cells[0].text
+        product_name = cells[1].text
+        product_price = cells[2].text
+        print(product_id, product_name, product_price)
+
 ```