Skip to content

Commit

Permalink
Separate XML from HTML
Browse files Browse the repository at this point in the history
  • Loading branch information
s2t2 committed Jul 22, 2024
1 parent 8844c84 commit bbc43d4
Show file tree
Hide file tree
Showing 3 changed files with 295 additions and 70 deletions.
4 changes: 4 additions & 0 deletions docs/_quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -182,6 +182,8 @@ website:
href: notes/fetching-data/json-data.qmd
- section:
href: notes/fetching-data/csv-data.qmd
- section:
href: notes/fetching-data/xml.qmd
- section:
href: notes/fetching-data/html-web-scraping.qmd
#text: "HTML Data (Web Scraping)"
Expand Down Expand Up @@ -287,6 +289,8 @@ format:
code-fold: false #show
#code-line-numbers: true
toc: true
#toc-depth: 3 # specify the number of section levels to include in the table of contents
#toc-expand: 3 # specify how much of the table of contents to show initially (defaults to 1 with auto-expansion as the user scrolls)
#toc-location: left
#number-sections: false
#number-depth: 1
Expand Down
248 changes: 178 additions & 70 deletions docs/notes/fetching-data/html-web-scraping.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,83 +2,68 @@
format:
html:
code-fold: false
#toc: true
#toc-depth: 4
#toc-expand: 5
jupyter: python3
execute:
cache: true # re-render only when source changes
---

# Fetching HTML Data (i.e. "Web Scraping")

# Fetching HTML Data

If the data you want to fetch is in XML or HTML format, we can use the `requests` package to fetch it, and the `beautifulsoup4` package to process it.

## XML


Let's consider this example \"students.xml\" file we have hosted on the Internet:

```xml
<GradeReport>
<DownloadDate>2018-06-05</DownloadDate>
<ProfessorId>123</ProfessorId>
<Students>
<Student>
<StudentId>1</StudentId>
<FinalGrade>76.7</FinalGrade>
</Student>
<Student>
<StudentId>2</StudentId>
<FinalGrade>85.1</FinalGrade>
</Student>
<Student>
<StudentId>3</StudentId>
<FinalGrade>50.3</FinalGrade>
</Student>
<Student>
<StudentId>4</StudentId>
<FinalGrade>89.8</FinalGrade>
</Student>
<Student>
<StudentId>5</StudentId>
<FinalGrade>97.4</FinalGrade>
</Student>
<Student>
<StudentId>6</StudentId>
<FinalGrade>75.5</FinalGrade>
</Student>
<Student>
<StudentId>7</StudentId>
<FinalGrade>87.2</FinalGrade>
</Student>
<Student>
<StudentId>8</StudentId>
<FinalGrade>88.0</FinalGrade>
</Student>
<Student>
<StudentId>9</StudentId>
<FinalGrade>93.9</FinalGrade>
</Student>
<Student>
<StudentId>10</StudentId>
<FinalGrade>92.5</FinalGrade>
</Student>
</Students>
</GradeReport>
If the data you want to fetch is in HTML format, like most web pages, we can use the `requests` package to fetch it, and the `beautifulsoup4` package to process it.

Before moving on to process HTML formatted data, it will be important to first review [Basic HTML](https://www.w3schools.com/html/html_basic.asp), [HTML Lists](https://www.w3schools.com/html/html_lists.asp), and [HTML Tables](https://www.w3schools.com/html/html_tables.asp).


## HTML Lists

Let's consider this \"my_lists.html\" file we have hosted on the Internet, which is a simplified web page containing a few HTML list elements:

```html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>HTML List Parsing Exercise</title>
</head>
<body>
<h1>HTML List Parsing Exercise</h1>

<p>This is an HTML page.</p>

<h2>Favorite Ice cream Flavors</h2>
<ol id="my-fav-flavors">
<li>Vanilla Bean</li>
<li>Chocolate</li>
<li>Strawberry</li>
</ol>

<h2>Skills</h2>
<ul id="my-skills">
<li class="skill">HTML</li>
<li class="skill">CSS</li>
<li class="skill">JavaScript</li>
<li class="skill">Python</li>
</ul>
</body>
</html>
```

First we note the URL of where the data resides. Then we pass that as a parameter to the `get` function from the `requests` package, to issue an HTTP GET request (as usual):
First we note the URL of where the data or webpage resides. Then we pass that as a parameter to the `get` function from the `requests` package, to issue an HTTP GET request (as usual):

```{python}
import requests
# the URL of some CSV data we stored online:
request_url = "https://raw.githubusercontent.com/prof-rossetti/python-for-finance/main/docs/data/gradebook.xml"
# the URL of some HTML data or web page stored online:
request_url = "https://raw.githubusercontent.com/prof-rossetti/python-for-finance/main/docs/data/my_lists.html"
response = requests.get(request_url)
print(type(response))
```

Then we pass the response text (an HTML or XML formatted string) to the `BeautifulSoup` class constructor.
Then we pass the response text (an HTML formatted string) to the `BeautifulSoup` class constructor.

```{python}
from bs4 import BeautifulSoup
Expand All @@ -87,23 +72,146 @@ soup = BeautifulSoup(response.text)
type(soup)
```

The soup object is able to intelligently process the data.
The soup object is able to intelligently process the data. We can invoke a `find` or `find_all` method on the soup object to find elements or tags based on their names or other attributes.

### Finding Elements by Identifier

Since the example HTML contains an ordered list (`ol` element) with a unique identifier of \"my-fav-flavors\", we can use the following code to access it:


```{python}
# get first <ol> element that has a given identifier of "my-fav-flavors":
ul = soup.find("ol", id="my-fav-flavors")
print(type(ul))
ul
```

```{python}
# get all child <li> elements from that list:
flavors = ul.find_all("li")
print(type(flavors))
print(len(flavors))
flavors
```


We can invoke a `find` or `find_all` method on the soup object to find elements or tags based on their names or other attributes. For example, finding all the student tags in this structure:

```{python}
students = soup.find_all("student")
print(type(students))
len(students)
for li in flavors:
print("-----------")
print(type(li))
print(li.text)
```

### Finding Elements by Class

Since the example HTML contains an unordered list (`ul` element) of skills, where each list item shares the same class of \"skill\", we can use the following code to access the list items directly:

```{python}
for student in students:
# get all <li> elements that have a given class of "skill"
skills = soup.find_all("li", "skill")
print(type(skills))
print(len(skills))
skills
```

```{python}
for li in skills:
print("-----------")
print(type(student))
student_id = student.studentid.text
final_grade = student.finalgrade.text
print(student_id, final_grade)
print(type(li))
print(li.text)
```


## HTML Tables

Let's consider this \"my_tables.html\" file we have hosted on the Internet, which is a simplified web page containing an HTML table element:

```html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>HTML Table Parsing Exercise</title>
</head>
<body>
<h1>HTML Table Parsing Exercise</h1>

<p>This is an HTML page.</p>

<h2>Products</h2>

<table id="products">
<tr>
<th>Id</th>
<th>Name</th>
<th>Price</th>
</tr>
<tr>
<td>1</td>
<td>Chocolate Sandwich Cookies</td>
<td>3.50</td>
</tr>
<tr>
<td>2</td>
<td>All-Seasons Salt</td>
<td>4.99</td>
</tr>
<tr>
<td>3</td>
<td>Robust Golden Unsweetened Oolong Tea</td>
<td>2.49</td>
</tr>
</table>
</body>
</html>
```

We repeat the process of fetching this data, as previously exemplified:


```{python}
import requests
from bs4 import BeautifulSoup
# the URL of some HTML data or web page stored online:
request_url = "https://raw.githubusercontent.com/prof-rossetti/python-for-finance/main/docs/data/my_tables.html"
response = requests.get(request_url)
soup = BeautifulSoup(response.text)
type(soup)
```

Since the example HTML contains a `table` element with a unique identifier of \"products\", we can use the following code to access it:


```{python}
# get first <table> element that has a given identifier of "products":
table = soup.find("table", id="products")
print(type(ul))
table
```

```{python}
# get all child <tr> elements from that list:
rows = table.find_all("tr")
print(type(rows))
print(len(rows))
rows
```

This gets us a list of the rows, where the first is the header row. We can then loop through the rows, ignoring the header row:

```{python}
for tr in rows:
cells = tr.find_all("td") # skip header row, which contains <th> elements instead
if any(cells):
print("-----------")
# makes assumptions about the order of the cells:
product_id = cells[0].text
product_name = cells[1].text
product_price = cells[2].text
print(product_id, product_name, product_price)
```
Loading

0 comments on commit bbc43d4

Please sign in to comment.