diff --git a/docs/_quarto.yml b/docs/_quarto.yml index 794d28f..57d8ae2 100644 --- a/docs/_quarto.yml +++ b/docs/_quarto.yml @@ -182,6 +182,8 @@ website: href: notes/fetching-data/json-data.qmd - section: href: notes/fetching-data/csv-data.qmd + - section: + href: notes/fetching-data/xml.qmd - section: href: notes/fetching-data/html-web-scraping.qmd #text: "HTML Data (Web Scraping)" @@ -287,6 +289,8 @@ format: code-fold: false #show #code-line-numbers: true toc: true + #toc-depth: 3 # specify the number of section levels to include in the table of contents + #toc-expand: 3 # specify how much of the table of contents to show initially (defaults to 1 with auto-expansion as the user scrolls) #toc-location: left #number-sections: false #number-depth: 1 diff --git a/docs/notes/fetching-data/html-web-scraping.qmd b/docs/notes/fetching-data/html-web-scraping.qmd index 8dab422..7ced88e 100644 --- a/docs/notes/fetching-data/html-web-scraping.qmd +++ b/docs/notes/fetching-data/html-web-scraping.qmd @@ -2,83 +2,68 @@ format: html: code-fold: false + #toc: true + #toc-depth: 4 + #toc-expand: 5 jupyter: python3 execute: cache: true # re-render only when source changes --- +# Fetching HTML Data (i.e. "Web Scraping") -# Fetching HTML Data - -If the data you want to fetch is in XML or HTML format, we can use the `requests` package to fetch it, and the `beautifulsoup4` package to process it. - -## XML - - -Let's consider this example \"students.xml\" file we have hosted on the Internet: - -```xml - - 2018-06-05 - 123 - - - 1 - 76.7 - - - 2 - 85.1 - - - 3 - 50.3 - - - 4 - 89.8 - - - 5 - 97.4 - - - 6 - 75.5 - - - 7 - 87.2 - - - 8 - 88.0 - - - 9 - 93.9 - - - 10 - 92.5 - - - +If the data you want to fetch is in HTML format, like most web pages, we can use the `requests` package to fetch it, and the `beautifulsoup4` package to process it. + +Before moving on to process HTML formatted data, it will be important to first review [Basic HTML](https://www.w3schools.com/html/html_basic.asp), [HTML Lists](https://www.w3schools.com/html/html_lists.asp), and [HTML Tables](https://www.w3schools.com/html/html_tables.asp). + + +## HTML Lists + +Let's consider this \"my_lists.html\" file we have hosted on the Internet, which is a simplified web page containing a few HTML list elements: + +```html + + + + + HTML List Parsing Exercise + + +

HTML List Parsing Exercise

+ +

This is an HTML page.

+ +

Favorite Ice cream Flavors

+
    +
  1. Vanilla Bean
  2. +
  3. Chocolate
  4. +
  5. Strawberry
  6. +
+ +

Skills

+ + + ``` -First we note the URL of where the data resides. Then we pass that as a parameter to the `get` function from the `requests` package, to issue an HTTP GET request (as usual): +First we note the URL of where the data or webpage resides. Then we pass that as a parameter to the `get` function from the `requests` package, to issue an HTTP GET request (as usual): ```{python} import requests -# the URL of some CSV data we stored online: -request_url = "https://raw.githubusercontent.com/prof-rossetti/python-for-finance/main/docs/data/gradebook.xml" +# the URL of some HTML data or web page stored online: +request_url = "https://raw.githubusercontent.com/prof-rossetti/python-for-finance/main/docs/data/my_lists.html" response = requests.get(request_url) print(type(response)) ``` -Then we pass the response text (an HTML or XML formatted string) to the `BeautifulSoup` class constructor. +Then we pass the response text (an HTML formatted string) to the `BeautifulSoup` class constructor. ```{python} from bs4 import BeautifulSoup @@ -87,23 +72,146 @@ soup = BeautifulSoup(response.text) type(soup) ``` -The soup object is able to intelligently process the data. +The soup object is able to intelligently process the data. We can invoke a `find` or `find_all` method on the soup object to find elements or tags based on their names or other attributes. + +### Finding Elements by Identifier + +Since the example HTML contains an ordered list (`ol` element) with a unique identifier of \"my-fav-flavors\", we can use the following code to access it: + + +```{python} +# get first
    element that has a given identifier of "my-fav-flavors": +ul = soup.find("ol", id="my-fav-flavors") +print(type(ul)) +ul +``` + +```{python} +# get all child
  1. elements from that list: +flavors = ul.find_all("li") +print(type(flavors)) +print(len(flavors)) +flavors +``` -We can invoke a `find` or `find_all` method on the soup object to find elements or tags based on their names or other attributes. For example, finding all the student tags in this structure: ```{python} -students = soup.find_all("student") -print(type(students)) -len(students) +for li in flavors: + print("-----------") + print(type(li)) + print(li.text) ``` +### Finding Elements by Class + +Since the example HTML contains an unordered list (`ul` element) of skills, where each list item shares the same class of \"skill\", we can use the following code to access the list items directly: ```{python} -for student in students: +# get all
  2. elements that have a given class of "skill" +skills = soup.find_all("li", "skill") +print(type(skills)) +print(len(skills)) +skills +``` + +```{python} +for li in skills: print("-----------") - print(type(student)) - student_id = student.studentid.text - final_grade = student.finalgrade.text - print(student_id, final_grade) + print(type(li)) + print(li.text) +``` + + +## HTML Tables + +Let's consider this \"my_tables.html\" file we have hosted on the Internet, which is a simplified web page containing an HTML table element: + +```html + + + + + HTML Table Parsing Exercise + + +

    HTML Table Parsing Exercise

    + +

    This is an HTML page.

    + +

    Products

    + + + + + + + + + + + + + + + + + + + + + + +
    IdNamePrice
    1Chocolate Sandwich Cookies3.50
    2All-Seasons Salt4.99
    3Robust Golden Unsweetened Oolong Tea2.49
    + + +``` + +We repeat the process of fetching this data, as previously exemplified: + + +```{python} +import requests +from bs4 import BeautifulSoup + +# the URL of some HTML data or web page stored online: +request_url = "https://raw.githubusercontent.com/prof-rossetti/python-for-finance/main/docs/data/my_tables.html" + +response = requests.get(request_url) + +soup = BeautifulSoup(response.text) +type(soup) +``` + +Since the example HTML contains a `table` element with a unique identifier of \"products\", we can use the following code to access it: + + +```{python} +# get first element that has a given identifier of "products": +table = soup.find("table", id="products") +print(type(ul)) +table +``` + +```{python} +# get all child elements from that list: +rows = table.find_all("tr") +print(type(rows)) +print(len(rows)) +rows +``` + +This gets us a list of the rows, where the first is the header row. We can then loop through the rows, ignoring the header row: + +```{python} +for tr in rows: + cells = tr.find_all("td") # skip header row, which contains
    elements instead + if any(cells): + print("-----------") + # makes assumptions about the order of the cells: + product_id = cells[0].text + product_name = cells[1].text + product_price = cells[2].text + print(product_id, product_name, product_price) + ``` diff --git a/docs/notes/fetching-data/xml.qmd b/docs/notes/fetching-data/xml.qmd new file mode 100644 index 0000000..b6f7f37 --- /dev/null +++ b/docs/notes/fetching-data/xml.qmd @@ -0,0 +1,113 @@ +--- +format: + html: + code-fold: false +jupyter: python3 +execute: + cache: true # re-render only when source changes +--- + + +# Fetching XML Data + +If the data you want to fetch is in XML format, including in an RSS feed, we can use the `requests` package to fetch it, and the `beautifulsoup4` package to process it. + +Let's consider this example \"students.xml\" file we have hosted on the Internet: + +```xml + + 2018-06-05 + 123 + + + 1 + 76.7 + + + 2 + 85.1 + + + 3 + 50.3 + + + 4 + 89.8 + + + 5 + 97.4 + + + 6 + 75.5 + + + 7 + 87.2 + + + 8 + 88.0 + + + 9 + 93.9 + + + 10 + 92.5 + + + +``` + +First we note the URL of where the data resides. Then we pass that as a parameter to the `get` function from the `requests` package, to issue an HTTP GET request (as usual): + +```{python} +import requests + +# the URL of some XML data we stored online: +request_url = "https://raw.githubusercontent.com/prof-rossetti/python-for-finance/main/docs/data/gradebook.xml" + +response = requests.get(request_url) +print(type(response)) +``` + +Then we pass the response text (an HTML or XML formatted string) to the `BeautifulSoup` class constructor. + +```{python} +from bs4 import BeautifulSoup + +soup = BeautifulSoup(response.text) +type(soup) +``` + +The soup object is able to intelligently process the data. + + +We can invoke a `find` or `find_all` method on the soup object to find elements or tags based on their names or other attributes. For example, finding all the student tags in this structure: + +```{python} +students = soup.find_all("student") +print(type(students)) +print(len(students)) +``` + + +```{python} +# examining the first item for reference: +print(type(students[0])) +students[0] +``` + +```{python} +# looping through all the items: +for student in students: + print("-----------") + print(type(student)) + student_id = student.studentid.text + final_grade = student.finalgrade.text + print(student_id, final_grade) +```