MonashDataFluency
diff --git a/‎docs/search/search_index.json
Lines changed: 1 addition & 1 deletion b/‎docs/search/search_index.json
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/section-0-brief-python-refresher.md
Lines changed: 968 additions & 0 deletions b/‎docs/section-0-brief-python-refresher.md
Lines changed: 968 additions & 0 deletions
diff --git a/‎docs/section-0-brief-python-refresher/index.html
Lines changed: 97 additions & 91 deletions b/‎docs/section-0-brief-python-refresher/index.html
Lines changed: 97 additions & 91 deletions
diff --git a/‎docs/section-1-intro-to-web-scraping.md
Lines changed: 110 additions & 0 deletions b/‎docs/section-1-intro-to-web-scraping.md
Lines changed: 110 additions & 0 deletions
diff --git a/‎docs/section-1-intro-to-web-scraping/index.html
Lines changed: 100 additions & 20 deletions b/‎docs/section-1-intro-to-web-scraping/index.html
Lines changed: 100 additions & 20 deletions
@@ -0,0 +1,110 @@
+## Introduction to Web Scraping
+
+### What is web scraping?
+---
+
+Web scraping is a technique for extracting information from websites. This can be done manually but it is usually faster, more efficient and less error-prone to automate the task.
+
+Web scraping allows you to acquire non-tabular or poorly structured data from websites and convert it into a usable, structured format, such as a .csv file or spreadsheet.
+
+Scraping is about more than just acquiring data: it can also help you archive data and track changes to data online.
+
+It is closely related to the practice of web indexing, which is what search engines like Google do when mass-analysing the Web to build their indices. But contrary to web indexing, which typically parses the entire content of a web page to make it searchable, web scraping targets specific information on the pages visited.
+
+For example, online stores will often scour the publicly available pages of their competitors, scrape item prices, and then use this information to adjust their own prices. Another common practice is “contact scraping” in which personal information like email addresses or phone numbers is collected for marketing purposes.
+
+### Why do we need it as a skill?
+---
+
+
+Web scraping is increasingly being used by academics and researchers to create data sets for text mining projects; these might be collections of journal articles or digitised texts. The practice of data journalism, in particular, relies on the ability of investigative journalists to harvest data that is not always presented or published in a form that allows analysis.
+
+### When do we need scraping?
+---
+
+As useful as scraping is, there might be better options for the task. Choose the right (i.e. the easiest) tool for the job.
+
+- Check whether or not you can easily copy and paste data from a site into Excel or Google Sheets. This might be quicker than scraping.
+- Check if the site or service already provides an API to extract structured data. If it does, that will be a much more efficient and effective pathway. Good examples are the Facebook API, the Twitter APIs or the YouTube comments API.
+- For much larger needs, Freedom of information requests can be useful. Be specific about the formats required for the data you want.
+
+### Structured vs unstructured data
+---
+
+When presented with information, human beings are good at quickly categorizing it and extracting the data that they are interested in. For example, when we look at a magazine rack, provided the titles are written in a script that we are able to read, we can rapidly figure out the titles of the magazines, the stories they contain, the language they are written in, etc. and we can probably also easily organize them by topic, recognize those that are aimed at children, or even whether they lean toward a particular end of the political spectrum. Computers have a much harder time making sense of such unstructured data unless we specifically tell them what elements data is made of, for example by adding labels such as this is the title of this magazine or this is a magazine about food. Data in which individual elements are separated and labelled is said to be structured.
+
+Refer to the file `fortune_500_basic_example.html`.
+
+We see that this data has been structured for displaying purposes (it is arranged in rows inside a table) but the different elements of information are not clearly labelled.
+
+What if we wanted to download this dataset and, for example, compare the revenues of these companies against each other or the industry that they work in? We could try copy-pasting the entire table into a spreadsheet or even manually copy-pasting the names and websites in another document, but this can quickly become impractical when faced with a large set of data. What if we wanted to collect this information for all the companies that are there?
+
+Fortunately, there are tools to automate at least part of the process. This technique is called web scraping.
+
+> Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites.(Source: Wikipedia)
+
+Web scraping typically targets one web site at a time to extract unstructured information and put it in a structured form for reuse.
+
+In this lesson, we will continue exploring the examples above and try different techniques to extract the information they contain. But before we launch into web scraping proper, we need to look a bit closer at how information is organized within an HTML document and how to build queries to access a specific subset of that information.
+
+#### What is HTML?
+- HTML stands for **HyperText Markup Language**
+- It is the standard markup language for the webpages which make up the internet. 
+- HTML contains a series of elements which make up a webpage which can connect with other webpages altogether forming a website. 
+- The HTML elements are represented in tags which tell the web browser how to display the web content.
+
+A sample raw HTML file below :
+
+```html
+<!DOCTYPE html>
+<html>
+
+<head>
+<title>Page Title</title>
+</head>
+
+<body>
+
+<h1>My First Heading</h1>
+<p>My first paragraph.</p>
+
+</body>
+
+</html>
+```
+
+Every HTML element corresponds to a display content on the web browser. The following image shows the HTML code and the webpage generated (please refer to `intro_html_example.html).
+![intro_html_example](../images/html.png)
+
+#### What is XML?
+- XML stands for **eXtensible Markup Language**
+- XML is a markup language much like HTML
+- XML was designed to store and transport data
+- XML was designed to be self-descriptive
+
+```xml
+<note>
+  <date>2015-09-01</date>
+  <hour>08:30</hour>
+  <to>Tove</to>
+  <from>Jani</from>
+  <body>Don't forget me this weekend!</body>
+</note>
+```
+
+
+
+### DOM (Document Object Model)
+
+DOM is the underlying structure of any webpage.
+
+#### DOM inspector : `F12` to the rescue!
+
+
+```python
+
+```
+
+### References
+
+- This image has been taken from https://www.w3schools.com/html/
@@ -91,7 +91,7 @@
     <input class="md-toggle" data-md-toggle="search" type="checkbox" id="__search" autocomplete="off">
     <label class="md-overlay" data-md-component="overlay" for="__drawer"></label>
 
-      <a href="#what-is-web-scraping" tabindex="0" class="md-skip">
+      <a href="#introduction-to-web-scraping" tabindex="0" class="md-skip">
         Skip to content
       </a>
 
@@ -361,13 +361,20 @@
     <ul class="md-nav__list" data-md-scrollfix>
 
         <li class="md-nav__item">
-  <a href="#what-is-web-scraping" class="md-nav__link">
-    What is web scraping?
+  <a href="#introduction-to-web-scraping" class="md-nav__link">
+    Introduction to Web Scraping
   </a>
 
     <nav class="md-nav">
       <ul class="md-nav__list">
 
+          <li class="md-nav__item">
+  <a href="#what-is-web-scraping" class="md-nav__link">
+    What is web scraping?
+  </a>
+  
+</li>
+        
           <li class="md-nav__item">
   <a href="#why-do-we-need-it-as-a-skill" class="md-nav__link">
     Why do we need it as a skill?
@@ -407,6 +414,33 @@
       </ul>
     </nav>
 
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#dom-document-object-model" class="md-nav__link">
+    DOM (Document Object Model)
+  </a>
+  
+    <nav class="md-nav">
+      <ul class="md-nav__list">
+        
+          <li class="md-nav__item">
+  <a href="#dom-inspector-f12-to-the-rescue" class="md-nav__link">
+    DOM inspector : F12 to the rescue!
+  </a>
+  
+</li>
+        
+      </ul>
+    </nav>
+  
+</li>
+        
+          <li class="md-nav__item">
+  <a href="#references" class="md-nav__link">
+    References
+  </a>
+  
 </li>
 
       </ul>
@@ -433,22 +467,27 @@
 
                   <h1>Section 1 intro to web scraping</h1>
 
-                <h2 id="what-is-web-scraping">What is web scraping?<a class="headerlink" href="#what-is-web-scraping" title="Permanent link">&para;</a></h2>
+                <h2 id="introduction-to-web-scraping">Introduction to Web Scraping<a class="headerlink" href="#introduction-to-web-scraping" title="Permanent link">&para;</a></h2>
+<h3 id="what-is-web-scraping">What is web scraping?<a class="headerlink" href="#what-is-web-scraping" title="Permanent link">&para;</a></h3>
+<hr />
 <p>Web scraping is a technique for extracting information from websites. This can be done manually but it is usually faster, more efficient and less error-prone to automate the task.</p>
 <p>Web scraping allows you to acquire non-tabular or poorly structured data from websites and convert it into a usable, structured format, such as a .csv file or spreadsheet.</p>
 <p>Scraping is about more than just acquiring data: it can also help you archive data and track changes to data online.</p>
 <p>It is closely related to the practice of web indexing, which is what search engines like Google do when mass-analysing the Web to build their indices. But contrary to web indexing, which typically parses the entire content of a web page to make it searchable, web scraping targets specific information on the pages visited.</p>
 <p>For example, online stores will often scour the publicly available pages of their competitors, scrape item prices, and then use this information to adjust their own prices. Another common practice is “contact scraping” in which personal information like email addresses or phone numbers is collected for marketing purposes.</p>
 <h3 id="why-do-we-need-it-as-a-skill">Why do we need it as a skill?<a class="headerlink" href="#why-do-we-need-it-as-a-skill" title="Permanent link">&para;</a></h3>
-<p>Web scraping is increasingly being used by scholars to create data sets for text mining projects; these might be collections of journal articles or digitised texts. The practice of data journalism, in particular, relies on the ability of investigative journalists to harvest data that is not always presented or published in a form that allows analysis.</p>
+<hr />
+<p>Web scraping is increasingly being used by academics and researchers to create data sets for text mining projects; these might be collections of journal articles or digitised texts. The practice of data journalism, in particular, relies on the ability of investigative journalists to harvest data that is not always presented or published in a form that allows analysis.</p>
 <h3 id="when-do-we-need-scraping">When do we need scraping?<a class="headerlink" href="#when-do-we-need-scraping" title="Permanent link">&para;</a></h3>
+<hr />
 <p>As useful as scraping is, there might be better options for the task. Choose the right (i.e. the easiest) tool for the job.</p>
 <ul>
 <li>Check whether or not you can easily copy and paste data from a site into Excel or Google Sheets. This might be quicker than scraping.</li>
 <li>Check if the site or service already provides an API to extract structured data. If it does, that will be a much more efficient and effective pathway. Good examples are the Facebook API, the Twitter APIs or the YouTube comments API.</li>
 <li>For much larger needs, Freedom of information requests can be useful. Be specific about the formats required for the data you want.</li>
 </ul>
 <h3 id="structured-vs-unstructured-data">Structured vs unstructured data<a class="headerlink" href="#structured-vs-unstructured-data" title="Permanent link">&para;</a></h3>
+<hr />
 <p>When presented with information, human beings are good at quickly categorizing it and extracting the data that they are interested in. For example, when we look at a magazine rack, provided the titles are written in a script that we are able to read, we can rapidly figure out the titles of the magazines, the stories they contain, the language they are written in, etc. and we can probably also easily organize them by topic, recognize those that are aimed at children, or even whether they lean toward a particular end of the political spectrum. Computers have a much harder time making sense of such unstructured data unless we specifically tell them what elements data is made of, for example by adding labels such as this is the title of this magazine or this is a magazine about food. Data in which individual elements are separated and labelled is said to be structured.</p>
 <p>Refer to the file <code>fortune_500_basic_example.html</code>.</p>
 <p>We see that this data has been structured for displaying purposes (it is arranged in rows inside a table) but the different elements of information are not clearly labelled.</p>
@@ -459,17 +498,51 @@ <h3 id="structured-vs-unstructured-data">Structured vs unstructured data<a class
 </blockquote>
 <p>Web scraping typically targets one web site at a time to extract unstructured information and put it in a structured form for reuse.</p>
 <p>In this lesson, we will continue exploring the examples above and try different techniques to extract the information they contain. But before we launch into web scraping proper, we need to look a bit closer at how information is organized within an HTML document and how to build queries to access a specific subset of that information.</p>
-<p>Look at a basic html file in <code>intro_html_example.html</code>.</p>
-<p><img alt="png" src="wrangling-and-analysis_files/intro_html_structure.png" /></p>
 <h4 id="what-is-html">What is HTML?<a class="headerlink" href="#what-is-html" title="Permanent link">&para;</a></h4>
-<p>HTML - HyperText Markup Language</p>
-<p>HTML is the standard markup language for the webpages which make up the internet. HTML contains a series of elements which make up a webpage which can connect with other webpages altogether forming a website. The HTML elements are represented in tags which tell the web browser how to display the web content.</p>
-<p>Every HTML element corresponds to a display content on the web browser. The following image shows the HTML code and the webpage generated.</p>
-<p><img alt="image.png" src="attachment:image.png" /></p>
-<p>This image has been taken from https://www.w3schools.com/html/</p>
+<ul>
+<li>HTML stands for <strong>HyperText Markup Language</strong></li>
+<li>It is the standard markup language for the webpages which make up the internet. </li>
+<li>HTML contains a series of elements which make up a webpage which can connect with other webpages altogether forming a website. </li>
+<li>The HTML elements are represented in tags which tell the web browser how to display the web content.</li>
+</ul>
+<p>A sample raw HTML file below :</p>
+<table class="codehilitetable"><tr><td class="linenos"><div class="linenodiv"><pre> 1
+ 2
+ 3
+ 4
+ 5
+ 6
+ 7
+ 8
+ 9
+10
+11
+12
+13
+14
+15</pre></div></td><td class="code"><div class="codehilite"><pre><span></span><code><span class="cp">&lt;!DOCTYPE html&gt;</span>
+<span class="p">&lt;</span><span class="nt">html</span><span class="p">&gt;</span>
+
+<span class="p">&lt;</span><span class="nt">head</span><span class="p">&gt;</span>
+<span class="p">&lt;</span><span class="nt">title</span><span class="p">&gt;</span>Page Title<span class="p">&lt;/</span><span class="nt">title</span><span class="p">&gt;</span>
+<span class="p">&lt;/</span><span class="nt">head</span><span class="p">&gt;</span>
+
+<span class="p">&lt;</span><span class="nt">body</span><span class="p">&gt;</span>
+
+<span class="p">&lt;</span><span class="nt">h1</span><span class="p">&gt;</span>My First Heading<span class="p">&lt;/</span><span class="nt">h1</span><span class="p">&gt;</span>
+<span class="p">&lt;</span><span class="nt">p</span><span class="p">&gt;</span>My first paragraph.<span class="p">&lt;/</span><span class="nt">p</span><span class="p">&gt;</span>
+
+<span class="p">&lt;/</span><span class="nt">body</span><span class="p">&gt;</span>
+
+<span class="p">&lt;/</span><span class="nt">html</span><span class="p">&gt;</span>
+</code></pre></div>
+</td></tr></table>
+
+<p>Every HTML element corresponds to a display content on the web browser. The following image shows the HTML code and the webpage generated (please refer to `intro_html_example.html).
+<img alt="intro_html_example" src="../images/html.png" /></p>
 <h4 id="what-is-xml">What is XML?<a class="headerlink" href="#what-is-xml" title="Permanent link">&para;</a></h4>
 <ul>
-<li>XML stands for eXtensible Markup Language</li>
+<li>XML stands for <strong>eXtensible Markup Language</strong></li>
 <li>XML is a markup language much like HTML</li>
 <li>XML was designed to store and transport data</li>
 <li>XML was designed to be self-descriptive</li>
@@ -480,19 +553,26 @@ <h4 id="what-is-xml">What is XML?<a class="headerlink" href="#what-is-xml" title
 4
 5
 6
-7</pre></div></td><td class="code"><div class="codehilite"><pre><span></span><code><span class="p">&lt;</span><span class="nt">note</span><span class="p">&gt;</span>
-  <span class="p">&lt;</span><span class="nt">date</span><span class="p">&gt;</span>2015-09-01<span class="p">&lt;/</span><span class="nt">date</span><span class="p">&gt;</span>
-  <span class="p">&lt;</span><span class="nt">hour</span><span class="p">&gt;</span>08:30<span class="p">&lt;/</span><span class="nt">hour</span><span class="p">&gt;</span>
-  <span class="p">&lt;</span><span class="nt">to</span><span class="p">&gt;</span>Tove<span class="p">&lt;/</span><span class="nt">to</span><span class="p">&gt;</span>
-  <span class="p">&lt;</span><span class="nt">from</span><span class="p">&gt;</span>Jani<span class="p">&lt;/</span><span class="nt">from</span><span class="p">&gt;</span>
-  <span class="p">&lt;</span><span class="nt">body</span><span class="p">&gt;</span>Don&#39;t forget me this weekend!<span class="p">&lt;/</span><span class="nt">body</span><span class="p">&gt;</span>
-<span class="p">&lt;/</span><span class="nt">note</span><span class="p">&gt;</span>
+7</pre></div></td><td class="code"><div class="codehilite"><pre><span></span><code><span class="nt">&lt;note&gt;</span>
+  <span class="nt">&lt;date&gt;</span>2015-09-01<span class="nt">&lt;/date&gt;</span>
+  <span class="nt">&lt;hour&gt;</span>08:30<span class="nt">&lt;/hour&gt;</span>
+  <span class="nt">&lt;to&gt;</span>Tove<span class="nt">&lt;/to&gt;</span>
+  <span class="nt">&lt;from&gt;</span>Jani<span class="nt">&lt;/from&gt;</span>
+  <span class="nt">&lt;body&gt;</span>Don&#39;t forget me this weekend!<span class="nt">&lt;/body&gt;</span>
+<span class="nt">&lt;/note&gt;</span>
 </code></pre></div>
 </td></tr></table>
 
+<h3 id="dom-document-object-model">DOM (Document Object Model)<a class="headerlink" href="#dom-document-object-model" title="Permanent link">&para;</a></h3>
+<h4 id="dom-inspector-f12-to-the-rescue">DOM inspector : <code>F12</code> to the rescue!<a class="headerlink" href="#dom-inspector-f12-to-the-rescue" title="Permanent link">&para;</a></h4>
 <table class="codehilitetable"><tr><td class="linenos"><div class="linenodiv"><pre>1</pre></div></td><td class="code"><div class="codehilite"><pre><span></span><code>
 </code></pre></div>
 </td></tr></table>
+
+<h3 id="references">References<a class="headerlink" href="#references" title="Permanent link">&para;</a></h3>
+<ul>
+<li>This image has been taken from https://www.w3schools.com/html/</li>
+</ul>