Skip to content

Commit f562902

Browse files
committed
added explantions in section 1
1 parent 651db76 commit f562902

File tree

8 files changed

+78
-18
lines changed

8 files changed

+78
-18
lines changed

data/intro_html_example.html

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
<!DOCTYPE html>
22
<html>
33
<head>
4-
<title>Page Title</title>
4+
<title>My title</title>
55
</head>
66
<body>
77

8-
<h1>My First Heading</h1>
9-
<p>My first paragraph.</p>
8+
<h1>A Heading</h1>
9+
<a href="#">Link text</a>
1010

1111
</body>
1212
</html>

docs/sitemap.xml.gz

0 Bytes
Binary file not shown.

images/dom.png

38.1 KB
Loading

images/dom1.png

74.4 KB
Loading

images/f12.png

53.9 KB
Loading

images/html.png

5.93 KB
Loading

images/web_scraping.jpg

27.5 KB
Loading

notebooks/section-1-intro-to-web-scraping.ipynb

+75-15
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,13 @@
77
"## Introduction to Web Scraping"
88
]
99
},
10+
{
11+
"cell_type": "markdown",
12+
"metadata": {},
13+
"source": [
14+
"<img src=\"http://www.price2spy.com/blog/wp-content/uploads/2019/07/web_scraping.jpg\"/>"
15+
]
16+
},
1017
{
1118
"cell_type": "markdown",
1219
"metadata": {},
@@ -19,7 +26,7 @@
1926
"cell_type": "markdown",
2027
"metadata": {},
2128
"source": [
22-
"Web scraping is a technique for extracting information from websites. This can be done manually but it is usually faster, more efficient and less error-prone to automate the task.\n",
29+
"Web scraping is a technique for extracting information from websites. This can be done *manually* but it is usually faster, more efficient and less error-prone to automate the task.\n",
2330
"\n",
2431
"Web scraping allows you to acquire non-tabular or poorly structured data from websites and convert it into a usable, structured format, such as a .csv file or spreadsheet.\n",
2532
"\n",
@@ -65,6 +72,16 @@
6572
"- For much larger needs, Freedom of information requests can be useful. Be specific about the formats required for the data you want."
6673
]
6774
},
75+
{
76+
"cell_type": "markdown",
77+
"metadata": {},
78+
"source": [
79+
"> #### Q. If you had to gather data from a website that provides updated data every 4 hours of an ongoing pandemic would you :\n",
80+
"- [ ] Scrape the site directly\n",
81+
"- [ ] Ask for permission and then scrape the site\n",
82+
"- [ ] "
83+
]
84+
},
6885
{
6986
"cell_type": "markdown",
7087
"metadata": {},
@@ -77,7 +94,9 @@
7794
"cell_type": "markdown",
7895
"metadata": {},
7996
"source": [
80-
"When presented with information, human beings are good at quickly categorizing it and extracting the data that they are interested in. For example, when we look at a magazine rack, provided the titles are written in a script that we are able to read, we can rapidly figure out the titles of the magazines, the stories they contain, the language they are written in, etc. and we can probably also easily organize them by topic, recognize those that are aimed at children, or even whether they lean toward a particular end of the political spectrum. Computers have a much harder time making sense of such unstructured data unless we specifically tell them what elements data is made of, for example by adding labels such as this is the title of this magazine or this is a magazine about food. Data in which individual elements are separated and labelled is said to be structured."
97+
"When presented with information, human beings are good at quickly categorizing it and extracting the data that they are interested in. For example, when we look at a magazine rack, provided the titles are written in a script that we are able to read, we can rapidly figure out the titles of the magazines, the stories they contain, the language they are written in, etc. and we can probably also easily organize them by topic, recognize those that are aimed at children, or even whether they lean toward a particular end of the political spectrum. \n",
98+
"\n",
99+
"Computers have a much harder time making sense of such unstructured data unless we specifically tell them what elements data is made of, for example by adding labels such as this is the title of this magazine or this is a magazine about food. Data in which individual elements are separated and labelled is said to be structured."
81100
]
82101
},
83102
{
@@ -126,13 +145,13 @@
126145
"<html>\n",
127146
"\n",
128147
"<head>\n",
129-
"<title>Page Title</title>\n",
148+
" <title>My Title</title>\n",
130149
"</head>\n",
131150
"\n",
132151
"<body>\n",
133152
"\n",
134-
"<h1>My First Heading</h1>\n",
135-
"<p>My first paragraph.</p>\n",
153+
" <h1>A Heading</h1>\n",
154+
" <a href=\"#\">Link text</a>\n",
136155
"\n",
137156
"</body>\n",
138157
"\n",
@@ -144,7 +163,7 @@
144163
"cell_type": "markdown",
145164
"metadata": {},
146165
"source": [
147-
"Every HTML element corresponds to a display content on the web browser. The following image shows the HTML code and the webpage generated (please refer to `intro_html_example.html).\n",
166+
"A webpage is simply a document. Every HTML element within this document corresponds to display specific content on the web browser. The following image shows the HTML code and the webpage generated (please refer to `intro_html_example.html`).\n",
148167
"![intro_html_example](../images/html.png)"
149168
]
150169
},
@@ -177,43 +196,84 @@
177196
{
178197
"cell_type": "markdown",
179198
"metadata": {},
180-
"source": []
199+
"source": [
200+
"### HTML DOM (or Document Object Model)\n",
201+
"---"
202+
]
181203
},
182204
{
183205
"cell_type": "markdown",
184206
"metadata": {},
185207
"source": [
186-
"### DOM (Document Object Model)"
208+
"> \"*The W3C Document Object Model (DOM) is a platform and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure, and style of a document.*\" -- W3C\n",
209+
"\n",
210+
"Everytime a web page is loaded in the browser, it creates a **D**ocument **O**bject **M**odel of the page. It essentially treats the HTML (or XML) document as a tree structure and the different HTML elements are represented as nodes and objects.\n",
211+
"\n",
212+
"More broadly, it is a programming interface for HTML and XML documents and can be considered as the object-oriented representation of a web page which can be modified with a scripting language like JavaScript. \n",
213+
"\n",
214+
"It also provides us with a rich visual representation of how the different elements interact and inform us about their relative position within the tree. This helps us find and target crucial **tags**, **id** or **classes** within the document and extract the same. To sumarize, DOM is a standard which allows us to :\n",
215+
"- **get**\n",
216+
"- **change**\n",
217+
"- **add**, or \n",
218+
"- **delete** \n",
219+
"\n",
220+
"HTML elements. Here we will be primarily interested in accessing and getting the data as opposed to manipulation of the document itself."
221+
]
222+
},
223+
{
224+
"cell_type": "markdown",
225+
"metadata": {},
226+
"source": [
227+
"Let's look at the DOM for the HTML from our previous example below\n",
228+
"![intro_html_example](../images/dom1.png)"
229+
]
230+
},
231+
{
232+
"cell_type": "markdown",
233+
"metadata": {},
234+
"source": [
235+
"The next question then is : How do we access the source code or DOM of **any** web page on the internet?"
187236
]
188237
},
189238
{
190239
"cell_type": "markdown",
191240
"metadata": {},
192241
"source": [
193-
"DOM is the underlying structure of any webpage."
242+
"#### DOM inspector and `F12` to the rescue!"
194243
]
195244
},
196245
{
197246
"cell_type": "markdown",
198247
"metadata": {},
199248
"source": [
200-
"#### DOM inspector : `F12` to the rescue!"
249+
"To inspect individual elements within a web page, we can simply use the DOM inspector (or its variants) that comes with every browser.\n",
250+
"\n",
251+
"- Easiest way to access the source code of any web page is through the console by clicking **F12**\n",
252+
"- Alternatively, we can right-click on a specific element in the webpage and select **inspect** or **inspect element** from the dropdown. This is especially useful in cases where we want to target a specific piece of data present within some HTML element.\n",
253+
"- It helps highlight different attributes, properties and styles within the HTML\n",
254+
"- It is known as **DOM inspector** and **Developers Tools** in Firefox and Chrome respectively.\n",
255+
"\n",
256+
"> Note : Some webpages prohibit right-click and in those cases we might have to resort to inspecting the source code via F12."
201257
]
202258
},
203259
{
204-
"cell_type": "code",
205-
"execution_count": null,
260+
"cell_type": "markdown",
206261
"metadata": {},
207-
"outputs": [],
208-
"source": []
262+
"source": [
263+
"A Google Chrome window along with the developer console accessed though **F12** (found under **Developers Tool**) below\n",
264+
"![intro_html_example](../images/f12.png)"
265+
]
209266
},
210267
{
211268
"cell_type": "markdown",
212269
"metadata": {},
213270
"source": [
214271
"### References\n",
215272
"\n",
216-
"- This image has been taken from https://www.w3schools.com/html/"
273+
"- https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model/Introduction\n",
274+
"- https://www.w3schools.com/html/\n",
275+
"- https://www.w3schools.com/js/js_htmldom.asp\n",
276+
"- https://www.price2spy.com/blog/case-study-web-scraping-data-extraction-for-ecommerce/"
217277
]
218278
}
219279
],

0 commit comments

Comments
 (0)