You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Web scraping is a technique for extracting information from websites. This can be done manually but it is usually faster, more efficient and less error-prone to automate the task.
7
+
8
+
Web scraping allows you to acquire non-tabular or poorly structured data from websites and convert it into a usable, structured format, such as a .csv file or spreadsheet.
9
+
10
+
Scraping is about more than just acquiring data: it can also help you archive data and track changes to data online.
11
+
12
+
It is closely related to the practice of web indexing, which is what search engines like Google do when mass-analysing the Web to build their indices. But contrary to web indexing, which typically parses the entire content of a web page to make it searchable, web scraping targets specific information on the pages visited.
13
+
14
+
For example, online stores will often scour the publicly available pages of their competitors, scrape item prices, and then use this information to adjust their own prices. Another common practice is “contact scraping” in which personal information like email addresses or phone numbers is collected for marketing purposes.
15
+
16
+
### Why do we need it as a skill?
17
+
---
18
+
19
+
20
+
Web scraping is increasingly being used by academics and researchers to create data sets for text mining projects; these might be collections of journal articles or digitised texts. The practice of data journalism, in particular, relies on the ability of investigative journalists to harvest data that is not always presented or published in a form that allows analysis.
21
+
22
+
### When do we need scraping?
23
+
---
24
+
25
+
As useful as scraping is, there might be better options for the task. Choose the right (i.e. the easiest) tool for the job.
26
+
27
+
- Check whether or not you can easily copy and paste data from a site into Excel or Google Sheets. This might be quicker than scraping.
28
+
- Check if the site or service already provides an API to extract structured data. If it does, that will be a much more efficient and effective pathway. Good examples are the Facebook API, the Twitter APIs or the YouTube comments API.
29
+
- For much larger needs, Freedom of information requests can be useful. Be specific about the formats required for the data you want.
30
+
31
+
### Structured vs unstructured data
32
+
---
33
+
34
+
When presented with information, human beings are good at quickly categorizing it and extracting the data that they are interested in. For example, when we look at a magazine rack, provided the titles are written in a script that we are able to read, we can rapidly figure out the titles of the magazines, the stories they contain, the language they are written in, etc. and we can probably also easily organize them by topic, recognize those that are aimed at children, or even whether they lean toward a particular end of the political spectrum. Computers have a much harder time making sense of such unstructured data unless we specifically tell them what elements data is made of, for example by adding labels such as this is the title of this magazine or this is a magazine about food. Data in which individual elements are separated and labelled is said to be structured.
35
+
36
+
Refer to the file `fortune_500_basic_example.html`.
37
+
38
+
We see that this data has been structured for displaying purposes (it is arranged in rows inside a table) but the different elements of information are not clearly labelled.
39
+
40
+
What if we wanted to download this dataset and, for example, compare the revenues of these companies against each other or the industry that they work in? We could try copy-pasting the entire table into a spreadsheet or even manually copy-pasting the names and websites in another document, but this can quickly become impractical when faced with a large set of data. What if we wanted to collect this information for all the companies that are there?
41
+
42
+
Fortunately, there are tools to automate at least part of the process. This technique is called web scraping.
43
+
44
+
> Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites.(Source: Wikipedia)
45
+
46
+
Web scraping typically targets one web site at a time to extract unstructured information and put it in a structured form for reuse.
47
+
48
+
In this lesson, we will continue exploring the examples above and try different techniques to extract the information they contain. But before we launch into web scraping proper, we need to look a bit closer at how information is organized within an HTML document and how to build queries to access a specific subset of that information.
49
+
50
+
#### What is HTML?
51
+
- HTML stands for **HyperText Markup Language**
52
+
- It is the standard markup language for the webpages which make up the internet.
53
+
- HTML contains a series of elements which make up a webpage which can connect with other webpages altogether forming a website.
54
+
- The HTML elements are represented in tags which tell the web browser how to display the web content.
55
+
56
+
A sample raw HTML file below :
57
+
58
+
```html
59
+
<!DOCTYPE html>
60
+
<html>
61
+
62
+
<head>
63
+
<title>Page Title</title>
64
+
</head>
65
+
66
+
<body>
67
+
68
+
<h1>My First Heading</h1>
69
+
<p>My first paragraph.</p>
70
+
71
+
</body>
72
+
73
+
</html>
74
+
```
75
+
76
+
Every HTML element corresponds to a display content on the web browser. The following image shows the HTML code and the webpage generated (please refer to `intro_html_example.html).
77
+

78
+
79
+
#### What is XML?
80
+
- XML stands for **eXtensible Markup Language**
81
+
- XML is a markup language much like HTML
82
+
- XML was designed to store and transport data
83
+
- XML was designed to be self-descriptive
84
+
85
+
```xml
86
+
<note>
87
+
<date>2015-09-01</date>
88
+
<hour>08:30</hour>
89
+
<to>Tove</to>
90
+
<from>Jani</from>
91
+
<body>Don't forget me this weekend!</body>
92
+
</note>
93
+
```
94
+
95
+
96
+
97
+
### DOM (Document Object Model)
98
+
99
+
DOM is the underlying structure of any webpage.
100
+
101
+
#### DOM inspector : `F12` to the rescue!
102
+
103
+
104
+
```python
105
+
106
+
```
107
+
108
+
### References
109
+
110
+
- This image has been taken from https://www.w3schools.com/html/
<h2id="what-is-web-scraping">What is web scraping?<aclass="headerlink" href="#what-is-web-scraping" title="Permanent link">¶</a></h2>
470
+
<h2id="introduction-to-web-scraping">Introduction to Web Scraping<aclass="headerlink" href="#introduction-to-web-scraping" title="Permanent link">¶</a></h2>
471
+
<h3id="what-is-web-scraping">What is web scraping?<aclass="headerlink" href="#what-is-web-scraping" title="Permanent link">¶</a></h3>
472
+
<hr/>
437
473
<p>Web scraping is a technique for extracting information from websites. This can be done manually but it is usually faster, more efficient and less error-prone to automate the task.</p>
438
474
<p>Web scraping allows you to acquire non-tabular or poorly structured data from websites and convert it into a usable, structured format, such as a .csv file or spreadsheet.</p>
439
475
<p>Scraping is about more than just acquiring data: it can also help you archive data and track changes to data online.</p>
440
476
<p>It is closely related to the practice of web indexing, which is what search engines like Google do when mass-analysing the Web to build their indices. But contrary to web indexing, which typically parses the entire content of a web page to make it searchable, web scraping targets specific information on the pages visited.</p>
441
477
<p>For example, online stores will often scour the publicly available pages of their competitors, scrape item prices, and then use this information to adjust their own prices. Another common practice is “contact scraping” in which personal information like email addresses or phone numbers is collected for marketing purposes.</p>
442
478
<h3id="why-do-we-need-it-as-a-skill">Why do we need it as a skill?<aclass="headerlink" href="#why-do-we-need-it-as-a-skill" title="Permanent link">¶</a></h3>
443
-
<p>Web scraping is increasingly being used by scholars to create data sets for text mining projects; these might be collections of journal articles or digitised texts. The practice of data journalism, in particular, relies on the ability of investigative journalists to harvest data that is not always presented or published in a form that allows analysis.</p>
479
+
<hr/>
480
+
<p>Web scraping is increasingly being used by academics and researchers to create data sets for text mining projects; these might be collections of journal articles or digitised texts. The practice of data journalism, in particular, relies on the ability of investigative journalists to harvest data that is not always presented or published in a form that allows analysis.</p>
444
481
<h3id="when-do-we-need-scraping">When do we need scraping?<aclass="headerlink" href="#when-do-we-need-scraping" title="Permanent link">¶</a></h3>
482
+
<hr/>
445
483
<p>As useful as scraping is, there might be better options for the task. Choose the right (i.e. the easiest) tool for the job.</p>
446
484
<ul>
447
485
<li>Check whether or not you can easily copy and paste data from a site into Excel or Google Sheets. This might be quicker than scraping.</li>
448
486
<li>Check if the site or service already provides an API to extract structured data. If it does, that will be a much more efficient and effective pathway. Good examples are the Facebook API, the Twitter APIs or the YouTube comments API.</li>
449
487
<li>For much larger needs, Freedom of information requests can be useful. Be specific about the formats required for the data you want.</li>
450
488
</ul>
451
489
<h3id="structured-vs-unstructured-data">Structured vs unstructured data<aclass="headerlink" href="#structured-vs-unstructured-data" title="Permanent link">¶</a></h3>
490
+
<hr/>
452
491
<p>When presented with information, human beings are good at quickly categorizing it and extracting the data that they are interested in. For example, when we look at a magazine rack, provided the titles are written in a script that we are able to read, we can rapidly figure out the titles of the magazines, the stories they contain, the language they are written in, etc. and we can probably also easily organize them by topic, recognize those that are aimed at children, or even whether they lean toward a particular end of the political spectrum. Computers have a much harder time making sense of such unstructured data unless we specifically tell them what elements data is made of, for example by adding labels such as this is the title of this magazine or this is a magazine about food. Data in which individual elements are separated and labelled is said to be structured.</p>
453
492
<p>Refer to the file <code>fortune_500_basic_example.html</code>.</p>
454
493
<p>We see that this data has been structured for displaying purposes (it is arranged in rows inside a table) but the different elements of information are not clearly labelled.</p>
@@ -459,17 +498,51 @@ <h3 id="structured-vs-unstructured-data">Structured vs unstructured data<a class
459
498
</blockquote>
460
499
<p>Web scraping typically targets one web site at a time to extract unstructured information and put it in a structured form for reuse.</p>
461
500
<p>In this lesson, we will continue exploring the examples above and try different techniques to extract the information they contain. But before we launch into web scraping proper, we need to look a bit closer at how information is organized within an HTML document and how to build queries to access a specific subset of that information.</p>
462
-
<p>Look at a basic html file in <code>intro_html_example.html</code>.</p>
<h4id="what-is-html">What is HTML?<aclass="headerlink" href="#what-is-html" title="Permanent link">¶</a></h4>
465
-
<p>HTML - HyperText Markup Language</p>
466
-
<p>HTML is the standard markup language for the webpages which make up the internet. HTML contains a series of elements which make up a webpage which can connect with other webpages altogether forming a website. The HTML elements are represented in tags which tell the web browser how to display the web content.</p>
467
-
<p>Every HTML element corresponds to a display content on the web browser. The following image shows the HTML code and the webpage generated.</p>
<spanclass="p"><</span><spanclass="nt">h1</span><spanclass="p">></span>My First Heading<spanclass="p"></</span><spanclass="nt">h1</span><spanclass="p">></span>
533
+
<spanclass="p"><</span><spanclass="nt">p</span><spanclass="p">></span>My first paragraph.<spanclass="p"></</span><spanclass="nt">p</span><spanclass="p">></span>
<p>Every HTML element corresponds to a display content on the web browser. The following image shows the HTML code and the webpage generated (please refer to `intro_html_example.html).
<spanclass="p"><</span><spanclass="nt">body</span><spanclass="p">></span>Don't forget me this weekend!<spanclass="p"></</span><spanclass="nt">body</span><spanclass="p">></span>
<h4id="dom-inspector-f12-to-the-rescue">DOM inspector : <code>F12</code> to the rescue!<aclass="headerlink" href="#dom-inspector-f12-to-the-rescue" title="Permanent link">¶</a></h4>
0 commit comments