|
| 1 | +## What is web scraping? |
| 2 | + |
| 3 | +Web scraping is a technique for extracting information from websites. This can be done manually but it is usually faster, more efficient and less error-prone to automate the task. |
| 4 | + |
| 5 | +Web scraping allows you to acquire non-tabular or poorly structured data from websites and convert it into a usable, structured format, such as a .csv file or spreadsheet. |
| 6 | + |
| 7 | +Scraping is about more than just acquiring data: it can also help you archive data and track changes to data online. |
| 8 | + |
| 9 | +It is closely related to the practice of web indexing, which is what search engines like Google do when mass-analysing the Web to build their indices. But contrary to web indexing, which typically parses the entire content of a web page to make it searchable, web scraping targets specific information on the pages visited. |
| 10 | + |
| 11 | +For example, online stores will often scour the publicly available pages of their competitors, scrape item prices, and then use this information to adjust their own prices. Another common practice is “contact scraping” in which personal information like email addresses or phone numbers is collected for marketing purposes. |
| 12 | + |
| 13 | +### Why do we need it as a skill? |
| 14 | +Web scraping is increasingly being used by scholars to create data sets for text mining projects; these might be collections of journal articles or digitised texts. The practice of data journalism, in particular, relies on the ability of investigative journalists to harvest data that is not always presented or published in a form that allows analysis. |
| 15 | + |
| 16 | +### When do we need scraping? |
| 17 | + |
| 18 | +As useful as scraping is, there might be better options for the task. Choose the right (i.e. the easiest) tool for the job. |
| 19 | + |
| 20 | +- Check whether or not you can easily copy and paste data from a site into Excel or Google Sheets. This might be quicker than scraping. |
| 21 | +- Check if the site or service already provides an API to extract structured data. If it does, that will be a much more efficient and effective pathway. Good examples are the Facebook API, the Twitter APIs or the YouTube comments API. |
| 22 | +- For much larger needs, Freedom of information requests can be useful. Be specific about the formats required for the data you want. |
| 23 | + |
| 24 | + |
| 25 | +### Structured vs unstructured data |
| 26 | + |
| 27 | +When presented with information, human beings are good at quickly categorizing it and extracting the data that they are interested in. For example, when we look at a magazine rack, provided the titles are written in a script that we are able to read, we can rapidly figure out the titles of the magazines, the stories they contain, the language they are written in, etc. and we can probably also easily organize them by topic, recognize those that are aimed at children, or even whether they lean toward a particular end of the political spectrum. Computers have a much harder time making sense of such unstructured data unless we specifically tell them what elements data is made of, for example by adding labels such as this is the title of this magazine or this is a magazine about food. Data in which individual elements are separated and labelled is said to be structured. |
| 28 | + |
| 29 | +<!-- |
| 30 | +<thead> |
| 31 | + <tr> |
| 32 | + <th>Rank</th> |
| 33 | + <th>Company</th> |
| 34 | + <th>Website</th> |
| 35 | + </tr> |
| 36 | +</thead> |
| 37 | +<tbody> |
| 38 | + <tr> |
| 39 | + <td>1</td> |
| 40 | + <td>Walmart</td> |
| 41 | + <td><a href="http://www.stock.walmart.com">http://www.stock.walmart.com</a></td> |
| 42 | + </tr> |
| 43 | + <tr> |
| 44 | + <td>2</td> |
| 45 | + <td>Exxon Mobil</td> |
| 46 | + <td><a href="http://www.exxonmobil.com">http://www.exxonmobil.com</a></td> |
| 47 | + (...) |
| 48 | + </tr> |
| 49 | + <tr> |
| 50 | + <td>500</td> |
| 51 | + <td>Cintas</td> |
| 52 | + <td><a href="http://www.cintas.com">http://www.cintas.com</a></td> |
| 53 | + </tr> |
| 54 | +</tbody> |
| 55 | +--> |
| 56 | + |
| 57 | +```html |
| 58 | +<thead> |
| 59 | + <tr> |
| 60 | + <th>Rank</th> |
| 61 | + <th>Company</th> |
| 62 | + <th>Website</th> |
| 63 | + </tr> |
| 64 | +</thead> |
| 65 | +<tbody> |
| 66 | + <tr> |
| 67 | + <td>1</td> |
| 68 | + <td>Walmart</td> |
| 69 | + <td><a href="http://www.stock.walmart.com">http://www.stock.walmart.com</a></td> |
| 70 | + </tr> |
| 71 | + <tr> |
| 72 | + <td>2</td> |
| 73 | + <td>Exxon Mobil</td> |
| 74 | + <td><a href="http://www.exxonmobil.com">http://www.exxonmobil.com</a></td> |
| 75 | + (...) |
| 76 | + </tr> |
| 77 | + <tr> |
| 78 | + <td>500</td> |
| 79 | + <td>Cintas</td> |
| 80 | + <td><a href="http://www.cintas.com">http://www.cintas.com</a></td> |
| 81 | + </tr> |
| 82 | +</tbody> |
| 83 | +``` |
| 84 | + |
| 85 | +We see that this data has been structured for displaying purposes (it is arranged in rows inside a table) but the different elements of information are not clearly labelled. |
| 86 | + |
| 87 | +What if we wanted to download this dataset and, for example, compare the revenues of these companies against each other or the industry that they work in? We could try copy-pasting the entire table into a spreadsheet or even manually copy-pasting the names and parties in another document, but this can quickly become impractical when faced with a large set of data. What if we wanted to collect this information for all the companies that are there? |
| 88 | + |
| 89 | +Fortunately, there are tools to automate at least part of the process. This technique is called web scraping. |
| 90 | + |
| 91 | +> Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites.(Source: Wikipedia) |
| 92 | +
|
| 93 | +Web scraping typically targets one web site at a time to extract unstructured information and put it in a structured form for reuse. |
| 94 | + |
| 95 | +In this lesson, we will continue exploring the examples above and try different techniques to extract the information they contain. But before we launch into web scraping proper, we need to look a bit closer at how information is organized within an HTML document and how to build queries to access a specific subset of that information. |
| 96 | + |
| 97 | +Create a basic html: |
| 98 | +```html |
| 99 | +<!DOCTYPE html> |
| 100 | +<html> |
| 101 | +<head> |
| 102 | +<title>Page Title</title> |
| 103 | +</head> |
| 104 | +<body> |
| 105 | + |
| 106 | +<h1>My First Heading</h1> |
| 107 | +<p>My first paragraph.</p> |
| 108 | + |
| 109 | +</body> |
| 110 | +</html> |
| 111 | +``` |
| 112 | + |
| 113 | + |
| 114 | +```python |
| 115 | +# Select image from https://www.w3schools.com/html/html_intro.asp |
| 116 | +``` |
| 117 | + |
| 118 | + |
| 119 | +```python |
| 120 | +!wget "https://www.zyxware.com/articles/5914/list-of-fortune-500-companies-and-their-websites-2018" |
| 121 | +``` |
| 122 | + |
| 123 | + 'wget' is not recognized as an internal or external command, |
| 124 | + operable program or batch file. |
| 125 | + |
| 126 | + |
| 127 | +xml https://www.w3schools.com/xml/xml_whatis.asp |
0 commit comments