Skip to content
This repository was archived by the owner on Nov 30, 2022. It is now read-only.

Commit e05fad3

Browse files
author
namrun
committed
Added scripts to scrape questions from Project Euler
1 parent 6d59af5 commit e05fad3

File tree

5 files changed

+88
-0
lines changed

5 files changed

+88
-0
lines changed
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Project Euler #
2+
3+
![Image](./images/euler_home.PNG)
4+
5+
Project Euler is a series of challenging mathematical/computer programming problems that will require more than just mathematical insights to solve
6+
7+
This script written in Python, gets all the 700+ questions across 15 pages which is written into a CSV file named Project_Euler.csv
8+
9+
Beautiful Soup is used for scraping the URL : https://projecteuler.net/archives
10+
11+
Regular expressions have also been used in order to obtain the description of the questions
12+
13+
## Implementation ##
14+
15+
Using **inspect element**, the contents of the page can be understood
16+
17+
The structure of each page is as shown
18+
19+
![Image](./images/euler_questions.PNG)
20+
21+
The <tr> element consists of the description of the question
22+
23+
Each question has the following components
24+
25+
![Image](./images/question1.PNG)
26+
27+
The contents are parsed and stored using Beautiful Soup, a library built for web scraping
28+
29+
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
#!/usr/bin/env python3
2+
3+
#Imports and dependencies
4+
5+
import requests
6+
from bs4 import BeautifulSoup
7+
import re
8+
import csv
9+
10+
def Euler():
11+
12+
#The contents are written into a CSV file
13+
#Each question has a serial number, name of the problem and description of the problem
14+
15+
with open('Project_Euler.csv', 'w', newline='') as file:
16+
writer = csv.writer(file)
17+
writer.writerow(["Problem Number", "Name" , "Description"])
18+
19+
#There are 15 pages in all, the page number is appended to the URL
20+
start = 1
21+
pages = 15
22+
23+
for page in range(start , pages + start):
24+
25+
#Response is got from each page, the questions are then searched for
26+
page_url = "https://projecteuler.net/archives;page="+ str(page)
27+
response = requests.get(page_url)
28+
soup = BeautifulSoup(response.text,"html.parser")
29+
30+
#All the questions are located within the <table> tag
31+
#This information can be found out by using inspect element, Ctrl+Shift+I
32+
33+
for link in soup.find('table' , attrs={"id" : "problems_table"}).find_all('a'):
34+
35+
#The link to the question is located in a <a> tag
36+
question_url = "https://projecteuler.net/" + link['href']
37+
38+
#The name and question number are obtained
39+
question_number = link['href'].split('=')[-1]
40+
question_name = link.string
41+
42+
ques_response = requests.get(question_url)
43+
ques_contents = BeautifulSoup(ques_response.text, "html.parser")
44+
description = ''
45+
46+
#In each question element, the description is mentioned in the <div> tag
47+
48+
for content in ques_contents.find("div" , attrs={"class":"problem_content"}).children:
49+
50+
#The content between the tags are obtained getting rid of the tag elements
51+
52+
content = re.sub(r'\<.*?>', r' ', str(content))
53+
description += content
54+
55+
#Each entry is written into the file
56+
57+
writer.writerow([question_number, question_name , description])
58+
59+
Euler()
Loading
Loading
Loading

0 commit comments

Comments
 (0)