Skip to content
This repository was archived by the owner on Nov 30, 2022. It is now read-only.

Commit 38b780d

Browse files
authored
Merge pull request #117 from arjuaman/master
Amazon Reviews Scraper for any product and putting it into a csv file
2 parents 0f38310 + 744547f commit 38b780d

File tree

8 files changed

+131
-0
lines changed

8 files changed

+131
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
## USE: As a part of projects where you are required to do sentiment analysis on customer reviews data. </br>
2+
3+
Scrapy is a web crawling framework for a developer to write code to create, which defines how a particular site (or a group of websites) will be scrapped.
4+
5+
### Steps:</br>
6+
7+
1) From the conda-forge, install scrapy: </br>
8+
>> conda install -c conda-forge scrapy
9+
10+
In case you want to install from system, use:
11+
>> pip install scrapy
12+
13+
2) Start a project:</br>
14+
>> scrapy startproject amazon_reviews_scraping
15+
16+
3) A spider is a chunk of python code that determines how a web page will be scrapped, it's the main component that crwals the webpage and extracts contents from it.</br>
17+
So copy your link from the product review you want to scrape, and run the following: </br>
18+
>> scrapy genspider amazon_review <your-link-here>
19+
20+
4) Now you'll need to define a scrapy parser, which I've already done in: </br>
21+
amazon_reviews_scraping/spider/scraper.py
22+
23+
5) Run the following to store the result in a csv file titled "reviews.csv", or you may change the name as per your convenience! </br>
24+
>> scrapy runspider amazon_reviews_scraping/amazon_reviews_scraping/spiders/amazon_reviews.py -o reviews.csv
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
import scrapy
2+
3+
4+
class AmazonReviewsScrapingItem(scrapy.Item):
5+
# define the fields for your item here like:
6+
# name = scrapy.Field()
7+
pass
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
from scrapy import signals
2+
3+
from itemadapter import is_item, ItemAdapter
4+
5+
6+
class AmazonReviewsScrapingSpiderMiddleware:
7+
8+
@classmethod
9+
def from_crawler(cls, crawler):
10+
11+
s = cls()
12+
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
13+
return s
14+
15+
def process_spider_input(self, response, spider):
16+
17+
return None
18+
19+
def process_spider_output(self, response, result, spider):
20+
21+
for i in result:
22+
yield i
23+
24+
def process_spider_exception(self, response, exception, spider):
25+
26+
pass
27+
28+
def process_start_requests(self, start_requests, spider):
29+
30+
for r in start_requests:
31+
yield r
32+
33+
def spider_opened(self, spider):
34+
spider.logger.info('Spider opened: %s' % spider.name)
35+
36+
37+
class AmazonReviewsScrapingDownloaderMiddleware:
38+
39+
@classmethod
40+
def from_crawler(cls, crawler):
41+
42+
s = cls()
43+
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
44+
return s
45+
46+
def process_request(self, request, spider):
47+
48+
return None
49+
50+
def process_response(self, request, response, spider):
51+
52+
return response
53+
54+
def process_exception(self, request, exception, spider):
55+
56+
pass
57+
58+
def spider_opened(self, spider):
59+
spider.logger.info('Spider opened: %s' % spider.name)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
2+
from itemadapter import ItemAdapter
3+
4+
5+
class AmazonReviewsScrapingPipeline:
6+
def process_item(self, item, spider):
7+
return item
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
BOT_NAME = 'amazon_reviews_scraping'
2+
3+
SPIDER_MODULES = ['amazon_reviews_scraping.spiders']
4+
NEWSPIDER_MODULE = 'amazon_reviews_scraping.spiders'
5+
6+
ROBOTSTXT_OBEY = True
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
# This package will contain the spiders of your Scrapy project
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
import scrapy
2+
3+
class AmazonReviewsSpider(scrapy.Spider):
4+
5+
name = 'amazon_reviews'
6+
7+
allowed_domains = ['amazon.in']
8+
9+
myBaseUrl = "https://www.amazon.com/OnePlus-Interstellar-Unlocked-Android-Smartphone/product-reviews/B0872473BF/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews"
10+
start_urls=[]
11+
12+
for i in range(1,121):
13+
start_urls.append(myBaseUrl+str(i))
14+
15+
def parse(self, response):
16+
data = response.css('#cm_cr-review_list')
17+
18+
star_rating = data.css('.review-rating')
19+
20+
comments = data.css('.review-text-content')
21+
count = 0
22+
23+
for review in star_rating:
24+
yield{'stars': ''.join(review.xpath('.//text()').extract()),
25+
'comment': ''.join(comments[count].xpath(".//text()").extract())
26+
}
27+
count=count+1

0 commit comments

Comments
 (0)