Skip to content

Latest commit

 

History

History
22 lines (14 loc) · 1003 Bytes

commands.md

File metadata and controls

22 lines (14 loc) · 1003 Bytes

Part 1: Building the basic scraping functionality.

  • Get the targeted website >> fetch('https://www.chocolate.co.uk/collections/all')

  • Check if it was successfull >> response

  • Get a targeted div of product is { >> products = response.css('product-item')}

  • Get first product-item >> response.css('product-item').get()

  • Get the lenght of the returned product >> len(products)

  • Get the price of the product from a span >> product.css('span.price').get().replace('\n Sale price','').replace('','')

  • Note: this is used when we have a html objects returns from a request instead of the data due to its nested nature.

  • pagination >> response.css('[rel="next"] ::attr(href)').get().

Part 2: Cleaning Dirty Data & Dealing With Edge Cases

  • Save data to S3 bucket 'scrapy crawl chocolatespider -O s3://aws_key:aws_secret@mybucket/path/to/myscrapeddata.csv:csv'

  • create a file pip freeze requirements.txt > requirements.txt