Web Scraping Made Easy with Scrapy



In the digital age, data is king. Whether you're a market researcher, a data analyst, or simply someone who loves crunching numbers, the ability to collect and analyze data from the web is a valuable skill. This is where web scraping comes into play, and one of the most powerful tools for web scraping in Python is Scrapy.

What is Web Scraping?

Web scraping is like using a robot to collect information from websites. Imagine you have a robot that can visit websites and take pictures of everything it sees on the web pages. These pictures are like the HTML content of the web pages. Then, the robot looks at the pictures and picks out the important things you want, like the scores of your favorite games or the weather forecast.

So, web scraping is when you use a special robot to get useful information from websites and use it for different things, like checking the weather or finding out what's for sale online.

Scrapy

Scrapy is is an open-source web crawling framework for Python. It is designed to be fast, efficient, and easy to use.

    Prerequisites

    Prior to starting this guide, make sure to configure a Python environment on your system. Visit the official website linked here for installation if you haven't already done this.

    Throughout this guide, I'll utilize Visual Studio Code as the integrated development environment (IDE) on a Windows system. However, you are welcome to opt for your preferred IDE. If you're using VS Code, adhere to the guidance provided here to establish Python support for this IDE.

    Also make sure pip is installed already.

    Goals

    Let's scrape something step-by-step. First we need to target what to scrape. In this article we want to scrape book information from https://books.toscrape.com/

    We will use scrapy with Python to achieve following goals -

    • Go to target url
    • Then go to details of every book
    • Retrive books URL, Title, Books Category, Price, UPC, Price (excl. tax), Price(incl. tax), Tax, Availability, Number of reviews, Product Description
    • And finally write the extracted data into a CSV file.

    Project Setup

    👉 Open VS Code Studio IDE
    👉 Go to File ⇾ Click on Open Folder ⇾ And select your project folder
    👉 Go to File ⇾ Click on Open Folder ⇾ And select your project folder
    👉Now go to terminal and install virtualenv by using following command
    >> pip install virtualenv
    

    👉 Then create virtual environment for your project by following command and activate it
    >> virtualenv <Your virtual environment name>
    >> <Your virtual environment name>\Scripts\activate
    👉 Now install scrapy using pip 
    >> pip install scrapy
    

    Ok so we are all set for starting our scrapy project.

    Creating Scrapy Project

    Use the following command to create a new Scrapy project:
    >> scrapy startproject scraper_genie
    

    This will create a directory structure for your project like this

    Now go to spiders directory and let's define a spider: Spiders are Scrapy's way of telling it how to perform a scraping job.

    >> scrapy genspider <spider name> <scraping website>
    This will create a spider as given name with the target website info like following:
    Now let's modify the spider file to achieve our goal.

    Finding elements and Extracting Data using Scrapy Spider

    At first we need to find all the books link and save this into a variable:

    1. books = response.css('article.product_pod')
    2. for book in books:
    3. relative_url = book.css('h3 a ::attr(href)').get()
    4. if 'catalogue/' in relative_url:
    5. book_url = 'https://books.toscrape.com/' + relative_url
    6. else:
    7. book_url = 'https://books.toscrape.com/catalogue/' + relative_url
    8. yield response.follow(book_url, callback = self.parse_book_page)

    Here at first all books are listed into books variable using the CSS selector to select all the <article> elements with the class product_pod on the web page.



    Then each books' url is generated and send for http request using scrapy's response.follow() method. After Scrapy sends the request and receives a response, it will call the self.parse_book_page method to handle and parse the response. This callback method defines how the data on the new page should be extracted and processed.

    This same task needs to be performed for all the book items present in the web page. To achieve this we can write a iteration like this

    1. next_page = response.css('li.next a ::attr(href)').get()
    2. if next_page is not None:
    3. if 'catalogue/' in next_page:
    4. next_page_url = 'https://books.toscrape.com/' + next_page
    5. else:
    6. next_page_url = 'https://books.toscrape.com/catalogue/' + next_page
    7. yield response.follow(next_page_url, callback = self.parse)

    Now let's write the parse_book_page method to extract our desired data:

    1. def parse_book_page(self, response):
    2. table_rows = response.css('table tr')
    3. yield{
    4. 'URL' : response.url,
    5. 'Title' : response.css('.product_main h1::text').get(),
    6. 'Category' : response.xpath("//ul[@class='breadcrumb']/li[@class='active']/preceding-sibling::li[1]/a/text()").get(),
    7. 'Price' : response.css('p.price_color ::text').get(),
    8. 'UPC' : table_rows[0].css("td ::text").get(),
    9. 'Price (excl. tax)' : table_rows[2].css("td ::text").get(),
    10. 'Price (incl. tax)' : table_rows[3].css("td ::text").get(),
    11. 'Tax' : table_rows[4].css("td ::text").get(),
    12. 'Availability' : table_rows[5].css("td ::text").get(),
    13. 'Number of reviews' : table_rows[6].css("td ::text").get(),
    14. 'Stars' : response.css('p.star-rating').attrib['class'],
    15. 'Description' : response.xpath("//div[@id='product_description']/following-sibling::p/text()").get(),
    16. }

    Here css selectors and xpaths are used to extract data from web page. You can check this tutorial to get more insight how to find web element locators.

    Now we need to run our spider and save the extracted data into a CSV file. Here's how we can do that by following commands:

    scrapy crawl Your_spider_file_name -o CSV_File_Name.csv
    e.g., scrapy crawl mySpider -o bookdetailsinfo.csv

    So after scraping finished a report will be generated by Scrapy which will look like this 

    2023-10-01 05:00:27 [scrapy.extensions.feedexport] INFO: Stored csv feed (1000 items) in: bookdetailsinfo.csv
    2023-10-01 05:00:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 355653,
    'downloader/request_count': 1051,
    'downloader/request_method_count/GET': 1051,
    'downloader/response_bytes': 22195383,
    'downloader/response_count': 1051,
    'downloader/response_status_count/200': 1050,
    'downloader/response_status_count/404': 1,
    'elapsed_time_seconds': 59.297681,
    'feedexport/success_count/FileFeedStorage': 1,
    'finish_reason': 'finished',
    'finish_time': datetime.datetime(2023, 9, 30, 23, 0, 27, 647742, tzinfo=datetime.timezone.utc),
    'item_scraped_count': 1000,
    'log_count/DEBUG': 2054,
    'log_count/ERROR': 1,
    'log_count/INFO': 11,
    'request_depth_max': 50,
    'response_received_count': 1051,
    'robotstxt/request_count': 1,
    'robotstxt/response_count': 1,
    'robotstxt/response_status_count/404': 1,
    'scheduler/dequeued': 1050,
    'scheduler/dequeued/memory': 1050,
    'scheduler/enqueued': 1050,
    'scheduler/enqueued/memory': 1050,
    'spider_exceptions/UnboundLocalError': 1,
    'start_time': datetime.datetime(2023, 9, 30, 22, 59, 28, 350061, tzinfo=datetime.timezone.utc)}

    And here is the csv file that is generated by Spider 


    Final thoughts

    Scrapy is a versatile and powerful tool for web scraping in Python. With its rich features and active community, it simplifies the process of extracting valuable data from websites, allowing you to focus on analysis and insights. So, go ahead, give Scrapy a try, and unlock the world of web data!

    If you find this information helpful, please feel free to share it. Additionally, if you come across any errors, notice areas for improvement, or identify potential bad practices in the content provided, do not hesitate to let me know. Your feedback is invaluable and helps in enhancing the quality and accuracy of the information shared.


    Happy scraping!

    Comments