Web Scraping Made Easy with Scrapy
In the digital age, data is king. Whether you're a market researcher, a data analyst, or simply someone who loves crunching numbers, the ability to collect and analyze data from the web is a valuable skill. This is where web scraping comes into play, and one of the most powerful tools for web scraping in Python is Scrapy.
What is Web Scraping?
Web scraping is like using a robot to collect information from websites. Imagine you have a robot that can visit websites and take pictures of everything it sees on the web pages. These pictures are like the HTML content of the web pages. Then, the robot looks at the pictures and picks out the important things you want, like the scores of your favorite games or the weather forecast.
So, web scraping is when you use a special robot to get useful information from websites and use it for different things, like checking the weather or finding out what's for sale online.
Scrapy
Prerequisites
Goals
Let's scrape something step-by-step. First we need to target what to scrape. In this article we want to scrape book information from https://books.toscrape.com/
We will use scrapy with Python to achieve following goals -
- Go to target url
- Then go to details of every book
- Retrive books URL, Title, Books Category, Price, UPC, Price (excl. tax), Price(incl. tax), Tax, Availability, Number of reviews, Product Description
- And finally write the extracted data into a CSV file.
Project Setup
>> pip install virtualenv
>> virtualenv <Your virtual environment name>
>> <Your virtual environment name>\Scripts\activate
>> pip install scrapy
Creating Scrapy Project
>> scrapy startproject scraper_genie
Now go to spiders directory and let's define a spider: Spiders are Scrapy's way of telling it how to perform a scraping job.
>> scrapy genspider <spider name> <scraping website>
Finding elements and Extracting Data using Scrapy Spider
1. books = response.css('article.product_pod')2. for book in books:3. relative_url = book.css('h3 a ::attr(href)').get()4. if 'catalogue/' in relative_url:5. book_url = 'https://books.toscrape.com/' + relative_url6. else:7. book_url = 'https://books.toscrape.com/catalogue/' + relative_url8. yield response.follow(book_url, callback = self.parse_book_page)
Here at first all books are listed into books variable using the CSS selector to select all the <article> elements with the class product_pod on the web page.
Then each books' url is generated and send for http request using scrapy's response.follow() method. After Scrapy sends the request and receives a response, it will call the self.parse_book_page method to handle and parse the response. This callback method defines how the data on the new page should be extracted and processed.
1. next_page = response.css('li.next a ::attr(href)').get()2. if next_page is not None:3. if 'catalogue/' in next_page:4. next_page_url = 'https://books.toscrape.com/' + next_page5. else:6. next_page_url = 'https://books.toscrape.com/catalogue/' + next_page7. yield response.follow(next_page_url, callback = self.parse)
Now let's write the parse_book_page method to extract our desired data:
1. def parse_book_page(self, response):2. table_rows = response.css('table tr')3. yield{4. 'URL' : response.url,5. 'Title' : response.css('.product_main h1::text').get(),6. 'Category' : response.xpath("//ul[@class='breadcrumb']/li[@class='active']/preceding-sibling::li[1]/a/text()").get(),7. 'Price' : response.css('p.price_color ::text').get(),8. 'UPC' : table_rows[0].css("td ::text").get(),9. 'Price (excl. tax)' : table_rows[2].css("td ::text").get(),10. 'Price (incl. tax)' : table_rows[3].css("td ::text").get(),11. 'Tax' : table_rows[4].css("td ::text").get(),12. 'Availability' : table_rows[5].css("td ::text").get(),13. 'Number of reviews' : table_rows[6].css("td ::text").get(),14. 'Stars' : response.css('p.star-rating').attrib['class'],15. 'Description' : response.xpath("//div[@id='product_description']/following-sibling::p/text()").get(),16. }
Here css selectors and xpaths are used to extract data from web page. You can check this tutorial to get more insight how to find web element locators.
Now we need to run our spider and save the extracted data into a CSV file. Here's how we can do that by following commands:
scrapy crawl Your_spider_file_name -o CSV_File_Name.csve.g., scrapy crawl mySpider -o bookdetailsinfo.csv
So after scraping finished a report will be generated by Scrapy which will
look like this
2023-10-01 05:00:27 [scrapy.extensions.feedexport] INFO: Stored csv feed (1000 items) in: bookdetailsinfo.csv2023-10-01 05:00:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats:{'downloader/request_bytes': 355653,'downloader/request_count': 1051,'downloader/request_method_count/GET': 1051,'downloader/response_bytes': 22195383,'downloader/response_count': 1051,'downloader/response_status_count/200': 1050,'downloader/response_status_count/404': 1,'elapsed_time_seconds': 59.297681,'feedexport/success_count/FileFeedStorage': 1,'finish_reason': 'finished','finish_time': datetime.datetime(2023, 9, 30, 23, 0, 27, 647742, tzinfo=datetime.timezone.utc),'item_scraped_count': 1000,'log_count/DEBUG': 2054,'log_count/ERROR': 1,'log_count/INFO': 11,'request_depth_max': 50,'response_received_count': 1051,'robotstxt/request_count': 1,'robotstxt/response_count': 1,'robotstxt/response_status_count/404': 1,'scheduler/dequeued': 1050,'scheduler/dequeued/memory': 1050,'scheduler/enqueued': 1050,'scheduler/enqueued/memory': 1050,'spider_exceptions/UnboundLocalError': 1,'start_time': datetime.datetime(2023, 9, 30, 22, 59, 28, 350061, tzinfo=datetime.timezone.utc)}
Final thoughts
Scrapy is a versatile and powerful tool for web scraping in Python. With its rich features and active community, it simplifies the process of extracting valuable data from websites, allowing you to focus on analysis and insights. So, go ahead, give Scrapy a try, and unlock the world of web data!
If you find this information helpful, please feel free to share it. Additionally, if you come across any errors, notice areas for improvement, or identify potential bad practices in the content provided, do not hesitate to let me know. Your feedback is invaluable and helps in enhancing the quality and accuracy of the information shared.
Happy scraping!
Comments
Post a Comment