Skip to main content

freeCodeCamp Scrapy Beginners Course Part 5 - Advanced Scraper

freeCodeCamp Scrapy Beginners Course Part 5: Advanced Scraper

In Part 5 of the Scrapy Beginner Course, we go through how to create a more advanced Scrapy spider that will crawl the entire BooksToScrape.com website and scrape the data from each individual book page.

We will walk through:

The code for this part of the course is available on Github here!

If you prefer video tutorials, then check out the video version of this course on the freeCodeCamp channel here.


Discover & Request Book Pages

When we finished with Part 4, we had created our first Scrapy Spider to scrape the product data from BooksToScrape.

However, this data only contained summary data for each book. Whereas if we looked at an individual book page we can see there is a lot more information like:

  • Number of books available
  • UPC of each book
  • Product type
  • Product description
  • Price including & excluding taxes
  • Number of reviews

This data is very useful so we want to update our scraper to scrape this data too.

When we finished with Part 4 our spider looked like this:

import scrapy

class BookspiderSpider(scrapy.Spider):
name = 'bookspider'
allowed_domains = ['books.toscrape.com']
start_urls = ['https://books.toscrape.com/']

def parse(self, response):
books = response.css('article.product_pod')
for book in books:
yield{
'name' : book.css('h3 a::text').get(),
'price' : book.css('div.product_price .price_color::text').get(),
'url' : book.css('h3 a').attrib['href'],
}

next_page = response.css('li.next a ::attr(href)').get()
if next_page is not None:
if 'catalogue/' in next_page:
next_page_url = 'https://books.toscrape.com/' + next_page
else:
next_page_url = 'https://books.toscrape.com/catalogue/' + next_page
yield response.follow(next_page_url, callback=self.parse)

Now, we will start updating our scraper to crawl through each product shelf page (page containing 20 books), and have it discover each individual book page and request it so we can scrape the more detail book data.

In Part 4 we were already extracting this URL so we just need to create the full product URL and have it trigger a new Scrapy Request. From there we can extract our target data when it recieves a response.


import scrapy

class BookspiderSpider(scrapy.Spider):
name = 'bookspider'
allowed_domains = ['books.toscrape.com']
start_urls = ['https://books.toscrape.com/']

def parse(self, response):
books = response.css('article.product_pod')
for book in books:
relative_url = book.css('h3 a').attrib['href']
if 'catalogue/' in relative_url:
book_url = 'https://books.toscrape.com/' + relative_url
else:
book_url = 'https://books.toscrape.com/catalogue/' + relative_url
yield scrapy.Request(book_url, callback=self.parse_book_page)

## Next Page
next_page = response.css('li.next a ::attr(href)').get()
if next_page is not None:
if 'catalogue/' in next_page:
next_page_url = 'https://books.toscrape.com/' + next_page
else:
next_page_url = 'https://books.toscrape.com/catalogue/' + next_page
yield response.follow(next_page_url, callback=self.parse)

def parse_book_page(self, response):
pass

Messy Page Structures

When scraping BooksToScrape you will see that the relative individual book URLs are inconsistent. Some contain /catalogue/ in their path, whereas others don't.

If you send a request without /catalogue/ in their path it will return a 404 error. As a result we need to add /catalogue/ to the book URL when it doesn't exist.


def parse(self, response):
books = response.css('article.product_pod')
for book in books:
relative_url = book.css('h3 a').attrib['href']
if 'catalogue/' in relative_url:
book_url = 'https://books.toscrape.com/' + relative_url
else:
book_url = 'https://books.toscrape.com/catalogue/' + relative_url
yield scrapy.Request(book_url, callback=self.parse_book_page)

With this updated Spider, it will extract the `book_url from each book and make a new request:

yield scrapy.Request(book_url, callback=self.parse_book_page)

When Scrapy recieves a response to this request it will trigger the parse_book_page callback, which we can then use to extract the data from the HTML response of the individual book page.

When we run this spider it will now crawl all the shelf pages on BooksToScrape and request each individual book page.

Next, we will create new CSS & XPath parsers to extract the data from the individual book pages.


Using Scrapy Shell To Find CSS & XPath Selectors

Like we did in Part 4 we will use Scrapy Shell to find the XPath or CSS selectors we need to extract our target data from the page.

In this case, we are going to extract the following data from the page:

  • Book URL
  • Title
  • UPC
  • Product Type
  • Price
  • Price Excluding Tax
  • Price Including Tax
  • Tax
  • Availability
  • Number of Reviews
  • Stars
  • Category
  • Description

This will give use good experience extracting data in various situations & with different methods:

To open Scrapy shell use this command:

scrapy shell

With our Scrapy shell open, you should see something like this:

[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x0000025111C47948>
[s] item {}
[s] settings <scrapy.settings.Settings object at 0x0000025111D17408>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]:

BooksToScrape Individual Book Page

Next, we will fetch an example books.toscrape.com book page:

fetch('https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html')

We should see a response like this:

In [1]: fetch('https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html')
2021-12-22 13:28:56 [scrapy.core.engine] INFO: Spider opened
2021-12-22 13:28:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/robots.txt> (referer: None)
2021-12-22 13:28:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html> (referer: None)

As we can see, we successful retrieve the page from https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html, and Scrapy shell has automatically saved the HTML response in the response variable.

In [2]: response
Out[2]: <200 https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html>

From here we can start building the CSS & XPath selectors we need to extract our target data.


Extracting Simple Data

To start off with we are going to extract some easy to access data like the title and price (one at top of the page).

Using the inspect element, hover over the product element of the page and look at the id's and classes on the book that encompasses all the data we want.

In this case we can see that the book has its own special component which is called <div class="product_main">. We can just use this to reference our book data.

Now using our Scrapy shell we can see if we can extract the product informaton using this class.

response.css('div.product_main')

We can see that there is only one of these containers on the page:

In [3]: response.css('div.product_main')
Out[3]:
[
<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' product_main ')]" data='<div class="col-sm-6 product_main">\n ...'>
]

So we will set this as the book variable:

In [4]: book = response.css('div.product_main')[0]

Now we can extract the title and price using the following seletors:

Title:

book.css("h1 ::text").get()

Price:

book.css('p.price_color ::text').get()

Extracting Data Using Siblings

Next we want to extract the product description and category data from the page. However, there is no id or class variables we can use to extract these easily so we will use siblings to extract them.

it is contained in a generic <p> tag.

Description: The description is contained in a generic <p> tag. However, this <p> tag is always after the <div id='product_description'> so we can extract it using siblings functionality.

For this we will use an XPaths instead of CSS selectors as it has better support for this:

book.xpath("//div[@id='product_description']/following-sibling::p/text()").get()

Category: The category is contained in a li element at the top of the page in the 2nd last li element.

Again, for this we will use an XPaths instead of CSS selectors as it has better support for targeting siblings:

book.xpath("//ul[@class='breadcrumb']/li[@class='active']/preceding-sibling::li[1]/a/text()").get()

Extracting Data From Tables

Another common scraping scenario is that you want to extract data from tables. In this case, we want to extract the UPC, Product Type, Price, Price Excluding Tax, Price Including Tax, Tax, Availability and Number of Reviews from the table at the bottom of the page.

To do this we will first retrieve the table rows themselves:

table_rows = response.css("table tr")

Now we can get the data from each rows <td> element by targeting the row with the data. For example:

UPC:

table_rows[0].css("td ::text").get()

Product Type:

table_rows[1].css("td ::text").get()

Price Excluding Tax:

table_rows[2].css("td ::text").get()

Extracting Data From Attributes

Finally, we want to extract the number of stars each book recieved. This data is contained in the class attribute of the star-rating class. For example:


<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>

To scrape the number of stars each book recieved, we need to extract the number from the class attribute.

We can do this like this:

book.css("p.star-rating").attrib['class']

Now, that we've found the correct XPath & CSS selectors we can exit Scrapy shell with the exit() command, and update our Spider.


Updating Our BooksSpider

With all the CSS & XPath selectors found we can now update our Spider to extract this data.

Here is the updated spider:

import scrapy

class BookspiderSpider(scrapy.Spider):
name = 'bookspider'
allowed_domains = ['books.toscrape.com']
start_urls = ['https://books.toscrape.com/']

def parse(self, response):
books = response.css('article.product_pod')
for book in books:
relative_url = book.css('h3 a').attrib['href']
if 'catalogue/' in relative_url:
book_url = 'https://books.toscrape.com/' + relative_url
else:
book_url = 'https://books.toscrape.com/catalogue/' + relative_url
yield scrapy.Request(book_url, callback=self.parse_book_page)

## Next Page
next_page = response.css('li.next a ::attr(href)').get()
if next_page is not None:
if 'catalogue/' in next_page:
next_page_url = 'https://books.toscrape.com/' + next_page
else:
next_page_url = 'https://books.toscrape.com/catalogue/' + next_page
yield response.follow(next_page_url, callback=self.parse)

def parse_book_page(self, response):
book = response.css("div.product_main")[0]
table_rows = response.css("table tr")
yield {
'url': response.url,
'title': book.css("h1 ::text").get(),
'upc': table_rows[0].css("td ::text").get(),
'product_type': table_rows[1].css("td ::text").get(),
'price_excl_tax': table_rows[2].css("td ::text").get(),
'price_incl_tax': table_rows[3].css("td ::text").get(),
'tax': table_rows[4].css("td ::text").get(),
'availability': table_rows[5].css("td ::text").get(),
'num_reviews': table_rows[6].css("td ::text").get(),
'stars': book.css("p.star-rating").attrib['class'],
'category': book.xpath("//ul[@class='breadcrumb']/li[@class='active']/preceding-sibling::li[1]/a/text()").get(),
'description': book.xpath("//div[@id='product_description']/following-sibling::p/text()").get(),
'price': book.css('p.price_color ::text').get(),
}

This spider will now, crawl through every shelf page on BooksToScrape, discover each individual book page and retrieve the HTML for that page.

When Scrapy recieves a response from the individual book page, it will then extract the book data in the parse_book_page function.


Testing Our Scrapy Spider

Now that we have updated our spider we can run it by going to the top level in our scrapy project and running the following command.

scrapy crawl bookspider 

It will run, and you should see the logs on your screen. Here are the final stats:

2023-01-31 15:24:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 354623,
'downloader/request_count': 1051,
'downloader/request_method_count/GET': 1051,
'downloader/response_bytes': 22195383,
'downloader/response_count': 1051,
'downloader/response_status_count/200': 1050,
'downloader/response_status_count/404': 1,
'elapsed_time_seconds': 60.737089,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 1, 31, 8, 24, 17, 764944),
'item_scraped_count': 1000,
'log_count/DEBUG': 2054,
'log_count/INFO': 11,
'request_depth_max': 50,
'response_received_count': 1051,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 1050,
'scheduler/dequeued/memory': 1050,
'scheduler/enqueued': 1050,
'scheduler/enqueued/memory': 1050,
'start_time': datetime.datetime(2023, 1, 31, 8, 23, 17, 27855)}
2023-01-31 15:24:17 [scrapy.core.engine] INFO: Spider closed (finished)

We can see from the above stats that our spider scraped 1000 Items: 'item_scraped_count': 1000.

If we want to save the data to a JSON file we can use the -O option, followed by the name of the file.

scrapy crawl bookspider -O myscrapeddata.json

If we want to save the data to a CSV file we can do so too.

scrapy crawl bookspider -O myscrapeddata.csv

When we inspect the scraped data we will see there are some issues however:

  • Prices aren't numbers
  • The stock availability isn't a number
  • Some text contains trailing & leading white spaces

In Part 6, we will look at how to use Items and Item Pipelines to better structure and clean our data before saving it into a database.

Next Steps

We've just expanded our simple scraper to now discover all the books on BooksToScrape and scrape each individual book page.

In Part 6, we will look at how to use Items and Item Pipelines to better structure and clean our data before saving it into a database.

All parts of the 12 Part freeCodeCamp Scrapy Beginner Course are as follows: