Skip to main content

freeCodeCamp Scrapy Beginners Course Part 4 - First Scraper

freeCodeCamp Scrapy Beginners Course Part 4: First Scraper

In Part 4 of the Scrapy Beginner Course, we go through how to create our first Scrapy spider to scrape BooksToScrape.com.

We will walk through:

The code for this part of the course is available on Github here!

If you prefer video tutorials, then check out the video version of this course on the freeCodeCamp channel here.


Creating Our Scrapy Spider

In Part 3, we created our Scrapy project that we will use to scrape BooksToScrape.

Now, we will start creating our first Scrapy Spider.

Scrapy provides a number of different spider types, however, in this course we will cover the most common one, the generic Spider. Here are some of the most common ones:

  • Spider - Takes a list of start_urls and scrapes each one with a parse method.
  • CrawlSpider - Designed to crawl a full website by following any links it finds.
  • SitemapSpider - Designed to extract URLs from a sitemap

To create a new generic spider, simply run the genspider command:

# syntax is --> scrapy genspider <name_of_spider> <website> 
$ scrapy genspider bookspider books.toscrape.com

A new spider will now have been added to your spiders folder, and it should look like this:

import scrapy

class BookspiderSpider(scrapy.Spider):
name = 'bookspider'
allowed_domains = ['books.toscrape.com']
start_urls = ['https://books.toscrape.com/']

def parse(self, response):
pass

Here we see that the genspider command has created a template spider for us to use in the form of a Spider class. This spider class contains:

  • name - a class attribute that gives a name to the spider. We will use this when running our spider later scrapy crawl <spider_name>.
  • allowed_domains - a class attribute that tells Scrapy that it should only ever scrape pages of the books.toscrape.com domain. This prevents the spider going rouge and scraping lots of websites. This is optional.
  • start_urls - a class attribute that tells Scrapy the first url it should scrape. We will be changing this in a bit.
  • parse - the parse function is called after a response has been recieved from the target website.

To start using this Spider we just need to start inserting our parsing code into the parse function.

To do this, we need to create our CSS selectors to parse the data we want from the page. We will use Scrapy Shell to find the best CSS selectors.

Using Scrapy Shell To Find Our CSS Selectors

To extract data from a HTML page, we need to use XPath or CSS selectors to tell Scrapy where in the page is the data. XPath and CSS selectors are like little maps for Scrapy to navigate the DOM tree and find the location of the data we require.

In this guide, we're going to use CSS selectors to parse the data from the page. And to help us create these CSS selectors we will use Scrapy Shell.

One of the great features of Scrapy is that it comes with a built-in shell that allows you to quickly test and debug your XPath & CSS selectors. Instead of having to run your full scraper to see if your XPath or CSS selectors are correct, you can enter them directly into your terminal and see the result.

To open Scrapy shell use this command:

scrapy shell

Note: If you would like to use IPython as your Scrapy shell (much more powerful and provides smart auto-completion and colorized output), then make sure you have IPython installed:

pip3 install ipython

And then edit your scrapy.cfg file like so:

## scrapy.cfg
[settings]
default = chocolatescraper.settings
shell = ipython

With our Scrapy shell open, you should see something like this:

[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x0000025111C47948>
[s] item {}
[s] settings <scrapy.settings.Settings object at 0x0000025111D17408>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]:

Fetch The Page

To create our CSS selectors we will be testing them on the following page:

https://books.toscrape.com/

BooksToScrape

The first thing we want to do is fetch the main products page of the chocolate site in our Scrapy shell.

fetch('https://books.toscrape.com/')

We should see a response like this:

In [1]: fetch('https://books.toscrape.com/')
2021-12-22 13:28:56 [scrapy.core.engine] INFO: Spider opened
2021-12-22 13:28:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/robots.txt> (referer: None)
2021-12-22 13:28:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/> (referer: None)

As we can see, we successful retrieve the page from books.toscrape.com, and Scrapy shell has automatically saved the HTML response in the response variable.

In [2]: response
Out[2]: <200 https://books.toscrape.com/>

Find Book CSS Selectors

To find the correct CSS selectors to parse the book details we will first open the page in our browsers DevTools.

Open the website, then open the developer tools console (right click on the page and click inspect).

BooksToScrape.com Developer Tools

Using the inspect element, hover over the item and look at the id's and classes on the individual products.

In this case we can see that each book has its own special component which is called <article class="product_pod">. We can just use this to reference our books (see above image).

Now using our Scrapy shell we can see if we can extract the product informaton using this class.

response.css('article.product_pod')

We can see that it has found all the elements that match this selector.

In [3]: response.css('article.product_pod')
Out[3]:
[<Selector xpath='descendant-or-self::product-item' data='<product-item class="product-item pro...'>,
<Selector xpath='descendant-or-self::product-item' data='<product-item class="product-item " r...'>,
<Selector xpath='descendant-or-self::product-item' data='<product-item class="product-item " r...'>,
<Selector xpath='descendant-or-self::product-item' data='<product-item class="product-item " r...'>,
<Selector xpath='descendant-or-self::product-item' data='<product-item class="product-item " r...'>,
<Selector xpath='descendant-or-self::product-item' data='<product-item class="product-item pro...'>,
...

Get First Book

To just get the first book we use .get() appended to the end of the command.

response.css('article.product_pod').get()

This returns all the HTML in this node of the DOM tree.

In [4]: response.css('article.product_pod').get()
Out[4]: '<product-item class="product-item product-item--sold-out" reveal><div class="product-item__image-wrapper product-item__image-wrapper--multiple"><div class="product-item__label-list label-list"><span class="label label--custom">New</span><span class="label label--subdued">Sold out</span></div><a href="/products/100-dark-hot-chocolate-flakes" class="product-item__aspect-ratio aspect-ratio " style="padding-bottom: 100.0%; --aspect-ratio: 1.0">\n
...

Get All Books

Now that we have found the DOM node that contains the book items, we will get all of them and save this data into a response variable and loop through the items and extract the data we need.

So can do this with the following command.

books = response.css('article.product_pod')

The books variable, is now an list of all the books on the page.

To check the length of the books variable we can see how many books are there.

len(books) 

Here is the output:

In [6]: len(books)
Out[6]: 20

Extract Book Details

Now lets extract the name, price, availability and url of each book from the list of books.

The books variable is a list of books. When we update our spider code, we will loop through this list, however, to find the correct selectors we will test the CSS selectors on the first element of the list books[0].

Single Product - Get single product.

book = books[0]

Name - The book name can be found with:

book.css('h3 a::text').get()
In [5]: book.css('h3 a::text').get()
Out[5]: 'A Light in the ...'

Price - The book price can be found with:

book.css('div.product_price .price_color::text').get()

You can see that the data returned for the price has encoding issues. We'll deal with this in the Part 6 Data Cleaning With Items & Pipelines.

In [6]: book.css('div.product_price .price_color::text').get()
Out[6]: '£51.77'

In Stock - Whether the book is in stock or not can be found with:

book.css('p.availability::text').get()

Again we can see that the data returned for the In Stock data has extra text and white space. We'll deal with this in the Part 6 Data Cleaning With Items & Pipelines.

In [7]: book.css('p.availability::text').get()
Out[7]: '\n\n \n In stock\n \n'

Book URL - Next lets see how we can extract the book url for each individual book. To do that we can use the attrib function on the end of book.css('h3 a')

book.css('h3 a').attrib['href']
In [8]: book.css('h3 a').attrib['href']
Out[8]: 'catalogue/a-light-in-the-attic_1000/index.html'

Adding CSS Selectors To Spider

Now, that we've found the correct CSS selectors let's update our spider. Exit Scrapy shell with the exit() command.

Our updated Spider code should look like this:

import scrapy

class BookspiderSpider(scrapy.Spider):
name = 'bookspider'
allowed_domains = ['books.toscrape.com']
start_urls = ['https://books.toscrape.com/']

def parse(self, response):
books = response.css('article.product_pod')
for book in books:
yield{
'name' : book.css('h3 a::text').get(),
'price' : book.css('div.product_price .price_color::text').get(),
'url' : book.css('h3 a').attrib['href'],
}


Here, our spider does the following steps:

  1. Makes a request to 'https://books.toscrape.com/'.
  2. When it gets a response, it extracts all the books from the page using books = response.css('article.product_pod').
  3. Loops through each book, and extracts the name, price and url using the CSS selectors we created.
  4. Yields(returns) these items so they can be output to the terminal and/or stored in a CSV, JSON, DB, etc.

How to Run Our Scrapy Spider

Now that we have a spider we can run it by going to the top level in our scrapy project and running the following command.

scrapy crawl bookspider 

It will run, and you should see the logs on your screen. Here are the final stats:

2021-12-22 14:43:54 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 707,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 64657,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 0.794875,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 22, 13, 43, 54, 937791),
'httpcompression/response_bytes': 268118,
'httpcompression/response_count': 2,
'item_scraped_count': 24,
'log_count/DEBUG': 26,
'log_count/INFO': 10,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2021, 12, 22, 13, 43, 54, 142916)}
2021-12-22 14:43:54 [scrapy.core.engine] INFO: Spider closed (finished)

We can see from the above stats that our spider scraped 24 Items: 'item_scraped_count': 20.

If we want to save the data to a JSON file we can use the -O option, followed by the name of the file.

scrapy crawl bookspider -O myscrapeddata.json

If we want to save the data to a CSV file we can do so too.

scrapy crawl bookspider -O myscrapeddata.csv

So far the code is working great but we're only getting the books from the first page of the site, the url which we have listed in the start_url variable.

So the next logical step is to go to the next page if there is one and scrape the item data from that too! So here's how we do that.

First, lets open our Scrapy shell again, fetch the page and find the correct selector to get the next page button.

scrapy shell

Then fetch the page again.

fetch('https://books.toscrape.com/')

And then get the href attribute that contains the url to the next page.

response.css('li.next a ::attr(href)').get()
In [2]: response.css('li.next a ::attr(href)').get()
Out[2]: 'catalogue/page-2.html'

Now, we just need to update our spider to request this page after it has parsed all items from a page.

import scrapy

class BookspiderSpider(scrapy.Spider):
name = 'bookspider'
allowed_domains = ['books.toscrape.com']
start_urls = ['https://books.toscrape.com/']

def parse(self, response):
books = response.css('article.product_pod')
for book in books:
yield{
'name' : book.css('h3 a::text').get(),
'price' : book.css('div.product_price .price_color::text').get(),
'url' : book.css('h3 a').attrib['href'],
}

next_page = response.css('li.next a ::attr(href)').get()
if next_page is not None:
if 'catalogue/' in next_page:
next_page_url = 'https://books.toscrape.com/' + next_page
else:
next_page_url = 'https://books.toscrape.com/catalogue/' + next_page
yield response.follow(next_page_url, callback=self.parse)



Here we see that our spider now finds the URL of the next page and if it isn't none it appends it to the base URL and makes another request.

To get around the next page url sometimes having catalogue in it and sometimes not we add an extra if statement which checks if it is present or not and create the correct next_page_url depending on the situation.

If we run our spider again, in our Scrapy stats we see that we have scraped 50 pages, and extracted 1,000 items:

2021-12-22 15:10:45 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2497,
'downloader/request_count': 50,
'downloader/request_method_count/GET': 05,
'downloader/response_bytes': 245935,
'downloader/response_count': 50,
'downloader/response_status_count/200': 50,
'elapsed_time_seconds': 2.441196,
'feedexport/success_count/FileFeedStorage': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 22, 14, 10, 45, 62280),
'httpcompression/response_bytes': 986800,
'httpcompression/response_count': 50,
'item_scraped_count': 1000,
'log_count/DEBUG': 78,
'log_count/INFO': 11,
'request_depth_max': 3,
'response_received_count': 5,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2021, 12, 22, 14, 10, 42, 621084)}
2021-12-22 15:10:45 [scrapy.core.engine] INFO: Spider closed (finished)

Now that we have made our Scrapy Spider we can scrape all the data on the page, however, when we look at the scraped data and the available data on each individual book page we can see we are only getting a fraction of the available data.

So in the Part 5, we will update our spider to scrape the data from each individual book page.


Next Steps

We've just created our first scraper to scrape all the books from BooksToScrape. In Part 5 we will move onto creating a spider that can crawl the entire website and scrape the full book data from each individual book page..

All parts of the 12 Part freeCodeCamp Scrapy Beginner Course are as follows: