freeCodeCamp Scrapy Beginners Course Part 4: First Scraper
In Part 4 of the Scrapy Beginner Course, we go through how to create our first Scrapy spider to scrape BooksToScrape.com.
We will walk through:
- Creating Our Scrapy Spider
- Using Scrapy Shell To Find Our CSS Selectors
- Adding CSS Selectors To Spider
- How to Run Our Scrapy Spider
- How to Navigate Through Pages
The code for this part of the course is available on Github here!
If you prefer video tutorials, then check out the video version of this course on the freeCodeCamp channel here.
freeCodeCamp Scrapy Course
This guide is part of the 12 Part freeCodeCamp Scrapy Beginner Course where we will build a Scrapy project end-to-end from building the scrapers to deploying on a server and run them every day.
If you would like to skip to another section then use one of the links below:
- Part 1: Course & Scrapy Overview
- Part 2: Setting Up Environment & Scrapy
- Part 3: Creating Scrapy Project
- Part 4: First Scrapy Spider
- Part 5: Crawling With Scrapy
- Part 6: Cleaning Data With Item Pipelines
- Part 7: Storing Data In CSVs & Databases
- Part 8: Faking Scrapy Headers & User-Agents
- Part 9: Using Proxies With Scrapy Spiders
- Part 10: Deploying & Scheduling Spiders With Scrapyd
- Part 11: Deploying & Scheduling Spiders With ScrapeOps
- Part 12: Deploying & Scheduling Spiders With Scrapy Cloud
The code for this project is available on Github here!
Creating Our Scrapy Spider
In Part 3, we created our Scrapy project that we will use to scrape BooksToScrape.
Now, we will start creating our first Scrapy Spider.
Scrapy provides a number of different spider types, however, in this course we will cover the most common one, the generic Spider. Here are some of the most common ones:
- Spider - Takes a list of start_urls and scrapes each one with a
parse
method. - CrawlSpider - Designed to crawl a full website by following any links it finds.
- SitemapSpider - Designed to extract URLs from a sitemap
To create a new generic spider, simply run the genspider command:
# syntax is --> scrapy genspider <name_of_spider> <website>
$ scrapy genspider bookspider books.toscrape.com
A new spider will now have been added to your spiders
folder, and it should look like this:
import scrapy
class BookspiderSpider(scrapy.Spider):
name = 'bookspider'
allowed_domains = ['books.toscrape.com']
start_urls = ['https://books.toscrape.com/']
def parse(self, response):
pass
Here we see that the genspider
command has created a template spider for us to use in the form of a Spider
class. This spider class contains:
- name - a class attribute that gives a name to the spider. We will use this when running our spider later
scrapy crawl <spider_name>
. - allowed_domains - a class attribute that tells Scrapy that it should only ever scrape pages of the
books.toscrape.com
domain. This prevents the spider going rouge and scraping lots of websites. This is optional. - start_urls - a class attribute that tells Scrapy the first url it should scrape. We will be changing this in a bit.
- parse - the
parse
function is called after a response has been recieved from the target website.
To start using this Spider we just need to start inserting our parsing code into the parse
function.
To do this, we need to create our CSS selectors to parse the data we want from the page. We will use Scrapy Shell to find the best CSS selectors.
Using Scrapy Shell To Find Our CSS Selectors
To extract data from a HTML page, we need to use XPath or CSS selectors to tell Scrapy where in the page is the data. XPath and CSS selectors are like little maps for Scrapy to navigate the DOM tree and find the location of the data we require.
In this guide, we're going to use CSS selectors to parse the data from the page. And to help us create these CSS selectors we will use Scrapy Shell.
One of the great features of Scrapy is that it comes with a built-in shell that allows you to quickly test and debug your XPath & CSS selectors. Instead of having to run your full scraper to see if your XPath or CSS selectors are correct, you can enter them directly into your terminal and see the result.
To open Scrapy shell use this command:
scrapy shell
Note: If you would like to use IPython as your Scrapy shell (much more powerful and provides smart auto-completion and colorized output), then make sure you have IPython installed:
pip3 install ipython
And then edit your scrapy.cfg
file like so:
## scrapy.cfg
[settings]
default = chocolatescraper.settings
shell = ipython
With our Scrapy shell open, you should see something like this:
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x0000025111C47948>
[s] item {}
[s] settings <scrapy.settings.Settings object at 0x0000025111D17408>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]:
Fetch The Page
To create our CSS selectors we will be testing them on the following page:
The first thing we want to do is fetch the main products page of the chocolate site in our Scrapy shell.
fetch('https://books.toscrape.com/')
We should see a response like this:
In [1]: fetch('https://books.toscrape.com/')
2021-12-22 13:28:56 [scrapy.core.engine] INFO: Spider opened
2021-12-22 13:28:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/robots.txt> (referer: None)
2021-12-22 13:28:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com/> (referer: None)
As we can see, we successful retrieve the page from books.toscrape.com
, and Scrapy shell has automatically saved the HTML response in the response variable.
In [2]: response
Out[2]: <200 https://books.toscrape.com/>
Find Book CSS Selectors
To find the correct CSS selectors to parse the book details we will first open the page in our browsers DevTools.
Open the website, then open the developer tools console (right click on the page and click inspect).
Using the inspect element, hover over the item and look at the id's and classes on the individual products.
In this case we can see that each book has its own special component which is called <article class="product_pod">
. We can just use this to reference our books (see above image).
Now using our Scrapy shell we can see if we can extract the product informaton using this class.
response.css('article.product_pod')
We can see that it has found all the elements that match this selector.
In [3]: response.css('article.product_pod')
Out[3]:
[<Selector xpath='descendant-or-self::product-item' data='<product-item class="product-item pro...'>,
<Selector xpath='descendant-or-self::product-item' data='<product-item class="product-item " r...'>,
<Selector xpath='descendant-or-self::product-item' data='<product-item class="product-item " r...'>,
<Selector xpath='descendant-or-self::product-item' data='<product-item class="product-item " r...'>,
<Selector xpath='descendant-or-self::product-item' data='<product-item class="product-item " r...'>,
<Selector xpath='descendant-or-self::product-item' data='<product-item class="product-item pro...'>,
...
Get First Book
To just get the first book we use .get()
appended to the end of the command.
response.css('article.product_pod').get()
This returns all the HTML in this node of the DOM tree.
In [4]: response.css('article.product_pod').get()
Out[4]: '<product-item class="product-item product-item--sold-out" reveal><div class="product-item__image-wrapper product-item__image-wrapper--multiple"><div class="product-item__label-list label-list"><span class="label label--custom">New</span><span class="label label--subdued">Sold out</span></div><a href="/products/100-dark-hot-chocolate-flakes" class="product-item__aspect-ratio aspect-ratio " style="padding-bottom: 100.0%; --aspect-ratio: 1.0">\n
...
Get All Books
Now that we have found the DOM node that contains the book items, we will get all of them and save this data into a response variable and loop through the items and extract the data we need.
So can do this with the following command.
books = response.css('article.product_pod')
The books
variable, is now an list of all the books on the page.
To check the length of the books
variable we can see how many books are there.
len(books)
Here is the output:
In [6]: len(books)
Out[6]: 20
Extract Book Details
Now lets extract the name, price, availability and url of each book from the list of books.
The books variable is a list of books. When we update our spider code, we will loop through this list, however, to find the correct selectors we will test the CSS selectors on the first element of the list books[0]
.
Single Product - Get single product.
book = books[0]
Name - The book name can be found with:
book.css('h3 a::text').get()
In [5]: book.css('h3 a::text').get()
Out[5]: 'A Light in the ...'
Price - The book price can be found with:
book.css('div.product_price .price_color::text').get()
You can see that the data returned for the price has encoding issues. We'll deal with this in the Part 6 Data Cleaning With Items & Pipelines.
In [6]: book.css('div.product_price .price_color::text').get()
Out[6]: '£51.77'
In Stock - Whether the book is in stock or not can be found with:
book.css('p.availability::text').get()
Again we can see that the data returned for the In Stock data has extra text and white space. We'll deal with this in the Part 6 Data Cleaning With Items & Pipelines.
In [7]: book.css('p.availability::text').get()
Out[7]: '\n\n \n In stock\n \n'
Book URL - Next lets see how we can extract the book url for each individual book. To do that we can use the attrib function on the end of book.css('h3 a')
book.css('h3 a').attrib['href']
In [8]: book.css('h3 a').attrib['href']
Out[8]: 'catalogue/a-light-in-the-attic_1000/index.html'
Adding CSS Selectors To Spider
Now, that we've found the correct CSS selectors let's update our spider. Exit Scrapy shell with the exit()
command.
Our updated Spider code should look like this:
import scrapy
class BookspiderSpider(scrapy.Spider):
name = 'bookspider'
allowed_domains = ['books.toscrape.com']
start_urls = ['https://books.toscrape.com/']
def parse(self, response):
books = response.css('article.product_pod')
for book in books:
yield{
'name' : book.css('h3 a::text').get(),
'price' : book.css('div.product_price .price_color::text').get(),
'url' : book.css('h3 a').attrib['href'],
}
Here, our spider does the following steps:
- Makes a request to
'https://books.toscrape.com/'
. - When it gets a response, it extracts all the books from the page using
books = response.css('article.product_pod')
. - Loops through each book, and extracts the name, price and url using the CSS selectors we created.
- Yields(returns) these items so they can be output to the terminal and/or stored in a CSV, JSON, DB, etc.
How to Run Our Scrapy Spider
Now that we have a spider we can run it by going to the top level in our scrapy project and running the following command.
scrapy crawl bookspider
It will run, and you should see the logs on your screen. Here are the final stats:
2021-12-22 14:43:54 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 707,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 64657,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 0.794875,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 22, 13, 43, 54, 937791),
'httpcompression/response_bytes': 268118,
'httpcompression/response_count': 2,
'item_scraped_count': 24,
'log_count/DEBUG': 26,
'log_count/INFO': 10,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2021, 12, 22, 13, 43, 54, 142916)}
2021-12-22 14:43:54 [scrapy.core.engine] INFO: Spider closed (finished)
We can see from the above stats that our spider scraped 24 Items: 'item_scraped_count': 20
.
If we want to save the data to a JSON file we can use the -O
option, followed by the name of the file.
scrapy crawl bookspider -O myscrapeddata.json
If we want to save the data to a CSV file we can do so too.
scrapy crawl bookspider -O myscrapeddata.csv
Navigating to the "Next Page"
So far the code is working great but we're only getting the books from the first page of the site, the url which we have listed in the start_url variable.
So the next logical step is to go to the next page if there is one and scrape the item data from that too! So here's how we do that.
First, lets open our Scrapy shell again, fetch the page and find the correct selector to get the next page button.
scrapy shell
Then fetch the page again.
fetch('https://books.toscrape.com/')
And then get the href attribute that contains the url to the next page.
response.css('li.next a ::attr(href)').get()
In [2]: response.css('li.next a ::attr(href)').get()
Out[2]: 'catalogue/page-2.html'
Now, we just need to update our spider to request this page after it has parsed all items from a page.
import scrapy
class BookspiderSpider(scrapy.Spider):
name = 'bookspider'
allowed_domains = ['books.toscrape.com']
start_urls = ['https://books.toscrape.com/']
def parse(self, response):
books = response.css('article.product_pod')
for book in books:
yield{
'name' : book.css('h3 a::text').get(),
'price' : book.css('div.product_price .price_color::text').get(),
'url' : book.css('h3 a').attrib['href'],
}
next_page = response.css('li.next a ::attr(href)').get()
if next_page is not None:
if 'catalogue/' in next_page:
next_page_url = 'https://books.toscrape.com/' + next_page
else:
next_page_url = 'https://books.toscrape.com/catalogue/' + next_page
yield response.follow(next_page_url, callback=self.parse)
Here we see that our spider now finds the URL of the next page and if it isn't none it appends it to the base URL and makes another request.
To get around the next page url sometimes having catalogue
in it and sometimes not we add an extra if statement which checks if it is present or not and create the correct next_page_url
depending on the situation.
If we run our spider again, in our Scrapy stats we see that we have scraped 50 pages, and extracted 1,000 items:
2021-12-22 15:10:45 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2497,
'downloader/request_count': 50,
'downloader/request_method_count/GET': 05,
'downloader/response_bytes': 245935,
'downloader/response_count': 50,
'downloader/response_status_count/200': 50,
'elapsed_time_seconds': 2.441196,
'feedexport/success_count/FileFeedStorage': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 22, 14, 10, 45, 62280),
'httpcompression/response_bytes': 986800,
'httpcompression/response_count': 50,
'item_scraped_count': 1000,
'log_count/DEBUG': 78,
'log_count/INFO': 11,
'request_depth_max': 3,
'response_received_count': 5,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2021, 12, 22, 14, 10, 42, 621084)}
2021-12-22 15:10:45 [scrapy.core.engine] INFO: Spider closed (finished)
Now that we have made our Scrapy Spider we can scrape all the data on the page, however, when we look at the scraped data and the available data on each individual book page we can see we are only getting a fraction of the available data.
So in the Part 5, we will update our spider to scrape the data from each individual book page.
Next Steps
We've just created our first scraper to scrape all the books from BooksToScrape. In Part 5 we will move onto creating a spider that can crawl the entire website and scrape the full book data from each individual book page..
All parts of the 12 Part freeCodeCamp Scrapy Beginner Course are as follows:
- Part 1: Course & Scrapy Overview
- Part 2: Setting Up Environment & Scrapy
- Part 3: Creating Scrapy Project
- Part 4: First Scrapy Spider
- Part 5: Crawling With Scrapy
- Part 6: Cleaning Data With Item Pipelines
- Part 7: Storing Data In CSVs & Databases
- Part 8: Faking Scrapy Headers & User-Agents
- Part 9: Using Proxies With Scrapy Spiders
- Part 10: Deploying & Scheduling Spiders With Scrapyd
- Part 11: Deploying & Scheduling Spiders With ScrapeOps
- Part 12: Deploying & Scheduling Spiders With Scrapy Cloud