freeCodeCamp Scrapy Beginners Course Part 3: Creating Scrapy Project
In Part 3 of the Scrapy Beginner Course, we go through how create a Scrapy project and explain all of its components.
We will walk through:
- How To Create A Scrapy Project
- Overview of The Scrapy Project Structure
- Scrapy Spiders Explained
- Scrapy Items Explained
- Scrapy Item Pipelines Explained
- Scrapy Middleware Explained
- Scrapy Settings
The code for this part of the course is available on Github here!
If you prefer video tutorials, then check out the video version of this course on the freeCodeCamp channel here.
freeCodeCamp Scrapy Course
This guide is part of the 12 Part freeCodeCamp Scrapy Beginner Course where we will build a Scrapy project end-to-end from building the scrapers to deploying on a server and run them every day.
If you would like to skip to another section then use one of the links below:
- Part 1: Course & Scrapy Overview
- Part 2: Setting Up Environment & Scrapy
- Part 3: Creating Scrapy Project
- Part 4: First Scrapy Spider
- Part 5: Crawling With Scrapy
- Part 6: Cleaning Data With Item Pipelines
- Part 7: Storing Data In CSVs & Databases
- Part 8: Faking Scrapy Headers & User-Agents
- Part 9: Using Proxies With Scrapy Spiders
- Part 10: Deploying & Scheduling Spiders With Scrapyd
- Part 11: Deploying & Scheduling Spiders With ScrapeOps
- Part 12: Deploying & Scheduling Spiders With Scrapy Cloud
The code for this project is available on Github here!
How To Create A Scrapy Project
Now that we have our virtual environment setup and Scrapy installed, we can get onto the fun stuff. creating our first Scrapy project
Our Scrapy project will hold all the code for our scrapers, and is a pre-built template for how we should structure our scrapers when using Scrapy.
To create a scrapy project, we need to use the following command in our command line:
scrapy startproject <project_name>
So in our projects case, as we're going to be scraping the BooksToScrape website, we will call our project bookscraper
. But you can use any project name you would like.
scrapy startproject bookscraper
Now if we enter the ls
command into the command line we should see the following files/folders:
├── scrapy.cfg
└── bookscraper
Overview of The Scrapy Project Structure
To help us understand what we've just done, and how Scrapy structures it projects we're going to pause for a second.
First, we're going to see what the scrapy startproject bookscraper
command we ran just did. If you open the folder in VS Code or another code editor program you should see the full folder structure.
You should see something like this:
├── scrapy.cfg
└── bookscraper
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
└── __init__.py
When we ran the scrapy startproject bookscraper
command, Scrapy automatically generated a template project for us to use.
This folder structure illustrates the 5 main building blocks of every Scrapy project: Spiders, Items, Middlewares, Pipelines and Settings.
Using these 5 building blocks you can create a scraper to do pretty much anything.
We won't be using all of these files in this beginners project, but we will give a quick explanation of each as each one has a special purpose:
- settings.py is where all your project settings are contained, like activating pipelines, middlewares etc. Here you can change the delays, concurrency, and lots more things.
- items.py is a model for the extracted data. You can define a custom model (like a ProductItem) that will inherit the Scrapy Item class and contain your scraped data.
- pipelines.py is where the item yielded by the spider gets passed, it’s mostly used to clean the text and connect to file outputs or databases (CSV, JSON SQL, etc).
- middlewares.py is useful when you want to modify how the request is made and scrapy handles the response.
- scrapy.cfg is a configuration file to change some deployment settings, etc.
The most fundamental of which are Spiders.
More Complex Explanations
Below, we explain the 5 main building blocks of every Scrapy project Spiders, Items, Middlewares, Pipelines and Settings in more detail. If you are new to Python and/or Scrapy this might be too much information too fast so feel free to skip this section as we will be covering each section in more detail later.
However, if you would like a high level overview of Spiders, Items, Middlewares, Pipelines and Settings then check out the following sections.
Scrapy Spiders Explained
Scrapy spiders is where the magics happens. "Spiders" are the Scrapy name for the main Python class that extracts the data you need from a website.
In your Scrapy project, you can have multiple Spiders all scraping the same or different websites and storing the data in different places.
Anything you could do with a Python Requests/BeautifulSoup scraper you can do with a Scrapy Spider.
import scrapy
class BooksSpider(scrapy.Spider):
name = 'books'
def start_requests(self):
url = 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
item = {}
product = response.css("div.product_main")
item["title"] = product.css("h1 ::text").extract_first()
item['category'] = response.xpath(
"//ul[@class='breadcrumb']/li[@class='active']/preceding-sibling::li[1]/a/text()"
).extract_first()
item['description'] = response.xpath(
"//div[@id='product_description']/following-sibling::p/text()"
).extract_first()
item['price'] = response.css('p.price_color ::text').extract_first()
yield item
To run this Spider, you simply need to run:
scrapy crawl books
When the above Spider is run, it will send a request to https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html and once it has responded it will scrape all book data from the page.
There are a couple things to point out here:
- Asynchronous - As Scrapy is built using the Twisted framework, when you send a request to a website it isn't blocking. Scrapy will send the request to the website, and once it has retrieved a successful response it will tigger the
parse
method using the callback defined in the original Scrapy Requestyield scrapy.Request(url, callback=self.parse)
. - Spider Name - Every spider in your Scrapy project must have a unique name so that Scrapy can identify it. You set this using the
name = 'books'
attribute. - Start Requests - You define the starting points for your spider using the
start_requests()
method. Subsequent requests can be generated successively from these initial requests. - Parse - You use the
parse()
method to process the response from the website and extract the data you need. After extraction this data is sent to the Item Pipelines using theyield
command.
Although this Scrapy spider is a bit more structured than your typical Python Requests/BeautifulSoup scraper it accomplishes the same things.
However, it is with Scrapy Items, Middlewares, Pipelines and Settings that Scrapy really stands out versus Python Requests/BeautifulSoup.
Scrapy Items Explained
Scrapy Items are how we store and process our scraped data. They provide a structured container for the data we scrape so that we can clean, validate and store it easily with Scrapy ItemLoaders, Item Pipelines, and Feed Exporters.
Using Scrapy Items have a number of advantages:
- Structures your data and gives it a clear schema.
- Enables you to easily clean and process your scraped data.
- Enables you to validate, deduplicate and monitor your data feeds.
- Enables you to easily store and export your data with Scrapy Feed Exports.
- Makes using Scrapy Item Pipelines & Item Loaders.
We typically define our Items in out items.py
file.
# items.py
from scrapy.item import Item, Field
class BookItem(Item):
title = Field()
category = Field()
description = Field()
price = Field()
Then inside in your spider, instead of yielding a dictionary you would create a new Item with the scraped data before yielding it.
import scrapy
from bookscraper.items import BookItem
class BooksSpider(scrapy.Spider):
name = 'books'
def start_requests(self):
url = 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
book_item = BookItem()
product = response.css("div.product_main")
book_item["title"] = product.css("h1 ::text").extract_first()
book_item['category'] = response.xpath(
"//ul[@class='breadcrumb']/li[@class='active']/preceding-sibling::li[1]/a/text()"
).extract_first()
book_item['description'] = response.xpath(
"//div[@id='product_description']/following-sibling::p/text()"
).extract_first()
book_item['price'] = response.css('p.price_color ::text').extract_first()
yield book_item
Scrapy Item Pipelines Explained
Item Pipelines are the data processors of Scrapy, which all our scraped Items will pass through and from where we can clean, process, validate, and store our data.
Using Scrapy Pipelines we can:
- Clean our data (ex. remove currency signs from prices)
- Format our data (ex. convert strings to ints)
- Enrich our data (ex. convert relative links to absolute links)
- Valdiate our data (ex. make sure the price scraped is a viable price)
- Store our data in databases, queues, files or object storage buckets.
For example, here is a Item pipeline that stores our scraped into a Postgres Database:
# pipelines.py
import psycopg2
class PostgresDemoPipeline:
def __init__(self):
## Connection Details
hostname = 'localhost'
username = 'postgres'
password = '******' # your password
database = 'quotes'
## Create/Connect to database
self.connection = psycopg2.connect(host=hostname, user=username, password=password, dbname=database)
## Create cursor, used to execute commands
self.cur = self.connection.cursor()
## Create quotes table if none exists
self.cur.execute("""
CREATE TABLE IF NOT EXISTS quotes(
id serial PRIMARY KEY,
title text,
category text,
description VARCHAR(255)
)
""")
def process_item(self, item, spider):
## Define insert statement
self.cur.execute(""" insert into books (title, category, description) values (%s,%s,%s)""", (
item["title"],
str(item["category"]),
item["description"]
))
## Execute insert of data into database
self.connection.commit()
return item
def close_spider(self, spider):
## Close cursor & connection to database
self.cur.close()
self.connection.close()
Scrapy Middlewares Explained
As we've discussed, Scrapy is a complete web scraping framework that manages a lot of the complexity of scraping at scale for you behind the scenes without you having to configure anything.
Most of this functionality is contained within Middlewares in the form of Downloader Middlewares and Spider Middlewares.
Downloader Middlewares
Downloader middlewares are specific hooks that sit between the Scrapy Engine and the Downloader, which process requests as they pass from the Engine to the Downloader, and responses as they pass from Downloader to the Engine.
By default Scrapy has the following downloader middlewares enabled:
# settings.py
DOWNLOADER_MIDDLEWARES_BASE = {
# Engine side
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
# Downloader side
}
These middlewares control everything from:
- Timing out requests
- What headers to send with your requests
- What user agents to use with your requests
- Retrying failed requests
- Managing cookies, caches and response compression
You can disable any of these default middlewares by setting it to none
in your settings.py
file. Here is an example of disabling the RobotsTxtMiddleware.
# settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': None,
}
You can also override existing middlewares, or insert your own completely new middlewares if you want to:
- alter a request just before it is sent to the website (change the proxy, user-agent, etc.)
- change received response before passing it to a spider
- retry a request if the response doesn't contain the correct data instead of passing received response to a spider
- pass response to a spider without fetching a web page
- silently drop some requests
Here is an example of inserting our own middleware to use a proxy with all of your requests. We will create this in our middlewares.py
file:
## middlewares.py
import base64
class MyProxyMiddleware(object):
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings)
def __init__(self, settings):
self.user = settings.get('PROXY_USER')
self.password = settings.get('PROXY_PASSWORD')
self.endpoint = settings.get('PROXY_ENDPOINT')
self.port = settings.get('PROXY_PORT')
def process_request(self, request, spider):
user_credentials = '{user}:{passw}'.format(user=self.user, passw=self.password)
basic_authentication = 'Basic ' + base64.b64encode(user_credentials.encode()).decode()
host = 'http://{endpoint}:{port}'.format(endpoint=self.endpoint, port=self.port)
request.meta['proxy'] = host
request.headers['Proxy-Authorization'] = basic_authentication
We would enable it in your settings.py
file, and fill in your proxy connection details:
## settings.py
PROXY_USER = 'username'
PROXY_PASSWORD = 'password'
PROXY_ENDPOINT = 'proxy.proxyprovider.com'
PROXY_PORT = '8000'
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.MyProxyMiddleware': 350,
}
Spider Middlewares
Spider middlewares are specific hooks that sit between the Scrapy Engine and the Spiders, and which process spider input (responses) and output (items and requests).
By default Scrapy has the following downloader middlewares enabled:
# settings.py
SPIDER_MIDDLEWARES_BASE = {
# Engine side
'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,
'scrapy.spidermiddlewares.referer.RefererMiddleware': 700,
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800,
'scrapy.spidermiddlewares.depth.DepthMiddleware': 900,
# Spider side
}
Spider middleware are used to:
- post-process output of spider callbacks - change/add/remove requests or items
- post-process start_requests
- handle spider exceptions
- call errback instead of callback for some of the requests based on response content
Like Downloader middlewares you can disable any of these default Spider middlewares by setting it to none
in your settings.py
file. Here is an example of disabling the RobotsTxtMiddleware.
# settings.py
SPIDER_MIDDLEWARES = {
'scrapy.spidermiddlewares.referer.RefererMiddleware': None,
}
Scrapy Settings Explained
The settings.py
file is the central control panel for your Scrapy project. You can enable/disable default functionality or integrate your own custom middlewares and extensions.
You can change the settings on a project basis by updating the settings.py
file, or on a individual Spider basis by adding custom_settings
to each spider.
In the following example, we add custom settings to our spider so that the scraped data will be saved to a data.csv
file using the custom_settings
attribute.
import scrapy
from bookscraper.items import BookItem
class BooksSpider(scrapy.Spider):
name = 'books'
custom_settings = {
'FEEDS': { 'data.csv': { 'format': 'csv',}}
}
def start_requests(self):
url = 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
book_item = BookItem()
product = response.css("div.product_main")
book_item["title"] = product.css("h1 ::text").extract_first()
book_item['category'] = response.xpath(
"//ul[@class='breadcrumb']/li[@class='active']/preceding-sibling::li[1]/a/text()"
).extract_first()
book_item['description'] = response.xpath(
"//div[@id='product_description']/following-sibling::p/text()"
).extract_first()
book_item['price'] = response.css('p.price_color ::text').extract_first()
yield book_item
There are a huge range of settings you can configure in Scrapy, so if you'd like to explore them all, here is a complete list of the default settings Scrapy provides.
Next Steps
Now that we have our Scrapy Project setup we will move onto creating our first Scrapy spider.
All parts of the 12 Part freeCodeCamp Scrapy Beginner Course are as follows:
- Part 1: Course & Scrapy Overview
- Part 2: Setting Up Environment & Scrapy
- Part 3: Creating Scrapy Project
- Part 4: First Scrapy Spider
- Part 5: Crawling With Scrapy
- Part 6: Cleaning Data With Item Pipelines
- Part 7: Storing Data In CSVs & Databases
- Part 8: Faking Scrapy Headers & User-Agents
- Part 9: Using Proxies With Scrapy Spiders
- Part 10: Deploying & Scheduling Spiders With Scrapyd
- Part 11: Deploying & Scheduling Spiders With ScrapeOps
- Part 12: Deploying & Scheduling Spiders With Scrapy Cloud