freeCodeCamp Scrapy Beginners Course Part 1: Scrapy Overview
Python Scrapy is a hugely popular Python framework specifically designed for web scraping and crawling. It has a huge amount of functionality ready to use out of the box and can be easily extendable with open-source Scrapy extensions and middlewares.
Making Scrapy a great option for anyone who wants to build production-ready scrapers that can scrape the web at scale.
To help you get started, in this free 12 Part Beginner Course we will build a Scrapy project end-to-end from building the scrapers to deploying on a server and running them every day.
freeCodeCamp Scrapy Beginners Course
- Part 1: Course & Scrapy Overview
- Part 2: Setting Up Environment & Scrapy
- Part 3: Creating Scrapy Project
- Part 4: First Scrapy Spider
- Part 5: Crawling With Scrapy
- Part 6: Cleaning Data With Item Pipelines
- Part 7: Storing Data In CSVs & Databases
- Part 8: Faking Scrapy Headers & User-Agents
- Part 9: Using Proxies With Scrapy Spiders
- Part 10: Deploying & Scheduling Spiders With Scrapyd
- Part 11: Deploying & Scheduling Spiders With ScrapeOps
- Part 12: Deploying & Scheduling Spiders With Scrapy Cloud
The code for this project is available on Github here!
If you prefer video tutorials, then check out the full video version of this article on the freeCodeCamp channel here.
What Is Scrapy?
Developed by the co-founders of Zyte, Pablo Hoffman and Shane Evans, Scrapy is a Python framework specifically designed for web scraping.
Using Scrapy you can easily build highly scalable scrapers that will retrieve a pages HTML, parse and process the data, and store it the file format and location of your choice.
Why & When Should You Use Scrapy?
Scrapy is a Python framework designed specifically for web scraping. Built using Twisted, an event-driven networking engine, Scrapy uses an asynchronous architecture to crawl & scrape websites at scale fast.
With Scrapy you write Spiders to retrieve HTML pages from websites and scrape the data you want, clean and validate it, and store it in the data format you want.
Here is an example Spider:
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
def start_requests(self):
url = 'https://quotes.toscrape.com/'
yield scrapy.Request(url, callback=self.parse)
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall()
}
# go to next page
next_page = response.css("li.next a::attr(href)").extract_first()
if next_page:
yield response.follow(next_page, callback=self.parse)
Scrapy is a highly customizable web scraping framework, that also has a large community of developers building extensions & plugins for it.
So if the vanilla version of Scrapy doesn't have everything you want, then it is easily customized with open-source Scrapy extensions or your own middlewares/extensions.
Why Choose Scrapy?
Although, there are other Python libraries also used for web scraping:
Python Requests/BeautifulSoup: Good for small scale web scraping where the data is returned in the HTML response. Would need to build your own spider management functionality to manage concurrency, retries, data cleaning, data storage.
Python Request-HTML: Combining Python requests with a parsing library, Request-HTML is a middle-ground between the Python Requests/BeautifulSoup combo and Scrapy.
Python Selenium: Use if you are scraping a site if it only returns the target data after the Javascript has rendered, or you need to interact with page elements to get the data.
Python Scrapy has lots more functionality and is great for large scale scraping right out of the box:
- CSS Selector & XPath Expressions Parsing
- Data formatting (CSV, JSON, XML) and Storage (FTP, S3, local filesystem)
- Robust Encoding Support
- Concurrency Managament
- Automatic Retries
- Cookies and Session Handling
- Crawl Spiders & In-Built Pagination Support
You just need to customize it in your settings file or add in one of the many Scrapy extensions and middlewares that developers have open sourced.
The learning curve is initially steeper than using the Python Requests/BeautifulSoup combo, however, it will save you a lot of time in the long run when deploying production scrapers and scraping at scale.
Part 2: Setting Up Environment & Scrapy
In Part 2: Setting Up Environment & Scrapy we go through how to setup your Python environment along with installing Scrapy.
We will walk through:
- How To Install Python
- Setting Up Your Python Virtual Environment On Linux/MacOS
- Setting Up Your Python Virtual Environment On Windows
- How To Install Scrapy
Part 3: Creating Scrapy Project
In Part 3: Creating Scrapy Project we go through how create a Scrapy project and explain all of its components.
We will walk through:
- How To Create A Scrapy Project
- Overview of The Scrapy Project Structure
- Scrapy Spiders Explained
- Scrapy Items Explained
- Scrapy Item Pipelines Explained
- Scrapy Middleware Explained
- Scrapy Settings
Part 4: First Scrapy Spider
In Part 4: First Scrapy Spider we go through how to create our first Scrapy spider to scrape BooksToScrape.com.
We will walk through:
- Creating Our Scrapy Spider
- Using Scrapy Shell To Find Our CSS Selectors
- Adding CSS Selectors To Spider
- How to Run Our Scrapy Spider
- How to Navigate Through Pages
Part 5: Crawling With Scrapy
In Part 5: Crawling With Scrapy, we go through how to create a more advanced Scrapy spider that will crawl the entire BooksToScrape.com website and scrape the data from each individual book page.
We will walk through:
- Discover & Request Book Pages
- Using Scrapy Shell To Find CSS & XPath Selectors
- Updating Our BooksSpider
- Testing Our Scrapy Spider
Part 6: Cleaning Data With Item Pipelines
In Part 6: Cleaning Data With Item Pipelines we go through how to use Scrapy Items & Item Pipelines to structure and clean your scraped data. As scraped data can be very messy and unstructured. Scraped data can be in the:
- Wrong format (text instead of a number)
- Contain additional unnecessary data
- Using the wrong encoding
We will walk through:
- What Are Scrapy Items?
- Using Scrapy Items To Structure Our Data
- What Are Scrapy Pipelines?
- Cleaning Our Scraped Data With Item Pipelines
Part 7: Storing Data In CSVs & Databases
In Part 7: Storing Data In CSVs & Databases, we go through how to save our scraped data to CSV files and MySQL & Postgres databases.
We will walk through:
- Scrapy Feed Exporters
- Saving Data To CSVs
- Saving Data To JSON Files
- Saving Data To MySQL Databases
- Saving Data To Postgres Databases
Part 8: Faking Scrapy Headers & User-Agents
In Part 8: Faking Scrapy Headers & User-Agents, we go through how to use fake headers and user-agents to help prevent your scrapers from getting blocked.
We will walk through:
- Getting Blocked Whilst Web Scraping
- What Are User-Agents & Why Do We Need To Manage Them?
- How To Set A Fake User-Agent In Scrapy
- How To Rotate User Agents
- ScrapeOps Fake User-Agent API
- Fake Browser Headers vs Fake User-Agents
- Using Fake Browser Headers With Scrapy
- ScrapeOps Fake Browser Header API
Part 9: Using Proxies With Scrapy Spiders
In Part 9: Using Proxies With Scrapy Spiders, we go through how you can use rotating proxy pools to hide your IP address and scrape at scale without getting blocked.
We will walk through:
- What Are Proxies & Why Do We Need Them?
- The 3 Most Popular Proxy Integration Methods
- How To Integrate & Rotate Proxy Lists
- How To Use Rotating/Backconnect Proxies
- How To Use Proxy APIs
Part 10: Deploying & Scheduling Spiders With Scrapyd
In Part 10: Deploying & Scheduling Spiders With Scrapyd, we go through how you can deploy and run your spiders in the cloud with Scrapyd.
We will walk through:
- What Is Scrapyd?
- How to Setup Scrapyd
- Controlling Spiders With Scrapyd
- Scrapyd Dashboards
- Integrating Scrapyd with ScrapeOps
- Integrating Scrapyd with ScrapydWeb
Part 11: Deploying & Scheduling Spiders With ScrapeOps
In Part 11: Deploying & Scheduling Spiders With ScrapeOps, we go through how you can deploy, schedule and run your spiders on any server with ScrapeOps.
We will walk through:
- What Are The ScrapeOps Job Scheduler & Monitor?
- ScrapeOps Scrapy Monitor
- Setting Up The ScrapeOps Scrapy Monitor
- ScrapeOps Server Manager & Scheduler
- Connecting ScrapeOps To Your Server
- Deploying Code From Github To Server With ScrapeOps
- Scheduling & Running Spiders In Cloud With ScrapeOps
Part 12: Deploying & Scheduling Spiders With Scrapy Cloud
In Part 12: Deploying & Scheduling Spiders With Scrapy Cloud, we go through how you can deploy, schedule and run your spiders on any server with Scrapy Cloud.
We will walk through:
- What Is Scrapy Cloud?
- Get Started With Scrapy Cloud
- Deploy Your Spiders To Scrapy Cloud From Command Line
- Deploy Your Spiders To Scrapy Cloud via GitHub
- Run Spiders On Scrapy Cloud
- Schedule Jobs on Scrapy Cloud
Next Steps
We hope you have enough of the basics to get up and running scraping a simple ecommerce site with the above tutorial.
If you would like the code from this example please check out on Github here!
In Part 2 of the course we go through how to setup your Python environment along with installing Scrapy.
All parts of the 12 Part freeCodeCamp Scrapy Beginner Course are as follows:
- Part 1: Course & Scrapy Overview
- Part 2: Setting Up Environment & Scrapy
- Part 3: Creating Scrapy Project
- Part 4: First Scrapy Spider
- Part 5: Crawling With Scrapy
- Part 6: Cleaning Data With Item Pipelines
- Part 7: Storing Data In CSVs & Databases
- Part 8: Faking Scrapy Headers & User-Agents
- Part 9: Using Proxies With Scrapy Spiders
- Part 10: Deploying & Scheduling Spiders With Scrapyd
- Part 11: Deploying & Scheduling Spiders With ScrapeOps
- Part 12: Deploying & Scheduling Spiders With Scrapy Cloud