freeCodeCamp Scrapy Beginners Course Part 1 - Overview

freeCodeCamp Scrapy Beginners Course Part 1: Scrapy Overview

Python Scrapy is a hugely popular Python framework specifically designed for web scraping and crawling. It has a huge amount of functionality ready to use out of the box and can be easily extendable with open-source Scrapy extensions and middlewares.

Making Scrapy a great option for anyone who wants to build production-ready scrapers that can scrape the web at scale.

To help you get started, in this free 12 Part Beginner Course we will build a Scrapy project end-to-end from building the scrapers to deploying on a server and running them every day.

freeCodeCamp Scrapy Beginners Course

The code for this project is available on Github here!

If you prefer video tutorials, then check out the full video version of this article on the freeCodeCamp channel here.

What Is Scrapy?

Developed by the co-founders of Zyte, Pablo Hoffman and Shane Evans, Scrapy is a Python framework specifically designed for web scraping.

Using Scrapy you can easily build highly scalable scrapers that will retrieve a pages HTML, parse and process the data, and store it the file format and location of your choice.

Why & When Should You Use Scrapy?

Scrapy is a Python framework designed specifically for web scraping. Built using Twisted, an event-driven networking engine, Scrapy uses an asynchronous architecture to crawl & scrape websites at scale fast.

With Scrapy you write Spiders to retrieve HTML pages from websites and scrape the data you want, clean and validate it, and store it in the data format you want.

Here is an example Spider:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'quotes'

    def start_requests(self):
        url = 'https://quotes.toscrape.com/'
        yield scrapy.Request(url, callback=self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall()
            }
        
        # go to next page
        next_page = response.css("li.next a::attr(href)").extract_first()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

Scrapy is a highly customizable web scraping framework, that also has a large community of developers building extensions & plugins for it.

So if the vanilla version of Scrapy doesn't have everything you want, then it is easily customized with open-source Scrapy extensions or your own middlewares/extensions.

Why Choose Scrapy?

Although, there are other Python libraries also used for web scraping:

Python Requests/BeautifulSoup: Good for small scale web scraping where the data is returned in the HTML response. Would need to build your own spider management functionality to manage concurrency, retries, data cleaning, data storage.
Python Request-HTML: Combining Python requests with a parsing library, Request-HTML is a middle-ground between the Python Requests/BeautifulSoup combo and Scrapy.
Python Selenium: Use if you are scraping a site if it only returns the target data after the Javascript has rendered, or you need to interact with page elements to get the data.

Python Scrapy has lots more functionality and is great for large scale scraping right out of the box:

CSS Selector & XPath Expressions Parsing
Data formatting (CSV, JSON, XML) and Storage (FTP, S3, local filesystem)
Robust Encoding Support
Concurrency Managament
Automatic Retries
Cookies and Session Handling
Crawl Spiders & In-Built Pagination Support

You just need to customize it in your settings file or add in one of the many Scrapy extensions and middlewares that developers have open sourced.

The learning curve is initially steeper than using the Python Requests/BeautifulSoup combo, however, it will save you a lot of time in the long run when deploying production scrapers and scraping at scale.

freeCodeCamp Scrapy Beginners Course Part 2 - Setting Up Scrapy

Part 2: Setting Up Environment & Scrapy

In Part 2: Setting Up Environment & Scrapy we go through how to setup your Python environment along with installing Scrapy.

We will walk through:

freeCodeCamp Scrapy Beginners Course Part 3 - Creating Scrapy Project

Part 3: Creating Scrapy Project

In Part 3: Creating Scrapy Project we go through how create a Scrapy project and explain all of its components.

We will walk through:

freeCodeCamp Scrapy Beginners Course Part 4 - First Scraper

Part 4: First Scrapy Spider

In Part 4: First Scrapy Spider we go through how to create our first Scrapy spider to scrape BooksToScrape.com.

We will walk through:

freeCodeCamp Scrapy Beginners Course Part 5 - Advanced Scraper

Part 5: Crawling With Scrapy

In Part 5: Crawling With Scrapy, we go through how to create a more advanced Scrapy spider that will crawl the entire BooksToScrape.com website and scrape the data from each individual book page.

We will walk through:

freeCodeCamp Scrapy Beginners Course Part 6 - Items & Item Pipelines

Part 6: Cleaning Data With Item Pipelines

In Part 6: Cleaning Data With Item Pipelines we go through how to use Scrapy Items & Item Pipelines to structure and clean your scraped data. As scraped data can be very messy and unstructured. Scraped data can be in the:

Wrong format (text instead of a number)
Contain additional unnecessary data
Using the wrong encoding

We will walk through:

freeCodeCamp Scrapy Beginners Course Part 7: Saving Data To Files & Databases

Part 7: Storing Data In CSVs & Databases

In Part 7: Storing Data In CSVs & Databases, we go through how to save our scraped data to CSV files and MySQL & Postgres databases.

We will walk through:

freeCodeCamp Scrapy Beginners Course Part 8: Fake Headers & User Agents

Part 8: Faking Scrapy Headers & User-Agents

In Part 8: Faking Scrapy Headers & User-Agents, we go through how to use fake headers and user-agents to help prevent your scrapers from getting blocked.

We will walk through:

freeCodeCamp Scrapy Beginners Course Part 9: Rotating Proxies & Proxy APIs

Part 9: Using Proxies With Scrapy Spiders

In Part 9: Using Proxies With Scrapy Spiders, we go through how you can use rotating proxy pools to hide your IP address and scrape at scale without getting blocked.

We will walk through:

freeCodeCamp Scrapy Beginners Course Part 10: Deploying & Scheduling Spiders With Scrapyd

Part 10: Deploying & Scheduling Spiders With Scrapyd

In Part 10: Deploying & Scheduling Spiders With Scrapyd, we go through how you can deploy and run your spiders in the cloud with Scrapyd.

We will walk through:

freeCodeCamp Scrapy Beginners Course Part 11: Deploying & Scheduling Spiders With ScrapeOps

Part 11: Deploying & Scheduling Spiders With ScrapeOps

In Part 11: Deploying & Scheduling Spiders With ScrapeOps, we go through how you can deploy, schedule and run your spiders on any server with ScrapeOps.

We will walk through:

freeCodeCamp Scrapy Beginners Course Part 12: Deploying & Scheduling Spiders With Scrapy Cloud

Part 12: Deploying & Scheduling Spiders With Scrapy Cloud

In Part 12: Deploying & Scheduling Spiders With Scrapy Cloud, we go through how you can deploy, schedule and run your spiders on any server with Scrapy Cloud.

We will walk through:

Next Steps

We hope you have enough of the basics to get up and running scraping a simple ecommerce site with the above tutorial.

If you would like the code from this example please check out on Github here!

In Part 2 of the course we go through how to setup your Python environment along with installing Scrapy.

All parts of the 12 Part freeCodeCamp Scrapy Beginner Course are as follows:

What Is Scrapy?
- Why Choose Scrapy?
Part 2: Setting Up Environment & Scrapy
Part 3: Creating Scrapy Project
Part 4: First Scrapy Spider
Part 5: Crawling With Scrapy
Part 6: Cleaning Data With Item Pipelines
Part 7: Storing Data In CSVs & Databases
Part 8: Faking Scrapy Headers & User-Agents
Part 9: Using Proxies With Scrapy Spiders
Part 10: Deploying & Scheduling Spiders With Scrapyd
Part 11: Deploying & Scheduling Spiders With ScrapeOps
Part 12: Deploying & Scheduling Spiders With Scrapy Cloud
Next Steps

freeCodeCamp Scrapy Beginners Course Part 1: Scrapy Overview

What Is Scrapy?​

Why & When Should You Use Scrapy?​

Why Choose Scrapy?​

Part 2: Setting Up Environment & Scrapy​

Part 3: Creating Scrapy Project​

Part 4: First Scrapy Spider​

Part 5: Crawling With Scrapy​

Part 6: Cleaning Data With Item Pipelines​

Part 7: Storing Data In CSVs & Databases​

Part 8: Faking Scrapy Headers & User-Agents​

Part 9: Using Proxies With Scrapy Spiders​

Part 10: Deploying & Scheduling Spiders With Scrapyd​

Part 11: Deploying & Scheduling Spiders With ScrapeOps​

Part 12: Deploying & Scheduling Spiders With Scrapy Cloud​

Next Steps​

What Is Scrapy?

Why & When Should You Use Scrapy?

Why Choose Scrapy?

Part 2: Setting Up Environment & Scrapy

Part 3: Creating Scrapy Project

Part 4: First Scrapy Spider

Part 5: Crawling With Scrapy

Part 6: Cleaning Data With Item Pipelines

Part 7: Storing Data In CSVs & Databases

Part 8: Faking Scrapy Headers & User-Agents

Part 9: Using Proxies With Scrapy Spiders

Part 10: Deploying & Scheduling Spiders With Scrapyd

Part 11: Deploying & Scheduling Spiders With ScrapeOps

Part 12: Deploying & Scheduling Spiders With Scrapy Cloud

Next Steps