Skip to main content

freeCodeCamp Scrapy Beginners Course Part 11: Deploying & Scheduling Spiders With ScrapeOps

freeCodeCamp Scrapy Beginners Course Part 11: Deploying & Scheduling Spiders With ScrapeOps

In Part 11 of the Scrapy Beginner Course, we go through how you can deploy, schedule and run your spiders on any server with ScrapeOps.

There are several ways to run and deploy your scrapers to the cloud which we will cover in this course:

However, in Part 11 we will show you how to deploy, schedule and run your spiders in the cloud with ScrapeOps:

The code for this part of the course is available on Github here!

If you prefer video tutorials, then check out the video version of this course on the freeCodeCamp channel here.


What Are The ScrapeOps Job Scheduler & Monitor?

The ScrapeOps Job Scheduler & Monitor is free to use web scraper monitoring and job scheduling tool that you can easily integrate with any Scrapy spider to monitor your spiders and run them remotely in the cloud.

It comprises two parts:

You can use one without using the other, for the best experience you should use both solutions as it will allow you to deploy, schedule, run and monitor your web scraping jobs in production.


ScrapeOps Scrapy Monitor

The ScrapeOps Monitor is a free monitoring and alerting tool dedicated to web scraping. With a simple 30-second install ScrapeOps gives you all the monitoring, alerting, scheduling and data validation functionality you need for web scraping straight out of the box.

Live demo here: ScrapeOps Demo

ScrapeOps Promo

The primary goal of the ScrapeOps Monitor is to give every developer the same level of scraping monitoring capabilities as the most sophisticated web scrapers, without any of the hassle of setting up your own custom solution.

The ScrapeOps Monitor is a full end-to-end web scraping monitoring and management tool dedicated to web scraping that automatically sets up all the monitors, health checks and alerts for you. If you have an issue with integrating ScrapeOps or need advice on setting up your scrapers then they have a support team on-hand to assist you.

Features

Once you have completed the simple install (3 lines in your scraper), ScrapeOps will:

  • 🕵️‍♂️ Monitor - Automatically monitor all your scrapers.
  • 📈 Dashboards - Visualise your job data in dashboards, so you see real-time & historical stats.
  • 💯 Data Quality - Validate the field coverage in each of your jobs, so broken parsers can be detected straight away.
  • 📉 Auto Health Checks - Automatically check every job's performance data versus its 7-day moving average to see if it's healthy or not.
  • ✔️ Custom Health Checks - Check each job with any custom health checks you have enabled for it.
  • Alerts - Alert you via email, Slack, etc. if any of your jobs are unhealthy.
  • 📑 Reports - Generate daily (periodic) reports, that check all jobs versus your criteria and let you know if everything is healthy or not.

Job stats tracked include:

  • ✅ Pages Scraped & Missed
  • ✅ Items Parsed & Missed
  • ✅ Item Field Coverage
  • ✅ Runtimes
  • ✅ Response Status Codes
  • ✅ Success Rates
  • ✅ Latencies
  • ✅ Errors & Warnings
  • ✅ Bandwidth

Setting Up The ScrapeOps Scrapy Monitor

Getting set up with the ScrapeOps logger is simple. Just install the Python package:


pip install scrapeops-scrapy

And add 3 lines to your settings.py file:

## settings.py

## Add Your ScrapeOps API key
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'

## Add In The ScrapeOps Extension
EXTENSIONS = {
'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500,
}

## Update The Download Middlewares
DOWNLOADER_MIDDLEWARES = {
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
}

From there, your scraping stats will be automatically logged and automatically shipped to your dashboard.

ScrapeOps Dashboard Demo


ScrapeOps Server Manager & Scheduler

The ScrapeOps Server Manager & Scheduler allows you to deploy, schedule & run any type of scraper (Python, NodeJs, Scrapy, etc) on a server that is accessible via SSH. Which includes most major server & VM providers (Digital Ocean, Vultr, AWS, etc.):

  • 🔗 Integrate With Any SSH Capably Server
  • 🕷️ Deploy scrapers directly from GitHub to your servers.
  • ⏰ Schedule Periodic Jobs

For this course, we will look at how to integrate the ScrapeOps server manager and scheduler with a Digital Ocean server. However, here is a guide to integrating it with an AWS Server

Get $100 Free Credits with Vultr using this link, and $100 Free Credits with Digital Ocean using this link.


Connecting ScrapeOps To Your Server


Deploying Code From Github To Server With ScrapeOps


Scheduling & Running Spiders In Cloud With ScrapeOps


Next Steps

In this part, we looked at how we can use Scrapyd to deploy and run our spiders in the cloud and control them using ScrapeOps and ScrapydWeb.

So in Part 11, we will look at how you can use ScrapeOps free server manager & job scheduler to deploy and run Scrapy spiders in the cloud.

All parts of the 12 Part freeCodeCamp Scrapy Beginner Course are as follows: