Scrapy User Agents: How to Manage User Agents When Scraping

After you've learned the basics of web scraping (how to send requests, crawl websites and parse data from the page), one of the main challenges we face is avoiding our requests getting blocked.

The two keys we can achieve this is be using proxies and managing the user-agents we send to the website we are scraping.

In this guide, we will go through:

What Are User-Agents & Why Do We Need To Manage Them?
How To Set A User Agent In Scrapy
How To Rotate User Agents
How To Manage Thousands of User Agents

First, let's quickly go over some the very basics.

What Are User-Agents & Why Do We Need To Manage Them?

User Agents are strings that let the website you are scraping identify the application, operating system (OSX/Windows/Linux), browser (Chrome/Firefox/Internet Explorer), etc. of the user sending a request to their website. They are sent to the server as part of the request headers.

Here is an example User agent sent when you visit a website with a Chrome browser:

user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36

When scraping a website, you also need to set user-agents on every request as otherwise the website may block your requests because it knows you aren't a real user.

In the case of Scrapy. When you use Scrapy with the default settings, the user-agent your spider sends is the following by default:

Scrapy/VERSION (+https://scrapy.org)

This user agent will clearly identify your requests as coming from a web scraper, so the website can easily block you from scraping the site.

That is why we need to manage the user-agents Scrapy sends with our requests.

How To Set A User-Agent In Scrapy

There are a couple of ways to set new user agent for your spiders to use.

1. Set New Default User-Agent

The easiest way to change the default Scrapy user-agent is to set a default user-agent in your settings.py file.

Simply uncomment the USER_AGENT value in the settings.py file and add a new user agent:

## settings.py

USER_AGENT = 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'

2. Add A User-Agent To Every Request

Another option is to set a user-agent on every request your spider makes by defining a user-agent in the headers of your request:

## myspider.py

def start_requests(self):
    for url in self.start_urls:
        return Request(url=url, callback=self.parse,
                       headers={"User-Agent": "Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148"})

Both of these options work, however, you will have the same user-agent for every single request which the target website might pick up on and block you for. That is why we need to have a list of user-agents and select a random one for every request.

How To Rotate User Agents

Rotating through user-agents is also pretty straightforward, and we need a list of user-agents in our spider and use a random one with every request we make using a similar approach to option #2 above.

## myspider.py

import random

user_agent_list = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36',
    'Mozilla/5.0 (iPhone; CPU iPhone OS 14_4_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Mobile/15E148 Safari/604.1',
    'Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363',
]

def start_requests(self):
    for url in self.start_urls:
        return Request(url=url, callback=self.parse,
                       headers={"User-Agent": user_agent_list[random.randint(0, len(user_agent_list)-1)]})

This works but it has 2 drawbacks:

We need to manage a list of user-agents ourselves.
We would need to implement this into every spider, which isn't ideal.

A better approach would be to use a Scrapy middleware to manage our user agents for us.

How To Manage Thousands of User Agents

The best approach to managing user-agents in Scrapy is to build or use a custom Scrapy middleware that manages the user agents for you.

You could build a custom middleware yourself if your project has specific requirements like you need to use specific user-agents with specific sites. However, in most cases using a off-the-shelf user-agent middleware is enough.

Developers have open sourced a number of user-agent middlewares for Scrapy, however, for this guide we will use scrapy-fake-useragentas it is one of the best available.

Scrapy-Fake-Useragent

Getting scrapy-fake-useragent setup is simple. Simply install the Python package:

pip install scrapy-fake-useragent

Then in your settings.py file, you need to turn off the built in UserAgentMiddleware and RetryMiddleware, and enable scrapy-fake-useragent's RandomUserAgentMiddleware and RetryUserAgentMiddleware.

In Scrapy >=1.0:

## settings.py

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
    'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
    'scrapy_fake_useragent.middleware.RetryUserAgentMiddleware': 401,
}

In Scrapy <1.0:

## settings.py

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
    'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': None,
    'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
    'scrapy_fake_useragent.middleware.RetryUserAgentMiddleware': 401,
}

And then enable the Fake User-Agent Providers by adding them to your settings.py file.

## settings.py

FAKEUSERAGENT_PROVIDERS = [
    'scrapy_fake_useragent.providers.FakeUserAgentProvider',  # This is the first provider we'll try
    'scrapy_fake_useragent.providers.FakerProvider',  # If FakeUserAgentProvider fails, we'll use faker to generate a user-agent string for us
    'scrapy_fake_useragent.providers.FixedUserAgentProvider',  # Fall back to USER_AGENT value
]

## Set Fallback User-Agent
USER_AGENT = '<your user agent string which you will fall back to if all other providers fail>'

When activated, scrapy-fake-useragent will download a list of the most common user-agents from useragentstring.com and use a random one with every request, so you don't need to create your own list.

You can also add your own user-agent string providers, or configure it to generate new user-agent strings as a backup using Faker.

To see all the configuration options, then check out the docs here.

Scrapy User Agents: How to Manage User Agents When Scraping

What Are User-Agents & Why Do We Need To Manage Them?​

How To Set A User-Agent In Scrapy​

1. Set New Default User-Agent​

2. Add A User-Agent To Every Request​

How To Rotate User Agents​

How To Manage Thousands of User Agents​

Scrapy-Fake-Useragent​

More Scrapy Tutorials​