Scrapy Splash Guide: A JS Rendering Service For Web Scraping

Developed by Zyte (formerly Scrapinghub), the creators of Scrapy, Scrapy Splash is a light weight browser with an HTTP API that you can use to scrape web pages that render data using Javascript or AJAX calls.

Although it is a bit outdated, it is the only headless browser that was specifically designed for web scraping, and has been heavily battletested by developers.

In this guide we're going to walk through how to setup and use Scrapy Splash, including:

How To Install Docker
Install & Run Scrapy Splash
Use Scrapy Splash With Our Spiders
Controling Scrapy Splash
Running JavaScript Scripts
Using Proxies With Scrapy Splash

How To Install Docker

As Scrapy Splash comes in the form of a Docker Image, to install and use Scrapy Splash we first need to have Docker installed on our machine. So if you haven't Docker installed already then use one of the following links to install Docker:

Download the Docker installation package, and follow the instructions. Your computer may need to restart after installation.

After installation, if Docker isn't running then click the Docker Desktop icon. You can check that docker is by running the command in your command line:

docker

If it is recognized then you should be good to go.

Install & Run Scrapy Splash

Next we need to get Scrapy Splash up and running.

1. Download Scrapy Splash

First we need to download the Scrapy Splash Docker image, which we can do by running the following command on Windows or Max OS:

docker pull scrapinghub/splash

Or on a Linux machine:

sudo docker pull scrapinghub/splash

If everything has worked correctly, when you open you Docker Desktop on the Images tab you should see the scrapinghub/splash image.

Python Scrapy Playbook - Scrapy Splash Docker Desktop Installed

2. Run Scrapy Splash

To run Scrapy Splash, we need to run the following command in our command line again.

For Windows and Max OS:

docker run -it -p 8050:8050 --rm scrapinghub/splash

For Linux:

sudo docker run -it -p 8050:8050 --rm scrapinghub/splash

To check that Splash is running correctly, go to http://localhost:8050/ and you should see the following screen.

Python Scrapy Playbook - Scrapy Splash Landing Page

If you do then, Scrapy Splash is up and running correctly.

Use Scrapy Splash With Our Spiders

When running Splash provides a simple HTTP server that we can send the urls we want to scrape to it, then Splash will make the fetch the page, fully render the page and return the rendered page to our spider.

You can send requests directly to it using its API endpoint like this:

curl 'http://localhost:8050/render.html?url=https://quotes.toscrape.com/js/'

However, when using Scrapy we can use the scrapy-splash downloader integration.

1. Set Up Scrapy Splash Integration

To use Scrapy Splash in our project, we first need to install the scrapy-splash downloader.

pip install scrapy-splash

Then we need to add the required Splash settings to our Scrapy projects settings.py file.

# settings.py

# Splash Server Endpoint
SPLASH_URL = 'http://192.168.59.103:8050'


# Enable Splash downloader middleware and change HttpCompressionMiddleware priority
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

# Enable Splash Deduplicate Args Filter
SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

# Define the Splash DupeFilter
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

2. Use Scrapy Splash In Spiders

To actually use Scrapy Splash in our spiders to render the pages we want to scrape we need to change the default Request to SplashRequest in our spiders.

# spiders/quotes.py

import scrapy
from demo.items import QuoteItem 
from scrapy_splash import SplashRequest 

class QuotesSpider(scrapy.Spider):
    name = 'quotes'

    def start_requests(self):
        url = 'https://quotes.toscrape.com/js/'
        yield SplashRequest(url, callback=self.parse)

    def parse(self, response):
        quote_item = QuoteItem()
        for quote in response.css('div.quote'):
            quote_item['text'] = quote.css('span.text::text').get()
            quote_item['author'] = quote.css('small.author::text').get()
            quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
            yield quote_item

Now all our requests will be made through our Splash server and any javascript on the page will be rendered.

Controling Scrapy Splash

Like other headless browsers you can tell Scrapy Splash to do certain actions before returning the HTML response to your spider.

Splash can:

Wait for page elements to load
Scroll the page
Click on page elements
Take screenshots
Turn off images or use Adblock rules to make rendering faster

You can configure Splash to do these actions through a combination of passing it arguments or using Lua Scripts.

We will go over the most common actions here but the Splash documentation has a scripting tutorial on how to write them.

1. Wait For Time

You can tell Splash to wait X number of seconds for updates after the initial page has loaded to make sure you get all the data you need by adding a wait agrument to your request:

# spiders/quotes.py

import scrapy
from scrapy_splash import SplashRequest 

class QuotesSpider(scrapy.Spider):
    name = 'quotes'

    def start_requests(self):
        url = 'https://quotes.toscrape.com/js/'
        yield SplashRequest(url, callback=self.parse, args={'wait': 0.5})

    def parse(self, response):
        ...

2. Wait For Page Element

You can use a Lua script to wait for a specific element to appear on the page before it returns a response.

First, you create a Lua script:

function main(splash)
  while not splash:select('div.quote') do
    splash:wait(0.1)
  end
  return {html=splash:html()}
end

Then load add this as a arg in your Splash request by converting it to a string, and telling sending the request to Splash's execute endpoint by adding endpoint='execute' to the request:

# spiders/quotes.py

import scrapy
from scrapy_splash import SplashRequest 


lua_script = """
function main(splash)
  while not splash:select('div.quote') do
    splash:wait(0.1)
  end
  return {html=splash:html()}
end
"""

class QuotesSpider(scrapy.Spider):
    name = 'quotes'

    def start_requests(self):
        url = 'https://quotes.toscrape.com/js/'
        yield SplashRequest(
            url, 
            callback=self.parse, 
            endpoint='execute', 
            args={'wait': 0.5, 'lua_source': lua_script}
            )

    def parse(self, response):
        ...

3. Scrolling The Page

To scroll the page down when dealing with infinite scrolling pages you can use a Lua script like this.

function main(splash)
    local num_scrolls = 10
    local scroll_delay = 1.0

    local scroll_to = splash:jsfunc("window.scrollTo")
    local get_body_height = splash:jsfunc(
        "function() {return document.body.scrollHeight;}"
    )
    assert(splash:go(splash.args.url))
    splash:wait(splash.args.wait)

    for _ = 1, num_scrolls do
        scroll_to(0, get_body_height())
        splash:wait(scroll_delay)
    end        
    return splash:html()
end

And add it to our Splash like we did above:

# spiders/quotes.py

import scrapy
from scrapy_splash import SplashRequest 


lua_script = """
<Your_Lua_Script>
"""

class QuotesSpider(scrapy.Spider):
    name = 'quotes'

    def start_requests(self):
        url = 'https://quotes.toscrape.com/js/'
        yield SplashRequest(
            url, 
            callback=self.parse, 
            endpoint='execute', 
            args={'wait': 2, 'lua_source': lua_script}
            )

    def parse(self, response):
        ...

4. Click Page Elements

To click on a button or page element you can use a Lua script like this.

function main(splash)
    btn = splash:select_all('button')[0]
    btn:mouse_click()
    splash:wait(splash.args.wait)  
    return splash:html()
end

And include it in your Spider as we did before.

# spiders/quotes.py

import scrapy
from scrapy_splash import SplashRequest 


lua_script = """
function main(splash)
    btn = splash:select_all('button')[0]
    btn:mouse_click()
    splash:wait(splash.args.wait)  
    return splash:html()
end
"""

class QuotesSpider(scrapy.Spider):
    name = 'quotes'

    def start_requests(self):
        url = 'https://quotes.toscrape.com/js/'
        yield SplashRequest(
            url, 
            callback=self.parse, 
            endpoint='execute', 
            args={'wait': 2, 'lua_source': lua_script}
            )

    def parse(self, response):
        ...

5. Take Screenshot

You can take a screenshot of the fully rendered page, using Splash's screenshot functionality.

# spiders/quotes_screenshot.py

import scrapy
import base64
from scrapy_splash import SplashRequest  

class QuotesSpider(scrapy.Spider):
    name = 'quotes_screenshot'

    def start_requests(self):
        url = 'https://quotes.toscrape.com/js/'
        yield SplashRequest(
            url, 
            callback=self.parse, 
            endpoint='render.json', 
            args={
                'html': 1, 
                'png': 1, 
                'width': 1000,
                'render_all': 1,
            })

    def parse(self, response):
        imgdata = base64.b64decode(response.data['png'])
        filename = 'some_image.png'
        with open(filename, 'wb') as f:
            f.write(imgdata)

Running JavaScript Scripts

You can configure Splash to run JavaScript code on the rendered page by using the js_source parameter.

For example, using this JavaScript script would modify the page title.

javascript_script = """
document.title="My Title";
"""

yield SplashRequest(
            url, 
            callback=self.parse, 
            endpoint='render.html', 
            args={
                'js_source': javascript_script, 
            })

Using Proxies With Scrapy Splash

If you need to use proxies when scraping you can configure Splash to use your proxy:

lua_script = """
splash:on_request(function(request)
    request:set_proxy{
        host = http://us-ny.proxymesh.com,
        port = 31280,
        username = username,
        password = secretpass,
    }
    return splash:html()
end)
"""

yield SplashRequest(
            url, 
            callback=self.parse, 
            endpoint='execute', 
            args={
                'lua_source': lua_script,
                'proxy': 'http://proxy_ip:proxy_port'
            })

Scrapy Splash Guide: A JS Rendering Service For Web Scraping

How To Install Docker​

Install & Run Scrapy Splash​

1. Download Scrapy Splash​

2. Run Scrapy Splash​

Use Scrapy Splash With Our Spiders​

1. Set Up Scrapy Splash Integration​

2. Use Scrapy Splash In Spiders​

Controling Scrapy Splash​

1. Wait For Time​

2. Wait For Page Element​

3. Scrolling The Page​

4. Click Page Elements​

5. Take Screenshot​

Running JavaScript Scripts​

Using Proxies With Scrapy Splash​

More Scrapy Tutorials​