Scrapy Splash Guide: A JS Rendering Service For Web Scraping
Developed by Zyte (formerly Scrapinghub), the creators of Scrapy, Scrapy Splash is a light weight browser with an HTTP API that you can use to scrape web pages that render data using Javascript or AJAX calls.
Although it is a bit outdated, it is the only headless browser that was specifically designed for web scraping, and has been heavily battletested by developers.
In this guide we're going to walk through how to setup and use Scrapy Splash, including:
- How To Install Docker
- Install & Run Scrapy Splash
- Use Scrapy Splash With Our Spiders
- Controling Scrapy Splash
- Running JavaScript Scripts
- Using Proxies With Scrapy Splash
How To Install Docker
As Scrapy Splash comes in the form of a Docker Image, to install and use Scrapy Splash we first need to have Docker installed on our machine. So if you haven't Docker installed already then use one of the following links to install Docker:
Download the Docker installation package, and follow the instructions. Your computer may need to restart after installation.
After installation, if Docker isn't running then click the Docker Desktop icon. You can check that docker is by running the command in your command line:
docker
If it is recognized then you should be good to go.
Install & Run Scrapy Splash
Next we need to get Scrapy Splash up and running.
1. Download Scrapy Splash
First we need to download the Scrapy Splash Docker image, which we can do by running the following command on Windows or Max OS:
docker pull scrapinghub/splash
Or on a Linux machine:
sudo docker pull scrapinghub/splash
If everything has worked correctly, when you open you Docker Desktop on the Images tab you should see the scrapinghub/splash image.
2. Run Scrapy Splash
To run Scrapy Splash, we need to run the following command in our command line again.
For Windows and Max OS:
docker run -it -p 8050:8050 --rm scrapinghub/splash
For Linux:
sudo docker run -it -p 8050:8050 --rm scrapinghub/splash
To check that Splash is running correctly, go to http://localhost:8050/
and you should see the following screen.
If you do then, Scrapy Splash is up and running correctly.
Use Scrapy Splash With Our Spiders
When running Splash provides a simple HTTP server that we can send the urls we want to scrape to it, then Splash will make the fetch the page, fully render the page and return the rendered page to our spider.
You can send requests directly to it using its API endpoint like this:
curl 'http://localhost:8050/render.html?url=https://quotes.toscrape.com/js/'
However, when using Scrapy we can use the scrapy-splash downloader integration.
1. Set Up Scrapy Splash Integration
To use Scrapy Splash in our project, we first need to install the scrapy-splash downloader.
pip install scrapy-splash
Then we need to add the required Splash settings to our Scrapy projects settings.py
file.
# settings.py
# Splash Server Endpoint
SPLASH_URL = 'http://192.168.59.103:8050'
# Enable Splash downloader middleware and change HttpCompressionMiddleware priority
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# Enable Splash Deduplicate Args Filter
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
# Define the Splash DupeFilter
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
2. Use Scrapy Splash In Spiders
To actually use Scrapy Splash in our spiders to render the pages we want to scrape we need to change the default Request
to SplashRequest
in our spiders.
# spiders/quotes.py
import scrapy
from demo.items import QuoteItem
from scrapy_splash import SplashRequest
class QuotesSpider(scrapy.Spider):
name = 'quotes'
def start_requests(self):
url = 'https://quotes.toscrape.com/js/'
yield SplashRequest(url, callback=self.parse)
def parse(self, response):
quote_item = QuoteItem()
for quote in response.css('div.quote'):
quote_item['text'] = quote.css('span.text::text').get()
quote_item['author'] = quote.css('small.author::text').get()
quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
yield quote_item
Now all our requests will be made through our Splash server and any javascript on the page will be rendered.
Controling Scrapy Splash
Like other headless browsers you can tell Scrapy Splash to do certain actions before returning the HTML response to your spider.
Splash can:
- Wait for page elements to load
- Scroll the page
- Click on page elements
- Take screenshots
- Turn off images or use Adblock rules to make rendering faster
You can configure Splash to do these actions through a combination of passing it arguments or using Lua Scripts.
We will go over the most common actions here but the Splash documentation has a scripting tutorial on how to write them.
1. Wait For Time
You can tell Splash to wait X number of seconds for updates after the initial page has loaded to make sure you get all the data you need by adding a wait
agrument to your request:
# spiders/quotes.py
import scrapy
from scrapy_splash import SplashRequest
class QuotesSpider(scrapy.Spider):
name = 'quotes'
def start_requests(self):
url = 'https://quotes.toscrape.com/js/'
yield SplashRequest(url, callback=self.parse, args={'wait': 0.5})
def parse(self, response):
...
2. Wait For Page Element
You can use a Lua script to wait for a specific element to appear on the page before it returns a response.
First, you create a Lua script:
function main(splash)
while not splash:select('div.quote') do
splash:wait(0.1)
end
return {html=splash:html()}
end
Then load add this as a arg in your Splash request by converting it to a string, and telling sending the request to Splash's execute endpoint by adding endpoint='execute'
to the request:
# spiders/quotes.py
import scrapy
from scrapy_splash import SplashRequest
lua_script = """
function main(splash)
while not splash:select('div.quote') do
splash:wait(0.1)
end
return {html=splash:html()}
end
"""
class QuotesSpider(scrapy.Spider):
name = 'quotes'
def start_requests(self):
url = 'https://quotes.toscrape.com/js/'
yield SplashRequest(
url,
callback=self.parse,
endpoint='execute',
args={'wait': 0.5, 'lua_source': lua_script}
)
def parse(self, response):
...
3. Scrolling The Page
To scroll the page down when dealing with infinite scrolling pages you can use a Lua script like this.
function main(splash)
local num_scrolls = 10
local scroll_delay = 1.0
local scroll_to = splash:jsfunc("window.scrollTo")
local get_body_height = splash:jsfunc(
"function() {return document.body.scrollHeight;}"
)
assert(splash:go(splash.args.url))
splash:wait(splash.args.wait)
for _ = 1, num_scrolls do
scroll_to(0, get_body_height())
splash:wait(scroll_delay)
end
return splash:html()
end
And add it to our Splash like we did above:
# spiders/quotes.py
import scrapy
from scrapy_splash import SplashRequest
lua_script = """
<Your_Lua_Script>
"""
class QuotesSpider(scrapy.Spider):
name = 'quotes'
def start_requests(self):
url = 'https://quotes.toscrape.com/js/'
yield SplashRequest(
url,
callback=self.parse,
endpoint='execute',
args={'wait': 2, 'lua_source': lua_script}
)
def parse(self, response):
...
4. Click Page Elements
To click on a button or page element you can use a Lua script like this.
function main(splash)
btn = splash:select_all('button')[0]
btn:mouse_click()
splash:wait(splash.args.wait)
return splash:html()
end
And include it in your Spider as we did before.
# spiders/quotes.py
import scrapy
from scrapy_splash import SplashRequest
lua_script = """
function main(splash)
btn = splash:select_all('button')[0]
btn:mouse_click()
splash:wait(splash.args.wait)
return splash:html()
end
"""
class QuotesSpider(scrapy.Spider):
name = 'quotes'
def start_requests(self):
url = 'https://quotes.toscrape.com/js/'
yield SplashRequest(
url,
callback=self.parse,
endpoint='execute',
args={'wait': 2, 'lua_source': lua_script}
)
def parse(self, response):
...
5. Take Screenshot
You can take a screenshot of the fully rendered page, using Splash's screenshot functionality.
# spiders/quotes_screenshot.py
import scrapy
import base64
from scrapy_splash import SplashRequest
class QuotesSpider(scrapy.Spider):
name = 'quotes_screenshot'
def start_requests(self):
url = 'https://quotes.toscrape.com/js/'
yield SplashRequest(
url,
callback=self.parse,
endpoint='render.json',
args={
'html': 1,
'png': 1,
'width': 1000,
'render_all': 1,
})
def parse(self, response):
imgdata = base64.b64decode(response.data['png'])
filename = 'some_image.png'
with open(filename, 'wb') as f:
f.write(imgdata)
Running JavaScript Scripts
You can configure Splash to run JavaScript code on the rendered page by using the js_source
parameter.
For example, using this JavaScript script would modify the page title.
javascript_script = """
document.title="My Title";
"""
yield SplashRequest(
url,
callback=self.parse,
endpoint='render.html',
args={
'js_source': javascript_script,
})
Using Proxies With Scrapy Splash
If you need to use proxies when scraping you can configure Splash to use your proxy:
lua_script = """
splash:on_request(function(request)
request:set_proxy{
host = http://us-ny.proxymesh.com,
port = 31280,
username = username,
password = secretpass,
}
return splash:html()
end)
"""
yield SplashRequest(
url,
callback=self.parse,
endpoint='execute',
args={
'lua_source': lua_script,
'proxy': 'http://proxy_ip:proxy_port'
})
More Scrapy Tutorials
In this guide we've introduced you to the fundamental functionality of Scrapy Splash and how to use it in your own projects.
However, if you would like to learn more about Scrapy Splash then check out the offical documentation here.
If you would like to learn more about different Javascript rendering options for Scrapy, then be sure to check out our other guides:
If you would like to learn more about Scrapy in general, then be sure to check out The Scrapy Playbook.