Scrapy Playwright Guide: Render & Scrape JS Heavy Websites
Released by Microsoft in 2020, Playwright.js is quickly becoming the most popular headless browser library for browser automation and web scraping thanks to its cross-browser support (can drive Chromium, WebKit, and Firefox browsers, whilst Puppeteer only drives Chromium) and developer experience improvements over Puppeteer.
So it is great to see that a number of the core Scrapy maintainers developed a Playwright integration for Scrapy: scrapy-playwright
Scrapy Playwright is one of the best headless browser options you can use with Scrapy so in this guide we will go through how:
- How To Install Scrapy Playwright
- How To Use Scrapy Playwright In Your Spiders
- How To Click On Page Elements With Scrapy Playwright
- How To Scroll The Page Elements With Scrapy Playwright
- Using Proxies With Scrapy Playwright
Note: As of writing this guide, Scrapy Playwright doesn't work with Windows. However, it is possible to run it with WSL (Windows Subsystem for Linux)
How To Install Scrapy Playwright
Installing scrapy-playwright into your Scrapy projects is very straightforward.
First, you need to install scrapy-playwright itself:
pip install scrapy-playwright
Then if your haven't already installed Playwright itself, you will need to install it using the following command in your command line:
playwright install
Next, we will need to update our Scrapy projects settings to activate scrapy-playwright in the project:
# settings.py
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
The ScrapyPlaywrightDownloadHandler
class inherits from Scrapy's default http/https
handler. So unless you explicitly activate scrapy-playwright in your Scrapy Request, those requests will be processed by the regular Scrapy download handler.
How To Use Scrapy Playwright In Your Spiders
Now, let's integrate scrapy-playwright into a Scrapy spider so all our requests will be JS rendered.
To route our requests through scrapy-playwright we just need to enable it in the Request meta dictionary by setting meta={'playwright': True}
.
# spiders/quotes.py
import scrapy
from scrapy_playwright_demo.items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = 'quotes'
def start_requests(self):
url = "https://quotes.toscrape.com/js/"
yield scrapy.Request(url, meta={'playwright': True})
def parse(self, response):
quote_item = QuoteItem()
for quote in response.css('div.quote'):
quote_item['text'] = quote.css('span.text::text').get()
quote_item['author'] = quote.css('small.author::text').get()
quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
yield quote_item
The response
will now contain the rendered page as seen by the browser. However, sometimes Playwright will have ended the rendering before the entire page has been rendered which we can solve using Playwright PageMethods.
Interacting With The Page Using Playwright PageMethods
To interaction with the page using scrapy-playwright we will need to use the PageMethod
class.
PageMethod's allow us to do alot of different things on the page, including:
- Wait for elements to load before returning response
- Scrolling the page
- Clicking on page elements
- Taking a screenshot of the page
- Creating PDFs of the page
First, to use the PageMethod functionality in your spider you will need to set playwright_include_page
equal to True so we can access the Playwright Page
object and also define any callbacks (i.e. def parse
) as a coroutine function (async def
) in order to await the provided Page
object.
# spiders/quotes.py
import scrapy
from scrapy_playwright_demo.items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = 'quotes'
def start_requests(self):
url = 'https://quotes.toscrape.com/js/'
yield scrapy.Request(url, meta=dict(
playwright = True,
playwright_include_page = True,
))
async def parse(self, response):
...
Note: When setting 'playwright_include_page': True
it is also recommended that you set a Request errback to make sure pages are closed even if a request fails (if playwright_include_page=False or unset, pages are automatically closed upon encountering an exception).
# spiders/quotes.py
import scrapy
from scrapy_playwright_demo.items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = 'quotes'
def start_requests(self):
url = 'https://quotes.toscrape.com/js/'
yield scrapy.Request(url, meta=dict(
playwright = True,
playwright_include_page = True,
errback=self.errback,
))
async def parse(self, response):
page = response.meta["playwright_page"]
await page.close()
quote_item = QuoteItem()
for quote in response.css('div.quote'):
quote_item['text'] = quote.css('span.text::text').get()
quote_item['author'] = quote.css('small.author::text').get()
quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
yield quote_item
async def errback(self, failure):
page = failure.request.meta["playwright_page"]
await page.close()
1. Waiting For Page Elements
To wait for a specific page element before stopping the javascript rendering and returning a response to our scraper we just need to add a PageMethod
to the playwright_page_methods
key in out Playwrright settings and define a wait_for_selector
.
Now, when we run the spider scrapy-playwright will render the page until a div with a class quote appears on the page.
# spiders/quotes.py
import scrapy
from scrapy_playwright_demo.items import QuoteItem
from scrapy_playwright.page import PageMethod
class QuotesSpider(scrapy.Spider):
name = 'quotes'
def start_requests(self):
url = "https://quotes.toscrape.com/js/"
yield scrapy.Request(url, meta=dict(
playwright = True,
playwright_include_page = True,
playwright_page_methods =[PageMethod('wait_for_selector', 'div.quote')],
errback=self.errback,
))
def parse(self, response):
page = response.meta["playwright_page"]
await page.close()
quote_item = QuoteItem()
for quote in response.css('div.quote'):
quote_item['text'] = quote.css('span.text::text').get()
quote_item['author'] = quote.css('small.author::text').get()
quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
yield quote_item
async def errback(self, failure):
page = failure.request.meta["playwright_page"]
await page.close()
2. Scroll Down Infinite Scroll Pages
We can also configure scrapy-playwright to scroll down a page when a website uses an infinite scroll to load in data.
In this example, Playwright will wait for div.quote
to appear, before scrolling down the page until it reachs the 10th quote.
# spiders/quotes.py
import scrapy
from scrapy_playwright_demo.items import QuoteItem
from scrapy_playwright.page import PageMethod
class QuotesSpider(scrapy.Spider):
name = 'quotes'
def start_requests(self):
url = "https://quotes.toscrape.com/scroll/"
yield scrapy.Request(url, meta=dict(
playwright = True,
playwright_include_page = True,
playwright_page_methods =[
PageMethod("wait_for_selector", "div.quote"),
PageMethod("evaluate", "window.scrollBy(0, document.body.scrollHeight)"),
PageMethod("wait_for_selector", "div.quote:nth-child(11)"), # 10 per page
],
errback=self.errback,
))
def parse(self, response):
page = response.meta["playwright_page"]
await page.close()
quote_item = QuoteItem()
for quote in response.css('div.quote'):
quote_item['text'] = quote.css('span.text::text').get()
quote_item['author'] = quote.css('small.author::text').get()
quote_item['tags'] = quote.css('div.tags a.tag::text').getall()
yield quote_item
async def errback(self, failure):
page = failure.request.meta["playwright_page"]
await page.close()
3. Take Screenshot Of Page
Taking screenshots of the page are simple too.
Here we wait for Playwright to see the selector div.quote
then it takes a screenshot of the page.
# spiders/quotes.py
import scrapy
from scrapy_playwright_demo.items import QuoteItem
from scrapy_playwright.page import PageMethod
class QuotesSpider(scrapy.Spider):
name = 'quotes'
def start_requests(self):
url = "https://quotes.toscrape.com/js/"
yield scrapy.Request(url, meta=dict(
playwright = True,
playwright_include_page = True,
playwright_page_methods =[
PageMethod("wait_for_selector", "div.quote"),
]
))
def parse(self, response):
page = response.meta["playwright_page"]
screenshot = await page.screenshot(path="example.png", full_page=True)
# screenshot contains the image's bytes
await page.close()
Using Proxies With Scrapy Playwright
In Scrapy Playwright, proxies can be configured at the Browser level by specifying the proxy
key in the PLAYWRIGHT_LAUNCH_OPTIONS
setting:
# spiders/quotes.py
from scrapy import Spider, Request
class ProxySpider(Spider):
name = "proxy"
custom_settings = {
"PLAYWRIGHT_LAUNCH_OPTIONS": {
"proxy": {
"server": "http://myproxy.com:3128"
"username": "user",
"password": "pass",
},
}
}
def start_requests(self):
yield Request("http://httpbin.org/get", meta={"playwright": True})
def parse(self, response):
print(response.text)
More Functionality
Scrapy Playwright has a huge amount of functionality and is highly customisable, so much so that it is hard to cover everything properly in a single guide.
So if you would like to learn more about Scrapy Playwright then check out the offical documentation here.
More Scrapy Tutorials
In this guide we've introduced you to the fundamental functionality of Scrapy Playwright and how to use it in your own projects.
If you would like to learn more about different Javascript rendering options for Scrapy, then be sure to check out our other guides:
If you would like to learn more about Scrapy in general, then be sure to check out The Scrapy Playbook.