Python Scrapy: Build A LinkedIn Company Profile Scraper [2023]
In this guide for our "How To Scrape X With Python Scrapy" series, we're going to look at how to build a Python Scrapy spider that will scrape LinkedIn.com public company profiles.
LinkedIn is the most up-to-date and extensive source of professional people profiles & companies on the internet. As a result it is the most popular web scraping target of recruiting, HR and lead generation companies.
In this article we will focus on building a production LinkedIn spider using Python Scrapy that will scrape LinkedIn Public Company Profiles.
In this guide we will go through:
- How To Build a LinkedIn Company Scraper
- Bypassing LinkedIn's Anti-Bot Protection
- Monitoring To Our LinkedIn Company Scraper
- Scheduling & Running Our Scraper In The Cloud
GitHub Code
The full code for this LinkedIn Company Spider is available on Github here.
If you prefer to follow along with a video then check out the video tutorial version here:
How To Build a LinkedIn Company Profile Scraper
Scraping LinkedIn Company Profiles is pretty straight forward, once you have the HTML response.
We just need a list of LinkedIn company profile urls and send requests to the LinkedIn's to get the data from those profiles. Let's check out what a company profile page looks like by going to the link below:
'https://www.linkedin.com/company/usebraintrust/'
It should look something like this:
We just need to create a Scrapy spider that will parse the profile data from the page.
The following is a simple Scrapy spider that will request the company profile page for every url in the company_pages
list, and then parse the company profile from the response.
import json
import scrapy
class LinkedCompanySpider(scrapy.Spider):
name = "linkedin_company_profile"
#add your own list of company urls here
company_pages = [
'https://www.linkedin.com/company/usebraintrust?trk=public_jobs_jserp-result_job-search-card-subtitle',
'https://www.linkedin.com/company/centraprise?trk=public_jobs_jserp-result_job-search-card-subtitle'
]
def start_requests(self):
company_index_tracker = 0
first_url = self.company_pages[company_index_tracker]
yield scrapy.Request(url=first_url, callback=self.parse_response, meta={'company_index_tracker': company_index_tracker})
def parse_response(self, response):
company_index_tracker = response.meta['company_index_tracker']
print('***************')
print('****** Scraping page ' + str(company_index_tracker+1) + ' of ' + str(len(self.company_pages)))
print('***************')
company_item = {}
company_item['name'] = response.css('.top-card-layout__entity-info h1::text').get(default='not-found').strip()
company_item['summary'] = response.css('.top-card-layout__entity-info h4 span::text').get(default='not-found').strip()
try:
## all company details
company_details = response.css('.core-section-container__content .mb-2')
#industry line
company_industry_line = company_details[1].css('.text-md::text').getall()
company_item['industry'] = company_industry_line[1].strip()
#company size line
company_size_line = company_details[2].css('.text-md::text').getall()
company_item['size'] = company_size_line[1].strip()
#company founded
company_size_line = company_details[5].css('.text-md::text').getall()
company_item['founded'] = company_size_line[1].strip()
except IndexError:
print("Error: Skipped Company - Some details missing")
yield company_item
company_index_tracker = company_index_tracker + 1
if company_index_tracker <= (len(self.company_pages)-1):
next_url = self.company_pages[company_index_tracker]
yield scrapy.Request(url=next_url, callback=self.parse_response, meta={'company_index_tracker': company_index_tracker})
Now when we run our scraper:
scrapy crawl linkedin_company_profile
The output of this code will look like this:
{
'name': 'Braintrust',
'summary': "Braintrust is the first decentralized Web3 talent network that connects tech freelancers with the world's leading brands",
'industry': 'Software Development',
'size': '11-50 employees',
'founded': '2018'
}
This spider scrapes the following data from the LinkedIn profile page:
- Name
- Summary
- Industry
- Company size
- Year Founded
You can expand this spider to scrape other details by simply using the response.css
function to get more data from the page in the parse_response
method.
Bypassing LinkedIn's Anti-Bot Protection
As mentioned above, LinkedIn has one of the most aggressive anti-scraping systems on the internet, making it very hard to scrape.
It uses a combination of IP address, headers, browser & TCP fingerprinting to detect scrapers and block them.
As you might have seen already, if you run the above code LinkedIn is likely blocking your requests and returning their login page like this:
Public LinkedIn Company Profiles
This Scrapy spider is only designed to scrape public LinkedIn company profiles that don't require you to login to view. Scraping behind LinkedIn's login is significantly harder and opens yourself up to much higher legal risks.
To bypass LinkedIn's anti-scraping system will need to using very high quality rotating residential/mobile proxies, browser-profiles and a fortified headless browser.
We have written guides about how to do this here:
- Guide to Web Scraping Without Getting Blocked
- Scrapy Proxy Guide: How to Integrate & Rotate Proxies With Scrapy
- Scrapy User Agents: How to Manage User Agents When Scraping
- Scrapy Proxy Waterfalling: How to Waterfall Requests Over Multiple Proxy Providers
However, if you don't want to implement all this anti-bot bypassing logic yourself, the easier option is to use a smart proxy solution like ScrapeOps Proxy Aggregator which integrates with over 20+ proxy providers and finds the proxy solution that works best for LinkedIn for you.
The ScrapeOps Proxy Aggregator is a smart proxy that handles everything for you:
- Proxy rotation & selection
- Rotating user-agents & browser headers
- Ban detection & CAPTCHA bypassing
- Country IP geotargeting
- Javascript rendering with headless browsers
You can get a ScrapeOps API key with 1,000 free API credits by signing up here.
To use the ScrapeOps Proxy Aggregator with our LinkedIn Scrapy Spider, we just need to send the URL we want to scrape to the Proxy API instead of making the request directly ourselves. You can test it out with Curl using the command below:
curl 'https://proxy.scrapeops.io/v1/?api_key=YOUR_API_KEY&url=https://www.linkedin.com/in/reidhoffman/'
We can integrate the proxy easily into our scrapy project by installing the ScrapeOps Scrapy Proxy SDK a Downloader Middleware. We can quickly install it into our project using the following command:
pip install scrapeops-scrapy-proxy-sdk
And then enable it in your project in the settings.py
file.
## settings.py
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
SCRAPEOPS_PROXY_ENABLED = True
DOWNLOADER_MIDDLEWARES = {
'scrapeops_scrapy_proxy_sdk.scrapeops_scrapy_proxy_sdk.ScrapeOpsScrapyProxySdk': 725,
}
Now when we make requests with our scrapy spider they will be routed through the proxy and LinkedIn won't block them.
Full documentation on how to integrate the ScrapeOps Proxy here.
Monitoring Your LinkedIn Company Profile Scraper
When scraping in production it is vital that you can see how your scrapers are doing so you can fix problems early.
You could see if your jobs are running correctly by checking the output in your file or database but the easier way to do it would be to install the ScrapeOps Monitor.
ScrapeOps gives you a simple to use, yet powerful way to see how your jobs are doing, run your jobs, schedule recurring jobs, setup alerts and more. All for free!
Live demo here: ScrapeOps Demo
You can create a free ScrapeOps API key here.
We'll just need to run the following to install the ScrapeOps Scrapy Extension:
pip install scrapeops-scrapy
Once that is installed you need to add the following to your Scrapy projects settings.py
file if you want to be able to see your logs in ScrapeOps:
# Add Your ScrapeOps API key
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
# Add In The ScrapeOps Extension
EXTENSIONS = {
'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500,
}
# Update The Download Middlewares
DOWNLOADER_MIDDLEWARES = {
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
}
Now, every time we run a our LinkedIn Company Profile spider (scrapy crawl linkedin_company_profile
), the ScrapeOps SDK will monitor the performance and send the data to ScrapeOps dashboard.
Full documentation on how to integrate the ScrapeOps Monitoring here.
Scheduling & Running Our Scraper In The Cloud
Lastly, we will want to deploy our LinkedIn Company Profile scraper to a server so that we can schedule it to run every day, week, etc.
To do this you have a couple of options.
However, one of the easiest ways is via ScrapeOps Job Scheduler. Plus it is free!
Here is a video guide on how to connect a Digital Ocean to ScrapeOps and schedule your jobs to run.
You could also connect ScrapeOps to any server like Vultr or Amazon Web Services(AWS).
More Web Scraping Guides
In this edition of our "How To Scrape X" series, we went through how you can scrape LinkedIn.com including how to bypass its anti-bot protection.
The full code for this LinkedIn Company Profile Spider is available on Github here.
If you would like to learn how to scrape other popular websites then check out our other How To Scrape With Scrapy Guides here:
- How To Scrape Amazon Products
- How To Scrape Amazon Product Reviews
- How To Scrape Walmart.com
- How To Scrape Indeed.com
Of if you would like to learn more about web scraping in general, then be sure to check out The Web Scraping Playbook, or check out one of our more in-depth guides: