How To Integrate Firecrawl With Langchain For Web Scraping

Sebastian Petrus
5 min readSep 4, 2024

--

Firecrawl and Langchain

Before we get started, if you want to manage all the AI models in one place, I strongly suggest you to take a look at Anakin AI, where you can use virtually any AI Model without the pain of managing 10+ subscriptions.

How to Integrate Firecrawl with LangChain for Web Scraping: Prerequisites

Before diving into how to integrate Firecrawl with LangChain for web scraping, ensure you have the following prerequisites set up:

  1. Python: Make sure you have Python 3.x installed on your system.
  2. Firecrawl: Install Firecrawl, a web scraping framework that allows users to extract data efficiently.
  3. LangChain: You need to install LangChain, a framework that aids in building applications with language models.

To install Firecrawl and LangChain, you can use pip as follows:

pip install firecrawl langchain

Hey, if you are working with AI APIs, Apidog is here to make your life easier. It’s an all-in-one API development tool that streamlines the entire process — from design and documentation to testing and debugging.

How to Integrate Firecrawl with LangChain for Web Scraping: Setting Up Your Project

To begin, we need to set up a project structure where we can implement the scraping functionality using Firecrawl and integrate it with LangChain.

  1. Create a New Project Directory:
mkdir firecrawl_langchain_integration
cd firecrawl_langchain_integration
  1. Create a Python Script:

Create a file named scraper.py:

touch scraper.py
  1. Import Necessary Libraries:

Open scraper.py and import the required libraries:

from firecrawl import Firecrawl
from langchain import TextGen

How to Integrate Firecrawl with LangChain for Web Scraping: Configuring Firecrawl

We need to configure Firecrawl to define the parameters for web scraping. This involves specifying the target URL and identifying the data we want to extract.

  1. Define Your Scraper:

Add the following code to scraper.py to set up a basic scraper:

class MyCustomScraper(Firecrawl):
def start_requests(self):
urls = ['https://example.com']
for url in urls:
yield self.make_request(url)

def parse(self, response):
title = response.xpath('//title/text()').get()
yield {'Title': title}
  1. Run Your Scraper:

To test your scraper, add the following code to run the scraping process.

if __name__ == '__main__':
scraper = MyCustomScraper()
scraper.run()

When you run this script (python scraper.py), it will fetch the title from the specified URL.

How to Integrate Firecrawl with LangChain for Web Scraping: Extracting Data with Firecrawl

Once the basic setup is done, let’s extract specific data from a webpage. We’ll further enhance our parse method to collect more data fields.

  1. Expand the Parsing Logic:

Update the parse method within MyCustomScraper:

def parse(self, response):
title = response.xpath('//title/text()').get()
paragraphs = response.xpath('//p/text()').getall() # Get all paragraph text
yield {
'Title': title,
'Content': paragraphs
}

This modification collects both the title and the content of all paragraphs from the webpage.

How to Integrate Firecrawl with LangChain for Web Scraping: Processing Data with LangChain

Now, let’s implement LangChain to analyze and process the scraped data. This example will show you how to generate a summary from the scraped content.

  1. Initialize LangChain:

At the beginning of your scraper.py, you need to import additional modules and set up LangChain:

from langchain.llm import OpenAI  # Or your preferred LLM

# Initialize your model
model = OpenAI(model_name='text-davinci-003', api_key='YOUR_API_KEY')

Make sure you replace 'YOUR_API_KEY' with your actual OpenAI API key.

  1. Summarizing the Content:

Add a function within your scraper to utilize LangChain for summarizing the scraped content:

def summarize_content(content):
summarized = model.generate(f"Please summarize the following content: {content}")
return summarized
  1. Integrate Summarization with the Parser:

Modify the parse method to also summarize the content after scraping:

def parse(self, response):
title = response.xpath('//title/text()').get()
paragraphs = response.xpath('//p/text()').getall()

content = ' '.join(paragraphs) # Join the paragraph into a single string
summary = summarize_content(content) # Get the summary

yield {
'Title': title,
'Content': content,
'Summary': summary
}

This way, after scraping the data, you now also get a summary of the page’s content.

How to Integrate Firecrawl with LangChain for Web Scraping: Handling Errors and Exceptions

When integrating web scraping with LangChain, error handling is essential to make your application robust. Here, we will incorporate error handling within our scraper.

  1. Error Handling Logic:

You can enhance your scraping method by adding try-except blocks:

def parse(self, response):
try:
title = response.xpath('//title/text()').get()
paragraphs = response.xpath('//p/text()').getall()

content = ' '.join(paragraphs)
summary = summarize_content(content)

yield {
'Title': title,
'Content': content,
'Summary': summary
}
except Exception as e:
self.logger.error(f"Error parsing the page: {e}")

This method logs any errors encountered during parsing, enabling easier debugging.

How to Integrate Firecrawl with LangChain for Web Scraping: Scaling Your Web Scraper

If you’re planning to scrape multiple URLs, you’ll want to scale your scraper effectively. Follow these steps to enhance your web scraper for multiple pages.

  1. Using Multiple URLs:

Update the start_requests method with a list of URLs:

def start_requests(self):
urls = [
'https://example1.com',
'https://example2.com',
'https://example3.com',
# Add more URLs as needed
]
for url in urls:
yield self.make_request(url)
  1. Managing Rate Limiting and Delays:

It’s important to implement delays between requests to avoid getting blocked. You can use time.sleep() for this purpose:

import time

def start_requests(self):
urls = [...] # List of URLs
for url in urls:
yield self.make_request(url)
time.sleep(2) # Pause for 2 seconds between requests

By following these steps, you can efficiently scrape multiple pages while respecting the website’s rate limits.

How to Integrate Firecrawl with LangChain for Web Scraping: Saving Data to a Database

Lastly, after scraping and processing your data, you may want to save it to a database. Here’s how to store scraped data in a SQLite database using SQLAlchemy.

  1. Install SQLAlchemy:

If you haven’t installed SQLAlchemy yet, it can be done via pip:

pip install SQLAlchemy
  1. Define Your Database Model:

Create a new file called models.py to define your database tables:

from sqlalchemy import create_engine, Column, String, Text
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker

Base = declarative_base()
class ScrapedData(Base):
__tablename__ = 'scraped_data'
title = Column(String, primary_key=True)
content = Column(Text)
summary = Column(Text)
engine = create_engine('sqlite:///scraped_data.db')
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)
  1. Save the Scraped Data:

Now, within your scraper.py, import your models and save the data within parse:

from models import ScrapedData, Session

def parse(self, response):
title = response.xpath('//title/text()').get()
paragraphs = response.xpath('//p/text()').getall()

content = ' '.join(paragraphs)
summary = summarize_content(content)
# Save to database
session = Session()
scraped_item = ScrapedData(title=title, content=content, summary=summary)
session.add(scraped_item)
session.commit()
session.close()
yield {
'Title': title,
'Content': content,
'Summary': summary
}

Now your scraper is not only capable of gathering and summarizing data but also storing it persistently in an SQLite database.

--

--

Sebastian Petrus
Sebastian Petrus

Written by Sebastian Petrus

Asist Prof @U of Waterloo, AI/ML, e/acc

No responses yet