AI PDF Summarize and Sentiment analysis with Langchain + Ollama Python tutorial
People & Blogs
Introduction
Introduction
In this tutorial, we will walk through the process of setting up a Python environment to summarize PDF documents and analyze their sentiment using Langchain and Ollama. You'll learn how to extract text from PDFs and employ AI models to generate structured outputs that can be utilized in various applications.
Environment Setup
Install UV: Ensure you have
UV
installed for better performance. You can quickly install it via pip:pip install uv
Initialize Project: Once UV is installed, initialize your project using the
uv init
command. This will create the necessary files for your project.Install Dependencies: Next, you’ll need to install important libraries, mainly
Langchain
andPi PDF
for text extraction from PDF files. Use the following command:pip install langchain pypdf
Switch to Virtual Environment: Make sure you are working within your virtual environment.
Install Ollama: If you haven't done so already, install
Ollama
to interact with models. Start the server using the command:ollama serve
List Available Models: After starting the server, list all available models using:
ollama list
We will be using the
Llama 3.1
model. If you don’t have it installed, you can quickly download it with:ollama pull llama-3.1
Install Langchain Community Package: Don't forget to install the Langchain Community package, as you'll be working with it.
Code Implementation
Step 1: Import Required Modules
Import necessary modules at the beginning of your script:
from langchain.llm import Llama
from pypdf import PdfReader
Step 2: Initialize the Language Model
Create an instance of the Llama
model:
llm = Llama(model="llama-3.1", temperature=0) # Temperature set to 0 for deterministic output
Step 3: Load the PDF
To load your PDF, provide the file path:
loader = PdfReader("IBM_annual_report.pdf") # Ensure the file is in the same directory
docs = loader.load() # This extracts text from the PDF
print(len(docs)) # Display the number of pages
Now, let's extract text from just the first 10 pages for processing:
pages = "\n".join([doc.text for doc in docs[:10]]) # Combine text of first 10 pages
print(pages)
Step 4: Generate a Summary
Set up messages for the model to summarize the document content:
messages = [
("Summarize the following text:", pages)
]
response = llm.invoke(messages)
print(response) # Output the summary
Step 5: Analyze Sentiment
For structured sentiment analysis, use the Pydantic library to structure the expected output.
from langchain.schema import BaseModel, Field
class SentimentResult(BaseModel):
sentiment: str = Field(..., description="The overall sentiment of the report")
score: int = Field(..., description="Sentiment score indicating bullish or bearish evaluation")
llm_with_structured_output = llm.with_structured_output(SentimentResult) # Enable structured output
## Introduction
structured_response = llm_with_structured_output.invoke(messages)
print(structured_response) # Print structured sentiment response
Conclusion
You've successfully created a script that can summarize PDF documents and analyze their sentiment in a structured way! This can be integrated into applications where compliance or sentiment insights from documents are valuable.
Keywords
- AI
- Summarization
- Sentiment Analysis
- Langchain
- Ollama
- Python
- Pydantic
- Machine Learning
FAQ
Q1: What is Langchain?
A1: Langchain is a framework designed for building applications with language models, primarily focusing on natural language processing tasks.
Q2: How do I install Ollama?
A2: You can install Ollama by following the installation instructions on their official repository and then start the server using ollama serve
.
Q3: What is the benefit of using a structured output?
A3: Structured output allows you to receive consistently formatted responses, making it easier to process and integrate results into applications.
Q4: Why set the temperature to zero for the language model?
A4: Setting the temperature to zero ensures that the model generates deterministic and consistent outputs, which is essential for tasks that require reliability.
Q5: Where can I find documentation for the libraries used?
A5: You can find the documentation for Langchain, Ollama, and Pydantic on their respective official websites or repositories.