AI PDF Summarize and Sentiment analysis with Langchain + Ollama Python tutorial

Introduction

In this tutorial, we will walk through the process of setting up a Python environment to summarize PDF documents and analyze their sentiment using Langchain and Ollama. You'll learn how to extract text from PDFs and employ AI models to generate structured outputs that can be utilized in various applications.

Environment Setup

Install UV: Ensure you have UV installed for better performance. You can quickly install it via pip:
```
pip install uv
```
Initialize Project: Once UV is installed, initialize your project using the uv init command. This will create the necessary files for your project.
Install Dependencies: Next, you’ll need to install important libraries, mainly Langchain and Pi PDF for text extraction from PDF files. Use the following command:
```
pip install langchain pypdf
```
Switch to Virtual Environment: Make sure you are working within your virtual environment.
Install Ollama: If you haven't done so already, install Ollama to interact with models. Start the server using the command:
```
ollama serve
```
List Available Models: After starting the server, list all available models using:
```
ollama list
```
We will be using the Llama 3.1 model. If you don’t have it installed, you can quickly download it with:
```
ollama pull llama-3.1
```
Install Langchain Community Package: Don't forget to install the Langchain Community package, as you'll be working with it.

Code Implementation

Step 1: Import Required Modules

Import necessary modules at the beginning of your script:

from langchain.llm import Llama
from pypdf import PdfReader

Step 2: Initialize the Language Model

Create an instance of the Llama model:

llm = Llama(model="llama-3.1", temperature=0)  # Temperature set to 0 for deterministic output

Step 3: Load the PDF

To load your PDF, provide the file path:

loader = PdfReader("IBM_annual_report.pdf")  # Ensure the file is in the same directory
docs = loader.load()  # This extracts text from the PDF
print(len(docs))  # Display the number of pages

Now, let's extract text from just the first 10 pages for processing:

pages = "\n".join([doc.text for doc in docs[:10]])  # Combine text of first 10 pages
print(pages)

Step 4: Generate a Summary

Set up messages for the model to summarize the document content:

messages = [
    ("Summarize the following text:", pages)
]

response = llm.invoke(messages)
print(response)  # Output the summary

Step 5: Analyze Sentiment

For structured sentiment analysis, use the Pydantic library to structure the expected output.

from langchain.schema import BaseModel, Field

class SentimentResult(BaseModel):
    sentiment: str = Field(..., description="The overall sentiment of the report")
    score: int = Field(..., description="Sentiment score indicating bullish or bearish evaluation")

llm_with_structured_output = llm.with_structured_output(SentimentResult)  # Enable structured output

## Introduction
structured_response = llm_with_structured_output.invoke(messages)
print(structured_response)  # Print structured sentiment response

Conclusion

You've successfully created a script that can summarize PDF documents and analyze their sentiment in a structured way! This can be integrated into applications where compliance or sentiment insights from documents are valuable.

Keywords

AI
PDF
Summarization
Sentiment Analysis
Langchain
Ollama
Python
Pydantic
Machine Learning

FAQ

Q1: What is Langchain?
A1: Langchain is a framework designed for building applications with language models, primarily focusing on natural language processing tasks.

Q2: How do I install Ollama?
A2: You can install Ollama by following the installation instructions on their official repository and then start the server using ollama serve.

Q3: What is the benefit of using a structured output?
A3: Structured output allows you to receive consistently formatted responses, making it easier to process and integrate results into applications.

Q4: Why set the temperature to zero for the language model?
A4: Setting the temperature to zero ensures that the model generates deterministic and consistent outputs, which is essential for tasks that require reliability.

Q5: Where can I find documentation for the libraries used?
A5: You can find the documentation for Langchain, Ollama, and Pydantic on their respective official websites or repositories.