Build a Local Ollama OCR Application Powered By Llama 3.2-Vision

Sebastian Petrus
4 min readNov 19, 2024

--

Optical Character Recognition (OCR) has become an essential tool for digitizing printed text and extracting information from images. With advancements in artificial intelligence, models like Llama 3.2-Vision provide powerful capabilities for OCR tasks. In this article, we will guide you through the process of building your own OCR application using the Llama 3.2-Vision model from Ollama, utilizing Python as our programming language.

Before we get started, If you are seeking an All-in-One AI platform that manages all your AI subscriptions in one place, including all LLMs (such as GPT-o1, Llama 3.1, Claude 3.5 Sonnet, Google Gemini, Uncensored LLMs) and Image Generation Models (FLUX, Stable Diffusion, etc.),

Use Anakin AI to manage them all!

Anakin AI: Your All-in-One AI Platform

Prerequisites

Before we begin, ensure you have the following prerequisites:

  • A laptop or desktop computer running Windows, macOS, or Linux.
  • A stable internet connection to download necessary packages and models.
  • Basic familiarity with Python programming.
  • Python installed on your system (preferably version 3.7 or higher).

Step 1: Install Ollama

Ollama is a platform that allows you to run multimodal models locally. To install Ollama, follow these steps:

  1. Download Ollama: Visit the official Ollama website and download the installation package suitable for your operating system.
  2. Install Ollama: Follow the installation prompts to complete the setup.

Step 2: Install Llama 3.2-Vision Model

Once you have Ollama installed, you can install the Llama 3.2-Vision model by executing the following command in your terminal:

ollama run llama3.2-vision

This command downloads and sets up the model for local use.

Step 3: Set Up Your Python Environment

Now that you have everything installed, let’s set up a Python environment for our OCR project:

  1. Create a new directory for your project:
mkdir llama-ocr && cd llama-ocr
  1. Create a virtual environment (optional but recommended):
python -m venv venv
source venv/bin/activate # On Windows use `venv\\\\Scripts\\\\activate`
  1. Install Required Libraries: You will need some libraries for image handling and base64 encoding. Install them using pip:
pip install requests Pillow

Step 4: Write Your OCR Script

Now it’s time to write the Python script that will perform OCR using Llama 3.2-Vision. Create a new file named ollama_ocr.py and add the following code:

import base64
import requests
from PIL import Image

SYSTEM_PROMPT = """Act as an OCR assistant. Analyze the provided image and:
1. Recognize all visible text in the image as accurately as possible.
2. Maintain the original structure and formatting of the text.
3. If any words or phrases are unclear, indicate this with [unclear] in your transcription.
Provide only the transcription without any additional comments."""
def encode_image_to_base64(image_path):
"""Convert an image file to a base64 encoded string."""
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
def perform_ocr(image_path):
"""Perform OCR on the given image using Llama 3.2-Vision."""
base64_image = encode_image_to_base64(image_path)
response = requests.post(
"<http://localhost:8080/chat>", # Ensure this URL matches your Ollama service endpoint
json={
"model": "llama3.2-vision",
"messages": [
{
"role": "user",
"content": SYSTEM_PROMPT,
"images": [base64_image],
},
],
}
)
if response.status_code == 200:
return response.json().get("message", {}).get("content", "")
else:
print("Error:", response.status_code, response.text)
return None
if __name__ == "__main__":
image_path = "path/to/your/image.jpg" # Replace with your image path
result = perform_ocr(image_path)
if result:
print("OCR Recognition Result:")
print(result)

Explanation of the Code

  1. Base64 Encoding: The function encode_image_to_base64 reads an image file and converts it into a base64 string, which is required for sending images through HTTP requests.
  2. Performing OCR: The perform_ocr function sends a POST request to the local Ollama service with the system prompt and base64 encoded image.
  3. Handling Response: The script checks if the request was successful and retrieves the recognized text from the JSON response.

Step 5: Run Your OCR Script

To run your script, replace "path/to/your/image.jpg" with the actual path of an image file you want to analyze. Then execute the script in your terminal:

python ollama_ocr.py

You should see output similar to this:

OCR Recognition Result:
The text recognized from your image will be displayed here.

Step 6: Optimizing Results

If you find that the OCR results are not satisfactory, consider adjusting the SYSTEM_PROMPT variable in your script to better suit your specific use case or improve clarity in instructions provided to Llama 3.2-Vision.

Conclusion

Building an OCR application using Llama 3.2-Vision with Ollama is straightforward and powerful due to its multimodal capabilities. By following these steps, you can create a functional OCR tool on your laptop that leverages advanced AI technology for text recognition tasks.

Feel free to experiment with different images and prompts to explore the full potential of this model! As AI continues to evolve, tools like Llama 3.2-Vision will only get better at understanding and processing visual information efficiently.

--

--

Sebastian Petrus
Sebastian Petrus

Written by Sebastian Petrus

Asist Prof @U of Waterloo, AI/ML, e/acc

Responses (6)