How to Run Mistral NeMo 12B Locally (Step by Step)

And It Might Be the Best Local LLM Now

4 min readSep 4, 2024

Mistral NeMo 12B is a state-of-the-art language model developed by Mistral AI in collaboration with NVIDIA. This powerful model offers a large context window of up to 128k tokens and excels in reasoning, world knowledge, and coding accuracy. In this guide, we’ll walk you through the process of running Mistral NeMo 12B locally on your machine, providing step-by-step instructions and sample code to get you started.

Before we get started, if you want to manage all the AI models in one place like me, I strongly suggest you to take a look at Anakin AI, where you can use virtually any AI Model without the pain of managing 10+ subscriptions.

Anakin.ai — One-Stop AI App Platform

Generate Content, Images, Videos, and Voice; Craft Automated Workflows, Custom AI Apps, and Intelligent Agents. Your…

app.anakin.ai

Prerequisites

Before we begin, ensure you have the following:

A compatible GPU with at least 24GB VRAM (e.g., NVIDIA RTX 3090, 4090, or A5000)
CUDA toolkit installed (version 11.7 or later)
Python 3.8 or later
Git (Obviously, right?)

Hey, if you are working with AI APIs, Apidog is here to make your life easier. It’s an all-in-one API development tool that streamlines the entire process — from design and documentation to testing and debugging.

Apidog — the all-in-one API development tool

Step 1: Set Up the Environment

First, let’s create a virtual environment and install the necessary dependencies:bash

python -m venv mistral_nemo_env
source mistral_nemo_env/bin/activate  # On Windows, use: mistral_nemo_env\Scripts\activate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers accelerate
pip install git+https://github.com/huggingface/transformers.git

Step 2: Download the Model

Mistral NeMo 12B is available on the Hugging Face Hub. We’ll use the huggingface_hub library to download the model:

Step 2: Download the Model

Mistral NeMo 12B is available on the Hugging Face Hub. We’ll use the huggingface_hub library to download the model:

from huggingface_hub import snapshot_download
from pathlib import Path

model_path = Path.home().joinpath('mistral_models', 'Nemo-Instruct')
model_path.mkdir(parents=True, exist_ok=True)
snapshot_download(
    repo_id="mistralai/Mistral-Nemo-Instruct-2407",
    allow_patterns=["params.json", "consolidated.safetensors", "tekken.json"],
    local_dir=model_path
)

This will download the necessary files to your home directory under mistral_models/Nemo-Instruct.

Step 3: Load the Model

Now that we have the model files, let’s load the model using the Transformers library:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_path = Path.home().joinpath('mistral_models', 'Nemo-Instruct')
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

This code loads the tokenizer and model, automatically utilizing your GPU for inference.

Step 4: Generate Text

With the model loaded, we can now generate text. Here’s a sample function to interact with the model:

def generate_response(prompt, max_length=256, temperature=0.3):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    outputs = model.generate(
        **inputs,
        max_length=max_length,
        temperature=temperature,
        do_sample=True,
        top_p=0.95,
        top_k=40,
        num_return_sequences=1
    )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example usage
prompt = "Explain the concept of quantum entanglement in simple terms."
response = generate_response(prompt)
print(response)

Note that we’ve set the temperature to 0.3 in this example. Unlike previous Mistral models, Mistral NeMo requires smaller temperatures. We recommend using a temperature of 0.3 for optimal results.

Step 5: Fine-tuning for Specific Tasks

If you want to fine-tune Mistral NeMo 12B for specific tasks, you can use the Transformers library’s training capabilities. Here’s a basic example of how to set up fine-tuning:

from transformers import Trainer, TrainingArguments
from datasets import load_dataset

# Load your dataset
dataset = load_dataset("your_dataset_name")
# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
)
# Create Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"]
)
# Start fine-tuning
trainer.train()

Remember to adjust the training parameters based on your specific use case and available computational resources.

Step 6: Optimizing for Inference

To optimize Mistral NeMo 12B for faster inference, you can use techniques like quantization. Here’s an example of how to apply 8-bit quantization:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0
)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    quantization_config=quantization_config,
    device_map="auto"
)

This will load the model in 8-bit precision, significantly reducing memory usage and potentially improving inference speed.

Step 7: Handling Long Contexts

Mistral NeMo 12B supports a context length of up to 128k tokens. To utilize this capability, you may need to adjust your tokenizer and model settings:

tokenizer.model_max_length = 128000
model.config.max_position_embeddings = 128000

# Example of using a long context
long_prompt = "Your very long text here..." * 1000  # Repeat to create a long context
response = generate_response(long_prompt, max_length=2048)
print(response)

Be aware that processing very long contexts may require significant computational resources and time.

Step 8: Leveraging Multi-GPU Setups

If you have multiple GPUs, you can distribute the model across them for improved performance:

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

The device_map="auto" parameter will automatically distribute the model across available GPUs.

Step 9: Implementing a Simple Chat Interface

To make interacting with Mistral NeMo 12B more user-friendly, you can create a simple chat interface:

def chat_with_model():
    print("Chat with Mistral NeMo 12B (type 'exit' to end the conversation)")
    while True:
        user_input = input("You: ")
        if user_input.lower() == 'exit':
            break
        
        response = generate_response(user_input)
        print("Mistral NeMo 12B:", response)

chat_with_model()

This function creates a loop that takes user input, generates a response using the model, and prints it, providing a simple chat-like experience.

How to Run Mistral NeMo 12B Locally (Step by Step)

And It Might Be the Best Local LLM Now

Anakin.ai — One-Stop AI App Platform

Generate Content, Images, Videos, and Voice; Craft Automated Workflows, Custom AI Apps, and Intelligent Agents. Your…

Prerequisites

Step 1: Set Up the Environment

Step 2: Download the Model

Step 2: Download the Model

Step 3: Load the Model

Step 4: Generate Text

Step 5: Fine-tuning for Specific Tasks

Step 6: Optimizing for Inference

Step 7: Handling Long Contexts

Step 8: Leveraging Multi-GPU Setups

Step 9: Implementing a Simple Chat Interface

Written by Sebastian Petrus

Responses (1)