How to Run Mistral NeMo 12B Locally (Step by Step)
And It Might Be the Best Local LLM Now
Mistral NeMo 12B is a state-of-the-art language model developed by Mistral AI in collaboration with NVIDIA. This powerful model offers a large context window of up to 128k tokens and excels in reasoning, world knowledge, and coding accuracy. In this guide, we’ll walk you through the process of running Mistral NeMo 12B locally on your machine, providing step-by-step instructions and sample code to get you started.
Before we get started, if you want to manage all the AI models in one place like me, I strongly suggest you to take a look at Anakin AI, where you can use virtually any AI Model without the pain of managing 10+ subscriptions.
Prerequisites
Before we begin, ensure you have the following:
- A compatible GPU with at least 24GB VRAM (e.g., NVIDIA RTX 3090, 4090, or A5000)
- CUDA toolkit installed (version 11.7 or later)
- Python 3.8 or later
- Git (Obviously, right?)
Hey, if you are working with AI APIs, Apidog is here to make your life easier. It’s an all-in-one API development tool that streamlines the entire process — from design and documentation to testing and debugging.
Step 1: Set Up the Environment
First, let’s create a virtual environment and install the necessary dependencies:bash
python -m venv mistral_nemo_env
source mistral_nemo_env/bin/activate # On Windows, use: mistral_nemo_env\Scripts\activate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers accelerate
pip install git+https://github.com/huggingface/transformers.git
Step 2: Download the Model
Mistral NeMo 12B is available on the Hugging Face Hub. We’ll use the huggingface_hub
library to download the model:
Step 2: Download the Model
Mistral NeMo 12B is available on the Hugging Face Hub. We’ll use the huggingface_hub
library to download the model:
from huggingface_hub import snapshot_download
from pathlib import Path
model_path = Path.home().joinpath('mistral_models', 'Nemo-Instruct')
model_path.mkdir(parents=True, exist_ok=True)
snapshot_download(
repo_id="mistralai/Mistral-Nemo-Instruct-2407",
allow_patterns=["params.json", "consolidated.safetensors", "tekken.json"],
local_dir=model_path
)
This will download the necessary files to your home directory under mistral_models/Nemo-Instruct
.
Step 3: Load the Model
Now that we have the model files, let’s load the model using the Transformers library:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_path = Path.home().joinpath('mistral_models', 'Nemo-Instruct')
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto"
)
This code loads the tokenizer and model, automatically utilizing your GPU for inference.
Step 4: Generate Text
With the model loaded, we can now generate text. Here’s a sample function to interact with the model:
def generate_response(prompt, max_length=256, temperature=0.3):
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_length=max_length,
temperature=temperature,
do_sample=True,
top_p=0.95,
top_k=40,
num_return_sequences=1
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Example usage
prompt = "Explain the concept of quantum entanglement in simple terms."
response = generate_response(prompt)
print(response)
Note that we’ve set the temperature to 0.3 in this example. Unlike previous Mistral models, Mistral NeMo requires smaller temperatures. We recommend using a temperature of 0.3 for optimal results.
Step 5: Fine-tuning for Specific Tasks
If you want to fine-tune Mistral NeMo 12B for specific tasks, you can use the Transformers library’s training capabilities. Here’s a basic example of how to set up fine-tuning:
from transformers import Trainer, TrainingArguments
from datasets import load_dataset
# Load your dataset
dataset = load_dataset("your_dataset_name")
# Define training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
warmup_steps=500,
weight_decay=0.01,
logging_dir="./logs",
)
# Create Trainer instance
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"]
)
# Start fine-tuning
trainer.train()
Remember to adjust the training parameters based on your specific use case and available computational resources.
Step 6: Optimizing for Inference
To optimize Mistral NeMo 12B for faster inference, you can use techniques like quantization. Here’s an example of how to apply 8-bit quantization:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0
)
model = AutoModelForCausalLM.from_pretrained(
model_path,
quantization_config=quantization_config,
device_map="auto"
)
This will load the model in 8-bit precision, significantly reducing memory usage and potentially improving inference speed.
Step 7: Handling Long Contexts
Mistral NeMo 12B supports a context length of up to 128k tokens. To utilize this capability, you may need to adjust your tokenizer and model settings:
tokenizer.model_max_length = 128000
model.config.max_position_embeddings = 128000
# Example of using a long context
long_prompt = "Your very long text here..." * 1000 # Repeat to create a long context
response = generate_response(long_prompt, max_length=2048)
print(response)
Be aware that processing very long contexts may require significant computational resources and time.
Step 8: Leveraging Multi-GPU Setups
If you have multiple GPUs, you can distribute the model across them for improved performance:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto"
)
The device_map="auto"
parameter will automatically distribute the model across available GPUs.
Step 9: Implementing a Simple Chat Interface
To make interacting with Mistral NeMo 12B more user-friendly, you can create a simple chat interface:
def chat_with_model():
print("Chat with Mistral NeMo 12B (type 'exit' to end the conversation)")
while True:
user_input = input("You: ")
if user_input.lower() == 'exit':
break
response = generate_response(user_input)
print("Mistral NeMo 12B:", response)
chat_with_model()
This function creates a loop that takes user input, generates a response using the model, and prints it, providing a simple chat-like experience.