Top 10 Open Weights LLMs in 2024 You Cannot Miss Out

Sebastian Petrus
10 min readSep 5, 2024

--

2024 has witnessed a significant shift towards more accessible and powerful AI models. At the forefront of this revolution is Anakin AI, an all-in-one AI platform that has emerged as a game-changer in the industry. Anakin AI offers users the unique opportunity to experiment with and leverage a wide array of cutting-edge AI models, including many of the open weights models featured in this comprehensive ranking.

Before we get started, let’s talk about Anakin AI for a bit.

  • Anakin AI’s platform stands out for its user-friendly interface and the sheer diversity of models it makes available.
  • Whether you’re a seasoned AI researcher, a developer looking to integrate AI into your applications, or a curious enthusiast eager to explore the capabilities of different models, Anakin AI provides a centralized hub for accessing and comparing various AI technologies.

You can also connect to Anakin AI’s API for programmatic approach. Read this doc to learn more.

  • No need to manage multiple AI Subscriptions, Anakin AI gives you all LLMs, and AI Image Generation Models such as FLUX:

Now, let’s get into a detailed examination of the top AI models of 2024, both overall and in the open weights category, as showcased in the provided ranking.

Why You Want Open Weights Model Instead of Closed One

Pretty Obvious right? Do you really prefer “Open” AI?

Open Weights LLMs: Better than “Open” AI

The prominence of open weights models in this ranking underscores a significant trend in the AI industry towards greater transparency and accessibility. These models offer several advantages:

  1. Customization: Researchers and developers can fine-tune and adapt open weights models to specific use cases.
  2. Transparency: The ability to inspect and understand model architectures promotes trust and enables better debugging.
  3. Community-driven improvement: Open-source models benefit from collective efforts to enhance performance and address limitations.
  4. Cost-effectiveness: Organizations can deploy and scale these models without the ongoing costs associated with proprietary API usage.

Hey, if you are working with AI APIs, Apidog is here to make your life easier. It’s an all-in-one API development tool that streamlines the entire process — from design and documentation to testing and debugging.

Best Open Weights Models Overall: A Closer Look

1. Gemma-2 27B (Google)

Gemma-2 27B is Google’s latest open-weights model, building upon the success of their Gemini series. This model represents a significant advancement in the field of open-source AI, offering performance that rivals much larger proprietary models.

Technical Specifications:

  • Parameters: 27 billion
  • Architecture: Transformer-based with optimized attention mechanisms
  • Context window: 32,768 tokens
  • Training data: Curated web content, books, and code repositories
  • Multi-task learning capabilities
  • Advanced few-shot and zero-shot performance
  • Optimized for efficient inference on various hardware configurations

Benchmarks:

  • MMLU (5-shot): 76.2%
  • HumanEval (0-shot): 67.8%
  • GSM8K (8-shot): 84.5%
  • TruthfulQA: 62.3%

Gemma-2 27B demonstrates exceptional performance across a wide range of tasks, particularly excelling in reasoning and knowledge-intensive benchmarks. Its MMLU score of 76.2% places it among the top-performing open-source models, rivaling some much larger proprietary alternatives.

2. Command R+ (Cohere)

Command R+ is Cohere’s flagship open-weights model, designed to excel in enterprise and research applications. It offers a balance of performance and efficiency, making it suitable for a variety of deployment scenarios.

Technical Specifications:

  • Parameters: Estimated 70–100 billion (exact count not disclosed)
  • Architecture: Enhanced transformer with proprietary optimizations
  • Context window: 128,000 tokens
  • Training data: High-quality web content, academic papers, and specialized datasets
  • Advanced instruction-following capabilities
  • Robust performance in multi-turn conversations
  • Specialized modules for document analysis and summarization

Benchmarks:

  • MMLU: Not publicly disclosed
  • HumanEval: 78.3%
  • GSM8K: 89.2%
  • TruthfulQA: 71.5%

While Cohere has not released comprehensive benchmark results for Command R+, independent evaluations have shown its strong performance in coding tasks (HumanEval) and mathematical reasoning (GSM8K). Its TruthfulQA score also indicates a high degree of factual accuracy and resistance to hallucination.

3. Grok-1 (xAI)

Grok-1, developed by Elon Musk’s xAI, is an open-weights model that aims to push the boundaries of AI capabilities while maintaining transparency and accessibility.

Technical Specifications:

  • Parameters: 314 billion
  • Architecture: Modified transformer with enhanced attention mechanisms
  • Context window: 8,192 tokens
  • Training data: Diverse web content, with emphasis on real-time information and current events
  • Real-time knowledge integration
  • Advanced conversational abilities
  • Specialized modules for scientific and technical reasoning

Benchmarks:

  • MMLU: 73.8%
  • HumanEval: 67.2%
  • GSM8K: 82.6%
  • TruthfulQA: 69.1%

Grok-1 demonstrates strong performance across various benchmarks, with particularly impressive results in tasks requiring up-to-date knowledge and complex reasoning. Its large parameter count contributes to its ability to handle a wide range of tasks effectively.

4. Mistral Large 2 (Mistral AI)

Mistral Large 2 represents the latest advancement from Mistral AI, known for their efficient and powerful language models.

Technical Specifications:

  • Parameters: 123 billion
  • Architecture: Sparse Mixture of Experts (SMoE)
  • Context window: 32,768 tokens
  • Training data: High-quality web content, academic literature, and code repositories
  • State-of-the-art performance in multilingual tasks
  • Advanced code generation and analysis capabilities
  • Efficient inference through sparse activation

Benchmarks:

  • MMLU: 84.0%
  • HumanEval: 76.5%
  • GSM8K: 91.3%
  • TruthfulQA: 73.2%

Mistral Large 2 showcases exceptional performance across all major benchmarks, with its MMLU score of 84.0% placing it at the forefront of open-source models. Its architecture allows for efficient processing of large context windows while maintaining high accuracy.

5. LLaMA 3.1 405B (Meta AI)

LLaMA 3.1 405B is Meta AI’s latest and largest open-source language model, pushing the boundaries of what’s possible with publicly available AI.

Technical Specifications:

  • Parameters: 405 billion
  • Architecture: Enhanced transformer with optimized self-attention
  • Context window: 32,000 tokens
  • Training data: Diverse multilingual web content, books, and scientific papers
  • Exceptional few-shot learning capabilities
  • Advanced multilingual understanding and generation
  • Robust performance in long-context tasks

Benchmarks:

  • MMLU: 88.6%
  • HumanEval: 73.3%
  • GSM8K: 96.8%
  • TruthfulQA: 73.8%

LLaMA 3.1 405B sets new standards for open-source model performance, with its MMLU score of 88.6% rivaling or surpassing many proprietary models. Its massive parameter count allows for nuanced understanding and generation across a wide range of tasks and domains.

6. DeepSeek Coder V2

DeepSeek Coder V2 is a specialized large language model focused on programming and code-related tasks, developed by DeepSeek AI.

Key Features:

  • Available in two sizes: 236B parameters (full model) and 16B parameters (lite version)
  • Trained on 6 trillion tokens of high-quality, multi-source code corpus
  • Supports 338 programming languages, a significant increase from its predecessor
  • Extended context length of 128K tokens, up from 16K in the previous version

Performance:

  • Claimed to achieve performance comparable to GPT-4-Turbo on code-specific tasks
  • Significant improvements in various aspects of code-related tasks, reasoning, and general capabilities compared to DeepSeek Coder V1

Benchmarks:

  • Specific scores not provided, but reported to be competitive with or surpassing top closed-source models in coding benchmarks

Key Capabilities:

  • Advanced code generation and completion
  • Bug detection and fixing
  • Code refactoring and optimization
  • Technical documentation generation
  • Handling complex programming concepts across multiple languages

Advantages:

  • Open-source nature allows for customization and fine-tuning
  • Extensive language support makes it versatile for various development environments
  • Long context window enables working with larger codebases and more complex problems

Considerations:

  • The full 236B model requires significant computational resources for deployment
  • While open-source, the model’s size may limit its use in resource-constrained environments

DeepSeek Coder V2 represents a significant advancement in open-source coding-focused language models, potentially rivaling proprietary solutions in capability while offering the benefits of transparency and customization inherent to open-source projects.

7. Nemotron-4 340B (NVIDIA)

Nemotron-4 340B represents NVIDIA’s most ambitious foray into the open-source large language model landscape. Released in June 2024, this model family showcases NVIDIA’s expertise in both hardware optimization and AI model development.

Technical Specifications:

  • Parameters: 340 billion
  • Architecture: Enhanced transformer with custom NVIDIA optimizations
  • Context window: 32,768 tokens
  • Training data: Diverse multilingual corpus, including web content, books, and specialized datasets

Key Features:

  1. Multilingual Proficiency: Trained on over 50 languages, Nemotron-4 340B demonstrates exceptional cross-lingual transfer abilities.
  2. Programming Language Support: With training data covering 40+ programming languages, it excels in code-related tasks.
  3. Hardware Optimization: Specifically designed to leverage NVIDIA’s GPU architecture, ensuring optimal performance on NVIDIA hardware.
  4. NeMo Framework Integration: Seamless integration with NVIDIA’s NeMo framework for easy fine-tuning and deployment.
  5. TensorRT-LLM Compatibility: Optimized for inference using NVIDIA’s TensorRT-LLM library, enabling high-performance, low-latency deployment.
  6. Open Model License: Released under a permissive license, encouraging widespread adoption and modification.

Performance Benchmarks:

  • MMLU: 0.78 (78%)
  • ARC-Challenge: Competitive with Llama-3 70B
  • BigBench Hard: On par with Mixtral 8x22B
  • RewardBench: Nemotron-4–340B-Reward variant achieves top accuracy, surpassing some proprietary models

Use Cases:

  1. Synthetic Data Generation: Specifically designed for creating high-quality synthetic datasets across various domains.
  2. Code Generation and Analysis: Strong performance in programming-related tasks.
  3. Multilingual Applications: Ideal for building applications that require understanding and generation across multiple languages.
  4. Research and Development: The open nature of the model makes it valuable for advancing AI research.

Challenges and Considerations:

  • Requires significant computational resources for deployment and fine-tuning.
  • While open-source, it’s optimized for NVIDIA hardware, which may limit its accessibility for some users.

8. GPT-2 (OpenAI)

Although released in 2019, GPT-2 remains a significant milestone in the development of large language models. Its impact on the field and continued relevance make it worth discussing in the context of 2024’s open-source LLM landscape.

Technical Specifications:

  • Parameters: 1.5 billion (largest variant)
  • Architecture: Transformer-based, causal language model
  • Context window: 1,024 tokens
  • Training data: 8 million web pages from outbound Reddit links

Key Features:

  1. Zero-shot Task Performance: Demonstrated ability to perform various tasks without specific fine-tuning.
  2. Scalable Architecture: Offered in multiple sizes (124M, 355M, 774M, 1.5B parameters), showcasing the benefits of scale.
  3. Unsupervised Pretraining: Trained on a diverse range of internet text, enabling broad knowledge acquisition.
  4. Open-source Release: Gradual release of model sizes, with the full 1.5B model eventually made public.

Performance Benchmarks (at time of release):

  • Children’s Book Test: 93.3% accuracy
  • LAMBADA: 63.2% accuracy
  • Winograd Schema Challenge: 70.7% accuracy

Historical Significance:

  1. Ethical Considerations: Sparked debates about the potential misuse of powerful language models.
  2. Advancement in Text Generation: Set new standards for coherent and contextually relevant text generation.
  3. Foundation for Future Models: Laid groundwork for subsequent models like GPT-3 and beyond.

Current Relevance (2024):

  • Still used as a baseline in many NLP research papers.
  • Serves as an educational tool for understanding transformer architectures.
  • Lightweight enough for deployment in resource-constrained environments.

Limitations:

  • Outdated knowledge cutoff (2019)
  • Limited context window compared to modern models
  • Less capable in specialized tasks compared to newer, task-specific models

9. Phi-3 Medium (Microsoft)

Phi-3 Medium, part of Microsoft’s Phi model series, represents a significant advancement in efficient large language model design. Released in early 2024, it aims to provide strong performance with a relatively small parameter count.

Technical Specifications:

  • Parameters: 14 billion
  • Architecture: Enhanced transformer with Microsoft’s proprietary optimizations
  • Context window: 128,000 tokens
  • Training data: Curated dataset focusing on high-quality web content, academic papers, and code repositories

Key Features:

  1. Extended Context Window: 128K token context allows for processing of very long documents or conversations.
  2. Instruction Optimization: Specifically tuned for following complex, multi-step instructions.
  3. Efficient Architecture: Designed to provide strong performance with lower computational requirements than larger models.
  4. Multilingual Capabilities: Supports a wide range of languages, though with a focus on major world languages.
  5. Code Understanding: Incorporates specialized training for programming tasks.

Performance Benchmarks:

  • MMLU: 76.5%
  • HumanEval: 63.2%
  • GSM8K: 79.8%
  • TruthfulQA: 68.7%

Use Cases:

  1. Resource-Constrained Environments: Ideal for deployment on edge devices or in scenarios with limited computational power.
  2. Long-Form Content Analysis: The extended context window makes it suitable for tasks involving lengthy documents.
  3. Code Assistance: Strong performance in programming-related tasks makes it useful for developer tools.
  4. Education and Tutoring: Well-suited for creating interactive educational experiences.

Advantages:

  • Balances performance and efficiency, making it accessible to a wider range of users and applications.
  • The large context window sets it apart from many other models in its size class.
  • Microsoft’s backing ensures ongoing support and potential integration with popular development tools.

Limitations:

  • May struggle with highly specialized tasks compared to larger, domain-specific models.
  • While efficient, still requires significant resources compared to traditional NLP approaches.

In conclusion, Phi-3 Medium represents an important trend in LLM development: the pursuit of efficiency without sacrificing too much capability. Its design choices make it a versatile tool for a wide range of applications, particularly where deployment constraints are a concern.

10. OpenLM 7B (Apple)

OpenLM 7B represents Apple’s entry into the open-source large language model arena, showcasing the tech giant’s commitment to advancing AI technologies while prioritizing on-device performance and privacy.Key Features:

  • 7 billion parameters, striking a balance between capability and efficiency
  • Optimized for deployment on Apple devices, leveraging Apple’s custom silicon
  • Designed with privacy in mind, focusing on on-device processing
  • Supports a wide range of natural language processing tasks

Performance:
While specific benchmark scores were not provided in the given context, OpenLM 7B is likely optimized for:

  • Efficient inference on Apple devices (iPhones, iPads, Macs)
  • Low-latency responses for real-time applications
  • Reduced power consumption compared to cloud-based alternatives

Potential Use Cases:

  • On-device virtual assistants
  • Natural language interfaces for Apple apps and services
  • Text generation and summarization tasks
  • Language translation and localization support

Advantages:

  • Tight integration with Apple’s hardware and software ecosystem
  • Potential for enhanced privacy through on-device processing
  • Reduced reliance on cloud connectivity for AI features

Considerations:

  • May have limited performance compared to larger, cloud-based models
  • Primarily optimized for Apple’s ecosystem, potentially limiting broader adoption

Emerging Trends and Future Outlook of Open Weights LLMs

As we analyze the landscape of AI models in 2024, several trends become apparent:

  1. Multimodal capabilities: Models like Gemini 1.5 Pro are pushing the boundaries of integrating various input types.
  2. Ethical considerations: Claude 3.5 Sonnet’s focus on ethical reasoning reflects a growing emphasis on responsible AI development.
  3. Efficiency and scalability: Models like Mistral Large 2 demonstrate that powerful AI can be deployed with reasonable computational resources.
  4. Specialized models: The diversity of models reflects a trend towards AI solutions tailored for specific industries or tasks.

Looking ahead, we can anticipate further advancements in areas such as:

  • Enhanced few-shot and zero-shot learning capabilities
  • Improved long-term memory and contextual understanding
  • More sophisticated multimodal integration, potentially incorporating video and tactile inputs
  • Advancements in AI interpretability and explainability

Conclusion

The 2024 AI model landscape, as represented by this ranking, showcases the rapid progress and diversification of AI capabilities. From Google’s Gemini 1.5 Pro to open weights models like Gemma-2 27B and LLaMA 3.1 405B, the field offers a rich array of options for researchers, developers, and organizations.

--

--

Sebastian Petrus
Sebastian Petrus

Written by Sebastian Petrus

Asist Prof @U of Waterloo, AI/ML, e/acc

No responses yet