In today’s world of artificial intelligence, language models have become an essential part of various applications. Large Language Models (LLMs) have revolutionized natural language processing, enabling machines to understand and generate text that closely mirrors human language. From producing coherent text to assisting in natural language processing tasks, these models have made significant progress over time.
Last year, we noticed a significant increase in the usage of open-source AI. Meta’s Jully release of LLaMa 2 generated a lot of interest in the open-source LLM community. This year, we are expecting to see a rise in the number of high-performance open-source large language models (LLMs) that are designed for both research and commercial applications.
In this article, we will explore the top 10 open-source LLMs that have been making waves in the AI community and are worth keeping an eye on in 2024.
A Large Language Model (LLM) is a type of artificial intelligence model that is designed to understand and generate human-like text. These models are built using deep learning techniques and are trained on massive datasets containing vast amounts of text mostly sourced from the internet.
There are two types of LLMs, proprietary and open source. Proprietary LLMs are owned by companies that can control their usage, such as OpenAI’s ChatGPT 4. On the other hand, open-source LLMs, such as LLaMa 2 and Falcon, are free and available for anyone to access and modify. Although proprietary LLMs tend to be larger in size than open-source models, bigger isn’t necessarily better. Open-source LLMs are free and available for anyone to access, and developers and researchers are free to use, improve, or otherwise modify the model.
Many organizations, including NASA and healthcare organizations, use open-source LLMs for various applications.
There are many benefits to choosing open-source LLMs over proprietary LLMs, both in the short and long term. Here are some of the most compelling reasons to choose open-source LLMs:
Numerous open-source Language Model Models (LLMs) are currently accessible, and the landscape is continuously evolving with the development of new models. However, Hugging Face’s Open LLM Leaderboard makes the process of selecting the most popular options easier. This leaderboard evaluates, ranks, and assesses open-source LLMs and chatbots based on various benchmarks.
The composite LLM score used in this leaderboard is based on several benchmarks, including ARC for reasoning ability, HellaSwag for common-sense inference, MMLU for multitasking ability, and Truthful QA for answer veracity.
With this information, we have created a list of the top 10 open-source LLMs currently available.
Llama 2 is a set of pre-trained and fine-tuned generative text models that were introduced by Meta AI in collaboration with Microsoft on July 18th, 2023. It represents a significant advancement in open-source large language models and brings open-source models much closer to GPT-4 performance.
LLaMA 2 was trained on a cluster of Nvidia GPUs and used a larger dataset with 2T tokens, which is 40% more data than Llama 1. LLaMA 2 comes in two flavors and three sizes: the base model and the fine-tuned model called the chat model, specializing in dialogue. Both models are available in 7 billion, 13 billion, and 70 billion parameter sizes. The largest model has 70 billion parameters and a context length of 4,000 tokens, which is double the context size compared to LLaMA 1.
LLaMA 2 is open source for both research and commercial purposes, with some limitations. It has been released with a commercial license, allowing users to download and use it immediately. Users with less than 700 million monthly active users can self-host the Llama 2 model for commercial use. However, products with over 700 million users require Meta’s permission for usage to protect against competitors.
LLaMa 2 70B has gained immense popularity in the AI community due to its remarkable performance and vast language understanding capabilities. With a massive training dataset of 70 billion sentences, this model excels in tasks such as language translation, sentiment analysis, and text completion.
However, Llama 2’s sophistication is not as advanced as GPT-4, particularly in poetry and complex programming requirements. There is a significant performance gap between LLaMA 2 and the frontier models (GPT-4 and PAM 2). LLaMA 2’s coding ability seems limited compared to GPT-4.
Safety is a top priority for LLaMA 2, with a significant focus on safety guardrails, red teaming, and evaluations. The goal is to make the AI less likely to produce harmful outputs. These safety advancements in LLaMA 2 make it much safer than other models. LLaMA 2 is still used for building fine-tuned models and applications due to its safety.
Overall, Meta’s reward model approach balances safety and helpfulness, giving you near GPT-4 capabilities at a fraction of the cost.
Falcon-180B is a decoder-only model with 180B parameters, developed by TII in the UAE. It is known for its large size and is one of the most powerful open-source pre-trained language models available today. Its features and architecture are optimized for inference, with multi-query capabilities. The model is available under a permissive license that allows for both research and commercial use.
Falcon-180B was trained on a massive dataset of 3.5 trillion tokens using AWS SageMaker, with up to 4,096 A100 40GB GPUs running simultaneously in P4d instances. The training dataset was refined using Tai’s web dataset, which includes 85% of the training data. This dataset includes conversations, technical papers, Wikipedia, and a small fraction of code.
It was designed to excel in completing natural language tasks, and as of January 2024, is ranked on the Hugging Face Open LLM Leaderboard for pre-trained language models, achieving an average score of 67.85, 69.45 on ARC, 88.86 on HellaSwag, 70.5 on MMLU, and 45.47 on TruthfulQA.
Falcon-180B is roughly 2.5 times larger than Llama 2 and was trained with 4x more computing. It outperforms many other large language models, such as Llama 2 stable LM, Red Pajama NPT, and others, on various benchmarks, including Llama 2 and OpenAI’s GPT-3.5 on the MMLU benchmark. It also performs well compared to Google’s Palm too large on other benchmarks.
This model typically sits between Llama 3.5 and GPT-4 on evaluation benchmarks. The Falcon-180B is considered a significant advancement in the open-source world, providing more innovation and growth opportunities.
If you are looking for a version better suited to taking generic instructions in a chat format, we recommend taking a look at Falcon 180B.
Yi 34B is a new language model created by 01 AI from China. This model is currently ranked first on the Hugging Face Open LLM leaderboard. The company aims to create bilingual models that can handle both English and Chinese languages. Initially, the model was trained on a 4K token context window, but now it can be extended to 32K tokens. Recently, the company released a 200,000 token version of the model in 34B, which is impressive. These models are available for research purposes and can be licensed for commercial use.
The 34B model has been trained on 3 trillion tokens and performs well in areas like math and code. The company has provided benchmarks for both the base models and the supervised fine-tuned chat models. The model offers various 4bit and 8bit versions to choose from.
Currently, Yi – 34B is the top-ranked LLM on the Hugging Face Open LLM Leaderboard for pre-trained language models, achieving an average score of 70.81, 65.36 on ARC, 85.58 on HellaSwag, 76.06 on MMLU, and 53.64 on TruthfulQA.
The 34B chat model outperforms the 70B Llama 2 chat model in terms of MMLU scores. The “Yi” model outperforms the Llama 2 70b model across the board. Yi-34B outperforms much larger models like LLaMA2-70B and Falcon-180B.
The responses generated by the model are very good. It is unclear if the model uses system prompts or not, but two versions were created – one without system prompts and one with system prompts. The generate function was modified to handle ML chat and use the standard generate function. The model generates responses in Markdown format. The responses have a different feel compared to other models. The model’s answers can be quite verbose and provide detailed information. The model tends to present information in bullet points. The model goes into more detail compared to other fine-tuned models. Responses from the model include bullet points and provide detailed facts and summaries.
The model’s creative storywriting is very good. The system prompt version performs well, providing even better results. The model performs well on certain tests, such as grade school math problems. The model able to understand and calculate correctly for some scenarios. However, it struggles to handle more complex questions, such as calculating the number of deaths over multiple years. The model’s performance may be influenced by direct fine-tuning on specific examples.
Overall, the Yi 34B is the best open-source model that you can try right now.
Mixtral 8X 7B is a Foundation model introduced by Mistral AI in December 2023. It is a high-quality sparse mixture of expert models (SMoE) with open weights and it surpasses Llama 2 and GPT 3.5 on various benchmarks while having a smaller parameter size. Licensed under Apache 2.0, you can modify and make money from it with minimal restrictions.
Mixtral is a sparse mixture-of-experts network that is a decoder-only model. The feedforward block picks from a set of 8 distinct groups of parameters. At every layer, for every token, a router network chooses two of these groups (the “experts”) to process the token and combine their output additively. This technique increases the number of parameters of a model while controlling cost and latency. Mixtral has 46.7B total parameters but only uses 12.9B parameters per token. This means that it processes input and generates output at the same speed and for the same cost as a 12.9B model.
Currently, the Hugging Face Open LLM Leaderboard ranks Mixtral 8x7B between the top 10 on the best LLM on the market, scoring 68.42 on average, 66.04 on ARC, 86.49 on HellaSwag, 71.82 on MMLU, and 46.78 on TruthfulQA.
Although not at the level of GPT 4, it outperforms GPT 3.5 and Llama 2 on most benchmarks. It outperforms Llama 2 70B on most benchmarks with 6x faster inference. In particular, it matches or outperforms GPT3.5 on most standard benchmarks. Compared to Llama 2, Mixtral presents less bias on the BBQ benchmark. Overall, Mixtral displays more positive sentiments than Llama 2 on BOLD, with similar variances within each dimension. Compared to GPT 3.5, Mixtral beats it on almost everything except for the Mt Bench score.
Mixtral shows strong performance in code generation and science. It can be fine-tuned into an instruction-following model that achieves a score of 8.3 on MT-Bench. Additionally, Mixtral is proficient in multiple languages including English, French, Italian, German, and Spanish, making it a versatile model for various tasks and users. Mistral AI is continuously working on adding more languages to the model.
However, Mixtral is highly censored and aligned out of the box. This is a good thing if you are building a customer-facing product, but it may not be practical for overthrowing the shape-shifting lizard overlords of the New World Order.
Mixtral is the best model considering cost/performance trade-offs.
In December 2023, Microsoft introduced its newest language model, called Phi-2. Despite having only 2.7 billion parameters, Phi-2 is the smallest but also the most powerful language model available. The model has been trained on 1.4 trillion tokens on 96, A100-80G GPUS and its context length is limited to 2048 tokens.
Currently, the Hugging Face Open LLM Leaderboard ranks Phi 2 between the top 20 on the best LLM on the market, scoring 61.33 on average, 61.09 on ARC, 75.11 on HellaSwag, 58.11 on MMLU, and 44.47 on TruthfulQA.
Phi-2 is capable of remarkable performance and capabilities, outperforming Google’s Gemini Nano model and Mistral’s 7 billion parameter model. It can match or outperform models that are up to 25 times larger in terms of performance. Phi-2’s state-of-the-art performance can be compared among different types of large language models with fewer than 13 billion parameters.
Phi-2 performs well in tasks involving multiple steps, such as coding and mathematics. It is also an efficient model that can be used on mobile devices. Its development was effective and demonstrated the potential of small-sized models.
Phi-2 is capable of generating outputs for complex problems, such as a physics problem that involves multiple steps within mathematics. Despite being a smaller model, Phi-2 is able to give the correct square root calculation and provide the right steps to solve the problem. This impressive performance showcases the capabilities of Phi-2, with its small parameter size of 2.7 billion.
In general, it shows that Phi-2, despite its small parameter size, is an effective and efficient model that anyone can use.
Mistral 7B is one of the oldest and the most popular models introduced by Mistral AI last year. It is a powerful and fully uncensored language model that outperforms larger models like Llama 2 with larger parameter sizes in various benchmarks. Despite its smaller size, Mistral 7B performs exceptionally well in different components and benchmarks. Its compact size also makes it versatile and customizable.
Mistral 7B model has a wide range of benchmarks and metrics in which it excels. The Hugging Face Open LLM Leaderboard currently ranks Mistral 7B on the market, with an average score of 60.97, 59.98 on ARC, 83.31 on HellaSwag, 64.16 on MMLU, and 42.15 on TruthfulQA.
Mistral 7B outperforms the Llama 2 model (13 billion parameters) on all benchmarks, demonstrating its proficiency in understanding and generating natural language text. It even outperforms Llama 1, a 34 billion parameter model, in many different benchmarks, underscoring its efficient parameter utilization. Mistral 7B nearly matches the performance of Code Llama 7 billion parameter model on code-related tasks, including code generation, debugging, and other code-related components. Furthermore, the Mistral 7 billion parameter model performs well in categories such as writing a Python script and excels in understanding and generating natural language text.
The Mistral 7B model is released under the Apache 2.0 license, and it can be downloaded and deployed in various environments, including local setups and popular cloud platforms. There are multiple ways to utilize the Mistral 7 billion parameter model. Firstly, it can be used with different chatbots that host the model, such as Po and Hugging Chat. Alternatively, users can install the model locally on their desktop using the Text Generation Web UI and Pinocchio. It can also be used through online chatbots that host the model for users who cannot host it themselves due to hardware limitations.
This 7 billion parameter Mistral model delivers high performance while saving memory.
Orca 2 is a Microsoft another open-source Language Model (LLM) with 13 billion parameters, released in November 2023. It builds on the success of Orca 1 and improves the way smaller language models work. Orca 1 was an open-source model that learned from rich signals such as explanation traces, allowing it to outperform conventional instruction-tuned models.
With Orca 2, Microsoft introduces a key strategy for explanation tuning by extracting answers with detailed explanations, outperforming larger models, including models with 70 billion parameters. Orca 2 comes in two parameter sizes: 7 billion and 13 billion. Both sizes are created by fine-tuning and based on tailored high-quality synthetic data.
Orca 2 utilizes advanced learning capabilities from GPT-4 to handle and generate solutions in complex scenarios. It incorporates innovative techniques and strategies to enhance problem-solving and reasoning abilities. Orca 2 is not just a scaled-down version of GPT-4, but a sophisticated and effective model on its own.
Orca 2 is striving to teach smaller models various reasoning techniques such as step-by-step processing, recall and reasoning, extraction and generation methods. Orca 2 carefully tailors reasoning strategies to the task at hand by using intricate prompts designed to elicit specific strategic behaviors. The prompt eraser technique is used to teach smaller models to reason. Larger models are presented with intricate prompts, while the smaller models are exposed to only the task and behavior without visibility into the original prompts. This technique makes Orca 2 a cautious reasoner.
Orca 2’s 7 billion and 13 billion parameter models match or exceed the performance of larger models. Orca 2 consistently outperforms models with 5 to 10 times larger parameter sizes. Orca 2 can generate complex question answers based on given context. It outperforms GPT-4 in answering different types of questions for complex tasks. Orca 2’s 13 billion chat model compares to Llama 2’s 13 billion chat model and GPT-3.5 Turbo. Orca 2 can reason and provide logical answers for complex scenarios. It is able to compete with larger models despite its smaller parameter size.
Orca-2 LLM is not limited to a specific type of task. It can effectively manage a wide range of domains and tasks. Orca 2 is easier to train, set up, and requires fewer resources compared to larger models. It can be utilized in conjunction with other models, highlighting its flexibility and collaboration potential.
SOLAR-10.7B is an advanced large language model (LLM) with 10.7 billion parameters designed by Upstage AI. This model has demonstrated superior performance in various natural language processing (NLP) tasks. Despite its compact size, it exhibits unparalleled state-of-the-art performance in models with parameters under 30B. The top 10 models on the Hugging Face Open LLM Leaderboard are all based on the Solar 10.7B architecture, which outperforms models up to 30 billion parameters despite having only 11 billion parameters.
The Solar 10.7B model achieves its performance by merging multiple models together, similar to the GPT-3 model. The key technique used in Solar 10.7B is called “depth upscaling”. This technique allows the model to incorporate techniques like a mixture of experts.
The depth upscaling process starts with a base model, LMA2 with 32 layers, which is initialized with pre-trained weights from Mixtrals 7B. Copies of the base model are created, and certain layers are removed. Finally, the two models with 24 layers are concatenated to create the final Solar 10.7B model with 48 layers. The Solar 10.7B model underwent continued pre-training with additional data. It then underwent two stages of fine-tuning: instruction fine-tuning and alignment tuning. The model’s performance was tested on benchmarks, with minimal contamination of the data sets used for instruction fine-tuning.
On the LLM leaderboard, the Solar 10.7B model achieved state-of-the-art results, outperforming much larger models, including the recent Mixtral 8X7B model. Even the base version of the model performed well. SOLAR 10.7B’s evaluation results demonstrate its strong performance compared to other open-source models across various benchmarks, including models with significantly larger parameter sizes.
Currently, the Hugging Face Open LLM Leaderboard ranks SOLAR 10.7B on the market, scoring 66.04 on average, 61.95 on ARC, 84.6 on HellaSwag, 65,48 on MMLU, and 45.04 on TruthfulQA.
The Solar 10.7B model exhibits impressive capabilities. It can provide logical answers, offer perspective on moral issues, and even generate creative responses. It shows capability in programming tasks, providing correct solutions. However, for logical reasoning challenges, the model struggles and provides incorrect responses.
In summary, if you are looking for a computationally efficient small-size model for NLP tasks like summarization and searching, the Solar 10.7B model is currently the best option.
OpenChat is an innovative library of open-source language models that are fine-tuned with C-RLFT, which is inspired by offline reinforcement learning, delivering exceptional performance on par with ChatGPT, even with a 7B model.
These models have been trained using a unique method called CR-RLfT, which is inspired by offline reinforcement learning. This method enables the models to learn from a mix of data, even if some of the data doesn’t come with clear preferences or labels. This sets OpenChat apart from other models.
OpenChat 3.5 is one of the most capable generalist models available. It performs well in various categories, including coding, logic and reasoning, mathematics, and scientific reasoning. It outperforms other models like Grok from Twitter or X in most aspects and even outperforms GPT 3.5 and Llama 70 billion-parameter models.
OpenChat 3.5 has recently been upgraded to increase its performance in coding. The upgrade achieved a near 150% increase in human evaluation while maintaining and improving performance on other benchmarks.
To access the new upgraded model, users can use Together AI’s optimized inference API or visit the OpenChat playground and live demo. It can also be used locally with LM Studio. The OpenChat model is already available for download. After downloading it, users can select the model and start chatting with OpenChat locally on their desktop.
DeepSeek Coder is an open-source coding model that is renowned for being the best in its class. It allows the code to write itself by utilizing its own pre-trained model, which has been fine-tuned on two trillion tokens and over 80 programming languages. It is considered to be superior to GPT 3.5 and performs at a level comparable to that of GPT 4.
This model comes in various sizes and has been pre-trained on 2 trillion tokens. It can be used as a replacement for GitHub co-pilot and can be run locally without internet access. DeepSeek Coder offers different model sizes, such as the 1.3 base model or the 33 billion parameter instruct model. The larger the model size, the more RAM it requires. For example, the 33 billion parameter model needs 30 GB of RAM. It supports project-level code completion and infilling with a window size of 16k.
DeepSeek Coder has been trained on 87% code and 133% natural language in English and Chinese. It has been pre-trained on 2 trillion tokens and has models ranging from 1 billion to 3 billion parameters. The model has a 16k window context size and performs better than any other open-source model.
DeepSeek Coder has undergone a detailed evaluation and has surpassed existing open-source code models. Its 33 billion model outperforms Code Llama, a 34 billion parameter model, by 7.9 on human evaluation. It surpasses Code Llama on Python functions by 99.3% and has a 9.3% increase in the multilingual evaluation. DeepSeek Coder’s performance is detailed in their GitHub repository, including comparisons in different metrics and programming languages. When compared to other models, DeepSeek Coder base performs better than other open-source coding models, including GPT 3.5 Turbo. It also comes very close to the performance of GPT 4. In the presentation, the speaker also mentions a model called Find that will be tested separately.
In summary, if you’re looking for the best code generation model, we highly recommend using DeepSeek Coder.
These top 10 open-source large language models are the latest advancements in natural language processing as of 2024. They include popular models like LLaMa2 70B and powerful coding model, DeepSeek Corder, which offer exceptional capabilities in various language-related tasks. While these pre-trained open-source LLM models are not yet at the level to compete with the performance of proprietary AI models such as GPT4, they can be a viable alternative to LLMs like GPT 3.5. As the field of natural language processing continues to rapidly evolve, these models are valuable resources for developers, researchers, and enthusiasts, pushing the boundaries of what is possible with AI-driven language understanding and generation. In the near future, we can expect to have super powerful open-source LLMs that can surpass models like ChatGPT 4s in a big leap.
GPT-4, OpenAI’s GPT-4 is the latest and most advanced language model in the GPT series. It comprises eight models with 220 billion parameters each, trained on a dataset of 1.76 trillion parameters. GPT-4 is capable of complex reasoning and performing multiple tasks, and is said to be close to artificial general intelligence (AGI). It can generate both language and images, making it a significant upgrade from previous models.
Mistral 7B, Mistral 7B is a powerful language model that outperforms larger models like Llama 2 in many different benchmarks despite its smaller parameter size. Mistral’s compact size allows it to perform exceptionally well in various components and benchmarks, whereas larger models with larger parameter sizes often fail. Additionally, its compact size makes it versatile and customizable.
LLaMA 2 is an open-source model that can be used for both research and commercial purposes, subject to certain conditions. If you have less than 700 million monthly active users, you can self-host the LLaMA 2 model for commercial use. However, if you plan to use it on products with over 700 million users, you need to obtain Meta’s permission to prevent competitors from misusing it.