The recent advancements in large language models (LLMs) have been primarily focused on scaling up model sizes. Various studies have showcased the emergent abilities of models with a substantial number of parameters, leading to a trend toward increasing language model sizes. So, the potential of exploring smaller models with larger datasets has remained largely unexplored.
In the last few months, a group of researchers has successfully developed ‘TinyLlama’, a groundbreaking 1.1 billion parameter model. Despite its compact size, TinyLlama has impressed everyone by delivering exceptional and high-quality responses with rapid inference times, surpassing larger LLM counterparts in the field.
TinyLlama: An Overview
This is a Transformer decoder-only model that utilizes the architecture and tokenizer of Llama 2 but in a scaled-down version. The model has a total of 1.1 billion parameters and was trained on a massive dataset of 3 trillion tokens. This is the first time a model with 1 billion parameters has been trained on such an extensive dataset.
Despite its smaller size, TinyLlama demonstrates impressive performance across various tasks, surpassing other open-source models. It is also compatible with Llama 2’s architecture and tokenizer, allowing for seamless integration into various open-source projects that are based on Llama. Being a 1.1 billion parameter model pre-trained on 3 trillion tokens, it can be run on both local computers and devices, making it versatile for different applications.
Architecture & Training Methodology
Tiny Llama has released a full paper detailing how their compact 1.1 billion language model was created and pre-trained. The model was trained on approximately 3 trillion tokens in just 90 days, utilizing 16 A100-40G GPUs. Tiny Llama integrates technologies from other models, primarily from Lama, and seamlessly integrates with the architecture and tokenizer of Lama 2. This allows the model to be easily integrated into numerous open-source projects that are built upon Lama.
Tiny Llama was pre-trained on a dataset of 3 trillion tokens that consisted of a mixture of natural language data and code. The data was obtained from two sources: SlimPajama and StarCoder. SlimPajama is a large open-source corpus created for training language models based on RedPajama, while Starcoderdata was collected to train StarCode. Starcoderdata comprises approximately 250 billion tokens across 86 programming languages and also includes GitHub issues and text-code pairs that involve natural languages.
After combining these two, TinyLlama had approximately 950 billion tokens for pre-training in total. The model was then trained on these tokens for approximately three epochs. Training on data repeated for up to four epochs resulted in minimal performance degradation compared to using unique data.
In terms of optimization, TinyLlama employs a range of different types of advanced optimization techniques to make it faster and more efficient. Tiny Llama utilizes positional embedding using rope or rotary positional embedding, which encodes absolute positional information and incorporates explicit relative position dependency. The model also utilizes the RMS Norm for pre-normalization and input normalization, improving training efficiency. Tiny Llama utilized incorporates flash attention, achieving a throughput of 24k tokens per second per A100-40G GPU, translating to 56% model flops utilization without activation checkpointing.
TinyLlama vs. Other Models
TinyLlama demonstrates competitive performance compared to existing open-source language models of similar sizes. Specifically, it outperforms both OPT-1.3B and Pythia1.4B in various downstream tasks.
Although a 1 billion parameter model may not surpass larger models such as ChatGPT or Llama, TinyLlama’s average scores compete against most Open-source projects. With a model size of 1.1 billion parameters, it performs quite well in most categories, achieving an average of 60 in common sense, which is really great to see for such a model at its size.
Benefits of TinyLlama
With TinyLlama’s size and its training on the size of tokens, you can get some specific benefits from it compared to other same-size and big-size LLMs:
- Resource Efficiency: TinyLlama’s smaller size allows usage in applications with limited computational and memory resources, making it a suitable supplement for users without high-end GPUs.
- Mobile Application Enablement: With its compact architecture and promising performance, TinyLlama can power end-user applications on mobile devices, offering a lightweight platform for testing innovative ideas related to language models.
- Seamless Integration: Adopting the same architecture and tokenizer as Llama 2 enables easy integration into various open-source projects built upon Llama.
This model is able to perform basic tasks, such as real-time machine translation and dialogue generation, without requiring an internet connection. These tasks are carried out with reasonable accuracy, making the model suitable for performing basic functions. Additionally, the model can run on your local computer and has a faster inference speed due to its small size.
The creation of TinyLlama represents a significant breakthrough in language model research. With its compact architecture, promising performance, and versatility, TinyLlama stands as a platform capable of enabling a wide range of applications on various devices. As the focus shifts towards models of smaller sizes, TinyLlama has the potential to lead future language model research towards building more efficient and practical solutions. Expectations are high for the impact TinyLlama will have in shaping the landscape of future LLMs.