The Race for Open Source Generative AI Models
The race to release open source generative AI models is heating up, and Salesforce has now joined the bandwagon with their latest offering: XGen-7B. This new model is a large language model (LLM) that is designed to support longer context windows than the currently available open source LLMs.
What sets XGen-7B apart is its impressive number of parameters. The ‘7B’ in XGen-7B represents the 7 billion parameters that this model possesses. The larger the number of parameters, the bigger the model, and in turn, the more powerful it is. However, it is important to note that models with larger parameters require high-end CPUs, GPUs, RAM, and storage. These resource-intensive requirements are necessary to handle the massive amount of data that the model has been trained on.
One of the key differentiators of XGen-7B is its 8K context window. A larger context window allows for a larger prompt and generates longer and more accurate responses. In fact, the 8K context window refers to the cumulative size of both the input and output text. This means that users can provide additional context to the model and receive more detailed and comprehensive responses.
XGen-7B Tokens and Tokenization
Before diving deeper into the features of XGen-7B, it is important to understand what tokens are. Machine learning models understand numbers, not characters, so each word or a part of it is converted into a token. Tokens are a way to encode text, similar to ASCII or Unicode. XGen-7B uses the OpenAI tokenizing system, which is also used with other popular models like GPT-3 and GPT-4, to turn words into tokens.
XGen-7B: An Alternative to Open Source LLMs
XGen-7B emerges as a compelling alternative to other open source LLMs such as MPT, Falcon, and LLaMa. Salesforce claims that XGen-7B achieves comparable or even better results than the current state-of-the-art language models of similar size.
Salesforce has released three variants of XGen-7B. The first variant, XGen-7B-4K-base, supports a 4K context window. The second variant, XGen-7B-8K-base, is trained with additional data and supports an 8K context length. Both of these variants are released under the Apache 2.0 open source license, allowing for commercial usage.
The third variant, XGen-7B-{4K,8K}-inst, is trained on instructional data and is available only for research purposes. These datasets include databricks-dolly-15k, oasst1, Baize, and GPT-related datasets. The ‘inst’ keyword in the name indicates that the model can understand instructions and has been trained based on reinforcement learning from human feedback (RLHF) techniques. An instruction-based language model like XGen-7B can be utilized to build chatbots similar to ChatGPT.
Training and Multilingual Capabilities
Salesforce has utilized multiple datasets like RedPajama and Wikipedia, along with their own dataset called Starcoder, to train the XGen-7B LLM. The training cost of the model is estimated to be $150K on 1T tokens, based on Google Cloud pricing for TPU-v4. Additionally, the model has been trained on 22 different languages to make it multilingual.
XGen-7B: Multitask Language Understanding
An impressive aspect of Salesforce’s XGen-7B is its ability to support Massive Multitask Language Understanding. This means that the model can answer multiple-choice questions from various domains such as the humanities, STEM, social sciences, and more. XGen-7B outperforms other models in this category.
Other Capabilities and Limitations
Aside from its multitask language understanding, XGen-7B excels in categories such as conversations, long-form Q&A, and summarization. However, it is important to note that Salesforce acknowledges that their LLM is subject to the same limitations as other LLMs, including bias, toxicity, and hallucinations.
In Conclusion
Salesforce’s XGen-7B LLM brings a new era in open source generative AI models. With its larger context window, comprehensive set of datasets, and impressive multilingual capabilities, XGen-7B holds immense promise in the field of natural language processing and conversation generation.