Databricks launches Dolly 2.0, an open-source, instruction-following LLM

370
Ali Ghodsi, CEO, Databricks

Databricks, the lakehouse company, announced the release of Dolly 2.0, the world’s first open-source, instruction-following large language model (LLM), fine-tuned on a human-generated instruction dataset licensed for commercial use.

This follows the initial release of Dolly in March 2023, an LLM trained for less than USD$30 to exhibit ChatGPT-like human interactivity (aka instruction-following).

Dolly 2.0 is a 12B parameter language model based on the EleutherAI pythia model family and fine-tuned exclusively on a new high-quality human-generated instruction-following dataset, crowdsourced among Databricks employees.

Databricks is open sourcing the entirety of Dolly 2.0, including the training code, the dataset, and the model weights, all suitable for commercial use. This enables any organisation to create, own, and customise powerful LLMs that can talk to people without paying for API access or sharing data with third parties.

“Dolly 2.0 is a game changer as it enables all organisations around the world to build their own bespoke models for their particular use cases to automate things and make processes much more productive in the field they’re in. With Dolly 2.0, any organisation can create, own, and customise a powerful LLM to create a competitive advantage for their business,” said Ali Ghodsi, CEO, Databricks.

Creating the databricks-dolly-15k dataset

databricks-dolly-15k contains 15,000 high-quality human-generated prompt or response pairs specifically designed for instruction tuning large language models. Under the licensing terms for databricks-dolly-15k (Creative Commons Attribution-ShareAlike 3.0 Unported License), anyone can use, modify, or extend this dataset for any purpose, including commercial applications.

This dataset was created to address the limitations of existing well-known instruction-following models that prohibit commercial use due to their training data. It is the world’s first open-source, human-generated instruction dataset specifically designed to make large language models exhibit the magical interactivity of ChatGPT.

databricks-dolly-15k was authored by over 5,000 Databricks employees during March and April 2023. These training records are natural, expressive and designed to represent a wide range of behaviours, from brainstorming and content generation to information extraction and summarisation.

Press release received on mail