dolmino-mix-1124

High quality dataset for the second stage of OLMo2 training.

Contains data from multiple sources, such as DCLM, Flan, Pes2O, Wiki, etc.
The dataset is divided into different categories, such as HQ Web Pages, STEM Papers, Encyclopedia, etc.
Support multiple natural language processing tasks, especially in the field of text generation.
The dataset is used for training and optimizing large language models, such as OLMo2.
The dataset contains a large amount of textual data and is suitable for large-scale machine learning training.
Following the open data license, researchers and developers are allowed to use it freely.

Product Details

DOLMino dataset mix for OLMo2 stage 2 annexing training is a dataset that combines multiple high-quality datasets for the second stage of OLMo2 model training. This dataset contains various types of data such as web pages, STEM papers, encyclopedias, etc., aimed at improving the performance of the model in text generation tasks. Its importance lies in providing abundant training resources for developing smarter and more accurate natural language processing models.

dolmino-mix-1124

Product Details

Related Projects

Understood zKnown

MBox AI Meet

Klee

CrossPrism for MacOS