DCLM-baseline

High performance language model benchmark dataset

High performance dataset for language model benchmark testing
Contains a large number of tokens and documentation, suitable for large-scale training
After cleaning, filtering, and deduplication, ensure data quality
Provides a benchmark for studying the performance of language models
Not suitable for model training in production environments or specific fields
Helps researchers understand the impact of data planning on model performance
Promoted the research and development of efficient language models

Product Details

DCLM baseline is a pre trained dataset for language model benchmarking, consisting of 4T tokens and 3B documents. It extracts data from the Common Crawl dataset through carefully planned data cleaning, filtering, and deduplication steps, aiming to demonstrate the importance of data planning in training efficient language models. This dataset is for research purposes only and is not suitable for model training in production environments or specific fields such as code and mathematics.

DCLM-baseline

Product Details

Related Projects

Understood zKnown

MBox AI Meet

Klee

CrossPrism for MacOS