
DCLM-7B
A language model with 700 million parameters demonstrates the effectiveness of data organization techniques.
- Use the Decoder only Transformer architecture to focus on decoding tasks.
- Support language processing for English (primarily).
- Using AdamW optimizer with a peak learning rate of 2e-3.
- Combining StarCoder and ProofPile2 datasets, achieving a data volume of 4.1T tokens.
- Evaluated on multiple tasks such as MMLU, HellaSwag, Jeopardy, etc.
- Provides detailed training details and evaluation results to facilitate users' understanding of model performance.
Product Details
DCLMBaseline-7B is a 700 million parameter language model developed by the DataMmp for Language Models (DCLM) team, primarily in English. This model aims to improve the performance of language models through systematic data organization techniques. The model training used PyTorch and OpenLM frameworks, with AdamW optimizer, a learning rate of 2e-3, weight decay of 0.05, batch size of 2048 sequences, sequence length of 2048 tokens, and a total training token count of 2.5T. The model training hardware uses H100 GPU.