
MInference 1.0
Accelerating the pre filling processing of large language models with long context
- The dynamic sparse attention method accelerates the pre filling stage of long context LLMs, increasing processing speed by up to 10 times.
- We divided dynamic sparse attention into three patterns: A-shape, Vertical Flash, and Block Sparse, and designed the Kernel Ware Sparse Pattern Search algorithm to find the optimal head pattern.
- Introducing online approximation methods and optimized GPU cores to accelerate LLM inference with minimal overhead.
- Propose the optimal inference code library and implement 1M token pre filling inference using LLaMA style model on a single A100.
- Evaluate MInference in multiple benchmark tests, including InfiniteBench, RULER, PG-19, and Needle in a Haystack, to assess the actual context processing capability of LLMs.
- The performance of the three proposed attention modes and the comparison of FlashAttention were demonstrated through micro benchmark testing.
- MInference was tested on different models and methods, including performance evaluation of key information positions in different contextual windows and prompts in the Needle in a Haystack task.
Product Details
MIference 1.0 is a sparse computing method aimed at accelerating the pre filling stage of long sequence processing. It realizes a dynamic sparse attention method for long context large language models (LLMs) by identifying three unique patterns in the long context attention matrix, accelerating the pre filling stage of 1M token prompts while maintaining the ability of LLMs, especially retrieval ability.