
ReDrafter
Innovative technology for accelerating LLM inference on NVIDIA GPU
0
- -Speculative decoding: using RNN draft model and dynamic tree attention mechanism to accelerate LLM token generation.
- -Performance improvement: On the open-source model, ReDrafter can achieve a speed increase of up to 3.5 tokens per generation step.
- -Integrating TensorRT-LLM: Collaborating with NVIDIA to integrate ReDrafter into the TensorRT-LLM framework, enhancing the framework's compatibility with complex models and decoding methods.
- -Reduce latency: By improving inference efficiency, it significantly reduces the latency for users when using LLM.
- -Reduce costs: Reduce GPU usage and energy consumption, lowering computing costs.
- -Open source model support: ReDrafter supports multiple open source LLMs, increasing the popularity and application scope of the technology.
- -Easy to deploy: ML developers can easily apply ReDrafter to production LLM applications and enjoy the benefits of acceleration.
Product Details
ReDrafter is a novel speculative decoding method that significantly improves the inference speed of large language models (LLMs) on NVIDIA GPUs by combining RNN draft models and dynamic tree attention mechanisms. This technology reduces the latency that users may experience by accelerating LLM token generation, while also reducing GPU usage and energy consumption. ReDrafter was developed by Apple's machine learning research team and integrated into the NVIDIA TensorRT-LLM inference acceleration framework in collaboration with NVIDIA, providing faster token generation capabilities for machine learning developers using NVIDIA GPUs.