Aqua's documentation

Aqua improves responsiveness of LLM serving engines like vLLM during bursts in traffic rates. Such bursts are usually handled by autoscaling the number of replicas serving the LLM and redirecting the additional traffic to these replicas. Autoscaling takes a few minutes since spinning up new VMs or containers takes several minutes. During this period all the users of the prompts in the burst experience an empty screen with no response! Aqua addresses this problem by designing and implementing the first Completely Fair Scheduler which efficiently time-shares the GPUs serving the LLM across all the prompts so that every user experiences a responsive service.

Preemption is a fundamental mechanism required to implement CFS and Aqua implements fast preemption by identifying unused GPU HBM within a scale-up domain like the NvLink fabric and using it as the preemption location reducing the preemption overheads by more than 5x! For more details, please refer to our paper here - Paper

Aqua tensor

Powered by Doctave