Aqua provides a full implementation of CFS in the vLLM's core scheduler. You can install this version of vLLM by following the instructions in https://github.com/aquaml/vllm.
You can either execute vLLM with just Aqua's CFS or also include the fast preemption enabled by Aqua-lib. Follow the instructions below to do the former and visit experiments to do the latter. The --enable-cfs
flag toggles the scheduler behavior in vLLM to use CFS. Note that you need to adjust the swap-space based on the size of the burst you need to absorb.
CUDA_VISIBLE_DEVICES=0 python3 -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--tensor-parallel-size 1 \
--swap-space 20 \
--disable-log-requests \
--enforce-eager \
--enable-chunked-prefill \
--max-num-batched-tokens 512 \
--max-num-seqs 512 \
--disable-sliding-window \
--load-format dummy \
--gpu-memory-utilization 0.9 \
--max-model-len 4096 \
--enable-cfs
Powered by Doctave