Getting vLLM with CFS

Aqua provides a full implementation of CFS in the vLLM's core scheduler. You can install this version of vLLM by following the instructions in https://github.com/aquaml/vllm.

You can either execute vLLM with just Aqua's CFS or also include the fast preemption enabled by Aqua-lib. Follow the instructions below to do the former and visit experiments to do the latter. The --enable-cfs flag toggles the scheduler behavior in vLLM to use CFS. Note that you need to adjust the swap-space based on the size of the burst you need to absorb.

CUDA_VISIBLE_DEVICES=0 python3 -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --tensor-parallel-size 1 \
  --swap-space 20 \
  --disable-log-requests \
  --enforce-eager \
  --enable-chunked-prefill \
  --max-num-batched-tokens 512 \
  --max-num-seqs 512 \
  --disable-sliding-window \
  --load-format dummy \
  --gpu-memory-utilization 0.9 \
  --max-model-len 4096 \
  --enable-cfs

Powered by Doctave