Producers and Consumers

Producers Workloads that possess surplus GPU memory available for allocation.

Consumers Workloads that require additional GPU memory and offload excess data.


Informing the Coordinator

After identifying producer and consumer GPUs, the coordinator must be notified of the mapping via its REST API. First, start the coordinator and navigate to http://127.0.0.1:8080/docs to access the Swagger UI. The /mapping endpoint supports both addition and deletion of producer-consumer pairings. All GPU identifiers supplied to these APIs must correspond to the physical IDs of the GPUs.


Examples

CFS Consumer

In this scenario, GPUs 2 and 3 serve as consumers, while GPUs 0 and 1 act as producers. The following JSON payloads register these mappings with the coordinator:

{ "producer": 0, "consumer": 2 }
{ "producer": 1, "consumer": 3 }

The vLLM server with CFS can be launched using:

CUDA_VISIBLE_DEVICES=2,3,0,1 \
python3 -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --swap-space 20 \
  --disable-log-requests \
  --enforce-eager \
  --enable-chunked-prefill \
  --max-num-batched-tokens 512 \
  --max-num-seqs 512 \
  --disable-sliding-window \
  --load-format dummy \
  --gpu-memory-utilization 0.9 \
  --max-model-len 4096 \
  --enable-cfs \
  --enable-aqua-swap \
  --port 9000

The benchmark script can be executed as follows (located in the benchmarks directory of vLLM):

python3 benchmark_cfs.py \
  --backend openai \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --request-rate 0.25 \
  --num-prompts 100 \
  --dataset-name dummy \
  --long-prompts 0 \
  --long-prompt-len 32000 \
  --port 9000

Note that CUDA_VISIBLE_DEVICES=2,3,0,1 exposes GPUs 0 and 1—configured as producers—alongside the tensor-parallel consumers on GPUs 2 and 3. Producer servers are started with:

CUDA_VISIBLE_DEVICES=0 \
python3 -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --tensor-parallel-size 1 \
  --swap-space 32 \
  --disable-log-requests \
  --enforce-eager \
  --enable-chunked-prefill \
  --max-num-batched-tokens 512 \
  --max-num-seqs 512 \
  --disable-sliding-window \
  --load-format dummy \
  --gpu-memory-utilization 0.9 \
  --max-model-len 64000 \
  --be-producer \
  --producer-req-gb 30 \
  --port 9001
CUDA_VISIBLE_DEVICES=1 \
python3 -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --tensor-parallel-size 1 \
  --swap-space 32 \
  --disable-log-requests \
  --enforce-eager \
  --enable-chunked-prefill \
  --max-num-batched-tokens 512 \
  --max-num-seqs 512 \
  --disable-sliding-window \
  --load-format dummy \
  --gpu-memory-utilization 0.9 \
  --max-model-len 64000 \
  --be-producer \
  --producer-req-gb 30 \
  --port 9002

FlexGen as Consumer

To run FlexGen on GPU 1 as a consumer and Stable Diffusion on GPU 0 as producer, restart the coordinator and reconfigure the mappings via the /mapping API. After cloning the FlexGen repository with Aqua support (https://github.com/aquaml/flexgen), execute:

CUDA_VISIBLE_DEVICES=1,0 numactl --membind=0 \
python3 -m flexgen.flex_opt \
  --model facebook/opt-30b \
  --path _DUMMY_ \
  --prompt-len 8192 \
  --gen-len 64 \
  --percent 100 0 50 50 100 0 \
  --gpu-batch-size 1 \
  --num-gpu-batches 1

A vision-model producer can be launched on GPUs 0 using the Aqua vision code (https://github.com/aquaml/vision), after installing diffusers:

CUDA_VISIBLE_DEVICES=0 \
python3 -m generate.py \
  -m sd \
  -d 10 \
  -b 16

Powered by Doctave