Producers and Consumers
Producers Workloads that possess surplus GPU memory available for allocation.
Consumers Workloads that require additional GPU memory and offload excess data.
After identifying producer and consumer GPUs, the coordinator must be notified of the mapping via its REST API. First, start the coordinator and navigate to http://127.0.0.1:8080/docs to access the Swagger UI. The /mapping
endpoint supports both addition and deletion of producer-consumer pairings. All GPU identifiers supplied to these APIs must correspond to the physical IDs of the GPUs.
In this scenario, GPUs 2 and 3 serve as consumers, while GPUs 0 and 1 act as producers. The following JSON payloads register these mappings with the coordinator:
{ "producer": 0, "consumer": 2 }
{ "producer": 1, "consumer": 3 }
The vLLM server with CFS can be launched using:
CUDA_VISIBLE_DEVICES=2,3,0,1 \
python3 -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--swap-space 20 \
--disable-log-requests \
--enforce-eager \
--enable-chunked-prefill \
--max-num-batched-tokens 512 \
--max-num-seqs 512 \
--disable-sliding-window \
--load-format dummy \
--gpu-memory-utilization 0.9 \
--max-model-len 4096 \
--enable-cfs \
--enable-aqua-swap \
--port 9000
The benchmark script can be executed as follows (located in the benchmarks
directory of vLLM):
python3 benchmark_cfs.py \
--backend openai \
--model meta-llama/Meta-Llama-3.1-70B-Instruct \
--request-rate 0.25 \
--num-prompts 100 \
--dataset-name dummy \
--long-prompts 0 \
--long-prompt-len 32000 \
--port 9000
Note that CUDA_VISIBLE_DEVICES=2,3,0,1
exposes GPUs 0 and 1—configured as producers—alongside the tensor-parallel consumers on GPUs 2 and 3. Producer servers are started with:
CUDA_VISIBLE_DEVICES=0 \
python3 -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--tensor-parallel-size 1 \
--swap-space 32 \
--disable-log-requests \
--enforce-eager \
--enable-chunked-prefill \
--max-num-batched-tokens 512 \
--max-num-seqs 512 \
--disable-sliding-window \
--load-format dummy \
--gpu-memory-utilization 0.9 \
--max-model-len 64000 \
--be-producer \
--producer-req-gb 30 \
--port 9001
CUDA_VISIBLE_DEVICES=1 \
python3 -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--tensor-parallel-size 1 \
--swap-space 32 \
--disable-log-requests \
--enforce-eager \
--enable-chunked-prefill \
--max-num-batched-tokens 512 \
--max-num-seqs 512 \
--disable-sliding-window \
--load-format dummy \
--gpu-memory-utilization 0.9 \
--max-model-len 64000 \
--be-producer \
--producer-req-gb 30 \
--port 9002
To run FlexGen on GPU 1 as a consumer and Stable Diffusion on GPU 0 as producer, restart the coordinator and reconfigure the mappings via the /mapping
API. After cloning the FlexGen repository with Aqua support (https://github.com/aquaml/flexgen
), execute:
CUDA_VISIBLE_DEVICES=1,0 numactl --membind=0 \
python3 -m flexgen.flex_opt \
--model facebook/opt-30b \
--path _DUMMY_ \
--prompt-len 8192 \
--gen-len 64 \
--percent 100 0 50 50 100 0 \
--gpu-batch-size 1 \
--num-gpu-batches 1
A vision-model producer can be launched on GPUs 0 using the Aqua vision code (https://github.com/aquaml/vision
), after installing diffusers
:
CUDA_VISIBLE_DEVICES=0 \
python3 -m generate.py \
-m sd \
-d 10 \
-b 16
Powered by Doctave