-
Notifications
You must be signed in to change notification settings - Fork 40
Description
Component
Configuration Explorer
Desired use case or feature
In vLLM, there are various choices for controlling the precision of the model. To start off, consider the three arguments: --dtype, --kv-cache-dtype, and --quantization.
--dtype: https://docs.vllm.ai/en/latest/configuration/engine_args.html#-dtype
- Controls the precision used for loading model weights and activations
auto: will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models- Default is determined from
torch_dtypeinconfig.json
--quantization: https://docs.vllm.ai/en/latest/configuration/engine_args.html#-quantization-q
- This probably takes precedence over
--dtype. - Options include
bitesandbytes,fp8,quark. Looks like it is model- and hardware- specific.
--kv-cache-dtype: KV cache can be quantized to reduce memory footprint.
- Options include
auto(unquantized, defaults to the model data type,fp8,fp8_e4m3, andfp8_e5m2` - Given this example, it seems that kv cache dtype is independent of the weight dtype. One can use kv cache dtype fo fp8 while the tensor weights are unquantized
Proposed solution
The Capacity Planner should accept quantization for model weights and activation. Currently, the calculation for model weights use the dtype exposed by safetensor metadata, which might be different for some parameters. However, when loading into vllm, the weights use the same precision (it should be what --dtype is). Furthermore, allow users to configure --kv-cache-dtype. Finally, update the vllm serve command.
Alternatives
No response
Additional context or screenshots
No response