Skip to content

Config Explorer should consider --dtype and --kv-cache-dtype #420

@jgchn

Description

@jgchn

Component

Configuration Explorer

Desired use case or feature

In vLLM, there are various choices for controlling the precision of the model. To start off, consider the three arguments: --dtype, --kv-cache-dtype, and --quantization.

--dtype: https://docs.vllm.ai/en/latest/configuration/engine_args.html#-dtype

  • Controls the precision used for loading model weights and activations
  • auto: will use FP16 precision for FP32 and FP16 models, and BF16 precision for BF16 models
  • Default is determined from torch_dtype in config.json

--quantization: https://docs.vllm.ai/en/latest/configuration/engine_args.html#-quantization-q

  • This probably takes precedence over --dtype.
  • Options include bitesandbytes, fp8, quark. Looks like it is model- and hardware- specific.

--kv-cache-dtype: KV cache can be quantized to reduce memory footprint.

  • Options include auto (unquantized, defaults to the model data type, fp8, fp8_e4m3, and fp8_e5m2`
  • Given this example, it seems that kv cache dtype is independent of the weight dtype. One can use kv cache dtype fo fp8 while the tensor weights are unquantized

Proposed solution

The Capacity Planner should accept quantization for model weights and activation. Currently, the calculation for model weights use the dtype exposed by safetensor metadata, which might be different for some parameters. However, when loading into vllm, the weights use the same precision (it should be what --dtype is). Furthermore, allow users to configure --kv-cache-dtype. Finally, update the vllm serve command.

Alternatives

No response

Additional context or screenshots

No response

Metadata

Metadata

Assignees

Labels

help wantedExtra attention is needed

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions