NOTE:

  • For gated repos, you will need to provide your HF token in the box below. You can generate a new one at https://huggingface.co/settings/tokens. The token won't be stored (you can check app.py).
  • We don't take into account KV cache savings from sliding window attention (most serving frameworks don't optimize for this anyway?)
  • For Multi-head Latent Attention (MLA) used in DeepSeek-V2/V3, we calculate the compressed KV cache as intended by MLA. This might not be supported on certain framework+hardware combinations e.g. llama.cpp, MLX, which will fallback to Multi-head Attention (MHA).
KV cache dtype

Model config

Model config