Gradio

NOTE:

For gated repos, you will need to provide your HF token in the box below. You can generate a new one at https://huggingface.co/settings/tokens. The token won't be stored (you can check app.py).
We don't take into account KV cache savings from sliding window attention (most serving frameworks don't optimize for this anyway?)
For Multi-head Latent Attention (MLA) used in DeepSeek-V2/V3, we calculate the compressed KV cache as intended by MLA. This might not be supported on certain framework+hardware combinations e.g. llama.cpp, MLX, which will fallback to Multi-head Attention (MHA).