SwiftKV

SwiftKV is an inference optimization technique designed to reduce compute overhead during the prefill phase of large language model (LLM) inference, particularly when processing long input prompts. It introduces a method called SingleInputKV, which allows later transformer layers to reuse key-value (KV) pairs computed by earlier layers, eliminating redundant computation.

This technique improves throughput and reduces latency without modifying model weights. In benchmarks with models like Llama 3.1 70B, SwiftKV reduced prefill computation by up to 50%, offering a practical performance gain for serving LLMs efficiently.

You can read more about SwiftKV in the Snowflake blog post and the arXiv paper.

Usage with Arctic Inference

To use SwiftKV with Arctic Inference, select a SwiftKV model that has been fine-tuned with SwiftKV in ArcticTraining. We have publically released SwiftKV models for Meta’s Llama-3 series of models on Hugging Face.

Loading one of these models will automatically enable the SwiftKV optimization. For example, to load the meta-llama/Llama-3.3-70B-Instruct model with SwiftKV, you would select the Snowflake/Llama-3.3-SwiftKV-70B-Instruct model:

export ARCTIC_INFERENCE_ENABLED=1

python -m vllm.entrypoints.openai.api_server \
    --model Snowflake/Llama-3.3-SwiftKV-70B-Instruct \
    --tensor-parallel-size 8

Training SwiftKV-Compatible Models

If your favorite model is not already available as a SwiftKV-compatible model, you can fine-tune it with ArcticTraining to make it compatible with SwiftKV.

To get started, refer to our provided examples for how we fine-tuned the Llama-3 and Qwen-2.5 models with SwiftKV