Arctic Inference documentation
Arctic Inference is a new library from Snowflake AI Research that contains current and future LLM inference optimizations developed at Snowflake. It is integrated with vLLM v0.8.4 using vLLM’s custom plugin feature, allowing us to develop and integrate inference optimizations quickly into vLLM and make them available to the community.
Once installed, Arctic Inference automatically patches vLLM to use Arctic Ulysses and other optimizations implemented in Arctic Inference, and users can continue to use their familiar vLLM APIs and CLI. It’s easy to get started!
Key Features
Optimized Generative AI
- 🚀 Shift Parallelism:
Dynamically switches between tensor and sequence parallelism at runtime to optimize latency, throughput, and cost — all in one deployment
- 🚀 Arctic Ulysses:
Improve long-context inference latency and throughput via sequence parallelism across GPUs
- 🚀 Speculative & Suffix Decoding:
Boosts LLM speed by drafting tokens with a small model and verifying them in bulk
- 🚀 SwiftKV:
Reduce compute during prefill by reusing key-value pairs across transformer layers
Optimized Embeddings
- 🚀 Optimized Embeddings:
Accelerate embedding performance with parallel tokenization, byte outputs, and GPU load-balanced replicas
Quick Start
To get started with Arctic Inference check out the quick start guide
Table of Contents
Optimized Generative AI
Optimized Embeddings