Arctic Inference Documentation

Arctic Inference is an open-source vLLM plugin that brings Snowflake’s inference innovations to the community, delivering the fastest and most cost-effective open-source inference for LLMs and Embeddings.

Once installed, Arctic Inference automatically patches vLLM upon launch to support the optimizations in Arctic Inference, and users can continue to use familiar vLLM APIs and CLI. It’s easy to get started!

Key Features

Arctic Inference achieves high throughput and low latency through a wholistic set of inference optimizations.

Advanced Parallelism

🚀 Shift Parallelism [blog]:

Dynamically switches between tensor and sequence parallelism at runtime to optimize latency, throughput, and cost — all in one deployment.

🚀 Arctic Ulysses [blog]:

Improve long-context inference latency and throughput via sequence parallelism across GPUs.

Speculative Decoding

🚀 Arctic Speculator [blog]:

Lightweight yet effective draft models based on MLP and LSTM architectures, complete with training pipelines.

🚀 Suffix Decoding [paper, blog]:

Rapid speculation for long repeated sequences, effective for coding, agents and other agentic applications.

Model Optimization

🚀 SwiftKV [paper]:

Reduce prefill compute by early-exiting prompt tokens and reusing KV across transformer layers.

Other Optimizations

🚀 Optimized Embeddings [blog]:

Accelerate embedding performance with parallel tokenization, byte outputs, and GPU load-balanced replicas.

Getting Started

Installation

To install Arctic Inference from PyPI, use the following command:

pip install arctic-inference[vllm]

This will install the latst Arctic Inference and compatible vLLM version.

Alternatively, you can also clone the Arctic Inference repository and build/install it from source:

git clone https://github.com/snowflakedb/ArcticInference.git && pip install ./ArcticInference

Serving

By using the examples below, you can get benefits from Shift Parallelism Speculative Decoding, and SwiftKV all at once!

vllm serve Snowflake/Llama-3.1-SwiftKV-8B-Instruct \
   --quantization "fp8" \
   --tensor-parallel-size 1 \
   --ulysses-sequence-parallel-size 2 \
   --enable-shift-parallel \
   --speculative-config '{
      "method": "arctic",
      "model":"Snowflake/Arctic-LSTM-Speculator-Llama-3.1-8B-Instruct",
      "num_speculative_tokens": 3,
      "enable_suffix_decoding": true,
      "disable_by_batch_size": 64
   }'

Offline

import vllm
from vllm import LLM, SamplingParams

vllm.plugins.load_general_plugins()

llm = LLM(
    model="Snowflake/Llama-3.1-SwiftKV-8B-Instruct",
    quantization="fp8",
    tensor_parallel_size=1,
    ulysses_sequence_parallel_size=2,
    enable_shift_parallel=True,
    speculative_config={
        "method": "arctic",
        "model": "Snowflake/Arctic-LSTM-Speculator-Llama-3.1-8B-Instruct",
        "num_speculative_tokens": 3,
        "enable_suffix_decoding": True,
        "disable_by_batch_size": 64,
    },
)

conversation = [
    {
        "role": "user",
        "content": "Write an essay about the importance of higher education.",
    },
]

sampling_params = SamplingParams(temperature=0.0, max_tokens=800)

outputs = llm.chat(conversation, sampling_params=sampling_params)