Arctic Ulysses
Arctic Ulysses is a sequence parallelism technique designed to improve inference performance for large language models (LLMs) on long-context inputs. Unlike traditional tensor parallelism (TP), which partitions model computation across GPUs and incurs significant inter-GPU communication overhead, Arctic Ulysses partitions the input sequence itself. This approach reduces time-to-first-token (TTFT) latency and enhances throughput efficiency, particularly for tasks like retrieval-augmented generation (RAG), summarization, and code generation.
By leveraging all-to-all communication for attention computation, Arctic Ulysses minimizes communication overhead and maintains a favorable communication-to-compute ratio as the number of GPUs scales. In evaluations, Arctic Ulysses achieved up to 6.8x lower latency and 1.5x higher throughput compared to TP-only configurations, without requiring multiple specialized deployments.
For more details, refer to the Snowflake blog post.
Usage with Arctic Inference
When launching vLLM, specifying both tensor-parallel-size and
ulysses-sequence-parallel-size will automatically enable the Arctic Ulysses
optimization. Here’s an example of how to run the
meta-llama/Llama-3.3-70B-Instruct model with both
tensor and sequence parallelism across 8 GPUs (4 TP, 2 SP) with Arctic Inference:
export ARCTIC_INFERENCE_ENABLED=1
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 4 \
--ulysses-sequence-parallel-size 2