Continuous batching

Continuous batching maximizes GPU utilization by dynamically rescheduling the batch at every generation step. As requests finish, new ones join immediately instead of waiting for the whole batch to complete. The GPU stays full and throughput stays high.

For production deployments, use transformers serve. It builds on ContinuousBatchingManager and exposes an OpenAI-compatible HTTP endpoint.

generate_batch

Continuous batching is supported through generate_batch(). Pass a list of tokenized prompts and get back results for all of them when they’re done. generate_batch handles scheduling internally and blocks until all requests are complete.

For serving and streaming use cases, use ContinuousBatchingManager directly to manage requests.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import ContinuousBatchingConfig, GenerationConfig

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-4B",
    attn_implementation="flash_attention_2",
    device_map="cuda",
    dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B")

prompts = [
    "Whats up?",
    "Name a cat breed.",
    "Write a detailed history of quantum mechanics.",
]
inputs = [tokenizer.encode(p) for p in prompts]

generation_config = GenerationConfig(
    max_new_tokens=64,
    eos_token_id=tokenizer.eos_token_id,
)

outputs = model.generate_batch(inputs=inputs, generation_config=generation_config)

for request_id, output in outputs.items():
    text = tokenizer.decode(output.generated_tokens, skip_special_tokens=True)
    print(f"[{request_id}] {text}")

ContinuousBatchingManager

ContinuousBatchingManager runs a background thread and lets you submit requests and retrieve results independently. Every generation step, it checks for finished requests and schedules new ones to join the batch. This is useful for streaming, real-time serving, or submitting requests as they arrive.

Use continuous_batching_context_manager() to start and stop the manager safely. The example below contains variable length inputs. As soon as the shortest prompt is complete, it leaves the batch while the longer prompts continue generating. With static batching, you’d have to pad them all to the same length. Continuous batching frees up the completed prompt so you can start processing the next prompt immediately.

with model.continuous_batching_context_manager(generation_config=generation_config) as manager:
    manager.add_request(
        input_ids=tokenizer.encode("Write a detailed history of quantum mechanics."),
        request_id="long",
        max_new_tokens=512,
    )
    manager.add_request(
        input_ids=tokenizer.encode("What's up?"),
        request_id="short_0",
        max_new_tokens=32,
    )
    manager.add_request(
        input_ids=tokenizer.encode("Name a cat breed."),
        request_id="short_1",
        max_new_tokens=32,
    )

    for result in manager:
        text = tokenizer.decode(result.generated_tokens, skip_special_tokens=True)
        print(f"[{result.request_id}] {text}")

You could also call init_continuous_batching() to manage the lifecycle yourself.

manager = model.init_continuous_batching(generation_config=generation_config)
manager.start()

# submit and retrieve requests...

Shutting down the manager

The manager runs a background thread and holds distributed resources. Shutdown happens in two stages so you can choose what to do with in-flight work.

Call stop() to halt the background thread. By default, the manager stops accepting new submissions and waits for queued and active requests to finish before the thread exits.

manager.stop()

Pass hard_stop=True to abandon pending work immediately. Queued and active requests are failed with a RuntimeError instead of finishing.

manager.stop(hard_stop=True)

Once stop is called, add_request() and add_requests() drop new submissions and log a warning. You can still call start again to run another generation session with the same manager.

Call destroy() to release distributed resources. destroy stops the manager first if it’s still running, and the manager cannot be restarted afterwards. Use it when you’re done with continuous batching for the lifetime of the process.

manager.destroy()

continuous_batching_context_manager() handles this process. It calls stop on exit and destroy unless you pass persistent_manager=True to cache the manager on the model for the next session.

Adding requests

add_request() submits a single request. Provide a request_id or let the manager generate one automatically.

manager.add_request(input_ids=input_ids, request_id="my_request")

add_requests() submits a batch at once. It sorts inputs automatically to maximize prefix cache hits when block sharing is enabled.

manager.add_requests(inputs=inputs)

Cancel a request with cancel_request().

manager.cancel_request(request_id="my_request")

Per-request sampling parameters

Enable per_request_processors to apply temperature, top_k, and top_p independently per request within the same forward pass to allow different sampling parameters for different requests (creative, high-temperature outputs versus precise, low-temperature ones for example).

cb_config = ContinuousBatchingConfig(per_request_processors=True)

# each request gets its own sampling parameters
manager.add_request(input_ids=inputs_a, temperature=0.9, top_p=0.95)
manager.add_request(input_ids=inputs_b, temperature=0.1, top_k=10)

Each parameter in GenerationConfig must be a non-default value in order to create the associated logits processor at runtime. For example, set temperature to a value other than None or 1 to support per-request temperature control. Requests with temperatures of 1 can still be created afterwards.

Retrieving results

Iterate over the manager to receive results as they arrive.

for result in manager:
    print(tokenizer.decode(result.generated_tokens, skip_special_tokens=True))

get_result() fetches the next result from the output queue. Pass a request_id to filter for a specific request. If the next result in the queue doesn’t match, it’s requeued and the method returns None.

# next available result
result = manager.get_result()

# filter for a specific request
result = manager.get_result(request_id="my_request")

Streaming

Set streaming=True on a request, then use request_id_iter() to iterate over partial outputs as tokens are generated.

from transformers.generation.continuous_batching import RequestStatus

manager.add_request(input_ids=input_ids, request_id="streamed", streaming=True)

for chunk in manager.request_id_iter(request_id="streamed"):
    token = tokenizer.decode(chunk.generated_tokens[-1:], skip_special_tokens=True)
    print(token, end="", flush=True)
    if chunk.status == RequestStatus.FINISHED:
        break

ContinuousBatchingConfig

ContinuousBatchingConfig controls the KV cache, scheduling, CUDA graphs, memory usage, and more. Pass it alongside GenerationConfig to customize continuous batching.

By default, num_blocks and max_batch_tokens are inferred automatically from available GPU memory. Use the table below to help you pick the appropriate features.

Feature	Memory	Throughput	Latency
`max_memory_percent` / `block_size`	✓ controls KV budget
`scheduler`		✓ scheduling policy	✓ TTFT
CUDA graphs	↑ graph storage	✓ less dispatch overhead	✓
Async batching	↑ ~2× I/O buffers	✓ overlaps CPU/GPU
Decode fast path	↑ block table per request	✓ faster decode-only steps	✓
CPU offloading	↑ pinned CPU memory	✓ skips some re-prefills
Prefix caching	↓ shared KV blocks	✓ skips redundant prefill	✓ TTFT
Paged attention	↓ no fragmentation	✓ dynamic batch membership
Sliding window	↓ bounded KV per layer
Per-request processors		✓ mixed sampling params per batch

from transformers.generation import ContinuousBatchingConfig

cb_config = ContinuousBatchingConfig(
    max_memory_percent=0.8,  # fraction of free GPU memory to use for the KV cache
    block_size=256,          # KV cache block size in tokens
    scheduler_type="fifo",        # "fifo" or "prefill_first"
)

outputs = model.generate_batch(
    inputs=inputs,
    generation_config=generation_config,
    continuous_batching_config=cb_config,
)

Log probabilities

ContinuousBatchingConfig returns each generated token’s log probability when return_logprobs=True. This is useful for RL where logprobs are an input to some of the training loops.

cb_config = ContinuousBatchingConfig(return_logprobs=True)

# generate_batch()

for request_id, output in outputs.items():
    for token_id, log_prob in zip(output.generated_tokens, output.logprobs):
        token = tokenizer.decode([token_id])
        print(f"{token} | logprob: {log_prob}")

CUDA graphs

CUDA graphs eliminate CPU dispatch overhead by recording the GPU execution graph once and replaying it for batches with matching shapes. Enable them explicitly with use_cuda_graph=True.

cb_config = ContinuousBatchingConfig(use_cuda_graph=True)

When active, the manager pads query and KV lengths to fixed intervals so shapes repeat and graphs reuse. Smaller values of q_padding_interval_size and kv_padding_interval_size reduce wasted compute on padding, but this means there are more unique shapes the graph has to record and store which costs more memory.

cb_config = ContinuousBatchingConfig(
    use_cuda_graph=True,
    q_padding_interval_size=64,
    kv_padding_interval_size=16384,
    max_cached_graphs=32,
)

Async batching

Async batching overlaps CPU scheduling of the next batch with GPU computation of the current one. It requires CUDA graphs and roughly doubles the VRAM used for input tensors.

cb_config = ContinuousBatchingConfig(
    use_cuda_graph=True,
    use_async_batching=True,
)

Decode fast path

When a batch contains only decode requests (one query token per sequence), the manager can dispatch to the flash_attn_with_kvcache kernel instead of the variable-length kernel. This is faster than the varlen path because the kernel reads and writes the paged KV cache in-place through a block table rather than going through a manual update. See Paged attention for kernel-level details.

The fast path is sized by max_blocks_per_request, which dimensions the per-request block table. By default this is auto-inferred. If max_prompt_length and max_generated_length are set on the manager, the block table is sized to fit the maximum sequence length. Otherwise, a fallback default (32 blocks per request) is used.

Set max_blocks_per_request to a specific value to size the block table explicitly. This is useful when you know the maximum sequence length per request and want to bound the block table memory cost.

cb_config = ContinuousBatchingConfig(max_blocks_per_request=64)

Set max_blocks_per_request=0 to disable the fast path and force every batch through the varlen kernel. This recovers the pre-default behavior and is useful when the fast path is unavailable for your attention implementation (the manager also disables it automatically when the underlying kernel can’t be used).

cb_config = ContinuousBatchingConfig(max_blocks_per_request=0)

The fast path relies on the flash_attn_with_kvcache kernel, which is available for two device and attention implementation combinations.

Device	`attn_implementation`
CUDA	`flash_attention_3`
XPU	flash_attention_2

For any other combination, or when the kernel can’t be imported, the manager falls back to the varlen path. It logs a warning only when you set max_blocks_per_request explicitly.

CPU offloading

CPU offloading copies evicted KV cache blocks to a pre-allocated pinned CPU buffer when the GPU KV cache is full. After cache space becomes available, the manager copies the blocks back to the GPU and resumes the request without recomputing its prompt and generated tokens.

Set cpu_offload_space to the CPU swap space in GiB. The default value, 0.0, disables CPU offloading.

cb_config = ContinuousBatchingConfig(cpu_offload_space=8.0)

By default, cpu_offload_space_safety_threshold=0.8 limits the requested space to 80% of available system RAM when psutil is installed. Set cpu_offload_space=None to size the swap pool from the safety threshold.

Tensor parallel timeout

Under tensor parallelism, the manager creates a CPU communication group to coordinate request submissions, cancellations, and shutdown across ranks. cpu_group_timeout limits how long a collective on this group can block before the process crashes. If one rank stalls, the timeout prevents the others from waiting forever.

Set a longer timeout for workloads that issue infrequent collectives, or pass None to disable it.

cb_config = ContinuousBatchingConfig(cpu_group_timeout=600.0)

Prefix caching

When multiple requests share a common prefix, like a system prompt, the manager reuses their KV cache blocks instead of recomputing them. This is enabled by default and requires all model layers to use full attention (it’s automatically disabled for sliding window models).

cb_config = ContinuousBatchingConfig(
    allow_block_sharing=True,  # default
)

Paged attention

Continuous batching requires a paged attention backend. Set attn_implementation when loading the model. If you load a model with a non-paged backend ("flash_attention_2"), the "paged|" prefix is added automatically when continuous batching starts.

Backend	`attn_implementation`	Requirements
FlashAttention	`“paged\|flash_attention_2”`	`flash-attn` package
SDPA (PyTorch native)	`“paged\|sdpa”`	None
Eager	`“paged\|eager”`	None

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-4B",
    attn_implementation="paged|flash_attention_2",
    device_map="cuda",
    dtype=torch.bfloat16,
)

Tensor parallelism

For models too large to fit on a single GPU, shard the weights across devices with tensor parallelism. Load the model with tp_plan="auto" and continuous batching reads the tensor parallel size from the model to size the paged KV cache per shard. See Tensor parallelism for the list of supported architectures and how sharding works.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import ContinuousBatchingConfig, GenerationConfig

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-32B",
    attn_implementation="paged|flash_attention_2",
    tp_plan="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-32B")

inputs = [tokenizer.encode(p) for p in ["Whats up?", "Name a cat breed."]]
generation_config = GenerationConfig(max_new_tokens=64, eos_token_id=tokenizer.eos_token_id)

outputs = model.generate_batch(inputs=inputs, generation_config=generation_config)

Launch the script with torchrun, setting --nproc-per-node to the number of GPUs you want to shard across.

torchrun --nproc-per-node 4 cb_tp.py

The tensor parallel size must divide the model’s num_key_value_heads (check the model config). The paged cache raises an error at startup otherwise, so choose an appropriate --nproc-per-node.

Don’t set device_map with tp_plan. The two conflict because device_map places whole modules on specific GPUs, while tp_plan shards those same parameters across all GPUs.

Sliding window attention

Models with sliding window attention (Mistral, Gemma 2) work with continuous batching. To manually configure a sliding window for fine-tuning or custom experiments, set it in the model config before loading.

from transformers import AutoConfig, AutoModelForCausalLM

config = AutoConfig.from_pretrained("google/gemma-2-2b")
config.sliding_window = 4096

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-2b",
    config=config,
    attn_implementation="paged|sdpa",
    device_map="cuda",
    dtype=torch.bfloat16,
)

Prefix caching is disabled automatically when sliding window attention is active.

Next steps

The Continuous batching blog post covers KV caching, chunked prefill, and dynamic scheduling with performance benchmark numbers.
For a deeper look at how the continuous batching system works, see the Continuous batching architecture doc.

Update on GitHub

Transformers

Continuous batching

generate_batch

ContinuousBatchingManager

Shutting down the manager

Adding requests

Per-request sampling parameters

Retrieving results

Streaming

ContinuousBatchingConfig

Log probabilities

CUDA graphs

Async batching

Decode fast path

CPU offloading

Tensor parallel timeout

Prefix caching

Paged attention

Tensor parallelism

Sliding window attention

Next steps