Skip to main content

Experimental: Parallel ACompletions

parallel_acompletions lets you fire multiple Router.acompletion requests concurrently with a simple list, handling retries/fallbacks per underlying call exactly as the normal router would. A sister iterator interface streams results as soon as each finishes.

Statusโ€‹

  • Flag gated (OFF by default).
  • API surface, naming, and result shape may change.
  • Enable explicitly before relying in production.

Enablingโ€‹

export LITELLM_ENABLE_PARALLEL_ACOMPLETIONS=1

Basic Usageโ€‹

import os, asyncio
os.environ["LITELLM_ENABLE_PARALLEL_ACOMPLETIONS"] = "1"

from litellm import Router
from litellm.router_utils.parallel_acompletion import RouterParallelRequest

router = Router(model_list=[
{
"model_name": "gpt35",
"litellm_params": {"model": "gpt-3.5-turbo", "api_key": "..."}
}
])

requests = [
RouterParallelRequest(model="gpt35", messages=[{"role":"user","content":"Hello"}]),
RouterParallelRequest(model="gpt35", messages=[{"role":"user","content":"Tell me a joke"}]),
]

async def main():
results = await router.parallel_acompletions(requests, concurrency=4, preserve_order=True)
for r in results:
if r.exception:
print("ERR:", r.exception)
else:
print(r.response)

asyncio.run(main())

Iterator Form (completion order)โ€‹

async for result in router.iter_parallel_acompletions(requests, concurrency=4):
...

Behavior & Error Semanticsโ€‹

  • return_exceptions=True (default)

    • Each item in the results has either response or exception set.
    • Iterator form yields all results; you handle result.exception per item.
  • return_exceptions=False (fail-fast)

    • parallel_acompletions(...) raises on the first error and cancels remaining tasks.
    • iter_parallel_acompletions(...) raises on the first error and stops iteration; any outstanding tasks are cancelled.
  • Concurrency

    • A bounded semaphore limits in-flight calls at the orchestration layer.
    • For very large request lists, tasks are scheduled but concurrency limits actual in-flight calls.

Flag Gating Notesโ€‹

  • The feature is controlled by LITELLM_ENABLE_PARALLEL_ACOMPLETIONS and evaluated on import.
  • If toggling the env var at runtime, restart the process (or reload modules) to apply.

Argumentsโ€‹

ParamDescription
requestsList of RouterParallelRequest (model, messages, optional kwargs)
concurrencyMax in-flight tasks at orchestration layer (default 8)
return_exceptionsIf True (default) errors captured per-result; if False first error cancels all
preserve_orderFor parallel_acompletions: if True, final list matches input order

Result Objectโ€‹

Each result is a RouterParallelResult:

{
"index": 0,
"request": RouterParallelRequest(...),
"response": <ModelResponse or custom>,
"exception": Optional[Exception]
}

Error Handlingโ€‹

  • With return_exceptions=True you get a result entry containing exception.
  • With False, the first exception aborts remaining tasks (standard asyncio.gather propagation).

Why Flag Gated?โ€‹

To allow fast iteration on:

  • Naming (parallel_acompletions, iter_parallel_acompletions)
  • Result shape (object vs tuple)
  • Performance tuning (fair queueing, cancellation semantics)

Provide feedback in the GitHub issue / PR.

Roadmap Ideasโ€‹

  • Per-request timeouts.
  • Integrated progress callbacks.
  • Batch adaptive concurrency.
  • Streaming passthrough merge (aggregate token usage).