Multi-Key Load Balancing in LiteLLM: A Complete Guide with Google Gemini, OpenRouter, Groq, and Cerebras

Why Multi-Key Load Balancing Matters

If you’re running LiteLLM as an AI proxy with more than one API key ‖ whether it’s from Google Gemini, OpenRouter, Groq, or Cerebras ‖ you’ve probably run into rate limits. One key hits its quota, requests fail, and your AI service goes down. The solution is elegant: register the same model multiple times with different keys, and LiteLLM handles the rest.

At Jerith AI, we run 23 models across 9+ providers via LiteLLM proxy. Multiple keys per model mean zero downtime from rate limits and better throughput. This guide walks you through exactly how multi-key load balancing works in LiteLLM setup, with real configuration examples you can adapt for your own setup.

How LiteLLM Multi-Key Load Balancing Actually Works

The concept is deceptively simple. In your proxy-config.yaml, you define the same model multiple times under the same model_name but with a different api_key for each entry. LiteLLM treats them as identical backends and automatically round-robins requests across all available keys.

When one key hits a rate limit or returns an error, LiteLLM seamlessly retries with the next key. No manual intervention, no dropped requests ‖ your client applications never see the failover happen.

Here’s what LiteLLM does behind the scenes:

Round-robin distribution: Evenly spreads requests across all registered keys for a given model alias
Automatic failover: If a key returns a 429 or 5xx error, LiteLLM instantly retries with the next available key
Cooldown tracking: Temporarily backs off from keys that are rate-limited until the cooldown window expires
Sticky sessions: Optional ‖ you can pin a client session to a specific key for consistency (useful for streaming)

Environment Variables: Organizing Multiple Keys

Before we get into the model list, you need to declare your API keys. The cleanest approach is using environment variables in the top section of your config file. This keeps secrets organized and makes key rotation straightforward.

Here’s how we organize keys at Jerith AI using Google Gemini as the primary example:

environment_variables:
  # Google Gemini ‖ 2 keys
  GEMINI_API_KEY: AIzaSy...
  GEMINI_API_KEY_2: AIzaSy...

  # OpenRouter ‖ 3 keys
  OPENROUTER_API_KEY: sk-or-v1-...
  OPENROUTER_API_KEY_2: sk-or-v1-...
  OPENROUTER_API_KEY_3: sk-or-v1-...

  # Groq ‖ 2 keys
  GROQ_API_KEY: gsk_...
  GROQ_API_KEY_2: gsk_...

  # Cerebras Cloud ‖ 3 keys
  CEREBRAS_API_KEY: csk-...
  CEREBRAS_API_KEY_2: csk-...
  CEREBRAS_API_KEY_3: csk...

Notice the naming convention: PROVIDER_API_KEY for the first key, PROVIDER_API_KEY_2, PROVIDER_API_KEY_3, and so on. This keeps everything readable and makes it trivial to add or remove keys later.

The Model List: Where the Magic Happens

The model_list section is where you tell LiteLLM which providers and models to expose. For multi-key setups, you register the same model_name multiple times ‖ once per API key.

Let’s use Google Gemini’s models as a concrete example. We’ll walk through setting up Gemini 3.1 Flash Lite and Gemini 2.5 Flash with two keys each:

model_list:
# ===== Gemini 3.1 Flash Lite (2-key load balanced) =====
# gemini-3.1-flash-lite - Key 1
- litellm_params:
    api_key: os.environ/GEMINI_API_KEY
    model: gemini/gemini-3.1-flash-lite
  model_name: gemini-3.1-flash-lite

# gemini-3.1-flash-lite - Key 2
- litellm_params:
    api_key: os.environ/GEMINI_API_KEY_2
    model: gemini/gemini-3.1-flash-lite
  model_name: gemini-3.1-flash-lite

# ===== Gemini 2.5 Flash (2-key load balanced) =====
# gemini-2.5-flash - Key 1
- litellm_params:
    api_key: os.environ/GEMINI_API_KEY
    model: gemini/gemini-2.5-flash
  model_name: gemini-2.5-flash

# gemini-2.5-flash - Key 2
- litellm_params:
    api_key: os.environ/GEMINI_API_KEY_2
    model: gemini/gemini-2.5-flash
  model_name: gemini-2.5-flash

# ===== Gemini 2.5 Flash Lite (2-key) =====
- litellm_params:
    api_key: os.environ/GEMINI_API_KEY
    model: gemini/gemini-2.5-flash-lite
  model_name: gemini-2.5-flash-lite

- litellm_params:
    api_key: os.environ/GEMINI_API_KEY_2
    model: gemini/gemini-2.5-flash-lite
  model_name: gemini-2.5-flash-lite

# ===== Gemini 2.5 Pro (2-key) =====
- litellm_params:
    api_key: os.environ/GEMINI_API_KEY
    model: gemini/gemini-2.5-pro
  model_name: gemini-2.5-pro

- litellm_params:
    api_key: os.environ/GEMINI_API_KEY_2
    model: gemini/gemini-2.5-pro
  model_name: gemini-2.5-pro

The Key Principles

Two rules that make multi-key work:

Same model_name, different api_key ‖ This tells LiteLLM “these are the same logical model, just use different authentication.” The alias your clients request stays the same regardless of which key handles the actual request.
Same model (provider path) ‖ The full provider/model identifier in litellm_params.model must be identical across all entries. Only the api_key changes.

Extending to Other Providers: A Complete Multi-Provider Example

Let’s expand beyond Gemini. Here’s how we configure multi-key load balancing across our full provider stack ‖ OpenRouter, Groq, and Cerebras alongside Google:

OpenRouter Models

# ===== OpenRouter: Owl Alpha (2 keys) =====
# Key 1
- litellm_params:
    api_key: os.environ/OPENROUTER_API_KEY
    model: openrouter/openrouter/owl-alpha
  model_name: owl-alpha

# Key 2
- litellm_params:
    api_key: os.environ/OPENROUTER_API_KEY_2
    model: openrouter/openrouter/owl-alpha
  model_name: owl-alpha

# ===== OpenRouter: Claude Sonnet 4 (2 keys) =====
- litellm_params:
    api_key: os.environ/OPENROUTER_API_KEY
    model: openrouter/anthropic/claude-sonnet-4.6
  model_name: claude-sonnet-4

- litellm_params:
    api_key: os.environ/OPENROUTER_API_KEY_2
    model: openrouter/anthropic/claude-sonnet-4.6
  model_name: claude-sonnet-4

# ===== OpenRouter: Claude Opus 4 (2 keys) =====
- litellm_params:
    api_key: os.environ/OPENROUTER_API_KEY
    model: openrouter/anthropic/claude-opus-4.7
  model_name: claude-opus-4

- litellm_params:
    api_key: os.environ/OPENROUTER_API_KEY_2
    model: openrouter/anthropic/claude-opus-4.7
  model_name: claude-opus-4

# ===== OpenRouter: GPT-4o (2 keys) =====
- litellm_params:
    api_key: os.environ/OPENROUTER_API_KEY
    model: openrouter/openai/gpt-4o
  model_name: gpt-4o

- litellm_params:
    api_key: os.environ/OPENROUTER_API_KEY_2
    model: openrouter/openai/gpt-4o
  model_name: gpt-4o

Groq Models

# ===== Groq: Llama 3.1 8B Instant (2 keys) =====
- litellm_params:
    api_key: os.environ/GROQ_API_KEY
    model: groq/llama-3.1-8b-instant
  model_name: groq-llama-3.1-8b

- litellm_params:
    api_key: os.environ/GROQ_API_KEY_2
    model: groq/llama-3.1-8b-instant
  model_name: groq-llama-3.1-8b

# ===== Groq: Llama 3.3 70B Versatile (2 keys) =====
- litellm_params:
    api_key: os.environ/GROQ_API_KEY
    model: groq/llama-3.3-70b-versatile
  model_name: groq-llama-3.3-70b

- litellm_params:
    api_key: os.environ/GROQ_API_KEY_2
    model: groq/llama-3.3-70b-versatile
  model_name: groq-llama-3.3-70b

Cerebras Cloud Models

# ===== Cerebras: Llama 3.1 8B (3 keys!) =====
- litellm_params:
    api_key: os.environ/CEREBRAS_API_KEY
    model: cerebras/llama-3.1-8b
  model_name: cerebras-llama-3.1-8b

- litellm_params:
    api_key: os.environ/CEREBRAS_API_KEY_2
    model: cerebras/llama-3.1-8b
  model_name: cerebras-llama-3.1-8b

- litellm_params:
    api_key: os.environ/CEREBRAS_API_KEY_3
    model: cerebras/llama-3.1-8b
  model_name: cerebras-llama-3.1-8b

Note how Cerebras gets three keys. More keys means more headroom ‖ LiteLLM will distribute across all three and cycle back to the first when the third is rate-limited. There’s no hard limit on how many keys you can stack per model.

LiteLLM Settings for Better Reliability

Beyond the model list, a few settings in litellm_settings make multi-key setups far more resilient:

litellm_settings:
  num_retries: 3
  request_timeout: 300
  drop_params: true

num_retries: 3 ‖ How many times LiteLLM retries a failed request. With multi-key, each retry can try a different key automatically, so 3 retries with 2 keys means up to 6 total attempts before giving up.
request_timeout: 300 ‖ Seconds before a request is considered hung. 300 seconds (5 minutes) is generous enough for large model responses but will eventually fail over if a provider is truly stuck.
drop_params: true ‖ Silently drops unsupported parameters instead of rejecting the request. Critical when clients send parameters that your LiteLLM-configured models don’t support ‖ prevents unnecessary failures.

Gemma Models: Open Weights via Google

Google’s Gemma models are also available through the Gemini API and work with the same multi-key pattern. Here’s how to add Gemma 4 26B:

# ===== Gemma 4 26B (2 keys) =====
- litellm_params:
    api_key: os.environ/GEMINI_API_KEY
    model: gemini/gemma-4-26b-a4b-it
  model_name: gemma-4-26b

- litellm_params:
    api_key: os.environ/GEMINI_API_KEY_2
    model: gemini/gemma-4-26b-a4b-it
  model_name: gemma-4-26b

Key Rotation and Addition Strategies

Here are practical tips for managing keys over time:

Add keys before you need them. Don’t wait until you’re hitting limits. Register 2―3 keys from the start and you’ll rarely notice rate limiting.
Use the naming convention _2, _3, etc. It’s predictable and keeps your config readable. Avoid _backup or _alt ‖ those create psychological friction when you need to remember which is “primary.”
Rotate keys monthly. Most providers (Google, OpenRouter, Groq) let you generate new keys without disabling old ones. Add the new key, verify it works, then retire the old one. With multi-key setup, you can do this zero-downtime ‖ just register the new key and remove the old one from the model list.
Monitor with LiteLLM’s built-in dashboard. The proxy includes a UI at /ui that shows per-key usage, error rates, and cooldown status. Use it to spot which keys are getting hammered.
Use cheaper keys for lighter models. If you have API keys with different rate limits (e.g., Google’s free tier vs. paid API keys), assign the free tier keys to lighter models like Flash Lite, and reserve paid keys for Pro models.

Complete Example: A Minimal Functional Config

Here’s a stripped-down but fully functional proxy-config.yaml that works. Save this, fill in your keys, and you’re running:

environment_variables:
  GEMINI_API_KEY: your-gemini-key-here
  GEMINI_API_KEY_2: your-second-gemini-key

litellm_settings:
  num_retries: 3
  request_timeout: 300
  drop_params: true

model_list:
  # Gemini 3.1 Flash Lite (2-key)
  - litellm_params:
      api_key: os.environ/GEMINI_API_KEY
      model: gemini/gemini-3.1-flash-lite
    model_name: gemini-3.1-flash-lite

  - litellm_params:
      api_key: os.environ/GEMINI_API_KEY_2
      model: gemini/gemini-3.1-flash-lite
    model_name: gemini-3.1-flash-lite

  # Gemini 2.5 Flash (2-key)
  - litellm_params:
      api_key: os.environ/GEMINI_API_KEY
      model: gemini/gemini-2.5-flash
    model_name: gemini-2.5-flash

  - litellm_params:
      api_key: os.environ/GEMINI_API_KEY_2
      model: gemini/gemini-2.5-flash
    model_name: gemini-2.5-flash

  # Gemini 2.5 Pro (2-key)
  - litellm_params:
      api_key: os.environ/GEMINI_API_KEY
      model: gemini/gemini-2.5-pro
    model_name: gemini-2.5-pro

  - litellm_params:
      api_key: os.environ/GEMINI_API_KEY_2
      model: gemini/gemini-2.5-pro
    model_name: gemini-2.5-pro

Start with this. Add providers as you go. The pattern repeats identically for OpenRouter, Groq, Cerebras, Mistral, HuggingFace, or any other supported provider.

Troubleshooting

Q: How do I know which key handled a request?

Check LiteLLM’s request logs. Each log entry shows which API key was used. The dashboard at /ui also shows per-key statistics.

Q: What if all my keys are rate-limited?

LiteLLM will retry up to num_retries times across all registered keys, then return a rate limit error to the client. To handle this gracefully, configure a fallback model with fallback_models in the model entry.

Q: Can I use different providers as fallbacks?

Absolutely. Beyond multi-key within one provider, LiteLLM supports fallback_models at the model level. Example: fall back from Gemini to OpenRouter if both Gemini keys are exhausted:

- litellm_params:
    api_key: os.environ/GEMINI_API_KEY
    model: gemini/gemini-3.1-flash-lite
  model_name: gemini-3.1-flash-lite
  fallbacks: [owl-alpha]  # Falls back to OpenRouter if Gemini keys fail

Final Thoughts

Multi-key load balancing in LiteLLM is one of those features that’s almost too simple to document ‖ just register the same model alias multiple times with different keys. But the reliability gains are massive. At Jerith AI, we serve 23 models across 9 providers with zero downtime by simply having 2―3 keys per model.

Start with two keys. Add providers as needed. Let LiteLLM handle the routing. Your users won’t know the difference, and you’ll stop losing sleep over rate limit errors at 2 AM.

This guide reflects our real-world setup running on an RTX 4060 host with LiteLLM proxy on port 8000. Questions? Find us on Discord through Jerith AI 🐼

Share this post:

X (Twitter) Facebook Pinterest LinkedIn Reddit WhatsApp Telegram Bluesky Pocket