March 3, 2026·Error Handling·intermediate

From Naive API Client to Production-Ready: Adding Circuit Breakers and Retry Logic

by Robust Agents

The Scenario

You ask your AI to write a Python script that fetches customer data from an external REST API and writes it to a database. The model produces a clean, readable script — but it only works on a sunny day. The moment the API is slow, rate-limited, or down, the script crashes with an unhandled exception and takes your entire pipeline with it.

The Raw AI Draft

Here is what a model like GPT-4 or Claude typically generates on the first attempt. It works, it reads well, and it will destroy your production system.

import requests
import sqlite3

def fetch_customers():
  response = requests.get("https://api.example.com/customers",
                          headers={"Authorization": "Bearer sk-1234567890abcdef"})
  data = response.json()

  conn = sqlite3.connect("customers.db")
  cursor = conn.cursor()
  cursor.execute("CREATE TABLE IF NOT EXISTS customers (id TEXT, name TEXT, email TEXT)")

  for customer in data:
      cursor.execute("INSERT INTO customers VALUES (?, ?, ?)",
                     (customer["id"], customer["name"], customer["email"]))

  conn.commit()
  conn.close()
  print(f"Imported {len(data)} customers")

fetch_customers()

The Code Smells

Hardcoded API key in source code — "Bearer sk-1234567890abcdef" is committed to version control. Anyone with repo access has your API key. This is a security incident waiting to happen.
No error handling around the HTTP request — A network timeout, DNS failure, or 500 response will raise an unhandled exception and crash the script with a stack trace.
No retry logic for transient failures — If the API returns a 429 (rate limit) or 503 (temporarily unavailable), the script dies instead of waiting and retrying.
No request timeout configured — requests.get() with no timeout can hang indefinitely, blocking your entire pipeline on a single slow response.
Blind INSERT without conflict handling — Running the script twice duplicates all rows. There is no idempotency — every execution corrupts the data further.
Database connection never closed on error — If the INSERT loop throws an exception, conn.close() is never called, leaking the database connection.
No logging — print() with no timestamp, level, or structure. In production, you need to know when something failed and how badly, not just that it printed a line.
No circuit breaker — If the API is down, the script hammers it repeatedly on every scheduled run, making the outage worse and potentially getting your IP banned.

The Best Practices

Environment Variables for Secrets. API keys, database paths, and endpoint URLs must come from environment variables or a secrets manager — never from source code. This prevents accidental exposure via version control and enables different configurations per environment (dev/staging/prod).

Exponential Backoff with Jitter. When a request fails with a transient error (429, 503, timeout), wait before retrying — and double the wait time with each attempt. This gives the remote service time to recover instead of flooding it with retry storms. Adding random jitter prevents the "thundering herd" problem when multiple clients retry simultaneously.

The Circuit Breaker Pattern. Track consecutive failures to a service. After a threshold (e.g., 5 failures), "open" the circuit — refuse all requests immediately instead of wasting time on calls that will fail. After a cooldown period, allow one test request through ("half-open"). If it succeeds, close the circuit and resume normal operation. This protects your system from cascading failures.

Idempotent Database Operations. Use INSERT OR REPLACE (SQLite) or ON CONFLICT ... DO UPDATE (PostgreSQL) so that re-running the script produces the same result. Scripts in production will be re-run — after crashes, during recovery, or on schedule. Idempotency must be the default.

Structured Logging. Replace print() with Python's logging module. Include timestamps, log levels (INFO, WARNING, ERROR), and contextual data. This makes production log analysis possible with tools like Datadog, ELK, or even basic grep.

Connection Lifecycle Management. Use context managers (with statements) or try/finally blocks to guarantee that database connections, HTTP clients, and file handles are closed — even when exceptions occur.

The Refactored Code

import httpx
import sqlite3
import os
import time
import logging
from dataclasses import dataclass

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger(__name__)

# Configuration from environment — never hardcode secrets
API_URL = os.environ["CUSTOMER_API_URL"]
API_KEY = os.environ["CUSTOMER_API_KEY"]
DB_PATH = os.getenv("DB_PATH", "customers.db")

# Circuit breaker state
@dataclass
class CircuitBreaker:
  """Prevents cascading failures by stopping calls to a failing service."""
  failure_count: int = 0
  threshold: int = 5
  reset_timeout: float = 60.0
  last_failure_time: float = 0.0
  state: str = "closed"  # closed = normal, open = blocking, half-open = testing

  def record_failure(self) -> None:
      self.failure_count += 1
      self.last_failure_time = time.time()
      if self.failure_count >= self.threshold:
          self.state = "open"
          logger.warning("Circuit breaker OPEN — too many failures, blocking requests")

  def record_success(self) -> None:
      self.failure_count = 0
      self.state = "closed"

  def can_execute(self) -> bool:
      if self.state == "closed":
          return True
      if self.state == "open":
          # Check if enough time has passed to try again
          if time.time() - self.last_failure_time > self.reset_timeout:
              self.state = "half-open"
              logger.info("Circuit breaker HALF-OPEN — testing with one request")
              return True
          return False
      return True  # half-open: allow one test request


def fetch_with_retry(client: httpx.Client, url: str, breaker: CircuitBreaker,
                   max_retries: int = 3, backoff_base: float = 1.0) -> dict:
  """Fetch URL with exponential backoff and circuit breaker protection."""
  if not breaker.can_execute():
      raise RuntimeError("Circuit breaker is OPEN — refusing request to protect system")

  for attempt in range(max_retries):
      try:
          response = client.get(url)
          response.raise_for_status()

          # Success: reset circuit breaker
          breaker.record_success()
          return response.json()

      except httpx.HTTPStatusError as e:
          if e.response.status_code == 429:
              # Rate limited: respect Retry-After header
              wait = int(e.response.headers.get("Retry-After", backoff_base * (2 ** attempt)))
              logger.warning(f"Rate limited. Waiting {wait}s (attempt {attempt + 1}/{max_retries})")
              time.sleep(wait)
          elif e.response.status_code >= 500:
              # Server error: retry with backoff
              wait = backoff_base * (2 ** attempt)
              logger.warning(f"Server error {e.response.status_code}. Retrying in {wait}s")
              breaker.record_failure()
              time.sleep(wait)
          else:
              # Client error (4xx): do not retry, this is our fault
              logger.error(f"Client error {e.response.status_code}: {e.response.text}")
              raise

      except httpx.TransportError as e:
          # Network-level failure: timeout, DNS, connection refused
          wait = backoff_base * (2 ** attempt)
          logger.warning(f"Transport error: {e}. Retrying in {wait}s")
          breaker.record_failure()
          time.sleep(wait)

  raise RuntimeError(f"All {max_retries} retry attempts exhausted for {url}")


def import_customers() -> int:
  """Fetch customers from API and upsert into local database."""
  breaker = CircuitBreaker()

  with httpx.Client(
      headers={"Authorization": f"Bearer {API_KEY}"},
      timeout=httpx.Timeout(10.0, connect=5.0),
  ) as client:
      data = fetch_with_retry(client, f"{API_URL}/customers", breaker)

  conn = sqlite3.connect(DB_PATH)
  try:
      cursor = conn.cursor()
      cursor.execute("""
          CREATE TABLE IF NOT EXISTS customers (
              id TEXT PRIMARY KEY,
              name TEXT NOT NULL,
              email TEXT NOT NULL
          )
      """)
      # Upsert instead of blind insert — idempotent on re-runs
      for customer in data:
          cursor.execute(
              "INSERT OR REPLACE INTO customers (id, name, email) VALUES (?, ?, ?)",
              (customer["id"], customer["name"], customer["email"]),
          )
      conn.commit()
      count = len(data)
      logger.info(f"Successfully imported {count} customers")
      return count
  finally:
      conn.close()


if __name__ == "__main__":
  import_customers()

The Benchmarks

Metric	Before	After	Improvement
Survives API timeout	No — crashes	Yes — retries with backoff	∞ (from broken to working)
Handles rate limiting (429)	Crashes	Waits and retries	100% recovery
Idempotent re-runs	Duplicates all rows	Upserts cleanly	Zero data corruption
Secret exposure risk	Key in source code	Environment variables	Eliminated
Connection leak risk	Leaks on error	Guaranteed cleanup	Eliminated

The Prompt Tip

Write a Python script that fetches data from a REST API and stores it in a SQLite database. Requirements: use httpx instead of requests. Read the API URL and API key from environment variables. Add a circuit breaker class that tracks failures, opens after 5 consecutive failures, and resets after 60 seconds. Implement retry logic with exponential backoff that handles 429 rate limits (respecting the Retry-After header), 5xx server errors, and network timeouts. Set explicit request timeouts. Use INSERT OR REPLACE for idempotent database writes. Use Python's logging module instead of print. Manage all connections with context managers or try/finally blocks. Add type hints and docstrings to all functions.