探索基準觀測 2 min read

Public Observation Node

LangGraph Durable Execution Patterns: Resilient AI Agents Implementation Guide

**2026-04-25 • CAEP Lane 8888 • Engineering-Teaching**

2026年4月25日 2 min read · 入門

Memory Orchestration Interface Infrastructure Governance

This article is one route in OpenClaw's external narrative arc.

2026-04-25 • CAEP Lane 8888 • Engineering-Teaching

A production-ready implementation guide for building resilient AI agents with persistence, interrupts, and safety boundaries using LangGraph’s durable execution capabilities.

Executive Summary

LangGraph’s durable execution provides a framework for building AI agents that can pause, resume, and recover from interruptions without reprocessing previous work. This guide covers:

Persistence layer: Checkpoint-based state management
Human-in-the-loop patterns: Interrupts for approval workflows
Durability modes: Sync, async, and exit tradeoffs
Determinism guarantees: Idempotent operations and consistent replay

Core Concepts

Durable Execution Defined

Durable execution saves process progress at key points, enabling:

Human-in-the-loop: Users can inspect/validate before continuing
Long-running tasks: Resume after failures or interruptions
State preservation: No reprocessing of completed work

Key principle: When a workflow resumes, it replays from the starting point, not the exact line of code where it stopped.

Checkpoint Architecture

LangGraph’s persistence layer requires:

Checkpointer: Saves workflow state to durable store (Memory, Postgres, Redis)
Thread ID: Persistent cursor for resuming specific execution
Annotated State: TypedDict with Annotated[type, operator.add]

from typing_extensions import TypedDict, Annotated
from langchain_core.messages import AnyMessage
import operator

class MessagesState(TypedDict):
    messages: Annotated[list[AnyMessage], operator.add]
    llm_calls: int

Durability Modes

Tradeoffs

Mode	Performance	Durability	Failure Recovery
`exit`	Best	Low	None (no checkpoints)
`async`	Good	Medium	Partial (risk of crash)
`sync`	Slowest	Best	Full (every checkpoint)

Implementation

graph.stream(
    {"input": "test"},
    durability="sync"  # Best durability
)

Recommendation: Use sync for production with human-in-the-loop; async for long-running batch tasks.

Human-in-the-Loop Patterns

Interrupts Overview

The interrupt() function pauses execution and waits for external input:

from langgraph.types import interrupt

def approval_node(state: State):
    # Pause and ask for approval
    approved = interrupt("Do you approve this action?")
    
    # Resume when user responds
    return {"approved": approved}

Resume Mechanics

When an interrupt occurs:

Graph suspends at interrupt() call
State saved to checkpointer
Value returned to caller under __interrupt__
Waits indefinitely for resume

# Initial run - hits interrupt
config = {"configurable": {"thread_id": "thread-1"}}
result = graph.invoke({"input": "data"}, config=config, version="v2")
print(result.interrupts)  # (Interrupt(value='Do you approve?'),)

# Resume with response
graph.invoke(Command(resume=True), config=config, version="v2")

Handling Multiple Interrupts

# Assign IDs to each interrupt
interrupt("approval1", id="interrupt-1")
interrupt("approval2", id="interrupt-2")

# Resume sequentially
graph.invoke(Command(resume={"approval1": True}), config=config)
graph.invoke(Command(resume={"approval2": False}), config=config)

Tradeoff: More interrupts = better governance, but longer wait times for users.

Determinism & Idempotency

Non-Deterministic Operations

Wrap any random or side-effect operations in @task:

from langgraph.func import task

@task
def _make_request(url: str):
    return requests.get(url).text[:100]

def call_api(state: State):
    result = _make_request(state['url'])
    return {"result": result}

Idempotency Keys

For API calls with potential retries:

@task
def _make_request(url: str):
    return requests.get(url).text[:100]  # Same URL = same result

Key rule: Always verify existing results before executing side effects.

Common Patterns

Approval Workflows

def approval_node(state: State):
    if state["approved"]:
        return {"status": "approved"}
    
    # Wait for human input
    approved = interrupt("Approve this action?")
    
    if not approved:
        return {"status": "rejected"}
    
    return {"status": "approved"}

Error Recovery

@task
def _make_request_with_retry(url: str, max_retries: int = 3):
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=5)
            response.raise_for_status()
            return response.json()
        except requests.RequestException as e:
            if attempt == max_retries - 1:
                raise
            await asyncio.sleep(1 * (2 ** attempt))  # Exponential backoff

Deployment Scenarios

Financial Trading Agent

class TradingState(TypedDict):
    positions: dict
    capital: float
    approved_deals: int

def trade_approval(state: TradingState):
    # Pause for human approval
    approved = interrupt(f"Approve trade? Position: {state['positions']}")
    
    if approved:
        return {"status": "executed"}
    else:
        return {"status": "rejected", "reason": "human_approval"}

Metrics:

Approval time: < 30 seconds
Rejection rate: < 5%
Capital protection: 100%

Healthcare Documentation Agent

class PatientState(TypedDict):
    notes: list
    patient: dict

def clinician_review(state: PatientState):
    # Pause for clinician review
    review = interrupt("Review patient notes")
    
    if review["approved"]:
        return {"notes": review["notes"]}
    else:
        return {"notes": state["notes"]}  # Use previous notes

Tradeoff: Human review adds latency but improves accuracy.

Measurable Metrics

Metric	Target	Rationale
Resume latency	< 5s	User experience
Checkpoint size	< 10MB	Storage efficiency
Failure recovery time	< 30s	SLA compliance
Idempotency success	100%	Data consistency

Anti-Patterns to Avoid

1. Repeating Side Effects

❌ Bad: Multiple API calls in one node

def bad_node(state: State):
    result1 = requests.get(url1)  # Executed on resume
    result2 = requests.get(url2)  # Executed again

✅ Good: Wrap in @task

@task
def _get_url(url: str):
    return requests.get(url)

def good_node(state: State):
    result1 = _get_url(url1)
    result2 = _get_url(url2)

2. Non-Deterministic Logic in Nodes

❌ Bad: Random decisions affecting state

def bad_node(state: State):
    random_decision = random.choice(["approve", "reject"])  # Different on resume
    return {"decision": random_decision}

✅ Good: Wrap in @task

@task
def _deterministic_decision(context: str):
    # Deterministic logic only
    return "approved" if "positive" in context else "rejected"

Implementation Checklist

[ ] Choose checkpointer type (Memory for dev, Postgres/Redis for prod)
[ ] Define TypedDict state with Annotated fields
[ ] Wrap non-deterministic operations in @task
[ ] Use interrupt() for human-in-the-loop points
[ ] Set durability mode (sync for prod, async for batch)
[ ] Implement idempotency keys for API calls
[ ] Test resume after failures
[ ] Monitor checkpoint sizes
[ ] Measure resume latency

Conclusion

Durable execution enables production-ready AI agents with:

Persistence: Automatic state saving and resumption
Human oversight: Interrupts for approval workflows
Resilience: Recover from failures without reprocessing

Tradeoff: Slight performance overhead for guaranteed data consistency and recovery.

Next steps: Implement checkpointer, test with interrupt(), measure latency, and monitor checkpoint sizes in production.

Source: Official LangChain documentation (LangGraph durable execution, interrupts, checkpoints)

Related: LangSmith evaluation, observability, deployment patterns

2026-04-25 • CAEP Lane 8888 • Engineering-Teaching

A production-ready implementation guide for building resilient AI agents with persistence, interrupts, and safety boundaries using LangGraph’s durable execution capabilities.

Executive Summary

LangGraph’s durable execution provides a framework for building AI agents that can pause, resume, and recover from interruptions without reprocessing previous work. This guide covers:

Persistence layer: Checkpoint-based state management
Human-in-the-loop patterns: Interrupts for approval workflows
Durability modes: Sync, async, and exit tradeoffs
Determinism guarantees: Idempotent operations and consistent replay

Core Concepts

Durable Execution Defined

Durable execution saves process progress at key points, enabling:

Human-in-the-loop: Users can inspect/validate before continuing
Long-running tasks: Resume after failures or interruptions
State preservation: No reprocessing of completed work

Key principle: When a workflow resumes, it replays from the starting point, not the exact line of code where it stopped.

Checkpoint Architecture

LangGraph’s persistence layer requires:

Checkpointer: Saves workflow state to durable store (Memory, Postgres, Redis)
Thread ID: Persistent cursor for resuming specific execution
Annotated State: TypedDict with Annotated[type, operator.add]

from typing_extensions import TypedDict, Annotated
from langchain_core.messages import AnyMessage
import operator

class MessagesState(TypedDict):
    messages: Annotated[list[AnyMessage], operator.add]
    llm_calls: int

Durability Modes

Tradeoffs

Mode	Performance	Durability	Failure Recovery
`exit`	Best	Low	None (no checkpoints)
`async`	Good	Medium	Partial (risk of crash)
`sync`	Slowest	Best	Full (every checkpoint)

Implementation

graph.stream(
    {"input": "test"},
    durability="sync"  # Best durability
)

Recommendation: Use sync for production with human-in-the-loop; async for long-running batch tasks.

Human-in-the-Loop Patterns

Interrupts Overview

The interrupt() function pauses execution and waits for external input:

from langgraph.types import interrupt

def approval_node(state: State):
    # Pause and ask for approval
    approved = interrupt("Do you approve this action?")
    
    # Resume when user responds
    return {"approved": approved}

Resume Mechanics

When an interrupt occurs:

Graph suspends at interrupt() call
State saved to checkpointer
Value returned to caller under __interrupt__
Waits indefinitely for resume

# Initial run - hits interrupt
config = {"configurable": {"thread_id": "thread-1"}}
result = graph.invoke({"input": "data"}, config=config, version="v2")
print(result.interrupts)  # (Interrupt(value='Do you approve?'),)

# Resume with response
graph.invoke(Command(resume=True), config=config, version="v2")

Handling Multiple Interrupts

# Assign IDs to each interrupt
interrupt("approval1", id="interrupt-1")
interrupt("approval2", id="interrupt-2")

# Resume sequentially
graph.invoke(Command(resume={"approval1": True}), config=config)
graph.invoke(Command(resume={"approval2": False}), config=config)

Tradeoff: More interrupts = better governance, but longer wait times for users.

Determinism & Idempotency

Non-Deterministic Operations

Wrap any random or side-effect operations in @task:

from langgraph.func import task

@task
def _make_request(url: str):
    return requests.get(url).text[:100]

def call_api(state: State):
    result = _make_request(state['url'])
    return {"result": result}

Idempotency Keys

For API calls with potential retries:

@task
def _make_request(url: str):
    return requests.get(url).text[:100]  # Same URL = same result

Key rule: Always verify existing results before executing side effects.

Common Patterns

Approval Workflows

def approval_node(state: State):
    if state["approved"]:
        return {"status": "approved"}
    
    # Wait for human input
    approved = interrupt("Approve this action?")
    
    if not approved:
        return {"status": "rejected"}
    
    return {"status": "approved"}

Error Recovery

@task
def _make_request_with_retry(url: str, max_retries: int = 3):
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=5)
            response.raise_for_status()
            return response.json()
        except requests.RequestException as e:
            if attempt == max_retries - 1:
                raise
            await asyncio.sleep(1 * (2 ** attempt))  # Exponential backoff

Deployment Scenarios

Financial Trading Agent

class TradingState(TypedDict):
    positions: dict
    capital: float
    approved_deals: int

def trade_approval(state: TradingState):
    # Pause for human approval
    approved = interrupt(f"Approve trade? Position: {state['positions']}")
    
    if approved:
        return {"status": "executed"}
    else:
        return {"status": "rejected", "reason": "human_approval"}

Metrics:

Approval time: < 30 seconds
Rejection rate: < 5%
Capital protection: 100%

Healthcare Documentation Agent

class PatientState(TypedDict):
    notes: list
    patient: dict

def clinician_review(state: PatientState):
    # Pause for clinician review
    review = interrupt("Review patient notes")
    
    if review["approved"]:
        return {"notes": review["notes"]}
    else:
        return {"notes": state["notes"]}  # Use previous notes

Tradeoff: Human review adds latency but improves accuracy.

Measurable Metrics

Metric	Target	Rationale
Resume latency	< 5s	User experience
Checkpoint size	< 10MB	Storage efficiency
Failure recovery time	< 30s	SLA compliance
Idempotency success	100%	Data consistency

Anti-Patterns to Avoid

1. Repeating Side Effects

❌ Bad: Multiple API calls in one node

def bad_node(state: State):
    result1 = requests.get(url1)  # Executed on resume
    result2 = requests.get(url2)  # Executed again

✅ Good: Wrap in @task

@task
def _get_url(url: str):
    return requests.get(url)

def good_node(state: State):
    result1 = _get_url(url1)
    result2 = _get_url(url2)

2. Non-Deterministic Logic in Nodes

❌ Bad: Random decisions affecting state

def bad_node(state: State):
    random_decision = random.choice(["approve", "reject"])  # Different on resume
    return {"decision": random_decision}

✅ Good: Wrap in @task

@task
def _deterministic_decision(context: str):
    # Deterministic logic only
    return "approved" if "positive" in context else "rejected"

Implementation Checklist

[ ] Choose checkpointer type (Memory for dev, Postgres/Redis for prod)
[ ] Define TypedDict state with Annotated fields
[ ] Wrap non-deterministic operations in @task
[ ] Use interrupt() for human-in-the-loop points
[ ] Set durability mode (sync for prod, async for batch)
[ ] Implement idempotency keys for API calls
[ ] Test resume after failures
[ ] Monitor checkpoint sizes
[ ] Measure resume latency

##Conclusion

Durable execution enables production-ready AI agents with:

Persistence: Automatic state saving and resumption
Human oversight: Interrupts for approval workflows
Resilience: Recover from failures without reprocessing

Tradeoff: Slight performance overhead for guaranteed data consistency and recovery.

Next steps: Implement checkpointer, test with interrupt(), measure latency, and monitor checkpoint sizes in production.

Source: Official LangChain documentation (LangGraph durable execution, interrupts, checkpoints)

Related: LangSmith evaluation, observability, deployment patterns