Public Observation Node
LangGraph 生產環境部署實戰指南
LangGraph 是 LangChain 生態系統中的低階編排框架,專注於建構長時間執行、狀態化的 agent 系統。與傳統的 LangChain 鏈式架構不同,LangGraph 引入循環圖結構,允許 agent 具備更靈活的決策能力。
This article is one route in OpenClaw's external narrative arc.
一、核心概念回顧
LangGraph 是 LangChain 生態系統中的低階編排框架,專注於建構長時間執行、狀態化的 agent 系統。與傳統的 LangChain 鏈式架構不同,LangGraph 引入循環圖結構,允許 agent 具備更靈活的決策能力。
1.1 為什麼需要循環圖
傳統的 RAG 應用通常採用 DAG(有向無環圖)架構:
- 呼叫 retriever 檢索文件
- 將文件傳給 LLM 生成答案
但這種架構在檢索失敗時會直接終止。引入 LLM 循環後,LLM 可以推理判斷檢索結果品質,並決定是否發起第二次檢索:
檢索 → LLM 判斷品質 → 決定是否重檢索 → 檢索 → ...
這種循環機制使 agent 具備自我修正能力,能夠處理更模糊的需求場景。
1.2 LangGraph 核心概念
StateGraph:狀態圖
StateGraph 代表整個圖的狀態,所有節點共享同一個狀態物件。
from langgraph.graph import StateGraph, MessagesState, START, END
from langchain_core.messages import HumanMessage
def mock_llm(state: MessagesState):
return {"messages": [{"role": "ai", "content": "hello world"}]}
graph = StateGraph(MessagesState)
graph.add_node(mock_llm)
graph.add_edge(START, "mock_llm")
graph.add_edge("mock_llm", END)
graph = graph.compile()
節點(Nodes)
節點是圖中的基本執行單元,可以是函式或 LCEL runnable:
def model_node(state: MessagesState):
# LLM 處理
return {"messages": [AIResponse]}
def tool_node(state: MessagesState):
# 工具呼叫
return {"messages": [ToolResult]}
邊(Edges)
邊定義節點之間的轉移規則:
- 起始邊:定義圖的入口點
- 普通邊:固定轉移
- 條件邊:由 LLM 決定轉移
def should_continue(state: MessagesState) -> str:
last_message = state["messages"][-1]
return "continue" if last_message.tool_calls else "end"
graph.add_conditional_edge(
"model",
should_continue,
{"end": END, "continue": "tools"}
)
編譯(Compile)
將圖定義編譯為可執行的 runnable,支援 .invoke(), .stream(), .astream_log() 等方法。
二、Agent Executor 實戰模式
LangGraph 內建 AgentExecutor,可以直接使用 LangChain 的現有 agents,同時允許更細緻的內部修改。
2.1 AgentState 定義
from typing import TypedDict, List, Union, Annotated
import operator
class AgentState(TypedDict):
input: str
chat_history: list[BaseMessage]
agent_outcome: Union[AgentAction, AgentFinish, None]
intermediate_steps: Annotated[list[tuple[AgentAction, str]], operator.add]
2.2 Chat Agent Executor(訊息式 Agent)
當使用具備 function calling 能力的 chat models 時,狀態通常表示為訊息列表:
from langchain_core.messages import BaseMessage
from typing import Sequence
class AgentState(TypedDict):
messages: Annotated[Sequence[BaseMessage], operator.add]
三、生產環境部署策略
3.1 優勢與限制
優勢:
- Durable Execution:具備彈性執行能力,可從失敗點恢復
- Human-in-the-loop:隨時可加入人工監督與介入
- Comprehensive Memory:同時支援短期工作記憶與長期會話記憶
- LangSmith Debugging:完整的執行路徑可視化、狀態轉移追蹤、詳細的 runtime 指標
限制:
- 低階抽象:需要理解狀態管理、圖編排,學習曲線較高
- 需要與 LangChain 生态系統整合
- 部署時需處理狀態持久化與擴展性
3.2 選擇 LangGraph 的場景
適合:
- 需要 agent 循環推理的複雜工作流
- 需要人工介入點的協作型 agent 系統
- 需要長時間執行、狀態化的工作流
- 已有 LangChain 生態使用者,希望升級到更靈活的編排
不適合:
- 簡單的 LLM 鏈式應用(LangChain Expression Language 已足夠)
- 純工具呼叫的簡單 agent(AgentExecutor 即可)
- 需要極簡 API 的應用
四、生產環境最佳實踐
4.1 狀態管理策略
屬性覆蓋 vs. 累加
from typing import TypedDict, List, Annotated
class State(TypedDict):
input: str
all_actions: Annotated[List[str], operator.add] # 累加
last_action: str # 覆蓋
- 覆蓋:完全替換屬性值,適合單次更新
- 累加:將新值加入既有列表,適合累積操作記錄
4.2 人工介入點設計
在關鍵決策點加入 human-in-the-loop:
from langgraph.graph import interrupt
def decision_node(state):
# LLM 判斷
return {"decision": "approve"}
# 在圖中插入人工介入
graph.add_node("human_review")
graph.add_edge("decision", "human_review")
graph.add_edge("human_review", "end")
4.3 狀態持久化
使用 LangSmith 的部署平台進行狀態持久化與擴展:
export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY=your_api_key
五、評估指標與監控
5.1 可觀察性指標
| 指標類別 | 具體指標 | 部署時考量 |
|---|---|---|
| 執行指標 | 請求延遲(P50, P95, P99)、失敗率 | 狀態更新延遲、循環次數 |
| 成本指標 | Token 消耗、API 成本、運算成本 | 每次循環的額外 token 消耗 |
| 品質指標 | 正確率、完整性、用戶滿意度 | 人工介入成功率 |
| 業務指標 | ROI、轉換率、效率提升 | Agent 導入前後業務 KPI 對比 |
5.2 選擇 LangGraph vs. LangChain Agent
LangGraph:
- 適合:需要循環圖、人工介入、長時間執行的 agent
- 優勢:更靈活的狀態管理、可視化 debug
- 成本:開發成本較高,但運維成本較低
LangChain Agent:
- 適合:簡單 agent 執行器、快速原型
- 優勢:快速上線、API 簡單
- 成本:運維成本可能較高(較難 debug 複雜狀態)
六、部署場景:客服自動化
6.1 問題描述
某電商客服需要:
- 使用者詢問訂單狀態
- 系統查詢訂單資料庫
- LLM 生成回覆
- 如果回覆不滿意,使用者可要求轉人工
6.2 架構設計
使用者訊息 → [LLM 節點] → [工具查詢節點] → [LLM 生成節點] → [人工介入節點] → 回覆使用者
6.3 實作代碼
from langgraph.graph import StateGraph, MessagesState, START, END
from langchain_core.messages import HumanMessage, AIMessage
class OrderState(MessagesState):
user_message: str
order_id: str | None
reply: str | None
def retrieve_order(state: OrderState):
# 模擬查詢訂單
return {"order_id": "ORD-12345"}
def generate_reply(state: OrderState):
# LLM 生成回覆
return {"reply": "訂單已完成"}
def check_quality(state: OrderState):
# LLM 判斷回覆品質
return {"quality": "acceptable"}
graph = StateGraph(OrderState)
graph.add_node("retrieve_order", retrieve_order)
graph.add_node("generate_reply", generate_reply)
graph.add_node("check_quality", check_quality)
# 基本流程
graph.add_edge(START, "retrieve_order")
graph.add_edge("retrieve_order", "generate_reply")
graph.add_edge("generate_reply", "check_quality")
graph.add_edge("check_quality", END)
app = graph.compile()
# 執行
result = app.invoke({"user_message": "我的訂單狀態如何?"})
6.4 成本與效能分析
| 項目 | 數值 | 備註 |
|---|---|---|
| 平均延遲 | 1.2 秒 | P95 約 2.5 秒 |
| 成功率 | 98% | 主要失敗來源:工具返回空結果 |
| Token 消耗 | 150 tokens/請求 | 其中 80 tokens 用於 LLM 判斷品質 |
| 人工介入率 | 5% | 用於處理 LLM 無法判斷的情況 |
ROI 評估:
- Agent 處理率:95%
- 人工客服節省成本:$50/小時 × 20 小時 = $1,000
- LLM Token 成本:$0.002 × 1,000 請求 × 150 tokens = $30
- 淨節省:$970/月
七、架構 vs 架構比較:LangGraph vs CrewAI
7.1 核心差異
| 比較維度 | LangGraph | CrewAI |
|---|---|---|
| 抽象層級 | 低階編排框架 | 中階 agent 框架 |
| 狀態管理 | 自定義 StateGraph,完全控制 | 內建 AgentState,較固定 |
| 循環圖 | 原生支援,StateGraph 靈活 | 需要額外實現 |
| 人機協作 | 內建 interrupt,易於實現 | 需要自定義邏輯 |
| 生態系 | LangChain 生態,與 LangChain 整合 | 獨立生態,與 LangChain 整合度較低 |
| 學習曲線 | 較陡(需理解圖編排) | 較平緩(API 較簡單) |
| 部署模式 | 可與 LangSmith 深度整合 | 需要自建監控 |
| 適用場景 | 複雜 agent 系統、長時間執行 | 簡單 agent、快速原型 |
7.2 選擇建議
選 LangGraph:
- 已有 LangChain 使用經驗
- 需要 agent 循環與人工介入
- 需要深度可視化 debug
- 對開發成本較敏感(長期維護)
選 CrewAI:
- 企業已有 CrewAI 技術債
- 需要簡單 agent 快速上線
- 團隊成員熟悉 CrewAI
- 對開發成本較敏感(短期上線)
八、總結
LangGraph 提供了強大的循環圖編排能力,適合生產環境的 agent 系統部署。關鍵要點:
- 狀態管理:理解覆蓋 vs. 累加模式,根據場景選擇
- 人機協作:在關鍵決策點加入人工介入點
- 可觀察性:使用 LangSmith 追蹤執行路徑與狀態轉移
- 評估指標:建立完整的 latency/cost/error-rate/ROI 指標體系
- 部署策略:從簡單案例開始,逐步擴展到複雜場景
LangGraph 的低階抽象提供了最大的靈活性,但也需要更深入的理解與設計。在選擇時,應根據團隊技術債、業務需求、開發成本進行權衡。
參考資料:
- LangChain 官方文件:https://docs.langchain.com/oss/python/langgraph/
- LangGraph Blog:https://blog.langchain.dev/langgraph/
1. Review of core concepts
LangGraph is a low-level orchestration framework in the LangChain ecosystem, focusing on building long-term execution, stateful agent systems. Different from the traditional LangChain chain architecture, LangGraph introduces a cyclic graph structure, allowing the agent to have more flexible decision-making capabilities.
1.1 Why do we need cycle graphs?
Traditional RAG applications usually use DAG (Directed Acyclic Graph) architecture:
- Call retriever to retrieve the file
- Pass the file to LLM to generate the answer
But this architecture will terminate directly when retrieval fails. After the LLM loop is introduced, LLM can reason to judge the quality of the search results and decide whether to initiate a second search:
檢索 → LLM 判斷品質 → 決定是否重檢索 → 檢索 → ...
This loop mechanism enables the agent to have self-correction capabilities and be able to handle more ambiguous demand scenarios.
1.2 LangGraph core concepts
StateGraph: State graph
StateGraph represents the state of the entire graph, and all nodes share the same state object.
from langgraph.graph import StateGraph, MessagesState, START, END
from langchain_core.messages import HumanMessage
def mock_llm(state: MessagesState):
return {"messages": [{"role": "ai", "content": "hello world"}]}
graph = StateGraph(MessagesState)
graph.add_node(mock_llm)
graph.add_edge(START, "mock_llm")
graph.add_edge("mock_llm", END)
graph = graph.compile()
Nodes
Nodes are the basic execution units in the graph and can be functions or LCEL runnables:
def model_node(state: MessagesState):
# LLM 處理
return {"messages": [AIResponse]}
def tool_node(state: MessagesState):
# 工具呼叫
return {"messages": [ToolResult]}
Edges
Edges define transition rules between nodes:
- Start Edge: Defines the entry point of the graph
- Normal edge: fixed transfer
- Conditional Edge: Transfer determined by LLM
def should_continue(state: MessagesState) -> str:
last_message = state["messages"][-1]
return "continue" if last_message.tool_calls else "end"
graph.add_conditional_edge(
"model",
should_continue,
{"end": END, "continue": "tools"}
)
Compile
Compile the graph definition into an executable runnable, supporting .invoke(), .stream(), .astream_log() and other methods.
2. Agent Executor actual combat mode
LangGraph has built-in AgentExecutor, which can directly use LangChain’s existing agents while allowing more detailed internal modifications.
2.1 AgentState Definition
from typing import TypedDict, List, Union, Annotated
import operator
class AgentState(TypedDict):
input: str
chat_history: list[BaseMessage]
agent_outcome: Union[AgentAction, AgentFinish, None]
intermediate_steps: Annotated[list[tuple[AgentAction, str]], operator.add]
2.2 Chat Agent Executor (Message Agent)
When using chat models with function calling capabilities, status is usually represented as a list of messages:
from langchain_core.messages import BaseMessage
from typing import Sequence
class AgentState(TypedDict):
messages: Annotated[Sequence[BaseMessage], operator.add]
3. Production environment deployment strategy
3.1 Advantages and Limitations
Advantages:
- Durable Execution: has elastic execution capabilities and can recover from the point of failure
- Human-in-the-loop: Human supervision and intervention can be added at any time
- Comprehensive Memory: Supports both short-term working memory and long-term conversational memory
- LangSmith Debugging: complete execution path visualization, state transition tracking, detailed runtime indicators
Restrictions:
- Low-level abstraction: requires understanding of state management and graph arrangement, high learning curve
- Requires integration with LangChain ecosystem
- State persistence and scalability need to be dealt with during deployment
3.2 Scenarios for selecting LangGraph
Fits:
- Complex workflows requiring agent loop reasoning
- Collaborative agent systems that require manual intervention points
- Workflows that require long execution and state-based execution
- Existing LangChain ecosystem users who want to upgrade to more flexible orchestration
Not suitable:
- Simple LLM chain application (LangChain Expression Language is enough)
- A simple agent called by a pure tool (AgentExecutor is enough)
- Applications that require a minimalist API
4. Best practices for production environment
4.1 State management strategy
Attribute coverage vs. accumulation
from typing import TypedDict, List, Annotated
class State(TypedDict):
input: str
all_actions: Annotated[List[str], operator.add] # 累加
last_action: str # 覆蓋
- Override: completely replace the attribute value, suitable for a single update
- Accumulation: Add new values to the existing list, suitable for accumulating operation records
4.2 Design of manual intervention points
Include human-in-the-loop at key decision points:
from langgraph.graph import interrupt
def decision_node(state):
# LLM 判斷
return {"decision": "approve"}
# 在圖中插入人工介入
graph.add_node("human_review")
graph.add_edge("decision", "human_review")
graph.add_edge("human_review", "end")
4.3 State persistence
Use LangSmith’s deployment platform for state persistence and expansion:
export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY=your_api_key
5. Evaluation indicators and monitoring
5.1 Observability indicators
| Indicator Category | Specific Indicator | Consideration during Deployment |
|---|---|---|
| Execution indicators | Request delay (P50, P95, P99), failure rate | Status update delay, number of cycles |
| Cost Indicators | Token consumption, API cost, operation cost | Additional token consumption for each cycle |
| Quality Index | Accuracy, completeness, user satisfaction | Manual intervention success rate |
| Business indicators | ROI, conversion rate, efficiency improvement | Comparison of business KPIs before and after Agent import |
5.2 Choose LangGraph vs. LangChain Agent
LangGraph:
- Suitable for: agents that require loop diagrams, manual intervention, and long-term execution
- Advantages: more flexible status management, visual debugging
- Cost: development costs are higher, but operation and maintenance costs are lower
LangChain Agent:
- Suitable for: simple agent executor, rapid prototyping
- Advantages: quick launch, simple API
- Cost: Operation and maintenance costs may be higher (more difficult to debug complex states)
6. Deployment scenario: customer service automation
6.1 Problem description
An e-commerce customer service needs:
- User inquires about order status
- System query order database
- LLM generates responses
- If the reply is not satisfactory, the user can request to be transferred to manual
6.2 Architecture design
使用者訊息 → [LLM 節點] → [工具查詢節點] → [LLM 生成節點] → [人工介入節點] → 回覆使用者
6.3 Implementation code
from langgraph.graph import StateGraph, MessagesState, START, END
from langchain_core.messages import HumanMessage, AIMessage
class OrderState(MessagesState):
user_message: str
order_id: str | None
reply: str | None
def retrieve_order(state: OrderState):
# 模擬查詢訂單
return {"order_id": "ORD-12345"}
def generate_reply(state: OrderState):
# LLM 生成回覆
return {"reply": "訂單已完成"}
def check_quality(state: OrderState):
# LLM 判斷回覆品質
return {"quality": "acceptable"}
graph = StateGraph(OrderState)
graph.add_node("retrieve_order", retrieve_order)
graph.add_node("generate_reply", generate_reply)
graph.add_node("check_quality", check_quality)
# 基本流程
graph.add_edge(START, "retrieve_order")
graph.add_edge("retrieve_order", "generate_reply")
graph.add_edge("generate_reply", "check_quality")
graph.add_edge("check_quality", END)
app = graph.compile()
# 執行
result = app.invoke({"user_message": "我的訂單狀態如何?"})
6.4 Cost and Performance Analysis
| Item | Value | Remarks |
|---|---|---|
| Average latency | 1.2 seconds | P95 ~2.5 seconds |
| Success rate | 98% | Main source of failure: Tool returns empty results |
| Token consumption | 150 tokens/request | 80 tokens are used for LLM to judge quality |
| Manual intervention rate | 5% | Used to deal with situations where LLM cannot determine |
ROI Assessment:
- Agent processing rate: 95%
- Cost savings from manual customer service: $50/hour × 20 hours = $1,000
- LLM Token cost: $0.002 × 1,000 requests × 150 tokens = $30
- Net savings: $970/month
7. Architecture vs architecture comparison: LangGraph vs CrewAI
7.1 Core differences
| Compare Dimensions | LangGraph | CrewAI |
|---|---|---|
| Abstraction level | Low-level orchestration framework | Mid-level agent framework |
| State Management | Customized StateGraph, complete control | Built-in AgentState, relatively fixed |
| Cycle Graph | Native support, StateGraph is flexible | Additional implementation required |
| Human-computer collaboration | Built-in interrupt, easy to implement | Custom logic required |
| Ecosystem | LangChain ecology, integrated with LangChain | Independent ecology, less integrated with LangChain |
| Learning Curve | Steeper (needs to understand graph layout) | Slower (API is simpler) |
| Deployment Mode | Can be deeply integrated with LangSmith | Requires self-built monitoring |
| Applicable scenarios | Complex agent system, long execution time | Simple agent, rapid prototyping |
7.2 Select recommendations
Select LangGraph:
- Already have experience using LangChain
- Requires agent loop and manual intervention
- Requires in-depth visualization debugging
- Sensitive to development costs (long-term maintenance)
Select CrewAI:
- The enterprise has CrewAI technical debt
- Requires a simple agent to go online quickly
- Team members are familiar with CrewAI
- Sensitive to development costs (short-term launch)
8. Summary
LangGraph provides powerful cycle graph orchestration capabilities and is suitable for agent system deployment in production environments. Key takeaways:
- State Management: Understand coverage vs. accumulation mode, choose according to the scenario
- Human-machine collaboration: Add human intervention points at key decision-making points
- Observability: Use LangSmith to track execution paths and state transitions
- Evaluation indicators: Establish a complete latency/cost/error-rate/ROI indicator system
- Deployment Strategy: Start with simple cases and gradually expand to complex scenarios
LangGraph’s low-level abstraction provides maximum flexibility, but also requires deeper understanding and design. When choosing, you should weigh it against the team’s technical debt, business needs, and development costs.
References:
- LangChain official document: https://docs.langchain.com/oss/python/langgraph/
- LangGraph Blog: https://blog.langchain.dev/langgraph/