
AI agent
AlphaForge
Quant Strategy ,Two Claude models in a 24/7 plan-execute loop. Canary tests guard the engine.
XJB
ProOpen@xjb
AlphaForge is an autonomous quant strategy research loop where two Claude models collaborate to iterate — one plans, one executes — running 24/7 with canary tests guarding the engine.
Why two models
Splitting plan and execute across two models fixes the context-switching cost of single-model loops:
- Opus = planner. Reads recent results, decides the next hypothesis, picks parameters to sweep. No code. Pure judgment.
- Sonnet = executor. Takes the plan, modifies the research code, runs the backtest, commits if green / rolls back if red. No planning. Pure execution.
Each model stays inside its own context window. Both are "thinking deeply" about different layers of the same problem.
The /auto loop
A scheduler triggers every 20 minutes. Each tick:
- Opus reads the latest backtest output + research log
- Opus emits a plan (next hypothesis, parameter ranges, expected outcome)
- Sonnet picks up the plan, writes the code, runs the backtest
- Result is judged against the canary tests
- All canaries green → commit. Any red → rollback + log why.
Hard cap of 8 steps per batch. Auto-locks during execution.
Canary tests
The biggest risk in autonomous research is silent regression — a refactor that passes type checks but breaks a core invariant nobody notices. AlphaForge has 14 canaries covering Layer 1 (factor library, feature engine, meta-labeling). Any Layer 1 change must pass all 14 before it ships.
The difference between "agent that writes code" and "agent that ships safely."
Stack
Python + Claude Code + Anthropic API (planner + executor calls). Walk-forward validation + bootstrap resampling. Postgres for run history.
What I learned
- Plan-execute splits beat single-model loops on long-horizon research
- Canaries are the only thing between "autonomous" and "autonomous and broken"
- 20-minute cadence > continuous (model fatigue is real)
- The framework outlasted any single hypothesis I tested with it
AlphaForge 是一個自動化量化策略開發Agent:兩個 Claude 模型分工接力,一個規劃、一個執行,每天 24 小時不停轉,canary 測試把守引擎。
為什麼用兩個模型
把規劃跟執行拆給兩個模型,解掉單模型循環的 context 切換成本:
- Opus = planner。讀最近的結果,決定下一個假說,挑哪些參數要 sweep。不寫 code,純判斷。
- Sonnet = executor。接 plan,改研究 code,跑 backtest,綠了就 commit、紅了就 rollback。不規劃,純執行。
兩個模型各自待在自己的 context 視窗裡,都在「深度思考」同一個問題的不同層。
/auto 循環
scheduler 每 20 分鐘觸發一次。每一 tick:
- Opus 讀最新的 backtest 輸出 + 研究日誌
- Opus 產出 plan(下一個假說、參數範圍、預期結果)
- Sonnet 接 plan,寫 code,跑 backtest
- 結果丟進 canary 測試判定
- 全綠 → commit。有紅 → rollback,記下原因。
每批硬上限 8 步。執行時自動上鎖。
Canary金絲雀 測試
自動研究最大的風險是 silent regression — 一次 refactor 通過 type check,但默默把一個 core invariant 弄壞,沒人察覺。AlphaForge 有 14 個 canary 守住 Layer 1(因子庫、特徵引擎、meta-labeling)。任何 Layer 1 改動都得先過 14 個才放行。
這就是「會寫 code 的 agent」和「能安全發佈的 agent」的差別。
技術棧
Python + Claude Code + Anthropic API(planner + executor 兩端)。Walk-forward validation + bootstrap resampling。Postgres 存執行紀錄。
學到的事
- Plan-execute 分模型在長期研究任務上贏單模型循環
- Canary 是「自動」和「自動但壞掉」之間唯一的防線
- 20 分鐘的節奏 > 連續執行(模型疲勞是真的)
- 框架本身比我用它測的任何單一假說都活得久