AlphaForge

AlphaForge is an autonomous quant strategy research loop where two Claude models collaborate to iterate — one plans, one executes — running 24/7 with canary tests guarding the engine.

Why two models

Splitting plan and execute across two models fixes the context-switching cost of single-model loops:

Opus = planner. Reads recent results, decides the next hypothesis, picks parameters to sweep. No code. Pure judgment.
Sonnet = executor. Takes the plan, modifies the research code, runs the backtest, commits if green / rolls back if red. No planning. Pure execution.

Each model stays inside its own context window. Both are "thinking deeply" about different layers of the same problem.

The /auto loop

A scheduler triggers every 20 minutes. Each tick:

Opus reads the latest backtest output + research log
Opus emits a plan (next hypothesis, parameter ranges, expected outcome)
Sonnet picks up the plan, writes the code, runs the backtest
Result is judged against the canary tests
All canaries green → commit. Any red → rollback + log why.

Hard cap of 8 steps per batch. Auto-locks during execution.

Canary tests

The biggest risk in autonomous research is silent regression — a refactor that passes type checks but breaks a core invariant nobody notices. AlphaForge has 14 canaries covering Layer 1 (factor library, feature engine, meta-labeling). Any Layer 1 change must pass all 14 before it ships.

The difference between "agent that writes code" and "agent that ships safely."

Stack

Python + Claude Code + Anthropic API (planner + executor calls). Walk-forward validation + bootstrap resampling. Postgres for run history.

What I learned

Plan-execute splits beat single-model loops on long-horizon research
Canaries are the only thing between "autonomous" and "autonomous and broken"
20-minute cadence > continuous (model fatigue is real)
The framework outlasted any single hypothesis I tested with it

AlphaForge 是一個自動化量化策略開發Agent:兩個 Claude 模型分工接力,一個規劃、一個執行,每天 24 小時不停轉,canary 測試把守引擎。

為什麼用兩個模型

把規劃跟執行拆給兩個模型,解掉單模型循環的 context 切換成本:

Opus = planner。讀最近的結果,決定下一個假說,挑哪些參數要 sweep。不寫 code,純判斷。
Sonnet = executor。接 plan,改研究 code,跑 backtest,綠了就 commit、紅了就 rollback。不規劃,純執行。

兩個模型各自待在自己的 context 視窗裡,都在「深度思考」同一個問題的不同層。

/auto 循環

scheduler 每 20 分鐘觸發一次。每一 tick:

Opus 讀最新的 backtest 輸出 + 研究日誌
Opus 產出 plan(下一個假說、參數範圍、預期結果)
Sonnet 接 plan,寫 code,跑 backtest
結果丟進 canary 測試判定
全綠 → commit。有紅 → rollback,記下原因。

每批硬上限 8 步。執行時自動上鎖。

Canary金絲雀測試

自動研究最大的風險是 silent regression — 一次 refactor 通過 type check,但默默把一個 core invariant 弄壞,沒人察覺。AlphaForge 有 14 個 canary 守住 Layer 1(因子庫、特徵引擎、meta-labeling)。任何 Layer 1 改動都得先過 14 個才放行。

這就是「會寫 code 的 agent」和「能安全發佈的 agent」的差別。

技術棧

Python + Claude Code + Anthropic API(planner + executor 兩端)。Walk-forward validation + bootstrap resampling。Postgres 存執行紀錄。

學到的事

Plan-execute 分模型在長期研究任務上贏單模型循環
Canary 是「自動」和「自動但壞掉」之間唯一的防線
20 分鐘的節奏 > 連續執行(模型疲勞是真的)
框架本身比我用它測的任何單一假說都活得久

AlphaForge

Report this content

Why two models

The /auto loop

Canary tests

Stack

What I learned

為什麼用兩個模型

/auto 循環

Canary金絲雀測試

技術棧

學到的事

Be the first to comment

Why two models

The /auto loop

Canary tests

Stack

What I learned

為什麼用兩個模型

/auto 循環

Canary金絲雀 測試

技術棧

學到的事

Be the first to comment

Canary金絲雀測試