Multiple AIs compete
to find edges humans miss.
Not a bot. Not a signal service. An autonomous system that discovers, validates, and trades its own strategies — with AI models that disagree, debate, and improve over time.
Inside the Engine
10 modules. From thesis to compounding advantage.
The Thesis
- 0.1 The Sophistication Gap
- 0.2 Why LLMs Change the Game
- 0.3 Competition > Consensus
- 0.4 The ATLAS Philosophy
Architecture Overview
- 1.1 The Multi-Model Design
- 1.2 The Workflow Graph
- 1.3 Market Selection
- 1.4 The Hypothesis Lifecycle
Data & Memory
- 2.1 Feed Architecture
- 2.2 Storage Design
- 2.3 Market State Embeddings
- 2.4 The Cold Start Problem
Proving an Edge Is Real
- 3.1 Why a Separate Engine
- 3.2 Realistic Simulation
- 3.3 Cost Modeling Done Right
- 3.4 Statistical Validation
Autonomous Discovery
- 4.1 Discovery Prompt Design
- 4.2 Multi-Source Research
- 4.3 The Hypothesis Registry
- 4.4 Testing at Scale
What the System Has Learned
- 5.1 The Breadth-First Approach
- 5.2 The Multi-Timeframe Insight
- 5.3 Instrument Selection Is the Edge
- 5.4 The Kill Rate
The Competition Layer
- 6.1 Scoring What Matters
- 6.2 Calibration Over Time
- 6.3 The Sequential Ensemble
- 6.4 Fresh Eyes
Risk & Execution
- 7.1 Position Sizing
- 7.2 Exchange-Side Execution
- 7.3 The Paper-to-Live Bridge
- 7.4 Drawdown Philosophy
Running the Machine
- 8.1 Containerized Architecture
- 8.2 Data Ingestion Patterns
- 8.3 State Management
- 8.4 Monitoring & Alerting
The Compounding Advantage
- 9.1 Loss Attribution as a Feature
- 9.2 Meta-Strategy Analysis
- 9.3 Model Calibration Evolution
- 9.4 The Memory Moat
Module 0
0.1 The Sophistication Gap
Mechanical trading rules have a well-documented failure rate. Yet profitable discretionary traders exist, consistently, across every market. The problem is not that markets are efficient. The problem is that our systems are unsophisticated.
Why Published Rules Fail
Take any popular trading strategy — trend-following, mean reversion, breakout — and backtest the published mechanical rules with realistic costs. The results are almost always the same: marginal at best, negative at worst. The academic literature and broker disclosures agree on the failure rate for retail traders who rely on these approaches.
And yet, in every market, there are traders who are consistently profitable. Not by luck — across hundreds of trades, over years. They are doing something that the published rules are not capturing.
The Five Layers
When you study how profitable traders actually make decisions — not what they say in interviews, but what they do when you watch them trade — a pattern emerges. They are processing multiple layers of information simultaneously:
| Layer | Timeframe | What It Does |
|---|---|---|
| Structural bias | Higher | Are we in an uptrend, downtrend, or range? Where are the key levels? |
| Contextual gate | Medium | Is this a good day to trade? Volatility regime, session quality, macro events. |
| Setup identification | Medium-Low | Has the specific pattern I trade appeared? |
| Entry trigger | Low | On the execution timeframe, has the exact entry signal fired? |
| Exit management | Low | Trailing stops, partials, time exits — adapting to what price does after entry. |
Published mechanical rules typically operate on one or two of these layers. A profitable trader operates on all five, simultaneously, in real time. That is the sophistication gap.
Key Insight
The gap between published rules and profitable trading is not about secret indicators or hidden data. It is about the number of information layers processed simultaneously. A moving average crossover is one layer. A trader who checks weekly bias, filters by volatility regime, waits for a specific setup, triggers on a lower-timeframe confirmation, and manages the exit dynamically is operating on five layers. The question is: can we build a system that processes all five?
The Real Question
This is not a market efficiency argument. It is an engineering challenge. We know the layers exist. We know profitable traders use them. Can we build a system that replicates the multi-layer decision process — and then goes further, processing data volumes and market contexts that no individual human can hold in their head?
That is the thesis behind ATLAS.
You Understand This When…
- You can explain why published mechanical rules typically fail after costs
- You understand the five-layer model and why single-layer systems are insufficient
- You see the problem as an engineering challenge, not a market efficiency debate
0.2 Why LLMs Change the Game
Rule-based systems can process quantitative layers (price, volume, indicators). But the qualitative layers — narrative, context, cross-asset reasoning — have always required human judgment. Large language models break that constraint.
What Was Impossible Before
Consider what a profitable trader does that a traditional algorithm cannot:
- Read a central bank statement and understand not just the words, but the shift in tone from the previous statement
- Scan social media and distinguish between genuine sentiment shifts and noise
- Synthesize across markets — understand that a move in currency markets implies something about commodity positioning
- Reason about regime — not just “volatility is high” but “this type of volatility, in this macro context, historically resolves in this direction”
These are qualitative reasoning tasks. Before LLMs, they required a human brain. Traditional systems could measure volatility but not reason about it. They could detect a breakout but not assess whether the narrative supports continuation.
What LLMs Actually Bring
LLMs are not crystal balls. They cannot predict the future. What they can do is process, synthesize, and reason about complex, multi-dimensional information in ways that complement quantitative systems:
- Pattern recognition at scale — ingest enormous context windows of market data, research, and historical patterns
- Qualitative synthesis — combine quantitative signals with narrative, sentiment, and macro context
- Hypothesis generation — propose trading ideas that a human might not consider, drawing on vast training data
- Adversarial review — attack their own proposals and find weaknesses before real money is risked
The breakthrough is not that LLMs can trade. It is that LLMs can fill the qualitative layers that rule-based systems have always left empty.
Important Caveat
LLMs hallucinate. They are confidently wrong about specific facts. They have no access to real-time data unless you give it to them. They are not suitable as standalone decision-makers for trading. But as one component in a system that also includes rigorous quantitative validation, statistical gates, and risk management — they are transformative. The key is architecture: LLMs propose, the verification harness disposes.
You Understand This When…
- You can articulate what LLMs add to a trading system that traditional algorithms cannot
- You understand the distinction between LLMs as reasoners and LLMs as predictors
- You know why LLMs must be paired with quantitative validation, not used standalone
Previous: 0.1 The Sophistication Gap
0.3 Competition > Consensus
One model gives you one perspective. Averaging multiple models gives you mush. Making multiple models compete gives you something genuinely valuable: calibrated disagreement.
The Problem with Single Models
A single LLM, no matter how capable, has blind spots. It will develop consistent biases over time. It will be confidently wrong about certain market conditions. It will anchor on its own prior analysis. You cannot know which of its outputs are insightful and which are hallucinated without an independent check.
The Problem with Consensus
The naive solution — ask four models the same question and average the answer — destroys the most valuable signal. If three models are bullish and one is bearish, averaging tells you “mildly bullish.” But the interesting question is: why does that one model disagree? Is it seeing something the others miss? Is it wrong? Is it early?
Averaging erases disagreement. Disagreement is the signal.
The Competition Model
Instead of consensus, ATLAS uses structured competition:
- Each model has a distinct mandate — different data emphasis, different analytical lens
- Models read each other’s output and are required to identify where they disagree and why
- Disagreements are preserved in full, never averaged away
- Over time, the system tracks who was right about what — building a calibration profile for each model
- When a disagreement resolves, that resolution is high-signal information about which model has the sharpest read on which market conditions
Design Principle
The most valuable output of a multi-model system is not any single model’s analysis. It is the pattern of agreement and disagreement across models, tracked against outcomes over time. A hypothesis that all four models propose is worth nothing — if it’s obvious to everyone, it’s already in the price. A hypothesis that only one model finds, that turns out to be real, is worth everything.
You Understand This When…
- You can explain why averaging models destroys the most valuable signal
- You understand that calibrated disagreement is more valuable than consensus
- You know why unique hypotheses (one model only) are the highest-value output
Previous: 0.2 Why LLMs Change the Game
Next: 0.4 The ATLAS Philosophy
0.4 The ATLAS Philosophy
Four principles that shape every design decision in the system. They are non-negotiable.
No Constraints at Observation
Any model can notice anything, propose anything, frame markets any way it wants. The hypothesis registry accepts every proposal without prejudice. There is no “that’s too weird” filter on the input side. Constraints live exclusively at the execution layer — nothing trades real money until it has cleared a rigorous statistical gate.
Discover, Don’t Design
The system does not execute human-designed strategies. It discovers its own. Seeded hypotheses are starting questions, not answers. The most valuable output is a hypothesis nobody put in the brief. The system is designed to surprise its operator.
Earn, Don’t Assume
Nothing goes live without statistical proof. The system earns its way to real capital through paper trading performance. Starting capital is minimal. Losses are educational, not catastrophic. The verification harness is the gatekeeper, and it is deliberately conservative.
Minimal Human Involvement
The system runs autonomously. Monthly reviews. Approval gates for major transitions. A circuit breaker for extreme drawdowns. But the day-to-day decisions — what to scan, what to test, what to paper trade — are made by the system. The human’s role is oversight, not operation.
From Experience
The “discover, don’t design” principle was not the original plan. The first version of the system was built to execute strategies we had already validated manually. It worked, but it was limited to what we could imagine. When we restructured the system to propose and validate its own hypotheses, the quality and diversity of ideas immediately exceeded what we had been producing ourselves. The machine does not have our biases. That turns out to be its greatest strength.
You Understand This When…
- You can state the four principles and explain why each matters
- You understand the separation: unconstrained observation, constrained execution
- You see the system as autonomous by design, not by accident
Previous: 0.3 Competition > Consensus
Module 1
1.1 The Multi-Model Design
One AI orchestrates and executes. Multiple AIs compete as analysts. Each has a distinct mandate. The orchestrator synthesizes — it never averages.
Role Separation
ATLAS uses four large language models, but they are not interchangeable. Each has a specific analytical mandate based on what that model architecture does best:
| Role | Mandate | Why This Model |
|---|---|---|
| Orchestrator | Synthesis, execution, code, system management | Always on. Manages all workflows. The only model that executes trades. |
| Narrative Analyst | Social sentiment, forming/dying narratives, crowd behaviour | Access to real-time social data. Reads and comprehends, not just scores. |
| Pattern Analyst | Statistical patterns, prediction accuracy, correlation discovery | Strong structured reasoning. Finds patterns in trade logs and metrics. |
| Macro Synthesizer | Cross-asset regime, macro context, large-scale data ingestion | Large context window. Reads full documents — actual statements, not summaries. |
Why Not One Model?
A single model, even the most capable one, develops blind spots. It anchors on its own prior analysis. It has consistent biases that are invisible without an external check. Multiple specialized models create natural cross-validation:
- The narrative analyst might see bullish sentiment that the pattern analyst’s data contradicts
- The macro synthesizer might flag a regime shift that neither of the others noticed
- The orchestrator sees all three perspectives and must reconcile them — or flag the disagreement as the most interesting signal
Design Principle
The orchestrator is the only model that trades. The analysts can propose, warn, contradict, and debate — but they never touch execution. This separation prevents any single model’s hallucination from directly causing a trade. Every idea must survive the full pipeline: proposal → statistical validation → paper trading → human approval → live.
You Understand This When…
- You can explain why each model has a distinct mandate rather than all models doing the same thing
- You understand the orchestrator/analyst separation and why only one model executes
- You see the multi-model architecture as a cross-validation mechanism, not a voting system
Previous: 0.4 The ATLAS Philosophy
Next: 1.2 The Workflow Graph
1.2 The Workflow Graph
ATLAS is not a chatbot loop. It is a directed state machine where each node has a specific function, clear inputs, and clear outputs. The graph determines what happens, when, and in what order.
Why a Graph, Not a Script
A trading system needs to handle multiple concurrent workflows: data ingestion runs continuously, the daily ensemble runs on a schedule, hypothesis scanning runs in loops, and live execution responds to signals in real time. A linear script cannot manage this. A state machine can.
The graph approach provides:
- Crash recovery — if the system restarts, the graph knows which state each workflow was in and resumes from there
- Human-in-the-loop gates — certain transitions require approval before proceeding
- Branching logic — different paths based on schedule (daily ensemble vs weekly discovery) or conditions (signal fired vs no signal)
- Full observability — every transition is logged, every decision is traceable
Conceptual workflow graph. Each node has defined inputs, outputs, and transition conditions. Multiple concurrent workflows run on different schedules.
Key Design Decisions
- Ingestion is separate from analysis. Data flows into the database regardless of whether any analysis is running. The analytical nodes read from the database, not from the feed directly. This prevents feed delays from blocking decisions.
- The ensemble is sequential, not parallel. Each analyst reads the previous analysts’ output. This creates a debate, not independent (redundant) analysis.
- Live execution has a human gate. The system can identify opportunities autonomously, but deploying real capital requires explicit human approval until the system has earned trust through paper trading performance.
You Understand This When…
- You understand why a state machine is superior to a linear script for a multi-workflow system
- You can trace a signal from data ingestion through to execution on the graph
- You know where human gates exist and why they are positioned there
Previous: 1.1 The Multi-Model Design
Next: 1.3 Market Selection
1.3 Market Selection
ATLAS trades across multiple asset classes: crypto perpetuals, forex, precious metals, and equity indices. This is not diversification for its own sake. It is a direct consequence of one empirical finding.
The Cost-Ratio Principle
Here is the single most important insight for market selection:
Consider the same strategy applied to two different instruments. On one, a typical trade moves several hundred points and the spread is a few points. On the other, a typical trade moves twenty points and the spread is one point. The absolute spread on the second instrument is smaller. But the cost as a fraction of the expected move is much larger.
The same edge, the same strategy, the same statistical profile — but one instrument gives the edge room to breathe after costs, and the other suffocates it.
Key Insight
Instrument selection matters more than strategy selection. A mediocre strategy on a favorable-cost instrument will outperform a brilliant strategy on an unfavorable-cost instrument. The cost-to-move ratio is the explanatory variable. When we tested the same strategies across dozens of instruments, this pattern was overwhelming and consistent.
Why Multiple Asset Classes
Crypto, forex, commodities, and indices each have different characteristics:
- Crypto perpetuals — High volatility, 24/7 markets, funding rate dynamics, on-chain data availability
- Forex — Deep liquidity, session-based patterns, macro sensitivity, low cost on major pairs
- Precious metals — Trend-following friendly, favorable cost-to-move ratio, safe-haven dynamics
- Equity indices — Strong session patterns, economic calendar sensitivity, high beta to risk sentiment
A system that only trades one asset class will have concentrated risk exposure and long idle periods. Trading across asset classes means the system always has something to study, something to test, and — when edges are validated — something to trade.
You Understand This When…
- You can explain the cost-to-move ratio and why it determines instrument viability
- You understand why instrument selection matters more than strategy selection
- You know why ATLAS trades across multiple asset classes
Previous: 1.2 The Workflow Graph
1.4 The Hypothesis Lifecycle
Every potential trading edge in ATLAS moves through a defined lifecycle. The gates between stages are statistical, not subjective. Most hypotheses die. That is the process working correctly.
- CI lower bound > cost threshold
- Risk-adjusted return above minimum
- Robust across multiple regimes
- Multi-model confidence positive
- Human approval
The hypothesis lifecycle. Progression is data-driven. Regression is automatic. Death is permanent but educational.
Why This Matters
Without a rigorous lifecycle, systems suffer from two failure modes:
- Too permissive: Strategies go live without sufficient evidence. Losses are “learning.” The account bleeds.
- Too restrictive: Nothing ever qualifies. The system observes forever and never trades. Paralysis by analysis.
The lifecycle solves both. The gates are demanding but achievable. The paper trading stage accumulates evidence at zero risk. The live eligibility gate requires statistical proof, not subjective confidence. And once live, continuous monitoring ensures degrading strategies are caught before they do serious damage.
From Experience
Our kill rate is high. The majority of proposed hypotheses never make it past paper trading. This was initially discouraging — it felt like the system was failing. In reality, a high kill rate is the strongest possible evidence that the filter is working. If most ideas survived, the filter would be too loose, and we would be deploying noise. The graveyard being larger than the live portfolio is the system working as designed.
You Understand This When…
- You can trace a hypothesis from proposal through to live trading or death
- You know what the live eligibility gate requires and why each condition exists
- You understand that a high kill rate is a feature, not a bug
Previous: 1.3 Market Selection
Next: 2.1 Feed Architecture
Module 2
2.1 Feed Architecture
An autonomous trading system is only as good as its data. ATLAS ingests from multiple exchanges and data providers, across multiple asset classes, continuously. The feed layer is the foundation everything else builds on.
Multiple Sources, Multiple Types
The system ingests several categories of data from different providers:
- Price data (OHLCV candles) — From crypto exchanges and an FX/CFD broker. Multiple timeframes from one-minute to daily.
- Derivatives data — Funding rates, open interest, and options-derived metrics. These reveal positioning and sentiment that price alone does not show.
- On-chain data — Exchange flows, wallet movements, and network metrics for crypto assets.
- Sentiment and macro — Fear/greed indices, economic calendar events, and macro indicators.
Design Decisions
- WebSocket for real-time, REST for historical. Live data arrives via streaming connections. Historical backfill uses paginated REST calls. The system handles both paths and knows which data came from which source.
- Feeds are independent of analysis. The ingestion loop runs continuously regardless of what the analytical nodes are doing. Data flows into the database first. Analytical nodes read from the database, never from the feed directly. This decouples feed reliability from decision-making.
- Automatic reconnection with backoff. Feeds disconnect. APIs rate-limit you. The system detects failures, backs off, reconnects, and fills any gaps that occurred during downtime.
From Experience
We had a data ingestion bug that went undetected for over a day. All three exchange feeds were failing every cycle, but silently — the error was caught and logged, but the system continued running without fresh data. Strategies kept scanning, but on stale candles. Trades opened during that window were based on data that was over a day old. The fix was straightforward, but the lesson was permanent: silent data failure is the most dangerous kind. Your ingestion layer needs loud, unmissable health monitoring.
You Understand This When…
- You know why feeds must be decoupled from analysis
- You understand the WebSocket/REST division and why both are necessary
- You recognize silent data failure as the most dangerous failure mode
Previous: 1.4 The Hypothesis Lifecycle
Next: 2.2 Storage Design
2.2 Storage Design
Candle data is time-series data. Storing it in a general-purpose database works, but a time-series-optimized database makes everything downstream faster and simpler. The schema decisions you make here propagate through the entire system.
Time-Series Optimization
ATLAS stores candle data in a time-series-optimized database. The key properties:
- Automatic partitioning by time — queries for recent data are fast because they only scan recent partitions
- Compression — historical data compresses significantly, reducing storage costs
- Efficient range queries — “give me all candles for this instrument between these dates” is the primary access pattern, and it is optimized for exactly this
Symbol Normalization
When you ingest data from multiple exchanges, the same instrument has different identifiers. One exchange calls it “BTCUSDT,” another calls it “BTC-USDT-PERP,” a third just uses “BTC.” If you store them as-is, you cannot compare data across venues.
ATLAS normalizes all symbols to a canonical form at ingestion time, keyed by exchange and base asset. This means a strategy can request “BTC candles” and get comparable data from any venue, without knowing the venue-specific naming convention.
Multi-Timeframe Alignment
The most subtle storage issue is multi-timeframe alignment. When a strategy uses both hourly and four-hour candles, the four-hour candle is not “known” until the fourth hourly candle closes. If you forward-fill the four-hour close into the earlier hourly bars, you have introduced lookahead bias — the strategy is using information that did not exist at the time of the decision.
The solution is explicit timestamps on data availability. Every bar has an “as-of” timestamp that records when that bar’s data was finalized. Strategies can only access bars whose as-of timestamp is at or before the current decision point. This is enforced programmatically, not by convention.
Design Principle
Lookahead bias is the silent killer of backtesting credibility. It is surprisingly easy to introduce and extremely difficult to detect after the fact. The correct solution is architectural: enforce data availability rules at the storage/access layer, so that strategies cannot access future data even if they try. Do not rely on strategy authors being careful. Make the system enforce correctness.
You Understand This When…
- You know why time-series databases are superior to general-purpose databases for candle storage
- You understand the symbol normalization problem and why it matters for cross-venue analysis
- You can explain multi-timeframe lookahead bias and how architectural enforcement prevents it
Previous: 2.1 Feed Architecture
2.3 Market State Embeddings
Price is one dimension. The full market state is dozens of dimensions: trend, volatility, positioning, sentiment, macro context, cross-asset correlations. ATLAS compresses this multi-dimensional state into vectors and stores them. The result: a searchable memory of every market condition the system has ever seen.
What Gets Embedded
At every candle close, across every market, ATLAS captures a snapshot of the full market state. This includes:
- Price structure — Trend direction, distance from key levels, recent swing points
- Volume characteristics — Above or below average, distribution shape
- Derivatives data — Funding rates, open interest changes, positioning extremes
- Sentiment — Fear/greed levels, social volume
- Cross-asset context — Dollar direction, equity momentum, commodity trends
- Macro cycle position — Rate cycle phase, economic calendar proximity
- Regime indicators — Raw measurements of trend strength, volatility level, and correlation stability
This snapshot is converted into a numerical vector and stored in a vector database alongside metadata: timestamp, asset, timeframe, and — critically — what happened in the market over the following hours and days.
Similarity Search
The power of embedding market states is similarity search. Given the current market conditions, ATLAS can query: “Find the twenty historical moments where conditions most closely resembled right now.”
Each of those historical moments has a known outcome. The models can reason over these outcomes with current context: “In similar conditions, price tended to do X, but three of those instances had a macro catalyst we don’t have today.”
This is not curve-fitting. It is giving the models a structured historical memory to reason with, rather than asking them to reason from their training data alone.
Key Insight
The embedding store is the long-term moat. Every day the system runs, it accumulates more market state snapshots with known outcomes. After months of operation, the system has “seen” more market conditions than most human traders encounter in a career. This advantage is structural and compounding — a new entrant starts with an empty memory, regardless of how good their models are.
You Understand This When…
- You know what a market state embedding contains and why it is multi-dimensional
- You understand how similarity search enables historically-grounded reasoning
- You see the embedding store as a compounding competitive advantage
Previous: 2.2 Storage Design
2.4 The Cold Start Problem
An autonomous system that relies on historical memory needs history before it can be useful. The first weeks of operation are a degraded-capability phase. Understanding this — and planning for it — is essential.
What You Need Before Go-Live
Before the system can meaningfully operate, it needs:
- Historical price data — Months to years of candle data across all target instruments and timeframes. This enables backtesting and populates the initial embedding store.
- Backfill quality — Historical data must pass the same quality checks as live data: no gaps, no duplicates, OHLC integrity, correct timestamps. A backfill script that fetches data in pages, respects rate limits, and validates results is a prerequisite, not an optimization.
- Seeded hypotheses — The system needs starting questions to begin investigating. These are not answers — they are initial directions for the discovery engine to explore, validate, or kill.
Accepting Degraded Capability
Even with good historical data, the system operates at reduced capability initially:
- The embedding store has historical snapshots but no live-observed outcomes yet
- Model calibration scores are meaningless until sufficient predictions have been tracked against results
- The anti-pattern library is empty — no failure conditions have been recorded
- The leaderboard shows no meaningful differentiation between models
This is expected and acceptable. The system is designed to improve over time. The cold start phase is measured in weeks, not months, and each day of operation reduces the capability gap.
Warning
The temptation during cold start is to skip paper trading and go straight to live “because the backtest looks good.” Resist this. The backtest validates the strategy logic. Paper trading validates the system — data ingestion, signal generation, order management, exit logic, and all the integration points between them. Every system has bugs that only appear in live operation. Paper trading finds them at zero cost.
You Understand This When…
- You know what data the system needs before it can begin operating
- You accept the degraded-capability phase as expected, not a failure
- You understand why paper trading is mandatory even when backtests are strong
Previous: 2.3 Market State Embeddings
Module 3
3.1 Why a Separate Engine
The backtesting engine is not part of the live trading workflow. It is a completely separate, standalone, synchronous system. This is a deliberate architectural choice, not a compromise.
The Case for Separation
A live trading system is asynchronous, event-driven, and connected to external services. A backtesting engine needs to be the opposite:
| Property | Live System | Backtesting Engine |
|---|---|---|
| Execution model | Async, event-driven | Synchronous, bar-by-bar |
| Data source | Live feeds, streaming | Historical database, batch reads |
| Timing | Real-time, unpredictable | Deterministic, reproducible |
| Failure mode | Must recover gracefully | Must fail loudly |
| Dependencies | Exchange APIs, feeds, cache | Database only |
Embedding backtesting inside the live workflow introduces async overhead, nondeterministic scheduling, and coupling to services that have nothing to do with historical simulation. A standalone engine runs faster, produces reproducible results, and can be tested independently.
The Execution Model
The engine processes candles in strict chronological order. At each bar:
- Update indicators using data up to and including the current bar
- Evaluate strategy rules against current state
- If a signal fires: place orders for execution at the next bar’s open
- Evaluate open positions against stops and targets
- Accrue funding/swap costs for held positions
- Log everything
The critical rule: decisions are made on the current bar’s close, execution happens at the next bar’s open. This eliminates the most common form of lookahead bias in backtesting — using a price to make a decision and then executing at that same price, which is impossible in real trading.
You Understand This When…
- You can explain why the backtesting engine is separate from the live system
- You understand the close-to-open execution model and why it prevents lookahead bias
- You know why reproducibility requires synchronous, deterministic execution
Previous: 2.4 The Cold Start Problem
Next: 3.2 Realistic Simulation
3.2 Realistic Simulation
A backtest that doesn’t model the messy realities of execution is a fantasy. The broker simulation must handle multiple positions, weekend gaps, overnight costs, and the fundamental ambiguity of what happens inside a single price bar.
Position Tracking
The simulated broker tracks individual position lots, not just net exposure. This is essential for strategies that pyramid (add to winners) or use partial exits. Each lot has its own entry price, entry time, and associated stop/target orders. Net exposure per instrument is computed separately for portfolio-level risk checks.
Weekend Gaps
Markets close on Friday and reopen on Sunday or Monday (depending on asset class). If the opening price gaps through a stop-loss, the stop cannot fill at its trigger price — it fills at the opening price, which may be significantly worse. The simulation models this: any stop or target that is “gapped through” during a market closure fills at the opening price, not the order price.
The Intrabar Ambiguity Problem
This is the most subtle issue in OHLC-based backtesting. A single bar has an open, high, low, and close — but you do not know the sequence in which high and low were reached. If a position has both a stop-loss and a take-profit within the bar’s range, you cannot determine which was hit first.
There are three common approaches:
| Approach | Assumption | Bias |
|---|---|---|
| Optimistic | Target hit first | Inflates profits |
| Pessimistic | Stop hit first | Conservative, understates edge |
| Random | Coin flip each bar | Neutral on average, noisy per run |
ATLAS uses the pessimistic rule as the default: when both stop and target are within a bar’s range, assume the adverse outcome. The reasoning: a verification harness should be conservative. If a strategy survives worst-case fill assumptions, it is more likely to survive real trading. If it only works under optimistic assumptions, it probably doesn’t work at all.
Design Principle
The backtest is a filter, not a predictor. It does not need to precisely match live execution. It needs to be conservative enough to eliminate bad strategies and realistic enough to not eliminate good ones. Erring on the side of pessimism is correct — you would rather reject a marginally profitable strategy than deploy a marginally unprofitable one.
You Understand This When…
- You know why position lot tracking matters for pyramiding strategies
- You understand gap-through fills and why they must be modeled
- You can explain the intrabar ambiguity problem and why the pessimistic default is correct
Previous: 3.1 Why a Separate Engine
3.3 Cost Modeling Done Right
Most backtests underestimate costs. They use a flat fee percentage, ignore funding, and pretend slippage is deterministic. Real trading costs are variable, session-dependent, and frequently the difference between a profitable strategy and a losing one.
The Three Cost Components
Spread & Fees (Variable by Session)
Spreads are not constant. They widen during low-liquidity sessions (overnight, weekends) and tighten during peak hours. A strategy that enters during the London open faces a different cost structure than one that enters during the Asian session. The cost model must be session-aware — applying different spread assumptions based on the time of day and day of week.
Fees depend on whether you are a maker (providing liquidity with limit orders) or a taker (consuming liquidity with market orders). Maker fees can be zero or even negative (rebates) on some venues. Taker fees are always positive. Modeling all trades as taker fees is conservative but may reject valid maker-oriented strategies.
Slippage (Stochastic, Not Deterministic)
Slippage is the difference between the price you intend to execute at and the price you actually get. It depends on order size, current liquidity, and volatility. Modeling slippage as a fixed number (e.g., 1 pip) is wrong — it understates slippage during volatile periods and overstates it during calm ones.
A more realistic approach models slippage as a random draw from a distribution that varies by session and volatility regime. The distribution should be calibrated conservatively: overestimating slippage is safer than underestimating it.
Funding & Swap (Per-Interval Accrual)
On perpetual futures, funding is exchanged between longs and shorts at regular intervals. On FX, overnight swap rates apply to positions held past the daily rollover. These costs are not flat percentages — they vary by instrument, direction, and market conditions.
The cost model must accrue funding/swap at each interval for the exact duration the position is held. A flat annual rate divided by 365 is a poor approximation when rates can spike from near-zero to extreme values within hours. Historical rate data should inform the model.
Key Insight
The cost-to-move ratio determines which instruments are viable for systematic trading. An instrument where a typical strategy move is large relative to transaction costs will support edges that survive. An instrument where the move is small relative to costs will kill the same edges. This is why instrument selection matters more than strategy selection — you are choosing the cost environment first, then finding strategies that work within it.
You Understand This When…
- You can name the three cost components and explain why each must be variable, not fixed
- You understand session-aware spread modeling and why it matters
- You know the difference between maker and taker execution and its impact on costs
- You can explain per-interval funding accrual and why flat approximations are dangerous
Previous: 3.2 Realistic Simulation
3.4 Statistical Validation
A positive backtest is the beginning of validation, not the end. Walk-forward testing, bootstrap confidence intervals, and multiple testing correction separate strategies that have a real edge from those that got lucky.
Walk-Forward Testing
The principle: train on one period, test on a period the strategy has never seen. Then roll forward and repeat.
- Divide history into rolling windows: a longer in-sample period followed by a shorter out-of-sample period
- Optimize parameters (if any) using only in-sample data
- Lock the parameters and test on the out-of-sample period
- Roll forward and repeat
- Only out-of-sample results are reportable
In-sample performance tells you how well you can fit to historical data. Out-of-sample performance tells you whether the edge is real. Only the latter matters.
Bootstrap Confidence Intervals
A point estimate (“the Sharpe ratio is 1.8”) is meaningless without a confidence interval. Bootstrap resampling provides this: resample the trade outcomes thousands of times (with replacement) and compute the metric on each resample. The distribution of resampled metrics gives you a confidence interval.
Critical nuance: block bootstrap, not naive shuffle. Trading returns are not independent — losses tend to cluster during adverse regimes. Naive resampling (shuffling individual trades) destroys this serial correlation and understates the risk of clustered losses. Block bootstrap preserves temporal dependencies by resampling blocks of consecutive trades rather than individual trades. This produces wider, more honest confidence intervals.
Common Mistake
Many backtesting frameworks implement Monte Carlo simulation by randomly shuffling trade order. This is presented as “seeing all possible equity paths.” It is a well-known flawed technique for time-series data. Real-world drawdowns are caused by sequences of correlated losses during unfavorable regimes, not by unlucky random orderings of independent trades. If your confidence intervals come from naive shuffling, they are too narrow and you are underestimating risk.
Multiple Testing Correction
If you test dozens of strategies and pick the ones that “passed,” some of those passes are false positives. At a 5% significance level, testing 20 strategies produces one false positive on average, purely by chance.
Multiple testing correction (such as the Bonferroni method) adjusts significance thresholds based on how many tests were conducted. The more strategies tested, the higher the bar each individual strategy must clear. This is not optional — without it, you are systematically promoting lucky noise alongside genuine edges.
Minimum Trade Count Gates
A strategy with five out-of-sample trades and a high Sharpe ratio is not validated — it is statistically meaningless. The confidence interval will be so wide that it is consistent with both strong profitability and significant loss.
Rather than using an arbitrary minimum trade count (which varies by strategy frequency), ATLAS gates on confidence interval width. A strategy qualifies when the lower bound of its confidence interval exceeds the cost threshold — not when it hits a fixed number of trades. This naturally requires more trades for noisier strategies and fewer for consistent ones.
You Understand This When…
- You can explain walk-forward testing and why only OOS results matter
- You know the difference between naive shuffle and block bootstrap and why it matters
- You understand multiple testing correction and can explain why it is necessary
- You know why CI-width gates are superior to fixed minimum trade counts
Previous: 3.3 Cost Modeling Done Right
Next: 4.1 Discovery Prompt Design (requires registration)
Free Access
You’ve seen Modules 0–3.
Enter your email to unlock all 10.
No payment. No spam. Deeper thinking on autonomous discovery, the competition layer, risk philosophy, infrastructure, and the compounding advantage.
We respect your privacy. Unsubscribe anytime.
Module 4
4.1 Discovery Prompt Design
The quality of the system’s output depends entirely on the quality of the prompts it receives. The discovery prompt is not “what should I trade?” — it is an explicit mandate to surprise, disagree, and find what everyone else is missing.
Designing for Novelty
The natural tendency of an LLM given market data is to produce safe, consensus analysis. “BTC is in an uptrend. Support at X.” This is worthless. If it is obvious to the model, it is obvious to everyone, and it is already in the price.
The discovery prompt is structured to explicitly counteract this tendency:
- Mandate to disagree. Each model is told that contradicting the other models is rewarded, not punished. Agreement scores nothing. A unique finding that turns out to be correct is the highest-value output.
- Cross-tradition thinking. “What would a commodities trader notice? A quantitative researcher? An options market maker? Set aside the existing framework and look with fresh eyes.”
- Concrete requirements. Every hypothesis must include a precise entry signal, a precise exit signal, the mechanical reason it should work, and historical examples the model can point to in the data.
- Leaderboard context. Each model sees the current standings and knows that safe, consensus observations score zero. The incentive structure rewards originality.
Key Insight
A hypothesis that all four models propose is worth nothing. A hypothesis that only one model finds, that survives the verification harness, is worth everything. The prompt must make this incentive explicit. The models are not collaborators — they are competitors whose disagreements are the most valuable signal the system produces.
You Understand This When…
- You know why consensus-seeking prompts produce worthless output
- You can explain how the discovery prompt incentivizes novelty over safety
- You understand why concrete entry/exit requirements prevent hand-waving
Previous: 3.4 Statistical Validation
4.2 Multi-Source Research
The models generate hypotheses from data analysis. But the richest source of trading ideas is not data — it is the accumulated wisdom of profitable traders, encoded in videos, social channels, and published research. The system extracts and tests this automatically.
From Unstructured Wisdom to Testable Rules
A profitable trader explains their approach in a two-hour video. The methodology is real, but it is buried in context, examples, and narrative. The challenge is converting this unstructured explanation into precise, mechanical rules that a backtesting engine can evaluate.
The pipeline:
- Automated transcript extraction — Headless browser automation retrieves full transcripts from video content
- Rule parsing via LLM — A model extracts: instruments traded, timeframes, session preferences, entry conditions, exit conditions, and filters
- Ambiguity handling — When the source is ambiguous, produce ranked candidate interpretations rather than guessing. The distinct element between two interpretations might be the edge.
- Hypothesis registration — Parsed rules enter the hypothesis registry for automated testing
Design Principle
Keep each trader’s methodology distinct. Never merge approaches until each is independently verified. Two traders may use similar concepts but the difference between their implementations — a session filter here, a different exit rule there — might be the profitable part. Premature deduplication destroys information.
You Understand This When…
- You know how unstructured trading wisdom is converted into testable rules
- You understand why ambiguity should produce candidates, not guesses
- You know why distinct methodologies must be tested independently before merging
Previous: 4.1 Discovery Prompt Design
4.3 The Hypothesis Registry
Every proposed edge — from any source, by any model — lives in a structured database. The registry is the single source of truth for what the system is investigating, trading, or has killed.
What the Registry Tracks
For each hypothesis, the registry maintains:
- The rules — Precise entry conditions, exit conditions, timeframes, and target markets
- Provenance — Which model proposed it, when, based on what data
- Multi-model confidence — Each model’s assessment, updated after every ensemble session. Not averaged — preserved individually so disagreements are visible.
- Performance history — Paper and live results: trade count, win rate, expected value, risk-adjusted return, confidence intervals. Broken down by market regime.
- Lifecycle status — Where the hypothesis sits in the lifecycle (observing, paper trading, live eligible, live, suspended, killed)
- Kill reason — If dead, why. What the data showed. What the failure conditions were.
The Anti-Pattern Library
After every losing trade, the system runs loss attribution: what data was available at the time of entry that, in hindsight, predicted the loss? These failure conditions are recorded per hypothesis, building a growing library of anti-patterns.
Over time, this library becomes as valuable as the strategy rules themselves. The system does not just learn what works — it learns what doesn’t work, and under which conditions. This is the other half of intelligence that most systems ignore entirely.
From Experience
The pass rate from hypothesis proposal to validated paper trading is low — roughly one in ten. This was initially surprising. It means the system generates far more failed ideas than successful ones. But this is exactly the expected base rate for systematic strategy research. The academic literature on quantitative alpha discovery consistently shows that the vast majority of tested ideas do not survive rigorous validation. A system where most ideas succeed is not rigorous enough.
You Understand This When…
- You know what the hypothesis registry tracks and why multi-model confidence is preserved individually
- You understand loss attribution and the anti-pattern library concept
- You accept that a low pass rate is a sign of rigorous filtering, not system failure
Previous: 4.2 Multi-Source Research
Next: 4.4 Testing at Scale
4.4 Testing at Scale
ATLAS tests hypotheses through two parallel paths: LLM-generated strategy code and composition from validated building blocks. Running both paths on the same hypothesis provides a built-in quality control mechanism.
Two Paths, One Goal
| Path A: LLM Code Generation | Path B: Building Block Composition | |
|---|---|---|
| How it works | LLM writes complete strategy code from the hypothesis description | System maps hypothesis to pre-validated entry/exit/filter components |
| Strength | Creative. Can implement novel logic the building blocks don’t cover. | Reliable. Components are individually tested. Fewer bugs. |
| Weakness | Code may have bugs. May misinterpret the hypothesis. | Limited to what existing components can express. |
| Speed | Slower (LLM call + code review) | Fast (configuration, not code generation) |
When both paths test the same hypothesis, the results can be compared. If they agree, confidence increases. If they disagree, the discrepancy reveals either a bug in the generated code or a limitation in the building blocks — both are valuable information.
Design Principle
The dual-path approach is not about choosing the “better” path. Path A is more creative (it can express ideas the building blocks cannot). Path B is more reliable (fewer implementation bugs). Together they provide independent verification. The low agreement rate between paths on novel hypotheses confirms that each path brings genuinely different perspectives — which is exactly the point.
You Understand This When…
- You can explain the two testing paths and why both exist
- You know what it means when the paths agree vs disagree
- You understand why low agreement rate on novel hypotheses is actually a positive signal
Previous: 4.3 The Hypothesis Registry
Module 5
5.1 The Breadth-First Approach
Test everything on everything. Don’t pre-filter by intuition. Let the data tell you what works.
The natural temptation is to pick a strategy you believe in and test it on the instrument you are most familiar with. This introduces selection bias before you have any data. The better approach: test dozens of strategies across dozens of instruments and let the results determine where to focus.
When you run a strategy across many instruments simultaneously, patterns emerge that you would never see in a single-instrument test:
- Some strategies work on an entire asset class but fail on another
- Some instruments support multiple strategy types; others support none
- The best instrument for a given strategy is often not the one you would have guessed
Breadth first, depth on the survivors. Let the matrix of results guide your focus.
You Understand This When…
- You know why intuition-based pre-filtering introduces selection bias
- You understand the breadth-first, depth-on-survivors approach
Previous: 4.4 Testing at Scale
5.2 The Multi-Timeframe Insight
This is the single most important architectural finding. The same trading concept, applied on a single timeframe, produces a handful of trades. Applied across multiple timeframes simultaneously, it produces an order of magnitude more — and the edge survives.
Why Single-Timeframe Fails
Most published strategies operate on one timeframe. “When RSI crosses below 30 on the daily chart, buy.” This produces a testable signal, but it only fires when conditions align on that one timeframe. The resulting trade count is often too low for statistical validation, and the strategy misses setups that are valid on adjacent timeframes.
The Multi-Timeframe State Machine
Profitable traders do not operate on one timeframe. They maintain a mental model across multiple timeframes simultaneously (as described in Module 0’s five-layer model). The multi-timeframe architecture replicates this:
- Higher timeframes establish bias and context (trend direction, key levels, regime)
- Medium timeframes identify setups (pullbacks, pattern formations, zone entries)
- Lower timeframes provide entry triggers and exit management
When all three layers align, a trade fires. When they don’t, the system waits. This produces dramatically more trades than single-timeframe approaches because the alignment can occur across many different timeframe combinations.
Key Insight
The trade count difference between single-timeframe and multi-timeframe implementations of the same concept is not incremental — it is an order of magnitude. And the multi-timeframe version tends to produce positive risk-adjusted returns where the single-timeframe version does not. The additional context from multiple timeframes acts as a filter, removing the false signals that make single-timeframe approaches marginal after costs.
You Understand This When…
- You can explain why the same concept produces dramatically more trades across multiple timeframes
- You understand the bias/setup/trigger framework across timeframe layers
- You know why multi-timeframe approaches tend to survive costs where single-timeframe versions fail
5.3 Instrument Selection Is the Edge
When you test the same strategies across many instruments, the pattern is overwhelming: some instrument classes consistently support edges after costs. Others consistently destroy them. The instrument matters more than the strategy.
This finding was counterintuitive. We expected strategy quality to be the primary driver of results. Instead, the cost-to-move ratio (introduced in Module 1.3) dominates. A strategy that is marginally positive on a high-cost instrument becomes clearly profitable on a low-cost one — and vice versa.
The implication: choose your instruments first, then find strategies that work on them. Most people do it the other way around — they develop a strategy and then look for an instrument to trade it on. This is backwards. The instrument determines the cost environment; the cost environment determines which edges can survive.
You Understand This When…
- You accept that instrument selection is more important than strategy selection
- You know why the cost-to-move ratio is the explanatory variable
- You would choose instruments first and strategies second
Previous: 5.2 The Multi-Timeframe Insight
Next: 5.4 The Kill Rate
5.4 The Kill Rate
Most strategies die. The graveyard is larger than the live portfolio. This is the process working correctly.
When the system first started producing results, the kill rate was discouraging. The vast majority of proposed hypotheses — whether from LLM discovery, video transcript extraction, or seeded ideas — failed the verification harness. Some failed spectacularly. Some failed boringly. A few showed promise and then died in walk-forward testing.
This is exactly what should happen. A system where most ideas survive is not rigorous enough. The academic literature on quantitative strategy research consistently shows that the base rate for genuine, tradeable alpha from systematic testing is low — somewhere around one in ten proposals at best.
The kill reasons are themselves informative:
- Some edges are structurally fragile — they depend on a single market condition that may not recur
- Some produce phantom edges from directional bias — they look profitable because the underlying asset went up during the test period, not because the entry logic was correct
- Some die on costs — the edge is real but too small to survive transaction costs on the available instruments
- Some are overfitted — they work beautifully in-sample and fail immediately out-of-sample
You Understand This When…
- You see a high kill rate as evidence of rigorous filtering
- You know the common kill reasons and what each reveals about the hypothesis
Previous: 5.3 Instrument Selection Is the Edge
Next: 5.5 Lessons from Failure
5.5 Lessons from Failure
The killed strategies teach as much as the surviving ones. Here are the generalized patterns from the graveyard.
Generalized Failure Patterns
Directional Bias Masquerading as Edge
Some instruments have strong long-term drift in one direction. A strategy that is net long on such an instrument will appear profitable regardless of signal quality. The test: run the same strategy in the opposite direction. If it also works, the edge is real. If it fails, you were just riding the drift.
Session Timing Is Everything
The same strategy applied at different times of day produces wildly different results. Session boundaries (when major financial centers open and close) create predictable liquidity and volatility patterns. A strategy that works during one session may be catastrophic during another. This is not noise — it is a structural feature of markets.
Certain Directions on Certain Instruments Are Structurally Toxic
Some instrument-direction combinations consistently produce losses across all strategies tested. Not “most strategies lose” — all strategies lose. When you find this pattern across hundreds of trades and dozens of approaches, it is a structural feature of that market, not bad luck. Respect it and stop trying.
Universal Patterns That Are Not Tradeable
Some market phenomena are statistically real but not tradeable as standalone strategies. They recur consistently, but the edge is too small or too infrequent to survive costs. These become filters or overlays — they add value when combined with other signals but cannot justify a position on their own.
You Understand This When…
- You can identify directional bias masquerading as edge
- You understand why session timing creates structural effects
- You know that some instrument-direction combinations are universally toxic
- You distinguish between statistically real phenomena and tradeable strategies
Previous: 5.4 The Kill Rate
Next: 6.1 Scoring What Matters
Module 6
6.1 Scoring What Matters
The leaderboard creates the incentive structure. What you measure determines what the models optimize for. Score the wrong things and you get noise. Score the right things and you get genuine alpha discovery.
The scoring system rewards two things above all else: finding unique edges and correctly identifying when an edge is degrading. Both are hard. Both are valuable.
- A hypothesis that clears the statistical gate earns points. A hypothesis that goes live and remains profitable earns more.
- A degradation flag raised before a strategy starts losing money earns significant points. This is harder than finding a new edge — it requires seeing the early signs of decay.
- A contradiction that is later proven correct earns points. Being a contrarian who turns out to be right is the highest-skill output.
- Noise is mildly penalized. Wrong contrarianism is significantly penalized. This prevents models from gaming the system with volume or reflexive disagreement.
The incentive structure is asymmetric by design: the reward for being uniquely right is much larger than the penalty for being uniquely wrong. This encourages risk-taking in hypothesis generation, which is where the value is.
You Understand This When…
- You can explain why the scoring system rewards unique findings over consensus
- You know why degradation detection is scored highly
- You understand the asymmetric incentive structure and why it encourages productive risk-taking
Previous: 5.5 Lessons from Failure
6.2 Calibration Over Time
Knowing a model is confident is not useful. Knowing that this model’s confidence, in this type of market condition, historically correlates with correct outcomes — that is useful.
Every model expresses confidence ratings on hypotheses and market assessments. The system tracks these ratings against actual outcomes, building a rolling calibration profile for each model:
- When Model A says “high confidence” on macro calls, is it right more often than when it says “medium”?
- Is Model B well-calibrated on crypto but consistently overconfident on FX?
- Does Model C’s accuracy vary by regime — sharp in trending markets, unreliable in ranges?
Over time, these calibration profiles become a weighting function. When the system needs a quick consultation during a live signal — “should we take this trade?” — it knows which model to ask based on the asset class, market regime, and historical calibration.
This is a compounding advantage. A new system with the same models starts with equal weighting. A system with months of calibration data knows who to trust about what.
You Understand This When…
- You understand the difference between raw confidence and calibrated confidence
- You know why calibration varies by model, asset class, and regime
- You see calibration data as a compounding advantage
Previous: 6.1 Scoring What Matters
6.3 The Sequential Ensemble
The models don’t analyze in parallel. They analyze in sequence, each reading the previous models’ output. This creates a structured debate, not redundant independent analysis.
Why Order Matters
In a parallel ensemble, each model sees the same input and produces independent output. The outputs are then combined. This produces four independent views, which is useful but misses the value of interaction.
In a sequential ensemble, each model sees the data and what previous models said about it. This changes the dynamic entirely:
- The second model can agree, disagree, or build on the first model’s analysis
- The third model sees two prior perspectives and can identify where they agree, where they conflict, and what both missed
- The orchestrator sees all prior analysis and synthesizes — resolving conflicts, flagging unresolved disagreements, and determining what is actionable
This mimics how a well-run investment team works: analyst presents, second analyst challenges, macro strategist provides context, portfolio manager synthesizes and decides.
Mandatory Disagreement
Each model is required to produce two mandatory sections in every analysis: explicit contradictions of other models’ claims, and degradation flags for strategies the model believes are losing their edge. These sections cannot be omitted. Saying “I agree with everything” is technically possible but scores zero.
You Understand This When…
- You know why sequential analysis creates richer output than parallel analysis
- You understand how mandatory disagreement sections prevent groupthink
- You can trace the flow from first analyst through to orchestrator synthesis
Previous: 6.2 Calibration Over Time
Next: 6.4 Fresh Eyes
6.4 Fresh Eyes
Periodically, one model gets raw data with zero context. No hypothesis registry. No existing framework. No prior analysis. Just data. The value of deliberate naivety in a system that builds up strong priors.
Any system that accumulates knowledge develops anchoring. The models learn the existing framework, the current hypotheses, and the historical patterns. This is valuable — but it can also create blind spots. The models start seeing what they expect to see.
The fresh eyes session counteracts this. A model given raw data with no context is forced to analyze from first principles. It cannot anchor on existing hypotheses because it does not know they exist. It cannot conform to the current framework because it has not seen it.
The most valuable output from fresh eyes sessions is often not a new hypothesis — it is a challenge to an existing assumption. “Why are you treating this as a mean-reverting market? The data suggests a regime change that your framework has not recognized.”
Design Principle
Systems that only accumulate knowledge become rigid. Periodically introducing deliberate naivety — forcing a reset to first-principles analysis — keeps the system flexible. The cost is one session of potentially redundant analysis. The benefit is catching framework errors that would otherwise compound unnoticed.
You Understand This When…
- You understand why accumulated knowledge creates anchoring
- You know the value of periodic first-principles analysis
- You see fresh eyes sessions as a systematic defense against framework rigidity
Previous: 6.3 The Sequential Ensemble
Next: 7.1 Position Sizing
Module 7
7.1 Position Sizing
Position sizing determines whether a strategy that works in theory survives in practice. Get it wrong and even a genuine edge will destroy your account. Get it right and a modest edge compounds into significant returns.
The Kelly Criterion
The Kelly criterion provides the mathematically optimal fraction of your bankroll to risk on each bet, given your edge and odds. In its simplest form:
f* = (p × b − q) / b
Where p is win probability, q is loss probability (1 − p), and b is the ratio of average win to average loss. ATLAS uses a conservative fraction of the Kelly amount — typically half — because full Kelly produces equity curves with drawdowns that are psychologically and practically unsustainable.
What Drives Sizing
Position size in ATLAS is not fixed. It is determined by:
- Statistical confidence in the edge. A hypothesis with a narrow confidence interval and many trades gets sized larger than one with a wide interval and few trades.
- Current portfolio exposure. Before any trade, the system checks aggregate directional exposure. If multiple strategies are all pointing the same direction on correlated instruments, that is one concentrated bet, not diversification. Sizing is reduced when exposure is concentrated.
Common Error
Many introductory texts state that risking 1% per trade means you can survive 100 consecutive losses before ruin. This is mathematically wrong under fractional (fixed-percentage) sizing. After N consecutive losses, equity is (1 − r)N of the starting value. At 1% risk per trade, after 100 consecutive losses you retain about 36.6% of equity — not zero. The correct framing is the probability of reaching a specific drawdown threshold, given the strategy’s win rate and reward-to-risk ratio. If you see a “100 losses to ruin” claim, the author does not understand compounding.
You Understand This When…
- You can state the Kelly criterion and explain why ATLAS uses a conservative fraction
- You know why sizing depends on statistical confidence and portfolio correlation
- You can identify the “100 losses to ruin” error and explain the correct compounding math
Previous: 6.4 Fresh Eyes
7.2 Exchange-Side Execution
Your trading bot will crash. Your server will lose connectivity. Your code will have bugs. The question is not whether this will happen — it is whether your open positions survive when it does.
Why Exchange-Side Orders Are Non-Negotiable
A stop-loss managed by your software (“if price reaches X, send a market sell order”) fails when your software is not running. A stop-loss placed as an exchange-side order (“the exchange will close this position at X regardless of whether my bot is connected”) works even if your entire infrastructure is offline.
The same applies to take-profit targets. Both must be exchange-side orders, placed immediately upon entry, and confirmed by the exchange. Software-side risk management is a supplement, not a replacement.
Mark Price vs Last Price
Exchanges offer two trigger types for stop-loss orders:
- Last price: Triggers based on the most recent trade on this exchange. Can be manipulated by a single large order creating a wick.
- Mark price: Triggers based on a composite price derived from multiple exchanges. Much harder to manipulate, but may not reflect the actual price on your specific venue during extreme conditions.
Exchanges use mark price for liquidation calculations specifically to prevent manipulation. Your stop-loss trigger type should be a deliberate decision based on the instrument and venue, not a default you never examined.
Key Insight
Exchange-side orders eliminate the “fill ambiguity” problem that plagues backtesting. In simulation, you must make assumptions about intrabar execution order. In live trading with exchange-side orders, the exchange resolves the ambiguity for you in real time, tick by tick. The backtest is a filter; the exchange is the truth.
You Understand This When…
- You know why exchange-side orders are mandatory, not optional
- You understand the mark price vs last price distinction and when each is appropriate
- You see how exchange-side execution eliminates the backtest fill ambiguity problem
Previous: 7.1 Position Sizing
7.3 The Paper-to-Live Bridge
A strategy that survives backtesting has proven the concept. Paper trading proves the system — that the data pipeline, signal generation, order management, and exit logic all work together in real time. Live deployment requires both.
Why Paper Trading Is Mandatory
Backtests run on historical data with a simulated broker. Paper trading runs on live data with simulated execution. The difference is critical:
- Data ingestion bugs only appear with live data (delayed feeds, format changes, connection drops)
- Signal timing issues only appear in real time (race conditions, stale cache, timezone confusion)
- Exit logic edge cases only appear in production (session boundaries, overnight holds, holiday schedules)
Paper trading finds these bugs at zero cost. Skipping paper trading finds them with real money.
The Graduation Gate
Strategies graduate from paper to live when they meet statistical gates that require evidence, not intuition:
- Sufficient trade count for meaningful confidence intervals
- Paper performance consistent with backtest expectations (within a tolerance band)
- Robustness across the market regimes encountered during paper trading
- Explicit human approval as the final gate
The human gate is deliberate. The system can identify candidates autonomously, but deploying real capital is a decision with consequences that justify human confirmation — at least until the system has built a sufficient track record to justify full autonomy.
You Understand This When…
- You know why paper trading catches bugs that backtesting cannot
- You can describe the graduation gate criteria
- You understand why human approval is the final gate during the trust-building phase
Previous: 7.2 Exchange-Side Execution
Next: 7.4 Drawdown Philosophy
7.4 Drawdown Philosophy
The system has hard stops for catastrophic drawdowns. But it does not auto-suspend strategies on losing streaks. This is a deliberate, experience-driven decision.
Why Auto-Suspend Fails
Many automated systems include a rule: “if a strategy loses N trades in a row, suspend it.” This sounds prudent. It is actually destructive for a large class of validated strategies.
Many profitable trading approaches have low win rates — sometimes well below 50%. They are profitable because their winning trades are significantly larger than their losing trades. A strategy with a 30% win rate and 3:1 reward-to-risk will routinely produce five, seven, even ten consecutive losses as normal operation. Auto-suspending after five losses guarantees you will always cut the strategy before its next large winner arrives.
Design Principle
The system reports performance data. The human decides when to kill a strategy. No auto-suspension on losing streaks. No auto-kill on drawdown thresholds. The system accumulates evidence and presents it clearly; the human applies judgment. A strategy needs a minimum number of trades before its performance can be meaningfully evaluated. Cutting it short because the first few were losers is statistically illiterate.
What Does Get Stopped
While losing streaks don’t trigger suspension, some conditions do warrant automatic intervention:
- System malfunction — Zero trades generated, execution errors, data feed failures. These are infrastructure problems, not strategy problems.
- Account-level circuit breaker — If the total account drawdown exceeds a hard threshold, all trading pauses and the operator is alerted. This is a catastrophic-event safety net, not a strategy management tool.
You Understand This When…
- You can explain why auto-suspend on losing streaks is destructive for low-win-rate strategies
- You know the difference between strategy performance management and catastrophic risk management
- You understand that the human decides when to kill — the system provides data, not judgment
Previous: 7.3 The Paper-to-Live Bridge
Module 8
8.1 Containerized Architecture
ATLAS runs as a set of containerized services: database, cache, application, and monitoring. Each can fail, restart, and scale independently. This separation is not over-engineering — it is the minimum viable reliability for a 24/7 system managing real money.
Why Containers
- Fault isolation. A crash in the monitoring stack does not take down the trading engine. A database restart does not crash the application — the application reconnects.
- Reproducible environments. The exact same environment runs in development, testing, and production. No “works on my machine” surprises.
- Resource limits. Each service has explicit CPU and memory constraints. A runaway backtest cannot starve the live trading engine of resources.
- Independent updates. You can update the monitoring dashboard without restarting the trading engine. Database upgrades do not require rebuilding the application.
Key Design Choices
- Shared data layer. All services read from the same time-series database. This prevents data divergence — there is one source of truth for candles, trades, and hypotheses.
- Health checks. Services wait for their dependencies to be healthy before starting. The application does not launch until the database is accepting connections.
- Persistent volumes. Database data and cache state survive container restarts. A docker restart does not lose your trade history.
You Understand This When…
- You know why containerization provides fault isolation for a trading system
- You understand the shared data layer / independent services architecture
- You know why health checks and persistent volumes are mandatory, not optional
Previous: 7.4 Drawdown Philosophy
8.2 Data Ingestion Patterns
Data ingestion runs continuously. Multiple feeds, multiple exchanges, multiple asset classes. The patterns that make this reliable are not glamorous, but getting them wrong corrupts everything downstream.
The Subtle Bugs
The most dangerous data bugs are not crashes or connection failures — those are loud and obvious. The dangerous ones are silent:
- A feed fails silently — The connection stays open but no new data arrives. Strategies scan stale candles and may generate signals based on outdated information.
- A schema conflict — An internal naming collision causes the database layer to reject writes without raising an obvious error. The application runs normally but no new data is stored.
- A backfill gap — The pagination logic miscounts or skips a page, leaving a hole in the historical data. Indicators calculated across the gap produce incorrect values.
Each of these has happened. Each was fixed. The lesson is always the same: your ingestion layer needs active health monitoring that measures data freshness, not just connection status.
You Understand This When…
- You know why silent data failures are more dangerous than crashes
- You understand why data freshness monitoring is essential
Previous: 8.1 Containerized Architecture
Next: 8.3 State Management
8.3 State Management
A trading system has two kinds of state: fast-changing state that needs sub-second access (current positions, scanner progress), and durable state that must survive restarts (trade history, hypothesis data). You need both, and they must stay in sync.
Dual Persistence
- Fast cache for real-time state: current scanner positions, pending signal evaluations, session tracking. This needs to be fast (milliseconds), small (kilobytes), and expendable (can be rebuilt from durable state if lost).
- Durable database for everything else: candle data, trade records, hypothesis registry, model logs. This is the system of record. It must survive container restarts, server reboots, and disk failures.
The cache is configured for persistence across normal restarts. If the cache is lost (rare), the system rebuilds it from the database on startup. This means a clean restart recovers the exact state from before shutdown, with a brief warm-up period.
You Understand This When…
- You know why dual persistence is necessary
- You understand the cache-rebuilds-from-database recovery pattern
Previous: 8.2 Data Ingestion Patterns
8.4 Monitoring & Alerting
What to measure, what to alert on, and — critically — what to ignore. The goal is not maximum visibility. It is the minimum information needed to know whether the system is healthy and performing as expected.
The Operator’s Daily Touchpoint
ATLAS produces a daily briefing delivered to the operator’s messaging platform. It contains:
- Current open positions and their P&L
- Top signals from the most recent ensemble session
- Any conflicts between models that remain unresolved
- Strategies pending approval for status changes
- System health: data freshness, service status
The briefing is designed to be readable on a phone in under two minutes. The operator does not need to log into dashboards or read logs during normal operation. If something requires attention, the briefing tells them.
What Gets Measured
- Data freshness per source — How old is the most recent candle from each feed?
- Service uptime — Are all containers running and healthy?
- Strategy performance vs expectations — Is each strategy tracking within tolerance of its backtest expectations?
- Model output quality — Are the models producing structured, parseable output? (LLMs can degrade in subtle ways.)
Design Principle
Monitor with minimum effective dose. An operator who receives 50 alerts a day ignores all of them. An operator who receives one alert a week reads it carefully. Design your monitoring to surface only what changes future decisions. Everything else is noise.
You Understand This When…
- You know what the daily briefing contains and why it is designed for two-minute consumption
- You understand the minimum effective dose principle for monitoring
- You can distinguish between metrics that change decisions and metrics that are noise
Previous: 8.3 State Management
Module 9
9.1 Loss Attribution as a Feature
Most systems learn from their wins. ATLAS learns equally from its losses. Every losing trade triggers an automatic analysis: what data was available at entry that, in hindsight, predicted the failure?
After every losing trade, the system examines the state at the time of entry and asks: what was different about this trade compared to the winners? Was there a regime signal the strategy did not check? A session condition that correlates with losses? A cross-asset indicator that was flashing a warning?
The answers accumulate into an anti-pattern library for each hypothesis. Over time, the library becomes a precision filter: the system knows not just when to trade, but when not to trade — and the “when not to” conditions are derived from real losses, not theoretical edge cases.
This is the other half of intelligence. Pattern discovery finds what works. Loss attribution finds what kills. A system that does both improves twice as fast as one that only does pattern discovery.
You Understand This When…
- You know why loss attribution is as important as pattern discovery
- You understand how the anti-pattern library compounds over time
Previous: 8.4 Monitoring & Alerting
9.2 Meta-Strategy Analysis
The system treats its own aggregate performance as a data series. Are there patterns in when ATLAS itself performs best or worst? This is a second-order edge that no individual strategy captures.
Individual strategies have their own performance profiles. But the aggregate system — all strategies running together — may exhibit patterns that transcend any single strategy:
- Does the system perform better in the first week after deploying a new strategy? (novelty advantage before the market adapts?)
- Does aggregate performance correlate with macro variables that no individual strategy explicitly tracks?
- Are there time-of-day or day-of-week effects in system-level performance that don’t appear at the strategy level?
The meta-strategy layer treats these questions as hypotheses and tests them with the same rigor applied to any trading idea. If a system-level pattern is real, it becomes a portfolio-level overlay: adjust total exposure based on conditions that predict system-wide performance.
You Understand This When…
- You understand the concept of a second-order edge at the system level
- You know why meta-analysis requires the same statistical rigor as strategy-level analysis
Previous: 9.1 Loss Attribution as a Feature
9.3 Model Calibration Evolution
In month one, all models are weighted equally because there is no data to differentiate them. In month six, the system knows which model to trust about what, and under which conditions. This knowledge is earned, not programmed.
Calibration data accumulates with every ensemble session and every resolved prediction. Over time, distinct profiles emerge:
- One model may be consistently well-calibrated on certain asset classes but overconfident on others
- Another may have poor overall accuracy but excellent timing on regime-change calls
- A third may be the most reliable during high-volatility periods but add noise during quiet markets
These profiles are not static — they evolve as models are updated and as markets change. The system continuously recalculates calibration scores on a rolling basis, ensuring that the weighting reflects current, not historical, reliability.
This is fundamentally different from fixed model weighting. A system with fixed weights cannot adapt. A system with calibration-derived weights improves its judgment automatically as evidence accumulates.
You Understand This When…
- You know why calibration-derived weighting is superior to fixed weighting
- You understand that calibration profiles are conditional (by asset class, regime, signal type)
- You see calibration data as a compounding, non-transferable advantage
Previous: 9.2 Meta-Strategy Analysis
Next: 9.4 The Memory Moat
9.4 The Memory Moat
The vector store grows every day. The anti-pattern library grows with every loss. The calibration data grows with every prediction. None of this can be copied, purchased, or shortcut. It can only be earned through time.
Consider what ATLAS accumulates over six months of operation:
- Market state embeddings with known outcomes for every candle close across every market
- Calibration profiles for each model, broken down by asset class and regime
- Anti-pattern conditions derived from every losing trade
- A hypothesis graveyard with detailed kill reasons for every failed idea
- Model disagreement resolution data showing who was right about what, and when
A competitor can copy the architecture. They can use the same models, the same graph structure, the same verification harness. But they start with an empty memory, uncalibrated models, no anti-pattern library, and no disagreement history. They are six months behind on day one, and the gap widens every day.
Key Insight
The moat is not the code. The code is tens of thousands of lines that a competent team could rewrite. The moat is the accumulated intelligence: the patterns, the anti-patterns, the calibration, the graveyard, the memory. This intelligence compounds. It cannot be transferred. It cannot be faked. It can only be earned by running the system, making mistakes, learning from them, and running it again. Every day the system operates, the moat deepens.
You Understand This When…
- You can enumerate the five types of accumulated intelligence that form the moat
- You understand why the moat compounds and cannot be shortcut
- You see that the architecture is reproducible but the intelligence is not
Previous: 9.3 Model Calibration Evolution
The system discovers.
The system validates.
The system improves.
ATLAS is not a finished product. It is a machine that gets better every day — more market memory, sharper model calibration, a growing library of what works and what doesn’t. The moat is not the code. It’s the compounding intelligence that no copycat can shortcut.
Powered by four AIs. Earning its way to live capital.