The AI Sizing Challenge: Why Contextual Bandit Couldn't Optimize Leverage

A beginner-friendly summary of the verification: “The AI Sizing Challenge: Why Contextual Bandit Couldn’t Optimize Leverage”.

What’s the idea?

We’ve explored various ways to make our algorithmic FX trading systems (EAs) smarter, especially when it comes to deciding how much to trade – what we call “position sizing.” This is like a chef adjusting the amount of spice based on the ingredients and the diners’ preferences. Get it right, and you could boost returns while managing risk. Get it wrong, and… well, let’s just say it’s not good! Previously, we tried using Machine Learning (ML) for this (Research 100), aiming to dynamically adjust leverage (how much borrowed money we use for trades) based on market conditions. But that experiment didn’t pan out; the ML system performed no better than a “placebo” (a shuffled version of its own output). It seemed the underlying market structure it was trying to learn was either too complex or already captured by our existing, simpler methods. This time, we’re trying a different flavor of AI: a contextual bandit. Think of it as a simpler, more focused type of reinforcement learning (RL). Instead of learning a complex sequence of actions, a contextual bandit learns the best single action to take in a given “context” or market state. Our goal was to see if this approach could learn to pick optimal leverage levels (0.5x, 1.0x, or 1.5x) based on different market scenarios, outperforming our existing, hand-coded sizing logic.

How I tested it

To test our contextual bandit, we needed to define the “contexts” (or “states”) it would observe and the “actions” (leverage levels) it could choose from. We kept the states relatively low-dimensional compared to our previous ML attempt to avoid complexity and potential overfitting. Here’s what our bandit looked at to decide on leverage:

Volatility: We categorized market volatility into three levels (quantiles) over a 3-minute period. Is the market calm, moderate, or wild?
Stock Market Status: Is the stock market “on” or “off”? This could indicate broader risk sentiment.
Drawdown (DD) State: Is our trading system currently in a significant drawdown? This is crucial for risk management; you might want to reduce leverage during tough times. Based on these three pieces of information, the contextual bandit would then choose one of three leverage levels: 0.5x, 1.0x, or 1.5x. We used a walk-forward learning approach. This means the system would learn from a block of past data, then apply its learned strategy to a new, unseen block of data, and then repeat the process. This helps simulate real-world trading and prevent the system from just memorizing historical patterns. Crucially, just like last time, we ran a “placebo” test. This involved shuffling the learned leverage assignments to different states. If the actual learning algorithm can’t beat this random assignment, it suggests it hasn’t truly learned anything useful about the market structure.

What happened?

The results were, shall we say, illuminating – and not in a good way for the contextual bandit! Our reinforcement learning system quickly developed what we call a degenerate policy. Essentially, because we penalized drawdowns in its reward structure, the system learned to be extremely cautious. It defaulted to the lowest leverage (0.5x) across almost all market states, especially when any drawdown was present. In other words, it became overly conservative to avoid penalties, which isn’t ideal for maximizing returns. When we looked at its performance, especially during months where the system experienced a drawdown of 10% or more, its average monthly return was a dismal +0.72%. This was actually one of the worst outcomes we’ve seen! But here’s where it gets really interesting (and a bit disheartening for the AI): the placebo test actually performed better! The system where we shuffled the state-to-leverage assignments, effectively making the choices random with respect to the actual market conditions, yielded an average monthly return of +1.57%. Yes, you read that right: a random assignment of leverage (the placebo) outperformed our sophisticated reinforcement learning algorithm! This result is a strong echo of our previous ML sizing experiment (Research 100). It suggests that the market structure our contextual bandit was trying to learn – how volatility, stock market status, and drawdown should influence leverage – is either not present in a way that can be easily learned, or more likely, it’s already effectively captured by our existing, simpler, hand-crafted system. Our current system, “vol-target + stock filter (v1.4.0),” already adjusts position size based on volatility and filters trades based on stock market conditions. It seems to have already captured the essence of what these complex algorithms are trying to find. We could try to “tweak” the rewards for the contextual bandit, making it less penalty-averse or more reward-seeking. However, this path often leads to overfitting – where the system learns to perform well on past data but fails miserably in real-time. Given the strong placebo effect, trying to fine-tune the reward design felt like chasing ghosts.

What I learned

The conclusion is clear and consistent across both our ML and RL sizing experiments: neither advanced machine learning nor reinforcement learning was able to learn a position sizing strategy that outperformed our existing, hand-crafted vol-target + stock filter (v1.4.0) system. It appears that the core relationships between market states (like volatility and stock market sentiment) and optimal leverage are already effectively modeled by our simpler, deterministic rules. Adding complex learning algorithms, in this case, didn’t provide any additional edge. In fact, they either performed worse or were easily beaten by a “random” placebo. This doesn’t mean AI is useless for trading, but it does highlight an important lesson: sometimes, the simplest solution is the best. If a hand-coded system already captures the significant market dynamics, throwing more complex AI at the problem might just lead to over-complication or, worse, overfitting. Therefore, we’re sticking with our current, proven system. No changes to the core trading system based on these sizing experiments! It’s another valuable lesson learned on our journey to build robust and profitable EAs.

How this connects

This verification builds on earlier ones (what failed before and what I tried this time, comparisons between approaches).

AI’s Risk Prediction: Was Its Trading Logic a Placebo?