
About the test data — where it comes from and how it's cleaned
The source of the historical data used here, how it's stored, the period covered, and how data quality is checked.
Testing lives and dies by its data. Here’s an honest look at what powers the verifications on this site.
What data is used
The main feed is 1-minute (M1) OHLCV data for major FX pairs, gold, and so on. OHLCV means each bar’s open, high, low, close and volume. With 1-minute bars I can rebuild any higher timeframe I want — 5-minute, hourly, daily.
The period runs roughly from 2015 to recent. A long span lets me test across bull markets, bear markets and turbulent stretches (like the COVID crash), which exposes strategies that only work in one kind of market.
For stock indices (S&P 500, Nikkei 225, etc.) I also use other sources (Yahoo Finance daily, Dukascopy minute data). Indices move differently from FX, so they’re very useful for diversification.
Storage — why Parquet
The raw data is CSV (text), but reading that every time is slow. So I convert it to Parquet, a columnar format — think “compress and store each column tightly.” It loads far faster and is safe to read from many tests at once. That means I can run lots of tests in parallel without data loading becoming the bottleneck.
Data-quality checks (quietly important)
I’ve been burned here. Some gold data had anomalies, and they made a strategy look wildly profitable — but most of the gains came from those broken values. Since then I added a step that automatically detects and removes abnormal bars (e.g., an absurd daily range, or unnatural price jumps). Tests run on this clean data by default.
An honest note on the source
The FX historical data is a broker’s real feed. The format is a common text layout, but the values are that broker’s quotes. Different brokers have slightly different spreads and prices, so results can shift a bit depending on which data you use. That’s exactly why I estimate costs (spread and slippage) on the conservative side.
The raw historical data isn’t included in the public repository (licensing, size). What’s shared here is the method — how the data is handled.