Vai al contenuto

Beyond the Backtest

Working in the field of systematic trading for over twenty years, we have engaged with various validation approaches, all with a common objective: to increase the probability that a model developed based on historical data could continue to perform well in the future.

But what are the crucial elements of this project, and how can a simple Performance Report help us achieve this goal?

A primary distinction, rooted in statistics, relates to the operational horizon. If the backtest is conducted on a High-Frequency Trading (HFT) strategy (operating in the very short term, even below a second), we will rely on certain metrics and employ specific countermeasures to enhance the robustness of the results. Conversely, if we wish to design an investment system for the multi-day horizon, we must give greater emphasis to different metrics and propose validation criteria more suitable for a medium to long-term timeframe.

In any case, we must start with a matrix of values appropriately organized in a Performance Report that is both concise and comprehensive. So, which one should we rely on?

There are dozens of commercial suites (such as Tradestation, Multicharts, Amibroker, Metatrader, Trade Navigator, etc.) that offer their version of the Performance Report. All have their merits, but in our opinion, they also have various shortcomings, linked to the inability to customize masks and metrics.

Our solution has been the construction of an “open” Python code, which has evolved over time and is customizable by each user (the code for functions and the backtesting engine is provided within the Python Academy). We are talking about the latest version of the “Wintermute” engine accelerated in Numba (a JIT library for Python).

This translates into the fact that it is no longer necessary to have a specific platform for backtesting, as one can work with data directly using Python without additional “intermediaries.” And that is no small feat.

Writing a strategy in Python is much simpler than doing so in Easylanguage or Powerlanguage (after learning how to simplify the code using functions and rationalizing dependencies). Some upfront work to gain unexpected freedom in previous years.

But let’s illustrate this with an example that allows us to see the new version of the Wintermute engine in action (version 20231114).

Let’s take the Gold Future Continuous at 60 minutes, an instrument that, with some exceptions (for example, in 2018), has consistently responded to level and volatility breakout logics.

Let’s begin by setting the working constants (which, in Python convention, are written in uppercase).

At the beginning, we have the general service settings. Let’s set the name of the Trading System (“NAME”), disable detailed log writing (“WRITELOG”), which can be useful for in-depth analysis, and set the backtest engine to “single run” mode (“OPTIMIZATION”).

Now, let’s configure the backtest engine’s operation:


This allows the engine to conclude a trade on the same entry bar. Excluding this option enables specific simulations for strategies that consider a certain time inertia or, in other words, certain non-idealities.


This allows starting a trade on the same bar on which the previous trade ended.


Limits the maximum number of trades per daily session to 1. A customizable mode to reduce the number of false signals.


This inhibits an instability factor that we will activate after the first backtest. It is a crucial parameter in validation criteria to increase model robustness and enhance the probability that a strategy can continue to perform after backtesting on unknown data. The theory behind this part is borrowed from the world of machine learning.

Let’s dive into the settings related to the operational part:


We decide to operate on a futures contract, and the system will carry all the related settings.

TICK = 0.1

Defines the value of the minimum price movement (instrument granularity). There is also an automatic way to read this value.


Represents the dollars (USD) gained or lost for each one-point movement in the underlying asset.

DIRECTION = "long"

Sets the operation only on the long side (entry with buy orders). Often, dividing the analysis between long and short positions can be useful to characterize and correct some peculiar behaviors on both fronts.

ORDER_TYPE = "stop"

Decides to use “stop” orders, meaning they are triggered upon reaching a certain price level. A stop order is converted into a market order when the condition is met.


Uses a single contract for the Gold Future.


Sets a 10% margin requirement (typical of many intermediate brokers). This setting accounts for how much capital is tied up during the trade for each contract.


Assumes starting with a capital of $20,000. This threshold will be used to calculate metrics such as drawdown.


Sets a zero annual “risk-free” interest rate. This parameter will affect the calculation of many metric indicators such as the Sortino Ratio.


Regarding costs, sets a fixed fee of $20 per contract operation (a first assumption). This translates to $40 “round turn” to open and close a trade.

We do not use any of the other coded exit modes.

Now, let’s move to the heart of the system.

In this case, let’s define an “enter_level” that allows buying upon exceeding the last 23 highs. As a filter, we use a session “blastoff” from Larry Williams, on the higher time frame (we chose 0.3 as a parameter, making it quite stringent for Gold). We exit the trade at midnight (American session). No other settings are used.

Now, let’s invoke the Wintermute engine’s “apply_trading_system” function and populate the following elements: tradelist, open_equity, closed_equity, operation_equity.

Once these elements, which characterize everything we need, are produced, we can initiate the printing of the performance report by invoking the “performance_report” function.

We only need to review the report in the version that we will present during the next live session of the Python Academy on Thursday, November 23, 2023.

The notification at the top informs us that the last trade is still open and was fictitiously closed on the last bar’s closing (to account for it).

The first section focuses on expected profit, explicitly detailing CAGR and Annual Return. The main difference between CAGR (Compound Annual Growth Rate) and Annual Return lies in the management of reinvesting profits in the case of CAGR (each related function is coded and explained in detail within the course, but information can be obtained from various official sources).

The second section provides a concise overview of performance metrics commonly used in both investing and trading:

Sharpe Ratio: A performance measure that calculates the excess return of an investment (compared to the risk-free return) per unit of assumed risk. This metric helps investors assess whether the return obtained from an investment is adequate relative to the level of risk taken.

Sortino Ratio: A performance indicator that measures the return of an investment relative to the risk below a certain acceptable minimum level. Unlike the more common Sharpe Ratio, which considers the standard deviation of all returns (positive and negative), the Sortino Ratio focuses only on the standard deviation of negative returns. In short, it evaluates the ability of an investment to generate positive returns in relation to negative risk, providing a more focused measure of downside risk management.

Calmar Ratio: An indicator that evaluates the performance of an investment in relation to the risk taken. In this basic version, it is calculated as the annualized return divided by the maximum drawdown. Essentially, the Calmar Ratio provides a measure of performance in relation to volatility or risk, helping investors assess the efficiency of a strategy or fund, where higher values indicate better risk-adjusted performance.

Calmar Ratio Mean: The average version of the Calmar Ratio calculated year by year. It aims to make the classic Calmar Ratio value (which increases for profitable strategies as time goes on) independent of time.

Omega Ratio: Used to assess the performance of an investment in relation to the risk taken. It is a performance measure that takes into account not only the return but also the distribution of returns, especially focusing on losses and a minimum subsistence threshold (which in the case of the Profit Factor is zero).

Kestner Ratio: A measure of the regularity of the returns curve, based on the mean squared deviation associated with the linear regression line.

Treynor Ratio: A measure of performance relative to risk that evaluates how much gain an investment has generated in relation to the market risk it has assumed. In this case, the denominator is the investment’s Beta.

Information Ratio: A measure of a portfolio’s excess return compared to a benchmark, adjusted for the risk taken. Essentially, the Information Ratio evaluates how much value a portfolio manager has added compared to a benchmark in relation to the portfolio’s volatility.

Beta: A measure of a security’s sensitivity to market changes overall. It represents the statistical relationship between the return of a security and the return of a benchmark market index. Beta is commonly used in portfolio theory and financial risk analysis.

In the next section, we find a detailed analysis. This means that the first two sections provide an initial overview of the goodness or otherwise of the strategy, while the following ones allow for a detailed examination of strengths and weaknesses.

Operations: The number of trades.

Profit: The profit net of fixed costs (slippage and commissions).

Average Trade: The average profit/loss of the strategy. One of the most important aspects to understand if the strategy is sufficiently robust to be implemented.

Profit Factor: A measure used to evaluate the profitability of a trading system or strategy. It is a ratio that compares the total profits generated by a system to the losses. A PF below 1 may not be sustainable. It is calculated as the ratio of Gross Profit to Gross Loss.

Gross Profit: The sum of all positive operations.
Gross Loss: The sum of all negative operations.

Percent Winning Trades: The percentage of positive trades calculated on “Operations.” A value above 50% generally has positive psychological implications, conversely, below 50%.

Reward Risk Ratio: The ratio between the average gains for profitable operations and the average losses for losing operations. This metric generally has an inverse proportionality relationship with the Reward Risk Ratio.

Sustainability: A linear combination of the product between (Percent Winning Trades x Avg Winning Trades) and (Percent Losing Trades x Avg Losing Trades). The higher the positive value, the statistically more robust the strategy is to absorb negative outliers.

Trading Fees: The cumulative fixed costs incurred.

Avg Gain: The average of positive operations.
Max Gain: The maximum positive occurrence among operations.

Avg Loss: The average of negative operations.
Max Loss: The maximum negative occurrence among operations.

Avg Win Trade Length: The average number of consecutive positive operations.
Max Win Trade Length: The maximum number of consecutive positive operations.

Avg Losing Trade Length: The average number of consecutive negative operations.
Max Losing Trade Length: The maximum number of consecutive negative operations.

Before moving on to the next section, we want to emphasize that the presented test strategy (which should be clarified is a basic strategy) net of costs (which could be increased) is sustainable ($137 of Average Trade and a Profit Factor of 1.49) with good psychological confidence (52.74% of Percent Winning Trades and a Reward Risk Ratio of 1.33).

The second part of the Performance Report focuses more on the expected risk of the strategy:

Avg Delay Between Peaks: The average delay (in bars) before a new peak in the equity line.

Max Delay Between Peaks: The maximum delay (in bars) before a new peak in the equity line.

Avg Time in Trade: The average duration in bars per trade.

Max Time in Trade: The maximum length in bars of a trade.

Min Time in Trade: The minimum length in bars of a trade.

Trades Standard Deviation: The standard deviation calculated on trades. A measure of variance and therefore expected risk.

Equity Standard Deviation: The standard deviation calculated on open equity. A measure of variance and therefore expected risk.

Correlation: The correlation, expressed in percentage points, between the daily aggregated series of profits and the series of changes in closures of the working instrument (in our case, Gold Future).

We now move to the section dedicated to drawdown:

Avg Open Drawdown: The average monetary correction of the open equity curve from a new peak.

Avg Open Drawdown %: The average percentage correction of the open equity curve from a new peak.

Max Open Drawdown: The maximum monetary correction of the open equity curve from a new peak.

Max Open Drawdown %: The maximum percentage correction of the open equity curve from a new peak.

In our case, we observe a max open drawdown per contract of -$16,300 (recorded on November 20, 2008) and an average value of -$2,313. To be conservative, we recommend always reviewing this entry rather than the one found below for closed operations.

Avg Closed Drawdown: The average monetary correction of the closed equity curve from a new peak.

Avg Closed Drawdown %: The average percentage correction of the closed equity curve from a new peak.

Max Closed Drawdown: The maximum monetary correction of the closed equity curve from a new peak.

Max Closed Drawdown %: The maximum percentage correction of the closed equity curve from a new peak.

These values are extremely useful if our strategy does not include stop-loss or any stop criteria during the session.

Following is the percentile drawdown profile, useful for adjusting any static or adaptive stops.

Finally, we find a summary of the exit reasons, which in this case, are all attributable to the exit rule (we do not read 100% of the generated trades because the last trade was still open).

Let’s now move on to the graphical tables.

Let’s examine the equity line net of fixed costs, highlighting (in “lime” color) the new equity line highs. The final profit reaches $140,000 per contract over 17 years.

The following table displays the same curve overlaid on the trend of the Gold Futures closing prices (the working instrument). Please note the two different vertical scales. This tool is very useful for assessing the level of correlation between the two instruments.

The following, we examine the dynamics of monetary drawdown, which illustrates how a comparable level has not been recorded since 2008.

The percentage profile tells a different story, emphasizing the nearly -50% drawdown (the -$16,000 from the previous table) recorded at the beginning of operations. Like all percentage scales, these are affected by the nominal value reached over time.

Following is the profile of the contract’s counter value (in green) and the portion of immobilized funds (in “lime”), taking into account the leverage used resulting from the margin.

We then find the annual aggregate of profits and losses. In this specific case, we observe that, net of fixed costs, there has been a profit almost every year except for 2008 and 2018.

For enthusiasts of Bias systems, we can observe the average profit/loss per calendar month. There is a noticeable downturn in February, May, and October. Since dynamics tend to shift, in this case, we advise against filtering trades based on this.

The “heatmap” of profits and losses represents a mosaic organized by month and year. It allows highlighting anomalous areas with periodic consistency.

The MAE table (“Maximum Adverse Excursion”) represents the profit/loss of each trade in relation to the maximum negative retracement experienced during the life of the trade. If the trade ends with a profit, there will be an upward-oriented green triangle; if it ends with a loss, a downward-oriented red triangle. In our case, it is evident that a $4000 stop loss would eliminate 6 negative trades.

Reversing the logic, we obtain the chart of MFE (“Maximum Favorable Excursion”), which displays the maximum run-up during the lifespan of each trade. In this scenario as well, a take profit of $6,000 (less time in the market translates to lower risk) would cut the profits of a single trade.

The information regarding “Trade Duration” needs to be correlated with the Maximum Adverse Excursion (MAE) to understand whether there is a possibility of cutting trades temporally even more efficiently. In our case, this possibility appears to be nonexistent, as trades are evenly distributed across 1 to 23 bars of life.

Below is the display of three consecutive trades.

Now, we have just introduced the possibility of implementing a Monte Carlo Analysis, which is not strictly a traditional Monte Carlo approach. It relies on the Instability Factor and enables us to assess whether the newly formulated strategy is generally “optimistic” or not (we explain the details within the Python Academy).

Let’s start by defining the following two parameters:

IF_SAMPLE: Specifies the number of resamples (non-Monte Carlo) to use for the simulation.
INSTABILITY_FACTOR: Relates to an instability factor designed to make overfitting more challenging.

The swarm of curves above represents the 1000 replicates (which, I reiterate, are not a recombination of trades as in the classic Monte Carlo) of the original equity line (colored in “lime”). From a qualitative perspective, it is evident that the original curve is situated in the higher percentiles of the distribution.

The judgment on the drawdown profile is more challenging. We seek assistance from quantitative analysis:

From a purely profit-oriented perspective, the original curve stands at the 89th percentile of the distribution. Considering that a neutral position corresponds to the 50th percentile, with a pessimistic stance falling below this level and an optimistic one above, this indicates that we have devised a strategy whose outcomes are to be considered strongly optimistic (+39.2 above neutrality).

Regarding the drawdown, we need to reverse the scale (as these are inherently negative values): a value above the 50th percentile translates to a pessimistic forecast, conversely below. In our case, we thus have a pessimistic projection of +12.7.

In summary, we have a curve that will likely generate a profit lower than that of the backtest but with a maximum drawdown probably lower.

The new “Wintermute” engine (version 20231114) will be delivered to all course participants (who will be free to modify it as they please) tomorrow, during the live broadcast of the new Python Academy 2023.


Giovanni Trombetta

Founder – Head of R&D
Gandalf Project

Lascia un commento

Il tuo indirizzo email non sarà pubblicato. I campi obbligatori sono contrassegnati *