31 August 2015 - 6 minute read

The most reliable way to test systematic investment strategies is on data that was not available when the strategy was designed. This ‘out-of-sample’ testing removes the possibility that performance is due to overfitting, which is a common problem when relying on back-testing to estimate strategy performance.

The difficulty of out-of-sample tests is that the weak predictive power of many strategies means that validation can take a long time: to reach a statistically significant conclusion typically requires decades of new data.

Here, we describe a unique experiment that allowed Winton to reach conclusions about a collection of strategies in a relatively short time, thus demonstrating the effectiveness of the firm’s research process.

At Winton, we have a well-defined process for testing and screening signals, which has been honed over the past two decades through a combination of trial, error, and careful thought. Our research process is designed to answer the question: does a trading strategy work? But we should also ask the question: does our research process work?

Out-of-sample validation

The strategies we develop can have estimated Sharpe ratios as low as 0.2. A strategy with a true Sharpe ratio of 0.2 has more than a 7% chance of losing money in three consecutive years, and so risks being rejected, even though it may be profitable in the long run.

To overcome this problem, we ask whether the average skill of all our strategies is positive. If it is, then this would support the idea that our screening protocols can distinguish genuinely skilful strategies from ones whose historical performance is due solely to luck.

In 2012, we designed an experiment to test whether our strategies possess out-of-sample skill. The experiment initially included eight non-trend strategies: we wanted to test our ability to develop strategies that complement the trend-following strategies that Winton has traded since its inception in 1997. The experiment allowed new strategies to be added if they passed our screening protocols. The number of strategies stood at 26 by the experiment’s conclusion.

A particular feature of the experiment was the use of a sequential statistical test, rather than a fixed sample size test. Sequential testing is widely used in industrial quality control, and has also been applied to drug trials. We keep collecting data until we can either accept the null hypothesis that our strategies have no skill, or the alternative hypothesis that they have skill.

The experiment’s rules are designed to avoid the multiple testing problem. The advantage of sequential methods is that they can allow early termination of an experiment if the evidence overwhelmingly favours one of the competing hypotheses. A disadvantage is that the amount of data that must be collected is not known in advance.

Before commencing the experiment, we ran simulations using different assumptions about how skilful our strategies might be, and how frequently new strategies would be added to the experiment. These simulations indicated that the experiment would probably take more than three years, and possibly up to a decade.

Such time scales for an R&D project are more typical in the pharmaceuticals or aerospace industries than in finance. But since our aim was to validate predictive signals that are weak compared to the background noise, we had to collect enough data to make our conclusions statistically credible.

Sequential probability ratio test

We used the Sequential Probability Ratio Test. This statistical test was developed by Abraham Wald during the Second World War to make wartime industrial quality control more efficient [1]. For most of the war it was classified, but Wald was eventually allowed to publish it in 1945 [2].

The test involves calculating the log-likelihood ratio (LLR) contrasting the probability of obtaining the observed data under each of the competing hypotheses. In our experiment these competing hypotheses are H0, that the average Sharpe ratio of our systems is zero and H1, that the average Sharpe ratio is 0.2.

The LLR is updated as new data become available. The decision whether to accept hypothesis H0 or H1 is made using the following rules:

The thresholds a and b are determined by the false positive rate, α, and the false negative rate, β, that we are prepared to tolerate. In our experiment we set α=β=0.05.


The experiment was initiated in July 2012 and strategies that passed our screening procedures were added to the experiment. On average, a strategy was added once every two months.

The evolution of the log-likelihood ratio is shown in Figure 1. The log-likelihood ratio is not the same as profit-and-loss. For example, a falling log-likelihood ratio does not necessarily mean that the strategies are losing money, but merely that the no-skill hypothesis is looking more likely.

In June 2015, the log-likelihood ratio crossed the upper threshold b indicating that we can accept the hypothesis that, on average, our strategies have skill. There was a period of particularly good performance by many of the constituent systems towards the end of 2014.

Correlations between the systems – which reduce the effective amount of data being collected – are used to correct the log-likelihood ratio. It is not possible to rule out the possibility that this performance is just lucky. We have continued collecting data, however, to keep testing the hypothesis that our strategies possess skill.

There was a spread in the performances of the strategies included in the experiment. Several of the strategies lost money during the period of the test, while some performed exceptionally well. The observed differences in performance, however, were not statistically significant. It was not possible to reject the hypothesis that all the strategies possess the same amount of skill.

Figure 1: The log-likelihood ratio comparing the hypothesis that our strategies have no skill with the hypothesis that they do possess skill

The lower and upper horizontal lines are the termination thresholds, a and b. If the log-likelihood ratio exceeds the upper threshold then the evidence favours the skill hypothesis while if it falls below the lower threshold the hypothesis of no skill is accepted. The values of these thresholds are determined by the false positive and false negative rates that we were prepared to tolerate (α=β=0.05).

Experiment limitations

The positive experiment result provides evidence that our screening protocols are effective, but there are a few issues that should be borne in mind. Firstly, even if all the underlying assumptions are correct, the test will have a false positive rate of 0.05.In other words, if the strategies included in the experiment have no skill, there would still be a 5 per cent probability of obtaining a positive result.

Beyond the caveat that a positive result might be a statistical fluke, there is also the possibility that strategy performance is non-stationary. That is, a strategy that had genuine skill in the past might not continue to do so in the future. The possibility that strategies might become less skilful over time underlines the need for performance monitoring.


In a carefully designed experiment, a collection of 26 research-approved trading strategies were shown, on average, to have genuine forecasting skill. This is not only an encouraging result for the profitability of these systems, but also for Winton's research methods.

Winton is in a unique position of being able to allocate resources to important but painstaking research projects like this, that require patience and technical expertise in order to critically review our own methodologies.


[1] W.A. Wallis, The Statistical Research Group, 1942-1945, Journal of the American Statistical Association, 75, 320-330, 1980.

[2] A. Wald, Sequential tests of statistical hypotheses, Annals of Mathematical Statistics, 16, 117-186, 1945.

Research Practices