20 December 2013 - 12 minute read

Overfitting is a well-known problem ‒ and one would expect clever statisticians and scientists not to succumb to its temptations. But what amounts to overfitting can occur more subtly through the collective behaviour of many individuals within an organisation or across many organisations. Here we describe how such "meta-overfitting" may be endemic in finance as well as other fields of research.

Mark Twain is often quoted as saying, "It ain't what you don't know that gets you into trouble. It's what you know for sure that just ain't so." A perennial problem in both science is distinguishing signal ‒ predictable effects that are likely to persist ‒ from noise ‒ random fluctuations. A common pitfall is to misidentify noise as signal and bet that it will persist. When this is done, incorrect conclusions can lead to undesirable consequences such as ineffective drugs.

Recent articles in The Economist, Nature and elsewhere suggest that this problem is more common than has been generally appreciated in the scientific literature [1, 2, 3, 4]. Publication bias is the issue they address: the tendency for papers that detect some effect to be published over papers reporting a null result. When this bias is compounded by the confusion of noise with signal ‒ so that a false positive occurs ‒ the result is that many published conclusions are actually wrong.

Here we highlight how these issues are also relevant to the world of finance, with equally serious consequences. Whether it is investors having to wade through advertising literature from fund managers, or "quants" inside banks having to pick and weight strategies, there is the potential for noise to get confused with signal. As with publication bias, when one selectively picks apparently well performing investments based on noisy estimates of their true performance, then they may be overly optimistic about what to expect in the future, and therefore make poor investment decisions.

Noise can be confused with signal in many ways. Some of which are well-known and easy to avoid while others are more insidious and threaten the reliability of whole areas of scientific and financial research. In this paper, we introduce the reader to some basic concepts and terms, discuss the ways in which one can be misled, before going through possible solutions.

Hypothesis testing and “p-values”

We begin with two workhorses of statistics: significance testing and p-values.

In many areas of science, "null hypothesis significance testing" has become the standard framework for testing hypotheses. In this framework, we start by specifying a null hypothesis. You can think of this as effectively what we are going to believe unless there is strong evidence against it. Our data is then condensed into a statistic, and its "p-value" is determined. This is the probability of getting a value of the statistic as extreme or more extreme than that observed if the null hypothesis is true. A small p-value means that either a very unlikely event has occurred, or the null hypothesis is false. Small enough, and we are inclined to reject the null hypothesis.

The null hypothesis is analogous to the presumption of innocence in a criminal court. In finance, an appropriate null hypothesis is often that price changes are not predictable. The statistic we often consider is the historical performance of an investment strategy, which is expected to be zero. If the p-value ‒ the probability of obtaining the observed historical trading performance due to chance alone ‒ is sufficiently small we feel justified in rejecting the null hypothesis and we would adopt an alternative explanation instead.

As an example, suppose a researcher presents a strategy that has a historical Sharpe ratio of 0.6. We can calculate the distribution of Sharpe ratios we would get from skill-free strategies that buy and sell entirely at random over the same time period. This is the red distribution in Figure 1. We find that only 0.2% of these random strategies have Sharpe ratios better than 0.6. That means there's a probability (a p-value) of only 0.002 of getting such a high Sharpe ratio if the strategy has no skill. So with some confidence we conclude that it “works”.      

Figure 1: Skill-free performance distribution

The performance distribution, as measured by the Sharpe ratio, of skill-free trading strategies, over 23 years.In this example, the probability of obtaining a Sharpe ratio better than 0.6 is only 0.002. This is the “p-value".

How to confuse signal and noise

Various research practices can lead to unreliable results, and overconfidence in rejecting a null hypothesis. The root cause is often similar but the label given depends on the context.


Overfitting occurs when a model has too many parameters relative to the amount of data available. The consequence is that the model is tuned to noise as well as signal, so it does a good job of reproducing the data on which it was fitted. But if faced with new data, generated by the same true underlying process, the model falls down.

An example of this is shown in Figure 2. Here the dataset was generated with a two-parameter linear model and some noise, but we have fitted models with up to 10 parameters. The more complicated models appear to fit better, in that they go through (or near) more data points. But, such a model is not expected to predict new data very well if that data comes from the linear model.      

Figure 2: An example of overfitting

30 data points generated from a first-order polynomial (two parameters), subsequently fitted with polynomials of order 1, 4, and 9. The higher order polynomial will have a lower average residual between the data and the model fit. But it is unlikely to fit new data.

In finance, this is an easy trap to fall into ‒ improving the apparent goodness of fit can improve the profitability shown by historical simulations.

An example of overfitting is shown in Figure 3 which is the backtested performance of a seasonal trading strategy. The parameters of the strategy were chosen based on the period 1980 to 2004 and then applied to the out-of-sample periods before 1980 and after 2004. The performance in both of these periods is significantly worse than the in-sample performance.

Figure 3: A seasonal trading strategy that has been overfitted

The model parameters were chosen based on the period 1980 to 2004. When applied to the out-of-sample periods the performance is significantly worse than during the in-sample period.

If only the post-2004 period had been tested, we might attribute the drop in performance to the seasonal effect disappearing. However, the similarly poor performance prior to 1980 strongly suggests that overfitting is the culprit.

Selection bias

Models with a large number of free parameters are most prone to overfitting. But what if we try hundreds of separate simple models, and retain only those that best fit the data? The result is the same as overfitting in the case of multiple parameters, although here we have multiple hypotheses, and the resulting effect is usually referred to as "selection bias".

Selection bias can occur when one researcher tests many ideas, but it can also occur at an institutional level. Imagine that one hundred researchers each test the efficacy of a single variable for predicting market movements. Each individual researcher might be meticulous about avoiding overfitting and correctly compute their p-values. But, if only the one researcher who found the highest level of predictability reported their results to management, then this paints a similarly misleading picture.

As an example, suppose a researcher presents us with an investment strategy that has a historical Sharpe ratio of 0.6. In Figure 1, we already calculated the distribution of Sharpe ratios we would get from random, skill-free, strategies over the same time period for this example. We found that only 0.2% of these random strategies have Sharpe ratios better than 0.6, meaning we were confident about its predictive power.

However, if this strategy was presented to us because it was actually the best of 100 tested strategies, then it is not appropriate to compare its Sharpe ratio with those in Figure 1. We should compare it with a distribution of the best Sharpe ratios as selected from 100 random strategies, each of which have no real power.

This is the blue distribution in Figure 4. In this distribution, 15% of the skill-free strategies have Sharpe ratios exceeding 0.6. In general, a p-value of 0.15 is not deemed to be statistically significant: it is quite probable to achieve such a result without any true predictive power.

Figure 4: Skill-free performance distribution versus cherry-picked performance distribution

The performance distribution of individual skill-free trading strategies, over 23 years is shown in red. The performance distribution of a skill-free trading strategy cherry-picked as the best performer from a set of 100 is shown in blue. In this example, the probability of obtaining a Sharpe ratio better than 0.6 is only 0.002 in the individual distribution but increases to 0.15 in the “best-of-100" distribution.

A memorable lesson in the dangers of multiplicity was given when neuroscientists placed a dead salmon in a magnetic resonance imaging machine. The dead salmon was asked to look at photographs depicting human emotions, and the scientists detected statistically significant activity in parts of its brain! The problem (as the scientists were trying to illustrate) was that by looking for activity anywhere in the salmon's brain they were effectively testing thousands of hypotheses. When they corrected for this, the result was no longer significant [5].

Again the problem is not just academic: in trying to find a good model a researcher will study many ideas as it would be perverse simply to try one.

Publication bias

The selective reporting of results is a recognised issue in many fields of science including medicine, psychology, economics and political science [6, 7, 8, 9]. It may arise because researchers are unmotivated to write-up negative results or because scientific journals don't want to publish such results. In areas such as clinical drug testing, there might even be commercial incentives to conceal negative findings.

Whatever the cause, this selectivity gives rise to publication bias, also known as the "file drawer effect", a reference to the unpublished results hidden away in file drawers. This selection process will mean that a disproportionate number of results published in the scientific literature are nothing more than statistical flukes. Therefore subsequent experiments will be unable to reproduce these results; a drug which appeared to work, loses its efficacy.

It is easy to see how finance could be particularly susceptible to these problems. The short-term gains of a positive result will push people to chase products and strategies that they can show "work" on past data. Evidence supports this suspicion, such as poor performance in the investment fund industry [14] and studies showing that hypothetical performance data is not consistent with live performance results in the case of ETFs and CTAs [10, 11]. This could be due to malicious misselling or poor analysis (not taking into account transaction costs, for example), but publication bias could also be contributing.

Survivorship bias

Survivorship bias arises when the performance of a model affects its inclusion in your study. For example, if you select the 10 largest hedge funds, then you are probably inadvertently selecting funds which have performed well in the past. Therefore, a dataset of their past performance will not be representative of the wider industry, or provide reliable forecasts of what we can expect to happen in the future. Survivorship bias is a significant problem in finance, where dead funds often disappear from memories and databases.

Consider a set of funds with no skill. Some will produce decent returns, simply by chance, and these will attract investors, while the poorly performing funds will close and their results may disappear. Looking at the results of those surviving funds, you would think that on average they do have some skill. But they don't, and in the future they should not be expected to perform as well as they did in the past.

Even if the funds really do have investment skill, the best performers will still tend to be the ones that have also been lucky. As a result, we should not expect the best performers in the population to do quite so well in the future there will be some "regression to the mean" in their returns.


The numerous sources of bias might seem rather overwhelming, but there are a number of statistical tools and processes that can help mitigate them. Most generally, one should invoke "Occam's razor" ‒ that is, one should prefer the simplest of the possible explanations. A more quantitative solution for overfitting is cross-validation, which is the principle that the data used to develop a hypothesis cannot also be used to test that hypothesis.

A non-statistical strategy for tackling the file-drawer problem is to force researchers to publish the results of all their studies. This strategy is being increasingly adopted for clinical drug trials, and many countries now require that trials are preregistered. The aim is to ensure that the literature is representative of all research, rather than a potentially biased subset of research, as is currently the case [12, 13].


The aim of statistical inference, whether it is in investment strategy development, medical research, evaluation of public policy, or any other area is not to say what would have been most effective in the past, but to make a statement about what will hold in the future. For valid inferences, we need to avoid bias in our conclusions.

Bias can be introduced into the scientific literature, as well as into estimates of the performance of trading strategies, in a variety of ways. Individual researchers can introduce bias by overfitting their models or cherry-picking their predictor variables or model structures. Institutions can introduce bias by inadvertently operating policies which tend to favour the reporting of positive results. Universities, academic journals, and financial modellers are all open to the risk.

In this note, we have described the causes of such bias and some of the tools which can be used to reduce the problem, yielding valid scientific conclusions. Unreliable results can also be the product of deliberate misselling, or even incompetence. These are quite different from the problems discussed here, which we believe are subtle enough to fool many smart investors, scientists, and researchers inside the financial industry, as shown by recent studies [10, 11].


[1] Trouble at the lab, The Economist, 23-27, Oct. 19th-25th, 2013.

[2] D. Sarewitz, Beware the creeping cracks of bias, Nature, 485, 149, 2012.

[3] J. P. A. Ioannidis, Why Most Published Research Findings Are False, Public Library of Science: Medicine, 8, 2005.

[4] J. P. Simmons, L. D. Nelson, U. Simonsohn, False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant, Psychological Science, 22, 1359-1366, 2011.

[5] C. M. Bennett, et al., Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction, Human Brain Mapping Conference, San Francisco, Jun. 2009.

[6] J. P. A. Ioannidis, Contradicted and initially stronger effects in highly cited clinical research, Journal of the American Medical Association, 294, 218-228, 2005.

[7] E. J. Masicampo, D. R. Lalande, A peculiar prevalence of p values just below .05, The Quarterly Journal of Experimental Psychology, DOI: 10.1080/17470218.2012.711335, 2012.

[8] C. Doucouliagos, T. D. Stanley, Are all economic facts greatly exaggerated? Theory competition and selectivity, Journal of Economic Surveys, 27, 316-339, 2013.

[9] A. Gerber, N. Malhotra, Do statistical reporting standards affect what is published? Publication bias in two leading political science journals, Quarterly Journal of Political Science, 3, 313-326, 2008.

[10] J. M. Dickson, S. Padmawar, S. Hammer, Joined at the hip: ETF and index development, Vanguard research, 2012.

[11] Winton Research, The Hypothetical Performance of CTAs, 2013.

[12] C. Chambers et al., Trust in science would be improved by study pre-registration, The Guardian, Letters, Jun. 5, 2013.

[13] E. J. Wagenmakers et al., An agenda for purely confirmatory research, Perspectives on Psychological Science, 7, 632-638, 2012.

[14] E. F. Fama, K. R. French, Luck Versus Skill in the Cross Section of Mutual Fund Returns, Journal of Finance, 5, 1915-1947, 2010.

This article contains simulated or hypothetical performance results that have certain inherent limitations. Unlike the results shown in an actual performance record, these results do not represent actual trading.  Also, because these trades have not actually been executed, these results may have under- or over-compensated for the impact, if any, of certain market factors, such as lack of liquidity and cannot completely account for the impact of financial risk in actual trading.  There are numerous other factors related to the markets in general or to the implementation of any specific trading program which cannot be fully accounted for in the preparation of hypothetical performance results and all of which can adversely affect actual trading results. Simulated or hypothetical trading programs in general are also subject to the fact that they are designed with the benefit of hindsight. No representation is being made that any investment will or is likely to achieve profits or losses similar to those being shown.  

Research Practices