31 January 2015
11 minute read

Value, size and momentum have a long history as stock price predictors, and similar indicators have been applied to stock indices in order to predict the performance of one national index against another. Published backtests of trading systems based on these ideas have shown impressive performance, but in this paper we find that this performance does not continue past the publication dates.

We argue that selection bias at the time of publication has a part to play in the disappointing out-of-sample performance of these indicators. We show how the combination of estimation uncertainty and selective reporting can readily explain the observed deterioration in performance. Importantly, with a fuller understanding of these effects, the long-term poor performance of the indicators could have been anticipated at the time.

Efforts to find predictors for stock returns have a long history. Quantitative work on momentum goes back at least to the 1960s, with the observation that, over timescales of a few months, stocks that performed well in the past tend to also perform well in the future [1].

Later work showed some evidence for a negative effect (mean reversion) on a longer timescale [2, 3]. Valuation ratios also have a long history. The controversy over the Value Line system in the 1960s [4, 5] is a well-known example. Fama and French recently reviewed the evidence for value, size and momentum factors across the world’s stock markets [6].

Starting in the 1990s, a series of papers was published suggesting that similar effects could be seen in national stock indices [2, 3, 7, 8]. In analogy to cross-sectional equity systems [9], countries were ranked on size, value or momentum indicators to form portfolios with long positions in the indices for the highest-ranked countries and short positions for the lowest-ranked countries. The published work showed excellent historical performance for such trading systems.

Here we review the performance of these cross-country equity portfolios. Nearly twenty years have passed, so we have the advantage of a large set of historical data that was not available to the authors of the listed papers.

First, we replicate the published evaluations in the period before 1995, before showing that the performance of the trading systems on more recent data is disappointing. We believe this is because of selection bias in the published results, and we follow by showing how this may have led to over-optimistic assessments of the systems. More importantly we demonstrate how this could have been avoided at the time.

Replicating the published work

The papers [2, 3, 8] use national stock market indices provided by MSCI [10] across 18 developed countries in the period 1970-1995. Table 1 lists the four indicators calculated from this dataset that were used in these papers.

Table 1: Indicators used in the cited publications to construct long‐short trading systems on national equity indices

The published Sharpe ratios are compared with the Sharpe ratios from our tests.

The momentum indicator (MOM) is the total return of an index over the last year, ignoring the return from the last month. The value indicator (V) is the book-to-market ratio of the index and size (S) is the inverse free-float market capitalisation. The book value and market capitalisation are each calculated for the index as a whole by summing the values for the component stocks. These three indicators are the same as those used in [8].

The mean-reversion indicator (MR) is defined in a number of different ways in [2] and [3]. We choose the fractional return between three years and one year ago so that the MR indicator is nearly independent of the MOM indicator.

Each indicator defines a distinct trading system which will be associated with a series of portfolios through time. The indicator is calculated for each country at the end of a calendar month. The countries are then ranked by the values of their indicator. Out of the 18 countries, we take a positive position in the top six countries and a negative position in the bottom six. All positions are equal in their allocation, so the portfolio is dollar neutral. The positions are held for one month.

Table 1 also shows Sharpe ratios from the published sources and from our own tests over the period 1970-1995. We believe that the remaining differences between the published values and our own are due to small differences in the definition of the trading systems (for example, the way in which ties in the ranks are resolved).

What happened next? Out‐of‐sample performance

It should be noted that the Sharpe ratios in Table 1 are so-called in-sample results. The same dataset (1970-1995) that was used to develop the trading systems was used to calculate the results. We have an advantage over the researchers who published the original work. Working in 2014, we can compute the performance of the systems on the market data from 1995 to 2014. This is an out-of-sample test. The results are shown in Figure 1 and Table 2.

Figure 1. Cumulative return of the published systems over the in‐sample and out‐of‐sample periods

Cumulative profit (positive) or loss (negative) in billion US dollars of the published systems over the in‐sample and out‐of‐sample periods (left and right of the vertical line). A US$100M long or short position is taken in each single country selected. The performance replications for MOM and MR start later due to the required price history, and the V replication starts in 1975 when the book price data becomes available. The combined performance of the four systems is shown in the bottom section.

All four systems have worse performance in the out-of-sample period than in the original test period. Two of the four systems lose money after 1995, and even the momentum system shows almost no profit after 2000.

Table 2. Sharpe ratios for the four published systems, in‐sample and out‐of‐sample

The Sharpe ratios for the combined system are also given. All systems show a decline in performance.

A family of trading systems: Evidence for selection bias

As we saw in the previous section, the performance of the national stock index portfolios is disappointing after the publication of these papers. This might be explained in various different ways. Perhaps the poor out-of-sample performance is due to bad luck and will be reversed in the future. Or the market environment might have changed in a way which makes the systems perform more poorly.

However, the most likely explanation is selection bias. In fact, we should have expected a drop in performance even without the benefit of hindsight.

We can understand this better by looking at the family of trading systems which the authors selected from. Usually it is difficult to know which trading systems were considered before one was selected. In this case, we can make a reasonable guess.

The MSCI dataset that contains the book-to-market ratio and the market capitalisation for each country also contains a selection of other indicators, which have likely been studied alongside the published indicators. As further evidence, we find that in [7] (which is referred to by [8]) the correlations between index returns and some of these additional indicators are studied. The book-to-market (V) indicator is assessed as the most promising.

Furthermore, the momentum and mean-reversion systems use specific timescales suggested by successful systems trading individual equities. But other timescales, from one quarter to five years, have also been used in systems trading both indices and individual equities [9, 11].

These considerations suggest a wider group of ten trading systems which might have been tested by researchers in the 1990s alongside the published systems. They are listed in Table 3. To compare the performance of this family of systems across the entire time range, we use the 13 countries with data for all the indicators across the period 1975-2014. The in-sample results are shown in Figure 2.

Table 3. Indicators for ten trading systems that could have been tested in the 1990s

Includes the four already tested.

Figure 2. In‐sample (pre‐1995) Sharpe ratios of the ten equity index trading systems listed in Table 3

Those published and discussed in section 2 are highlighted in red. The mean (0.34) is indicated by a black line, and the one‐standard‐deviation range (σ=0.23) is shown as the red band around the mean. Error bars show the estimated sampling error σs from a formula given in [12].

Strikingly, all four of the published Sharpe ratios are above the average. This is very suggestive of active performance-based selection of trading systems.

If the published trading systems were picked because they were among the best in a wider set of systems, then we would expect their out-of-sample performance to be worse than their in-sample performance [13].

This phenomenon of “regression to the mean” applies in many other contexts and often leads to controversy [14]. Students with the best test scores in a particular school year are expected to decline in performance the next year (leading to accusations that they are a neglected group). Patients selected because they have high blood pressure will tend to show a decrease in blood pressure in the next stage of a clinical trial (leading to a positive assessment of any treatment they are given – even if it is ineffective).

The reason for this decline is easy to understand. In each case there is a large random variation in the quantity measured. Trying to select the very best, we select the very lucky. And the very lucky are not likely to be as lucky next time.

Estimating the effect of selection bias

We now understand why trading systems selected for their performance in-sample are expected to show a decline in returns outside the sample period

If we can estimate the size of this effect using only an in-sample test, then we will not need to wait until out-of-sample data are available; we can make a corrected estimate of the future performance of a trading system immediately. This type of calculation is a valuable tool for the analysis of trading systems and portfolio construction.

To do this, we need to know how much of the difference in performance between the systems is caused by real differences in the systems’ effectiveness and how much is due to luck. The luck here is random sampling error, caused by the use of a finite amount of data to estimate the Sharpe ratios. Real differences in effectiveness should be consistent between the in-sample and out-of-sample periods, but luck will not persist into the out-of-sample period.

The standard formula [12] for the sampling error σs of Sharpe ratios gives values of σs between 0.22 and 0.25 for all the systems in the family (Table 3). These are the error bars in Figure 2. In every case, the estimated sampling error σs is close to the standard deviation σ=0.23 of the set of Sharpe ratios (the measured range of Sharpe ratios across the systems).

Differences between the Sharpe ratios can therefore be attributed to random sampling error alone. They do not indicate true differences in performance.

This has an important consequence. Given only in-sample data, our best estimate of the future (out-of-sample) Sharpe ratio of one of the systems is not the in-sample Sharpe ratio of that system: it is the mean in-sample Sharpe ratio of the whole family (0.34 in this case). The Tweedie formula [15], a well-known method for correcting selection bias, confirms this conclusion.

We can now look at the out-of-sample data to see whether they confirm this conclusion. Figure 3 shows the in-sample and out-of-sample Sharpe ratios for the ten systems. It’s clear that the in-sample and out-of-sample results are not correlated, and a statistical test confirms this.

In other words, exceptionally ‘good’ systems in-sample are not likely to remain good in out-of-sample tests. This is exactly what the analysis of the sampling error led us to expect. We have managed to foresee the drop in performance, due to selection bias, without requiring the extra 20 years of data.

Figure 3. In‐sample (pre‐1995) and out‐of‐sample (post‐1995) Sharpe ratios for the family of 10 trading systems

The one‐standard deviation ranges and the mean values are shown in the side bars.

A large change in the overall mean would suggest some change in the world’s financial systems between the two periods. In fact, the small decrease in the mean Sharpe ratio (0.34 to 0.25) is of the same order of magnitude as the change expected from random variations, so there is no evidence for such a change.


We have seen how published systems on national stock indices from the 1990s have underperformed in the following 20 years, and we have shown strong evidence that the decrease in performance was caused by cherry-picking a set of indicators based on in-sample performance.

This selection is not always done by an individual researcher or group. For example, it is possible that different researchers tried the different systems, and only the ones who obtained positive results published their work. Or one group of researchers may evaluate a set of possible indicators (as in [7]), leading a different group of researchers (such as [8]) to make a particular choice of trading systems.

Investment managers will always select trading systems that have performed well in the past, and we do not argue that this is a bad policy. But as our research shows, the selection introduces a bias, which should be corrected so that the performance of individual trading systems is not over-stated.

Methods to do this are an important part of the arsenal of scientific quantitative investment. Recent concerns expressed in the academic literature and in the financial community [16, 17, 18] make it clear that not everyone working in finance has fully adopted these methods yet.


[1] R. Levy, Relative Strength as a Criterion for Investment Selection, Journal of Finance, pp. 595‐610, 1967.

[2] R. Balvers, Mean Reversion across National Stock Markets and Parametric Contrarian Investment Strategies, Journal of Finance, pp. 745‐772, 2000.

[3] A. Richards, Winner‐Loser Reversals in National Stock Market Indices: Can They be Explained?, IMF Working Paper, 1997.

[4] J. Shelton, The Value Line Contest: a Test of Predictability of Stock‐Price Changes, Journal of Business, pp. 251‐269, 1967.

[5] F. Black, Yes Virginia there is Hope: Tests of the Value Line Ranking System, Financial Analysts Journal, 29, 1973.

[6] E. Fama and K. French, Size, Value and Momentum in International Stock Returns, Journal of Financial Economics, pp. 457‐472, 2012.

[7] L. Heckman, Valuation Ratios and Cross‐Country Equity Allocation, Journal of Investing, pp. 54‐63, 1996.

[8] C. Asness, J. Liew and R. Stevens, Parallels between the Cross‐Sectional Predictability of Stock and Country Returns, Journal of Portfolio Management, pp. 79‐87, 1997.

[9] N. Jegadeesh and S. Titman, Returns to Buying Winners and Selling Losers: Implications for Stock Market Efficiency, Journal of Finance, pp. 65‐91, 1993.

[10] MSCI country and regional indices, www.msci.com/market-cap-weighted-indexes

[11] F. DeBondt and R. Thaler, Does the Stock Market Overreact?, Proceedings of the 43rd Annual Meeting of the American Finance Association, pp. 28‐30, 1985.

[12] A. Lo, The Statistics of Sharpe Ratios, Financial Analysts Journal, pp. 36‐52, 2002.

[13] Winton Research, Blinded by Optimism, December 2013.

[14] J. Kruger, Superstition and the Regression Effect, Skeptical Inquirer, March/April 1999.

[15] B. Efron, Tweedie’s Formula and Selection Bias, Journal of the American Statistical Association, pp. 1602‐1614, 2011.

[16] D. Bailey, Pseudo‐Mathematics and Financial Charlatanism, Notices of the American Mathematical Society, pp. 458‐471, 2014.

[17] Wall Street Journal, Huge returns at low risk? Not so fast, 27 June 2014. , blogs.wsj.com/moneybeat/2014/06/27/huge-returns-at-low-risk-not-so-fast.

[18] Vanguard, Joined at the Hip: ETF and Index Development, July 2012, www.vanguard.com/pdf/s319.pdf.

Value, size and momentum have a long history as stock price predictors, and similar indicators have been applied to stock indices in order to predict the performance of one national index against another. Published backtests of trading systems based on these ideas have shown impressive performance, but in this paper we find that this performance does not continue past the publication dates.