How hard is it to pick the right model? MCS and backtest overfitting
Article type: Research Article
Authors: Aparicio, Diegoa; * | López de Prado, Marcosb; 1
Affiliations: [a] Department of Economics, Massachusetts Institute of Technology, Cambridge, MA, USA | [b] True Positive Technologies, New York, NY, USA
Correspondence: [*] Correspondence to: Diego Aparicio, Massachusetts Institute of Technology, Department of Economics. Address: 77 Massachusetts Ave, Building E52-301, Cambridge, MA 02142, USA. E-mail: dapa@mit.edu.
Note: [1] True Positive Technologies, New York, NY, USA; Lawrence Berkeley National Laboratory, Berkeley, CA, USA. Email: lopezdeprado@lbl.gov
Note: [2] In fact, machine learning and artificial intelligence algorithms can be trained to scan billions of data signals in order to design millions, if not billions, of different virtual trading strategies. See,. AI equity research robots are already tracking and providing views on asset prices. See.
Note: [3] Bailey & Lópex de Prado (2014) and Harvey & Liu (2014) discuss ways to adjust Sharpe ratios and p-values based on the number of trials. See also Barras et al 2010 for a discussion of false discoveries in mutual fund performance.
Note: [4] The data generating process (DGP) in the following simulations is a simplified yet standard assumption in the literature (e.g., Harvey & Liu (2014)).
Note: [5] The Python codes to reproduce the results are available on the authors’ webpage.
Note: [6] Results remain qualitatively similar under a loss function with squared errors.
Note: [7] We first show results for the MCS specification using the TRange,M test statistic, a moving-block bootstrap of length ℓ=5, and B = 500 bootstrap samples. Results are similar under alternative specifications of the TRange,M statistic. However, we find somewhat inconsistent results using the Tmax,M test statistic. See Section 3.3.7
Note: [8] The DGP assumes M independent strategies, although we note that in practice some will tend to be correlated. Correlated returns would reduce the variance of dij,t ≡ (Li,t - Lj,t), and therefore reduce the sample size required in MCS to identify the superior model.
Note: [9] Such Sharpe ratios are rarely seen in practice. As a reference, the S&P 500 Sharpe ratio is estimated at 0.38 during 1996–2014;even the best-performing hedge funds typically have average Sharpe ratios below 2 (Titman & Tiu (2010), Getmansky et al. (2015)).
Note: [10] See Bailey & López de Prado (2014), Harvey & Liu (2015), Harvey et al. (2016), and Bailey et al. (2017) for recent methodologies to address backtest overfitting. See Ioannidis (2005) for a general discussion.
Note: [11] Holm’s method ends once the first null hypothesis cannot be rejected. Holm’s is less strict that Bonferroni’s, which inflates all p-values equally. In fact, pmHolm≤pmBonf.,∀m∈M .
Note: [12] Consistent with these results, it has come to our attention that the Tmax,M statistic, and therefore the elimination rule emax,M , is not recommended in practice. See Corrigendum (Hansen et al. (2011)).
Abstract: Recent advances in machine learning, artificial intelligence, and the availability of billions of high frequency data signals have made model selection a challenging and pressing need. However, most of the model selection methods available in modern finance are subject to backtest overfitting. This is the probability that one will select a financial strategy that outperforms during backtest, but underperforms in practice. We evaluate the performance of the novel model confidence set (MCS) introduced in Hansen et al. (2011a) in a simple machine learning trading strategy problem. We find that MCS is not robust to multiple testing and that it requires a very high signal-to-noise ratio to be utilizable. More generally, we raise awareness on the limitations of model selection in finance.
Keywords: Forecasting, model confidence set, machine learning, model selection, multiple testing JEL Codes: G17, C52, C53
DOI: 10.3233/AF-180231
Journal: Algorithmic Finance, vol. 7, no. 1-2, pp. 53-61, 2018