It’s been a week. I got quoted in Bloomberg again for some reason, talking about meme stocks again.
As I discussed in my previous post, the art of prediction is a dangerous game. It’s quite easy to fall prey to biases, especially given the natural optimism of humans and desire for explanation of pure randomness.
Let’s take a step back. What does prediction even mean?
Correlation and Confusion
Most echo the axiom “correlation does not imply causation” without really understanding what it means.
Correlation is simply the relationship between two variables. These can be logically related variables, and often are. We can construct a simple example of correlation looking at the relationship below between electricity bills and ambient temperature:
In this contrived example, we’re looking at the relationship between background temperature (measured in Celsius) and the monthly electrical bill (measured here in rupees). This shows a clear positive correlation — as temperature increases, so does the electrical bill. This intuitively makes sense — as the temperature rises, we expect people to put on their air conditioning, which will increase their electrical bill.
However, correlations can also be negative — an inverse relationship between two variables.
In the above, we can see a clear negative relationship between amount of missed classes and exam score. This makes sense also intuitively, given we expect the relationship between the two is causal — missing class means missing learning opportunities, which means likely poorer understanding of the material.
In both of the above examples, I made it easy for you by conflating correlation examples with causal examples. In the first — electricity bill versus temperature — the relationship is causal, mediated by a third variable — usage of the air conditioning unit. In this relationship, we can model causality between the two variables as such:
The above is a representation of a belief network, an abstract representation of dependencies between variables. In English, this can be summed up as: “Increased temperature causes (often with some level of probability, not surety) increased A/C usage, which causes (again probabilistically) an increased electrical bill”.
This statement is a lot more powerful than simple correlation. Why? Take a look at the following chart:
In the above, we can observe a simple positive relation relationship between the total revenue generated by arcades and number of computer science doctorates awarded in the U.S. To make matters worse, the correlation coefficient (in English, the strength of the correlation) observed is extremely high. However, we know a causal relationship here, at least a direct one, is nonsense. Unlike our prior examples, there is no common-sense argument linking the number of computer science doctorates to arcade revenue.
Causation is a tricky beast, and more or less a holy grail in statistical analysis. This is especially salient in quantitative analysis, because of the curse of an absolutely massive amount data. It is extraordinary trivial when doing extraordinary analysis to fall back on what is called parameter grid search. While I could give a succinct and mathematical explanation of what that is, I prefer to torture my readers, and hence I will construct a story instead.
Imagine you’re a young would-be quantitative researcher, and you’re looking to find America’s Next Top Quantitative Model to break the markets. Aha, you say to yourself, this problem is purely one of data — I don’t really need to use my thinking brain here, I can just feed in a ton of features and permutations of those features to my magic machine learning model, and science will decide what’s important. So you feed your model the entire tick data of the NYSE going back to 1400 AD, as well as the transcriptions of every episode of Gilmore Girls and a dramatic rendition of the Color Purple starring Oprah Winfrey, and you wait for machine learning to do its magic.
A couple hours later, you come back and the model found something and it works beautifully! You look at the test results, and historically, it hit 20x SPY returns. Fantastic, you think, you’re going to be the next Jim Simons!
You take a look at the model weights, and you see that it has maybe forty or fifty different features. You used a random forest approach, so perhaps these features are segmented using thresholds the human mind can’t even comprehend by the end of it. Who cares? Your results curve looks like this:
Then a funny thing happens. You remember from your old machine learning courses that you’re supposed to split your data into a training set and a test set. This of course is really just for nerds, but humor me for a second and try it.
You test it on the test set. For financial data, you often need to segment your test and train sequentially since it doesn’t make sense to randomize data (or in English, you can’t really test a model on random days over 10 years — you test it on a continuous stretch).
And your result curve looks like this:
Pretty bad, huh? What happened? Well, you conducted a search to optimize all your features on your training data and your model did exactly that. It found the optimal correlations to produce the best results…but correlation is not by itself predictive. Much like the arcade and the computer science doctorate, your model created spurious correlations which held true during the period it learned from (after all, machine learning is a way to optimize, not necessarily to solve). But because these complex, arcane correlations had no causal basis, two things can happen:
- These correlations can continue to work for some time — This happens fairly frequently. If it does in your test dataset, congrats! You may have a model to test in production. But given your results curve, probably not this.
- These correlations may fade out at random — Unlike the causal relationship discussed in the electrical bill example, many correlations in noisy data are completely spurious. You may see certain completely unrelated assets track solely due to probability. These cannot be relied upon, and they will likely fail to actually trade on.
The second point is worth reiterating. The larger your input space grows (in this case, the number of features and the number of potential thresholds per feature defines your input space), the more likely just due to random chance you will observe a correlation. In statistics, one of the most used significance levels in hypothesis testing is the 5% significance level. This is based on the normal distribution via the central limit theorem, which tl;dr states that with sufficient large numbers of observations/tests, the results of your tests should approximate a normal distribution. The 5% significance level in this case implies that you expect approximately 5% of the time to achieve your results even when no relationship exists (the null hypothesis).
Understanding this fully is critical. This implies that if you were to run approximately 20 completely independent tests for significance (let’s say on various combinations of variables), you would expect one to potentially show significance at the 5% level purely due to random chance.
This is the danger of grid searching — while it seems easy and alluring, by checking over large potential combinations of data and thresholds to find a relationship (rather than properly feature engineering in the first place), you will find nonsense relationships in the noise. In this case, you are still observing correlations in your data, but they are not predictive.
Let’s however, look at another example to understand the nature of prediction. Correlation, by itself, is not bidirectionally predictive. For example, we know there is an obvious correlation between earthquakes and human deaths. When significant earthquakes occur, deaths occur. This can be modeled in a belief network like above:
This implies that if we know an earthquake has occurred, we can rightly predict that more human deaths will occur. However, the reverse is not true. We cannot knowing that more human deaths occurred than usual assume an earthquake must have occurred.
A particularly memorable trading example here can be observed with RSI, the Relative Strength Indicator. The Relative Strength Indicator is a price-momentum index that is used for simple mean reversion trading. It, at least theoretically, indicates when an equity is oversold (usually when the RSI falls below 30) or overbought (when the RSI goes above 70).
What’s interesting about RSI in particular is its relationship over time (in the most common iteration of it, RSI(14)) to crash and correction periods. Let’s observe the graph below:
We can note quite astonishingly that yes, according to our overbought-oversold hypothesis, corrections in 2020 seemed to occur exactly at the overbought times! Beautiful, we have a correction indicator.
The issue here is two-fold.
As we highlighted in the previous section on earthquakes and human deaths, often times there does not exist bidirectional prediction — we cannot assume more human deaths imply an earthquake occurred, although we can assume the other way around. We can see using the 1 day, 14-day RSI that overbought conditions occur right before corrections, confirming a correlation. However, the reverse also seems to be true — we see overbought conditions occurring in cases where a correction (even by the loosest definition of the word) did not occur. Worse yet, even in cases where it did seem to harbinger a correction, the RSI warned us well in advance, and we may have missed substantial returns in the mean time.
Unfortunately, it gets worse. The RSI, like many other technical indicators, is parametrized, meaning you can use it over any time period with any set amount of periods. The most popular one for swing trading is the 14-day RSI (RSI(14)), but we see a very different picture if we zoom in:
We can observe looking at the last month that periods of 70 on RSI were not particularly predictive at all. While they seemed to occur on days where the market went up, there was no extra information given — rather, by looking at price alone you would observe the same correlations.
This issue belies a lot of technical indicators, unfortunately., While retrospectively many can outperform, this tends to be a result of two effects:
Data snooping — This is essentially when you create a backtest or analysis knowing the shape of your data. If for instance you already knew 2020 returns, you could specifically signal for indicators that would’ve worked in 2020 well.
Hindsight bias — This is the most dangerous one, and why when I do give predictions, I always give them before the event. In general humans are optimistic creatures, and we believe that randomness is a lot more predictive than it is. We tend to find patterns in the past to explain things, and in hindsight view things as “obvious”.
Here’s a great example, stolen from my favorite online octopus, @macrocephalopod on Twitter. Check out the following graph, which is a price history of $GLD:
Looks obviously like a head and shoulders right? It was pretty obvious looking at it that in Jul 2018 you could’ve sold and made a pretty penny before it fell down again.
Except oops, it’s not $GLD. It’s actually a completely random algorithm which drew this. By definition there was no predictiveness at any point here. Correlation does not imply causation.
In essence, this argument is a spin on an age-old tradeoff — specificity versus sensitivity.
To borrow these terms from my bioinformatics background, sensitivity is the true positive rate — for example, what is the probability of a correction beginning within about 10 days from now using the RSI(14) indicator? specificity is from what I’ve observed much harder to achieve — that is, assuming RSI(14) is not above 70, what is the chance there won’t be a correction in the next 10 days (the true negative rate).
This can be more succinctly summarized in the concept of a confusion matrix:
When you choose some tool or ensemble set to predict outcomes, you care about all of the outcomes represented in the above box. This is salient in RSI. RSI may for example correctly warn you relatively in advance before a correction period (true positive), but will warn you a lot of other times too (false positives). Similarly, you may observe corrections even in times RSI is not elevated (false negative); however, most of the time you won’t (true negative).
What’s interesting about developing predictive tools is perhaps we should not weight all of these outcomes as equally important. From experience, it seems a lot easier to deal with signals which have high true positive rates even if the true negative rate isn’t great. With a very high sensitivity to certain events (e.g. correction periods, for example), you can buttress your model with more features to achieve better prediction potentially (not always; your signals may not be independent from each other for example).
The converse isn’t true; if your signal has a high true negative rate but a high false positive rate, it may be worse than useless.
Probability is beautiful and extremely unintuitive. In general, we tend to overestimate our ability to predict things, partly due to believing in an internal locus of control. Often times the actions of market events are multiple steps divorced from our own information, and we try to attach an explanation to events even in the face of imperfect or non-existent information. For example, when something like this happens:
On Wednesday, our favorite stock fell almost 50% in the span of a few minutes. In retrospect, this was likely due to a large whale order hitting the tape and slicing through the retail-driven order book like a hot knife through butter. Could a model predict this? Probably not. It was likely the action of one or a few humans, acting independently of any predictable event. Could one say that there was a higher likelihood of this occurring due to X, Y, and Z market factors? Most assuredly. However, this wouldn’t necessarily provide us much leeway to predict when or how it would happen.
Well, this post was a doozy. Going to close it off for now.