Between 2010 and 2015, I developed and operated a website called backrecord.com in collaboration with my friend Jamie. This wasn't a full time endeavor, but we put quite a lot of effort into it. The goal of the website was to collect and analyze opinions of people in the public spotlight. In particular, we developed a quite sophisticated methodology for assessing market predictions. We were hopeful this analysis would prove valuable, but we ultimately found no evidence of any exploitable market inefficiency (which I guess was always the most likely outcome).
We've taken the website offline since it requires resources and effort to keep running, but here are some screenshots and an overview of the assessment methodology for posterity. Hopefully someone finds this useful (if only as a recipe for what not to do!).
Because prediction timeframes are generally not known exactly, the assessment of a prediction is actually an aggregation of assessments over a range of times, with weightings described by a trapeziodal time kernel function. Time kernel parameters are based on available timing information, typically an interpretation of the language used in a quote. This approach minimises sensitivity to an arbitrary cut-off date.
For a given day, the assessment of a prediction makes use of a probability distribution of the associated topic prices (and comparison topic prices where there is a comparison topic) taken from a statistical model of market movements that is constructed using only knowledge available at the prediction stated date.
During an assessment, we calculate a number of values, the most important being the Risk Adjusted Annualized Excess Return (RAAER) and Standardized Score.
RAAER is the annualised return of a market position corresponding to the prediction, relative to the average return expected by the market model (which is currently an estimate of the risk free rate for all topics) and scaled according to the topic variance. Maintaining a given level of excess return over a longer time is more impressive (less likely to happen by chance alone) than the same annualized return over a shorter time, so for short term or partially assessed predictions, some very large RAAERs will occur by chance.
The standardised score is effectively the RAAER adjusted to account for this time effect, and is designed so that a random collection of prediction scores will have a standard deviation 1 and mean 0.
RAAER is the best estimate of the underlying under- or out-performance due to skill of the source, while the standard score is a measure of the statistical significance of the result. Note that the distribution of standardized scores is not normal - the heavy-tailed distribution of market returns (on all time scales) frequently results in some relatively large standardized scores.
Some topics have a benchmark index defined (for example, the benchmark currently associated with all US stocks is the S&P 500 Total Return). For predictions where this is the case, we assess the prediction both with and without the benchmark index set as the comparison topic.
To assess comparison predictions we replace the primary topic by a composite formed by subtracting a multiple of the comparison topic returns from the primary topic returns. For benchmarking, this multiple is calculated using historical variance and correlation and chosen so that the composite is theoretically uncorrelated to the comparison topic. Thus benchmarked assessments represent the return with exposure to the broader market removed as far as possible.
We produce a number of assessments for each person (or other source) based on various subsets of their assessed predictions. The most comprehensive of these (referred to as "Mixed") is derived from the set of all benchmarked prediction assessments, where available, and absolute prediction assessments where not. Other assessments include "Benchmarked" (the set of all benchmarked assessments only), "Absolute" (the set of all absolute assessments only), "Market" (the set of assessments of predictions about broad based US market indices) and "< 1yr" (the set of assessments of predictions that have a representative timeframe less than one year).
One of the components of a person assessment is the person RAAER - a weighted average of the RAAERs of their assessed predictions. The person RAAER is a best estimate of the excess return that might be achieved by following the person's advice (assuming a person's ability is constant through time and continues into the future). We also display this value multiplied by the standard deviation of the yearly S&P 500 returns, making the two directly comparable. Unfortunately, the person RAAER is usually very unreliable (has a large associated error), and so in practice it is not very useful.
As part of a person assessment we also produce a p-value, which is an estimate of the probability that the person RAAER could have been obtained by chance alone (no predictive skill). People who are more likely to posess skill (positive or negative) will have a smaller p-values. p-values below 0.05 (5%) are usually considered to be "statistically significant". We do see a number of people with p-values close to or below the statistical significant level, suggesting there is some value in looking at this number (and noting whether the RAAER is positive or negative).
There are a number of factors that we account for which complicate calculation of the person RAAER and p-values considerably. The most important is probably correlation between prediction RAAERs - the results of multiple predictions about the same or similar topics over overlapping time periods convey (often much) less information than independent results. Another very important consideration is the heavy-tailed distribution of standardised scores. There are also a number of other less important factors that we have incorporated into our methodology.
In addition to the RAAER and p-value, we produce a set of simpler metrics. The Annualized Excess Return (AER) is an annualized return corrected for the risk-free rate of return but not historical volatility, so it represents the return in excess of the risk free rate obtained by holding the recommended assets without leverage. The BM AER is the equivalent annualised return generated by holding the primary benchmark (S&P 500) instead of the recommended assets in the same amount over the same period, while the Excess AER is the difference. Excess AER can be interpreted as the return that would have been achieved by following a persons advice, relative to simply investing in the S&P 500, without making any attempt to correct for differences in risk.
The person assessment results together with all predictions that are currently open or partially open are used to produce a statistical market model that we hope has some predictive power.
We start with an instance of the market model identical to that which would be used to assess predictions with a stated date equal to the current time. This model captures the nature of market volatility quite well, but has no exploitable predictive power. We then update this model (by way of a monte carlo simulation) so that the assessment of all open predictions are as consistent as possible with all person assessments.
A key function of the forecasting algorithm is to deal with conflict in the best way possible - not every source with a high positive score will be in agreement about a particular topic, nor will they always disagree with sources with negative scores. We are effectively trying to find topics that have the least controversy amongst sources with the highest significant scores (either positive or negative).
There are a number of factors that make generating a reliable forecast very difficult. The most important of these is correlation between the person assessments. One reason for the correlation is that different people might work from the same sort of analysis or model. It could be a good approach, but when it fails everyone using it is wrong at the same time. A more pernicious effect is that people are swayed by what others are saying, which at the extreme can lead to situations of group-think where an incorrect consensus emerges, often seen before a market crash. By modelling at least some correlation between all sources, we retain a residual level of doubt in the face of overwhelming agreement.
Note: Forecaster is still experimental and we don't expose it's output yet. It appears to be working quite well, but requires more testing and more prediction data before we are prepared to declare the results useful.