Using correlation to track model performance is “a mistake that nobody would ever make” combined with a vague “what would be wrong if I did do that” feeling. I hope after reading this feel a least a small urge to double check your work and presentations to make sure you have not reported correlation where R-squared, likelihood or root mean square error (RMSE) would have been more appropriate.
It is tempting (but wrong) to use correlation to track the performance of model predictions. The temptation arises because we often (correctly) use correlation to evaluate possible model inputs. And the correlation function is often a convenient built-in function.
The correlation function (which we will call cor(,)) has a huge number of obscuring symmetries: it is unchanged under positive scaling, shifts and the swap of its two arguments. This means it is in fact scoring if some ideal shift plus re-scaling of your model predictions is performing well instead of scoring the predictions you are using. And this is not what you want, models in a production environment are supposed to make actual good predictions. Measurements in production are supposed to tell you if the model or data have drifted (not to merely assume they have not).
Here is some R-code showing symmetries in cor(,):
> y = runif(10) > x = y + 0.5*runif(10) > cor(x,y) [1] 0.8893743 > cor(y,x) [1] 0.8893743 > cor(10*x,y) [1] 0.8893743 > cor(x+10,y) [1] 0.8893743
R-squared (written as a function as rsq(,) has none of these symmetries, it changes under simple alterations of its arguments and can become arbitrarily negative.
Here is some R-code showing the lack of symmetries in rsq(,):
> rsq = function(y,f) { 1 - sum((y-f)^2)/sum((y-mean(y))^2) } > rsq(x,y) [1] -0.4966555 > rsq(y,x) [1] 0.09424879 > rsq(10*x,y) [1] -9.197255 > rsq(x+10,y) [1] -2250.407
And here is some R-code to remind you that correlation squared and R-squared do agree on training data:
> model = lm(y~x) > rsq(y,predict(model)) [1] 0.7909866 > cor(y,predict(model))^2 [1] 0.7909866 > model Call: lm(formula = y ~ x) Coefficients: (Intercept) x -0.3309 1.1432
If you look at this with an open or learning mind it should seem very strange that a function like cor(,) with a huge number of symmetries is closely associated with a function like rsq(,) with many fewer symmetries. At this point we re-recommend Nina Zumel’s “Correlation and R-squared” article to remind ourselves why correlation squared and R-squared are the same on training data. But they point we want to leave with is that the correlation function is using its many symmetries to evaluate if some simple function of a value is a good prediction (hence correlation is a great way to vet possible model inputs), and correlation is not scoring if the unaltered predictions at hand actually are in fact good.
No comments:
Post a Comment