So you think you know about linear regression ...
Submitted by Chris Stucchio (@stucchio) on Monday, 11 June 2018
Everyone has used linear regression. It’s boring, standard mathematics that we learned in Stats 101.
But how many of us really understand it at a deep level? One of the “rules” of linear regression is that your features must not exhibit multicollinearity. But where does this rule come from? What happens if we violate it? Many people suggest regularization or ridge regression as a solution, but why do these methods work? What are we actually doing?
In this talk I’ll discuss linear regression from the Bayesian perspective. This is a simple way to think about it which makes the answer to these questions quite transparent. It also provides an avenue to solve various harder problems (e.g. non-gaussian errors) that you might not have seen before.
As a running example I’ll consider predicting scores in fantasy sports, specifically the scores of a batter in Baseball or Cricket.
- Introduce the idea of a Bayesian posterior, and illustrate the general idea with PyMC.
- Show how to set up linear regression in PyMC, and generate a sample of answers that illustrate the model uncertainty, how uncertainty varies with sample size, etc.
2a) Illustrate in a practical example, namely predicting Cricket or Baseball scores (batter vs bowler).
- Show how ordinary least squares corresponds to maximizing likelihood, and why various assumption violations make maximization impossible. Show how bayesian perspective still works, just gives different picture.
- Many violations of the rules of linear regression stem from unreasonable solutions to the problem. I’ll show that if we incorporate reasonable assumptions into our model (Bayesian priors), then we get reasonable results out.
4a) If you do naive OLS on batter vs bowler data set, you get crazy results for batters who’ve only played in 1 or 2 games. But you can fix this by making reasonable assumptions and putting them into the math.
- Show how different Bayesian priors correspond to many regression tricks, e.g. ridge regression, l1 regularization, etc. No magic here - just express assumptions as math
- If the data violates our assumptions, just change our assumptions. Bayesian regression still works.
6a) Errors in sports data are not normally distributed. But we can fix that!
End goal of this talk: if you have highly correlated input data, non-normal errors, domain knowledge exceeding input data, or other common problems, you shouldn’t get stuck. You might need to custom hack some tools
Familiarity with linear regression and some basic mathematics (calculus, linear algebra) is helpful. If you’re familiar with Bayesian reasoning, so much the better (but not required).
Chris Stucchio is a former physicist, high frequency trader and software developer. He’s currently the head of data science at Simpl. He’s been working in decision theory and bayesian optimization for the past 5 years, and has been teaching statistics to novices for much longer.