Ridge regression

Ridge regression

Why care about ridge regression (RR)?

  • One typical motivation comes from people who want to build encoding models.
  • Encoding models: explicit model of what potential stimulus/cognitive/motor information might be "encoded" in brain activity measurements
  • Typically one has many stimulus/experimental predictors to try to "explain" some brain data
  • See  🐰Modeling concepts 
  • But keep in mind that ridge regression is a general-purpose statistical method that can find applications in many other contexts
  • Even if you don't use ridge regression, understanding it deeply will help you better understand standard regression.

Basics of ridge regression

  • Regression: Find me the best set of weights on my predictors such that a weighted sum of the predictors (with the weights) is as close as possible to the dependent variable (usually in the squared-error sense).
  • Standard ordinary least-squares (OLS) is the most common approach to find weights. See also:  ↗️Basic regression 
  • What is the problem with regression? Getting good weights is hard. We want optimal weights, where optimal is defined as weights in the case of infinite samples (data points). Standard OLS regression is just one estimate of many possible solutions. With limited data and/or noise, the OLS solution might correspond to a crappy model (where crappy is defined as very poor match to the true underlying system, or defined as very poor out-of-sample generalization). Hence, ridge regression (and other regularization approaches) to the rescue!
  • Note that RR is still just giving you a set of weights. So, the overall modeling outcome hasn't fundamentally changed.
  • Another issue to consider (but out of scope) is biological plausibility... maybe the RR weights are more plausible.
  • Yet another issue to consider is interpretability... maybe the RR weights are easier to interpret (or maybe they are challenging to interpret if there are tons of different predictors/weights).
  • Why you might want to introduce RR?
  • Measurement noise, as well as sampling issues. This is going to limit the accuracy of your model, and the OLS solution might be poor.
  • Overfitting. OLS can/will overfit (in the sense that noise in the data may be exerting too much influence on your weights).
  • Bias. OLS provides unbiased estimates (mean of weights across many fitting instances will converge to "true" mean). But the limitation of OLS is that in the data we have (which is all we have), the model estimate we have is crappy. Can we do something to improve the estimate we have on the data we have? Yes... we can bias the estimation.

Ridge Regression

RR sets up a new cost function where we minimize a combination of the sum of the squares of the residuals as well as the sum of the squares of the estimated weights (L2 penalty). The relative influence of this penalty is controlled by lambda, which has to be set somehow.
  • The prior that RR embodies is that weights are generally small.
  • Note that when lambda is 0, RR just reduces to standard regression. (In fractional ridge regression, this is a fraction of 1.)
  • Note that when lambda is infinite, RR gives you a set of weights that are all zero. (In fractional ridge regression, this is a fraction of 0.)
  • The typical method to set lambda is to try to maximize out-of-sample performance. The direct way to do this is to perform cross-validation to set lambda.
  • Advantages of RR: improves prediction performance, gets you closer to the right answer.
  • Disadvantages of RR: yields biased estimates.
  • Geometric interpretation: with correlated predictors, ridge regression tends to "preserve" the low PCs (that dominate the correlational structure and which are responsible for most of the variance in the predictors).
  • Ridge regression is in a sense only useful for the case of correlational structure. If all of your predictors are already fully orthogonal, there is "not much" RR can do.
  • All of this discussion assumes a fixed set of predictors (i.e. model class). Keep in mind that how you determine the weights is just one issue to consider. You could also consider changing the model class (either wholesale, or subtle change to the existing model class, or extracting out a small subset of predictors). Changing the model class has the potential to completely change the modeling performance.

Data splitting

  • Basic terminology: [training / validation] vs. testing
  • Training data - it's the data that informs your model weights.
  • Validation data - it helps you try to set lambda.
  • Testing data - it quantifies your FINAL model prediction performance.
  • [60 / 20 / 20] or [65 / 15 / 20] maybe is a reasonable starting point?
  • To reduce stochasticity of a single split, it is possible to do many splits and average over results (somehow).
  • General approach: loop over possible lambdas (using the training data to determine weights), and choose the one that maximizes performance on the validation data. Then, once you are doing figuring out lambda, evaluate model performance on the testing data.
  • The size of lambda is inscrutable... which motivates fractional ridge regression.
  • RidgeCV [[leave-one-sample-out]]?
  • See  🔧Resampling techniques  and  🐰Modeling concepts 

Constant, scale, offset issues:

  • Often, you can z-score each as a predictor as a preparation step.
  • You have to make sure to follow it through. (The z-scoring operation becomes part of your overall model.)
  • If the predictors that you prepare have very different ranges, this is going to influence your ridge regression results.
  • If omit z-score, the predictors with high variance are going to dominate.
  • If you do z-score, all the predictors are going to have a chance to exert influence the regression results.
  • Whether or not you z-score is essentially just changing the prior that you bring to the problem.
  • See  Scale and offset .

Other issues:

  • Lasso (L1 penalty). Sparsity (you might want sparsity (many weights being zero) for interpretation reasons). Compared to RR, lasso can be interpreted as bringing a different prior to the estimation.
  • Banded ridge regression
  • Univariate regression - In standard regression and ridge regression, there is only one output variable. Multivariate regression (multiple output variables) is a distinct topic. That path then leads to CCA.
  • Noise correlations. Standard regression and ridge regression ignores all of these things.
  • Decoding vs. encoding. You can flip the problem (and instead use brain activity to model some experimental variable). However, different assumptions are involved; for example, noise now lives in the predictors and isn't really taken into account by the regression machinery.
  • Interpretation of weights is very tricky. What can we really conclude from how the weight estimates turn out? Especially when there are tricky things like correlations across predictors.
  • If one treats ridge regression as purely a prediction-maximizer, then you avoid these issues.
  • But if you are trying to interpret ridge regression weights/results, things get more complicated.
  • Basic practical tips: Look at your data. Look at your predictors. Understand their units. Try simple ordinary least squares regression.

Resources

  •  Mathematical details of gradient descent with early stopping 
  • Explanation of ridge regression with correlated regressors  [video]   [files]  
  •  Rokem & Kay (2020) Gigascience - Fractional ridge regression 
  •  Kay et al., Nature, 2008 - uses gradient descent with early stopping which is very similar to ridge regression 
  •  Hastie - Elements of Statistical Learning - general textbook (advanced but good) 
  •  Implementation of ridge regression in relating EEG or MEG data to multiple stimulus features