Resampling techniques

Resampling techniques

Introduction

  • Here is a broad definition of what we mean by "resampling techniques": Ways of manipulating your data to learn something, and what you learn depends on the specific technique we are talking about.
  • Falls under the guise of "computational statistics" — in which we opt for more nonparametric methods that minimize assumptions about the data and that tend to be CPU-heavy (rely on sheer computational power). This is in contrast to analytic techniques that tend to make more assumptions and that are more mathematical in nature.
  • Resampling techniques operate on your data. So in that sense, they enjoy the benefit that they stay very close to the data you actually have.
  • Pros: Easy to do!, useful, easy to think about and interpret, nonparametric
  • Cons: Requires computers, harder to apply to very large datasets
  • The techniques below, although they are all resampling techniques, are very different from one another, so don't conflate them!

Statistical independence

  • This is a fundamental concept that is critical to resampling techniques. The concept of independence is founded in the basic statistical notion of sampling.
  • The key is that we want to resample data units (e.g. individual data points, runs, sets of trials, subjects, etc.) that we can (more or less) assume to be independent of each other.
  • What happens when independence is violated?
  • If your data units are not independent, the bootstrap might deliver overly optimistic (or overly pessimistic) sampling variability estimates.
  • If your data units are not independent, cross-validation might deliver overly optimistic model performances.
  • If your data units are not independent, permutation may create null scenarios that are unrealistic and it may be too "easy" to show that the actual data deviate from these null scenarios.
  • Some examples:
  • Noise over time in fMRI time-series tends to show autocorrelation. Hence, if you permute the time-series, this creates very unrealistic null scenarios.
  • Voxels are not independent (as there are global sources of noise). Hence, bootstrapping voxels is inappropriate and will likely create artificially low estimates of variability.

Permutation

  • Definition: Re-ordering (shuffling) your data and then analyzing the resulting data; repeating this process many times (e.g. 100-1000 iterations); and then examining how your actual result (no shuffling) compares to the permutation results.
  • Interpretation: The shuffling deliberately breaks the relationship (for example, between two variables), and therefore creates a null distribution that we can compare the actual results to. This then leads us to the concepts of null hypothesis significance testing.
  • Why does it work?: You can view shuffling as akin to preserving the empirical marginal distribution but then treating the data as random draws from that distribution. This creates a null scenario in which the order of the data is completely incidental/random.
  • Pros: Easy
  • Cons: It takes CPU time
  • Note that if you have very small sample size, there may be only a small number of distinct scenarios; in that case, you may wish to enumerate and evaluate all possible permutations.

Bootstrapping

  • Definition: resample your data (with replacement) to generate a bootstrap sample; then analyze that bootstrap sample; then repeat the whole process many times (e.g. 100-1000 iterations); then examine the variability of your analysis result across all of those iterations
  • Interpretation: The variability that you see is an estimate of your sampling variability. Bootstrapping is a method that provides this estimate!
  • Why does it work?: You are using your data as a proxy (approximation) of the true underlying population distribution.
  • Pros: it's easy, it's nonparametric, it makes use of the data that you have.
  • Cons: it takes CPU time.
  • Note that you must draw a sample of the same size as your actual data.

Cross-validation

  • Definition: Hold out some of your data, fit a model on the remaining data ("training data"), compute the model prediction for the held-out data ("testing data"), and then quantify how well model predictions match the held-out data.
  • Interpretation: This procedure is a direct quantification of the generalization ability of your fitted model; cross-validation is a method that provides an unbiased estimate of the generalization performance.
  • If you don't do anything special, and just look at percent variance explained on the training data, that number is positively biased.
  • Factors that affect cross-validation performance:
  • Noise in the test data
  • The appropriateness of the model type
  • How well you estimated model parameters from the training data
  • Pros: it's easy
  • Cons: it eats up training data (for example, a model trained on 80% is not as good as trained on 100% of the data); it's stochastic (you will get different results for different splits of the data); there are knobs (there are various choices you have to make regarding how exactly do you do the splitting)

Jackknifing

  • Definition: Split your data into a number of chunks (e.g. 10 or 20); in each iteration, leave out one chunk and analyze the remaining data; examine the variability of analysis results across iterations.
  • Interpretation: It is basically trying to get the same thing that bootstrapping is getting at, namely, sampling variability. However, it is not widely used any more, as the bootstrap is more accurate.