Error is a fundamental concept. In general, in the context of supervised learning, we always want to minimize error (discrepancy between model output and data).
But... overfitting (there is noise in the training data that we don't really want to fit)
But... there are different error metrics (how exactly do you quantify error?)
But... a summary measure for error can hide important issues (like imbalanced classes, or some data points are better modeled than others)
Regularization is a fundamental concept. Constrain/massage the model to improve out-of-sample generalization performance.
Case study:
Consider the simple case of multiple linear regression. y (d data points x 1) and X (d data points x p predictors). Goal: use a weighted combination of the columns of X to "fit" y. Find good estimates of the weights.
Assuming that noise exists in y (fake/incidental/unimportant variability), we don't really want our model to fit it.
Note that sometimes the "noise" in some data is actually the thing you are trying to systematically characterize, in which case it's not really noise in the "random" sense... so just be careful.
Noise will cause "noise" in the weights and hence the model you actually estimate. Ideally, in general, we want d >> p in order to get stable weights. (As a corollary, if the weights are unstable, they are probably inaccurate and will cause bad model generalization performance (i.e. it's going to limit performance))
If the true relationship between X and y is not linear (i.e. not captured by a weighted combination of Xs), there will be extra error (i.e. residuals) and maybe some people might call that "noise".
Note that noise could exist in X, but for the most part, this is ignored.
If I gather more training data (d), I expect test performance to improve, simply because the quality of the weights should improve. But there are limits to that, since the true underlying thing you are modeling may not be accurately captured by the model class/form you are using. (More on this later.)
Sampling issues
As a matter of experimental design, you can control how you sample y and X. Thus, you can improve the quality of your model fitting simply by judiciously choosing X (e.g., use high dynamic range).
If you really care about deciding which predictor is more important than others (i.e., assign accurate weights to the predictors and figure out which is biggest), then it's important to sample in such a way (if possible) that the predictors are uncorrelated (or less correlated). The more correlated the predictors, the more uncertain we will be in the weights assigned to the predictors.
If you have the wrong model class, limited narrow-minded sampling might be really really problematic (i.e. really bad extrapolations to other parts fo the space).
Testing data and predictive performance
Usually, we use testing data to unbiasedly assess performance of the model we get from the training data. You can't just assess on the training data since that would be overly optimistic. However, note that there is "noise" in the testing data. We can never achieve perfect predictions. (Hence, the concept of the 'noise ceiling'.)
Other times, the goal is to really stress-test the learned relationship; in which case, you want to actually get strange diverse testing data to check the generalizability of your model.
Dividing data into subsets
Typical setup: [Training vs. validation] vs. testing.
One idea: [[80 / 20] within the 80] vs. 20
Sometimes we make the testing data special (by collecting lots of trials). Averaging across many trials helps our models' numbers (because noise is not predictable), but note that a different issue is how well did you sample the space of stimuli ("predicting r = 0.99 for 5 stimuli is not very impressive, because maybe you just got lucky for those 5 stimuli"). (More on this later.)
Multiple splits?
If you do a single split (e.g. 80/20 for training/testing), there is incidentical variability in both the training and testing data, and we aren't making full use of the data for training. Also, our testing assessment will be "noisy".
Thus, there is some motivation to do n-fold cross-validation, the idea being that we can pool results across the n folds to stabilize our accuracy measures.
Out-of-study testing of models
This goes beyond formal statistics
Validation/replication across newer experiments, different model implementations, etc.
Deep thoughts
Bootstrapping (is a non-parametric approach that) assesses intrinsic ("measurement noise") variability of analysis results. You can, for example, draw a bootstrap sample and run the full modeling procedure on that sample, and then repeat the entire process for different bootstrap samples. Limitation: CPU time, and it's only as accurate as the diversity/size of your sample.
Permutation test directly assesses outcomes under the null hypothesis. For example, consider shuffling the "y" variable (notice this preserves the marginal distribution of "y", but DESTROYS the relationship between the independent variables and the outcome variable) and re-running the entire analysis. Or, considering a simple correlation of two variables, imagine shuffling one of the variables and re-computing correlation values. Limitation: CPU time, need sufficient samples/trials in order to generate a diverse set of permuted outcomes.
Models can be used in many different ways:
Prediction - machine learning-esque. Often the idea is to strive for perfect prediction. But note that in some cases, models can still be useful even if they don't explain 100% of the variance.
Summarize what's going on - e.g. the linear weights summarize the system
Denoising - if you trust your model, its outputs could have much less noise than the actual data themselves
Maybe the model itself is of interest - e.g., Newton 2nd Law, parametric model of a timecourse, or biophysical model with different components, models that summarize some "lawful" behavior (e.g. Weber's Law), cognitive models
Maybe the parameters themselves are of interest (e.g. pRF size)
Feature comparison - comparing the magnitudes of different potential model components (predictors).
If the testing data have 'noise', we can never predict all of it. This is the concept of the "noise ceiling".
Why is it useful to predict? Prediction helps check your code. It is not guaranteed to work. It forces you to "put your money where your mouth is".
Issue of 'how similar' the training and testing data are. Do we want to make them as similar as possible? ('weak' generalization) Or do we want to perform a strong test of generalization? ('strong' generalization). Both are interesting.
All else being equal, if we have more training data (more observations), what can we expect to happen?
Your parameter estimates are going to get better. By consequence, overall predictive performance should get better.
In the limit, with infinite training data, the training variance explained will asymptote to some number (likely less than 100%).
All else being equal, if we have more testing data (more observations), what can we expect to happen?
Variance on the prediction metric will decrease (i.e. you can be more confident in the prediction number you are getting).
In the limit, with infinite testing data, the testing variance explained will asymptote to some number that is the "true" predictive power of the model that have estimated (remember that we are holding the training data constant).
What if we do some 'trial averaging' on testing data before assessing predictive performance?
We expect the noise in the trial-averaged data is less.
We expect, on average, that the model that we have estimated will do better on the trial-averaged data (since noise has been removed).
How should we design the T trials for training? M distinct conditions x N trials.
That is, should we dedicate more experimental time to increasing M, or instead keep a large N?
Some algorithms/models WANT variance (covariance), meaning they want N >> 1.
But in general, it shouldn't matter too much for the purposes of getting accurate model parameter estimates... (but it really depends on the details and the specific model...)
Rule of thumb is to ensure M is large (i.e., span a diverse range of the stimulus).
How should we design the T trials for testing? M distinct conditions x N trials.
Well, if you trial-average a lot, you really increase your noise ceiling and you will make your model prediction metrics look big.
But beware that if M is low, you have very little reliability in the condition-generalization sense.
On the issue of the optimality of hyperparameter and its dependence on the amount of data. Imagine a 60/20/20 split for training/validation/testing. Should I have dedicated more data to the validation (e.g. to figure out the ridge regression hyperparameter?)
All else equal, more is always better. You want to get a good estimate of what the hyperparameter should be.
However, if you are contemplating 60/20 → 50/30, you risk making your actual model estimate worse.
Note that any single split has incidental variability in it. So, multiple splits is a good idea to "smooth out" the incidentalness.
Feature selection - intimately connected to Lasso since it wants to "zero" lots of the predictors for you. When many predictors are assigned a weight of zero, they are effectively "removed" from the model.
Why would one want to regularize? Inducing some bias in order to reduce variance and therefore improve out-of-sample generalization
Don't obsess too much about these resampling/statistics/machine-learning stuff; you should also worry the data and the experiment and the goal of your analysis/modeling.
Pitfalls/considerations of the modeling process
Model behavior vs. model innards. There is some danger in reifying the model innards too much. (They are "real" only insofar that you are wedded to your model.) The model behavior, on the other hand, is concrete, objective, and closely matched to the observed data.
Add a caption...
Often, people prioritize and (almost overemphasize?) predictive performance in the sense of maximizing the correlation of a model output with neural data. This is in the spirit of the 'model behavior' point above. However, this view discards and neglects to interpret the innards of the model, and lacks potential useful interpretation.
The performance of a model is meaningful only with respect to the stimulus/experimental variation on which it is trained/tested. In other words, saying that r = 0.9 is not very meaningful, unless we know exactly what neural data, what stimuli, etc. this was evaluated over. Moreover, it may be the case that a given model might fare much worse on a very different stimulus set (or task manipulation).
Model class vs. two instances of a single model (two different sets of parameters). Sometimes we talk about two different model classes. But even within a single model class, two different instantiations of that model are still two different models. (So, be careful and precise with your wording and interpretation.)
Two models might have very similar "predictive" performance but vastly different innards. This highlights the challenge of using predictive performance alone. And also highlights the challenge of performing effective interpretation of model innards. (If two models are equally "good", then in what sense can we really make a big deal out of one model's innards?)
If you have two sets of parameters that work equally well for a given model, what can we do? There are different views one can take. One view is that the two sets of parameters are equally good and so either model is correct. Another view is that you should perhaps "fix" your model setup so that the two sets of parameters actually give different behavior.
What to do when your model parameters become really wacky? There are different possibilities: 1. It's just noise. An action item is to confirm that it's just estimation noise. 2. If the model is working perfectly well and you are noticing that the numbers you are getting for the parameters are strange, maybe the way you structured/designed your model is suboptimal. 3. Use simulations to diagnose the strange behaviors of your model, and this may provide insight in how to "fix" the model. 4. Limit the range of the parameters (but this is dangerous and somewhat arbitrary). 5. Maybe it's not noise but reflects some edge case of sampling... so one approach is to just identify those cases and ignore them. 6. Maybe the space for your parameter numbers is improper (e.g. maybe it's better to consider your parameters on a log scale vs. linear scale).
Often we have to estimate parameters of a model (on basis of data). Because of that, model parameters are always wrong (there is always some estimation error). This is important to bear in mind.
Trying to fit data vs. trying to simulate data (model simulations). These are very different approaches. The latter often is designed and explored in a way that is fairly detached from real data. However, the latter can still be combined with empirical data/results eventually.
Word (non-quantitative) model vs. phenomenological/descriptive (could be quantitative) vs. computational/mechanistic model vs. a real biological model. These are different levels of abstraction. However, it is not necessarily the case that abstraction is bad. Having abstractions in one's model is arguably a good thing: modeling every single biological detail may not be terribly useful for the scientific enterprise. Thinking about these distinctions is important to ensure that when we evaluate each other's work that we understand what we are each doing.
When your model doesn't work, what's the problem? Maybe there is too much noise (defined as unwanted uncontrolled variability) in the data. Maybe the signal in the data is not what you were hoping for (your model isn't the right one). Maybe the parameter fitting approach you are using is failing, either because of practical problems (e.g. local minima, code implementation was bad) or because you do not have sufficient data to estimate the parameters well (ill-posed). Maybe the experiment did not sample the relevant space well enough.
Add a caption...
Sampling. Typically, your model is going to perform best in the range of sampling that you train it on. (So, take experimental design seriously.) Sampling may be non-trivial. Maybe the space that you assume the sampling to be in may not be the correct space. Often, we choose to sample uniformly, or randomly somehow. But what if the space is logarithmic? Sometimes we don't know the proper sampling range, so it may become a chicken-and-egg problem: we have to do some experiments to figure out the relevant range, and then run a new experiment with this new knowledge in hand.
Re-parameterization. For technical reasons, it may be useful to re-parameterize your model. For example, it may make the fitting better behaved.
What's in the data vs. what's in the model. There's no guarantee that any effect of interest that is present in the model is present in the data. Thus, it's always a good idea to look at the data.
Pernicious parameters. Think about the full set of possibilities for the parameters in your model and think about the model behaviors that could be elicited. There could be parameter combinations that cause really wacky model behavior.
A model may "work" but there may be unconsidered confounds/correlated variables that are responsible for the model's performance. A simple example is head motion that is correlated with the experimental conditions; the head motion may cause signal fluctuations that masquerade as activation.
Correlated-predictors scenarios. Some models may have multiple predictors that are correlated with each other. This causes some statistical issues.
Models are often complex and have model components that interact with one another. Fully understanding a model involves understanding these interactions.