SNR and statistical issues

IntroductionNoise, broadly speaking, is variability you don't want or are not interested in.
Why is SNR an issue/problem?? Because fMRI (and other types of) data have a lot of noise. Thus, you have to work hard to study the signal (what you care about).
Example: Voxel selection (speckles, false positives)
Example: Hard to get clean anatomical data-driven masks
Example: If you have too little SNR, you really can't study anything in any useful way
fMRI is often on the edge of being garbage/noise data. If we lived in a world where you can "see the activity changes" in the raw data, that would be a very different world.
Another issue is that you have to know your rough SNR is in order to know how much data to collect.
Note that there are many different sources of noise. In order to speak intelligibly, you may have to define what you are talking about.
When you are quantifying something in some data, you have to define clearly the specific thing you are quantifying (e.g. trial-level, condition-level, ROI-level, subject-level). Different things emerge at different levels. For example, the spatial scale of a single voxel differs from the spatial scale of a whole region.
Noise can have structure (e.g. voxels are "correlated" and not independent), so be careful. Noise is not necessarily "uncorrelated".
Things that we know about fMRI data:Subjects vary a lot in SNR
Sessions can vary a lot in SNR
Voxels can vary a lot in SNR
Units are tricky. Typically, we like to work in %BOLD (i.e. amplitude divided by mean signal (or baseline signal) times 100), but sometimes people work in some sort of t-statistic units, or z-score units, etc. Note that %BOLD is just a specific instance of the idea of "delta-S/S". 
Signal — defined as %BOLD magnitude — potentially varies a lot (across voxels, subjects, sessions). It in theory is invariant to RF coil distance (meaning if you are close to the outskirts of the brain or if you are buried deep in the brain, it should be constant).
Noise — defined as variability of signal across trials — also potentially varies a lot (across voxels, subjects, sessions). It is NOT invariant to RF coil distance (since thermal noise is relatively larger proportion of the measurements obtained from "dark voxels" near the middle of the brain as opposed to "bright voxels" that are very close to the RF coil). But even that observation depends on what units and what specific quantification you are using...
MRI people often use the term "tSNR" which is temporal signal-to-noise ratio, which is just mean signal intensity in a voxel divided by the standard deviation of signal intensity over time. Note that this has nothing to do with actual evoked BOLD responses. The advantage is that tSNR is easy to get and compute, and it's mainly useful for thinking about thermal noise. (But note that real BOLD responses actually tend to make tSNR LOWER, ironically. But real BOLD responses are so weak anyway that it really doesn't matter that much!)
%BOLD provides, to some degree, a nice interpretable number. For example, a %BOLD of 2 is fairly "high". However, %BOLD varies across voxels/areas/subjects for potentially non-interesting non-neural reasons (e.g. proximity to large veins).
Note that if you simple scale a voxel time-series by a constant factor, this does not change %BOLD.
Keep in mind the difference between the magnitude of the baseline signal intensity (e.g. how bright a brain voxel is) and the magnitude of evoked amplitudes over and above that baseline (e.g. what we call %BOLD response amplitudes).
Practical analysis issues:In fMRI you often have to filter or threshold out voxels (voxel selection). An issue to consider is "whether the results are basically the same no matter what specific threshold you choose".
In most cases, the safest comparisons involve holding the voxels/region constant and seeing how those exact voxels/regions change ACROSS conditions. In other words, try NOT to change which voxel/regions are being quantified when you look at experimental manipulations. Another way to think about this is: voxel selection is hard/tricky, but at least try to do voxel selection ONCE (and in a FAIR way) and don't change the voxel selection when you are looking at your effects of interest.
What about making claims across subjects? That situation is a bit different from the situation in which you are directly comparing an experimental manipulation within the same brain tissue/voxels.
Thinking about analyzing data other than the real data themselves:
A useful exercise is to analyze single-coil MRI data (or high-bias surface coil data). (In these cases, the variation of signal and noise is drastic and will force you to grapple with data demons.)
Another useful exercise, in the same vein, is to analyze data where you inject deliberately noise into it.
(More on this below...)
SNR:SNR can have many different meanings and metrics, so be careful!!!
SNR is complicated, as it is a function of both signal as well as noise, and those two things are very different things that themselves have very different complex properties.
Note that standard deviation units are different from variance units.
As an robust alternative, consider interquartile range (75th - 25th percentile) and semi-IQR
Note that SNR metrics are themselves potentially noisy (since they are calculated from data, which are always limited), so you have to be careful. (I.e., whenever you make a measurement of something, it is noisy/subject to noise.) You can (and maybe should) put bounds (i.e. error bars) on your SNR calculations.
Thresholding / voxel selection:The general goal is to report on stuff that's real. Find real brain signals to talk about.
One way to formally follow NHST (null hypothesis statistical testing) in order to choose voxels.
A different way is to do voxel selection but without any p-value interpretation. (Sort of depends on the situation.) For example, if you do an "anatomical mask" (gray matter; V1), notice that you are basically saying p <= 1 (i.e., all voxels are welcome even if noisy!).
Different situations call for different approaches, so you really just have to take it on a case-by-case basis.
When you draw ROIs, exactly how big or small you make the ROIs is a type of voxel selection issue.
Statistical issues:Often, in a dataset, we work under the presumption that a substantial proportion of the units (e.g. voxels) are pure noise (i.e. practically indistinguishable from noise). Hence, when you focus/select voxels, we have to consider if a lot of what we are including are just pure noise.
One strategy: perform the analysis on "known bad voxels" (e.g. white matter) and decide a threshold accordingly.
Another strategy: use a p-value threshold (based on the scientific hypotheses) to accept/reject
A variant of that strategy: use a p-value threshold for some generic "reliability" (test-retest) criterion.
Another strategy: use an anatomical ROI (or some ROI based on some other independent data)
Summarizing distributions. We have to boil things down (otherwise there are too many numbers to think about).
Summarizing is very tricky. Mean is nice if it works. But median is more robust. But median can be weird and discrete and may not accurately describe your distribution. Trimmed means is another idea (Winsoring). Other percentiles than the median (e.g. [5 25 50 75 95]) could be an approach. Another approach is to nonlinearly transform the data, compute the mean, and then transform back. You could fit actual probability distribution to your data and report their parameters.
A general idea could be (if you are worried): do it multiple ways and show that the results are qualitatively the same.
Be careful about crazy values (e.g. 1e4 or -1e9 or NaN) and think about how your metric fares.
Sometimes you are interested in the central tendency. But other times you might actually care about the extremes (for example, maybe all the interesting signals to analyze are in the extremes and all the crappy noisy is in the middle). So it just depends.
Controls and related checks.
Use control regions (air voxels), use control measurements (data that should NOT have the effect), shuffling/permutation, and/or generate data from scratch (randn).
You have to be careful! Different control approaches test for and assume different things!
If you generate/simulate noise (and certainly, there are different flavors of noise), what does your metric do? What does the distribution across repeated noise experiments look like? You need to have a clear sense of what the "null" level of your metric is.
Alternatively, if you take experimental data and corrupt it deliberately (e.g. shuffle time-series, or shuffle trials, or shuffle across voxels (which is very different!)), what does your metric do? 
If you take experimental data and inject extra noise into it, what happens?
Depending on what you corrupt/shuffle, you corrupt different characteristics of the data. For example, if you shuffle across voxels, you will completely mess up the weird spatial correlational structure of the data. Whether you want to do this or not really depends on what you are trying to get at...
If you take experimental data and subsample it (e.g. use only half of it), what happens to your metric?
This is sort of related to the statistical concepts of convergence, consistency, bias, etc. Basically, what happens to your metric if you increase the amount of data given to it (while keeping everything constant about the system). For example, consider what happens to a given SNR metric as you increase the number of trials (shouldn't it stay constant?).
Suppose you have a metric and want to compare how it fares across brain areas or subjects. Is your metric affected by gross "SNR differences" or is it expected to be invariant to such differences? For example, if you quantify a tuning property like RF position, obviously it shouldn't be fundamentally systematically biased one way or another by measurement noise (hopefully). 
Camp A [cases where the result is dependent on SNR, amount of data, pulse sequence, stimulus type, etc.]: p-values. Decoding MVPA performance. How much of brain passes some statistical threshold.
Camp B [cases where the result is independent of the above issues]: Anatomical regions. Preferred stimulus tuning/peak of a thing / voxel / region.
Arguably, you want to be in camp B, not camp A.
﻿