Basic statistics

Basic statistics

Basic statistics:
  • mean
  • variance
  • statistic - some summary property of a set of data
  • standard deviation. Note that the reason for n-1 is to have an unbiased estimator for population variance (and almost unbiased for the population standard deviation)
  • sample vs. population
  • standard error - the standard deviation of the sampling distribution of the mean
  • confidence interval - the N% confidence interval is an interval such that N% of the time across repeated experiments we would expect to find the true population parameter in the interval
  • estimator - procedure that estimates the population parameter
  • expectation - mean across an infinite number of samples/experiments
  • bias - the discrepancy between the expected value of the estimate vs. the population parameter
  • median - 50th percentile. Associated with non-parametric statistics and robustness (as a major advantage).
  • mode - value corresponding to the peak of a probability distribution
  • robust - referring to working well across a variety of different situations
  • probability distribution
  • histogram
  • parametric - assuming or conforming to some model. In case of statistics, typically this refers to assumptions on the shapes of probability distribution
  • non-parametric - tending to not make assumptions about probability distributions
  • percentiles - The 99th percentile of a set of values is the number at which 99% of the values are smaller (or equal?) than that number. Non-parametric concept. Note that with finite data, there are funny interpolation games one must play in order to get sensible values
  • quartile - [0 25 50 75 100] percentiles
  • range - difference between max and min values
  • spread - referring to the fact that numbers are different
  • iqr (and semiiqr) - interquartile range is p(75)-p(25). semiinterquartile range is one half of that.
  • centering (a.k.a. mean-centering) - subtract the mean of a set of values
  • z-scoring - centering and dividing by the standard deviation. This transforms the numbers into z-score units
  • rank - one approach is to convert your nice continuous value data into ranks (e.g. 1 2 3 4 5 ...). This can be useful when pursuing non-parametric methods.
  • correlation - Pearson's correlation is typically what we mean by correlation
  • for z-scored data (x) and z-scored data (y), correlation is equal to the average product between corresponding data points in x and y (similar to a dot product)
  • it's equivalent-ish to fitting a line on x to predict y. correlation is measure of how well a linear function of x predicts y (and vice versa).
  • correlation is symmetric (order does not matter)
  • Spearman's correlation - Pearson's correlation on rank-transformed data. Good for being "robust" to simple monotonic nonlinearities
  • supervised learning - Can be viewed as mapping X to y, or predicting y from X. You have a bunch of data examples where X and y are known, and you trying to learn a good (predictive) mapping.
  • unsupervised learning - learn structure in a set of data. Basically you only have X and you want to learn something about X. Examples include PCA and clustering.
  • multivariate statistics
  • observation matrix - 2D matrix where you have observations x features. Each row is a "subject" (or "trials" or repeated experiments), each column is some measured property.
  • feature - sometimes referred to generically as "dimensions". For example, 2 features => 2 dimensional space. You can think of your observation matrix as just a bunch of points in high-dimensional feature space.
  • distance - there are different metrics of quantifying distances in high-dimensional feature space (e.g., Euclidean distance, Metropolis distance, cosine distance, correlation distance (1-r), Mahalanobis distance, etc.)
  • distance matrix - a square matrix (observations x observations) that quantifies the distance between all pairs of observations.
  • similarity vs. dissimilarity - one is just the inverse of the other, and in different situations, people use one or the other.
  • statistical significance
  • t-test
  • two-sample t-test
  • paired t-test
  • null hypothesis
  • ANOVA - one-way ANOVA is just the t-test extended to more than two groups
  • alpha - typically, 0.05
  • effect size - refers to the population (the system you are trying to measure and characterize)... it's the size of the effect. 30% increase in BOLD response from condition A (1% BOLD increment above baseline) to condition B (1.3% BOLD increment above baseline). Alternatively, you could quantify the effect size as 0.3% BOLD. Importantly, the effect size is independent of how you choose to sample the population, etc.
  • Type I error - rejecting the null when in fact the null hypothesis is correct
  • Type II error - not rejecting the null when in fact the null hypothesis is incorrect
  • power - in a given scenario (e.g. for a certain sample size, for a certain effect size), the probability of rejecting the null hypothesis (when it is in fact incorrect)
  • null hypothesis statistical testing (NHST) - Under the null hypothesis, how likely is our current observation? The answer is the p-value (i.e. the statistical significance level). If the p-value is lower than some specified (arbitrary) alpha level, we REJECT the null hypothesis. If the p-value is relatively high (e.g. 0.3), then we just fail to reject the null hypothesis. Technically, we are not really proffering evidence of the truth of the null hypothesis.
  • permutation approaches - non-parametric methods for calculating/determining statistical significance.
  • regression - one or more predictor variables ("regressors") trying to "predict" an outcome variable ("data"), where the outcome variable is treated as continuous
  • model fitting / estimation
  • residual sum of squares (RSS) - typically, regression is aimed at tweaking the parameters of a model such that RSS is minimal
  • likelihood - probability of a data point (or multiple data points) given some probabilistic model
  • estimation - a process for determining the value of some unknown parameter based on a set of data
  • maximum likelihood estimation (MLE) - one flavor of method for estimation, where you choose the value that maximizes the likelihood of the data. Note that MLE for Gaussian noise assumptions is equivalent to finding parameters that minimize squared error.
  • maximum a posteriori (MAP) estimation - if you impose a prior (a la Bayesian statistics), this is a different flavor of estimation method where your MAP estimate is the peak of the posterior distribution.
  • point estimation - just one value as your guess of the parameter
  • interval estimation - providing a range (like a confidence interval) for your guess of the parameter
  • classification - the outcome variable is not continuous (discrete) and instead consists of "classes" or "categories". Note that 'logistic regression' is actually a method for classification.
  • methods to manipulate your data (see  🔧Resampling techniques  for more information)
  • permutation - usually associated with NHST
  • bootstrapping - a cool, conceptually easy, computationally demanding non-parametric method for assessing the reliability of your data measures
  • cross-validation - testing a model on out-of-sample data (i.e. data to which your model was blind (it had no access)). Specifically, cross-validation says take the data you have and split it into two groups, train and test. Good for obtaining unbiased assessments of your model accuracy and, therefore, for comparing models.