Basic linear algebra

Basic linear algebra:
Vector
Matrix multiplication (just a bunch of dot products)
High dimensions
Vector length (L2 norm, Euclidean distance, vector magnitude)
Vector direction
Unit length (vector length is 1)
In high dimensions, density of points is sparse and volumes grow like crazy
subspace
span
basis vectors
weighted sums (linear combinations)
dot product
orthogonal (perpendicular) <=> dot product of two vectors is 0
projection
Pearson's correlation
mean subtraction
unit-length normalization
z-scoring (mean subtraction + standardization)
the dot product of two unit-length mean-subtracted vectors is EQUAL to Pearson's correlation
cosine similarity (cos of the angle that two vectors make)
cos(theta) = -1 (when theta = 180)
cos(theta) = 1 (when theta = 0)
cos(theta) = 0 (when theta = 90)
If two vectors are each mean-subtracted, cosine similarity is IDENTICAL to Pearson correlation
The reason that cosine similarity is a useful metric beyond Pearson correlation is that sometimes you don't want to mean subtract your two vectors.
orthogonalization
orthogonalize B with respect to A (i.e. calculate [B minus projection of B onto A])). after orthogonalization, the orthogonalized-B is now perpendicular to A
alternative interpretation: in a regression sense, fit B (data) with A (predictor) and then calculate the residuals (data minus the model fit)
residuals (the difference between the data and the model fit)
linear regression - using weighted sums of predictors to approximate (explain) some target data
sometimes you hear about the distinction between multiple linear regression (where more than one predictor is used) and regular univariate linear regression (where one predictor is used)
regression (y = Xh + n)
The least-squares solution for the case of a single predictor that is already normalized to be a unit-length vector is equal to the dot product of the data onto that predictor!
OLS (ordinary least-squares) solution: h_estimate = inv(X'*X)*X'*y
model fit = X*h_estimate
residuals = data - model fit
If two variables are mean-subtracted and unit-length normalized, then Pearson's correlation is numerically identical to the least-squares solution for the regression problem that attempts to explain one variable in terms of the other variable
This is because y = xh + n => h_estimate = inv(x'*x)*x'*y = inv(1)*x'*y = x'*y
The mean can be thought of as regression: a single predictor consisting of all ones can be fit to a set of values, assuming minimization of squared error, and the answer is identical to the mean
The median can also be thought of as regression: a single predictor consisting of all ones can be fit to a set of values, assuming minimiziation of absolute error, and the answer is identical to the median
If X is orthonormal <=> X'*X is the identity matrix and h_estimate = X'*y
orthogonal predictors => X'*X is a diagonal matrix
orthogonal predictors + unit length => X'*X is the identity matrix
If two predictors are identical, there is no unique regression solution
If two predictors are highly correlated (but not identical), then there is a unique regression solution, but betas on predictors are going to be highly unstable (and they trade off with one another)
Orthogonalize x_2 with respect to x_1 means to "fit x_1 to x_2 and subtract the model fit"
"fit X and subtract the model fit" - achieved by multiplying the data with a matrix constructed using X
The least-squares coefficient associated with one predictor in a multiple linear regression model is IDENTICAL to what you would get if you orthogonalized that predictor with respect to all the other predictors and then fit the data using that orthogonalized predictor.
Sequential fitting is ... (meaning, what we fit using x_1 first, then start analyzing the residuals)
Nested models...
orthogonal - The Gram-Schmidt process takes a set of vectors and makes them all orthogonal to one another. It is simple: Just orthogonalize the second vector with respect to the first one. Then orthogonalize the third vector with respect to the first two, and so forth.
﻿