Scatter plots and low-dimensional regression

Scatter plots and low-dimensional regression

Introduction:

  • Why are scatter plots so useful? Visualization point of view: it shows you all the data.
  • Scatter plots have a close linkage to model building
  • Scatter plots have a close linkage to nonparametric methods
  • Scatter plots force you think deeply about "what is error?"
  • Recall distinction between marginal distribution and joint distribution
  • (Hint: see recording on  🏄‍♀️Special Topics  for figures and demonstrations)

Issues:

  • Fitting a line in a scatter plot
  • Standard least-squares regression.
  • two-parameter multiple regression (x and a constant term). y=ax+b.
  • What is error? => OLS implies squared errors (distances) and that error in the y-variable. The line minimizes the sum of the squares of these residuals.
  • Goodness-of-fit: R2. Conceptually, this is the size of your residuals (in explaining y) compared to the variance in y.
  • Model reliability / parameter reliability [e.g. bootstrapping]
  • Model selection [e.g. cross-validation, statistical significance of higher order terms, AIC/BIC]. One approach is to build nested models...
  • Alternatives that go beyond vanilla line fitting:
  • Robust regression (median not mean) [e.g. "median absolute deviation"]
  • Error in two dimensions (see below)
  • Fit relationships that are more complex than straight lines
  • You can go nonparametric (binned scatter plot, local regression)
  • Mixed effects models
  • Bayesian parameter estimation
  • The issue of independence
  • Are the data points in your scatter plot independent, and if so, what independence do they reflect?? One way to think about it: do the dots reflect fresh sampling of noise?
  • Note that a different issue is the independence (or dependence) of the two variables we are plotting. That is ultimately the issue we are usually trying to understand when we make a scatter plot.
  • Errors in two dimensions
  • Fundamentally different!
  • Known as "deming regression"
  • Might be useful.
  • One drawback is it is CPU intensive and requires iterative methods (I think).
  • Moving to nonlinear relationships. What are our strategies?
  • Binned scatter plot. (Requires some choice of bin size)
  • Higher-order polynomials or other nonlinear transformations of your predictors
  • Fourier transform is one choice of parameterizing the x-axis. If you smooth your data (i.e. delete high frequencies), this can be viewed as fitting a nonlinear function to your data (i.e. by using a basis set of sinusoids that exist only at low frequencies).
  • Local regression (a.k.a. LOWESS = locally weighted scatterplot smoothing)
  • CPU-intensive, data-driven method to flexibly allow any shape of model.
  • Basically, you fit a simple model (e.g. linear model) to local windows of your data.
  • Window size is a major choice. The choice can be viewed in terms of bias and variance.
  • Pros: minimal assumptions, elegant
  • Cons: CPU intensive, breaks down (poor performance) in high-dimensional situations.
  • fieldmap regularization is an example of local regression in 3D
  • Data sampling and impact on model fitting