STAT 211 Topic 10

From Notes
Jump to navigation Jump to search

« previous | Thursday, April 14, 2011 | next »


Simple Linear Regression

Predicting linear relationships between two variables and : , where rvs .

Goal: Find the line such that the squared distance (so distance values are positive) between the predicted and observed values is minimized.

What the line tells us:

  • Estimates/Predicts values (interpolation within range; extrapolation outside of range)
  • Describe ratio between two variables (slope)

Regression Model Equation
  • β0, β1, and x are assumed to be unknown fixed parameters.
  • ε Error variable: 2 is unknown)
  • Therefore

We have estimates for (sample data), but we need estimates for and .


Least Squares

Given points , our regression model is:

Squared distance between points is given by . Therefore the sum of squared differences is:

Find and such that is minimized. (Take derivative, equate to zero, and solve)


Estimated Solutions: (both have normal distribution regardless of sample size)

Therefore, the final equation becomes

The fitted (predicted) value of is


Residual (observed − fitted; ) can be plotted to check how well the line fits.


All computer stat systems will return a coefficient of determination value. This value is always between 0 and 1, and close to 1 implies a better fitting line.


Tests for β1

Since we don't know σ, we can substitute an estimate (see below), then the standard deviation of β1 is a T-distribution with df .

Therefore, we can do confidence intervals and t-test for H0: β1 = β10


Estimating σ2

Minimum value for is the Error Sum of Squares:

To obtain estimate for σ2, we divide SSE by the (remaining) degrees of freedom (after estimating β1 and β2) :

ANOVA for Regression

Source Sum of Squares Degrees of Freedom Mean Square f statistic
Regression Error SSR 1 MSR MSR / MSE
Error SSE MSE
Total SST


Predicting with Regression

Suppose we have a new value that we want to estimate: x*

2 possible ways to calculate this:

  1. mean response (only think of value of regression model line):
  2. prediction (include error variable ε):

Basically plugging in a new value into the model equation to obtain a estimate.

Mean Response

We want to narrow the range and reduce variance by as much as possible, so we disregard the variance of ε.

Calculate 100(1-α)% confidence interval for response with:

Prediction

Because the regression model has the ε term, the prediction is actually within a range.

Correlation Coefficient

Show how strongly related two random variables are.

Note: If X and Y are independent, then correlation is 0

where Covariance is


For a population where we do not know the pdf (ƒ(x, y)), we can estimate the sample correlation coefficient using