« previous | Thursday, April 14, 2011 | next »
Simple Linear Regression
Predicting linear relationships between two variables
and
:
, where rvs
.
Goal: Find the line such that the squared distance (so distance values are positive) between the predicted and observed values is minimized.
What the line tells us:
- Estimates/Predicts values (interpolation within range; extrapolation outside of range)
- Describe ratio between two variables (slope)

Regression Model Equation
- β0, β1, and x are assumed to be unknown fixed parameters.
- ε Error variable:
(σ2 is unknown)
- Therefore

We have estimates for
(sample data), but we need estimates for
and
.
Least Squares
Given points
, our regression model is:
Squared distance between points is given by
. Therefore the sum of squared differences is:
Find
and
such that
is minimized. (Take derivative, equate to zero, and solve)
Estimated Solutions: (both have normal distribution regardless of sample size)
Therefore, the final equation becomes
The fitted (predicted) value of
is
Residual (observed − fitted;
) can be plotted to check how well the line fits.
All computer stat systems will return a coefficient of determination
value. This value is always between 0 and 1, and
close to 1 implies a better fitting line.
Tests for β1
Since we don't know σ, we can substitute an estimate
(see below), then the standard deviation of β1 is a T-distribution with df
.
Therefore, we can do confidence intervals
and t-test for H0: β1 = β10
Estimating σ2
Minimum value for
is the Error Sum of Squares:
To obtain estimate for σ2, we divide SSE by the (remaining) degrees of freedom (after estimating β1 and β2)
:
ANOVA for Regression
Source |
Sum of Squares |
Degrees of Freedom |
Mean Square |
f statistic
|
Regression Error
|
SSR |
1 |
MSR
|
MSR / MSE
|
Error
|
SSE |
 |
MSE
|
Total
|
SST |
|
Predicting with Regression
Suppose we have a new value that we want to estimate: x*
2 possible ways to calculate this:
- mean response (only think of value of regression model line):

- prediction (include error variable ε):

Basically plugging in a new
value into the model equation to obtain a
estimate.
Mean Response
We want to narrow the range and reduce variance by as much as possible, so we disregard the variance of ε.
Calculate 100(1-α)% confidence interval for response with:
Prediction
Because the regression model has the ε term, the prediction is actually within a range.
Correlation Coefficient
Show how strongly related two random variables are.
Note: If X and Y are independent, then correlation is 0
where Covariance is
For a population where we do not know the pdf (ƒ(x, y)), we can estimate the sample correlation coefficient using