« previous | Thursday, April 14, 2011 | next »
Simple Linear Regression
Predicting linear relationships between two variables and : , where rvs .
Goal: Find the line such that the squared distance (so distance values are positive) between the predicted and observed values is minimized.
What the line tells us:
- Estimates/Predicts values (interpolation within range; extrapolation outside of range)
- Describe ratio between two variables (slope)
Regression Model Equation
- β0, β1, and x are assumed to be unknown fixed parameters.
- ε Error variable: (σ2 is unknown)
- Therefore
We have estimates for (sample data), but we need estimates for and .
Least Squares
Given points , our regression model is:
Squared distance between points is given by . Therefore the sum of squared differences is:
Find and such that is minimized. (Take derivative, equate to zero, and solve)
Estimated Solutions: (both have normal distribution regardless of sample size)
Therefore, the final equation becomes
The fitted (predicted) value of is
Residual (observed − fitted; ) can be plotted to check how well the line fits.
All computer stat systems will return a coefficient of determination value. This value is always between 0 and 1, and close to 1 implies a better fitting line.
Tests for β1
Since we don't know σ, we can substitute an estimate (see below), then the standard deviation of β1 is a T-distribution with df .
Therefore, we can do confidence intervals and t-test for H0: β1 = β10
Estimating σ2
Minimum value for is the Error Sum of Squares:
To obtain estimate for σ2, we divide SSE by the (remaining) degrees of freedom (after estimating β1 and β2) :
ANOVA for Regression
Source |
Sum of Squares |
Degrees of Freedom |
Mean Square |
f statistic
|
Regression Error
|
SSR |
1 |
MSR
|
MSR / MSE
|
Error
|
SSE |
|
MSE
|
Total
|
SST |
|
Predicting with Regression
Suppose we have a new value that we want to estimate: x*
2 possible ways to calculate this:
- mean response (only think of value of regression model line):
- prediction (include error variable ε):
Basically plugging in a new value into the model equation to obtain a estimate.
Mean Response
We want to narrow the range and reduce variance by as much as possible, so we disregard the variance of ε.
Calculate 100(1-α)% confidence interval for response with:
Prediction
Because the regression model has the ε term, the prediction is actually within a range.
Correlation Coefficient
Show how strongly related two random variables are.
Note: If X and Y are independent, then correlation is 0
where Covariance is
For a population where we do not know the pdf (ƒ(x, y)), we can estimate the sample correlation coefficient using