STAT 211 Topic 10
« previous | Thursday, April 14, 2011 | next »
Simple Linear Regression
Predicting linear relationships between two variables and : , where rvs .
Goal: Find the line such that the squared distance (so distance values are positive) between the predicted and observed values is minimized.
What the line tells us:
- Estimates/Predicts values (interpolation within range; extrapolation outside of range)
- Describe ratio between two variables (slope)
Regression Model Equation
- β0, β1, and x are assumed to be unknown fixed parameters.
- ε Error variable: (σ2 is unknown)
- Therefore
We have estimates for (sample data), but we need estimates for and .
Least Squares
Given points , our regression model is:
Squared distance between points is given by . Therefore the sum of squared differences is:
Find and such that is minimized. (Take derivative, equate to zero, and solve)
Estimated Solutions: (both have normal distribution regardless of sample size)
Therefore, the final equation becomes
The fitted (predicted) value of is
Residual (observed − fitted; ) can be plotted to check how well the line fits.
All computer stat systems will return a coefficient of determination value. This value is always between 0 and 1, and close to 1 implies a better fitting line.
Tests for β1
Since we don't know σ, we can substitute an estimate (see below), then the standard deviation of β1 is a T-distribution with df .
Therefore, we can do confidence intervals and t-test for H0: β1 = β10
Estimating σ2
Minimum value for Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle D} is the Error Sum of Squares:
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle D = \sum_{i=0}^n (y_i - (\hat{\beta}_0 + \hat{\beta}_1x_i))^2 = SSE}
To obtain estimate for σ2, we divide SSE by the (remaining) degrees of freedom (after estimating β1 and β2) Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n-2} :
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle s_\epsilon = \hat{\sigma} = \sqrt{\frac{SSE}{n-2}}}
ANOVA for Regression
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \begin{align} SST = SSR + SSE &= \sum_{i=1}^n (y_i-\bar{y})^2 & \mbox{Total Sum of Squares} \\ SSE &= \sum_{i=1}^n (y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i))^2 & \mbox{Error Sum of Squares} \\ SSR = SST - SSE &= \sum_{i=1}^n (\hat{y}_i - \bar{y})^2 & \mbox{Regression Sum of Squares} \\ R^2 &= \frac{SSR}{SST} & \mbox{Coefficient of Determination} \end{align}}
| Source | Sum of Squares | Degrees of Freedom | Mean Square | f statistic |
|---|---|---|---|---|
| Regression Error | SSR | 1 | MSR | MSR / MSE |
| Error | SSE | Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n-2} | MSE | |
| Total | SST | Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n-1} |
Predicting with Regression
Suppose we have a new value that we want to estimate: x*
2 possible ways to calculate this:
- mean response (only think of value of regression model line): Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle E(y|x=x^*)=\hat{\beta}_0 + \hat{\beta}_1x^*}
- prediction (include error variable ε): Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \hat{y}=\hat{\beta}_0 + \hat{\beta}_1x^*}
Basically plugging in a new Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} value into the model equation to obtain a Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \hat{y}} estimate.
Mean Response
We want to narrow the range and reduce variance by as much as possible, so we disregard the variance of ε.
Calculate 100(1-α)% confidence interval for response with:
Prediction
Because the regression model has the ε term, the prediction is actually within a range.
Correlation Coefficient
Show how strongly related two random variables are.
where Covariance is
For a population where we do not know the pdf (ƒ(x, y)), we can estimate the sample correlation coefficient using