STAT 211 Topic 10

« previous | Thursday, April 14, 2011 | next »

Simple Linear Regression

Predicting linear relationships between two variables $X$ and $Y$ : $Y=A+BX$ , where rvs $A,B\sim \mathrm {Normal} \,\!$ .

Goal: Find the line such that the squared distance (so distance values are positive) between the predicted and observed values is minimized.

What the line tells us:

Estimates/Predicts values (interpolation within range; extrapolation outside of range)
Describe ratio between two variables (slope)

y=\beta _{0}+\beta _{1}x+\epsilon \,\!

Regression Model Equation

β₀, β₁, and x are assumed to be unknown fixed parameters.
ε Error variable: $\epsilon \sim \mathrm {Normal} (0,\sigma ^{2})\,\!$ (σ² is unknown)
Therefore $y\sim \mathrm {Normal} (\beta _{0}+\beta _{1}x,\ \sigma ^{2})$

We have estimates for $x$ (sample data), but we need estimates for $\beta _{0}$ and $\beta _{1}$ .

Least Squares

Given points $(x_{1},y_{1}),(x_{2},y_{2}),\ldots ,(x_{n},y_{n})$ , our regression model is:

$y_{i}=\beta _{0}+\beta _{1}x_{i}+\epsilon _{i}$

Squared distance between points is given by $(y_{i}-(\beta _{0}+\beta _{1}x_{i}))^{2}$ . Therefore the sum of squared differences is:

$S(\beta _{0},\beta _{1})=\sum _{i=1}^{n}(y_{i}-\beta _{0}-\beta _{1}x_{i})^{2}$

Find $\beta _{0}$ and $\beta _{1}$ such that $S(\beta _{0},\beta _{1})$ is minimized. (Take derivative, equate to zero, and solve)

Estimated Solutions: (both have normal distribution regardless of sample size)

{\begin{aligned}{\hat {\beta }}_{0}&={\frac {\sum (x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{\sum (x_{i}-{\bar {x}})^{2}}}&\sim \mathrm {N} \left(\beta _{0},\ \right)\\{\hat {\beta }}_{1}&={\bar {y}}-{\hat {\beta }}_{1}{\bar {x}}&\sim \mathrm {N} \left(\beta _{1},\ {\frac {\sigma ^{2}}{\sum (x_{i}-{\bar {x}})^{2}}}\right)\end{aligned}}

Therefore, the final equation becomes

{\hat {y}}_{i}={\hat {\beta }}_{0}+{\hat {\beta }}_{1}x_{i}

The fitted (predicted) value of $y_{i}$ is ${\hat {y}}_{i}={\hat {\beta }}_{0}+{\hat {\beta }}_{1}x_{i}$

Residual (observed − fitted; $y_{i}-{\hat {y}}_{i}$ ) can be plotted to check how well the line fits.

All computer stat systems will return a coefficient of determination $R^{2}$ value. This value is always between 0 and 1, and $R^{2}$ close to 1 implies a better fitting line.

Tests for β₁

Since we don't know σ, we can substitute an estimate $s_{\epsilon }$ (see below), then the standard deviation of β₁ is a T-distribution with df $n-2$ .

Therefore, we can do confidence intervals $\left({\hat {\beta }}_{1}\pm t_{\alpha /2}\cdot s_{{\hat {\beta }}_{1}}\right)$ and t-test for H₀: β₁ = β₁₀ $\left(t={\tfrac {{\hat {\beta }}_{1}-\beta _{10}}{s_{{\hat {\beta }}_{1}}}}\right)$

Estimating σ²

Minimum value for $D$ is the Error Sum of Squares:

$D=\sum _{i=0}^{n}(y_{i}-({\hat {\beta }}_{0}+{\hat {\beta }}_{1}x_{i}))^{2}=SSE$

To obtain estimate for σ², we divide SSE by the (remaining) degrees of freedom (after estimating β₁ and β₂) $n-2$ :

$s_{\epsilon }={\hat {\sigma }}={\sqrt {\frac {SSE}{n-2}}}$

ANOVA for Regression

${\begin{aligned}SST=SSR+SSE&=\sum _{i=1}^{n}(y_{i}-{\bar {y}})^{2}&{\mbox{Total Sum of Squares}}\\SSE&=\sum _{i=1}^{n}(y_{i}-({\hat {\beta }}_{0}+{\hat {\beta }}_{1}x_{i}))^{2}&{\mbox{Error Sum of Squares}}\\SSR=SST-SSE&=\sum _{i=1}^{n}({\hat {y}}_{i}-{\bar {y}})^{2}&{\mbox{Regression Sum of Squares}}\\R^{2}&={\frac {SSR}{SST}}&{\mbox{Coefficient of Determination}}\end{aligned}}$

Source	Sum of Squares	Degrees of Freedom	Mean Square	f statistic
Regression Error	SSR	1	MSR	MSR / MSE
Error	SSE	$n-2$	MSE	MSR / MSE
Total	SST	$n-1$

Predicting with Regression

Suppose we have a new value that we want to estimate: x*

2 possible ways to calculate this:

mean response (only think of value of regression model line): $E(y|x=x^{*})={\hat {\beta }}_{0}+{\hat {\beta }}_{1}x^{*}$
prediction (include error variable ε): ${\hat {y}}={\hat {\beta }}_{0}+{\hat {\beta }}_{1}x^{*}$

Basically plugging in a new $x$ value into the model equation to obtain a ${\hat {y}}$ estimate.

Mean Response

We want to narrow the range and reduce variance by as much as possible, so we disregard the variance of ε.

Calculate 100(1-α)% confidence interval for response with:

{\hat {\beta }}_{0}+{\hat {\beta }}_{1}x^{*}\pm t_{\alpha /2,n-2}\cdot s_{\epsilon }{\sqrt {{\frac {1}{n}}+{\frac {(x^{*}-{\bar {x}})^{2}}{\sum (x_{i}-{\bar {x}})^{2}}}}}

Prediction

Because the regression model has the ε term, the prediction is actually within a range.

{\hat {\beta }}_{0}+{\hat {\beta }}_{1}x^{*}\pm t_{\alpha /2,n-2}\cdot s_{\epsilon }{\sqrt {1+{\frac {1}{n}}+{\frac {(x^{*}-{\bar {x}})^{2}}{\sum (x_{i}-{\bar {x}})^{2}}}}}

Correlation Coefficient

Show how strongly related two random variables are.

Note: If X and Y are independent, then correlation is 0

\mathrm {Corr} (X,Y)=\rho _{X,Y}={\frac {\mathrm {Cov} (X,Y)}{\sqrt {V(X)V(Y)}}}

where Covariance is

\mathrm {Cov} (X,Y)=E(XY)-E(X)E(Y)\,\!

For a population where we do not know the pdf (ƒ(x, y)), we can estimate the sample correlation coefficient using

r={\frac {\sum _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{\sqrt {\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2}\,\sum _{i=1}^{n}(y_{i}-{\bar {y}})^{2}}}}

STAT 211 Topic 10

Contents