STAT 211 Topic 10

From Notes
Jump to navigation Jump to search

« previous | Thursday, April 14, 2011 | next »


Simple Linear Regression

Predicting linear relationships between two variables and : , where rvs .

Goal: Find the line such that the squared distance (so distance values are positive) between the predicted and observed values is minimized.

What the line tells us:

  • Estimates/Predicts values (interpolation within range; extrapolation outside of range)
  • Describe ratio between two variables (slope)

Regression Model Equation
  • β0, β1, and x are assumed to be unknown fixed parameters.
  • ε Error variable: 2 is unknown)
  • Therefore

We have estimates for (sample data), but we need estimates for and .


Least Squares

Given points , our regression model is:

Squared distance between points is given by . Therefore the sum of squared differences is:

Find and such that is minimized. (Take derivative, equate to zero, and solve)


Estimated Solutions: (both have normal distribution regardless of sample size)

Therefore, the final equation becomes

The fitted (predicted) value of is


Residual (observed − fitted; ) can be plotted to check how well the line fits.


All computer stat systems will return a coefficient of determination value. This value is always between 0 and 1, and close to 1 implies a better fitting line.


Tests for β1

Since we don't know σ, we can substitute an estimate (see below), then the standard deviation of β1 is a T-distribution with df .

Therefore, we can do confidence intervals and t-test for H0: β1 = β10


Estimating σ2

Minimum value for Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle D} is the Error Sum of Squares:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle D = \sum_{i=0}^n (y_i - (\hat{\beta}_0 + \hat{\beta}_1x_i))^2 = SSE}

To obtain estimate for σ2, we divide SSE by the (remaining) degrees of freedom (after estimating β1 and β2) Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n-2} :

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle s_\epsilon = \hat{\sigma} = \sqrt{\frac{SSE}{n-2}}}

ANOVA for Regression

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \begin{align} SST = SSR + SSE &= \sum_{i=1}^n (y_i-\bar{y})^2 & \mbox{Total Sum of Squares} \\ SSE &= \sum_{i=1}^n (y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i))^2 & \mbox{Error Sum of Squares} \\ SSR = SST - SSE &= \sum_{i=1}^n (\hat{y}_i - \bar{y})^2 & \mbox{Regression Sum of Squares} \\ R^2 &= \frac{SSR}{SST} & \mbox{Coefficient of Determination} \end{align}}

Source Sum of Squares Degrees of Freedom Mean Square f statistic
Regression Error SSR 1 MSR MSR / MSE
Error SSE Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n-2} MSE
Total SST Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n-1}


Predicting with Regression

Suppose we have a new value that we want to estimate: x*

2 possible ways to calculate this:

  1. mean response (only think of value of regression model line): Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle E(y|x=x^*)=\hat{\beta}_0 + \hat{\beta}_1x^*}
  2. prediction (include error variable ε): Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \hat{y}=\hat{\beta}_0 + \hat{\beta}_1x^*}

Basically plugging in a new Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle x} value into the model equation to obtain a Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \hat{y}} estimate.

Mean Response

We want to narrow the range and reduce variance by as much as possible, so we disregard the variance of ε.

Calculate 100(1-α)% confidence interval for response with:

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \hat{\beta}_0 + \hat{\beta}_1x^* \pm t_{\alpha/2, n-2} \cdot s_\epsilon \sqrt{\frac{1}{n} + \frac{(x^* - \bar{x})^2}{\sum (x_i-\bar{x})^2}}}

Prediction

Because the regression model has the ε term, the prediction is actually within a range.

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \hat{\beta}_0 + \hat{\beta}_1x^* \pm t_{\alpha/2, n-2} \cdot s_\epsilon \sqrt{1+\frac{1}{n} + \frac{(x^*-\bar{x})^2}{\sum (x_i-\bar{x})^2}}}

Correlation Coefficient

Show how strongly related two random variables are.

Note: If X and Y are independent, then correlation is 0
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathrm{Corr}(X,Y) = \rho_{X,Y} = \frac{\mathrm{Cov}(X,Y)}{\sqrt{V(X)V(Y)}}}

where Covariance is

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathrm{Cov}(X,Y)=E(XY)-E(X)E(Y)\,\!}


For a population where we do not know the pdf (ƒ(x, y)), we can estimate the sample correlation coefficient using

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle r=\frac{\sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n (x_i-\bar{x})^2\,\sum_{i=1}^n (y_i-\bar{y})^2}}}