Lecture 1

Lecture 1 Notes

« previous | Tuesday, January 18, 2011 | next »

Introduction

Please do the following as soon as possible:

Print and sign syllabus; bring on Thursday, January 20, 2010
Send email with Bio and Picture to stat211.jun@gmail.com;
Create account on and upload same picture to http://dl.stat.tamu.edu/dostat
Verify Textbook: Miller and Freund’s Probability and Statistics for Engineers (8th ed)

What is Statistics

Example: M&Ms

Number of candies in each bag; in particular, how many red?

Science of collecting, classifying, and interpreting data

Vocabulary

population: entire group of interest (normally very big, and potentially more than one!); EX: all M&Ms
sample: subset of population selected for analysis; EX: M&Ms purchased by students
parameter: fixed unknown number that describes population (what we're trying to figure out); EX: avg. number of red M&amp:Ms in total production
statistic: number produced from a sample that estimates parameter; this is the goal of statistics in general; EX: avg. number of red M&Ms in sample
variable: any characteristic whose value may change from one object to another in the population; EX: number of red M&Ms in each bag

Interpreting Data: Histograms

bar graph drawn across $x$ -axis, where the area of the bars represents the relative frequency of the results:

$A_{\mbox{bars}}={\mbox{relative frequency}}$

${\mbox{relative frequency}}={\frac {\mbox{occurrences that fit}}{\mbox{total number of results}}}$

Height of bar = density

If you add up all the areas in a histogram, the result is 1 (100%)

Note: inclusion of endpoints is very important! The entire shape of histogram can change between a ≤ b < c and a < b ≤ c

Lecture 2

Lecture 2 Notes

Thursday, January 20, 2011

Histograms (cont'd)

Plotting a fit-line over a histogram reveals one of four general shapes:

symmetric: similar on both sides
unimodal: 1 maximum
bimodal: 2 maxima
multimodal: 3+ maxima
positively skewed: low on right side; example: income
negatively skewed: low on left side

Histograms can be described using more than one term.

Measures of Location

Summarizing data with one number

"center" values

mean/average: ${\bar {x}}={\frac {1}{n}}\sum _{i=1}^{n}x_{i}$
median: "data point in middle"
1. sort the data
2. if odd number of data, use data[(n+1)/2] as ordered value
3. if even number of data, use average of data[n/2] and data[n/2+1]

Example

Data set: { 1, 3, 10, 4, 6 }

mean: ${\bar {x}}=4.8$
median: $\{1,3,4,6,10\}\rightarrow {\tilde {x}}=4$

Suppose we add 100:

mean: ${\bar {x}}=20.6$
median: $\{1,3,4,6,10,100\}\rightarrow {\tilde {x}}=avg(4,6)=5$

Mean vs. Median

Any data point that is large or small compared to surrounding values are called outliers

mean is more sensitive to outliers

median is robust in that it is not sensitive to outliers

Going back to histograms

symmetric & unimodal: mean = median
positively skewed: mean > median
negatively skewed: mean < median

Medians occur roughly around the maximum of a histogram.

Percentiles and Quartiles

90^th percentile of SAT scores mean that 90% of people who took SAT are below your score and 10% are above.

Quartiles (robust):

Q₁ (First Quartile) is 25^th percentile
Q₂ (Second Quartile) is 50^th percentile (=Median)
Q₃ (Third Quartile) is 75^th percentile
IQR (Interquartile Range or Fourth Spread) = Q₃ - Q₁

more precise definition of outliers:

any observation that is farther than 1.5 × IQR from Q₁ or Q₃

Calculation of p^th percentile:

Order n values from smallest to largest
calculate product (n*p)/100
if product is not integer, go up to next (ceil())

Variables

Quantitative

Recall that variable is a characteristic or quantity to be measured.

quantitative variables take numerical values that we can manipulate arithmetically

Categorical

Places a unit into one of several categories:

EX: Gender, race, political party

Think of a sample proportion:

{\frac {\mbox{gender, race, party, etc.}}{\mbox{total num of sample}}}

Variance

How is the data spread out?

range

Difference between maximum and minimum (max − min)

very sensitive to outliers

sample variance

(for entire data set)

deviation from mean of each item in data set:

x_{i}-{\bar {x}}

calculate sample variance:

s^{2}={\frac {\sum _{i=1}^{n}\left(x_{i}-{\bar {x}}\right)^{2}}{n-1}}

(don't forget to square units!)

Sample standard deviation

s={\sqrt {s^{2}}}

(the ± deviation)

Effects of math on Mean and Variance

if a professor added 5 points to everyone's test

the mean would increase by 5 points ( ${\bar {x}}=A+{\bar {x}}_{0}\quad A=5$ )
deviation and variance would not change

if a company raises the salary by 10%

mean would change by 1.1 ( ${\bar {x}}=A\times {\bar {x}}_{0}$
variance would change ( $s^{2}=A^{2}\times s_{0}^{2}$ )
standard deviation would change ( $s=\left|A\right|\times s_{0}$ )

Box Plots

Male and Female Height

Way to show center and spread of data set

Box extends from Q₁ to Q₃ with line drawn inside at median. Whisks extend from both sides by a length of 1.5 × IQR (can be truncated to max/min of range)

Very useful for comparing two variables

Tuesday, January 25, 2010

Experiments

Types of data collection

Observational study: observe group and measure quantities. Very passive / non-invasive (i.e. does not influence the group); all terms studied in Lecture 1 are for this group
Experiment: deliberately expose a group to certain environments/treatments and observe responses.

New Vocabulary

Experimental Group: experimental units subjected to a real treatment
Control Group: experimental units subjected to the same conditions as experimental groups, but no treatment is imposed
Confounding effects: nuances between control group and experimental groups; should be avoided when possible.
Blinding: If people in control group know that they are in the control group, then data can be affected; e.g. subjects in medical survey can be given treatment or placebo
Double-blinding: All other people involved in experiment do not know whether subject is in control or experimental group; e.g. doctors in medical survey are not told whether patient is experimental or control

STAT 211 Topic 1

Contents