We now investigate the probability distribution that is most central to statistics: the normal distribution. If we are confident that our data are nearly normal, that opens the door to many powerful statistical methods. Here we'll describe this distribution and assess the normality of our data.
Bitllet de 10 marcs: Carl Friedrich Gauss (1777 - Gotinga;1855)
names(data)
[1] "SCHOOL" "EXP1" "EXP2" "EXP3" "EXP" "COU" "PAAU"
head(data)
SCHOOL EXP1 EXP2 EXP3 EXP COU PAAU
1 3 6.85 6.85 6.80 6.75 6.50 6.58
2 3 7.57 7.75 7.95 7.76 7.75 7.55
3 3 7.56 7.15 7.00 7.29 7.44 6.25
4 3 8.96 8.70 8.20 8.45 7.94 7.96
5 3 8.37 8.70 8.25 8.36 8.13 7.63
6 3 5.62 6.00 6.40 5.94 5.75 6.00
summary(data$PAAU)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.440 4.850 5.580 5.616 6.380 9.600
sd(data$PAAU)
[1] 1.123641
hist(data$PAAU, prob=T, col="blue", main = "Histograma de notes de PAAU (3609 estudiants, any xx)")
# lines(density(PAAU), col="green")
curve(dnorm(x, mean=mean(data$PAAU), sd=sd(data$PAAU)), col="red", lwd=3, add=TRUE)
text(8, 0.30, "mitjana = 5.616", col ="red")
text(8, 0.28, "sd = 1.12", col ="red")
text(8.2, 0.25, "% entre 4 i 8: 91.3%", col="blue")
text(8.2, 0.22, "(aproximacio normal: 90.8%)", col="magenta")
mean(data$PAAU <= 8) - mean(data$PAAU <= 4)
[1] 0.9135495
pnorm(8, 5.616312, 1.123641) - pnorm(4, 5.616312, 1.123641)
[1] 0.9079039
To see how accurate that description is, we can plot a normal distribution curve on top of a histogram to see how closely the data follow a normal distribution. This normal curve should have the same mean and standard deviation as the data. We'll be working with women's heights, so lets store them as a separate object and then calculate some statistics that will be referenced later.
The top of the curve is cut off because the limits of the x- and y-axes are set to best fit the histogram.
Based on the this plot, does it appear that the data follow a nearly normal distribution? Evaluating the normal distribution Eyeballing the shape of the histogram is one way to determine if the data appear to be nearly normally distributed, but it can be frustrating to decide just how close the histogram is to the curve. An alternative approach involves constructing a normal probability plot, also called a normal Q-Q plot for ???quantile-quantile???.
qqnorm(data$PAAU)
qqline(data$PAAU, col="red")
set.seed(291114)
n<-1000000
x<- rnorm(n, 5.616312, 1.123641)
hist(x, prob=T, 100, col="blue", main = "The same distribution with n=1000000", ylim=c(0,.37))
# lines(density(PAAU), col="green")
ra<- seq(min(x), max(x),length=300)
lines(ra,dnorm(ra, mean=mean(x), sd=sd(x)), col="red", lwd=3 )
abline(v = 5.616312, col="grey", lwd=3)
abline(v = 5.616312-1.123641, col="green", lwd=.8)
abline(v = 5.616312+1.123641, col="green", lwd=.8)
text(5.616312+.2,0.37, expression(mu))
text(5.616312-1.123641 +.2,0.37, expression(mu-sigma))
text(5.616312+1.123641 +.2,0.37, expression(mu+sigma))
abline(v = (5.616312+1.96*1.123641), col="orange", lwd=3, lty=3)
abline(v = (5.616312-1.96*1.123641), col="orange", lwd=3, lty=3)
text(5.616312- 1.96*1.123641 +.2,0.34, expression(mu-1.96*sigma), cex=.8)
text(5.616312+1.96*1.123641 +.2,0.34, expression(mu+1.96*sigma),cex=.8)
L’interval \(\mu \pm 1.96\sigma\) agafa el 95% dels valors de la variable (\(\mu \pm 2\sigma\) agafa el 95.5% ); per agafar el 99%, cal l’interval \(\mu \pm 2.56\sigma\).
The important result here is that 95% of values from a Normal Distribution tall within 1.96 standard deviatios fo the mean. We can also show that 99% of vlaues from a Normal Distribution fall within 2.58 standard deviations of rthe mean. We will use both of these results later in the course.