Capítol 1 Teoria Setmana 1

Discussió sobre:

  • Logística del curs:
    • Teoria i seminaris: Horaris
    • Canals: Aula global, Web de curs
  • El PDA
  • Avaluació: Participació seminaris (15%), Treball (35%), Examen Final (50%)
  • Disseny del curs: Conceptes, dades, programari
    • Eina de treball: R

1.1 Avaluació

1.1.1 El treball

Consulta la Guía del Treball de fi de curs.

Consulta a l’Aula global els Exemples de treballs d’anys anteriors.

1.1.2 Exàmens

Consulta a l’Aula global els Exàmens d’anys anteriors.

1.1.3 Participació

A l’aula d’informàtica es resoldran deures a classe.

A la secció de la Llista de deures 14, es publicaràn fins a 3 problemes que s’hauràn d’entregar de forma individual i es resoldrà a classe.

1.3 Introducció conceptes Estadística

Introducció a l’assignatura. Llibres, la nostra calculadora de butxaca. Interacció entre Teoria (preguntes), Dades, Estadística (conceptes, no fórmules). Un exemple, per començar a rodar.

1.4 Gender gap, at birth

Question: Is there a gender birth rate?. At birth, probability of a boy equal to the probability of a girl?

What CIA says on this issue? (Wikipedia)

CIA estimates that the current world wide sex ratio at birth is 107 boys to 100 girls, 107/207= 0.5169082

1.5 With Arbuthnott’s data, collected on years 1629 to 1710

John Graunt was the first person to compile data that showed an excess of male births over female births. He also noticed spatial and temporal variation in the sex ratio, but the variation in his data is not significant. John Arbuthnott was the first person to demonstrate that the excess of male births is statistically significant.

John Arbuthnot (1710) used these time series data on the ratios of male to female christenings in London from 1629-1710 to carry out the first known significance test, comparing observed data to a null hypothesis.

Arbuthnot, J. 1710. An argument for divine providence. Philosophical Transactions 27:186-190.

dat<- read.table("http://84.89.132.1/~satorra/dades/arbuthnot.txt", header=TRUE)
head(dat)
##   year boys girls
## 1 1629 5218  4683
## 2 1630 4858  4457
## 3 1631 4422  4102
## 4 1632 4994  4590
## 5 1633 5158  4839
## 6 1634 5035  4820
dim(dat)
## [1] 82  3
names(dat)
## [1] "year"  "boys"  "girls"
attach(dat)
mean(boys)
## [1] 5907.098
mean(girls)
## [1] 5534.646
mean(boys/girls)
## [1] 1.070748
total<- boys+girls
plot(year,total, col="blue", main="total birth along the years")
mean(boys/(boys+girls))
## [1] 0.5169751
require(stats) # for lowess, rpois, rnorm
lines(year,lowess(total,f=1/6)$y, lty=3, lwd=4)

boxplot(boys, girls, names=c("boys","girls"), col=c("blue","magenta"))

boysrate<-  boys/(boys+girls)
plot(year, boysrate, ylim = c(.49,.53), col="blue")
abline(h=0.5,  col="red", lwd=3, lty=3)

t<-   (mean(boysrate) - 0.5)/(sd(boysrate)/sqrt(length(boysrate)))
t
## [1] 21.29243
t.test(boysrate, mu=0.5)
## 
##  One Sample t-test
## 
## data:  boysrate
## t = 21.292, df = 81, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0.5
## 95 percent confidence interval:
##  0.5153889 0.5185614
## sample estimates:
## mean of x 
## 0.5169751
plot(year, boysrate, ylim = c(.49,.53), col="blue", main="Boysrate values with 95% confidence bound (green lines) ")
abline(h=0.5,  col="red", lwd=3, lty=3)
# Intervals
abline(h=mean(boysrate),  col="blue", lwd=.8)
lines(year, rep(0.5153889, length(year)) , lty = 'dashed', col = 'green', lwd=3)
lines(year, rep(0.5186,  length(year)) , lty = 'dashed', col = 'green', lwd=3)
# add fill
n<- length(year)
polygon(c(rev(year), year), c( rev(rep(0.5153889, n )), rep(0.5186, n )),  border = NA)

1.5.1 Now with recent USA’s data, present day birth records in the United States

source("http://www.openintro.org/stat/data/present.R")
head(present)
##   year    boys   girls
## 1 1940 1211684 1148715
## 2 1941 1289734 1223693
## 3 1942 1444365 1364631
## 4 1943 1508959 1427901
## 5 1944 1435301 1359499
## 6 1945 1404587 1330869
dim(present)
## [1] 63  3
mean(present$boys/(present$boys+present$girls))
## [1] 0.512516
boysrate<-  present$boys/(present$boys+present$girls)
t.test(boysrate, mu=0.5)
## 
##  One Sample t-test
## 
## data:  boysrate
## t = 147, df = 62, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0.5
## 95 percent confidence interval:
##  0.5123458 0.5126862
## sample estimates:
## mean of x 
##  0.512516

1.6 Distribució d’una variable, distribució normal, altres distribucions

Distribució d’una variable quantitativa (variable numérica): anàlisi de variància.

library(foreign)
data= read.spss( "http://84.89.132.1/~satorra/dades/PAISOS.SAV", use.value.labels = TRUE, to.data.frame = TRUE )  
attach(data)   
names(data)
##  [1] "IDH"      "NIVELL"   "PAIS"     "ESPVIDA"  "PIB"      "ALFAB"    "CONT"     "CALORIES" "HABMETG"  "DIARIS"   "TV"      
## [12] "SANITAT"  "AGRICULT" "INDUST"
ESPVIDA
##   [1] 46.4 52.1 47.5 39.0 50.7 53.5 44.9 50.2 55.6 43.5 47.5 56.5 45.6 47.3 51.0 46.5 60.4 46.0 47.4 65.2 50.4 55.7 48.0 66.7 48.9
##  [26] 45.0 65.5 55.0 47.6 49.4 61.5 56.0 68.5 44.5 56.0 51.5 71.9 67.7 53.7 60.5 63.6 51.0 62.7 62.1 70.4 59.4 49.3 66.3 56.0 64.7
##  [51] 67.6 55.8 64.8 63.3 69.6 57.5 68.8 51.3 67.9 69.9 66.4 65.2 70.3 69.3 66.0 73.6 71.2 70.0 58.8 67.8 69.0 67.1 71.1 76.3 66.5
##  [76] 70.9 70.0 71.5 67.5 71.5 73.6 64.9 72.8 76.0 71.3 73.8 70.2 66.3 67.6 62.9 70.8 70.5 71.7 69.0 72.5 70.8 71.6 67.5 53.5 70.0
## [101] 72.0 72.1 75.6 69.6 71.1 77.6 74.6 69.7 71.6 77.0 73.1 75.5 75.3 76.5 77.6 78.6 70.5 74.8 77.6 76.2 74.9 77.5 77.4 77.4 76.4
## [126] 76.9 73.8 75.7 76.2 76.0 76.0 78.2 76.9 75.3 78.2 79.5 75.7 78.0 72.0 76.1 68.5 66.0 63.1 67.1 75.3 63.7 71.1 43.5 50.2 65.2
## [151] 56.6 57.6 52.0 53.0 46.5 51.6 55.4 47.0 74.2 48.3
summary(ESPVIDA)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   39.00   55.67   67.60   64.50   72.58   79.50
hist(ESPVIDA,  col="blue")

boxplot(ESPVIDA, col="blue")

1.6.1 Valor estandarditzat d’una variable

# variable centrads
head(ESPVIDA - mean(ESPVIDA))
## [1] -18.10437 -12.40437 -17.00437 -25.50437 -13.80437 -11.00437
# variable estandarditzada 
head((ESPVIDA - mean(ESPVIDA))/sd(ESPVIDA))
## [1] -1.715521 -1.175405 -1.611288 -2.416725 -1.308065 -1.042744
# Amb R
head(scale(ESPVIDA))
##           [,1]
## [1,] -1.715521
## [2,] -1.175405
## [3,] -1.611288
## [4,] -2.416725
## [5,] -1.308065
## [6,] -1.042744

1.6.2 Distribució Normal

We now investigate the probability distribution that is most central to statistics: the normal distribution. If we are confident that our data are nearly normal, that opens the door to many powerful statistical methods. Here we’ll describe this distribution and assess the normality of our data.

Bitllet de 10 marcs: Carl Friedrich Gauss (1777 - Gotinga;1855)

1.6.2.1 Distribució de notes d’estudiants

Veure exemple enllaç.

1.6.3 Exemple amb la variable diners en butxaca (estudiants de MQ III)

Dades_enq_paper (diners en butxaca, genere):

datadiners<-read.table("http://84.89.132.1/~satorra/dades/ECP2019diners.txt", header=TRUE)

names(datadiners)
## [1] "diners" "genere"
attach(datadiners) 
summary(diners)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   7.995  16.249  21.625 100.000
boxplot(diners)

hist(diners)

boxplot(diners ~ genere, col=c("blue","violet"))

aggregate(diners, list(genere), mean) 
##   Group.1        x
## 1       0 18.30667
## 2       1 13.30833

Diners en butxaca: distribució normal?

qqnorm(datadiners$diners)
qqline(datadiners$diners ,col="red", lwd=3)

1.6.3.1 Log de diners en butxaca: distribució normal?

qqnorm(log(datadiners$diners+ .1))
qqline(log(datadiners$diners +.1),col="red", lwd=3)

1.6.3.2 log de diners en butxaca (no zero): distribució normal?

ind<- datadiners$diners > 0
qqnorm(log(datadiners$diners[ind]))
qqline(log(datadiners$diners[ind]),col="red", lwd=3)

1.6.4 Altres distribucions univariants

Distribucions univariants en R

1.6.4.1 Variables quantitatives i qualitatives

Dades Enquesta Socrative

d<-matrix(scan("http://84.89.132.1/~satorra/dades/ECP2019enquestasocrativeG1.txt", what="character",skip=1),ncol=20,byrow=TRUE);
data<-d[-1,];
colnames(data)<-d[1,];
table(data[,4]); 
## 
##  av  cd mai 
##  12  13   7
round(100*prop.table(table(data[,4], data[,1]),2),1);  
##      
##          0    1
##   av  42.1 30.8
##   cd  36.8 46.2
##   mai 21.1 23.1
pie(table(data[,4]), main="fer-se el llit")

## interess assignatura
aggregate(as.numeric(data[,7]), list(data[,4]), mean)
##   Group.1        x
## 1      av 6.625000
## 2      cd 5.923077
## 3     mai 4.571429
## tirada de dues monedes
table(data[,14])
## 
##  0  1  2  3 
##  2 11 15  4
round(prop.table(table(data[,14])),2)
## 
##    0    1    2    3 
## 0.06 0.34 0.47 0.12