Capítol 1 Teoria Setmana 1

Discussió sobre:

Logística del curs:
- Teoria i seminaris: Horaris
- Canals: Aula global, Web de curs
El PDA
Avaluació: Participació seminaris (15%), Treball (35%), Examen Final (50%)
Disseny del curs: Conceptes, dades, programari
- Eina de treball: R

1.1 Avaluació

1.1.1 El treball

Consulta la Guía del Treball de fi de curs.

Consulta a l’Aula global els Exemples de treballs d’anys anteriors.

1.1.2 Exàmens

Consulta a l’Aula global els Exàmens d’anys anteriors.

1.1.3 Participació

A l’aula d’informàtica es resoldran deures a classe.

A la secció de la Llista de deures 14, es publicaràn fins a 3 problemes que s’hauràn d’entregar de forma individual i es resoldrà a classe.

1.2 Bibliografia

Mètodes aplicats a Ciències polítiques

AnalizaR Datos Políticos. Bookdown. 2020.
Quantitative Politics with R. Bookdown. November 2019.
Lab Guide to Quantitative Research Methods in Political Science, Public Policy & Public Administration. Bookdown
Quantitative Research Methods for Political Science, Public Policy and Public Administration: 4th Edition With Applications in R. Bookdown

General R y Estadística

R Markdown: The Definitive Guide. Chapman & Hall/CRC; 2020.
R Graphics Cookbook, O’Reilly Media, Inc. 2nd ed.; 2020.
Data Visualization with R. 2018.
R for Data Science. O’Reilly; 2017. (Castellano)
An Introduction to Statistical Learning with applications in R. Springer; 2017.

1.3 Introducció conceptes Estadística

Introducció a l’assignatura. Llibres, la nostra calculadora de butxaca. Interacció entre Teoria (preguntes), Dades, Estadística (conceptes, no fórmules). Un exemple, per començar a rodar.

1.4 Gender gap, at birth

Question: Is there a gender birth rate?. At birth, probability of a boy equal to the probability of a girl?

What CIA says on this issue? (Wikipedia)

CIA estimates that the current world wide sex ratio at birth is 107 boys to 100 girls, 107/207= 0.5169082

1.5 With Arbuthnott’s data, collected on years 1629 to 1710

John Graunt was the first person to compile data that showed an excess of male births over female births. He also noticed spatial and temporal variation in the sex ratio, but the variation in his data is not significant. John Arbuthnott was the first person to demonstrate that the excess of male births is statistically significant.

John Arbuthnot (1710) used these time series data on the ratios of male to female christenings in London from 1629-1710 to carry out the first known significance test, comparing observed data to a null hypothesis.

Arbuthnot, J. 1710. An argument for divine providence. Philosophical Transactions 27:186-190.

dat<- read.table("http://84.89.132.1/~satorra/dades/arbuthnot.txt", header=TRUE)
head(dat)

##   year boys girls
## 1 1629 5218  4683
## 2 1630 4858  4457
## 3 1631 4422  4102
## 4 1632 4994  4590
## 5 1633 5158  4839
## 6 1634 5035  4820

dim(dat)

## [1] 82  3

names(dat)

## [1] "year"  "boys"  "girls"

attach(dat)
mean(boys)

## [1] 5907.098

mean(girls)

## [1] 5534.646

mean(boys/girls)

## [1] 1.070748

total<- boys+girls
plot(year,total, col="blue", main="total birth along the years")
mean(boys/(boys+girls))

## [1] 0.5169751

require(stats) # for lowess, rpois, rnorm
lines(year,lowess(total,f=1/6)$y, lty=3, lwd=4)

boxplot(boys, girls, names=c("boys","girls"), col=c("blue","magenta"))

boysrate<-  boys/(boys+girls)
plot(year, boysrate, ylim = c(.49,.53), col="blue")
abline(h=0.5,  col="red", lwd=3, lty=3)

t<-   (mean(boysrate) - 0.5)/(sd(boysrate)/sqrt(length(boysrate)))
t

## [1] 21.29243

t.test(boysrate, mu=0.5)

## 
##  One Sample t-test
## 
## data:  boysrate
## t = 21.292, df = 81, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0.5
## 95 percent confidence interval:
##  0.5153889 0.5185614
## sample estimates:
## mean of x 
## 0.5169751

plot(year, boysrate, ylim = c(.49,.53), col="blue", main="Boysrate values with 95% confidence bound (green lines) ")
abline(h=0.5,  col="red", lwd=3, lty=3)
# Intervals
abline(h=mean(boysrate),  col="blue", lwd=.8)
lines(year, rep(0.5153889, length(year)) , lty = 'dashed', col = 'green', lwd=3)
lines(year, rep(0.5186,  length(year)) , lty = 'dashed', col = 'green', lwd=3)
# add fill
n<- length(year)
polygon(c(rev(year), year), c( rev(rep(0.5153889, n )), rep(0.5186, n )),  border = NA)

1.5.1 Now with recent USA’s data, present day birth records in the United States

source("http://www.openintro.org/stat/data/present.R")
head(present)

##   year    boys   girls
## 1 1940 1211684 1148715
## 2 1941 1289734 1223693
## 3 1942 1444365 1364631
## 4 1943 1508959 1427901
## 5 1944 1435301 1359499
## 6 1945 1404587 1330869

dim(present)

## [1] 63  3

mean(present$boys/(present$boys+present$girls))

## [1] 0.512516

boysrate<-  present$boys/(present$boys+present$girls)
t.test(boysrate, mu=0.5)

## 
##  One Sample t-test
## 
## data:  boysrate
## t = 147, df = 62, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0.5
## 95 percent confidence interval:
##  0.5123458 0.5126862
## sample estimates:
## mean of x 
##  0.512516

1.6 Distribució d’una variable, distribució normal, altres distribucions

Distribució d’una variable quantitativa (variable numérica): anàlisi de variància.

library(foreign)
data= read.spss( "http://84.89.132.1/~satorra/dades/PAISOS.SAV", use.value.labels = TRUE, to.data.frame = TRUE )  
attach(data)   
names(data)

##  [1] "IDH"      "NIVELL"   "PAIS"     "ESPVIDA"  "PIB"      "ALFAB"    "CONT"     "CALORIES" "HABMETG"  "DIARIS"   "TV"      
## [12] "SANITAT"  "AGRICULT" "INDUST"

ESPVIDA

##   [1] 46.4 52.1 47.5 39.0 50.7 53.5 44.9 50.2 55.6 43.5 47.5 56.5 45.6 47.3 51.0 46.5 60.4 46.0 47.4 65.2 50.4 55.7 48.0 66.7 48.9
##  [26] 45.0 65.5 55.0 47.6 49.4 61.5 56.0 68.5 44.5 56.0 51.5 71.9 67.7 53.7 60.5 63.6 51.0 62.7 62.1 70.4 59.4 49.3 66.3 56.0 64.7
##  [51] 67.6 55.8 64.8 63.3 69.6 57.5 68.8 51.3 67.9 69.9 66.4 65.2 70.3 69.3 66.0 73.6 71.2 70.0 58.8 67.8 69.0 67.1 71.1 76.3 66.5
##  [76] 70.9 70.0 71.5 67.5 71.5 73.6 64.9 72.8 76.0 71.3 73.8 70.2 66.3 67.6 62.9 70.8 70.5 71.7 69.0 72.5 70.8 71.6 67.5 53.5 70.0
## [101] 72.0 72.1 75.6 69.6 71.1 77.6 74.6 69.7 71.6 77.0 73.1 75.5 75.3 76.5 77.6 78.6 70.5 74.8 77.6 76.2 74.9 77.5 77.4 77.4 76.4
## [126] 76.9 73.8 75.7 76.2 76.0 76.0 78.2 76.9 75.3 78.2 79.5 75.7 78.0 72.0 76.1 68.5 66.0 63.1 67.1 75.3 63.7 71.1 43.5 50.2 65.2
## [151] 56.6 57.6 52.0 53.0 46.5 51.6 55.4 47.0 74.2 48.3

summary(ESPVIDA)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   39.00   55.67   67.60   64.50   72.58   79.50

hist(ESPVIDA,  col="blue")

boxplot(ESPVIDA, col="blue")

1.6.1 Valor estandarditzat d’una variable

# variable centrads
head(ESPVIDA - mean(ESPVIDA))

## [1] -18.10437 -12.40437 -17.00437 -25.50437 -13.80437 -11.00437

# variable estandarditzada 
head((ESPVIDA - mean(ESPVIDA))/sd(ESPVIDA))

## [1] -1.715521 -1.175405 -1.611288 -2.416725 -1.308065 -1.042744

# Amb R
head(scale(ESPVIDA))

##           [,1]
## [1,] -1.715521
## [2,] -1.175405
## [3,] -1.611288
## [4,] -2.416725
## [5,] -1.308065
## [6,] -1.042744

1.6.2 Distribució Normal

We now investigate the probability distribution that is most central to statistics: the normal distribution. If we are confident that our data are nearly normal, that opens the door to many powerful statistical methods. Here we’ll describe this distribution and assess the normality of our data.

Bitllet de 10 marcs: Carl Friedrich Gauss (1777 - Gotinga;1855)

1.6.2.1 Distribució de notes d’estudiants

Veure exemple enllaç.

1.6.3 Exemple amb la variable diners en butxaca (estudiants de MQ III)

Dades_enq_paper (diners en butxaca, genere):

datadiners<-read.table("http://84.89.132.1/~satorra/dades/ECP2019diners.txt", header=TRUE)

names(datadiners)

## [1] "diners" "genere"

attach(datadiners) 
summary(diners)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   7.995  16.249  21.625 100.000

boxplot(diners)

hist(diners)

boxplot(diners ~ genere, col=c("blue","violet"))

aggregate(diners, list(genere), mean)

##   Group.1        x
## 1       0 18.30667
## 2       1 13.30833

Diners en butxaca: distribució normal?

qqnorm(datadiners$diners)
qqline(datadiners$diners ,col="red", lwd=3)

1.6.3.1 Log de diners en butxaca: distribució normal?

qqnorm(log(datadiners$diners+ .1))
qqline(log(datadiners$diners +.1),col="red", lwd=3)

1.6.3.2 log de diners en butxaca (no zero): distribució normal?

ind<- datadiners$diners > 0
qqnorm(log(datadiners$diners[ind]))
qqline(log(datadiners$diners[ind]),col="red", lwd=3)

1.6.4 Altres distribucions univariants

Distribucions univariants en R

1.6.4.1 Variables quantitatives i qualitatives

Dades Enquesta Socrative

d<-matrix(scan("http://84.89.132.1/~satorra/dades/ECP2019enquestasocrativeG1.txt", what="character",skip=1),ncol=20,byrow=TRUE);
data<-d[-1,];
colnames(data)<-d[1,];
table(data[,4]);

## 
##  av  cd mai 
##  12  13   7

round(100*prop.table(table(data[,4], data[,1]),2),1);

##      
##          0    1
##   av  42.1 30.8
##   cd  36.8 46.2
##   mai 21.1 23.1

pie(table(data[,4]), main="fer-se el llit")

## interess assignatura
aggregate(as.numeric(data[,7]), list(data[,4]), mean)

##   Group.1        x
## 1      av 6.625000
## 2      cd 5.923077
## 3     mai 4.571429

## tirada de dues monedes
table(data[,14])

## 
##  0  1  2  3 
##  2 11 15  4

round(prop.table(table(data[,14])),2)

## 
##    0    1    2    3 
## 0.06 0.34 0.47 0.12