Capítol 4 Teoria Setmana 4
4.1 Packages
install.packages(texreg)
install.packages(lmtest)
install.packages(sandwich)
4.2 Salaries
El conjunt de dades Salary informa d’una mostra de salaris de professors universitaris recopilats durant l’any acadèmic 2008-2009 a USA. Addicionalment als salaris en dòlars, les dades inclouen les següents 5 variables: gènere, anys des del Doctorat, anys de servei, disciplina (teoria (1) o aplicat (2)) i rang acadèmic.
Hi ha una consulta sobre la possible discriminació salarial per gènere.
library(car)
names(Salaries)## [1] "rank" "discipline" "yrs.since.phd" "yrs.service" "sex" "salary"
dim(Salaries)## [1] 397 6
4.2.1 Anàlisis univariant
suppressPackageStartupMessages(library(tidyverse))
Salaries %>%
ggplot(aes(x=salary)) +
geom_density()+
labs(title="Densitat Salaris")
Salaries %>%
ggplot(aes(x=log(salary))) +
geom_density()+
labs(title="Densitat Log-Salaris")
Salaries %>%
ggplot(aes(x=sex)) +
geom_bar()
Salaries %>%
ggplot(aes(x=rank)) +
geom_bar()
Salaries %>%
ggplot(aes(x=discipline)) +
geom_bar()
Salaries %>%
ggplot(aes(x=yrs.since.phd)) +
geom_density()
Salaries %>%
ggplot(aes(x=log(yrs.since.phd))) +
geom_density()
Salaries %>%
ggplot(aes(x=yrs.service)) +
geom_density()
Salaries %>%
ggplot(aes(x=log(yrs.service))) +
geom_density()## Warning: Removed 11 rows containing non-finite values (stat_density).

4.2.2 Anàlisis bivariant
Salaries <- Salaries %>%
mutate(log_salary = log(salary)
,log_yrs.since.phd = log(yrs.since.phd)
,yrs.since.phd2 = yrs.since.phd**2
,log_yrs.service = log(yrs.service+1)
,yrs.service2 = yrs.service**2
)
Salaries %>% select(salary,log_salary,yrs.since.phd,log_yrs.since.phd,yrs.since.phd2) %>% plot()
Salaries %>% select(salary,log_salary,yrs.since.phd,log_yrs.since.phd,yrs.since.phd2) %>% cor() %>% round(3)## salary log_salary yrs.since.phd log_yrs.since.phd yrs.since.phd2
## salary 1.000 0.988 0.419 0.496 0.305
## log_salary 0.988 1.000 0.426 0.520 0.297
## yrs.since.phd 0.419 0.426 1.000 0.915 0.963
## log_yrs.since.phd 0.496 0.520 0.915 1.000 0.790
## yrs.since.phd2 0.305 0.297 0.963 0.790 1.000
Salaries %>%
ggplot(aes(x=sex,y=salary)) +
geom_boxplot()
Salaries %>%
ggplot(aes(x=sex,y=log(salary))) +
geom_boxplot()
reg1 <- lm(salary~sex, data=Salaries)
summary(reg1)##
## Call:
## lm(formula = salary ~ sex, data = Salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57290 -23502 -6828 19710 116455
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101002 4809 21.001 < 2e-16 ***
## sexMale 14088 5065 2.782 0.00567 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30030 on 395 degrees of freedom
## Multiple R-squared: 0.01921, Adjusted R-squared: 0.01673
## F-statistic: 7.738 on 1 and 395 DF, p-value: 0.005667
Observem que els salaris en homes són $14.088 superiors que en les dones. Això ens diu que, quan comparem els salaris de forma directe, entre homes i dones, amb un nivell de confiança del 95%, hi ha diferències significatives.
library(stargazer)
logreg1 <- lm(log(salary)~sex, data=Salaries)
stargazer(reg1,logreg1, type="text")##
## ===========================================================
## Dependent variable:
## ----------------------------
## salary log(salary)
## (1) (2)
## -----------------------------------------------------------
## sexMale 14,088.010*** 0.129***
## (5,064.579) (0.043)
##
## Constant 101,002.400*** 11.491***
## (4,809.386) (0.041)
##
## -----------------------------------------------------------
## Observations 397 397
## R2 0.019 0.022
## Adjusted R2 0.017 0.019
## Residual Std. Error (df = 395) 30,034.610 0.258
## F Statistic (df = 1; 395) 7.738*** 8.810***
## ===========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
La comparativa en termes de logaritme neperià del salari, mostra igualment diferencies significatives amb un 95% de confiança. El coeficient 0,129 del gènere, ens diu que els homes, tenen un salari un 12,9% (aproximadament) per sobre de les dones.
Incorporem ara nous factor.
logreg2 <- lm(log(salary) ~ sex + yrs.since.phd, data= Salaries)
stargazer(logreg1, logreg2, type="text")##
## ==================================================================
## Dependent variable:
## ----------------------------------------------
## log(salary)
## (1) (2)
## ------------------------------------------------------------------
## sexMale 0.129*** 0.075*
## (0.043) (0.040)
##
## yrs.since.phd 0.008***
## (0.001)
##
## Constant 11.491*** 11.353***
## (0.041) (0.041)
##
## ------------------------------------------------------------------
## Observations 397 397
## R2 0.022 0.188
## Adjusted R2 0.019 0.184
## Residual Std. Error 0.258 (df = 395) 0.235 (df = 394)
## F Statistic 8.810*** (df = 1; 395) 45.715*** (df = 2; 394)
## ==================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Observem que la variable “anys des del doctorat” guanya molt pes en la regressió, i produeix que la variables gènere informi que els homes guanyen un 7,5% més les dones (a igual nombre de anys des del doctorat), essent ara, aquesta relació, no significativa amb un nivell del confiança del 95%.
En aquest cas, el coeficient 0.008 ens diu que per cada any addicional, hi ha un augment del 0,8% del salari.
Salaries %>%
ggplot(aes(x=yrs.since.phd, y=log(salary), color=sex))+
geom_point()
4.2.3 Realció log-log
loglogreg2 <- lm(log(salary) ~ sex + log(yrs.since.phd), data= Salaries)
stargazer(logreg1, logreg2, loglogreg2, type="text")##
## ==========================================================================================
## Dependent variable:
## ----------------------------------------------------------------------
## log(salary)
## (1) (2) (3)
## ------------------------------------------------------------------------------------------
## sexMale 0.129*** 0.075* 0.076**
## (0.043) (0.040) (0.038)
##
## yrs.since.phd 0.008***
## (0.001)
##
## log(yrs.since.phd) 0.167***
## (0.014)
##
## Constant 11.491*** 11.353*** 11.061***
## (0.041) (0.041) (0.051)
##
## ------------------------------------------------------------------------------------------
## Observations 397 397 397
## R2 0.022 0.188 0.278
## Adjusted R2 0.019 0.184 0.274
## Residual Std. Error 0.258 (df = 395) 0.235 (df = 394) 0.222 (df = 394)
## F Statistic 8.810*** (df = 1; 395) 45.715*** (df = 2; 394) 75.904*** (df = 2; 394)
## ==========================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Si agafem els anys d’experiència en termes de logaritme observem 3 coses:
- El coeficient 0,167 indica que per cada augment del 100% (duplicar) els anys des del doctorat, el salari augmenta un 16,7%.
- Per altra banda la variable gènere, passa a ser significativa amb un 95% de confiança.
- La regressió passa a explicar del 18,8% a un 27,8% de la variabilitat del logaritme dels salaris.
Salaries %>%
ggplot(aes(x=log(yrs.since.phd), y=log(salary), color=sex))+
geom_point()
4.2.4 Realció x**2
logreg22 <- lm(log(salary) ~ sex + yrs.since.phd +yrs.since.phd2, data= Salaries)
stargazer(logreg1, logreg2, loglogreg2, logreg22, type="text")##
## ==================================================================================================================
## Dependent variable:
## ----------------------------------------------------------------------------------------------
## log(salary)
## (1) (2) (3) (4)
## ------------------------------------------------------------------------------------------------------------------
## sexMale 0.129*** 0.075* 0.076** 0.087**
## (0.043) (0.040) (0.038) (0.035)
##
## yrs.since.phd 0.008*** 0.039***
## (0.001) (0.003)
##
## log(yrs.since.phd) 0.167***
## (0.014)
##
## yrs.since.phd2 -0.001***
## (0.0001)
##
## Constant 11.491*** 11.353*** 11.061*** 11.081***
## (0.041) (0.041) (0.051) (0.044)
##
## ------------------------------------------------------------------------------------------------------------------
## Observations 397 397 397 397
## R2 0.022 0.188 0.278 0.366
## Adjusted R2 0.019 0.184 0.274 0.361
## Residual Std. Error 0.258 (df = 395) 0.235 (df = 394) 0.222 (df = 394) 0.208 (df = 393)
## F Statistic 8.810*** (df = 1; 395) 45.715*** (df = 2; 394) 75.904*** (df = 2; 394) 75.720*** (df = 3; 393)
## ==================================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Ara no hi ha una interpretació directa. Veiem que és una realció significativa al 95% de confiança en tots els factors.
Salaries %>%
ggplot(aes(x=yrs.since.phd, y=log(salary), color=sex))+
stat_smooth(method = "lm", formula = y ~ x + I(x^2), size = 1)+
geom_point()
4.2.5 Interaccions
Anem a analitzar la interacció entre gènere i anys des del doctorat.
logreg3 <- lm(log(salary) ~ sex * yrs.since.phd , data= Salaries)
stargazer(logreg1, logreg2, logreg3, type="text")##
## ============================================================================================
## Dependent variable:
## ----------------------------------------------------------------------
## log(salary)
## (1) (2) (3)
## --------------------------------------------------------------------------------------------
## sexMale 0.129*** 0.075* 0.210***
## (0.043) (0.040) (0.078)
##
## yrs.since.phd 0.008*** 0.016***
## (0.001) (0.004)
##
## sexMale:yrs.since.phd -0.008**
## (0.004)
##
## Constant 11.491*** 11.353*** 11.229***
## (0.041) (0.041) (0.074)
##
## --------------------------------------------------------------------------------------------
## Observations 397 397 397
## R2 0.022 0.188 0.197
## Adjusted R2 0.019 0.184 0.190
## Residual Std. Error 0.258 (df = 395) 0.235 (df = 394) 0.234 (df = 393)
## F Statistic 8.810*** (df = 1; 395) 45.715*** (df = 2; 394) 32.040*** (df = 3; 393)
## ============================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
logreg4 <- lm(log(salary) ~ sex * yrs.since.phd + rank + discipline , data= Salaries)
summary(logreg4)##
## Call:
## lm(formula = log(salary) ~ sex * yrs.since.phd + rank + discipline,
## data = Salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.68720 -0.11114 -0.00746 0.09789 0.57698
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.1613950 0.0593982 187.908 < 2e-16 ***
## sexMale 0.0627753 0.0617296 1.017 0.310
## yrs.since.phd 0.0006863 0.0032390 0.212 0.832
## rankAssocProf 0.1536658 0.0337357 4.555 7.02e-06 ***
## rankProf 0.4550346 0.0344904 13.193 < 2e-16 ***
## disciplineB 0.1279222 0.0188376 6.791 4.17e-11 ***
## sexMale:yrs.since.phd -0.0012207 0.0031494 -0.388 0.699
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1819 on 390 degrees of freedom
## Multiple R-squared: 0.5185, Adjusted R-squared: 0.5111
## F-statistic: 70 on 6 and 390 DF, p-value: < 2.2e-16
logreg5 <- lm(log(salary) ~ sex + yrs.since.phd + yrs.since.phd2 + rank + discipline , data= Salaries)
summary(logreg5)##
## Call:
## lm(formula = log(salary) ~ sex + yrs.since.phd + yrs.since.phd2 +
## rank + discipline, data = Salaries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.62180 -0.10560 -0.00586 0.10151 0.59825
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.111e+01 4.262e-02 260.719 < 2e-16 ***
## sexMale 5.020e-02 3.088e-02 1.626 0.10481
## yrs.since.phd 1.277e-02 4.522e-03 2.824 0.00498 **
## yrs.since.phd2 -2.290e-04 7.604e-05 -3.012 0.00277 **
## rankAssocProf 8.627e-02 4.025e-02 2.143 0.03270 *
## rankProf 3.481e-01 4.945e-02 7.038 8.81e-12 ***
## disciplineB 1.304e-01 1.864e-02 6.997 1.15e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1799 on 390 degrees of freedom
## Multiple R-squared: 0.5293, Adjusted R-squared: 0.522
## F-statistic: 73.09 on 6 and 390 DF, p-value: < 2.2e-16
avPlots(logreg5)
vif(logreg5)## GVIF Df GVIF^(1/(2*Df))
## sex 1.036197 1 1.017938
## yrs.since.phd 41.568673 1 6.447377
## yrs.since.phd2 29.791275 1 5.458138
## rank 4.291107 2 1.439270
## discipline 1.057661 1 1.028426
plot(logreg5,1)
4.2.6 Estimadors robustos
- HC0 = es el original de White (Wooldridge 2016)
- HC1 = Es el que usa el software de Stata
- HC3 =Es el más conservador, por lo tanto, es muy recomendable
library(texreg)## Version: 1.37.5
## Date: 2020-06-17
## Author: Philip Leifeld (University of Essex)
##
## Consider submitting praise using the praise or praise_interactive functions.
## Please cite the JSS article in your publications -- see citation("texreg").
##
## Attaching package: 'texreg'
## The following object is masked from 'package:tidyr':
##
## extract
library(lmtest)## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(sandwich)
logreg5_robust_3 <- coeftest(logreg5, vcov = vcovHC(logreg5, "HC3"))
logreg5_robust_1 <- coeftest(logreg5, vcov = vcovHC(logreg5, "HC1"))
logreg5_robust_0 <- coeftest(logreg5, vcov = vcovHC(logreg5, "HC0"))
models_robust <- list(logreg5, logreg5_robust_0,
logreg5_robust_1, logreg5_robust_3)
screenreg(models_robust,
custom.model.names = c("sin ES robustos",
"robustos HC0", "robustos HC1", "robustos HC3"))##
## =========================================================================
## sin ES robustos robustos HC0 robustos HC1 robustos HC3
## -------------------------------------------------------------------------
## (Intercept) 11.11 *** 11.11 *** 11.11 *** 11.11 ***
## (0.04) (0.04) (0.04) (0.04)
## sexMale 0.05 0.05 * 0.05 * 0.05 *
## (0.03) (0.02) (0.02) (0.02)
## yrs.since.phd 0.01 ** 0.01 * 0.01 * 0.01 *
## (0.00) (0.01) (0.01) (0.01)
## yrs.since.phd2 -0.00 ** -0.00 * -0.00 * -0.00 *
## (0.00) (0.00) (0.00) (0.00)
## rankAssocProf 0.09 * 0.09 ** 0.09 ** 0.09 **
## (0.04) (0.03) (0.03) (0.03)
## rankProf 0.35 *** 0.35 *** 0.35 *** 0.35 ***
## (0.05) (0.04) (0.04) (0.04)
## disciplineB 0.13 *** 0.13 *** 0.13 *** 0.13 ***
## (0.02) (0.02) (0.02) (0.02)
## -------------------------------------------------------------------------
## R^2 0.53
## Adj. R^2 0.52
## Num. obs. 397
## =========================================================================
## *** p < 0.001; ** p < 0.01; * p < 0.05
## Normalidad
plot(logreg5,2)
Outlyers
plot(logreg5,5)