03/11, 2020

Variables fijas vs aleatorias

Variables fijas vs aleatorias

  • Fijas (continuas o categóricas)
    • Se espera que tengan una influencia predecible y sistemática en sobre lo que queremos explicar. Además usan todos los niveles de un factor (Ejemplo genero)
  • Aleatorias

    • Se espera que su influencia sea impredecible e idiosincratica. Además no se usan todos los niveles de un factor (todos los individuos) A + Error(B)

Ejemplo CO2

Ejemplo CO2

Plant Type Treatment conc uptake
Qn1 Quebec nonchilled 95 16.0
Qn1 Quebec nonchilled 175 30.4
Qn1 Quebec nonchilled 250 34.8
Qn1 Quebec nonchilled 350 37.2
Qn1 Quebec nonchilled 500 35.3
Qn1 Quebec nonchilled 675 39.2
Qn1 Quebec nonchilled 1000 39.7
Qn2 Quebec nonchilled 95 13.6
Qn2 Quebec nonchilled 175 27.3
Qn2 Quebec nonchilled 250 37.1
Qn2 Quebec nonchilled 350 41.8
Qn2 Quebec nonchilled 500 40.6
Qn2 Quebec nonchilled 675 41.4
Qn2 Quebec nonchilled 1000 44.3
Qn3 Quebec nonchilled 95 16.2
Qn3 Quebec nonchilled 175 32.4
Qn3 Quebec nonchilled 250 40.3
Qn3 Quebec nonchilled 350 42.1
Qn3 Quebec nonchilled 500 42.9
Qn3 Quebec nonchilled 675 43.9
Qn3 Quebec nonchilled 1000 45.5
Qc1 Quebec chilled 95 14.2
Qc1 Quebec chilled 175 24.1
Qc1 Quebec chilled 250 30.3
Qc1 Quebec chilled 350 34.6
Qc1 Quebec chilled 500 32.5
Qc1 Quebec chilled 675 35.4
Qc1 Quebec chilled 1000 38.7
Qc2 Quebec chilled 95 9.3
Qc2 Quebec chilled 175 27.3
Qc2 Quebec chilled 250 35.0
Qc2 Quebec chilled 350 38.8
Qc2 Quebec chilled 500 38.6
Qc2 Quebec chilled 675 37.5
Qc2 Quebec chilled 1000 42.4
Qc3 Quebec chilled 95 15.1
Qc3 Quebec chilled 175 21.0
Qc3 Quebec chilled 250 38.1
Qc3 Quebec chilled 350 34.0
Qc3 Quebec chilled 500 38.9
Qc3 Quebec chilled 675 39.6
Qc3 Quebec chilled 1000 41.4
Mn1 Mississippi nonchilled 95 10.6
Mn1 Mississippi nonchilled 175 19.2
Mn1 Mississippi nonchilled 250 26.2
Mn1 Mississippi nonchilled 350 30.0
Mn1 Mississippi nonchilled 500 30.9
Mn1 Mississippi nonchilled 675 32.4
Mn1 Mississippi nonchilled 1000 35.5
Mn2 Mississippi nonchilled 95 12.0
Mn2 Mississippi nonchilled 175 22.0
Mn2 Mississippi nonchilled 250 30.6
Mn2 Mississippi nonchilled 350 31.8
Mn2 Mississippi nonchilled 500 32.4
Mn2 Mississippi nonchilled 675 31.1
Mn2 Mississippi nonchilled 1000 31.5
Mn3 Mississippi nonchilled 95 11.3
Mn3 Mississippi nonchilled 175 19.4
Mn3 Mississippi nonchilled 250 25.8
Mn3 Mississippi nonchilled 350 27.9
Mn3 Mississippi nonchilled 500 28.5
Mn3 Mississippi nonchilled 675 28.1
Mn3 Mississippi nonchilled 1000 27.8
Mc1 Mississippi chilled 95 10.5
Mc1 Mississippi chilled 175 14.9
Mc1 Mississippi chilled 250 18.1
Mc1 Mississippi chilled 350 18.9
Mc1 Mississippi chilled 500 19.5
Mc1 Mississippi chilled 675 22.2
Mc1 Mississippi chilled 1000 21.9
Mc2 Mississippi chilled 95 7.7
Mc2 Mississippi chilled 175 11.4
Mc2 Mississippi chilled 250 12.3
Mc2 Mississippi chilled 350 13.0
Mc2 Mississippi chilled 500 12.5
Mc2 Mississippi chilled 675 13.7
Mc2 Mississippi chilled 1000 14.4
Mc3 Mississippi chilled 95 10.6
Mc3 Mississippi chilled 175 18.0
Mc3 Mississippi chilled 250 17.9
Mc3 Mississippi chilled 350 17.9
Mc3 Mississippi chilled 500 17.9
Mc3 Mississippi chilled 675 18.9
Mc3 Mississippi chilled 1000 19.9

Modelos

library(lme4)
mod1 <- lm(uptake ~ Type * Treatment + I(log(conc)) + conc, data = CO2)
mod2 <- lmer(uptake ~ Type * Treatment + I(log(conc)) + conc + (1 | Plant), data = CO2)
options(na.action = "na.fail")
Max.Vars <- floor(nrow(CO2)/10)
Seleccion <- dredge(mod2, m.lim = c(0, Max.Vars))

Selección

(Intercept) conc I(log(conc)) Treatment Type Treatment:Type df logLik AICc delta weight
-57.247 -0.025 17.789
8 -230.705 479.331 0.000 0.963
-55.608 -0.025 17.789
NA 7 -235.188 485.850 6.519 0.037
-59.037 -0.025 17.789 NA
NA 6 -241.972 497.036 17.705 0.000
-14.037 NA 8.484
7 -241.168 497.809 18.478 0.000
-12.398 NA 8.484
NA 6 -245.650 504.392 25.061 0.000
-61.937 -0.025 17.789
NA NA 6 -246.728 506.546 27.215 0.000
-65.367 -0.025 17.789 NA NA NA 5 -250.329 511.428 32.097 0.000
-15.827 NA 8.484 NA
NA 5 -252.435 515.638 36.307 0.000
-18.727 NA 8.484
NA NA 5 -257.190 525.149 45.818 0.000
-22.157 NA 8.484 NA NA NA 4 -260.792 530.089 50.759 0.000
27.621 0.018 NA
7 -267.614 550.702 71.371 0.000
29.260 0.018 NA
NA 6 -272.096 557.284 77.953 0.000
25.830 0.018 NA NA
NA 5 -278.881 568.530 89.200 0.000
22.930 0.018 NA
NA NA 5 -283.636 578.041 98.710 0.000
19.500 0.018 NA NA NA NA 4 -287.238 582.982 103.651 0.000
35.333 NA NA
6 -286.019 585.128 105.798 0.000
36.973 NA NA
NA 5 -289.930 590.630 111.299 0.000
33.543 NA NA NA
NA 4 -296.656 601.819 122.488 0.000
30.643 NA NA
NA NA 4 -301.412 611.330 131.999 0.000
27.213 NA NA NA NA NA 3 -305.013 616.327 136.996 0.000
  • Solo el primer modelo es seleccionado
  • Modelo

    BestModel <- get.models(Seleccion, 1)[[1]]
    broom.mixed::tidy(BestModel)
    effect group term estimate std.error statistic
    fixed NA (Intercept) -57.247 7.867 -7.277
    fixed NA conc -0.025 0.004 -6.075
    fixed NA I(log(conc)) 17.789 1.622 10.970
    fixed NA Treatmentchilled -3.581 1.835 -1.952
    fixed NA TypeMississippi -9.381 1.835 -5.112
    fixed NA Treatmentchilled:TypeMississippi -6.557 2.595 -2.527
    ran_pars Plant sd__(Intercept) 1.769 NA NA
    ran_pars Residual sd__Observation 3.667 NA NA

    Modelo

    Control mas detallado de dredge en MuMin

    Cemento

    • ¿Que problema tiene la siguiente base de datos para LMs o GLMs?
    • Ejemplo base de datos Cement de MuMIn

    Como agrego eso a MuMIn

    GlobalMod <- lm(Calorias ~ ., data = Cement2)
    cor(Cement2[, -1])
    CaAl SiCa3 Ca2AlFe Ca2Si
    CaAl 1.0000000 0.2285795 -0.8241338 -0.2454451
    SiCa3 0.2285795 1.0000000 -0.1392424 -0.9729550
    Ca2AlFe -0.8241338 -0.1392424 1.0000000 0.0295370
    Ca2Si -0.2454451 -0.9729550 0.0295370 1.0000000

    Como agrego eso a MuMIn (cont)

    nm <- colnames(Cement2[-1])
    smat <- abs(cor(Cement2[, -1])) <= 0.7
    smat[!lower.tri(smat)] <- NA
    CaAl SiCa3 Ca2AlFe Ca2Si
    CaAl NA NA NA NA
    SiCa3 TRUE NA NA NA
    Ca2AlFe FALSE TRUE NA NA
    Ca2Si TRUE FALSE TRUE NA

    Como agrego eso a MuMIn (cont)

    options(na.action = "na.fail")
    Selected <- dredge(GlobalMod, subset = smat)
    (Intercept) Ca2AlFe Ca2Si CaAl SiCa3 df logLik AICc delta weight
    52.58 NA NA 1.47 0.66 4 -28.16 69.31 0.00 0.84
    103.10 NA -0.61 1.44 NA 4 -29.82 72.63 3.32 0.16
    131.28 -1.20 -0.72 NA NA 4 -35.37 83.74 14.43 0.00
    72.07 -1.01 NA NA 0.73 4 -40.96 94.93 25.62 0.00
    117.57 NA -0.74 NA NA 3 -45.87 100.41 31.10 0.00
    57.42 NA NA NA 0.79 3 -46.04 100.74 31.42 0.00
    81.48 NA NA 1.87 NA 3 -48.21 105.08 35.77 0.00
    110.20 -1.26 NA NA NA 3 -50.98 110.63 41.31 0.00
    95.42 NA NA NA NA 2 -53.17 111.54 42.22 0.00

    Incluir además límite de numero de variables

    options(na.action = "na.fail")
    Selected <- dredge(GlobalMod, subset = smat, m.lim = c(0, floor(nrow(Cement)/10)))
    (Intercept) Ca2AlFe Ca2Si CaAl SiCa3 df logLik AICc delta weight
    117.57 NA -0.74 NA NA 3 -45.87 100.41 0.00 0.51
    57.42 NA NA NA 0.79 3 -46.04 100.74 0.33 0.43
    81.48 NA NA 1.87 NA 3 -48.21 105.08 4.67 0.05
    110.20 -1.26 NA NA NA 3 -50.98 110.63 10.22 0.00
    95.42 NA NA NA NA 2 -53.17 111.54 11.13 0.00

    Crossvalidation

    A veces no se pueden usar criterios de información

    • No se pueden calcular IC
    • No se cumplen supuestos de criterios de información
    • Comparacion entre distintos tipos de modelos (GLM, vs GAM vs RPART, etc)
    • Se aplican distintas transformaciones a la variable respuesta (GLM)
    • Varios terminos polinomiales (no se debe promediar modelos)

    Alternativas a los métodos de criterios de información

    • Simular que presentamos datos nuevos
      • Crossvalidation
      • Bootstrapping
      • Leave one out

    k-fold Crossvalidation

    • Divido aleatoreamente mi base de datos en \(k\) grupos
    • Entreno mis modelos con \(k-1\) grupos
    • Testeo con el grupo \(k_i\)
    • Promedio medida de desempeño por ejemplo \(R^2\)

    Volvamos al ejemplo de hp

    • Veamos como evaluamos 1 modelo \(mpg = \beta_1 hp +c\)
    • \(R^2\) = 0.6

    Paso 1 K-fold

    Divido my base en K

    • En este caso K = 5
      • Dividiremos nuestra base de datos en 5 partes iguales

    Paso 2 Entreno y testeo para cada K

    Fold 1

    • Rsq = 0.73

    Fold 2

    • Rsq = 0.73, 0.6

    Fold 3

    • Rsq = 0.73, 0.6, 0.85

    Fold 4

    • Rsq = 0.73, 0.6, 0.85, 0.5

    Fold 5

    • Rsq = 0.73, 0.6, 0.85, 0.5, 0.71
    • Rsq = 0.678

    Con Caret

    CV en caret

    • para 1 modelo
    set.seed(2020)
    ctrl <- trainControl(method = "cv", number = 5)
    Modelo <- train(mpg ~ hp, data = mtcars, method = "lm", trControl = ctrl)
    DF <- Modelo$resample
    DF <- DF %>% select(Rsquared, Resample)

    Continuado

    Rsquared Resample
    0.845 Fold1
    0.921 Fold2
    0.603 Fold3
    0.832 Fold4
    0.783 Fold5

    Seleccionando por crossvalidation

    form1 <- "mpg ~ hp"
    form2 <- "mpg ~ hp + I(hp^2)"
    form3 <- "mpg ~ hp + I(hp^2) + I(hp^3)"
    form4 <- "mpg ~ hp + I(hp^2) + I(hp^3) + I(hp^4)"
    form5 <- "mpg ~ hp + I(hp^2) + I(hp^3) + I(hp^4) + I(hp^5)"
    form6 <- "mpg ~ hp + I(hp^2) + I(hp^3) + I(hp^4) + I(hp^5) + I(hp^6)"
    forms <- list(form1, form2, form3, form4, form5, form6)
    K = (2:7)
    ctrl <- trainControl(method = "cv", number = 5)

    Continuado

    set.seed(2020)
    Tests <- forms %>% map(~train(as.formula(.x), 
        data = mtcars, method = "lm", 
        trControl = ctrl)) %>% map(~as.data.frame(.x$resample)) %>% 
        map(~select(.x, Rsquared)) %>% 
        map(~summarise_all(.x, funs(mean, 
            sd), na.rm = T)) %>% map2(.y = forms, 
        ~mutate(.x, model = .y)) %>% 
        reduce(bind_rows) %>% mutate(K = K) %>% 
        arrange(desc(mean))

    Continuado

    mean sd model K
    0.808 0.116 mpg ~ hp + I(hp^2) + I(hp^3) 4
    0.797 0.119 mpg ~ hp 2
    0.786 0.128 mpg ~ hp + I(hp^2) 3
    0.705 0.326 mpg ~ hp + I(hp^2) + I(hp^3) + I(hp^4) + I(hp^5) 6
    0.656 0.374 mpg ~ hp + I(hp^2) + I(hp^3) + I(hp^4) 5
    0.618 0.302 mpg ~ hp + I(hp^2) + I(hp^3) + I(hp^4) + I(hp^5) + I(hp^6) 7

    Entender la hipótesis estadística a generar basado en la hipótesis biológica

    Pregunta 1

    • ¿Cual de estas relaciones es la que uno esperaría ver al finalizar el experimento?

    • Respuesta: A

    Pregunta 2

    • ¿Cuál de estos códigos resultaría en el modelo que generaría este gráfico?

    A <- glm(weight ~ Time + Diet, family = poisson)
    B <- glm(weight ~ Time * Diet, family = poisson)
    C <- glm(weight ~ Time + Time:Diet, family = poisson)

    Pregunta 2 (cont)

    library(lme4)
    ChickPoissMM1 <- glmer(weight ~ Time + Diet + (1 | Chick), family = poisson, 
        data = ChickWeight)
    ChickPoissMM2 <- glmer(weight ~ Time + Time:Diet + (1 | Chick), 
        family = poisson, data = ChickWeight)
    ChickPoissMM3 <- glmer(weight ~ Time + Time * Diet + (1 | Chick), 
        family = poisson, data = ChickWeight)

    Pregunta 3

    • ¿Cuanto pesaria un pollo en el dia 10 en la dieta 2 basado en el modelo anterior?
    effect group term estimate std.error statistic p.value
    fixed NA (Intercept) 3.84 0.03 121.74 0
    fixed NA Time 0.07 0.00 63.72 0
    fixed NA Time:Diet2 0.01 0.00 4.99 0
    fixed NA Time:Diet3 0.02 0.00 12.59 0
    fixed NA Time:Diet4 0.01 0.00 5.91 0
    ran_pars Chick sd__(Intercept) 0.21 NA NA NA
    • A = 94.6324083
    • B = 113.2955623
    • C = 4.73

    Pregunta 3 (cont)

    exp(3.83 + 10 * 0.07 + 0.02)
    ## [1] 94.63241
    exp(3.83 + 10 * 0.07 + (10 * 0.02))
    ## [1] 113.2956
    3.83 + 10 * 0.07 + (10 * 0.02)
    ## [1] 4.73

    Pregunta 4

    • ¿Cuál es el código que mejor representa este gráfico?

    A <- glm(weight ~ Time + Diet, family = poisson)
    B <- glm(weight ~ Time * Diet, family = poisson)
    C <- glm(weight ~ Time + Time:Diet, family = poisson)

    Pregunta 5

    • ¿Cuantas pendientes e interceptos espero ver en este modelo?

    Pregunta 5 (cont)

    term estimate std.error statistic p.value
    (Intercept) 3.70 0.01 346.30 0
    Time 0.08 0.00 125.25 0
    Diet2 0.15 0.01 13.69 0
    Diet3 0.30 0.01 29.46 0
    Diet4 0.26 0.01 24.88 0

    Dudas