glm() model

Highlights & Limitations

  • Defaults to 0-to-1 predictions for binomial family models. That is akin to running predict(model, type = "response")
  • Only treatment contrast (contr.treatment) are supported.
  • offset is supported
  • Categorical variables are supported
  • In-line functions in the formulas are not supported:
    • OK - wt ~ mpg + am
    • OK - mutate(mtcars, newam = paste0(am)) and then wt ~ mpg + newam
    • Not OK - wt ~ mpg + as.factor(am)
    • Not OK - wt ~ mpg + as.character(am)
  • Interval functions are not supported: tidypredict_interval() & tidypredict_sql_interval()

How it works

df <- mtcars %>%
  mutate(char_cyl = paste0("cyl", cyl)) %>%
  select(wt, char_cyl, am) 

model <- glm(am ~ wt + char_cyl, data = df, family = "binomial")

It returns a SQL query that contains the coefficients (model$coefficients) operated against the correct variable or categorical variable value. In most cases the resulting SQL is one short CASE WHEN statement per coefficient. It appends the offset field or value, if one is provided.

For binomial models, the sigmoid equation is applied. This means that the target SQL database type will need to support the exponent function.

library(tidypredict)

tidypredict_sql(model, dbplyr::simulate_mssql())
## <SQL> 1.0 - 1.0 / (1.0 + EXP((((20.8527831345691) + ((`wt`) * (-7.85934263583835))) + ((CASE WHEN (((`char_cyl`) = ('cyl6')) =  'TRUE') THEN (1.0) WHEN (((`char_cyl`) = ('cyl6')) =  'FALSE') THEN (0.0) END) * (3.10462643177453))) + ((CASE WHEN (((`char_cyl`) = ('cyl8')) =  'TRUE') THEN (1.0) WHEN (((`char_cyl`) = ('cyl8')) =  'FALSE') THEN (0.0) END) * (5.37942092366097))))

Alternatively, use tidypredict_to_column() if the results are the be used or previewed in dplyr.

df %>%
  tidypredict_to_column(model) %>%
  head(10) 
##       wt char_cyl am        fit
## 1  2.620     cyl6  1 0.96662269
## 2  2.875     cyl6  1 0.79605201
## 3  2.320     cyl4  1 0.93208127
## 4  3.215     cyl6  0 0.21242376
## 5  3.440     cyl8  0 0.30918450
## 6  3.460     cyl6  0 0.03783629
## 7  3.570     cyl8  0 0.13875740
## 8  3.190     cyl4  0 0.01450687
## 9  3.150     cyl4  0 0.01975984
## 10 3.440     cyl6  0 0.04399324

Under the hood

The parser reads several parts of the glm object to tabulate all of the needed variables. One entry per coefficient is added to the final table. Other variables are added at the end. Some variables are not required for every parsed model. For example, offset is listed because it’s part of the formula (call) of the model, if there were no offset in a given model, that line would not exist.

parse_model(model)
## # A tibble: 10 x 10
##    labels    estimate type   field_1 field_2    qr_1   qr_2    qr_3   qr_4
##    <chr>        <dbl> <chr>  <chr>   <chr>     <dbl>  <dbl>   <dbl>  <dbl>
##  1 (Interce~    20.9  term   <NA>    <NA>    - 0.679 - 5.11 - 1.15    6.05
##  2 wt          - 7.86 term   <NA>    {{:}}     0       1.62   0.232 - 2.58
##  3 char_cyl~     3.10 term   cyl6    <NA>      0       0      1.56    1.86
##  4 char_cyl~     5.38 term   cyl8    <NA>      0       0      0       3.20
##  5 labels        0    varia~ char_c~ wt       NA      NA     NA      NA   
##  6 model        NA    varia~ <NA>    <NA>     NA      NA     NA      NA   
##  7 version      NA    varia~ <NA>    <NA>     NA      NA     NA      NA   
##  8 residual     NA    varia~ <NA>    <NA>     NA      NA     NA      NA   
##  9 family       NA    varia~ <NA>    <NA>     NA      NA     NA      NA   
## 10 link         NA    varia~ <NA>    <NA>     NA      NA     NA      NA   
## # ... with 1 more variable: vals <chr>

The output from parse_model() is transformed into a dplyr, a.k.a Tidy Eval, formula. All categorical variables are operated using if_else().

tidypredict_fit(model)
## 1 - 1/(1 + exp((((20.8527831345691) + ((wt) * (-7.85934263583835))) + 
##     ((ifelse((char_cyl) == ("cyl6"), 1, 0)) * (3.10462643177453))) + 
##     ((ifelse((char_cyl) == ("cyl8"), 1, 0)) * (5.37942092366097))))

From there, the Tidy Eval formula can be used anywhere where it can be operated. tidypredict provides three paths:

  • Use directly inside dplyr, mutate(df, !! tidypredict_fit(model))
  • Use tidypredict_to_column(model) to a piped command set
  • Use tidypredict_to_sql(model) to retrieve the SQL statement

The same applies to the prediction interval functions.

How it performs

Testing the tidypredict results is easy. The tidypredict_test() function automatically uses the lm model object’s data frame, to compare tidypredict_fit(), and tidypredict_interval() to the results given by predict()

tidypredict_test(model)
## tidypredict test results
## Difference threshold: 1e-12
## 
##  All results are within the difference threshold
Skip to content This repository Search Pull requests Issues Marketplace Explore @edgararuiz Sign out Unwatch 9 Star 5 Fork 5 rstudio/db.rstudio.com Code Issues 6 Pull requests 0 Projects 0 Wiki Insights Settings Tree: 7a7548589f Find file Copy pathdb.rstudio.com/themes/hugo-material-docs/layouts/partials/footer_js.html 7a75485 on Apr 22, 2017 @edgararuiz edgararuiz Fix to theme's scrollspy 1 contributor RawBlameHistory 84 lines (73 sloc) 2.83 KB © 2018 GitHub, Inc. Terms Privacy Security Status Help Contact GitHub API Training Shop Blog About