Title: | Generalized Linear Models for Categorical Responses |
---|---|
Description: | In statistical modeling, there is a wide variety of regression models for categorical dependent variables (nominal or ordinal data); yet, there is no software embracing all these models together in a uniform and generalized format. Following the methodology proposed by Peyhardi, Trottier, and Guédon (2015) <doi:10.1093/biomet/asv042>, we introduce 'GLMcat', an R package to estimate generalized linear models implemented under the unified specification (r, F, Z). Where r represents the ratio of probabilities (reference, cumulative, adjacent, or sequential), F the cumulative cdf function for the linkage, and Z, the design matrix. |
Authors: | Lorena León [aut, cre], Jean Peyhardi [aut], Catherine Trottier [aut] |
Maintainer: | Lorena León <[email protected]> |
License: | GPL-3 |
Version: | 0.2.7 |
Built: | 2024-11-04 21:45:46 UTC |
Source: | https://github.com/ylleonv/glmcat |
This dataset contains information about various accidents, including details such as accident severity, road and weather conditions, light conditions, and the number of casualties.
accidents
accidents
A data frame with 109,577 rows and 12 variables:
Factor with levels Slight
, Serious
, Fatal
Factor with levels Dual carriageway
, One way street
, Roundabout
, Single carriageway
, Slip road
Factor with levels Fine + high winds
, Fine no high winds
, Fog or mist
, Raining + high winds
, Raining no high winds
, Snowing
Factor with levels Darkness
, Daylight
Factor with levels Monday
, Tuesday
, Wednesday
, Thursday
, Friday
, Saturday
, Sunday
Numeric, number of casualties in the accident
Factor with levels Urban
, Rural
Numeric, speed limit at the accident location
Factor with levels Not at junction or within 20 metres
, T or staggered junction
, Crossroads
, Roundabout
, Other junction
, Private drive or entrance
Factor with levels Any animal in carriageway (except ridden horse)
, Data missing or out of range
, None
, Other object on road
, Pedestrian in carriageway - not injured
, Previous accident
, Vehicle load on road
Factor with levels Fine + high winds
, Fine no high winds
, Fog or mist
, Raining + high winds
, Raining no high winds
, Snowing
Factor with levels Dual carriageway
, One way street
, Roundabout
, Single carriageway
, Slip road
Data from 2019, openly available at https://www.data.gov.uk/, accessed in September 2023.
data(accidents)
data(accidents)
glmcat
model objectCompute an analysis of deviance table for one fitted glmcat
model object.
## S3 method for class 'glmcat' anova(object, ...)
## S3 method for class 'glmcat' anova(object, ...)
object |
an object of class |
... |
additional arguments. |
glmcat
model objectReturns the coefficient estimates of the fitted glmcat
model object.
## S3 method for class 'glmcat' coef(object, na.rm = FALSE, ...)
## S3 method for class 'glmcat' coef(object, na.rm = FALSE, ...)
object |
an fitted object of class |
na.rm |
TRUE for NA coefficients to be removed, default is FALSE. |
... |
additional arguments affecting the |
glmcat
model objectComputes confidence intervals from a fitted glmcat
model object for all the parameters.
## S3 method for class 'glmcat' confint(object, parm, level, ...)
## S3 method for class 'glmcat' confint(object, parm, level, ...)
object |
an fitted object of class |
parm |
a numeric or character vector indicating which regression coefficients should be displayed |
level |
the confidence level. |
... |
other parameters. |
glmcat
modelsSet control parameters for glmcat
models.
control_glmcat(maxit = 25, epsilon = 1e-06, beta_init = NA)
control_glmcat(maxit = 25, epsilon = 1e-06, beta_init = NA)
maxit |
the maximum number of the Fisher's Scoring Algorithm iterations. Defaults to 25. |
epsilon |
a double to change update the convergence criterion of GLMcat models. |
beta_init |
an appropriate sized vector for the initial iteration of the algorithm. |
Family of models for Discrete Choice. Fits discrete choice models which require data in long form. For each individual (or decision maker), there are multiple observations (rows), one for each of the alternatives the individual could have chosen. A group of observations of the same individual is a "case". It is important to note that each case represents a single statistical observation although it comprises multiple observations.
discrete_cm( formula, case_id, alternatives, reference, alternative_specific = NA, data, cdf = list(), intercept = "standard", normalization = 1, control = list(), na.action = "na.omit", find_nu = FALSE )
discrete_cm( formula, case_id, alternatives, reference, alternative_specific = NA, data, cdf = list(), intercept = "standard", normalization = 1, control = list(), na.action = "na.omit", find_nu = FALSE )
formula |
a symbolic description of the model to be fit. An expression of the form y ~ predictors is interpreted as a specification that the response y is modeled by a linear predictor specified symbolically by model. A particularity for the formula is that for the case-specific variables, the user can define a specific effect for a category (in the parameter 'alternative_specific'). |
case_id |
a string with the name of the column that identifies each case. |
alternatives |
a string with the name of the column that identifies the vector of alternatives the individual could have chosen. |
reference |
a string indicating the reference category. |
alternative_specific |
a character vector with the name of the explanatory variables that are different for each case, these are the alternative-specific variables. By default, the case-specific variables are the explanatory variables that are not identified here but are part of the formula. |
data |
a dataframe (in long format) object in R, with the dependent variable as a factor. |
cdf |
a parameter specifying the inverse distribution function to be used as part of the link function. If the distribution has no parameters to specify, it should be entered as a string indicating the name. The default value is 'logistic'. If there are parameters to specify, a list must be entered. For example, for Student's distribution, it would be 'list("student", df=2)'. For the non-central distribution of Student, it would be 'list("noncentralt", df=2, mu=1)'. |
intercept |
if set to "conditional", the design will be equivalent to the conditional logit model. |
normalization |
the quantile to use for the normalization of the estimated coefficients where the logistic distribution is used as the base cumulative distribution function. |
control |
a list specifying additional control parameters. - 'maxit': the maximum number of iterations for the Fisher scoring algorithm. - 'epsilon': a double value to fix the epsilon value. - 'beta_init': an appropriately sized vector for the initial iteration of the algorithm. |
na.action |
an argument to handle missing data. Available options are na.omit, na.fail, and na.exclude. It comes from the stats library and does not include the na.pass option. |
find_nu |
a logical argument to indicate whether the user intends to utilize the Student CDF and seeks an optimization algorithm to identify an optimal degrees of freedom setting for the model. |
Family of models for Discrete Choice
For these models, it is not allowed to exclude the intercept.
library(GLMcat) data(TravelChoice) discrete_cm(formula = choice ~ hinc + gc + invt, case_id = "indv", alternatives = "mode", reference = "air", data = TravelChoice, cdf = "logistic") #' Model with alternative specific effects for gc and invt: discrete_cm(formula = choice ~ hinc + gc + invt, case_id = "indv", alternatives = "mode", reference = "air", data = TravelChoice, alternative_specific = c("gc", "invt"), cdf = "logistic") #' A more specific design was studied by Louvierte et al. (2000, p. 157) and Greene (2003, p. 730). #' These analyses set the effect of the variables hinc and psize exclusively for the category air discrete_cm(formula = choice ~ hinc[air] + psize[air] + gc + ttme, case_id = "indv", alternatives = "mode", reference = "car", alternative_specific = c("gc", "ttme"), data = TravelChoice)
library(GLMcat) data(TravelChoice) discrete_cm(formula = choice ~ hinc + gc + invt, case_id = "indv", alternatives = "mode", reference = "air", data = TravelChoice, cdf = "logistic") #' Model with alternative specific effects for gc and invt: discrete_cm(formula = choice ~ hinc + gc + invt, case_id = "indv", alternatives = "mode", reference = "air", data = TravelChoice, alternative_specific = c("gc", "invt"), cdf = "logistic") #' A more specific design was studied by Louvierte et al. (2000, p. 157) and Greene (2003, p. 730). #' These analyses set the effect of the variables hinc and psize exclusively for the category air discrete_cm(formula = choice ~ hinc[air] + psize[air] + gc + ttme, case_id = "indv", alternatives = "mode", reference = "car", alternative_specific = c("gc", "ttme"), data = TravelChoice)
Boy's disturbed dreams benchmark dataset drawn from a study that cross-classified boys by their age, and the severity (not severe, severe 1, severe 2, very severe) of their disturbed dreams (Maxwell, 1961).
data(DisturbedDreams)
data(DisturbedDreams)
A dataframe containing :
Individuals age
Severity level: Not.severe, Severe.1, Severe.2, Very.severe.
Maxwell, A.E. (1961) Analyzing qualitative data, Methuen London, 73.
data(DisturbedDreams)
data(DisturbedDreams)
glmcat
model objectMethod to compute the (generalized) Akaike An Information Criterion for a fitted object of class glmcat
.
## S3 method for class 'glmcat' extractAIC(fit, ...)
## S3 method for class 'glmcat' extractAIC(fit, ...)
fit |
an fitted object of class |
... |
further arguments (currently unused in base R). |
model <- glmcat(formula = Level ~ Age, data = DisturbedDreams, ref_category = "Very.severe", ratio = "cumulative") extractAIC(model)
model <- glmcat(formula = Level ~ Age, data = DisturbedDreams, ref_category = "Very.severe", ratio = "cumulative") extractAIC(model)
Estimate generalized linear models implemented under the unified
specification ( ratio,cdf,Z) where ratio
represents the ratio of probabilities
(reference, cumulative, adjacent, or sequential), cdf
the cumulative distribution function
for the linkage, and Z the design matrix which must be specified through the parallel
and the threshold
arguments.
glmcat( formula, data, ratio = c("reference", "cumulative", "sequential", "adjacent"), cdf = list(), parallel = NA, categories_order = NA, ref_category = NA, threshold = c("standard", "symmetric", "equidistant"), control = list(), normalization = 1, na.action = "na.omit", find_nu = FALSE, ... )
glmcat( formula, data, ratio = c("reference", "cumulative", "sequential", "adjacent"), cdf = list(), parallel = NA, categories_order = NA, ref_category = NA, threshold = c("standard", "symmetric", "equidistant"), control = list(), normalization = 1, na.action = "na.omit", find_nu = FALSE, ... )
formula |
formula a symbolic description of the model to be fit. An expression of the form 'y ~ predictors' is interpreted as a specification that the response 'y' is modeled by a linear predictor specified by 'predictors'. |
data |
a dataframe object in R, with the dependent variable as a factor. |
ratio |
a string indicating the ratio (equivalently to the family) options are: reference, adjacent, cumulative and sequential. It is mandatory for the user to specify the desired ratio option as there is no default value. |
cdf |
The inverse distribution function to be used as part of the link function. - If the distribution has no parameters to specify, then it should be entered as a string indicating the name, e.g., 'cdf = "normal"'. The default value is 'cdf = "logistic"'. - If there are parameters to specify, then a list must be entered. For example, for Student's distribution: 'cdf = list("student", df=2)'. For the non-central distribution of Student: 'cdf = list("noncentralt", df=2, mu=1)'. |
parallel |
a character vector indicating the name of the variables with a parallel effect. If a variable is categorical, specify the name and the level of the variable as a string, e.g., '"namelevel"'. |
categories_order |
a character vector indicating the incremental order of the categories, e.g., 'c("a", "b", "c")' for 'a < b < c'. Alphabetical order is assumed by default. Order is relevant for adjacent, cumulative, and sequential ratio. |
ref_category |
a string indicating the reference category. This option is suitable for models with reference ratio. |
threshold |
a restriction to impose on the thresholds. Options are: 'standard', 'equidistant', or 'symmetric'. This is valid only for the cumulative ratio. |
control |
a list of control parameters for the estimation algorithm. - 'maxit': The maximum number of iterations for the Fisher scoring algorithm. - 'epsilon': A double to change the convergence criterion of GLMcat models. - 'beta_init': An appropriately sized vector for the initial iteration of the algorithm. |
normalization |
the quantile to use for the normalization of the estimated coefficients when the logistic distribution is used as the base cumulative distribution function. |
na.action |
an argument to handle missing data. Available options are 'na.omit', 'na.fail', and 'na.exclude'. It does not include the 'na.pass' option. |
find_nu |
a logical argument to indicate whether the user intends to utilize the Student CDF and seeks an optimization algorithm to identify an optimal degrees of freedom setting for the model. |
... |
additional arguments.
|
Fitting models for categorical responses
This function fits generalized linear models for categorical responses using the unified specification framework introduced by Peyhardi, Trottier, and Guédon (2015).
Peyhardi J, Trottier C, Guédon Y (2015). “A new specification of generalized linear models for categorical responses.” Biometrika, 102(4), 889–906. doi:10.1093/biomet/asv042.
data(DisturbedDreams) ref_log_com <- glmcat(formula = Level ~ Age, data = DisturbedDreams, ref_category = "Very.severe", cdf = "logistic", ratio = "reference")
data(DisturbedDreams) ref_log_com <- glmcat(formula = Level ~ Age, data = DisturbedDreams, ref_category = "Very.severe", cdf = "logistic", ratio = "reference")
glmcat
model objectExtract Log-likelihood of a fitted glmcat
model object.
## S3 method for class 'glmcat' logLik(object, ...)
## S3 method for class 'glmcat' logLik(object, ...)
object |
an fitted object of class |
... |
additional arguments affecting the loglik. |
glmcat
model objectExtract the number of observations of the fitted glmcat
model object.
## S3 method for class 'glmcat' nobs(object, ...)
## S3 method for class 'glmcat' nobs(object, ...)
object |
an fitted object of class |
... |
additional arguments affecting the |
glmcat
model objectplot
of the log-likelihood profile for a fitted glmcat
model object.
## S3 method for class 'glmcat' plot(x, ...)
## S3 method for class 'glmcat' plot(x, ...)
x |
an object of class |
... |
additional arguments. |
glmcat
model objectObtains predictions of a fitted glmcat
model object.
## S3 method for class 'glmcat' predict(object, newdata, type, ...)
## S3 method for class 'glmcat' predict(object, newdata, type, ...)
object |
a fitted object of class |
newdata |
optionally, a data frame in which to look for the variables involved in the model. If omitted, the fitted linear predictors are used. |
type |
the type of prediction required.
The default is |
... |
further arguments.
The default is |
glmcat
model fitsprint.anova
method for GLMcat objects.
## S3 method for class 'anova.glmcat' print(x, digits = max(getOption("digits") - 2, 3), ...)
## S3 method for class 'anova.glmcat' print(x, digits = max(getOption("digits") - 2, 3), ...)
x |
an object of class |
digits |
the number of digits in the printed table. |
... |
additional arguments affecting the summary produced. |
glmcat
model objectprint
method for a fitted glmcat
model object.
## S3 method for class 'glmcat' print(x, ...)
## S3 method for class 'glmcat' print(x, ...)
x |
an object of class |
... |
additional arguments. |
model <- glmcat(formula = Level ~ Age, data = DisturbedDreams, ref_category = "Very.severe", ratio = "cumulative") print(model)
model <- glmcat(formula = Level ~ Age, data = DisturbedDreams, ref_category = "Very.severe", ratio = "cumulative") print(model)
glmcat
model objectprint.summary
method for GLMcat objects.
## S3 method for class 'summary.glmcat' print(x, digits = max(3, getOption("digits") - 3), ...)
## S3 method for class 'summary.glmcat' print(x, digits = max(3, getOption("digits") - 3), ...)
x |
an object of class |
digits |
the number of digits in the printed table. |
... |
additional arguments affecting the summary produced. |
glmcat
model objectStepwise for a glmcat
model object based on the AIC.
## S3 method for class 'glmcat' step(object, scope, scale, direction, trace, keep, steps, k, ...)
## S3 method for class 'glmcat' step(object, scope, scale, direction, trace, keep, steps, k, ...)
object |
an fitted object of class |
scope |
defines the range of models examined in the stepwise search (same as in the step function of the stats package). This should be either a single formula, or a list containing components upper and lower, both formulae. |
scale |
the scaling parameter (if applicable). |
direction |
the mode of the stepwise search. |
trace |
to print the process information. |
keep |
a logical value indicating whether to keep the models from all steps. |
steps |
the maximum number of steps. |
k |
additional arguments (if needed). |
... |
additional arguments passed to the function. |
glmcat
model objectSummary method for a fitted 'glmcat' model object.
## S3 method for class 'glmcat' summary(object, normalized = FALSE, correlation = FALSE, ...)
## S3 method for class 'glmcat' summary(object, normalized = FALSE, correlation = FALSE, ...)
object |
an fitted object of class 'glmcat'. |
normalized |
if 'TRUE', the summary method yields the normalized coefficients. |
correlation |
if 'TRUE', prints the correlation matrix. |
... |
additional arguments affecting the summary produced. |
mod1 <- discrete_cm(formula = choice ~ hinc + gc + invt, case_id = "indv", alternatives = "mode", reference = "air", data = TravelChoice, alternative_specific = c("gc", "invt"), cdf = "normal", normalization = 0.8) summary(mod1, normalized = TRUE)
mod1 <- discrete_cm(formula = choice ~ hinc + gc + invt, case_id = "indv", alternatives = "mode", reference = "air", data = TravelChoice, alternative_specific = c("gc", "invt"), cdf = "normal", normalization = 0.8) summary(mod1, normalized = TRUE)
glmcat
model objectReturns the terms of a fitted glmcat
model object.
## S3 method for class 'glmcat' terms(x, ...)
## S3 method for class 'glmcat' terms(x, ...)
x |
an object of class |
... |
additional arguments. |
The data set contains 210 observations on mode choice for travel between Sydney and Melbourne, Australia.
data(TravelChoice)
data(TravelChoice)
A dataframe containing :
Id of the individual
available options: air, train, bus or car
a logical vector indicating as TRUE the transportation mode chosen by the traveler
As category-specific variables:
travel time in vehicle
generalized cost measure
terminal waiting time for plane, train and bus; 0 for car
in vehicle cost
As case-specific variables:
household income
traveling group size in mode chosen
Download from on-line (18/09/2020) complements to Greene, W.H. (2011) Econometric Analysis, Prentice Hall, 7th Edition, Table F18-2.
Greene, W.H. and D. Hensher (1997) Multinomial logit and discrete choice models in Greene, W. H. (1997) LIMDEP version 7.0 user's manual revised, Plainview, New York econometric software, Inc .
data(TravelChoice)
data(TravelChoice)
glmcat
model objectReturns the variance-covariance matrix of the main parameters of a fitted glmcat
model object.
## S3 method for class 'glmcat' vcov(object,...)
## S3 method for class 'glmcat' vcov(object,...)
object |
an object of class |
... |
additional arguments. |