Direct Pooling and model selection of Linear and Logistic regression models across multiply imputed data.

glm_mi Pooling and backward or forward selection of Linear and Logistic regression models across multiply imputed data using selection methods RR, D1, D2, D3, D4 and MPR (without use of with function).

glm_mi(
  data,
  formula = NULL,
  nimp = 5,
  impvar = NULL,
  keep.predictors = NULL,
  p.crit = 1,
  method = "RR",
  direction = NULL,
  model_type = NULL
)

Arguments

data: Data frame with stacked multiple imputed datasets. The original dataset that contains missing values must be excluded from the dataset. The imputed datasets must be distinguished by an imputation variable, specified under impvar, and starting by 1.
formula: A formula object to specify the model as normally used by glm. See under "Details" and "Examples" how these can be specified. If a formula object is used set predictors, cat.predictors, spline.predictors or int.predictors at the default value of NULL.
nimp: A numerical scalar. Number of imputed datasets. Default is 5.
impvar: A character vector. Name of the variable that distinguishes the imputed datasets.
keep.predictors: A single string or a vector of strings including the variables that are forced in the model during predictor selection. All type of variables are allowed.
p.crit: A numerical scalar. P-value selection criterium. A value of 1 provides the pooled model without selection.
method: A character vector to indicate the pooling method for p-values to pool the total model or used during predictor selection. This can be "RR", D1", "D2", "D3", "D4", or "MPR". See details for more information. Default is "RR".
direction: The direction of predictor selection, "BW" means backward selection and "FW" means forward selection.
model_type: A character vector for type of model, "binomial" is for logistic regression and "linear" is for linear regression models.

Value

An object of class pmods (multiply imputed models) from which the following objects can be extracted:

data imputed datasets
RR_model pooled model at each selection step
RR_model_final final selected pooled model
multiparm pooled p-values at each step according to pooling method
multiparm_final pooled p-values at final step according to pooling method
multiparm_out (only when direction = "FW") pooled p-values of removed predictors
formula_step formula object at each step
formula_final formula object at final step
formula_initial formula object at final step
predictors_in predictors included at each selection step
predictors_out predictors excluded at each step
impvar name of variable used to distinguish imputed datasets
nimp number of imputed datasets
Outcome name of the outcome variable
method selection method
p.crit p-value selection criterium
call function call
model_type type of regression model used
direction direction of predictor selection
predictors_final names of predictors in final selection step
predictors_initial names of predictors in start model
keep.predictors names of predictors that were forced in the model

Details

The basic pooling procedure to derive pooled coefficients, standard errors, 95 confidence intervals and p-values is Rubin's Rules (RR). However, RR is only possible when the model includes continuous and dichotomous variables. Specific procedures are available when the model also included categorical (> 2 categories) or restricted cubic spline variables. These pooling methods are: “D1” is pooling of the total covariance matrix, ”D2” is pooling of Chi-square values, “D3” and "D4" is pooling Likelihood ratio statistics (method of Meng and Rubin) and “MPR” is pooling of median p-values (MPR rule). Spline regression coefficients are defined by using the rcs function for restricted cubic splines of the rms package. A minimum number of 3 knots as defined under knots is required.

A typical formula object has the form Outcome ~ terms. Categorical variables has to be defined as Outcome ~ factor(variable), restricted cubic spline variables as Outcome ~ rcs(variable, 3). Interaction terms can be defined as Outcome ~ variable1*variable2 or Outcome ~ variable1 + variable2 + variable1:variable2. All variables in the terms part have to be separated by a "+". If a formula object is used set predictors, cat.predictors, spline.predictors or int.predictors at the default value of NULL.

References

Eekhout I, van de Wiel MA, Heymans MW. Methods for significance testing of categorical covariates in logistic regression models after multiple imputation: power and applicability analysis. BMC Med Res Methodol. 2017;17(1):129.

Enders CK (2010). Applied missing data analysis. New York: The Guilford Press.

Meng X-L, Rubin DB. Performing likelihood ratio tests with multiply-imputed data sets. Biometrika.1992;79:103-11.

van de Wiel MA, Berkhof J, van Wieringen WN. Testing the prediction error difference between 2 predictors. Biostatistics. 2009;10:550-60.

Marshall A, Altman DG, Holder RL, Royston P. Combining estimates of interest in prognostic modelling studies after multiple imputation: current practice and guidelines. BMC Med Res Methodol. 2009;9:57.

Van Buuren S. (2018). Flexible Imputation of Missing Data. 2nd Edition. Chapman & Hall/CRC Interdisciplinary Statistics. Boca Raton.

EW. Steyerberg (2019). Clinical Prediction MOdels. A Practical Approach to Development, Validation, and Updating (2nd edition). Springer Nature Switzerland AG.

http://missingdatasolutions.rbind.io/

Author

Martijn Heymans, 2021

Examples

  pool_lr <- glm_mi(data=lbpmilr, formula = Chronic ~ Pain +
  factor(Satisfaction) + rcs(Tampascale,3) + Radiation +
  Radiation*factor(Satisfaction) + Age + Duration + BMI,
  p.crit = 0.05, direction="FW", nimp=5, impvar="Impnr",
  keep.predictors = c("Radiation*factor(Satisfaction)", "Age"),
  method="D1", model_type="binomial")
#> Entered at Step 1 is - Pain
#> 
#> Selection correctly terminated, 
#> No new variables entered the model

  pool_lr$RR_model_final
#> $`Final model`
#>                              term     estimate  std.error    statistic
#> 1                     (Intercept) -4.709498778 1.43056951 -3.292044710
#> 2                            Pain  1.000998071 0.17550142  5.703646588
#> 3                       Radiation  1.290636824 1.09101880  1.182964787
#> 4                             Age -0.015400069 0.02511643 -0.613147208
#> 5           factor(Satisfaction)2  0.006745828 0.72630877  0.009287825
#> 6           factor(Satisfaction)3 -2.245109289 1.28811308 -1.742944248
#> 7 Radiation:factor(Satisfaction)2 -1.112425071 1.29072002 -0.861863964
#> 8 Radiation:factor(Satisfaction)3 -0.328027439 1.69820886 -0.193160834
#>          df      p.value          OR    lower.EXP  upper.EXP
#> 1 121.39346 1.302896e-03 0.009009292 0.0005305503  0.1529871
#> 2  86.27941 1.612718e-07 2.720996221 1.9196231945  3.8569134
#> 3  30.54412 2.459495e-01 3.635100740 0.3922493500 33.6876464
#> 4 102.60849 5.411357e-01 0.984717906 0.9368661635  1.0350137
#> 5  94.66955 9.926090e-01 1.006768633 0.2380619605  4.2576440
#> 6  23.29136 9.453270e-02 0.105915964 0.0073877578  1.5184839
#> 7  40.99328 3.937762e-01 0.328760726 0.0242557883  4.4559927
#> 8  19.52844 8.488278e-01 0.720343254 0.0207352364 25.0247643
#>