Multiple Imputation in R

R Markdown

This article explores how to manage and analyze data after performing multiple imputation using the mice package in R. Multiple imputation is a sophisticated statistical technique that handles missing data by creating multiple imputations (or ‘fill-ins’) for missing values.

Step 1: Data Preparation and Imputation

First, we download and prepare our data, introducing some missingness for demonstration purposes. After that, the mice package is employed to perform multiple imputation:

library(readr)
library(mice)

bfi <- read.csv("https://lukasnovak.online/media/data/bfi.csv", sep = ";")

# Introducing some missingness
bfi[4, 1] <- NA_character_
bfi[6, 2] <- NA_character_
bfi[9, 1] <- NA_character_
bfi[7, 2] <- NA_character_
bfi[6, 1] <- NA_character_

# Performing multiple imputation
imput.bfi <- mice(bfi, m = 3)
## 
##  iter imp variable
##   1   1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5  education
##   1   2  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5  education
##   1   3  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5  education
##   2   1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5  education
##   2   2  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5  education
##   2   3  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5  education
##   3   1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5  education
##   3   2  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5  education
##   3   3  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5  education
##   4   1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5  education
##   4   2  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5  education
##   4   3  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5  education
##   5   1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5  education
##   5   2  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5  education
##   5   3  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5  education

Step 2: Analyzing Imputed Data

Once we have our imputed datasets, we perform our analysis on each one. In this case, we use linear regression to understand the relationship between variables like N1 and age:

library(psych)

# Running linear regression on imputed data
lm.bfi <- with(imput.bfi, lm(N1 ~ age))

Step 3: Pooling and Interpreting Results

After analyzing each imputed dataset, we pool the results to get overall estimates. This is where the magic of multiple imputation truly shines.

# Pooling the results
pooled_results <- pool(lm.bfi)
# print results: 
print(pooled_results)
## Class: mipo    m = 3 
##          term m    estimate         ubar            b            t dfcom
## 1 (Intercept) 3  3.28746417 6.736793e-03 1.342301e-05 6.754690e-03  2798
## 2         age 3 -0.01236588 7.075051e-06 7.990314e-09 7.085705e-06  2798
##         df         riv      lambda         fmi
## 1 2761.562 0.002656657 0.002649618 0.003371143
## 2 2783.016 0.001505820 0.001503556 0.002220347

Understanding Pooled Results

The output of print(pooled_results) provides critical insights. Here’s the interpretation:

  • Estimates: The coefficient for the intercept (3.29) and age (-0.0124) represent the average effect across all imputed datasets.
  • Relative Increase in Variance (RIV): Indicates the additional uncertainty due to missing data.
  • Lambda: The proportion of total variance attributable to the missing data. Small lambda values suggest that variability due to missing data is minimal compared to the total variability in the model. The present results imply that the missing data have a negligible impact on the estimated parameters in the linear regression model.
  • Fraction of Missing Information (FMI): This shows the percentage of information about the parameter that is missing due to nonresponse.

Conclusion

By employing multiple imputation and pooling techniques, we can increase the statistical power of our tests while quantifying the uncertainty of regression model parameters associated with missing values.

Lukas Novak
Lukas Novak
Researcher

My research interests include affective neuroscience and psychometrics.

Related