Multiple Imputation in R
R Markdown
This article explores how to manage and analyze data after performing multiple imputation using the mice
package in R. Multiple imputation is a sophisticated statistical technique that handles missing data by creating multiple imputations (or ‘fill-ins’) for missing values.
Step 1: Data Preparation and Imputation
First, we download and prepare our data, introducing some missingness for demonstration purposes. After that, the mice
package is employed to perform multiple imputation:
library(readr)
library(mice)
bfi <- read.csv("https://lukasnovak.online/media/data/bfi.csv", sep = ";")
# Introducing some missingness
bfi[4, 1] <- NA_character_
bfi[6, 2] <- NA_character_
bfi[9, 1] <- NA_character_
bfi[7, 2] <- NA_character_
bfi[6, 1] <- NA_character_
# Performing multiple imputation
imput.bfi <- mice(bfi, m = 3)
##
## iter imp variable
## 1 1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O3 O4 O5 education
## 1 2 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O3 O4 O5 education
## 1 3 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O3 O4 O5 education
## 2 1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O3 O4 O5 education
## 2 2 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O3 O4 O5 education
## 2 3 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O3 O4 O5 education
## 3 1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O3 O4 O5 education
## 3 2 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O3 O4 O5 education
## 3 3 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O3 O4 O5 education
## 4 1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O3 O4 O5 education
## 4 2 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O3 O4 O5 education
## 4 3 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O3 O4 O5 education
## 5 1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O3 O4 O5 education
## 5 2 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O3 O4 O5 education
## 5 3 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O3 O4 O5 education
Step 2: Analyzing Imputed Data
Once we have our imputed datasets, we perform our analysis on each one. In this case, we use linear regression to understand the relationship between variables like N1
and age
:
library(psych)
# Running linear regression on imputed data
lm.bfi <- with(imput.bfi, lm(N1 ~ age))
Step 3: Pooling and Interpreting Results
After analyzing each imputed dataset, we pool the results to get overall estimates. This is where the magic of multiple imputation truly shines.
# Pooling the results
pooled_results <- pool(lm.bfi)
# print results:
print(pooled_results)
## Class: mipo m = 3
## term m estimate ubar b t dfcom
## 1 (Intercept) 3 3.28746417 6.736793e-03 1.342301e-05 6.754690e-03 2798
## 2 age 3 -0.01236588 7.075051e-06 7.990314e-09 7.085705e-06 2798
## df riv lambda fmi
## 1 2761.562 0.002656657 0.002649618 0.003371143
## 2 2783.016 0.001505820 0.001503556 0.002220347
Understanding Pooled Results
The output of print(pooled_results)
provides critical insights. Here’s the interpretation:
- Estimates: The coefficient for the intercept (3.29) and
age
(-0.0124) represent the average effect across all imputed datasets. - Relative Increase in Variance (RIV): Indicates the additional uncertainty due to missing data.
- Lambda: The proportion of total variance attributable to the missing data. Small lambda values suggest that variability due to missing data is minimal compared to the total variability in the model. The present results imply that the missing data have a negligible impact on the estimated parameters in the linear regression model.
- Fraction of Missing Information (FMI): This shows the percentage of information about the parameter that is missing due to nonresponse.
Conclusion
By employing multiple imputation and pooling techniques, we can increase the statistical power of our tests while quantifying the uncertainty of regression model parameters associated with missing values.