Title: | Simultaneously Impute the Missing and Censored Values |
---|---|
Description: | Implementing a multiple imputation algorithm for multivariate data with missing and censored values under a coarsening at random assumption (Heitjan and Rubin, 1991<doi:10.1214/aos/1176348396>). The multiple imputation algorithm is based on the data augmentation algorithm proposed by Tanner and Wong (1987)<doi:10.1080/01621459.1987.10478458>. The Gibbs sampling algorithm is adopted to to update the model parameters and draw imputations of the coarse data. |
Authors: | Hesen Li |
Maintainer: | Hesen Li <[email protected]> |
License: | GPL-2 | GPL-3 |
Version: | 1.0.1 |
Built: | 2024-11-02 03:57:35 UTC |
Source: | https://github.com/hli226/mvnimpute |
The mvnimpute package implements multiple imputation for simultaneously imputing missing and censored values based on the joint normal model assumption.
Hesen Li
Calculates the autocorrelation function and draws the plots.
acf.calc(data.mat, lag = 50, plot = TRUE, title = NULL, details = FALSE)
acf.calc(data.mat, lag = 50, plot = TRUE, title = NULL, details = FALSE)
data.mat |
matrix including the variables of which autocorrelations are calculated. |
lag |
lag at which the autocorrelation is calculated, default is set as 50. |
plot |
logical variable to specify whether the plot is generated, default is set to TRUE. |
title |
title of each generated autocorrelation plot. |
details |
boolean variable to specify whether the autocorrelation values are returned, default is set to FALSE. |
This function calculates the autocorrelations of all the variables on a column by column base.
The default value of lag
is set as 50, the maximum number of lag should not exceed the number of rows of the dataset,
which reflects the corresponding number of iteration of running the multiple imputation.
If details
= TRUE, a matrix containing the calculated autocorrelations of all the variables in the dataset will be returned.
If plot
= TRUE, the autocorrelation plots of all the variables will be drawn.
### generate some data dat <- MASS::mvrnorm(n = 1000, mu = c(1, 2, 3, 4), Sigma = diag(4)) ### ACF plots acf.calc(data.mat = dat, title = paste0("Var ", 1:nrow(dat)))
### generate some data dat <- MASS::mvrnorm(n = 1000, mu = c(1, 2, 3, 4), Sigma = diag(4)) ### ACF plots acf.calc(data.mat = dat, title = paste0("Var ", 1:nrow(dat)))
Calculates the average simulated values of all parameters and generates plots.
avg.plot( data.mat, start, end, x.lab = "Iteration number", y.lab = "Average of simulated values", title = NULL, details = FALSE )
avg.plot( data.mat, start, end, x.lab = "Iteration number", y.lab = "Average of simulated values", title = NULL, details = FALSE )
data.mat |
data matrix including the simulated values for plot. |
start |
the number of cycle to start. |
end |
the number of cycle to end. |
x.lab |
label of the x axis in the generated plot, default is set to "Iteration number". |
y.lab |
label of the y axis in the generated plot, default is set to "Average of simulated values". |
title |
title of each generated plot. |
details |
logical variable to specify whether the average simulated values are returned, default is set to FALSE. |
This function calculates the average simulated values across simulations.
iter
can be any number of iterations you want to draw, the corresponding number of rows
of the data should be iter
+ 1.
The plot of averaged values across iterations. If details
= TRUE,
a matrix containing the averaged values of all the variables across iterations will be returned.
### generate some normal data dat <- MASS::mvrnorm(n = 1000, mu = c(1, 2, 3, 4), Sigma = diag(4)) ### set column names colnames(dat) <- paste0("Var ", 1:ncol(dat)) ### average values plot: take sample from 500 to 1000 rows avg.plot(data.mat = dat[500:1000, ], start = 500, end = 1000, title = "Random Variables")
### generate some normal data dat <- MASS::mvrnorm(n = 1000, mu = c(1, 2, 3, 4), Sigma = diag(4)) ### set column names colnames(dat) <- paste0("Var ", 1:ncol(dat)) ### average values plot: take sample from 500 to 1000 rows avg.plot(data.mat = dat[500:1000, ], start = 500, end = 1000, title = "Random Variables")
Draws convergence plot for the simulated parameter values of all variables.
conv.plot( data.mat, start, end, x.lab = "Iteration number", y.lab = "Simulated values", title = NULL )
conv.plot( data.mat, start, end, x.lab = "Iteration number", y.lab = "Simulated values", title = NULL )
data.mat |
data matrix including the simulated values. |
start |
the number of cycle to start. |
end |
the number of cycle to end. |
x.lab |
label of the x axis in the generated plot, default is set to "Iteration number". |
y.lab |
label of the y axis in the generated plot, default is set to "Simulated values". |
title |
title of each generated plot. |
The function generates the trace plot of simulated values across iterations.
iter
can be any number of iterations you want to draw, the corresponding number of rows
of the data is iter
+ 1.
The plot of simulated values across iterations.
### generate some data dat <- MASS::mvrnorm(n = 1000, mu = c(1, 2, 3, 4), Sigma = diag(4)) ### set column names colnames(dat) <- paste0("Var ", 1:ncol(dat)) ### convergence plot: select samples from 500 to 1000 rows conv.plot(data.mat = dat[500:1000, ], start = 500, end = 1000, title = "Random Variables")
### generate some data dat <- MASS::mvrnorm(n = 1000, mu = c(1, 2, 3, 4), Sigma = diag(4)) ### set column names colnames(dat) <- paste0("Var ", 1:ncol(dat)) ### convergence plot: select samples from 500 to 1000 rows conv.plot(data.mat = dat[500:1000, ], start = 500, end = 1000, title = "Random Variables")
Simulates multivariate normal data with missing and censored values. In this function, missing values will be generated first in the multivariate data, then censored values will be generated for the non-missing data.
data.generation( num_ind = 2000, mean_vec = rnorm(5), cov_mat = diag(5), miss_var = c(2, 3), miss_mech = "MCAR", miss_prob = c(0.2, 0.4), censor_var = 4, censor_type = "interval", censor_param = 0.1 )
data.generation( num_ind = 2000, mean_vec = rnorm(5), cov_mat = diag(5), miss_var = c(2, 3), miss_mech = "MCAR", miss_prob = c(0.2, 0.4), censor_var = 4, censor_type = "interval", censor_param = 0.1 )
num_ind |
number of subjects. |
mean_vec |
mean vectors. |
cov_mat |
covariance matrix. |
miss_var |
variables that have missing values. |
miss_mech |
missing mechanism. "MCAR" or "MAR". Default "MCAR". |
miss_prob |
missing data probability when missing data is MCAR. |
censor_var |
variables that have censored values. |
censor_type |
type of censoring. "interval", "right" or "left. Default "interval". |
censor_param |
rate parameter of the exponential distribution that the censoring times come from. |
A list containing the fully observed data, the observed data, the bounds information of the observed data and the data type indicator matrix.
### generate a multivariate normal dataset of 2000 sample size ### using the default arguments data.generation()
### generate a multivariate normal dataset of 2000 sample size ### using the default arguments data.generation()
Draws marginal density plots for all variables
marg.plot(data.mat, title = NULL)
marg.plot(data.mat, title = NULL)
data.mat |
data matrix including all the variables. |
title |
title of each generated plot. |
Marginal density plot for each variable in the dataset.
### generate some data dat <- MASS::mvrnorm(n = 1000, mu = c(1, 2, 3, 4), Sigma = diag(4)) ### set column names colnames(dat) <- paste0("Var ", 1:ncol(dat)) ### marginal plots marg.plot(data.mat = dat, title = paste0("Var", 1:nrow(dat)))
### generate some data dat <- MASS::mvrnorm(n = 1000, mu = c(1, 2, 3, 4), Sigma = diag(4)) ### set column names colnames(dat) <- paste0("Var ", 1:ncol(dat)) ### marginal plots marg.plot(data.mat = dat, title = paste0("Var", 1:nrow(dat)))
Multiply imputes the missing and censored values in multivariate data.
multiple.imputation(data, prior.params, initial.values, iter, verbose = TRUE)
multiple.imputation(data, prior.params, initial.values, iter, verbose = TRUE)
data |
a list of data containing the lower and upper bounds information for the missing and censored values. |
prior.params |
list of prior parameter specifications. |
initial.values |
list of initial values. |
iter |
number of rounds for doing multiple imputation. |
verbose |
boolean variable indicating whether the running status is printed in the console. Default is set to TRUE. |
A multivariate normal model is assumed on the data, the sweep operator is adopted to calculate the parameters of the conditional models. The implemented multiple imputation algorithm is based on the data augmentation algorithm proposed by Tanner and Wong (1987). The Gibbs sampling algorithm is adopted to update the model parameters and draw imputations of the coarse data. Output is a list including the parameters of the normal models and the imputed data across different iterations of multiple imputation.
A list including the simulated mean and variance values of the assumed normal model, the covariance matrix, the imputed data, and the conditional model parameters across different iterations of multiple imputation.
Goodnight, J. H. (1979). A tutorial on the SWEEP operator. The American Statistician, 33(3), 149-158.
Tanner, M., & Wong, W. (1987). The Calculation of Posterior Distributions by Data Augmentation. Journal of the American Statistical Association, 82(398), 528-540.
## Not run: ### data and indicator miss.dat <- simulated.dat[[1]] data.ind <- simulated.dat[[2]] ### number of observations and variables n <- nrow(miss.dat); p <- ncol(miss.dat) #### bound matrices b1 <- b2 <- matrix(nrow = nrow(data.ind), ncol = ncol(data.ind)) for (i in 1:nrow(b1)) { for (j in 1:ncol(b1)) { b1[i, j] <- ifelse(data.ind[i, j] != 1, NA, miss.dat[i, j]) b2[i, j] <- ifelse(data.ind[i, j] == 0, NA, miss.dat[i, j]) } } colnames(b1) <- colnames(b2) <- colnames(miss.dat) #### create a matrix for including the lower and upper bounds bounds <- list() bounds[[1]] <- b1; bounds[[2]] <- b2 ### prior specifications prior.param <- list( mu.0 = rep(0, p), Lambda.0 = diag(100, p), kappa.0 = 2, nu.0 = p * (p + 1) / 2 ) ### starting values start.vals <- list( mu = rep(0, p), sigma = diag(100, p) ) ### imputation sim.res <- multiple.imputation( data = bounds, prior.params = prior.param, initial.values = start.vals, iter = 500, verbose = FALSE ) ## End(Not run)
## Not run: ### data and indicator miss.dat <- simulated.dat[[1]] data.ind <- simulated.dat[[2]] ### number of observations and variables n <- nrow(miss.dat); p <- ncol(miss.dat) #### bound matrices b1 <- b2 <- matrix(nrow = nrow(data.ind), ncol = ncol(data.ind)) for (i in 1:nrow(b1)) { for (j in 1:ncol(b1)) { b1[i, j] <- ifelse(data.ind[i, j] != 1, NA, miss.dat[i, j]) b2[i, j] <- ifelse(data.ind[i, j] == 0, NA, miss.dat[i, j]) } } colnames(b1) <- colnames(b2) <- colnames(miss.dat) #### create a matrix for including the lower and upper bounds bounds <- list() bounds[[1]] <- b1; bounds[[2]] <- b2 ### prior specifications prior.param <- list( mu.0 = rep(0, p), Lambda.0 = diag(100, p), kappa.0 = 2, nu.0 = p * (p + 1) / 2 ) ### starting values start.vals <- list( mu = rep(0, p), sigma = diag(100, p) ) ### imputation sim.res <- multiple.imputation( data = bounds, prior.params = prior.param, initial.values = start.vals, iter = 500, verbose = FALSE ) ## End(Not run)
A dataset including the age, gender and diastolic blood pressure, body mass index and 24 PCB measurements.
NHANES.dat
NHANES.dat
A list including data frame with 5874 rows and 24 variables and associated indicator matrix:
Diastolic blood pressure
Gender, 1 = male, 2 = female
Age in years
Body mass index
The dataset is combined from the NHANES release cycles 1999-2000, 2001-2002, and 2003-2004. Almost all PCB have both the missing and censored values as falling below the limits of detection (LODs). The dataset include two components, the first component is the observed NHANES data where the censored PCB measurements are replaced by the LODs dividing the square root of 2. The second component is a data frame including the censoring indicators of the data, in that data frame, 0 indicates an observed PCB measurement, 1 indicates a censored PCB measurement, and 'NA' indicates a missing PCB measurement.
The subset provided here was selected to demonstrate the functionality of the mvnimpute package, no clinical conclusions should be derived from it.
https://www.cdc.gov/nchs/nhanes/index.htm
A dataset including simulated data with missing and censored values
simulated.dat
simulated.dat
A list including data matrix with 200 rows and 4 variables and associated indicator matrix:
Outcome variable to be used in the regression model after imputation
First covariate variable subject to MAR missing and non-informative censored values
Second covariate variable subject to MAR missing and non-informative censored values
Third covariate variable that is fully observed
A simulated dataset and its associated indicator matrix are included into a list. In the indicator matrix, 0 stands for the missing values, 1 stands for the observed values, and 3 stands for the left censored values.
Draws plot that graphically shows the percentages of the missing, censored and observed data. It supports generating plots for all major types of censoring including left, right and interval censoring.
visual.plot(data.indicator, title = "Percentages of different data type")
visual.plot(data.indicator, title = "Percentages of different data type")
data.indicator |
matrix including the data type indicators of the original data. |
title |
title of the generated plot, default is set to "Percentages of different data type". |
The function draws the plot that graphically shows the percentages of the missing, censored and observed
data in the dataset. data.indicator
should be a matrix containing the data type indicators as generated in the
data preparation step. 0 for missing values, 1 for observed values, and 2 for right censored values, 3 for left censored values,
and 4 for interval censored values. title
is the title
of the generated plot.
The plot that shows the details of the different type of data in the dataset.
data.ind <- simulated.dat[[2]] visual.plot(data.ind)
data.ind <- simulated.dat[[2]] visual.plot(data.ind)