Package 'mvnimpute' reference manual

Title:	Simultaneously Impute the Missing and Censored Values
Description:	Implementing a multiple imputation algorithm for multivariate data with missing and censored values under a coarsening at random assumption (Heitjan and Rubin, 1991<doi:10.1214/aos/1176348396>). The multiple imputation algorithm is based on the data augmentation algorithm proposed by Tanner and Wong (1987)<doi:10.1080/01621459.1987.10478458>. The Gibbs sampling algorithm is adopted to to update the model parameters and draw imputations of the coarse data.
Authors:	Hesen Li
Maintainer:	Hesen Li <[email protected]>
License:	GPL-2 \| GPL-3
Version:	1.0.1
Built:	2025-04-01 04:18:07 UTC
Source:	https://github.com/hli226/mvnimpute

mvnimpute: Multiple imputation for multivariate data with missing and censored values

Description

The mvnimpute package implements multiple imputation for simultaneously imputing missing and censored values based on the joint normal model assumption.

Author(s)

Hesen Li

Autocorrelation function

Description

Calculates the autocorrelation function and draws the plots.

Usage

acf.calc(data.mat, lag = 50, plot = TRUE, title = NULL, details = FALSE)
acf.calc(data.mat, lag = 50, plot = TRUE, title = NULL, details = FALSE)

Arguments

`data.mat`	matrix including the variables of which autocorrelations are calculated.
`lag`	lag at which the autocorrelation is calculated, default is set as 50.
`plot`	logical variable to specify whether the plot is generated, default is set to TRUE.
`title`	title of each generated autocorrelation plot.
`details`	boolean variable to specify whether the autocorrelation values are returned, default is set to FALSE.

Details

This function calculates the autocorrelations of all the variables on a column by column base. The default value of lag is set as 50, the maximum number of lag should not exceed the number of rows of the dataset, which reflects the corresponding number of iteration of running the multiple imputation.

Value

If details = TRUE, a matrix containing the calculated autocorrelations of all the variables in the dataset will be returned. If plot = TRUE, the autocorrelation plots of all the variables will be drawn.

Examples


### generate some data
dat <- MASS::mvrnorm(n = 1000, mu = c(1, 2, 3, 4), Sigma = diag(4))

### ACF plots
acf.calc(data.mat = dat, title = paste0("Var ", 1:nrow(dat)))

### generate some data
dat <- MASS::mvrnorm(n = 1000, mu = c(1, 2, 3, 4), Sigma = diag(4))

### ACF plots
acf.calc(data.mat = dat, title = paste0("Var ", 1:nrow(dat)))

Averaged simulated values plot function

Description

Calculates the average simulated values of all parameters and generates plots.

Usage

avg.plot(
  data.mat,
  start,
  end,
  x.lab = "Iteration number",
  y.lab = "Average of simulated values",
  title = NULL,
  details = FALSE
)
avg.plot(
  data.mat,
  start,
  end,
  x.lab = "Iteration number",
  y.lab = "Average of simulated values",
  title = NULL,
  details = FALSE
)

Arguments

`data.mat`	data matrix including the simulated values for plot.
`start`	the number of cycle to start.
`end`	the number of cycle to end.
`x.lab`	label of the x axis in the generated plot, default is set to "Iteration number".
`y.lab`	label of the y axis in the generated plot, default is set to "Average of simulated values".
`title`	title of each generated plot.
`details`	logical variable to specify whether the average simulated values are returned, default is set to FALSE.

Details

This function calculates the average simulated values across simulations. iter can be any number of iterations you want to draw, the corresponding number of rows of the data should be iter + 1.

Value

The plot of averaged values across iterations. If details = TRUE, a matrix containing the averaged values of all the variables across iterations will be returned.

Examples


### generate some normal data
dat <- MASS::mvrnorm(n = 1000, mu = c(1, 2, 3, 4), Sigma = diag(4))

### set column names
colnames(dat) <- paste0("Var ", 1:ncol(dat))

### average values plot: take sample from 500 to 1000 rows
avg.plot(data.mat = dat[500:1000, ], start = 500, end = 1000, title = "Random Variables")

### generate some normal data
dat <- MASS::mvrnorm(n = 1000, mu = c(1, 2, 3, 4), Sigma = diag(4))

### set column names
colnames(dat) <- paste0("Var ", 1:ncol(dat))

### average values plot: take sample from 500 to 1000 rows
avg.plot(data.mat = dat[500:1000, ], start = 500, end = 1000, title = "Random Variables")

Convergence plot function

Description

Draws convergence plot for the simulated parameter values of all variables.

Usage

conv.plot(
  data.mat,
  start,
  end,
  x.lab = "Iteration number",
  y.lab = "Simulated values",
  title = NULL
)
conv.plot(
  data.mat,
  start,
  end,
  x.lab = "Iteration number",
  y.lab = "Simulated values",
  title = NULL
)

Arguments

`data.mat`	data matrix including the simulated values.
`start`	the number of cycle to start.
`end`	the number of cycle to end.
`x.lab`	label of the x axis in the generated plot, default is set to "Iteration number".
`y.lab`	label of the y axis in the generated plot, default is set to "Simulated values".
`title`	title of each generated plot.

Details

The function generates the trace plot of simulated values across iterations. iter can be any number of iterations you want to draw, the corresponding number of rows of the data is iter + 1.

Value

The plot of simulated values across iterations.

Examples


### generate some data
dat <- MASS::mvrnorm(n = 1000, mu = c(1, 2, 3, 4), Sigma = diag(4))

### set column names
colnames(dat) <- paste0("Var ", 1:ncol(dat))

### convergence plot: select samples from 500 to 1000 rows
conv.plot(data.mat = dat[500:1000, ], start = 500, end = 1000, title = "Random Variables")

### generate some data
dat <- MASS::mvrnorm(n = 1000, mu = c(1, 2, 3, 4), Sigma = diag(4))

### set column names
colnames(dat) <- paste0("Var ", 1:ncol(dat))

### convergence plot: select samples from 500 to 1000 rows
conv.plot(data.mat = dat[500:1000, ], start = 500, end = 1000, title = "Random Variables")

Data generation function

Description

Simulates multivariate normal data with missing and censored values. In this function, missing values will be generated first in the multivariate data, then censored values will be generated for the non-missing data.

Usage

data.generation(
  num_ind = 2000,
  mean_vec = rnorm(5),
  cov_mat = diag(5),
  miss_var = c(2, 3),
  miss_mech = "MCAR",
  miss_prob = c(0.2, 0.4),
  censor_var = 4,
  censor_type = "interval",
  censor_param = 0.1
)
data.generation(
  num_ind = 2000,
  mean_vec = rnorm(5),
  cov_mat = diag(5),
  miss_var = c(2, 3),
  miss_mech = "MCAR",
  miss_prob = c(0.2, 0.4),
  censor_var = 4,
  censor_type = "interval",
  censor_param = 0.1
)

Arguments

`num_ind`	number of subjects.
`mean_vec`	mean vectors.
`cov_mat`	covariance matrix.
`miss_var`	variables that have missing values.
`miss_mech`	missing mechanism. "MCAR" or "MAR". Default "MCAR".
`miss_prob`	missing data probability when missing data is MCAR.
`censor_var`	variables that have censored values.
`censor_type`	type of censoring. "interval", "right" or "left. Default "interval".
`censor_param`	rate parameter of the exponential distribution that the censoring times come from.

Value

A list containing the fully observed data, the observed data, the bounds information of the observed data and the data type indicator matrix.

Examples

### generate a multivariate normal dataset of 2000 sample size
### using the default arguments
data.generation()

### generate a multivariate normal dataset of 2000 sample size
### using the default arguments
data.generation()

Marginal density plots function

Description

Draws marginal density plots for all variables

Usage

marg.plot(data.mat, title = NULL)
marg.plot(data.mat, title = NULL)

Arguments

`data.mat`	data matrix including all the variables.
`title`	title of each generated plot.

Value

Marginal density plot for each variable in the dataset.

Examples

### generate some data
dat <- MASS::mvrnorm(n = 1000, mu = c(1, 2, 3, 4), Sigma = diag(4))

### set column names
colnames(dat) <- paste0("Var ", 1:ncol(dat))

### marginal plots
marg.plot(data.mat = dat, title = paste0("Var", 1:nrow(dat)))
### generate some data
dat <- MASS::mvrnorm(n = 1000, mu = c(1, 2, 3, 4), Sigma = diag(4))

### set column names
colnames(dat) <- paste0("Var ", 1:ncol(dat))

### marginal plots
marg.plot(data.mat = dat, title = paste0("Var", 1:nrow(dat)))

Multiple imputation function

Description

Multiply imputes the missing and censored values in multivariate data.

Usage

multiple.imputation(data, prior.params, initial.values, iter, verbose = TRUE)
multiple.imputation(data, prior.params, initial.values, iter, verbose = TRUE)

Arguments

`data`	a list of data containing the lower and upper bounds information for the missing and censored values.
`prior.params`	list of prior parameter specifications.
`initial.values`	list of initial values.
`iter`	number of rounds for doing multiple imputation.
`verbose`	boolean variable indicating whether the running status is printed in the console. Default is set to TRUE.

Details

A multivariate normal model is assumed on the data, the sweep operator is adopted to calculate the parameters of the conditional models. The implemented multiple imputation algorithm is based on the data augmentation algorithm proposed by Tanner and Wong (1987). The Gibbs sampling algorithm is adopted to update the model parameters and draw imputations of the coarse data. Output is a list including the parameters of the normal models and the imputed data across different iterations of multiple imputation.

Value

A list including the simulated mean and variance values of the assumed normal model, the covariance matrix, the imputed data, and the conditional model parameters across different iterations of multiple imputation.

References

Goodnight, J. H. (1979). A tutorial on the SWEEP operator. The American Statistician, 33(3), 149-158.

Tanner, M., & Wong, W. (1987). The Calculation of Posterior Distributions by Data Augmentation. Journal of the American Statistical Association, 82(398), 528-540.

Examples

## Not run: 
### data and indicator
miss.dat <- simulated.dat[[1]]
data.ind <- simulated.dat[[2]]

### number of observations and variables
n <- nrow(miss.dat); p <- ncol(miss.dat)

#### bound matrices
b1 <- b2 <- matrix(nrow = nrow(data.ind), ncol = ncol(data.ind))

for (i in 1:nrow(b1)) {
  for (j in 1:ncol(b1)) {
    b1[i, j] <- ifelse(data.ind[i, j] != 1, NA,
                       miss.dat[i, j])
    b2[i, j] <- ifelse(data.ind[i, j] == 0, NA, miss.dat[i, j])
  }
}
colnames(b1) <- colnames(b2) <- colnames(miss.dat)

#### create a matrix for including the lower and upper bounds
bounds <- list()
bounds[[1]] <- b1; bounds[[2]] <- b2

### prior specifications
prior.param <- list(
  mu.0 = rep(0, p),
  Lambda.0 = diag(100, p),
  kappa.0 = 2,
  nu.0 = p * (p + 1) / 2
)

### starting values
start.vals <- list(
  mu = rep(0, p),
  sigma = diag(100, p)
)

### imputation
sim.res <- multiple.imputation(
   data = bounds,
   prior.params = prior.param,
   initial.values = start.vals,
   iter = 500,
   verbose = FALSE
)

## End(Not run)

## Not run: 
### data and indicator
miss.dat <- simulated.dat[[1]]
data.ind <- simulated.dat[[2]]

### number of observations and variables
n <- nrow(miss.dat); p <- ncol(miss.dat)

#### bound matrices
b1 <- b2 <- matrix(nrow = nrow(data.ind), ncol = ncol(data.ind))

for (i in 1:nrow(b1)) {
  for (j in 1:ncol(b1)) {
    b1[i, j] <- ifelse(data.ind[i, j] != 1, NA,
                       miss.dat[i, j])
    b2[i, j] <- ifelse(data.ind[i, j] == 0, NA, miss.dat[i, j])
  }
}
colnames(b1) <- colnames(b2) <- colnames(miss.dat)

#### create a matrix for including the lower and upper bounds
bounds <- list()
bounds[[1]] <- b1; bounds[[2]] <- b2

### prior specifications
prior.param <- list(
  mu.0 = rep(0, p),
  Lambda.0 = diag(100, p),
  kappa.0 = 2,
  nu.0 = p * (p + 1) / 2
)

### starting values
start.vals <- list(
  mu = rep(0, p),
  sigma = diag(100, p)
)

### imputation
sim.res <- multiple.imputation(
   data = bounds,
   prior.params = prior.param,
   initial.values = start.vals,
   iter = 500,
   verbose = FALSE
)

## End(Not run)

Combined NHANES dataset from 1999-2004 NHANES study

Description

A dataset including the age, gender and diastolic blood pressure, body mass index and 24 PCB measurements.

Usage

NHANES.dat
NHANES.dat

Format

A list including data frame with 5874 rows and 24 variables and associated indicator matrix:

BPXDAR: Diastolic blood pressure
RIAGENDR: Gender, 1 = male, 2 = female
RIDAGEYR: Age in years
BMXBMI: Body mass index

Details

The dataset is combined from the NHANES release cycles 1999-2000, 2001-2002, and 2003-2004. Almost all PCB have both the missing and censored values as falling below the limits of detection (LODs). The dataset include two components, the first component is the observed NHANES data where the censored PCB measurements are replaced by the LODs dividing the square root of 2. The second component is a data frame including the censoring indicators of the data, in that data frame, 0 indicates an observed PCB measurement, 1 indicates a censored PCB measurement, and 'NA' indicates a missing PCB measurement.

Note

The subset provided here was selected to demonstrate the functionality of the mvnimpute package, no clinical conclusions should be derived from it.

Source

https://www.cdc.gov/nchs/nhanes/index.htm

Simulated continuous data with missing and censored values

Description

A dataset including simulated data with missing and censored values

Usage

simulated.dat
simulated.dat

Format

A list including data matrix with 200 rows and 4 variables and associated indicator matrix:

y: Outcome variable to be used in the regression model after imputation
x1: First covariate variable subject to MAR missing and non-informative censored values
x2: Second covariate variable subject to MAR missing and non-informative censored values
x3: Third covariate variable that is fully observed

Details

A simulated dataset and its associated indicator matrix are included into a list. In the indicator matrix, 0 stands for the missing values, 1 stands for the observed values, and 3 stands for the left censored values.

Draws percentage plot for different type of values

Description

Draws plot that graphically shows the percentages of the missing, censored and observed data. It supports generating plots for all major types of censoring including left, right and interval censoring.

Usage

visual.plot(data.indicator, title = "Percentages of different data type")
visual.plot(data.indicator, title = "Percentages of different data type")

Arguments

`data.indicator`	matrix including the data type indicators of the original data.
`title`	title of the generated plot, default is set to "Percentages of different data type".

Details

The function draws the plot that graphically shows the percentages of the missing, censored and observed data in the dataset. data.indicator should be a matrix containing the data type indicators as generated in the data preparation step. 0 for missing values, 1 for observed values, and 2 for right censored values, 3 for left censored values, and 4 for interval censored values. title is the title of the generated plot.

Value

The plot that shows the details of the different type of data in the dataset.

Examples

data.ind <- simulated.dat[[2]]
visual.plot(data.ind)

data.ind <- simulated.dat[[2]]
visual.plot(data.ind)

Package 'mvnimpute'

Help Index

mvnimpute: Multiple imputation for multivariate data with missing and censored values

Description

Author(s)

Autocorrelation function

Description

Usage

Arguments

Details

Value

Examples

Averaged simulated values plot function

Description

Usage

Arguments

Details

Value

Examples

Convergence plot function

Description

Usage

Arguments

Details

Value

Examples

Data generation function

Description

Usage

Arguments

Value

Examples

Marginal density plots function

Description

Usage

Arguments

Value

Examples

Multiple imputation function

Description

Usage

Arguments

Details

Value

References

Examples

Combined NHANES dataset from 1999-2004 NHANES study

Description

Usage

Format

Details

Note

Source

Simulated continuous data with missing and censored values

Description

Usage

Format

Details

Draws percentage plot for different type of values

Description

Usage

Arguments

Details

Value

Examples