Comparative study of two imputation methods in R package for RGB color histogram data

Multiple imputation (MI) is a powerful tool in handling missing data issue. This paper provides a comparison of the multiple imputation method in Amelia II package and MICE package in R. Both packages are well-known and incredible to conduct the missing data research in numerous domains. There are very limited researches comparing the multiple imputation combined with other techniques in the image data context. We employ the mean absolute error (MAE) error metric to evaluate the accuracy on the predicted values based on 20% and 50% of missing data percentages. Although the implementation of MICE is time consuming, the result shows that MICE can deal with large amount of missing values while Amelia II is only capable to deal up 20% amount of missing values. Based on the MAE result, both packages show that they are superior on the particular variables.


Introduction
Missing data is the common problem in the data quality issues.It will give a significant impact to the statistical inference on data analysis result.Therefore, the missing data problem have been received a great concern from professional researchers to address this issue that associated to decision making and planning.

Literature review
Numerous efforts have been introduced to improve the existing solutions of the missing data problem [1].Traditionally, deletion method is one of the easiest technique in handling missing data solution.Later, imputation method was introduced where the missing values are substituted with plausible values for instance; mean, median and mode values.Although the mean imputation technique is may enhance drawbacks in deletion technique, the results of the imputed data are obviously bias.The conditional mean and stochastic the invented for improving bias.All these aforementioned techniques are categorized of the single imputation methods [2], and also have limitations on deal with high-dimensional dataset.Furthermore, the image datasets normally consist high-dimensional image data.In literature, CollaGAN proposed to impute the missing image data by using multi-domain images-to-image translation technique [3].
The multiple imputation [4] is one of the strategies to overcome number of weaknesses in the single imputation technique.It was first introduced by Rubin in 1978 and the early developments of the multiple imputation method.Multiple imputation is an iterative procedure and consist three distinct phases: (i) Imputation phase creates several copies of imputed data sets (says, m=5) where missing values are imputed by plausible values commonly applied iterative stochastic regression imputation.Each of m copy of data set will contain different imputed missing values.(ii) Analysis phase estimates the statistical inference such as parameter estimates and standard errors of each imputed data.Therefore, each m of imputed data yield different m as parameter estimates and standard errors.(iii) pooling phase combines m as parameter estimates and standard errors into a single parameter estimate and standard error [5].
Figure 1 : Scheme of main steps in multiple imputation [6].
The purpose of this paper is to provide a simulation study of the multiple imputation approach on the image data.We differentiate the performance of the two well-known multiple imputation packages in R called Amelia II [7] and Multiple imputation by Chain Equation (MICE) [8] in which each package employed different statistical approaches.The first package is called Amelia II combines the expectation maximization bootstrap (EMB) algorithm; it is able to impute high dimensional missing data with less time.This package implemented by Honaker and King in 2010.The second package is MICE which is also known as fully conditional specification or sequential regression multiple imputation.This package provides several extension procedures that combined with multiple imputation process.However, authors specified the predictive mean matching approach as an extension approach of the multiple imputation technique employed in the MICE package and we applied it in this simulation study.

Methodology
Let X be the n x P matrix data where n is the number of observations and P is the total number of variables or components.We assumed X is a matrix of multivariate distribution that completely specified by unknown parameter .We denote the and are observed components and missing components respectively.The standard multiple imputation algorithm in Amelia II and MICE are follow the following steps [9]: 1) Estimate from the posterior distribution based on the observed data 2) Estimate from 3) Draw a value of from the conditional posterior distribution given by and The parameters estimation are obtained by sampling iteratively by conditional distribution … .
The both MICE and Amelia methods have been proposed and incorporated in the multiple imputation framework.The MICE method employs fully conditional specification (FCS) in the multiple imputation for multivariate data .Then the imputation process implemented on the conditional model where by imputing missing values based on the variable-by-variable on multivariate data.The Amelia package combines the expectation maximization algorithm and bootstrapping in the multiple imputation framework.Basically, the use of EM algorithm is to estimate μ and Σ.This algorithm consists two iterative steps.The first step is called E-step: where is an iterative counter and the second step is called M-step: Maximize with respect to .Prior to the EM algorithm process, the data is drawn by sample with replacement using bootstrap approach.Then, the EM algorithm is performed to estimate parameter from the bootstrapped data.Within the EM algorithm process, the missing data imputed by drawing the missing data conditional on the observed data and estimated parameter .
In this simulation, the missing data are assumed to missing completely at random (MCAR) where the missingness are not rely on any variables in the data set.The missing data pattern presented in the Figure 2.

Experimental design
We compare two incredible multiple imputation packages in R that able to work with continuous variables.With this simulation study, we may learn which package that suitable to deal with the high dimensional continuous data set.
We applied the multiple imputation technique using Amelia II and MICE package in R on a dataset contains RGB colour histogram whereby the colour feature extracted from 4 different images;(i) building, (ii) festival,(iii) beach,(iv) mountain.The colour histogram generated provides the distribution of RGB colours in the image.Generally, the 3D RGB colour space is divided into cells and for each cell, the number of pixels is counted.
A 3D RGB colour space is projected by each pixel in the image, as illustrated in Figure 1.The 3D colour space was divided into 4x4x4 cells which generate colour histogram with 64 bins.The number of pixels in each cell is counted and stored in the colour histogram.The total number of pixels in each bin is added up to get the total value.Each bin value is divided by the total value.This normalized data gives the proportion of pixels as a percentage for each bin.The purpose of this simulation study is to measure the performance and prediction accuracy between predicted and actual values.The mean absolute error (MAE) was used to measure the average error between predicted and actual values.The greater the deviation means the larger error between predicted and actual values and lower shows better result.The two evaluation criteria are:

Experimental result and conclusion
We applied the MAE error metric as an evaluation criterion to assess the accuracy of the imputed data.The result obtained in Table 1 shows that the multiple imputation method in Amelia R package outperformed the multiple imputation algorithm proposed in MICE.We assessed the performance of multiple imputation in both R packages based on 20% of missing data percentage.The result show that both Amelia II and MICE algorithm superior in the specific variables as shown with bold font.When we added the MDP from 30% up 50%, only MICE consistently work to impute the missing data.However, the Amelia is only works well if missing data percentage is less than 20%.Otherwise, there are some limitations appeared such as collinearity issues etc.

Figure 2 :
Figure 2: missing data pattern or missingness map of the 64-variables of the RGB colour in the image.The red spots indicate that missing values occur in the selected variables.

Figure 1
Figure 1 Illustration of RGB colour cube.

Table 1 :
The MAE estimates