MISTRESS V1.17 DOCUMENTATION Stef van Buuren Department of Statistics TNO Preventive and Health, Leiden Nov 17, 1992 Email: Stef.vanBuuren@tno.nl DESCRIPTION This document describes the MISTRESS V1.17 program. The SAS macro MISTRESS implements the imputation technique for missing categorical data as outlined in Van Buuren & Van Rijckevorsel (1992). The technique maximizes the sum of the p largest eigenvalues of the correlation matrix of the quantified imputed data. Imputations adhere to the original category coding. The method works best for a large number of variables that have a substantial amount of association between the variables (say average r(i,j) > 0.60) and if the amount of missing data does not exceed, say, 20%. SYSTEM REQUIREMENTS MISTRESS V1.17 was developed and tested under the Unix HP9000/720 series release of the SAS System, Version 6.07, and requires base SAS and SAS/IML. It is not tested on earlier SAS versions. TYPES OF DATA VALUES The macro assumes that each data value falls within one of the following classes: OBSERVED The value is known and is analyzed as usual. A variable is observed if it is not MISSING or IDLE. MISSING The value is unknown and will be imputed. Missing values are coded as '.', '.a', or '.b', etc. The macro does not distinguish between different types of missing values and treats all missing values alike. IDLE The value is unknown and will NOT be imputed. Data values that are equal to macro input variable 'idle' are treated as idle. The default is 0. Idle character entries must be coded by the string "idle". If all unknown entries are idle, then MISTRESS becomes identical to 'HOMALS missing passive' (cf. Gifi, 1990, p. 136). MACRO INVOCATION The MISTRESS macro can be invoked as %mistress(<,parameter2=value>...) All parameters are optional. It is only necessary to specify parameters which differ from the default value, and these parameters may be specified in any order in the macro call. Default values are preceded by the "=" sign in the following parameter list. PARAMETER LIST data=_LAST_ Name of the input SAS data set. If no name is specified, the most recently created data set is used. out=IMPUTED Name of the output SAS data set that contains the imputed data. If no name is specified, a SAS data set named IMPUTED will be created in the current library. The SAS library to which the data are written must exist. The output data set must be different from the input data set. If OUT=_NONE_ no output is generated. var=_NUM_ The list of input variable(s) included in the analysis. Variables must listed individually between braces, e.g. var={income car sex}. The macro does not accept range conventions such as 'V1-V5'. Numerical and character variables can be mixed. Alternatively, instead of separately listing variables, three keywords can be used. These are: var=_NUM_ : include all numeric variables var=_CHAR_: include all character variables var=_ALL_ : include all variables. By default, all numerical variables are read. Each unique value of a variable defines a category. wgtvar=_NONE_ Name of a variable that contains non-negative integer row weights. This option is especially useful for reading response profile counts or cross-classified data. Weight values of zero delete rows from the analysis. Note that the data are expanded into a new data set with unit row weights before the analysis begins, which may easily lead to large matrices. The weighting variable must be listed among the input variables, but will be deleted before the analysis begins. After weighting, the programs prints the new number of observations and variables. By default, all rows are weighted equally. ndim=1 Dimensionality of the solution. The number of dimensions must not exceed the total number of categories minus the number of variables. For consistency maximizing imputations choose ndim=1, the default. idle = 0 Numerical code for idle entries. Each data point with a value equal to this code will be treated as idle, which means that it is a missing value that will NOT be imputed. By default, idle entries are identified by zero. Idle character values must be coded with the string "idle". To avoid confusion, do not use 'idle=.'. prt=1 Controls the amount of printing. prt=0: prints warnings and error messages prt=1: also, print summary imputation table prt=2: also, print history of iterations maxit1=100 The maximum number of iterations of the initial solution. The initial solution is used to obtain a reasonable starting imputation before optimization begins. Choosing maxit1=0 skips the rational start, and may introduce local mimima. Changing defaults is not recommended in actual data analysis and should only be done to investigate the behaviour of the algorithm. crit1=1E-6 The convergence criterion of the initial solution. Changing this value is not recommended. maxit2=100 The maximum number of iterations of the final solution. If maxit2=0 no relocation step is performed. Any imputations will then be based on the HOMALS passive solution. crit2=1E-7 The convergence criterion of the final solution. Changing this value is not recommended. out1=_NONE_ Name of the output SAS data set that contains the object scores of the solution. The data set contains ndim variables, named 'X1','X2', etc., of object scores. No scores are saved by default. out2=_NONE_ Name of the output SAS data set that contains counts and quantifications per category. The following variables are saved: 'var' (variable number), 'cat' (category number), 'freq' (count), 'Y1', 'Y2', etc., (quantifications per dimension). By default, no data set is created. file=print Name of the file for printed output. Two filenames are of special interest: file=print prints to the output window (default), and file=log prints to the log window. ERROR HANDLING The macro may generate warnings and errors. Warnings are printed to the output file and execution continues. Errors are fatal and abort the program. A list of warnings and errors accompanied by a short explanation is given below. Warnings: 1 Warning: Variable X has 32 categories. This warning is printed if a variable contains more than 25 categories, i.e. if takes more than 25 different values. This warning is intended to spot any overlooked continuous variables in the data. The user may wish to recode such variables into fewer categories. 2 Warning: 8 empty row(s) deleted. Observations consisting of only missing or idle values are deleted from the analysis. Suppressing this message is possible by deleting the rows from the input data set. 3 Warning: 2 empty column(s) deleted. Variable consisting of only missing or idle values are deleted from the analysis. Delete the variable from the variable list. 4 Warning: Number of dimensions set to 1 If ndim was specified out-of-range the program finds the closest valid setting. One may ask for maximal dimensionality by setting to an arbitrary large value, e.g. ndim=999. 5 Warning: Expected counts too low for chi-square test This warning is printed if one of the following holds: * if there are two valid categories and the expected count is lower than 5 * if the expected count is lower than 1 for any category * if the percentage of expected counts below 5 exceeds 20% See Siegel & Castellan (1988, p 49). In these cases, the p-value of the test should not be used. 6 Warning: Variable X not found. A variable named on the var= option could not be found in the data= data set. The variable will be skipped and will not enter the analysis. Check for errors in the var= and data= options. Errors: 1 Error: Variable X contains no valid data. The variable consists entirely of idle and/or missing values and cannot be analyzed. Delete the variable from the analysis. 2 Error: Linear dependency in routine WGRAM The error may appear of the rank of the data is less than the number of dimensions. Try decreasing ndim. 3 Error: Data matrix contains no observations. The data set consists entirely of missing/idle values. A possible cause is a misspecification of the input data set, or an error in the weighting variable. Check the data= and wgtvar= options. 4 Error: Variable X has no categories. No valid categories were found for the variable. Check the input data set. 5 Error: Too few categories in variable X This error occurs if the variable has only one valid category and no idle values, which may for example occur in the analysis of binary variables in which no zero appears. Check the input data. 6 Error: Empty categories in variable X The variable contains categories that were not observed. This error should never occur. 7 Error: Iteration limit exceeded in RELMAT This error occurs if after 10 passes through the data the relocation loop has not yet ended. This error stops prevents endless looping of the RELMAT routine (which normally should never occur). 8 Error: Weighting variable X not found. The requested weighted variable on the wgtvar= option could not be found in the input data. Check if the variable appears in the variable list of the var= option. Check spelling in the wgtvar= option. 9 Error: Character variables cannot be weighting variables. A character variable was requested as a weighting variable. Choose a numerical weighting variable. INVOCATION AS AN IML MODULE It is possible to run MISTRESS from within IML by calling the MISTRESS module with a proper set of arguments. The following IML modules must have been compiled to do this succesfully: IND, DEIND, WGRAM, MISTLOSS, DEVMN, FINDY, CHECK1, CHECK2, RELMAT, INIMAT, CHISQ, PRINFO, MISTRESS. The arguments of the MISTRESS module are, in this order: OUTPUT hnew (nobs,nvar) completed categorical data matrix x (nobs,ndim) object scores y (skj, ndim) quantifications d (1,skj) marginal frequencies after imputation gv (1,skj) variable indicator cv (1,skj) original data values per cat. eva (ndim,1) eigenvalues per dimension eta2 (ndim,1) consistency per dimension (=eva/nvar) INPUT h (nobs,nvar) incomplete data matrix (numeric) ndim number of dimensions idle the value for idle codes crit1 convergence criterion, phase 1 maxit1 maximum no. of iterations, phase 1 crit2 idem, phase 2 maxit2 prt printing flag (0, 1, 2) vars (1, nvar) character: variable names The function returns 1 if an error occured, otherwise zero. Suppose that 'h' contain the incomplete data and that 'vars' is an array of variable names, then the IML statements hnew=0; x=0; y=0; d=0; gv=0; cv=0; eva=0; eta2=0; err = mistress(hnew, x, y, d, gv, cv, eva, eta2, h, 1, 0, 1E-6, 100, 1E-7, 100, 1, vars); will run MISTRESS under the usual defaults. EXAMPLE 1 The statements given below can be submitted to SAS and analyzes the data in Table 1 in Van Buuren & Van Rijckevorsel, 1992. ----- BEGIN INPUT ----- /* Example 1: Table 1 of Van Buuren & Van Rijckevorsel (1992) */ %include 'md/mis117.sas'; /* specify place of mistress macro */ data table1; length income age $6 car $3; input id 1-2 income 4-9 age 11-16 car 18-20; cards; 1 young jpn 2 middle middle am 3 old am 4 low young jpn 5 middle young am 6 high old am 7 low young jpn 8 high midlle am 9 high am 10 low young am; proc print data=table1; %mistress(var=_CHAR_) proc print data=imputed; run; ----- END INPUT ----- The output is shown below: ----- BEGIN OUTPUT ----- OBS INCOME AGE CAR ID 1 young jpn 1 2 middle middle am 2 3 old am 3 4 low young jpn 4 5 middle young am 5 6 high old am 6 7 low young jpn 7 8 high middle am 8 9 high am 9 10 low young am 10 MISTRESS V1.17 (c) NIPG-TNO, Leiden Imputation statistics... Variable Cat Value Counts Percentages Before Expect After Before Expect After p INCOME miss . 2 0 0 20 0 idle 0.00 0 0 0 0 0 0 1 2.00 3 3.75 4 30 38 40 0.9355 2 3.00 3 3.75 4 30 38 40 3 4.00 2 2.50 2 20 25 20 Warning: Expected counts too low for chi-square test AGE miss . 1 0 0 10 0 idle 0.00 0 0 0 0 0 0 1 2.00 2 2.22 3 20 22 30 0.8395 2 3.00 5 5.56 5 50 56 50 3 4.00 2 2.22 2 20 22 20 Warning: Expected counts too low for chi-square test CAR miss . 0 0 0 0 0 idle 0.00 0 0 0 0 0 0 1 2.00 7 7.00 7 70 70 70 1.0000 2 3.00 3 3.00 3 30 30 30 Warning: Expected counts too low for chi-square test total miss . 3 0 0 10 0 0 total idle 0.00 0 0 0 0 0 Dim1 Consistency 0.85 Eigenvalue 2.55 MISTRESS finished OK OBS INCOME AGE CAR 1 low young jpn 2 middle middle am 3 high old am 4 low young jpn 5 middle young am 6 high old am 7 low young jpn 8 high middle am 9 high old am 10 low young am ----- END OUTPUT ----- First, SAS echoes the input data set, which has missing values at three places. Next, the mistress macro is applied to all character variables. It produces a table of imputation statistics, which contains the following information: 1 the variable name 2 the category identification: (missing, idle, 1, 2, ....) 3 for numerical variables, the original data value This column is meaningless for character variables. 4 the marginal count before imputation 5 the expected count under the MCAR assumption MCAR (Missing Completely At Random) is the simplest of all missing data mechanisms and means that missing values are supposed to be randomly distributed over the entire matrix. 6 the marginal count after imputation 7-9 idem, but then in percentages 10 the p-value of a one-sample chi-square test that measures the difference between the expected (under MCAR) and observed distribution. This p-value cannot be interpreted if the nonresponse is Missing at Random (MAR) or nonignorable. The final two rows contain counts and percentages for the missing and idle categories for all variables simultaneously. Finally, consistency measures (0.00-1.00) and eigenvalues are listed per dimension. Also, the imputed table is printed. EXAMPLE 2 The second example demonstrates: 1 how contingency tables may be read using wgtvar 2 that maximally consistent imputations are all identical given the level of the other variables. For small tables, this may not always be realistic. ----- BEGIN INPUT ----- /* Example 2: Little and Rubin, p. 187 */ data table2; length clinic $1 precare $4 died $3; input clinic 1-1 precare 3-6 died 8-10 count 12-14; cards; A less yes 3 A less no 176 A more yes 4 A more no 293 B less yes 17 B less no 197 B more yes 2 B more no 23 less yes 10 less no 150 more yes 5 more no 90 ; proc freq data=table2; weight count; tables clinic * precare * died / nocol norow nocum nopercent missprint; %mistress(var={clinic precare died count}, wgtvar=count) proc freq data=imputed; tables clinic * precare * died / nocol norow nocum nopercent missprint; run; ----- END INPUT, BEGIN OUTPUT ----- TABLE 1 OF PRECARE BY DIED CONTROLLING FOR CLINIC= PRECARE DIED Frequency| no |yes | Total ---------+--------+--------+ less | 150 | 10 | 160 ---------+--------+--------+ more | 90 | 5 | 95 ---------+--------+--------+ Total 240 15 255 TABLE 2 OF PRECARE BY DIED CONTROLLING FOR CLINIC=A PRECARE DIED Frequency| no |yes | Total ---------+--------+--------+ less | 176 | 3 | 179 ---------+--------+--------+ more | 293 | 4 | 297 ---------+--------+--------+ Total 469 7 476 TABLE 3 OF PRECARE BY DIED CONTROLLING FOR CLINIC=B PRECARE DIED Frequency| no |yes | Total ---------+--------+--------+ less | 197 | 17 | 214 ---------+--------+--------+ more | 23 | 2 | 25 ---------+--------+--------+ Total 220 19 239 Weighted data contains 970 rows and 3 columns. MISTRESS V1.17 (c) NIPG-TNO, Leiden Imputation statistics... Variable Cat Value Counts Percentages Before Expect After Before Expect After p CLINIC miss . 255 0 0 26 0 idle 0.00 0 0 0 0 0 0 1 2.00 476 645.76 566 49 67 58 0.0000 2 3.00 239 324.24 404 25 33 42 PRECARE miss . 0 0 0 0 0 idle 0.00 0 0 0 0 0 0 1 2.00 553 553.00 553 57 57 57 1.0000 2 3.00 417 417.00 417 43 43 43 DIED miss . 0 0 0 0 0 idle 0.00 0 0 0 0 0 0 1 2.00 929 929.00 929 96 96 96 1.0000 2 3.00 41 41.00 41 4 4 4 total miss . 255 0 0 9 0 0 total idle 0.00 0 0 0 0 0 Dim1 Consistency 0.55 Eigenvalue 1.65 MISTRESS finished OK TABLE 1 OF PRECARE BY DIED CONTROLLING FOR CLINIC=A PRECARE DIED Frequency| no |yes | Total ---------+--------+--------+ less | 176 | 3 | 179 ---------+--------+--------+ more | 383 | 4 | 387 ---------+--------+--------+ Total 559 7 566 TABLE 2 OF PRECARE BY DIED CONTROLLING FOR CLINIC=B PRECARE DIED Frequency| no |yes | Total ---------+--------+--------+ less | 347 | 27 | 374 ---------+--------+--------+ more | 23 | 7 | 30 ---------+--------+--------+ Total 370 34 404 ----- END OUTPUT ----- Note that all 90 observation with scores (more, survived) are assigned to Clinic A. All other observation are imputed by Clinic B. Though this extrapolates the major trend in the data, the results are hardly satisfactory. We recommend not to apply mistress on tables with a very small number of cells. DISCLAIMER: This software is distributed 'as is'. Although I believe that the program does what it is supposed to do, I do not take any responsibility for losses that may result of its use. Furthermore, I cannot promise updates, fixes or support. The reader is encouraged to report any found bugs though. DISTRIBUTION AND USE: The software may be distributed freely as long as it remains intact, and must include this documentation. It may be used for scientific purposes as long as proper credit is given. It may not be sold or otherwise be exploited commercially. REFERENCES: Gifi, A., 1990. Nonlinear multivariate analysis. New York: Wiley. Little, R.J.A. & Rubin, D.B., 1987. Statistical analysis with missing data. New York: Wiley. Siegel, S.S. & Castellan Jr, N.J. 1988. Nonparametric statistics for the behavioral sciences. New York: McGraw Hill. Van Buuren, S. & Van Rijckevorsel, J.L.A., 1992. Missing data imputation by maximizing internal consistency. Psychometrika, 57, 4, 567-580.