Michael Friendly
2005-Jan-06 16:50 UTC
[R] patterns of missing data: determining monotonicity
Here is a problem that perhaps someone out here has an idea about. It vaguely reminds me of something I've seen before, but can't place. Can anyone help? For multiple imputation, there are simpler methods available if the patterns of missing data are 'monotone' --- if Vj is missing then all variables Vk, k>j are also missing, vs. more complex methods required when the patterns are not monotone. The problem is to determine if, for a collection of variables, there is an ordering of them with a monotone missing data pattern, or, if not, what the longest monotone sequence is. Here is an example, where in a dataset of 65 observations, there are 8 different patterns of missingness, with X and . representing observed and missing: Group V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 nmiss 1 X X X X X X X X X X 0 2 X X X X X X . X X X 1 3 X X X X X . X X X X 1 4 X X X X X . . X X X 2 5 X X . X . X X X X X 2 6 X X . . X X X X X X 2 7 X X . . X . X X X X 3 8 X X . . . X X X X X 3 Treated as a binary matrix, one can sort the columns by the number of non-missing for each variable, and monotone means that there are at most 2 runs -- a string of 0s followed by all 1s for *all* patterns. But how to determine an ordering (or orderings) of variables of maximal length? Group V2 V3 V9 V10 V11 V6 V8 V5 V7 V4 nmiss 1 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 1 0 0 0 1 3 0 0 0 0 0 0 0 0 1 0 1 4 0 0 0 0 0 0 1 0 1 0 2 5 0 0 0 0 0 1 0 0 0 1 2 6 0 0 0 0 0 0 0 1 0 1 2 7 0 0 0 0 0 0 0 1 1 1 3 8 0 0 0 0 0 1 0 1 0 1 3 == == == === === == == == == = 0 0 0 0 0 2 2 3 3 4 -- Michael Friendly Email: friendly at yorku.ca Professor, Psychology Dept. York University Voice: 416 736-5115 x66249 Fax: 416 736-5814 4700 Keele Street http://www.math.yorku.ca/SCS/friendly.html Toronto, ONT M3J 1P3 CANADA
Michael Friendly
2005-Jan-07 13:13 UTC
[R] re: patterns of missing data: determining monotonicity
[Sorry for the re-post; my examples got garbled in the original cut/paste.] Here is a problem that perhaps someone out here has an idea about. It vaguely reminds me of something I've seen before, but can't place. Can anyone help? For multiple imputation, there are simpler methods available if the patterns of missing data are 'monotone' --- if Vj is missing then all variables Vk, k>j are also missing, vs. more complex methods required when the patterns are not monotone. The problem is to determine if, for a collection of variables, there is an ordering of them with a monotone missing data pattern, or, if not, what the longest monotone sequence is. Here is an example, where in a dataset of 65 observations, there are 8 different patterns of missingness, with X and . representing observed and missing: Group V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 nmiss 1 x x x x x x x x x x 0 2 x x x x x x . x x x 1 3 x x x x x . x x x x 1 4 x x x x x . . x x x 2 5 x x . x . x x x x x 2 6 x x . . x x x x x x 2 7 x x . . x . x x x x 3 8 x x . . . x x x x x 3 Treated as a binary matrix, one can sort the columns by the number of non-missing for each variable, and monotone means that there are at most 2 runs -- a string of 0s followed by all 1s for *all* patterns. But how to determine an ordering (or orderings) of variables of maximal length? Group V2 V3 V9 V10 V11 V6 V8 V5 V7 V4 nmiss 1 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 1 0 0 0 1 3 0 0 0 0 0 0 0 0 1 0 1 4 0 0 0 0 0 0 1 0 1 0 2 5 0 0 0 0 0 1 0 0 0 1 2 6 0 0 0 0 0 0 0 1 0 1 2 7 0 0 0 0 0 0 0 1 1 1 3 8 0 0 0 0 0 1 0 1 0 1 3 == == == === === == == == == = 0 0 0 0 0 2 2 3 3 4 -- Michael Friendly Email: friendly at yorku.ca Professor, Psychology Dept. York University Voice: 416 736-5115 x66249 Fax: 416 736-5814 4700 Keele Street http://www.math.yorku.ca/SCS/friendly.html Toronto, ONT M3J 1P3 CANADA
(Ted Harding)
2005-Jan-07 14:13 UTC
[R] patterns of missing data: determining monotonicity
On 06-Jan-05 Michael Friendly wrote:> Here is a problem that perhaps someone out here has an idea > about. It vaguely reminds me of something I've seen before, > but can't place. Can anyone help? > > For multiple imputation, there are simpler methods available > if the patterns of missing data are 'monotone' --- if Vj is > missing then all variables Vk, k>j are also missing, vs. more > complex methods required when the patterns are not monotone. > The problem is to determine if, for a collection of variables, > there is an ordering of them with a monotone missing data pattern, > or, if not, what the longest monotone sequence is. > > Here is an example, where in a dataset of 65 observations, there > are 8 different patterns of missingness, with X and . representing > observed and missing: > > Group V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 nmiss > 1 X X X X X X X X X X 0 > 2 X X X X X X . X X X 1 > 3 X X X X X . X X X X 1 > 4 X X X X X . . X X X 2 > 5 X X . X . X X X X X 2 > 6 X X . . X X X X X X 2 > 7 X X . . X . X X X X 3 > 8 X X . . . X X X X X 3 > > Treated as a binary matrix, one can sort the columns by the number > of non-missing for each variable, and monotone means that there > are at most 2 runs -- a string of 0s followed by all 1s for *all* > patterns. But how > to determine an ordering (or orderings) of variables of maximal length? > > Group V2 V3 V9 V10 V11 V6 V8 V5 V7 V4 nmiss > 1 0 0 0 0 0 0 0 0 0 0 0 > 2 0 0 0 0 0 0 1 0 0 0 1 > 3 0 0 0 0 0 0 0 0 1 0 1 > 4 0 0 0 0 0 0 1 0 1 0 2 > 5 0 0 0 0 0 1 0 0 0 1 2 > 6 0 0 0 0 0 0 0 1 0 1 2 > 7 0 0 0 0 0 0 0 1 1 1 3 > 8 0 0 0 0 0 1 0 1 0 1 3 > == == == === === == == == == => 0 0 0 0 0 2 2 3 3 4Hi Michael, Consider the following approach. It's not a full solution to the specific problem you have posed above, but it contains pathways to solutions. If you're doing multiple imputation anyway, you should install the packages "cat" (for categorical data), "norm" (for continuous data, assumed Normal) and "mix" (for data mixing both kinds), and also "pan" for MI on "panel" data, which might also be useful to you. I'll discuss the situation using "cat" as an example, though "norm" works the same way as far as this question is concerned. First make sure your data are arranged as a matrix X (say) with rows representing "cases" and columns variables. If the variables are categorical, make sure that their values are represented as integers 1, 2, 3, ... (don't start with "0"), and represent missing values as NA. Example of data matrix X: X [,1] [,2] [,3] [1,] 3 1 2 [2,] 2 1 3 [3,] 2 1 NA [4,] 2 3 NA [5,] 1 3 NA [6,] 2 NA NA [7,] 2 NA NA [8,] 3 NA NA [9,] NA NA NA [10,] NA NA NA (constructed to have monotone pattern). Now shuffle it: X<-X[,sample(1:3)] X<-X[sample(1:10),] X [,1] [,2] [,3] [1,] 1 2 3 [2,] 3 NA 2 [3,] 1 NA 2 [4,] 3 NA 1 [5,] 1 3 2 [6,] NA NA NA [7,] NA NA 2 [8,] NA NA 3 [9,] NA NA 2 [10,] NA NA NA Consider this as a real data matrix where now it is not obvious that it has monotone missingness pattern. Then: library(cat) s <- prelim.cat(X) Now read *very*carefully" ?prelim.cat and in particular what is said about its value (the value of s). Note also what is *not* said about it! Now look at "s" by printing it to the console. Amongst its 17 components the following are of particular interest. s$x [,1] [,2] [,3] [1,] 1 3 2 [2,] 1 2 3 [3,] 3 NA 1 [4,] 1 NA 2 [5,] 3 NA 2 [6,] NA NA 2 [7,] NA NA 2 [8,] NA NA 3 [9,] NA NA NA [10,] NA NA NA You can see that this is the same as X except that rows have been permuted to push the NAs downwards. The component s$ro [1] 2 5 4 3 1 9 6 8 7 10 shows the permutation: the original Row 1 of X is Row 2 of s$x, the original row 2 of X is Row 5 of s$x, and so on. Now look at the component s$nmis of s: s$nmis [1] 5 8 2 This gives the numbers of missing values in the different columns of X (and of s$x since the order of columns has not been changed). Now you can sort s$nmis into decreasing order using the "index.return=TRUE" option of 'sort' so as to get the column permutation: sort(s$nmis,index.return=TRUE) $x [1] 2 5 8 $ix [1] 3 1 2 You can check directly that s$x[,c(3,1,2)] is in monotone pattern; more directly, you can get X re-structured into monotone pattern as s$x[,sort(s$nmis,index.return=TRUE)$ix] [,1] [,2] [,3] [1,] 2 1 3 [2,] 3 1 2 [3,] 1 3 NA [4,] 2 1 NA [5,] 2 3 NA [6,] 2 NA NA [7,] 2 NA NA [8,] 3 NA NA [9,] NA NA NA [10,] NA NA NA I hope this is some help. At least it shows you places where you can start digging. If the original X is incompatible with monotone pattern, then the above should give you something which is close to monotone, though I'm not sure whether it will get you "as close as possible"; and you may need to do some more work to uncover how to determine your "longest monotone sequence". In any case, since these MI packages (all based on Shafer's original S code) work internally with monotonicity in mind, for reasons of efficiency and fast convergence, you may find that your imputation needs are met by them. Best wishes, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861 [NB: New number!] Date: 07-Jan-05 Time: 14:13:20 ------------------------------ XFMail ------------------------------