Hi,
I have a data frame containing a column of human judgments, some of which
are missing:
> pr[3]
label
1 4
2 4
3 4
4 4
5 NA
6 3
7 3
8 3
9 3
10 NA
11 NA
12 NA
13 2
14 2
15 2
16 NA
17 1
18 -1
19 -1
20 -1
Accompanying this is a matrix containing multinomial probabilities for the
missing values, note that they sum to 1 across the rows:
> probs
4 3 2 1 -1
[1,] NA NA NA NA NA
[2,] NA NA NA NA NA
[3,] NA NA NA NA NA
[4,] NA NA NA NA NA
[5,] 0.13619525 0.2592090 0.1688164 0.22337423 0.21240512
[6,] NA NA NA NA NA
[7,] NA NA NA NA NA
[8,] NA NA NA NA NA
[9,] NA NA NA NA NA
[10,] 0.34255536 0.2447324 0.2072017 0.14401648 0.06149409
[11,] 0.03116203 0.2954909 0.3652752 0.26606234 0.04200961
[12,] 0.25506687 0.3278237 0.2207594 0.16907757 0.02727250
[13,] NA NA NA NA NA
[14,] NA NA NA NA NA
[15,] NA NA NA NA NA
[16,] 0.14396164 0.2161286 0.2819815 0.02106388 0.33686437
[17,] NA NA NA NA NA
[18,] NA NA NA NA NA
[19,] NA NA NA NA NA
[20,] NA NA NA NA NA
I'm writing a function to simulate multiple random draws from these
distributions. My function currently looks something like this:
simulate.unknowns <- function(pr, probs, expr, trials=1000,
labels=colnames(probs)) {
isNA <- is.na(pr$label)
replicate(trials, {
for (i in which(isNA)) {
pr$label[i] <- sample(labels, 1, prob=probs[i,])
}
expr(pr)
})
}
That works fine but that inner loop is pretty slow. Anyone have suggestions
for how to improve it?
In my typical case there are far more than 20 rows, but still only 5
columns.
For the simple case when there are only 2 classes to draw from, there's an
easy trick which works well:
simulate.unknowns <- function(pr, probs, expr, trials=1000,
labels=colnames(probs)) {
isNA <- is.na(pr$label)
replicate(trials, {
pr$label[isNA] <- runif(sum(isNA)) < probs[isNA]
expr(pr)
})
}
Thanks.
--
Ken Williams
Research Scientist
The Thomson Reuters Corporation
Eagan, MN