Hi Tax,
I played around with several different functions. I keep thinking
that there should be an easier/faster way, but this is what I came up
with. Given the speed tests, it looks like foo4 is the best option
(they all give identical results).
#### The functions ####
foo1 <- function(object) {
object <- object == 1
x <- ncol(object)
vals <- expand.grid(1:x, 1:x)
y <- colSums(object[, vals[, 1]] & object[, vals[, 2]])
output <- matrix(y, ncol = x)
return(output)
}
foo2 <- function(object) {
x <- ncol(object)
output <- matrix(0, nrow = x, ncol = x)
for (i in 1:x) {
for (j in 1:x) {
output[i, j] <- sum(object[, i] & object[, j])
}
}
return(output)
}
foo3 <- function(object) {
object <- object == 1
x <- ncol(object)
output <- matrix(0, nrow = x, ncol = x)
for (i in 1:x) {
for (j in 1:x) {
output[i, j] <- sum(object[, i] & object[, j])
}
}
return(output)
}
foo4 <- function(object) {
object <- object == 1
output <- sapply(1:ncol(object), function(x) {
colSums(object[, x] & object)
})
return(output)
}
foo5 <- function(object) {
output <- sapply(1:ncol(object), function(x) {
colSums(object[, x] & object)
})
return(output)
}
#### The test data ####
set.seed(2213)
dat <- sample(0:1, 10000, replace = TRUE)
test1 <- matrix(dat, ncol = 10)
test2 <- matrix(dat, ncol = 100)
test3 <- matrix(dat, ncol = 1000)
#### Results ####
10 cols 100 cols 1000 cols
foo1 0.012 0.586 2.336
foo2 0.013 0.338 20.285
foo3 0.014 0.313 19.550
foo4 0.007 0.065 0.689
foo5 0.008 0.070 0.731
Notice that when I used the same data but varied the number of
columns, some functions were more or less dramatically influenced. If
the minimal gain between foo4 & foo5 is not important to you, I might
suggest this for simplicity (essentially foo5 without the unnecessary
wrapping).
sapply(1:ncol(object), function(x) {colSums(object[, x] & object)})
For example, using your data:
## This data was read in from your email and then conveniently
## provided using dput(tmp) from my system
tmp <- structure(list(t1 = c(1L, 1L, 1L), t2 = c(1L, 1L, 0L), t3 = c(0L,
0L, 0L), t4 = c(0L, 1L, 0L), t5 = c(1L, 1L, 1L)), .Names = c("t1",
"t2", "t3", "t4", "t5"), class =
"data.frame", row.names = c("d1",
"d2", "d3"))
sapply(1:ncol(tmp), function(x) {colSums(tmp[, x] & tmp)})
[,1] [,2] [,3] [,4] [,5]
t1 3 2 0 1 3
t2 2 2 0 1 2
t3 0 0 0 0 0
t4 1 1 0 1 1
t5 3 2 0 1 3
Cheers,
Josh
On Wed, Nov 10, 2010 at 5:03 PM, tax botsis <taxbotsis at gmail.com>
wrote:> Thanks Josh,
>
> here is the table with binary values showing the presence or absence of a
> certain term (t1, t2, t3, t4, and t5) in a document (d1, d2 and d3):
>
> ???? [t1] [t2] [t3] [t4] [t5]
> [d1]??? 1??? 1 ?? 0 ? 0?? 1
> [d2]??? 1 ?? 1 ?? 0 ? 1?? 1
> [d3]??? 1 ?? 0 ?? 0 ? 0 ? 1
>
>
> and here is the (adjacency) matrix I would like to get that calculates the
> coocurrencies of each pair of terms in all documents:
>
> ???? [t1] [t2] [t3] [t4] [t5]
> [t1]?? 0 ? 2 ? 0 ?? 1 ?? 3
> [t2]?? 2?? 0 ? 0 ?? 1 ?? 2
> [t3]?? 0 ? 0 ? 0 ?? 0 ?? 0
> [t4]?? 1 ? 1 ? 0 ?? 0 ?? 1
> [t5]?? 3 ? 2 ? 0 ? 1??? 1
>
> Please let me know whether this reads better or whether I should post it
> again
>
> Thanks
> Tax
>
> 2010/11/10 Joshua Wiley <jwiley.psych at gmail.com>
>>
>> Hi Tax,
>>
>> Because the list dost not accept HTML messages (per posting guide),
>> your message was converted to plain text, and your table is difficult
>> to read. ?My suggestion would be to start with:
>>
>> ?table
>> ?xtabs
>>
>> If you make up a minimal example of the data you have, and email it to
>> us we can give more detailed and specific help. ?Suppose your data is
>> stored under the name, "dat", you can easily provide us the
data using
>> the R function, dput(). ?For example:
>>
>> dput(dat)
>>
>> will give you a bunch of output you can simply copy and paste into
>> your next plain text email.
>>
>> Best regards,
>>
>> Josh
>>
>>
>> On Wed, Nov 10, 2010 at 12:01 PM, tax botsis <taxbotsis at
gmail.com> wrote:
>> > Hi all,
>> > I am trying to construct a pairwise coocurrence matrix for certain
terms
>> > appearing in a number of documents. For example I have the
following
>> > table
>> > with binary values showing the presence or absence of a certain
term in
>> > a
>> > document:
>> >
>> > ? ? term1 term2 term3 term4 term5 doc1 1 1 0 0 1 doc2 1 1 0 1 1
doc3 1 0
>> > 0
>> > 0 1
>> >
>> > And I want to have a matrix with the number of the pairwise
>> > coocurrencies.
>> > So, based on the above table the matrix should be:
>> >
>> > ? ? term1 term2 term3 term4 term5 term1 0 2 0 1 3 term2 2 0 0 1 2
term3
>> > 0 0
>> > 0 0 0
>> >
>> > term4
>> > 1 1 0 0 1
>> >
>> > term5
>> > 3 2 0 1 1
>> > Any ideas on how to do that?
>> >
>> > Tahnks
>> > Tax
>> >
>> > ? ? ? ?[[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>>
>>
>> --
>> Joshua Wiley
>> Ph.D. Student, Health Psychology
>> University of California, Los Angeles
>> http://www.joshuawiley.com/
>
>
--
Joshua Wiley
Ph.D. Student, Health Psychology
University of California, Los Angeles
http://www.joshuawiley.com/