Hi, I have data on individuals (B) who participated in events (A). If ALL participants in an event are a subset of the participants in another event I would like to remove the smaller event and if the participants in one event are exactly similar to the participants in another event I would like to remove one of the events (I don't care which one). The following example does that however it is extremely slow (and the true dataset is very large). What would be a more efficient way to solve the problem? I really appreciate your help. Thanks! DF <- data.frame(read.table(textConnection(" A B 12095 69832 12095 51750 12095 6734 18774 51750 18774 51733 18774 6734 18774 69833 19268 51750 19268 6734 19268 51733 19268 65251 5169 54441 5169 15480 5169 3228 5966 51733 5966 65251 5966 68197 5966 6734 5966 51750 5966 69833 7189 135523 7189 65251 7189 51733 7189 69833 7189 135522 7189 68197 7189 6734 7797 51750 7797 6734 7797 69833 7866 6734 7866 69833 7866 51733 8596 51733 8596 51750 8596 65251 8677 6734 8677 51750 8677 51733 8936 68197 8936 6734 8936 65251 8936 51733 9204 51750 9204 69833 9204 6734 9204 51733"),head=TRUE,stringsAsFactors=FALSE)) data <- unique(DF$A) for (m in 1:length(data)) { for (m in 1:length(data)) { tdata <- data[-m] q <- 0 for (n in 1:length(tdata)) { if (length(which(DF[DF$A == data[m], 2] %in% DF[DF$A == tdata[n], 2] =TRUE)) == length(DF[DF$A == data[m], 2])) { q <- q + 1 } } if (q > 0) { data <- data[-m] m <- m - 1 } } } DF <- DF[DF$A %in% data,] -- View this message in context: http://r.789695.n4.nabble.com/Merging-fully-overlapping-groups-tp4470999p4470999.html Sent from the R help mailing list archive at Nabble.com.
This code performs the same operation in about 1/10th the time on my machine. Give it a try. look <- function(i) { # look for subsets dif <- m[, i] - m apply(dif, 2, min) > -0.5 } nosubsets <- function(df) { # eliminate events that are subsets of other events in terms of attendance m <- table(df$B, df$A) nevents <- dim(m)[2] found <- sapply(seq(nevents), look) diag(found) <- FALSE df[df$A %in% dimnames(m)[[2]][rowSums(found)<0.5], ] } nosubsets(DF) Jean mdvaan wrote on 03/13/2012 10:56:33 PM:> Hi, > > I have data on individuals (B) who participated in events (A). If ALL > participants in an event are a subset of the participants in anotherevent I> would like to remove the smaller event and if the participants in oneevent> are exactly similar to the participants in another event I would like to > remove one of the events (I don't care which one). The following example > does that however it is extremely slow (and the true dataset is verylarge).> What would be a more efficient way to solve the problem? I reallyappreciate> your help. Thanks! > > DF <- data.frame(read.table(textConnection(" A B > 12095 69832 > 12095 51750 > 12095 6734 > 18774 51750 > 18774 51733 > 18774 6734 > 18774 69833 > 19268 51750 > 19268 6734 > 19268 51733 > 19268 65251 > 5169 54441 > 5169 15480 > 5169 3228 > 5966 51733 > 5966 65251 > 5966 68197 > 5966 6734 > 5966 51750 > 5966 69833 > 7189 135523 > 7189 65251 > 7189 51733 > 7189 69833 > 7189 135522 > 7189 68197 > 7189 6734 > 7797 51750 > 7797 6734 > 7797 69833 > 7866 6734 > 7866 69833 > 7866 51733 > 8596 51733 > 8596 51750 > 8596 65251 > 8677 6734 > 8677 51750 > 8677 51733 > 8936 68197 > 8936 6734 > 8936 65251 > 8936 51733 > 9204 51750 > 9204 69833 > 9204 6734 > 9204 51733"),head=TRUE,stringsAsFactors=FALSE)) > > data <- unique(DF$A) > for (m in 1:length(data)) > { > for (m in 1:length(data)) > { > tdata <- data[-m] > q <- 0 > for (n in 1:length(tdata)) > { > if (length(which(DF[DF$A == data[m], 2] %in% DF[DF$A == > tdata[n], 2] => TRUE)) == length(DF[DF$A == data[m], 2])) > { > q <- q + 1 > } > } > if (q > 0) > { > data <- data[-m] > m <- m - 1 > } > } > } > DF <- DF[DF$A %in% data,][[alternative HTML version deleted]]
On Tue, Mar 13, 2012 at 08:56:33PM -0700, mdvaan wrote:> Hi, > > I have data on individuals (B) who participated in events (A). If ALL > participants in an event are a subset of the participants in another event I > would like to remove the smaller event and if the participants in one event > are exactly similar to the participants in another event I would like to > remove one of the events (I don't care which one). The following example > does that however it is extremely slow (and the true dataset is very large). > What would be a more efficient way to solve the problem? I really appreciate > your help. Thanks! > > DF <- data.frame(read.table(textConnection(" A B > 12095 69832 > 12095 51750... Hi. Try the following. data <- unique(DF$A) gr <- split(DF$B, f=factor(DF$A, levels=data)) gr <- lapply(gr, FUN=sort) gr <- lapply(gr, FUN=unique) elim <- rep(FALSE, times=length(gr)) for (i in seq.int(along=gr)) { gr.i <- gr[[i]] for (j in seq.int(along=gr)) { gr.j <- gr[[j]] if (j < i && identical(gr.i, gr.j)) { elim[i] <- TRUE } else if (i != j) { both <- unique(sort(c(gr.i, gr.j))) if (identical(gr.j, both) && !identical(gr.i, both)) { elim[i] <- TRUE } } } } DF1 <- DF[DF$A %in% data[!elim], ] How frequent it is that an event is eliminated in the real data? Petr Savicky.
Hi Jean and Peter, Thanks for the help. Both options are indeed faster than my initial procedure. Best, Mathijs -- View this message in context: http://r.789695.n4.nabble.com/Merging-fully-overlapping-groups-tp4470999p4473013.html Sent from the R help mailing list archive at Nabble.com.
On Tue, Mar 13, 2012 at 08:56:33PM -0700, mdvaan wrote:> Hi, > > I have data on individuals (B) who participated in events (A). If ALL > participants in an event are a subset of the participants in another event I > would like to remove the smaller event and if the participants in one event > are exactly similar to the participants in another event I would like to > remove one of the events (I don't care which one). The following example > does that however it is extremely slow (and the true dataset is very large). > What would be a more efficient way to solve the problem? I really appreciate > your help. Thanks! > > DF <- data.frame(read.table(textConnection(" A B > 12095 69832 > 12095 51750 > 12095 6734... Hi. If a lot of events are eliminated, then the following may be faster, since eliminated events are removed before the further comparisons take place. data <- unique(DF$A) gr <- split(DF$B, f=factor(DF$A, levels=data)) gr <- lapply(gr, FUN=sort) gr <- lapply(gr, FUN=unique) accept <- rep(FALSE, times=length(gr)) accept[1] <- TRUE for (i in seq.int(from=2, length=length(accept)-1)) { cand <- gr[[i]] OK <- TRUE for (j in which(accept)) { prev <- gr[[j]] both <- unique(sort(c(cand, prev))) if (identical(prev, both)) { OK <- FALSE break } } if (OK) { for (j in which(accept)) { prev <- gr[[j]] both <- unique(sort(c(cand, prev))) if (identical(cand, both)) { accept[j] <- FALSE } } accept[i] <- TRUE } } DF2 <- DF[DF$A %in% data[accept], ] Can you afford to compute table(DF$A, DF$B) for the real data? Its size will be proportional to length(unique(DF$A))*length(unique(DF$B)). Petr Savicky.