thr3ads.net - R help - [R] Merging fully overlapping groups [Mar 2012]

If this information is useful, please help other people find it:
Share via:

mdvaan

2012-Mar-14 03:56 UTC

[R] Merging fully overlapping groups

Hi,

I have data on individuals (B) who participated in events (A). If ALL
participants in an event are a subset of the participants in another event I
would like to remove the smaller event and if the participants in one event
are exactly similar to the participants in another event I would like to
remove one of the events (I don't care which one). The following example
does that however it is extremely slow (and the true dataset is very large).
What would be a more efficient way to solve the problem? I really appreciate
your help. Thanks!  

DF <- data.frame(read.table(textConnection("  A  B
12095	 69832
12095	 51750
12095	 6734
18774	 51750
18774	 51733
18774	 6734
18774	 69833
19268	 51750
19268	 6734
19268	 51733
19268	 65251
5169	 54441
5169	 15480
5169	 3228
5966	 51733
5966	 65251
5966	 68197
5966	 6734
5966	 51750
5966	 69833
7189	 135523
7189	 65251
7189	 51733
7189	 69833
7189	 135522
7189	 68197
7189	 6734
7797	 51750
7797	 6734
7797	 69833
7866	 6734
7866	 69833
7866	 51733
8596	 51733
8596	 51750
8596	 65251
8677	 6734
8677	 51750
8677	 51733
8936	 68197
8936	 6734
8936	 65251
8936	 51733
9204	 51750
9204	 69833
9204	 6734
9204	 51733"),head=TRUE,stringsAsFactors=FALSE))

data <- unique(DF$A)
for (m in 1:length(data))
	{
	for (m in 1:length(data))
		{
		tdata <- data[-m]
		q <- 0
		for (n in 1:length(tdata))
			{
			if (length(which(DF[DF$A == data[m], 2] %in% DF[DF$A == tdata[n], 2] =TRUE))
== length(DF[DF$A == data[m], 2]))
				{
				q <- q + 1
				}
			}
		if (q > 0)
			{
			data <- data[-m]
			m <- m - 1
			}
		}
	}
DF <- DF[DF$A %in% data,]

--
View this message in context:
http://r.789695.n4.nabble.com/Merging-fully-overlapping-groups-tp4470999p4470999.html
Sent from the R help mailing list archive at Nabble.com.

Jean V Adams

2012-Mar-14 18:31 UTC

head link

[R] Merging fully overlapping groups

This code performs the same operation in about 1/10th the time on my 
machine.
Give it a try.

look <- function(i) {
        # look for subsets
        dif <- m[, i] - m
        apply(dif, 2, min) > -0.5
        }
nosubsets <- function(df) {
        # eliminate events that are subsets of other events in terms of 
attendance
        m <- table(df$B, df$A)
        nevents <- dim(m)[2]
        found <- sapply(seq(nevents), look)
        diag(found) <- FALSE
        df[df$A %in% dimnames(m)[[2]][rowSums(found)<0.5], ]
        }
nosubsets(DF)

Jean



mdvaan wrote on 03/13/2012 10:56:33 PM:
> Hi,
> 
> I have data on individuals (B) who participated in events (A). If ALL
> participants in an event are a subset of the participants in another 
event I> would like to remove the smaller event and if the participants in one 
event> are exactly similar to the participants in another event I would like to
> remove one of the events (I don't care which one). The following
example
> does that however it is extremely slow (and the true dataset is very 
large).> What would be a more efficient way to solve the problem? I really 
appreciate> your help. Thanks! 
> 
> DF <- data.frame(read.table(textConnection("  A  B
> 12095    69832
> 12095    51750
> 12095    6734
> 18774    51750
> 18774    51733
> 18774    6734
> 18774    69833
> 19268    51750
> 19268    6734
> 19268    51733
> 19268    65251
> 5169    54441
> 5169    15480
> 5169    3228
> 5966    51733
> 5966    65251
> 5966    68197
> 5966    6734
> 5966    51750
> 5966    69833
> 7189    135523
> 7189    65251
> 7189    51733
> 7189    69833
> 7189    135522
> 7189    68197
> 7189    6734
> 7797    51750
> 7797    6734
> 7797    69833
> 7866    6734
> 7866    69833
> 7866    51733
> 8596    51733
> 8596    51750
> 8596    65251
> 8677    6734
> 8677    51750
> 8677    51733
> 8936    68197
> 8936    6734
> 8936    65251
> 8936    51733
> 9204    51750
> 9204    69833
> 9204    6734
> 9204    51733"),head=TRUE,stringsAsFactors=FALSE))
> 
> data <- unique(DF$A)
> for (m in 1:length(data))
>    {
>    for (m in 1:length(data))
>       {
>       tdata <- data[-m]
>       q <- 0
>       for (n in 1:length(tdata))
>          {
>          if (length(which(DF[DF$A == data[m], 2] %in% DF[DF$A == 
> tdata[n], 2] => TRUE)) == length(DF[DF$A == data[m], 2]))
>             {
>             q <- q + 1
>             }
>          }
>       if (q > 0)
>          {
>          data <- data[-m]
>          m <- m - 1
>          }
>       }
>    }
> DF <- DF[DF$A %in% data,]
	[[alternative HTML version deleted]]

Petr Savicky

2012-Mar-14 20:05 UTC

head link

[R] Merging fully overlapping groups

On Tue, Mar 13, 2012 at 08:56:33PM -0700, mdvaan wrote:> Hi,
> 
> I have data on individuals (B) who participated in events (A). If ALL
> participants in an event are a subset of the participants in another event
I
> would like to remove the smaller event and if the participants in one event
> are exactly similar to the participants in another event I would like to
> remove one of the events (I don't care which one). The following
example
> does that however it is extremely slow (and the true dataset is very
large).
> What would be a more efficient way to solve the problem? I really
appreciate
> your help. Thanks!  
> 
> DF <- data.frame(read.table(textConnection("  A  B
> 12095	 69832
> 12095	 51750...

Hi.

Try the following.

  data <- unique(DF$A)
  gr <- split(DF$B, f=factor(DF$A, levels=data))
  gr <- lapply(gr, FUN=sort)
  gr <- lapply(gr, FUN=unique)
  elim <- rep(FALSE, times=length(gr))
  for (i in seq.int(along=gr)) {
      gr.i <- gr[[i]]
      for (j in seq.int(along=gr)) {
          gr.j <- gr[[j]]
          if (j < i && identical(gr.i, gr.j)) {
              elim[i] <- TRUE
          } else if (i != j) {
              both <- unique(sort(c(gr.i, gr.j)))
              if (identical(gr.j, both) && !identical(gr.i, both)) {
                  elim[i] <- TRUE
              }
          }
      }
  }
  DF1 <- DF[DF$A %in% data[!elim], ]

How frequent it is that an event is eliminated in the real data?

Petr Savicky.

mdvaan

2012-Mar-14 20:15 UTC

head link

[R] Merging fully overlapping groups

Hi Jean and Peter,

Thanks for the help. Both options are indeed faster than my initial
procedure.

Best,

Mathijs

--
View this message in context:
http://r.789695.n4.nabble.com/Merging-fully-overlapping-groups-tp4470999p4473013.html
Sent from the R help mailing list archive at Nabble.com.

Petr Savicky

2012-Mar-14 20:59 UTC

head link

[R] Merging fully overlapping groups

On Tue, Mar 13, 2012 at 08:56:33PM -0700, mdvaan wrote:> Hi,
> 
> I have data on individuals (B) who participated in events (A). If ALL
> participants in an event are a subset of the participants in another event
I
> would like to remove the smaller event and if the participants in one event
> are exactly similar to the participants in another event I would like to
> remove one of the events (I don't care which one). The following
example
> does that however it is extremely slow (and the true dataset is very
large).
> What would be a more efficient way to solve the problem? I really
appreciate
> your help. Thanks!  
> 
> DF <- data.frame(read.table(textConnection("  A  B
> 12095	 69832
> 12095	 51750
> 12095	 6734...

Hi.

If a lot of events are eliminated, then the following may
be faster, since eliminated events are removed before
the further comparisons take place.

  data <- unique(DF$A)
  gr <- split(DF$B, f=factor(DF$A, levels=data))
  gr <- lapply(gr, FUN=sort)
  gr <- lapply(gr, FUN=unique)
  accept <- rep(FALSE, times=length(gr))
  accept[1] <- TRUE
  for (i in seq.int(from=2, length=length(accept)-1)) {
      cand <- gr[[i]]
      OK <- TRUE
      for (j in which(accept)) {
          prev <- gr[[j]]
          both <- unique(sort(c(cand, prev)))
          if (identical(prev, both)) {
              OK <- FALSE
              break
          }
      }
      if (OK) {
          for (j in which(accept)) {
              prev <- gr[[j]]
              both <- unique(sort(c(cand, prev)))
              if (identical(cand, both)) {
                  accept[j] <- FALSE
              }
          }
          accept[i] <- TRUE
      }
  }
  DF2 <- DF[DF$A %in% data[accept], ]

Can you afford to compute table(DF$A, DF$B) for the real data?
Its size will be proportional to length(unique(DF$A))*length(unique(DF$B)).

Petr Savicky.

Reasonably Related Threads

Search for more possibly parallel threads

R help - Mar 2012 - Merging fully overlapping groups

[R] Merging fully overlapping groups

[R] Merging fully overlapping groups

[R] Merging fully overlapping groups

[R] Merging fully overlapping groups

[R] Merging fully overlapping groups

Reasonably Related Threads