Dear R users, Basically, from the following arbitrary data set: a <- data.frame(id=c(c("A1","A2","A3","A4","A5"),c("A3","A2","A3","A4","A5")),loc=c("B1","B2","B3","B4","B5"),clm=c(rep(("General"),6),rep("Life",4)))> aid loc clm 1 A1 B1 General 2 A2 B2 General 3 A3 B3 General 4 A4 B4 General 5 A5 B5 General 6 A3 B1 General 7 A2 B2 Life 8 A3 B3 Life 9 A4 B4 Life 10 A5 B5 Life I desire removing records (highlighted records above) with identical values in each fields ("id" & "loc") but with different value of "clm" (i.e according to category) i.e> categ <- table(a$id,a$clm) > categGeneral Life A1 1 0 A2 1 1 A3 2 1 A4 1 1 A5 1 1 The desired output is id loc clm 1 A1 B1 General 6 A3 B1 General Because the data set I am working on is quite big (~ 800,000 x 20) with majority of the fields values being long strings, looping turned out to be very inefficient in comapring individual rows.. Are there any alternative efficient methods in implementing this problem? Greatly appreciate for your expertise. Steven [[alternative HTML version deleted]]
On Oct 28, 2009, at 9:30 PM, Steven Kang wrote:> Dear R users, > > > Basically, from the following arbitrary data set: > > a <- > data > .frame > (id > > c > (c > ("A1 > ","A2 > ","A3 > ","A4 > ","A5 > "),c > ("A3 > ","A2 > ","A3 > ","A4","A5")),loc=c("B1","B2","B3","B4","B5"),clm=c(rep(("General"), > 6),rep("Life",4))) > >> a > id loc clm > 1 A1 B1 General > 2 A2 B2 General > 3 A3 B3 General > 4 A4 B4 General > 5 A5 B5 General > 6 A3 B1 General > 7 A2 B2 Life > 8 A3 B3 Life > 9 A4 B4 Life > 10 A5 B5 Life > > I desire removing records (highlighted records above) with identical > values > in each fields ("id" & "loc") but with different value of "clm" (i.e > according to category)Take a look at this merge operation on separate rows of "a". > merge( a[a$clm=="Life", ], a[a$clm=="General", ] , by=c("id", "loc"), all=T) id loc clm.x clm.y 1 A1 B1 <NA> General 2 A2 B2 Life General 3 A3 B1 <NA> General 4 A3 B3 Life General 5 A4 B4 Life General 6 A5 B5 Life General Assignment of that object and selection with is.na should complete the process. > a2m <- merge( a[a$clm=="Life", ], a[a$clm=="General", ] , by=c("id", "loc"), all=T) > a2m[ is.na(a2m$clm.x) | is.na(a2m$clm.y), ] id loc clm.x clm.y 1 A1 B1 <NA> General 3 A3 B1 <NA> General Alternate methods might include paste-ing id to loc and removing duplicates.> i.e >> categ <- table(a$id,a$clm) >> categ > > General Life > A1 1 0 > A2 1 1 > A3 2 1 > A4 1 1 > A5 1 1 > > The desired output is > > id loc clm > 1 A1 B1 General > 6 A3 B1 General > > Because the data set I am working on is quite big (~ 800,000 x 20) > with majority of the fields values being long strings, looping > turned out to > be very inefficient in comapring individual rows.. > > Are there any alternative efficient methods in implementing this > problem? > Steven-- David Winsemius, MD Heritage Laboratories West Hartford, CT
Here is one way of doing it:> a <-+ data.frame(id=c(c("A1","A2","A3","A4","A5"), + c("A3","A2","A3","A4","A5")),loc=c("B1","B2","B3","B4","B5"), + clm=c(rep(("General"),6),rep("Life",4)))> # split the indices based on 'id' & 'loc' > a.indx <- split(seq(nrow(a)), paste(a$id, a$loc)) > # now take each group and see if 'clm' differs (don't know what you want to > # do if more than 2 are in the group) > result <- lapply(a.indx, function(.indx){+ if (length(.indx) == 1) return(.indx) + if (any(a$clm[.indx[1]] != a$clm[.indx])) return(NULL) + .indx + })> # output the matches > a[unlist(result),,drop=FALSE]id loc clm 1 A1 B1 General 6 A3 B1 General> >On Wed, Oct 28, 2009 at 9:30 PM, Steven Kang <stochastickang at gmail.com> wrote:> Dear R users, > > > Basically, from the following arbitrary data set: > > a <- > data.frame(id=c(c("A1","A2","A3","A4","A5"),c("A3","A2","A3","A4","A5")),loc=c("B1","B2","B3","B4","B5"),clm=c(rep(("General"),6),rep("Life",4))) > >> a > ? ?id ? loc ?clm > 1 ?A1 ?B1 General > 2 ?A2 ?B2 General > 3 ?A3 ?B3 General > 4 ?A4 ?B4 General > 5 ?A5 ?B5 General > 6 ?A3 ?B1 General > 7 ?A2 ?B2 ? ?Life > 8 ?A3 ?B3 ? ?Life > 9 ?A4 ?B4 ? ?Life > 10 A5 ?B5 ? ?Life > > I desire removing records (highlighted records above) with identical values > in each fields ("id" & "loc") but with different value of "clm" (i.e > according to category) > i.e >> categ <- table(a$id,a$clm) >> categ > > ? ? General Life > ?A1 ? ? ? 1 ? ?0 > ?A2 ? ? ? 1 ? ?1 > ?A3 ? ? ? 2 ? ?1 > ?A4 ? ? ? 1 ? ?1 > ?A5 ? ? ? 1 ? ?1 > > The desired output is > > ? ?id ? loc ?clm > 1 ?A1 ?B1 General > 6 ?A3 ?B1 General > > Because the data set I am working on is quite big (~ 800,000 x 20) > with majority of the fields values being long strings, looping turned out to > be very inefficient in comapring individual rows.. > > Are there any alternative efficient methods in implementing this problem? > > Greatly appreciate for your expertise. > > > > Steven > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
Here is another way based on pasting ids as hinted below: a <- data.frame(id=c(c("A1","A2","A3","A4","A5"), c("A3","A2","A3","A4","A5")), loc=c("B1","B2","B3","B4","B5"), clm=c(rep(("General"),6),rep("Life",4))) a$uid <- paste(a$id, ".", a$loc, sep="") out <- tapply( a$clm, a$uid, paste ) # can also add collapse="," $A1.B1 [1] "General" $A2.B2 [1] "General" "Life" $A3.B1 [1] "General" $A3.B3 [1] "General" "Life" $A4.B4 [1] "General" "Life" $A5.B5 [1] "General" "Life" Then here are those with single policies. > out[ which( sapply(out, length) == 1 ) ] $A1.B1 [1] "General" $A3.B1 [1] "General" David Winsemius wrote:> On Oct 28, 2009, at 9:30 PM, Steven Kang wrote: > >> Dear R users, >> >> >> Basically, from the following arbitrary data set: >> >> a <- >> data >> .frame >> (id >> >> c >> (c >> ("A1 >> ","A2 >> ","A3 >> ","A4 >> ","A5 >> "),c >> ("A3 >> ","A2 >> ","A3 >> ","A4","A5")),loc=c("B1","B2","B3","B4","B5"),clm=c(rep(("General"), >> 6),rep("Life",4))) >> >>> a >> id loc clm >> 1 A1 B1 General >> 2 A2 B2 General >> 3 A3 B3 General >> 4 A4 B4 General >> 5 A5 B5 General >> 6 A3 B1 General >> 7 A2 B2 Life >> 8 A3 B3 Life >> 9 A4 B4 Life >> 10 A5 B5 Life >> >> I desire removing records (highlighted records above) with identical >> values >> in each fields ("id" & "loc") but with different value of "clm" (i.e >> according to category) > > Take a look at this merge operation on separate rows of "a". > > > merge( a[a$clm=="Life", ], a[a$clm=="General", ] , by=c("id", > "loc"), all=T) > id loc clm.x clm.y > 1 A1 B1 <NA> General > 2 A2 B2 Life General > 3 A3 B1 <NA> General > 4 A3 B3 Life General > 5 A4 B4 Life General > 6 A5 B5 Life General > > Assignment of that object and selection with is.na should complete the > process. > > > a2m <- merge( a[a$clm=="Life", ], a[a$clm=="General", ] , > by=c("id", "loc"), all=T) > > > a2m[ is.na(a2m$clm.x) | is.na(a2m$clm.y), ] > id loc clm.x clm.y > 1 A1 B1 <NA> General > 3 A3 B1 <NA> General > > Alternate methods might include paste-ing id to loc and removing > duplicates. > > >> i.e >>> categ <- table(a$id,a$clm) >>> categ >> General Life >> A1 1 0 >> A2 1 1 >> A3 2 1 >> A4 1 1 >> A5 1 1 >> >> The desired output is >> >> id loc clm >> 1 A1 B1 General >> 6 A3 B1 General >> >> Because the data set I am working on is quite big (~ 800,000 x 20) >> with majority of the fields values being long strings, looping >> turned out to >> be very inefficient in comapring individual rows.. >> >> Are there any alternative efficient methods in implementing this >> problem? >> Steven