thr3ads.net - R help - [R] finding both rows that are duplicated in a data frame [Sep 2013]

If this information is useful, please help other people find it:
Share via:

Robert Lynch

2013-Sep-07 07:02 UTC

[R] finding both rows that are duplicated in a data frame

I have a data frame that looks like

id1<-c(1,1,2,2,3,3,4,5,5,6,6,7,8,9,9,10)
id2<-c(22,22,34,34,15,15,76,45,45,84,84,37,52,66,66,91)
GENDER<-sample(c("G-UNK","G-M","G-F"),16,
replace = TRUE)
ETH <-sample(c("E-AF","E-UNK","E-VT"),16,
replace = TRUE)
example<-cbind(id1,id2,GENDER,ETH)

where there are two id's and some duplicate entries for ID's that have
different GENDER or ETH(nicity)
I would like to get a data frame that doesn't have the duplicates, but the
ones that are kept are which ever GENDER is not G-UNK (unknown) and the
kept ETH is what ever is not E-UNK

the resultant data frame should have 10 rows with no *-UNK in either of the
last two columns ( unless both entries were UNK)

yes the example data may have some impossible results but it does capture
important aspects.
1) G-UNK is alphabetically last of G-F, G-M & G-UNK
2) E-UNK is in the middle alphabetically
3) some times the first entry is the unknown gender, some times it is the
second *likely to happen with random sample
4) some times both entries for one variable, GENDER or ETH are unknown.
5) only appears to be two of each row, * not 100% sure

Thanks!
 Robert

	[[alternative HTML version deleted]]

arun

2013-Sep-07 14:52 UTC

head link

[R] finding both rows that are duplicated in a data frame

Hi,
example<- data.frame(id1,id2,GENDER,ETH,stringsAsFactors=FALSE)

res<-unique(example[!(grepl("UNK",example$GENDER)|grepl("UNK",example$ETH)),])
?res
#?? id1 id2 GENDER? ETH
#1??? 1? 22??? G-M E-VT
#3??? 2? 34??? G-M E-AF
#5??? 3? 15??? G-M E-AF
#7??? 4? 76??? G-F E-VT
#8??? 5? 45??? G-F E-VT
#12?? 7? 37??? G-F E-AF
#13?? 8? 52??? G-F E-AF
#14?? 9? 66??? G-F E-AF
#16? 10? 91??? G-F E-VT


It is a bit unclear about the condition for id1 #6.? If I include both of them,
the nrows will be 11, now it is 9.

10?? 6? 84? G-UNK? E-AF
11?? 6? 84??? G-F E-UNK


A.K.



----- Original Message -----
From: Robert Lynch <robert.b.lynch at gmail.com>
To: R help <r-help at r-project.org>
Cc: 
Sent: Saturday, September 7, 2013 3:02 AM
Subject: [R] finding both rows that are duplicated in a data frame

I have a data frame that looks like

id1<-c(1,1,2,2,3,3,4,5,5,6,6,7,8,9,9,10)
id2<-c(22,22,34,34,15,15,76,45,45,84,84,37,52,66,66,91)
GENDER<-sample(c("G-UNK","G-M","G-F"),16,
replace = TRUE)
ETH <-sample(c("E-AF","E-UNK","E-VT"),16,
replace = TRUE)
example<-cbind(id1,id2,GENDER,ETH)

where there are two id's and some duplicate entries for ID's that have
different GENDER or ETH(nicity)
I would like to get a data frame that doesn't have the duplicates, but the
ones that are kept are which ever GENDER is not G-UNK (unknown) and the
kept ETH is what ever is not E-UNK

the resultant data frame should have 10 rows with no *-UNK in either of the
last two columns ( unless both entries were UNK)

yes the example data may have some impossible results but it does capture
important aspects.
1) G-UNK is alphabetically last of G-F, G-M & G-UNK
2) E-UNK is in the middle alphabetically
3) some times the first entry is the unknown gender, some times it is the
second *likely to happen with random sample
4) some times both entries for one variable, GENDER or ETH are unknown.
5) only appears to be two of each row, * not 100% sure

Thanks!
Robert

??? [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

arun

2013-Sep-07 16:19 UTC

head link

[R] finding both rows that are duplicated in a data frame

Hi,

Suppose you have situations like this: (duplicates are both UNKNOWN and want to
remove those)


example1<-rbind(example,data.frame(id1=c(11,12,12),id2=c(93,95,95),GENDER=rep("G-UNK",3),ETH=rep("E-UNK",3)))
spl<- as.character(interaction(example1$id1,example1$id2))
?res1<-do.call(rbind,lapply(split(example1,spl),function(x)
{indx<-!(grepl("UNK",x[,3])|grepl("UNK",x[,4]));if(sum(indx)==0)
{x[,3]<-x[,3][-grep("UNK",x[,3])];x[,4]<-
x[,4][-grep("UNK",x[,4])];unique(x) } else unique(x[indx,])}))
res1<-res1[!(is.na(res1[,3])|is.na(res1[,4])),]? ##remove the rows with NA

?res2<-res1[order(res1$id1),]
?row.names(res2)<- 1:nrow(res2)
?res2
#? id1 id2 GENDER? ETH
#1??? 1? 22??? G-M E-VT
#2??? 2? 34??? G-M E-AF
#3??? 3? 15??? G-M E-AF
#4??? 4? 76??? G-F E-VT
#5??? 5? 45??? G-F E-VT
#6??? 6? 84??? G-F E-AF
#7??? 7? 37??? G-F E-AF
#8??? 8? 52??? G-F E-AF
#9??? 9? 66??? G-F E-AF
#10? 10? 91??? G-F E-VT

A.K.


----- Original Message -----
From: arun <smartpink111 at yahoo.com>
To: Robert Lynch <robert.b.lynch at gmail.com>
Cc: R help <r-help at r-project.org>
Sent: Saturday, September 7, 2013 11:30 AM
Subject: Re: [R] finding both rows that are duplicated in a data frame

HI,
May be this is what you are looking for.


spl<- as.character(interaction(example$id1,example$id2))
res<-do.call(rbind,lapply(split(example,spl),function(x)
{indx<-!(grepl("UNK",x[,3])|grepl("UNK",x[,4]));if(sum(indx)==0)
{x[,3]<-x[,3][-grep("UNK",x[,3])];x[,4]<-
x[,4][-grep("UNK",x[,4])];unique(x) } else unique(x[indx,])}))
?
?res1<-res[order(res$id1),]
?row.names(res1)<-1:nrow(res1)
?res1
#?? id1 id2 GENDER? ETH
#1??? 1? 22??? G-M E-VT
#2??? 2? 34??? G-M E-AF
#3??? 3? 15??? G-M E-AF
#4??? 4? 76??? G-F E-VT
#5??? 5? 45??? G-F E-VT
#6??? 6? 84??? G-F E-AF
#7??? 7? 37??? G-F E-AF
#8??? 8? 52??? G-F E-AF
#9??? 9? 66??? G-F E-AF
#10? 10? 91??? G-F E-VT
A.K.



----- Original Message -----
From: arun <smartpink111 at yahoo.com>
To: Robert Lynch <robert.b.lynch at gmail.com>
Cc: R help <r-help at r-project.org>
Sent: Saturday, September 7, 2013 10:52 AM
Subject: Re: [R] finding both rows that are duplicated in a data frame

Hi,
example<- data.frame(id1,id2,GENDER,ETH,stringsAsFactors=FALSE)

res<-unique(example[!(grepl("UNK",example$GENDER)|grepl("UNK",example$ETH)),])
?res
#?? id1 id2 GENDER? ETH
#1??? 1? 22??? G-M E-VT
#3??? 2? 34??? G-M E-AF
#5??? 3? 15??? G-M E-AF
#7??? 4? 76??? G-F E-VT
#8??? 5? 45??? G-F E-VT
#12?? 7? 37??? G-F E-AF
#13?? 8? 52??? G-F E-AF
#14?? 9? 66??? G-F E-AF
#16? 10? 91??? G-F E-VT


It is a bit unclear about the condition for id1 #6.? If I include both of them,
the nrows will be 11, now it is 9.

10?? 6? 84? G-UNK? E-AF
11?? 6? 84??? G-F E-UNK


A.K.



----- Original Message -----
From: Robert Lynch <robert.b.lynch at gmail.com>
To: R help <r-help at r-project.org>
Cc: 
Sent: Saturday, September 7, 2013 3:02 AM
Subject: [R] finding both rows that are duplicated in a data frame

I have a data frame that looks like

id1<-c(1,1,2,2,3,3,4,5,5,6,6,7,8,9,9,10)
id2<-c(22,22,34,34,15,15,76,45,45,84,84,37,52,66,66,91)
GENDER<-sample(c("G-UNK","G-M","G-F"),16,
replace = TRUE)
ETH <-sample(c("E-AF","E-UNK","E-VT"),16,
replace = TRUE)
example<-cbind(id1,id2,GENDER,ETH)

where there are two id's and some duplicate entries for ID's that have
different GENDER or ETH(nicity)
I would like to get a data frame that doesn't have the duplicates, but the
ones that are kept are which ever GENDER is not G-UNK (unknown) and the
kept ETH is what ever is not E-UNK

the resultant data frame should have 10 rows with no *-UNK in either of the
last two columns ( unless both entries were UNK)

yes the example data may have some impossible results but it does capture
important aspects.
1) G-UNK is alphabetically last of G-F, G-M & G-UNK
2) E-UNK is in the middle alphabetically
3) some times the first entry is the unknown gender, some times it is the
second *likely to happen with random sample
4) some times both entries for one variable, GENDER or ETH are unknown.
5) only appears to be two of each row, * not 100% sure

Thanks!
Robert

??? [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

jim holtman

2013-Sep-07 16:22 UTC

head link

[R] finding both rows that are duplicated in a data frame

try this.  Splits the dataframe based on the two IDs and then chooses
the first one in cases where condition not met.

> id1<-c(1,1,2,2,3,3,4,5,5,6,6,7,8,9,9,10)
>  id2<-c(22,22,34,34,15,15,76,45,45,84,84,37,52,66,66,91)
>  GENDER<-sample(c("G-UNK","G-M","G-F"),16,
replace = TRUE)
>  ETH <-sample(c("E-AF","E-UNK","E-VT"),16,
replace = TRUE)
>  example<-data.frame(id1,id2,GENDER,ETH, stringsAsFactors = FALSE)
> # find dups by spliting on id1,id2
> result <- do.call(rbind+ , lapply(split(example, list(example$id1, example$id2), drop TRUE),
function(x){
+ indx <- which(!grepl("UNK", x$GENDER) &
!grepl("UNK", x$ETH))[1L] #
choose first one
+ if (is.na(indx)) indx <- 1L  # none match so choose one
+ x[indx,]
+ })
+ )> result      id1 id2 GENDER   ETH
3.15    3  15    G-F  E-AF
1.22    1  22    G-F  E-VT
2.34    2  34    G-F  E-AF
7.37    7  37  G-UNK  E-VT
5.45    5  45    G-M  E-AF
8.52    8  52    G-F  E-AF
9.66    9  66  G-UNK  E-AF
4.76    4  76    G-M  E-AF
6.84    6  84    G-M  E-VT
10.91  10  91    G-F E-UNK
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.


On Sat, Sep 7, 2013 at 3:02 AM, Robert Lynch <robert.b.lynch at gmail.com>
wrote:> I have a data frame that looks like
>
> id1<-c(1,1,2,2,3,3,4,5,5,6,6,7,8,9,9,10)
> id2<-c(22,22,34,34,15,15,76,45,45,84,84,37,52,66,66,91)
> GENDER<-sample(c("G-UNK","G-M","G-F"),16,
replace = TRUE)
> ETH <-sample(c("E-AF","E-UNK","E-VT"),16,
replace = TRUE)
> example<-cbind(id1,id2,GENDER,ETH)
>
> where there are two id's and some duplicate entries for ID's that
have
> different GENDER or ETH(nicity)
> I would like to get a data frame that doesn't have the duplicates, but
the
> ones that are kept are which ever GENDER is not G-UNK (unknown) and the
> kept ETH is what ever is not E-UNK
>
> the resultant data frame should have 10 rows with no *-UNK in either of the
> last two columns ( unless both entries were UNK)
>
> yes the example data may have some impossible results but it does capture
> important aspects.
> 1) G-UNK is alphabetically last of G-F, G-M & G-UNK
> 2) E-UNK is in the middle alphabetically
> 3) some times the first entry is the unknown gender, some times it is the
> second *likely to happen with random sample
> 4) some times both entries for one variable, GENDER or ETH are unknown.
> 5) only appears to be two of each row, * not 100% sure
>
> Thanks!
>  Robert
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

R help - Sep 2013 - finding both rows that are duplicated in a data frame

[R] finding both rows that are duplicated in a data frame

[R] finding both rows that are duplicated in a data frame

[R] finding both rows that are duplicated in a data frame

[R] finding both rows that are duplicated in a data frame