thr3ads.net - R help - [R] filtering out replicate (duplicate) observations under special conditions [Aug 2013]

If this information is useful, please help other people find it:
Share via:

arun

2013-Aug-22 13:25 UTC

[R] filtering out replicate (duplicate) observations under special conditions

HI Samuel,

Based on the output you wanted:
(It would be better to use ?dput() to show the example dataset)


dat1<- structure(list(SiteID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L), SiteName = c("Big Platte Lake", "Big
Platte Lake",
"Big Platte Lake", "Big Platte Lake", "Big Platte
Lake", "Big Platte Lake",
"Big Platte Lake", "Big Platte Lake", "Big Platte
Lake", "Big Platte Lake",
"Big Platte Lake", "Big Platte Lake", "Big Platte
Lake", "Big Platte Lake",
"Big Platte Lake"), SampDate = c("2006-09-20",
"2006-09-20",
"2006-09-20", "2006-09-20", "2006-09-20",
"2006-09-20", "2006-09-20",
"2006-09-20", "2006-09-20", "2006-09-20",
"2006-09-20", "2006-09-20",
"2006-09-20", "2006-09-20", "2006-09-20"), DepthM
= c(0, 0, 0,
2.286, 2.286, 2.286, 4.572, 4.572, 4.572, 9.144, 9.144, 9.144, 
13.716, 13.716, 13.716), PDesc = c("TP", "TP",
"TP", "TP", "TP",
"TP", "TP", "TP", "TP", "TP",
"TP", "TP", "TP", "TP", "TP"),
??? MAbbr = c("Grab", "Grab", "Grab",
"Grab", "Grab", "Grab",
??? "Grab", "Grab", "Grab", "Grab",
"Grab", "Grab", "Grab", "Grab",
??? "Grab"), Measure = c(6.58, 6.84, 6.59, 7.76, 8.57, 8.49, 
??? 9.71, 8.47, 7.71, 7.51, 7.85, 6.81, 7.94, 8.76, 8.4), DNU = c(FALSE, 
??? FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
??? FALSE, FALSE, FALSE, FALSE, FALSE)), .Names = c("SiteID", 
"SiteName", "SampDate", "DepthM",
"PDesc", "MAbbr", "Measure",
"DNU"), class = "data.frame", row.names =
c("16042", "16043",
"16044", "16045", "16046", "16047",
"16048", "16049", "16050",
"16051", "16052", "16053", "16054",
"16055", "16056"))


dat1[!duplicated(dat1$DepthM),] #current example with one SampDate
#????? SiteID??????? SiteName?? SampDate DepthM PDesc MAbbr Measure?? DNU
#16042????? 1 Big Platte Lake 2006-09-20? 0.000??? TP? Grab??? 6.58 FALSE
#16045????? 1 Big Platte Lake 2006-09-20? 2.286??? TP? Grab??? 7.76 FALSE
#16048????? 1 Big Platte Lake 2006-09-20? 4.572??? TP? Grab??? 9.71 FALSE
#16051????? 1 Big Platte Lake 2006-09-20? 9.144??? TP? Grab??? 7.51 FALSE
#16054????? 1 Big Platte Lake 2006-09-20 13.716??? TP? Grab??? 7.94 FALSE

#with more than one SampDate (hopefully it works):

dat1[unlist(with(dat1,tapply(DepthM,list(SampDate),FUN=function(x)
!duplicated(x)))),]
#????? SiteID??????? SiteName?? SampDate DepthM PDesc MAbbr Measure?? DNU
#16042????? 1 Big Platte Lake 2006-09-20? 0.000??? TP? Grab??? 6.58 FALSE
#16045????? 1 Big Platte Lake 2006-09-20? 2.286??? TP? Grab??? 7.76 FALSE
#16048????? 1 Big Platte Lake 2006-09-20? 4.572??? TP? Grab??? 9.71 FALSE
#16051????? 1 Big Platte Lake 2006-09-20? 9.144??? TP? Grab??? 7.51 FALSE
#16054????? 1 Big Platte Lake 2006-09-20 13.716??? TP? Grab??? 7.94 FALSE



A.K.




----- Original Message -----
From: Samuel T. Christel <schristel at wisc.edu>
To: smartpink111 at yahoo.com
Cc: 
Sent: Thursday, August 22, 2013 9:17 AM
Subject: Re: filtering out replicate (duplicate) observations under special
conditions

Hello,

I did update post to the following:


Update: The previous description of my expected output may have been a bit
confusing as the example table did not illustrate the problem... A better
example of the data table is:

SiteID SiteName SampDate DepthM PDesc MAbbr Measure DNU 
16042 1 Big Platte Lake 2006-09-20 0.000 TP Grab 6.58 FALSE 
16043 1 Big Platte Lake 2006-09-20 0.000 TP Grab 6.84 FALSE 
16044 1 Big Platte Lake 2006-09-20 0.000 TP Grab 6.59 FALSE 
16045 1 Big Platte Lake 2006-09-20 2.286 TP Grab 7.76 FALSE 
16046 1 Big Platte Lake 2006-09-20 2.286 TP Grab 8.57 FALSE 
16047 1 Big Platte Lake 2006-09-20 2.286 TP Grab 8.49 FALSE 
16048 1 Big Platte Lake 2006-09-20 4.572 TP Grab 9.71 FALSE 
16049 1 Big Platte Lake 2006-09-20 4.572 TP Grab 8.47 FALSE 
16050 1 Big Platte Lake 2006-09-20 4.572 TP Grab 7.71 FALSE 
16051 1 Big Platte Lake 2006-09-20 9.144 TP Grab 7.51 FALSE 
16052 1 Big Platte Lake 2006-09-20 9.144 TP Grab 7.85 FALSE 
16053 1 Big Platte Lake 2006-09-20 9.144 TP Grab 6.81 FALSE 
16054 1 Big Platte Lake 2006-09-20 13.716 TP Grab 7.94 FALSE 
16055 1 Big Platte Lake 2006-09-20 13.716 TP Grab 8.76 FALSE 
16056 1 Big Platte Lake 2006-09-20 13.716 TP Grab 8.40 FALSE 

On a given "SampDate" I am only interested in ONE unique
"DepthM." That is to say in the table above I would like to remove the
replicate observations of "DepthM" for "DepthM" values of
0.000, 2.286, 4.572, 9.144, and 13.716.

The final table would look like this: 

SiteID SiteName SampDate DepthM PDesc MAbbr Measure DNU 
16042 1 Big Platte Lake 2006-09-20 0.000 TP Grab 6.58 FALSE 
16045 1 Big Platte Lake 2006-09-20 2.286 TP Grab 7.76 FALSE 
16048 1 Big Platte Lake 2006-09-20 4.572 TP Grab 9.71 FALSE 
16051 1 Big Platte Lake 2006-09-20 9.144 TP Grab 7.51 FALSE 
16054 1 Big Platte Lake 2006-09-20 13.716 TP Grab 7.94 FALSE 

Note that it does not matter which observation at a particular depth (on that
sampling date) is maintained or discarded !!




Any advice you might have would be most appreciated!

-STC



On 08/21/13, smartpink111 at yahoo.com wrote:> 
> Hi,
> 
> Could you show your expected output as the description is confusing.. Based
on the example dataset, all the rows look unique for a combination of SampDate
and DepthM.
> 
> A.K.
> <quote author='limno.sam'>
> Hi,
> 
> I am working with a data.frame with the following structure:
> 
> SiteID SiteName SampDate DepthM PDesc MAbbr Measure DNU
> 1 1 Big Platte Lake 1982-06-17 0.000 Alk Grab 143 FALSE
> 2 1 Big Platte Lake 1992-09-09 0.000 Alk Grab 64 FALSE
> 3 1 Big Platte Lake 1992-09-09 4.572 Alk Grab 126 FALSE
> 4 1 Big Platte Lake 1992-09-09 9.144 Alk Grab 130 FALSE
> 5 1 Big Platte Lake 1992-09-09 13.716 Alk Grab 142 FALSE
> 6 1 Big Platte Lake 1992-09-09 18.288 Alk Grab 146 FALSE
> 
> I would like to filter out replicate observations (Measure). However, there
> is no column in the source data indicating whether or not an observation is
> a replicate. Therefore, I am only interested in a data frame where one
> unique "SampDepth" is attached to one unique
"SampDate." That is to say I am
> only interested in observations unique to both one sample date and sample
> depth (these data were collected from a lake). I am having issues getting
my
> code to work, and I've only been coding in R for a few months. 
> 
> I have named my data.frame (after other filtering steps) as
"data" and
> tried the following:
> data=data[(duplicated(data$SampDate,incomparables=FALSE,fromLast=FALSE,
> nmax=NA[which(!duplicated(data$DepthM))])==TRUE),]
> 
> I end up with a data frame of all unique "SampDate" values, but
the unique
> "SampDepth" values for a given "SampDate" are filtered
out.
> 
> Any suggestions? 
> </quote>
> Quoted from: 
>
http://r.789695.n4.nabble.com/filtering-out-replicate-duplicate-observations-under-special-conditions-tp4674253.html
> 
> 
> _____________________________________
> Sent from http://r.789695.n4.nabble.com

arun

2013-Aug-22 13:48 UTC

head link

[R] filtering out replicate (duplicate) observations under special conditions

Hi,
Also, you could use:
library(data.table)
?dt1<- data.table(dat1,key=c('SampDate','DepthM'))


unique(dt1)
#?? SiteID??????? SiteName?? SampDate DepthM PDesc MAbbr Measure?? DNU
#1:????? 1 Big Platte Lake 2006-09-20? 0.000??? TP? Grab??? 6.58 FALSE
#2:????? 1 Big Platte Lake 2006-09-20? 2.286??? TP? Grab??? 7.76 FALSE
#3:????? 1 Big Platte Lake 2006-09-20? 4.572??? TP? Grab??? 9.71 FALSE
#4:????? 1 Big Platte Lake 2006-09-20? 9.144??? TP? Grab??? 7.51 FALSE
#5:????? 1 Big Platte Lake 2006-09-20 13.716??? TP? Grab??? 7.94 FALSE


res1<-dat1[unlist(with(dat1,tapply(DepthM,list(SampDate),FUN=function(x)
!duplicated(x)))),]
?row.names(res1)<-1:nrow(res1)
?res2<- as.data.frame(unique(dt1))
?identical(res1,res2)
#[1] TRUE

#or
dt1[unique(dt1[,key(dt1),with=FALSE]),mult='first']
#???? SampDate DepthM SiteID??????? SiteName PDesc MAbbr Measure?? DNU
#1: 2006-09-20? 0.000????? 1 Big Platte Lake??? TP? Grab??? 6.58 FALSE
#2: 2006-09-20? 2.286????? 1 Big Platte Lake??? TP? Grab??? 7.76 FALSE
#3: 2006-09-20? 4.572????? 1 Big Platte Lake??? TP? Grab??? 9.71 FALSE
#4: 2006-09-20? 9.144????? 1 Big Platte Lake??? TP? Grab??? 7.51 FALSE
#5: 2006-09-20 13.716????? 1 Big Platte Lake??? TP? Grab??? 7.94 FALSE



A.K.

----- Original Message -----
From: arun <smartpink111 at yahoo.com>
To: "schristel at wisc.edu" <schristel at wisc.edu>
Cc: R help <r-help at r-project.org>
Sent: Thursday, August 22, 2013 9:25 AM
Subject: Re: filtering out replicate (duplicate) observations under special
conditions

HI Samuel,

Based on the output you wanted:
(It would be better to use ?dput() to show the example dataset)


dat1<- structure(list(SiteID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L), SiteName = c("Big Platte Lake", "Big
Platte Lake",
"Big Platte Lake", "Big Platte Lake", "Big Platte
Lake", "Big Platte Lake",
"Big Platte Lake", "Big Platte Lake", "Big Platte
Lake", "Big Platte Lake",
"Big Platte Lake", "Big Platte Lake", "Big Platte
Lake", "Big Platte Lake",
"Big Platte Lake"), SampDate = c("2006-09-20",
"2006-09-20",
"2006-09-20", "2006-09-20", "2006-09-20",
"2006-09-20", "2006-09-20",
"2006-09-20", "2006-09-20", "2006-09-20",
"2006-09-20", "2006-09-20",
"2006-09-20", "2006-09-20", "2006-09-20"), DepthM
= c(0, 0, 0,
2.286, 2.286, 2.286, 4.572, 4.572, 4.572, 9.144, 9.144, 9.144, 
13.716, 13.716, 13.716), PDesc = c("TP", "TP",
"TP", "TP", "TP",
"TP", "TP", "TP", "TP", "TP",
"TP", "TP", "TP", "TP", "TP"),
??? MAbbr = c("Grab", "Grab", "Grab",
"Grab", "Grab", "Grab",
??? "Grab", "Grab", "Grab", "Grab",
"Grab", "Grab", "Grab", "Grab",
??? "Grab"), Measure = c(6.58, 6.84, 6.59, 7.76, 8.57, 8.49, 
??? 9.71, 8.47, 7.71, 7.51, 7.85, 6.81, 7.94, 8.76, 8.4), DNU = c(FALSE, 
??? FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, 
??? FALSE, FALSE, FALSE, FALSE, FALSE)), .Names = c("SiteID", 
"SiteName", "SampDate", "DepthM",
"PDesc", "MAbbr", "Measure",
"DNU"), class = "data.frame", row.names =
c("16042", "16043",
"16044", "16045", "16046", "16047",
"16048", "16049", "16050",
"16051", "16052", "16053", "16054",
"16055", "16056"))


dat1[!duplicated(dat1$DepthM),] #current example with one SampDate
#????? SiteID??????? SiteName?? SampDate DepthM PDesc MAbbr Measure?? DNU
#16042????? 1 Big Platte Lake 2006-09-20? 0.000??? TP? Grab??? 6.58 FALSE
#16045????? 1 Big Platte Lake 2006-09-20? 2.286??? TP? Grab??? 7.76 FALSE
#16048????? 1 Big Platte Lake 2006-09-20? 4.572??? TP? Grab??? 9.71 FALSE
#16051????? 1 Big Platte Lake 2006-09-20? 9.144??? TP? Grab??? 7.51 FALSE
#16054????? 1 Big Platte Lake 2006-09-20 13.716??? TP? Grab??? 7.94 FALSE

#with more than one SampDate (hopefully it works):

dat1[unlist(with(dat1,tapply(DepthM,list(SampDate),FUN=function(x)
!duplicated(x)))),]
#????? SiteID??????? SiteName?? SampDate DepthM PDesc MAbbr Measure?? DNU
#16042????? 1 Big Platte Lake 2006-09-20? 0.000??? TP? Grab??? 6.58 FALSE
#16045????? 1 Big Platte Lake 2006-09-20? 2.286??? TP? Grab??? 7.76 FALSE
#16048????? 1 Big Platte Lake 2006-09-20? 4.572??? TP? Grab??? 9.71 FALSE
#16051????? 1 Big Platte Lake 2006-09-20? 9.144??? TP? Grab??? 7.51 FALSE
#16054????? 1 Big Platte Lake 2006-09-20 13.716??? TP? Grab??? 7.94 FALSE



A.K.




----- Original Message -----
From: Samuel T. Christel <schristel at wisc.edu>
To: smartpink111 at yahoo.com
Cc: 
Sent: Thursday, August 22, 2013 9:17 AM
Subject: Re: filtering out replicate (duplicate) observations under special
conditions

Hello,

I did update post to the following:


Update: The previous description of my expected output may have been a bit
confusing as the example table did not illustrate the problem... A better
example of the data table is:

SiteID SiteName SampDate DepthM PDesc MAbbr Measure DNU 
16042 1 Big Platte Lake 2006-09-20 0.000 TP Grab 6.58 FALSE 
16043 1 Big Platte Lake 2006-09-20 0.000 TP Grab 6.84 FALSE 
16044 1 Big Platte Lake 2006-09-20 0.000 TP Grab 6.59 FALSE 
16045 1 Big Platte Lake 2006-09-20 2.286 TP Grab 7.76 FALSE 
16046 1 Big Platte Lake 2006-09-20 2.286 TP Grab 8.57 FALSE 
16047 1 Big Platte Lake 2006-09-20 2.286 TP Grab 8.49 FALSE 
16048 1 Big Platte Lake 2006-09-20 4.572 TP Grab 9.71 FALSE 
16049 1 Big Platte Lake 2006-09-20 4.572 TP Grab 8.47 FALSE 
16050 1 Big Platte Lake 2006-09-20 4.572 TP Grab 7.71 FALSE 
16051 1 Big Platte Lake 2006-09-20 9.144 TP Grab 7.51 FALSE 
16052 1 Big Platte Lake 2006-09-20 9.144 TP Grab 7.85 FALSE 
16053 1 Big Platte Lake 2006-09-20 9.144 TP Grab 6.81 FALSE 
16054 1 Big Platte Lake 2006-09-20 13.716 TP Grab 7.94 FALSE 
16055 1 Big Platte Lake 2006-09-20 13.716 TP Grab 8.76 FALSE 
16056 1 Big Platte Lake 2006-09-20 13.716 TP Grab 8.40 FALSE 

On a given "SampDate" I am only interested in ONE unique
"DepthM." That is to say in the table above I would like to remove the
replicate observations of "DepthM" for "DepthM" values of
0.000, 2.286, 4.572, 9.144, and 13.716.

The final table would look like this: 

SiteID SiteName SampDate DepthM PDesc MAbbr Measure DNU 
16042 1 Big Platte Lake 2006-09-20 0.000 TP Grab 6.58 FALSE 
16045 1 Big Platte Lake 2006-09-20 2.286 TP Grab 7.76 FALSE 
16048 1 Big Platte Lake 2006-09-20 4.572 TP Grab 9.71 FALSE 
16051 1 Big Platte Lake 2006-09-20 9.144 TP Grab 7.51 FALSE 
16054 1 Big Platte Lake 2006-09-20 13.716 TP Grab 7.94 FALSE 

Note that it does not matter which observation at a particular depth (on that
sampling date) is maintained or discarded !!




Any advice you might have would be most appreciated!

-STC



On 08/21/13, smartpink111 at yahoo.com wrote:> 
> Hi,
> 
> Could you show your expected output as the description is confusing.. Based
on the example dataset, all the rows look unique for a combination of SampDate
and DepthM.
> 
> A.K.
> <quote author='limno.sam'>
> Hi,
> 
> I am working with a data.frame with the following structure:
> 
> SiteID SiteName SampDate DepthM PDesc MAbbr Measure DNU
> 1 1 Big Platte Lake 1982-06-17 0.000 Alk Grab 143 FALSE
> 2 1 Big Platte Lake 1992-09-09 0.000 Alk Grab 64 FALSE
> 3 1 Big Platte Lake 1992-09-09 4.572 Alk Grab 126 FALSE
> 4 1 Big Platte Lake 1992-09-09 9.144 Alk Grab 130 FALSE
> 5 1 Big Platte Lake 1992-09-09 13.716 Alk Grab 142 FALSE
> 6 1 Big Platte Lake 1992-09-09 18.288 Alk Grab 146 FALSE
> 
> I would like to filter out replicate observations (Measure). However, there
> is no column in the source data indicating whether or not an observation is
> a replicate. Therefore, I am only interested in a data frame where one
> unique "SampDepth" is attached to one unique
"SampDate." That is to say I am
> only interested in observations unique to both one sample date and sample
> depth (these data were collected from a lake). I am having issues getting
my
> code to work, and I've only been coding in R for a few months. 
> 
> I have named my data.frame (after other filtering steps) as
"data" and
> tried the following:
> data=data[(duplicated(data$SampDate,incomparables=FALSE,fromLast=FALSE,
> nmax=NA[which(!duplicated(data$DepthM))])==TRUE),]
> 
> I end up with a data frame of all unique "SampDate" values, but
the unique
> "SampDepth" values for a given "SampDate" are filtered
out.
> 
> Any suggestions? 
> </quote>
> Quoted from: 
>
http://r.789695.n4.nabble.com/filtering-out-replicate-duplicate-observations-under-special-conditions-tp4674253.html
> 
> 
> _____________________________________
> Sent from http://r.789695.n4.nabble.com

R help - Aug 2013 - filtering out replicate (duplicate) observations under special conditions

[R] filtering out replicate (duplicate) observations under special conditions

[R] filtering out replicate (duplicate) observations under special conditions