thr3ads.net - R help - [R] Removing duplicates without a for loop [Sep 2012]

If this information is useful, please help other people find it:
Share via:

wwreith

2012-Sep-26 15:19 UTC

[R] Removing duplicates without a for loop

I have several thousand rows of shipment data imported into R as a data
frame, with two columns of particular interest, col 1 is the entry date, and
col 2 is the tracking number (colname is REQ.NR). Tracking numbers should be
unique but on occassion aren't because they get entered more than once. This
creates two or more rows of with the same tracking number but different
dates. I wrote a for loop that will keep the row with the oldest date but it
is extremely slow. 

Any suggestions of how I should write this so that it is faster?

# Creates a vector of on the unique tracking numbers #
u<-na.omit(unique(Para.5C$REQ.NR))

# Create Data Frame to rbind unique rows to #
Para.5C.final<-data.frame()

# For each value in u subset Para.5C find the min date and rbind it to
Para.5C.final #
for(i in 1:length(u))
{
  x<-subset(Para.5C,Para.5C$REQ.NR==u[i])
  Para.5C.final<-rbind(Para.5C.final,x[which(x[,1]==min(x[,1])),])
}



--
View this message in context:
http://r.789695.n4.nabble.com/Removing-duplicates-without-a-for-loop-tp4644255.html
Sent from the R help mailing list archive at Nabble.com.

Rui Barradas

2012-Sep-26 18:23 UTC

head link

[R] Removing duplicates without a for loop

Hello,

If I understand it correctly, something like this will get you what you 
want.


d <- Sys.Date() + 1:4
d2 <- sample(d, 2)
dat <- data.frame(id = 1:6, date = c(d, d2), value = rnorm(6))

aggregate(dat, by = list(dat$date), FUN = tail, 1)

Hope this helps,

Rui Barradas
Em 26-09-2012 16:19, wwreith escreveu:>   I have several thousand rows of shipment data imported into R as a data
> frame, with two columns of particular interest, col 1 is the entry date,
and
> col 2 is the tracking number (colname is REQ.NR). Tracking numbers should
be
> unique but on occassion aren't because they get entered more than once.
This
> creates two or more rows of with the same tracking number but different
> dates. I wrote a for loop that will keep the row with the oldest date but
it
> is extremely slow.
>
> Any suggestions of how I should write this so that it is faster?
>
> # Creates a vector of on the unique tracking numbers #
> u<-na.omit(unique(Para.5C$REQ.NR))
>
> # Create Data Frame to rbind unique rows to #
> Para.5C.final<-data.frame()
>
> # For each value in u subset Para.5C find the min date and rbind it to
> Para.5C.final #
> for(i in 1:length(u))
> {
>    x<-subset(Para.5C,Para.5C$REQ.NR==u[i])
>    Para.5C.final<-rbind(Para.5C.final,x[which(x[,1]==min(x[,1])),])
> }
>
>
>
> --
> View this message in context:
http://r.789695.n4.nabble.com/Removing-duplicates-without-a-for-loop-tp4644255.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Rui Barradas

2012-Sep-26 18:30 UTC

head link

[R] Removing duplicates without a for loop

Sorry, but in my previous post I've confused the columns. It's by 
REQ.NR, not by date

REQ.NR <- 1:4
REQ.NR <- c(REQ.NR, sample(REQ.NR, 2))
dat <- data.frame(date = Sys.Date() + 1:6, REQ.NR = REQ.NR, value = 
rnorm(6))

aggregate(dat, by = list(dat$REQ.NR), FUN = tail, 1)

Rui Barradas
Em 26-09-2012 16:19, wwreith escreveu:>   I have several thousand rows of shipment data imported into R as a data
> frame, with two columns of particular interest, col 1 is the entry date,
and
> col 2 is the tracking number (colname is REQ.NR). Tracking numbers should
be
> unique but on occassion aren't because they get entered more than once.
This
> creates two or more rows of with the same tracking number but different
> dates. I wrote a for loop that will keep the row with the oldest date but
it
> is extremely slow.
>
> Any suggestions of how I should write this so that it is faster?
>
> # Creates a vector of on the unique tracking numbers #
> u<-na.omit(unique(Para.5C$REQ.NR))
>
> # Create Data Frame to rbind unique rows to #
> Para.5C.final<-data.frame()
>
> # For each value in u subset Para.5C find the min date and rbind it to
> Para.5C.final #
> for(i in 1:length(u))
> {
>    x<-subset(Para.5C,Para.5C$REQ.NR==u[i])
>    Para.5C.final<-rbind(Para.5C.final,x[which(x[,1]==min(x[,1])),])
> }
>
>
>
> --
> View this message in context:
http://r.789695.n4.nabble.com/Removing-duplicates-without-a-for-loop-tp4644255.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Jean V Adams

2012-Sep-26 19:30 UTC

head link

[R] Removing duplicates without a for loop

This might be quicker.

Para.5C.sorted <- Para.5C[order(Para.5C[, 1]), ]
Para.5C.final <- Para.5C.sorted[!duplicated(Para.5C.sorted$REQ.NR), ]

If your data are already sorted by date, then you can skip the first step 
and just run
Para.5C.final <- Para.5C[!duplicated(Para.5C$REQ.NR), ]

Jean



wwreith <reith_william@bah.com> wrote on 09/26/2012 10:19:21
AM:> 
>  I have several thousand rows of shipment data imported into R as a data
> frame, with two columns of particular interest, col 1 is the entry date, 
and> col 2 is the tracking number (colname is REQ.NR). Tracking numbers 
should be> unique but on occassion aren't because they get entered more than once.
This> creates two or more rows of with the same tracking number but different
> dates. I wrote a for loop that will keep the row with the oldest date 
but it> is extremely slow. 
> 
> Any suggestions of how I should write this so that it is faster?
> 
> # Creates a vector of on the unique tracking numbers #
> u<-na.omit(unique(Para.5C$REQ.NR))
> 
> # Create Data Frame to rbind unique rows to #
> Para.5C.final<-data.frame()
> 
> # For each value in u subset Para.5C find the min date and rbind it to
> Para.5C.final #
> for(i in 1:length(u))
> {
>   x<-subset(Para.5C,Para.5C$REQ.NR==u[i])
>   Para.5C.final<-rbind(Para.5C.final,x[which(x[,1]==min(x[,1])),])
> }
	[[alternative HTML version deleted]]

Maybe Matching Threads

Search for more reasonably related threads

R help - Sep 2012 - Removing duplicates without a for loop

[R] Removing duplicates without a for loop

[R] Removing duplicates without a for loop

[R] Removing duplicates without a for loop

[R] Removing duplicates without a for loop

Maybe Matching Threads