hey guys, I am working on a function to make a duplicated value unique. For example, the original vector would be like : a = c(2,1,1,3,3,3,4) I'll like to transform it into: a.nodup = 2, 1.01, 1.02, 3.01, 3.02, 3.03, 4 basically, find the duplicates and assign a unique value by adding a small amount and keep it in order. I come up with the following codes, but it runs slow if t is large. Is there a better way to do it? nodup = function(t) { t.index=0 t.dup=duplicated(t) for (i in 2:length(t)) { if (t.dup[i]==T) t.index=t.index+0.01 else t.index=0 t[i]=t[i]+t.index } return(t) } -- View this message in context: http://r.789695.n4.nabble.com/help-program-efficiency-tp3059079p3059079.html Sent from the R help mailing list archive at Nabble.com.
one way is the following: a <- c(2,1,1,3,3,3,4) d <- unlist(sapply(rle(a)$length, function (x) if (x > 1) seq(0.01, by = 0.01, len = x) else 0)) a + d I hope it helps. Best, Dimitris On 11/25/2010 3:49 PM, randomcz wrote:> > hey guys, > > I am working on a function to make a duplicated value unique. For example, > the original vector would be like : a = c(2,1,1,3,3,3,4) > I'll like to transform it into: > a.nodup = 2, 1.01, 1.02, 3.01, 3.02, 3.03, 4 > basically, find the duplicates and assign a unique value by adding a small > amount and keep it in order. > I come up with the following codes, but it runs slow if t is large. Is there > a better way to do it? > nodup = function(t) > { > t.index=0 > t.dup=duplicated(t) > for (i in 2:length(t)) > { > if (t.dup[i]==T) > t.index=t.index+0.01 > else t.index=0 > t[i]=t[i]+t.index > } > return(t) > } > >-- Dimitris Rizopoulos Assistant Professor Department of Biostatistics Erasmus University Medical Center Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands Tel: +31/(0)10/7043478 Fax: +31/(0)10/7043014 Web: http://www.erasmusmc.nl/biostatistiek/
If the input vector t is known to be ordered (or if you only care about runs of duplicated values, not all duplicated values) the following is pretty quick nodup3 <- function (t) { t + (sequence(rle(t)$lengths) - 1)/100 } If you don't know if the the input will be ordered then ave() will do it a bit faster than your code nodup2 <- function (t) { ave(t, t, FUN = function(x) x + (seq_along(x) - 1)/100) } E.g., for a sorted sequence of 300,000 numbers drawn with replacement from 1:100,000 I get:> a2 <- sort(sample(1:1e5, size=3e5, replace=TRUE)) > system.time(v <- nodup(a2))user system elapsed 2.78 0.05 3.97> system.time(v2 <- nodup2(a2))user system elapsed 1.83 0.02 2.66> system.time(v3 <- nodup3(a2))user system elapsed 0.18 0.00 0.14> identical(v,v2) && identical(v,v3)[1] TRUE If speed is truly an issue, the built-in sequence may be replaced by a faster one that does the same thing: nodup3a <- function (t) { faster.sequence <- function(nvec) { seq_len(sum(nvec)) - rep(cumsum(c(0L, nvec[-length(nvec)])), nvec) } t + (faster.sequence(rle(t)$lengths) - 1)/100 } That took 0.05 seconds on the a2 dataset and produced identical results. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com> -----Original Message----- > From: r-help-bounces at r-project.org > [mailto:r-help-bounces at r-project.org] On Behalf Of randomcz > Sent: Thursday, November 25, 2010 6:49 AM > To: r-help at r-project.org > Subject: [R] help: program efficiency > > > hey guys, > > I am working on a function to make a duplicated value unique. > For example, > the original vector would be like : a = c(2,1,1,3,3,3,4) > I'll like to transform it into: > a.nodup = 2, 1.01, 1.02, 3.01, 3.02, 3.03, 4 > basically, find the duplicates and assign a unique value by > adding a small > amount and keep it in order. > I come up with the following codes, but it runs slow if t is > large. Is there > a better way to do it? > nodup = function(t) > { > t.index=0 > t.dup=duplicated(t) > for (i in 2:length(t)) > { > if (t.dup[i]==T) > t.index=t.index+0.01 > else t.index=0 > t[i]=t[i]+t.index > } > return(t) > } > > > -- > View this message in context: > http://r.789695.n4.nabble.com/help-program-efficiency-tp3059079p3059079.html> Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
----------------------------------------> Date: Thu, 25 Nov 2010 06:49:19 -0800 > From: randomcz at gmail.com > To: r-help at r-project.org > Subject: [R] help: program efficiency > > > hey guys, > > I am working on a function to make a duplicated value unique. For example, > the original vector would be like : a = c(2,1,1,3,3,3,4) > I'll like to transform it into: > a.nodup = 2, 1.01, 1.02, 3.01, 3.02, 3.03, 4 > basically, find the duplicates and assign a unique value by adding a small > amount and keep it in order. > I come up with the following codes, but it runs slow if t is large. Is there > a better way to do it?I guess I'd just make a vector of uniform or even normal random numbers and add to your input vector. This of course is not guaranteed and adds to uniques but you can test and repeat and it is probably closer to what you want but I am only speculating on your objectives.> nodup = function(t) > { > t.index=0 > t.dup=duplicated(t) > for (i in 2:length(t)) > { > if (t.dup[i]==T) > t.index=t.index+0.01 > else t.index=0 > t[i]=t[i]+t.index > } > return(t) > } > > > -- > View this message in context: http://r.789695.n4.nabble.com/help-program-efficiency-tp3059079p3059079.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
> -----Original Message----- > From: William Dunlap > Sent: Thursday, November 25, 2010 9:31 AM > To: 'randomcz'; r-help at r-project.org > Subject: RE: [R] help: program efficiency > > If the input vector t is known to be ordered > (or if you only care about runs of duplicated > values, not all duplicated values) the following > is pretty quick > > nodup3 <- function (t) { > t + (sequence(rle(t)$lengths) - 1)/100 > } > > If you don't know if the the input will be ordered > then ave() will do it a bit faster than your > code > > nodup2 <- function (t) { > ave(t, t, FUN = function(x) x + (seq_along(x) - 1)/100) > } > > E.g., for a sorted sequence of 300,000 numbers drawn with > replacement from 1:100,000 I get: > > > a2 <- sort(sample(1:1e5, size=3e5, replace=TRUE)) > > system.time(v <- nodup(a2)) > user system elapsed > 2.78 0.05 3.97 > > system.time(v2 <- nodup2(a2)) > user system elapsed > 1.83 0.02 2.66 > > system.time(v3 <- nodup3(a2)) > user system elapsed > 0.18 0.00 0.14 > > identical(v,v2) && identical(v,v3) > [1] TRUE > > If speed is truly an issue, the built-in sequence may > be replaced by a faster one that does the same thing: > > nodup3a <- function (t) { > faster.sequence <- function(nvec) { > seq_len(sum(nvec)) - rep(cumsum(c(0L, nvec[-length(nvec)])), > nvec) > } > t + (faster.sequence(rle(t)$lengths) - 1)/100 > } > > That took 0.05 seconds on the a2 dataset and produced > identical results.rle() computes a sort of second difference and nodup3a computes a cumsum on that second diffence, to get back to a first difference. The following avoids that wasted operation (along with rle's computation of the values component of its output). nodup4 <- function(t) { n <- length(t) p <- c(0L, which(t[-1L] != t[-n]), n) t + ( seq_len(n) - rep.int(p[-length(p)] + 1L, diff(p)) ) /100 } That reduced nodup3a's time by about 30% on that dataset. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com> > -----Original Message----- > > From: r-help-bounces at r-project.org > > [mailto:r-help-bounces at r-project.org] On Behalf Of randomcz > > Sent: Thursday, November 25, 2010 6:49 AM > > To: r-help at r-project.org > > Subject: [R] help: program efficiency > > > > > > hey guys, > > > > I am working on a function to make a duplicated value unique. > > For example, > > the original vector would be like : a = c(2,1,1,3,3,3,4) > > I'll like to transform it into: > > a.nodup = 2, 1.01, 1.02, 3.01, 3.02, 3.03, 4 > > basically, find the duplicates and assign a unique value by > > adding a small > > amount and keep it in order. > > I come up with the following codes, but it runs slow if t is > > large. Is there > > a better way to do it? > > nodup = function(t) > > { > > t.index=0 > > t.dup=duplicated(t) > > for (i in 2:length(t)) > > { > > if (t.dup[i]==T) > > t.index=t.index+0.01 > > else t.index=0 > > t[i]=t[i]+t.index > > } > > return(t) > > } > > > > > > -- > > View this message in context: > > http://r.789695.n4.nabble.com/help-program-efficiency-tp305907 > 9p3059079.html > > Sent from the R help mailing list archive at Nabble.com. > > > > ______________________________________________ > > R-help at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > >
See if this works for you. a <- c(2,1,1,3,3,3,4) a.fac <- as.factor(a) b <- split(a, f = a.fac) system.time(lapply(X = b, FUN = function(x) { swn <- seq(from = 0, to = 0 + 0.01*length(x), by = 0.01) out <- x + swn return(out) })) Cheers, Roman -- View this message in context: http://r.789695.n4.nabble.com/help-program-efficiency-tp3059079p3060801.html Sent from the R help mailing list archive at Nabble.com.
Oops, tiny mistake. Try lapply(X = b, FUN = function(x) { swn <- seq(from = 0, to = (0 + 0.01*length(x))-0.01, by = 0.01) out <- x + swn return(out) }) -- View this message in context: http://r.789695.n4.nabble.com/help-program-efficiency-tp3059079p3060806.html Sent from the R help mailing list archive at Nabble.com.