thr3ads.net - R help - [R] help: program efficiency [Nov 2010]

If this information is useful, please help other people find it:
Share via:

randomcz

2010-Nov-25 14:49 UTC

[R] help: program efficiency

hey guys,

I am working on a function to make a duplicated value unique. For example,
the original vector would be like : a = c(2,1,1,3,3,3,4)
I'll like to transform it into:
a.nodup = 2, 1.01, 1.02, 3.01, 3.02, 3.03, 4
basically, find the duplicates and assign a unique value by adding a small
amount and keep it in order.
I come up with the following codes, but it runs slow if t is large. Is there
a better way to do it?
nodup = function(t)
{
  t.index=0
  t.dup=duplicated(t)
  for (i in 2:length(t))
  {
    if (t.dup[i]==T)
      t.index=t.index+0.01
    else t.index=0
    t[i]=t[i]+t.index
  }
  return(t)
}


-- 
View this message in context:
http://r.789695.n4.nabble.com/help-program-efficiency-tp3059079p3059079.html
Sent from the R help mailing list archive at Nabble.com.

Dimitris Rizopoulos

2010-Nov-25 15:48 UTC

head link

[R] help: program efficiency

one way is the following:

a <- c(2,1,1,3,3,3,4)

d <- unlist(sapply(rle(a)$length, function (x)
     if (x > 1) seq(0.01, by = 0.01, len = x) else 0))

a + d


I hope it helps.

Best,
Dimitris


On 11/25/2010 3:49 PM, randomcz wrote:>
> hey guys,
>
> I am working on a function to make a duplicated value unique. For example,
> the original vector would be like : a = c(2,1,1,3,3,3,4)
> I'll like to transform it into:
> a.nodup = 2, 1.01, 1.02, 3.01, 3.02, 3.03, 4
> basically, find the duplicates and assign a unique value by adding a small
> amount and keep it in order.
> I come up with the following codes, but it runs slow if t is large. Is
there
> a better way to do it?
> nodup = function(t)
> {
>    t.index=0
>    t.dup=duplicated(t)
>    for (i in 2:length(t))
>    {
>      if (t.dup[i]==T)
>        t.index=t.index+0.01
>      else t.index=0
>      t[i]=t[i]+t.index
>    }
>    return(t)
> }
>
>
-- 
Dimitris Rizopoulos
Assistant Professor
Department of Biostatistics
Erasmus University Medical Center

Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands
Tel: +31/(0)10/7043478
Fax: +31/(0)10/7043014
Web: http://www.erasmusmc.nl/biostatistiek/

William Dunlap

2010-Nov-25 17:31 UTC

head link

[R] help: program efficiency

If the input vector t is known to be ordered
(or if you only care about runs of duplicated
values, not all duplicated values) the following
is pretty quick

nodup3 <- function (t) { 
    t + (sequence(rle(t)$lengths) - 1)/100
}

If you don't know if the the input will be ordered
then ave() will do it a bit faster than your
code

nodup2 <- function (t) { 
    ave(t, t, FUN = function(x) x + (seq_along(x) - 1)/100)
}

E.g., for a sorted sequence of 300,000 numbers drawn with
replacement from 1:100,000 I get:
> a2 <- sort(sample(1:1e5, size=3e5, replace=TRUE))
> system.time(v <- nodup(a2))   user  system elapsed 
   2.78    0.05    3.97 > system.time(v2 <- nodup2(a2))   user  system elapsed 
   1.83    0.02    2.66 > system.time(v3 <- nodup3(a2))   user  system elapsed 
   0.18    0.00    0.14 > identical(v,v2) && identical(v,v3)[1] TRUE

If speed is truly an issue, the built-in sequence may
be replaced by a faster one that does the same thing:

nodup3a <- function (t) {
    faster.sequence <- function(nvec) {
        seq_len(sum(nvec)) - rep(cumsum(c(0L, nvec[-length(nvec)])), 
            nvec)
    }
    t + (faster.sequence(rle(t)$lengths) - 1)/100
}

That took 0.05 seconds on the a2 dataset and produced
identical results.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com  
> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of randomcz
> Sent: Thursday, November 25, 2010 6:49 AM
> To: r-help at r-project.org
> Subject: [R] help: program efficiency
> 
> 
> hey guys,
> 
> I am working on a function to make a duplicated value unique. 
> For example,
> the original vector would be like : a = c(2,1,1,3,3,3,4)
> I'll like to transform it into:
> a.nodup = 2, 1.01, 1.02, 3.01, 3.02, 3.03, 4
> basically, find the duplicates and assign a unique value by 
> adding a small
> amount and keep it in order.
> I come up with the following codes, but it runs slow if t is 
> large. Is there
> a better way to do it?
> nodup = function(t)
> {
>   t.index=0
>   t.dup=duplicated(t)
>   for (i in 2:length(t))
>   {
>     if (t.dup[i]==T)
>       t.index=t.index+0.01
>     else t.index=0
>     t[i]=t[i]+t.index
>   }
>   return(t)
> }
> 
> 
> -- 
> View this message in context: 
> http://r.789695.n4.nabble.com/help-program-efficiency-tp305907
9p3059079.html> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Mike Marchywka

2010-Nov-25 17:34 UTC

head link

[R] help: program efficiency

----------------------------------------> Date: Thu, 25 Nov 2010 06:49:19 -0800
> From: randomcz at gmail.com
> To: r-help at r-project.org
> Subject: [R] help: program efficiency
>
>
> hey guys,
>
> I am working on a function to make a duplicated value unique. For example,
> the original vector would be like : a = c(2,1,1,3,3,3,4)
> I'll like to transform it into:
> a.nodup = 2, 1.01, 1.02, 3.01, 3.02, 3.03, 4
> basically, find the duplicates and assign a unique value by adding a small
> amount and keep it in order.
> I come up with the following codes, but it runs slow if t is large. Is
there
> a better way to do it?

I guess I'd just make a vector of uniform or even normal random numbers
and add to your input vector. This of course is not guaranteed and adds
to uniques  but you can test and repeat and it is probably closer
to what you want but I am only speculating on your objectives.

> nodup = function(t)
> {
> t.index=0
> t.dup=duplicated(t)
> for (i in 2:length(t))
> {
> if (t.dup[i]==T)
> t.index=t.index+0.01
> else t.index=0
> t[i]=t[i]+t.index
> }
> return(t)
> }
>
>
> --
> View this message in context:
http://r.789695.n4.nabble.com/help-program-efficiency-tp3059079p3059079.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

William Dunlap

2010-Nov-26 19:01 UTC

head link

[R] help: program efficiency

> -----Original Message-----
> From: William Dunlap 
> Sent: Thursday, November 25, 2010 9:31 AM
> To: 'randomcz'; r-help at r-project.org
> Subject: RE: [R] help: program efficiency
> 
> If the input vector t is known to be ordered
> (or if you only care about runs of duplicated
> values, not all duplicated values) the following
> is pretty quick
> 
> nodup3 <- function (t) { 
>     t + (sequence(rle(t)$lengths) - 1)/100
> }
> 
> If you don't know if the the input will be ordered
> then ave() will do it a bit faster than your
> code
> 
> nodup2 <- function (t) { 
>     ave(t, t, FUN = function(x) x + (seq_along(x) - 1)/100)
> }
> 
> E.g., for a sorted sequence of 300,000 numbers drawn with
> replacement from 1:100,000 I get:
> 
> > a2 <- sort(sample(1:1e5, size=3e5, replace=TRUE))
> > system.time(v <- nodup(a2))
>    user  system elapsed 
>    2.78    0.05    3.97 
> > system.time(v2 <- nodup2(a2))
>    user  system elapsed 
>    1.83    0.02    2.66 
> > system.time(v3 <- nodup3(a2))
>    user  system elapsed 
>    0.18    0.00    0.14 
> > identical(v,v2) && identical(v,v3)
> [1] TRUE
> 
> If speed is truly an issue, the built-in sequence may
> be replaced by a faster one that does the same thing:
> 
> nodup3a <- function (t) {
>     faster.sequence <- function(nvec) {
>         seq_len(sum(nvec)) - rep(cumsum(c(0L, nvec[-length(nvec)])), 
>             nvec)
>     }
>     t + (faster.sequence(rle(t)$lengths) - 1)/100
> }
> 
> That took 0.05 seconds on the a2 dataset and produced
> identical results.
rle() computes a sort of second difference and
nodup3a computes a cumsum on that second diffence,
to get back to a first difference.  The following
avoids that wasted operation (along with rle's
computation of the values component of its output).

nodup4 <- function(t) {
    n <- length(t)
    p <- c(0L, which(t[-1L] != t[-n]), n)
    t + ( seq_len(n) - rep.int(p[-length(p)] + 1L, diff(p)) ) /100
}

That reduced nodup3a's time by about 30% on that dataset.
 
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com  
 > > -----Original Message-----
> > From: r-help-bounces at r-project.org 
> > [mailto:r-help-bounces at r-project.org] On Behalf Of randomcz
> > Sent: Thursday, November 25, 2010 6:49 AM
> > To: r-help at r-project.org
> > Subject: [R] help: program efficiency
> > 
> > 
> > hey guys,
> > 
> > I am working on a function to make a duplicated value unique. 
> > For example,
> > the original vector would be like : a = c(2,1,1,3,3,3,4)
> > I'll like to transform it into:
> > a.nodup = 2, 1.01, 1.02, 3.01, 3.02, 3.03, 4
> > basically, find the duplicates and assign a unique value by 
> > adding a small
> > amount and keep it in order.
> > I come up with the following codes, but it runs slow if t is 
> > large. Is there
> > a better way to do it?
> > nodup = function(t)
> > {
> >   t.index=0
> >   t.dup=duplicated(t)
> >   for (i in 2:length(t))
> >   {
> >     if (t.dup[i]==T)
> >       t.index=t.index+0.01
> >     else t.index=0
> >     t[i]=t[i]+t.index
> >   }
> >   return(t)
> > }
> > 
> > 
> > -- 
> > View this message in context: 
> > http://r.789695.n4.nabble.com/help-program-efficiency-tp305907
> 9p3059079.html
> > Sent from the R help mailing list archive at Nabble.com.
> > 
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide 
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >

Roman Luštrik

2010-Nov-26 19:21 UTC

head link

[R] help: program efficiency

See if this works for you.

a <- c(2,1,1,3,3,3,4)
a.fac <- as.factor(a)
b <- split(a, f = a.fac)
system.time(lapply(X = b, FUN = function(x) {
			swn <- seq(from = 0, to = 0 + 0.01*length(x), by = 0.01)
			out <- x + swn
			return(out)
		}))

Cheers,
Roman
-- 
View this message in context:
http://r.789695.n4.nabble.com/help-program-efficiency-tp3059079p3060801.html
Sent from the R help mailing list archive at Nabble.com.

Roman Luštrik

2010-Nov-26 19:25 UTC

head link

[R] help: program efficiency

Oops, tiny mistake. Try

lapply(X = b, FUN = function(x) {
			swn <- seq(from = 0, to = (0 + 0.01*length(x))-0.01, by = 0.01)
			out <- x + swn
			return(out)
		})
-- 
View this message in context:
http://r.789695.n4.nabble.com/help-program-efficiency-tp3059079p3060806.html
Sent from the R help mailing list archive at Nabble.com.

Seemingly Similar Threads

Search for more seemingly similar threads

R help - Nov 2010 - help: program efficiency

[R] help: program efficiency

[R] help: program efficiency

[R] help: program efficiency

[R] help: program efficiency

[R] help: program efficiency

[R] help: program efficiency

[R] help: program efficiency

Seemingly Similar Threads