Matthew Keller
2011-May-30 00:10 UTC
[R] ideas about how to reduce RAM & improve speed in trying to use lapply(strsplit())
hi all,
I'm full of questions today :). Thanks in advance for your help!
Here's the problem:
x <- c('18x.6','12x.9','302x.3')
I want to get a vector that is c('18x','12x','302x')
This is easily done using this code:
unlist(lapply(strsplit(x,".",fixed=TRUE),function(x) x[1]))
So far so good. The problem is that x is a vector of length 132e6.
When I run the above code, it runs for > 30 minutes, and it takes > 23
Gb RAM (no kidding!).
Does anyone have ideas about how to speed up the code above and (more
importantly) reduce the RAM footprint? I'd prefer not to change the
file on disk using, e.g., awk, but I will do that as a last resort.
Best
Matt
--
Matthew C Keller
Asst. Professor of Psychology
University of Colorado at Boulder
www.matthewckeller.com
jim holtman
2011-May-30 00:40 UTC
[R] ideas about how to reduce RAM & improve speed in trying to use lapply(strsplit())
Try this approach:> x <- c('18x.6','12x.9','302x.3') > gsub("^(.*)\\..*", '\\1', x)[1] "18x" "12x" "302x" On Sun, May 29, 2011 at 8:10 PM, Matthew Keller <mckellercran at gmail.com> wrote:> hi all, > > I'm full of questions today :). Thanks in advance for your help! > > Here's the problem: > x <- c('18x.6','12x.9','302x.3') > > I want to get a vector that is c('18x','12x','302x') > > This is easily done using this code: > > unlist(lapply(strsplit(x,".",fixed=TRUE),function(x) x[1])) > > So far so good. The problem is that x is a vector of length 132e6. > When I run the above code, it runs for > 30 minutes, and it takes > 23 > Gb RAM (no kidding!). > > Does anyone have ideas about how to speed up the code above and (more > importantly) reduce the RAM footprint? I'd prefer not to change the > file on disk using, e.g., awk, but I will do that as a last resort. > > Best > > Matt > > -- > Matthew C Keller > Asst. Professor of Psychology > University of Colorado at Boulder > www.matthewckeller.com > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve?
Joshua Wiley
2011-May-30 00:41 UTC
[R] ideas about how to reduce RAM & improve speed in trying to use lapply(strsplit())
Hi Matt,
There are likely more efficient ways still, but this is a big
performance boost time-wise for me:
x <- c('18x.6','12x.9','302x.3')
gsub("\\.(.+$)", "", x)
x <- rep(x, 10^5)
> system.time(out1 <-
unlist(lapply(strsplit(x,".",fixed=TRUE),function(x) x[1])))
user system elapsed
2.89 0.03 2.96> system.time(out2 <- gsub("\\.(.+$)", "", x))
user system elapsed
0.57 0.00 0.59> all.equal(out1, out2)
[1] TRUE
Cheers,
Josh
On Sun, May 29, 2011 at 5:10 PM, Matthew Keller <mckellercran at
gmail.com> wrote:> hi all,
>
> I'm full of questions today :). Thanks in advance for your help!
>
> Here's the problem:
> x <- c('18x.6','12x.9','302x.3')
>
> I want to get a vector that is
c('18x','12x','302x')
>
> This is easily done using this code:
>
> unlist(lapply(strsplit(x,".",fixed=TRUE),function(x) x[1]))
>
> So far so good. The problem is that x is a vector of length 132e6.
> When I run the above code, it runs for > 30 minutes, and it takes >
23
> Gb RAM (no kidding!).
>
> Does anyone have ideas about how to speed up the code above and (more
> importantly) reduce the RAM footprint? I'd prefer not to change the
> file on disk using, e.g., awk, but I will do that as a last resort.
>
> Best
>
> Matt
>
> --
> Matthew C Keller
> Asst. Professor of Psychology
> University of Colorado at Boulder
> www.matthewckeller.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Joshua Wiley
Ph.D. Student, Health Psychology
University of California, Los Angeles
http://www.joshuawiley.com/
Timothy Bates
2011-May-30 01:38 UTC
[R] ideas about how to reduce RAM & improve speed in trying to use lapply(strsplit())
Hi Matt,
Though it's the last solution on your list, I would treat this as a
text editing problem: just find and replace "\.[0-9]", then read in
the result.
perl -pi -e 's/x\.[0-9]//g' *test.txt
likely done in seconds.
But other R solutions seem to be coming in in a fairly timely manner too.
t
On 30 May 2011, at 10:10, Matthew Keller wrote:
Here's the problem:
x <- c('18x.6','12x.9','302x.3')
I want to get a vector that is c('18x','12x','302x')
This is easily done using this code:
unlist(lapply(strsplit(x,".",fixed=TRUE),function(x) x[1]))
So far so good. The problem is that x is a vector of length 132e6.
When I run the above code, it runs for > 30 minutes, and it takes > 23
Gb RAM (no kidding!).
Does anyone have ideas about how to speed up the code above and (more
importantly) reduce the RAM footprint? I'd prefer not to change the
file on disk using, e.g., awk, but I will do that as a last resort.