Hello,
I want to do the following: Given a set of (number, value) pairs, I want
to create a list l so that l[[toString(number)]] returns the vector of
values associated to that number. It is hundreds of times slower than
the equivalent that I would write in python. I'm pretty new to R so I
bet I'm using its data structures inefficiently, but I've tried more or
less everything I can think of and can't really speed it up. I have done
some profiling which helped me find problem areas, but I couldn't speed
things up even with that information. I'm thinking I'm just
fundamentally using R in a silly way.
I've included code for the different versions. I wrote the python code
in a style to make it as clear to R programmers as possible. Thanks a
lot! Any help would be greatly appreciated!
Cheers,
Thomas
R code (with two versions depending on commenting):
-----
numbers <- numeric(0)
for (i in 1:5) {
numbers <- c(numbers, sample(1:30000, 10000))
}
values <- numeric(0)
for (i in 1:length(numbers)) {
values <- append(values, sample(1:10, 1))
}
starttime <- Sys.time()
d = list()
for (i in 1:length(numbers)) {
number = toString(numbers[i])
value = values[i]
if (is.null(d[[number]])) {
#if (number %in% names(d)) {
d[[number]] <- c(value)
} else {
d[[number]] <- append(d[[number]], value)
}
}
endtime <- Sys.time()
print(format(endtime - starttime))
-----
uncommented version: "45.64791 secs"
commented version: "1.423056 mins"
Another version of R code:
-----
numbers <- numeric(0)
for (i in 1:5) {
numbers <- c(numbers, sample(1:30000, 10000))
}
values <- numeric(0)
for (i in 1:length(numbers)) {
values <- append(values, sample(1:10, 1))
}
starttime <- Sys.time()
d = list()
for (number in unique(numbers)) {
d[[toString(number)]] <- numeric(0)
}
for (i in 1:length(numbers)) {
number = toString(numbers[i])
value = values[i]
d[[number]] <- append(d[[number]], value)
}
endtime <- Sys.time()
print(format(endtime - starttime))
-----
"47.15579 secs"
The python code:
-----
import random
import time
numbers = []
for i in range(5):
numbers += random.sample(range(30000), 10000)
values = []
for i in range(len(numbers)):
values.append(random.randint(1, 10))
starttime = time.time()
d = {}
for i in range(len(numbers)):
number = numbers[i]
value = values[i]
if d.has_key(number):
d[number].append(value)
else:
d[number] = [value]
endtime = time.time()
print endtime - starttime, "seconds"
-----
0.123021125793 seconds
Hi, perhaps pre-generating the list before processing would speed it up significantly. Though it may still be slower than python. e.g. try something like: d = as.list(rep(NA,length(numbers))) rather than: d = list() Olivier. On Thu, 30 Oct 2014 11:17:59 -0400 Thomas Nyberg <tomnyberg at gmail.com> wrote:> Hello, > > I want to do the following: Given a set of (number, value) pairs, I > want to create a list l so that l[[toString(number)]] returns the > vector of values associated to that number. It is hundreds of times > slower than the equivalent that I would write in python. I'm pretty > new to R so I bet I'm using its data structures inefficiently, but > I've tried more or less everything I can think of and can't really > speed it up. I have done some profiling which helped me find problem > areas, but I couldn't speed things up even with that information. I'm > thinking I'm just fundamentally using R in a silly way. > > I've included code for the different versions. I wrote the python > code in a style to make it as clear to R programmers as possible. > Thanks a lot! Any help would be greatly appreciated! > > Cheers, > Thomas > > > R code (with two versions depending on commenting): > > ----- > > numbers <- numeric(0) > for (i in 1:5) { > numbers <- c(numbers, sample(1:30000, 10000)) > } > > values <- numeric(0) > for (i in 1:length(numbers)) { > values <- append(values, sample(1:10, 1)) > } > > starttime <- Sys.time() > > d = list() > for (i in 1:length(numbers)) { > number = toString(numbers[i]) > value = values[i] > if (is.null(d[[number]])) { > #if (number %in% names(d)) { > d[[number]] <- c(value) > } else { > d[[number]] <- append(d[[number]], value) > } > } > > endtime <- Sys.time() > > print(format(endtime - starttime)) > > ----- > > uncommented version: "45.64791 secs" > commented version: "1.423056 mins" > > > > Another version of R code: > > ----- > > numbers <- numeric(0) > for (i in 1:5) { > numbers <- c(numbers, sample(1:30000, 10000)) > } > > values <- numeric(0) > for (i in 1:length(numbers)) { > values <- append(values, sample(1:10, 1)) > } > > starttime <- Sys.time() > > d = list() > for (number in unique(numbers)) { > d[[toString(number)]] <- numeric(0) > } > for (i in 1:length(numbers)) { > number = toString(numbers[i]) > value = values[i] > d[[number]] <- append(d[[number]], value) > } > > endtime <- Sys.time() > > print(format(endtime - starttime)) > > ----- > > "47.15579 secs" > > > > The python code: > > ----- > > import random > import time > > numbers = [] > for i in range(5): > numbers += random.sample(range(30000), 10000) > > values = [] > for i in range(len(numbers)): > values.append(random.randint(1, 10)) > > starttime = time.time() > > d = {} > for i in range(len(numbers)): > number = numbers[i] > value = values[i] > if d.has_key(number): > d[number].append(value) > else: > d[number] = [value] > > endtime = time.time() > > print endtime - starttime, "seconds" > > ----- > > 0.123021125793 seconds > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html and provide commented, > minimal, self-contained, reproducible code.-- Olivier Crouzet, PhD Laboratoire de Linguistique -- EA3827 Universit? de Nantes Chemin de la Censive du Tertre - BP 81227 44312 Nantes cedex 3 France phone: (+33) 02 40 14 14 05 (lab.) (+33) 02 40 14 14 36 (office) fax: (+33) 02 40 14 13 27 e-mail: olivier.crouzet at univ-nantes.fr http://www.lling.univ-nantes.fr/
Look at sqldf or data.table packages. Lists are slow for lookup and not
particularly efficient with memory. numeric indexing into matrices or data
frames is more typical in R, and the above mentioned packages support indexing
to speed up lookups. Also, carefully consider whether you can program your
processing in bulk... vector or relational processing can be critical for
performance.
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live
Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.
On October 30, 2014 8:17:59 AM PDT, Thomas Nyberg <tomnyberg at gmail.com>
wrote:>Hello,
>
>I want to do the following: Given a set of (number, value) pairs, I
>want
>to create a list l so that l[[toString(number)]] returns the vector of
>values associated to that number. It is hundreds of times slower than
>the equivalent that I would write in python. I'm pretty new to R so I
>bet I'm using its data structures inefficiently, but I've tried more
or
>
>less everything I can think of and can't really speed it up. I have
>done
>some profiling which helped me find problem areas, but I couldn't speed
>
>things up even with that information. I'm thinking I'm just
>fundamentally using R in a silly way.
>
>I've included code for the different versions. I wrote the python code
>in a style to make it as clear to R programmers as possible. Thanks a
>lot! Any help would be greatly appreciated!
>
>Cheers,
>Thomas
>
>
>R code (with two versions depending on commenting):
>
>-----
>
>numbers <- numeric(0)
>for (i in 1:5) {
> numbers <- c(numbers, sample(1:30000, 10000))
>}
>
>values <- numeric(0)
>for (i in 1:length(numbers)) {
> values <- append(values, sample(1:10, 1))
>}
>
> starttime <- Sys.time()
>
>d = list()
>for (i in 1:length(numbers)) {
> number = toString(numbers[i])
> value = values[i]
> if (is.null(d[[number]])) {
> #if (number %in% names(d)) {
> d[[number]] <- c(value)
> } else {
> d[[number]] <- append(d[[number]], value)
> }
>}
>
>endtime <- Sys.time()
>
>print(format(endtime - starttime))
>
>-----
>
>uncommented version: "45.64791 secs"
>commented version: "1.423056 mins"
>
>
>
>Another version of R code:
>
>-----
>
>numbers <- numeric(0)
>for (i in 1:5) {
> numbers <- c(numbers, sample(1:30000, 10000))
>}
>
>values <- numeric(0)
>for (i in 1:length(numbers)) {
> values <- append(values, sample(1:10, 1))
>}
>
>starttime <- Sys.time()
>
>d = list()
>for (number in unique(numbers)) {
> d[[toString(number)]] <- numeric(0)
>}
>for (i in 1:length(numbers)) {
> number = toString(numbers[i])
> value = values[i]
> d[[number]] <- append(d[[number]], value)
>}
>
>endtime <- Sys.time()
>
>print(format(endtime - starttime))
>
>-----
>
>"47.15579 secs"
>
>
>
>The python code:
>
>-----
>
>import random
>import time
>
>numbers = []
>for i in range(5):
> numbers += random.sample(range(30000), 10000)
>
>values = []
>for i in range(len(numbers)):
> values.append(random.randint(1, 10))
>
>starttime = time.time()
>
>d = {}
>for i in range(len(numbers)):
> number = numbers[i]
> value = values[i]
> if d.has_key(number):
> d[number].append(value)
> else:
> d[number] = [value]
>
>endtime = time.time()
>
>print endtime - starttime, "seconds"
>
>-----
>
>0.123021125793 seconds
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
Repeatedly extending vectors takes a lot of time. You can do what you want with d2 <- split(values, factor(numbers, levels=unique(numbers))) If you would like the labels on d2 to be in numeric order then you can simplify that to d3 <- split(values, numbers) Bill Dunlap TIBCO Software wdunlap tibco.com On Thu, Oct 30, 2014 at 8:17 AM, Thomas Nyberg <tomnyberg at gmail.com> wrote:> Hello, > > I want to do the following: Given a set of (number, value) pairs, I want to > create a list l so that l[[toString(number)]] returns the vector of values > associated to that number. It is hundreds of times slower than the > equivalent that I would write in python. I'm pretty new to R so I bet I'm > using its data structures inefficiently, but I've tried more or less > everything I can think of and can't really speed it up. I have done some > profiling which helped me find problem areas, but I couldn't speed things up > even with that information. I'm thinking I'm just fundamentally using R in a > silly way. > > I've included code for the different versions. I wrote the python code in a > style to make it as clear to R programmers as possible. Thanks a lot! Any > help would be greatly appreciated! > > Cheers, > Thomas > > > R code (with two versions depending on commenting): > > ----- > > numbers <- numeric(0) > for (i in 1:5) { > numbers <- c(numbers, sample(1:30000, 10000)) > } > > values <- numeric(0) > for (i in 1:length(numbers)) { > values <- append(values, sample(1:10, 1)) > } > > starttime <- Sys.time() > > d = list() > for (i in 1:length(numbers)) { > number = toString(numbers[i]) > value = values[i] > if (is.null(d[[number]])) { > #if (number %in% names(d)) { > d[[number]] <- c(value) > } else { > d[[number]] <- append(d[[number]], value) > } > } > > endtime <- Sys.time() > > print(format(endtime - starttime)) > > ----- > > uncommented version: "45.64791 secs" > commented version: "1.423056 mins" > > > > Another version of R code: > > ----- > > numbers <- numeric(0) > for (i in 1:5) { > numbers <- c(numbers, sample(1:30000, 10000)) > } > > values <- numeric(0) > for (i in 1:length(numbers)) { > values <- append(values, sample(1:10, 1)) > } > > starttime <- Sys.time() > > d = list() > for (number in unique(numbers)) { > d[[toString(number)]] <- numeric(0) > } > for (i in 1:length(numbers)) { > number = toString(numbers[i]) > value = values[i] > d[[number]] <- append(d[[number]], value) > } > > endtime <- Sys.time() > > print(format(endtime - starttime)) > > ----- > > "47.15579 secs" > > > > The python code: > > ----- > > import random > import time > > numbers = [] > for i in range(5): > numbers += random.sample(range(30000), 10000) > > values = [] > for i in range(len(numbers)): > values.append(random.randint(1, 10)) > > starttime = time.time() > > d = {} > for i in range(len(numbers)): > number = numbers[i] > value = values[i] > if d.has_key(number): > d[number].append(value) > else: > d[number] = [value] > > endtime = time.time() > > print endtime - starttime, "seconds" > > ----- > > 0.123021125793 seconds > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.