Hello, I want to do the following: Given a set of (number, value) pairs, I want to create a list l so that l[[toString(number)]] returns the vector of values associated to that number. It is hundreds of times slower than the equivalent that I would write in python. I'm pretty new to R so I bet I'm using its data structures inefficiently, but I've tried more or less everything I can think of and can't really speed it up. I have done some profiling which helped me find problem areas, but I couldn't speed things up even with that information. I'm thinking I'm just fundamentally using R in a silly way. I've included code for the different versions. I wrote the python code in a style to make it as clear to R programmers as possible. Thanks a lot! Any help would be greatly appreciated! Cheers, Thomas R code (with two versions depending on commenting): ----- numbers <- numeric(0) for (i in 1:5) { numbers <- c(numbers, sample(1:30000, 10000)) } values <- numeric(0) for (i in 1:length(numbers)) { values <- append(values, sample(1:10, 1)) } starttime <- Sys.time() d = list() for (i in 1:length(numbers)) { number = toString(numbers[i]) value = values[i] if (is.null(d[[number]])) { #if (number %in% names(d)) { d[[number]] <- c(value) } else { d[[number]] <- append(d[[number]], value) } } endtime <- Sys.time() print(format(endtime - starttime)) ----- uncommented version: "45.64791 secs" commented version: "1.423056 mins" Another version of R code: ----- numbers <- numeric(0) for (i in 1:5) { numbers <- c(numbers, sample(1:30000, 10000)) } values <- numeric(0) for (i in 1:length(numbers)) { values <- append(values, sample(1:10, 1)) } starttime <- Sys.time() d = list() for (number in unique(numbers)) { d[[toString(number)]] <- numeric(0) } for (i in 1:length(numbers)) { number = toString(numbers[i]) value = values[i] d[[number]] <- append(d[[number]], value) } endtime <- Sys.time() print(format(endtime - starttime)) ----- "47.15579 secs" The python code: ----- import random import time numbers = [] for i in range(5): numbers += random.sample(range(30000), 10000) values = [] for i in range(len(numbers)): values.append(random.randint(1, 10)) starttime = time.time() d = {} for i in range(len(numbers)): number = numbers[i] value = values[i] if d.has_key(number): d[number].append(value) else: d[number] = [value] endtime = time.time() print endtime - starttime, "seconds" ----- 0.123021125793 seconds
Hi, perhaps pre-generating the list before processing would speed it up significantly. Though it may still be slower than python. e.g. try something like: d = as.list(rep(NA,length(numbers))) rather than: d = list() Olivier. On Thu, 30 Oct 2014 11:17:59 -0400 Thomas Nyberg <tomnyberg at gmail.com> wrote:> Hello, > > I want to do the following: Given a set of (number, value) pairs, I > want to create a list l so that l[[toString(number)]] returns the > vector of values associated to that number. It is hundreds of times > slower than the equivalent that I would write in python. I'm pretty > new to R so I bet I'm using its data structures inefficiently, but > I've tried more or less everything I can think of and can't really > speed it up. I have done some profiling which helped me find problem > areas, but I couldn't speed things up even with that information. I'm > thinking I'm just fundamentally using R in a silly way. > > I've included code for the different versions. I wrote the python > code in a style to make it as clear to R programmers as possible. > Thanks a lot! Any help would be greatly appreciated! > > Cheers, > Thomas > > > R code (with two versions depending on commenting): > > ----- > > numbers <- numeric(0) > for (i in 1:5) { > numbers <- c(numbers, sample(1:30000, 10000)) > } > > values <- numeric(0) > for (i in 1:length(numbers)) { > values <- append(values, sample(1:10, 1)) > } > > starttime <- Sys.time() > > d = list() > for (i in 1:length(numbers)) { > number = toString(numbers[i]) > value = values[i] > if (is.null(d[[number]])) { > #if (number %in% names(d)) { > d[[number]] <- c(value) > } else { > d[[number]] <- append(d[[number]], value) > } > } > > endtime <- Sys.time() > > print(format(endtime - starttime)) > > ----- > > uncommented version: "45.64791 secs" > commented version: "1.423056 mins" > > > > Another version of R code: > > ----- > > numbers <- numeric(0) > for (i in 1:5) { > numbers <- c(numbers, sample(1:30000, 10000)) > } > > values <- numeric(0) > for (i in 1:length(numbers)) { > values <- append(values, sample(1:10, 1)) > } > > starttime <- Sys.time() > > d = list() > for (number in unique(numbers)) { > d[[toString(number)]] <- numeric(0) > } > for (i in 1:length(numbers)) { > number = toString(numbers[i]) > value = values[i] > d[[number]] <- append(d[[number]], value) > } > > endtime <- Sys.time() > > print(format(endtime - starttime)) > > ----- > > "47.15579 secs" > > > > The python code: > > ----- > > import random > import time > > numbers = [] > for i in range(5): > numbers += random.sample(range(30000), 10000) > > values = [] > for i in range(len(numbers)): > values.append(random.randint(1, 10)) > > starttime = time.time() > > d = {} > for i in range(len(numbers)): > number = numbers[i] > value = values[i] > if d.has_key(number): > d[number].append(value) > else: > d[number] = [value] > > endtime = time.time() > > print endtime - starttime, "seconds" > > ----- > > 0.123021125793 seconds > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html and provide commented, > minimal, self-contained, reproducible code.-- Olivier Crouzet, PhD Laboratoire de Linguistique -- EA3827 Universit? de Nantes Chemin de la Censive du Tertre - BP 81227 44312 Nantes cedex 3 France phone: (+33) 02 40 14 14 05 (lab.) (+33) 02 40 14 14 36 (office) fax: (+33) 02 40 14 13 27 e-mail: olivier.crouzet at univ-nantes.fr http://www.lling.univ-nantes.fr/
Look at sqldf or data.table packages. Lists are slow for lookup and not particularly efficient with memory. numeric indexing into matrices or data frames is more typical in R, and the above mentioned packages support indexing to speed up lookups. Also, carefully consider whether you can program your processing in bulk... vector or relational processing can be critical for performance. --------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --------------------------------------------------------------------------- Sent from my phone. Please excuse my brevity. On October 30, 2014 8:17:59 AM PDT, Thomas Nyberg <tomnyberg at gmail.com> wrote:>Hello, > >I want to do the following: Given a set of (number, value) pairs, I >want >to create a list l so that l[[toString(number)]] returns the vector of >values associated to that number. It is hundreds of times slower than >the equivalent that I would write in python. I'm pretty new to R so I >bet I'm using its data structures inefficiently, but I've tried more or > >less everything I can think of and can't really speed it up. I have >done >some profiling which helped me find problem areas, but I couldn't speed > >things up even with that information. I'm thinking I'm just >fundamentally using R in a silly way. > >I've included code for the different versions. I wrote the python code >in a style to make it as clear to R programmers as possible. Thanks a >lot! Any help would be greatly appreciated! > >Cheers, >Thomas > > >R code (with two versions depending on commenting): > >----- > >numbers <- numeric(0) >for (i in 1:5) { > numbers <- c(numbers, sample(1:30000, 10000)) >} > >values <- numeric(0) >for (i in 1:length(numbers)) { > values <- append(values, sample(1:10, 1)) >} > > starttime <- Sys.time() > >d = list() >for (i in 1:length(numbers)) { > number = toString(numbers[i]) > value = values[i] > if (is.null(d[[number]])) { > #if (number %in% names(d)) { > d[[number]] <- c(value) > } else { > d[[number]] <- append(d[[number]], value) > } >} > >endtime <- Sys.time() > >print(format(endtime - starttime)) > >----- > >uncommented version: "45.64791 secs" >commented version: "1.423056 mins" > > > >Another version of R code: > >----- > >numbers <- numeric(0) >for (i in 1:5) { > numbers <- c(numbers, sample(1:30000, 10000)) >} > >values <- numeric(0) >for (i in 1:length(numbers)) { > values <- append(values, sample(1:10, 1)) >} > >starttime <- Sys.time() > >d = list() >for (number in unique(numbers)) { > d[[toString(number)]] <- numeric(0) >} >for (i in 1:length(numbers)) { > number = toString(numbers[i]) > value = values[i] > d[[number]] <- append(d[[number]], value) >} > >endtime <- Sys.time() > >print(format(endtime - starttime)) > >----- > >"47.15579 secs" > > > >The python code: > >----- > >import random >import time > >numbers = [] >for i in range(5): > numbers += random.sample(range(30000), 10000) > >values = [] >for i in range(len(numbers)): > values.append(random.randint(1, 10)) > >starttime = time.time() > >d = {} >for i in range(len(numbers)): > number = numbers[i] > value = values[i] > if d.has_key(number): > d[number].append(value) > else: > d[number] = [value] > >endtime = time.time() > >print endtime - starttime, "seconds" > >----- > >0.123021125793 seconds > >______________________________________________ >R-help at r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.
Repeatedly extending vectors takes a lot of time. You can do what you want with d2 <- split(values, factor(numbers, levels=unique(numbers))) If you would like the labels on d2 to be in numeric order then you can simplify that to d3 <- split(values, numbers) Bill Dunlap TIBCO Software wdunlap tibco.com On Thu, Oct 30, 2014 at 8:17 AM, Thomas Nyberg <tomnyberg at gmail.com> wrote:> Hello, > > I want to do the following: Given a set of (number, value) pairs, I want to > create a list l so that l[[toString(number)]] returns the vector of values > associated to that number. It is hundreds of times slower than the > equivalent that I would write in python. I'm pretty new to R so I bet I'm > using its data structures inefficiently, but I've tried more or less > everything I can think of and can't really speed it up. I have done some > profiling which helped me find problem areas, but I couldn't speed things up > even with that information. I'm thinking I'm just fundamentally using R in a > silly way. > > I've included code for the different versions. I wrote the python code in a > style to make it as clear to R programmers as possible. Thanks a lot! Any > help would be greatly appreciated! > > Cheers, > Thomas > > > R code (with two versions depending on commenting): > > ----- > > numbers <- numeric(0) > for (i in 1:5) { > numbers <- c(numbers, sample(1:30000, 10000)) > } > > values <- numeric(0) > for (i in 1:length(numbers)) { > values <- append(values, sample(1:10, 1)) > } > > starttime <- Sys.time() > > d = list() > for (i in 1:length(numbers)) { > number = toString(numbers[i]) > value = values[i] > if (is.null(d[[number]])) { > #if (number %in% names(d)) { > d[[number]] <- c(value) > } else { > d[[number]] <- append(d[[number]], value) > } > } > > endtime <- Sys.time() > > print(format(endtime - starttime)) > > ----- > > uncommented version: "45.64791 secs" > commented version: "1.423056 mins" > > > > Another version of R code: > > ----- > > numbers <- numeric(0) > for (i in 1:5) { > numbers <- c(numbers, sample(1:30000, 10000)) > } > > values <- numeric(0) > for (i in 1:length(numbers)) { > values <- append(values, sample(1:10, 1)) > } > > starttime <- Sys.time() > > d = list() > for (number in unique(numbers)) { > d[[toString(number)]] <- numeric(0) > } > for (i in 1:length(numbers)) { > number = toString(numbers[i]) > value = values[i] > d[[number]] <- append(d[[number]], value) > } > > endtime <- Sys.time() > > print(format(endtime - starttime)) > > ----- > > "47.15579 secs" > > > > The python code: > > ----- > > import random > import time > > numbers = [] > for i in range(5): > numbers += random.sample(range(30000), 10000) > > values = [] > for i in range(len(numbers)): > values.append(random.randint(1, 10)) > > starttime = time.time() > > d = {} > for i in range(len(numbers)): > number = numbers[i] > value = values[i] > if d.has_key(number): > d[number].append(value) > else: > d[number] = [value] > > endtime = time.time() > > print endtime - starttime, "seconds" > > ----- > > 0.123021125793 seconds > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.