Hi, I'm faced with the following problem and would appreciate some advice. I could have a data frame x that looks like this: aa bb a 1 "A" b 2 "B" The advantage of this is that I could access all the individual components easily. Also I could access all the rows and columns easily. Alternatively, I could have a list of lists that looks like this: xprime <- list() xprime$a <- list() xprime$b <- list() xprime$a$aa <- 1 xprime$a$bb <- "A" xprime$b$aa <- 2 xprime$b$bb <- "B" etc. If speed is important, would a list of lists be faster than a data frame? (I know, for example, that scan is supposed to be faster than read.table, but I don't know if that is related to issues with data frames.) My problem with a list of lists, though, is that if I want to access all the bb subcomponents, a naive method like this one failed: y <- c( "a", "b" ) xprime[[ y ]]$bb (Does not work) So to get all the bb subcomponents I seem to need to loop, which may slow things down (presumably). But maybe people here know of a way. Finally what would be the "best" way given the constraint of quick access to all rows, columns and individual components? I'd appreciate your thoughts and comments. Thanks very much.
Roger Peng
2003-May-01 01:25 UTC
[R] List of lists? Data frames? (Or other data structures?)
If you're talking about rows and columns, it seems like the appropriate data structure for you is the data frame. I think your list of lists representation might get unwieldy after a while. I can't really think of why a data frame would be any slower than a list of lists -- I've never experienced such behavior. read.table() may be a little slower than scan() because read.table() reads in an entire file and then converts each of the columns into an appropriate data class. So there is some post-processing going on. It doesn't have anything to do with data frames vs. lists. -roger _______________________________ UCLA Department of Statistics http://www.stat.ucla.edu/~rpeng On Thu, 1 May 2003, R A F wrote:> Hi, I'm faced with the following problem and would appreciate some > advice. > > I could have a data frame x that looks like this: > aa bb > a 1 "A" > b 2 "B" > > The advantage of this is that I could access all the individual > components easily. Also I could access all the rows and columns > easily. > > Alternatively, I could have a list of lists that looks like this: > > xprime <- list() > xprime$a <- list() > xprime$b <- list() > > xprime$a$aa <- 1 > xprime$a$bb <- "A" > > xprime$b$aa <- 2 > xprime$b$bb <- "B" > > etc. > > If speed is important, would a list of lists be faster than a data > frame? (I know, for example, that scan is supposed to be faster than > read.table, but I don't know if that is related to issues with data > frames.) > > My problem with a list of lists, though, is that if I want to access > all the bb subcomponents, a naive method like this one failed: > > y <- c( "a", "b" ) > xprime[[ y ]]$bb (Does not work) > > So to get all the bb subcomponents I seem to need to loop, which may > slow things down (presumably). But maybe people here know of a way. > > Finally what would be the "best" way given the constraint of quick > access to all rows, columns and individual components? > > I'd appreciate your thoughts and comments. Thanks very much. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help >
Thanks for your comments. I'm not too familiar with these differences, but here's a simple experiment. In a data file with 139,000 rows and 5 columns (double string double double double),>system.time( aaa <- read.table( "file" ) )20.67 0.41 21.10 0.00 0.00>system.time( aaa <- scan( "file", list( 0, "", 0, 0, 0 ) ) )6.07 0.01 6.09 0.00 0.00 It seems like scan is much faster -- and as the data file grows, read.table seems to choke. (I actually tried this with a data file with over 2 million rows.) I'm using a Sun-Sparc, Solaris 2.8 and R 1.5.1. Sorry I can't be more specific about the hardware/software configurations, not being too knowledgeable about this sort of thing. By the way, it's not possible to create a matrix of mixed types, is it? (I don't know how anyway.) Any ideas as to the speed differences? Thanks again.>From: Prof Brian Ripley <ripley at stats.ox.ac.uk> >To: Roger Peng <rpeng at stat.ucla.edu> >CC: r-help at stat.math.ethz.ch, R A F <raf1729 at hotmail.com> >Subject: Re: [R] List of lists? Data frames? (Or other data structures?) >Date: Thu, 1 May 2003 08:42:55 +0100 (BST) > >On Wed, 30 Apr 2003, Roger Peng wrote: > > > If you're talking about rows and columns, it seems like the appropriate > > data structure for you is the data frame. I think your list of lists > > representation might get unwieldy after a while. I can't really think >of > > why a data frame would be any slower than a list of lists -- I've never > > experienced such behavior. > > > > read.table() may be a little slower than scan() because read.table() >reads > > in an entire file and then converts each of the columns into an > > appropriate data class. So there is some post-processing going on. It > > doesn't have anything to do with data frames vs. lists. > >Only if you don't specify colClasses: if you do (and you would need the >information to use scan()) there should be no performance penalty. (Note >that matrices can be scan()-ed into a vector and the dimensions added, and >that will be faster.)
Ah, thanks! (It's not that I didn't reading it -- I didn't understand it and so I thought that it'd be easier to ask again. Thanks very much!)>From: Peter Dalgaard BSA <p.dalgaard at biostat.ku.dk> >To: "R A F" <raf1729 at hotmail.com> >CC: ripley at stats.ox.ac.uk, rpeng at stat.ucla.edu, r-help at stat.math.ethz.ch >Subject: Re: [R] List of lists? Data frames? (Or other data structures?) >Date: 01 May 2003 14:19:32 +0200 > >You're not taking Brian's hint!:
For what it's worth, I followed the suggestion of using colClasses: cls <- c( "numeric", "character", "numeric", "numeric", "numeric" ) system.time( bbb <- read.table( "file", colClasses = cls ) ) Here're the results from three tries: 8.21 0.06 8.28 0.00 0.00 8.94 0.10 9.10 0.00 0.00 8.55 0.06 8.69 0.00 0.00 I also did system.time( aaa <- scan( "file", list( 0, "", 0, 0, 0 ) ) three times: 6.46 0.04 6.59 0.00 0.00 5.27 0.04 5.33 0.00 0.00 5.14 0.05 5.19 0.00 0.00 By the way, I did the experiment in the order bbb, aaa, bbb, aaa, bbb, aaa. So it appears that read.table is still a little slower -- but it could be just me doing something wrong. Thanks.>From: "R A F" <raf1729 at hotmail.com> >To: p.dalgaard at biostat.ku.dk >CC: r-help at stat.math.ethz.ch, rpeng at stat.ucla.edu, ripley at stats.ox.ac.uk >Subject: Re: [R] List of lists? Data frames? (Or other data structures?) >Date: Thu, 01 May 2003 12:20:57 +0000 > >Ah, thanks! > >(It's not that I didn't reading it -- I didn't understand it and so >I thought that it'd be easier to ask again. Thanks very much!) > >>From: Peter Dalgaard BSA <p.dalgaard at biostat.ku.dk> >>To: "R A F" <raf1729 at hotmail.com> >>CC: ripley at stats.ox.ac.uk, rpeng at stat.ucla.edu, r-help at stat.math.ethz.ch >>Subject: Re: [R] List of lists? Data frames? (Or other data structures?) >>Date: 01 May 2003 14:19:32 +0200 >> >>You're not taking Brian's hint!: