Dear Folks-- Is there a data frame analog to sparse matrices? I am working with a panel data set that has a large number of variables that are redefined repeatedly or exist for only a few years (out of 48). In my current structure, where variables are columns and rows are years, more than 90 percent of the cells and more than 3/4 of the total size of my file are NAs. I am wondering if there is an alternate file specification currently available that still allows numeric, character and factor data to be stored. Besides just using a database. A pointer in the right direction (or a solid "no" if that is the truth) would be greatly appreciated. Sincerely, andrewH -- View this message in context: http://r.789695.n4.nabble.com/Sparse-dataframes-tp4655614.html Sent from the R help mailing list archive at Nabble.com.
andrewH skreiv:> Is there a data frame analog to sparse matrices? I am working with a panel > data set that has a large number of variables that are redefined > repeatedly or exist for only a few years (out of 48). In my current > structure, where variables are columns and rows are years, more than 90 > percent of the cells and more than 3/4 of the total size of my file are > NAs. > > I am wondering if there is an alternate file specification currently > available that still allows numeric, character and factor data to be > stored. Besides just using a database.How about storing the data in a ?long? format, like you get when you apply melt() (with na.rm=TRUE) from the ?reshape2? package to your data frame? Parts of the data frame (the ID part) will be repeated on each row, which may make the data take up more space, but no rows are stored for NA cells, so for somewhat sparse data it will be a win. It also makes it very easy to reshape and analyse the data. Here?s an introduction (to the older ?reshape? package, but ?reshape2? is very similar): http://www.jstatsoft.org/v21/i12 You might also be interested in this paper on ?tidy? data: http://vita.had.co.nz/papers/tidy-data.pdf -- Karl Ove Hufthammer E-mail: karl at huftis.org Jabber: huftis at jabber.no
Hi Karl! Thanks for writing! Doesn't this format require a column for a factor for every variable present in any observation, whether or not that variable is present in the observation in question? I think I end up with data that consists mainly of columns of variables that are NAs for all but a few years. But let me take a closer look at the data format and see. Again, thanks! --andrewH On Tue, Jan 15, 2013 at 9:22 AM, andrewH <ahoerner@rprogress.org> wrote:> Dear Folks-- > Is there a data frame analog to sparse matrices? I am working with a panel > data set that has a large number of variables that are redefined repeatedly > or exist for only a few years (out of 48). In my current structure, where > variables are columns and rows are years, more than 90 percent of the cells > and more than 3/4 of the total size of my file are NAs. > > I am wondering if there is an alternate file specification currently > available that still allows numeric, character and factor data to be > stored. > Besides just using a database. > > A pointer in the right direction (or a solid "no" if that is the truth) > would be greatly appreciated. > > Sincerely, andrewH > > > > -- > View this message in context: > http://r.789695.n4.nabble.com/Sparse-dataframes-tp4655614.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- J. Andrew Hoerner Director, Sustainable Economics Program Redefining Progress (510) 507-4820 [[alternative HTML version deleted]]