Marshall Feldman
2010-Apr-09 12:59 UTC
[R] Beyond reshape: automatically streamlining data
Hello, I've been very impressed by the reshape package and how easy it makes reorganizing statistical data structures. This makes me wonder if there's another package out there that addresses another set of tasks that one often does when preparing data for analysis. For any particular set of analyses, one typically recodes variables and deletes cases and variables. It would be really nice to have a package that, for example, if one selected cases from a larger data set based on the values of certain variables would inspect the resulting data and drop any variables that have the same value for all cases. Similarly, if any cases are entirely zero or NA, the package could (under user control) drop these cases. Finally, it could take a set of data transformations and keep them as an object, so that the same selection/reshape/streamlining can easily be applied to similar data sets. My motivation for this came from working with employment data this morning. I started out with 11 variables and 35569 cases for Rhode Island, a few selections later I had only 420 cases and 3 variables. It struck me that the process I went through, which included not only making selections but also inspecting the results and deleting unnecessary cases/variables, could be automated at least to eliminate the inspection step. Also, since I want to do the same thing with data for other states, automation would be very nice indeed. I realize that programming this kind of stuff in R is relatively easy, but the reshape package makes me wonder if someone has already done it. Thanks Marsh Feldman
Hi Marshall, On Fri, Apr 9, 2010 at 8:59 AM, Marshall Feldman <marsh at uri.edu> wrote:> ... > For any particular set of analyses, one typically recodes variables and > deletes cases and variables. It would be really nice to have a package that, > for example, if one selected cases from a larger data set based on the > values of certain variables would inspect the resulting data and drop any > variables that have the same value for all cases. Similarly, if any cases > are entirely zero or NA, the package could (under user control) drop these > cases. Finally, it could take a set of data transformations and keep them as > an object, so that the same selection/reshape/streamlining can easily be > applied to similar data sets. > ...Some of the utilities in the caret package might be related to the things your after: http://cran.r-project.org/package=caret There is a writeup about using caret to build predictive models in R in the Journal of Statistical Software (it's a PDF): http://www.jstatsoft.org/v28/i05/paper I'd recommend reading through that if you haven't before, since caret offers many handy wrapper/utility functions, but check out section 3: Data Preparation, in particular, where Max talks about zero-variance-predictors and the multicollinearity problem. Hope that helps. -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact