As noted on the R-project web site itself ( www.r-project.org -> Manuals -> R Data Import/Export ), it can be cumbersome to prepare messy and dirty data for analysis with the R tool itself. I've also seen at least one S programming book (one of the yellow Springer ones) that says, more briefly, the same thing. The R Data Import/Export page recommends examples using SAS, Perl, Python, and Java. It takes a bit of courage to say that ( when you go to a corporate software web site, you'll never see a page saying "This is the type of problem that our product is not the best at, here's what we suggest instead" ). I'd like to provide a few more suggestions, especially for volunteers who are willing to evaluate new candidates. SAS is fine if you're not paying for the license out of your own pocket. But maybe one reason you're using R is you don't have thousands of spare dollars. Using Java for data cleaning is an exercise in sado-masochism, Java has a learning curve (almost) as difficult as C++. There are different types of data transformation, and for some data preparation problems an all-purpose programming language is a good choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has excellent regular expression facilities. However, for some types of complex demanding data preparation problems, an all-purpose programming language is a poor choice. For example: cleaning up and preparing clinical lab data and adverse event data - you could do it in Perl, but it would take way, way too much time. A specialized programming language is needed. And since data transformation is quite different from data query, SQL is not the ideal solution either. There are only three statistical programming languages that are well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more popular than S for data cleaning. If you're an R user with difficult data preparation problems, frankly you are out of luck, because the products I'm about to mention are new, unknown, and therefore regarded as immature. And while the founders of these products would be very happy if you kicked the tires, most people don't like to look at brand new products. Most innovators and inventers don't realize this, I've learned it the hard way. But if you are a volunteer who likes to help out by evaluating, comparing, and reporting upon new candidates, well you could certainly help out R users and the developers of the products by kicking the tires of these products. And there is a huge need for such volunteers. 1. DAP This is an open source implementation of SAS. The founder: Susan Bassein Find it at: directory.fsf.org/math/stats (GNU GPL) 2. PSPP This is an open source implementation of SPSS. The relatively early version number might not give a good idea of how mature the data transformation features are, it reflects the fact that he has only started doing the statistical tests. The founder: Ben Pfaff, either a grad student or professor at Stanford CS dept. Also at : directory.fsf.org/math/stats (GNU GPL) 3. Vilno This uses a programming language similar to SPSS and SAS, but quite unlike S. Essentially, it's a substitute for the SAS datastep, and also transposes data and calculates averages and such. (No t-tests or regressions in this version). I created this, during the years 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in my opinion. The tarball includes about 100 or so test cases used for debugging - for logical calculation errors, but not for extremely high volumes of data. The maintenance of Vilno has slowed down, because I am currently (desparately) looking for employment. But once I've found new employment and living quarters and settled in, I will continue to enhance Vilno in my spare time. The founder: that would be me, Robert Wilkins Find it at: code.google.com/p/vilno ( GNU GPL ) ( In particular, the tarball at code.google.com/p/vilno/downloads/list , since I have yet to figure out how to use Subversion ). 4. Who knows? It was not easy to find out about the existence of DAP and PSPP. So who knows what else is out there. However, I think you'll find a lot more statistics software ( regression , etc ) out there, and not so much data transformation software. Not many people work on data preparation software. In fact, the category is so obscure that there isn't one agreed term: data cleaning , data munging , data crunching , or just getting the data ready for analysis.
An additional option for Windows users is Micro Osiris http://www.microsiris.com/ best robert On 6/7/07, Robert Wilkins <irishhacker at gmail.com> wrote:> As noted on the R-project web site itself ( www.r-project.org -> > Manuals -> R Data Import/Export ), it can be cumbersome to prepare > messy and dirty data for analysis with the R tool itself. I've also > seen at least one S programming book (one of the yellow Springer ones) > that says, more briefly, the same thing. > The R Data Import/Export page recommends examples using SAS, Perl, > Python, and Java. It takes a bit of courage to say that ( when you go > to a corporate software web site, you'll never see a page saying "This > is the type of problem that our product is not the best at, here's > what we suggest instead" ). I'd like to provide a few more > suggestions, especially for volunteers who are willing to evaluate new > candidates. > > SAS is fine if you're not paying for the license out of your own > pocket. But maybe one reason you're using R is you don't have > thousands of spare dollars. > Using Java for data cleaning is an exercise in sado-masochism, Java > has a learning curve (almost) as difficult as C++. > > There are different types of data transformation, and for some data > preparation problems an all-purpose programming language is a good > choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has > excellent regular expression facilities. > > However, for some types of complex demanding data preparation > problems, an all-purpose programming language is a poor choice. For > example: cleaning up and preparing clinical lab data and adverse event > data - you could do it in Perl, but it would take way, way too much > time. A specialized programming language is needed. And since data > transformation is quite different from data query, SQL is not the > ideal solution either. > > There are only three statistical programming languages that are > well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more > popular than S for data cleaning. > > If you're an R user with difficult data preparation problems, frankly > you are out of luck, because the products I'm about to mention are > new, unknown, and therefore regarded as immature. And while the > founders of these products would be very happy if you kicked the > tires, most people don't like to look at brand new products. Most > innovators and inventers don't realize this, I've learned it the hard > way. > > But if you are a volunteer who likes to help out by evaluating, > comparing, and reporting upon new candidates, well you could certainly > help out R users and the developers of the products by kicking the > tires of these products. And there is a huge need for such volunteers. > > 1. DAP > This is an open source implementation of SAS. > The founder: Susan Bassein > Find it at: directory.fsf.org/math/stats (GNU GPL) > > 2. PSPP > This is an open source implementation of SPSS. > The relatively early version number might not give a good idea of how > mature the > data transformation features are, it reflects the fact that he has > only started doing the statistical tests. > The founder: Ben Pfaff, either a grad student or professor at Stanford CS dept. > Also at : directory.fsf.org/math/stats (GNU GPL) > > 3. Vilno > This uses a programming language similar to SPSS and SAS, but quite unlike S. > Essentially, it's a substitute for the SAS datastep, and also > transposes data and calculates averages and such. (No t-tests or > regressions in this version). I created this, during the years > 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in > my opinion. The tarball includes about 100 or so test cases used for > debugging - for logical calculation errors, but not for extremely high > volumes of data. > The maintenance of Vilno has slowed down, because I am currently > (desparately) looking for employment. But once I've found new > employment and living quarters and settled in, I will continue to > enhance Vilno in my spare time. > The founder: that would be me, Robert Wilkins > Find it at: code.google.com/p/vilno ( GNU GPL ) > ( In particular, the tarball at code.google.com/p/vilno/downloads/list > , since I have yet to figure out how to use Subversion ). > > > 4. Who knows? > It was not easy to find out about the existence of DAP and PSPP. So > who knows what else is out there. However, I think you'll find a lot > more statistics software ( regression , etc ) out there, and not so > much data transformation software. Not many people work on data > preparation software. In fact, the category is so obscure that there > isn't one agreed term: data cleaning , data munging , data crunching , > or just getting the data ready for analysis. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Robert Wilkins wrote:> As noted on the R-project web site itself ( www.r-project.org -> > Manuals -> R Data Import/Export ), it can be cumbersome to prepare > messy and dirty data for analysis with the R tool itself. I've also > seen at least one S programming book (one of the yellow Springer ones) > that says, more briefly, the same thing. > The R Data Import/Export page recommends examples using SAS, Perl, > Python, and Java. It takes a bit of courage to say that ( when you go > to a corporate software web site, you'll never see a page saying "This > is the type of problem that our product is not the best at, here's > what we suggest instead" ). I'd like to provide a few more > suggestions, especially for volunteers who are willing to evaluate new > candidates. > > SAS is fine if you're not paying for the license out of your own > pocket. But maybe one reason you're using R is you don't have > thousands of spare dollars. > Using Java for data cleaning is an exercise in sado-masochism, Java > has a learning curve (almost) as difficult as C++. > > There are different types of data transformation, and for some data > preparation problems an all-purpose programming language is a good > choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has > excellent regular expression facilities. > > However, for some types of complex demanding data preparation > problems, an all-purpose programming language is a poor choice. For > example: cleaning up and preparing clinical lab data and adverse event > data - you could do it in Perl, but it would take way, way too much > time. A specialized programming language is needed. And since data > transformation is quite different from data query, SQL is not the > ideal solution either.We deal with exactly those kinds of data solely using R. R is exceptionally powerful for data manipulation, just a bit hard to learn. Many examples are at http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RS/sintro.pdf Frank> > There are only three statistical programming languages that are > well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more > popular than S for data cleaning. > > If you're an R user with difficult data preparation problems, frankly > you are out of luck, because the products I'm about to mention are > new, unknown, and therefore regarded as immature. And while the > founders of these products would be very happy if you kicked the > tires, most people don't like to look at brand new products. Most > innovators and inventers don't realize this, I've learned it the hard > way. > > But if you are a volunteer who likes to help out by evaluating, > comparing, and reporting upon new candidates, well you could certainly > help out R users and the developers of the products by kicking the > tires of these products. And there is a huge need for such volunteers. > > 1. DAP > This is an open source implementation of SAS. > The founder: Susan Bassein > Find it at: directory.fsf.org/math/stats (GNU GPL) > > 2. PSPP > This is an open source implementation of SPSS. > The relatively early version number might not give a good idea of how > mature the > data transformation features are, it reflects the fact that he has > only started doing the statistical tests. > The founder: Ben Pfaff, either a grad student or professor at Stanford CS dept. > Also at : directory.fsf.org/math/stats (GNU GPL) > > 3. Vilno > This uses a programming language similar to SPSS and SAS, but quite unlike S. > Essentially, it's a substitute for the SAS datastep, and also > transposes data and calculates averages and such. (No t-tests or > regressions in this version). I created this, during the years > 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in > my opinion. The tarball includes about 100 or so test cases used for > debugging - for logical calculation errors, but not for extremely high > volumes of data. > The maintenance of Vilno has slowed down, because I am currently > (desparately) looking for employment. But once I've found new > employment and living quarters and settled in, I will continue to > enhance Vilno in my spare time. > The founder: that would be me, Robert Wilkins > Find it at: code.google.com/p/vilno ( GNU GPL ) > ( In particular, the tarball at code.google.com/p/vilno/downloads/list > , since I have yet to figure out how to use Subversion ). > > > 4. Who knows? > It was not easy to find out about the existence of DAP and PSPP. So > who knows what else is out there. However, I think you'll find a lot > more statistics software ( regression , etc ) out there, and not so > much data transformation software. Not many people work on data > preparation software. In fact, the category is so obscure that there > isn't one agreed term: data cleaning , data munging , data crunching , > or just getting the data ready for analysis. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University
Hi, Can you provide examples of data formats that are problematic to read and clean with R ? The only problematic cases I have encountered were cases with multiline and/or varying length records (optional information). Then, it is sometimes a good idea to preprocess the data to present in a tabular format (one record per line). For this purpose, I use awk (e.g. http://www.vectorsite.net/tsawk.html), which is very adept at processing ascii data files (awk is much simpler to learn than perl, spss, sas, ...). I have never encountered a data file in ascii format that I could not reformat with Awk. With binary formats, it is another story... But, again, this is my limited experience; I would like to know if there are situations where using SAS/SPSS is really a better approach. Christophe Pallier On 6/8/07, Robert Wilkins <irishhacker@gmail.com> wrote:> > As noted on the R-project web site itself ( www.r-project.org -> > Manuals -> R Data Import/Export ), it can be cumbersome to prepare > messy and dirty data for analysis with the R tool itself. I've also > seen at least one S programming book (one of the yellow Springer ones) > that says, more briefly, the same thing. > The R Data Import/Export page recommends examples using SAS, Perl, > Python, and Java. It takes a bit of courage to say that ( when you go > to a corporate software web site, you'll never see a page saying "This > is the type of problem that our product is not the best at, here's > what we suggest instead" ). I'd like to provide a few more > suggestions, especially for volunteers who are willing to evaluate new > candidates. > > SAS is fine if you're not paying for the license out of your own > pocket. But maybe one reason you're using R is you don't have > thousands of spare dollars. > Using Java for data cleaning is an exercise in sado-masochism, Java > has a learning curve (almost) as difficult as C++. > > There are different types of data transformation, and for some data > preparation problems an all-purpose programming language is a good > choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has > excellent regular expression facilities. > > However, for some types of complex demanding data preparation > problems, an all-purpose programming language is a poor choice. For > example: cleaning up and preparing clinical lab data and adverse event > data - you could do it in Perl, but it would take way, way too much > time. A specialized programming language is needed. And since data > transformation is quite different from data query, SQL is not the > ideal solution either. > > There are only three statistical programming languages that are > well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more > popular than S for data cleaning. > > If you're an R user with difficult data preparation problems, frankly > you are out of luck, because the products I'm about to mention are > new, unknown, and therefore regarded as immature. And while the > founders of these products would be very happy if you kicked the > tires, most people don't like to look at brand new products. Most > innovators and inventers don't realize this, I've learned it the hard > way. > > But if you are a volunteer who likes to help out by evaluating, > comparing, and reporting upon new candidates, well you could certainly > help out R users and the developers of the products by kicking the > tires of these products. And there is a huge need for such volunteers. > > 1. DAP > This is an open source implementation of SAS. > The founder: Susan Bassein > Find it at: directory.fsf.org/math/stats (GNU GPL) > > 2. PSPP > This is an open source implementation of SPSS. > The relatively early version number might not give a good idea of how > mature the > data transformation features are, it reflects the fact that he has > only started doing the statistical tests. > The founder: Ben Pfaff, either a grad student or professor at Stanford CS > dept. > Also at : directory.fsf.org/math/stats (GNU GPL) > > 3. Vilno > This uses a programming language similar to SPSS and SAS, but quite unlike > S. > Essentially, it's a substitute for the SAS datastep, and also > transposes data and calculates averages and such. (No t-tests or > regressions in this version). I created this, during the years > 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in > my opinion. The tarball includes about 100 or so test cases used for > debugging - for logical calculation errors, but not for extremely high > volumes of data. > The maintenance of Vilno has slowed down, because I am currently > (desparately) looking for employment. But once I've found new > employment and living quarters and settled in, I will continue to > enhance Vilno in my spare time. > The founder: that would be me, Robert Wilkins > Find it at: code.google.com/p/vilno ( GNU GPL ) > ( In particular, the tarball at code.google.com/p/vilno/downloads/list > , since I have yet to figure out how to use Subversion ). > > > 4. Who knows? > It was not easy to find out about the existence of DAP and PSPP. So > who knows what else is out there. However, I think you'll find a lot > more statistics software ( regression , etc ) out there, and not so > much data transformation software. Not many people work on data > preparation software. In fact, the category is so obscure that there > isn't one agreed term: data cleaning , data munging , data crunching , > or just getting the data ready for analysis. > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Christophe Pallier (http://www.pallier.org) [[alternative HTML version deleted]]
On 6/7/07, Robert Wilkins <irishhacker at gmail.com> wrote:> As noted on the R-project web site itself ( www.r-project.org -> > Manuals -> R Data Import/Export ), it can be cumbersome to prepare > messy and dirty data for analysis with the R tool itself. I've also > seen at least one S programming book (one of the yellow Springer ones) > that says, more briefly, the same thing. > The R Data Import/Export page recommends examples using SAS, Perl, > Python, and Java. It takes a bit of courage to say that ( when you go > to a corporate software web site, you'll never see a page saying "This > is the type of problem that our product is not the best at, here's > what we suggest instead" ). I'd like to provide a few more > suggestions, especially for volunteers who are willing to evaluate new > candidates. > > SAS is fine if you're not paying for the license out of your own > pocket. But maybe one reason you're using R is you don't have > thousands of spare dollars. > Using Java for data cleaning is an exercise in sado-masochism, Java > has a learning curve (almost) as difficult as C++. > > There are different types of data transformation, and for some data > preparation problems an all-purpose programming language is a good > choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has > excellent regular expression facilities. > > However, for some types of complex demanding data preparation > problems, an all-purpose programming language is a poor choice. For > example: cleaning up and preparing clinical lab data and adverse event > data - you could do it in Perl, but it would take way, way too much > time. A specialized programming language is needed. And since data > transformation is quite different from data query, SQL is not the > ideal solution either. > > There are only three statistical programming languages that are > well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more > popular than S for data cleaning. > > If you're an R user with difficult data preparation problems, frankly > you are out of luck, because the products I'm about to mention are > new, unknown, and therefore regarded as immature. And while the > founders of these products would be very happy if you kicked the > tires, most people don't like to look at brand new products. Most > innovators and inventers don't realize this, I've learned it the hard > way. > > But if you are a volunteer who likes to help out by evaluating, > comparing, and reporting upon new candidates, well you could certainly > help out R users and the developers of the products by kicking the > tires of these products. And there is a huge need for such volunteers. > > 1. DAP > This is an open source implementation of SAS. > The founder: Susan Bassein > Find it at: directory.fsf.org/math/stats (GNU GPL) > > 2. PSPP > This is an open source implementation of SPSS. > The relatively early version number might not give a good idea of how > mature the > data transformation features are, it reflects the fact that he has > only started doing the statistical tests. > The founder: Ben Pfaff, either a grad student or professor at Stanford CS dept. > Also at : directory.fsf.org/math/stats (GNU GPL) > > 3. Vilno > This uses a programming language similar to SPSS and SAS, but quite unlike S. > Essentially, it's a substitute for the SAS datastep, and also > transposes data and calculates averages and such. (No t-tests or > regressions in this version). I created this, during the years > 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in > my opinion. The tarball includes about 100 or so test cases used for > debugging - for logical calculation errors, but not for extremely high > volumes of data. > The maintenance of Vilno has slowed down, because I am currently > (desparately) looking for employment. But once I've found new > employment and living quarters and settled in, I will continue to > enhance Vilno in my spare time. > The founder: that would be me, Robert Wilkins > Find it at: code.google.com/p/vilno ( GNU GPL ) > ( In particular, the tarball at code.google.com/p/vilno/downloads/list > , since I have yet to figure out how to use Subversion ). > > 4. Who knows? > It was not easy to find out about the existence of DAP and PSPP. So > who knows what else is out there. However, I think you'll find a lot > more statistics software ( regression , etc ) out there, and not so > much data transformation software. Not many people work on data > preparation software. In fact, the category is so obscure that there > isn't one agreed term: data cleaning , data munging , data crunching , > or just getting the data ready for analysis.Thanks for bringing up this topic. I think there is definitely a place for such languages, which I would regard as data-filtering languages, but I also think that trying to reproduce the facilities in SAS or SPSS for data analysis is redundant. Other responses in this thread have mentioned 'little language' filters like awk, which is fine for those who were raised in the Bell Labs tradition of programming ("why type three characters when two character names should suffice for anything one wants to do on a PDP-11") but the typical field scientist finds this a bit too terse to understand and would rather write a filter as a paragraph of code that they have a change of reading and understanding a week later. Frank Harrell indicated that it is possible to do a lot of difficult data transformation within R itself if you try hard enough but that sometimes means working against the S language and its "whole object" view to accomplish what you want and it can require knowledge of subtle aspects of the S language. General scripting languages like Perl, Python and Ruby can certainly be used for data filtering but that means learning the language and its idiosyncrasies, and those idiosyncrasies are often exactly the aspects that would be used to write a filter tersely. Readability suffers. ("Hell is reading someone else's Perl code - purgatory is reading your own Perl code.") The very generality of the languages means there is a lot to learn and understand before you can write something like a simple filter. So I do agree that it would be useful to have a language like the SAS data step (but Open Source, of course) in which to write a data filter. I have one suggestion to make - use the R data frame structure in the form of a .rda file as the binary output format for a data table. That way the user can get the best of both worlds by using a language like Viino to manipulate and rearrange huge data files then switching to R for the graphics and data analysis. As a further enhancement one might provide the ability to take a .rda file that contains a single data frame and select columns or rows, including a random sample of the rows, as a filter. Producing an R data frame may involve passing over the data twice, once to determine the size of the resulting structure and the second time to evaluate the data itself. This would have been a horrific penalty in the days that SAS and SPSS were developed but not now.