In clinical trial data preparation and many other data situations, the statistical programmer needs to merge and re-merge multiple input files countless times. A syntax for merging files that is clear and concise is very important for the statistical programmer's productivity. Here is how Vilno does it: inlist dataset1 dataset2 dataset3 ; joinby variable1 variable2 where ( var3<=var4 ) ; Each column in a dataset has a variable name ( variable1, variable2, var3, var4 ). You are merging three input datafiles: dataset1, dataset2, and dataset3. The joinby statement asks for a many-to-many join, rather like the SQL SELECT statement. [ The mergeby statement asks for a many-to-one join , more efficient ] [ The readby statement asks for interleaving of rows, the rows don't "match up" , but one row goes under the preceding row (100 rows + 100 rows -> 200 output rows ] The join( or merge ) is done with variable1*variable2 subgroups: A row from dataset1 where variable1=4 and variable2="Sam" can only match to a row from dataset2 where variable1=4 and variable2="Sam". Also, any match-ups where it is not the case that var3<=var4 are also excluded. Here's how the SAS datastep will do it: merge dataset1 dataset2 dataset3 ; by variable1 variable2 ; if ^( var3<=var4 ) then delete ; [Actually, the SAS datastep can only do a many-to-one join, but you can do a PROC SQL paragraph to do an SQL SELECT statement, then export the results to a SAS datastep afterwards.] The point is : there are lots of data preparation scenarios where large numbers of merges need to be done. This is an example where Vilno and SAS are easier to use than the competition. I'm sure an Awk programmer can come up with something, but the result would be awkward. You can also find other data preparation problems where the best tool is Awk. Looking through "Sed & Awk" (O'Reilly) gives a good idea. I'm not expert Awk-er sure, but I think I can see that Awk and Vilno are really like apples and oranges. For scanning inconsistently structured ASCII data files, where different rows have different column specifications, Awk is a better tool. For data problems that lend themselves to UNIX-style regular expressions, Awk, again, is a great tool. If you have a data manipulation problem that is incredibly simple, then converting an ascii data file to binary, and then back, may not seem worth it. Awk, again, wins. But the asciitobinary and binarytoascii statement ( there and back ) only takes 4 lines or so, so Vilno is really not that bad. Certain apsects of Vilno and SAS are a bit more user-friendly: Each column has a variable name, such as "PatientID". Awk uses $1, $2, $3 , as variable names for columns. Not user-friendly. In both Vilno and SAS (and SQL) the possibility of "MISSING" ( or "NULL" ) is built into the data values held in the columns. So you don't have to use separate boolean variables to track MISSING vs NOT-MISSING. Very convenient. Vilno does have a lot of functionality that is a lot harder to implement in most other programming languages. (You can implement that functionality, but it would take a ton of code - the three merge-in options for Vilno are an example). The upshot: Awk is a hammer. Vilno is a screwdriver.
On 6/13/07, Robert Wilkins <irishhacker@gmail.com> wrote:> > The point is : there are lots of data preparation scenarios where > large numbers of merges need to be done. This is an example where > Vilno and SAS are easier to use than the competition. I'm sure an Awk > programmer can come up with something, but the result would be > awkward.Agreed. In the awk+R scenario, it is clear that the merges are often better done with R. My strategy is to use awk only to clean/reformat data into a tabular format and do most of the "consolidation" (computations/filtering/merges) in R. I suggested to use awk only to perform manipulations that would be more complex to do within R (especially mutliline records or recors with optionnal fields). I try to keep the scripts as simple as possible on both sides> Certain apsects of Vilno and SAS are a bit more user-friendly: > > Each column has a variable name, such as "PatientID". > > Awk uses $1, $2, $3 , as variable names for columns. Not user-friendly. > >In the first lines of awk scripts, I usually assign column numbers to variables (e.g. "Code=1, time=3") and then access the fields with "$Code", "$Time"... Yet, it is true that it is cumbersome, in awk, to use the labels on the first line of a file as a variable names (my major complain about awk). I looked at a few examples of SAS Data step scripts on the Net, and found that the awk scripts would be very similar (except for merges), but there may manipulations which I missed.> For scanning inconsistently structured ASCII data files, where > different rows have different column specifications, Awk is a better > tool. > > For data problems that lend themselves to UNIX-style regular > expressions, Awk, again, is a great tool.The examples of messy data formats that were described ealier on the list are good examples where regular expressions will help a lot. In the very first stage of data inspection, to detect coding "mistakes", awk (sometimes with the help ot other gnutools such as 'uniq' and 'sort') can be very efficient.> The upshot:> Awk is a hammer. > Vilno is a screwdriver.Nice analogy. Using the right tool for the right task is very important. So awk and vilno seem complementary. Yet, when R enters into the equation, do you still "need" the three tools? What we should really compare is the four situations: R alone R + awk R + vilno R + awk + vilno and maybe "R + SAS Data step" and see what scripts are more elegant (read 'short and understandable') Best, Christophe -- Christophe Pallier (http://www.pallier.org) [[alternative HTML version deleted]]
On 13-Jun-07 01:24:41, Robert Wilkins wrote:> In clinical trial data preparation and many other data situations, > the statistical programmer needs to merge and re-merge multiple > input files countless times. A syntax for merging files that is > clear and concise is very important for the statistical programmer's > productivity. > > Here is how Vilno does it: > > inlist dataset1 dataset2 dataset3 ; > joinby variable1 variable2 where ( var3<=var4 ) ; > [...]Thanks to Robert for this more explicit illustration of what Vilno does. Its potential usefulness is clear. I broadly agree with the comments that have been made about the various approaches (with/without awk/sed/R etc). Is there any URL that leads to a fuller explicit exposition of Vilno? As I said previously, a web-search on "vilno" leads to very little that is relevant. What I did find didn't amount to much. Best wishes, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <ted.harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 13-Jun-07 Time: 10:00:12 ------------------------------ XFMail ------------------------------
Hey,> What we should really compare is the four situations: > R alone > R + awk > R + vilno > R + awk + vilno > and maybe "R + SAS Data step" > and see what scripts are more elegant (read 'short and understandable')what do you guys think of creating a R-wiki page for syntax comparisons among the various options to enhance R use? I already have two sugestions: 1) syntax examples for using R and other tools to manipulate and analyze large datasets (with a concise description of the datasets); 2) syntax examples for using R and other tools (or R alone) to clean and prepare datasets (simple and very small datasets, for didatic purposes). I think this could be interesting for R users and to promote other software tools, since it seems there is a lot of R users that use other tools also. Besides that, questions on those two above subjects are prevalent at this list. Thus a wiki page seems to be the right place to discuss and teach this to other users. What do you think? Rogerio