thr3ads.net - R help - [R] Awk and Vilno [Jun 2007]

If this information is useful, please help other people find it:
Share via:

Robert Wilkins

2007-Jun-13 01:24 UTC

[R] Awk and Vilno

In clinical trial data preparation and many other data situations, the
statistical programmer needs to merge and re-merge multiple input
files countless times. A syntax for merging files that is clear and
concise is very important for the statistical programmer's
productivity.

Here is how Vilno does it:

inlist dataset1 dataset2 dataset3 ;
joinby variable1 variable2  where ( var3<=var4 ) ;

Each column in a dataset has a variable name ( variable1, variable2,
var3, var4 ).
You are merging three input datafiles: dataset1, dataset2, and dataset3.
The joinby statement asks for a many-to-many join, rather like the SQL
SELECT statement.
[ The mergeby statement asks for a many-to-one join , more efficient ]
[ The readby statement asks for interleaving of rows, the rows don't
"match up" ,
  but one row goes under the preceding row (100 rows + 100 rows -> 200
output rows ]
The join( or merge ) is done with variable1*variable2 subgroups: A row
from dataset1 where variable1=4 and variable2="Sam" can only match to
a row from dataset2 where variable1=4 and variable2="Sam". Also, any
match-ups where it is not the case that var3<=var4 are also excluded.

Here's how the SAS datastep will do it:

merge dataset1 dataset2 dataset3 ;
by variable1 variable2 ;
if ^( var3<=var4 ) then delete ;

[Actually, the SAS datastep can only do a many-to-one join, but you
can do a PROC SQL paragraph to do an SQL SELECT statement, then export
the results to a SAS datastep afterwards.]

The point is : there are lots of data preparation scenarios where
large numbers of merges need to be done. This is an example where
Vilno and SAS are easier to use than the competition. I'm sure an Awk
programmer can come up with something, but the result would be
awkward.

You can also find other data preparation problems where the best tool
is Awk. Looking through "Sed & Awk" (O'Reilly) gives a good
idea. I'm
not expert Awk-er sure, but I think I can see that Awk and Vilno are
really like apples and oranges.

For scanning inconsistently structured ASCII data files, where
different rows have different column specifications, Awk is a better
tool.

For data problems that lend themselves to UNIX-style regular
expressions, Awk, again, is a great tool.

If you have a data manipulation problem that is incredibly simple,
then converting an ascii data file to binary, and then back, may not
seem worth it. Awk, again, wins. But the asciitobinary and
binarytoascii statement ( there and back ) only takes 4 lines or so,
so Vilno is really not that bad.

Certain apsects of Vilno and SAS are a bit more user-friendly:
Each column has a variable name, such as "PatientID".
Awk uses $1, $2, $3 , as variable names for columns. Not user-friendly.
In both Vilno and SAS (and SQL) the possibility of "MISSING" ( or
"NULL" ) is built into the data values held in the columns. So you
don't have to use separate boolean variables to track MISSING vs
NOT-MISSING. Very convenient.

Vilno does have a lot of functionality that is a lot harder to
implement in most other programming languages. (You can implement that
functionality, but it would take a ton of code - the three merge-in
options for Vilno are an example).

The upshot:

Awk is a hammer.
Vilno is a screwdriver.

Christophe Pallier

2007-Jun-13 05:43 UTC

head link

[R] Awk and Vilno

On 6/13/07, Robert Wilkins <irishhacker@gmail.com>
wrote:>
> The point is : there are lots of data preparation scenarios where
> large numbers of merges need to be done. This is an example where
> Vilno and SAS are easier to use than the competition. I'm sure an Awk
> programmer can come up with something, but the result would be
> awkward.

Agreed.
In the awk+R scenario, it is clear that the merges are often better done
with R.
My strategy is to use awk only to clean/reformat data into a tabular format
and
do most of the "consolidation" (computations/filtering/merges) in R. 
I
suggested to use awk only to perform manipulations that would be more
complex to do within R (especially mutliline records or recors with
optionnal fields). I try to keep the scripts as simple as possible on both
sides

> Certain apsects of Vilno and SAS are a bit more user-friendly:
> > Each column has a variable name, such as "PatientID".
> > Awk uses $1, $2, $3 , as variable names for columns. Not
user-friendly.
>
>

In the first lines of awk scripts, I usually assign column numbers to
variables (e.g. "Code=1, time=3") and then access the fields with
"$Code",
"$Time"...
Yet, it is true that it is cumbersome, in awk, to use the labels on the
first line of a file as a variable names (my major complain about awk).

I looked at a few examples of  SAS Data step scripts on the Net, and found
that the awk scripts would be very similar (except for merges), but there
may  manipulations which I missed.

> For scanning inconsistently structured ASCII data files, where
> different rows have different column specifications, Awk is a better
> tool.
>
> For data problems that lend themselves to UNIX-style regular
> expressions, Awk, again, is a great tool.

The examples of messy data formats that were described ealier on the list
are good examples where regular expressions will help a lot. In the very
first stage of data inspection, to detect coding "mistakes", awk
(sometimes
with the help ot other gnutools such as 'uniq' and 'sort') can
be very
efficient.
> The upshot:
> Awk is a hammer.
> Vilno is a screwdriver.
Nice analogy. Using the right tool for the right task is very important.
So awk and vilno seem complementary.
Yet, when R enters into the equation, do you still "need" the three
tools?

What we should really compare is the four situations:

R alone
R + awk
R + vilno
R + awk + vilno

and maybe "R + SAS Data step"

and see what scripts are more  elegant (read 'short and understandable')

Best,

Christophe

-- 
Christophe Pallier (http://www.pallier.org)

	[[alternative HTML version deleted]]

(Ted Harding)

2007-Jun-13 09:00 UTC

head link

[R] Awk and Vilno

On 13-Jun-07 01:24:41, Robert Wilkins wrote:> In clinical trial data preparation and many other data situations,
> the statistical programmer needs to merge and re-merge multiple
> input files countless times. A syntax for merging files that is
> clear and concise is very important for the statistical programmer's
> productivity.
> 
> Here is how Vilno does it:
> 
> inlist dataset1 dataset2 dataset3 ;
> joinby variable1 variable2  where ( var3<=var4 ) ;
> [...]
Thanks to Robert for this more explicit illustration of what Vilno
does. Its potential usefulness is clear.

I broadly agree with the comments that have been made about the
various approaches (with/without awk/sed/R etc).

Is there any URL that leads to a fuller explicit exposition of Vilno?

As I said previously, a web-search on "vilno" leads to very little
that is relevant. What I did find didn't amount to much.

Best wishes,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <ted.harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 13-Jun-07                                       Time: 10:00:12
------------------------------ XFMail ------------------------------

Rogerio Porto

2007-Jun-13 14:50 UTC

head link

[R] Awk and Vilno

Hey,
> What we should really compare is the four situations:
> R alone
> R + awk
> R + vilno
> R + awk + vilno
> and maybe "R + SAS Data step"
> and see what scripts are more  elegant (read 'short and
understandable')
what do you guys think of creating a R-wiki page for syntax
comparisons among the various options to enhance R use?

I already have two sugestions:

1) syntax examples for using R and other tools to manipulate
and analyze large datasets (with a concise description of the
datasets);

2) syntax examples for using R and other tools (or R alone) to clean
and prepare datasets (simple and very small datasets, for didatic
purposes).

I think this could be interesting for R users and to promote other
software tools, since it seems there is a lot of R users that use
other tools also.

Besides that, questions on those two above subjects are prevalent
at this list. Thus a wiki page seems to be the right place to discuss
and teach this to other users.

What do you think?

Rogerio

Possibly Parallel Threads

Search for more possibly parallel threads

R help - Jun 2007 - Awk and Vilno

[R] Awk and Vilno

[R] Awk and Vilno

[R] Awk and Vilno

[R] Awk and Vilno

Possibly Parallel Threads