thr3ads.net - R help - [R] SAS to R migration questions [Sep 2004]

If this information is useful, please help other people find it:
Share via:

Matthew Wilson

2004-Sep-11 13:39 UTC

[R] SAS to R migration questions

Hi,


I'd like to get away from SAS, but I don't really know R well enough at
this point to know if it would be good for this project.  I tried to
describe the essence of the project below without getting bogged down in
details.

It starts when I receive a data flat file.  There's lots of columns, but
the relevant ones are:

    custid  (customer ID number)
    saledt  (date of sale)
    salepx  (sale price)


Step 1:

I read in this data into a SAS dataset.  Some of these flat files hold
several gigabytes of data.  SAS allows indexes to be created on columns
which really speeds up queries. 

I read the R import/export doc and it suggested using databases for
really big datasets.  I figured I'd probably use perl or python to read
the file and convert it to either an R .tab file or to load the data
into a SQL database for the big files (Postgres or MySQL, since I'm
trying to go 100% open source with this).



Step 2:

In the data, I'll usually find one row per sale, but occasionally, a
sale will be entered incorrectly at first, then later reversed, then a
third line will show the correct sale data:

    custid      saledt      salepx
    111         8/1/2004    $75
    111         9/1/2004    $50
    112         10/1/2004   $30
    112         10/1/2004   ($30)
    112         10/1/2004   $20

The fourth line reverses the third line by showing a negative charge for
the same customer ID and sale date, and the last line is the correct
line.  I want to compress all those adjustments and reversals lines out
of the data, so the outgoing data would look like this:

    custid      saledt      salepx
    111         8/1/2004    $75
    111         9/1/2004    $50
    112         10/1/2004   $20

In SAS, I use a proc summary step in SAS to accomplish this:

    proc summary data=d1;
        class custid saledt;
        var salepx;
        output out=d2 sum=;
    run;

This is where I need help:  how to do this step in R?


Step 3:

I print a list of number of sales per customer ID, ranking the customer
IDs from most to least.  I use a SAS proc freq step for this:

    proc freq data=d2 order=freq;
        tables custid;
    run;

and the output would look like this:

    custid      freq
    111         2
    112         1


Again, I have no idea how to do step 3 in R.  


Thanks in advance!  All help is welcome.  Is this kind of work what R is
good at?


-- 
My public key:
gpg --recv-keys --keyserver www.mandrakesecure.net 0x8D10BFD5

Spencer Graves

2004-Sep-12 19:55 UTC

head link

[R] SAS to R migration questions

I copied your data into Excel and saved it as *.txt.  From within 
R, the following commands produced for me the result cited in your email 
below: 

salesData <- read.table("R-qn.txt", header=TRUE,
                        sep="\t", as.is=TRUE)
sapply(salesData, class)# check class of columns of salesData
reversal <- regexpr("\\(", salesData$salepx)
rev <- which(reversal>0)
salesData[-c(rev, rev-1),]

      If "R-qn.txt" contained gigabytes, R might die in
"read.table".
For that, you will need to process the data base in smaller pieces.  The 
"R Data Import/Export" manual [available, e.g., via help.start() from 
within R] discusses various ways of interacting direction with 
relational databases, etc. 

      hope this helps.  spencer graves

Matthew Wilson wrote:
>Hi,
>
>
>I'd like to get away from SAS, but I don't really know R well enough
at
>this point to know if it would be good for this project.  I tried to
>describe the essence of the project below without getting bogged down in
>details.
>
>It starts when I receive a data flat file.  There's lots of columns, but
>the relevant ones are:
>
>    custid  (customer ID number)
>    saledt  (date of sale)
>    salepx  (sale price)
>
>
>Step 1:
>
>I read in this data into a SAS dataset.  Some of these flat files hold
>several gigabytes of data.  SAS allows indexes to be created on columns
>which really speeds up queries. 
>
>I read the R import/export doc and it suggested using databases for
>really big datasets.  I figured I'd probably use perl or python to read
>the file and convert it to either an R .tab file or to load the data
>into a SQL database for the big files (Postgres or MySQL, since I'm
>trying to go 100% open source with this).
>
>
>
>Step 2:
>
>In the data, I'll usually find one row per sale, but occasionally, a
>sale will be entered incorrectly at first, then later reversed, then a
>third line will show the correct sale data:
>
>    custid      saledt      salepx
>    111         8/1/2004    $75
>    111         9/1/2004    $50
>    112         10/1/2004   $30
>    112         10/1/2004   ($30)
>    112         10/1/2004   $20
>
>The fourth line reverses the third line by showing a negative charge for
>the same customer ID and sale date, and the last line is the correct
>line.  I want to compress all those adjustments and reversals lines out
>of the data, so the outgoing data would look like this:
>
>    custid      saledt      salepx
>    111         8/1/2004    $75
>    111         9/1/2004    $50
>    112         10/1/2004   $20
>
>In SAS, I use a proc summary step in SAS to accomplish this:
>
>    proc summary data=d1;
>        class custid saledt;
>        var salepx;
>        output out=d2 sum=;
>    run;
>
>This is where I need help:  how to do this step in R?
>
>
>Step 3:
>
>I print a list of number of sales per customer ID, ranking the customer
>IDs from most to least.  I use a SAS proc freq step for this:
>
>    proc freq data=d2 order=freq;
>        tables custid;
>    run;
>
>and the output would look like this:
>
>    custid      freq
>    111         2
>    112         1
>
>
>Again, I have no idea how to do step 3 in R.  
>
>
>Thanks in advance!  All help is welcome.  Is this kind of work what R is
>good at?
>
>
>  
>
-- 
Spencer Graves, PhD, Senior Development Engineer
O:  (408)938-4420;  mobile:  (408)655-4567

Fernando Henrique Ferraz P. da Rosa

2004-Sep-12 21:06 UTC

head link

[R] SAS to R migration questions

Matthew Wilson writes:> (...)
> Step 3:
> 
> I print a list of number of sales per customer ID, ranking the customer
> IDs from most to least.  I use a SAS proc freq step for this:
> 
>     proc freq data=d2 order=freq;
>         tables custid;
>     run;
> 
> and the output would look like this:
> 
>     custid      freq
>     111         2
>     112         1
> 
> 
> Again, I have no idea how to do step 3 in R.  
        Provided that you are going to work with this data stored in a
SQL database, it'll probably be more efficient to do this sort of
manipulation directly in SQL. Anyways, the following lines of code in R
will do what you described:

cust <- c(111,111,112)
cc <- data.frame(t(sapply(unique(cust),function(level,vec) {
c(custid=level,freq=sum(vec==level)) },cust))) 
cc[order(cc$freq,decreasing=T),]


Cheers,

--
Fernando Henrique Ferraz P. da Rosa
http://www.ime.usp.br/~feferraz

Fernando Henrique Ferraz P. da Rosa

2004-Sep-12 21:24 UTC

head link

[R] SAS to R migration questions

Fernando Henrique Ferraz P. da Rosa writes:> cust <- c(111,111,112)
> cc <- data.frame(t(sapply(unique(cust),function(level,vec) {
> c(custid=level,freq=sum(vec==level)) },cust))) 
> cc[order(cc$freq,decreasing=T),]
        An even simpler solution:
 
        cc <- data.frame(table(cust))
        cc[order(cc$Freq,decreasing=T),]


--
Fernando Henrique Ferraz P. da Rosa
http://www.ime.usp.br/~feferraz

Reasonably Related Threads

Search for more possibly parallel threads

R help - Sep 2004 - SAS to R migration questions

[R] SAS to R migration questions

[R] SAS to R migration questions

[R] SAS to R migration questions

[R] SAS to R migration questions

Reasonably Related Threads