Hi,
I'd like to get away from SAS, but I don't really know R well enough at
this point to know if it would be good for this project. I tried to
describe the essence of the project below without getting bogged down in
details.
It starts when I receive a data flat file. There's lots of columns, but
the relevant ones are:
custid (customer ID number)
saledt (date of sale)
salepx (sale price)
Step 1:
I read in this data into a SAS dataset. Some of these flat files hold
several gigabytes of data. SAS allows indexes to be created on columns
which really speeds up queries.
I read the R import/export doc and it suggested using databases for
really big datasets. I figured I'd probably use perl or python to read
the file and convert it to either an R .tab file or to load the data
into a SQL database for the big files (Postgres or MySQL, since I'm
trying to go 100% open source with this).
Step 2:
In the data, I'll usually find one row per sale, but occasionally, a
sale will be entered incorrectly at first, then later reversed, then a
third line will show the correct sale data:
custid saledt salepx
111 8/1/2004 $75
111 9/1/2004 $50
112 10/1/2004 $30
112 10/1/2004 ($30)
112 10/1/2004 $20
The fourth line reverses the third line by showing a negative charge for
the same customer ID and sale date, and the last line is the correct
line. I want to compress all those adjustments and reversals lines out
of the data, so the outgoing data would look like this:
custid saledt salepx
111 8/1/2004 $75
111 9/1/2004 $50
112 10/1/2004 $20
In SAS, I use a proc summary step in SAS to accomplish this:
proc summary data=d1;
class custid saledt;
var salepx;
output out=d2 sum=;
run;
This is where I need help: how to do this step in R?
Step 3:
I print a list of number of sales per customer ID, ranking the customer
IDs from most to least. I use a SAS proc freq step for this:
proc freq data=d2 order=freq;
tables custid;
run;
and the output would look like this:
custid freq
111 2
112 1
Again, I have no idea how to do step 3 in R.
Thanks in advance! All help is welcome. Is this kind of work what R is
good at?
--
My public key:
gpg --recv-keys --keyserver www.mandrakesecure.net 0x8D10BFD5
I copied your data into Excel and saved it as *.txt. From within
R, the following commands produced for me the result cited in your email
below:
salesData <- read.table("R-qn.txt", header=TRUE,
sep="\t", as.is=TRUE)
sapply(salesData, class)# check class of columns of salesData
reversal <- regexpr("\\(", salesData$salepx)
rev <- which(reversal>0)
salesData[-c(rev, rev-1),]
If "R-qn.txt" contained gigabytes, R might die in
"read.table".
For that, you will need to process the data base in smaller pieces. The
"R Data Import/Export" manual [available, e.g., via help.start() from
within R] discusses various ways of interacting direction with
relational databases, etc.
hope this helps. spencer graves
Matthew Wilson wrote:
>Hi,
>
>
>I'd like to get away from SAS, but I don't really know R well enough
at
>this point to know if it would be good for this project. I tried to
>describe the essence of the project below without getting bogged down in
>details.
>
>It starts when I receive a data flat file. There's lots of columns, but
>the relevant ones are:
>
> custid (customer ID number)
> saledt (date of sale)
> salepx (sale price)
>
>
>Step 1:
>
>I read in this data into a SAS dataset. Some of these flat files hold
>several gigabytes of data. SAS allows indexes to be created on columns
>which really speeds up queries.
>
>I read the R import/export doc and it suggested using databases for
>really big datasets. I figured I'd probably use perl or python to read
>the file and convert it to either an R .tab file or to load the data
>into a SQL database for the big files (Postgres or MySQL, since I'm
>trying to go 100% open source with this).
>
>
>
>Step 2:
>
>In the data, I'll usually find one row per sale, but occasionally, a
>sale will be entered incorrectly at first, then later reversed, then a
>third line will show the correct sale data:
>
> custid saledt salepx
> 111 8/1/2004 $75
> 111 9/1/2004 $50
> 112 10/1/2004 $30
> 112 10/1/2004 ($30)
> 112 10/1/2004 $20
>
>The fourth line reverses the third line by showing a negative charge for
>the same customer ID and sale date, and the last line is the correct
>line. I want to compress all those adjustments and reversals lines out
>of the data, so the outgoing data would look like this:
>
> custid saledt salepx
> 111 8/1/2004 $75
> 111 9/1/2004 $50
> 112 10/1/2004 $20
>
>In SAS, I use a proc summary step in SAS to accomplish this:
>
> proc summary data=d1;
> class custid saledt;
> var salepx;
> output out=d2 sum=;
> run;
>
>This is where I need help: how to do this step in R?
>
>
>Step 3:
>
>I print a list of number of sales per customer ID, ranking the customer
>IDs from most to least. I use a SAS proc freq step for this:
>
> proc freq data=d2 order=freq;
> tables custid;
> run;
>
>and the output would look like this:
>
> custid freq
> 111 2
> 112 1
>
>
>Again, I have no idea how to do step 3 in R.
>
>
>Thanks in advance! All help is welcome. Is this kind of work what R is
>good at?
>
>
>
>
--
Spencer Graves, PhD, Senior Development Engineer
O: (408)938-4420; mobile: (408)655-4567
Fernando Henrique Ferraz P. da Rosa
2004-Sep-12 21:06 UTC
[R] SAS to R migration questions
Matthew Wilson writes:> (...) > Step 3: > > I print a list of number of sales per customer ID, ranking the customer > IDs from most to least. I use a SAS proc freq step for this: > > proc freq data=d2 order=freq; > tables custid; > run; > > and the output would look like this: > > custid freq > 111 2 > 112 1 > > > Again, I have no idea how to do step 3 in R.Provided that you are going to work with this data stored in a SQL database, it'll probably be more efficient to do this sort of manipulation directly in SQL. Anyways, the following lines of code in R will do what you described: cust <- c(111,111,112) cc <- data.frame(t(sapply(unique(cust),function(level,vec) { c(custid=level,freq=sum(vec==level)) },cust))) cc[order(cc$freq,decreasing=T),] Cheers, -- Fernando Henrique Ferraz P. da Rosa http://www.ime.usp.br/~feferraz
Fernando Henrique Ferraz P. da Rosa
2004-Sep-12 21:24 UTC
[R] SAS to R migration questions
Fernando Henrique Ferraz P. da Rosa writes:> cust <- c(111,111,112) > cc <- data.frame(t(sapply(unique(cust),function(level,vec) { > c(custid=level,freq=sum(vec==level)) },cust))) > cc[order(cc$freq,decreasing=T),]An even simpler solution: cc <- data.frame(table(cust)) cc[order(cc$Freq,decreasing=T),] -- Fernando Henrique Ferraz P. da Rosa http://www.ime.usp.br/~feferraz