Hi, I'd like to get away from SAS, but I don't really know R well enough at this point to know if it would be good for this project. I tried to describe the essence of the project below without getting bogged down in details. It starts when I receive a data flat file. There's lots of columns, but the relevant ones are: custid (customer ID number) saledt (date of sale) salepx (sale price) Step 1: I read in this data into a SAS dataset. Some of these flat files hold several gigabytes of data. SAS allows indexes to be created on columns which really speeds up queries. I read the R import/export doc and it suggested using databases for really big datasets. I figured I'd probably use perl or python to read the file and convert it to either an R .tab file or to load the data into a SQL database for the big files (Postgres or MySQL, since I'm trying to go 100% open source with this). Step 2: In the data, I'll usually find one row per sale, but occasionally, a sale will be entered incorrectly at first, then later reversed, then a third line will show the correct sale data: custid saledt salepx 111 8/1/2004 $75 111 9/1/2004 $50 112 10/1/2004 $30 112 10/1/2004 ($30) 112 10/1/2004 $20 The fourth line reverses the third line by showing a negative charge for the same customer ID and sale date, and the last line is the correct line. I want to compress all those adjustments and reversals lines out of the data, so the outgoing data would look like this: custid saledt salepx 111 8/1/2004 $75 111 9/1/2004 $50 112 10/1/2004 $20 In SAS, I use a proc summary step in SAS to accomplish this: proc summary data=d1; class custid saledt; var salepx; output out=d2 sum=; run; This is where I need help: how to do this step in R? Step 3: I print a list of number of sales per customer ID, ranking the customer IDs from most to least. I use a SAS proc freq step for this: proc freq data=d2 order=freq; tables custid; run; and the output would look like this: custid freq 111 2 112 1 Again, I have no idea how to do step 3 in R. Thanks in advance! All help is welcome. Is this kind of work what R is good at? -- My public key: gpg --recv-keys --keyserver www.mandrakesecure.net 0x8D10BFD5
I copied your data into Excel and saved it as *.txt. From within R, the following commands produced for me the result cited in your email below: salesData <- read.table("R-qn.txt", header=TRUE, sep="\t", as.is=TRUE) sapply(salesData, class)# check class of columns of salesData reversal <- regexpr("\\(", salesData$salepx) rev <- which(reversal>0) salesData[-c(rev, rev-1),] If "R-qn.txt" contained gigabytes, R might die in "read.table". For that, you will need to process the data base in smaller pieces. The "R Data Import/Export" manual [available, e.g., via help.start() from within R] discusses various ways of interacting direction with relational databases, etc. hope this helps. spencer graves Matthew Wilson wrote:>Hi, > > >I'd like to get away from SAS, but I don't really know R well enough at >this point to know if it would be good for this project. I tried to >describe the essence of the project below without getting bogged down in >details. > >It starts when I receive a data flat file. There's lots of columns, but >the relevant ones are: > > custid (customer ID number) > saledt (date of sale) > salepx (sale price) > > >Step 1: > >I read in this data into a SAS dataset. Some of these flat files hold >several gigabytes of data. SAS allows indexes to be created on columns >which really speeds up queries. > >I read the R import/export doc and it suggested using databases for >really big datasets. I figured I'd probably use perl or python to read >the file and convert it to either an R .tab file or to load the data >into a SQL database for the big files (Postgres or MySQL, since I'm >trying to go 100% open source with this). > > > >Step 2: > >In the data, I'll usually find one row per sale, but occasionally, a >sale will be entered incorrectly at first, then later reversed, then a >third line will show the correct sale data: > > custid saledt salepx > 111 8/1/2004 $75 > 111 9/1/2004 $50 > 112 10/1/2004 $30 > 112 10/1/2004 ($30) > 112 10/1/2004 $20 > >The fourth line reverses the third line by showing a negative charge for >the same customer ID and sale date, and the last line is the correct >line. I want to compress all those adjustments and reversals lines out >of the data, so the outgoing data would look like this: > > custid saledt salepx > 111 8/1/2004 $75 > 111 9/1/2004 $50 > 112 10/1/2004 $20 > >In SAS, I use a proc summary step in SAS to accomplish this: > > proc summary data=d1; > class custid saledt; > var salepx; > output out=d2 sum=; > run; > >This is where I need help: how to do this step in R? > > >Step 3: > >I print a list of number of sales per customer ID, ranking the customer >IDs from most to least. I use a SAS proc freq step for this: > > proc freq data=d2 order=freq; > tables custid; > run; > >and the output would look like this: > > custid freq > 111 2 > 112 1 > > >Again, I have no idea how to do step 3 in R. > > >Thanks in advance! All help is welcome. Is this kind of work what R is >good at? > > > >-- Spencer Graves, PhD, Senior Development Engineer O: (408)938-4420; mobile: (408)655-4567
Fernando Henrique Ferraz P. da Rosa
2004-Sep-12 21:06 UTC
[R] SAS to R migration questions
Matthew Wilson writes:> (...) > Step 3: > > I print a list of number of sales per customer ID, ranking the customer > IDs from most to least. I use a SAS proc freq step for this: > > proc freq data=d2 order=freq; > tables custid; > run; > > and the output would look like this: > > custid freq > 111 2 > 112 1 > > > Again, I have no idea how to do step 3 in R.Provided that you are going to work with this data stored in a SQL database, it'll probably be more efficient to do this sort of manipulation directly in SQL. Anyways, the following lines of code in R will do what you described: cust <- c(111,111,112) cc <- data.frame(t(sapply(unique(cust),function(level,vec) { c(custid=level,freq=sum(vec==level)) },cust))) cc[order(cc$freq,decreasing=T),] Cheers, -- Fernando Henrique Ferraz P. da Rosa http://www.ime.usp.br/~feferraz
Fernando Henrique Ferraz P. da Rosa
2004-Sep-12 21:24 UTC
[R] SAS to R migration questions
Fernando Henrique Ferraz P. da Rosa writes:> cust <- c(111,111,112) > cc <- data.frame(t(sapply(unique(cust),function(level,vec) { > c(custid=level,freq=sum(vec==level)) },cust))) > cc[order(cc$freq,decreasing=T),]An even simpler solution: cc <- data.frame(table(cust)) cc[order(cc$Freq,decreasing=T),] -- Fernando Henrique Ferraz P. da Rosa http://www.ime.usp.br/~feferraz