Folks, I have the following code, that works fine on smaller data sets. For larger datasets, it runs out of memory and runs way too slow because we are essentially creating large vectors with rep() and then calling median() on it. (I learned this approach from a post on the web). Below that, I have written the corresponding SAS code. The SAS code works fast because I can just tell the proc summary (by the weights option) that the Counts variable is a frequency. So, the question is, is there a simple way to do the same thing in R? I have to run this on a large dataset -- for a small set it is not a problem. ---------------------- Begin R code ------------------------------------ N <- 1005 * 14; myNorm <- data.frame(PaydexNormingCategory = numeric(N), SIC = numeric(N), CatMedian = numeric(N)); k=1; #j = 7941; ## For testing only for (j in levels(SIC)){ for (i in levels(PaydexNormingCategory)){ myData <- dfpaydex[(Paydex==i) & (SIC==j),]; myMedian <- with(myData, levels(Paydex)[median(rep(as.numeric(Paydex), Counts))]); myNorm[k] <-c( as.numeric(i), as.numeric(j), as.numeric(myMedian) ); k <- k+1; } } ---------------------- Begin SAS code ------------------------------------ proc summary data=SASUser.PaydexNormfull nway; class PaydexNormingCategory SIC ; weight Counts; var Paydex; output out=outstat (drop=_type_ _freq_) median= / autoname; run; ---------------------- End SAS code ------------------------------------ Thanks for your guidance! Vivek Satsangi GE Capital Americas GE imagination at work [[alternative HTML version deleted]]
On Nov 18, 2009, at 4:55 PM, Satsangi, Vivek (GE Capital) wrote:> Folks, > > I have the following code, that works fine on smaller data sets. For > larger datasets, it runs out of memory and runs way too slow because > we > are essentially creating large vectors with rep() and then calling > median() on it. (I learned this approach from a post on the web). > > Below that, I have written the corresponding SAS code. The SAS code > works fast because I can just tell the proc summary (by the weights > option) that the Counts variable is a frequency. > > So, the question is, is there a simple way to do the same thing in > R? I > have to run this on a large dataset -- for a small set it is not a > problem. >Not sure and I see no reproducible dataset (that I recognize), but Harrell's Hmisc:::wtd.quantile might be an alternate approach.> > ---------------------- Begin R code > ------------------------------------ > N <- 1005 * 14; > myNorm <- data.frame(PaydexNormingCategory = numeric(N), > SIC = numeric(N), CatMedian = numeric(N)); > > k=1; > #j = 7941; ## For testing only > for (j in levels(SIC)){ > for (i in levels(PaydexNormingCategory)){ > myData <- dfpaydex[(Paydex==i) & (SIC==j),]; > myMedian <- with(myData, levels(Paydex)[median(rep(as.numeric(Paydex), > Counts))]); > myNorm[k] <-c( as.numeric(i), as.numeric(j), as.numeric(myMedian) ); > k <- k+1; > } > } > > ---------------------- Begin SAS code > ------------------------------------ > > proc summary data=SASUser.PaydexNormfull nway; > > class PaydexNormingCategory SIC ; > weight Counts; > var Paydex; > > output out=outstat (drop=_type_ _freq_) > median= / autoname; > run; > > ---------------------- End SAS code > ------------------------------------ > > Thanks for your guidance! > > > Vivek Satsangi > GE Capital > Americas > > GE imagination at work > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
You could use S+. Its median function has a weights argument. E.g., > median(c(1,2,3,4e4), weights=c(1e8,1e8,1,2e8)) [1] 3 > median(c(1,2,3,4e4), weights=c(1e8,1e8,1,2e8+10)) [1] 40000 > median(c(1,2,3,4e4), weights=c(1e8,1e8,1,2e8+1)) [1] 20001.5 Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com> -----Original Message----- > From: r-help-bounces at r-project.org > [mailto:r-help-bounces at r-project.org] On Behalf Of Satsangi, > Vivek (GE Capital) > Sent: Wednesday, November 18, 2009 1:55 PM > To: r-help at r-project.org > Subject: [R] Median on Aggregated data > > Folks, > > I have the following code, that works fine on smaller data sets. For > larger datasets, it runs out of memory and runs way too slow > because we > are essentially creating large vectors with rep() and then calling > median() on it. (I learned this approach from a post on the web). > > Below that, I have written the corresponding SAS code. The SAS code > works fast because I can just tell the proc summary (by the weights > option) that the Counts variable is a frequency. > > So, the question is, is there a simple way to do the same > thing in R? I > have to run this on a large dataset -- for a small set it is not a > problem. > > > ---------------------- Begin R code > ------------------------------------ > N <- 1005 * 14; > myNorm <- data.frame(PaydexNormingCategory = numeric(N), > SIC = numeric(N), CatMedian = numeric(N)); > > k=1; > #j = 7941; ## For testing only > for (j in levels(SIC)){ > for (i in levels(PaydexNormingCategory)){ > myData <- dfpaydex[(Paydex==i) & (SIC==j),]; > myMedian <- with(myData, > levels(Paydex)[median(rep(as.numeric(Paydex), > Counts))]); > myNorm[k] <-c( as.numeric(i), as.numeric(j), as.numeric(myMedian) ); > k <- k+1; > } > } > > ---------------------- Begin SAS code > ------------------------------------ > > proc summary data=SASUser.PaydexNormfull nway; > > class PaydexNormingCategory SIC ; > weight Counts; > var Paydex; > > output out=outstat (drop=_type_ _freq_) > median= / autoname; > run; > > ---------------------- End SAS code > ------------------------------------ > > Thanks for your guidance! > > > Vivek Satsangi > GE Capital > Americas > > GE imagination at work > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >