thr3ads.net - R help - [R] Median on Aggregated data [Nov 2009]

If this information is useful, please help other people find it:
Share via:

Satsangi, Vivek (GE Capital)

2009-Nov-18 21:55 UTC

[R] Median on Aggregated data

Folks,
 
I have the following code, that works fine on smaller data sets. For
larger datasets, it runs out of memory and runs way too slow because we
are essentially creating large vectors with rep() and then calling
median() on it. (I learned this approach from a post on the web). 
 
Below that, I have written the corresponding SAS code. The SAS code
works fast because I can just tell the proc summary (by the weights
option) that the Counts variable is a frequency.
 
So, the question is, is there a simple way to do the same thing in R? I
have to run this on a large dataset -- for a small set it is not a
problem.
 
 
---------------------- Begin R code ------------------------------------
N <- 1005 * 14; 
myNorm <- data.frame(PaydexNormingCategory = numeric(N),
    SIC = numeric(N), CatMedian = numeric(N));
 
k=1;
#j = 7941;  ## For testing only
for (j in levels(SIC)){
 for (i in levels(PaydexNormingCategory)){
 myData <- dfpaydex[(Paydex==i) & (SIC==j),];
 myMedian <- with(myData, levels(Paydex)[median(rep(as.numeric(Paydex),
Counts))]);
 myNorm[k] <-c( as.numeric(i), as.numeric(j), as.numeric(myMedian) );
 k <- k+1;
 }
}
 
---------------------- Begin SAS code
------------------------------------

proc summary data=SASUser.PaydexNormfull nway; 

   class PaydexNormingCategory SIC ;
   weight Counts;
  var Paydex;

 output out=outstat (drop=_type_ _freq_)    
        median= / autoname;                   
 run;

---------------------- End SAS code ------------------------------------

Thanks for your guidance!


Vivek Satsangi
GE Capital
Americas

GE imagination at work


	[[alternative HTML version deleted]]

David Winsemius

2009-Nov-18 22:12 UTC

head link

[R] Median on Aggregated data

On Nov 18, 2009, at 4:55 PM, Satsangi, Vivek (GE Capital) wrote:
> Folks,
>
> I have the following code, that works fine on smaller data sets. For
> larger datasets, it runs out of memory and runs way too slow because  
> we
> are essentially creating large vectors with rep() and then calling
> median() on it. (I learned this approach from a post on the web).
>
> Below that, I have written the corresponding SAS code. The SAS code
> works fast because I can just tell the proc summary (by the weights
> option) that the Counts variable is a frequency.
>
> So, the question is, is there a simple way to do the same thing in  
> R? I
> have to run this on a large dataset -- for a small set it is not a
> problem.
>
Not sure and I see no reproducible dataset (that I recognize), but  
Harrell's  Hmisc:::wtd.quantile might be an alternate approach.

>
> ---------------------- Begin R code  
> ------------------------------------
> N <- 1005 * 14;
> myNorm <- data.frame(PaydexNormingCategory = numeric(N),
>    SIC = numeric(N), CatMedian = numeric(N));
>
> k=1;
> #j = 7941;  ## For testing only
> for (j in levels(SIC)){
> for (i in levels(PaydexNormingCategory)){
> myData <- dfpaydex[(Paydex==i) & (SIC==j),];
> myMedian <- with(myData, levels(Paydex)[median(rep(as.numeric(Paydex),
> Counts))]);
> myNorm[k] <-c( as.numeric(i), as.numeric(j), as.numeric(myMedian) );
> k <- k+1;
> }
> }
>
> ---------------------- Begin SAS code
> ------------------------------------
>
> proc summary data=SASUser.PaydexNormfull nway;
>
>   class PaydexNormingCategory SIC ;
>   weight Counts;
>  var Paydex;
>
> output out=outstat (drop=_type_ _freq_)
>        median= / autoname;
> run;
>
> ---------------------- End SAS code  
> ------------------------------------
>
> Thanks for your guidance!
>
>
> Vivek Satsangi
> GE Capital
> Americas
>
> GE imagination at work
>
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

William Dunlap

2009-Nov-18 22:20 UTC

head link

[R] Median on Aggregated data

You could use S+.  Its median function has
a weights argument.  E.g.,
   > median(c(1,2,3,4e4), weights=c(1e8,1e8,1,2e8))
   [1] 3
   > median(c(1,2,3,4e4),  weights=c(1e8,1e8,1,2e8+10))
   [1] 40000
   > median(c(1,2,3,4e4),  weights=c(1e8,1e8,1,2e8+1))
   [1] 20001.5

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com  
> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Satsangi, 
> Vivek (GE Capital)
> Sent: Wednesday, November 18, 2009 1:55 PM
> To: r-help at r-project.org
> Subject: [R] Median on Aggregated data
> 
> Folks,
>  
> I have the following code, that works fine on smaller data sets. For
> larger datasets, it runs out of memory and runs way too slow 
> because we
> are essentially creating large vectors with rep() and then calling
> median() on it. (I learned this approach from a post on the web). 
>  
> Below that, I have written the corresponding SAS code. The SAS code
> works fast because I can just tell the proc summary (by the weights
> option) that the Counts variable is a frequency.
>  
> So, the question is, is there a simple way to do the same 
> thing in R? I
> have to run this on a large dataset -- for a small set it is not a
> problem.
>  
>  
> ---------------------- Begin R code 
> ------------------------------------
> N <- 1005 * 14; 
> myNorm <- data.frame(PaydexNormingCategory = numeric(N),
>     SIC = numeric(N), CatMedian = numeric(N));
>  
> k=1;
> #j = 7941;  ## For testing only
> for (j in levels(SIC)){
>  for (i in levels(PaydexNormingCategory)){
>  myData <- dfpaydex[(Paydex==i) & (SIC==j),];
>  myMedian <- with(myData, 
> levels(Paydex)[median(rep(as.numeric(Paydex),
> Counts))]);
>  myNorm[k] <-c( as.numeric(i), as.numeric(j), as.numeric(myMedian) );
>  k <- k+1;
>  }
> }
>  
> ---------------------- Begin SAS code
> ------------------------------------
> 
> proc summary data=SASUser.PaydexNormfull nway; 
> 
>    class PaydexNormingCategory SIC ;
>    weight Counts;
>   var Paydex;
> 
>  output out=outstat (drop=_type_ _freq_)    
>         median= / autoname;                   
>  run;
> 
> ---------------------- End SAS code 
> ------------------------------------
> 
> Thanks for your guidance!
> 
> 
> Vivek Satsangi
> GE Capital
> Americas
> 
> GE imagination at work
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Possibly Parallel Threads

Search for more maybe matching threads

R help - Nov 2009 - Median on Aggregated data

[R] Median on Aggregated data

[R] Median on Aggregated data

[R] Median on Aggregated data

Possibly Parallel Threads