ANGELO.LINARDI at bancaditalia.it
2008-Jun-06 10:12 UTC
[R] Agreggating data using external aggregation rules
Dear R experts, I am currently facing a tricky problem which I have read a lot about in the various R mailing lists without finding exactly what I need. I have a big data frame DF (about 2,000,000 rows) with 7 columns being variables and 1 being a measure (using reshape package nomeclature). There are no "duplicates" in it. Fot each of the variables I have some "rules" to apply, being COD_IN the value of the variable in the DF, COD_OUT the one to be transformed to; once obtained the "new codes" in the DF I have to aggregate the "new DF" (for example summing the measure). Usually the total transformation (merge+aggregate) really decreases the number of lines in the data frame, but sometimes it can grows depending on the rule. Just to give an idea, the first "rule" in v1 maps 820 different values into 7 ones. Using SQL and a database this can be done in a very straightforward way (for example on the variable v1): Select COD_OUT, v2, v3, v4, v5, v6, v7, sum(measure)>From DF, RULE_v1Where v1=COD_IN Group by v2, v3,v4, v5, v6, v7 So the first choice would be using a database; the second one would be splitting the data frame and then joining the results. Is there any other possibility to merge+aggregate caused by the merge ? Thank you in advance Angelo Linardi ** Le e-mail provenienti dalla Banca d'Italia sono trasmesse in buona fede e non comportano alcun vincolo ne' creano obblighi per la Banca stessa, salvo che cio' non sia espressamente previsto da un accordo scritto. Questa e-mail e' confidenziale. Qualora l'avesse ricevuta per errore, La preghiamo di comunicarne via e-mail la ricezione al mittente e di distruggerne il contenuto. La informiamo inoltre che l'utilizzo non autorizzato del messaggio o dei suoi allegati potrebbe costituire reato. Grazie per la collaborazione. -- E-mails from the Bank of Italy are sent in good faith but they are neither binding on the Bank nor to be understood as creating any obligation on its part except where provided for in a written agreement. This e-mail is confidential. If you have received it by mistake, please inform the sender by reply e-mail and delete it from your system. Please also note that the unauthorized disclosure or use of the message or any attachments could be an offence. Thank you for your cooperation. **
Gabor Grothendieck
2008-Jun-06 13:10 UTC
[R] Agreggating data using external aggregation rules
Use aggregate() for aggregation and use indexing or subset() for selection. Alternately try the sqldf package: http://sqldf.googlecode.com which allows one to perform SQL operations on data frames. On Fri, Jun 6, 2008 at 6:12 AM, <ANGELO.LINARDI at bancaditalia.it> wrote:> Dear R experts, > > I am currently facing a tricky problem which I have read a lot about in > the various R mailing lists without finding exactly what I need. > I have a big data frame DF (about 2,000,000 rows) with 7 columns being > variables and 1 being a measure (using reshape package nomeclature). > There are no "duplicates" in it. > Fot each of the variables I have some "rules" to apply, being COD_IN the > value of the variable in the DF, COD_OUT the one to be transformed to; > once obtained the "new codes" in the DF I have to aggregate the "new DF" > (for example summing the measure). > Usually the total transformation (merge+aggregate) really decreases the > number of lines in the data frame, but sometimes it can grows depending > on the rule. Just to give an idea, the first "rule" in v1 maps 820 > different values into 7 ones. > Using SQL and a database this can be done in a very straightforward way > (for example on the variable v1): > > Select COD_OUT, v2, v3, v4, v5, v6, v7, sum(measure) > >From DF, RULE_v1 > Where v1=COD_IN > Group by v2, v3,v4, v5, v6, v7 > > So the first choice would be using a database; the second one would be > splitting the data frame and then joining the results. > Is there any other possibility to merge+aggregate caused by the merge ? > > Thank you in advance > > Angelo Linardi > > > > ** Le e-mail provenienti dalla Banca d'Italia sono trasmesse in buona fede e non > comportano alcun vincolo ne' creano obblighi per la Banca stessa, salvo che cio' non > sia espressamente previsto da un accordo scritto. > Questa e-mail e' confidenziale. Qualora l'avesse ricevuta per errore, La preghiamo di > comunicarne via e-mail la ricezione al mittente e di distruggerne il contenuto. La > informiamo inoltre che l'utilizzo non autorizzato del messaggio o dei suoi allegati > potrebbe costituire reato. Grazie per la collaborazione. > -- E-mails from the Bank of Italy are sent in good faith but they are neither binding on > the Bank nor to be understood as creating any obligation on its part except where > provided for in a written agreement. This e-mail is confidential. If you have received it > by mistake, please inform the sender by reply e-mail and delete it from your system. > Please also note that the unauthorized disclosure or use of the message or any > attachments could be an offence. Thank you for your cooperation. ** > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >