Delcour Libertus
2007-Mar-24 17:35 UTC
[R] Two Problems while trying to aggregate a dataframe
Hello! Given is an Excel-Sheet with actually 11,000 rows and 9 columns. I want to work with the data in R. The contents are similar to my following example. I have a list with ID-number, personal name and two kinds of loan-values. I want to aggregate the list, that for each person only one row remains and where the loan-values are added. First I tried some commands with tapply but had no success at all. Then I found in this mailing list a hint for aggregate (though I did not understand most of that mail). So I made some efforts with aggregate() and it seems to lead the right way: [code]> atest <- read.csv2 ("aggregatetest.csv") > str(atest)`data.frame': 10 obs. of 4 variables: $ PrsNr : int 1 2 2 3 4 5 6 6 6 7 $ Namen : Factor w/ 7 levels "Holla","Mabba",..: 1 2 2 4 5 6 7 7 7 3 $ Betrag1: num 1.99 2.34 5.23 4.23 2.23 2.77 3.83 2.76 6.32 2.88 $ Betrag2: num 3.44 5.32 5.21 9.12 7.32 8.32 6.99 4.45 5.34 3.81> atestPrsNr Namen Betrag1 Betrag2 1 1 Holla 1.99 3.44 2 2 Mabba 2.34 5.32 3 2 Mabba 5.23 5.21 4 3 Pisa 4.23 9.12 5 4 Pulla 2.23 7.32 6 5 Raba 2.77 8.32 7 6 Saba 3.83 6.99 8 6 Saba 2.76 4.45 9 6 Saba 6.32 5.34 10 7 Mulla 2.88 3.81> aggregate(list(Betrag1=atest$Betrag1), by=list(PsrNr=atest$PrsNr,Namen=atest$Namen), sum) PsrNr Namen Betrag1 1 1 Holla 1.99 2 2 Mabba 7.57 3 7 Mulla 2.88 4 3 Pisa 4.23 5 4 Pulla 2.23 6 5 Raba 2.77 7 6 Saba 12.91 [/code] The result is nearly that I want. First problem: How do I get all columnss in my result. "Betrag2" is missing. Second problem: If I use the aggregate-command on the real data then it is for me impossible to use more than on by-grouping variable (my example above has two). Impossible because 1 GB RAM and 1.5 GB SWAP are not enough to process my command. My computer (Ubuntu Linux, Gmome) freezes. So I doubt wether I use the appropriate method to follow my target. Which ist the best way to aggregate dataframes as I want? Are there any better functions/commands or do I have to learn programming for this? Greetings Delcour
On 3/24/07, Delcour Libertus <delcour.libertus@gmail.com> wrote:> > Hello! > > Given is an Excel-Sheet with actually 11,000 rows and 9 columns. I want > to work with the data in R. The contents are similar to my following > example. > > I have a list with ID-number, personal name and two kinds of > loan-values. I want to aggregate the list, that for each person only one > row remains and where the loan-values are added. > > First I tried some commands with tapply but had no success at all. Then > I found in this mailing list a hint for aggregate (though I did not > understand most of that mail). > > So I made some efforts with aggregate() and it seems to lead the right > way: > > [code] > > atest <- read.csv2 ("aggregatetest.csv") > > str(atest) > `data.frame': 10 obs. of 4 variables: > $ PrsNr : int 1 2 2 3 4 5 6 6 6 7 > $ Namen : Factor w/ 7 levels "Holla","Mabba",..: 1 2 2 4 5 6 7 7 7 3 > $ Betrag1: num 1.99 2.34 5.23 4.23 2.23 2.77 3.83 2.76 6.32 2.88 > $ Betrag2: num 3.44 5.32 5.21 9.12 7.32 8.32 6.99 4.45 5.34 3.81 > > atest > PrsNr Namen Betrag1 Betrag2 > 1 1 Holla 1.99 3.44 > 2 2 Mabba 2.34 5.32 > 3 2 Mabba 5.23 5.21 > 4 3 Pisa 4.23 9.12 > 5 4 Pulla 2.23 7.32 > 6 5 Raba 2.77 8.32 > 7 6 Saba 3.83 6.99 > 8 6 Saba 2.76 4.45 > 9 6 Saba 6.32 5.34 > 10 7 Mulla 2.88 3.81 > > aggregate(list(Betrag1=atest$Betrag1), by=list(PsrNr=atest$PrsNr, > Namen=atest$Namen), sum) > PsrNr Namen Betrag1 > 1 1 Holla 1.99 > 2 2 Mabba 7.57 > 3 7 Mulla 2.88 > 4 3 Pisa 4.23 > 5 4 Pulla 2.23 > 6 5 Raba 2.77 > 7 6 Saba 12.91 > [/code] > > The result is nearly that I want. > > First problem: > > How do I get all columnss in my result. "Betrag2" is missing.It really doesn't make sense to have the column Betrag2. For example, for 'Saba' which of the three values to you want? If you want the sum of that column also, then::> aggregate(list(Betrag1=x.in$Betrag1, Betrag2=x.in$Betrag2),+ list(PrsNr=x.in$PrsNr, Namen=x.in$Namen), sum) PrsNr Namen Betrag1 Betrag2 1 1 Holla 1.99 3.44 2 2 Mabba 7.57 10.53 3 7 Mulla 2.88 3.81 4 3 Pisa 4.23 9.12 5 4 Pulla 2.23 7.32 6 5 Raba 2.77 8.32 7 6 Saba 12.91 16.78 would do it. Second problem:> > If I use the aggregate-command on the real data then it is for me > impossible to use more than on by-grouping variable (my example above > has two). Impossible because 1 GB RAM and 1.5 GB SWAP are not enough to > process my command. My computer (Ubuntu Linux, Gmome) freezes. So I > doubt wether I use the appropriate method to follow my target.A data.frame 11000x9 is not very large for processing in R. If most of the columns are numeric, this is less than 1MB. You need to provide exactly what you are doing, since I process dataframe with 100,000 row and 10 columns without any problems Here is a quick test of creating a 11000x9 data frame and then aggregating the columns; takes less than a second.> n <- 11000 > x <- data.frame(id=sample(1:20,n,T),runif(n), runif(n), runif(n),runif(n), + runif(n), runif(n), runif(n), runif(n))> str(x)'data.frame': 11000 obs. of 9 variables: $ id : int 10 7 2 7 19 10 20 18 5 5 ... $ runif.n. : num 0.590 0.960 0.493 0.497 0.746 ... $ runif.n..1: num 0.5058 0.0354 0.9629 0.0861 0.5001 ... $ runif.n..2: num 0.693 0.389 0.277 0.731 0.726 ... $ runif.n..3: num 0.1442 0.0585 0.5610 0.4206 0.4768 ... $ runif.n..4: num 0.7917 0.0351 0.5035 0.2469 0.1013 ... $ runif.n..5: num 0.259 0.157 0.492 0.264 0.502 ... $ runif.n..6: num 0.769 0.236 0.295 0.138 0.776 ... $ runif.n..7: num 0.278 0.613 0.219 0.199 0.689 ...> object.size(x)[1] 748984> system.time(print(aggregate(x[2:9], list(x$id), sum)))Group.1 runif.n. runif.n..1 runif.n..2 runif.n..3 runif.n..4 runif.n..5 runif.n..6 runif.n..7 1 1 292.8513 289.6298 290.0143 293.4324 293.3280 284.4095 295.2547 290.5937 2 2 262.3319 273.1242 271.9370 269.8798 277.1156 273.3304 282.2954 267.8959 3 3 289.1224 273.2288 270.3815 278.3573 269.7454 267.2329 269.0993 277.7569 4 4 284.3488 284.9606 281.8343 278.8463 281.6107 277.7917 279.8643 286.0857 5 5 285.0568 281.0035 286.2571 283.1572 279.0046 275.8996 288.9442 268.2956 6 6 271.3570 254.0755 273.3773 264.6663 260.6819 264.5135 263.4219 273.2450 7 7 260.7022 269.0428 266.2272 273.6763 273.4340 258.7884 265.0706 267.7709 8 8 272.3414 264.2267 276.7643 271.0522 266.9568 268.4925 276.1819 270.2247 9 9 303.8744 302.0581 283.6805 292.9412 305.3064 291.2466 308.8895 291.9919 10 10 282.4658 265.5525 265.3477 278.4066 267.9801 272.9873 267.4629 276.1851 11 11 272.4771 256.4079 255.4102 261.3409 264.8875 265.5583 274.0994 253.7260 12 12 293.0385 281.3208 288.6191 268.4726 299.9310 283.5328 267.8157 279.8099 13 13 287.5422 278.6447 286.0739 268.9741 268.8262 288.2162 290.8423 284.9556 14 14 266.3838 261.6319 260.5097 251.5177 257.8366 269.5637 261.1484 262.0399 15 15 269.9839 277.2644 278.5290 266.0161 268.4113 268.2144 272.2928 277.3032 16 16 275.6890 279.9139 263.0586 271.9163 267.7920 272.8441 259.7067 267.1122 17 17 287.0234 284.0590 283.1442 284.7422 293.1614 286.7737 291.3513 280.7318 18 18 264.9676 272.7865 265.7336 259.1395 266.8845 269.0353 262.1133 272.1571 19 19 283.5144 292.2356 268.5373 270.9601 280.9630 275.0372 264.3380 263.1144 20 20 276.5350 287.7986 266.6976 294.2297 277.7174 278.0096 286.9167 271.0283 [1] 0.08 0.32 0.41 NA NA> >Which ist the best way to aggregate dataframes as I want? Are there any> better functions/commands or do I have to learn programming for this? > > Greetings > > Delcour > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve? [[alternative HTML version deleted]]
Gabor Grothendieck
2007-Mar-24 18:17 UTC
[R] Two Problems while trying to aggregate a dataframe
Try this: aggregate(atest[3:4], atest[1:2], sum) Use a data base and SQL is you don't otherwise have enough computer resources. On 3/24/07, Delcour Libertus <delcour.libertus at gmail.com> wrote:> Hello! > > Given is an Excel-Sheet with actually 11,000 rows and 9 columns. I want > to work with the data in R. The contents are similar to my following > example. > > I have a list with ID-number, personal name and two kinds of > loan-values. I want to aggregate the list, that for each person only one > row remains and where the loan-values are added. > > First I tried some commands with tapply but had no success at all. Then > I found in this mailing list a hint for aggregate (though I did not > understand most of that mail). > > So I made some efforts with aggregate() and it seems to lead the right way: > > [code] > > atest <- read.csv2 ("aggregatetest.csv") > > str(atest) > `data.frame': 10 obs. of 4 variables: > $ PrsNr : int 1 2 2 3 4 5 6 6 6 7 > $ Namen : Factor w/ 7 levels "Holla","Mabba",..: 1 2 2 4 5 6 7 7 7 3 > $ Betrag1: num 1.99 2.34 5.23 4.23 2.23 2.77 3.83 2.76 6.32 2.88 > $ Betrag2: num 3.44 5.32 5.21 9.12 7.32 8.32 6.99 4.45 5.34 3.81 > > atest > PrsNr Namen Betrag1 Betrag2 > 1 1 Holla 1.99 3.44 > 2 2 Mabba 2.34 5.32 > 3 2 Mabba 5.23 5.21 > 4 3 Pisa 4.23 9.12 > 5 4 Pulla 2.23 7.32 > 6 5 Raba 2.77 8.32 > 7 6 Saba 3.83 6.99 > 8 6 Saba 2.76 4.45 > 9 6 Saba 6.32 5.34 > 10 7 Mulla 2.88 3.81 > > aggregate(list(Betrag1=atest$Betrag1), by=list(PsrNr=atest$PrsNr, > Namen=atest$Namen), sum) > PsrNr Namen Betrag1 > 1 1 Holla 1.99 > 2 2 Mabba 7.57 > 3 7 Mulla 2.88 > 4 3 Pisa 4.23 > 5 4 Pulla 2.23 > 6 5 Raba 2.77 > 7 6 Saba 12.91 > [/code] > > The result is nearly that I want. > > First problem: > > How do I get all columnss in my result. "Betrag2" is missing. > > Second problem: > > If I use the aggregate-command on the real data then it is for me > impossible to use more than on by-grouping variable (my example above > has two). Impossible because 1 GB RAM and 1.5 GB SWAP are not enough to > process my command. My computer (Ubuntu Linux, Gmome) freezes. So I > doubt wether I use the appropriate method to follow my target. > > Which ist the best way to aggregate dataframes as I want? Are there any > better functions/commands or do I have to learn programming for this? > > Greetings > > Delcour > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >