thr3ads.net - R help - [R] Create new data frame with conditional sums [Oct 2023]

If this information is useful, please help other people find it:
Share via:

Jason Stout, M.D.

2023-Oct-14 14:16 UTC

[R] Create new data frame with conditional sums

That's very helpful and instructive, thank you!

Jason Stout, MD, MHS
Box 102359-DUMC
Durham, NC 27710
FAX 919-681-7494
________________________________
From: John Fox <jfox at mcmaster.ca>
Sent: Saturday, October 14, 2023 10:13 AM
To: Jason Stout, M.D. <jason.stout at duke.edu>
Cc: r-help at r-project.org <r-help at r-project.org>
Subject: Re: [R] Create new data frame with conditional sums

Dear Jason,

I don't think that there's anything wrong with using a loop to solve
this problem, but it's generally a good idea to pre-allocate space for
the result rather than build it up one value at a time, which may cause
unnecessary copying of the object.

Here are three solutions:

f1 <- function(Cutoff, Pct, Totpop){
   Pop <- numeric(0)
   for (i in seq_along(Cutoff))
     Pop[i] <- sum(Totpop[Pct >= Cutoff[i]])
   cbind(Cutoff, Pop)
}

f2 <- function(Cutoff, Pct, Totpop){
   Pop <- numeric(length(Cutoff))
   for (i in seq_along(Cutoff))
     Pop[i] <- sum(Totpop[Pct >= Cutoff[i]])
   cbind(Cutoff, Pop)
}

f3 <- function(Cutoff, Pct, Totpop){
   Pop <- sapply(Cutoff, function(c) sum(Totpop[Pct >= c]))
   cbind(Cutoff, Pop)
}

The first is similar to yours; the second pre-allocates space for the
result but still uses a loop; and the third avoids the loop. All produce
the same result, for example,

 > with(dummydata, f3(seq(0, 0.15, by=0.01), Pct, Totpop))
       Cutoff   Pop
  [1,]   0.00 43800
  [2,]   0.01 43800
  [3,]   0.02 39300
  [4,]   0.03 39300
  [5,]   0.04 31000
  [6,]   0.05 26750
  [7,]   0.06 22750
  [8,]   0.07 17800
  [9,]   0.08 12700
[10,]   0.09 12700
[11,]   0.10  8000
[12,]   0.11  8000
[13,]   0.12  8000
[14,]   0.13  3900
[15,]   0.14  3900
[16,]   0.15  3900

Here are some timings:

 > microbenchmark::microbenchmark(
+   preallocate=with(dummydata, f2(seq(0, 0.15, by=0.01),
+                                  Pct, Totpop)),
+   yourloop=with(dummydata, f1(seq(0, 0.15, by=0.01),
+                               Pct, Totpop)),
+   sapply=with(dummydata, f3(seq(0, 0.15, by=0.01),
+                             Pct, Totpop)),
+   times=1000
+ )
Unit: microseconds
         expr    min      lq     mean  median     uq    max neval cld
  preallocate 13.776 14.3910 15.74195 14.9240 16.318 56.908  1000 a
     yourloop 15.129 15.7645 17.26809 16.3795 18.368 73.964  1000  b
       sapply 22.304 23.2060 25.19868 24.1080 26.814 48.544  1000   c

So, for this very small problem, there are small but reliable
differences in timing among the three solutions, and the version that
avoids the loop is slowest. I suspect, but haven't verified, that for a
much larger problem, your solution would be slowest.

I hope this helps,
  John

--
John Fox, Professor Emeritus
McMaster University
Hamilton, Ontario, Canada
web:
https://urldefense.com/v3/__https://www.john-fox.ca/__;!!OToaGQ!s5vzmg4dxnnS0zohDtpWBBey7cb53uSXIPTTqs5fgaz-BKlNnWzpCfBz6aP0YhCGemy-bP6xEtJEwobdDQ$
On 2023-10-13 4:13 p.m., Jason Stout, M.D. wrote:> Caution: External email.
>
>
> This seems like it should be simple but I can't get it to work
properly.  I'm starting with a data frame like this:
>
> Tract      Pct          Totpop
> 1              0.05        4000
> 2              0.03        3500
> 3              0.01        4500
> 4              0.12        4100
> 5              0.21        3900
> 6              0.04        4250
> 7              0.07        5100
> 8              0.09        4700
> 9              0.06        4950
> 10           0.03        4800
>
> And I want to end up with a data frame with two columns, a
"Cutoff" column that is a simple sequence of equally spaced cutoffs
(let's say in this case from 0-0.15 by 0.01) and a "Pop" column
which equals the sum of "Totpop" in the prior data frame in which
"Pct" is greater than or equal to "cutoff."  So in this toy
example, this is what I want for a result:
>
>     Cutoff   Pop
> 1    0.00 43800
> 2    0.01 43800
> 3    0.02 39300
> 4    0.03 39300
> 5    0.04 31000
> 6    0.05 26750
> 7    0.06 22750
> 8    0.07 17800
> 9    0.08 12700
> 10   0.09 12700
> 11   0.10  8000
> 12   0.11  8000
> 13   0.12  8000
> 14   0.13  3900
> 15   0.14  3900
> 16   0.15  3900
>
> I can do this with a for loop but it seems there should be an easier,
vectorized way that would be more efficient.  Here is a reproducible example:
>
>
dummydata<-data.frame(Tract=seq(1,10,by=1),Pct=c(0.05,0.03,0.01,0.12,0.21,0.04,0.07,0.09,0.06,0.03),Totpop=c(4000,3500,4500,4100,
>                                                                            
3900,4250,5100,4700,
>                                                                            
4950,4800))
>
dfrm<-data.frame(matrix(ncol=2,nrow=0,dimnames=list(NULL,c("Cutoff","Pop"))))
> for (i in seq(0,0.15,by=0.01)) {
>   temp<-sum(dummydata[dummydata$Pct>=i,"Totpop"])
> dfrm[nrow(dfrm)+1,]<-c(i,temp)
> }
>
> Jason Stout, MD, MHS
> Division of Infectious Diseases
> Dept of Medicine
> Duke University
> Box 102359-DUMC
> Durham, NC 27710
> FAX 919-681-7494
>
>
>          [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>
https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!OToaGQ!s5vzmg4dxnnS0zohDtpWBBey7cb53uSXIPTTqs5fgaz-BKlNnWzpCfBz6aP0YhCGemy-bP6xEtL8RrekaA$
> PLEASE do read the posting guide
https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!OToaGQ!s5vzmg4dxnnS0zohDtpWBBey7cb53uSXIPTTqs5fgaz-BKlNnWzpCfBz6aP0YhCGemy-bP6xEtKGvEhDNw$
> and provide commented, minimal, self-contained, reproducible code.

	[[alternative HTML version deleted]]

Bert Gunter

2023-Oct-15 14:29 UTC

head link

[R] Create new data frame with conditional sums

Under the hood, sapply() is also a loop (at the interpreted level). As
is lapply(), etc.

-- Bert

On Sun, Oct 15, 2023 at 2:34?AM Jason Stout, M.D. <jason.stout at
duke.edu> wrote:>
> That's very helpful and instructive, thank you!
>
> Jason Stout, MD, MHS
> Box 102359-DUMC
> Durham, NC 27710
> FAX 919-681-7494
> ________________________________
> From: John Fox <jfox at mcmaster.ca>
> Sent: Saturday, October 14, 2023 10:13 AM
> To: Jason Stout, M.D. <jason.stout at duke.edu>
> Cc: r-help at r-project.org <r-help at r-project.org>
> Subject: Re: [R] Create new data frame with conditional sums
>
> Dear Jason,
>
> I don't think that there's anything wrong with using a loop to
solve
> this problem, but it's generally a good idea to pre-allocate space for
> the result rather than build it up one value at a time, which may cause
> unnecessary copying of the object.
>
> Here are three solutions:
>
> f1 <- function(Cutoff, Pct, Totpop){
>    Pop <- numeric(0)
>    for (i in seq_along(Cutoff))
>      Pop[i] <- sum(Totpop[Pct >= Cutoff[i]])
>    cbind(Cutoff, Pop)
> }
>
> f2 <- function(Cutoff, Pct, Totpop){
>    Pop <- numeric(length(Cutoff))
>    for (i in seq_along(Cutoff))
>      Pop[i] <- sum(Totpop[Pct >= Cutoff[i]])
>    cbind(Cutoff, Pop)
> }
>
> f3 <- function(Cutoff, Pct, Totpop){
>    Pop <- sapply(Cutoff, function(c) sum(Totpop[Pct >= c]))
>    cbind(Cutoff, Pop)
> }
>
> The first is similar to yours; the second pre-allocates space for the
> result but still uses a loop; and the third avoids the loop. All produce
> the same result, for example,
>
>  > with(dummydata, f3(seq(0, 0.15, by=0.01), Pct, Totpop))
>        Cutoff   Pop
>   [1,]   0.00 43800
>   [2,]   0.01 43800
>   [3,]   0.02 39300
>   [4,]   0.03 39300
>   [5,]   0.04 31000
>   [6,]   0.05 26750
>   [7,]   0.06 22750
>   [8,]   0.07 17800
>   [9,]   0.08 12700
> [10,]   0.09 12700
> [11,]   0.10  8000
> [12,]   0.11  8000
> [13,]   0.12  8000
> [14,]   0.13  3900
> [15,]   0.14  3900
> [16,]   0.15  3900
>
> Here are some timings:
>
>  > microbenchmark::microbenchmark(
> +   preallocate=with(dummydata, f2(seq(0, 0.15, by=0.01),
> +                                  Pct, Totpop)),
> +   yourloop=with(dummydata, f1(seq(0, 0.15, by=0.01),
> +                               Pct, Totpop)),
> +   sapply=with(dummydata, f3(seq(0, 0.15, by=0.01),
> +                             Pct, Totpop)),
> +   times=1000
> + )
> Unit: microseconds
>          expr    min      lq     mean  median     uq    max neval cld
>   preallocate 13.776 14.3910 15.74195 14.9240 16.318 56.908  1000 a
>      yourloop 15.129 15.7645 17.26809 16.3795 18.368 73.964  1000  b
>        sapply 22.304 23.2060 25.19868 24.1080 26.814 48.544  1000   c
>
> So, for this very small problem, there are small but reliable
> differences in timing among the three solutions, and the version that
> avoids the loop is slowest. I suspect, but haven't verified, that for a
> much larger problem, your solution would be slowest.
>
> I hope this helps,
>   John
>
> --
> John Fox, Professor Emeritus
> McMaster University
> Hamilton, Ontario, Canada
> web:
https://urldefense.com/v3/__https://www.john-fox.ca/__;!!OToaGQ!s5vzmg4dxnnS0zohDtpWBBey7cb53uSXIPTTqs5fgaz-BKlNnWzpCfBz6aP0YhCGemy-bP6xEtJEwobdDQ$
> On 2023-10-13 4:13 p.m., Jason Stout, M.D. wrote:
> > Caution: External email.
> >
> >
> > This seems like it should be simple but I can't get it to work
properly.  I'm starting with a data frame like this:
> >
> > Tract      Pct          Totpop
> > 1              0.05        4000
> > 2              0.03        3500
> > 3              0.01        4500
> > 4              0.12        4100
> > 5              0.21        3900
> > 6              0.04        4250
> > 7              0.07        5100
> > 8              0.09        4700
> > 9              0.06        4950
> > 10           0.03        4800
> >
> > And I want to end up with a data frame with two columns, a
"Cutoff" column that is a simple sequence of equally spaced cutoffs
(let's say in this case from 0-0.15 by 0.01) and a "Pop" column
which equals the sum of "Totpop" in the prior data frame in which
"Pct" is greater than or equal to "cutoff."  So in this toy
example, this is what I want for a result:
> >
> >     Cutoff   Pop
> > 1    0.00 43800
> > 2    0.01 43800
> > 3    0.02 39300
> > 4    0.03 39300
> > 5    0.04 31000
> > 6    0.05 26750
> > 7    0.06 22750
> > 8    0.07 17800
> > 9    0.08 12700
> > 10   0.09 12700
> > 11   0.10  8000
> > 12   0.11  8000
> > 13   0.12  8000
> > 14   0.13  3900
> > 15   0.14  3900
> > 16   0.15  3900
> >
> > I can do this with a for loop but it seems there should be an easier,
vectorized way that would be more efficient.  Here is a reproducible example:
> >
> >
dummydata<-data.frame(Tract=seq(1,10,by=1),Pct=c(0.05,0.03,0.01,0.12,0.21,0.04,0.07,0.09,0.06,0.03),Totpop=c(4000,3500,4500,4100,
> >                                                                       
3900,4250,5100,4700,
> >                                                                       
4950,4800))
> >
dfrm<-data.frame(matrix(ncol=2,nrow=0,dimnames=list(NULL,c("Cutoff","Pop"))))
> > for (i in seq(0,0.15,by=0.01)) {
> >   temp<-sum(dummydata[dummydata$Pct>=i,"Totpop"])
> > dfrm[nrow(dfrm)+1,]<-c(i,temp)
> > }
> >
> > Jason Stout, MD, MHS
> > Division of Infectious Diseases
> > Dept of Medicine
> > Duke University
> > Box 102359-DUMC
> > Durham, NC 27710
> > FAX 919-681-7494
> >
> >
> >          [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >
https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!OToaGQ!s5vzmg4dxnnS0zohDtpWBBey7cb53uSXIPTTqs5fgaz-BKlNnWzpCfBz6aP0YhCGemy-bP6xEtL8RrekaA$
> > PLEASE do read the posting guide
https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!OToaGQ!s5vzmg4dxnnS0zohDtpWBBey7cb53uSXIPTTqs5fgaz-BKlNnWzpCfBz6aP0YhCGemy-bP6xEtKGvEhDNw$
> > and provide commented, minimal, self-contained, reproducible code.
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Possibly Parallel Threads

Search for more possibly parallel threads

R help - Oct 2023 - Create new data frame with conditional sums

[R] Create new data frame with conditional sums

[R] Create new data frame with conditional sums

Possibly Parallel Threads