Rui Barradas
2021-Jan-28 20:21 UTC
[R] How to generate SE for the proportion value using a randomization process in R?
Hello, I don't know why you would need to see the indices but rewrite the function bootprop as bootprop_ind <- function(data, index){ d <- data[index, ] #sum(d[["BothTimes"]], na.rm = TRUE)/sum(d[["Time1"]], na.rm = TRUE) index } and call in the same way. It will now return a matrix of indices with R = 1000 rows and 19 columns. Hope this helps, Rui Barradas ?s 19:29 de 28/01/21, Marna Wagley escreveu:> Hi Rui, > I am sorry for asking you several questions. > > In the given example, randomizations (reshuffle) were done 1000 times, > and its 1000 proportion values (results) are stored and it can be seen > using b$t; but I was wondering how the table was randomized (which rows > have been missed/or repeated in each randomizing procedure?). > > Is there any way we can see the randomized table and its associated > results? Here in this example, I randomized (or bootstrapped) the table > into three times (R=3) so I would like to store these three tables and > look at them later to know which rows were repeated/missed. Is there any > possibility? > The example data and the code is given below. > > Thank you for your help. > > #### > library(boot) > dat<-structure(list(Sample = structure(c(1L, 12L, 13L, 14L, 15L, 16L, > 17L, 18L, 19L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L), .Label = c("id1", > "id10", "id11", "id12", "id13", "id14", "id15", "id16", "id17", > "id18", "id19", "Id2", "id3", "id4", "id5", "id6", "id7", "id8", > "id9"), class = "factor"), Time1 = c(0L, 1L, 1L, 1L, 0L, 0L, > 1L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 0L), Time2 = c(1L, > 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, > 1L, 1L)), .Names = c("Sample", "Time1", "Time2"), class = "data.frame", > row.names = c(NA, > -19L)) > daT<-data.frame(dat %>% > ? mutate(Time1.but.not.in.Time2 = case_when( > ? ? ? ? ? ? Time1 %in% "1" & Time2 %in% "0" ?~ "1"), > Time2.but.not.in.Time1 = case_when( > ? ? ? ? ? ? Time1 %in% "0" & Time2 %in% "1" ?~ "1"), > ?BothTimes = case_when( > ? ? ? ? ? ? Time1 %in% "1" & Time2 %in% "1" ?~ "1"))) > cols.num <- c("Time1.but.not.in.Time2","Time2.but.not.in.Time1", > "BothTimes") > daT[cols.num] <- sapply(daT[cols.num],as.numeric) > summary(daT) > > bootprop <- function(data, index){ > ? ?d <- data[index, ] > ? ?sum(d[["BothTimes"]], na.rm = TRUE)/sum(d[["Time1"]], na.rm = TRUE) > } > > R <- 3 > set.seed(2020) > b <- boot(daT, bootprop, R) > b > b$t0? ? ?# original > b$t > sd(b$t)? # bootstrapped estimate of the SE of the sample prop. > hist(b$t, freq = FALSE) > > str(b) > b$data > b$seed > b$sim > b$strata > ################ > > > On Sat, Jan 23, 2021 at 12:36 AM Marna Wagley <marna.wagley at gmail.com > <mailto:marna.wagley at gmail.com>> wrote: > > Yes Rui, I can see we don't need to divide by square root of sample > size. The example is great to understand it. > Thank you. > Marna > > > On Sat, Jan 23, 2021 at 12:28 AM Rui Barradas <ruipbarradas at sapo.pt > <mailto:ruipbarradas at sapo.pt>> wrote: > > Hello, > > Inline. > > ?s 07:47 de 23/01/21, Marna Wagley escreveu: > > Dear Rui, > > I was wondering whether we have to square root of SD to find > SE, right? > > No, we don't. var already divides by n, don't divide again. > This is the code, that can be seen by running the function name > at a > command line. > > > sd > #function (x, na.rm = FALSE) > #sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), > #? ? na.rm = na.rm)) > #<bytecode: 0x55f3ce900848> > #<environment: namespace:stats> > > > > > > > bootprop <- function(data, index){ > >? ? ?d <- data[index, ] > >? ? ?sum(d[["BothTimes"]], na.rm = TRUE)/sum(d[["Time1"]], > na.rm = TRUE) > > } > > > > R <- 1e3 > > set.seed(2020) > > b <- boot(daT, bootprop, R) > > b > > b$t0? ? ?# original > > sd(b$t)? # bootstrapped estimate of the SE of the sample prop. > > sd(b$t)/sqrt(1000) > > pandit*(1-pandit) > > > > hist(b$t, freq = FALSE) > > > Try plotting the normal densities for both cases, the red line is > clearly wrong. > > > f <- function(x, xbar, s){ > ? ?dnorm(x, mean = xbar, sd = s) > } > > hist(b$t, freq = FALSE) > curve(f(x, xbar = b$t0, s = sd(b$t)), from = 0, to = 1, col > "blue", > add = TRUE) > curve(f(x, xbar = b$t0, s = sd(b$t)/sqrt(R)), from = 0, to = 1, > col > "red", add = TRUE) > > > Hope this helps, > > Rui Barradas > > > > > > > > > > > On Fri, Jan 22, 2021 at 3:07 PM Rui Barradas > <ruipbarradas at sapo.pt <mailto:ruipbarradas at sapo.pt> > > <mailto:ruipbarradas at sapo.pt <mailto:ruipbarradas at sapo.pt>>> > wrote: > > > >? ? ?Hello, > > > >? ? ?Something like this, using base package boot? > > > > > >? ? ?library(boot) > > > >? ? ?bootprop <- function(data, index){ > >? ? ? ? ?d <- data[index, ] > >? ? ? ? ?sum(d[["BothTimes"]], na.rm = TRUE)/sum(d[["Time1"]], > na.rm = TRUE) > >? ? ?} > > > >? ? ?R <- 1e3 > >? ? ?set.seed(2020) > >? ? ?b <- boot(daT, bootprop, R) > >? ? ?b > >? ? ?b$t0? ? ?# original > >? ? ?sd(b$t)? # bootstrapped estimate of the SE of the sample > prop. > >? ? ?hist(b$t, freq = FALSE) > > > > > >? ? ?Hope this helps, > > > >? ? ?Rui Barradas > > > >? ? ??s 21:57 de 22/01/21, Marna Wagley escreveu: > >? ? ? > Hi All, > >? ? ? > I was trying to estimate standard error (SE) for the > proportion > >? ? ?value using > >? ? ? > some kind of randomization process (bootstrapping or > jackknifing) > >? ? ?in R, but > >? ? ? > I could not figure it out. > >? ? ? > > >? ? ? > Is there any way to generate SE for the proportion? > >? ? ? > > >? ? ? > The example of the data and the code I am using is > attached for your > >? ? ? > reference. I would like to generate the value of > proportion with > >? ? ?a SE using > >? ? ? > a 1000 times randomization. > >? ? ? > > >? ? ? > dat<-structure(list(Sample = structure(c(1L, 12L, 13L, > 14L, 15L, 16L, > >? ? ? > 17L, 18L, 19L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, > 11L), .Label > >? ? ?= c("id1", > >? ? ? > "id10", "id11", "id12", "id13", "id14", "id15", > "id16", "id17", > >? ? ? > "id18", "id19", "Id2", "id3", "id4", "id5", "id6", > "id7", "id8", > >? ? ? > "id9"), class = "factor"), Time1 = c(0L, 1L, 1L, 1L, > 0L, 0L, > >? ? ? > 1L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 0L), > Time2 = c(1L, > >? ? ? > 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, > 0L, 1L, 0L, > >? ? ? > 1L, 1L)), .Names = c("Sample", "Time1", "Time2"), class > >? ? ?"data.frame", > >? ? ? > row.names = c(NA, > >? ? ? > -19L)) > >? ? ? > daT<-data.frame(dat %>% > >? ? ? >? ? mutate(Time1.but.not.in.Time2 = case_when( > >? ? ? >? ? ? ? ? ? ? Time1 %in% "1" & Time2 %in% "0"? ~ "1"), > >? ? ? > Time2.but.not.in.Time1 = case_when( > >? ? ? >? ? ? ? ? ? ? Time1 %in% "0" & Time2 %in% "1"? ~ "1"), > >? ? ? >? ?BothTimes = case_when( > >? ? ? >? ? ? ? ? ? ? Time1 %in% "1" & Time2 %in% "1"? ~ "1"))) > >? ? ? >? ?daT > >? ? ? >? ?summary(daT) > >? ? ? > > >? ? ? > cols.num <- > c("Time1.but.not.in.Time2","Time2.but.not.in.Time1", > >? ? ? > "BothTimes") > >? ? ? > daT[cols.num] <- sapply(daT[cols.num],as.numeric) > >? ? ? > summary(daT) > >? ? ? > ProportionValue<-sum(daT$BothTimes, > na.rm=T)/sum(daT$Time1, na.rm=T) > >? ? ? > ProportionValue > >? ? ? > standard error?? > >? ? ? > > >? ? ? >? ? ? ?[[alternative HTML version deleted]] > >? ? ? > > >? ? ? > ______________________________________________ > >? ? ? > R-help at r-project.org <mailto:R-help at r-project.org> > <mailto:R-help at r-project.org <mailto:R-help at r-project.org>> > mailing list > >? ? ?-- To UNSUBSCRIBE and more, see > >? ? ? > https://stat.ethz.ch/mailman/listinfo/r-help > >? ? ? > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > >? ? ? > and provide commented, minimal, self-contained, > reproducible code. > >? ? ? > > > >
Marna Wagley
2021-Jan-28 20:29 UTC
[R] How to generate SE for the proportion value using a randomization process in R?
Thank you Rui, This is great. How about the following? SimilatedData<-boot.array(b, indices=T) seems it is giving the rows ID which are used in the calculation, isn't it? On Thu, Jan 28, 2021 at 12:21 PM Rui Barradas <ruipbarradas at sapo.pt> wrote:> Hello, > > I don't know why you would need to see the indices but rewrite the > function bootprop as > > bootprop_ind <- function(data, index){ > d <- data[index, ] > #sum(d[["BothTimes"]], na.rm = TRUE)/sum(d[["Time1"]], na.rm = TRUE) > index > } > > > and call in the same way. It will now return a matrix of indices with R > = 1000 rows and 19 columns. > > Hope this helps, > > Rui Barradas > > > ?s 19:29 de 28/01/21, Marna Wagley escreveu: > > Hi Rui, > > I am sorry for asking you several questions. > > > > In the given example, randomizations (reshuffle) were done 1000 times, > > and its 1000 proportion values (results) are stored and it can be seen > > using b$t; but I was wondering how the table was randomized (which rows > > have been missed/or repeated in each randomizing procedure?). > > > > Is there any way we can see the randomized table and its associated > > results? Here in this example, I randomized (or bootstrapped) the table > > into three times (R=3) so I would like to store these three tables and > > look at them later to know which rows were repeated/missed. Is there any > > possibility? > > The example data and the code is given below. > > > > Thank you for your help. > > > > #### > > library(boot) > > dat<-structure(list(Sample = structure(c(1L, 12L, 13L, 14L, 15L, 16L, > > 17L, 18L, 19L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L), .Label > c("id1", > > "id10", "id11", "id12", "id13", "id14", "id15", "id16", "id17", > > "id18", "id19", "Id2", "id3", "id4", "id5", "id6", "id7", "id8", > > "id9"), class = "factor"), Time1 = c(0L, 1L, 1L, 1L, 0L, 0L, > > 1L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 0L), Time2 = c(1L, > > 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, > > 1L, 1L)), .Names = c("Sample", "Time1", "Time2"), class = "data.frame", > > row.names = c(NA, > > -19L)) > > daT<-data.frame(dat %>% > > mutate(Time1.but.not.in.Time2 = case_when( > > Time1 %in% "1" & Time2 %in% "0" ~ "1"), > > Time2.but.not.in.Time1 = case_when( > > Time1 %in% "0" & Time2 %in% "1" ~ "1"), > > BothTimes = case_when( > > Time1 %in% "1" & Time2 %in% "1" ~ "1"))) > > cols.num <- c("Time1.but.not.in.Time2","Time2.but.not.in.Time1", > > "BothTimes") > > daT[cols.num] <- sapply(daT[cols.num],as.numeric) > > summary(daT) > > > > bootprop <- function(data, index){ > > d <- data[index, ] > > sum(d[["BothTimes"]], na.rm = TRUE)/sum(d[["Time1"]], na.rm = TRUE) > > } > > > > R <- 3 > > set.seed(2020) > > b <- boot(daT, bootprop, R) > > b > > b$t0 # original > > b$t > > sd(b$t) # bootstrapped estimate of the SE of the sample prop. > > hist(b$t, freq = FALSE) > > > > str(b) > > b$data > > b$seed > > b$sim > > b$strata > > ################ > > > > > > On Sat, Jan 23, 2021 at 12:36 AM Marna Wagley <marna.wagley at gmail.com > > <mailto:marna.wagley at gmail.com>> wrote: > > > > Yes Rui, I can see we don't need to divide by square root of sample > > size. The example is great to understand it. > > Thank you. > > Marna > > > > > > On Sat, Jan 23, 2021 at 12:28 AM Rui Barradas <ruipbarradas at sapo.pt > > <mailto:ruipbarradas at sapo.pt>> wrote: > > > > Hello, > > > > Inline. > > > > ?s 07:47 de 23/01/21, Marna Wagley escreveu: > > > Dear Rui, > > > I was wondering whether we have to square root of SD to find > > SE, right? > > > > No, we don't. var already divides by n, don't divide again. > > This is the code, that can be seen by running the function name > > at a > > command line. > > > > > > sd > > #function (x, na.rm = FALSE) > > #sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), > > # na.rm = na.rm)) > > #<bytecode: 0x55f3ce900848> > > #<environment: namespace:stats> > > > > > > > > > > > > bootprop <- function(data, index){ > > > d <- data[index, ] > > > sum(d[["BothTimes"]], na.rm = TRUE)/sum(d[["Time1"]], > > na.rm = TRUE) > > > } > > > > > > R <- 1e3 > > > set.seed(2020) > > > b <- boot(daT, bootprop, R) > > > b > > > b$t0 # original > > > sd(b$t) # bootstrapped estimate of the SE of the sample prop. > > > sd(b$t)/sqrt(1000) > > > pandit*(1-pandit) > > > > > > hist(b$t, freq = FALSE) > > > > > > Try plotting the normal densities for both cases, the red line is > > clearly wrong. > > > > > > f <- function(x, xbar, s){ > > dnorm(x, mean = xbar, sd = s) > > } > > > > hist(b$t, freq = FALSE) > > curve(f(x, xbar = b$t0, s = sd(b$t)), from = 0, to = 1, col > > "blue", > > add = TRUE) > > curve(f(x, xbar = b$t0, s = sd(b$t)/sqrt(R)), from = 0, to = 1, > > col > > "red", add = TRUE) > > > > > > Hope this helps, > > > > Rui Barradas > > > > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 3:07 PM Rui Barradas > > <ruipbarradas at sapo.pt <mailto:ruipbarradas at sapo.pt> > > > <mailto:ruipbarradas at sapo.pt <mailto:ruipbarradas at sapo.pt>>> > > wrote: > > > > > > Hello, > > > > > > Something like this, using base package boot? > > > > > > > > > library(boot) > > > > > > bootprop <- function(data, index){ > > > d <- data[index, ] > > > sum(d[["BothTimes"]], na.rm = TRUE)/sum(d[["Time1"]], > > na.rm = TRUE) > > > } > > > > > > R <- 1e3 > > > set.seed(2020) > > > b <- boot(daT, bootprop, R) > > > b > > > b$t0 # original > > > sd(b$t) # bootstrapped estimate of the SE of the sample > > prop. > > > hist(b$t, freq = FALSE) > > > > > > > > > Hope this helps, > > > > > > Rui Barradas > > > > > > ?s 21:57 de 22/01/21, Marna Wagley escreveu: > > > > Hi All, > > > > I was trying to estimate standard error (SE) for the > > proportion > > > value using > > > > some kind of randomization process (bootstrapping or > > jackknifing) > > > in R, but > > > > I could not figure it out. > > > > > > > > Is there any way to generate SE for the proportion? > > > > > > > > The example of the data and the code I am using is > > attached for your > > > > reference. I would like to generate the value of > > proportion with > > > a SE using > > > > a 1000 times randomization. > > > > > > > > dat<-structure(list(Sample = structure(c(1L, 12L, 13L, > > 14L, 15L, 16L, > > > > 17L, 18L, 19L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, > > 11L), .Label > > > = c("id1", > > > > "id10", "id11", "id12", "id13", "id14", "id15", > > "id16", "id17", > > > > "id18", "id19", "Id2", "id3", "id4", "id5", "id6", > > "id7", "id8", > > > > "id9"), class = "factor"), Time1 = c(0L, 1L, 1L, 1L, > > 0L, 0L, > > > > 1L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 0L), > > Time2 = c(1L, > > > > 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, > > 0L, 1L, 0L, > > > > 1L, 1L)), .Names = c("Sample", "Time1", "Time2"), > class > > > "data.frame", > > > > row.names = c(NA, > > > > -19L)) > > > > daT<-data.frame(dat %>% > > > > mutate(Time1.but.not.in.Time2 = case_when( > > > > Time1 %in% "1" & Time2 %in% "0" ~ "1"), > > > > Time2.but.not.in.Time1 = case_when( > > > > Time1 %in% "0" & Time2 %in% "1" ~ "1"), > > > > BothTimes = case_when( > > > > Time1 %in% "1" & Time2 %in% "1" ~ "1"))) > > > > daT > > > > summary(daT) > > > > > > > > cols.num <- > > c("Time1.but.not.in.Time2","Time2.but.not.in.Time1", > > > > "BothTimes") > > > > daT[cols.num] <- sapply(daT[cols.num],as.numeric) > > > > summary(daT) > > > > ProportionValue<-sum(daT$BothTimes, > > na.rm=T)/sum(daT$Time1, na.rm=T) > > > > ProportionValue > > > > standard error?? > > > > > > > > [[alternative HTML version deleted]] > > > > > > > > ______________________________________________ > > > > R-help at r-project.org <mailto:R-help at r-project.org> > > <mailto:R-help at r-project.org <mailto:R-help at r-project.org>> > > mailing list > > > -- To UNSUBSCRIBE and more, see > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > PLEASE do read the posting guide > > > http://www.R-project.org/posting-guide.html > > > > and provide commented, minimal, self-contained, > > reproducible code. > > > > > > > > > >[[alternative HTML version deleted]]