Hi there, I posted this message before but there may be some confusion in my previous post. So here is a clearer version: I'd like to do a bootstrap sampling for clustered data. Then I will run some complicated models (say mixed effects models) on the bootstrapped sample. Here id is the cluster. Note different clusters have different number of subjects, e.g., id 2 has 2 observations, id 3 has 3 observations. id=c(1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5) y=c(.5, .6, .4, .3, .4, 1, .9, 1, .5, 2, 2.2, 3) x=c(0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1 ) xx=data.frame(id, x, y) boot.cluster <- function(x, id){ boot.id <- sample(unique(id), replace=T) out <- lapply(boot.id, function(i) x[id%in%i,]) return( do.call("rbind",out) ) } boot.xx=boot.cluster(xx, xx$id) Here is the generated boot.xx dataset: id x y 3 0 0.4 3 0 1.0 3 0 0.9 1 0 0.5 1 0 0.6 5 1 2.2 5 1 3.0 2 1 0.4 2 1 0.3 1 0 0.5 1 0 0.6 You can see that some clusters (ids) appears multiple times (e.g., id 1 appears in two places - 4 rows), since bootstrap does a sample with replacement, we could have the same cluster multiple times. Thus, we cannot do a mixed effects model using this data, as we should assume all the clusters are different in this new data. Instead, I will reorganize the data as below (id is reordered from the above boot.xx data). This is the step I need help: id x y 1 0 0.4 1 0 1.0 1 0 0.9 2 0 0.5 2 0 0.6 3 1 2.2 3 1 3.0 4 1 0.4 4 1 0.3 5 0 0.5 5 0 0.6 Can someone help me with it? Thanks! Lei Liu Professor of Biostatistics Washington University in St. Louis [[alternative HTML version deleted]]
You are telling us that the ID values in your data set indicate clusters. However you went about making that determination in the first place might be an obvious(?) way to do it again with your bootstrapped sample, ignoring the cluster assignments you have in place. This is the wrong place to have a discussion about which theoretical method for cluster identification you should use, and if you do know that then searching the web or using the sos package would be the appropriate way to find implementations of a specific clustering algorithm. I am not an ME expert, but AFAIK "complicated" analyses such as mixed effects models tend to have rather hefty appetites for data completeness, so you may have to design a special sampling plan in order to avoid generating data sets for which those analyses won't break, and you will probably need a very large data set to start with in order to have sufficient data in each cluster. That is, you may be better off keeping the original cluster identification and just restructuring your bootstrap sampling to sample within clusters. The R-sig-me mailing list is probably a better venue for your questions. On September 16, 2018 8:22:44 PM PDT, "Liu, Lei" <lei.liu at wustl.edu> wrote:>Hi there, > >I posted this message before but there may be some confusion in my >previous post. So here is a clearer version: > >I'd like to do a bootstrap sampling for clustered data. Then I will run >some complicated models (say mixed effects models) on the bootstrapped >sample. Here id is the cluster. Note different clusters have different >number of subjects, e.g., id 2 has 2 observations, id 3 has 3 >observations. > >id=c(1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5) >y=c(.5, .6, .4, .3, .4, 1, .9, 1, .5, 2, 2.2, 3) >x=c(0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1 ) > >xx=data.frame(id, x, y) > >boot.cluster <- function(x, id){ > > boot.id <- sample(unique(id), replace=T) > out <- lapply(boot.id, function(i) x[id%in%i,]) > > return( do.call("rbind",out) ) > >} > >boot.xx=boot.cluster(xx, xx$id) > >Here is the generated boot.xx dataset: > > id x y > 3 0 0.4 > 3 0 1.0 > 3 0 0.9 > 1 0 0.5 > 1 0 0.6 > 5 1 2.2 > 5 1 3.0 > 2 1 0.4 > 2 1 0.3 > 1 0 0.5 > 1 0 0.6 > >You can see that some clusters (ids) appears multiple times (e.g., id 1 >appears in two places - 4 rows), since bootstrap does a sample with >replacement, we could have the same cluster multiple times. Thus, we >cannot do a mixed effects model using this data, as we should assume >all the clusters are different in this new data. Instead, I will >reorganize the data as below (id is reordered from the above boot.xx >data). This is the step I need help: > > id x y > 1 0 0.4 > 1 0 1.0 > 1 0 0.9 > 2 0 0.5 > 2 0 0.6 > 3 1 2.2 > 3 1 3.0 > 4 1 0.4 > 4 1 0.3 > 5 0 0.5 > 5 0 0.6 > >Can someone help me with it? Thanks! > >Lei Liu >Professor of Biostatistics >Washington University in St. Louis > > > [[alternative HTML version deleted]] > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.-- Sent from my phone. Please excuse my brevity.
Thanks for the help. My friend helped me and here is the solution: boot.cluster <- function(x, id){ boot.id <- sample(unique(id), replace=T) out <- lapply(1:length(boot.id), function(newid){cbind(x[id%in%boot.id[newid],],newid)}) return( do.call("rbind",out) ) } Lei -----Original Message----- From: Jeff Newmiller [mailto:jdnewmil at dcn.davis.ca.us] Sent: Monday, September 17, 2018 2:32 AM To: r-help at r-project.org; Liu, Lei <lei.liu at wustl.edu>; r-help at R-project.org Subject: Re: [R] bootstrap sample for clustered data You are telling us that the ID values in your data set indicate clusters. However you went about making that determination in the first place might be an obvious(?) way to do it again with your bootstrapped sample, ignoring the cluster assignments you have in place. This is the wrong place to have a discussion about which theoretical method for cluster identification you should use, and if you do know that then searching the web or using the sos package would be the appropriate way to find implementations of a specific clustering algorithm. I am not an ME expert, but AFAIK "complicated" analyses such as mixed effects models tend to have rather hefty appetites for data completeness, so you may have to design a special sampling plan in order to avoid generating data sets for which those analyses won't break, and you will probably need a very large data set to start with in order to have sufficient data in each cluster. That is, you may be better off keeping the original cluster identification and just restructuring your bootstrap sampling to sample within clusters. The R-sig-me mailing list is probably a better venue for your questions. On September 16, 2018 8:22:44 PM PDT, "Liu, Lei" <lei.liu at wustl.edu> wrote:>Hi there, > >I posted this message before but there may be some confusion in my >previous post. So here is a clearer version: > >I'd like to do a bootstrap sampling for clustered data. Then I will run >some complicated models (say mixed effects models) on the bootstrapped >sample. Here id is the cluster. Note different clusters have different >number of subjects, e.g., id 2 has 2 observations, id 3 has 3 >observations. > >id=c(1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5) y=c(.5, .6, .4, .3, .4, 1, .9, >1, .5, 2, 2.2, 3) x=c(0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1 ) > >xx=data.frame(id, x, y) > >boot.cluster <- function(x, id){ > > boot.id <- sample(unique(id), replace=T) out <- lapply(boot.id, > function(i) x[id%in%i,]) > > return( do.call("rbind",out) ) > >} > >boot.xx=boot.cluster(xx, xx$id) > >Here is the generated boot.xx dataset: > > id x y > 3 0 0.4 > 3 0 1.0 > 3 0 0.9 > 1 0 0.5 > 1 0 0.6 > 5 1 2.2 > 5 1 3.0 > 2 1 0.4 > 2 1 0.3 > 1 0 0.5 > 1 0 0.6 > >You can see that some clusters (ids) appears multiple times (e.g., id 1 >appears in two places - 4 rows), since bootstrap does a sample with >replacement, we could have the same cluster multiple times. Thus, we >cannot do a mixed effects model using this data, as we should assume >all the clusters are different in this new data. Instead, I will >reorganize the data as below (id is reordered from the above boot.xx >data). This is the step I need help: > > id x y > 1 0 0.4 > 1 0 1.0 > 1 0 0.9 > 2 0 0.5 > 2 0 0.6 > 3 1 2.2 > 3 1 3.0 > 4 1 0.4 > 4 1 0.3 > 5 0 0.5 > 5 0 0.6 > >Can someone help me with it? Thanks! > >Lei Liu >Professor of Biostatistics >Washington University in St. Louis > > > [[alternative HTML version deleted]] > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.-- Sent from my phone. Please excuse my brevity.