In our recent workshop on "Multilevel Modeling in R" we discussed handling data for multilevel modeling. An classic example of such data are test scores of students grouped into schools. We may wish to model the scores as functions of both student-level covariates and school-level covariates. Such data are often organized in a multi-table format with a separate table for each level of information. The MathAchieve and MathAchSchool data frames in the nlme package are examples of such an organization. The HLM software requires the data to be organized like this. To fit a model in R we need to create a composite table by "joining" the columns of the student-level and school-level tables, in the relational database sense of "join". I have created a function to join the columns from two such frames according to the values of a key column. In relational database terms the key column must be a primary key for the second frame. I have called this function 'cjoin', by analogy to cbind. You can try data(MathAchieve, package = 'nlme') data(MathAchSchool, package = 'nlme') cjoin(MathAchieve, MathAchSchool, "School") cjoin(MathAchieve, MathAchSchool, "School", which = "Sector") as examples Several questions: - Am I duplicating existing functionality? - Is cjoin a good name for such a function? - Would this be useful in base? -------------- next part -------------- "cjoin" <- function(fr1, fr2, cnm1, cnm2 = cnm1, which) { val = as.data.frame(fr1) vnms = names(val) cnm1 = as.character(cnm1) if (length(cnm1) < 1 || any(is.na(match(cnm1, vnms)))) { stop("cnm1 must be one or more column names from fr1") } fr2nm = names(fr2) if (!missing(cnm2)) { cnm2 = as.character(cnm2) if (length(cnm2) != length(cnm1) || any(is.na(match(cnm2, fr2nm)))) { stop(paste("cnm2 must be", length(cnm1), "column name(s) from fr2")) } } else { if (any(missed <- is.na(match(cnm2, fr2nm)))) { stop(paste("No columns named", paste(cnm2[missed], sep = ", "), "in fr2")) } } if (length(cnm1) == 1) { mcol = fr2[[cnm2]] if (any(is.na(mcol))) { warning(paste("Missing values in column", mcol, "of fr2 cannot be matched")) mcol = mcol[is.na(mcol)] } if (length(mcol) != length(unique(mcol))) { stop(paste("column", mcol, "must be a unique key to fr2")) } mm = match(fr1[[cnm1]], fr2[[cnm2]]) if (any(is.na(mm))) { stop("Values present in column", cnm1, "of fr1 are not present in column", cnm2, "of fr2") } if (missing(which)) { which = fr2nm[!(fr2nm %in% cnm2)] } else { which = which[!(which %in% cnm2)] if (any(missed <- is.na(match(which, fr2nm)))) { stop(paste("No columns named", paste(which[missed], sep = ", "), "in fr2")) } } if (any(dups <- !is.na(match(which, vnms)))) { warning(paste("Join operation will duplicate column name(s)", paste(which[dups], sep = ", "))) } return(cbind(val, fr2[mm, which])) } else { stop("Matches on multiple columns not yet implemented") } }
On Tue, 2003-06-24 at 18:07, Douglas Bates wrote:> In our recent workshop on "Multilevel Modeling in R" we discussed > handling data for multilevel modeling. An classic example of such > data are test scores of students grouped into schools. We may wish to > model the scores as functions of both student-level covariates and > school-level covariates. > > Such data are often organized in a multi-table format with a separate table > for each level of information. The MathAchieve and MathAchSchool data > frames in the nlme package are examples of such an organization. The > HLM software requires the data to be organized like this. To fit a > model in R we need to create a composite table by "joining" the > columns of the student-level and school-level tables, in the > relational database sense of "join". > > I have created a function to join the columns from two such frames > according to the values of a key column. In relational database terms > the key column must be a primary key for the second frame. I have > called this function 'cjoin', by analogy to cbind. > > You can try > > data(MathAchieve, package = 'nlme') > data(MathAchSchool, package = 'nlme') > cjoin(MathAchieve, MathAchSchool, "School") > cjoin(MathAchieve, MathAchSchool, "School", which = "Sector") > > as examples > > Several questions: > > - Am I duplicating existing functionality? > > - Is cjoin a good name for such a function? > > - Would this be useful in base?Prof. Bates, Perhaps I am not seeing all of the details but a quick (perhaps too quick) review would suggest that merge(), which is in base, would at least be a parallel function. For example: data(MathAchieve, package = 'nlme') data(MathAchSchool, package = 'nlme') cj1 <- cjoin(MathAchieve, MathAchSchool, "School") mrg1 <- merge(MathAchieve, MathAchSchool, by = "School") dim(cj1) [1] 7185 12 dim(mrg1) [1] 7185 12 colnames(cj1) [1] "School" "Minority" "Sex" "SES" "MathAch" "MEANSES" [7] "Size" "Sector" "PRACAD" "DISCLIM" "HIMINTY" "MEANSES" colnames(mrg1) [1] "School" "Minority" "Sex" "SES" "MathAch" "MEANSES.x" [7] "Size" "Sector" "PRACAD" "DISCLIM" "HIMINTY" "MEANSES.y"> table(cj1 == mrg1)TRUE 86220 Note that cj1 appears to have duplicate colnames for "MEANSES", for which your function does issue a warning message, whereas mrg1 has appended ".x" and ".y" as default suffixes. In the second case, if I am reading the code correctly, the 'which' argument appears to restrict the columns retained from the second data frame after the key match, in this case "Sector".> cj2 <- cjoin(MathAchieve, MathAchSchool, "School", which = "Sector") > mrg2 <- mrg1[, c(1:6, 8)]> dim(cj2)[1] 7185 7> dim(mrg2)[1] 7185 7> colnames(cj2)[1] "School" "Minority" "Sex" "SES" [5] "MathAch" "MEANSES" "fr2[mm, which]"> colnames(mrg2)[1] "School" "Minority" "Sex" "SES" "MathAch" "MEANSES.x" [7] "Sector"> table(cj2 == mrg2)TRUE 50295 See ?merge for more information. Thoughts? Best regards, Marc Schwartz
Marc Schwartz <MSchwartz@medanalytics.com> writes:> On Tue, 2003-06-24 at 18:07, Douglas Bates wrote:> > Several questions: > > > > - Am I duplicating existing functionality?...> Prof. Bates, > > Perhaps I am not seeing all of the details but a quick (perhaps too > quick) review would suggest that merge(), which is in base, would at > least be a parallel function.No, you are seeing better than I. I had not thought to look up merge - I only thought of join. I can use merge. The answer to my first question is that I am duplicating functionality. It was an interesting exercise. Thanks for pointing out merge.
?merge says Merge two data frames by common columns or row names, or do other versions of database ``join'' operations. and it has done all the examples of this sort of thing that I have ever needed. On 24 Jun 2003, Douglas Bates wrote:> In our recent workshop on "Multilevel Modeling in R" we discussed > handling data for multilevel modeling. An classic example of such > data are test scores of students grouped into schools. We may wish to > model the scores as functions of both student-level covariates and > school-level covariates. > > Such data are often organized in a multi-table format with a separate table > for each level of information. The MathAchieve and MathAchSchool data > frames in the nlme package are examples of such an organization. The > HLM software requires the data to be organized like this. To fit a > model in R we need to create a composite table by "joining" the > columns of the student-level and school-level tables, in the > relational database sense of "join". > > I have created a function to join the columns from two such frames > according to the values of a key column. In relational database terms > the key column must be a primary key for the second frame. I have > called this function 'cjoin', by analogy to cbind. > > You can try > > data(MathAchieve, package = 'nlme') > data(MathAchSchool, package = 'nlme') > cjoin(MathAchieve, MathAchSchool, "School") > cjoin(MathAchieve, MathAchSchool, "School", which = "Sector") > > as examples > > Several questions: > > - Am I duplicating existing functionality? > > - Is cjoin a good name for such a function? > > - Would this be useful in base? > >-- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595