thr3ads.net - R devel - [Rd] joining columns as in a relational database [Jun 2003]

If this information is useful, please help other people find it:
Share via:

Douglas Bates

2003-Jun-25 01:07 UTC

[Rd] joining columns as in a relational database

In our recent workshop on "Multilevel Modeling in R" we discussed
handling data for multilevel modeling.  An classic example of such
data are test scores of students grouped into schools.  We may wish to
model the scores as functions of both student-level covariates and
school-level covariates.

Such data are often organized in a multi-table format with a separate table
for each level of information.  The MathAchieve and MathAchSchool data
frames in the nlme package are examples of such an organization.  The
HLM software requires the data to be organized like this.  To fit a
model in R we need to create a composite table by "joining" the
columns of the student-level and school-level tables, in the
relational database sense of "join".

I have created a function to join the columns from two such frames
according to the values of a key column.  In relational database terms
the key column must be a primary key for the second frame.  I have
called this function 'cjoin', by analogy to cbind.

You can try

data(MathAchieve, package = 'nlme')
data(MathAchSchool, package = 'nlme')
cjoin(MathAchieve, MathAchSchool, "School")
cjoin(MathAchieve, MathAchSchool, "School", which =
"Sector")

as examples

Several questions:

 - Am I duplicating existing functionality?

 - Is cjoin a good name for such a function?

 - Would this be useful in base?

-------------- next part --------------
"cjoin" <-
    function(fr1, fr2, cnm1, cnm2 = cnm1, which)
{
    val = as.data.frame(fr1)
    vnms = names(val)
    cnm1 = as.character(cnm1)
    if (length(cnm1) < 1 || any(is.na(match(cnm1, vnms)))) {
        stop("cnm1 must be one or more column names from fr1")
    }
    fr2nm = names(fr2)
    if (!missing(cnm2)) {
        cnm2 = as.character(cnm2)
        if (length(cnm2) != length(cnm1) || any(is.na(match(cnm2, fr2nm)))) {
            stop(paste("cnm2 must be", length(cnm1),
                       "column name(s) from fr2"))
        }
    } else {
        if (any(missed <- is.na(match(cnm2, fr2nm)))) {
            stop(paste("No columns named", paste(cnm2[missed], sep =
", "),
                       "in fr2"))
        }
    }
    if (length(cnm1) == 1) {
        mcol = fr2[[cnm2]]
        if (any(is.na(mcol))) {
            warning(paste("Missing values in column", mcol,
                          "of fr2 cannot be matched"))
            mcol = mcol[is.na(mcol)]
        }
        if (length(mcol) != length(unique(mcol))) {
            stop(paste("column", mcol, "must be a unique key to
fr2"))
        }
        mm = match(fr1[[cnm1]], fr2[[cnm2]])
        if (any(is.na(mm))) {
            stop("Values present in column", cnm1,
                 "of fr1 are not present in column", cnm2, "of
fr2")
        }
        if (missing(which)) {
            which = fr2nm[!(fr2nm %in% cnm2)]
        } else {
            which = which[!(which %in% cnm2)]
            if (any(missed <- is.na(match(which, fr2nm)))) {
                stop(paste("No columns named", paste(which[missed],
sep = ", "),
                           "in fr2"))
            }
        }
        if (any(dups <- !is.na(match(which, vnms)))) {
            warning(paste("Join operation will duplicate column
name(s)",
                          paste(which[dups], sep = ", ")))
        }
        return(cbind(val, fr2[mm, which]))
    } else {
        stop("Matches on multiple columns not yet implemented")
    }
}

Marc Schwartz

2003-Jun-25 02:24 UTC

head link

[Rd] joining columns as in a relational database

On Tue, 2003-06-24 at 18:07, Douglas Bates wrote:> In our recent workshop on "Multilevel Modeling in R" we discussed
> handling data for multilevel modeling.  An classic example of such
> data are test scores of students grouped into schools.  We may wish to
> model the scores as functions of both student-level covariates and
> school-level covariates.
> 
> Such data are often organized in a multi-table format with a separate table
> for each level of information.  The MathAchieve and MathAchSchool data
> frames in the nlme package are examples of such an organization.  The
> HLM software requires the data to be organized like this.  To fit a
> model in R we need to create a composite table by "joining" the
> columns of the student-level and school-level tables, in the
> relational database sense of "join".
> 
> I have created a function to join the columns from two such frames
> according to the values of a key column.  In relational database terms
> the key column must be a primary key for the second frame.  I have
> called this function 'cjoin', by analogy to cbind.
> 
> You can try
> 
> data(MathAchieve, package = 'nlme')
> data(MathAchSchool, package = 'nlme')
> cjoin(MathAchieve, MathAchSchool, "School")
> cjoin(MathAchieve, MathAchSchool, "School", which =
"Sector")
> 
> as examples
> 
> Several questions:
> 
>  - Am I duplicating existing functionality?
> 
>  - Is cjoin a good name for such a function?
> 
>  - Would this be useful in base?

Prof. Bates,

Perhaps I am not seeing all of the details but a quick (perhaps too
quick) review would suggest that merge(), which is in base, would at
least be a parallel function.

For example:

data(MathAchieve, package = 'nlme')
data(MathAchSchool, package = 'nlme')

cj1 <- cjoin(MathAchieve, MathAchSchool, "School")
mrg1 <- merge(MathAchieve, MathAchSchool, by = "School")

dim(cj1)
[1] 7185   12

dim(mrg1)
[1] 7185   12

colnames(cj1)
 [1] "School"   "Minority" "Sex"     
"SES"      "MathAch"  "MEANSES"
 [7] "Size"     "Sector"   "PRACAD"  
"DISCLIM"  "HIMINTY"  "MEANSES"

colnames(mrg1)
 [1] "School"    "Minority"  "Sex"      
"SES"       "MathAch"
"MEANSES.x"
 [7] "Size"      "Sector"    "PRACAD"   
"DISCLIM"   "HIMINTY"
"MEANSES.y"

> table(cj1 == mrg1) 
 TRUE
86220


Note that cj1 appears to have duplicate colnames for "MEANSES", for
which your function does issue a warning message, whereas mrg1 has
appended ".x" and ".y" as default suffixes.


In the second case, if I am reading the code correctly, the 'which'
argument appears to restrict the columns retained from the second data
frame after the key match, in this case "Sector".
> cj2 <- cjoin(MathAchieve, MathAchSchool, "School", which =
"Sector")
> mrg2 <- mrg1[, c(1:6, 8)]
> dim(cj2)[1] 7185    7
> dim(mrg2)[1] 7185    7
> colnames(cj2)[1] "School"         "Minority"       "Sex"       
"SES"
[5] "MathAch"        "MEANSES"        "fr2[mm,
which]"
> colnames(mrg2)[1] "School"    "Minority"  "Sex"      
"SES"       "MathAch"
"MEANSES.x"
[7] "Sector"
> table(cj2 == mrg2) 
 TRUE
50295


See ?merge for more information.

Thoughts?


Best regards,

Marc Schwartz

Douglas Bates

2003-Jun-25 04:46 UTC

head link

[Rd] joining columns as in a relational database

Marc Schwartz <MSchwartz@medanalytics.com> writes:
> On Tue, 2003-06-24 at 18:07, Douglas Bates wrote:
> > Several questions:
> > 
> >  - Am I duplicating existing functionality?...
> Prof. Bates,
> 
> Perhaps I am not seeing all of the details but a quick (perhaps too
> quick) review would suggest that merge(), which is in base, would at
> least be a parallel function.
No, you are seeing better than I.  I had not thought to look up merge
- I only thought of join.  I can use merge.  The answer to my first
question is that I am duplicating functionality.

It was an interesting exercise.  Thanks for pointing out merge.

Prof Brian Ripley

2003-Jun-25 08:57 UTC

head link

[Rd] joining columns as in a relational database

?merge says

     Merge two data frames by common columns or row names, or do other
     versions of database ``join'' operations.

and it has done all the examples of this sort of thing that I have ever 
needed.

On 24 Jun 2003, Douglas Bates wrote:
> In our recent workshop on "Multilevel Modeling in R" we discussed
> handling data for multilevel modeling.  An classic example of such
> data are test scores of students grouped into schools.  We may wish to
> model the scores as functions of both student-level covariates and
> school-level covariates.
> 
> Such data are often organized in a multi-table format with a separate table
> for each level of information.  The MathAchieve and MathAchSchool data
> frames in the nlme package are examples of such an organization.  The
> HLM software requires the data to be organized like this.  To fit a
> model in R we need to create a composite table by "joining" the
> columns of the student-level and school-level tables, in the
> relational database sense of "join".
> 
> I have created a function to join the columns from two such frames
> according to the values of a key column.  In relational database terms
> the key column must be a primary key for the second frame.  I have
> called this function 'cjoin', by analogy to cbind.
> 
> You can try
> 
> data(MathAchieve, package = 'nlme')
> data(MathAchSchool, package = 'nlme')
> cjoin(MathAchieve, MathAchSchool, "School")
> cjoin(MathAchieve, MathAchSchool, "School", which =
"Sector")
> 
> as examples
> 
> Several questions:
> 
>  - Am I duplicating existing functionality?
> 
>  - Is cjoin a good name for such a function?
> 
>  - Would this be useful in base?
> 
> 
-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Maybe Matching Threads

Search for more possibly parallel threads

R devel - Jun 2003 - joining columns as in a relational database

[Rd] joining columns as in a relational database

[Rd] joining columns as in a relational database

[Rd] joining columns as in a relational database

[Rd] joining columns as in a relational database

Maybe Matching Threads