Hi, I agree that replacing data.frames for modeling functions would be too painful. Also I agree with Thomas that new class(es) for tabular data should not inherit from data.frame, and that data.frames should conceptually inherit from some other base tabular data class. At this point I'm not suggesting anything in concrete --- I haven't sorted it out in my own mind --- but I want to point out that (1) we already have several tabular data clasess (not one), including matrices, arrays, contingency tables, data.frames, etc.; and that (2) I feel we'll need to address some of the problems we get into when we use data.frames inappropriately due to lack of other/better data structures. But I don't think that one single class can truthfully and completely represent all "tabular data". For instance, interfaces to XML, spreadsheets, DBMS, etc., will further expose some of the limitations of these existing objects. Tables holding data from relational DBMS are an easy case: this class should preserve the original data as much as possible, i.e., no coercing into factors, no changing column names, no row names, but otherwise very similar to data.frames. (Timothy Keitt has a more interesting concept of "proxyTables" that presents some very interesting issues: should proxyTables, which "point" to remote relations in dbms, allow integer indexing? --- the relational database model does not support it) Is there something common to all these objects? Obviously they all support indexing x[i,j, ...] plus the methods dim() and (possibly NULL) dimnames(). S4 defines vectors as a (virtual) class in terms of the indexing operation in exactly this way -- thus in S4 lists are vectors, and so are logicals, characters, etc. We may be able to group the various "tabular data" classes under such virtual class, and provide simple coercion facilities so that users can easily fit, say, a linear model to data coming as an XML document or in a table stored in a dbms. David A. James Statistics Research, Room 2C-253 Phone: (908) 582-3082 Bell Labs, Lucent Technologies Fax: (908) 582-3340 Murray Hill, NJ 09794-0636 ----------------------------------------------------------------------------> From: Kurt Hornik <Kurt.Hornik@ci.tuwien.ac.at> > MIME-Version: 1.0 > Content-Transfer-Encoding: 7bit > Date: Thu, 8 Feb 2001 14:41:03 +0100 > To: David James <dj@research.bell-labs.com> > Cc: Kurt.Hornik@ci.tuwien.ac.at, tlumley@u.washington.edu,p.dalgaard@biostat.ku.dk, R-devel@r-project.org> Subject: Re: [Rd] RE: [R] Removing "row.names" > > >>>>> David James writes: > > >> Date: Wed, 7 Feb 2001 09:33:12 -0800 (PST) > >> From: Thomas Lumley <tlumley@u.washington.edu> > >> To: Kurt Hornik <Kurt.Hornik@ci.tuwien.ac.at> > >> cc: Peter Dalgaard BSA <p.dalgaard@biostat.ku.dk>, R-devel@r-project.org > >> Subject: Re: [Rd] RE: [R] Removing "row.names" > >> MIME-Version: 1.0 > >> > >> On Wed, 7 Feb 2001, Kurt Hornik wrote: > >> > >> > >>>>> Thomas Lumley writes: > >> > > >> > > On Wed, 7 Feb 2001, Kurt Hornik wrote: > >> > >> >>>>> Peter Dalgaard BSA writes: > >> > >> > >> > >> > Kurt Hornik <Kurt.Hornik@ci.tuwien.ac.at> writes: > >> > >> >> names(sampled) <- " " > >> > >> >> and > >> > >> >> dimnames(sampled)[[2]] <- " " > >> > >> >> > >> > >> >> happily introduce non-unique variable names in the data frame. > >> > >> >> > >> > >> >> Is the rule that row.names and names must be unique still on? > >> > >> >> > >> > >> >> Argh ... > >> > >> > >> > >> > Splus 3.4 dispatches on dimnames<-, but not on names<- with the > >> > >> > following curious result: > >> > >> > >> > >> >> d <- data.frame(a=1:3,b=4:6) > >> > >> >> names(d)<-c(" "," ") > >> > >> >> d > >> > >> > >> > >> > 1 1 4 > >> > >> > 2 2 5 > >> > >> > 3 3 6 > >> > >> >> dimnames(d)[[1]] <- rep(" ",3) > >> > >> > Error in "dimnames<-.data.frame"(d, .A0): column names must beunique> >> > >> > Dumped > >> > >> > >> > >> > R dispatches similarly, but doesn't check the dimnames in > >> > >> > dimnames<-.data.frame. It could do so quite easily. Just add > >> > >> > >> > >> > || any(duplicated(d[[1]])) || any(duplicated(d[[2]])) > >> > >> > >> > >> > at the appropriate spot. > >> > >> > >> > >> Thomas' view about what should be permitted seems to be different. > >> > > >> > > I wouldn't object to making it hard to create duplicated names(), but > >> > > I think it would be a bad idea to have data.frame() make up unique > >> > > names if it's given non-unique ones. > >> > > >> > Maybe `check.names' could also be used for uniqueness testing? > >> > > >> > In any case, I think we should specify what *exactly* a data frame is. > >> > > >> > >> I think we should specify, and check.names is a logical way to > >> allow/forbid non-unique columns. > >> > >> Having a new class would be messy: logically it shouldn't inherit from > >> data.frame, data.frame should inherit from it, but that would be a real > >> pain to set up. > >> > > > Data frames were originally meant to be used in modeling functions. > > The opening paragraph in Chapter 3 (Data for Models) in the White Book > > says: > > > "This chapter describes the general structure for data that > > will be used throughout the book. In particular, it introduces the > > data frame, a class of objects to represent the data typically encounterd > > in fitting models." > > > However, data.frames may not be quite appropriate for representing > > other types of tabular data (certainly a data.frame does not capture > > the essence of, say, a "relational" table in the SQL sense, which > > doesn't have the concept of row names). Several manifestations of > > this problem are coercing character data to factors "at the drop of a > > hat" (as someone wrote here or in s-news), the row.names issue now > > being discussed, problems including general objets in the "cells" of > > the data.frame, etc. > > > I think that the concept of a data.frame to represent data for fitting > > models is fine, but we may (certainly I) have abused this concept. We > > need other classes of tabular data objects in addition (not as a > > replacement) to data.frames, together with coercion methods and > > perhaps other utilities. > > Thomas had said that yes it would be nice to have something with less > restrictions for modeling, but that it was uneconomical at least to > introduce a new class that data.frame would then inherit from. > > I interpret your comment as suggesting that we introduce a new class for > holding tabular data? Do you have specific ideas on this? > > -k-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Kurt Hornik
2001-Feb-12 15:47 UTC
[Rd] Re: tabular data (was RE: [R] Removing "row.names")
>>>>> David James writes:> Hi,> For instance, interfaces to XML, spreadsheets, DBMS, etc., will > further expose some of the limitations of these existing objects. > Tables holding data from relational DBMS are an easy case: this class > should preserve the original data as much as possible, i.e., no > coercing into factors, no changing column names, no row names, but > otherwise very similar to data.frames. (Timothy Keitt has a more > interesting concept of "proxyTables" that presents some very > interesting issues: should proxyTables, which "point" to remote > relations in dbms, allow integer indexing? --- the relational database > model does not support it)> Is there something common to all these objects? Obviously they all > support indexing x[i,j, ...] plus the methods dim() and (possibly > NULL) dimnames(). S4 defines vectors as a (virtual) class in terms of > the indexing operation in exactly this way -- thus in S4 lists are > vectors, and so are logicals, characters, etc. We may be able to > group the various "tabular data" classes under such virtual class, and > provide simple coercion facilities so that users can easily fit, say, > a linear model to data coming as an XML document or in a table stored > in a dbms.Maybe this is something to be kicked off at the official part of DSC and discussed in the `r-core' part. -k -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._