Hi, I just had a very quick look at the StatDataML proposal --- nice work! At the risk of showing my ignorance, I want to mention my first impressions. My first impression is that defining datasets in terms of arrays and list is a bit too high a level. What about simpler vectors, scalars? (I know that R/S don't have scalars, but other systems/applications do.) Can we think of a core set of "basic" data types (factors, strings, integers, etc.) from which to build on other, possibly recursive types (perhaps similar to corba's IDL basic data types or S's datadump?). Would it make sense to imagine, say xlispstat/python/java applications reading and interpreting an StatDataML document without serious difficulties? My gut feeling (which is often wrong) is that the DTD should make the data self-describing: e.g., the factor "machineId" has levels (or defining set) "Stepper1", "Stepper2", ... "Stepper20", eventhough the particular dataset at hand has only a subset of those. Similarly, perhaps allowing units and classes to be included in the dataset (in the case of currency, it is certainly a number, perhaps single precision, perhaps not, with specific units dollars, euros, pesos, etc.) More long-term, how about application-defined data? Application may have it's own set of data objects that fully exploits contextual information that could be extremely useful to capture and communicate. Also, do the data have to be in ASCII format? What about (possibly mime-encoded) images? sound? As I mentioned, these are questions coming from my lack of experience with XML, but may be worth raising now better than later :-) David A. James Statistics Research, Room 2C-253 Phone: (908) 582-3082 Bell Labs, Lucent Technologies Fax: (908) 582-3340 Murray Hill, NJ 09794-0636 -----------------------------------------------------------------------------> From: Friedrich Leisch <Friedrich.Leisch@ci.tuwien.ac.at> > MIME-Version: 1.0 > Content-Transfer-Encoding: 8bit > Date: Fri, 3 Mar 2000 17:07:37 +0100 (CET) > To: omega-devel@omegahat.org, r-devel@R-project.org,Erich.Neuwirth@univie.ac.at, hothorn@ci.tuwien.ac.at, baier@ci.tuwien.ac.at, Christian.Buchta@wu-wien.ac.at> Subject: [Omega-devel] StatDataML > X-Mailman-Version: 1.0rc2 > List-Id: Developers of Omega <omega-devel.www.omegahat.org> > X-BeenThere: omega-devel@www.omegahat.org > > > Hi, > > we have a first draft of R functions reading/writing data to XML files > including a rather general DTD ... which borrows heavily from the data > types of a certain programming language :-) > > The basic idea is to create an XML standard for data exchange, > together with import/export functions for as many applications as > possible. We here will need R, Matlab & Octave for our research > program, but the idea is of course to create a general standard. > > After looking in several other applications we think that all the data > types there can easily be represented using S constructs (i.e., arrays > and lists together with attributes) ... so why make life complicated > and invent something new. > > Of course this only applies to the low-level representaion ... the > real thing will come next when one starts defining higher level > classes, this step we have avoided so far because one needs the > low-level things first to have something to play with. > > A short description of the DTD and an R package with import/export > functions can be found at > > http://www.ci.tuwien.ac.at/~leisch/R > > (Modulo some bugs) R data objects can be saved/restored without loss > of information. We don't intend to cover functions or models yet. > > All comments and ideas are appreciated! This is just a proposal and > anything can still be changed ... > > Best, > Fritz > > PS: Almost all the work has been done by Torsten Hothorn, I'm just > writing the email ;-) > > -- > ------------------------------------------------------------------- > Friedrich Leisch > Institut für Statistik Tel: (+43 1) 58801 10715 > Technische Universität Wien Fax: (+43 1) 58801 10798 > Wiedner Hauptstraße 8-10/1071 Friedrich.Leisch@ci.tuwien.ac.at > A-1040 Wien, Austria http://www.ci.tuwien.ac.at/~leisch > PGP public key http://www.ci.tuwien.ac.at/~leisch/pgp.key > ------------------------------------------------------------------- > > > _______________________________________________ > Omega-devel maillist - Omega-devel@www.omegahat.org > http://www.omegahat.org/mailman/listinfo/omega-devel-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
>>>>> On Fri, 3 Mar 2000 16:36:36 -0500 (EST), >>>>> David James (DJ) wrote:DJ> Hi, DJ> I just had a very quick look at the StatDataML proposal --- nice DJ> work! At the risk of showing my ignorance, I want to mention DJ> my first impressions. DJ> My first impression is that defining datasets in terms of DJ> arrays and list is a bit too high a level. What about DJ> simpler vectors, scalars? (I know that R/S don't have scalars, DJ> but other systems/applications do.) Can we think of a core DJ> set of "basic" data types (factors, strings, integers, etc.) DJ> from which to build on other, possibly recursive types (perhaps DJ> similar to corba's IDL basic data types or S's datadump?). Hmm, basically we have that ... just that I don't see why it's necessary to differentiate between a vector (=1-dimensional array) and higher dimensions, i.e., introduce different tags for it. But if many others feel like this is necessary: I don't have s trong opinion about it, we just wanted to keep the thing as simple as possible. Regarding data types: Torsten and I just discussed that we want to keep the mode of an array as abstract as possible such that applications can use the internal representation that fits the data best. IMO the following modes will be necessary to represent statistical data: logical, nominal, ordinal, integer, real, complex DJ> Would it make sense to imagine, say xlispstat/python/java applications DJ> reading and interpreting an StatDataML document without serious difficulties? Sure! What's the difference? DJ> My gut feeling (which is often wrong) is that the DTD should make DJ> the data self-describing: e.g., the factor "machineId" has DJ> levels (or defining set) "Stepper1", "Stepper2", ... "Stepper20", DJ> eventhough the particular dataset at hand has only a DJ> subset of those. Similarly, perhaps allowing units and classes DJ> to be included in the dataset (in the case of currency, it is certainly DJ> a number, perhaps single precision, perhaps not, with specific units DJ> dollars, euros, pesos, etc.) DJ> More long-term, how about application-defined data? Application may have DJ> it's own set of data objects that fully exploits contextual DJ> information that could be extremely useful to capture and DJ> communicate. We definitely need (and want) any user to be able to exctend StatDataML, i.e., define new classes. There should be a set of standard classes (like dataframe or time series), but also interfaces for defining new classes. The current idea (in R) is to have the following: If the SDML object has a class and there exists a conversion function for that particular class then use it, otherwise do the default thing. The conversion function shouldn't do to much, probably mostly renaming some slots and re-organizing the structure (as claases on different systems will probably have different structures). DJ> Also, do the data have to be in ASCII format? What about DJ> (possibly mime-encoded) images? sound? Hmm, haven't thought about that yet. DJ> As I mentioned, these are questions coming from my lack of experience DJ> with XML, but may be worth raising now better than later :-) YES!!! That's why we called it ``proposal'' rather than ``StatDataML version 1.0'' :-) Best, Fritz PS: We are also no XML experts! -- ------------------------------------------------------------------- Friedrich Leisch Institut für Statistik Tel: (+43 1) 58801 10715 Technische Universität Wien Fax: (+43 1) 58801 10798 Wiedner Hauptstraße 8-10/1071 Friedrich.Leisch@ci.tuwien.ac.at A-1040 Wien, Austria http://www.ci.tuwien.ac.at/~leisch PGP public key http://www.ci.tuwien.ac.at/~leisch/pgp.key ------------------------------------------------------------------- -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
>>>>> On Mon, 6 Mar 2000 12:27:25 +0100 (CET), >>>>> Friedrich Leisch (FL) wrote:FL> IMO the following modes will be necessary to represent statistical FL> data: FL> logical, nominal, ordinal, integer, real, complex Sorry, forgot character. .fritz -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._