Matthew Dowle
2009-Mar-31 01:37 UTC
[R] [R-pkgs] data.table is on CRAN (enhanced data.frame for time series joins and more)
Dear all, The data.table package was released back in August 2008. This email is to publicise its existence in response to several suggestions to do so. It seems I didn't send a general announcement about it at the time and therefore perhaps, not surprisingly, not many people know about it. Glancing at some r-help threads recently supports the idea of sending a public announcement. The main difference between data.frame and data.table is enhanced functionality in [.data.table where most documentation for this package lives i.e. help("[.data.table"). Selected extracts from the package documentation follow. The package builds on base R functionality to reduce 2 types of time : 1. programming time (easier to write, read, debug and maintain) 2. compute time when combining database like operations (subset, with and by) and provides similar joins that merge provides but faster. This is achieved by using R's column based ordered in-memory data.frame, eval within the environment of a list (i.e. with), the [.data.table mechanism to condense the features and compiled C to make certain operations fast. [.data.table is like [.data.frame but i and j can be expressions of column names directly. Furthermore i may itself be a data.table which invokes a fast table join using binary search in O(log n) time. Allowing i to be data.table is consistent with subsetting an n-dimension array by an n-column matrix in base R. data.tables do not have rownames but may instead have a key of one or more columns using setkey. This key may be used for row indexing instead of rownames. Examples comparing [.data.frame and [.data.table : DF = data.frame(a=1:5, b=6:10) DT = data.table(a=1:5, b=6:10) tt = subset(DF,a==3) ss = DT[a==3] # just use the column name 'a' directly. No need to remember the comma. The i argument is like the 'where' in SQL. identical(as.data.table(tt), ss) tt = with(subset(DF,a==3),a+b+1) ss = DT[a==3,a+b+1] # j is like select in SQL and the select argument of subset in base R. j can be an expression of column names directly, including a data.table of multiple expressions. Here the j expression is executed just for the rows matching the i argument. identical(tt, ss) # Examples above use vector scans i.e. the "a==3" expression first creates a logical vector as long as the total number of rows and then evaluates a==3 for every row. # Examples below use binary search, invoked by passing in a data.table as the i argument. Joins in SQL are performed in the where clause and the i argument is like where, so this seems very natural (to me anyway!) DT = data.table(a=letters[1:5], b=6:10) setkey(DT,a) identical(DT[J("d")], DT[4]) # binary search to row for 'd' DT = data.table(id=rep(c("A","B"),each=3), date=c(20080501L,20080502L,20080506L), v=1:6) setkey(DT,id,date) DT["A"] # all 3 rows for A since mult by default is "all" DT[J("A",20080502L)] # row for A where date also matches exactly DT[J("A",20080505L)] # NA since 5 May is missing (outer join by default) DT[J("A",20080505L),nomatch=0] # inner join instead dts = c(20080501L, 20080502L, 20080505L, 20080506L, 20080507L, 20080508L) DT[J("A",dts)] # 3 of the dates in dts match exactly DT[J("A",dts),roll=TRUE] # roll previous data forward i.e. return the prevailing observation DT[J("A",dts),rolltolast=TRUE] # roll all but last observation forward tables(mb=TRUE) # prints table names, number of rows, size in memory Thanks to all those who have made suggestions and feedback so far. Further comments and feedback on the package would be much appreciated. Regards, Matthew _______________________________________________ R-packages mailing list R-packages at r-project.org https://stat.ethz.ch/mailman/listinfo/r-packages