Nick Switanek
2010-Mar-24 20:51 UTC
[R] translating SQL statements into data.table operations
I've recently stumbled across data.table, Matthew Dowle's package. I'm impressed by the speed of the package in handling operations with large data.frames, but am a bit overwhelmed with the syntax. I'd like to express the SQL statement below using data.table operations rather than sqldf (which was incredibly slow for a small subset of my financial data) or import/export with a DBMS, but I haven't been able to figure out how to do it. I would be grateful for your suggestions. nick My aim is to join events (trades) from two datasets ("edt" and "cdt") where, for the same stock, the events in one dataset occur between 15 and 75 days before the other, and within the same time window. I can only see how to express the "WHERE e.SYMBOL = c.SYMBOL" part in data.table syntax. I'm also at a loss at whether I can express the remainder using data.table's %between% operator or not. ctqm <- sqldf("SELECT e.*, c.DATE 'DATEctrl', c.TIME 'TIMEctrl', c.PRICE 'PRICEctrl', c.SIZE 'SIZEctrl' FROM edt e, ctq c WHERE e.SYMBOL = c.SYMBOL AND julianday(e.DATE) - julianday(c.DATE) BETWEEN 15 AND 75 AND strftime('%H:%M:%S',c.TIME) BETWEEN strftime('%H:%M:%S',e.BEGTIME) AND strftime('%H:%M:%S',e.ENDTIME)") [[alternative HTML version deleted]]
Gabor Grothendieck
2010-Mar-24 22:29 UTC
[R] translating SQL statements into data.table operations
Note that, in general, you can speed up joins, even within sqldf, by adding indexes to your tables and ensuring that your select statement is written in such a way that the indexes are used. See example 4i on the sqldf home page. On Wed, Mar 24, 2010 at 4:51 PM, Nick Switanek <nswitanek at gmail.com> wrote:> I've recently stumbled across data.table, Matthew Dowle's package. I'm > impressed by the speed of the package in handling operations with large > data.frames, but am a bit overwhelmed with the syntax. I'd like to express > the SQL statement below using data.table operations rather than sqldf (which > was incredibly slow for a small subset of my financial data) or > import/export with a DBMS, but I haven't been able to figure out how to do > it. I would be grateful for your suggestions. > > nick > > > > My aim is to join events (trades) from two datasets ("edt" and "cdt") where, > for the same stock, the events in one dataset occur between 15 and 75 days > before the other, and within the same time window. I can only see how to > express the "WHERE e.SYMBOL = c.SYMBOL" part in data.table syntax. I'm also > at a loss at whether I can express the remainder using data.table's > %between% operator or not. > > ctqm <- sqldf("SELECT e.*, > ? ? ? ? ? ? ? ? c.DATE 'DATEctrl', > ? ? ? ? ? ? ? ? c.TIME 'TIMEctrl', > ? ? ? ? ? ? ? ? c.PRICE 'PRICEctrl', > ? ? ? ? ? ? ? ? c.SIZE 'SIZEctrl' > > ? ? ? ? ? ? ? ? FROM edt e, ctq c > > ? ? ? ? ? ? ? ? WHERE e.SYMBOL = c.SYMBOL AND > ? ? ? ? ? ? ? ? ? ? ? julianday(e.DATE) - julianday(c.DATE) BETWEEN 15 AND > 75 AND > ? ? ? ? ? ? ? ? ? ? ? strftime('%H:%M:%S',c.TIME) BETWEEN > strftime('%H:%M:%S',e.BEGTIME) AND strftime('%H:%M:%S',e.ENDTIME)") > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Matthew Dowle
2010-Mar-25 12:33 UTC
[R] translating SQL statements into data.table operations
Nick, Good question, but just sent to the wrong place. The posting guide asks you to contact the package maintainer first before posting to r-help only if you don't hear back. I guess one reason for that is that if questions about all 2000+ packages were sent to r-help, then r-help's traffic could go through the roof. Another reason could be that some (i.e. maybe many, maybe few) package maintainers don't actually monitor r-help and might miss any messages you post here. I only saw this one thanks to google alerts. Since I'm writing anyway ... are you using the latest version on r-forge which has the very fast grouping? Have you set multi-column keys on both edt and cdt and tried edt[cdt,roll=TRUE] syntax ? We'll help you off list to climb the learning curve quickly. We are working on FAQs and a vignette and they should be ready soon too. Please do follow up with us (myself and Tom Short cc'd are the main developers) off list and one of us will be happy to help further. Matthew "Nick Switanek" <nswitanek at gmail.com> wrote in message news:772ec1011003241351v6a3f36efqb0b0787564691f3a at mail.gmail.com...> I've recently stumbled across data.table, Matthew Dowle's package. I'm > impressed by the speed of the package in handling operations with large > data.frames, but am a bit overwhelmed with the syntax. I'd like to express > the SQL statement below using data.table operations rather than sqldf > (which > was incredibly slow for a small subset of my financial data) or > import/export with a DBMS, but I haven't been able to figure out how to do > it. I would be grateful for your suggestions. > > nick > > > > My aim is to join events (trades) from two datasets ("edt" and "cdt") > where, > for the same stock, the events in one dataset occur between 15 and 75 days > before the other, and within the same time window. I can only see how to > express the "WHERE e.SYMBOL = c.SYMBOL" part in data.table syntax. I'm > also > at a loss at whether I can express the remainder using data.table's > %between% operator or not. > > ctqm <- sqldf("SELECT e.*, > c.DATE 'DATEctrl', > c.TIME 'TIMEctrl', > c.PRICE 'PRICEctrl', > c.SIZE 'SIZEctrl' > > FROM edt e, ctq c > > WHERE e.SYMBOL = c.SYMBOL AND > julianday(e.DATE) - julianday(c.DATE) BETWEEN 15 AND > 75 AND > strftime('%H:%M:%S',c.TIME) BETWEEN > strftime('%H:%M:%S',e.BEGTIME) AND strftime('%H:%M:%S',e.ENDTIME)") > > [[alternative HTML version deleted]] >