Farewell, Timothy
2014-Aug-11 13:06 UTC
[R] building a BIGLM model from three tables (related)
Hi all, I wonder if you can help me. THE PROBLEM I want to train and test a GLM with some large datasets. I am running into some problems when I flatten my tables together to feed into the GLM model, as it produces a very large table which is far too big for the memory on my computer. THREE TABLES - Pipes, Weekly Weather data, Bursts I have three tables, which are all related to each other. (1) Pipe cohorts (114,000 rows) with a range of explanatory variables. ((1) Linking fields: (A) Pipe cohort ID, (B) weathercell_ID) (2) Explanatory Weekly Weather data 12 years (e.g. 624 weeks for each pipe cohort) ((2) Linking fields: (C) week, (B) weathercell_ID) (3) Bursts (40,000 bursts) ((3) Linking fields: (A) Pipe cohort ID, (C) week) Effectively, the combination of tables (1) and (2) make the population. Table (3) are the events, or failures. JOINING THE THREE TABLES I have previously had far fewer pipe cohort rows. What I have been doing till now is joining the (1) pipe cohorts data to the (2) weekly weather data. This repeats the pipe cohort data, each week, for the 12 years, which, now, makes a very long table e.g. 624 x 114,000 rows = 71 million rows. I would then join the (3) burst data to that to see how many bursts there were that week, on that pipe cohort. This made a large, flat file, which I could feed into GLM. This worked ok when there are not so many pipe cohorts, but now there are 114,000 rows, when I join the data tables I produce a MASSIVE table (many, many GB) which kills my computers. RELATIONAL DATABASE APPROACH? I am thinking it would be better to have a relational database structure where, for each data point (row) being brought into the BIGLM model, it take the three tables and looks up the appropriate values each time, using the defined join fields (A, B +C), feeds that into the model, then goes back and looks up the next point. ADVICE? How would you approach this problem? I have the data prepared in the three tables. I need to fit lots of models to see which variables give me the best AIC (output: lots of model fits) Then predict bursts using the best model and the available (1) pipe and (2) weather data Would you use the package BIGLM, linking to a sqlite database? (Or do something completely different?) Many thanks, Tim [[alternative HTML version deleted]]