Christopher Austin-Lane
2004-Mar-05 04:31 UTC
[R] Slow reshape from 5x600000 to 6311 x 132
I have a dataset that's a few hundred thousand rows from a database (read in via dbreadTable). The database is like: > str(measures) `data.frame': 609363 obs. of 5 variables: $ vih.id : int 1 2 3 4 5 6 7 8 9 10 ... $ vi.id : int 1 2 3 4 5 6 7 8 9 10 ... $ vih.value: chr "0" "1989" "0" "N/A" ... $ vih.date : chr "20040226012314" "20040226012315" "20040226012315" "20040226012315" ... $ vih.run.n: int 1 1 1 1 1 1 1 1 1 1 .. I'm reshaping it to be like > str(better) `data.frame': 132 obs. of 6311 variables: $ vih.run.n : int 1 2 4 5 6 7 8 9 10 11 ... $ vih.value.1 : chr "0" "0" "0" "0" ... $ vih.value.2 : chr "1989" "1989" "1989" "1989" ... $ vih.value.3 : chr "0" "0" "0" "0" ... $ vih.value.4 : chr "N/A" "N/A" "N/A" "N/A" ... $ vih.value.5 : chr "3163979" "3163979" "3163979" "3163979" ... $ vih.value.6 : chr "5500073" "5500073" "5500073" "5500073" ... (etc., etc.) This takes about 4-8 hours to accomplish. Should I a) try to put it into the wide format row by row as I get the data from the DB instead of using dbReadTable, or b) try to tune something in R? (I'm trying it now with R --min-vsize=600M --min-nsize=6M although it's not seeming fast; I won't know if it's faster for a while). (Using home compiled R 1.8.1 on Mac OS X 10.3.2, under emacs/ESS, although my R 1.8.1 on Solaris 2.8 has been churning for a few hours as well (on a split of the data that is 630 variables by 1000 obs). --Chris
Hi, my reshape's from ~1.4 million obs. to ~150.00 obs. & 50 attr. goes surprinsing fast (1-2 miniutes), but is less complex then yours. Perhaps it is faster if you have no character.string as value - if it's possible for your data? Reshaping in the database is possible with innerselects ,but i prefer reshape because it take in the db really long time? christian Am Freitag, 5. M?rz 2004 05:31 schrieb Christopher Austin-Lane:> I have a dataset that's a few hundred thousand rows from a database > > (read in via dbreadTable). The database is like: > > str(measures) > > `data.frame': 609363 obs. of 5 variables: > $ vih.id : int 1 2 3 4 5 6 7 8 9 10 ... > > $ vi.id : int 1 2 3 4 5 6 7 8 9 10 ... > > $ vih.value: chr "0" "1989" "0" "N/A" ... > > $ vih.date : chr "20040226012314" "20040226012315" "20040226012315" > "20040226012315" ... > > $ vih.run.n: int 1 1 1 1 1 1 1 1 1 1 .. > I'm reshaping it to be like > > > str(better) > > `data.frame': 132 obs. of 6311 variables: > $ vih.run.n : int 1 2 4 5 6 7 8 9 10 11 ... > $ vih.value.1 : chr "0" "0" "0" "0" ... > $ vih.value.2 : chr "1989" "1989" "1989" "1989" ... > $ vih.value.3 : chr "0" "0" "0" "0" ... > $ vih.value.4 : chr "N/A" "N/A" "N/A" "N/A" ... > $ vih.value.5 : chr "3163979" "3163979" "3163979" "3163979" ... > $ vih.value.6 : chr "5500073" "5500073" "5500073" "5500073" ... > > (etc., etc.) > > This takes about 4-8 hours to accomplish. Should I > > a) try to put it into the wide format row by row as I get the data from > the DB instead of using dbReadTable, > > or > > b) try to tune something in R? (I'm trying it now with R > --min-vsize=600M --min-nsize=6M although it's not seeming fast; I won't > know if it's faster for a while). > > (Using home compiled R 1.8.1 on Mac OS X 10.3.2, under emacs/ESS, > although my R 1.8.1 on Solaris 2.8 has been churning for a few hours as > well (on a split of the data that is 630 variables by 1000 obs). > > --Chris > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html