Charles Cheung
2006-May-09 00:20 UTC
[Rd] Efficient Merging of two huge sorted data frames?---Use merge()?
Hello all, A problem I encounter today is the speed which takes to sort two huge data frames... I wish to sort by (X,Y) Dataframe One consists of variables: X, Y, sequence, position having ~700 000 records another dataframe consists of X,Y, intensities having ~900 000 records Every (X,Y) pair in dataframe One is included in dataframe Two, however, the reverse is not true. Furthermore, (X,Y, position) in data frame One makes the record unique. (That means there can be multiple records with the same (X,Y) records!) Added together, it makes it hard to just combine the two data frames together by simply going data.frame(dataFrameOne, dataFrameTwo) because the mapping won't correspond even in sorted records by X and Y. Intuitive, it should only require very little time <O(n) complexity> after the data records are sorted. However, it takes so long (I haven't finished the process in 20 minutes.. it should only take <1 min) to merge the list by X and Y using merge(dataFrameOne, dataFrameTwo, by=c("X","Y") , which leads me to suspect this process is not optimized for already sorted list. * assuming the two frames have been sorted, I would be able to do the following: X Y seq Pos 1 1 AA 32 1 2 AG 44 1 3 GC 65 X Y intensities 1 1 0.4 1 3 0.552>>Cursor at beginning (1,1) (1,1) -->merge the (1,1) pair.. then cursor >>moves to (1,2) (1,3) --> can't find.. cursor moves to (1,3) (1,3) .. >>merge that pairIs the merge function doing that already? Is there an efficient way to merge the data frames? (What do you suggest I should do?) (to produce) X Y seq pos intensities 1 1 AA 32 0.4 1 3 GC 65 0.552 Thank you in advance! Charles Cheung
Prof Brian Ripley
2006-May-09 07:50 UTC
[Rd] Efficient Merging of two huge sorted data frames?---Use merge()?
merge() is not optimized for large data frames. To do things on this scale you really want to be using a DBMS not R. See the `R Data Import/Export Manual'. Sorting is not really relevant, especially as merge is not assuming that the match is unique. Hashing could be used, but is not. As R is open source, you have the source code and it would be kinder to read it yourself rather than expect this list to read it for you. A useful contribution to the R project would be to contribute a more efficient version, and we look forwards to seeing your contribution. On Mon, 8 May 2006, Charles Cheung wrote:> Hello all, > > A problem I encounter today is the speed which takes to sort two huge data > frames... > > I wish to sort by (X,Y) > > Dataframe One consists of variables: > X, Y, sequence, position > having ~700 000 records > > another dataframe consists of > X,Y, intensities > having ~900 000 records > > > Every (X,Y) pair in dataframe One is included in dataframe Two, > however, the reverse is not true. > Furthermore, (X,Y, position) in data frame One makes the record unique. > (That means there can be multiple records with the same (X,Y) records!) > > Added together, it makes it hard to just combine the two data frames > together by simply going > data.frame(dataFrameOne, dataFrameTwo) because the mapping won't correspond > even in sorted records by X and Y. > > > Intuitive, it should only require very little time <O(n) complexity> after > the data records are sorted. > However, it takes so long (I haven't finished the process in 20 minutes.. it > should only take <1 min) to merge the list by X and Y using > > merge(dataFrameOne, dataFrameTwo, by=c("X","Y") , which leads me to suspect > this process is not optimized for already sorted list. > > * assuming the two frames have been sorted, I would be able to do the > following: > > > X Y seq Pos > 1 1 AA 32 > 1 2 AG 44 > 1 3 GC 65 > > > X Y intensities > 1 1 0.4 > 1 3 0.552 > >>> Cursor at beginning (1,1) (1,1) -->merge the (1,1) pair.. then cursor >>> moves to (1,2) (1,3) --> can't find.. cursor moves to (1,3) (1,3) .. >>> merge that pair > > Is the merge function doing that already? > > > Is there an efficient way to merge the data frames? (What do you suggest I > should do?) > > > (to produce) > X Y seq pos intensities > 1 1 AA 32 0.4 > 1 3 GC 65 0.552 > > Thank you in advance! > > > Charles Cheung > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595