William Dunlap
2008-Nov-25 21:02 UTC
[R] Efficient passing through big data.frame and modifying select
> -----Original Message----- > From: William Dunlap > Sent: Tuesday, November 25, 2008 9:16 AM > To: 'johannes_graumann at web.de' > Subject: Re: [R] Efficient passing through big data.frame and > modifying select fields > > > Johannes Graumann johannes_graumann at web.de > > Tue Nov 25 15:16:01 CET 2008 > > > > Hi all, > > > > I have relatively big data frames (> 10000 rows by 80 columns) > > that need to be exposed to "merge". Works marvelously well in > > general, but some fields of the data frames actually contain > > multiple ";"-separated values encoded as a character string without > > defined order, which makes the fields not match each other. > > > > Example: > > > > > frame1[1,1] > > [1] "some;thing" > > >frame2[2,1] > > [2] "thing;some" > > > > In order to enable merging/duplicate identification of columns > > containing these strings, I wrote the following function, which > > passes through the rows one by one, identifies ";"-containing cells, > > splits and resorts them. > > > > ResortCombinedFields <- function(dframe){ > > if(!is.data.frame(dframe)){ > > stop("\"ResortCombinedFields\" input needs to be a data frame.") > > } > > for(row in seq(nrow(dframe))){ > > for(mef in grep(";",dframe[row,])){ > > I needed to add drop=TRUE to the above dframe[row,] for this to work. > > > dframe[row,mef] <- > paste(sort(unlist(strsplit(dframe[row,mef],";"))),collapse=";") > > } > > } > > return(dframe) > > } > > > > works fine, but is horribly inefficient. How might this be > tackled more elegantly? > > > > Thanks for any input, Joh > > It is usually faster to loop over columns of an data frame and use row > subscripting, if needed, on individual columns. E.g., the following > 2 are much quicker on a sample 1000 by 4 dataset I made with > > dframe<-data.frame(lapply(c(One=1,Two=2,Three=3), > function(i)sapply(1:1000, > function(i) > > paste(sample(LETTERS[1:5],size=sample(3,size=1),repl=FALSE), > collapse=";"))), > stringsAsFactors=FALSE) > dframe$Four<-sample(LETTERS[1:5], size=nrow(dframe), > replace=TRUE) # no ;'s in column Four > > The first function, f1, doesn't try to find which rows may > need adjusting > and the second, f2, does. > > f1 <- function(dframe){ > if(!is.data.frame(dframe)){ > stop("\"ResortCombinedFields\" input needs to be a data frame.") > } > for(icol in seq_len(ncol(dframe))){ > dframe[,icol] <- unlist(lapply(strsplit(dframe[,icol], > ";"), function(parts) paste(sort(parts), collapse=";"))) > } > return(dframe) > } > > f2 <- > function(dframe){ > if(!is.data.frame(dframe)){ > stop("\"ResortCombinedFields\" input needs to be a data frame.") > } > for(icol in seq_len(ncol(dframe))){ > col <- dframe[,icol] > irow <- grep(";", col) > if (length(irow)) { > col[irow] <- unlist(lapply(strsplit(col[irow], ";"), > function(parts) paste(sort(parts), collapse=";"))) > dframe[,icol] <- col > } > } > return(dframe) > } > > Times were > > > unix.time(z<-ResortCombinedFields(dframe)) > user system elapsed > 2.526 0.022 2.559 > > unix.time(f1z<-f1(dframe)) > user system elapsed > 0.509 0.000 0.508 > > unix.time(f2z<-f2(dframe)) > user system elapsed > 0.259 0.004 0.264 > > identical(z, f1z) > [1] TRUE > > identical(z, f2z) > [1] TRUEIn R 2.7.0 (April 2008) f1() and f2() both take time proportional to nrow(dframe), while your original ResortCombinedFields() takes time proportional to the square of nrow(dframe). E.g., for 50,000 rows ResortCombinedFields takes 4252 seconds while f2 takes 14 seconds It looks like 2.9 acts about the same. Bill Dunlap TIBCO Software Inc - Spotfire Division wdunlap tibco.com
Johannes Graumann
2008-Nov-26 10:36 UTC
[R] Efficient passing through big data.frame and modifying select
Marvelous! Thanks guys for your hints and time! Very smooth now! Joh On Wednesday 26 November 2008 03:41:49 Henrik Bengtsson wrote:> Alright, here are another $.02: using 'use.names=FALSE' in unlist() is > much faster than the default 'use.names=TRUE'. /Henrik > > On Tue, Nov 25, 2008 at 6:40 PM, Henrik Bengtsson <hb at stat.berkeley.edu>wrote:> > My $.02: Using argument 'fixed=TRUE' in strsplit() is much faster than > > the default 'fixed=FALSE'. /Henrik > > > > On Tue, Nov 25, 2008 at 1:02 PM, William Dunlap <wdunlap at tibco.com> wrote: > >>> -----Original Message----- > >>> From: William Dunlap > >>> Sent: Tuesday, November 25, 2008 9:16 AM > >>> To: 'johannes_graumann at web.de' > >>> Subject: Re: [R] Efficient passing through big data.frame and > >>> modifying select fields > >>> > >>> > Johannes Graumann johannes_graumann at web.de > >>> > Tue Nov 25 15:16:01 CET 2008 > >>> > > >>> > Hi all, > >>> > > >>> > I have relatively big data frames (> 10000 rows by 80 columns) > >>> > that need to be exposed to "merge". Works marvelously well in > >>> > general, but some fields of the data frames actually contain > >>> > multiple ";"-separated values encoded as a character string without > >>> > defined order, which makes the fields not match each other. > >>> > > >>> > Example: > >>> > > frame1[1,1] > >>> > > >>> > [1] "some;thing" > >>> > > >>> > >frame2[2,1] > >>> > > >>> > [2] "thing;some" > >>> > > >>> > In order to enable merging/duplicate identification of columns > >>> > containing these strings, I wrote the following function, which > >>> > passes through the rows one by one, identifies ";"-containing cells, > >>> > splits and resorts them. > >>> > > >>> > ResortCombinedFields <- function(dframe){ > >>> > if(!is.data.frame(dframe)){ > >>> > stop("\"ResortCombinedFields\" input needs to be a data frame.") > >>> > } > >>> > for(row in seq(nrow(dframe))){ > >>> > for(mef in grep(";",dframe[row,])){ > >>> > >>> I needed to add drop=TRUE to the above dframe[row,] for this to work. > >>> > >>> > dframe[row,mef] <- > >>> > >>> paste(sort(unlist(strsplit(dframe[row,mef],";"))),collapse=";") > >>> > >>> > } > >>> > } > >>> > return(dframe) > >>> > } > >>> > > >>> > works fine, but is horribly inefficient. How might this be > >>> > >>> tackled more elegantly? > >>> > >>> > Thanks for any input, Joh > >>> > >>> It is usually faster to loop over columns of an data frame and use row > >>> subscripting, if needed, on individual columns. E.g., the following > >>> 2 are much quicker on a sample 1000 by 4 dataset I made with > >>> > >>> dframe<-data.frame(lapply(c(One=1,Two=2,Three=3), > >>> function(i)sapply(1:1000, > >>> function(i) > >>> > >>> paste(sample(LETTERS[1:5],size=sample(3,size=1),repl=FALSE), > >>> collapse=";"))), > >>> stringsAsFactors=FALSE) > >>> dframe$Four<-sample(LETTERS[1:5], size=nrow(dframe), > >>> replace=TRUE) # no ;'s in column Four > >>> > >>> The first function, f1, doesn't try to find which rows may > >>> need adjusting > >>> and the second, f2, does. > >>> > >>> f1 <- function(dframe){ > >>> if(!is.data.frame(dframe)){ > >>> stop("\"ResortCombinedFields\" input needs to be a data frame.") > >>> } > >>> for(icol in seq_len(ncol(dframe))){ > >>> dframe[,icol] <- unlist(lapply(strsplit(dframe[,icol], > >>> ";"), function(parts) paste(sort(parts), collapse=";"))) > >>> } > >>> return(dframe) > >>> } > >>> > >>> f2 <- > >>> function(dframe){ > >>> if(!is.data.frame(dframe)){ > >>> stop("\"ResortCombinedFields\" input needs to be a data frame.") > >>> } > >>> for(icol in seq_len(ncol(dframe))){ > >>> col <- dframe[,icol] > >>> irow <- grep(";", col) > >>> if (length(irow)) { > >>> col[irow] <- unlist(lapply(strsplit(col[irow], ";"), > >>> function(parts) paste(sort(parts), collapse=";"))) > >>> dframe[,icol] <- col > >>> } > >>> } > >>> return(dframe) > >>> } > >>> > >>> Times were > >>> > >>> > unix.time(z<-ResortCombinedFields(dframe)) > >>> > >>> user system elapsed > >>> 2.526 0.022 2.559 > >>> > >>> > unix.time(f1z<-f1(dframe)) > >>> > >>> user system elapsed > >>> 0.509 0.000 0.508 > >>> > >>> > unix.time(f2z<-f2(dframe)) > >>> > >>> user system elapsed > >>> 0.259 0.004 0.264 > >>> > >>> > identical(z, f1z) > >>> > >>> [1] TRUE > >>> > >>> > identical(z, f2z) > >>> > >>> [1] TRUE > >> > >> In R 2.7.0 (April 2008) f1() and f2() both take time proportional > >> to nrow(dframe), while your original ResortCombinedFields() takes > >> time proportional to the square of nrow(dframe). E.g., for 50,000 > >> rows ResortCombinedFields takes 4252 seconds while f2 takes 14 seconds > >> It looks like 2.9 acts about the same. > >> > >> Bill Dunlap > >> TIBCO Software Inc - Spotfire Division > >> wdunlap tibco.com > >> > >> ______________________________________________ > >> R-help at r-project.org mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide > >> http://www.R-project.org/posting-guide.html and provide commented, > >> minimal, self-contained, reproducible code.-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 835 bytes Desc: This is a digitally signed message part. URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20081126/f2c08061/attachment.bin>