thr3ads.net - R help - [R] Efficient passing through big data.frame and modifying select [Nov 2008]

If this information is useful, please help other people find it:
Share via:

William Dunlap

2008-Nov-25 21:02 UTC

[R] Efficient passing through big data.frame and modifying select

> -----Original Message-----
> From: William Dunlap 
> Sent: Tuesday, November 25, 2008 9:16 AM
> To: 'johannes_graumann at web.de'
> Subject: Re: [R] Efficient passing through big data.frame and 
> modifying select fields
> 
> > Johannes Graumann johannes_graumann at web.de
> > Tue Nov 25 15:16:01 CET 2008
> > 
> > Hi all,
> > 
> > I have relatively big data frames (> 10000 rows by 80 columns)
> > that need to be exposed to "merge". Works marvelously well
in
> > general, but some fields of the data frames actually contain
> > multiple ";"-separated values encoded as a character string
without
> > defined order, which makes the fields not match each other.
> > 
> > Example:
> > 
> > > frame1[1,1]
> > [1] "some;thing"
> > >frame2[2,1]
> > [2] "thing;some"
> > 
> > In order to enable merging/duplicate identification of columns
> > containing these strings, I wrote the following function, which
> > passes through the rows one by one, identifies
";"-containing cells,
> > splits and resorts them.
> >
> > ResortCombinedFields <- function(dframe){
> >  if(!is.data.frame(dframe)){
> >    stop("\"ResortCombinedFields\" input needs to be a
data frame.")
> >  }
> >  for(row in seq(nrow(dframe))){
> >    for(mef in grep(";",dframe[row,])){
> 
> I needed to add drop=TRUE to the above dframe[row,] for this to work.
> 
> >      dframe[row,mef] <- 
>
paste(sort(unlist(strsplit(dframe[row,mef],";"))),collapse=";")
> >    }
> >  }
> >  return(dframe)
> > }
> > 
> > works fine, but is horribly inefficient. How might this be 
> tackled more elegantly?
> > 
> > Thanks for any input, Joh
> 
> It is usually faster to loop over columns of an data frame and use row
> subscripting, if needed, on individual columns.  E.g., the following
> 2 are much quicker on a sample 1000 by 4 dataset I made with
> 
> dframe<-data.frame(lapply(c(One=1,Two=2,Three=3),
>    function(i)sapply(1:1000,
>       function(i)
>          
> paste(sample(LETTERS[1:5],size=sample(3,size=1),repl=FALSE), 
> collapse=";"))),
>    stringsAsFactors=FALSE)
> dframe$Four<-sample(LETTERS[1:5], size=nrow(dframe), 
> replace=TRUE) # no ;'s in column Four
> 
> The first function, f1, doesn't try to find which rows may 
> need adjusting
> and the second, f2, does.
> 
> f1 <- function(dframe){
>   if(!is.data.frame(dframe)){
>     stop("\"ResortCombinedFields\" input needs to be a data
frame.")
>   }
>   for(icol in seq_len(ncol(dframe))){
>     dframe[,icol] <- unlist(lapply(strsplit(dframe[,icol], 
> ";"), function(parts) paste(sort(parts),
collapse=";")))
>   }
>   return(dframe)
> }
> 
> f2 <-
> function(dframe){
>   if(!is.data.frame(dframe)){
>     stop("\"ResortCombinedFields\" input needs to be a data
frame.")
>   }
>   for(icol in seq_len(ncol(dframe))){
>     col <- dframe[,icol]
>     irow <- grep(";", col)
>     if (length(irow)) {
>         col[irow] <- unlist(lapply(strsplit(col[irow], ";"), 
> function(parts) paste(sort(parts), collapse=";")))
>         dframe[,icol] <- col
>     }
>   }
>   return(dframe)
> }
> 
> Times were
>  
> > unix.time(z<-ResortCombinedFields(dframe))
>    user  system elapsed
>   2.526   0.022   2.559
> > unix.time(f1z<-f1(dframe))
>    user  system elapsed
>   0.509   0.000   0.508
> > unix.time(f2z<-f2(dframe))
>    user  system elapsed
>   0.259   0.004   0.264
> > identical(z, f1z)
> [1] TRUE
> > identical(z, f2z)
> [1] TRUE
In R 2.7.0 (April 2008) f1() and f2() both take time proportional
to nrow(dframe), while your original ResortCombinedFields() takes
time proportional to the square of nrow(dframe).  E.g., for 50,000
rows ResortCombinedFields takes 4252 seconds while f2 takes 14 seconds
It looks like 2.9 acts about the same.

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com

Johannes Graumann

2008-Nov-26 10:36 UTC

head link

[R] Efficient passing through big data.frame and modifying select

Marvelous! Thanks guys for your hints and time! Very smooth now!

Joh

On Wednesday 26 November 2008 03:41:49 Henrik Bengtsson
wrote:> Alright, here are another $.02: using 'use.names=FALSE' in unlist()
is
> much faster than the default 'use.names=TRUE'. /Henrik
>
> On Tue, Nov 25, 2008 at 6:40 PM, Henrik Bengtsson <hb at
stat.berkeley.edu>
wrote:> > My $.02: Using argument 'fixed=TRUE' in strsplit() is much
faster than
> > the default 'fixed=FALSE'. /Henrik
> >
> > On Tue, Nov 25, 2008 at 1:02 PM, William Dunlap <wdunlap at
tibco.com> wrote:
> >>> -----Original Message-----
> >>> From: William Dunlap
> >>> Sent: Tuesday, November 25, 2008 9:16 AM
> >>> To: 'johannes_graumann at web.de'
> >>> Subject: Re: [R] Efficient passing through big data.frame and
> >>> modifying select fields
> >>>
> >>> > Johannes Graumann johannes_graumann at web.de
> >>> > Tue Nov 25 15:16:01 CET 2008
> >>> >
> >>> > Hi all,
> >>> >
> >>> > I have relatively big data frames (> 10000 rows by 80
columns)
> >>> > that need to be exposed to "merge". Works
marvelously well in
> >>> > general, but some fields of the data frames actually
contain
> >>> > multiple ";"-separated values encoded as a
character string without
> >>> > defined order, which makes the fields not match each
other.
> >>> >
> >>> > Example:
> >>> > > frame1[1,1]
> >>> >
> >>> > [1] "some;thing"
> >>> >
> >>> > >frame2[2,1]
> >>> >
> >>> > [2] "thing;some"
> >>> >
> >>> > In order to enable merging/duplicate identification of
columns
> >>> > containing these strings, I wrote the following function,
which
> >>> > passes through the rows one by one, identifies
";"-containing cells,
> >>> > splits and resorts them.
> >>> >
> >>> > ResortCombinedFields <- function(dframe){
> >>> >  if(!is.data.frame(dframe)){
> >>> >    stop("\"ResortCombinedFields\" input
needs to be a data frame.")
> >>> >  }
> >>> >  for(row in seq(nrow(dframe))){
> >>> >    for(mef in grep(";",dframe[row,])){
> >>>
> >>> I needed to add drop=TRUE to the above dframe[row,] for this
to work.
> >>>
> >>> >      dframe[row,mef] <-
> >>>
> >>>
paste(sort(unlist(strsplit(dframe[row,mef],";"))),collapse=";")
> >>>
> >>> >    }
> >>> >  }
> >>> >  return(dframe)
> >>> > }
> >>> >
> >>> > works fine, but is horribly inefficient. How might this
be
> >>>
> >>> tackled more elegantly?
> >>>
> >>> > Thanks for any input, Joh
> >>>
> >>> It is usually faster to loop over columns of an data frame and
use row
> >>> subscripting, if needed, on individual columns.  E.g., the
following
> >>> 2 are much quicker on a sample 1000 by 4 dataset I made with
> >>>
> >>> dframe<-data.frame(lapply(c(One=1,Two=2,Three=3),
> >>>    function(i)sapply(1:1000,
> >>>       function(i)
> >>>
> >>> paste(sample(LETTERS[1:5],size=sample(3,size=1),repl=FALSE),
> >>> collapse=";"))),
> >>>    stringsAsFactors=FALSE)
> >>> dframe$Four<-sample(LETTERS[1:5], size=nrow(dframe),
> >>> replace=TRUE) # no ;'s in column Four
> >>>
> >>> The first function, f1, doesn't try to find which rows may
> >>> need adjusting
> >>> and the second, f2, does.
> >>>
> >>> f1 <- function(dframe){
> >>>   if(!is.data.frame(dframe)){
> >>>     stop("\"ResortCombinedFields\" input needs
to be a data frame.")
> >>>   }
> >>>   for(icol in seq_len(ncol(dframe))){
> >>>     dframe[,icol] <- unlist(lapply(strsplit(dframe[,icol],
> >>> ";"), function(parts) paste(sort(parts),
collapse=";")))
> >>>   }
> >>>   return(dframe)
> >>> }
> >>>
> >>> f2 <-
> >>> function(dframe){
> >>>   if(!is.data.frame(dframe)){
> >>>     stop("\"ResortCombinedFields\" input needs
to be a data frame.")
> >>>   }
> >>>   for(icol in seq_len(ncol(dframe))){
> >>>     col <- dframe[,icol]
> >>>     irow <- grep(";", col)
> >>>     if (length(irow)) {
> >>>         col[irow] <- unlist(lapply(strsplit(col[irow],
";"),
> >>> function(parts) paste(sort(parts), collapse=";")))
> >>>         dframe[,icol] <- col
> >>>     }
> >>>   }
> >>>   return(dframe)
> >>> }
> >>>
> >>> Times were
> >>>
> >>> > unix.time(z<-ResortCombinedFields(dframe))
> >>>
> >>>    user  system elapsed
> >>>   2.526   0.022   2.559
> >>>
> >>> > unix.time(f1z<-f1(dframe))
> >>>
> >>>    user  system elapsed
> >>>   0.509   0.000   0.508
> >>>
> >>> > unix.time(f2z<-f2(dframe))
> >>>
> >>>    user  system elapsed
> >>>   0.259   0.004   0.264
> >>>
> >>> > identical(z, f1z)
> >>>
> >>> [1] TRUE
> >>>
> >>> > identical(z, f2z)
> >>>
> >>> [1] TRUE
> >>
> >> In R 2.7.0 (April 2008) f1() and f2() both take time proportional
> >> to nrow(dframe), while your original ResortCombinedFields() takes
> >> time proportional to the square of nrow(dframe).  E.g., for 50,000
> >> rows ResortCombinedFields takes 4252 seconds while f2 takes 14
seconds
> >> It looks like 2.9 acts about the same.
> >>
> >> Bill Dunlap
> >> TIBCO Software Inc - Spotfire Division
> >> wdunlap tibco.com
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html and provide commented,
> >> minimal, self-contained, reproducible code.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 835 bytes
Desc: This is a digitally signed message part.
URL:
<https://stat.ethz.ch/pipermail/r-help/attachments/20081126/f2c08061/attachment.bin>

Maybe Matching Threads

Search for more possibly parallel threads

R help - Nov 2008 - Efficient passing through big data.frame and modifying select

[R] Efficient passing through big data.frame and modifying select

[R] Efficient passing through big data.frame and modifying select

Maybe Matching Threads