Brian.J.GREGOR@odot.state.or.us
2004-Apr-08 21:12 UTC
[R] Why are Split and Tapply so slow with named vectors, why is a for loop faster than mapply
First, here's the problem I'm working on so you understand the context. I have a data frame of travel activity characteristics with 70,000+ records. These activities are identified by unique chain numbers. (Activities are part of trip chains.) There are 17,500 chains. I use the chain numbers as factors to split various data fields into lists of chain characteristics with each element of the list representing one chain. For example:> betaHomeDist[1:3]$"400001111" 1316 2319 2317 1364 1316 0.000000 14.930820 24.431210 6.174959 0.000000 $"400001211" 1316 2319 2319 1364 1316 0.000000 14.930820 14.930820 6.174959 0.000000 $"400001212" 1316 1364 2324 1364 1316 0.000000 6.174959 14.392375 6.174959 0.000000 Where each element of the list is a named vector. Each vector element is named with the zone that the activity occurred within. I use these names in subsequent computations. What I've found, however, is that it is not easy (or I have not found the easy way) to split a named vector into a list that retains the vector names. For example, splitting an unnamed vector (70,000+) based on the chain numbers takes very little time:> system.time(actTimeList <- split(actTime, chainId))[1] 0.16 0.00 0.15 NA NA But if the vector is named, R will work for minutes and still not complete the job:> names(actTime) <- zoneNames > system.time(actTimeList <- split(actTime, chainId))Timing stopped at: 83.22 0.12 84.49 NA NA The same thing happens with using tapply with a named vector such as: tapply(actTime, chainId, function(x) x) Using the following function with a for loop accomplishes the job in a few seconds for all 70,000+ records:> splitWithNames <- function(dataVector, nameVector, factorVector){+ dataList <- split(dataVector, factorVector) + nameList <- split(nameVector, factorVector) + listLength <- length(dataList) + namedDataList <- list(NULL) + for(i in 1:listLength){ + x <- dataList[[i]] + names(x) <- nameList[[i]] + namedDataList[[i]] <- x + } + namedDataList + }> system.time(actTimeList <- splitWithNames(actTime, zoneNames, chainId))[1] 8.04 0.00 9.03 NA NA However if I rewrite the function to use mapply instead of a for loop, it again takes a long (undetermined) amount of time to complete. Here are the results for just 5000 and 10000 records. You can see that there is a scaling issue:> testfun <- function(dataVector, nameVector, factorVector){+ dataList <- split(dataVector, factorVector) + nameList <- split(nameVector, factorVector) + nameFun <- function(x, xNames){ + names(x) <- xNames + x + } + mapply(nameFun, dataList, nameList, SIMPLIFY=TRUE) + }> system.time(actTimeList <- testfun(actTime[1:5000], zoneNames[1:5000],chainId[1:5000])) [1] 2.99 0.00 2.98 NA NA> system.time(actTimeList <- testfun(actTime[1:10000], zoneNames[1:10000],chainId[1:10000])) [1] 10.64 0.00 10.64 NA NA My problem is solved for now with the home-brew splitWithNames function, but I'm curious about why named vectors slow down split and tapply so much and why a function using mapply is so much slower than a function that uses a for loop? My computer is a 800+ MHz Pentium III with 512 Mb of memory. The operating system is Windows NT 4.0. My R version is 1.8.1. Thank you. Brian Gregor, P.E. Transportation Planning Analysis Unit Oregon Department of Transportation Brian.J.GREGOR at odot.state.or.us (503) 986-4120
Peter Dalgaard
2004-Apr-08 22:09 UTC
[R] Why are Split and Tapply so slow with named vectors, why is a for loop faster than mapply
Brian.J.GREGOR at odot.state.or.us writes:> What I've found, however, is that it is not easy (or I have not found the > easy way) to split a named vector into a list that retains the vector names. > For example, splitting an unnamed vector (70,000+) based on the chain > numbers takes very little time: > > system.time(actTimeList <- split(actTime, chainId)) > [1] 0.16 0.00 0.15 NA NA > > But if the vector is named, R will work for minutes and still not complete > the job: > > names(actTime) <- zoneNames > > system.time(actTimeList <- split(actTime, chainId)) > Timing stopped at: 83.22 0.12 84.49 NA NA > > The same thing happens with using tapply with a named vector such as: > tapply(actTime, chainId, function(x) x) > > Using the following function with a for loop accomplishes the job in a few > seconds for all 70,000+ records: > > splitWithNames <- function(dataVector, nameVector, factorVector){ > + dataList <- split(dataVector, factorVector) > + nameList <- split(nameVector, factorVector) > + listLength <- length(dataList) > + namedDataList <- list(NULL) > + for(i in 1:listLength){ > + x <- dataList[[i]] > + names(x) <- nameList[[i]] > + namedDataList[[i]] <- x > + } > + namedDataList > + } > > system.time(actTimeList <- splitWithNames(actTime, zoneNames, chainId)) > [1] 8.04 0.00 9.03 NA NA > > However if I rewrite the function to use mapply instead of a for loop, it > again takes a long (undetermined) amount of time to complete. Here are the > results for just 5000 and 10000 records. You can see that there is a > scaling issue: > > testfun <- function(dataVector, nameVector, factorVector){ > + dataList <- split(dataVector, factorVector) > + nameList <- split(nameVector, factorVector) > + nameFun <- function(x, xNames){ > + names(x) <- xNames > + x > + } > + mapply(nameFun, dataList, nameList, SIMPLIFY=TRUE) > + } > > system.time(actTimeList <- testfun(actTime[1:5000], zoneNames[1:5000], > chainId[1:5000])) > [1] 2.99 0.00 2.98 NA NA > > system.time(actTimeList <- testfun(actTime[1:10000], zoneNames[1:10000], > chainId[1:10000])) > [1] 10.64 0.00 10.64 NA NA > > My problem is solved for now with the home-brew splitWithNames function, but > I'm curious about why named vectors slow down split and tapply so much and > why a function using mapply is so much slower than a function that uses a > for loop?If you look inside split.default, you'll see that it only uses fast internal code in simple cases: if (is.null(attr(x, "class")) && is.null(names(x))) return(.Internal(split(x, f))) in the other cases, we use for (k in lf) y[[k]] <- x[f %in% k] and if lf is large, we get a large number of calls to %in%. This wasn't really designed for that case, but I suppose we could be smarter about it. Wouldn't know about mapply, but are you sure you want SIMPLIFY=TRUE in there??? -- O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907