James Sams
2014-Mar-15 17:53 UTC
[Rd] allocation error and high CPU usage from kworker and migration: memory fragmentation?
Hi, I'm new to this list (and R), but my impression is that this question is more appropriate here than R-help. I hope that is right. I'm having several issues with the performance of an R script. Occasionally it crashes with the well-known 'Error: cannot allocate vector of size X' (this past time it was 4.8 Gb). When it doesn't crash, CPU usage frequently drops quite low (often to 0) with high migration/X usage. Adding the 'last CPU used' field to top indicates that the R process is hopping from core to core quite frequently. Using taskset to set an affinity to one core results in CPU usage more typically in the 40-60% range with no migration/X usage. But the core starts sharing time with a kworker task. renice'ing doesn't seem to change anything. If I had to guess, I would think that the kworker task is from R trying to re-arrange things in memory to make space for my large objects. 2 machines: - 128 and 256 GiB RAM, - dual processor Xeons (16 cores + hyperthreading, 32 total 'cores'), - Ubuntu 13.10 and 13.04 (both 64 bit), - R 3.0.2, - data.table 1.8.11 (svn r1129).* Data: We have main fact tables stored in about 1000 R data files that range up to 3 GiB in size on disk; so up to like 50 GiB in RAM. Questions: - Why is R skipping around cores so much? I've never seen that happen before with other R scripts or with other statistical software. Is it something I'm doing? - When I set the affinity of R to one core, why is there so much kworker activity? It seems obvious that it is the R script generating this kworker activity on the same core. I'm guessing this is R trying to recover from memory fragmentation? - I suspect a lot of my problem is from the merges. If I did that in one line, would this help at all? move <- merge(merge(move, upc, by=c('upc')), parent, by=c('store', 'year')) * other strategies to improve merge performance? - If this is a memory fragmentation issue, is there a way to get lapply to allocate not just pointers to the data.tables that will be allocated, but to (over)allocate the data.tables themselves. The final list should be about 1000 data.tables long with each data.table no larger than 6000x4. I've used data.table in a similar strategy to build lists like this before without issue from the same data. I'm not sure what is different about this code compared to my other code. Perhaps the merging? The gist of the R code is pretty basic (modified for simplicity). The action is all happening in the reduction_function and lapply. I keep reassigning to move to try to indicate to R that it can gc the previous object referenced by move. library(data.table) library(lubridate) # imports several data.tables, total 730 MiB load(UPC) # provides PL_flag data.table load(STORES) # and parent data.table timevar = 'month' by=c('retailer', 'month') save.dir='/tmp/R_cache' each.parent <- rbindlist(lapply(sort(list.files(MOVEMENT, full.names=T), reduction_function, upc=PL_flag, parent=parent, timevar=timevar, by=by)) reduction_function <- function(filename, upc, parent, timevar, by, save.dir=NA) { load(filename) # imports move a potentially large data.table (memory size 10 MiB-50 GiB) move[, c(timevar, 'year') := list(floor_date(week_end, unit=timevar), year(week_end))] move <- merge(move, upc, by=c('upc')) # adds is_PL column, a boolean move <- merge(move, parent, by=c('store', 'year') # adds parent column, an integer setkeyv(move, by) # this reduces move to a data.table with at most 6000 rows, but always 4 columns move <- move[, list(revenue=sum(price*units), revenue_PL=sum(price*units*is_PL)), keyby=by] move[, category := gsub(search, replace, filename)] return(move) } -- James Sams sams.james at gmail.com
