I was thinking about [Bug 3099]... in that while it's easy to get a 2-3x speed for the average app using parallel scans, the upper and lower bounds on that speed increase could be <1x in a worst case (very unlikely, but with primitive or constrained (in a container or VM) HW, the chances are raised. Better, with less std. deviation, I believe, might be to move I/O calls to all being AIO -- It seems that would allow them to be completed at the OS's discretion which, in the idea case would be minimal wasted disk-head. The advantage in AIO, being that OS can coalesce calls more at its leisure, vs. an upper level app algorithm, that might divide up the work fairly, but not know how much each underlying request costs in terms of wasted head movement. Is that already in there, in the works, or do you think it would avoid worst case division of file scanning based on FS-hierarchical structure vs. underlying disk layout?