thr3ads.net - R help - [R] Parallel Scan of Large File [Dec 2010]

If this information is useful, please help other people find it:
Share via:

Ryan Garner

2010-Dec-08 01:22 UTC

[R] Parallel Scan of Large File

Is it possible to parallel scan a large file into a character vector in 1M
chunks using scan() with the "doMC" package? Furthermore, can I
specify the
tasks for each child?

i.e. I'm working on a Linux box with 8 cores and would like to scan in 8M
records at time (all 8 cores scan 1M records at a time) from a file with 40M
records total.

file <- file("data.txt","r")
child <- foreach(i = icount(40)) %dopar%
{
    scan(file,what = "character",sep = "\n",skip = 0,nlines
= 1e6)
}

Thus, each child would have a different skip argument. child[[1]]: skip = 0,
child[[2]]: skip = 1e6 + 1, child[[3]]: skip = 2e6 + 1, ... ,child[[40]]:
skip = 39e6 + 1. I would then end up with a list of 40 vectors with
child[[1]] containing records 1 to 1000000, child[[2]] containing records
1000001 to 2000000, ... ,child[[40]] containing records 39000001 to
40000000. 

Also, would one file connection suffice or does their need to be a file
connection that opens and closes for each child?




-- 
View this message in context:
http://r.789695.n4.nabble.com/Parallel-Scan-of-Large-File-tp3077545p3077545.html
Sent from the R help mailing list archive at Nabble.com.

Mike Marchywka

2010-Dec-08 13:24 UTC

head link

[R] Parallel Scan of Large File

----------------------------------------> Date: Tue, 7 Dec 2010 17:22:57 -0800
> From: ryan.steven.garner at gmail.com
> To: r-help at r-project.org
> Subject: [R] Parallel Scan of Large File
>
>
> Is it possible to parallel scan a large file into a character vector in 1M
> chunks using scan() with the "doMC" package? Furthermore, can I
specify the
> tasks for each child?
>
> i.e. I'm working on a Linux box with 8 cores and would like to scan in
8M
> records at time (all 8 cores scan 1M records at a time) from a file with
40M
> records total.
I can't comment on R approaches but if your rational here is speed
and you hope to scale this up to bigger files I would suggest more
analysis or measurement. In the case you outline, disk IO is probably
going to be the rate limiting step. It usually helps if you can make
thing predictable so the disk and memory caches can be used efficiently.
If you split up disk IO among different threads there is no reasonable
way the hardware can figure out what access is likely to be next.
Further, often times things like "skip()" are implemented as dummy
reads
on sequential file access calls. 

If you pursue this, I'd be curious to see what kind of results you get as
you go from 1 to 8 core with larger files.

You would probably be better off if you could find a way to pipeline this
work rather than split it up as you suggest. The idea sounds good of course,
you end up with 8 cores looking at your text, but you could easily be limited
by some other resource like bus bandwidths to disk or memory. As each core
gets a bigger junk, eventually you run out of physical memory and then of
course you are just doing disk IO for VM. 


>
> file <- file("data.txt","r")
> child <- foreach(i = icount(40)) %dopar%
> {
> scan(file,what = "character",sep = "\n",skip = 0,nlines
= 1e6)
> }
>
> Thus, each child would have a different skip argument. child[[1]]: skip =
0,
> child[[2]]: skip = 1e6 + 1, child[[3]]: skip = 2e6 + 1, ... ,child[[40]]:
> skip = 39e6 + 1. I would then end up with a list of 40 vectors with
> child[[1]] containing records 1 to 1000000, child[[2]] containing records
> 1000001 to 2000000, ... ,child[[40]] containing records 39000001 to
> 40000000.
>
> Also, would one file connection suffice or does their need to be a file
> connection that opens and closes for each child?
>
>
>
>
> --
> View this message in context:
http://r.789695.n4.nabble.com/Parallel-Scan-of-Large-File-tp3077545p3077545.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Luedde, Mirko

2010-Dec-09 12:03 UTC

head link

[R] Parallel Scan of Large File

Hi Ryan, 

the "Getting Started with doMC and foreach" manual 
tells me that you might have forgotten a 

  registerDoMC()

in the example you provided.

Best, Mirko

Apparently Analagous Threads

Search for more maybe matching threads

R help - Dec 2010 - Parallel Scan of Large File

[R] Parallel Scan of Large File

[R] Parallel Scan of Large File

[R] Parallel Scan of Large File

Apparently Analagous Threads