thr3ads.net - R help - [R] Handling huge data of 17GB in R [Nov 2015]

If this information is useful, please help other people find it:
Share via:

Ajay Ramaseshan

2015-Nov-27 11:03 UTC

[R] Handling huge data of 17GB in R

Hello,


I am trying the DBSCAN clustering algorithm on a huge data matrix (26000 x
26000). I dont have the datapoints, just the distance matrix. It comes to 17 GB
in the hard disk, and needs to be loaded into R to use the DBSCAN implementation
(under package fpc). So I tried using read.csv but R crashed.


I am getting the message 'Killed after it runs for 10 minutes'


 dist<-read.csv('dist.csv',header=FALSE)
Killed

So I chceked is there any R package that handles big data like this, and came
across bigmemory package in R. So I installed it and ran this command, but even
this does not work, R exits.
> dist<-read.big.matrix('dist.csv',sep=',',header=FALSE)
 *** caught bus error ***
address 0x7fbc4faba000, cause 'non-existent physical address'

Traceback:
 1: .Call("bigmemory_CreateSharedMatrix", PACKAGE =
"bigmemory",     row, col, colnames, rownames, typeLength, ini,
separated)
 2: CreateSharedMatrix(as.double(nrow), as.double(ncol), as.character(colnames),
as.character(rownames), as.integer(typeVal), as.double(init),    
as.logical(separated))
 3: big.matrix(nrow = numRows, ncol = createCols, type = type, dimnames =
list(rowNames,     colNames), init = NULL, separated = separated, backingfile =
backingfile,     backingpath = backingpath, descriptorfile = descriptorfile,    
binarydescriptor = binarydescriptor, shared = TRUE)
 4: read.big.matrix("dist.csv", sep = ",", header = FALSE)
 5: read.big.matrix("dist.csv", sep = ",", header = FALSE)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: 2
Save workspace image? [y/n/c]: n
Warning message:
In read.big.matrix("dist.csv", sep = ",", header = FALSE) :
  Because type was not specified, we chose double based on the first line of
data.


So how do I handle such huge data in R for DBSCAN? Or is there any other
implementation of DBSCAN in other programming language which can handle such a
huge distance matrix of 17 GB ?



Regards,

Ajay

	[[alternative HTML version deleted]]

Ista Zahn

2015-Nov-27 13:59 UTC

head link

[R] Handling huge data of 17GB in R

The easy way is to use a machine with say 32 Gb of ram. You can rent them
by the hour from AWS or google cloud at very reasonable prices.

Best,
Ista
On Nov 27, 2015 8:39 AM, "Ajay Ramaseshan" <ajay_ramaseshan at
hotmail.com>
wrote:
> Hello,
>
>
> I am trying the DBSCAN clustering algorithm on a huge data matrix (26000 x
> 26000). I dont have the datapoints, just the distance matrix. It comes to
> 17 GB in the hard disk, and needs to be loaded into R to use the DBSCAN
> implementation (under package fpc). So I tried using read.csv but R
crashed.
>
>
> I am getting the message 'Killed after it runs for 10 minutes'
>
>
>  dist<-read.csv('dist.csv',header=FALSE)
> Killed
>
> So I chceked is there any R package that handles big data like this, and
> came across bigmemory package in R. So I installed it and ran this command,
> but even this does not work, R exits.
>
> >
dist<-read.big.matrix('dist.csv',sep=',',header=FALSE)
>
>  *** caught bus error ***
> address 0x7fbc4faba000, cause 'non-existent physical address'
>
> Traceback:
>  1: .Call("bigmemory_CreateSharedMatrix", PACKAGE =
"bigmemory",     row,
> col, colnames, rownames, typeLength, ini, separated)
>  2: CreateSharedMatrix(as.double(nrow), as.double(ncol),
> as.character(colnames),     as.character(rownames), as.integer(typeVal),
> as.double(init),     as.logical(separated))
>  3: big.matrix(nrow = numRows, ncol = createCols, type = type, dimnames
> list(rowNames,     colNames), init = NULL, separated = separated,
> backingfile = backingfile,     backingpath = backingpath, descriptorfile
> descriptorfile,     binarydescriptor = binarydescriptor, shared = TRUE)
>  4: read.big.matrix("dist.csv", sep = ",", header =
FALSE)
>  5: read.big.matrix("dist.csv", sep = ",", header =
FALSE)
>
> Possible actions:
> 1: abort (with core dump, if enabled)
> 2: normal R exit
> 3: exit R without saving workspace
> 4: exit R saving workspace
> Selection: 2
> Save workspace image? [y/n/c]: n
> Warning message:
> In read.big.matrix("dist.csv", sep = ",", header =
FALSE) :
>   Because type was not specified, we chose double based on the first line
> of data.
>
>
> So how do I handle such huge data in R for DBSCAN? Or is there any other
> implementation of DBSCAN in other programming language which can handle
> such a huge distance matrix of 17 GB ?
>
>
>
> Regards,
>
> Ajay
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Duncan Murdoch

2015-Nov-27 14:02 UTC

head link

[R] Handling huge data of 17GB in R

On 27/11/2015 6:03 AM, Ajay Ramaseshan wrote:> Hello,
>
>
> I am trying the DBSCAN clustering algorithm on a huge data matrix (26000 x
26000). I dont have the datapoints, just the distance matrix. It comes to 17 GB
in the hard disk, and needs to be loaded into R to use the DBSCAN implementation
(under package fpc). So I tried using read.csv but R crashed.
>
>
> I am getting the message 'Killed after it runs for 10 minutes'
This is coming from your OS, not from R.
>
>
>   dist<-read.csv('dist.csv',header=FALSE)
This would be much faster if you specified the column types and number 
of rows.  Try

read.csv('dist.csv', header=FALSE, colClasses = "numeric",
nrows = 26000)

(assuming all entries are numeric).  And once you've read it in, convert 
it to a matrix; dataframe operations tend to be slow.
> Killed
>
> So I chceked is there any R package that handles big data like this, and
came across bigmemory package in R. So I installed it and ran this command, but
even this does not work, R exits.
Plain base R can handle a dataframe or matrix of that size, you don't 
need a special package. To see this, try

m <- matrix(0, 26000, 26000)

However, it takes a lot of memory. Make sure you are trying this on a 
machine with 10 or 20 GB of free memory. (Each copy of your data takes 
about 5 GB; operations may result in duplication.)

Duncan Murdoch
>>
dist<-read.big.matrix('dist.csv',sep=',',header=FALSE)
>
>   *** caught bus error ***
> address 0x7fbc4faba000, cause 'non-existent physical address'
>
> Traceback:
>   1: .Call("bigmemory_CreateSharedMatrix", PACKAGE =
"bigmemory",     row, col, colnames, rownames, typeLength, ini,
separated)
>   2: CreateSharedMatrix(as.double(nrow), as.double(ncol),
as.character(colnames),     as.character(rownames), as.integer(typeVal),
as.double(init),     as.logical(separated))
>   3: big.matrix(nrow = numRows, ncol = createCols, type = type, dimnames =
list(rowNames,     colNames), init = NULL, separated = separated, backingfile =
backingfile,     backingpath = backingpath, descriptorfile = descriptorfile,    
binarydescriptor = binarydescriptor, shared = TRUE)
>   4: read.big.matrix("dist.csv", sep = ",", header =
FALSE)
>   5: read.big.matrix("dist.csv", sep = ",", header =
FALSE)
>
> Possible actions:
> 1: abort (with core dump, if enabled)
> 2: normal R exit
> 3: exit R without saving workspace
> 4: exit R saving workspace
> Selection: 2
> Save workspace image? [y/n/c]: n
> Warning message:
> In read.big.matrix("dist.csv", sep = ",", header =
FALSE) :
>    Because type was not specified, we chose double based on the first line
of data.
>
>
> So how do I handle such huge data in R for DBSCAN? Or is there any other
implementation of DBSCAN in other programming language which can handle such a
huge distance matrix of 17 GB ?
>
>
>
> Regards,
>
> Ajay
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

R help - Nov 2015 - Handling huge data of 17GB in R

[R] Handling huge data of 17GB in R

[R] Handling huge data of 17GB in R

[R] Handling huge data of 17GB in R