Hello, I am trying the DBSCAN clustering algorithm on a huge data matrix (26000 x 26000). I dont have the datapoints, just the distance matrix. It comes to 17 GB in the hard disk, and needs to be loaded into R to use the DBSCAN implementation (under package fpc). So I tried using read.csv but R crashed. I am getting the message 'Killed after it runs for 10 minutes' dist<-read.csv('dist.csv',header=FALSE) Killed So I chceked is there any R package that handles big data like this, and came across bigmemory package in R. So I installed it and ran this command, but even this does not work, R exits.> dist<-read.big.matrix('dist.csv',sep=',',header=FALSE)*** caught bus error *** address 0x7fbc4faba000, cause 'non-existent physical address' Traceback: 1: .Call("bigmemory_CreateSharedMatrix", PACKAGE = "bigmemory", row, col, colnames, rownames, typeLength, ini, separated) 2: CreateSharedMatrix(as.double(nrow), as.double(ncol), as.character(colnames), as.character(rownames), as.integer(typeVal), as.double(init), as.logical(separated)) 3: big.matrix(nrow = numRows, ncol = createCols, type = type, dimnames = list(rowNames, colNames), init = NULL, separated = separated, backingfile = backingfile, backingpath = backingpath, descriptorfile = descriptorfile, binarydescriptor = binarydescriptor, shared = TRUE) 4: read.big.matrix("dist.csv", sep = ",", header = FALSE) 5: read.big.matrix("dist.csv", sep = ",", header = FALSE) Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace Selection: 2 Save workspace image? [y/n/c]: n Warning message: In read.big.matrix("dist.csv", sep = ",", header = FALSE) : Because type was not specified, we chose double based on the first line of data. So how do I handle such huge data in R for DBSCAN? Or is there any other implementation of DBSCAN in other programming language which can handle such a huge distance matrix of 17 GB ? Regards, Ajay [[alternative HTML version deleted]]
The easy way is to use a machine with say 32 Gb of ram. You can rent them by the hour from AWS or google cloud at very reasonable prices. Best, Ista On Nov 27, 2015 8:39 AM, "Ajay Ramaseshan" <ajay_ramaseshan at hotmail.com> wrote:> Hello, > > > I am trying the DBSCAN clustering algorithm on a huge data matrix (26000 x > 26000). I dont have the datapoints, just the distance matrix. It comes to > 17 GB in the hard disk, and needs to be loaded into R to use the DBSCAN > implementation (under package fpc). So I tried using read.csv but R crashed. > > > I am getting the message 'Killed after it runs for 10 minutes' > > > dist<-read.csv('dist.csv',header=FALSE) > Killed > > So I chceked is there any R package that handles big data like this, and > came across bigmemory package in R. So I installed it and ran this command, > but even this does not work, R exits. > > > dist<-read.big.matrix('dist.csv',sep=',',header=FALSE) > > *** caught bus error *** > address 0x7fbc4faba000, cause 'non-existent physical address' > > Traceback: > 1: .Call("bigmemory_CreateSharedMatrix", PACKAGE = "bigmemory", row, > col, colnames, rownames, typeLength, ini, separated) > 2: CreateSharedMatrix(as.double(nrow), as.double(ncol), > as.character(colnames), as.character(rownames), as.integer(typeVal), > as.double(init), as.logical(separated)) > 3: big.matrix(nrow = numRows, ncol = createCols, type = type, dimnames > list(rowNames, colNames), init = NULL, separated = separated, > backingfile = backingfile, backingpath = backingpath, descriptorfile > descriptorfile, binarydescriptor = binarydescriptor, shared = TRUE) > 4: read.big.matrix("dist.csv", sep = ",", header = FALSE) > 5: read.big.matrix("dist.csv", sep = ",", header = FALSE) > > Possible actions: > 1: abort (with core dump, if enabled) > 2: normal R exit > 3: exit R without saving workspace > 4: exit R saving workspace > Selection: 2 > Save workspace image? [y/n/c]: n > Warning message: > In read.big.matrix("dist.csv", sep = ",", header = FALSE) : > Because type was not specified, we chose double based on the first line > of data. > > > So how do I handle such huge data in R for DBSCAN? Or is there any other > implementation of DBSCAN in other programming language which can handle > such a huge distance matrix of 17 GB ? > > > > Regards, > > Ajay > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
On 27/11/2015 6:03 AM, Ajay Ramaseshan wrote:> Hello, > > > I am trying the DBSCAN clustering algorithm on a huge data matrix (26000 x 26000). I dont have the datapoints, just the distance matrix. It comes to 17 GB in the hard disk, and needs to be loaded into R to use the DBSCAN implementation (under package fpc). So I tried using read.csv but R crashed. > > > I am getting the message 'Killed after it runs for 10 minutes'This is coming from your OS, not from R.> > > dist<-read.csv('dist.csv',header=FALSE)This would be much faster if you specified the column types and number of rows. Try read.csv('dist.csv', header=FALSE, colClasses = "numeric", nrows = 26000) (assuming all entries are numeric). And once you've read it in, convert it to a matrix; dataframe operations tend to be slow.> Killed > > So I chceked is there any R package that handles big data like this, and came across bigmemory package in R. So I installed it and ran this command, but even this does not work, R exits.Plain base R can handle a dataframe or matrix of that size, you don't need a special package. To see this, try m <- matrix(0, 26000, 26000) However, it takes a lot of memory. Make sure you are trying this on a machine with 10 or 20 GB of free memory. (Each copy of your data takes about 5 GB; operations may result in duplication.) Duncan Murdoch>> dist<-read.big.matrix('dist.csv',sep=',',header=FALSE) > > *** caught bus error *** > address 0x7fbc4faba000, cause 'non-existent physical address' > > Traceback: > 1: .Call("bigmemory_CreateSharedMatrix", PACKAGE = "bigmemory", row, col, colnames, rownames, typeLength, ini, separated) > 2: CreateSharedMatrix(as.double(nrow), as.double(ncol), as.character(colnames), as.character(rownames), as.integer(typeVal), as.double(init), as.logical(separated)) > 3: big.matrix(nrow = numRows, ncol = createCols, type = type, dimnames = list(rowNames, colNames), init = NULL, separated = separated, backingfile = backingfile, backingpath = backingpath, descriptorfile = descriptorfile, binarydescriptor = binarydescriptor, shared = TRUE) > 4: read.big.matrix("dist.csv", sep = ",", header = FALSE) > 5: read.big.matrix("dist.csv", sep = ",", header = FALSE) > > Possible actions: > 1: abort (with core dump, if enabled) > 2: normal R exit > 3: exit R without saving workspace > 4: exit R saving workspace > Selection: 2 > Save workspace image? [y/n/c]: n > Warning message: > In read.big.matrix("dist.csv", sep = ",", header = FALSE) : > Because type was not specified, we chose double based on the first line of data. > > > So how do I handle such huge data in R for DBSCAN? Or is there any other implementation of DBSCAN in other programming language which can handle such a huge distance matrix of 17 GB ? > > > > Regards, > > Ajay > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >