I have a bunch of data sets that were created for the libsvm tool. They are in "colon separated sparse format". i.e. 1 5:1 27:3 345:10 Is a row with the label of "1" and only has values in columns 5, 27, and 345. I want to read these into a data.frame in R. Is there a simple way to do this? -- Noah Silverman, M.S. UCLA Department of Statistics 8117 Math Sciences Building Los Angeles, CA 90095
Mr Silverman, On 9 October 2012 00:56, Noah Silverman <noahsilverman@ucla.edu> wrote:> I have a bunch of data sets that were created for the libsvm tool. They > are in "colon separated sparse format". > Is there a simple way to do this? >Use read.table with a sep of ':' and let me know how you get on. -- H -- Sent from my mobile device Envoyait de mon portable [[alternative HTML version deleted]]
Hello,
Here's a function that doesn't do it all but might help.
fun <- function(x){
x1 <- unlist(strsplit(x, " "))
x2 <- x1[nchar(x1) > 0]
i <- as.integer(x2[1])
x3 <- unlist(strsplit(x2[-1], ":"))
j <- as.integer(x3[rep(c(TRUE, FALSE), length(x3)/2)])
y <- numeric(max(j))
y[j] <- as.numeric(x3[rep(c(FALSE, TRUE), length(x3)/2)])
list(row = i, line = y)
}
x <- "1 5:1 27:3 345:10"
fun(x)
If you know that your labels, i.e., row numbers are consecutive, have
the function return just 'y', not a list.
Then use readLines to read the file in and lapply fun to it. Something like
ln <- readLines(filename)
lst <- lapply(ln, fun)
Then you'll have another problem. The lines' lengths. They shouldn't
be
all the same, so in order to make a data.frame or matrix you'll need
extra work. Try the code above and say whether it's on the right track.
Also, take a look at package Matrix. It's a recommended package and it
implements sparse matrices.
Hope this helps,
Rui Barradas
Em 09-10-2012 05:56, Noah Silverman escreveu:> I have a bunch of data sets that were created for the libsvm tool. They
are in "colon separated sparse format".
>
> i.e.
>
> 1 5:1 27:3 345:10
>
> Is a row with the label of "1" and only has values in columns 5,
27, and 345.
>
> I want to read these into a data.frame in R.
>
> Is there a simple way to do this?
>
> --
> Noah Silverman, M.S.
> UCLA Department of Statistics
> 8117 Math Sciences Building
> Los Angeles, CA 90095
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
If you want something that is fast, read the file in, strip off the colon/data, write it out to a temp and then read it back in. Here is a 355K line file:> temp <- tempfile() > input <- readLines('/temp/colon.txt') > length(input)[1] 355212> system.time(input <- gsub("(:[0-9]+)", "", input))user system elapsed 0.72 0.00 0.74> head(input)[1] "1 5 27 345" "1 5 27 345" "1 5 27 345" "1 5 27 345" "1 5 27 345" "1 5 27 345"> writeLines(input, temp) > system.time(newInput <- read.table(temp))user system elapsed 1.08 0.02 1.13> dim(newInput)[1] 355212 4> > head(newInput)V1 V2 V3 V4 1 1 5 27 345 2 1 5 27 345 3 1 5 27 345 4 1 5 27 345 5 1 5 27 345 6 1 5 27 345 On Tue, Oct 9, 2012 at 12:56 AM, Noah Silverman <noahsilverman at ucla.edu> wrote:> I have a bunch of data sets that were created for the libsvm tool. They are in "colon separated sparse format". > > i.e. > > 1 5:1 27:3 345:10 > > Is a row with the label of "1" and only has values in columns 5, 27, and 345. > > I want to read these into a data.frame in R. > > Is there a simple way to do this? > > -- > Noah Silverman, M.S. > UCLA Department of Statistics > 8117 Math Sciences Building > Los Angeles, CA 90095 > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it.
Matrix::spMatrix can help.
Read your data file with lns <- readLines("fileName") to get
something like
lns <- c("1 5:15 7:17 9:19",
"2 2:22 8:28",
"4 6:46")
Then use a function like the following that reformats the
data to the i=row,j=col,x=value vectors that spMatrix can use.
f <- function(lns, nrow=NULL, ncol=NULL)
{
# expect lines of the form
"rowNum<whiteSpace>colNum:value[<whiteSpace>colNum:value
...]"
triples <- unlist(lapply(strsplit(lns, "[ \t]+"),
function(ln)paste(sep=":",ln[1],ln[-1]))))
triples <- strsplit(triples, ":")
if (any(which <- vapply(triples, length, 0) != 3))
stop("formatting error")
ijx <- matrix(as.numeric(unlist(triples)), ncol=3, byrow=TRUE)
if (is.null(nrow)) nrow <- max(ijx[,1])
if (is.null(ncol)) ncol <- max(ijx[,2])
spMatrix(nrow=nrow, ncol=ncol, i=ijx[,1], j=ijx[,2], x=ijx[,3])
}
Use it as> f(lns)
4 x 9 sparse Matrix of class "dgTMatrix"
[1,] . . . . 15 . 17 . 19
[2,] . 22 . . . . . 28 .
[3,] . . . . . . . . .
[4,] . . . . . 46 . . .
or, if you know the number of rows and columns, tell it:
> f(lns, 10, 10)
10 x 10 sparse Matrix of class "dgTMatrix"
[1,] . . . . 15 . 17 . 19 .
[2,] . 22 . . . . . 28 . .
[3,] . . . . . . . . . .
[4,] . . . . . 46 . . . .
[5,] . . . . . . . . . .
[6,] . . . . . . . . . .
[7,] . . . . . . . . . .
[8,] . . . . . . . . . .
[9,] . . . . . . . . . .
[10,] . . . . . . . . . .
Use as.matrix() on its output if you don't want to continue
using the sparse matrix format.
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at
r-project.org] On Behalf
> Of Noah Silverman
> Sent: Monday, October 08, 2012 9:57 PM
> To: r-help
> Subject: [R] Convert COLON separated format
>
> I have a bunch of data sets that were created for the libsvm tool. They
are in "colon
> separated sparse format".
>
> i.e.
>
> 1 5:1 27:3 345:10
>
> Is a row with the label of "1" and only has values in columns 5,
27, and 345.
>
> I want to read these into a data.frame in R.
>
> Is there a simple way to do this?
>
> --
> Noah Silverman, M.S.
> UCLA Department of Statistics
> 8117 Math Sciences Building
> Los Angeles, CA 90095
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.