thr3ads.net - R help - [R] Seeking a more efficient way to read in a file [Jan 2008]

If this information is useful, please help other people find it:
Share via:

Talbot Katz

2008-Jan-02 23:05 UTC

[R] Seeking a more efficient way to read in a file

Hi.
 
I have a matrix stored in a large, tab-delimited flat file.  The first row
contains column names.  Because the matrix is symmetric, the file has lower
triangular format, so the second row contains one number, the third row two
numbers, etc.  In general, row k+1 contains k numbers; the matrix has 3000 rows,
so the file has 3001 rows.  The file has variable length records, so each row
ends with its last piece of data.  I read in the file and produced the full
symmetric matrix as follows:
 > mana01 <- scan( file = "C:/mat.dat", sep = "\t",
nlines = 1, what = "character" )Read 3000 items> nco <- length(
mana01 )> malt <- matrix(0, nrow = nco, ncol = nco )> colnames( malt )
<- mana01> rownames( malt ) <- mana01> for ( i in 1:3000 ) { malt[
i, (1:i) ] <- scan( file="C:/mat.dat", skip = i, n = i, quiet =
TRUE ) }
> mat <- malt + t( malt ) - diag( diag( malt ) )> 
The for loop took a couple of hours to complete.  I suspect there's a much
faster way to do this.  Any suggestions?  Thanks!
 
--  TMK  --212-460-5430 home917-656-5351 cell
	[[alternative HTML version deleted]]

jim holtman

2008-Jan-03 01:31 UTC

head link

[R] Seeking a more efficient way to read in a file

After you read in the first line, read the rest of the file with a single scan:

rest <- scan(..., sep="\t", what=0, skip=1)
index <- 1  # used to march through 'rest'
for (i in 1:3000){
    for (j in 1:i){
        malt[i,j] <- rest[index]
        index <- index+1
    }
}

There are probably faster ways, but this should go quicker since most
of your previous time was spent in the reading.

On Jan 2, 2008 6:05 PM, Talbot Katz <topkatz at msn.com>
wrote:>
> Hi.
>
> I have a matrix stored in a large, tab-delimited flat file.  The first row
contains column names.  Because the matrix is symmetric, the file has lower
triangular format, so the second row contains one number, the third row two
numbers, etc.  In general, row k+1 contains k numbers; the matrix has 3000 rows,
so the file has 3001 rows.  The file has variable length records, so each row
ends with its last piece of data.  I read in the file and produced the full
symmetric matrix as follows:
>
> > mana01 <- scan( file = "C:/mat.dat", sep =
"\t", nlines = 1, what = "character" )Read 3000 items>
nco <- length( mana01 )> malt <- matrix(0, nrow = nco, ncol = nco )>
colnames( malt ) <- mana01> rownames( malt ) <- mana01> for ( i in
1:3000 ) { malt[ i, (1:i) ] <- scan( file="C:/mat.dat", skip = i, n
= i, quiet = TRUE ) }
> > mat <- malt + t( malt ) - diag( diag( malt ) )>
>
> The for loop took a couple of hours to complete.  I suspect there's a
much faster way to do this.  Any suggestions?  Thanks!
>
> --  TMK  --212-460-5430 home917-656-5351 cell
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?

Charilaos Skiadas

2008-Jan-03 01:42 UTC

head link

[R] Seeking a more efficient way to read in a file

On Jan 2, 2008, at 6:05 PM, Talbot Katz wrote:
> Hi.
>
> I have a matrix stored in a large, tab-delimited flat file.  The  
> first row contains column names.  Because the matrix is symmetric,  
> the file has lower triangular format, so the second row contains  
> one number, the third row two numbers, etc.  In general, row k+1  
> contains k numbers; the matrix has 3000 rows, so the file has 3001  
> rows.  The file has variable length records, so each row ends with  
> its last piece of data.  I read in the file and produced the full  
> symmetric matrix as follows:
>
>> mana01 <- scan( file = "C:/mat.dat", sep = "\t",
nlines = 1, what
>> = "character" )Read 3000 items> nco <- length( mana01
)> malt <-
>> matrix(0, nrow = nco, ncol = nco )> colnames( malt ) <-
mana01>
>> rownames( malt ) <- mana01> for ( i in 1:3000 ) { malt[ i, (1:i)
]
>> <- scan( file="C:/mat.dat", skip = i, n = i, quiet = TRUE
) }
>> mat <- malt + t( malt ) - diag( diag( malt ) )>
>
> The for loop took a couple of hours to complete.  I suspect there's  
> a much faster way to do this.  Any suggestions?  Thanks!
I saw Jim's reply just after having just written a solution, so here  
is my take on it. The key thing, as Jim mentioned, is to not use scan  
each time, but to read the whole thing in and then process it. I read  
the lines, used strsplit to get a list of each individual line, and  
then used sapply after extending each row by the right number of zeros.

Not sure which of the two is faster.

nms <- scan("~/Desktop/testing.txt", sep="\t", nlines=1,
what=character(0))
x <- scan("~/Desktop/testing.txt", sep="\n", skip=1,
what=character
(0)) # read as a vector of lines
splt <- strsplit(x,"\t") # split at the tabs
nr <- length(nms)
splt <- sapply(splt, function(x) c(as.numeric(x), rep(0,nr-length 
(x)))) # extend each for by the right number of zeros.


Haris Skiadas
Department of Mathematics and Computer Science
Hanover College

Maybe Matching Threads

Search for more maybe matching threads

R help - Jan 2008 - Seeking a more efficient way to read in a file

[R] Seeking a more efficient way to read in a file

[R] Seeking a more efficient way to read in a file

[R] Seeking a more efficient way to read in a file

Maybe Matching Threads