thr3ads.net - R help - [R] How to transpose it in a fast way? [Mar 2013]

If this information is useful, please help other people find it:
Share via:

Yao He

2013-Mar-07 00:18 UTC

[R] How to transpose it in a fast way?

Dear all:

I have a big data file of 60000 columns and 60000 rows like that:

AA AC AA AA .......AT
CC CC CT CT.......TC
..........................
.........................

I want to transpose it and the output is a new like that
AA CC ............
AC CC............
AA CT.............
AA CT.........
....................
....................
AT TC.............

The keypoint is  I can't read it into R by read.table() because the
data is too large,so I try that:
c<-file("silygenotype.txt","r")
geno_t<-list()
repeat{
  line<-readLines(c,n=1)
  if (length(line)==0)break  #end of file
  line<-unlist(strsplit(line,"\t"))
geno_t<-cbind(geno_t,line)
}
 write.table(geno_t,"xxx.txt")

It works but it is too slow ,how to optimize it???

Thank you

Yao He
?????????????????????????
Master candidate in 2rd year
Department of Animal genetics & breeding
Room 436,College of Animial Science&Technology,
China Agriculture University,Beijing,100193
E-mail: yao.h.1988 at gmail.com
??????????????????????????

Peter Langfelder

2013-Mar-07 00:56 UTC

head link

[R] How to transpose it in a fast way?

On Wed, Mar 6, 2013 at 4:18 PM, Yao He <yao.h.1988 at gmail.com>
wrote:> Dear all:
>
> I have a big data file of 60000 columns and 60000 rows like that:
>
> AA AC AA AA .......AT
> CC CC CT CT.......TC
> ..........................
> .........................
>
> I want to transpose it and the output is a new like that
> AA CC ............
> AC CC............
> AA CT.............
> AA CT.........
> ....................
> ....................
> AT TC.............
>
> The keypoint is  I can't read it into R by read.table() because the
> data is too large,so I try that:
> c<-file("silygenotype.txt","r")
> geno_t<-list()
> repeat{
>   line<-readLines(c,n=1)
>   if (length(line)==0)break  #end of file
>   line<-unlist(strsplit(line,"\t"))
> geno_t<-cbind(geno_t,line)
> }
>  write.table(geno_t,"xxx.txt")
>
> It works but it is too slow ,how to optimize it???
I hate to be negative, but this will also not work on a 60000x 60000
matrix. At some point R will complain either about the lack of memory
or about you trying to allocate a vector that is too long.

I think your best bet is to look at file-backed data packages (for
example, package bigmemory). Look at this URL:
http://cran.r-project.org/web/views/HighPerformanceComputing.html and
scroll down to  Large memory and out-of-memory data. Some of the
packages may have the functionality you are looking for and may do it
faster than your code.

If this doesn't help, you _may_ be able to make your code work, albeit
slowly, if you replace the cbind() by data.frame. cbind() will in this
case produce a matrix, and matrices are limited to 2^31 elements,
which is less than 60000 times 60000. A data.frame is a special type
of list and so _may_ be able to handle that many elements, given
enough system RAM. There are experts on this list who will correct me
if I'm wrong.

If you are on a linux system, you can use split (type man split at the
shell prompt to see help) to split the file into smaller chunks of say
5000 lines or so. Process each file separately, write it into a
separate output file, then use the linux utility paste to "paste" the
files side-by-side into the final output.

Further, if you want to make it faster, do not grow geno_t by
cbind'ing a new column to it in each iteration. Pre-allocate a matrix
or data frame of an appropriate number of rows and columns and fill it
out as you go. But it will still be slow, which I think is due to the
inherent slowness of readLines and possibly strsplit.

HTH,

Peter

peter dalgaard

2013-Mar-08 10:08 UTC

head link

[R] How to transpose it in a fast way?

On Mar 7, 2013, at 01:18 , Yao He wrote:
> Dear all:
> 
> I have a big data file of 60000 columns and 60000 rows like that:
> 
> AA AC AA AA .......AT
> CC CC CT CT.......TC
> ..........................
> .........................
> 
> I want to transpose it and the output is a new like that
> AA CC ............
> AC CC............
> AA CT.............
> AA CT.........
> ....................
> ....................
> AT TC.............
> 
> The keypoint is  I can't read it into R by read.table() because the
> data is too large,so I try that:
> c<-file("silygenotype.txt","r")
> geno_t<-list()
> repeat{
>  line<-readLines(c,n=1)
>  if (length(line)==0)break  #end of file
>  line<-unlist(strsplit(line,"\t"))
> geno_t<-cbind(geno_t,line)
> }
> write.table(geno_t,"xxx.txt")
> 
> It works but it is too slow ,how to optimize it???

As others have pointed out, that's a lot of data! 

You seem to have the right idea: If you read the columns line by line there is
nothing to transpose. A couple of points, though:

- The cbind() is a potential performance hit since it copies the list every time
around. geno_t <- vector("list", 60000) and then
geno_t[[i]] <- <etc>

- You might use scan() instead of readLines, strsplit

- Perhaps consider the data type as you seem to be reading strings with 16
possible values (I suspect that R already optimizes string storage to make this
point moot, though.)

-- 
Peter Dalgaard, Professor
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

Seemingly Similar Threads

Search for more apparently analagous threads

R help - Mar 2013 - How to transpose it in a fast way?

[R] How to transpose it in a fast way?

[R] How to transpose it in a fast way?

[R] How to transpose it in a fast way?

Seemingly Similar Threads