Dear all: I have a big data file of 60000 columns and 60000 rows like that: AA AC AA AA .......AT CC CC CT CT.......TC .......................... ......................... I want to transpose it and the output is a new like that AA CC ............ AC CC............ AA CT............. AA CT......... .................... .................... AT TC............. The keypoint is I can't read it into R by read.table() because the data is too large,so I try that: c<-file("silygenotype.txt","r") geno_t<-list() repeat{ line<-readLines(c,n=1) if (length(line)==0)break #end of file line<-unlist(strsplit(line,"\t")) geno_t<-cbind(geno_t,line) } write.table(geno_t,"xxx.txt") It works but it is too slow ,how to optimize it??? Thank you Yao He ????????????????????????? Master candidate in 2rd year Department of Animal genetics & breeding Room 436,College of Animial Science&Technology, China Agriculture University,Beijing,100193 E-mail: yao.h.1988 at gmail.com ??????????????????????????
On Wed, Mar 6, 2013 at 4:18 PM, Yao He <yao.h.1988 at gmail.com> wrote:> Dear all: > > I have a big data file of 60000 columns and 60000 rows like that: > > AA AC AA AA .......AT > CC CC CT CT.......TC > .......................... > ......................... > > I want to transpose it and the output is a new like that > AA CC ............ > AC CC............ > AA CT............. > AA CT......... > .................... > .................... > AT TC............. > > The keypoint is I can't read it into R by read.table() because the > data is too large,so I try that: > c<-file("silygenotype.txt","r") > geno_t<-list() > repeat{ > line<-readLines(c,n=1) > if (length(line)==0)break #end of file > line<-unlist(strsplit(line,"\t")) > geno_t<-cbind(geno_t,line) > } > write.table(geno_t,"xxx.txt") > > It works but it is too slow ,how to optimize it???I hate to be negative, but this will also not work on a 60000x 60000 matrix. At some point R will complain either about the lack of memory or about you trying to allocate a vector that is too long. I think your best bet is to look at file-backed data packages (for example, package bigmemory). Look at this URL: http://cran.r-project.org/web/views/HighPerformanceComputing.html and scroll down to Large memory and out-of-memory data. Some of the packages may have the functionality you are looking for and may do it faster than your code. If this doesn't help, you _may_ be able to make your code work, albeit slowly, if you replace the cbind() by data.frame. cbind() will in this case produce a matrix, and matrices are limited to 2^31 elements, which is less than 60000 times 60000. A data.frame is a special type of list and so _may_ be able to handle that many elements, given enough system RAM. There are experts on this list who will correct me if I'm wrong. If you are on a linux system, you can use split (type man split at the shell prompt to see help) to split the file into smaller chunks of say 5000 lines or so. Process each file separately, write it into a separate output file, then use the linux utility paste to "paste" the files side-by-side into the final output. Further, if you want to make it faster, do not grow geno_t by cbind'ing a new column to it in each iteration. Pre-allocate a matrix or data frame of an appropriate number of rows and columns and fill it out as you go. But it will still be slow, which I think is due to the inherent slowness of readLines and possibly strsplit. HTH, Peter
On Mar 7, 2013, at 01:18 , Yao He wrote:> Dear all: > > I have a big data file of 60000 columns and 60000 rows like that: > > AA AC AA AA .......AT > CC CC CT CT.......TC > .......................... > ......................... > > I want to transpose it and the output is a new like that > AA CC ............ > AC CC............ > AA CT............. > AA CT......... > .................... > .................... > AT TC............. > > The keypoint is I can't read it into R by read.table() because the > data is too large,so I try that: > c<-file("silygenotype.txt","r") > geno_t<-list() > repeat{ > line<-readLines(c,n=1) > if (length(line)==0)break #end of file > line<-unlist(strsplit(line,"\t")) > geno_t<-cbind(geno_t,line) > } > write.table(geno_t,"xxx.txt") > > It works but it is too slow ,how to optimize it???As others have pointed out, that's a lot of data! You seem to have the right idea: If you read the columns line by line there is nothing to transpose. A couple of points, though: - The cbind() is a potential performance hit since it copies the list every time around. geno_t <- vector("list", 60000) and then geno_t[[i]] <- <etc> - You might use scan() instead of readLines, strsplit - Perhaps consider the data type as you seem to be reading strings with 16 possible values (I suspect that R already optimizes string storage to make this point moot, though.) -- Peter Dalgaard, Professor Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com