Rui Barradas
2012-May-16 13:11 UTC
[R] Problem to resolve a step for reading a large TXT and, split in several file
Hello, Your bug is obvious, each pass through the loop you read twice and write only once. The file pointer keeps moving forward... Use something like while (length(pv <- readLines(con, n=n)) > 0 ) { # note that this line changed. i <- i + 1 write.table(pv, file = paste(fileNames.temp.1, "_", i, ".txt", sep = ""), sep = "\t") } (or put the line with read.table where you have readLines.) Anyway, I don't like it very much. If you know the number of lines in the input file, it would be much better to use integer division and modulus to determine how many times and how much to read. Something like n <- 1000000 passes <- number.of.lines.in.file %/% n remaining <- number.of.lines.in.file %% n for(i in seq.int(passes)){ [ ... read n lines at a time & process them...] } if(remaining){ n <- remaining [ ...read what's left... ] } If you do not know how many lines are there in the file, see (package::function) parser::nlines R.utils::countLines Hope this helps, Rui Barradas Em 16-05-2012 11:00, r-help-request at r-project.org escreveu:> Date: Tue, 15 May 2012 22:16:42 +0200 > From: gianni lavaredo<gianni.lavaredo at gmail.com> > To:r-help at r-project.org > Subject: [R] Problem to resolve a step for reading a large TXT and > split in several file > Message-ID: > <CAJ6JbR-YwgjsFu8o0UnvET6M8p8WvP7YbosXw5nRdz48woDsrw at mail.gmail.com> > Content-Type: text/plain > > Dear Researchs, > > It's the first time I am trying to resolve this problem. I have a TXT file > with 1408452 rows. I wish to split file-by-file where each file has > 1,000,000 rows with the following procedure: > > # split in two file one with 1,000,000 of rows and one with 408,452 of rows > > file<- "09G001_72975_7575_25_4025.txt" > fileNames<- strsplit(as.character(file), ".", fixed = TRUE) > fileNames.temp.1<- unique(as.vector(do.call("rbind", fileNames)[, 1])) > > con<- file(file, open = "r") > # n is the number of row > n<- 1000000 > i<- 0 > while (length(readLines(con, n=n))> 0 ) { > i<- i + 1 > pv<- read.table(con,header=F,sep="\t", nrow=n) > write.table(pv, file = paste(fileNames.temp.1,"_",i,".txt",sep = ""), > sep = "\t") > } > close(con) > > > when I use 1,000,000 I have in the directory only > "09G001_72975_7575_25_4025_1.txt" (with 1000000 of rows) and not > "09G001_72975_7575_25_4025_2.txt" (with 408,452). I din't understand where > is my bug > > Furthermore when i wish for example split in 3 files (where n is 469484 > 1408452/3) i have this message: > > *Error in read.table(con, header = F, sep = "\t", nrow = n) : > no lines available in input* > > Thanks for all help and sorry for the disturb > > [[alternative HTML version deleted]]
Possibly Parallel Threads
- how disable the Error massage in read.table() " no lines available in input"
- count how many row i have in a txt file in a directory
- R citation for the 2012
- help to slip a file name using "strsplit" function
- the right reference for the R Stats package for a scientific journal