Greetings, I have a csv file of 76 fields and about 4 million records. I know that some of the records have errors - unmatched quotes, specifically.? Reading the file with readLines and parsing the lines with read.csv(text = ...) is really slow. I know that the first 2459465 records are good. So I try this: > startTime <- Sys.time() > first_records <- read.csv(file_name, nrows = 2459465) > endTime <- Sys.time() > cat("elapsed time = ", endTime - startTime, "\n") elapsed time = ? 24.12598 > startTime <- Sys.time() > second_records <- read.csv(file_name, skip = 2459465, nrows = 5) > endTime <- Sys.time() > cat("elapsed time = ", endTime - startTime, "\n") This appears to never finish. I have been waiting over 20 minutes. So why would (skip = 2459465, nrows = 5) take orders of magnitude longer than (nrows = 2459465) ? Thanks! -dave PS: readLines(n=2459470) takes 10.42731 seconds.
Hi Dave, That's rather frustrating. I've found vroom (from the package vroom) to be helpful with large files like this. Does the following give you any better luck? vroom(file_name, delim = ",", skip = 2459465, n_max = 5) Of course, when you know you've got errors & the files are big like that it can take a bit of work resolving things. The command line tools awk & sed might even be a good plan for finding lines that have errors & figuring out a fix, but I certainly don't envy you. All the best Stevie On Tue, 9 Apr 2024 at 00:36, Dave Dixon <ddixon at swcp.com> wrote:> Greetings, > > I have a csv file of 76 fields and about 4 million records. I know that > some of the records have errors - unmatched quotes, specifically. > Reading the file with readLines and parsing the lines with read.csv(text > = ...) is really slow. I know that the first 2459465 records are good. > So I try this: > > > startTime <- Sys.time() > > first_records <- read.csv(file_name, nrows = 2459465) > > endTime <- Sys.time() > > cat("elapsed time = ", endTime - startTime, "\n") > > elapsed time = 24.12598 > > > startTime <- Sys.time() > > second_records <- read.csv(file_name, skip = 2459465, nrows = 5) > > endTime <- Sys.time() > > cat("elapsed time = ", endTime - startTime, "\n") > > This appears to never finish. I have been waiting over 20 minutes. > > So why would (skip = 2459465, nrows = 5) take orders of magnitude longer > than (nrows = 2459465) ? > > Thanks! > > -dave > > PS: readLines(n=2459470) takes 10.42731 seconds. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
No idea, but have you tried using ?scan to read those next 5 rows? It might give you a better idea of the pathologies that are causing problems. For example, an unmatched quote might result in some huge number of characters trying to be read into a single element of a character variable. As your previous respondent said, resolving such problems can be a challenge. Cheers, Bert On Mon, Apr 8, 2024 at 8:06?AM Dave Dixon <ddixon at swcp.com> wrote:> Greetings, > > I have a csv file of 76 fields and about 4 million records. I know that > some of the records have errors - unmatched quotes, specifically. > Reading the file with readLines and parsing the lines with read.csv(text > = ...) is really slow. I know that the first 2459465 records are good. > So I try this: > > > startTime <- Sys.time() > > first_records <- read.csv(file_name, nrows = 2459465) > > endTime <- Sys.time() > > cat("elapsed time = ", endTime - startTime, "\n") > > elapsed time = 24.12598 > > > startTime <- Sys.time() > > second_records <- read.csv(file_name, skip = 2459465, nrows = 5) > > endTime <- Sys.time() > > cat("elapsed time = ", endTime - startTime, "\n") > > This appears to never finish. I have been waiting over 20 minutes. > > So why would (skip = 2459465, nrows = 5) take orders of magnitude longer > than (nrows = 2459465) ? > > Thanks! > > -dave > > PS: readLines(n=2459470) takes 10.42731 seconds. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
? Sun, 7 Apr 2024 23:47:52 -0600 Dave Dixon <ddixon at swcp.com> ?????:> > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)It may or may not be important that read.csv defaults to header TRUE. Having skipped 2459465 lines, it may attempt to parse the next one as a header, so the second call read.csv() should probably include header = FALSE. Bert's advice to try scan() is on point, though. It's likely that the default-enabled header is not the most serious problem here. -- Best regards, Ivan
?s 06:47 de 08/04/2024, Dave Dixon escreveu:> Greetings, > > I have a csv file of 76 fields and about 4 million records. I know that > some of the records have errors - unmatched quotes, specifically. > Reading the file with readLines and parsing the lines with read.csv(text > = ...) is really slow. I know that the first 2459465 records are good. > So I try this: > > > startTime <- Sys.time() > > first_records <- read.csv(file_name, nrows = 2459465) > > endTime <- Sys.time() > > cat("elapsed time = ", endTime - startTime, "\n") > > elapsed time = ? 24.12598 > > > startTime <- Sys.time() > > second_records <- read.csv(file_name, skip = 2459465, nrows = 5) > > endTime <- Sys.time() > > cat("elapsed time = ", endTime - startTime, "\n") > > This appears to never finish. I have been waiting over 20 minutes. > > So why would (skip = 2459465, nrows = 5) take orders of magnitude longer > than (nrows = 2459465) ? > > Thanks! > > -dave > > PS: readLines(n=2459470) takes 10.42731 seconds. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.Hello, Can the following function be of help? After reading the data setting argument quote=FALSE, call a function applying gregexpr to its character columns, then transforming the output in a two column data.frame with columns Col - the column processed; Unbalanced - the rows with unbalanced double quotes. I am assuming the quotes are double quotes. It shouldn't be difficult to adapt it to other cas, single quotes, both cases. unbalanced_dquotes <- function(x) { char_cols <- sapply(x, is.character) |> which() lapply(char_cols, \(i) { y <- x[[i]] Unbalanced <- gregexpr('"', y) |> sapply(\(x) attr(x, "match.length") |> length()) |> {\(x) (x %% 2L) == 1L}() |> which() data.frame(Col = i, Unbalanced = Unbalanced) }) |> do.call(rbind, args = _) } # read the data disregardin g quoted strings df1 <- read.csv(fl, quote = "") # determine which strings have unbalanced quotes and # where unbalanced_dquotes(df1) Hope this helps, Rui Barradas -- Este e-mail foi analisado pelo software antiv?rus AVG para verificar a presen?a de v?rus. www.avg.com