thr3ads.net - R help - [R] Exceptional slowness with read.csv [Apr 2024]

If this information is useful, please help other people find it:
Share via:

Dave Dixon

2024-Apr-08 05:47 UTC

[R] Exceptional slowness with read.csv

Greetings,

I have a csv file of 76 fields and about 4 million records. I know that 
some of the records have errors - unmatched quotes, specifically.? 
Reading the file with readLines and parsing the lines with read.csv(text 
= ...) is really slow. I know that the first 2459465 records are good. 
So I try this:

 > startTime <- Sys.time()
 > first_records <- read.csv(file_name, nrows = 2459465)
 > endTime <- Sys.time()
 > cat("elapsed time = ", endTime - startTime, "\n")

elapsed time = ? 24.12598

 > startTime <- Sys.time()
 > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
 > endTime <- Sys.time()
 > cat("elapsed time = ", endTime - startTime, "\n")

This appears to never finish. I have been waiting over 20 minutes.

So why would (skip = 2459465, nrows = 5) take orders of magnitude longer 
than (nrows = 2459465) ?

Thanks!

-dave

PS: readLines(n=2459470) takes 10.42731 seconds.

Stevie Pederson

2024-Apr-08 15:18 UTC

head link

[R] Exceptional slowness with read.csv

Hi Dave,

That's rather frustrating. I've found vroom (from the package vroom) to
be
helpful with large files like this.

Does the following give you any better luck?

vroom(file_name, delim = ",", skip = 2459465, n_max = 5)

Of course, when you know you've got errors & the files are big like that
it
can take a bit of work resolving things. The command line tools awk & sed
might even be a good plan for finding lines that have errors & figuring out
a fix, but I certainly don't envy you.

All the best

Stevie

On Tue, 9 Apr 2024 at 00:36, Dave Dixon <ddixon at swcp.com> wrote:
> Greetings,
>
> I have a csv file of 76 fields and about 4 million records. I know that
> some of the records have errors - unmatched quotes, specifically.
> Reading the file with readLines and parsing the lines with read.csv(text
> = ...) is really slow. I know that the first 2459465 records are good.
> So I try this:
>
>  > startTime <- Sys.time()
>  > first_records <- read.csv(file_name, nrows = 2459465)
>  > endTime <- Sys.time()
>  > cat("elapsed time = ", endTime - startTime, "\n")
>
> elapsed time =   24.12598
>
>  > startTime <- Sys.time()
>  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
>  > endTime <- Sys.time()
>  > cat("elapsed time = ", endTime - startTime, "\n")
>
> This appears to never finish. I have been waiting over 20 minutes.
>
> So why would (skip = 2459465, nrows = 5) take orders of magnitude longer
> than (nrows = 2459465) ?
>
> Thanks!
>
> -dave
>
> PS: readLines(n=2459470) takes 10.42731 seconds.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Bert Gunter

2024-Apr-08 17:49 UTC

head link

[R] Exceptional slowness with read.csv

No idea, but have you tried using ?scan to read those next 5 rows? It might
give you a better idea of the pathologies that are causing problems. For
example, an unmatched quote might result in some huge number of characters
trying to be read into a single element of a character variable. As your
previous respondent said, resolving such problems can be a challenge.

Cheers,
Bert



On Mon, Apr 8, 2024 at 8:06?AM Dave Dixon <ddixon at swcp.com> wrote:
> Greetings,
>
> I have a csv file of 76 fields and about 4 million records. I know that
> some of the records have errors - unmatched quotes, specifically.
> Reading the file with readLines and parsing the lines with read.csv(text
> = ...) is really slow. I know that the first 2459465 records are good.
> So I try this:
>
>  > startTime <- Sys.time()
>  > first_records <- read.csv(file_name, nrows = 2459465)
>  > endTime <- Sys.time()
>  > cat("elapsed time = ", endTime - startTime, "\n")
>
> elapsed time =   24.12598
>
>  > startTime <- Sys.time()
>  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
>  > endTime <- Sys.time()
>  > cat("elapsed time = ", endTime - startTime, "\n")
>
> This appears to never finish. I have been waiting over 20 minutes.
>
> So why would (skip = 2459465, nrows = 5) take orders of magnitude longer
> than (nrows = 2459465) ?
>
> Thanks!
>
> -dave
>
> PS: readLines(n=2459470) takes 10.42731 seconds.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Ivan Krylov

2024-Apr-08 18:42 UTC

head link

[R] Exceptional slowness with read.csv

? Sun, 7 Apr 2024 23:47:52 -0600
Dave Dixon <ddixon at swcp.com> ?????:
>  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
It may or may not be important that read.csv defaults to header TRUE. Having
skipped 2459465 lines, it may attempt to parse the next
one as a header, so the second call read.csv() should probably include
header = FALSE.

Bert's advice to try scan() is on point, though. It's likely that the
default-enabled header is not the most serious problem here.

-- 
Best regards,
Ivan

Rui Barradas

2024-Apr-10 13:46 UTC

head link

[R] Exceptional slowness with read.csv

?s 06:47 de 08/04/2024, Dave Dixon escreveu:> Greetings,
> 
> I have a csv file of 76 fields and about 4 million records. I know that 
> some of the records have errors - unmatched quotes, specifically. 
> Reading the file with readLines and parsing the lines with read.csv(text 
> = ...) is really slow. I know that the first 2459465 records are good. 
> So I try this:
> 
>  > startTime <- Sys.time()
>  > first_records <- read.csv(file_name, nrows = 2459465)
>  > endTime <- Sys.time()
>  > cat("elapsed time = ", endTime - startTime, "\n")
> 
> elapsed time = ? 24.12598
> 
>  > startTime <- Sys.time()
>  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
>  > endTime <- Sys.time()
>  > cat("elapsed time = ", endTime - startTime, "\n")
> 
> This appears to never finish. I have been waiting over 20 minutes.
> 
> So why would (skip = 2459465, nrows = 5) take orders of magnitude longer 
> than (nrows = 2459465) ?
> 
> Thanks!
> 
> -dave
> 
> PS: readLines(n=2459470) takes 10.42731 seconds.
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.Hello,

Can the following function be of help?
After reading the data setting argument quote=FALSE, call a function 
applying gregexpr to its character columns, then transforming the output 
in a two column data.frame with columns

  Col - the column processed;
  Unbalanced - the rows with unbalanced double quotes.

I am assuming the quotes are double quotes. It shouldn't be difficult to 
adapt it to other cas, single quotes, both cases.




unbalanced_dquotes <- function(x) {
   char_cols <- sapply(x, is.character) |> which()
   lapply(char_cols, \(i) {
     y <- x[[i]]
     Unbalanced <- gregexpr('"', y) |>
       sapply(\(x) attr(x, "match.length") |> length()) |>
       {\(x) (x %% 2L) == 1L}() |>
       which()
     data.frame(Col = i, Unbalanced = Unbalanced)
   }) |>
   do.call(rbind, args = _)
}

# read the data disregardin g quoted strings
df1 <- read.csv(fl, quote = "")
# determine which strings have unbalanced quotes and
# where
unbalanced_dquotes(df1)


Hope this helps,

Rui Barradas


-- 
Este e-mail foi analisado pelo software antiv?rus AVG para verificar a presen?a
de v?rus.
www.avg.com

Seemingly Similar Threads

Search for more maybe matching threads

R help - Apr 2024 - Exceptional slowness with read.csv

[R] Exceptional slowness with read.csv

[R] Exceptional slowness with read.csv

[R] Exceptional slowness with read.csv

[R] Exceptional slowness with read.csv

[R] Exceptional slowness with read.csv

Seemingly Similar Threads