thr3ads.net - R help - [R] Exceptional slowness with read.csv [Apr 2024]

If this information is useful, please help other people find it:
Share via:

Rui Barradas

2024-Apr-10 13:46 UTC

[R] Exceptional slowness with read.csv

?s 06:47 de 08/04/2024, Dave Dixon escreveu:> Greetings,
> 
> I have a csv file of 76 fields and about 4 million records. I know that 
> some of the records have errors - unmatched quotes, specifically. 
> Reading the file with readLines and parsing the lines with read.csv(text 
> = ...) is really slow. I know that the first 2459465 records are good. 
> So I try this:
> 
>  > startTime <- Sys.time()
>  > first_records <- read.csv(file_name, nrows = 2459465)
>  > endTime <- Sys.time()
>  > cat("elapsed time = ", endTime - startTime, "\n")
> 
> elapsed time = ? 24.12598
> 
>  > startTime <- Sys.time()
>  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
>  > endTime <- Sys.time()
>  > cat("elapsed time = ", endTime - startTime, "\n")
> 
> This appears to never finish. I have been waiting over 20 minutes.
> 
> So why would (skip = 2459465, nrows = 5) take orders of magnitude longer 
> than (nrows = 2459465) ?
> 
> Thanks!
> 
> -dave
> 
> PS: readLines(n=2459470) takes 10.42731 seconds.
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.Hello,

Can the following function be of help?
After reading the data setting argument quote=FALSE, call a function 
applying gregexpr to its character columns, then transforming the output 
in a two column data.frame with columns

  Col - the column processed;
  Unbalanced - the rows with unbalanced double quotes.

I am assuming the quotes are double quotes. It shouldn't be difficult to 
adapt it to other cas, single quotes, both cases.




unbalanced_dquotes <- function(x) {
   char_cols <- sapply(x, is.character) |> which()
   lapply(char_cols, \(i) {
     y <- x[[i]]
     Unbalanced <- gregexpr('"', y) |>
       sapply(\(x) attr(x, "match.length") |> length()) |>
       {\(x) (x %% 2L) == 1L}() |>
       which()
     data.frame(Col = i, Unbalanced = Unbalanced)
   }) |>
   do.call(rbind, args = _)
}

# read the data disregardin g quoted strings
df1 <- read.csv(fl, quote = "")
# determine which strings have unbalanced quotes and
# where
unbalanced_dquotes(df1)


Hope this helps,

Rui Barradas


-- 
Este e-mail foi analisado pelo software antiv?rus AVG para verificar a presen?a
de v?rus.
www.avg.com

@vi@e@gross m@iii@g oii gm@ii@com

2024-Apr-10 15:11 UTC

head link

[R] Exceptional slowness with read.csv

It sounds like the discussion is now on how to clean your data, with a twist.
You want to clean it before you can properly read it in using standard methods.

Some of those standard methods already do quite a bit as they parse the data
such as looking ahead to determine the data type for a column.

The specific problem being discussed seems to be related to a lack of balance in
individual lines of a CSV file related to double quotes that then mess up that
row and following rows for a while. I am not clear on the meaning of the quotes
to the user but wonder if they can simply not be viewed as quotes. Functions
like read.csv() or the tidyverse variant of read_csv() allow you to specify the
quote character or disable it.

So what would happen to the damaged line/row in your case, or any row with both
quotes intact if you tried reading it in with an argument disabling processing
quoted regions? It may cause problems but in your case, maybe it won't.

If so, after reading in the file, you can march through it and make fixes, such
as discussed. The other alternative seems to be to read the lines in the
old-fashioned way, do some surgery on whole lines rather than individual
row/column entries, and perhaps feed the huge amount of data in some form to
read.csv as text=TEXT or write it out to another file and read it in again.

And, of course, if there is just one bad line, then you might just open it with
a program such as EXCEL or anything that lets you edit it once, ...




-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Rui Barradas
Sent: Wednesday, April 10, 2024 9:46 AM
To: Dave Dixon <ddixon at swcp.com>; r-help at r-project.org
Subject: Re: [R] Exceptional slowness with read.csv

?s 06:47 de 08/04/2024, Dave Dixon escreveu:> Greetings,
> 
> I have a csv file of 76 fields and about 4 million records. I know that 
> some of the records have errors - unmatched quotes, specifically. 
> Reading the file with readLines and parsing the lines with read.csv(text 
> = ...) is really slow. I know that the first 2459465 records are good. 
> So I try this:
> 
>  > startTime <- Sys.time()
>  > first_records <- read.csv(file_name, nrows = 2459465)
>  > endTime <- Sys.time()
>  > cat("elapsed time = ", endTime - startTime, "\n")
> 
> elapsed time =   24.12598
> 
>  > startTime <- Sys.time()
>  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
>  > endTime <- Sys.time()
>  > cat("elapsed time = ", endTime - startTime, "\n")
> 
> This appears to never finish. I have been waiting over 20 minutes.
> 
> So why would (skip = 2459465, nrows = 5) take orders of magnitude longer 
> than (nrows = 2459465) ?
> 
> Thanks!
> 
> -dave
> 
> PS: readLines(n=2459470) takes 10.42731 seconds.
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.Hello,

Can the following function be of help?
After reading the data setting argument quote=FALSE, call a function 
applying gregexpr to its character columns, then transforming the output 
in a two column data.frame with columns

  Col - the column processed;
  Unbalanced - the rows with unbalanced double quotes.

I am assuming the quotes are double quotes. It shouldn't be difficult to 
adapt it to other cas, single quotes, both cases.




unbalanced_dquotes <- function(x) {
   char_cols <- sapply(x, is.character) |> which()
   lapply(char_cols, \(i) {
     y <- x[[i]]
     Unbalanced <- gregexpr('"', y) |>
       sapply(\(x) attr(x, "match.length") |> length()) |>
       {\(x) (x %% 2L) == 1L}() |>
       which()
     data.frame(Col = i, Unbalanced = Unbalanced)
   }) |>
   do.call(rbind, args = _)
}

# read the data disregardin g quoted strings
df1 <- read.csv(fl, quote = "")
# determine which strings have unbalanced quotes and
# where
unbalanced_dquotes(df1)


Hope this helps,

Rui Barradas


-- 
Este e-mail foi analisado pelo software antiv?rus AVG para verificar a presen?a
de v?rus.
www.avg.com

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Dave Dixon

2024-Apr-10 16:19 UTC

head link

[R] Exceptional slowness with read.csv

That's basically what I did

1. Get text lines using readLines
2. use tryCatch to parse each line using read.csv(text=...)
3. in the catch, use?gregexpr to find any quotes not adjacent to a comma 
(gregexpr("[^,]\"[^,]",...)
4. escape any quotes found by adding a second quote (using str_sub from 
stringr)
6. parse the patched text using read.csv(text=...)
7. write out the parsed fields as I go along using write.table(..., 
append=TRUE) so I'm not keeping too much in memory.

I went directly to tryCatch because there were 3.5 million records, and 
I only expected a few to have errors.

I found only 6 bad records, but it had to be done to make the datafile 
usable with read.csv(), for the benefit of other researchers using these 
data.


On 4/10/24 07:46, Rui Barradas wrote:> ?s 06:47 de 08/04/2024, Dave Dixon escreveu:
>> Greetings,
>>
>> I have a csv file of 76 fields and about 4 million records. I know 
>> that some of the records have errors - unmatched quotes, 
>> specifically. Reading the file with readLines and parsing the lines 
>> with read.csv(text = ...) is really slow. I know that the first 
>> 2459465 records are good. So I try this:
>>
>> ?> startTime <- Sys.time()
>> ?> first_records <- read.csv(file_name, nrows = 2459465)
>> ?> endTime <- Sys.time()
>> ?> cat("elapsed time = ", endTime - startTime,
"\n")
>>
>> elapsed time = ? 24.12598
>>
>> ?> startTime <- Sys.time()
>> ?> second_records <- read.csv(file_name, skip = 2459465, nrows =
5)
>> ?> endTime <- Sys.time()
>> ?> cat("elapsed time = ", endTime - startTime,
"\n")
>>
>> This appears to never finish. I have been waiting over 20 minutes.
>>
>> So why would (skip = 2459465, nrows = 5) take orders of magnitude 
>> longer than (nrows = 2459465) ?
>>
>> Thanks!
>>
>> -dave
>>
>> PS: readLines(n=2459470) takes 10.42731 seconds.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> Hello,
>
> Can the following function be of help?
> After reading the data setting argument quote=FALSE, call a function 
> applying gregexpr to its character columns, then transforming the 
> output in a two column data.frame with columns
>
> ?Col - the column processed;
> ?Unbalanced - the rows with unbalanced double quotes.
>
> I am assuming the quotes are double quotes. It shouldn't be difficult 
> to adapt it to other cas, single quotes, both cases.
>
>
>
>
> unbalanced_dquotes <- function(x) {
> ? char_cols <- sapply(x, is.character) |> which()
> ? lapply(char_cols, \(i) {
> ??? y <- x[[i]]
> ??? Unbalanced <- gregexpr('"', y) |>
> ????? sapply(\(x) attr(x, "match.length") |> length()) |>
> ????? {\(x) (x %% 2L) == 1L}() |>
> ????? which()
> ??? data.frame(Col = i, Unbalanced = Unbalanced)
> ? }) |>
> ? do.call(rbind, args = _)
> }
>
> # read the data disregardin g quoted strings
> df1 <- read.csv(fl, quote = "")
> # determine which strings have unbalanced quotes and
> # where
> unbalanced_dquotes(df1)
>
>
> Hope this helps,
>
> Rui Barradas
>
>

Seemingly Similar Threads

Search for more possibly parallel threads

R help - Apr 2024 - Exceptional slowness with read.csv

[R] Exceptional slowness with read.csv

[R] Exceptional slowness with read.csv

[R] Exceptional slowness with read.csv

Seemingly Similar Threads