On 03/19/2018 02:23 PM, Detlef Steuer wrote:> Dear friends,
>
> I stumbled into beheaviour of read.delim which I would consider a bug
> or at least an inconsistency that should be improved upon.
>
> Recently we had to work with data that used "", two double
quotes, as
> symbol to start and end character input.
>
> Essentially the data looked like this
>
> data.csv
> =======> V1, V2, V3
> ""data"", 3, """"
>
> The last sequence of """" indicating a missing.
After processing the quotes, this is internally parsed as
data 3 "
Which I think is correct; in particular, """" represents
single quote.
This is correct and it conforms to RFC 4180. "" in contrast represents
an empty string.
Based on my reading of RFC4180, ""data"" is not a valid
field, but not
every CSV file follows that RFC, and R supports this pattern as expected
in your data. So you should be fine here.
> One obvious solution to read in this data is using some gsub(),
> but that's not the point I want to make.
>
> Consider this case we found during tests:
>
> test.csv
> =======> V1, V2, V3, V4
> """", """", 3, ""
>
> and read it with
>> read.delim("test.csv", sep=",", header=TRUE,
na.strings="\"")
After processing the quotes, this is internally parsed as
" " 3 <empty_string>
which is again I think correct (and conforms to RFC 4180)
> you get the following
>
> V1 V2 V3 V4
> 1 NA " 3 NA
>
> (and a warning)
I do not get the warning on my system. The reason why the second " is
not translated to NA by na.strings is white space after the comma in the
CSV file, this works more consistently:
> read.delim("test.csv", sep=",", header=TRUE,
na.strings="\"",
strip.white=TRUE)
? V1 V2 V3 V4
1 NA NA? 3 NA
If one needed to differentiate between " and <empty_string>, then it
might be necessary to run without the na.strings argument.
Best
Tomas
> I would have assumed to get some error message or at
> least the same result for both appearances of """" in
the
> input file.
> (the setting na.strings="\"" turned out to be working for
> a colleague and his specific data, while I think it shouldn't)
>
> My main concern is the different interpretation for the two
""""
> sequences.
>
> Real bug? Minor inconsistency? I don't know.
>
> All the best
> Detlef
>
>