thr3ads.net - R devel - [Rd] read.csv [Apr 2024]

If this information is useful, please help other people find it:
Share via:

jing hua zhao

2024-Apr-16 10:46 UTC

[Rd] read.csv

Dear R-developers,

I came to a somewhat unexpected behaviour of read.csv() which is trivial but
worthwhile to note -- my data involves a protein named "1433E" but to
save space I drop the quote so it becomes,

Gene,SNP,prot,log10p
YWHAE,13:62129097_C_T,1433E,7.35
YWHAE,4:72617557_T_TA,1433E,7.73

Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly
confused by scientific notation) numeric 1433 which only alerts me when I tried
to combine data,

all_data <- data.frame()
for (protein in proteins[1:7])
{
   cat(protein,":\n")
   f <- paste0(protein,".csv")
   if(file.exists(f))
   {
     p <- read.csv(f)
     print(p)
     if(nrow(p)>0) all_data  <- bind_rows(all_data,p)
   }
}

proteins[1:7]
[1] "1433B" "1433E" "1433F" "1433G"
"1433S" "1433T" "1433Z"

dplyr::bind_rows() failed to work due to incompatible types nevertheless rbind()
went ahead without warnings.

Best wishes,


Jing Hua

Dirk Eddelbuettel

2024-Apr-16 10:52 UTC

head link

[Rd] read.csv

On 16 April 2024 at 10:46, jing hua zhao wrote:
| Dear R-developers,
| 
| I came to a somewhat unexpected behaviour of read.csv() which is trivial but
worthwhile to note -- my data involves a protein named "1433E" but to
save space I drop the quote so it becomes,
| 
| Gene,SNP,prot,log10p
| YWHAE,13:62129097_C_T,1433E,7.35
| YWHAE,4:72617557_T_TA,1433E,7.73
| 
| Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly
confused by scientific notation) numeric 1433 which only alerts me when I tried
to combine data,
| 
| all_data <- data.frame()
| for (protein in proteins[1:7])
| {
|    cat(protein,":\n")
|    f <- paste0(protein,".csv")
|    if(file.exists(f))
|    {
|      p <- read.csv(f)
|      print(p)
|      if(nrow(p)>0) all_data  <- bind_rows(all_data,p)
|    }
| }
| 
| proteins[1:7]
| [1] "1433B" "1433E" "1433F" "1433G"
"1433S" "1433T" "1433Z"
| 
| dplyr::bind_rows() failed to work due to incompatible types nevertheless
rbind() went ahead without warnings.

You may need to reconsider aiding read.csv() (and alternate reading
functions) by supplying column-type info instead of relying on educated
heuristic guesses which appear to fail here due to the nature of your data.

Other storage formats can store type info. That is generally safer and may be
an option too.

I think this was more of an email for r-help than r-devel.

Dirk

-- 
dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org

Rui Barradas

2024-Apr-16 11:36 UTC

head link

[Rd] read.csv

?s 11:46 de 16/04/2024, jing hua zhao escreveu:> Dear R-developers,
> 
> I came to a somewhat unexpected behaviour of read.csv() which is trivial
but worthwhile to note -- my data involves a protein named "1433E" but
to save space I drop the quote so it becomes,
> 
> Gene,SNP,prot,log10p
> YWHAE,13:62129097_C_T,1433E,7.35
> YWHAE,4:72617557_T_TA,1433E,7.73
> 
> Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly
confused by scientific notation) numeric 1433 which only alerts me when I tried
to combine data,
> 
> all_data <- data.frame()
> for (protein in proteins[1:7])
> {
>     cat(protein,":\n")
>     f <- paste0(protein,".csv")
>     if(file.exists(f))
>     {
>       p <- read.csv(f)
>       print(p)
>       if(nrow(p)>0) all_data  <- bind_rows(all_data,p)
>     }
> }
> 
> proteins[1:7]
> [1] "1433B" "1433E" "1433F" "1433G"
"1433S" "1433T" "1433Z"
> 
> dplyr::bind_rows() failed to work due to incompatible types nevertheless
rbind() went ahead without warnings.
> 
> Best wishes,
> 
> 
> Jing Hua
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-develHello,

I wrote a file with that content and read it back with


read.csv("filename.csv", as.is = TRUE)


There were no problems, it all worked as expected.

Hope this helps,

Rui Barradas




-- 
Este e-mail foi analisado pelo software antiv?rus AVG para verificar a presen?a
de v?rus.
www.avg.com

peter dalgaard

2024-Apr-16 12:03 UTC

head link

[Rd] read.csv

Hum...

This boils down to
> as.numeric("1.23e")
[1] 1.23> as.numeric("1.23e-")
[1] 1.23> as.numeric("1.23e+")[1] 1.23

which in turn comes from this code in src/main/util.c (function R_strtod)

    if (*p == 'e' || *p == 'E') {
        int expsign = 1;
        switch(*++p) {
        case '-': expsign = -1;
        case '+': p++;
        default: ;
        }
        for (n = 0; *p >= '0' && *p <= '9'; p++) n
= (n < MAX_EXPONENT_PREFIX) ? n * 10 + (*p - '0') : n;
        expn += expsign * n;
    }

which sets the exponent to zero even if the for loop terminates immediately.  

This might qualify as a bug, as it differs from the C function strtod which
accepts

"A sequence of digits, optionally containing a decimal-point character (.),
optionally followed by an exponent part (an e or E character followed by an
optional sign and a sequence of digits)."

[Of course, there would be nothing to stop e.g. "1433E1" from being
converted to numeric.]

-pd

> On 16 Apr 2024, at 12:46 , jing hua zhao <jinghuazhao at hotmail.com>
wrote:
> 
> Dear R-developers,
> 
> I came to a somewhat unexpected behaviour of read.csv() which is trivial
but worthwhile to note -- my data involves a protein named "1433E" but
to save space I drop the quote so it becomes,
> 
> Gene,SNP,prot,log10p
> YWHAE,13:62129097_C_T,1433E,7.35
> YWHAE,4:72617557_T_TA,1433E,7.73
> 
> Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly
confused by scientific notation) numeric 1433 which only alerts me when I tried
to combine data,
> 
> all_data <- data.frame()
> for (protein in proteins[1:7])
> {
>   cat(protein,":\n")
>   f <- paste0(protein,".csv")
>   if(file.exists(f))
>   {
>     p <- read.csv(f)
>     print(p)
>     if(nrow(p)>0) all_data  <- bind_rows(all_data,p)
>   }
> }
> 
> proteins[1:7]
> [1] "1433B" "1433E" "1433F" "1433G"
"1433S" "1433T" "1433Z"
> 
> dplyr::bind_rows() failed to work due to incompatible types nevertheless
rbind() went ahead without warnings.
> 
> Best wishes,
> 
> 
> Jing Hua
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

Ben Bolker

2024-Apr-16 12:37 UTC

head link

[Rd] read.csv

Tangentially, your code will be more efficient if you add the data 
files to a *list* one by one and then apply bind_rows or 
do.call(rbind,...) after you have accumulated all of the information 
(see chapter 2 of the _R Inferno_). This may or may not be practically 
important in your particular case.

Burns, Patrick. 2012. The R Inferno. Lulu.com. 
http://www.burns-stat.com/pages/Tutor/R_inferno.pdf.


On 2024-04-16 6:46 a.m., jing hua zhao wrote:> Dear R-developers,
> 
> I came to a somewhat unexpected behaviour of read.csv() which is trivial
but worthwhile to note -- my data involves a protein named "1433E" but
to save space I drop the quote so it becomes,
> 
> Gene,SNP,prot,log10p
> YWHAE,13:62129097_C_T,1433E,7.35
> YWHAE,4:72617557_T_TA,1433E,7.73
> 
> Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly
confused by scientific notation) numeric 1433 which only alerts me when I tried
to combine data,
> 
> all_data <- data.frame()
> for (protein in proteins[1:7])
> {
>     cat(protein,":\n")
>     f <- paste0(protein,".csv")
>     if(file.exists(f))
>     {
>       p <- read.csv(f)
>       print(p)
>       if(nrow(p)>0) all_data  <- bind_rows(all_data,p)
>     }
> }
> 
> proteins[1:7]
> [1] "1433B" "1433E" "1433F" "1433G"
"1433S" "1433T" "1433Z"
> 
> dplyr::bind_rows() failed to work due to incompatible types nevertheless
rbind() went ahead without warnings.
> 
> Best wishes,
> 
> 
> Jing Hua
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Reed A. Cartwright

2024-Apr-16 18:21 UTC

head link

[Rd] read.csv

Gene names being misinterpreted by spreadsheet software (read.csv is
no different) is a classic issue in bioinformatics. It seems like
every practitioner ends up encountering this issue in due time. E.g.

https://pubmed.ncbi.nlm.nih.gov/15214961/

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7

https://www.nature.com/articles/d41586-021-02211-4

https://www.theverge.com/2020/8/6/21355674/human-genes-rename-microsoft-excel-misreading-dates


On Tue, Apr 16, 2024 at 3:46?AM jing hua zhao <jinghuazhao at hotmail.com>
wrote:>
> Dear R-developers,
>
> I came to a somewhat unexpected behaviour of read.csv() which is trivial
but worthwhile to note -- my data involves a protein named "1433E" but
to save space I drop the quote so it becomes,
>
> Gene,SNP,prot,log10p
> YWHAE,13:62129097_C_T,1433E,7.35
> YWHAE,4:72617557_T_TA,1433E,7.73
>
> Both read.cv() and readr::read_csv() consider prot(ein) name as (possibly
confused by scientific notation) numeric 1433 which only alerts me when I tried
to combine data,
>
> all_data <- data.frame()
> for (protein in proteins[1:7])
> {
>    cat(protein,":\n")
>    f <- paste0(protein,".csv")
>    if(file.exists(f))
>    {
>      p <- read.csv(f)
>      print(p)
>      if(nrow(p)>0) all_data  <- bind_rows(all_data,p)
>    }
> }
>
> proteins[1:7]
> [1] "1433B" "1433E" "1433F" "1433G"
"1433S" "1433T" "1433Z"
>
> dplyr::bind_rows() failed to work due to incompatible types nevertheless
rbind() went ahead without warnings.
>
> Best wishes,
>
>
> Jing Hua
>
> ______________________________________________
> R-devel at r-project.org mailing list
>
https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!IKRxdwAv5BmarQ!YJzURlAK1O3rlvXvq9xl99aUaYL5iKm9gnN5RBi-WJtWa5IEtodN3vaN9pCvRTZA23dZyfrVD7X8nlYUk7S1AK893A$

Apparently Analagous Threads

Search for more maybe matching threads

R devel - Apr 2024 - read.csv

[Rd] read.csv

[Rd] read.csv

[Rd] read.csv

[Rd] read.csv

[Rd] read.csv

[Rd] read.csv

Apparently Analagous Threads