thr3ads.net - R help - [R] read_xlsx(readxl) apparently mangling some data input [Feb 2020]

If this information is useful, please help other people find it:
Share via:

Chris Evans

2020-Feb-04 12:07 UTC

[R] read_xlsx(readxl) apparently mangling some data input

This is a very odd error I'm hitting using read_xlsx from the readxl package
(version 1.3.1) with R version 3.6.2 (2019-12-12) , platform x86_64-pc-linux-gnu
(and updated Ubuntu 18.04). I have some largeish Excel spreadsheets that contain
clinical data. I can't share the entire raw data but I think I can share the
specific problem columns as Excel files, but not via the list as I'm sure it
rightly rejects such attachments.

The particular column contains entries like
1
1, 14

1.14

That's to say it's a column that can have empty cells, or entries which
should be integers (a limited range of them) but cells may have multiple
integers and the data entry means that people use various separators, commas,
full stops and occasionally semi-colons or colons and all with or without
various amounts of space.

I thought this would be easy to handle but this illustrates the issue I'm
hitting:
> unique(read_xlsx("Book1.xlsx", col_types = "text"))# A tibble: 18 x 1
   NOWARN            
   <chr>             
 1 NA                
 2 14                
 3 8,12,14           
 4 13                
 5 58                
 6 9                 
 7 9.1300000000000008
 8 11                
 9 11.14             
10 10                
11 10.14             
12 9.14              
13 13.14             
14 9 ,13             
15 9.11              
16 1                 
17 1.1399999999999999
18 1, 14          

That's reading from a single column, 981 row (including column header) Excel
xlsx file in an up to date Windoze 10 Professional running in a VM on the Ubuntu
machine.

I created that file (which I can share) by copying the data from the full file
to a new Excel spreadsheet (M$ Orifice "Professional Plus 2019"
"Version 1912" "Build 12325.20344 Click-to-run" to an empty
new Excel file and using the default save_as.  The clinical data files were
created in, and updated in, versions of Excel that I can't access but the
file was certainly created first between two years and three months before now
so probably with different versions of Excel and probably in a Spanish or
Catalan M$ locale.

The weird thing is that looking at the Excel cells that created those
"9.1300000000000008" and "1.1399999999999999" entries they
show "9.13" and "1.14" (respectively!).  They continue to
show those values plus many trailing zeroes if I use Excel formatting to ask for
20 decimal places (I get less of course, but no arbitrary terminal rounding
digit).

It appears to me that read_xlxs() is only applying the "col_types =
"text"" argument _after_ reading the column freely, reading each
cell guessing the type by its contents and so ending up with numeric values for
"9.13" and "1.14" which are then picking up rounding errors
and being forced to character after that.  I say that the reading would appear
to be free across all cells in the column as there are entries of "8, 12,
14" coming before these problem entries:
> tmp <- read_xlsx("Book1.xlsx", col_types = "text")
> grep("1.1399999999999999", tmp$NOWARN, fixed = TRUE)
[1] 932 948 954> grep("9.1300000000000008", tmp$NOWARN, fixed = TRUE)
 [1]  73 189 190 271 272 390 511 645 686 710 744 830 899> tmp$NOWARN[20][1] "8,12,14"

This seems completely bizarre to me.  I find it very hard to believe that
read_xlsx() would guess content class (type) freely by for each individual entry
and only apply the col_types argument after doing that as that would seem likely
to be incredibly inefficient for really big spreadsheets. It seems equally hard
to believe that it would then create rounding errors (for some guessed numerics
like 9.13 and 1.14 but not for others like 11.4).  However, my guess would
appear to fit the results and I am only guessing because I'm sure my
programming comprehension isn't good enough to read into the sources to
actually work out how the function works.

To make things more interesting, and to suggest that at least some of the
problem is with Excel is that when I use LibreOffice (in Ubuntu) created a Excel
file in the same way, i.e. open the clinical Excel file but in LibreOffice, copy
and paste the same column into a new LibreOffice calc spreadsheet and save as
xlsx, tmp.xlsx, I get this:
> unique(read_xlsx("tmp.xlsx", col_types = "text"))# A tibble: 18 x 1
   NOWARN 
   <chr>  
 1 NA     
 2 14     
 3 8,12,14
 4 13     
 5 58     
 6 9      
 7 9.13   
 8 11     
 9 11.14  
10 10     
11 10.14  
12 9.14   
13 13.14  
14 9 ,13  
15 9.11   
16 1      
17 1.14   
18 1, 14  

Exactly what I think I should be seeing. I was working in Rstudio but get
exactly the same in a new R terminal session with only readxl loaded so I
don't think this is any weird environment or other clash.

Obviously I can, though not terribly easily for a fully generic fix, catch these
weird rounding errors and correct them, I am sure can also report this as a
suspected bug to the maintainer through the github issues system but I wanted to
check here whether anyone could see something I'm missing as I'm really
a (clinically retired) therapist and doctor, now full time researcher and
I'm not a professional statistician or programmer.

TIA,

Chris



-- 
Chris Evans <chris at psyctc.org> Visiting Professor, University of
Sheffield <chris.evans at sheffield.ac.uk>
I do some consultation work for the University of Roehampton <chris.evans at
roehampton.ac.uk> and other places
but <chris at psyctc.org> remains my main Email address.  I have a work
web site at:
   https://www.psyctc.org/psyctc/
and a site I manage for CORE and CORE system trust at:
   http://www.coresystemtrust.org.uk/
I have "semigrated" to France, see: 
   https://www.psyctc.org/pelerinage2016/semigrating-to-france/ 
That page will also take you to my blog which started with earlier joys in
France and Spain!

If you want to book to talk, I am trying to keep that to Thursdays and my diary
is at:
   https://www.psyctc.org/pelerinage2016/ceworkdiary/
Beware: French time, generally an hour ahead of UK.

PIKAL Petr

2020-Feb-04 12:39 UTC

head link

[R] read_xlsx(readxl) apparently mangling some data input

Hi

Floating point representation

I prepared excel file with arbitrary first row and second row

45.65 and 45.65/5

The division result should be 9.13 (exactly), but based on floation point
representation in binary computers (FAQ 7.31) it results in 9.129999999...
However Excel shows exact value (9.13) although internally it stores this
9.129999.. Probably they do not want to disturb its audience.

Therefore read_xlsx reads it correctly
> temp <- read_xlsx(file.choose())
> temp# A tibble: 2 x 2
     a1 a2               
  <dbl> <chr>            
1  12   8,8,10           
2  45.6 9.129999999999999> as.data.frame(temp)     a1                a2
1 12.00            8,8,10
2 45.65 9.129999999999999

Cheers
Petr
> -----Original Message-----
> From: R-help <r-help-bounces at r-project.org> On Behalf Of Chris
Evans
> Sent: Tuesday, February 4, 2020 1:07 PM
> To: R-help Mailing List <r-help at r-project.org>
> Subject: [R] read_xlsx(readxl) apparently mangling some data input
> 
> This is a very odd error I'm hitting using read_xlsx from the readxl
package> (version 1.3.1) with R version 3.6.2 (2019-12-12) , platform
x86_64-pc-linux-> gnu (and updated Ubuntu 18.04). I have some largeish Excel spreadsheets
> that contain clinical data. I can't share the entire raw data but I
think
I can> share the specific problem columns as Excel files, but not via the list as
I'm> sure it rightly rejects such attachments.
> 
> The particular column contains entries like
> 1
> 1, 14
> 
> 1.14
> 
> That's to say it's a column that can have empty cells, or entries
which
should> be integers (a limited range of them) but cells may have multiple integers
> and the data entry means that people use various separators, commas, full
> stops and occasionally semi-colons or colons and all with or without
various> amounts of space.
> 
> I thought this would be easy to handle but this illustrates the issue
I'm
> hitting:
> 
> > unique(read_xlsx("Book1.xlsx", col_types =
"text"))
> # A tibble: 18 x 1
>    NOWARN
>    <chr>
>  1 NA
>  2 14
>  3 8,12,14
>  4 13
>  5 58
>  6 9
>  7 9.1300000000000008
>  8 11
>  9 11.14
> 10 10
> 11 10.14
> 12 9.14
> 13 13.14
> 14 9 ,13
> 15 9.11
> 16 1
> 17 1.1399999999999999
> 18 1, 14
> 
> That's reading from a single column, 981 row (including column header)
> Excel xlsx file in an up to date Windoze 10 Professional running in a VM
on> the Ubuntu machine.
> 
> I created that file (which I can share) by copying the data from the full
file to> a new Excel spreadsheet (M$ Orifice "Professional Plus 2019"
"Version
1912"> "Build 12325.20344 Click-to-run" to an empty new Excel file and
using the
> default save_as.  The clinical data files were created in, and updated in,
> versions of Excel that I can't access but the file was certainly
created
first> between two years and three months before now so probably with different
> versions of Excel and probably in a Spanish or Catalan M$ locale.
> 
> The weird thing is that looking at the Excel cells that created those
> "9.1300000000000008" and "1.1399999999999999" entries
they show "9.13"
> and "1.14" (respectively!).  They continue to show those values
plus many
> trailing zeroes if I use Excel formatting to ask for 20 decimal places (I
get less> of course, but no arbitrary terminal rounding digit).
> 
> It appears to me that read_xlxs() is only applying the "col_types
"text""
> argument _after_ reading the column freely, reading each cell guessing the
> type by its contents and so ending up with numeric values for
"9.13" and
> "1.14" which are then picking up rounding errors and being forced
to
> character after that.  I say that the reading would appear to be free
across all> cells in the column as there are entries of "8, 12, 14" coming
before
these> problem entries:
> 
> > tmp <- read_xlsx("Book1.xlsx", col_types =
"text")
> > grep("1.1399999999999999", tmp$NOWARN, fixed = TRUE)
> [1] 932 948 954
> > grep("9.1300000000000008", tmp$NOWARN, fixed = TRUE)
>  [1]  73 189 190 271 272 390 511 645 686 710 744 830 899
> > tmp$NOWARN[20]
> [1] "8,12,14"
> 
> This seems completely bizarre to me.  I find it very hard to believe that
> read_xlsx() would guess content class (type) freely by for each individual
> entry and only apply the col_types argument after doing that as that would
> seem likely to be incredibly inefficient for really big spreadsheets. It
seems> equally hard to believe that it would then create rounding errors (for
some> guessed numerics like 9.13 and 1.14 but not for others like 11.4).
However,> my guess would appear to fit the results and I am only guessing because
I'm> sure my programming comprehension isn't good enough to read into the
> sources to actually work out how the function works.
> 
> To make things more interesting, and to suggest that at least some of the
> problem is with Excel is that when I use LibreOffice (in Ubuntu) created a
> Excel file in the same way, i.e. open the clinical Excel file but in
LibreOffice,> copy and paste the same column into a new LibreOffice calc spreadsheet
> and save as xlsx, tmp.xlsx, I get this:
> 
> > unique(read_xlsx("tmp.xlsx", col_types = "text"))
> # A tibble: 18 x 1
>    NOWARN
>    <chr>
>  1 NA
>  2 14
>  3 8,12,14
>  4 13
>  5 58
>  6 9
>  7 9.13
>  8 11
>  9 11.14
> 10 10
> 11 10.14
> 12 9.14
> 13 13.14
> 14 9 ,13
> 15 9.11
> 16 1
> 17 1.14
> 18 1, 14
> 
> Exactly what I think I should be seeing. I was working in Rstudio but get
> exactly the same in a new R terminal session with only readxl loaded so I
> don't think this is any weird environment or other clash.
> 
> Obviously I can, though not terribly easily for a fully generic fix, catch
these> weird rounding errors and correct them, I am sure can also report this as
a> suspected bug to the maintainer through the github issues system but I
> wanted to check here whether anyone could see something I'm missing as
> I'm really a (clinically retired) therapist and doctor, now full time
researcher> and I'm not a professional statistician or programmer.
> 
> TIA,
> 
> Chris
> 
> 
> 
> --
> Chris Evans <chris at psyctc.org> Visiting Professor, University of
Sheffield
> <chris.evans at sheffield.ac.uk> I do some consultation work for the
> University of Roehampton <chris.evans at roehampton.ac.uk> and other
> places but <chris at psyctc.org> remains my main Email address.  I
have a
> work web site at:
>    https://www.psyctc.org/psyctc/
> and a site I manage for CORE and CORE system trust at:
>    http://www.coresystemtrust.org.uk/
> I have "semigrated" to France, see:
>    https://www.psyctc.org/pelerinage2016/semigrating-to-france/
> That page will also take you to my blog which started with earlier joys in
> France and Spain!
> 
> If you want to book to talk, I am trying to keep that to Thursdays and my
> diary is at:
>    https://www.psyctc.org/pelerinage2016/ceworkdiary/
> Beware: French time, generally an hour ahead of UK.
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

peter dalgaard

2020-Feb-04 13:01 UTC

head link

[R] read_xlsx(readxl) apparently mangling some data input

Excel itself will store numeric data as numeric unless you explicitly say that
they are not. I.e., 9.13 gest stored in floating point, with the innate binary
rounding issues which you can also see from R
> print(9.13, digits=20)
[1] 9.1300000000000007816> print(1.14, digits=20)[1] 1.1399999999999999023

This happens on a per-cell basis inside Excel, and the excess digits are
probably retained in .xlsx files (which, as far as I recall, are text-based, XML
format files). LibreOffice probably keeps fewer significant digits.

So read_xlsx just reads what is in the file, and if instructed to read as text,
gets you the result that you see. There seems to be no way so set a tolerance
for rounding (which would also logically be a peculiar thing to do as it
requires char -> num -> char conversion in ways that may or may not be
what the user wants)

Pragmatically, I think you need to jump through a few hoops, something like

x <- tbl$NOWARN
xn <- as.numeric(x)
ix <- !(is.na(xn)) & xn%%1 != 0
x[ix] <- as.character(xn[ix])

(possibly throw in a zapsmall(), caveat emptor). Then proceed as per the
original plan.

-pd
> On 4 Feb 2020, at 13:07 , Chris Evans <chrishold at psyctc.org>
wrote:
> 
> This is a very odd error I'm hitting using read_xlsx from the readxl
package (version 1.3.1) with R version 3.6.2 (2019-12-12) , platform
x86_64-pc-linux-gnu (and updated Ubuntu 18.04). I have some largeish Excel
spreadsheets that contain clinical data. I can't share the entire raw data
but I think I can share the specific problem columns as Excel files, but not via
the list as I'm sure it rightly rejects such attachments.
> 
> The particular column contains entries like
> 1
> 1, 14
> 
> 1.14
> 
> That's to say it's a column that can have empty cells, or entries
which should be integers (a limited range of them) but cells may have multiple
integers and the data entry means that people use various separators, commas,
full stops and occasionally semi-colons or colons and all with or without
various amounts of space.
> 
> I thought this would be easy to handle but this illustrates the issue
I'm hitting:
> 
>> unique(read_xlsx("Book1.xlsx", col_types = "text"))
> # A tibble: 18 x 1
>   NOWARN            
>   <chr>             
> 1 NA                
> 2 14                
> 3 8,12,14           
> 4 13                
> 5 58                
> 6 9                 
> 7 9.1300000000000008
> 8 11                
> 9 11.14             
> 10 10                
> 11 10.14             
> 12 9.14              
> 13 13.14             
> 14 9 ,13             
> 15 9.11              
> 16 1                 
> 17 1.1399999999999999
> 18 1, 14          
> 
> That's reading from a single column, 981 row (including column header)
Excel xlsx file in an up to date Windoze 10 Professional running in a VM on the
Ubuntu machine.
> 
> I created that file (which I can share) by copying the data from the full
file to a new Excel spreadsheet (M$ Orifice "Professional Plus 2019"
"Version 1912" "Build 12325.20344 Click-to-run" to an empty
new Excel file and using the default save_as.  The clinical data files were
created in, and updated in, versions of Excel that I can't access but the
file was certainly created first between two years and three months before now
so probably with different versions of Excel and probably in a Spanish or
Catalan M$ locale.
> 
> The weird thing is that looking at the Excel cells that created those
"9.1300000000000008" and "1.1399999999999999" entries they
show "9.13" and "1.14" (respectively!).  They continue to
show those values plus many trailing zeroes if I use Excel formatting to ask for
20 decimal places (I get less of course, but no arbitrary terminal rounding
digit).
> 
> It appears to me that read_xlxs() is only applying the "col_types =
"text"" argument _after_ reading the column freely, reading each
cell guessing the type by its contents and so ending up with numeric values for
"9.13" and "1.14" which are then picking up rounding errors
and being forced to character after that.  I say that the reading would appear
to be free across all cells in the column as there are entries of "8, 12,
14" coming before these problem entries:
> 
>> tmp <- read_xlsx("Book1.xlsx", col_types =
"text")
>> grep("1.1399999999999999", tmp$NOWARN, fixed = TRUE)
> [1] 932 948 954
>> grep("9.1300000000000008", tmp$NOWARN, fixed = TRUE)
> [1]  73 189 190 271 272 390 511 645 686 710 744 830 899
>> tmp$NOWARN[20]
> [1] "8,12,14"
> 
> This seems completely bizarre to me.  I find it very hard to believe that
read_xlsx() would guess content class (type) freely by for each individual entry
and only apply the col_types argument after doing that as that would seem likely
to be incredibly inefficient for really big spreadsheets. It seems equally hard
to believe that it would then create rounding errors (for some guessed numerics
like 9.13 and 1.14 but not for others like 11.4).  However, my guess would
appear to fit the results and I am only guessing because I'm sure my
programming comprehension isn't good enough to read into the sources to
actually work out how the function works.
> 
> To make things more interesting, and to suggest that at least some of the
problem is with Excel is that when I use LibreOffice (in Ubuntu) created a Excel
file in the same way, i.e. open the clinical Excel file but in LibreOffice, copy
and paste the same column into a new LibreOffice calc spreadsheet and save as
xlsx, tmp.xlsx, I get this:
> 
>> unique(read_xlsx("tmp.xlsx", col_types = "text"))
> # A tibble: 18 x 1
>   NOWARN 
>   <chr>  
> 1 NA     
> 2 14     
> 3 8,12,14
> 4 13     
> 5 58     
> 6 9      
> 7 9.13   
> 8 11     
> 9 11.14  
> 10 10     
> 11 10.14  
> 12 9.14   
> 13 13.14  
> 14 9 ,13  
> 15 9.11   
> 16 1      
> 17 1.14   
> 18 1, 14  
> 
> Exactly what I think I should be seeing. I was working in Rstudio but get
exactly the same in a new R terminal session with only readxl loaded so I
don't think this is any weird environment or other clash.
> 
> Obviously I can, though not terribly easily for a fully generic fix, catch
these weird rounding errors and correct them, I am sure can also report this as
a suspected bug to the maintainer through the github issues system but I wanted
to check here whether anyone could see something I'm missing as I'm
really a (clinically retired) therapist and doctor, now full time researcher and
I'm not a professional statistician or programmer.
> 
> TIA,
> 
> Chris
> 
> 
> 
> -- 
> Chris Evans <chris at psyctc.org> Visiting Professor, University of
Sheffield <chris.evans at sheffield.ac.uk>
> I do some consultation work for the University of Roehampton
<chris.evans at roehampton.ac.uk> and other places
> but <chris at psyctc.org> remains my main Email address.  I have a
work web site at:
>   https://www.psyctc.org/psyctc/
> and a site I manage for CORE and CORE system trust at:
>   http://www.coresystemtrust.org.uk/
> I have "semigrated" to France, see: 
>   https://www.psyctc.org/pelerinage2016/semigrating-to-france/ 
> That page will also take you to my blog which started with earlier joys in
France and Spain!
> 
> If you want to book to talk, I am trying to keep that to Thursdays and my
diary is at:
>   https://www.psyctc.org/pelerinage2016/ceworkdiary/
> Beware: French time, generally an hour ahead of UK.
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

Chris Evans

2020-Feb-04 13:56 UTC

head link

[R] read_xlsx(readxl) apparently mangling some data input

This list can be priceless (and has taught me so much over, hm, over a decade
certainly now!)

Thanks both: makes perfect sense (of course) and shows my naivety in the way I
was thinking about this.
I'm intrigued that it's LibreOffice actually using lower precision that
avoids the issue that was
puzzling me, again, makes perfect sense.

Further, Peter's 
   ix <- !(is.na(xn)) & xn%%1 != 0
is delicious and exactly the sort of thing I don't see unaided.  I know I
would have done something
horribly more clumsy.  It's also the sort of little revelation about the
potential power of %% that
I think I _will_ remember and no doubt find myself using again in the future.

Thanks both, huge help to me and, as I suspected, a wasteful github issue report
prevented!

Chris

----- Original Message -----> From: "PIKAL Petr" <petr.pikal at precheza.cz>
> To: "Chris Evans" <chrishold at psyctc.org>, "R-help
Mailing List" <r-help at r-project.org>
> Sent: Tuesday, 4 February, 2020 13:39:31
> Subject: RE: read_xlsx(readxl) apparently mangling some data input
> Hi
> 
> Floating point representation
> 
> I prepared excel file with arbitrary first row and second row
> 
> 45.65 and 45.65/5
> 
> The division result should be 9.13 (exactly), but based on floation point
> representation in binary computers (FAQ 7.31) it results in 9.129999999...
> However Excel shows exact value (9.13) although internally it stores this
> 9.129999.. Probably they do not want to disturb its audience.
> 
> Therefore read_xlsx reads it correctly
> 
>> temp <- read_xlsx(file.choose())
>> temp
> # A tibble: 2 x 2
>     a1 a2
>  <dbl> <chr>
> 1  12   8,8,10
> 2  45.6 9.129999999999999
>> as.data.frame(temp)
>     a1                a2
> 1 12.00            8,8,10
> 2 45.65 9.129999999999999
> 
> Cheers
> Petr
> 
>> -----Original Message-----
>> From: R-help <r-help-bounces at r-project.org> On Behalf Of Chris
Evans
>> Sent: Tuesday, February 4, 2020 1:07 PM
>> To: R-help Mailing List <r-help at r-project.org>
>> Subject: [R] read_xlsx(readxl) apparently mangling some data input
>> 
>> This is a very odd error I'm hitting using read_xlsx from the
readxl
> package
>> (version 1.3.1) with R version 3.6.2 (2019-12-12) , platform
> x86_64-pc-linux-
>> gnu (and updated Ubuntu 18.04). I have some largeish Excel spreadsheets
>> that contain clinical data. I can't share the entire raw data but I
think
> I can
>> share the specific problem columns as Excel files, but not via the list
as
> I'm
>> sure it rightly rejects such attachments.
>> 
>> The particular column contains entries like
>> 1
>> 1, 14
>> 
>> 1.14
>> 
>> That's to say it's a column that can have empty cells, or
entries which
> should
>> be integers (a limited range of them) but cells may have multiple
integers
>> and the data entry means that people use various separators, commas,
full
>> stops and occasionally semi-colons or colons and all with or without
> various
>> amounts of space.
>> 
>> I thought this would be easy to handle but this illustrates the issue
I'm
>> hitting:
>> 
>> > unique(read_xlsx("Book1.xlsx", col_types =
"text"))
>> # A tibble: 18 x 1
>>    NOWARN
>>    <chr>
>>  1 NA
>>  2 14
>>  3 8,12,14
>>  4 13
>>  5 58
>>  6 9
>>  7 9.1300000000000008
>>  8 11
>>  9 11.14
>> 10 10
>> 11 10.14
>> 12 9.14
>> 13 13.14
>> 14 9 ,13
>> 15 9.11
>> 16 1
>> 17 1.1399999999999999
>> 18 1, 14
>> 
>> That's reading from a single column, 981 row (including column
header)
>> Excel xlsx file in an up to date Windoze 10 Professional running in a
VM
> on
>> the Ubuntu machine.
>> 
>> I created that file (which I can share) by copying the data from the
full
> file to
>> a new Excel spreadsheet (M$ Orifice "Professional Plus 2019"
"Version
> 1912"
>> "Build 12325.20344 Click-to-run" to an empty new Excel file
and using the
>> default save_as.  The clinical data files were created in, and updated
in,
>> versions of Excel that I can't access but the file was certainly
created
> first
>> between two years and three months before now so probably with
different
>> versions of Excel and probably in a Spanish or Catalan M$ locale.
>> 
>> The weird thing is that looking at the Excel cells that created those
>> "9.1300000000000008" and "1.1399999999999999"
entries they show "9.13"
>> and "1.14" (respectively!).  They continue to show those
values plus many
>> trailing zeroes if I use Excel formatting to ask for 20 decimal places
(I
> get less
>> of course, but no arbitrary terminal rounding digit).
>> 
>> It appears to me that read_xlxs() is only applying the "col_types
> "text""
>> argument _after_ reading the column freely, reading each cell guessing
the
>> type by its contents and so ending up with numeric values for
"9.13" and
>> "1.14" which are then picking up rounding errors and being
forced to
>> character after that.  I say that the reading would appear to be free
> across all
>> cells in the column as there are entries of "8, 12, 14"
coming before
> these
>> problem entries:
>> 
>> > tmp <- read_xlsx("Book1.xlsx", col_types =
"text")
>> > grep("1.1399999999999999", tmp$NOWARN, fixed = TRUE)
>> [1] 932 948 954
>> > grep("9.1300000000000008", tmp$NOWARN, fixed = TRUE)
>>  [1]  73 189 190 271 272 390 511 645 686 710 744 830 899
>> > tmp$NOWARN[20]
>> [1] "8,12,14"
>> 
>> This seems completely bizarre to me.  I find it very hard to believe
that
>> read_xlsx() would guess content class (type) freely by for each
individual
>> entry and only apply the col_types argument after doing that as that
would
>> seem likely to be incredibly inefficient for really big spreadsheets.
It
> seems
>> equally hard to believe that it would then create rounding errors (for
> some
>> guessed numerics like 9.13 and 1.14 but not for others like 11.4).
> However,
>> my guess would appear to fit the results and I am only guessing because
> I'm
>> sure my programming comprehension isn't good enough to read into
the
>> sources to actually work out how the function works.
>> 
>> To make things more interesting, and to suggest that at least some of
the
>> problem is with Excel is that when I use LibreOffice (in Ubuntu)
created a
>> Excel file in the same way, i.e. open the clinical Excel file but in
> LibreOffice,
>> copy and paste the same column into a new LibreOffice calc spreadsheet
>> and save as xlsx, tmp.xlsx, I get this:
>> 
>> > unique(read_xlsx("tmp.xlsx", col_types =
"text"))
>> # A tibble: 18 x 1
>>    NOWARN
>>    <chr>
>>  1 NA
>>  2 14
>>  3 8,12,14
>>  4 13
>>  5 58
>>  6 9
>>  7 9.13
>>  8 11
>>  9 11.14
>> 10 10
>> 11 10.14
>> 12 9.14
>> 13 13.14
>> 14 9 ,13
>> 15 9.11
>> 16 1
>> 17 1.14
>> 18 1, 14
>> 
>> Exactly what I think I should be seeing. I was working in Rstudio but
get
>> exactly the same in a new R terminal session with only readxl loaded so
I
>> don't think this is any weird environment or other clash.
>> 
>> Obviously I can, though not terribly easily for a fully generic fix,
catch
> these
>> weird rounding errors and correct them, I am sure can also report this
as
> a
>> suspected bug to the maintainer through the github issues system but I
>> wanted to check here whether anyone could see something I'm missing
as
>> I'm really a (clinically retired) therapist and doctor, now full
time
> researcher
>> and I'm not a professional statistician or programmer.
>> 
>> TIA,
>> 
>> Chris
>> 
>> 
>> 
>> --
>> Chris Evans <chris at psyctc.org> Visiting Professor, University
of Sheffield
>> <chris.evans at sheffield.ac.uk> I do some consultation work for
the
>> University of Roehampton <chris.evans at roehampton.ac.uk> and
other
>> places but <chris at psyctc.org> remains my main Email address. 
I have a
>> work web site at:
>>    https://www.psyctc.org/psyctc/
>> and a site I manage for CORE and CORE system trust at:
>>    http://www.coresystemtrust.org.uk/
>> I have "semigrated" to France, see:
>>    https://www.psyctc.org/pelerinage2016/semigrating-to-france/
>> That page will also take you to my blog which started with earlier joys
in
>> France and Spain!
>> 
>> If you want to book to talk, I am trying to keep that to Thursdays and
my
>> diary is at:
>>    https://www.psyctc.org/pelerinage2016/ceworkdiary/
>> Beware: French time, generally an hour ahead of UK.
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-
>> guide.html
> > and provide commented, minimal, self-contained, reproducible code.
-- 
Chris Evans <chris at psyctc.org> Visiting Professor, University of
Sheffield <chris.evans at sheffield.ac.uk>
I do some consultation work for the University of Roehampton <chris.evans at
roehampton.ac.uk> and other places
but <chris at psyctc.org> remains my main Email address.  I have a work
web site at:
   https://www.psyctc.org/psyctc/
and a site I manage for CORE and CORE system trust at:
   http://www.coresystemtrust.org.uk/
I have "semigrated" to France, see: 
   https://www.psyctc.org/pelerinage2016/semigrating-to-france/ 
That page will also take you to my blog which started with earlier joys in
France and Spain!

If you want to book to talk, I am trying to keep that to Thursdays and my diary
is at:
   https://www.psyctc.org/pelerinage2016/ceworkdiary/
Beware: French time, generally an hour ahead of UK.

R help - Feb 2020 - read_xlsx(readxl) apparently mangling some data input

[R] read_xlsx(readxl) apparently mangling some data input

[R] read_xlsx(readxl) apparently mangling some data input

[R] read_xlsx(readxl) apparently mangling some data input

[R] read_xlsx(readxl) apparently mangling some data input