Hello r users, I have to deal with int8 data with R. AFAIK R does only handle int4 with `as.integer` function [1]. I wonder: 1. what is the better approach to handle int8 ? `as.character` ? `as.numeric` ? 2. is there any plan to handle int8 in the future ? As you might know, int4 is to small to deal with earth population right now. Thanks for you ideas, int8 eg: human_id ---------------------- -1311071933951566764 -4708675461424073238 -6865005668390999818 5578000650960353108 -3219674686933841021 -6469229889308771589 -606871692563545028 -8199987422425699249 -463287495999648233 7675955260644241951 reference: 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ -- Nicolas PARIS
I am not on R-core, so cannot speak to future plans to internally support int8 (though my impression is that there aren't any, at least none that are close to fruition). The standard way of dealing with whole numbers too big to fit in an integer is to put them in a numeric (double down in C land). this can represent integers up to 2^53 without loss of precision see ( http://stackoverflow.com/questions/1848700/biggest-integer-that-can-be-stored-in-a-double). This is how long vector indices are (currently) implemented in R. If it's good enough for indices it's probably good enough for whatever you need them for. Hope that helps. ~G On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <nicolas.paris at aphp.fr> wrote:> Hello r users, > > I have to deal with int8 data with R. AFAIK R does only handle int4 > with `as.integer` function [1]. I wonder: > 1. what is the better approach to handle int8 ? `as.character` ? > `as.numeric` ? > 2. is there any plan to handle int8 in the future ? As you might know, > int4 is to small to deal with earth population right now. > > Thanks for you ideas, > > int8 eg: > > human_id > ---------------------- > -1311071933951566764 > -4708675461424073238 > -6865005668390999818 > 5578000650960353108 > -3219674686933841021 > -6469229889308771589 > -606871692563545028 > -8199987422425699249 > -463287495999648233 > 7675955260644241951 > > reference: > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ > > -- > Nicolas PARIS > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >-- Gabriel Becker, PhD Associate Scientist (Bioinformatics) Genentech Research [[alternative HTML version deleted]]
The lack of 64 bit integer support causes lots of problems when dealing with certain types of data where the loss of precision from coercing to 53 bits with double is unacceptable. Two packages were developed to deal with this: int64 and bit64. You may need to find archival versions of these packages if they've fallen off cran. Murray (mobile phone) On Jan 20, 2017 7:20 AM, "Gabriel Becker" <gmbecker at ucdavis.edu> wrote: I am not on R-core, so cannot speak to future plans to internally support int8 (though my impression is that there aren't any, at least none that are close to fruition). The standard way of dealing with whole numbers too big to fit in an integer is to put them in a numeric (double down in C land). this can represent integers up to 2^53 without loss of precision see ( http://stackoverflow.com/questions/1848700/biggest- integer-that-can-be-stored-in-a-double). This is how long vector indices are (currently) implemented in R. If it's good enough for indices it's probably good enough for whatever you need them for. Hope that helps. ~G On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <nicolas.paris at aphp.fr> wrote:> Hello r users, > > I have to deal with int8 data with R. AFAIK R does only handle int4 > with `as.integer` function [1]. I wonder: > 1. what is the better approach to handle int8 ? `as.character` ? > `as.numeric` ? > 2. is there any plan to handle int8 in the future ? As you might know, > int4 is to small to deal with earth population right now. > > Thanks for you ideas, > > int8 eg: > > human_id > ---------------------- > -1311071933951566764 > -4708675461424073238 > -6865005668390999818 > 5578000650960353108 > -3219674686933841021 > -6469229889308771589 > -606871692563545028 > -8199987422425699249 > -463287495999648233 > 7675955260644241951 > > reference: > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ > > -- > Nicolas PARIS > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >-- Gabriel Becker, PhD Associate Scientist (Bioinformatics) Genentech Research [[alternative HTML version deleted]] ______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel [[alternative HTML version deleted]]
If these are identifiers, store them as strings. If not, what sort of calculations do you plan on doing with them? Bill Dunlap TIBCO Software wdunlap tibco.com On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <nicolas.paris at aphp.fr> wrote:> Hello r users, > > I have to deal with int8 data with R. AFAIK R does only handle int4 > with `as.integer` function [1]. I wonder: > 1. what is the better approach to handle int8 ? `as.character` ? > `as.numeric` ? > 2. is there any plan to handle int8 in the future ? As you might know, > int4 is to small to deal with earth population right now. > > Thanks for you ideas, > > int8 eg: > > human_id > ---------------------- > -1311071933951566764 > -4708675461424073238 > -6865005668390999818 > 5578000650960353108 > -3219674686933841021 > -6469229889308771589 > -606871692563545028 > -8199987422425699249 > -463287495999648233 > 7675955260644241951 > > reference: > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ > > -- > Nicolas PARIS > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Le 20 janv. 2017 ? 18h09, Murray Stokely ?crivait :> The lack of 64 bit integer support causes lots of problems when dealing with > certain types of data where the loss of precision from coercing to 53 bits with > double is unacceptable.Hello Murray, Do you mean, by eg. -1311071933951566764 loses in precision during as.numeric(-1311071933951566764) process ? Thanks,> > Two packages were developed to deal with this: int64 and bit64. > > You may need to find archival versions of these packages if they've fallen off > cran. > > Murray (mobile phone) > > On Jan 20, 2017 7:20 AM, "Gabriel Becker" <gmbecker at ucdavis.edu> wrote: > > I am not on R-core, so cannot speak to future plans to internally support > int8 (though my impression is that there aren't any, at least none that are > close to fruition). > > The standard way of dealing with whole numbers too big to fit in an integer > is to put them in a numeric (double down in C land). this can represent > integers up to 2^53 without loss of precision see ( > http://stackoverflow.com/questions/1848700/biggest- > integer-that-can-be-stored-in-a-double). > This is how long vector indices are (currently) implemented in R. If it's > good enough for indices it's probably good enough for whatever you need > them for. > > Hope that helps. > > ~G > > > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <nicolas.paris at aphp.fr> > wrote: > > > Hello r users, > > > > I have to deal with int8 data with R. AFAIK R does only handle int4 > > with `as.integer` function [1]. I wonder: > > 1. what is the better approach to handle int8 ? `as.character` ? > > `as.numeric` ? > > 2. is there any plan to handle int8 in the future ? As you might know, > > int4 is to small to deal with earth population right now. > > > > Thanks for you ideas, > > > > int8 eg: > > > > human_id > > ---------------------- > > -1311071933951566764 > > -4708675461424073238 > > -6865005668390999818 > > 5578000650960353108 > > -3219674686933841021 > > -6469229889308771589 > > -606871692563545028 > > -8199987422425699249 > > -463287495999648233 > > 7675955260644241951 > > > > reference: > > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ > > > > -- > > Nicolas PARIS > > > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > > -- > Gabriel Becker, PhD > Associate Scientist (Bioinformatics) > Genentech Research > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > >-- Nicolas PARIS
Right, they are identifiers. Storing them as String has drawbacks: - huge to store in memory - slow to process - huge to index (by eg data.table columns indexes) Why not storing them as numeric ? Thanks, Le 20 janv. 2017 ? 18h16, William Dunlap ?crivait :> If these are identifiers, store them as strings. If not, what sort of > calculations do you plan on doing with them? > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <nicolas.paris at aphp.fr> wrote: > > Hello r users, > > > > I have to deal with int8 data with R. AFAIK R does only handle int4 > > with `as.integer` function [1]. I wonder: > > 1. what is the better approach to handle int8 ? `as.character` ? > > `as.numeric` ? > > 2. is there any plan to handle int8 in the future ? As you might know, > > int4 is to small to deal with earth population right now. > > > > Thanks for you ideas, > > > > int8 eg: > > > > human_id > > ---------------------- > > -1311071933951566764 > > -4708675461424073238 > > -6865005668390999818 > > 5578000650960353108 > > -3219674686933841021 > > -6469229889308771589 > > -606871692563545028 > > -8199987422425699249 > > -463287495999648233 > > 7675955260644241951 > > > > reference: > > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ > > > > -- > > Nicolas PARIS > > > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel-- Nicolas PARIS Responsable R & D WIND - PACTE, H?pital Rothschild ( RTH ) Courriel : nicolas.paris at aphp.fr Tel : 01 48 04 21 07
2^53 == 2^53+1 TRUE Which makes joining or grouping data sets with 64 bit identifiers problematic. Murray (mobile) On Jan 20, 2017 9:15 AM, "Nicolas Paris" <nicolas.paris at aphp.fr> wrote: Le 20 janv. 2017 ? 18h09, Murray Stokely ?crivait :> The lack of 64 bit integer support causes lots of problems when dealingwith> certain types of data where the loss of precision from coercing to 53bits with> double is unacceptable.Hello Murray, Do you mean, by eg. -1311071933951566764 loses in precision during as.numeric(-1311071933951566764) process ? Thanks,> > Two packages were developed to deal with this: int64 and bit64. > > You may need to find archival versions of these packages if they'vefallen off> cran. > > Murray (mobile phone) > > On Jan 20, 2017 7:20 AM, "Gabriel Becker" <gmbecker at ucdavis.edu> wrote: > > I am not on R-core, so cannot speak to future plans to internallysupport> int8 (though my impression is that there aren't any, at least nonethat are> close to fruition). > > The standard way of dealing with whole numbers too big to fit in aninteger> is to put them in a numeric (double down in C land). this canrepresent> integers up to 2^53 without loss of precision see ( > http://stackoverflow.com/questions/1848700/biggest- > integer-that-can-be-stored-in-a-double). > This is how long vector indices are (currently) implemented in R. Ifit's> good enough for indices it's probably good enough for whatever youneed> them for. > > Hope that helps. > > ~G > > > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <nicolas.paris at aphp.fr> > wrote: > > > Hello r users, > > > > I have to deal with int8 data with R. AFAIK R does only handle int4 > > with `as.integer` function [1]. I wonder: > > 1. what is the better approach to handle int8 ? `as.character` ? > > `as.numeric` ? > > 2. is there any plan to handle int8 in the future ? As you mightknow,> > int4 is to small to deal with earth population right now. > > > > Thanks for you ideas, > > > > int8 eg: > > > > human_id > > ---------------------- > > -1311071933951566764 > > -4708675461424073238 > > -6865005668390999818 > > 5578000650960353108 > > -3219674686933841021 > > -6469229889308771589 > > -606871692563545028 > > -8199987422425699249 > > -463287495999648233 > > 7675955260644241951 > > > > reference: > > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ > > > > -- > > Nicolas PARIS > > > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > > -- > Gabriel Becker, PhD > Associate Scientist (Bioinformatics) > Genentech Research > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > >-- Nicolas PARIS [[alternative HTML version deleted]]
Well I definitely cannot use them as numeric because join is the main reason of those identifiers. About int64 and bit64 packages, it's not a solution, because I am releasing a dataset for external users. I cannot ask them to install a package in order to exploit them. I have to be very carefull when releasing the data. If a user just use read.csv functions, they by default cast the identifiers as numeric. $ more res.csv "col1";"col2" "-1311071933951566764";"toto" "-1311071933951566764";"tata"> read.table("res.csv",sep=";",header=T)col1 col2 1 -1.311072e+18 toto 2 -1.311072e+18 tata>sapply(read.table("res.csv",sep=";",header=T),class)col1 col2 "numeric" "factor"> read.table("res.csv",sep=";",header=T,colClasses="character")col1 col2 1 -1311071933951566764 toto 2 -1311071933951566764 tata Am I comdemned to provide a R script with the data in order to exploit the dataset ? Le 20 janv. 2017 ? 18h29, Murray Stokely ?crivait :> 2^53 == 2^53+1 > TRUE > > Which makes joining or grouping data sets with 64 bit identifiers problematic. > > Murray (mobile) > > On Jan 20, 2017 9:15 AM, "Nicolas Paris" <nicolas.paris at aphp.fr> wrote: > > Le 20 janv. 2017 ? 18h09, Murray Stokely ?crivait : > > The lack of 64 bit integer support causes lots of problems when dealing > with > > certain types of data where the loss of precision from coercing to 53 > bits with > > double is unacceptable. > > Hello Murray, > Do you mean, by eg. -1311071933951566764 loses in precision during > as.numeric(-1311071933951566764) process ? > Thanks, > > > > Two packages were developed to deal with this: int64 and bit64. > > > > You may need to find archival versions of these packages if they've > fallen off > > cran. > > > > Murray (mobile phone) > > > > On Jan 20, 2017 7:20 AM, "Gabriel Becker" <gmbecker at ucdavis.edu> wrote: > > > > I am not on R-core, so cannot speak to future plans to internally > support > > int8 (though my impression is that there aren't any, at least none > that are > > close to fruition). > > > > The standard way of dealing with whole numbers too big to fit in an > integer > > is to put them in a numeric (double down in C land). this can > represent > > integers up to 2^53 without loss of precision see ( > > http://stackoverflow.com/questions/1848700/biggest- > > integer-that-can-be-stored-in-a-double). > > This is how long vector indices are (currently) implemented in R. If > it's > > good enough for indices it's probably good enough for whatever you > need > > them for. > > > > Hope that helps. > > > > ~G > > > > > > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <nicolas.paris at aphp.fr > > > > wrote: > > > > > Hello r users, > > > > > > I have to deal with int8 data with R. AFAIK R does only handle > int4 > > > with `as.integer` function [1]. I wonder: > > > 1. what is the better approach to handle int8 ? `as.character` ? > > > `as.numeric` ? > > > 2. is there any plan to handle int8 in the future ? As you might > know, > > > int4 is to small to deal with earth population right now. > > > > > > Thanks for you ideas, > > > > > > int8 eg: > > > > > > human_id > > > ---------------------- > > > -1311071933951566764 > > > -4708675461424073238 > > > -6865005668390999818 > > > 5578000650960353108 > > > -3219674686933841021 > > > -6469229889308771589 > > > -606871692563545028 > > > -8199987422425699249 > > > -463287495999648233 > > > 7675955260644241951 > > > > > > reference: > > > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ > > > > > > -- > > > Nicolas PARIS > > > > > > ______________________________________________ > > > R-devel at r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > > > > > > > -- > > Gabriel Becker, PhD > > Associate Scientist (Bioinformatics) > > Genentech Research > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > > -- > Nicolas PARIS > >-- Nicolas PARIS
On Fri, Jan 20, 2017 at 6:09 PM, Murray Stokely <murray at stokely.org> wrote:> The lack of 64 bit integer support causes lots of problems when dealing > with certain types of data where the loss of precision from coercing to 53 > bits with double is unacceptable. > > Two packages were developed to deal with this: int64 and bit64.Don't forget packages for large arbitrary large numbers such as Rmpfr and openssl. x <- openssl::bignum("12345678987654321") x^10 The risk of storing int64 as a double (e.g. in bit64) is that it might easily be mistaken for a completely different value via unclass() or Rf_isNumeric() or so.
To summarise this thread, there are basically three ways of handling int64 in R: * coerce to character * coerce to double * store in double There is no ideal solution, and each have pros and cons that I've attempted to summarise below. ## Coerce to character This is the easiest approach if the data is used as identifiers. It will have some performance drawbacks when loading and will require additional memory. It should not have negative performance implications once the data has been loaded because R has a global string pool so string comparisons only require a single pointer comparison (assuming they have the same encoding) ## Coerce to double This is the easiest approach if your integers are in the range [-(2^53), 2^53] or you can tolerate some minor loss of precision. ## Store in a double This technique takes advantage of the fact that doubles and int64s are the same size, so you can store the binary representation of an int64 in a double. This will effectively be garbage if you treat the vector as if it is a double, so it requires adding an S3 class and overriding every generic function with a custom method. Not all functions are generic, and internal C code will not know about the special class, so this has the danger of code silently interpreting the data incorrectly. This is the approach taken by the bit64 package (and, I believe, the int64 package, but since that's been archived it's not worth considering. Hadley On Fri, Jan 20, 2017 at 9:19 AM, Gabriel Becker <gmbecker at ucdavis.edu> wrote:> I am not on R-core, so cannot speak to future plans to internally support > int8 (though my impression is that there aren't any, at least none that are > close to fruition). > > The standard way of dealing with whole numbers too big to fit in an integer > is to put them in a numeric (double down in C land). this can represent > integers up to 2^53 without loss of precision see ( > http://stackoverflow.com/questions/1848700/biggest-integer-that-can-be-stored-in-a-double). > This is how long vector indices are (currently) implemented in R. If it's > good enough for indices it's probably good enough for whatever you need > them for. > > Hope that helps. > > ~G > > > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <nicolas.paris at aphp.fr> > wrote: > >> Hello r users, >> >> I have to deal with int8 data with R. AFAIK R does only handle int4 >> with `as.integer` function [1]. I wonder: >> 1. what is the better approach to handle int8 ? `as.character` ? >> `as.numeric` ? >> 2. is there any plan to handle int8 in the future ? As you might know, >> int4 is to small to deal with earth population right now. >> >> Thanks for you ideas, >> >> int8 eg: >> >> human_id >> ---------------------- >> -1311071933951566764 >> -4708675461424073238 >> -6865005668390999818 >> 5578000650960353108 >> -3219674686933841021 >> -6469229889308771589 >> -606871692563545028 >> -8199987422425699249 >> -463287495999648233 >> 7675955260644241951 >> >> reference: >> 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ >> >> -- >> Nicolas PARIS >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > > > > -- > Gabriel Becker, PhD > Associate Scientist (Bioinformatics) > Genentech Research > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- http://hadley.nz