Well I definitely cannot use them as numeric because join is the main reason of those identifiers. About int64 and bit64 packages, it's not a solution, because I am releasing a dataset for external users. I cannot ask them to install a package in order to exploit them. I have to be very carefull when releasing the data. If a user just use read.csv functions, they by default cast the identifiers as numeric. $ more res.csv "col1";"col2" "-1311071933951566764";"toto" "-1311071933951566764";"tata"> read.table("res.csv",sep=";",header=T)col1 col2 1 -1.311072e+18 toto 2 -1.311072e+18 tata>sapply(read.table("res.csv",sep=";",header=T),class)col1 col2 "numeric" "factor"> read.table("res.csv",sep=";",header=T,colClasses="character")col1 col2 1 -1311071933951566764 toto 2 -1311071933951566764 tata Am I comdemned to provide a R script with the data in order to exploit the dataset ? Le 20 janv. 2017 ? 18h29, Murray Stokely ?crivait :> 2^53 == 2^53+1 > TRUE > > Which makes joining or grouping data sets with 64 bit identifiers problematic. > > Murray (mobile) > > On Jan 20, 2017 9:15 AM, "Nicolas Paris" <nicolas.paris at aphp.fr> wrote: > > Le 20 janv. 2017 ? 18h09, Murray Stokely ?crivait : > > The lack of 64 bit integer support causes lots of problems when dealing > with > > certain types of data where the loss of precision from coercing to 53 > bits with > > double is unacceptable. > > Hello Murray, > Do you mean, by eg. -1311071933951566764 loses in precision during > as.numeric(-1311071933951566764) process ? > Thanks, > > > > Two packages were developed to deal with this: int64 and bit64. > > > > You may need to find archival versions of these packages if they've > fallen off > > cran. > > > > Murray (mobile phone) > > > > On Jan 20, 2017 7:20 AM, "Gabriel Becker" <gmbecker at ucdavis.edu> wrote: > > > > I am not on R-core, so cannot speak to future plans to internally > support > > int8 (though my impression is that there aren't any, at least none > that are > > close to fruition). > > > > The standard way of dealing with whole numbers too big to fit in an > integer > > is to put them in a numeric (double down in C land). this can > represent > > integers up to 2^53 without loss of precision see ( > > http://stackoverflow.com/questions/1848700/biggest- > > integer-that-can-be-stored-in-a-double). > > This is how long vector indices are (currently) implemented in R. If > it's > > good enough for indices it's probably good enough for whatever you > need > > them for. > > > > Hope that helps. > > > > ~G > > > > > > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <nicolas.paris at aphp.fr > > > > wrote: > > > > > Hello r users, > > > > > > I have to deal with int8 data with R. AFAIK R does only handle > int4 > > > with `as.integer` function [1]. I wonder: > > > 1. what is the better approach to handle int8 ? `as.character` ? > > > `as.numeric` ? > > > 2. is there any plan to handle int8 in the future ? As you might > know, > > > int4 is to small to deal with earth population right now. > > > > > > Thanks for you ideas, > > > > > > int8 eg: > > > > > > human_id > > > ---------------------- > > > -1311071933951566764 > > > -4708675461424073238 > > > -6865005668390999818 > > > 5578000650960353108 > > > -3219674686933841021 > > > -6469229889308771589 > > > -606871692563545028 > > > -8199987422425699249 > > > -463287495999648233 > > > 7675955260644241951 > > > > > > reference: > > > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ > > > > > > -- > > > Nicolas PARIS > > > > > > ______________________________________________ > > > R-devel at r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > > > > > > > -- > > Gabriel Becker, PhD > > Associate Scientist (Bioinformatics) > > Genentech Research > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > > -- > Nicolas PARIS > >-- Nicolas PARIS
How many unique idenfiiers do you have? If they are large (in terms of bytes) but you don't have that many of them (eg the total possible number you'll ever have is < INT_MAX), you could store them as factors. You get the speed of integers but the labeling of full "precision" strings. Factors are fast for joins. ~G On Fri, Jan 20, 2017 at 9:47 AM, Nicolas Paris <nicolas.paris at aphp.fr> wrote:> Well I definitely cannot use them as numeric because join is the main > reason of those identifiers. > > About int64 and bit64 packages, it's not a solution, because I am > releasing a dataset for external users. I cannot ask them to install a > package in order to exploit them. > > I have to be very carefull when releasing the data. If a user just use > read.csv functions, they by default cast the identifiers as numeric. > > $ more res.csv > "col1";"col2" > "-1311071933951566764";"toto" > "-1311071933951566764";"tata" > > > > read.table("res.csv",sep=";",header=T) > col1 col2 > 1 -1.311072e+18 toto > 2 -1.311072e+18 tata > > >sapply(read.table("res.csv",sep=";",header=T),class) > col1 col2 > "numeric" "factor" > > > read.table("res.csv",sep=";",header=T,colClasses="character") > col1 col2 > 1 -1311071933951566764 toto > 2 -1311071933951566764 tata > > Am I comdemned to provide a R script with the data in order to exploit the > dataset ? > > Le 20 janv. 2017 ? 18h29, Murray Stokely ?crivait : > > 2^53 == 2^53+1 > > TRUE > > > > Which makes joining or grouping data sets with 64 bit identifiers > problematic. > > > > Murray (mobile) > > > > On Jan 20, 2017 9:15 AM, "Nicolas Paris" <nicolas.paris at aphp.fr> wrote: > > > > Le 20 janv. 2017 ? 18h09, Murray Stokely ?crivait : > > > The lack of 64 bit integer support causes lots of problems when > dealing > > with > > > certain types of data where the loss of precision from coercing to > 53 > > bits with > > > double is unacceptable. > > > > Hello Murray, > > Do you mean, by eg. -1311071933951566764 loses in precision during > > as.numeric(-1311071933951566764) process ? > > Thanks, > > > > > > Two packages were developed to deal with this: int64 and bit64. > > > > > > You may need to find archival versions of these packages if they've > > fallen off > > > cran. > > > > > > Murray (mobile phone) > > > > > > On Jan 20, 2017 7:20 AM, "Gabriel Becker" <gmbecker at ucdavis.edu> > wrote: > > > > > > I am not on R-core, so cannot speak to future plans to > internally > > support > > > int8 (though my impression is that there aren't any, at least > none > > that are > > > close to fruition). > > > > > > The standard way of dealing with whole numbers too big to fit > in an > > integer > > > is to put them in a numeric (double down in C land). this can > > represent > > > integers up to 2^53 without loss of precision see ( > > > http://stackoverflow.com/questions/1848700/biggest- > > > integer-that-can-be-stored-in-a-double). > > > This is how long vector indices are (currently) implemented in > R. If > > it's > > > good enough for indices it's probably good enough for whatever > you > > need > > > them for. > > > > > > Hope that helps. > > > > > > ~G > > > > > > > > > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris < > nicolas.paris at aphp.fr > > > > > > wrote: > > > > > > > Hello r users, > > > > > > > > I have to deal with int8 data with R. AFAIK R does only > handle > > int4 > > > > with `as.integer` function [1]. I wonder: > > > > 1. what is the better approach to handle int8 ? > `as.character` ? > > > > `as.numeric` ? > > > > 2. is there any plan to handle int8 in the future ? As you > might > > know, > > > > int4 is to small to deal with earth population right now. > > > > > > > > Thanks for you ideas, > > > > > > > > int8 eg: > > > > > > > > human_id > > > > ---------------------- > > > > -1311071933951566764 > > > > -4708675461424073238 > > > > -6865005668390999818 > > > > 5578000650960353108 > > > > -3219674686933841021 > > > > -6469229889308771589 > > > > -606871692563545028 > > > > -8199987422425699249 > > > > -463287495999648233 > > > > 7675955260644241951 > > > > > > > > reference: > > > > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ > > > > > > > > -- > > > > Nicolas PARIS > > > > > > > > ______________________________________________ > > > > R-devel at r-project.org mailing list > > > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > > > > > > > > > > > > -- > > > Gabriel Becker, PhD > > > Associate Scientist (Bioinformatics) > > > Genentech Research > > > > > > [[alternative HTML version deleted]] > > > > > > ______________________________________________ > > > R-devel at r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > > > > > > -- > > Nicolas PARIS > > > > > > -- > Nicolas PARIS >-- Gabriel Becker, PhD Associate Scientist (Bioinformatics) Genentech Research [[alternative HTML version deleted]]
For what it is worth, I would be extremely pleased to R's integer type go to 64bit. A signed 32bit integer is just a bit too small to index into the ~3 billion position human genome. The "work arounds" that have arisen for this specific issue are surprisingly complex. Pete ____________________ Peter M. Haverty, Ph.D. Genentech, Inc. phaverty at gene.com On Fri, Jan 20, 2017 at 9:47 AM, Nicolas Paris <nicolas.paris at aphp.fr> wrote:> Well I definitely cannot use them as numeric because join is the main > reason of those identifiers. > > About int64 and bit64 packages, it's not a solution, because I am > releasing a dataset for external users. I cannot ask them to install a > package in order to exploit them. > > I have to be very carefull when releasing the data. If a user just use > read.csv functions, they by default cast the identifiers as numeric. > > $ more res.csv > "col1";"col2" > "-1311071933951566764";"toto" > "-1311071933951566764";"tata" > > > > read.table("res.csv",sep=";",header=T) > col1 col2 > 1 -1.311072e+18 toto > 2 -1.311072e+18 tata > > >sapply(read.table("res.csv",sep=";",header=T),class) > col1 col2 > "numeric" "factor" > > > read.table("res.csv",sep=";",header=T,colClasses="character") > col1 col2 > 1 -1311071933951566764 toto > 2 -1311071933951566764 tata > > Am I comdemned to provide a R script with the data in order to exploit the > dataset ? > > Le 20 janv. 2017 ? 18h29, Murray Stokely ?crivait : > > 2^53 == 2^53+1 > > TRUE > > > > Which makes joining or grouping data sets with 64 bit identifiers > problematic. > > > > Murray (mobile) > > > > On Jan 20, 2017 9:15 AM, "Nicolas Paris" <nicolas.paris at aphp.fr> wrote: > > > > Le 20 janv. 2017 ? 18h09, Murray Stokely ?crivait : > > > The lack of 64 bit integer support causes lots of problems when > dealing > > with > > > certain types of data where the loss of precision from coercing to > 53 > > bits with > > > double is unacceptable. > > > > Hello Murray, > > Do you mean, by eg. -1311071933951566764 loses in precision during > > as.numeric(-1311071933951566764) process ? > > Thanks, > > > > > > Two packages were developed to deal with this: int64 and bit64. > > > > > > You may need to find archival versions of these packages if they've > > fallen off > > > cran. > > > > > > Murray (mobile phone) > > > > > > On Jan 20, 2017 7:20 AM, "Gabriel Becker" <gmbecker at ucdavis.edu> > wrote: > > > > > > I am not on R-core, so cannot speak to future plans to > internally > > support > > > int8 (though my impression is that there aren't any, at least > none > > that are > > > close to fruition). > > > > > > The standard way of dealing with whole numbers too big to fit > in an > > integer > > > is to put them in a numeric (double down in C land). this can > > represent > > > integers up to 2^53 without loss of precision see ( > > > http://stackoverflow.com/questions/1848700/biggest- > > > integer-that-can-be-stored-in-a-double). > > > This is how long vector indices are (currently) implemented in > R. If > > it's > > > good enough for indices it's probably good enough for whatever > you > > need > > > them for. > > > > > > Hope that helps. > > > > > > ~G > > > > > > > > > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris < > nicolas.paris at aphp.fr > > > > > > wrote: > > > > > > > Hello r users, > > > > > > > > I have to deal with int8 data with R. AFAIK R does only > handle > > int4 > > > > with `as.integer` function [1]. I wonder: > > > > 1. what is the better approach to handle int8 ? > `as.character` ? > > > > `as.numeric` ? > > > > 2. is there any plan to handle int8 in the future ? As you > might > > know, > > > > int4 is to small to deal with earth population right now. > > > > > > > > Thanks for you ideas, > > > > > > > > int8 eg: > > > > > > > > human_id > > > > ---------------------- > > > > -1311071933951566764 > > > > -4708675461424073238 > > > > -6865005668390999818 > > > > 5578000650960353108 > > > > -3219674686933841021 > > > > -6469229889308771589 > > > > -606871692563545028 > > > > -8199987422425699249 > > > > -463287495999648233 > > > > 7675955260644241951 > > > > > > > > reference: > > > > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ > > > > > > > > -- > > > > Nicolas PARIS > > > > > > > > ______________________________________________ > > > > R-devel at r-project.org mailing list > > > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > > > > > > > > > > > > -- > > > Gabriel Becker, PhD > > > Associate Scientist (Bioinformatics) > > > Genentech Research > > > > > > [[alternative HTML version deleted]] > > > > > > ______________________________________________ > > > R-devel at r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > > > > > > -- > > Nicolas PARIS > > > > > > -- > Nicolas PARIS > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]
Hi, I do have < INT_MAX. This looks attractive but since they are unique identifiers, storing them as factor will be likely to be counter-productive. (a string version + an int32 for each) I was looking to https://cran.r-project.org/web/packages/csvread/index.html This looks like a good feet for my needs. Any chances such an external package for int64 would be integrated in core ? Le 20 janv. 2017 ? 18h57, Gabriel Becker ?crivait :> How many unique idenfiiers do you have? > > If they are large (in terms of bytes) but you don't have that many of them (eg > the total possible number you'll ever have is < INT_MAX), you could store them > as factors. You get the speed of integers but the labeling of full "precision" > strings. Factors are fast for joins. > > ~G > > On Fri, Jan 20, 2017 at 9:47 AM, Nicolas Paris <nicolas.paris at aphp.fr> wrote: > > Well I definitely cannot use them as numeric because join is the main > reason of those identifiers. > > About int64 and bit64 packages, it's not a solution, because I am > releasing a dataset for external users. I cannot ask them to install a > package in order to exploit them. > > I have to be very carefull when releasing the data. If a user just use > read.csv functions, they by default cast the identifiers as numeric. > > $ more res.csv > "col1";"col2" > "-1311071933951566764";"toto" > "-1311071933951566764";"tata" > > > > read.table("res.csv",sep=";",header=T) > col1 col2 > 1 -1.311072e+18 toto > 2 -1.311072e+18 tata > > >sapply(read.table("res.csv",sep=";",header=T),class) > col1 col2 > "numeric" "factor" > > > read.table("res.csv",sep=";",header=T,colClasses="character") > col1 col2 > 1 -1311071933951566764 toto > 2 -1311071933951566764 tata > > Am I comdemned to provide a R script with the data in order to exploit the > dataset ? > > Le 20 janv. 2017 ? 18h29, Murray Stokely ?crivait : > > 2^53 == 2^53+1 > > TRUE > > > > Which makes joining or grouping data sets with 64 bit identifiers > problematic. > > > > Murray (mobile) > > > > On Jan 20, 2017 9:15 AM, "Nicolas Paris" <nicolas.paris at aphp.fr> wrote: > > > > Le 20 janv. 2017 ? 18h09, Murray Stokely ?crivait : > > > The lack of 64 bit integer support causes lots of problems when > dealing > > with > > > certain types of data where the loss of precision from coercing to > 53 > > bits with > > > double is unacceptable. > > > > Hello Murray, > > Do you mean, by eg. -1311071933951566764 loses in precision during > > as.numeric(-1311071933951566764) process ? > > Thanks, > > > > > > Two packages were developed to deal with this: int64 and bit64. > > > > > > You may need to find archival versions of these packages if they've > > fallen off > > > cran. > > > > > > Murray (mobile phone) > > > > > > On Jan 20, 2017 7:20 AM, "Gabriel Becker" <gmbecker at ucdavis.edu> > wrote: > > > > > > I am not on R-core, so cannot speak to future plans to > internally > > support > > > int8 (though my impression is that there aren't any, at least > none > > that are > > > close to fruition). > > > > > > The standard way of dealing with whole numbers too big to fit > in an > > integer > > > is to put them in a numeric (double down in C land). this can > > represent > > > integers up to 2^53 without loss of precision see ( > > > http://stackoverflow.com/questions/1848700/biggest- > > > integer-that-can-be-stored-in-a-double). > > > This is how long vector indices are (currently) implemented in > R. If > > it's > > > good enough for indices it's probably good enough for whatever > you > > need > > > them for. > > > > > > Hope that helps. > > > > > > ~G > > > > > > > > > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris < > nicolas.paris at aphp.fr > > > > > > wrote: > > > > > > > Hello r users, > > > > > > > > I have to deal with int8 data with R. AFAIK R does only > handle > > int4 > > > > with `as.integer` function [1]. I wonder: > > > > 1. what is the better approach to handle int8 ? `as.character > ` ? > > > > `as.numeric` ? > > > > 2. is there any plan to handle int8 in the future ? As you > might > > know, > > > > int4 is to small to deal with earth population right now. > > > > > > > > Thanks for you ideas, > > > > > > > > int8 eg: > > > > > > > > human_id > > > > ---------------------- > > > > -1311071933951566764 > > > > -4708675461424073238 > > > > -6865005668390999818 > > > > 5578000650960353108 > > > > -3219674686933841021 > > > > -6469229889308771589 > > > > -606871692563545028 > > > > -8199987422425699249 > > > > -463287495999648233 > > > > 7675955260644241951 > > > > > > > > reference: > > > > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ > > > > > > > > -- > > > > Nicolas PARIS > > > > > > > > ______________________________________________ > > > > R-devel at r-project.org mailing list > > > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > > > > > > > > > > > > -- > > > Gabriel Becker, PhD > > > Associate Scientist (Bioinformatics) > > > Genentech Research > > > > > > [[alternative HTML version deleted]] > > > > > > ______________________________________________ > > > R-devel at r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > > > > > > -- > > Nicolas PARIS > > > > > > -- > Nicolas PARIS > > > > > -- > Gabriel Becker, PhD > Associate Scientist (Bioinformatics) > Genentech Research-- Nicolas PARIS Responsable R & D WIND - PACTE, H?pital Rothschild ( RTH ) Courriel : nicolas.paris at aphp.fr Tel : 01 48 04 21 07
You might want to use a data.table then. It will automatically detect that it is a 64 bit int. Although also in that case the user will have to install the data.table package. (which is a good idea anyway in my opinion :) ) It will then obviously allow you to join tables. Willem On 20-01-17 18:47, Nicolas Paris wrote:> Well I definitely cannot use them as numeric because join is the main > reason of those identifiers. > > About int64 and bit64 packages, it's not a solution, because I am > releasing a dataset for external users. I cannot ask them to install a > package in order to exploit them. > > I have to be very carefull when releasing the data. If a user just use > read.csv functions, they by default cast the identifiers as numeric. > > $ more res.csv > "col1";"col2" > "-1311071933951566764";"toto" > "-1311071933951566764";"tata" > > >> read.table("res.csv",sep=";",header=T) > col1 col2 > 1 -1.311072e+18 toto > 2 -1.311072e+18 tata > >> sapply(read.table("res.csv",sep=";",header=T),class) > col1 col2 > "numeric" "factor" > >> read.table("res.csv",sep=";",header=T,colClasses="character") > col1 col2 > 1 -1311071933951566764 toto > 2 -1311071933951566764 tata > > Am I comdemned to provide a R script with the data in order to exploit the dataset ? > > Le 20 janv. 2017 ? 18h29, Murray Stokely ?crivait : >> 2^53 == 2^53+1 >> TRUE >> >> Which makes joining or grouping data sets with 64 bit identifiers problematic. >> >> Murray (mobile) >> >> On Jan 20, 2017 9:15 AM, "Nicolas Paris" <nicolas.paris at aphp.fr> wrote: >> >> Le 20 janv. 2017 ? 18h09, Murray Stokely ?crivait : >> > The lack of 64 bit integer support causes lots of problems when dealing >> with >> > certain types of data where the loss of precision from coercing to 53 >> bits with >> > double is unacceptable. >> >> Hello Murray, >> Do you mean, by eg. -1311071933951566764 loses in precision during >> as.numeric(-1311071933951566764) process ? >> Thanks, >> > >> > Two packages were developed to deal with this: int64 and bit64. >> > >> > You may need to find archival versions of these packages if they've >> fallen off >> > cran. >> > >> > Murray (mobile phone) >> > >> > On Jan 20, 2017 7:20 AM, "Gabriel Becker" <gmbecker at ucdavis.edu> wrote: >> > >> > I am not on R-core, so cannot speak to future plans to internally >> support >> > int8 (though my impression is that there aren't any, at least none >> that are >> > close to fruition). >> > >> > The standard way of dealing with whole numbers too big to fit in an >> integer >> > is to put them in a numeric (double down in C land). this can >> represent >> > integers up to 2^53 without loss of precision see ( >> > http://stackoverflow.com/questions/1848700/biggest- >> > integer-that-can-be-stored-in-a-double). >> > This is how long vector indices are (currently) implemented in R. If >> it's >> > good enough for indices it's probably good enough for whatever you >> need >> > them for. >> > >> > Hope that helps. >> > >> > ~G >> > >> > >> > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <nicolas.paris at aphp.fr >> > >> > wrote: >> > >> > > Hello r users, >> > > >> > > I have to deal with int8 data with R. AFAIK R does only handle >> int4 >> > > with `as.integer` function [1]. I wonder: >> > > 1. what is the better approach to handle int8 ? `as.character` ? >> > > `as.numeric` ? >> > > 2. is there any plan to handle int8 in the future ? As you might >> know, >> > > int4 is to small to deal with earth population right now. >> > > >> > > Thanks for you ideas, >> > > >> > > int8 eg: >> > > >> > > human_id >> > > ---------------------- >> > > -1311071933951566764 >> > > -4708675461424073238 >> > > -6865005668390999818 >> > > 5578000650960353108 >> > > -3219674686933841021 >> > > -6469229889308771589 >> > > -606871692563545028 >> > > -8199987422425699249 >> > > -463287495999648233 >> > > 7675955260644241951 >> > > >> > > reference: >> > > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ >> > > >> > > -- >> > > Nicolas PARIS >> > > >> > > ______________________________________________ >> > > R-devel at r-project.org mailing list >> > > https://stat.ethz.ch/mailman/listinfo/r-devel >> > > >> > >> > >> > >> > -- >> > Gabriel Becker, PhD >> > Associate Scientist (Bioinformatics) >> > Genentech Research >> > >> > [[alternative HTML version deleted]] >> > >> > ______________________________________________ >> > R-devel at r-project.org mailing list >> > https://stat.ethz.ch/mailman/listinfo/r-devel >> > >> > >> >> -- >> Nicolas PARIS >> >>-------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 455 bytes Desc: OpenPGP digital signature URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20170120/41f83b05/attachment.bin>