Dear List, I frequently use the data() function to load csv files (with separator ";") into R session, typically data(myfile) loads myfile.csv from my working/data directory into R. Now, in 1.4.0 version, everything works as expected, but with one difference: The values readed in older versions in "num" mode are now readed as "int" mode, converting the values larger than 2147483647 (2^{31}-1) into that value. This has a consequence when reading such kind of data: <example> File alerts.csv looks like: "IMSI";"DialedDigits";"Cnt";"Pri";"Dur" "230020100010125";"+28491628975809";3;332;2391 "230020100010125";"+28491723744868";1;12;75 etc... with first row being the colnames of resulting dataframe. <R-1.3.1> In 1.3.1 session:>data(alerts); str(alerts$IMSI)gives num [1:2793] 2.3e+14 2.3e+14 2.3e+14 2.3e+14 2.3e+14 ...>str(as.character(alerts$IMSI))gives chr [1:2793] "230020100010125" "230020100010125" "230020100010125" ... and>n<-length(unique(alerts$IMSI)); ngives 125, (i.e. reads the data as they are) </R-1.3.1> <R-1.4.0> while the same on 1.4.0 gives int [1:2793] 2147483647 2147483647 2147483647 ... and>n<-length(unique(alerts$IMSI)); ngives 1. (i.e. reflects the conversion of the data in int mode, which destroys the info about IMSI numbers, which are always 15 digit numbers) </R-1.4.0> </example> I was unable to find in http://cran.r-project.org/src/base/NEWS some comment to this new behaviour of data(). What I found was: --- read.table() has new arguments `nrows' and `colClasses'. If the latter is NA (the default), conversion is attempted to logical, integer, numeric or complex, not just to numeric --- Should I use read.table() with colClasses specified (instead of data())? Why not, but this involves lots of "hand-made" changes to my R-scripts, which is unpleasant and involves risk of some typos and so on. Is there some more "systematic" way to solve this problem?>versionplatform i386-pc-mingw32 arch x86 os Win32 system x86, Win32 status major 1 minor 4.0 year 2001 month 12 day 19 language R Thanks In Advance, Jan ------------------------------------------------- designed for _monospaced_ font ------------------------------------------------- /- Jan Svatos, PhD Sokolovska 855/225 -/ /- Data Analyst, Prague 9 -/ /- Eurotel Praha 190 00 -/ /- jan_svatos at eurotel.cz Czechia -/ ------------------------------------------------- -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
This is nothing to do with data(). data uses read.table to read .csv files, and that *is* in its help file! Also, these fields are not numeric nor integers but strings, so you can't expect the standard methods to make sense of them. What `Writing R Extensions' recommends you should do is to read them in once, correctly, them dump them as .rda files. *Then* data() will work as you expected. If you use compression the files might be much smaller, too. I'm not clear why type.convert is not objecting to overflowing integers, but that will depend on the implementation of strtol on your platform. We might manage to improve it. But in any case I think you ought to read these fields as character. On Thu, 3 Jan 2002 Jan_Svatos at eurotel.cz wrote:> Dear List, > > I frequently use the > > data() > > function to load csv files (with separator ";") into R session, > typically > > data(myfile) > > loads myfile.csv from my working/data directory into R. > Now, in 1.4.0 version, everything works as expected, but with one > difference: > The values readed in older versions in "num" mode are now readed as "int" > mode, > converting the values larger than 2147483647 (2^{31}-1) into that value. > > This has a consequence when reading such kind of data: > > <example> > > File > alerts.csv > looks like: > > "IMSI";"DialedDigits";"Cnt";"Pri";"Dur" > "230020100010125";"+28491628975809";3;332;2391 > "230020100010125";"+28491723744868";1;12;75 > etc... > with first row being the colnames of resulting dataframe. > > <R-1.3.1> > In 1.3.1 session: > >data(alerts); str(alerts$IMSI) > gives > > num [1:2793] 2.3e+14 2.3e+14 2.3e+14 2.3e+14 2.3e+14 ... > > >str(as.character(alerts$IMSI)) > gives > chr [1:2793] "230020100010125" "230020100010125" "230020100010125" ... > > and > >n<-length(unique(alerts$IMSI)); n > gives 125, (i.e. reads the data as they are) > > </R-1.3.1> > > <R-1.4.0> > > while the same on 1.4.0 gives > > int [1:2793] 2147483647 2147483647 2147483647 ... > > and > >n<-length(unique(alerts$IMSI)); n > gives 1. (i.e. reflects the conversion of the data in int mode, which > destroys the info about > IMSI numbers, which are always 15 digit numbers) > > </R-1.4.0> > </example> > > I was unable to find in http://cran.r-project.org/src/base/NEWS > some comment to this new behaviour of data(). > What I found was: > > --- > read.table() has new arguments `nrows' and `colClasses'. If the > latter is NA (the default), conversion is attempted to > logical, integer, numeric or complex, not just to numeric > --- > > Should I use read.table() with colClasses specified (instead of data())? > > Why not, but this involves lots of "hand-made" changes to my R-scripts, > which is unpleasant and involves risk of some typos and so on. > > Is there some more "systematic" way to solve this problem? > > >version > > platform i386-pc-mingw32 > arch x86 > os Win32 > system x86, Win32 > status > major 1 > minor 4.0 > year 2001 > month 12 > day 19 > language R > > Thanks In Advance, > Jan > > ------------------------------------------------- > designed for _monospaced_ font > ------------------------------------------------- > /- Jan Svatos, PhD Sokolovska 855/225 -/ > /- Data Analyst, Prague 9 -/ > /- Eurotel Praha 190 00 -/ > /- jan_svatos at eurotel.cz Czechia -/ > ------------------------------------------------- > > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._ >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Thanks to Prof. Ripley for quick and useful answer. Yes, I will either transfrorm the data-acquiring tool to get the columns as numbers, not character, or read them as character, and then manage them with as.factor(). Jan - - - Original message: - - - From: Prof Brian Ripley <ripley at stats.ox.ac.uk> Send: 1/3/02 1:20:11 PM To: <Jan_Svatos at eurotel.cz> <r-help at stat.math.ethz.ch> Subject: Re: [R] Different behaviour of data() This is nothing to do with data(). data uses read.table to read .csv files, and that *is* in its help file! Also, these fields are not numeric nor integers but strings, so you can't expect the standard methods to make sense of them. What `Writing R Extensions' recommends you should do is to read them in once, correctly, them dump them as .rda files. *Then* data() will work as you expected. If you use compression the files might be much smaller, too. I'm not clear why type.convert is not objecting to overflowing integers, but that will depend on the implementation of strtol on your platform. We might manage to improve it. But in any case I think you ought to read these fields as character. On Thu, 3 Jan 2002 Jan_Svatos at eurotel.cz wrote:> Dear List, > > I frequently use the > > data() > > function to load csv files (with separator ";") into R session, > typically > > data(myfile) > > loads myfile.csv from my working/data directory into R. > Now, in 1.4.0 version, everything works as expected, but with one > difference: > The values readed in older versions in "num" mode are now readed as "int" > mode, > converting the values larger than 2147483647 (2^{31}-1) into that value. > > This has a consequence when reading such kind of data: > > <example> > > File > alerts.csv > looks like: > > "IMSI";"DialedDigits";"Cnt";"Pri";"Dur" > "230020100010125";"+28491628975809";3;332;2391 > "230020100010125";"+28491723744868";1;12;75 > etc... > with first row being the colnames of resulting dataframe. > > <R-1.3.1> > In 1.3.1 session: > >data(alerts); str(alerts$IMSI) > gives > > num [1:2793] 2.3e+14 2.3e+14 2.3e+14 2.3e+14 2.3e+14 ... > > >str(as.character(alerts$IMSI)) > gives > chr [1:2793] "230020100010125" "230020100010125" "230020100010125" ... > > and > >n<-length(unique(alerts$IMSI)); n > gives 125, (i.e. reads the data as they are) > > </R-1.3.1> > > <R-1.4.0> > > while the same on 1.4.0 gives > > int [1:2793] 2147483647 2147483647 2147483647 ... > > and > >n<-length(unique(alerts$IMSI)); n > gives 1. (i.e. reflects the conversion of the data in int mode, which > destroys the info about > IMSI numbers, which are always 15 digit numbers) > > </R-1.4.0> > </example> > > I was unable to find in http://cran.r-project.org/src/base/NEWS > some comment to this new behaviour of data(). > What I found was: > > --- > read.table() has new arguments `nrows' and `colClasses'. If the > latter is NA (the default), conversion is attempted to > logical, integer, numeric or complex, not just to numeric > --- > > Should I use read.table() with colClasses specified (instead of data())? > > Why not, but this involves lots of "hand-made" changes to my R-scripts, > which is unpleasant and involves risk of some typos and so on. > > Is there some more "systematic" way to solve this problem? > > >version > > platform i386-pc-mingw32 > arch x86 > os Win32 > system x86, Win32 > status > major 1 > minor 4.0 > year 2001 > month 12 > day 19 > language R > > Thanks In Advance, > Jan > > ------------------------------------------------- > designed for _monospaced_ font > ------------------------------------------------- > /- Jan Svatos, PhD Sokolovska 855/225 -/ > /- Data Analyst, Prague 9 -/ > /- Eurotel Praha 190 00 -/ > /- jan_svatos at eurotel.cz Czechia -/ > ------------------------------------------------- > >-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. -.-.-> r-help mailing list -- Readhttp://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html> Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch >_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. _._._>-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._