A package that I develop (xcms) sometimes needs to read and process vectors several hundreds of megabytes in size. (They only represent parts of a large data sets which can approach nearly 100GB.) Unfortunately, R sometimes hits the 2GB memory limit of Win32. To help cut the memory footprint in half, I'm implementing a "float" class as a subclass of "raw". Because almost all the computation on the large vectors is done in C code, having a somewhat limited single-precision data type is okay. I've run into a limitation with the .C() function where it does not handle raw vectors, which it will do in 2.2.0. In the meantime, I'm using the .Call() function to access the raw vectors. However, there don't seem to be any macros for handling raw vectors in Rdefines.h. I've made a guess at what those macros would be and was wondering whether my guesses were correct and/or might be included in 2.2.0: #define NEW_RAW(n) allocVector(RAWSXP,n) #define RAW_POINTER(x) (RAW(x)) #define AS_RAW(x) coerceVector(x,RAWSXP) I'm not sure whether coerceVector(x,RAWSXP) will actually work. Also, there isn't an Rf_isRaw() function, which would be useful for an IS_RAW(x) macro. Another issue with the "float" class is that it will run into endian issues if it ever gets saved to disk and moved cross-platform. I don't really anticipate that happening but it might be nice to incorporate serialization hooks if possible. Are there any facilities in R for doing that? Thanks for any feedback or suggestions. -Colin http://abagyan.scripps.edu/~csmith/float.R http://abagyan.scripps.edu/~csmith/float.c
Prof Brian Ripley
2005-Aug-22 10:38 UTC
[Rd] Implementing a single-precision class with raw
On Fri, 19 Aug 2005, Colin A. Smith wrote:> A package that I develop (xcms) sometimes needs to read and process > vectors several hundreds of megabytes in size. (They only represent > parts of a large data sets which can approach nearly 100GB.) > Unfortunately, R sometimes hits the 2GB memory limit of Win32.The rw-FAQ explains why that is _not_ the limit!> To help cut the memory footprint in half, I'm implementing a "float" > class as a subclass of "raw".Why via "raw"? I believe the intention is that this sort of thing be done via external references, but as float and int are the same size on all current platforms, I would have considered R integers for storage. Then for example subsetting would work and you had a 4x larger size limit on 64-bit platforms. (You would also have got automatic handling of endianness.)> Because almost all the computation on the large vectors is done in C > code, having a somewhat limited single-precision data type is okay. > > I've run into a limitation with the .C() function where it does not > handle raw vectors, which it will do in 2.2.0.That is just not true!> In the meantime, I'm using the .Call() function to access the raw > vectors. However, there don't seem to be any macros for handling raw > vectors in Rdefines.h.So? We recommend using Rinternals.h: Rdefines.h is a compatibility wrapper for macros from S4. The raw type has not attempted to be compatible with S4, and we are not aware of any user who has compiled S4 code using raw vectors that (s)he wishes to port to R. (The R-exts.texi manual has been rather too optimistic about Rdefines.h: as you need to use SET_STRING_ELT and SET_VECTOR_ELT in R, you are rather limited as to what you can do in S4 style. This has been so since R 1.2.0 and Rdefines.h has hardly been updated since.)> I've made a guess at what those macros would be and was wondering > whether my guesses were correct and/or might be included in 2.2.0: > > #define NEW_RAW(n) allocVector(RAWSXP,n) > #define RAW_POINTER(x) (RAW(x)) > #define AS_RAW(x) coerceVector(x,RAWSXP) > > I'm not sure whether coerceVector(x,RAWSXP) will actually work.You should have read the code to find out (people answering your comment would have had to). It will `actually work', but it may not do whatever it is that you expect. (It interprets its input as integer (decimal if a string) representations of the bytes.) This is in contrast to S, where I have no idea precisely what AS_RAW is supposed to do and no code to read. (as(, "raw") seems to do wierd and unpredictable things, though, and the Green Book suggests that coercion probably is not intended to work.) For completeness I have added my (informed) guesses to Rdefines.h in R-devel.> Also, there isn't an Rf_isRaw() function, which would be useful for an > IS_RAW(x) macro.Why would this be necessary? TYPEOF(x) == RAWSXP is all that is needed.> Another issue with the "float" class is that it will run into endian > issues if it ever gets saved to disk and moved cross-platform. I don't > really anticipate that happening but it might be nice to incorporate > serialization hooks if possible. Are there any facilities in R for > doing that?See the comment above. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595