Hi, To speed up reading of large (few million lines) CSV files I am writing custom read functions (in C). By timing various approaches I figured out that one of the bottlenecks in reading character fields is the mkChar() function which on each call incurs a lot of garbage-collection-related overhead. I wonder if there is a "vectorized" version of mkChar, say mkChar2(char **, int length) that converts an array of C strings to a string vector, which somehow amortizes the gc overhead over the entire array? If no such function exists, I'd appreciate any hint as to how to write it. Thanks, Vadim [[alternative HTML version deleted]]
"Vadim Ogranovich" <vograno at evafunds.com> writes:> Hi, > > To speed up reading of large (few million lines) CSV files I am writing > custom read functions (in C). By timing various approaches I figured out > that one of the bottlenecks in reading character fields is the mkChar() > function which on each call incurs a lot of garbage-collection-related > overhead. > > I wonder if there is a "vectorized" version of mkChar, say mkChar2(char > **, int length) that converts an array of C strings to a string vector, > which somehow amortizes the gc overhead over the entire array? > > If no such function exists, I'd appreciate any hint as to how to write > it.The real issue here is that character vectors are implemented as generic vectors of little R objects (CHARSXP type) that each hold one string. Allocating all those objects is probably what does you in. The reason behind the implementation is probably that doing it that way allows the mechanics of the garbage collector to be applied directly (CHARSXPs are just vectors of bytes), but it is obviously wasteful in terms of total allocation. If you can think up something better, please say so (but remember that the memory management issues are nontrivial). -- O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
On Tue, 8 Jun 2004 12:23:58 -0700, "Vadim Ogranovich" <vograno at evafunds.com> wrote :>Hi, > >To speed up reading of large (few million lines) CSV files I am writing >custom read functions (in C). By timing various approaches I figured out >that one of the bottlenecks in reading character fields is the mkChar() >function which on each call incurs a lot of garbage-collection-related >overhead. > >I wonder if there is a "vectorized" version of mkChar, say mkChar2(char >**, int length) that converts an array of C strings to a string vector, >which somehow amortizes the gc overhead over the entire array? > >If no such function exists, I'd appreciate any hint as to how to write >it.It's not easy. Internally R strings always have a header at the front, so you need to allocate memory and move C strings to get R to understand them. Duncan Murdoch
I am no expert in memory management in R so it's hard for me to tell
what is and what is not doable. From reading the code of allocVector()
in memory.c I think that the critical part is to vectorize
CLASS_GET_FREE_NODE and use the vectorized version along the lines of
the code fragment below (taken from memory.c).
	if (node_class < NUM_SMALL_NODE_CLASSES) {
	    CLASS_GET_FREE_NODE(node_class, s); 
If this is possible than the rest is just a matter of code refactoring.
By vectorizing I mean writing a macro CLASS_GET_FREE_NODE2(node_class,
s, n) which in one go allocates n little objects of class node_class and
"inscribes" them into the elements of vector s, which is assumed to be
long enough to hold these objects.
If this is doable than the only missing piece would be a new function
setChar(CHARSXP rstr, const char * cstr) which copies 'cstr' into
'rstr'
and (re)allocates the heap memory if necessary. Here the setChar() macro
is safe since s[i]-s are all brand new and thus are not shared with any
other object.
> -----Original Message-----
> From: Peter Dalgaard [mailto:p.dalgaard at biostat.ku.dk] 
> Sent: Tuesday, June 08, 2004 1:23 PM
> To: Vadim Ogranovich
> Cc: R-Help
> Subject: Re: [R] fast mkChar
> 
> "Vadim Ogranovich" <vograno at evafunds.com> writes:
> 
> > Hi,
> >  
> > To speed up reading of large (few million lines) CSV files I am 
> > writing custom read functions (in C). By timing various 
> approaches I 
> > figured out that one of the bottlenecks in reading 
> character fields is 
> > the mkChar() function which on each call incurs a lot of 
> > garbage-collection-related overhead.
> >  
> > I wonder if there is a "vectorized" version of mkChar, say 
> > mkChar2(char **, int length) that converts an array of C 
> strings to a 
> > string vector, which somehow amortizes the gc overhead over 
> the entire array?
> >  
> > If no such function exists, I'd appreciate any hint as to 
> how to write 
> > it.
> 
> The real issue here is that character vectors are implemented 
> as generic vectors of little R objects (CHARSXP type) that 
> each hold one string. Allocating all those objects is 
> probably what does you in.
> 
> The reason behind the implementation is probably that doing 
> it that way allows the mechanics of the garbage collector to 
> be applied directly (CHARSXPs are just vectors of bytes), but 
> it is obviously wasteful in terms of total allocation. If you 
> can think up something better, please say so (but remember 
> that the memory management issues are nontrivial).
> 
> -- 
>    O__  ---- Peter Dalgaard             Blegdamsvej 3  
>   c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
>  (*) \(*) -- University of Copenhagen   Denmark      Ph: 
> (+45) 35327918
> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: 
> (+45) 35327907
> 
>
Hello everyone This is my first message to the list and I believe the question I am including is a simple one. I have a matrix where I need to calculate ANOVA for the rows as the columns represent a different treatment. I would like to know if there is a command or a series of commans that I can enter to do that. At the moment I have a external script that extracts each row from the matrix, transforms it in a column, another factor columns is add and the text file is thrown to Rterm --vanilla. Any help is appreciated. Thanks a lot Paulo Nuin
Thank you for the lead, Peter. It may be useful for other packages I write. As to the strings, I think I have to take what is already there. I agree that strings would be better managed in malloc-style fashion (probably with reference counter) and not by gc(). However I don't want to have a system with two different string classes, such close relatives seldom coexist peacefully. BTW, the slowness of mkChar explains why R is so slow when it needs to compute names for long vectors. Thank you for an interesting discussion, Vadim> -----Original Message----- > From: Peter Dalgaard [mailto:p.dalgaard at biostat.ku.dk] > Sent: Tuesday, June 08, 2004 3:35 PM > To: Vadim Ogranovich > Cc: R-Help > Subject: Re: [R] fast mkChar > > "Vadim Ogranovich" <vograno at evafunds.com> writes: > > > I am no expert in memory management in R so it's hard for > me to tell > > what is and what is not doable. From reading the code of > allocVector() > > in memory.c I think that the critical part is to vectorize > > CLASS_GET_FREE_NODE and use the vectorized version along > the lines of > > the code fragment below (taken from memory.c). > > > > if (node_class < NUM_SMALL_NODE_CLASSES) { > > CLASS_GET_FREE_NODE(node_class, s); > > > > If this is possible than the rest is just a matter of code > refactoring. > > > > By vectorizing I mean writing a macro > CLASS_GET_FREE_NODE2(node_class, > > s, n) which in one go allocates n little objects of class > node_class > > and "inscribes" them into the elements of vector s, which > is assumed > > to be long enough to hold these objects. > > > > If this is doable than the only missing piece would be a > new function > > setChar(CHARSXP rstr, const char * cstr) which copies > 'cstr' into 'rstr' > > and (re)allocates the heap memory if necessary. Here the setChar() > > macro is safe since s[i]-s are all brand new and thus are > not shared > > with any other object. > > I had a similar idea initially, but I don't think it can fly: > First, allocating n objects at once is not likely to be much > faster than allocating them one-by-one, especially when you > consider the implications of having to deal with > near-out-of-memory conditions. > Second, you have to know the string lengths when allocating, > since the structure of a vector object (CHARSXP) is a header > immediately followed by the data. > > A more interesting line to pursue is that - depending on what > it really is that you need - you might be able to create a > different kind of object that could "walk and quack" like a > character vector, but is stored differently internally. E.g. > you could set up a representation that is just a block of > pointers, pointing to strings that are being maintained in > malloc-style. > > Have a look at External pointers and finalization. > > > -- > O__ ---- Peter Dalgaard Blegdamsvej 3 > c/ /'_ --- Dept. of Biostatistics 2200 Cph. N > (*) \(*) -- University of Copenhagen Denmark Ph: > (+45) 35327918 > ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: > (+45) 35327907 > >