thr3ads.net - R devel - [Rd] Some R questions [Oct 2006]

If this information is useful, please help other people find it:
Share via:

Vladimir Dergachev

2006-Oct-31 19:24 UTC

[Rd] Some R questions

Hi all, 

   I am working with some large data sets (1-4 GB) and have some questions 
that I hope someone can help me with:

   1.  Is there a way to turn off garbage collector from within C interface ?
      	what I am trying to do is suck data from mysql (using my own C
 	functions) and I see that allocating each column (with about 1-4 million
        items) takes between 0.5 and 1 seconds. My first thought was that it
        would be nice to turn off garbage collector, allocate all the data, 
	copy values and then turn the garbage collector back on.

   2.  For creating STRSXP should I be using mkChar() or mkString() to create
 	element values ? Is there a way to do it without allocating a cons cell ?
	(otherwise a single STRSXP with 1e6 length slows down garbage collector)

   3. 	Is "row.names" attribute required for data frames and, if so,
can I
 	use some other type besides STRSXP ?

   4.	While poking around to find out why some of my code is excessively slow
      	I have come upon definition of `[.data.frame` - subscription operator
	for data frames, which appears to be written in R. I am wondering whether
	I am looking at the right place and whether anyone would be interested in
	a piece of C code optimizing it - in particular extraction of single element
	is quite slow (i.e. calls like T[i, j]).

                   thank you very much !

                                 Vladimir Dergachev

miguel manese

2006-Nov-01 02:30 UTC

head link

[Rd] Some R questions

Hi,

Had experience with this on doing SQLiteDF...

On 11/1/06, Vladimir Dergachev <vdergachev at rcgardis.com>
wrote:> Hi all,
>
>    I am working with some large data sets (1-4 GB) and have some questions
> that I hope someone can help me with:
>
>    1.  Is there a way to turn off garbage collector from within C interface
?
>         what I am trying to do is suck data from mysql (using my own C
>         functions) and I see that allocating each column (with about 1-4
million
>         items) takes between 0.5 and 1 seconds. My first thought was that
it
>         would be nice to turn off garbage collector, allocate all the data,
>         copy values and then turn the garbage collector back on.I believe not. FWIW a numeric() vector is a chunk of memory with a
VECTOR_SEXP header and then your data contiguously allocated. If you
are desparate enough and assuming the garbage collector is indeed the
culprit, you may want to implement your own  lightweight allocVector
(the function expanded to by NEW_NUMERIC(), etc.)

>    2.  For creating STRSXP should I be using mkChar() or mkString() to
create
>         element values ? Is there a way to do it without allocating a cons
cell ?
>         (otherwise a single STRSXP with 1e6 length slows down garbage
collector)A string vector (STRSXP) is composed of CHARSXP's. mkChar makes ar
CHARSXP, and mkString makes a STRSXP with 1 CHARSXP, more like a
shorthand for

SEXP str = NEW_CHARACTER(1);
SET_STRING_ELT(str, 0, mkChar("foo"));

>    3.   Is "row.names" attribute required for data frames and, if
so, can I
>         use some other type besides STRSXP ?It is required. It can be integers, for 2.4.0+

>    4.   While poking around to find out why some of my code is excessively
slow
>         I have come upon definition of `[.data.frame` - subscription
operator
>         for data frames, which appears to be written in R. I am wondering
whether
>         I am looking at the right place and whether anyone would be
interested in
>         a piece of C code optimizing it - in particular extraction of
single element
>         is quite slow (i.e. calls like T[i, j]).[.data.frame is such a pain to implement because there is just too
many ways to index a data frame. You may want to do a specialized
index-er that just considers the index-ing styles you use. But I think
you are not just vectorizing enough. If you have to access your data
frames like that then it must be inside some loop, which would kill
your social life.

<pimp-my-project>
Or, you may just use (and pour your effort on improving) SQLiteDF
http://cran.r-project.org/src/contrib/Descriptions/SQLiteDF.html
</pimp-my-project>

M. Manese

Seemingly Similar Threads

Search for more possibly parallel threads

R devel - Oct 2006 - Some R questions

[Rd] Some R questions

[Rd] Some R questions

Seemingly Similar Threads