thr3ads.net - R devel - [Rd] data frame subset patch, take 2 [Dec 2006]

If this information is useful, please help other people find it:
Share via:

Vladimir Dergachev

2006-Dec-06 18:33 UTC

[Rd] data frame subset patch, take 2

Hi Robert,

Here is the second iteration of data frame subset patch.
It now passes make check on both 2.4.0 and 2.5.0 (svn as of a few days ago).
Same speedup as before.

Changes:

	* Introduced two new functions .subassign2 and .subassign that are 
complimentary to .subset2 and .subset.

	* Changed x[[j]]<- assignment to x<-.subassign2(x, j, ..) to fix the
problem
with the previous patch.

                               thank you !

                                    Vladimir Dergachev

Marcus G. Daniels

2006-Dec-12 16:05 UTC

head link

[Rd] data frame subset patch, take 2

Vladimir Dergachev wrote:> Here is the second iteration of data frame subset patch.
> It now passes make check on both 2.4.0 and 2.5.0 (svn as of a few days
ago).
> Same speedup as before.
>   Hi,

I was wondering if this patch would make it into the next release.  I 
don't see it in SVN, but it's hard to be sure because the mailing list 
apparently strips attachments.  If it isn't in, or going to be in, is 
this patch available somewhere else?

Thanks,

Marcus

Vladimir Dergachev

2006-Dec-13 18:03 UTC

head link

[Rd] data frame subset patch, take 2

On Wednesday 13 December 2006 6:01 am, Martin Maechler
wrote:>
> - Vladimir, have you verified your 'take2' against recent versions
>   of R-devel?
Yes. 
>
> - If they still work, could you re-post them to R-devel, this
>   time using a proper MIME type,
>   i.e. most probably one of
>      application/x-tar
>      application/x-compressed-tar
>      application/x-gzip
>
>   In case you don't know how to achieve this,
>   I'd be interested to get it by "private" e-mail.
No problem. The old e-mail did have a mime type: "text/x-diff".
I am resending the patch - now compressed, hopefully it will get pass whatever 
filters are in place.

With regard to speedups in R, here is my wish list - I would greatly 
appreciate comments on what makes sense here or not, etc:

    1. I greatly miss equivalents of Tcl append and lappend commands - not
the function performed by these commands but efficiency (they are O(1) on 
average). Tcl easily handles lists with 1e6 components and strings of 10s of 
megabytes in length.

    2. It would be nice to have true hashed arrays in R (i.e. O(1) access 
times). So far I have used named lists for this, but they are O(n):
> L<-list(); system.time(for(i in 1:10000)L[[paste(i)]]<-i);
[1] 2.864 0.004 2.868 0.000 0.000> L<-list(); system.time(for(i in 1:20000)L[[paste(i)]]<-i);[1] 11.789  0.216 12.004  0.000  0.000

    3. Efficient manipulation of large numbers of strings. The big reason 
character row.names are slow is because they require a large number of string 
objects that slow down garbage collector.

    This is possibly not a problem that has an easy solution, here are a 
couple of approaches I have considered:

        a) Inline strings - use a structure like
		union {
			struct { 
				unsigned char size;
				char body[15];
				} inlined_string;  /* use this when size<16 */

			struct {
				unsigned char flag;  
				char reserved[7]; /* for 64 bit */
				CHRSXP ptr;
				} indirect_string; /* use this when flag=255 */
			}

             This basically turns small strings into an enum type stored 
within a 128-bit integer. This would greatly decrease required number of 
CHRSXP in many common cases (in particular for many rownames). 

             The biggest disadvantage is more complicated access to string 
data. Also this does not solve the issue of how to deal with 1e6 long 
strings - though I feel like 15 characters should be good enough for most 
uses.

	b) CHRSXPs are always leaf nodes. One could implement true reference counting
and create a separate garbage collector pool for them. This way one can rely 
on reference counting to free string objects during normal operation, but 
also keep track of the number of referenced strings during garbage collector 
passes - and trigger string garbage collection passes (with a warning) when 
the number of referenced strings is much smaller the number of objects in 
string pool.

            This gets rid of overhead that strings impose on garbage 
collector. The disadvantage are very large changes to R code.

                            best

                                Vladimir Dergachev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: subset.patch.2.diff.gz
Type: application/x-gzip
Size: 1474 bytes
Desc: not available
Url :
https://stat.ethz.ch/pipermail/r-devel/attachments/20061213/33598b1d/attachment.gz

Marcus G. Daniels

2006-Dec-13 18:23 UTC

head link

[Rd] data frame subset patch, take 2

Vladimir Dergachev wrote:
>     2. It would be nice to have true hashed arrays in R (i.e. O(1) access 
> times). So far I have used named lists for this, but they are O(n):
>   new.env(hash=TRUE) with get/assign/exists works ok.  But I suspect its 
just too easy to use named lists because it is easy, and that has bad 
performance ramifications for user code (perhaps the R developers are 
more vigilant about this for the R code itself).

Vladimir Dergachev

2006-Dec-13 19:11 UTC

head link

[Rd] data frame subset patch, take 2

On Wednesday 13 December 2006 1:23 pm, Marcus G. Daniels
wrote:> Vladimir Dergachev wrote:
> >     2. It would be nice to have true hashed arrays in R (i.e. O(1)
access
> > times). So far I have used named lists for this, but they are O(n):
>
> new.env(hash=TRUE) with get/assign/exists works ok.  But I suspect its
> just too easy to use named lists because it is easy, and that has bad
> performance ramifications for user code (perhaps the R developers are
> more vigilant about this for the R code itself).
Cool, thank you ! 

I wonder whether environments could be extended to allow names() to work 
(altough I see that ls() does the same function) and to allow for(i in E) 
loops.

                       thank you

                               Vladimir Dergachev

Reasonably Related Threads

Search for more reasonably related threads

R devel - Dec 2006 - data frame subset patch, take 2

[Rd] data frame subset patch, take 2

[Rd] data frame subset patch, take 2

[Rd] data frame subset patch, take 2

[Rd] data frame subset patch, take 2

[Rd] data frame subset patch, take 2

Reasonably Related Threads