thr3ads.net - R devel - [Rd] Function 'factor' issues [Mar 2018]

If this information is useful, please help other people find it:
Share via:

Suharto Anggono Suharto Anggono

2018-Mar-24 00:52 UTC

[Rd] Function 'factor' issues

I am trying once again.

By just changing
f <- match(xlevs[f], nlevs)
to
f <- match(xlevs, nlevs)[f]
, function 'factor' in R devel could be made more consistent and
back-compatible. Why not picking it?

--------------------------------------------
On Sat, 25/11/17, Suharto Anggono Suharto Anggono <suharto_anggono at
yahoo.com> wrote:

 Subject: Re: [Rd] Function 'factor' issues
 To: r-devel at r-project.org
 Date: Saturday, 25 November, 2017, 6:03 PM
>From commits to R devel, I saw attempts to speed up subsetting and
'match', and to cache results of conversion of small nonnegative
integers to character string. That's good.
I am sorry for pushing, still.

Is the partial new behavior of function 'factor' with respect to NA
really worthy?

match(xlevs, nlevs)[f]? looks nice, too.

- Using
f <- match(xlevs, nlevs)[f]
instead of
f <- match(xlevs[f], nlevs)
for remapping
- Remapping only if length(nlevs) differs from length(xlevs)
Applying changes similar to above to function 'levels<-.factor' will
not change 'levels<-.factor' result at all. So, the corresponding
part of functions 'factor' and 'levels<-.factor' can be kept
in sync.

--------------------------------------------
On Sun, 22/10/17, Suharto Anggono Suharto Anggono <suharto_anggono at
yahoo.com> wrote:

Subject: Re: [Rd] Function 'factor' issues
To: r-devel at r-project.org
Date: Sunday, 22 October, 2017, 6:43 AM

My idea (like in https://bugs.r-project.org/bugzilla/attachment.cgi?id=1540 ):
- For remapping, use
f <- match(xlevs, nlevs)[f]
instead of
f <- match(xlevs[f], nlevs)
(I have mentioned it).
- Remap only if length(nlevs) differs from length(xlevs) .

[snip]

--------------------------------------------
On Wed, 18/10/17, Martin Maechler <maechler at stat.math.ethz.ch> wrote:

Subject: Re: [Rd] Function 'factor' issues

Cc: r-devel at r-project.org
Date: Wednesday, 18 October, 2017, 11:54 PM
>>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel at
r-project.org>
>>>>>? ? on Sun, 15 Oct 2017 16:03:48 +0000 writes:

? ? > In R devel, function 'factor' has been changed, allowing and
merging duplicated 'labels'.

Indeed.? That had been asked for and discussed a bit on this
list from June 14 to June 23, starting at
? https://stat.ethz.ch/pipermail/r-devel/2017-June/074451.html

? ? > Issue 1: Handling of specified 'labels' without duplicates is
slower than before.
? ? > Example:
? ? > x <- rep(1:26, 40000)
? ? > system.time(factor(x, levels=1:26, labels=letters))

? ? > Function 'factor' is already rather slow because of conversion
to character. Please don't add slowdown.

Indeed, I doo see a ~ 20%? performance loss for the example
above, and I may get to look into this.
However, in R-devel there have been important internal
changes (ALTREP additions) some of which are currently giving
some performance losses in some cases (but they have the
potential to give big performance _gains_ e.g. for simple
indexing into large vectors which may apply here !).
For factor(), these C level "ALTREP" changes may not be the reason at
all for the slow down;
I may find time to investigate further.

{{ For the ALTREP-change slowdowns I've noticed in some
? indexing/subset operations, we'll definitely have time to look into
? before R-devel is going to be released next spring... and as mentioned,
? these operations may even become considerably faster *thanks*
? to ALTREP ... }}

? ? > Issue 2: While default 'labels' is 'levels', not
specifying 'labels' may be different from specifying 'labels' to
be the same as 'levels'.

? ? > Example 1:
? ? > as.integer(factor(c(NA,2,3), levels = c(2, NA), exclude = NULL))
? ? > is different from
? ? > as.integer(factor(c(NA,2,3), levels = c(2, NA), labels = c(2, NA),
exclude = NULL))

You are right.? But this is not so exceptional and part of the new feature of
'labels' allowing to "fix up" things in such cases.? While it
would be nice if this was not the case the same phenomenon
happens in other functions as well because of lazy evaluation.
I think I had noticed that already and at the time found
"not easy" to work around.
(There are many aspects about changing such important base functions:
1. not breaking back compatibility ((unless in rare
? ? border cases, where we are sure it's worth))
2. Keeping code relatively transparent
3. Keep the semantics "simple" to document and as intuitive as
possible
)

? ? > File reg-tests-1d.R indicates that 'factor' behavior with NA is
slightly changed, for the better. NA entry (because it is unmatched to
'levels' argument or is in 'exclude') is absorbed into NA in
"levels" attribute (comes from 'labels' argument), if any. The
issue is that it happens only when 'labels' is specified.

I'm not sure anymore, but I think I had noticed that also in
June, considered to change it and found that such a changed
factor() would be too different from what it has "always been".
So, yes, IIRC, this current behavior is on purpose, if only for back
compatibility.

? ? > Function 'factor' could use match(xlevs, nlevs)[f]. It
doesn't match NA to NA level. When 'f' is long enough, longer than
'xlevs', it is faster than match(xlevs[f], nlevs).

? ? > Example 2:
? ? > With
? ? > levs <- c("A","A")? ,
? ? > factor(levs, levels=levs)
? ? > gives error, but
? ? > factor(levs, levels=levs, labels=levs)
? ? > doesn't.

yes, again that is a consequence of what you said above (before
'Example 1')

? ? > Note: In theory, if function 'factor' merged duplicated
'labels' in all cases, at least in
? ? > factor(c(sqrt(2)^2, 2))? ,
? ? > function 'factor' could do matching on original 'x'
(without conversion to character), as in R before version 2.10.0. If function
'factor' did it,
? ? > factor(c(sqrt(2)^2, 2), levels = c(sqrt(2)^2, 2), labels =
c("sqrt(2)^2", "2"))
? ? > could take sqrt(2)^2 and 2 as distinct.

Well, that may be interesting.. but I doubt if that's somewhere
we should go, easily, because? factor() has been documented to do
what it does now (with very slightly rounding such numbers via as.character(.))
and hence such a change would typically lead to much work for
too many people.

I do see that indeed the? as.character(.) inside factor() takes
most of the CPU time used in largish factor() examples [as your
first], and indeed, for the case of integer 'x', we really could
be much faster in factor construction.? 

[snip]

Martin Maechler

2018-Mar-24 19:23 UTC

head link

[Rd] Function 'factor' issues

>>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel at
r-project.org>
>>>>>     on Sat, 24 Mar 2018 00:52:02 +0000 writes:
    > I am trying once again.

    > By just changing
    > f <- match(xlevs[f], nlevs)
    > to
    > f <- match(xlevs, nlevs)[f]
    > , function 'factor' in R devel could be made more consistent
and back-compatible. Why not picking it?

Thank you for persevering,

I'll have a hard look...  You have been right before ;-)
So I will check this small change for both  `factor`  and `levels<-.factor`

Martin

    > --------------------------------------------
    > On Sat, 25/11/17, Suharto Anggono Suharto Anggono <suharto_anggono
at yahoo.com> wrote:

    > Subject: Re: [Rd] Function 'factor' issues
    > To: r-devel at r-project.org
    > Date: Saturday, 25 November, 2017, 6:03 PM

    >> From commits to R devel, I saw attempts to speed up subsetting and
'match', and to cache results of conversion of small nonnegative
integers to character string. That's good.

    > I am sorry for pushing, still.

    > Is the partial new behavior of function 'factor' with respect
to NA really worthy?

    > match(xlevs, nlevs)[f]? looks nice, too.

    > - Using
    > f <- match(xlevs, nlevs)[f]
    > instead of
    > f <- match(xlevs[f], nlevs)
    > for remapping
    > - Remapping only if length(nlevs) differs from length(xlevs)
    > Applying changes similar to above to function
'levels<-.factor' will not change 'levels<-.factor' result
at all. So, the corresponding part of functions 'factor' and
'levels<-.factor' can be kept in sync.

    > --------------------------------------------
    > On Sun, 22/10/17, Suharto Anggono Suharto Anggono <suharto_anggono
at yahoo.com> wrote:

    > Subject: Re: [Rd] Function 'factor' issues
    > To: r-devel at r-project.org
    > Date: Sunday, 22 October, 2017, 6:43 AM

    > My idea (like in
https://bugs.r-project.org/bugzilla/attachment.cgi?id=1540 ):
    > - For remapping, use
    > f <- match(xlevs, nlevs)[f]
    > instead of
    > f <- match(xlevs[f], nlevs)
    > (I have mentioned it).
    > - Remap only if length(nlevs) differs from length(xlevs) .


    > [snip]

    > --------------------------------------------
    > On Wed, 18/10/17, Martin Maechler <maechler at stat.math.ethz.ch>
wrote:

    > Subject: Re: [Rd] Function 'factor' issues

    > Cc: r-devel at r-project.org
    > Date: Wednesday, 18 October, 2017, 11:54 PM
>>>>> Suharto Anggono Suharto Anggono via R-devel <r-devel at
r-project.org>    >>>>>> ? ? on Sun, 15 Oct 2017 16:03:48 +0000 writes:


    > ? ? > In R devel, function 'factor' has been changed,
allowing and merging duplicated 'labels'.

    > Indeed.? That had been asked for and discussed a bit on this
    > list from June 14 to June 23, starting at
    > ? https://stat.ethz.ch/pipermail/r-devel/2017-June/074451.html

    > ? ? > Issue 1: Handling of specified 'labels' without
duplicates is slower than before.
    > ? ? > Example:
    > ? ? > x <- rep(1:26, 40000)
    > ? ? > system.time(factor(x, levels=1:26, labels=letters))

    > ? ? > Function 'factor' is already rather slow because of
conversion to character. Please don't add slowdown.

    > Indeed, I doo see a ~ 20%? performance loss for the example
    > above, and I may get to look into this.
    > However, in R-devel there have been important internal
    > changes (ALTREP additions) some of which are currently giving
    > some performance losses in some cases (but they have the
    > potential to give big performance _gains_ e.g. for simple
    > indexing into large vectors which may apply here !).
    > For factor(), these C level "ALTREP" changes may not be the
reason at
    > all for the slow down;
    > I may find time to investigate further.

    > {{ For the ALTREP-change slowdowns I've noticed in some
    > ? indexing/subset operations, we'll definitely have time to look
into
    > ? before R-devel is going to be released next spring... and as
mentioned,
    > ? these operations may even become considerably faster *thanks*
    > ? to ALTREP ... }}

    > ? ? > Issue 2: While default 'labels' is 'levels',
not specifying 'labels' may be different from specifying
'labels' to be the same as 'levels'.

    > ? ? > Example 1:
    > ? ? > as.integer(factor(c(NA,2,3), levels = c(2, NA), exclude =
NULL))
    > ? ? > is different from
    > ? ? > as.integer(factor(c(NA,2,3), levels = c(2, NA), labels = c(2,
NA), exclude = NULL))

    > You are right.? But this is not so exceptional and part of the new
feature of
    > 'labels' allowing to "fix up" things in such cases.?
While it
    > would be nice if this was not the case the same phenomenon
    > happens in other functions as well because of lazy evaluation.
    > I think I had noticed that already and at the time found
    > "not easy" to work around.
    > (There are many aspects about changing such important base functions:
    > 1. not breaking back compatibility ((unless in rare
    > ? ? border cases, where we are sure it's worth))
    > 2. Keeping code relatively transparent
    > 3. Keep the semantics "simple" to document and as intuitive
as possible
    > )

    > ? ? > File reg-tests-1d.R indicates that 'factor' behavior
with NA is slightly changed, for the better. NA entry (because it is unmatched
to 'levels' argument or is in 'exclude') is absorbed into NA in
"levels" attribute (comes from 'labels' argument), if any. The
issue is that it happens only when 'labels' is specified.

    > I'm not sure anymore, but I think I had noticed that also in
    > June, considered to change it and found that such a changed
    > factor() would be too different from what it has "always
been".
    > So, yes, IIRC, this current behavior is on purpose, if only for back
compatibility.


    > ? ? > Function 'factor' could use match(xlevs, nlevs)[f]. It
doesn't match NA to NA level. When 'f' is long enough, longer than
'xlevs', it is faster than match(xlevs[f], nlevs).

    > ? ? > Example 2:
    > ? ? > With
    > ? ? > levs <- c("A","A")? ,
    > ? ? > factor(levs, levels=levs)
    > ? ? > gives error, but
    > ? ? > factor(levs, levels=levs, labels=levs)
    > ? ? > doesn't.

    > yes, again that is a consequence of what you said above (before
    > 'Example 1')

    > ? ? > Note: In theory, if function 'factor' merged
duplicated 'labels' in all cases, at least in
    > ? ? > factor(c(sqrt(2)^2, 2))? ,
    > ? ? > function 'factor' could do matching on original
'x' (without conversion to character), as in R before version 2.10.0. If
function 'factor' did it,
    > ? ? > factor(c(sqrt(2)^2, 2), levels = c(sqrt(2)^2, 2), labels =
c("sqrt(2)^2", "2"))
    > ? ? > could take sqrt(2)^2 and 2 as distinct.

    > Well, that may be interesting.. but I doubt if that's somewhere
    > we should go, easily, because? factor() has been documented to do
    > what it does now (with very slightly rounding such numbers via
as.character(.))
    > and hence such a change would typically lead to much work for
    > too many people.

    > I do see that indeed the? as.character(.) inside factor() takes
    > most of the CPU time used in largish factor() examples [as your
    > first], and indeed, for the case of integer 'x', we really
could
    > be much faster in factor construction.? 

    > [snip]

    > ______________________________________________
    > R-devel at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-devel

Apparently Analagous Threads

Search for more possibly parallel threads

R devel - Mar 2018 - Function 'factor' issues

[Rd] Function 'factor' issues

[Rd] Function 'factor' issues

Apparently Analagous Threads