thr3ads.net - R devel - [Rd] A different error in sample() [Sep 2018]

If this information is useful, please help other people find it:
Share via:

lmo

2018-Sep-19 23:53 UTC

[Rd] A different error in sample()

Although it seems to be pretty weird to enter a numeric vector of length one
that is not an integer as the first argument to sample(), the results do not
seem to match what is documented in the manual. In addition, the results below
do not support the use of round rather than truncate in the documentation.
Consider the code below.
The first sentence in the details section says: "If x has length 1, is
numeric (in the sense of is.numeric) and x >= 1, sampling via sample takes
place from 1:x."
In the console:> 1:2.001
[1] 1 2> 1:2.9[1] 1 2

truncation:> trunc(2.9)[1] 2

So, this seems to support the quote from in previous emails: "Non-integer
positive numerical values of n or x will be truncated to the next smallest
integer, which has to be no larger than .Machine$integer.max."
However, again in the console:> set.seed(123)
 > table(sample(2.001, 10000, replace=TRUE))

?? 1??? 2??? 3 
5052 4941??? 7

So, neither rounding nor truncation is occurring. Next, define a
sequence.> x <- seq(2.001, 2.51, length.out=20)Now, grab all of the threes from sample()-ing this sequence.

 > set.seed(123)> threes <- sapply(x, function(y) table(sample(y, 10000,
replace=TRUE))[3])
Check for NAs (I cheated here and found a nice seed).> any(is.na(threes))
[1] FALSE
Now, the (to me) disturbing result.
> is.unsorted(threes)[1] FALSE

or equivalently
> all(diff(threes) > 0)[1] TRUE

So the number of threes grows monotonically as 2.001 moves to 2.5. As I hinted
above, the monotonic growth is not assured. My guess is that the growth is
stochastic and relates to some "probability weighting" based on how
close the element of x is to 3. Perhaps this has been brought up before, but it
seems relevant to the current discussion.
A potential aid to this issue would be something like
if(length(x) == 1 && !all.equal(x, as.integer(x))) warning("It is a
bad idea to use vectors of length 1 in the x argument that are not
integers.")
Hope that helps,luke

	[[alternative HTML version deleted]]

Emil Bode

2018-Sep-20 08:17 UTC

head link

[Rd] A different error in sample()

But do we handle it as an error in what sample does, or how the documentation
is?
I think what is done now would be best described as "ceilinged", i.e.
what ceiling() does. But is there an English word to describe this?
Or just use "converted to the next smallest integer"?

But then again, what happens is that the answer is ceilinged, not the input.
I guess the rationale is that multiplying by any integer and then dividing
should give the same results:
ceiling(sample(n * x, size=1e6, replace = TRUE) / x) gives the same results for
any integer n and x, it's nice that this also holds for non-integer n.
The most important thing is why people would use sample with a non-integer x, I
don?t see many use cases.
So I agree with Luke that a warning would be best, regardless of what the docs
say.

Best regards, 
Emil Bode

    Although it seems to be pretty weird to enter a numeric vector of length one
that is not an integer as the first argument to sample(), the results do not
seem to match what is documented in the manual. In addition, the results below
do not support the use of round rather than truncate in the documentation.
Consider the code below.
    The first sentence in the details section says: "If x has length 1, is
numeric (in the sense of is.numeric) and x >= 1, sampling via sample takes
place from 1:x."
    In the console:> 1:2.001
    [1] 1 2
    > 1:2.9
    [1] 1 2
    
    truncation:
    > trunc(2.9)
    [1] 2
    
    So, this seems to support the quote from in previous emails:
"Non-integer positive numerical values of n or x will be truncated to the
next smallest integer, which has to be no larger than
.Machine$integer.max."
    However, again in the console:> set.seed(123)
     > table(sample(2.001, 10000, replace=TRUE))
    
       1    2    3 
    5052 4941    7
    
    So, neither rounding nor truncation is occurring. Next, define a sequence.
    > x <- seq(2.001, 2.51, length.out=20)
    Now, grab all of the threes from sample()-ing this sequence.
    
     > set.seed(123)
    > threes <- sapply(x, function(y) table(sample(y, 10000,
replace=TRUE))[3])
    
    Check for NAs (I cheated here and found a nice seed).> any(is.na(threes))
    [1] FALSE
    Now, the (to me) disturbing result.
    
    > is.unsorted(threes)
    [1] FALSE
    
    or equivalently
    
    > all(diff(threes) > 0)
    [1] TRUE
    
    So the number of threes grows monotonically as 2.001 moves to 2.5. As I
hinted above, the monotonic growth is not assured. My guess is that the growth
is stochastic and relates to some "probability weighting" based on how
close the element of x is to 3. Perhaps this has been brought up before, but it
seems relevant to the current discussion.
    A potential aid to this issue would be something like
    if(length(x) == 1 && !all.equal(x, as.integer(x))) warning("It
is a bad idea to use vectors of length 1 in the x argument that are not
integers.")
    Hope that helps,luke
    
    	[[alternative HTML version deleted]]
    
    ______________________________________________
    R-devel at r-project.org mailing list
    https://stat.ethz.ch/mailman/listinfo/r-devel

Joris Meys

2018-Sep-20 08:31 UTC

head link

[Rd] A different error in sample()

To be more clear: I do NOT state that the function "round" is used. I
read
the documentation as "non integer positive numerical values will be
replaced by the next smallest integer", the important part being the NEXT
smallest integer, i.e. how ceiling() does it. So that's exactly what I
would expect. If "replaced by" causes less confusion than
"rounded to" or
"truncated to", then use that.

I do agree that this wording would still indicate that this happens prior
to the sampling, whereas the output indicates that this is done after the
sampling. I can reproduce the sample() outcome using runif() as follows:
> table(ceiling(runif(10000,0,2.1)))   1    2    3
4774 4756  470
> table(ceiling(runif(10000,0,3)))   1    2    3
3273 3440 3287

I don't know if that's the intended behaviour, but there is some logic
in
it. It's up to the R core team to decide if this is OK and rephrase the
help page so it becomes more clear what actually happens, or simply add
something like

if( (x%%1) != 0) x <- ceiling(x)

prior to the sampling algorithm.

Cheers
Joris

On Thu, Sep 20, 2018 at 9:44 AM lmo via R-devel <r-devel at r-project.org>
wrote:
> Although it seems to be pretty weird to enter a numeric vector of length
> one that is not an integer as the first argument to sample(), the results
> do not seem to match what is documented in the manual. In addition, the
> results below do not support the use of round rather than truncate in the
> documentation. Consider the code below.
> The first sentence in the details section says: "If x has length 1, is
> numeric (in the sense of is.numeric) and x >= 1, sampling via sample
takes
> place from 1:x."
> In the console:> 1:2.001
> [1] 1 2
> > 1:2.9
> [1] 1 2
>
> truncation:
> > trunc(2.9)
> [1] 2
>
> So, this seems to support the quote from in previous emails:
"Non-integer
> positive numerical values of n or x will be truncated to the next smallest
> integer, which has to be no larger than .Machine$integer.max."
> However, again in the console:> set.seed(123)
>  > table(sample(2.001, 10000, replace=TRUE))
>
>    1    2    3
> 5052 4941    7
>
> So, neither rounding nor truncation is occurring. Next, define a sequence.
> > x <- seq(2.001, 2.51, length.out=20)
> Now, grab all of the threes from sample()-ing this sequence.
>
>  > set.seed(123)
> > threes <- sapply(x, function(y) table(sample(y, 10000,
replace=TRUE))[3])
>
> Check for NAs (I cheated here and found a nice seed).>
any(is.na(threes))
> [1] FALSE
> Now, the (to me) disturbing result.
>
> > is.unsorted(threes)
> [1] FALSE
>
> or equivalently
>
> > all(diff(threes) > 0)
> [1] TRUE
>
> So the number of threes grows monotonically as 2.001 moves to 2.5. As I
> hinted above, the monotonic growth is not assured. My guess is that the
> growth is stochastic and relates to some "probability weighting"
based on
> how close the element of x is to 3. Perhaps this has been brought up
> before, but it seems relevant to the current discussion.
> A potential aid to this issue would be something like
> if(length(x) == 1 && !all.equal(x, as.integer(x))) warning("It
is a bad
> idea to use vectors of length 1 in the x argument that are not
integers.")
> Hope that helps,luke
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Joris Meys
Statistical consultant

Department of Data Analysis and Mathematical Modelling
Ghent University
Coupure Links 653, B-9000 Gent (Belgium)
<https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g>

-----------
Biowiskundedagen 2017-2018
http://www.biowiskundedagen.ugent.be/

-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

	[[alternative HTML version deleted]]

peter dalgaard

2018-Sep-20 08:42 UTC

head link

[Rd] A different error in sample()

Yup, that is a bug, at least in the documentation. Probably a clearer example is

x <- seq(2.001, 2.999, length.out=999)
threes <- sapply(x, function(y) table(sample(y, 10000, replace=TRUE))[3])
plot(threes, type="l")
curve(10000*(x-2)/x, add=TRUE, col="red")

which is entirely consistent with what you'd expect from floor(runif(10000,
0, y)) + 1, and as far as I can tell from the source, that is what is happening
internally.

(Strict monotonicity is a bit of a red herring, it is jut a matter of having
spaced the y so far apart that the probability of an order reversal becomes
negligible.)

So either we should do what the documentation says we do, or the documentation
should not say that we do what we do not actually do...

The suspect code is this snippet from do_sample:

            int n = (int) dn;
            .....

            if (replace || k < 2) {
                for (int i = 0; i < k; i++) iy[i] = (int)(R_unif_index(dn) +
1);
            } else {
                int *x = (int *)R_alloc(n, sizeof(int));
                for (int i = 0; i < n; i++) x[i] = i;
                for (int i = 0; i < k; i++) {
                    int j = (int)(R_unif_index(n));
                    iy[i] = x[j] + 1;
                    x[j] = x[--n];
                }
            }

(notice arguments to R_unif_index)

-pd
> On 20 Sep 2018, at 01:53 , lmo via R-devel <r-devel at r-project.org>
wrote:
> 
> Although it seems to be pretty weird to enter a numeric vector of length
one that is not an integer as the first argument to sample(), the results do not
seem to match what is documented in the manual. In addition, the results below
do not support the use of round rather than truncate in the documentation.
Consider the code below.
> The first sentence in the details section says: "If x has length 1, is
numeric (in the sense of is.numeric) and x >= 1, sampling via sample takes
place from 1:x."
> In the console:> 1:2.001
> [1] 1 2
>> 1:2.9
> [1] 1 2
> 
> truncation:
>> trunc(2.9)
> [1] 2
> 
> So, this seems to support the quote from in previous emails:
"Non-integer positive numerical values of n or x will be truncated to the
next smallest integer, which has to be no larger than
.Machine$integer.max."
> However, again in the console:> set.seed(123)
>> table(sample(2.001, 10000, replace=TRUE))
> 
>    1    2    3 
> 5052 4941    7
> 
> So, neither rounding nor truncation is occurring. Next, define a sequence.
>> x <- seq(2.001, 2.51, length.out=20)
> Now, grab all of the threes from sample()-ing this sequence.
> 
>> set.seed(123)
>> threes <- sapply(x, function(y) table(sample(y, 10000,
replace=TRUE))[3])
> 
> Check for NAs (I cheated here and found a nice seed).>
any(is.na(threes))
> [1] FALSE
> Now, the (to me) disturbing result.
> 
>> is.unsorted(threes)
> [1] FALSE
> 
> or equivalently
> 
>> all(diff(threes) > 0)
> [1] TRUE
> 
> So the number of threes grows monotonically as 2.001 moves to 2.5. As I
hinted above, the monotonic growth is not assured. My guess is that the growth
is stochastic and relates to some "probability weighting" based on how
close the element of x is to 3. Perhaps this has been brought up before, but it
seems relevant to the current discussion.
> A potential aid to this issue would be something like
> if(length(x) == 1 && !all.equal(x, as.integer(x))) warning("It
is a bad idea to use vectors of length 1 in the x argument that are not
integers.")
> Hope that helps,luke
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

Possibly Parallel Threads

Search for more possibly parallel threads

R devel - Sep 2018 - A different error in sample()

[Rd] A different error in sample()

[Rd] A different error in sample()

[Rd] A different error in sample()

[Rd] A different error in sample()

Possibly Parallel Threads