Meyners, Michael
2015-Jun-08 10:51 UTC
[R] mismatch between match and unique causing ecdf (well, approxfun) to fail
Aehm, adding on this: I incorrectly *assumed* without testing that rounding would help; it doesn't: ecdf(round(test2,0)) # a rounding that is way too rough for my application... #Error in xy.coords(x, y) : 'x' and 'y' lengths differ Digging deeper: The initially mentioned call to unique() is not very helpful, as test2 is a data frame, so I get what I deserve, an unchanged data frame with 1 row. Still, the issue remains and can even be simplified further:> ecdf(data.frame(a=3, b=4))Empirical CDF Call: ecdf(data.frame(a = 3, b = 4)) x[1:2] = 3, 4 works ok, but> ecdf(data.frame(a=3, b=3))Error in xy.coords(x, y) : 'x' and 'y' lengths differ doesn't (same for a=b=1 or 2, so likely the same for any a=b). Instead,> ecdf(c(a=3, b=3))Empirical CDF Call: ecdf(c(a = 3, b = 3)) x[1:1] = 3 does the trick. From ?ecdf, I get that x should be a numeric vector - apparently, my misuse of the function by applying it to a row of a data frame (i.e. a data frame with one row). In all my other (dozens of) cases that worked ok, though but not for this particular one. A simple unlist() helps:> ecdf(unlist(data.frame(a=3, b=3)))Empirical CDF Call: ecdf(unlist(data.frame(a = 3, b = 3))) x[1:1] = 3 Yet, I'm even more confused than before: in my other data, there were also duplicated values in the vector (1-row-data frame), and it never caused any issue. For this particular example, it does. I must be missing something fundamental... Michael> -----Original Message----- > From: Meyners, Michael > Sent: Montag, 8. Juni 2015 12:02 > To: 'r-help at r-project.org' > Subject: mismatch between match and unique causing ecdf (well, > approxfun) to fail > > All, > > I encountered the following issue with ecdf which was originally on a vector > of length 10,000, but I have been able to reduce it to a minimal reproducible > example (just to avoid questions why I'd want to do this for a vector of > length 2...): > > test2 = structure(list(X817 = 3.39824670255344, X4789 = 3.39824670255344), > .Names = c("X817", "X4789"), row.names = 74L, class = "data.frame") > ecdf(test2) > > # Error in xy.coords(x, y) : 'x' and 'y' lengths differ > > In an attempt to track this down, it occurs that > > unique(test2) > # X817 X4789 > #74 3.398247 3.398247 > > while > > match(test2, unique(test2)) > #[1] 1 1 > > matches both values to the first one. This causes a hiccup in the call to ecdf, > as this uses (an equivalent to) a call to approxfun with x = test2 and y > cumsum(tabulate(match(test2, unique(test2)))), the latter now containing > one entry less than the former, so xy.coords fails. > > I understand that the issue should be somehow related to FAQ 7.31, but I > would have hoped that unique and match would be using the same precision > and hence both or neither would consider the two values identical, but not > one match while unique doesn't. > > Last but not least, it doesn't really cause an issue on my end (other than > breaking my code and hence out of a loop at first place...); rounding will help > w/o noteworthy changes to the outcome, so no need to propose a > workaround :-) I'd rather like to raise the issue and learn whether there is a > purpose for this behavior, and/or whether there is a generic fix to this, or > whether I am completely missing something. > > Version info (under Windows 7): > R version 3.2.0 (2015-04-16) -- "Full of Ingredients" > Platform: x86_64-w64-mingw32/x64 (64-bit) > > Cheers, Michael
Martin Maechler
2015-Jun-08 14:42 UTC
[R] mismatch between match and unique causing ecdf (well, approxfun) to fail
> Aehm, adding on this: I incorrectly *assumed* without testing that rounding would help; it doesn't: > ecdf(round(test2,0)) # a rounding that is way too rough for my application... > #Error in xy.coords(x, y) : 'x' and 'y' lengths differ > > Digging deeper: The initially mentioned call to unique() is not very helpful, as test2 is a data frame, so I get what I deserve, an unchanged data frame with 1 row. Still, the issue remains and can even be simplified further: > > > ecdf(data.frame(a=3, b=4)) > Empirical CDF > Call: ecdf(data.frame(a = 3, b = 4)) > x[1:2] = 3, 4 > > works ok, but > > > ecdf(data.frame(a=3, b=3)) > Error in xy.coords(x, y) : 'x' and 'y' lengths differ > > doesn't (same for a=b=1 or 2, so likely the same for any a=b). Instead, > > > ecdf(c(a=3, b=3)) > Empirical CDF > Call: ecdf(c(a = 3, b = 3)) > x[1:1] = 3 > > does the trick. From ?ecdf, I get that x should be a numeric vector - apparently, my misuse of the function by applying it to a row of a data frame (i.e. a data frame with one row). In all my other (dozens of) cases that worked ok, though but not for this particular one. A simple unlist() helps:You were lucky. To use a one-row data frame instead of a numerical vector will typically *not* work unless ... well, you are lucky. No, do *not* pass data frame rows instead of numeric vectors.> > > ecdf(unlist(data.frame(a=3, b=3))) > Empirical CDF > Call: ecdf(unlist(data.frame(a = 3, b = 3))) > x[1:1] = 3 > > Yet, I'm even more confused than before: in my other data, there were also duplicated values in the vector (1-row-data frame), and it never caused any issue. For this particular example, it does. I must be missing something fundamental... >well.. I'm confused about why you are confused, but if you are thinking about passing rows of data frames as numeric vectors, this means you are sure that your data frame only contains "classical numbers" (no factors, no 'Date's, no...). In such a case, transform your data frame to a numerical matrix *once* preferably using data.matrix(<d.fr>) instead of just as.matrix(<d.fr>) but in this case it should not matter. Then *check* the result and then work with that matrix from then on. All other things probably will continue to leave you confused .. ;-) Martin Maechler, ETH Zurich
Meyners, Michael
2015-Jun-09 11:38 UTC
[R] mismatch between match and unique causing ecdf (well, approxfun) to fail
Thanks Martin. Yep, I understand it is documented and my code wasn't as it should've been -- the confusion comes from the fact that it worked ok for hundreds of situations that seem very much alike, but one situation breaks. I agree that you typically can't be sure about having only numerical data in the data frame, but I was sure I had by design (numeric results of simulations, so no factors or anything else) and was then sloppy in passing the rows of the data frame to ecdf. So wondering what makes this situation different from all the others I had... Anyway, point taken and working solution found, so all fine :-) Cheers, Michael> -----Original Message----- > From: Martin Maechler [mailto:maechler at stat.math.ethz.ch] > Sent: Montag, 8. Juni 2015 16:43 > To: Meyners, Michael > Cc: r-help at r-project.org > Subject: Re: [R] mismatch between match and unique causing ecdf (well, > approxfun) to fail > > > > Aehm, adding on this: I incorrectly *assumed* without testing that > rounding would help; it doesn't: > > ecdf(round(test2,0)) # a rounding that is way too rough for my > application... > > #Error in xy.coords(x, y) : 'x' and 'y' lengths differ > > > > Digging deeper: The initially mentioned call to unique() is not very helpful, > as test2 is a data frame, so I get what I deserve, an unchanged data frame > with 1 row. Still, the issue remains and can even be simplified further: > > > > > ecdf(data.frame(a=3, b=4)) > > Empirical CDF > > Call: ecdf(data.frame(a = 3, b = 4)) > > x[1:2] = 3, 4 > > > > works ok, but > > > > > ecdf(data.frame(a=3, b=3)) > > Error in xy.coords(x, y) : 'x' and 'y' lengths differ > > > > doesn't (same for a=b=1 or 2, so likely the same for any a=b). > > Instead, > > > > > ecdf(c(a=3, b=3)) > > Empirical CDF > > Call: ecdf(c(a = 3, b = 3)) > > x[1:1] = 3 > > > > does the trick. From ?ecdf, I get that x should be a numeric vector - > apparently, my misuse of the function by applying it to a row of a data frame > (i.e. a data frame with one row). In all my other (dozens of) cases that > worked ok, though but not for this particular one. A simple unlist() helps: > > You were lucky. To use a one-row data frame instead of a > numerical vector will typically *not* work unless ... well, you are lucky. > > No, do *not* pass data frame rows instead of numeric vectors. > > > > > > ecdf(unlist(data.frame(a=3, b=3))) > > Empirical CDF > > Call: ecdf(unlist(data.frame(a = 3, b = 3))) > > x[1:1] = 3 > > > > Yet, I'm even more confused than before: in my other data, there were > also duplicated values in the vector (1-row-data frame), and it never caused > any issue. For this particular example, it does. I must be missing something > fundamental... > > > > well.. I'm confused about why you are confused, but if you are thinking > about passing rows of data frames as numeric vectors, this means you are > sure that your data frame only contains "classical numbers" (no factors, no > 'Date's, no...). > > In such a case, transform your data frame to a numerical matrix > *once* preferably using data.matrix(<d.fr>) instead of just > as.matrix(<d.fr>) but in this case it should not matter. > Then *check* the result and then work with that matrix from then on. > > All other things probably will continue to leave you confused .. > ;-) > > Martin Maechler, > ETH Zurich