Hi, Why sum() on a 10-item vector produces a different value than its counterpart on a 2-item vector? I understand the problems related to the arithmetic precision in storing decimal numbers in binary format, but shouldn't the errors be equal regardless of the method used? See my example:> options(digits=22) > x=rep(.1,10) > x[1] 0.10000000000000001 0.10000000000000001 0.10000000000000001 [4] 0.10000000000000001 0.10000000000000001 0.10000000000000001 [7] 0.10000000000000001 0.10000000000000001 0.10000000000000001 [10] 0.10000000000000001> sum(x)[1] 1> y=0 > for (i in 1:10) {y=sum(y+x[i])} > y[1] 0.99999999999999989> y*10^6[1] 999999.99999999988> sum(x)*10^6[1] 1e+06> z=.1+.1+.1+.1+.1+.1+.1+.1+.1+.1 > z[1] 0.99999999999999989>
This is described in R FAQ 7.31 ir. Thierry Onkelinx Statisticus / Statistician Vlaamse Overheid / Government of Flanders INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND FOREST Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance thierry.onkelinx at inbo.be Havenlaan 88 bus 73, 1000 Brussel inbo.be /////////////////////////////////////////////////////////////////////////////////////////// To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey /////////////////////////////////////////////////////////////////////////////////////////// <inbo.be> 2018-02-26 2:34 GMT+01:00 Iuri Gavronski <iuri at proxima.adm.br>:> Hi, > > Why sum() on a 10-item vector produces a different value than its > counterpart on a 2-item vector? I understand the problems related to > the arithmetic precision in storing decimal numbers in binary format, > but shouldn't the errors be equal regardless of the method used? > > See my example: > > > options(digits=22) > > x=rep(.1,10) > > x > [1] 0.10000000000000001 0.10000000000000001 0.10000000000000001 > [4] 0.10000000000000001 0.10000000000000001 0.10000000000000001 > [7] 0.10000000000000001 0.10000000000000001 0.10000000000000001 > [10] 0.10000000000000001 > > sum(x) > [1] 1 > > y=0 > > for (i in 1:10) {y=sum(y+x[i])} > > y > [1] 0.99999999999999989 > > y*10^6 > [1] 999999.99999999988 > > sum(x)*10^6 > [1] 1e+06 > > z=.1+.1+.1+.1+.1+.1+.1+.1+.1+.1 > > z > [1] 0.99999999999999989 > > > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide R-project.org > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
On Sun, 25 Feb 2018, Iuri Gavronski wrote:> Hi, > > Why sum() on a 10-item vector produces a different value than its > counterpart on a 2-item vector? I understand the problems related to > the arithmetic precision in storing decimal numbers in binary format, > but shouldn't the errors be equal regardless of the method used?Since you understand the problems, why are you asking? More to the point, why are you asking as if this was a property of R rather than a property of all typical IEEE754 floating point implementations? For others Googling this: short answer is don't make your program behavior depend on the least significant digits... they are not reliable. Read about the difference between double precision (53 bits or about 15 digits [1]) and extended precision (63 bits or about 18 digits for Intel math coprocessors [2]) and ask questions about this topic in an appropriate forum (e.g. [3]) since this issue can crop up in practically any programming language.> See my example:Examples are good. Diving into numerical analysis theory not so much.>> options(digits=22)Don't be misleading... you are not going to get 22digits with standard numeric types. Note that you get 17 digits below, and two of those are artifacts pulled from the artificially-expanded extended-precision values in the coprocessor... the numbers in memory are not that precise. The best that can be said for using 17digit printed values is that you have the best shot at being able to reproduce the actual 15 digits of double precision in a different program or instance of R if you keep them.>> x=rep(.1,10) >> x > [1] 0.10000000000000001 0.10000000000000001 0.10000000000000001 > [4] 0.10000000000000001 0.10000000000000001 0.10000000000000001 > [7] 0.10000000000000001 0.10000000000000001 0.10000000000000001 > [10] 0.10000000000000001Sitting in memory (not extended precision!)>> sum(x) > [1] 1sum() is vectorized (compact compiled C code)... the intermediate values can stay in the coprocessor at extended precision. Only the result is trimmed back to double precision.>> y=0 >> for (i in 1:10) {y=sum(y+x[i])}Why are you calling the sum function on a scalar value? The addition has already taken place... The actual intermediate values in this for loop get stored in RAM at double precision and get re-expanded to extended precision each time an addition occurs and then rounded again when the addition is done. That is a lot of rounding compared to the sum function.>> y > [1] 0.99999999999999989 >> y*10^6 > [1] 999999.99999999988Note that 10 = 2*5 = b'0010' * b'0101' is not just a power of 2 so both the binary mantissa and binary exponent are getting modified when you multiply by powers of 10, so the rounding may be different for various powers of 10.>> sum(x)*10^6 > [1] 1e+06 >> z=.1+.1+.1+.1+.1+.1+.1+.1+.1+.1More double precision.>> z > [1] 0.99999999999999989Similar to for loop. Don't rely on the last few digits to be stable. Take a course in numerical analysis if you just cannot take my word for it that those last few digits are going wander off a bit no matter what you do. [1] en.wikipedia.org/wiki/IEEE_754 [2] en.wikipedia.org/wiki/Extended_precision [3] cs.stackexchange.com/search?q=ieee+754 --------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k
In the R expression x[1] + x[2] the result must be stored as a double precision number, because that is what R "numerics" are. sum() does not have to keep its intermediate results as doubles, but can use quad precision or Kahan's summation algorithm (both methods involve more than a simple double to keep track of the running sum) to give more accurate results for summing vectors. Bill Dunlap TIBCO Software wdunlap tibco.com On Sun, Feb 25, 2018 at 5:34 PM, Iuri Gavronski <iuri at proxima.adm.br> wrote:> Hi, > > Why sum() on a 10-item vector produces a different value than its > counterpart on a 2-item vector? I understand the problems related to > the arithmetic precision in storing decimal numbers in binary format, > but shouldn't the errors be equal regardless of the method used? > > See my example: > > > options(digits=22) > > x=rep(.1,10) > > x > [1] 0.10000000000000001 0.10000000000000001 0.10000000000000001 > [4] 0.10000000000000001 0.10000000000000001 0.10000000000000001 > [7] 0.10000000000000001 0.10000000000000001 0.10000000000000001 > [10] 0.10000000000000001 > > sum(x) > [1] 1 > > y=0 > > for (i in 1:10) {y=sum(y+x[i])} > > y > [1] 0.99999999999999989 > > y*10^6 > [1] 999999.99999999988 > > sum(x)*10^6 > [1] 1e+06 > > z=.1+.1+.1+.1+.1+.1+.1+.1+.1+.1 > > z > [1] 0.99999999999999989 > > > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide R-project.org > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]