On 8/22/20 9:33 PM, Jeroen Ooms wrote:> On Sat, Aug 22, 2020 at 9:10 PM Tomas Kalibera <tomas.kalibera at gmail.com> wrote: >> On 8/22/20 8:26 PM, Tomas Kalibera wrote: >>> On 8/22/20 7:58 PM, Jeroen Ooms wrote: >>>> On Sat, Aug 22, 2020 at 8:39 AM Tomas Kalibera >>>> <tomas.kalibera at gmail.com> wrote: >>>>> On 8/21/20 11:45 PM, m19tdn+9alxwj7d2bmk--- via R-devel wrote: >>>>>> Ah yes, this is related. I reported v2010 below, but it looks like >>>>>> I was updated to this Insider Build overnight without my knowledge, >>>>>> and conflated it with the new installation R v4 this morning. >>>>>> >>>>>> I will continue to look into the issue with the methods Tomas >>>>>> mentioned. >>>>> It is interesting that a rare 5 years old problem would re-appear on >>>>> current Insider builds. Which build of Windows are you running exactly? >>>>> I've seen another report about a crash on 20190.1000. It'd be nice to >>>>> know if it is present also in newer builds, i.e. in 20197. >>>> I installed the latest 20197 build in a vm, and I can indeed reproduce >>>> this problem. >>>> >>>> What seems to be happening is that R triggers an infinite recursion in >>>> Windows unwinding mechanism, and eventually dies with a stack >>>> overflow. Attached a backtrace of the initial 100 frames of the main >>>> thread (the pattern in the top ~30 frames continues forever). >>>> >>>> The microsoft blog doesn't mention anything related to exception >>>> handling has changed in recent versions: >>>> https://docs.microsoft.com/en-us/windows-insider/at-home/active-dev-branch >>>> >>> Thanks, unfortunately that does not ring any bells (except below), I >>> can't guess from this what is the underlying cause of the problem. >>> There may be something wrong in how we use setjmp/longjmp or how >>> setjmp/longjmp works on Windows. >>> >>> It reminds me of a problem I've been debugging few days ago, when >>> longjump implementation segfaults on Windows 10 (recent but not >>> Insider build) probably soon after unwinding the stack, but only with >>> GCC 10 / MinGW 7 and only in one of the no-segfault tests, and only >>> with -03 (not -O2, not with with -O3 -fno-split-loops). The problem >>> was sensitive to these optimization options interestingly on the call >>> site of long jump (do_abs), even when it was not an immediate caller >>> of the longjump. I've not tracked this down yet, it will require >>> looking at the assembly level, and I was suspecting a compiler error >>> causing the compiler to generate code that messes with the stack or >>> registers in a way that impacts the upcoming jump. But now as we have >>> this other problem with setjmp/logjmp, the compiler may not be the top >>> suspect anymore. >>> >>> I may not be able to work on this in the next few days or a week, so >>> if anyone gets there first, please let me know what you find out. >> Btw could you please try out if the UCRT build of R crashes as well in >> the Insider Windows build ? > Yes, it hangs in exactly the same way, except that the backtrace shows > > ucrtbase!.intrinsic_setjmpex () from C:\WINDOWS\System32\ucrtbase.dll > > Instead of msvcrt!_setjmpex (as expected of course).Thanks. I found what is causing the problem I observed with GCC10/stock Windows 10, I expect this is the same one as in the Insider build. I will investigate further, Tomas
On 8/25/20 6:14 PM, Tomas Kalibera wrote:> On 8/22/20 9:33 PM, Jeroen Ooms wrote: >> On Sat, Aug 22, 2020 at 9:10 PM Tomas Kalibera >> <tomas.kalibera at gmail.com> wrote: >>> On 8/22/20 8:26 PM, Tomas Kalibera wrote: >>>> On 8/22/20 7:58 PM, Jeroen Ooms wrote: >>>>> On Sat, Aug 22, 2020 at 8:39 AM Tomas Kalibera >>>>> <tomas.kalibera at gmail.com> wrote: >>>>>> On 8/21/20 11:45 PM, m19tdn+9alxwj7d2bmk--- via R-devel wrote: >>>>>>> Ah yes, this is related. I reported v2010 below, but it looks like >>>>>>> I was updated to this Insider Build overnight without my knowledge, >>>>>>> and conflated it with the new installation R v4 this morning. >>>>>>> >>>>>>> I will continue to look into the issue with the methods Tomas >>>>>>> mentioned. >>>>>> It is interesting that a rare 5 years old problem would re-appear on >>>>>> current Insider builds. Which build of Windows are you running >>>>>> exactly? >>>>>> I've seen another report about a crash on 20190.1000. It'd be >>>>>> nice to >>>>>> know if it is present also in newer builds, i.e. in 20197. >>>>> I installed the latest 20197 build in a vm, and I can indeed >>>>> reproduce >>>>> this problem. >>>>> >>>>> What seems to be happening is that R triggers an infinite >>>>> recursion in >>>>> Windows unwinding mechanism, and eventually dies with a stack >>>>> overflow. Attached a backtrace of the initial 100 frames of the main >>>>> thread (the pattern in the top ~30 frames continues forever). >>>>> >>>>> The microsoft blog doesn't mention anything related to exception >>>>> handling has changed in recent versions: >>>>> https://docs.microsoft.com/en-us/windows-insider/at-home/active-dev-branch >>>>> >>>>> >>>> Thanks, unfortunately that does not ring any bells (except below), I >>>> can't guess from this what is the underlying cause of the problem. >>>> There may be something wrong in how we use setjmp/longjmp or how >>>> setjmp/longjmp works on Windows. >>>> >>>> It reminds me of a problem I've been debugging few days ago, when >>>> longjump implementation segfaults on Windows 10 (recent but not >>>> Insider build) probably soon after unwinding the stack, but only with >>>> GCC 10 / MinGW 7 and only in one of the no-segfault tests, and only >>>> with -03 (not -O2, not with with -O3 -fno-split-loops). The problem >>>> was sensitive to these optimization options interestingly on the call >>>> site of long jump (do_abs), even when it was not an immediate caller >>>> of the longjump. I've not tracked this down yet, it will require >>>> looking at the assembly level, and I was suspecting a compiler error >>>> causing the compiler to generate code that messes with the stack or >>>> registers in a way that impacts the upcoming jump. But now as we have >>>> this other problem with setjmp/logjmp, the compiler may not be the top >>>> suspect anymore. >>>> >>>> I may not be able to work on this in the next few days or a week, so >>>> if anyone gets there first, please let me know what you find out. >>> Btw could you please try out if the UCRT build of R crashes as well in >>> the Insider Windows build ? >> Yes, it hangs in exactly the same way, except that the backtrace shows >> >> ? ucrtbase!.intrinsic_setjmpex () from C:\WINDOWS\System32\ucrtbase.dll >> >> Instead of msvcrt!_setjmpex (as expected of course). > > Thanks. I found what is causing the problem I observed with > GCC10/stock Windows 10, I expect this is the same one as in the > Insider build. > I will investigate further, > > Tomas >It seems the problem is between MinGW-W64 and Windows, and really it causes both the reported crashes in an Insider build (I tested in 20197) and in my GCC 10 builds in a single "no-segfault" test. setjmp is implemented using Windows call _setjmpex, which has a second argument argument, which is set differently by MinGW based on GCC version. When I set this argument as MinGW-W64 did on early versions of GCC, mingw_getsp(), it fixes/hides the problems on my systems. Perl5 uses a similar workaround, but otherwise there is no solid base (documentation, specification, etc) I am aware of for this change, so this may take some more time to be properly fixed. Still, if anyone experiments with this workaround and finds a problem, please let me know. In particular, I am curious whether it works on earlier versions of Windows (at least with check-all, including recommended packages). Thanks Tomas -------------- next part -------------- A non-text attachment was scrubbed... Name: setjmp.diff Type: text/x-patch Size: 570 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20200826/a4ff18ba/attachment.bin>
On Wed, Aug 26, 2020 at 7:54 PM Tomas Kalibera <tomas.kalibera at gmail.com> wrote:> > On 8/25/20 6:14 PM, Tomas Kalibera wrote: > > On 8/22/20 9:33 PM, Jeroen Ooms wrote: > >> On Sat, Aug 22, 2020 at 9:10 PM Tomas Kalibera > >> <tomas.kalibera at gmail.com> wrote: > >>> On 8/22/20 8:26 PM, Tomas Kalibera wrote: > >>>> On 8/22/20 7:58 PM, Jeroen Ooms wrote: > >>>>> On Sat, Aug 22, 2020 at 8:39 AM Tomas Kalibera > >>>>> <tomas.kalibera at gmail.com> wrote: > >>>>>> On 8/21/20 11:45 PM, m19tdn+9alxwj7d2bmk--- via R-devel wrote: > >>>>>>> Ah yes, this is related. I reported v2010 below, but it looks like > >>>>>>> I was updated to this Insider Build overnight without my knowledge, > >>>>>>> and conflated it with the new installation R v4 this morning. > >>>>>>> > >>>>>>> I will continue to look into the issue with the methods Tomas > >>>>>>> mentioned. > >>>>>> It is interesting that a rare 5 years old problem would re-appear on > >>>>>> current Insider builds. Which build of Windows are you running > >>>>>> exactly? > >>>>>> I've seen another report about a crash on 20190.1000. It'd be > >>>>>> nice to > >>>>>> know if it is present also in newer builds, i.e. in 20197. > >>>>> I installed the latest 20197 build in a vm, and I can indeed > >>>>> reproduce > >>>>> this problem. > >>>>> > >>>>> What seems to be happening is that R triggers an infinite > >>>>> recursion in > >>>>> Windows unwinding mechanism, and eventually dies with a stack > >>>>> overflow. Attached a backtrace of the initial 100 frames of the main > >>>>> thread (the pattern in the top ~30 frames continues forever). > >>>>> > >>>>> The microsoft blog doesn't mention anything related to exception > >>>>> handling has changed in recent versions: > >>>>> https://docs.microsoft.com/en-us/windows-insider/at-home/active-dev-branch > >>>>> > >>>>> > >>>> Thanks, unfortunately that does not ring any bells (except below), I > >>>> can't guess from this what is the underlying cause of the problem. > >>>> There may be something wrong in how we use setjmp/longjmp or how > >>>> setjmp/longjmp works on Windows. > >>>> > >>>> It reminds me of a problem I've been debugging few days ago, when > >>>> longjump implementation segfaults on Windows 10 (recent but not > >>>> Insider build) probably soon after unwinding the stack, but only with > >>>> GCC 10 / MinGW 7 and only in one of the no-segfault tests, and only > >>>> with -03 (not -O2, not with with -O3 -fno-split-loops). The problem > >>>> was sensitive to these optimization options interestingly on the call > >>>> site of long jump (do_abs), even when it was not an immediate caller > >>>> of the longjump. I've not tracked this down yet, it will require > >>>> looking at the assembly level, and I was suspecting a compiler error > >>>> causing the compiler to generate code that messes with the stack or > >>>> registers in a way that impacts the upcoming jump. But now as we have > >>>> this other problem with setjmp/logjmp, the compiler may not be the top > >>>> suspect anymore. > >>>> > >>>> I may not be able to work on this in the next few days or a week, so > >>>> if anyone gets there first, please let me know what you find out. > >>> Btw could you please try out if the UCRT build of R crashes as well in > >>> the Insider Windows build ? > >> Yes, it hangs in exactly the same way, except that the backtrace shows > >> > >> ucrtbase!.intrinsic_setjmpex () from C:\WINDOWS\System32\ucrtbase.dll > >> > >> Instead of msvcrt!_setjmpex (as expected of course). > > > > Thanks. I found what is causing the problem I observed with > > GCC10/stock Windows 10, I expect this is the same one as in the > > Insider build. > > I will investigate further, > > > > Tomas > > > It seems the problem is between MinGW-W64 and Windows, and really it > causes both the reported crashes in an Insider build (I tested in 20197) > and in my GCC 10 builds in a single "no-segfault" test. setjmp is > implemented using Windows call _setjmpex, which has a second argument > argument, which is set differently by MinGW based on GCC version. When I > set this argument as MinGW-W64 did on early versions of GCC, > mingw_getsp(), it fixes/hides the problems on my systems. Perl5 uses a > similar workaround, but otherwise there is no solid base (documentation, > specification, etc) I am aware of for this change, so this may take some > more time to be properly fixed. Still, if anyone experiments with this > workaround and finds a problem, please let me know. In particular, I am > curious whether it works on earlier versions of Windows (at least with > check-all, including recommended packages).FYI, the problem has disappeared on Windows dev built 20201 (released yesterday), so it may have been a Windows bug. That is not to say there is no bug on the R/mingw side, but at least the current and past releases of R are working again on the latest versions of Windows, which is a big relief.