Toomas Soome
2018-Oct-22 09:27 UTC
head -r338804 boots threadripper 1950X fine; head -r338810+ do not; -r338807 seems implicated
> On 22 Oct 2018, at 06:30, Warner Losh <imp at bsdimp.com> wrote: > > On Sun, Oct 21, 2018 at 9:28 PM Warner Losh <imp at bsdimp.com <mailto:imp at bsdimp.com>> wrote: > >> >> >> On Sun, Oct 21, 2018 at 8:57 PM Mark Millard via freebsd-stable < >> freebsd-stable at freebsd.org> wrote: >> >>> [I built based on WITHOUT_ZFS= for other reasons. But, >>> after installing the build, Hyper-V based boots are >>> working.] >>> >>> On 2018-Oct-20, at 2:09 AM, Mark Millard <marklmi at yahoo.com> wrote: >>> >>>> On 2018-Oct-20, at 1:39 AM, Mark Millard <marklmi at yahoo.com> wrote: >>>> >>>>> I attempted to jump from head -r334014 to -r339076 >>>>> on a threadripper 1950X board and the boot fails. >>>>> This is both native booting and under Hyper-V, >>>>> same machine and root file system in both cases. >>>> >>>> I did my investigation under Hyper-V after seeing >>>> a boot failure native. >>>> >>>> Looks like the native failure is even earlier, >>>> before db> is even possible, possibly during >>>> early loader activity. >>>> >>>> So this report is really for running under >>>> Hyper-V: -r338804 boots and -r338810 does >>>> not. By contrast -r334804 does not boot native. >>>> (But I've little information for that context.) >>>> >>>> Sorry for the confusion. I rushed the report >>>> in hopes of getting to sleep. It was not to be. >>>> >>>>> It fails just after the FreeBSD/SMP lines, >>>>> reporting "kernel trap 9 with interrupts disabled". >>>>> >>>>> It fails in pmap_force_invaldiate_cache_range at >>>>> a clflusl (%rax) instruction that produces a >>>>> "Fatal trap 9: general protection fault while >>>>> in kernel mode". cpudid=0 apic id= 00 >>>>> >>>>> I used kernel.txz files from: >>>>> >>>>> https://artifact.ci.freebsd.org/snapshot/head/r*/amd64/amd64/ >>>>> >>>>> to narrow the range of kernel builds for working -> failing >>>>> and got: >>>>> >>>>> -r338804 boots fine >>>>> (no amd64 kernel builds between to try) >>>>> -r338810+ fails (any that I tried, anyway) >>>>> >>>>> In that range is -r338807 : >>>>> >>>>> QUOTE >>>>> Author: kib >>>>> Date: Wed Sep 19 19:35:02 2018 >>>>> New Revision: 338807 >>>>> URL: >>>>> https://svnweb.freebsd.org/changeset/base/338807 >>>>> >>>>> >>>>> Log: >>>>> Convert x86 cache invalidation functions to ifuncs. >>>>> >>>>> This simplifies the runtime logic and reduces the number of >>>>> runtime-constant branches. >>>>> >>>>> Reviewed by: alc, markj >>>>> Sponsored by: The FreeBSD Foundation >>>>> Approved by: re (gjb) >>>>> Differential revision: >>>>> https://reviews.freebsd.org/D16736 >>>>> >>>>> Modified: >>>>> head/sys/amd64/amd64/pmap.c >>>>> head/sys/amd64/include/pmap.h >>>>> head/sys/dev/drm2/drm_os_freebsd.c >>>>> head/sys/dev/drm2/i915/intel_ringbuffer.c >>>>> head/sys/i386/i386/pmap.c >>>>> head/sys/i386/i386/vm_machdep.c >>>>> head/sys/i386/include/pmap.h >>>>> head/sys/x86/iommu/intel_utils.c >>>>> END QUOTE >>>>> >>>>> There do seem to be changes associated with >>>>> clflush(...) use. Looking at: >>>>> >>>>> >>> https://svnweb.freebsd.org/base/head/sys/amd64/amd64/pmap.c?annotate=339432 >>>>> >>>>> it appears that pmap_force_invalidate_cache_range has not >>>>> changed since -r338807. >>>>> >>>>> It seems that -r338806 and -r3388810 would be unlikely >>>>> contributors. >>>> >>> >>> I went after my native-boot loader problem first because I >>> could switch kernels via the loader for booting FreeBSD under >>> Hyper-V. Switching loaders is more of a problem. >>> >>> In order to avoid the loader-time crash I switched to building >>> installing based on WITHOUT_ZFS= . I've had no active use of >>> ZFS in years. (The old official-build loaders that worked were >>> non-ZFS ones.) >>> >>> This took care of the native-boot loader-crash --and, to my >>> surprise, also the Hyper-V-boot kernel-time crash. >>> >>> My private builds now boot the 1950X in both contexts just >>> fine. >>> >>> During my early investigation I did pick up specific changes >>> from after -r339076 that seemed to be tied to Ryzen and such. >>> (They made no difference to the boot problems at the time >>> but I saw no reason to remove them.) >>> >>> # uname -apKU >>> FreeBSD FBSDFSSD 12.0-ALPHA8 FreeBSD 12.0-ALPHA8 #5 r339076:339432M: Sun >>> Oct 21 16:44:25 PDT 2018 markmi at FBSDFSSD:/usr/obj/amd64_clang/amd64.amd64/usr/src/amd64.amd64/sys/GENERIC-NODBG >>> amd64 amd64 1200084 1200084 >> >> > (stupid gmail) > > The phrase "no active use" bothers me. What does that mean? Are there any > ZFS pools or any disks that any whiff of ZFSish thing on it at all? > Clearly, there's something in the zfs boot loader that's freaking out by > something on your system, but absent that information I can't help you. >It would help to get output from loader lsdev -v command. Also if you could test boot loader with UEFI - for example get to loader prompt via usb/cd boot and then get the same lsdev -v output. I would be interested to see the sector size information and if the UEFI loader does also have issues. If it does, I?d like to see the outputs from commands: zpool status zpool import thanks, toomas
Mark Millard
2018-Oct-22 10:58 UTC
head -r338804 boots threadripper 1950X fine; head -r338810+ do not; -r338807 seems implicated
On 2018-Oct-22, at 2:27 AM, Toomas Soome <tsoome at me.com> wrote:> >> On 22 Oct 2018, at 06:30, Warner Losh <imp at bsdimp.com> wrote: >> >> On Sun, Oct 21, 2018 at 9:28 PM Warner Losh <imp at bsdimp.com> wrote: >> >>> >>> >>> On Sun, Oct 21, 2018 at 8:57 PM Mark Millard via freebsd-stable < >>> freebsd-stable at freebsd.org> wrote: >>> >>>> [I built based on WITHOUT_ZFS= for other reasons. But, >>>> after installing the build, Hyper-V based boots are >>>> working.] >>>> >>>> On 2018-Oct-20, at 2:09 AM, Mark Millard <marklmi at yahoo.com> wrote: >>>> >>>>> On 2018-Oct-20, at 1:39 AM, Mark Millard <marklmi at yahoo.com> wrote: >>>>> >>>>>> I attempted to jump from head -r334014 to -r339076 >>>>>> on a threadripper 1950X board and the boot fails. >>>>>> This is both native booting and under Hyper-V, >>>>>> same machine and root file system in both cases. >>>>> >>>>> I did my investigation under Hyper-V after seeing >>>>> a boot failure native. >>>>> >>>>> Looks like the native failure is even earlier, >>>>> before db> is even possible, possibly during >>>>> early loader activity. >>>>> >>>>> So this report is really for running under >>>>> Hyper-V: -r338804 boots and -r338810 does >>>>> not. By contrast -r334804 does not boot native. >>>>> (But I've little information for that context.) >>>>> >>>>> Sorry for the confusion. I rushed the report >>>>> in hopes of getting to sleep. It was not to be. >>>>> >>>>>> It fails just after the FreeBSD/SMP lines, >>>>>> reporting "kernel trap 9 with interrupts disabled". >>>>>> >>>>>> It fails in pmap_force_invaldiate_cache_range at >>>>>> a clflusl (%rax) instruction that produces a >>>>>> "Fatal trap 9: general protection fault while >>>>>> in kernel mode". cpudid=0 apic id= 00 >>>>>> >>>>>> I used kernel.txz files from: >>>>>> >>>>>> https://artifact.ci.freebsd.org/snapshot/head/r*/amd64/amd64/ >>>>>> >>>>>> to narrow the range of kernel builds for working -> failing >>>>>> and got: >>>>>> >>>>>> -r338804 boots fine >>>>>> (no amd64 kernel builds between to try) >>>>>> -r338810+ fails (any that I tried, anyway) >>>>>> >>>>>> In that range is -r338807 : >>>>>> >>>>>> QUOTE >>>>>> Author: kib >>>>>> Date: Wed Sep 19 19:35:02 2018 >>>>>> New Revision: 338807 >>>>>> URL: >>>>>> https://svnweb.freebsd.org/changeset/base/338807 >>>>>> >>>>>> >>>>>> Log: >>>>>> Convert x86 cache invalidation functions to ifuncs. >>>>>> >>>>>> This simplifies the runtime logic and reduces the number of >>>>>> runtime-constant branches. >>>>>> >>>>>> Reviewed by: alc, markj >>>>>> Sponsored by: The FreeBSD Foundation >>>>>> Approved by: re (gjb) >>>>>> Differential revision: >>>>>> https://reviews.freebsd.org/D16736 >>>>>> >>>>>> Modified: >>>>>> head/sys/amd64/amd64/pmap.c >>>>>> head/sys/amd64/include/pmap.h >>>>>> head/sys/dev/drm2/drm_os_freebsd.c >>>>>> head/sys/dev/drm2/i915/intel_ringbuffer.c >>>>>> head/sys/i386/i386/pmap.c >>>>>> head/sys/i386/i386/vm_machdep.c >>>>>> head/sys/i386/include/pmap.h >>>>>> head/sys/x86/iommu/intel_utils.c >>>>>> END QUOTE >>>>>> >>>>>> There do seem to be changes associated with >>>>>> clflush(...) use. Looking at: >>>>>> >>>>>> >>>> https://svnweb.freebsd.org/base/head/sys/amd64/amd64/pmap.c?annotate=339432 >>>>>> >>>>>> it appears that pmap_force_invalidate_cache_range has not >>>>>> changed since -r338807. >>>>>> >>>>>> It seems that -r338806 and -r3388810 would be unlikely >>>>>> contributors. >>>>> >>>> >>>> I went after my native-boot loader problem first because I >>>> could switch kernels via the loader for booting FreeBSD under >>>> Hyper-V. Switching loaders is more of a problem. >>>> >>>> In order to avoid the loader-time crash I switched to building >>>> installing based on WITHOUT_ZFS= . I've had no active use of >>>> ZFS in years. (The old official-build loaders that worked were >>>> non-ZFS ones.) >>>> >>>> This took care of the native-boot loader-crash --and, to my >>>> surprise, also the Hyper-V-boot kernel-time crash. >>>> >>>> My private builds now boot the 1950X in both contexts just >>>> fine. >>>> >>>> During my early investigation I did pick up specific changes >>>> from after -r339076 that seemed to be tied to Ryzen and such. >>>> (They made no difference to the boot problems at the time >>>> but I saw no reason to remove them.) >>>> >>>> # uname -apKU >>>> FreeBSD FBSDFSSD 12.0-ALPHA8 FreeBSD 12.0-ALPHA8 #5 r339076:339432M: Sun >>>> Oct 21 16:44:25 PDT 2018 markmi at FBSDFSSD:/usr/obj/amd64_clang/amd64.amd64/usr/src/amd64.amd64/sys/GENERIC-NODBG >>>> amd64 amd64 1200084 1200084 >>> >>> >> (stupid gmail) >> >> The phrase "no active use" bothers me. What does that mean? Are there any >> ZFS pools or any disks that any whiff of ZFSish thing on it at all? >> Clearly, there's something in the zfs boot loader that's freaking out by >> something on your system, but absent that information I can't help you. >> > > It would help to get output from loader lsdev -v command.That turned out to be very interesting: The non-ZFS loader crashes during the listing, during disk8, which shows a x0 instead of a x512. Hand transcribed from pictures: OK lsdev -v disk devices disk0: BIOS drive C (937703088 x 512): disk0p1: FreeBSD boot 512K disk0p2: FreeBSD UFS 356G disk0p3: FreeBSD swap 15G disp0p4: FreeBSD swap 76G disk1: BIOS drive D (16514064 x 512): disk1s1: Linux 2048KB disk1s2: Unknown 952GB disk2: BIOS drive E (16514064 x 512): disk2p1: Unknown 128MB disk3: BIOS drive F (16514064 x 512): disk3p1: Unknown 128MB disk4: BIOS drive G (16434495 x 512): disk2p1: Unknown 128MB disk4p2: DOS/Windwos 1716GB disk5: BIOS drive H (16434495 x 512): disk5p1: FreeBSD boot 512K disk5p2: FreeBSD UFS 176G disk5p3: FreeBSD swap 193G disp5p4: FreeBSD swap 15G disk6: BIOS drive I (16434495 x 512): disk6p1: Unknown 499MB disk6p2: EFI 99MB disk6p3: Unknown 16MB disp6p4: DOS/Windows 886G dis7: BIOS drive H (16434495 x 512): disk7p1: FreeBSD boot 512K disk7p2: FreeBSD UFS 953G disk8: BIOS drive K (262144 x 0): int=00000000 err=00000000 efl=00010246 eip=000286bd eax=00000000 ebx=72b50430 ecx=00000000 edx=00000000 esi=00000000 edi=00092080 ebp=00091eec esp=00091ea8 cs=002b ds=0033 es=0033 fs=0033 gs=0033 ss=0033 cs:eip=f7 f1 89 c1 85 d2 0f 85-d8 01 00 00 6a 05 58 85 f6 0f 88 75 01 00 00 89-cb c1 fb 1f 89 ca 03 55 ss:esp=09 00 00 00 00 00 00 00-0a 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00-78 1f 09 00 33 45 04 00 BTX halted I expect that "disk8" is what gpart show -p from a native boot showed as: => 1 60062499 da1 MBR (29G) 1 31 - free - (16K) 32 60062468 da1s1 fat32lba (29G) (That gpart show -p output is in another of the list messages.)> Also if you could test boot loader with UEFI - for example get to loader prompt via usb/cd boot and then get the same lsdev -v output.Still true given the above crash? Or, going the other way, should "drive8" be left as it is in order to be sure to do this test with the drive present? If I do this test later, it will take a bit to get media to do it with. (It is about 4AM in the morning and I've yet to get to sleep.) Note: I've never tried a UEFI based boot of FreeBSD on this machine (but the Windows 10 Pro x64 is EFI based). The only FreeBSD context using a EFI partition to boot that I have used is on an arm aarch64 Cortex-A57 system.> I would be interested to see the sector size information and if the UEFI loader does also have issues.Understood.> If it does, I?d like to see the outputs from commands:> zpool status > zpool importIndependent of the UEFI test . . . I do have a -r331924 head version on another one of the devices and can native-boot that. It still has its ZFS software (but a default loader without ZFS). Trying from that context, hand transcribed: # zpool status ZFS filesystem version: 5 ZFS storage pool version: features support (5000) no pools available # zpool import # [That was based on the old (default) loader being a non-ZFS one.] ==Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)
Lev Serebryakov
2018-Oct-23 10:53 UTC
loader lsdev crashes loader (Was: head -r338804 boots threadripper 1950X fine; head -r338810+ do not; -r338807 seems implicated)
On 22.10.2018 12:27, Toomas Soome wrote:> It would help to get output from loader lsdev -v command.current loader crashes on "lsdev" for me: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=232483 (it is not threadripper-related, my hardware is Intel Atom). -- // Lev Serebryakov -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 963 bytes Desc: OpenPGP digital signature URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20181023/701bfc4e/attachment.sig>