David Wolfskill
2006-May-31 17:31 UTC
6.1-STABLE; Fatal trap 12: page fault while in kernel mode; kgdb isn't working??!?
In testing a vendor's product, I managed (as I had been warned might happen) to crash the machine on which the product was running. It's a moderately-recent 6.1-STABLE: mx-out05# uname -a FreeBSD mx-out05.lab.example.org 6.1-STABLE FreeBSD 6.1-STABLE #3: Sun May 7 10:06:44 PDT 2006 dhw@mx-out05.lab.example.org:/usr/obj/usr/src/sys/SMP_PAE i386 mx-out05# Hardware-wise, it's a dual 3 GHz Xeon box with 4 GB RAM. In case it's relevant: mx-out05# mount; df; swapinfo /dev/aacd0s2a on / (ufs, local, soft-updates) devfs on /dev (devfs, local) /dev/aacd0s2d on /usr (ufs, local, soft-updates) /dev/aacd0s3d on /home (ufs, local, soft-updates) /dev/aacd0s3e on /var (ufs, local, soft-updates) /dev/aacd1s1d on /var/spool (ufs, local, noatime) devfs on /var/named/dev (devfs, local) /dev/md0 on /tmp (ufs, local, soft-updates) Filesystem 1K-blocks Used Avail Capacity Mounted on /dev/aacd0s2a 507630 37008 430012 8% / devfs 1 1 0 100% /dev /dev/aacd0s2d 2280880 1676226 422184 80% /usr /dev/aacd0s3d 5077038 50950 4619926 1% /home /dev/aacd0s3e 7270492 949650 5739204 14% /var /dev/aacd1s1d 34678048 14136 31889670 0% /var/spool devfs 1 1 0 100% /var/named/dev /dev/md0 9159102 16 8426358 0% /tmp Device 1K-blocks Used Avail Capacity /dev/aacd0s3b 16777216 0 16777216 0% mx-out05# Yes, swap is ridiculously huge (but note that /tmp is swap-backed). So are a few other allocations (huge, that is); in general, I prefer to avoid exhausting resources. :-} The crash appears to be quite reproducible by using ports/benchmarks/postal. It's fairly likely that I need to configure some resource-consumption constraints so the application doesn't go completely berserk. I note that running postal using the same parameters against a similar box running Postfix just chugs along, no problem at all. Here's a typical complaint as extracted from /var/log/messages: May 31 16:02:13 mx-out05 kernel: Fatal trap 12: page fault while in kernel mode May 31 16:02:13 mx-out05 kernel: cpuid = 0; apic id = 00 May 31 16:02:13 mx-out05 kernel: fault virtual address May 31 16:02:13 mx-out05 kernel: = 0x0 May 31 16:02:13 mx-out05 kernel: fault code = supervisor read, page not present May 31 16:02:13 mx-out05 kernel: instruction pointer = 0x20:0x0 May 31 16:02:13 mx-out05 kernel: stack pointer = 0x28:0xf06f8b98 May 31 16:02:13 mx-out05 kernel: frame pointer = 0x28:0xf06f8bcc May 31 16:02:13 mx-out05 kernel: code segment = base 0x0, limit 0xf May 31 16:02:13 mx-out05 kernel: f I did manage to set things up to get a kernel crash dump, and I'm about as certain as I can be that the kernel, userland, and crash dump are all in sync. Still, when I cd /usr/obj/usr/src/sys/SMP_PAE/ && kgdb kernel.debug /var/crash/vmcore.0 I get a repeating: kgdb: kvm_read: invalid address (0xc9ff5624) kgdb: kvm_read: invalid address (0xc9ff8600) kgdb: kvm_read: invalid address (0xc9ff5624) kgdb: kvm_read: invalid address (0xc9ff8600) The pattern repeats until I interrupt it. Now, this box is in a lab; it is for testing (at this time), so I have rather more flexibility than I might for a production system. The product was built for FreeBSD 5.x; I have the ports/misc/compat-5x port installed, and the product does run -- at least, until I start stress-testing it. :-} I could bring the box up to a more recent -STABLE fairly easily; for that matter, I could probably bring it up to -CURRENT fairly easily, but I have no intent to be running a production service on -CURRENT. (My laptop? Sometimes. A production box in a colo? Uhh... maybe I'm just not sufficiently daring, but no thanks. :-}) I'd appreciate suggestions (or pointers to same) as to how I might proceed to determine what I can do to get the product to run reliably iin a FreeBSD environment. (The vendor has suggested eithe rRed Hat or Suse Linux as more stable platforms, and has complained about an inability to get debugging information from FreeBSD. I have pointe dout that there's been some progress of late on getting DTrace ported to FreeBSD, and they've seemed at least somewhat interested, but in the mean time....) Anyway, I'll plan on summarizing off-list responses that are relevant. Thanks! Peace, david -- David H. Wolfskill david@catwhisker.org Doing business with spammers only encourages them. Please boycott spammers. See http://www.catwhisker.org/~david/publickey.gpg for my public key. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 195 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20060601/4a8c69af/attachment.pgp
Scott Long
2006-May-31 20:10 UTC
6.1-STABLE; Fatal trap 12: page fault while in kernel mode; kgdb isn't working??!?
David Wolfskill wrote:> In testing a vendor's product, I managed (as I had been warned might > happen) to crash the machine on which the product was running. > > It's a moderately-recent 6.1-STABLE: > > mx-out05# uname -a > FreeBSD mx-out05.lab.example.org 6.1-STABLE FreeBSD 6.1-STABLE #3: Sun May 7 10:06:44 PDT 2006 dhw@mx-out05.lab.example.org:/usr/obj/usr/src/sys/SMP_PAE i386 > mx-out05# > > Hardware-wise, it's a dual 3 GHz Xeon box with 4 GB RAM. > > In case it's relevant: > > mx-out05# mount; df; swapinfo > /dev/aacd0s2a on / (ufs, local, soft-updates) > devfs on /dev (devfs, local) > /dev/aacd0s2d on /usr (ufs, local, soft-updates) > /dev/aacd0s3d on /home (ufs, local, soft-updates) > /dev/aacd0s3e on /var (ufs, local, soft-updates) > /dev/aacd1s1d on /var/spool (ufs, local, noatime) > devfs on /var/named/dev (devfs, local) > /dev/md0 on /tmp (ufs, local, soft-updates) > Filesystem 1K-blocks Used Avail Capacity Mounted on > /dev/aacd0s2a 507630 37008 430012 8% / > devfs 1 1 0 100% /dev > /dev/aacd0s2d 2280880 1676226 422184 80% /usr > /dev/aacd0s3d 5077038 50950 4619926 1% /home > /dev/aacd0s3e 7270492 949650 5739204 14% /var > /dev/aacd1s1d 34678048 14136 31889670 0% /var/spool > devfs 1 1 0 100% /var/named/dev > /dev/md0 9159102 16 8426358 0% /tmp > Device 1K-blocks Used Avail Capacity > /dev/aacd0s3b 16777216 0 16777216 0% > mx-out05# > > Yes, swap is ridiculously huge (but note that /tmp is swap-backed). > So are a few other allocations (huge, that is); in general, I prefer > to avoid exhausting resources. :-} > > The crash appears to be quite reproducible by using > ports/benchmarks/postal. It's fairly likely that I need to configure > some resource-consumption constraints so the application doesn't go > completely berserk. I note that running postal using the same > parameters against a similar box running Postfix just chugs along, no > problem at all. > > Here's a typical complaint as extracted from /var/log/messages: > > May 31 16:02:13 mx-out05 kernel: Fatal trap 12: page fault while in kernel mode > May 31 16:02:13 mx-out05 kernel: cpuid = 0; apic id = 00 > May 31 16:02:13 mx-out05 kernel: fault virtual address > May 31 16:02:13 mx-out05 kernel: = 0x0 > May 31 16:02:13 mx-out05 kernel: fault code = supervisor read, page not present > May 31 16:02:13 mx-out05 kernel: instruction pointer = 0x20:0x0 > May 31 16:02:13 mx-out05 kernel: stack pointer = 0x28:0xf06f8b98 > May 31 16:02:13 mx-out05 kernel: frame pointer = 0x28:0xf06f8bcc > May 31 16:02:13 mx-out05 kernel: code segment = base 0x0, limit 0xf > May 31 16:02:13 mx-out05 kernel: f > > > I did manage to set things up to get a kernel crash dump, and I'm about > as certain as I can be that the kernel, userland, and crash dump are all > in sync. > > Still, when I > > cd /usr/obj/usr/src/sys/SMP_PAE/ && kgdb kernel.debug /var/crash/vmcore.0 > > I get a repeating: > kgdb: kvm_read: invalid address (0xc9ff5624) > kgdb: kvm_read: invalid address (0xc9ff8600) > kgdb: kvm_read: invalid address (0xc9ff5624) > kgdb: kvm_read: invalid address (0xc9ff8600) > > The pattern repeats until I interrupt it. > > Now, this box is in a lab; it is for testing (at this time), so I have > rather more flexibility than I might for a production system. The > product was built for FreeBSD 5.x; I have the ports/misc/compat-5x port > installed, and the product does run -- at least, until I start > stress-testing it. :-} > > I could bring the box up to a more recent -STABLE fairly easily; for that > matter, I could probably bring it up to -CURRENT fairly easily, but I > have no intent to be running a production service on -CURRENT. (My > laptop? Sometimes. A production box in a colo? Uhh... maybe I'm just > not sufficiently daring, but no thanks. :-}) > > I'd appreciate suggestions (or pointers to same) as to how I might > proceed to determine what I can do to get the product to run reliably > iin a FreeBSD environment. (The vendor has suggested eithe rRed Hat or > Suse Linux as more stable platforms, and has complained about an > inability to get debugging information from FreeBSD. I have pointe dout > that there's been some progress of late on getting DTrace ported to > FreeBSD, and they've seemed at least somewhat interested, but in the > mean time....) > > Anyway, I'll plan on summarizing off-list responses that are relevant. > > Thanks! > > Peace, > davidkgdb seems to be more broken than not. COuld you enable KDB+DDB and at least get a stack trace from the fault? Scott