----- John's Original Message -----> Hi Folks,
>
> I'm seeing a panic with the 9.2-PRERELEASE code. The system
> will stay up for anywhere from a couple of seconds to a few hours
> and then panic.
I have continued to look at this problem and believe I have made
some progress but still have no solution... Just questions :-)
I have updated my source to r255r73 which includes all the recent
MFCs to cam/xpt/scsi. The problem still persists.
I have attempted to simplify the problem and have been able to
duplicate part of the problem with just two shelves attached serially.
I have the following debugging diffs:
Index: scsi/scsi_xpt.c
==================================================================---
scsi/scsi_xpt.c (revision 255573)
+++ scsi/scsi_xpt.c (working copy)
@@ -2445,6 +2445,15 @@
case CDAI_TYPE_PHYS_PATH:
if (cdai->flags & CDAI_FLAG_STORE) {
if (device->physpath != NULL) {
+ /* In theory the following should not be the same - see
ses_setphyspath_callback() */
+ printf("incoming %d
%.*s\n",(int)cdai->bufsiz,(int)cdai->bufsiz,cdai->buf);
+ printf("current %d %.*s
%*s\n",(int)device->physpath_len,(int)device->physpath_len,device->physpath,device->serial_num_len,device->serial_num);
+
free(device->physpath, M_CAMXPT);
device->physpath = NULL;
}
and to ses_devids_iter():
Index: scsi/scsi_enc_ses.c
==================================================================---
scsi/scsi_enc_ses.c (revision 255573)
+++ scsi/scsi_enc_ses.c (working copy)
@@ -849,6 +849,7 @@
devid->length = sizeof(struct scsi_vpd_id_naa_ieee_reg);
memcpy(devid->identifier, phy_addr, devid->length);
+ printf("callback: i=%d device=%08x\n",i,*phy_addr);
callback(enc, elm, devid, callback_arg);
}
}
The first patch shows the currently stored versus the about
to be stored physpath. The 2nd patch adds some debug information
to show which phys (i value) is being processed.
What I find is a constant flip/flop between the device ids of
the 2 shelves being stored into what I believe is the first shelves
descriptor.
Sep 16 17:01:22 testsys1 kernel: callback: i=0 device=00000050
Sep 16 17:01:22 testsys1 last message repeated 23 times
Sep 16 17:01:22 testsys1 kernel: callback: i=0 device=00000020
Sep 16 17:01:22 testsys1 kernel: callback: i=1 device=0000002e
Sep 16 17:01:22 testsys1 kernel: callback: i=2 device=000000ff
Sep 16 17:01:22 testsys1 kernel: callback: i=3 device=000000ff
Sep 16 17:01:22 testsys1 kernel: callback: i=4 device=000000ff
Sep 16 17:01:22 testsys1 kernel: callback: i=5 device=000000ff
Sep 16 17:01:22 testsys1 kernel: callback: i=6 device=00000000
Sep 16 17:01:22 testsys1 kernel: incoming 60 id1,enc at n5001438023478280/type
at 9/slot at 1/elmdesc at 900080050002
Sep 16 17:01:22 testsys1 kernel: current 60 id1,enc at n5001438023441700/type
at 9/slot at 1/elmdesc at 900080050002 5C7237P403
Sep 16 17:01:22 testsys1 kernel: callback: i=7 device=00000000
5C7237P403 is the serial of the 1st shelf:
# camcontrol inquiry ses0
pass29: <HP D2700 SAS AJ941A 0147> Fixed Enclosure Services SCSI-5 device
pass29: Serial Number 5C7237P403
pass29: 600.000MB/s transfers, Command Queueing Enabled
# camcontrol inquiry ses1
pass55: <HP D2700 SAS AJ941A 0147> Fixed Enclosure Services SCSI-5 device
pass55: Serial Number 5C7236P2T5
pass55: 600.000MB/s transfers, Command Queueing Enabled
A sample grep to show the flip/flop'ing...
Sep 16 17:00:22 testsys1 kernel: incoming 60 id1,enc at n5001438023478280/type
at 9/slot at 1/elmdesc at 900080050002
Sep 16 17:00:22 testsys1 kernel: current 60 id1,enc at n5001438023441700/type
at 9/slot at 1/elmdesc at 900080050002 5C7237P403
Sep 16 17:00:23 testsys1 kernel: incoming 60 id1,enc at n5001438023441700/type
at 9/slot at 1/elmdesc at 900080050002
Sep 16 17:00:23 testsys1 kernel: current 60 id1,enc at n5001438023478280/type
at 9/slot at 1/elmdesc at 900080050002 5C7237P403
Sep 16 17:01:22 testsys1 kernel: incoming 60 id1,enc at n5001438023478280/type
at 9/slot at 1/elmdesc at 900080050002
Sep 16 17:01:22 testsys1 kernel: current 60 id1,enc at n5001438023441700/type
at 9/slot at 1/elmdesc at 900080050002 5C7237P403
Sep 16 17:01:23 testsys1 kernel: incoming 60 id1,enc at n5001438023441700/type
at 9/slot at 1/elmdesc at 900080050002
Sep 16 17:01:23 testsys1 kernel: current 60 id1,enc at n5001438023478280/type
at 9/slot at 1/elmdesc at 900080050002 5C7237P403
Sep 16 17:02:22 testsys1 kernel: incoming 60 id1,enc at n5001438023478280/type
at 9/slot at 1/elmdesc at 900080050002
Sep 16 17:02:22 testsys1 kernel: current 60 id1,enc at n5001438023441700/type
at 9/slot at 1/elmdesc at 900080050002 5C7237P403
Sep 16 17:02:23 testsys1 kernel: incoming 60 id1,enc at n5001438023441700/type
at 9/slot at 1/elmdesc at 900080050002
Sep 16 17:02:23 testsys1 kernel: current 60 id1,enc at n5001438023478280/type
at 9/slot at 1/elmdesc at 900080050002 5C7237P403
I'm not sure why this shows up for i=6...
After alot of reading & experimenting, I saw the following
when running the: sg_ses -p ed ses0 command...
# sg_ses -p ed ses0
HP D2700 SAS AJ941A 0147
Primary enclosure logical identifier (hex): 5001438023478280
Element Descriptor In diagnostic page:
generation code: 0x0
element descriptor by type list
Element type: Array device slot, subenclosure id: 0 [ti=0]
Overall descriptor: <empty>
Element 0 descriptor: <empty>
...
Element 23 descriptor: <empty>
Element 24 descriptor: <empty>
Element type: SAS connector, subenclosure id: 0 [ti=1]
Overall descriptor: <empty>
Element 0 descriptor: <empty>
...
Element 23 descriptor: <empty>
Element 24 descriptor: <empty>
Element type: vendor specific [0xff], subenclosure id: 0 [ti=2]
Overall descriptor: <empty>
Element 0 descriptor: 5C7237P403
Element type: Power supply, subenclosure id: 0 [ti=3]
Overall descriptor: <empty>
Element 0 descriptor: 20LiteOn 5ANLE0CLL3K6VI
Element 1 descriptor: 20LiteOn 5ANLE0CLL3K6V7
Element type: Cooling, subenclosure id: 0 [ti=4]
Overall descriptor: B PAVCA0B9V3E4I0
Element 0 descriptor: <empty>
Element 1 descriptor: <empty>
Element type: Cooling, subenclosure id: 0 [ti=5]
Overall descriptor: B PAVCA0B9V3E4HH
Element 0 descriptor: <empty>
Element 1 descriptor: <empty>
Element type: Enclosure services controller electronics, subenclosure id: 0
[ti=6]
Overall descriptor: <empty>
Element 0 descriptor: 014701B0PAZAV0BTM3F17A 0026
...
Could the enclosure services controller at ti=6 somehow be
related to what I'm seeing?
If anyone has any ideas I'd appreciate any ideas, pointers,etc.
Thanks,
John
ps: full output from smp_discover for both shelves..
# smp_discover ses0
phy 0:D:attached:[500003944822dd2a:00 t(SSP)] 6 Gbps
phy 1:D:attached:[500003944822dc96:00 t(SSP)] 6 Gbps
phy 2:D:attached:[500003944822dcbe:00 t(SSP)] 6 Gbps
phy 3:D:attached:[500003944822f8a2:00 t(SSP)] 6 Gbps
phy 4:D:attached:[500003944822d876:00 t(SSP)] 6 Gbps
phy 5:D:attached:[500003944822d776:00 t(SSP)] 6 Gbps
phy 6:D:attached:[5000039448232726:00 t(SSP)] 6 Gbps
phy 7:D:attached:[500003944823b8aa:00 t(SSP)] 6 Gbps
phy 8:D:attached:[500003944823b846:00 t(SSP)] 6 Gbps
phy 9:D:attached:[500003944822d7ae:00 t(SSP)] 6 Gbps
phy 10:D:attached:[500003944823b2f2:00 t(SSP)] 6 Gbps
phy 11:D:attached:[500003944823b04a:00 t(SSP)] 6 Gbps
phy 12:D:attached:[5000039448239ff2:00 t(SSP)] 6 Gbps
phy 13:D:attached:[500003944823db72:00 t(SSP)] 6 Gbps
phy 14:D:attached:[500003944823b7a6:00 t(SSP)] 6 Gbps
phy 15:D:attached:[500003944823dcc6:00 t(SSP)] 6 Gbps
phy 16:D:attached:[500003944823787a:00 t(SSP)] 6 Gbps
phy 17:D:attached:[500003944823b30a:00 t(SSP)] 6 Gbps
phy 18:D:attached:[500003944823b0de:00 t(SSP)] 6 Gbps
phy 19:D:attached:[500003944823b506:00 t(SSP)] 6 Gbps
phy 20:D:attached:[500003944823e7b6:00 t(SSP)] 6 Gbps
phy 21:D:attached:[5000039448232e7a:00 t(SSP)] 6 Gbps
phy 22:D:attached:[500003944823cfde:00 t(SSP)] 6 Gbps
phy 23:D:attached:[500003944823dc52:00 t(SSP)] 6 Gbps
phy 24:D:attached:[500003944823c9e6:00 t(SSP)] 6 Gbps
phy 25:S:attached:[500605b0050a7660:03 i(SSP+STP+SMP)] 6 Gbps
phy 26:S:attached:[500605b0050a7660:02 i(SSP+STP+SMP)] 6 Gbps
phy 27:S:attached:[500605b0050a7660:01 i(SSP+STP+SMP)] 6 Gbps
phy 28:S:attached:[500605b0050a7660:00 i(SSP+STP+SMP)] 6 Gbps
phy 29:T:attached:[500143802344173f:25 exp i(SMP) t(SMP)] 6 Gbps
phy 30:T:attached:[500143802344173f:26 exp i(SMP) t(SMP)] 6 Gbps
phy 31:T:attached:[500143802344173f:27 exp i(SMP) t(SMP)] 6 Gbps
phy 32:T:attached:[500143802344173f:28 exp i(SMP) t(SMP)] 6 Gbps
phy 33: inaccessible (phy vacant)
phy 34: inaccessible (phy vacant)
phy 35: inaccessible (phy vacant)
phy 36:D:attached:[50014380234782be:36 V i(SSP) t(SSP)] 6 Gbps
# smp_discover ses1
phy 0:D:attached:[500003944823cbfa:00 t(SSP)] 6 Gbps
phy 1:D:attached:[500003944823ddbe:00 t(SSP)] 6 Gbps
phy 2:D:attached:[500003944823d9ee:00 t(SSP)] 6 Gbps
phy 3:D:attached:[500003944823c2f6:00 t(SSP)] 6 Gbps
phy 4:D:attached:[500003944823c2ca:00 t(SSP)] 6 Gbps
phy 5:D:attached:[500003944823d8c2:00 t(SSP)] 6 Gbps
phy 6:D:attached:[500003944823cc3e:00 t(SSP)] 6 Gbps
phy 7:D:attached:[500003944823d3e2:00 t(SSP)] 6 Gbps
phy 8:D:attached:[500003944823d8da:00 t(SSP)] 6 Gbps
phy 9:D:attached:[500003944823c39a:00 t(SSP)] 6 Gbps
phy 10:D:attached:[500003944823d896:00 t(SSP)] 6 Gbps
phy 11:D:attached:[500003944823dbb6:00 t(SSP)] 6 Gbps
phy 12:D:attached:[500003944823c6a2:00 t(SSP)] 6 Gbps
phy 13:D:attached:[500003944823dc86:00 t(SSP)] 6 Gbps
phy 14:D:attached:[500003944823e7e2:00 t(SSP)] 6 Gbps
phy 15:D:attached:[500003944823cae6:00 t(SSP)] 6 Gbps
phy 16:D:attached:[500003944823d44a:00 t(SSP)] 6 Gbps
phy 17:D:attached:[500003944823d27a:00 t(SSP)] 6 Gbps
phy 18:D:attached:[5000039448234122:00 t(SSP)] 6 Gbps
phy 19:D:attached:[500003944823b17e:00 t(SSP)] 6 Gbps
phy 20:D:attached:[500003944823c6fe:00 t(SSP)] 6 Gbps
phy 21:D:attached:[500003944823435e:00 t(SSP)] 6 Gbps
phy 22:D:attached:[500003944823c776:00 t(SSP)] 6 Gbps
phy 23:D:attached:[50000394482341da:00 t(SSP)] 6 Gbps
phy 24:D:attached:[500003944823cc62:00 t(SSP)] 6 Gbps
phy 25:S:attached:[50014380234782bf:29 exp i(SMP) t(SMP)] 6 Gbps
phy 26:S:attached:[50014380234782bf:30 exp i(SMP) t(SMP)] 6 Gbps
phy 27:S:attached:[50014380234782bf:31 exp i(SMP) t(SMP)] 6 Gbps
phy 28:S:attached:[50014380234782bf:32 exp i(SMP) t(SMP)] 6 Gbps
phy 33: inaccessible (phy vacant)
phy 34: inaccessible (phy vacant)
phy 35: inaccessible (phy vacant)
phy 36:D:attached:[500143802344173e:36 V i(SSP) t(SSP)] 6 Gbps
# camcontrol smpmaninfo ses0
Report Manufacturer Information
Expander Change count: 1815
SAS 1.1 Format: Yes
<HP D2700 SAS AJ941A 0147>
Component Vendor: PMCSIERA
Component ID: 0x8005
Component Revision: 0x2
Vendor Specific: 0x0000000000000000
> Fatal trap 12: page fault while in kernel mode
> cpuid = 31; apic id = 2f
> fault virtual address = 0x0
> fault code = supervisor read data, page not present
> instruction pointer = 0x20:0xffffffff80d2b018
> stack pointer = 0x28:0xffffffbfd0fea080
> frame pointer = 0x28:0xffffffbfd0fea0b0
> code segment = base 0x0, limit 0xfffff, type 0x1b
> = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags = interrupt enabled, resume, IOPL = 0
> current process = 25 (enc_daemon7)
>
> and:
>
> db:0:kdb.enter.default> show pcpu
> cpuid = 31
> dynamic pcpu = 0xffffff807f203880
> curthread = 0xfffffe0032f53920: pid 25 "enc_daemon7"
> curpcb = 0xffffffbfd0feabc0
> fpcurthread = none
> idlethread = 0xfffffe002600b920: tid 100034 "idle: cpu31"
> curpmap = 0xffffffff8141b510
> tssp = 0xffffffff81489e98
> commontssp = 0xffffffff81489e98
> rsp0 = 0xffffffbfd0feabc0
> gs32p = 0xffffffff81487fd0
> ldt = 0xffffffff81488010
> tss = 0xffffffff81488000
>
>
>
> This looks like a bug I started tracing down a while back with
> the new enclosure services (r246437 and later). I added witness
> into the kernel and received the following LOR:
>
>
> Kernel page fault with the following non-sleepable locks held:
> exclusive sleep mutex MPT2SAS lock (MPT2SAS lock) r = 0
(0xffffff8003c851b8) locked @ cam/cam_periph.h:192
> KDB: stack backtrace:
> db_trace_self_wrapper() at db_trace_self_wrapper+0x2a/frame
0xffffffbfd0f3cb20
> kdb_backtrace() at kdb_backtrace+0x37/frame 0xffffffbfd0f3cbe0
> _witness_debugger() at _witness_debugger+0x2c/frame 0xffffffbfd0f3cc00
> witness_warn() at witness_warn+0x2d2/frame 0xffffffbfd0f3cd40
> trap_pfault() at trap_pfault+0x6a/frame 0xffffffbfd0f3cdd0
> trap() at trap+0x344/frame 0xffffffbfd0f3cfd0
> calltrap() at calltrap+0x8/frame 0xffffffbfd0f3cfd0
> --- trap 0xc, rip = 0xffffffff80ca8478, rsp = 0xffffffbfd0f3d090, rbp =
0xffffffbfd0f3d0c0 ---
> memcpy() at memcpy+0x8/frame 0xffffffbfd0f3d0c0
> ses_setphyspath_callback() at ses_setphyspath_callback+0xb3/frame
0xffffffbfd0f3d1d0
> ses_path_iter_devid_callback() at ses_path_iter_devid_callback+0x1c6/frame
0xffffffbfd0f3d770
> ses_devids_iter() at ses_devids_iter+0xb1/frame 0xffffffbfd0f3d7f0
> ses_paths_iter() at ses_paths_iter+0x20/frame 0xffffffbfd0f3d810
> ses_publish_physpaths() at ses_publish_physpaths+0x264/frame
0xffffffbfd0f3da40
> enc_daemon() at enc_daemon+0x2a4/frame 0xffffffbfd0f3daa0
> fork_exit() at fork_exit+0x11d/frame 0xffffffbfd0f3daf0
> fork_trampoline() at fork_trampoline+0xe/frame 0xffffffbfd0f3daf0
> --- trap 0, rip = 0, rsp = 0xffffffbfd0f3dbb0, rbp = 0 ---
>
>
> Fatal trap 12: page fault while in kernel mode
> cpuid = 8; apic id = 08
> fault virtual address = 0x0
> fault code = supervisor read data, page not present
> instruction pointer = 0x20:0xffffffff80ca8478
> stack pointer = 0x28:0xffffffbfd0f3d090
> frame pointer = 0x28:0xffffffbfd0f3d0c0
> code segment = base 0x0, limit 0xfffff, type 0x1b
> = DPL 0, pres 1, long 1, def32 0, gran 1
> processor eflags = interrupt enabled, resume, IOPL = 0
> current process = 30 (enc_daemon12)
> lock order reversal: (Giant after non-sleepable)
> 1st 0xffffff8003c851b8 MPT2SAS lock (MPT2SAS lock) @ cam/cam_periph.h:192
> 2nd 0xffffffff8139bc80 Giant (Giant) @ dev/usb/input/ukbd.c:1942
>
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> I'm wondering if there is a bad interaction here.
>
>
> The system has 8 DS2700 shelves dual attached to a pair
> of LSI 8e cards, thus the kernel configuration with an increased
> msgbuf size.
>
> Kernel conf:
>
> include GENERIC
> ident ZFS
> options DDB
> options KDB
> options WITNESS
> options MSGBUF_SIZE=(32768*16)
>
> And some ddb output:
>
> db:0:kdb.enter.default> run lockinfo
> db:1:lockinfo> show locks
> exclusive sleep mutex MPT2SAS lock (MPT2SAS lock) r = 0
(0xffffff8003c851b8) locked @ cam/cam_periph.h:192
> db:1:locks> show alllocks
> Process 30 (enc_daemon12) thread 0xfffffe003421a000 (100155)
> exclusive sleep mutex MPT2SAS lock (MPT2SAS lock) r = 0
(0xffffff8003c851b8) locked @ cam/cam_periph.h:192
> db:1:alllocks> show lockedvnods
> Locked vnodes
> db:0:kdb.enter.default> show pcpu
> cpuid = 8
> dynamic pcpu = 0xffffff807f1e4800
> curthread = 0xfffffe003421a000: pid 30 "enc_daemon12"
> curpcb = 0xffffffbfd0f3dbc0
> fpcurthread = none
> idlethread = 0xfffffe0021ffe490: tid 100011 "idle: cpu8"
> curpmap = 0xffffffff81399590
> tssp = 0xffffffff815a5640
> commontssp = 0xffffffff815a5640
> rsp0 = 0xffffffbfd0f3dbc0
> gs32p = 0xffffffff815a3778
> ldt = 0xffffffff815a37b8
> tss = 0xffffffff815a37a8
> spin locks held:
> db:0:kdb.enter.default> bt
> Tracing pid 30 tid 100155 td 0xfffffe003421a000
> memcpy() at memcpy+0x8/frame 0xffffffbfd0f3d0c0
> ses_setphyspath_callback() at ses_setphyspath_callback+0xb3/frame
0xffffffbfd0f3d1d0
> ses_path_iter_devid_callback() at ses_path_iter_devid_callback+0x1c6/frame
0xffffffbfd0f3d770
> ses_devids_iter() at ses_devids_iter+0xb1/frame 0xffffffbfd0f3d7f0
> ses_paths_iter() at ses_paths_iter+0x20/frame 0xffffffbfd0f3d810
> ses_publish_physpaths() at ses_publish_physpaths+0x264/frame
0xffffffbfd0f3da40
> enc_daemon() at enc_daemon+0x2a4/frame 0xffffffbfd0f3daa0
> fork_exit() at fork_exit+0x11d/frame 0xffffffbfd0f3daf0
> fork_trampoline() at fork_trampoline+0xe/frame 0xffffffbfd0f3daf0
> --- trap 0, rip = 0, rsp = 0xffffffbfd0f3dbb0, rbp = 0 ---
>
> Any thoughts/ideas are appreciated. I've reviewed the code and
> don't see anything obvious.
>
> Thanks,
> John
>
>
>