Hello,
We''re setting up a new mailserver infrastructure and decided, to run it
on zfs. On a E220R with a D1000, I''ve setup a storage pool with four
mirrors:
--------------------------------------------------------------
root at newponit # zpool status
  pool: pool0
 state: ONLINE
 scrub: none requested
config:
        NAME         STATE     READ WRITE CKSUM
        pool0        ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c1t0d0   ONLINE       0     0     0
            c5t8d0   ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c1t1d0   ONLINE       0     0     0
            c5t9d0   ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c1t2d0   ONLINE       0     0     0
            c5t10d0  ONLINE       0     0     0
          mirror     ONLINE       0     0     0
            c1t3d0   ONLINE       0     0     0
            c5t11d0  ONLINE       0     0     0
errors: No known data errors
--------------------------------------------------------------
Before we start to install any software on it, we''ve got to the idea,
to
see how zfs behaves when something goes wrong. So we pulled out a disk,
while a mkfile was running. What happened then, was not that, what we
expected. The system was hanging for more than an hour and finally it
paniced:
--------------------------------------------------------------
Jan 23 18:49:26 newponit genunix: [ID 611667 kern.info] NOTICE: glm0:
got SCSI bus reset
Jan 23 18:50:36 newponit scsi: [ID 365881 kern.info] /pci at 1f,4000/scsi at 3
(glm0):
Jan 23 18:50:36 newponit 	Cmd (0x60000a3ed10) dump for Target 1 Lun 0:
Jan 23 18:50:36 newponit scsi: [ID 365881 kern.info] /pci at 1f,4000/scsi at 3
(glm0):
Jan 23 18:50:36 newponit 		cdb=[ 0x2a 0x0 0x2 0x1b 0x2c 0x93 0x0 0x0 0x1
0x0 ]
Jan 23 18:50:36 newponit scsi: [ID 365881 kern.info] /pci at 1f,4000/scsi at 3
(glm0):
Jan 23 18:50:36 newponit 	pkt_flags=0xc000 pkt_statistics=0x60 pkt_state=0x7
Jan 23 18:50:36 newponit scsi: [ID 365881 kern.info] /pci at 1f,4000/scsi at 3
(glm0):
Jan 23 18:50:36 newponit 	pkt_scbp=0x0 cmd_flags=0x1860
Jan 23 18:50:36 newponit scsi: [ID 107833 kern.warning] WARNING:
/pci at 1f,4000/scsi at 3 (glm0):
Jan 23 18:50:36 newponit 	Disconnected tagged cmd(s) (1) timeout for
Target 1.0
Jan 23 18:50:36 newponit genunix: [ID 408822 kern.info] NOTICE: glm0:
fault detected in device; service still available
Jan 23 18:50:36 newponit genunix: [ID 611667 kern.info] NOTICE: glm0:
Disconnected tagged cmd(s) (1) timeout for Target 1.0
Jan 23 18:50:36 newponit glm: [ID 401478 kern.warning] WARNING:
ID[SUNWpd.glm.cmd_timeout.6018]
Jan 23 18:50:36 newponit scsi: [ID 107833 kern.warning] WARNING:
/pci at 1f,4000/scsi at 3 (glm0):
Jan 23 18:50:36 newponit 	got SCSI bus reset
Jan 23 18:50:36 newponit genunix: [ID 408822 kern.info] NOTICE: glm0:
fault detected in device; service still available
Jan 23 18:50:36 newponit genunix: [ID 611667 kern.info] NOTICE: glm0:
got SCSI bus reset
Jan 23 18:50:36 newponit scsi: [ID 107833 kern.warning] WARNING:
/pci at 1f,4000/scsi at 3/sd at 1,0 (sd0):
Jan 23 18:50:36 newponit 	SCSI transport failed: reason
''timeout'': giving up
Jan 23 18:50:36 newponit md: [ID 312844 kern.warning] WARNING: md: state
database commit failed
Jan 23 18:50:36 newponit last message repeated 1 time
Jan 23 18:51:38 newponit unix: [ID 836849 kern.notice]
Jan 23 18:51:38 newponit ^Mpanic[cpu2]/thread=30000e81600:
Jan 23 18:51:38 newponit unix: [ID 268973 kern.notice] md: Panic due to
lack of DiskSuite state
Jan 23 18:51:38 newponit  database replicas. Fewer than 50% of the total
were available,
Jan 23 18:51:38 newponit  so panic to ensure data integrity.
Jan 23 18:51:38 newponit unix: [ID 100000 kern.notice]
Jan 23 18:51:38 newponit genunix: [ID 723222 kern.notice]
000002a1003c1230 md:mddb_commitrec_wrapper+a8 (a, 30000e81600, 18e9250,
12ecc00, 18e9000, 1)
Jan 23 18:51:38 newponit genunix: [ID 179002 kern.notice]   %l0-3:
0000000000000030 0000000000000000 0000000000000002 0000060000a8e6c8
Jan 23 18:51:38 newponit   %l4-7: 0000000000000000 00000000012ecf48
0000000000000002 00000000012ecc00
Jan 23 18:51:39 newponit genunix: [ID 723222 kern.notice]
000002a1003c12e0 md_mirror:mirror_mark_resync_region+290 (0, 0,
600008dacc0, 600008da980, 0, 1)
Jan 23 18:51:39 newponit genunix: [ID 179002 kern.notice]   %l0-3:
0000000000000000 00000600008e9e80 0000000000000001 0000000000000000
Jan 23 18:51:39 newponit   %l4-7: 0000000000000001 0000000000000000
000000000183d400 0000000000000002
Jan 23 18:51:39 newponit genunix: [ID 723222 kern.notice]
000002a1003c1390 md_mirror:mirror_write_strategy+5c0 (60000885108, 0, 0,
0, 600008dad20, 0)
Jan 23 18:51:39 newponit genunix: [ID 179002 kern.notice]   %l0-3:
0000000000000000 00000300000c33b8 00000600008f5e08 0000000000008000
Jan 23 18:51:39 newponit   %l4-7: 00000000018e9a28 00000600008da980
0000000000000000 0000000000000000
Jan 23 18:51:39 newponit genunix: [ID 723222 kern.notice]
000002a1003c1440 md:mdstrategy+d4 (60000885108, 1, 18e9800, 2000101,
2000101, 18e4400)
Jan 23 18:51:39 newponit genunix: [ID 179002 kern.notice]   %l0-3:
00000000018e4800 00000000018e4888 0000000002000101 000000000130f500
Jan 23 18:51:39 newponit   %l4-7: 00000000018eafa8 00000600008c3dc8
0000000000000008 0000000000000001
Jan 23 18:51:39 newponit genunix: [ID 723222 kern.notice]
000002a1003c14f0 ufs:ldl_strategy+1a0 (1, 60002d92bc0, 12382a0, 18bcb58,
0, 0)
Jan 23 18:51:39 newponit genunix: [ID 179002 kern.notice]   %l0-3:
00000300003e3e00 0000060000883f20 0000000000000400 0000000000000370
Jan 23 18:51:39 newponit   %l4-7: 0000000000000000 000002a1003c15d0
000002a1003c15d8 0000060000885108
Jan 23 18:51:40 newponit genunix: [ID 723222 kern.notice]
000002a1003c15e0 ufs:push_dirty_bp+c (60000859840, 60002d92bc0, 0, 0, 0,
30000e8e000)
Jan 23 18:51:40 newponit genunix: [ID 179002 kern.notice]   %l0-3:
0000000000000000 000000000006dc00 000000000000036e 0000000000000000
Jan 23 18:51:40 newponit   %l4-7: 0000030000e8e000 0000000000000000
000000000183d400 00000000018aa698
Jan 23 18:51:40 newponit genunix: [ID 723222 kern.notice]
000002a1003c1690 ufs:logmap_commit+8c (a, 32, 33, 16, 600008595c0,
60000859840)
Jan 23 18:51:40 newponit genunix: [ID 179002 kern.notice]   %l0-3:
0000000000001770 0000000000000000 00000000018a6800 0000000045b636bd
Jan 23 18:51:40 newponit   %l4-7: 0000000045b636bc 00000000018a5800
0000000000000000 0000000000000000
Jan 23 18:51:40 newponit genunix: [ID 723222 kern.notice]
000002a1003c17b0 ufs:top_end_sync+e4 (0, 2a1003c193c, 60000859960, 32,
60000859840, 600008596a0)
Jan 23 18:51:40 newponit genunix: [ID 179002 kern.notice]   %l0-3:
00000600008595c0 0000000000000000 00000600008596b0 0000000000000000
Jan 23 18:51:40 newponit   %l4-7: 0000000000000000 0000000000000033
0000000000000003 0000000000000001
Jan 23 18:51:40 newponit genunix: [ID 723222 kern.notice]
000002a1003c1880 ufs:ufs_update+2f4 (0, 60000a8e6c8, 1883a68,
300003e3e00, 1d, 2a1003c193c)
Jan 23 18:51:41 newponit genunix: [ID 179002 kern.notice]   %l0-3:
0000000000000030 0000000000000000 0000000000000002 0000060000a8e6c8
Jan 23 18:51:41 newponit   %l4-7: 000000000000ffbf 0000000000000000
0000060001b02000 0000000000000000
Jan 23 18:51:41 newponit genunix: [ID 723222 kern.notice]
000002a1003c1940 ufs:ufs_sync+2c (0, 1, 0, 1, 185de00, 1000000000000)
Jan 23 18:51:41 newponit genunix: [ID 179002 kern.notice]   %l0-3:
0000000000001497 000000000002ecfb 0000000000000000 000000000014c2e1
Jan 23 18:51:41 newponit   %l4-7: 00000000008bbccc 000000000000213f
00000000008bde0b 0000000000000001
Jan 23 18:51:41 newponit genunix: [ID 723222 kern.notice]
000002a1003c1a00 genunix:fsflush+4e0 (2, 18a5800, 1864f20, 1080,
185de28, 8000)
Jan 23 18:51:41 newponit genunix: [ID 179002 kern.notice]   %l0-3:
000000000185dda8 0000000000000bb8 0000000000000001 00000000018ad2a8
Jan 23 18:51:41 newponit   %l4-7: 000000000185dea8 000000000185ee28
0000000000000100 0000000000000100
Jan 23 18:51:41 newponit unix: [ID 100000 kern.notice]
Jan 23 18:51:41 newponit genunix: [ID 672855 kern.notice] syncing file
systems...
Jan 23 18:51:41 newponit genunix: [ID 904073 kern.notice]  done
Jan 23 18:51:42 newponit genunix: [ID 111219 kern.notice] dumping to
/dev/md/dsk/d1, offset 215220224, content: kernel
Jan 23 18:52:10 newponit genunix: [ID 409368 kern.notice] ^M100% done:
51733 pages dumped, compression ratio 9.67,
Jan 23 18:52:10 newponit genunix: [ID 851671 kern.notice] dump succeeded
--------------------------------------------------------------
After the system booted up, there was one disk on zfs. But really
intersting was this:
--------------------------------------------------------------
root at newponit # metastat
d3: Mirror
    Submirror 0: d13
      State: Needs maintenance
    Submirror 1: d23
      State: Needs maintenance
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 4198392 blocks (2.0 GB)
d13: Submirror of d3
    State: Needs maintenance
    Invoke: metasync d3
    Size: 4198392 blocks (2.0 GB)
    Stripe 0:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c0t0d0s3          0     No            Okay   Yes
d23: Submirror of d3
    State: Needs maintenance
    Invoke: metasync d3
    Size: 4198392 blocks (2.0 GB)
    Stripe 0:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c0t1d0s3          0     No            Okay   Yes
d1: Mirror
    Submirror 0: d11
      State: Needs maintenance
    Submirror 1: d21
      State: Needs maintenance
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 2101552 blocks (1.0 GB)
d11: Submirror of d1
    State: Needs maintenance
    Invoke: metasync d1
    Size: 2101552 blocks (1.0 GB)
    Stripe 0:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c0t0d0s1          0     No            Okay   Yes
d21: Submirror of d1
    State: Needs maintenance
    Invoke: metasync d1
    Size: 2101552 blocks (1.0 GB)
    Stripe 0:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c0t1d0s1          0     No            Okay   Yes
d0: Mirror
    Submirror 0: d10
      State: Needs maintenance
    Submirror 1: d20
      State: Needs maintenance
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 29035344 blocks (13 GB)
d10: Submirror of d0
    State: Needs maintenance
    Invoke: metasync d0
    Size: 29035344 blocks (13 GB)
    Stripe 0:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c0t0d0s0          0     No            Okay   Yes
d20: Submirror of d0
    State: Needs maintenance
    Invoke: metasync d0
    Size: 29035344 blocks (13 GB)
    Stripe 0:
        Device     Start Block  Dbase        State Reloc Hot Spare
        c0t1d0s0          0     No            Okay   Yes
Device Relocation Information:
Device   Reloc  Device ID
c0t1d0   Yes    id1,sd at SSEAGATE_ST318305LSUN18G_3JKQ57TD000023076LHH
c0t0d0   Yes    id1,sd at SSEAGATE_ST318404LSUN18G_3BT2PJ950000220439SK
--------------------------------------------------------------
Why was the system hanging for such a long time and why I''ve got a
problem with the internal disk? And why the system paniced?
Full messages: http://ihsan.dogan.ch/files/messages
format:
--------------------------------------------------------------
       0. c0t0d0 <SEAGATE-ST318404LSUN18G-4203 cyl 7506 alt 2 hd 19 sec
248>
          /pci at 1f,4000/scsi at 3/sd at 0,0
       1. c0t1d0 <SUN18G cyl 7506 alt 2 hd 19 sec 248>
          /pci at 1f,4000/scsi at 3/sd at 1,0
       2. c1t0d0 <SEAGATE-ST318305LSUN18G-0641-16.87GB>
          /pci at 1f,2000/scsi at 1/sd at 0,0
       3. c1t1d0 <SEAGATE-ST318305LSUN18G-0641-16.87GB>
          /pci at 1f,2000/scsi at 1/sd at 1,0
       4. c1t2d0 <SEAGATE-ST318305LSUN18G-0641-16.87GB>
          /pci at 1f,2000/scsi at 1/sd at 2,0
       5. c1t3d0 <SEAGATE-ST318305LSUN18G-0641-16.87GB>
          /pci at 1f,2000/scsi at 1/sd at 3,0
       6. c5t8d0 <SEAGATE-ST318305LSUN18G-0641-16.87GB>
          /pci at 1f,4000/scsi at 5/sd at 8,0
       7. c5t9d0 <SEAGATE-ST318305LSUN18G-0641-16.87GB>
          /pci at 1f,4000/scsi at 5/sd at 9,0
       8. c5t10d0 <SEAGATE-ST318305LSUN18G-0641-16.87GB>
          /pci at 1f,4000/scsi at 5/sd at a,0
       9. c5t11d0 <SEAGATE-ST318305LSUN18G-0641-16.87GB>
          /pci at 1f,4000/scsi at 5/sd at b,0
--------------------------------------------------------------
Ihsan
-- 
ihsan at dogan.ch		http://ihsan.dogan.ch/
			http://gallery.dogan.ch/
Ihsan Dogan wrote:> Hello, > > We''re setting up a new mailserver infrastructure and decided, to run it > on zfs. On a E220R with a D1000, I''ve setup a storage pool with four > mirrors: > > -------------------------------------------------------------- > root at newponit # zpool status > pool: pool0 > state: ONLINE > scrub: none requested > config:[...]> Jan 23 18:51:38 newponit ^Mpanic[cpu2]/thread=30000e81600: > Jan 23 18:51:38 newponit unix: [ID 268973 kern.notice] md: Panic due to > lack of DiskSuite state > Jan 23 18:51:38 newponit database replicas. Fewer than 50% of the total > were available, > Jan 23 18:51:38 newponit so panic to ensure data integrity.this message shows (and the rest of the stack prove) that your panic happened in SVM. It has NOTHING to do with zfs. So either you pulled the wrong disk, or the disk you pulled also contained SVM volumes (next to ZFS). -- Michael Schuster Sun Microsystems, Inc. Recursion, n.: see ''Recursion''
> Hello, > > We''re setting up a new mailserver infrastructure and decided, to run it > on zfs. On a E220R with a D1000, I''ve setup a storage pool with four > mirrors:Good morning Ihsan ... I see that you have everything mirrored here, thats excellent. When you pulled a disk, was it a disk that was containing a metadevice or was it a disk in the zpool ? In the case of a metadevice, as you know, the system should have kept running fine. We have probably both done this over and over at various sites to demonstrate SVM to people. If you pulled out a device in the zpool, well now we are in a whole new world and I had heard that there was some *feature* in Solaris now that will protect the ZFS file system integrity by simply causing a system to panic if the last device in some redundant component was compromised. I think you hit a major bug in ZFS personally. Dennis
Afternoon, The panic looks due to the fact that your SVM state databases aren''t all there, so when we came to update one of them we found there was <= 50% of the state databases and crashed. This doesn''t look like anything to do with ZFS. I''d check the output from metadb and see if it looks like you''ve got a SVM database on a disk that''s also in use by ZFS.> Jan 23 18:50:36 newponit SCSI transport failed: reason ''timeout'': > giving up > Jan 23 18:50:36 newponit md: [ID 312844 kern.warning] WARNING: md: > state > database commit failed > Jan 23 18:50:36 newponit last message repeated 1 time > Jan 23 18:51:38 newponit unix: [ID 836849 kern.notice] > Jan 23 18:51:38 newponit ^Mpanic[cpu2]/thread=30000e81600: > Jan 23 18:51:38 newponit unix: [ID 268973 kern.notice] md: Panic > due to > lack of DiskSuite state > Jan 23 18:51:38 newponit database replicas. Fewer than 50% of the > total > were available, > Jan 23 18:51:38 newponit so panic to ensure data integrity. >Regards, Jason
Hello Michael, Am 24.1.2007 14:36 Uhr, Michael Schuster schrieb:>> -------------------------------------------------------------- >> root at newponit # zpool status >> pool: pool0 >> state: ONLINE >> scrub: none requested >> config: > > [...] > >> Jan 23 18:51:38 newponit ^Mpanic[cpu2]/thread=30000e81600: >> Jan 23 18:51:38 newponit unix: [ID 268973 kern.notice] md: Panic due to >> lack of DiskSuite state >> Jan 23 18:51:38 newponit database replicas. Fewer than 50% of the total >> were available, >> Jan 23 18:51:38 newponit so panic to ensure data integrity. > > this message shows (and the rest of the stack prove) that your panic > happened in SVM. It has NOTHING to do with zfs. So either you pulled the > wrong disk, or the disk you pulled also contained SVM volumes (next to > ZFS).I noticed that the panic was in SVM and I''m wondering, why the machine was hanging. SVM is only running on the internal disks (c0) and I pulled a disk from the D1000: Jan 23 17:24:14 newponit scsi: [ID 107833 kern.warning] WARNING: /pci at 1f,4000/scsi at 5/sd at 9,0 (sd50): Jan 23 17:24:14 newponit SCSI transport failed: reason ''incomplete'': retrying command Jan 23 17:24:14 newponit scsi: [ID 107833 kern.warning] WARNING: /pci at 1f,4000/scsi at 5/sd at 9,0 (sd50): Jan 23 17:24:14 newponit disk not responding to selection Jan 23 17:24:18 newponit scsi: [ID 107833 kern.warning] WARNING: /pci at 1f,4000/scsi at 5/sd at 9,0 (sd50): Jan 23 17:24:18 newponit disk not responding to selection This is clearly the disk with ZFS on it: SVM has nothing to do with this disk. A minute later, the troubles started with the internal disks: Jan 23 17:25:26 newponit scsi: [ID 365881 kern.info] /pci at 1f,4000/scsi at 3 (glm0): Jan 23 17:25:26 newponit Cmd (0x60000a3ed10) dump for Target 0 Lun 0: Jan 23 17:25:26 newponit scsi: [ID 365881 kern.info] /pci at 1f,4000/scsi at 3 (glm0): Jan 23 17:25:26 newponit cdb=[ 0x28 0x0 0x0 0x78 0x6 0x30 0x0 0x0 0x10 0x0 ] Jan 23 17:25:26 newponit scsi: [ID 365881 kern.info] /pci at 1f,4000/scsi at 3 (glm0): Jan 23 17:25:26 newponit pkt_flags=0x4000 pkt_statistics=0x60 pkt_state=0x7 Jan 23 17:25:26 newponit scsi: [ID 365881 kern.info] /pci at 1f,4000/scsi at 3 (glm0): Jan 23 17:25:26 newponit pkt_scbp=0x0 cmd_flags=0x860 Jan 23 17:25:26 newponit scsi: [ID 107833 kern.warning] WARNING: /pci at 1f,4000/scsi at 3 (glm0): Jan 23 17:25:26 newponit Disconnected tagged cmd(s) (1) timeout for Target 0.0 Jan 23 17:25:26 newponit genunix: [ID 408822 kern.info] NOTICE: glm0: fault detected in device; service still available Jan 23 17:25:26 newponit genunix: [ID 611667 kern.info] NOTICE: glm0: Disconnected tagged cmd(s) (1) timeout for Target 0.0 Jan 23 17:25:26 newponit glm: [ID 401478 kern.warning] WARNING: ID[SUNWpd.glm.cmd_timeout.6018] Jan 23 17:25:26 newponit scsi: [ID 107833 kern.warning] WARNING: /pci at 1f,4000/scsi at 3 (glm0): Jan 23 17:25:26 newponit got SCSI bus reset Jan 23 17:25:26 newponit genunix: [ID 408822 kern.info] NOTICE: glm0: fault detected in device; service still available SVM and ZFS disks are on a seperate SCSI bus, so theoretically there should be any impact on the SVM disks when I pull out a ZFS disk. Ihsan -- ihsan at dogan.ch http://ihsan.dogan.ch/ http://gallery.dogan.ch/
> Hello Michael, > > Am 24.1.2007 14:36 Uhr, Michael Schuster schrieb: > >>> -------------------------------------------------------------- >>> root at newponit # zpool status >>> pool: pool0 >>> state: ONLINE >>> scrub: none requested >>> config: >> >> [...] >> >>> Jan 23 18:51:38 newponit ^Mpanic[cpu2]/thread=30000e81600: >>> Jan 23 18:51:38 newponit unix: [ID 268973 kern.notice] md: Panic due to >>> lack of DiskSuite state >>> Jan 23 18:51:38 newponit database replicas. Fewer than 50% of the total >>> were available, >>> Jan 23 18:51:38 newponit so panic to ensure data integrity. >> >> this message shows (and the rest of the stack prove) that your panic >> happened in SVM. It has NOTHING to do with zfs. So either you pulled the >> wrong disk, or the disk you pulled also contained SVM volumes (next to >> ZFS). > > I noticed that the panic was in SVM and I''m wondering, why the machine > was hanging. SVM is only running on the internal disks (c0) and I pulled > a disk from the D1000:so the device that was affected had nothing to do with SVM at all. fine ... I have the exact same cconfig here. Internal SVM and then external ZFS on two disk arrays on two controllers.> Jan 23 17:24:14 newponit scsi: [ID 107833 kern.warning] WARNING: > /pci at 1f,4000/scsi at 5/sd at 9,0 (sd50): > Jan 23 17:24:14 newponit SCSI transport failed: reason ''incomplete'': > retrying command > Jan 23 17:24:14 newponit scsi: [ID 107833 kern.warning] WARNING: > /pci at 1f,4000/scsi at 5/sd at 9,0 (sd50): > Jan 23 17:24:14 newponit disk not responding to selection > Jan 23 17:24:18 newponit scsi: [ID 107833 kern.warning] WARNING: > /pci at 1f,4000/scsi at 5/sd at 9,0 (sd50): > Jan 23 17:24:18 newponit disk not responding to selection > > This is clearly the disk with ZFS on it: SVM has nothing to do with this > disk. A minute later, the troubles started with the internal disks:OKay .. so are we back to looking at ZFS or ZFS and the SVM components or some interaction between these kernel modules. At this point I have to be careful not to fall into a pit of blind ignorance as I grobe for the answer. Perhaps some data would help. Was there a core file in /var/crash/newponit ?> Jan 23 17:25:26 newponit scsi: [ID 365881 kern.info] /pci at 1f,4000/scsi at 3 > (glm0): > Jan 23 17:25:26 newponit Cmd (0x60000a3ed10) dump for Target 0 Lun 0: > Jan 23 17:25:26 newponit scsi: [ID 365881 kern.info] /pci at 1f,4000/scsi at 3 > (glm0): > Jan 23 17:25:26 newponit cdb=[ 0x28 0x0 0x0 0x78 0x6 0x30 0x0 0x0 0x10 > 0x0 ] > Jan 23 17:25:26 newponit scsi: [ID 365881 kern.info] /pci at 1f,4000/scsi at 3 > (glm0): > Jan 23 17:25:26 newponit pkt_flags=0x4000 pkt_statistics=0x60 pkt_state=0x7 > Jan 23 17:25:26 newponit scsi: [ID 365881 kern.info] /pci at 1f,4000/scsi at 3 > (glm0): > Jan 23 17:25:26 newponit pkt_scbp=0x0 cmd_flags=0x860 > Jan 23 17:25:26 newponit scsi: [ID 107833 kern.warning] WARNING: > /pci at 1f,4000/scsi at 3 (glm0): > Jan 23 17:25:26 newponit Disconnected tagged cmd(s) (1) timeout for > Target 0.0so a pile of scsi noise above there .. one would expect that from a suddenly missing scsi device.> Jan 23 17:25:26 newponit genunix: [ID 408822 kern.info] NOTICE: glm0: > fault detected in device; service still available > Jan 23 17:25:26 newponit genunix: [ID 611667 kern.info] NOTICE: glm0: > Disconnected tagged cmd(s) (1) timeout for Target 0.0NCR scsi controllers .. what OS revision is this ? Solaris 10 u 3 ? Solaris Nevada snv_55b ?> Jan 23 17:25:26 newponit glm: [ID 401478 kern.warning] WARNING: > ID[SUNWpd.glm.cmd_timeout.6018] > Jan 23 17:25:26 newponit scsi: [ID 107833 kern.warning] WARNING: > /pci at 1f,4000/scsi at 3 (glm0): > Jan 23 17:25:26 newponit got SCSI bus reset > Jan 23 17:25:26 newponit genunix: [ID 408822 kern.info] NOTICE: glm0: > fault detected in device; service still available > > SVM and ZFS disks are on a seperate SCSI bus, so theoretically there > should be any impact on the SVM disks when I pull out a ZFS disk.I still feel that you hit a bug in ZFS somewhere. Under no circumstances should a Solaris server panic and crash simply because you pulled out a single disk that was totally mirrored. In fact .. I will reproduce those conditions here and then see what happens for me. Dennis
Hello, Am 24.1.2007 14:40 Uhr, Dennis Clarke schrieb:>> We''re setting up a new mailserver infrastructure and decided, to run it >> on zfs. On a E220R with a D1000, I''ve setup a storage pool with four >> mirrors: > > Good morning Ihsan ... > > I see that you have everything mirrored here, thats excellent. > > When you pulled a disk, was it a disk that was containing a metadevice or > was it a disk in the zpool ? In the case of a metadevice, as you know, the > system should have kept running fine. We have probably both done this over > and over at various sites to demonstrate SVM to people. > > If you pulled out a device in the zpool, well now we are in a whole new > world and I had heard that there was some *feature* in Solaris now that > will protect the ZFS file system integrity by simply causing a system to > panic if the last device in some redundant component was compromised.The disk was in a zpool. The SVM disks are on a separate SCSI bus, so they can''t disturb each other.> I think you hit a major bug in ZFS personally.For me it also looks like a bug. Ihsan -- ihsan at dogan.ch http://ihsan.dogan.ch/ http://gallery.dogan.ch/
Ihsan Dogan wrote:>> I think you hit a major bug in ZFS personally. > > For me it also looks like a bug.I think we don''t have enough information to judge. If you have a supported version of Solaris, open a case and supply all the data (crash dump!) you have. HTH -- Michael Schuster Sun Microsystems, Inc. Recursion, n.: see ''Recursion''
Hello, Am 24.1.2007 14:49 Uhr, Jason Banham schrieb:> The panic looks due to the fact that your SVM state databases aren''t > all there, so when we came to update one of them we found there > was <= 50% of the state databases and crashed.The metadbs are fine. I haven''t touched them at all: root at newponit # metadb flags first blk block count a m p luo 16 8192 /dev/dsk/c0t0d0s7 a p luo 8208 8192 /dev/dsk/c0t0d0s7 a p luo 16 8192 /dev/dsk/c0t1d0s7 a p luo 8208 8192 /dev/dsk/c0t1d0s7> This doesn''t look like anything to do with ZFS. > I''d check the output from metadb and see if it looks like > you''ve got a SVM database on a disk that''s also in use by ZFS.The question is still, why the system is panicing? I pulled out now a different this, which is for sure on ZFS and not on SVM. The system still runs, but I can''t login anymore and the console doesn''t work at all anymore. Even if it has nothing to do with zfs, I don''t think this is a normal behavior. Ihsan -- ihsan at dogan.ch http://ihsan.dogan.ch/ http://gallery.dogan.ch/
Am 24.1.2007 14:59 Uhr, Dennis Clarke schrieb:>> Jan 23 17:25:26 newponit genunix: [ID 408822 kern.info] NOTICE: glm0: >> fault detected in device; service still available >> Jan 23 17:25:26 newponit genunix: [ID 611667 kern.info] NOTICE: glm0: >> Disconnected tagged cmd(s) (1) timeout for Target 0.0 > > NCR scsi controllers .. what OS revision is this ? Solaris 10 u 3 ? > > Solaris Nevada snv_55b ?root at newponit # cat /etc/release Solaris 10 11/06 s10s_u3wos_10 SPARC Copyright 2006 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 14 November 2006 root at newponit # uname -a SunOS newponit 5.10 Generic_118833-33 sun4u sparc SUNW,Ultra-60>> SVM and ZFS disks are on a seperate SCSI bus, so theoretically there >> should be any impact on the SVM disks when I pull out a ZFS disk. > > I still feel that you hit a bug in ZFS somewhere. Under no circumstances > should a Solaris server panic and crash simply because you pulled out a > single disk that was totally mirrored. In fact .. I will reproduce those > conditions here and then see what happens for me.And Solaris should not hang at all. Ihsan -- ihsan at dogan.ch http://ihsan.dogan.ch/ http://gallery.dogan.ch/
> Ihsan Dogan wrote: > >>> I think you hit a major bug in ZFS personally. >> >> For me it also looks like a bug. > > I think we don''t have enough information to judge. If you have a supported > version of Solaris, open a case and supply all the data (crash dump!) you > have.I agree we need data. Everything else is just speculation and wild conjecture. I am going to create the same conditions here but with snv_55b and then yank a disk from my zpool. If I get a similar response then I will *hope* for a crash dump. You must be kidding about the "open a case" however. This is OpenSolaris. Dennis
> Am 24.1.2007 14:59 Uhr, Dennis Clarke schrieb: > >>> Jan 23 17:25:26 newponit genunix: [ID 408822 kern.info] NOTICE: glm0: >>> fault detected in device; service still available >>> Jan 23 17:25:26 newponit genunix: [ID 611667 kern.info] NOTICE: glm0: >>> Disconnected tagged cmd(s) (1) timeout for Target 0.0 >> >> NCR scsi controllers .. what OS revision is this ? Solaris 10 u 3 ? >> >> Solaris Nevada snv_55b ? > > root at newponit # cat /etc/release > Solaris 10 11/06 s10s_u3wos_10 SPARC > Copyright 2006 Sun Microsystems, Inc. All Rights Reserved. > Use is subject to license terms. > Assembled 14 November 2006 > root at newponit # uname -a > SunOS newponit 5.10 Generic_118833-33 sun4u sparc SUNW,Ultra-60 >oh dear. that''s not Solaris Nevada at all. That is production Solaris 10.>>> SVM and ZFS disks are on a seperate SCSI bus, so theoretically there >>> should be any impact on the SVM disks when I pull out a ZFS disk. >> >> I still feel that you hit a bug in ZFS somewhere. Under no >> circumstances >> should a Solaris server panic and crash simply because you pulled out a >> single disk that was totally mirrored. In fact .. I will reproduce those >> conditions here and then see what happens for me. > > And Solaris should not hang at all.I agree. We both know this. You just recently patched a blastwave server that was running for over 700 days in production and *this* sort of behavior just does not happen in Solaris. Let me see if I can reproduce your config here : bash-3.2# metastat -p d0 -m /dev/md/rdsk/d10 /dev/md/rdsk/d20 1 d10 1 1 /dev/rdsk/c0t1d0s0 d20 1 1 /dev/rdsk/c0t0d0s0 d1 -m /dev/md/rdsk/d11 1 d11 1 1 /dev/rdsk/c0t1d0s1 d4 -m /dev/md/rdsk/d14 1 d14 1 1 /dev/rdsk/c0t1d0s7 d5 -m /dev/md/rdsk/d15 1 d15 1 1 /dev/rdsk/c0t1d0s5 d21 1 1 /dev/rdsk/c0t0d0s1 d24 1 1 /dev/rdsk/c0t0d0s7 d25 1 1 /dev/rdsk/c0t0d0s5 bash-3.2# metadb flags first blk block count a m p luo 16 8192 /dev/dsk/c0t0d0s4 a p luo 8208 8192 /dev/dsk/c0t0d0s4 a p luo 16 8192 /dev/dsk/c0t1d0s4 a p luo 8208 8192 /dev/dsk/c0t1d0s4 bash-3.2# zpool status -v zfs0 pool: zfs0 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM zfs0 ONLINE 0 0 0 c1t9d0 ONLINE 0 0 0 c1t10d0 ONLINE 0 0 0 c1t11d0 ONLINE 0 0 0 c1t12d0 ONLINE 0 0 0 c1t13d0 ONLINE 0 0 0 c1t14d0 ONLINE 0 0 0 errors: No known data errors bash-3.2# I will add in mirrors to that zpool from another array on another controller and then yank a disk. However this machine is on snv_52 at the moment. Dennis
Dennis Clarke wrote:>> Ihsan Dogan wrote: >> >>>> I think you hit a major bug in ZFS personally. >>> For me it also looks like a bug. >> I think we don''t have enough information to judge. If you have a supported >> version of Solaris, open a case and supply all the data (crash dump!) you >> have. > > I agree we need data. Everything else is just speculation and wild conjecture. > > I am going to create the same conditions here but with snv_55b and then yank > a disk from my zpool. If I get a similar response then I will *hope* for a > crash dump. > > You must be kidding about the "open a case" however. This is OpenSolaris.no, I''m not. That''s why I said "If you have a supported version of Solaris". Also, Ihsan seems to disagree about OpenSolaris:> root at newponit # uname -a > SunOS newponit 5.10 Generic_118833-33 sun4u sparc SUNW,Ultra-60Michael -- Michael Schuster Sun Microsystems, Inc. Recursion, n.: see ''Recursion''
Am 24.1.2007 15:49 Uhr, Michael Schuster schrieb:>> I am going to create the same conditions here but with snv_55b and >> then yank >> a disk from my zpool. If I get a similar response then I will *hope* >> for a >> crash dump. >> >> You must be kidding about the "open a case" however. This is >> OpenSolaris. > > no, I''m not. That''s why I said "If you have a supported version of > Solaris". Also, Ihsan seems to disagree about OpenSolaris:I opened a case this morning. Lets see, what the support guys are saying. Ihsan -- ihsan at dogan.ch http://ihsan.dogan.ch/ http://gallery.dogan.ch/
Ihsan, If you are running Solaris 10 then you are probably hitting: 6456939 sd_send_scsi_SYNCHRONIZE_CACHE_biodone() can issue TUR which calls biowait()and deadlock/hangs host This was fixed in opensolaris (build 48) but a patch is not yet available for Solaris 10. Thanks, George Ihsan Dogan wrote:> Am 24.1.2007 15:49 Uhr, Michael Schuster schrieb: > >>> I am going to create the same conditions here but with snv_55b and >>> then yank >>> a disk from my zpool. If I get a similar response then I will *hope* >>> for a >>> crash dump. >>> >>> You must be kidding about the "open a case" however. This is >>> OpenSolaris. >> no, I''m not. That''s why I said "If you have a supported version of >> Solaris". Also, Ihsan seems to disagree about OpenSolaris: > > I opened a case this morning. Lets see, what the support guys are saying. > > > > Ihsan >