Hello, We''re setting up a new mailserver infrastructure and decided, to run it on zfs. On a E220R with a D1000, I''ve setup a storage pool with four mirrors: -------------------------------------------------------------- root at newponit # zpool status pool: pool0 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM pool0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t0d0 ONLINE 0 0 0 c5t8d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c5t9d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c5t10d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 c5t11d0 ONLINE 0 0 0 errors: No known data errors -------------------------------------------------------------- Before we start to install any software on it, we''ve got to the idea, to see how zfs behaves when something goes wrong. So we pulled out a disk, while a mkfile was running. What happened then, was not that, what we expected. The system was hanging for more than an hour and finally it paniced: -------------------------------------------------------------- Jan 23 18:49:26 newponit genunix: [ID 611667 kern.info] NOTICE: glm0: got SCSI bus reset Jan 23 18:50:36 newponit scsi: [ID 365881 kern.info] /pci at 1f,4000/scsi at 3 (glm0): Jan 23 18:50:36 newponit Cmd (0x60000a3ed10) dump for Target 1 Lun 0: Jan 23 18:50:36 newponit scsi: [ID 365881 kern.info] /pci at 1f,4000/scsi at 3 (glm0): Jan 23 18:50:36 newponit cdb=[ 0x2a 0x0 0x2 0x1b 0x2c 0x93 0x0 0x0 0x1 0x0 ] Jan 23 18:50:36 newponit scsi: [ID 365881 kern.info] /pci at 1f,4000/scsi at 3 (glm0): Jan 23 18:50:36 newponit pkt_flags=0xc000 pkt_statistics=0x60 pkt_state=0x7 Jan 23 18:50:36 newponit scsi: [ID 365881 kern.info] /pci at 1f,4000/scsi at 3 (glm0): Jan 23 18:50:36 newponit pkt_scbp=0x0 cmd_flags=0x1860 Jan 23 18:50:36 newponit scsi: [ID 107833 kern.warning] WARNING: /pci at 1f,4000/scsi at 3 (glm0): Jan 23 18:50:36 newponit Disconnected tagged cmd(s) (1) timeout for Target 1.0 Jan 23 18:50:36 newponit genunix: [ID 408822 kern.info] NOTICE: glm0: fault detected in device; service still available Jan 23 18:50:36 newponit genunix: [ID 611667 kern.info] NOTICE: glm0: Disconnected tagged cmd(s) (1) timeout for Target 1.0 Jan 23 18:50:36 newponit glm: [ID 401478 kern.warning] WARNING: ID[SUNWpd.glm.cmd_timeout.6018] Jan 23 18:50:36 newponit scsi: [ID 107833 kern.warning] WARNING: /pci at 1f,4000/scsi at 3 (glm0): Jan 23 18:50:36 newponit got SCSI bus reset Jan 23 18:50:36 newponit genunix: [ID 408822 kern.info] NOTICE: glm0: fault detected in device; service still available Jan 23 18:50:36 newponit genunix: [ID 611667 kern.info] NOTICE: glm0: got SCSI bus reset Jan 23 18:50:36 newponit scsi: [ID 107833 kern.warning] WARNING: /pci at 1f,4000/scsi at 3/sd at 1,0 (sd0): Jan 23 18:50:36 newponit SCSI transport failed: reason ''timeout'': giving up Jan 23 18:50:36 newponit md: [ID 312844 kern.warning] WARNING: md: state database commit failed Jan 23 18:50:36 newponit last message repeated 1 time Jan 23 18:51:38 newponit unix: [ID 836849 kern.notice] Jan 23 18:51:38 newponit ^Mpanic[cpu2]/thread=30000e81600: Jan 23 18:51:38 newponit unix: [ID 268973 kern.notice] md: Panic due to lack of DiskSuite state Jan 23 18:51:38 newponit database replicas. Fewer than 50% of the total were available, Jan 23 18:51:38 newponit so panic to ensure data integrity. Jan 23 18:51:38 newponit unix: [ID 100000 kern.notice] Jan 23 18:51:38 newponit genunix: [ID 723222 kern.notice] 000002a1003c1230 md:mddb_commitrec_wrapper+a8 (a, 30000e81600, 18e9250, 12ecc00, 18e9000, 1) Jan 23 18:51:38 newponit genunix: [ID 179002 kern.notice] %l0-3: 0000000000000030 0000000000000000 0000000000000002 0000060000a8e6c8 Jan 23 18:51:38 newponit %l4-7: 0000000000000000 00000000012ecf48 0000000000000002 00000000012ecc00 Jan 23 18:51:39 newponit genunix: [ID 723222 kern.notice] 000002a1003c12e0 md_mirror:mirror_mark_resync_region+290 (0, 0, 600008dacc0, 600008da980, 0, 1) Jan 23 18:51:39 newponit genunix: [ID 179002 kern.notice] %l0-3: 0000000000000000 00000600008e9e80 0000000000000001 0000000000000000 Jan 23 18:51:39 newponit %l4-7: 0000000000000001 0000000000000000 000000000183d400 0000000000000002 Jan 23 18:51:39 newponit genunix: [ID 723222 kern.notice] 000002a1003c1390 md_mirror:mirror_write_strategy+5c0 (60000885108, 0, 0, 0, 600008dad20, 0) Jan 23 18:51:39 newponit genunix: [ID 179002 kern.notice] %l0-3: 0000000000000000 00000300000c33b8 00000600008f5e08 0000000000008000 Jan 23 18:51:39 newponit %l4-7: 00000000018e9a28 00000600008da980 0000000000000000 0000000000000000 Jan 23 18:51:39 newponit genunix: [ID 723222 kern.notice] 000002a1003c1440 md:mdstrategy+d4 (60000885108, 1, 18e9800, 2000101, 2000101, 18e4400) Jan 23 18:51:39 newponit genunix: [ID 179002 kern.notice] %l0-3: 00000000018e4800 00000000018e4888 0000000002000101 000000000130f500 Jan 23 18:51:39 newponit %l4-7: 00000000018eafa8 00000600008c3dc8 0000000000000008 0000000000000001 Jan 23 18:51:39 newponit genunix: [ID 723222 kern.notice] 000002a1003c14f0 ufs:ldl_strategy+1a0 (1, 60002d92bc0, 12382a0, 18bcb58, 0, 0) Jan 23 18:51:39 newponit genunix: [ID 179002 kern.notice] %l0-3: 00000300003e3e00 0000060000883f20 0000000000000400 0000000000000370 Jan 23 18:51:39 newponit %l4-7: 0000000000000000 000002a1003c15d0 000002a1003c15d8 0000060000885108 Jan 23 18:51:40 newponit genunix: [ID 723222 kern.notice] 000002a1003c15e0 ufs:push_dirty_bp+c (60000859840, 60002d92bc0, 0, 0, 0, 30000e8e000) Jan 23 18:51:40 newponit genunix: [ID 179002 kern.notice] %l0-3: 0000000000000000 000000000006dc00 000000000000036e 0000000000000000 Jan 23 18:51:40 newponit %l4-7: 0000030000e8e000 0000000000000000 000000000183d400 00000000018aa698 Jan 23 18:51:40 newponit genunix: [ID 723222 kern.notice] 000002a1003c1690 ufs:logmap_commit+8c (a, 32, 33, 16, 600008595c0, 60000859840) Jan 23 18:51:40 newponit genunix: [ID 179002 kern.notice] %l0-3: 0000000000001770 0000000000000000 00000000018a6800 0000000045b636bd Jan 23 18:51:40 newponit %l4-7: 0000000045b636bc 00000000018a5800 0000000000000000 0000000000000000 Jan 23 18:51:40 newponit genunix: [ID 723222 kern.notice] 000002a1003c17b0 ufs:top_end_sync+e4 (0, 2a1003c193c, 60000859960, 32, 60000859840, 600008596a0) Jan 23 18:51:40 newponit genunix: [ID 179002 kern.notice] %l0-3: 00000600008595c0 0000000000000000 00000600008596b0 0000000000000000 Jan 23 18:51:40 newponit %l4-7: 0000000000000000 0000000000000033 0000000000000003 0000000000000001 Jan 23 18:51:40 newponit genunix: [ID 723222 kern.notice] 000002a1003c1880 ufs:ufs_update+2f4 (0, 60000a8e6c8, 1883a68, 300003e3e00, 1d, 2a1003c193c) Jan 23 18:51:41 newponit genunix: [ID 179002 kern.notice] %l0-3: 0000000000000030 0000000000000000 0000000000000002 0000060000a8e6c8 Jan 23 18:51:41 newponit %l4-7: 000000000000ffbf 0000000000000000 0000060001b02000 0000000000000000 Jan 23 18:51:41 newponit genunix: [ID 723222 kern.notice] 000002a1003c1940 ufs:ufs_sync+2c (0, 1, 0, 1, 185de00, 1000000000000) Jan 23 18:51:41 newponit genunix: [ID 179002 kern.notice] %l0-3: 0000000000001497 000000000002ecfb 0000000000000000 000000000014c2e1 Jan 23 18:51:41 newponit %l4-7: 00000000008bbccc 000000000000213f 00000000008bde0b 0000000000000001 Jan 23 18:51:41 newponit genunix: [ID 723222 kern.notice] 000002a1003c1a00 genunix:fsflush+4e0 (2, 18a5800, 1864f20, 1080, 185de28, 8000) Jan 23 18:51:41 newponit genunix: [ID 179002 kern.notice] %l0-3: 000000000185dda8 0000000000000bb8 0000000000000001 00000000018ad2a8 Jan 23 18:51:41 newponit %l4-7: 000000000185dea8 000000000185ee28 0000000000000100 0000000000000100 Jan 23 18:51:41 newponit unix: [ID 100000 kern.notice] Jan 23 18:51:41 newponit genunix: [ID 672855 kern.notice] syncing file systems... Jan 23 18:51:41 newponit genunix: [ID 904073 kern.notice] done Jan 23 18:51:42 newponit genunix: [ID 111219 kern.notice] dumping to /dev/md/dsk/d1, offset 215220224, content: kernel Jan 23 18:52:10 newponit genunix: [ID 409368 kern.notice] ^M100% done: 51733 pages dumped, compression ratio 9.67, Jan 23 18:52:10 newponit genunix: [ID 851671 kern.notice] dump succeeded -------------------------------------------------------------- After the system booted up, there was one disk on zfs. But really intersting was this: -------------------------------------------------------------- root at newponit # metastat d3: Mirror Submirror 0: d13 State: Needs maintenance Submirror 1: d23 State: Needs maintenance Pass: 1 Read option: roundrobin (default) Write option: parallel (default) Size: 4198392 blocks (2.0 GB) d13: Submirror of d3 State: Needs maintenance Invoke: metasync d3 Size: 4198392 blocks (2.0 GB) Stripe 0: Device Start Block Dbase State Reloc Hot Spare c0t0d0s3 0 No Okay Yes d23: Submirror of d3 State: Needs maintenance Invoke: metasync d3 Size: 4198392 blocks (2.0 GB) Stripe 0: Device Start Block Dbase State Reloc Hot Spare c0t1d0s3 0 No Okay Yes d1: Mirror Submirror 0: d11 State: Needs maintenance Submirror 1: d21 State: Needs maintenance Pass: 1 Read option: roundrobin (default) Write option: parallel (default) Size: 2101552 blocks (1.0 GB) d11: Submirror of d1 State: Needs maintenance Invoke: metasync d1 Size: 2101552 blocks (1.0 GB) Stripe 0: Device Start Block Dbase State Reloc Hot Spare c0t0d0s1 0 No Okay Yes d21: Submirror of d1 State: Needs maintenance Invoke: metasync d1 Size: 2101552 blocks (1.0 GB) Stripe 0: Device Start Block Dbase State Reloc Hot Spare c0t1d0s1 0 No Okay Yes d0: Mirror Submirror 0: d10 State: Needs maintenance Submirror 1: d20 State: Needs maintenance Pass: 1 Read option: roundrobin (default) Write option: parallel (default) Size: 29035344 blocks (13 GB) d10: Submirror of d0 State: Needs maintenance Invoke: metasync d0 Size: 29035344 blocks (13 GB) Stripe 0: Device Start Block Dbase State Reloc Hot Spare c0t0d0s0 0 No Okay Yes d20: Submirror of d0 State: Needs maintenance Invoke: metasync d0 Size: 29035344 blocks (13 GB) Stripe 0: Device Start Block Dbase State Reloc Hot Spare c0t1d0s0 0 No Okay Yes Device Relocation Information: Device Reloc Device ID c0t1d0 Yes id1,sd at SSEAGATE_ST318305LSUN18G_3JKQ57TD000023076LHH c0t0d0 Yes id1,sd at SSEAGATE_ST318404LSUN18G_3BT2PJ950000220439SK -------------------------------------------------------------- Why was the system hanging for such a long time and why I''ve got a problem with the internal disk? And why the system paniced? Full messages: http://ihsan.dogan.ch/files/messages format: -------------------------------------------------------------- 0. c0t0d0 <SEAGATE-ST318404LSUN18G-4203 cyl 7506 alt 2 hd 19 sec 248> /pci at 1f,4000/scsi at 3/sd at 0,0 1. c0t1d0 <SUN18G cyl 7506 alt 2 hd 19 sec 248> /pci at 1f,4000/scsi at 3/sd at 1,0 2. c1t0d0 <SEAGATE-ST318305LSUN18G-0641-16.87GB> /pci at 1f,2000/scsi at 1/sd at 0,0 3. c1t1d0 <SEAGATE-ST318305LSUN18G-0641-16.87GB> /pci at 1f,2000/scsi at 1/sd at 1,0 4. c1t2d0 <SEAGATE-ST318305LSUN18G-0641-16.87GB> /pci at 1f,2000/scsi at 1/sd at 2,0 5. c1t3d0 <SEAGATE-ST318305LSUN18G-0641-16.87GB> /pci at 1f,2000/scsi at 1/sd at 3,0 6. c5t8d0 <SEAGATE-ST318305LSUN18G-0641-16.87GB> /pci at 1f,4000/scsi at 5/sd at 8,0 7. c5t9d0 <SEAGATE-ST318305LSUN18G-0641-16.87GB> /pci at 1f,4000/scsi at 5/sd at 9,0 8. c5t10d0 <SEAGATE-ST318305LSUN18G-0641-16.87GB> /pci at 1f,4000/scsi at 5/sd at a,0 9. c5t11d0 <SEAGATE-ST318305LSUN18G-0641-16.87GB> /pci at 1f,4000/scsi at 5/sd at b,0 -------------------------------------------------------------- Ihsan -- ihsan at dogan.ch http://ihsan.dogan.ch/ http://gallery.dogan.ch/
Ihsan Dogan wrote:> Hello, > > We''re setting up a new mailserver infrastructure and decided, to run it > on zfs. On a E220R with a D1000, I''ve setup a storage pool with four > mirrors: > > -------------------------------------------------------------- > root at newponit # zpool status > pool: pool0 > state: ONLINE > scrub: none requested > config:[...]> Jan 23 18:51:38 newponit ^Mpanic[cpu2]/thread=30000e81600: > Jan 23 18:51:38 newponit unix: [ID 268973 kern.notice] md: Panic due to > lack of DiskSuite state > Jan 23 18:51:38 newponit database replicas. Fewer than 50% of the total > were available, > Jan 23 18:51:38 newponit so panic to ensure data integrity.this message shows (and the rest of the stack prove) that your panic happened in SVM. It has NOTHING to do with zfs. So either you pulled the wrong disk, or the disk you pulled also contained SVM volumes (next to ZFS). -- Michael Schuster Sun Microsystems, Inc. Recursion, n.: see ''Recursion''
> Hello, > > We''re setting up a new mailserver infrastructure and decided, to run it > on zfs. On a E220R with a D1000, I''ve setup a storage pool with four > mirrors:Good morning Ihsan ... I see that you have everything mirrored here, thats excellent. When you pulled a disk, was it a disk that was containing a metadevice or was it a disk in the zpool ? In the case of a metadevice, as you know, the system should have kept running fine. We have probably both done this over and over at various sites to demonstrate SVM to people. If you pulled out a device in the zpool, well now we are in a whole new world and I had heard that there was some *feature* in Solaris now that will protect the ZFS file system integrity by simply causing a system to panic if the last device in some redundant component was compromised. I think you hit a major bug in ZFS personally. Dennis
Afternoon, The panic looks due to the fact that your SVM state databases aren''t all there, so when we came to update one of them we found there was <= 50% of the state databases and crashed. This doesn''t look like anything to do with ZFS. I''d check the output from metadb and see if it looks like you''ve got a SVM database on a disk that''s also in use by ZFS.> Jan 23 18:50:36 newponit SCSI transport failed: reason ''timeout'': > giving up > Jan 23 18:50:36 newponit md: [ID 312844 kern.warning] WARNING: md: > state > database commit failed > Jan 23 18:50:36 newponit last message repeated 1 time > Jan 23 18:51:38 newponit unix: [ID 836849 kern.notice] > Jan 23 18:51:38 newponit ^Mpanic[cpu2]/thread=30000e81600: > Jan 23 18:51:38 newponit unix: [ID 268973 kern.notice] md: Panic > due to > lack of DiskSuite state > Jan 23 18:51:38 newponit database replicas. Fewer than 50% of the > total > were available, > Jan 23 18:51:38 newponit so panic to ensure data integrity. >Regards, Jason
Hello Michael, Am 24.1.2007 14:36 Uhr, Michael Schuster schrieb:>> -------------------------------------------------------------- >> root at newponit # zpool status >> pool: pool0 >> state: ONLINE >> scrub: none requested >> config: > > [...] > >> Jan 23 18:51:38 newponit ^Mpanic[cpu2]/thread=30000e81600: >> Jan 23 18:51:38 newponit unix: [ID 268973 kern.notice] md: Panic due to >> lack of DiskSuite state >> Jan 23 18:51:38 newponit database replicas. Fewer than 50% of the total >> were available, >> Jan 23 18:51:38 newponit so panic to ensure data integrity. > > this message shows (and the rest of the stack prove) that your panic > happened in SVM. It has NOTHING to do with zfs. So either you pulled the > wrong disk, or the disk you pulled also contained SVM volumes (next to > ZFS).I noticed that the panic was in SVM and I''m wondering, why the machine was hanging. SVM is only running on the internal disks (c0) and I pulled a disk from the D1000: Jan 23 17:24:14 newponit scsi: [ID 107833 kern.warning] WARNING: /pci at 1f,4000/scsi at 5/sd at 9,0 (sd50): Jan 23 17:24:14 newponit SCSI transport failed: reason ''incomplete'': retrying command Jan 23 17:24:14 newponit scsi: [ID 107833 kern.warning] WARNING: /pci at 1f,4000/scsi at 5/sd at 9,0 (sd50): Jan 23 17:24:14 newponit disk not responding to selection Jan 23 17:24:18 newponit scsi: [ID 107833 kern.warning] WARNING: /pci at 1f,4000/scsi at 5/sd at 9,0 (sd50): Jan 23 17:24:18 newponit disk not responding to selection This is clearly the disk with ZFS on it: SVM has nothing to do with this disk. A minute later, the troubles started with the internal disks: Jan 23 17:25:26 newponit scsi: [ID 365881 kern.info] /pci at 1f,4000/scsi at 3 (glm0): Jan 23 17:25:26 newponit Cmd (0x60000a3ed10) dump for Target 0 Lun 0: Jan 23 17:25:26 newponit scsi: [ID 365881 kern.info] /pci at 1f,4000/scsi at 3 (glm0): Jan 23 17:25:26 newponit cdb=[ 0x28 0x0 0x0 0x78 0x6 0x30 0x0 0x0 0x10 0x0 ] Jan 23 17:25:26 newponit scsi: [ID 365881 kern.info] /pci at 1f,4000/scsi at 3 (glm0): Jan 23 17:25:26 newponit pkt_flags=0x4000 pkt_statistics=0x60 pkt_state=0x7 Jan 23 17:25:26 newponit scsi: [ID 365881 kern.info] /pci at 1f,4000/scsi at 3 (glm0): Jan 23 17:25:26 newponit pkt_scbp=0x0 cmd_flags=0x860 Jan 23 17:25:26 newponit scsi: [ID 107833 kern.warning] WARNING: /pci at 1f,4000/scsi at 3 (glm0): Jan 23 17:25:26 newponit Disconnected tagged cmd(s) (1) timeout for Target 0.0 Jan 23 17:25:26 newponit genunix: [ID 408822 kern.info] NOTICE: glm0: fault detected in device; service still available Jan 23 17:25:26 newponit genunix: [ID 611667 kern.info] NOTICE: glm0: Disconnected tagged cmd(s) (1) timeout for Target 0.0 Jan 23 17:25:26 newponit glm: [ID 401478 kern.warning] WARNING: ID[SUNWpd.glm.cmd_timeout.6018] Jan 23 17:25:26 newponit scsi: [ID 107833 kern.warning] WARNING: /pci at 1f,4000/scsi at 3 (glm0): Jan 23 17:25:26 newponit got SCSI bus reset Jan 23 17:25:26 newponit genunix: [ID 408822 kern.info] NOTICE: glm0: fault detected in device; service still available SVM and ZFS disks are on a seperate SCSI bus, so theoretically there should be any impact on the SVM disks when I pull out a ZFS disk. Ihsan -- ihsan at dogan.ch http://ihsan.dogan.ch/ http://gallery.dogan.ch/
> Hello Michael, > > Am 24.1.2007 14:36 Uhr, Michael Schuster schrieb: > >>> -------------------------------------------------------------- >>> root at newponit # zpool status >>> pool: pool0 >>> state: ONLINE >>> scrub: none requested >>> config: >> >> [...] >> >>> Jan 23 18:51:38 newponit ^Mpanic[cpu2]/thread=30000e81600: >>> Jan 23 18:51:38 newponit unix: [ID 268973 kern.notice] md: Panic due to >>> lack of DiskSuite state >>> Jan 23 18:51:38 newponit database replicas. Fewer than 50% of the total >>> were available, >>> Jan 23 18:51:38 newponit so panic to ensure data integrity. >> >> this message shows (and the rest of the stack prove) that your panic >> happened in SVM. It has NOTHING to do with zfs. So either you pulled the >> wrong disk, or the disk you pulled also contained SVM volumes (next to >> ZFS). > > I noticed that the panic was in SVM and I''m wondering, why the machine > was hanging. SVM is only running on the internal disks (c0) and I pulled > a disk from the D1000:so the device that was affected had nothing to do with SVM at all. fine ... I have the exact same cconfig here. Internal SVM and then external ZFS on two disk arrays on two controllers.> Jan 23 17:24:14 newponit scsi: [ID 107833 kern.warning] WARNING: > /pci at 1f,4000/scsi at 5/sd at 9,0 (sd50): > Jan 23 17:24:14 newponit SCSI transport failed: reason ''incomplete'': > retrying command > Jan 23 17:24:14 newponit scsi: [ID 107833 kern.warning] WARNING: > /pci at 1f,4000/scsi at 5/sd at 9,0 (sd50): > Jan 23 17:24:14 newponit disk not responding to selection > Jan 23 17:24:18 newponit scsi: [ID 107833 kern.warning] WARNING: > /pci at 1f,4000/scsi at 5/sd at 9,0 (sd50): > Jan 23 17:24:18 newponit disk not responding to selection > > This is clearly the disk with ZFS on it: SVM has nothing to do with this > disk. A minute later, the troubles started with the internal disks:OKay .. so are we back to looking at ZFS or ZFS and the SVM components or some interaction between these kernel modules. At this point I have to be careful not to fall into a pit of blind ignorance as I grobe for the answer. Perhaps some data would help. Was there a core file in /var/crash/newponit ?> Jan 23 17:25:26 newponit scsi: [ID 365881 kern.info] /pci at 1f,4000/scsi at 3 > (glm0): > Jan 23 17:25:26 newponit Cmd (0x60000a3ed10) dump for Target 0 Lun 0: > Jan 23 17:25:26 newponit scsi: [ID 365881 kern.info] /pci at 1f,4000/scsi at 3 > (glm0): > Jan 23 17:25:26 newponit cdb=[ 0x28 0x0 0x0 0x78 0x6 0x30 0x0 0x0 0x10 > 0x0 ] > Jan 23 17:25:26 newponit scsi: [ID 365881 kern.info] /pci at 1f,4000/scsi at 3 > (glm0): > Jan 23 17:25:26 newponit pkt_flags=0x4000 pkt_statistics=0x60 pkt_state=0x7 > Jan 23 17:25:26 newponit scsi: [ID 365881 kern.info] /pci at 1f,4000/scsi at 3 > (glm0): > Jan 23 17:25:26 newponit pkt_scbp=0x0 cmd_flags=0x860 > Jan 23 17:25:26 newponit scsi: [ID 107833 kern.warning] WARNING: > /pci at 1f,4000/scsi at 3 (glm0): > Jan 23 17:25:26 newponit Disconnected tagged cmd(s) (1) timeout for > Target 0.0so a pile of scsi noise above there .. one would expect that from a suddenly missing scsi device.> Jan 23 17:25:26 newponit genunix: [ID 408822 kern.info] NOTICE: glm0: > fault detected in device; service still available > Jan 23 17:25:26 newponit genunix: [ID 611667 kern.info] NOTICE: glm0: > Disconnected tagged cmd(s) (1) timeout for Target 0.0NCR scsi controllers .. what OS revision is this ? Solaris 10 u 3 ? Solaris Nevada snv_55b ?> Jan 23 17:25:26 newponit glm: [ID 401478 kern.warning] WARNING: > ID[SUNWpd.glm.cmd_timeout.6018] > Jan 23 17:25:26 newponit scsi: [ID 107833 kern.warning] WARNING: > /pci at 1f,4000/scsi at 3 (glm0): > Jan 23 17:25:26 newponit got SCSI bus reset > Jan 23 17:25:26 newponit genunix: [ID 408822 kern.info] NOTICE: glm0: > fault detected in device; service still available > > SVM and ZFS disks are on a seperate SCSI bus, so theoretically there > should be any impact on the SVM disks when I pull out a ZFS disk.I still feel that you hit a bug in ZFS somewhere. Under no circumstances should a Solaris server panic and crash simply because you pulled out a single disk that was totally mirrored. In fact .. I will reproduce those conditions here and then see what happens for me. Dennis
Hello, Am 24.1.2007 14:40 Uhr, Dennis Clarke schrieb:>> We''re setting up a new mailserver infrastructure and decided, to run it >> on zfs. On a E220R with a D1000, I''ve setup a storage pool with four >> mirrors: > > Good morning Ihsan ... > > I see that you have everything mirrored here, thats excellent. > > When you pulled a disk, was it a disk that was containing a metadevice or > was it a disk in the zpool ? In the case of a metadevice, as you know, the > system should have kept running fine. We have probably both done this over > and over at various sites to demonstrate SVM to people. > > If you pulled out a device in the zpool, well now we are in a whole new > world and I had heard that there was some *feature* in Solaris now that > will protect the ZFS file system integrity by simply causing a system to > panic if the last device in some redundant component was compromised.The disk was in a zpool. The SVM disks are on a separate SCSI bus, so they can''t disturb each other.> I think you hit a major bug in ZFS personally.For me it also looks like a bug. Ihsan -- ihsan at dogan.ch http://ihsan.dogan.ch/ http://gallery.dogan.ch/
Ihsan Dogan wrote:>> I think you hit a major bug in ZFS personally. > > For me it also looks like a bug.I think we don''t have enough information to judge. If you have a supported version of Solaris, open a case and supply all the data (crash dump!) you have. HTH -- Michael Schuster Sun Microsystems, Inc. Recursion, n.: see ''Recursion''
Hello, Am 24.1.2007 14:49 Uhr, Jason Banham schrieb:> The panic looks due to the fact that your SVM state databases aren''t > all there, so when we came to update one of them we found there > was <= 50% of the state databases and crashed.The metadbs are fine. I haven''t touched them at all: root at newponit # metadb flags first blk block count a m p luo 16 8192 /dev/dsk/c0t0d0s7 a p luo 8208 8192 /dev/dsk/c0t0d0s7 a p luo 16 8192 /dev/dsk/c0t1d0s7 a p luo 8208 8192 /dev/dsk/c0t1d0s7> This doesn''t look like anything to do with ZFS. > I''d check the output from metadb and see if it looks like > you''ve got a SVM database on a disk that''s also in use by ZFS.The question is still, why the system is panicing? I pulled out now a different this, which is for sure on ZFS and not on SVM. The system still runs, but I can''t login anymore and the console doesn''t work at all anymore. Even if it has nothing to do with zfs, I don''t think this is a normal behavior. Ihsan -- ihsan at dogan.ch http://ihsan.dogan.ch/ http://gallery.dogan.ch/
Am 24.1.2007 14:59 Uhr, Dennis Clarke schrieb:>> Jan 23 17:25:26 newponit genunix: [ID 408822 kern.info] NOTICE: glm0: >> fault detected in device; service still available >> Jan 23 17:25:26 newponit genunix: [ID 611667 kern.info] NOTICE: glm0: >> Disconnected tagged cmd(s) (1) timeout for Target 0.0 > > NCR scsi controllers .. what OS revision is this ? Solaris 10 u 3 ? > > Solaris Nevada snv_55b ?root at newponit # cat /etc/release Solaris 10 11/06 s10s_u3wos_10 SPARC Copyright 2006 Sun Microsystems, Inc. All Rights Reserved. Use is subject to license terms. Assembled 14 November 2006 root at newponit # uname -a SunOS newponit 5.10 Generic_118833-33 sun4u sparc SUNW,Ultra-60>> SVM and ZFS disks are on a seperate SCSI bus, so theoretically there >> should be any impact on the SVM disks when I pull out a ZFS disk. > > I still feel that you hit a bug in ZFS somewhere. Under no circumstances > should a Solaris server panic and crash simply because you pulled out a > single disk that was totally mirrored. In fact .. I will reproduce those > conditions here and then see what happens for me.And Solaris should not hang at all. Ihsan -- ihsan at dogan.ch http://ihsan.dogan.ch/ http://gallery.dogan.ch/
> Ihsan Dogan wrote: > >>> I think you hit a major bug in ZFS personally. >> >> For me it also looks like a bug. > > I think we don''t have enough information to judge. If you have a supported > version of Solaris, open a case and supply all the data (crash dump!) you > have.I agree we need data. Everything else is just speculation and wild conjecture. I am going to create the same conditions here but with snv_55b and then yank a disk from my zpool. If I get a similar response then I will *hope* for a crash dump. You must be kidding about the "open a case" however. This is OpenSolaris. Dennis
> Am 24.1.2007 14:59 Uhr, Dennis Clarke schrieb: > >>> Jan 23 17:25:26 newponit genunix: [ID 408822 kern.info] NOTICE: glm0: >>> fault detected in device; service still available >>> Jan 23 17:25:26 newponit genunix: [ID 611667 kern.info] NOTICE: glm0: >>> Disconnected tagged cmd(s) (1) timeout for Target 0.0 >> >> NCR scsi controllers .. what OS revision is this ? Solaris 10 u 3 ? >> >> Solaris Nevada snv_55b ? > > root at newponit # cat /etc/release > Solaris 10 11/06 s10s_u3wos_10 SPARC > Copyright 2006 Sun Microsystems, Inc. All Rights Reserved. > Use is subject to license terms. > Assembled 14 November 2006 > root at newponit # uname -a > SunOS newponit 5.10 Generic_118833-33 sun4u sparc SUNW,Ultra-60 >oh dear. that''s not Solaris Nevada at all. That is production Solaris 10.>>> SVM and ZFS disks are on a seperate SCSI bus, so theoretically there >>> should be any impact on the SVM disks when I pull out a ZFS disk. >> >> I still feel that you hit a bug in ZFS somewhere. Under no >> circumstances >> should a Solaris server panic and crash simply because you pulled out a >> single disk that was totally mirrored. In fact .. I will reproduce those >> conditions here and then see what happens for me. > > And Solaris should not hang at all.I agree. We both know this. You just recently patched a blastwave server that was running for over 700 days in production and *this* sort of behavior just does not happen in Solaris. Let me see if I can reproduce your config here : bash-3.2# metastat -p d0 -m /dev/md/rdsk/d10 /dev/md/rdsk/d20 1 d10 1 1 /dev/rdsk/c0t1d0s0 d20 1 1 /dev/rdsk/c0t0d0s0 d1 -m /dev/md/rdsk/d11 1 d11 1 1 /dev/rdsk/c0t1d0s1 d4 -m /dev/md/rdsk/d14 1 d14 1 1 /dev/rdsk/c0t1d0s7 d5 -m /dev/md/rdsk/d15 1 d15 1 1 /dev/rdsk/c0t1d0s5 d21 1 1 /dev/rdsk/c0t0d0s1 d24 1 1 /dev/rdsk/c0t0d0s7 d25 1 1 /dev/rdsk/c0t0d0s5 bash-3.2# metadb flags first blk block count a m p luo 16 8192 /dev/dsk/c0t0d0s4 a p luo 8208 8192 /dev/dsk/c0t0d0s4 a p luo 16 8192 /dev/dsk/c0t1d0s4 a p luo 8208 8192 /dev/dsk/c0t1d0s4 bash-3.2# zpool status -v zfs0 pool: zfs0 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM zfs0 ONLINE 0 0 0 c1t9d0 ONLINE 0 0 0 c1t10d0 ONLINE 0 0 0 c1t11d0 ONLINE 0 0 0 c1t12d0 ONLINE 0 0 0 c1t13d0 ONLINE 0 0 0 c1t14d0 ONLINE 0 0 0 errors: No known data errors bash-3.2# I will add in mirrors to that zpool from another array on another controller and then yank a disk. However this machine is on snv_52 at the moment. Dennis
Dennis Clarke wrote:>> Ihsan Dogan wrote: >> >>>> I think you hit a major bug in ZFS personally. >>> For me it also looks like a bug. >> I think we don''t have enough information to judge. If you have a supported >> version of Solaris, open a case and supply all the data (crash dump!) you >> have. > > I agree we need data. Everything else is just speculation and wild conjecture. > > I am going to create the same conditions here but with snv_55b and then yank > a disk from my zpool. If I get a similar response then I will *hope* for a > crash dump. > > You must be kidding about the "open a case" however. This is OpenSolaris.no, I''m not. That''s why I said "If you have a supported version of Solaris". Also, Ihsan seems to disagree about OpenSolaris:> root at newponit # uname -a > SunOS newponit 5.10 Generic_118833-33 sun4u sparc SUNW,Ultra-60Michael -- Michael Schuster Sun Microsystems, Inc. Recursion, n.: see ''Recursion''
Am 24.1.2007 15:49 Uhr, Michael Schuster schrieb:>> I am going to create the same conditions here but with snv_55b and >> then yank >> a disk from my zpool. If I get a similar response then I will *hope* >> for a >> crash dump. >> >> You must be kidding about the "open a case" however. This is >> OpenSolaris. > > no, I''m not. That''s why I said "If you have a supported version of > Solaris". Also, Ihsan seems to disagree about OpenSolaris:I opened a case this morning. Lets see, what the support guys are saying. Ihsan -- ihsan at dogan.ch http://ihsan.dogan.ch/ http://gallery.dogan.ch/
Ihsan, If you are running Solaris 10 then you are probably hitting: 6456939 sd_send_scsi_SYNCHRONIZE_CACHE_biodone() can issue TUR which calls biowait()and deadlock/hangs host This was fixed in opensolaris (build 48) but a patch is not yet available for Solaris 10. Thanks, George Ihsan Dogan wrote:> Am 24.1.2007 15:49 Uhr, Michael Schuster schrieb: > >>> I am going to create the same conditions here but with snv_55b and >>> then yank >>> a disk from my zpool. If I get a similar response then I will *hope* >>> for a >>> crash dump. >>> >>> You must be kidding about the "open a case" however. This is >>> OpenSolaris. >> no, I''m not. That''s why I said "If you have a supported version of >> Solaris". Also, Ihsan seems to disagree about OpenSolaris: > > I opened a case this morning. Lets see, what the support guys are saying. > > > > Ihsan >