Vasile Dumitrescu
2008-Oct-01 09:20 UTC
[zfs-discuss] zpool unimportable (corrupt zpool metadata??) but no zdb -l device problems
Hi, I am running snv90. I have a pool that is 6x1TB, config raidz. After a computer crash (root is NOT on the pool - only data) the pool showed FAULTED status. I exported and tried to reimport it, with the result as follows: ===============# zpool import pool: ztank id: 12125153257763159358 state: FAULTED status: The pool metadata is corrupted. action: The pool cannot be imported due to damaged devices or data. The pool may be active on another system, but can be imported using the ''-f'' flag. see: http://www.sun.com/msg/ZFS-8000-72 config: ztank FAULTED corrupted data raidz1 ONLINE c1t6d0 ONLINE c1t5d0 ONLINE c1t4d0 ONLINE c1t3d0 ONLINE c1t2d0 ONLINE c1t1d0 ONLINE =============== I searched google and run zdb -l for every pool device. Results follow below... to me it appears that all disks are ok and zdb can see the zpool structure off of each of them. (at least this is how I can interpret the messages, but the zpool still says corrupt zpool metadata :-( Any ideas as to what I might be able to do to salvage the data? restoring from backup is not an option (yes, I know :() - as this is a personal project I hoped the raidz would be enough :-( The output for each of the disks is more or less identical, all labels are accessible. # zdb -l /dev/dsk/c1t6d0s0 -------------------------------------------- LABEL 0 -------------------------------------------- version=10 name=''ztank'' state=0 txg=207161 pool_guid=12125153257763159358 hostid=628051022 hostname=''zfssrv'' top_guid=763279656890868029 guid=10947029755543026189 vdev_tree type=''raidz'' id=0 guid=763279656890868029 nparity=1 metaslab_array=14 metaslab_shift=35 ashift=9 asize=6001149345792 is_log=0 children[0] type=''disk'' id=0 guid=10947029755543026189 path=''/dev/dsk/c1t1d0s0'' devid=''id1,sd at f0000000048455c81000880330000/a'' phys_path=''/pci at 0,0/pci1000,30 at 10/sd at 1,0:a'' whole_disk=1 DTL=193 children[1] type=''disk'' id=1 guid=2640926618230776740 path=''/dev/dsk/c1t2d0s0'' devid=''id1,sd at f0000000048455c81000992690001/a'' phys_path=''/pci at 0,0/pci1000,30 at 10/sd at 2,0:a'' whole_disk=1 DTL=192 children[2] type=''disk'' id=2 guid=8982722125061616789 path=''/dev/dsk/c1t3d0s0'' devid=''id1,sd at f0000000048455c81000ae8610002/a'' phys_path=''/pci at 0,0/pci1000,30 at 10/sd at 3,0:a'' whole_disk=1 DTL=191 children[3] type=''disk'' id=3 guid=7263648809970512976 path=''/dev/dsk/c1t4d0s0'' devid=''id1,sd at f0000000048455c81000bb2cf0003/a'' phys_path=''/pci at 0,0/pci1000,30 at 10/sd at 4,0:a'' whole_disk=1 DTL=190 children[4] type=''disk'' id=4 guid=5275414937202266822 path=''/dev/dsk/c1t5d0s0'' devid=''id1,sd at f0000000048455c81000ca3c40004/a'' phys_path=''/pci at 0,0/pci1000,30 at 10/sd at 5,0:a'' whole_disk=1 DTL=189 children[5] type=''disk'' id=5 guid=8503895341004279533 path=''/dev/dsk/c1t6d0s0'' devid=''id1,sd at f0000000048455c81000d49220005/a'' phys_path=''/pci at 0,0/pci1000,30 at 10/sd at 6,0:a'' whole_disk=1 DTL=188 -------------------------------------------- LABEL 1 -------------------------------------------- version=10 name=''ztank'' state=0 txg=207161 pool_guid=12125153257763159358 hostid=628051022 hostname=''zfssrv'' top_guid=763279656890868029 guid=10947029755543026189 vdev_tree type=''raidz'' id=0 guid=763279656890868029 nparity=1 metaslab_array=14 metaslab_shift=35 ashift=9 asize=6001149345792 is_log=0 children[0] type=''disk'' id=0 guid=10947029755543026189 path=''/dev/dsk/c1t1d0s0'' devid=''id1,sd at f0000000048455c81000880330000/a'' phys_path=''/pci at 0,0/pci1000,30 at 10/sd at 1,0:a'' whole_disk=1 DTL=193 children[1] type=''disk'' id=1 guid=2640926618230776740 path=''/dev/dsk/c1t2d0s0'' devid=''id1,sd at f0000000048455c81000992690001/a'' phys_path=''/pci at 0,0/pci1000,30 at 10/sd at 2,0:a'' whole_disk=1 DTL=192 children[2] type=''disk'' id=2 guid=8982722125061616789 path=''/dev/dsk/c1t3d0s0'' devid=''id1,sd at f0000000048455c81000ae8610002/a'' phys_path=''/pci at 0,0/pci1000,30 at 10/sd at 3,0:a'' whole_disk=1 DTL=191 children[3] type=''disk'' id=3 guid=7263648809970512976 path=''/dev/dsk/c1t4d0s0'' devid=''id1,sd at f0000000048455c81000bb2cf0003/a'' phys_path=''/pci at 0,0/pci1000,30 at 10/sd at 4,0:a'' whole_disk=1 DTL=190 children[4] type=''disk'' id=4 guid=5275414937202266822 path=''/dev/dsk/c1t5d0s0'' devid=''id1,sd at f0000000048455c81000ca3c40004/a'' phys_path=''/pci at 0,0/pci1000,30 at 10/sd at 5,0:a'' whole_disk=1 DTL=189 children[5] type=''disk'' id=5 guid=8503895341004279533 path=''/dev/dsk/c1t6d0s0'' devid=''id1,sd at f0000000048455c81000d49220005/a'' phys_path=''/pci at 0,0/pci1000,30 at 10/sd at 6,0:a'' whole_disk=1 DTL=188 -------------------------------------------- LABEL 2 -------------------------------------------- version=10 name=''ztank'' state=0 txg=207161 pool_guid=12125153257763159358 hostid=628051022 hostname=''zfssrv'' top_guid=763279656890868029 guid=10947029755543026189 vdev_tree type=''raidz'' id=0 guid=763279656890868029 nparity=1 metaslab_array=14 metaslab_shift=35 ashift=9 asize=6001149345792 is_log=0 children[0] type=''disk'' id=0 guid=10947029755543026189 path=''/dev/dsk/c1t1d0s0'' devid=''id1,sd at f0000000048455c81000880330000/a'' phys_path=''/pci at 0,0/pci1000,30 at 10/sd at 1,0:a'' whole_disk=1 DTL=193 children[1] type=''disk'' id=1 guid=2640926618230776740 path=''/dev/dsk/c1t2d0s0'' devid=''id1,sd at f0000000048455c81000992690001/a'' phys_path=''/pci at 0,0/pci1000,30 at 10/sd at 2,0:a'' whole_disk=1 DTL=192 children[2] type=''disk'' id=2 guid=8982722125061616789 path=''/dev/dsk/c1t3d0s0'' devid=''id1,sd at f0000000048455c81000ae8610002/a'' phys_path=''/pci at 0,0/pci1000,30 at 10/sd at 3,0:a'' whole_disk=1 DTL=191 children[3] type=''disk'' id=3 guid=7263648809970512976 path=''/dev/dsk/c1t4d0s0'' devid=''id1,sd at f0000000048455c81000bb2cf0003/a'' phys_path=''/pci at 0,0/pci1000,30 at 10/sd at 4,0:a'' whole_disk=1 DTL=190 children[4] type=''disk'' id=4 guid=5275414937202266822 path=''/dev/dsk/c1t5d0s0'' devid=''id1,sd at f0000000048455c81000ca3c40004/a'' phys_path=''/pci at 0,0/pci1000,30 at 10/sd at 5,0:a'' whole_disk=1 DTL=189 children[5] type=''disk'' id=5 guid=8503895341004279533 path=''/dev/dsk/c1t6d0s0'' devid=''id1,sd at f0000000048455c81000d49220005/a'' phys_path=''/pci at 0,0/pci1000,30 at 10/sd at 6,0:a'' whole_disk=1 DTL=188 -------------------------------------------- LABEL 3 -------------------------------------------- version=10 name=''ztank'' state=0 txg=207161 pool_guid=12125153257763159358 hostid=628051022 hostname=''zfssrv'' top_guid=763279656890868029 guid=10947029755543026189 vdev_tree type=''raidz'' id=0 guid=763279656890868029 nparity=1 metaslab_array=14 metaslab_shift=35 ashift=9 asize=6001149345792 is_log=0 children[0] type=''disk'' id=0 guid=10947029755543026189 path=''/dev/dsk/c1t1d0s0'' devid=''id1,sd at f0000000048455c81000880330000/a'' phys_path=''/pci at 0,0/pci1000,30 at 10/sd at 1,0:a'' whole_disk=1 DTL=193 children[1] type=''disk'' id=1 guid=2640926618230776740 path=''/dev/dsk/c1t2d0s0'' devid=''id1,sd at f0000000048455c81000992690001/a'' phys_path=''/pci at 0,0/pci1000,30 at 10/sd at 2,0:a'' whole_disk=1 DTL=192 children[2] type=''disk'' id=2 guid=8982722125061616789 path=''/dev/dsk/c1t3d0s0'' devid=''id1,sd at f0000000048455c81000ae8610002/a'' phys_path=''/pci at 0,0/pci1000,30 at 10/sd at 3,0:a'' whole_disk=1 DTL=191 children[3] type=''disk'' id=3 guid=7263648809970512976 path=''/dev/dsk/c1t4d0s0'' devid=''id1,sd at f0000000048455c81000bb2cf0003/a'' phys_path=''/pci at 0,0/pci1000,30 at 10/sd at 4,0:a'' whole_disk=1 DTL=190 children[4] type=''disk'' id=4 guid=5275414937202266822 path=''/dev/dsk/c1t5d0s0'' devid=''id1,sd at f0000000048455c81000ca3c40004/a'' phys_path=''/pci at 0,0/pci1000,30 at 10/sd at 5,0:a'' whole_disk=1 DTL=189 children[5] type=''disk'' id=5 guid=8503895341004279533 path=''/dev/dsk/c1t6d0s0'' devid=''id1,sd at f0000000048455c81000d49220005/a'' phys_path=''/pci at 0,0/pci1000,30 at 10/sd at 6,0:a'' whole_disk=1 DTL=188 ===============-- This message posted from opensolaris.org
Vasile Dumitrescu
2008-Oct-01 10:42 UTC
[zfs-discuss] zpool unimportable (corrupt zpool metadata??) but no zdb -l device problems
an update to the above: I tried to run zdb -e on the pool id and here''s the result: # zdb -e 12125153257763159358 zdb: can''t open 12125153257763159358: I/O error NB zdb seems to recognize the ID because runnig it with an incorrect ID gives me an error # zdb -e 12125153257763159354 zdb: can''t open 12125153257763159354: No such file or directory Also zdb -e with the ID of the syspool works: # zdb -e 8843238790372298114 Uberblock magic = 0000000000bab10c version = 10 txg = 317369 guid_sum = 14131844542001965925 timestamp = 1222857640 UTC = Wed Oct 1 12:40:40 2008 Dataset mos [META], ID 0, cr_txg 4, 2.76M, 244 objects Dataset 8843238790372298114/export/home [ZPL], ID 60, cr_txg 721, 1.21G, 55 objects Dataset 8843238790372298114/export [ZPL], ID 54, cr_txg 718, 19.0K, 5 objects Dataset 8843238790372298114/swap [ZVOL], ID 28, cr_txg 15, 519M, 3 objects Dataset 8843238790372298114/ROOT/snv_90 [ZPL], ID 48, cr_txg 710, 6.85G, 254748 objects Dataset 8843238790372298114/ROOT [ZPL], ID 22, cr_txg 12, 18.0K, 4 objects Dataset 8843238790372298114/dump [ZVOL], ID 34, cr_txg 18, 512M, 3 objects Dataset 8843238790372298114 [ZPL], ID 5, cr_txg 4, 39.5K, 13 objects etc etc. ============ Any ideas? Could this be a hardware problem? I have no idea what to do next :-( thanks for your help! Vasile -- This message posted from opensolaris.org
Vasile Dumitrescu
2008-Oct-01 18:24 UTC
[zfs-discuss] one step forward - pinging Lukas pool: ztankKarwacki (kangurek)
on the advice of Okana in the freenode.net #opensolaris channel I tried to run the latest opensolaris livecd and try to import the pool. No luck, however I tried the trick in Lukas''s post that allowed him to import the pool and I had a beginning of luck. By doing the mdb wizardry he indicated I was able to run zpool import with the following result: pool: ztank id: whatever state: ONLINE status: The pool was last accessed by another system. see http://www.sun.com/msg/ZFS-8000-EY config: ztank ONLINE raidz1 ONLINE c4t0d0 ONLINE c4t1d0 ONLINE c4t2d0 ONLINE c4t3d0 ONLINE c4t4d0 ONLINE c4t5d0 ONLINE HOWEVER. When I attempt again to import using zdb -e ztank I still get zdb: can''t open ztank: I/O error and zpool import -f, whilst it starts and seems to access the disks sequentially, it stops al the 3rd one (no sure which precisely - it spins it up and the process stops right there, and the system will not reboot when asked to (shutdown -g0 -y -i5) so there''s some slight progress here. I would really appreciate ideas from you guys! Thanks Vasile -- This message posted from opensolaris.org
Martin Uhl
2008-Oct-02 14:37 UTC
[zfs-discuss] one step forward - pinging Lukas pool: ztankKarwacki (kangurek)
> When I attempt again to import using zdb -e ztank > I still get zdb: can''t open ztank: I/O error > and zpool import -f, whilst it starts and seems to > access the disks sequentially, it stops al the 3rd > one (no sure which precisely - it spins it up and the > process stops right there, and the system will not > reboot when asked to (shutdown -g0 -y -i5) > so there''s some slight progress here.How about just removing that disk and try importing? -- This message posted from opensolaris.org
Vasile Dumitrescu
2008-Oct-02 20:32 UTC
[zfs-discuss] one step forward - pinging Lukas pool: ztankKarwacki (kangurek)
Thanks Martin, Yeah, tried it but no luck :-( I do not think it is a hardware problem - in fact I tried removing every disk one by one with no luck - this is why I think it is not in fact a hardware problem... Kind regards Vasile -- This message posted from opensolaris.org
Vasile Dumitrescu
2008-Oct-03 14:42 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Hi folks, I just wanted to share the end of my "adventure" here and especially take the time to thank Victor for helping me out of this mess. I will let him explain the technical details (I am out of my depth here) but bottom line he spent a couple of hours with me on the machine and sorted me out. His explanation: he invalidated the incorrect uberblocks and forced zfs to revert to an earlier state that was consistent. The machine is now in the process of doing a full scrub and the first order of business tomorrow will be to do a full backup :-) According to his explanation, the reason for the troubles I had was that Solaris was running in a VM on my Debian server and it was not shut down properly when the Debian server did a controlled shutdown following a UPS event. The Solaris machine was abruptly shut down but because it was not in control of the entire chain till bare hardware, it appears that some writes were in fact still with Debian when Solaris thought them safely executed. This left the zpool in question in a state that even raidz1 did not help with. Anyway, again, lots and lots of thanks to Victor!!! kind regards Vasile -- This message posted from opensolaris.org
Darren J Moffat
2008-Oct-03 14:50 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Vasile Dumitrescu wrote:> Hi folks, > > I just wanted to share the end of my "adventure" here and especially take the time to thank Victor for helping me out of this mess. > > I will let him explain the technical details (I am out of my depth here) but bottom line he spent a couple of hours with me on the machine and sorted me out. His explanation: he invalidated the incorrect uberblocks and forced zfs to revert to an earlier state that was consistent. > > The machine is now in the process of doing a full scrub and the first order of business tomorrow will be to do a full backup :-) > > According to his explanation, the reason for the troubles I had was that Solaris was running in a VM on my Debian server and it was not shut down properly when the Debian server did a controlled shutdown following a UPS event.Which VM solution was this ? VMware, VirtualBox, Xen, other ? How were the "disks" presented to the guest ? What are the "disks" in the host, real disks, files, something else ? -- Darren J Moffat
Vasile Dumitrescu
2008-Oct-03 15:37 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
> > Which VM solution was this ? VMware, VirtualBox, Xen, > other ? How were > the "disks" presented to the guest ? What are the > "disks" in the host, > real disks, files, something else ? > > > -- > Darren J Moffat > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ssVMWare 6.0.4 running on Debian unstable, Linux bigsrv 2.6.26-1-amd64 #1 SMP Wed Sep 24 13:59:41 UTC 2008 x86_64 GNU/Linux Solaris is vanilla snv_90 installed with no GUI. Here is the content of the .vmx file in question: ===============================================#!/usr/bin/vmware config.version = "8" virtualHW.version = "6" scsi0.present = "TRUE" scsi0.virtualDev = "lsilogic" memsize = "4096" MemAllowAutoScaleDown = "FALSE" MemTrimRate = "0" sched.mem.pshare.enable = "FALSE" sched.mem.minsize = "3062" sched.mem.max = "7000" sched.mem.maxmemctl = "0" sched.mem.shares = "100000" scsi0:0.present = "TRUE" scsi0:0.fileName = "/home/vasile/vmware/solsrv/OpenSolaris64.vmdk" ide1:0.present = "TRUE" ide1:0.autodetect = "TRUE" ide1:0.deviceType = "cdrom-image" floppy0.startConnected = "FALSE" floppy0.autodetect = "TRUE" ethernet0.present = "TRUE" ethernet0.virtualDev = "e1000" ethernet0.wakeOnPcktRcv = "TRUE" sound.present = "FALSE" sound.fileName = "-1" sound.autodetect = "TRUE" svga.autodetect = "FALSE" pciBridge0.present = "TRUE" displayName = "zfssrv" guestOS = "solaris10-64" nvram = "Solaris 10 64-bit.nvram" deploymentPlatform = "windows" virtualHW.productCompatibility = "hosted" RemoteDisplay.vnc.port = "0" tools.upgrade.policy = "useGlobal" floppy0.fileName = "/dev/fd0" extendedConfigFile = "Solaris 10 64-bit.vmxf" ide1:0.fileName = "" floppy0.present = "FALSE" gui.powerOnAtStartup = "TRUE" ide1:0.startConnected = "TRUE" ethernet0.addressType = "generated" uuid.location = "56 4d da 02 a4 a0 78 74-2e 09 90 62 45 bb c4 94" uuid.bios = "56 4d da 02 a4 a0 78 74-2e 09 90 62 45 bb c4 94" scsi0:0.redo = "" pciBridge0.pciSlotNumber = "17" scsi0.pciSlotNumber = "16" ethernet0.pciSlotNumber = "32" sound.pciSlotNumber = "-1" ethernet0.generatedAddress = "00:0c:29:bb:c4:94" ethernet0.generatedAddressOffset = "0" tools.syncTime = "FALSE" svga.maxWidth = "1024" svga.maxHeight = "768" svga.vramSize = "3145728" scsi0:1.present = "TRUE" scsi0:1.fileName = "ztank-sda.vmdk" scsi0:1.mode = "independent-persistent" scsi0:1.deviceType = "rawDisk" scsi0:2.present = "TRUE" scsi0:2.fileName = "ztank-sdb.vmdk" scsi0:2.mode = "independent-persistent" scsi0:2.deviceType = "rawDisk" scsi0:3.present = "TRUE" scsi0:3.fileName = "ztank-sdc.vmdk" scsi0:3.mode = "independent-persistent" scsi0:3.deviceType = "rawDisk" scsi0:4.present = "TRUE" scsi0:4.fileName = "ztank-sdd.vmdk" scsi0:4.mode = "independent-persistent" scsi0:4.deviceType = "rawDisk" scsi0:5.present = "TRUE" scsi0:5.fileName = "ztank-sde.vmdk" scsi0:5.mode = "independent-persistent" scsi0:5.deviceType = "rawDisk" scsi0:6.present = "TRUE" scsi0:6.fileName = "ztank-sdf.vmdk" scsi0:6.mode = "independent-persistent" scsi0:6.deviceType = "rawDisk" scsi0:1.redo = "" scsi0:2.redo = "" scsi0:3.redo = "" scsi0:4.redo = "" scsi0:5.redo = "" scsi0:6.redo = "" isolation.tools.dnd.disable = "TRUE" snapshot.disabled = "TRUE" scsi0:0.mode = "independent-persistent" isolation.tools.copy.disable = "FALSE" isolation.tools.paste.disable = "FALSE" tools.remindInstall = "TRUE" =============================================== in summary: physical disks, assigned 100% to the VM HTH kind regards Vasile -- This message posted from opensolaris.org
Fajar A. Nugraha
2008-Oct-04 07:19 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Fri, Oct 3, 2008 at 10:37 PM, Vasile Dumitrescu <vasiledumitrescu at gmail.com> wrote:> VMWare 6.0.4 running on Debian unstable, > Linux bigsrv 2.6.26-1-amd64 #1 SMP Wed Sep 24 13:59:41 UTC 2008 x86_64 GNU/Linux > > Solaris is vanilla snv_90 installed with no GUI.> > in summary: physical disks, assigned 100% to the VMThat''s weird. I thought one of the point of using physical disks instead of files was to avoid problems caused by caching on host/dom0?
Darren J Moffat
2008-Oct-06 09:39 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Fajar A. Nugraha wrote:> On Fri, Oct 3, 2008 at 10:37 PM, Vasile Dumitrescu > <vasiledumitrescu at gmail.com> wrote: > >> VMWare 6.0.4 running on Debian unstable, >> Linux bigsrv 2.6.26-1-amd64 #1 SMP Wed Sep 24 13:59:41 UTC 2008 x86_64 GNU/Linux >> >> Solaris is vanilla snv_90 installed with no GUI. > > >> in summary: physical disks, assigned 100% to the VM > > That''s weird. I thought one of the point of using physical disks > instead of files was to avoid problems caused by caching on host/dom0?The data still flows through the host/dom0 device drivers and is thus at the mercy of the commands they issue to the physical devices. -- Darren J Moffat
.
2008-Oct-09 09:53 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
> His explanation: he invalidated the incorrect > uberblocks and forced zfs to revert to an earlier > state that was consistent.Would someone be willing to document the steps required in order to do this please? I have a disk in a similar state: # zpool import pool: tank id: 13234439337856002730 state: FAULTED status: The pool metadata is corrupted. action: The pool cannot be imported due to damaged devices or data. The pool may be active on another system, but can be imported using the ''-f'' flag. see: http://www.sun.com/msg/ZFS-8000-72 config: tank FAULTED corrupted data c7d0 ONLINE This happened after I foolishly began trusting zfs-fuse with some large but relatively unimportant data on a big, empty single disk zpool in my home machine and then suffered a power cut before I got around to backing it up. OpenSolaris can''t import the pool either, so the drive is sat on a shelf waiting till a method for fixing it is published. While it''s clearly my own fault for taking the risks I did, it''s still pretty frustrating knowing that all my data is likely still intact and nicely checksummed on the disk but that none of it is accessible due to some tiny filesystem inconsistency. With pretty much any other FS I think I could get most of it back. Clearly such a small number of occurrences in what were admittedly precarious configurations aren''t going to be particularly convincing motivators to provide a general solution, but I''d feel a whole lot better about using ZFS if I knew that there were some documented steps or a tool (zfsck? ;) that could help to recover from this kind of metadata corruption in the unlikely event of it happening. cheers, Rob -- This message posted from opensolaris.org
Mike Gerdts
2008-Oct-09 11:37 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Thu, Oct 9, 2008 at 4:53 AM, . <osl at boymonkey.com> wrote:> While it''s clearly my own fault for taking the risks I did, it''s > still pretty frustrating knowing that all my data is likely still > intact and nicely checksummed on the disk but that none of it is > accessible due to some tiny filesystem inconsistency. ?With pretty > much any other FS I think I could get most of it back. > > Clearly such a small number of occurrences in what were admittedly > precarious configurations aren''t going to be particularly convincing > motivators to provide a general solution, but I''d feel a whole lot > better about using ZFS if I knew that there were some documented > steps or a tool (zfsck? ;) that could help to recover from this kind > of metadata corruption in the unlikely event of it happening.Well said. You have hit on my #1 concern with deploying ZFS. FWIW, I belive that I have hit the same type of bug as the OP in the following combinations: - T2000, LDoms 1.0, various builds of Nevada in control and guest domains. - Laptop, VirtualBox 1.6.2, Windows XP SP2 host, OpenSolaris 2008.05 @ build 97 guest In the past year I''ve lost more ZFS file systems than I have any other type of file system in the past 5 years. With other file systems I can almost always get some data back. With ZFS I can''t get any back. -- Mike Gerdts http://mgerdts.blogspot.com/
Wilkinson, Alex
2008-Oct-09 11:46 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
0n Thu, Oct 09, 2008 at 06:37:23AM -0500, Mike Gerdts wrote: >FWIW, I belive that I have hit the same type of bug as the OP in the >following combinations: > >- T2000, LDoms 1.0, various builds of Nevada in control and guest > domains. >- Laptop, VirtualBox 1.6.2, Windows XP SP2 host, OpenSolaris 2008.05 @ > build 97 guest > >In the past year I''ve lost more ZFS file systems than I have any other >type of file system in the past 5 years. With other file systems I >can almost always get some data back. With ZFS I can''t get any back. Thats scary to hear! -aW IMPORTANT: This email remains the property of the Australian Defence Organisation and is subject to the jurisdiction of section 70 of the CRIMES ACT 1914. If you have received this email in error, you are requested to contact the sender and delete the email.
Ahmed Kamal
2008-Oct-09 12:44 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
>>In the past year I''ve lost more ZFS file systems than I have any other >type of file system in the past 5 years. With other file systems I >can almost always get some data back. With ZFS I can''t get any back. Thats scary to hear!> >I am really scared now! I was the one trying to quantify ZFS reliability, and that is surely bad to hear! -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081009/8dbecfa9/attachment.html>
Mike Gerdts
2008-Oct-09 13:22 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Thu, Oct 9, 2008 at 7:44 AM, Ahmed Kamal <email.ahmedkamal at googlemail.com> wrote:> > > > >In the past year I''ve lost more ZFS file systems than I have any other > >type of file system in the past 5 years. With other file systems I > >can almost always get some data back. With ZFS I can''t get any back. > >> Thats scary to hear! >> > > I am really scared now! I was the one trying to quantify ZFS reliability, > and that is surely bad to hear!The circumstances where I have lost data have been when ZFS has not handled a layer of redundancy. However, I am not terribly optimistic of the prospects of ZFS on any device that hasn''t committed writes that ZFS thinks are committed. Mirrors and raidz would also be vulnerable to such failures. I also have run into other failures that have gone unanswered on the lists. It makes me wary about using zfs without a support contract that allows me to escalate to engineering. Patching only support won''t help. http://mail.opensolaris.org/pipermail/zfs-discuss/2007-December/044984.html Hang only after I mirrored the zpool, no response on the list http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048255.html I think this is fixed around snv_98, but the zfs-discuss list was surprisingly silent on acknowledging it as a problem - I had no idea that it was being worked until I saw the commit. The panic seemed to be caused by dtrace - core developers of dtrace were quite interested in the kernel crash dump. http://mail.opensolaris.org/pipermail/zfs-discuss/2008-September/051109.html Panic during ON build. Pool was lost, no response from list. -- Mike Gerdts http://mgerdts.blogspot.com/
Timh Bergström
2008-Oct-09 14:50 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Unfortunely I can only agree to the doubts about running ZFS in production environments, i''ve lost ditto-blocks, i''''ve gotten corrupted pools and a bunch of other failures even in mirror/raidz/raidz2 setups with or without hardware mirrors/raid5/6. Plus the insecurity of a sudden crash/reboot will corrupt or even destroy the pools with "restore from backup" as the only advice. I''ve been lucky so far about getting my pools back thanks to people like Victor. What would be needed is a proper fsck for ZFS which can resolv "minor" data corruptions, tools for rebuilding, resizing and moving the data about on pools is also needed, even recover of data from faulted pools, like there is for ext2/3/ufs/ntfs. All in all, great FS but not production ready until the tools are in place or it gets really really resillient to minor failures and/or crashes in both software and hardware. For now i''ll stick to XFS/UFS and sw/hw-raid and live with the restrictions of such fs. //T 2008/10/9 Mike Gerdts <mgerdts at gmail.com>:> On Thu, Oct 9, 2008 at 7:44 AM, Ahmed Kamal > <email.ahmedkamal at googlemail.com> wrote: >> >> > >> >In the past year I''ve lost more ZFS file systems than I have any other >> >type of file system in the past 5 years. With other file systems I >> >can almost always get some data back. With ZFS I can''t get any back. >> >>> Thats scary to hear! >>> >> >> I am really scared now! I was the one trying to quantify ZFS reliability, >> and that is surely bad to hear! > > The circumstances where I have lost data have been when ZFS has not > handled a layer of redundancy. However, I am not terribly optimistic > of the prospects of ZFS on any device that hasn''t committed writes > that ZFS thinks are committed. Mirrors and raidz would also be > vulnerable to such failures. > > I also have run into other failures that have gone unanswered on the > lists. It makes me wary about using zfs without a support contract > that allows me to escalate to engineering. Patching only support > won''t help. > > http://mail.opensolaris.org/pipermail/zfs-discuss/2007-December/044984.html > Hang only after I mirrored the zpool, no response on the list > > http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048255.html > I think this is fixed around snv_98, but the zfs-discuss list was > surprisingly silent on acknowledging it as a problem - I had no > idea that it was being worked until I saw the commit. The panic > seemed to be caused by dtrace - core developers of dtrace > were quite interested in the kernel crash dump. > > http://mail.opensolaris.org/pipermail/zfs-discuss/2008-September/051109.html > Panic during ON build. Pool was lost, no response from list. > > -- > Mike Gerdts > http://mgerdts.blogspot.com/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-- Timh Bergstr?m System Administrator Diino AB - www.diino.com :wq
Greg Shaw
2008-Oct-09 15:10 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Perhaps I mis-understand, but the below issues are all based on Nevada, not Solaris 10. Nevada isn''t production code. For real ZFS testing, you must use a production release, currently Solaris 10 (update 5, soon to be update 6). In the last 2 years, I''ve stored everything in my environment (home directory, builds, etc.) on ZFS on multiple types of storage subsystems without issues. All of this has been on Solaris 10, however. Btw, I completely agree on the panic issue. If I have a large DB server with many pools, and one inconsequential pool fails, I lose the entire DB server. I''d really like to see an option at the zpool level directing what to do in a panic for a particular pool. Perhaps this is in the latest bits; if so, sorry, I''m running old stuff. :-) I also run ZFS on my mac. While not production quality, some of the panic errors dealing with external (firewire, usb, esata) are very irritating. A hiccup due to a jostled cable, and the entire box panics. That''s frustrating. Timh Bergstr?m wrote:> Unfortunely I can only agree to the doubts about running ZFS in > production environments, i''ve lost ditto-blocks, i''''ve gotten > corrupted pools and a bunch of other failures even in > mirror/raidz/raidz2 setups with or without hardware mirrors/raid5/6. > Plus the insecurity of a sudden crash/reboot will corrupt or even > destroy the pools with "restore from backup" as the only advice. I''ve > been lucky so far about getting my pools back thanks to people like > Victor. > > What would be needed is a proper fsck for ZFS which can resolv "minor" > data corruptions, tools for rebuilding, resizing and moving the data > about on pools is also needed, even recover of data from faulted > pools, like there is for ext2/3/ufs/ntfs. > > All in all, great FS but not production ready until the tools are in > place or it gets really really resillient to minor failures and/or > crashes in both software and hardware. For now i''ll stick to XFS/UFS > and sw/hw-raid and live with the restrictions of such fs. > > //T > > 2008/10/9 Mike Gerdts <mgerdts at gmail.com>: > >> On Thu, Oct 9, 2008 at 7:44 AM, Ahmed Kamal >> <email.ahmedkamal at googlemail.com> wrote: >> >>> > >>> >In the past year I''ve lost more ZFS file systems than I have any other >>> >type of file system in the past 5 years. With other file systems I >>> >can almost always get some data back. With ZFS I can''t get any back. >>> >>> >>>> Thats scary to hear! >>>> >>>> >>> I am really scared now! I was the one trying to quantify ZFS reliability, >>> and that is surely bad to hear! >>> >> The circumstances where I have lost data have been when ZFS has not >> handled a layer of redundancy. However, I am not terribly optimistic >> of the prospects of ZFS on any device that hasn''t committed writes >> that ZFS thinks are committed. Mirrors and raidz would also be >> vulnerable to such failures. >> >> I also have run into other failures that have gone unanswered on the >> lists. It makes me wary about using zfs without a support contract >> that allows me to escalate to engineering. Patching only support >> won''t help. >> >> http://mail.opensolaris.org/pipermail/zfs-discuss/2007-December/044984.html >> Hang only after I mirrored the zpool, no response on the list >> >> http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048255.html >> I think this is fixed around snv_98, but the zfs-discuss list was >> surprisingly silent on acknowledging it as a problem - I had no >> idea that it was being worked until I saw the commit. The panic >> seemed to be caused by dtrace - core developers of dtrace >> were quite interested in the kernel crash dump. >> >> http://mail.opensolaris.org/pipermail/zfs-discuss/2008-September/051109.html >> Panic during ON build. Pool was lost, no response from list. >> >> -- >> Mike Gerdts >> http://mgerdts.blogspot.com/ >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >> > > > >
Mike Gerdts
2008-Oct-09 15:18 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Thu, Oct 9, 2008 at 10:10 AM, Greg Shaw <Greg.Shaw at sun.com> wrote:> Nevada isn''t production code. For real ZFS testing, you must use a > production release, currently Solaris 10 (update 5, soon to be update 6).I misstated before in my LDoms case. The corrupted pool was on Solaris 10, with LDoms 1.0. The control domain was SX*E, but the zpool there showed no problems. I got into a panic loop with dangling dbufs. My understanding is that this was caused by a bug in the LDoms manager 1.0 code that has been fixed in a later release. It was a supported configuration, I pushed for and got a fix. However, that pool was still lost. -- Mike Gerdts http://mgerdts.blogspot.com/
Miles Nordin
2008-Oct-09 18:38 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
>>>>> "gs" == Greg Shaw <Greg.Shaw at Sun.COM> writes:gs> Nevada isn''t production code. For real ZFS testing, you must gs> use a production release, currently Solaris 10 (update 5, soon gs> to be update 6). based on list feedback, my impression is that the results of a ``test'''' confined to s10, particularly s10u4 (the latest available during most of Mike''s experience), would be worse than Nevada experience over the same period. but I doubt either matches UFS+SVM or ext3+LVM2. The on-disk format with ``ditto blocks'''' and ``always consistent'''' may be fantastic, but the code for reading it is not. Maybe the code is stellar, and the problem really is underlying storage stacks that fail to respect write barriers. If so, ZFS needs to include a storage stack qualification tool. For me it doesn''t strain credibility to believe these problems might be rampant in VM stacks and SAN''s, nor do I find it unacceptable if ZFS is vastly more sensitive to them than any other filesystem. If this speculation turns out to really be the case, I imagine the two going together: the problems are rampant because they don''t bother other filesystems too catastrophically. If this is really the situation, then ZFS needs to give the sysadmin a way to isolate and fix the problems deterministically before filling the pool with data, not just blame the sysadmin based on nebulous speculatory hindsight gremlins. And if it''s NOT the case, the ZFS problems need to be acknowledged and fixed. To my view, the above is *IN ADDITION* to developing a recovery/forensic/``fsck'''' tool, not either/or. The pools should not be getting corrupt in the first place, and pulling the cord should not mean you have to settle for best-effort. None of the modern filesystems demand an fsck after unclean shutdown. The current procedure for qualifying a platform seems to be: (1) subject it to heavy write activity, (2) pull the cord, (3) repeat. Ahmed, maybe you should use that test to ``quantify'''' filesystem reliability. You can try it with ZFS, then reinstall the machine with CentOS and try the same test with ext3+LVM2 or xfs+areca. The numbers you get are how many times can you pull the cord before you lose something, and how much do you lose. Here''s a really old test of that sort comparing Linux filesystems which is something like what I have in mind: https://www.redhat.com/archives/fedora-list/2004-July/msg00418.html so you see he got two sets of numbers---number of reboots and amount of corruption. For reiserfs and JFS he lost their equivalent of ``the whole pool'''', and for ext3 and XFS he got corruption but never lost the pool. It''s not clear to me the filesystems ever claimed to prevent corruption in his test scenario (was he calling fsync() after each log write? syslog does that sometimes, and if so, they do claim it, but if he''s just writing with some silly script they don''t), but definitely they do all claim you won''t lose the whole pool in a power outage, and only two out of four delivered on that. I base my choice of Linux filesystem on this test, and wish I''d done such a test before converting things to ZFS. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081009/632ede48/attachment.bin>
Bob Friesenhahn
2008-Oct-09 19:06 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Thu, 9 Oct 2008, Miles Nordin wrote:> > catastrophically. If this is really the situation, then ZFS needs to > give the sysadmin a way to isolate and fix the problems > deterministically before filling the pool with data, not just blame > the sysadmin based on nebulous speculatory hindsight gremlins. > > And if it''s NOT the case, the ZFS problems need to be acknowledged and > fixed.Can you provide any supportive evidence that ZFS is as fragile as you describe?>From recent opinions expressed here, properly-designed ZFS pools mustbe inexplicably permanently cratering each and every day. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Mike Gerdts
2008-Oct-10 03:33 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Thu, Oct 9, 2008 at 10:18 AM, Mike Gerdts <mgerdts at gmail.com> wrote:> On Thu, Oct 9, 2008 at 10:10 AM, Greg Shaw <Greg.Shaw at sun.com> wrote: >> Nevada isn''t production code. For real ZFS testing, you must use a >> production release, currently Solaris 10 (update 5, soon to be update 6). > > I misstated before in my LDoms case. The corrupted pool was on > Solaris 10, with LDoms 1.0. The control domain was SX*E, but the > zpool there showed no problems. I got into a panic loop with dangling > dbufs. My understanding is that this was caused by a bug in the LDoms > manager 1.0 code that has been fixed in a later release. It was a > supported configuration, I pushed for and got a fix. However, that > pool was still lost.Or maybe it wasn''t fixed yet. I see that this was committed just today. 6684721 file backed virtual i/o should be synchronous http://hg.genunix.org/onnv-gate.hg/rev/eb40ff0c92ec -- Mike Gerdts http://mgerdts.blogspot.com/
Timh Bergström
2008-Oct-10 07:38 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
2008/10/9 Bob Friesenhahn <bfriesen at simple.dallas.tx.us>:> On Thu, 9 Oct 2008, Miles Nordin wrote: >> >> catastrophically. If this is really the situation, then ZFS needs to >> give the sysadmin a way to isolate and fix the problems >> deterministically before filling the pool with data, not just blame >> the sysadmin based on nebulous speculatory hindsight gremlins. >> >> And if it''s NOT the case, the ZFS problems need to be acknowledged and >> fixed. > > Can you provide any supportive evidence that ZFS is as fragile as you > describe?The hundreds of sysadmins seeing their pools go byebye after normal operations in a production environment is evidence enough. And the number of times people like Victor have saved our asses.> > >From recent opinions expressed here, properly-designed ZFS pools must > be inexplicably permanently cratering each and every day. > > Bob > =====================================> Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-- Timh Bergstr?m System Administrator Diino AB - www.diino.com :wq
Jeff Bonwick
2008-Oct-10 08:26 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
> The circumstances where I have lost data have been when ZFS has not > handled a layer of redundancy. However, I am not terribly optimistic > of the prospects of ZFS on any device that hasn''t committed writes > that ZFS thinks are committed.FYI, I''m working on a workaround for broken devices. As you note, some disks flat-out lie: you issue the synchronize-cache command, they say "got it, boss", yet the data is still not on stable storage. Why do they do this? Because "it performs better". Well, duh -- you can make stuff *really* fast if it doesn''t have to be correct. Before I explain how ZFS can fix this, I need to get something off my chest: people who knowingly make such disks should be in federal prison. It is *fraud* to win benchmarks this way. Doing so causes real harm to real people. Same goes for NFS implementations that ignore sync. We have specifications for a reason. People assume that you honor them, and build higher-level systems on top of them. Change the mass of the proton by a few percent, and the stars explode. It is impossible to build a functioning civil society in a culture that tolerates lies. We need a little more Code of Hammurabi in the storage industry. Now: The uberblock ring buffer in ZFS gives us a way to cope with this, as long as we don''t reuse freed blocks for a few transaction groups. The basic idea: if we can''t read the pool startign from the most recent uberblock, then we should be able to use the one before it, or the one before that, etc, as long as we haven''t yet reused any blocks that were freed in those earlier txgs. This allows us to use the normal load on the pool, plus the passage of time, as a displacement flush for disk caches that ignore the sync command. If we go back far enough in (txg) time, we will eventually find an uberblock all of whose dependent data blocks have make it to disk. I''ll run tests with known-broken disks to determine how far back we need to go in practice -- I''ll bet one txg is almost always enough. Jeff
Ross
2008-Oct-10 09:29 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
That sounds like a great idea for a tool Jeff. Would it be possible to build that in as a "zpool recover" command? Being able to run a tool like that and see just how bad the corruption is, but know it''s possible to recover an older version would be great. Is there any chance of outputting details so the sysadmin can know roughly how much was lost? My thoughts are going to be very rough (I don''t know much about zfs internals), but I''m wondering if something like this would work, where all bad blocks are reported, along with the latest 3 good ones: *************************************8 # zpool recover <pool> ......... pool details ........... Finding and testing uberblocks... 1. block a date/time: xxxxx/xxxx CORRUPTED 2. block b date/time: yyyyy/yyyy CORRUPTED 3. block c date/time: zzzzz/zzzz Appears OK 4. block d date/time: zzzzz/zzzz Appears OK 5. block e date/time: zzzzz/zzzz Appears OK>*************************************8 Victor was talking in another thread about using zdb to check the pool before doing an import of a damaged pool. Might it be possible for the next stage of the recovery process to give the user an option of testing or importing the pool for any particular uberblock? It does sound like testing can take a long time, so this would need to be something that can be cancelled, and you would also need a way to mark uberblocks as bad should problems be found with either the test or import. This would be a great addition to ZFS though, and would hopefully save Victor a bit of time ;-) Ross -- This message posted from opensolaris.org
Ricardo M. Correia
2008-Oct-10 09:48 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Hi Jeff, On Sex, 2008-10-10 at 01:26 -0700, Jeff Bonwick wrote:> > The circumstances where I have lost data have been when ZFS has not > > handled a layer of redundancy. However, I am not terribly optimistic > > of the prospects of ZFS on any device that hasn''t committed writes > > that ZFS thinks are committed. > > FYI, I''m working on a workaround for broken devices. As you note, > some disks flat-out lie: you issue the synchronize-cache command, > they say "got it, boss", yet the data is still not on stable storage.It''s not just about ignoring the synchronize-cache command, there''s also another weak spot. ZFS is quite resilient against so-called phantom writes, provided that they occur sporadically - let''s say, if the disk decides to _randomly_ ignore writes 10% of the time, ZFS could probably survive that pretty well even on single-vdev pools, due to ditto blocks. However, it is not so resilient when the storage system suffers hiccups which cause phantom writes to occur continuously, even if for a small period of time (say less than 10 seconds), and then return to normal. This could happen for several reasons, including network problems, bugs in software or even firmware, etc. I think in this case, going back to a previous uberblock could also be enough to recover from such a scenario most of the times, unless perhaps the error occurred too long ago, and the unwritten metadata got flushed out of the ARC and didn''t have a chance to get rewritten. In any case, a more generic solution to repair all kinds of metadata corruption, such as (e.g.) space map corruption, would be very desirable, as I think everyone can agree. Best regards, Ricardo
Marcelo Leal
2008-Oct-10 13:15 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Hello all, I think the problem here is the ZFS? capacity for recovery from a failure. Forgive me, but thinking about creating a code "without failures", maybe the hackers did forget that other people can make mistakes (if they can?t). - "ZFS does not need fsck". Ok, that?s a great statement, but i think ZFS needs one. Really does. And in my opinion a enhanced zdb would be the solution. Flexibility. Options. - "I have 90% of something i think is your filesystem, do you want it"? I think a software is as good as it can recovery from failures. And i don?t want to know who failed, i?m not going to send anyone to jail, i?m not a lawyer. I agree with Jeff, really do, but that is "another" problem... The solution Jeff is working one, i think is really great, since it does NOT be the "all or nothing" again... I don?t know about you, but A LOT of times i was saved by the "Lost and Found" directory! All the beauty of a UNIX system is "rm /etc/passwd" after have edited it, and get the whole file doing a "cat /dev/mem". ;-) I think there are a lot of parts in ZFS design that remembers me when you see something left on the floor at home, so you ask for your son why he did not get it, and he says "it was not me". peace. Leal. -- This message posted from opensolaris.org
Miles Nordin
2008-Oct-10 17:58 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
>>>>> "jb" == Jeff Bonwick <Jeff.Bonwick at sun.com> writes: >>>>> "rmc" == Ricardo M Correia <Ricardo.M.Correia at Sun.COM> writes:jb> We need a little more Code of Hammurabi in the storage jb> industry. It seems like most of the work people have to do now is cleaning up after the sloppyness of others. At least it takes the longest. You could always mention which disks you found ignoring the command---wouldn''t that help the overall problem? I understand there''s a pervasive ``i don'' wan'' any trouble, mistah'''' attitude, but I don''t understand where it comes from. http://www.ferris.edu/news/jimcrow/tom/ jb> displacement flush for disk caches that ignore the sync jb> command. Sounds like a good idea but: (1) won''t this break the NFS guarantees you were just saying should never be broken? I get it, someone else is breaking a standard so how can ZFS be expected to yadda yadda yadad. But I fear it will just push ``blame the sysadmin'''' one step further out. ex., Q. ``with ZFS all my NFS clients become unstable after the server reboots,'''' or ``I''m getting silent corruption with NFS''''. A. ``your drives might have gremlins in them, no way to know,'''' and ``well what do you expect without a single integrity domain and TCP''s weak checksums. / no i''m using a crossover cable, and FCS is not weak. / ZFS managing a layer of redundancy it is probably your RAM or corruption on the uh, between the Ethernet MAC chip and the PCI slot'''' (1a) I''m concerned about how it''ll be reported when it happens. (a) if it''s not reported at all, then ZFS is hiding the fact that fsync() is not working. Also, other journaling filesystems sometimes report when they find ``unexpected'''' corruption, which is useful for finding both hardware and software problems. I''m already concerned ZFS is not reporting enough, like when it says a vdev component is ONLINE, but ''zpool offline pool <component>'' says ''no valid replicas'', then after a scrub there is no change to zpool status, but zpool offline works again. ZFS should not ``simplify'''' the user interface to the point that it''s hiding problems with itself and its environment to the ends of avoiding discussion. (b) if it is reported, then whenever the reporter-blob raises its hand it will have the effect of exonerating ZFS in most people''s minds, like the stupid CKSUM column does right now. ``ZFS-FEED-B33F error? oh yeah that''s the new ueberblock search code. that means your disks are ignoring the SYNCHRONIZE CACHE command. thank GOD you have ZFS with ANY OTHER FILESYSTEM all bets would be totally off. lucky you. / I have tried ten different models from all four brands. / yeah sucks don''t it? flagrant violation of the standard, industry wide. / my linux testing tool says they''re obeying the command fine / linux is crap / i added a patch to solaris to block the SYNC CACHE command and the disks got faster so I think it''s not being ignored / well the stack is complicated and flushing happens at many levels, like think about controller performance, and that''s completely unsupported you are doing something REALLY UNSAFE there you should NOT DO THAT it is STUPID'''' and so on, stalling the actual fix literally for years. The right way to exonerate ZFS is to make a diagnosis tool for the disks which proves they''re broken, and then don''t buy those disks. not to make a new class of ZFS fault report that could potentially capture all kinds of problems, then hazily assign blame to an untestable quantity. (2) disks are probably not the only thing dropping the write barriers. So far, we''re also suspecting (unproven!) iSCSI targets/initiators, particularly around a TCP reconnection event or target reboot. and VM stacks, both VirtualBox and the HVM in UltraSPARC T1. probably other stuff. I''m concerned that assumptions you''ll find safe to make about disks after you get started, like nothing is more than 1s stale, or send a CDB to size the on-disk cache and imagine it''s a FIFO and it''ll be no worse than that, or ``you can get an fsync by pausing reads for 500ms'''' or whatever, will add robustness for current and future broken disks but won''t apply to other types of broken storage layer. rmc> However, it is not so resilient when the storage system rmc> suffers hiccups which cause phantom writes to occur rmc> continuously, even if for a small period of time (say less rmc> than 10 seconds), and then return to normal. ha! that is a great idea. temporal ditto blocks: Important writes should be written, aged in RAM for 1 minute, then rewritten. :) This will help with latent sector errors caused by powersag/vibration too. but...Even I will admit at some point you have to give up and let the filesystem get corrupted. actually I''m more in the camp of making ZFS fragile to incorrect storage stacks, and offering an offline recovery tool that treats the corrupt pool as read-only and copies it into a new filesystem (so you need a second same-size empty pool to use the tool). I like this painful way better than fsck-like things, and much better than silent workarounds. but i''m probably in the wrong camp on this one. My reasoning is, we will not be ultimately happy with a fileystem where fsync() is broken, and that''s the best you can do. To compete with Netapp, we need to bang on this thing until it''s actually working. So far I think sysadmins are receptive to the idea they need to fix <...> about their setup, or make purchases with extreme care, or do testing before production. We are not lazy and do not expect an appliance-on-a-CD. it''s just that pass-the-buck won''t ever deliver something useful. When ext3 was corrupting filesystems on laptops, ext3 got blamed, and ext3 was not at the root of the problem. But no one _accepted_ that ext3 was correctly-coded until the overall problem was fixed. (IIRC it was: you need to send drives a stop-unit command before sending the ACPI powerdown, because even if they ignore synchronize-cache they do still flush when told to stop-unit) It''s proper to have a strict separation between ``unclean shutdown'''' and ``recovery from corruption''''. UFS does have the separation between log-rolling and fsck-ing, but ZFS could detect the difference between unclean shutdown and corruption a lot better than UFS, and that''s good. Currently ZFS seems to detect it by telling you ``pool''s corrupt. <shrug>, destroy it.''''---the fact that the recovery tool is entirely absent isn''t good, but keeping recovery actions like this ueberblock-search strictly separate makes delivering something truly correct on the ``unclean shutdown'''' front more likely. I think, if iSCSI target/initiator combinations are silently discarding 10sec worth of writes (ex., when they drop and reconnect their TCP session), then this needs to be proven and their implementation can be and needs to be corrected, not speculated on and then worked around. And I bet this same beefing-up performance numbers by discarding cache flushes is as rampant in the virtualization game as in the hard disk game. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081010/ab907880/attachment.bin>
Eric Schrock
2008-Oct-10 18:23 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Fri, Oct 10, 2008 at 06:15:16AM -0700, Marcelo Leal wrote:> - "ZFS does not need fsck". > Ok, that?s a great statement, but i think ZFS needs one. Really does. > And in my opinion a enhanced zdb would be the solution. Flexibility. > Options.About 99% of the problems reported as "I need ZFS fsck" can be summed up by two ZFS bugs: 1. If a toplevel vdev fails to open, we should be able to pull information from necessary ditto blocks to open the pool and make what progress we can. Right now, the root vdev code assumes "can''t open = faulted pool," which results in failure scenarios that are perfectly recoverable most of the time. This needs to be fixed so that pool failure is only determined by the ability to read critical metadata (such as the root of the DSL). 2. If an uberblock ends up with an inconsistent view of the world (due to failure of DKIOCFLUSHWRITECACHE, for example), we should be able to go back to previous uberblocks to find a good view of our pool. This is the failure mode described by Jeff. These are both bugs in ZFS and will be fixed. The other 1% of the complaints are usually of the form "I created my pool on top of my old one" or "I imported a LUN on two different systems at the same time". It''s unclear what a ''fsck'' tool could do in this scenario, if anything. Due to a variety of reasons (hierarchical nature of ZFS, variable block sizes, RAIDZ-Z, compression, etc), it''s difficult to even *identify* a ZFS block, let alone determine its validity and associate it in some larger construct. There are some interesting possibilities for limited forensic tools - in particular, I like the idea of a mdb backend for reading and writing ZFS pools[1]. But I haven''t actually heard a reasonable proposal for what a fsck-like tool (i.e. one that could "repair" things automatically) would actually *do*, let alone how it would work in the variety of situations it needs to (compressed RAID-Z?) where the standard ZFS infrastructure fails. - Eric [1] http://mbruning.blogspot.com/2008/08/recovering-removed-file-on-zfs-disk.html -- Eric Schrock, Fishworks http://blogs.sun.com/eschrock
Victor Latushkin
2008-Oct-10 19:48 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Eric Schrock wrote:> On Fri, Oct 10, 2008 at 06:15:16AM -0700, Marcelo Leal wrote: >> - "ZFS does not need fsck". >> Ok, that?s a great statement, but i think ZFS needs one. Really does. >> And in my opinion a enhanced zdb would be the solution. Flexibility. >> Options. > > About 99% of the problems reported as "I need ZFS fsck" can be summed up > by two ZFS bugs: > > 1. If a toplevel vdev fails to open, we should be able to pull > information from necessary ditto blocks to open the pool and make > what progress we can. Right now, the root vdev code assumes "can''t > open = faulted pool," which results in failure scenarios that are > perfectly recoverable most of the time. This needs to be fixed > so that pool failure is only determined by the ability to read > critical metadata (such as the root of the DSL). > > 2. If an uberblock ends up with an inconsistent view of the world (due > to failure of DKIOCFLUSHWRITECACHE, for example), we should be able > to go back to previous uberblocks to find a good view of our pool. > This is the failure mode described by Jeff.I''ve mostly seen (2), because despite all the best practices out there, single vdev pools are quite common. In all such cases that I had my hands on it was possible to recover pool by going back by one or two txgs.> These are both bugs in ZFS and will be fixed. The other 1% of the > complaints are usually of the form "I created my pool on top of my old > one" or "I imported a LUN on two different systems at the same time".Of these two former is not easy because it requires searching through the entire disk space for root block candidates and trying each of them. Latter one is not catastrophic in case there were little to no activity from one system. In this case one of the first things to suffer is pool config object, and corruption of it prevents pool open. Fortunately enough, after putback of 6733970 assertion failure in dbuf_dirty() via spa_sync_nvlist() in build 99 corrupted pool config object is written in such a way during open that prevents reading in old corrupted copy, and in most cases this allows to import pool and save most of the data. zdb is useful to understand how much is corrupted and how much is recovered. If nothing else is corrupted, then pool may be available for further use without recreation. Again, in every case I had my hands on it was possible to either recover pool completely or at least save most of the data.> It''s unclear what a ''fsck'' tool could do in this scenario, if anything. > Due to a variety of reasons (hierarchical nature of ZFS, variable block > sizes, RAIDZ-Z, compression, etc), it''s difficult to even *identify* a > ZFS block, let alone determine its validity and associate it in some > larger construct.Indeed. In "more ZFS recovery" case involving 42TB pool with about 8TB used, zdb -bv alone took several hours to walk the block tree and verify consistency of block pointers, and zdb -bcv took couple of days to verify all user data blocks as well. And different checksums and gang blocks in addition to all other dynamic features mentioned complicate the task of identifying ZFS blocks and linking those blocks into tree and make it really time (and space) consuming.> There are some interesting possibilities for limited forensic tools - in > particular, I like the idea of a mdb backend for reading and writing ZFS > pools[1]. But I haven''t actually heard a reasonable proposal for what a > fsck-like tool (i.e. one that could "repair" things automatically) would > actually *do*, let alone how it would work in the variety of situations > it needs to (compressed RAID-Z?) where the standard ZFS infrastructure > fails.There are a number of bugs and rfes to improve usefulness of zdb for field use, e.g. 6720637 want zdb -l option to dump uberblock arrays as well 6709782 issues running zdb with -p and -e options 6736356 zdb -R needs to work with exported pools 6720907 zdb should handle errors while dumping datasets and objects 6746101 zdb command to search for ZFS labels in a device 6757444 want zdb -R to supoprt decompression, checksumming and raid-z 6757430 want an option for zdb to disable space map loading and leak tracking Hth, Victor> - Eric > > [1] http://mbruning.blogspot.com/2008/08/recovering-removed-file-on-zfs-disk.html > > -- > Eric Schrock, Fishworks http://blogs.sun.com/eschrock > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Timh Bergström
2008-Oct-10 19:50 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
2008/10/10 Richard Elling <Richard.Elling at sun.com>:> Timh Bergstr?m wrote: >> >> 2008/10/9 Bob Friesenhahn <bfriesen at simple.dallas.tx.us>: >> >>> >>> On Thu, 9 Oct 2008, Miles Nordin wrote: >>> >>>> >>>> catastrophically. If this is really the situation, then ZFS needs to >>>> give the sysadmin a way to isolate and fix the problems >>>> deterministically before filling the pool with data, not just blame >>>> the sysadmin based on nebulous speculatory hindsight gremlins. >>>> >>>> And if it''s NOT the case, the ZFS problems need to be acknowledged and >>>> fixed. >>>> >>> >>> Can you provide any supportive evidence that ZFS is as fragile as you >>> describe? >>> >> >> The hundreds of sysadmins seeing their pools go byebye after normal >> operations in a production environment is evidence enough. And the >> number of times people like Victor have saved our asses. >> > > Hundreds? Do you have evidence of this?One is one to many, I dont need evidence of hundreds - that is hopefully an exaggeration. //T> -- richard > >
Marcelo Leal
2008-Oct-10 20:29 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
> On Fri, Oct 10, 2008 at 06:15:16AM -0700, Marcelo > Leal wrote: > > - "ZFS does not need fsck". > > Ok, that?s a great statement, but i think ZFS > needs one. Really does. > > And in my opinion a enhanced zdb would be the > solution. Flexibility. > > Options. > > About 99% of the problems reported as "I need ZFS > fsck" can be summed up > by two ZFS bugs: > > 1. If a toplevel vdev fails to open, we should be > able to pull > information from necessary ditto blocks to open > the pool and make > what progress we can. Right now, the root vdev > code assumes "can''t > open = faulted pool," which results in failure > scenarios that are > perfectly recoverable most of the time. This needs > to be fixed > so that pool failure is only determined by the > ability to read > critical metadata (such as the root of the DSL). > . If an uberblock ends up with an inconsistent view > of the world (due > to failure of DKIOCFLUSHWRITECACHE, for example), > we should be able > to go back to previous uberblocks to find a good > view of our pool. > This is the failure mode described by Jeff. > hese are both bugs in ZFS and will be fixed.That?s it! It?s 100% for me! ;-) One is the "all-or-nothing" problem, and the other is about guilty... ;-))> > There are some interesting possibilities for limited > forensic tools - in > particular, I like the idea of a mdb backend for > reading and writing ZFS > pools[1].In my opinion would be great the whole functionality in zdb. it?s simple, and the concepts are clear on the tool. mdb is a debugger, needs concepts that i think is different in a tool for read/fix filesystems. Just an opinion... What does not mean we can not have both. Like i said, flexibility, options... ;-) But I haven''t actually heard a reasonable> proposal for what a > fsck-like toolI think we must NOT stuck in the word "fsck", i have used it just as an example (Lost and Found). And i think other users used just as an example too. The important is the two points you have described very *well*. (i.e. one that could "repair" things> automatically) would > actually *do*, let alone how it would work in the > variety of situations > it needs to (compressed RAID-Z?) where the standard > ZFS infrastructure > fails. > > - Eric > > [1] > http://mbruning.blogspot.com/2008/08/recovering-remove > d-file-on-zfs-disk.html > > -- > Eric Schrock, Fishworks > http://blogs.sun.com/eschrock > ________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ssMany thanks for your answer! Leal. -- This message posted from opensolaris.org
Ricardo M. Correia
2008-Oct-10 20:42 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Sex, 2008-10-10 at 11:23 -0700, Eric Schrock wrote:> But I haven''t actually heard a reasonable proposal for what a > fsck-like tool (i.e. one that could "repair" things automatically) would > actually *do*, let alone how it would work in the variety of situations > it needs to (compressed RAID-Z?) where the standard ZFS infrastructure > fails.I''d say an fsck-like tool for ZFS should not worry much compression, checksums, RAID-Z and whatnot. In essence, it would try to do what an fsck tool does for a typical filesystem, and so would be mostly oblivious to the layout or encoding of the blocks, perhaps treating blocks with failed checksums as blocks full of zeros. Here''s how it could work (of course, this is all easier said than done): 1) Open all the devices specified by the user. Optionally, take just a pool name/guid and scan for the right devices in /dev/[r]dsk. 2) Verify if the pool configuration read from the devices is sane -- if not, try to generate a consistent configuration. Some elements of the pool configuration, such as the correct pool version, could be checked in later steps, depending on features that were found. 3) Starting from the last uberblock, fully traverse a few levels down the tree. If less than 100% of the blocks could be read without errors, do the same for previous uberblocks and offer the user the choice to which uberblock to use, or if running non-interactively, choose the one with the best success rate. 4) Traverse the list/tree of filesystems, snapshots and clones. Make sure that they are well-connected. For each filesystem, try to replay the ZILs, clean them out. 5) Now fully traverse the pool. Compute the space maps and FS space usage on-the-go, as blocks are read. 6) For each metadata block read, check whether the fields are sane, fix them/zero them out if they''re not. Basically we''re assuming here that we may have corrupted metadata with correct checksums. If some metadata block can not be read due to a failed checksum, assume the block is full of zeros, and fix it. By the way, this includes every field of every kind of metadata block, including ZAPs, ACLs, FID maps, znode fields, everything. For fields that reference other objects, make sure that the object they reference is of the correct type and that the object itself is correct. For objects that are missing, create empty ones if necessary. 7) Check that every object is referenced somewhere and link unreferenced objects to /lost+found/object-type/, or similar. 8) Probably do other things that I''m forgetting. 9) In the end, check if the space maps are consistent with the ones computed, write correct ones if not. Check that space usage/reservations/quotas are correct. Essentially, the goal is that at the end of this process, the pool should contain consistent information, should have as much data as could be recovered and should never cause any further errors in ZFS due to invalid metadata/fields; either when importing it, reading from it or writing/modifying it (except that it would still return EIO errors when trying to read corrupted file data blocks, of course). Now, a problem with fsck-like tools, and perhaps especially with ZFS, is that some of these steps may either require lots of memory or multiple filesystem/pool traversals. I''d say having such a tool, even if it required additional temporary storage for operation (hopefully not a very large fraction of the pool size), would be *very* useful and would clear up any worries that people currently have. Kind regards, Ricardo
Richard Elling
2008-Oct-10 22:38 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
Timh Bergstr?m wrote:> 2008/10/10 Richard Elling <Richard.Elling at sun.com>: > >> Timh Bergstr?m wrote: >> >>> 2008/10/9 Bob Friesenhahn <bfriesen at simple.dallas.tx.us>: >>> >>> >>>> On Thu, 9 Oct 2008, Miles Nordin wrote: >>>> >>>> >>>>> catastrophically. If this is really the situation, then ZFS needs to >>>>> give the sysadmin a way to isolate and fix the problems >>>>> deterministically before filling the pool with data, not just blame >>>>> the sysadmin based on nebulous speculatory hindsight gremlins. >>>>> >>>>> And if it''s NOT the case, the ZFS problems need to be acknowledged and >>>>> fixed. >>>>> >>>>> >>>> Can you provide any supportive evidence that ZFS is as fragile as you >>>> describe? >>>> >>>> >>> The hundreds of sysadmins seeing their pools go byebye after normal >>> operations in a production environment is evidence enough. And the >>> number of times people like Victor have saved our asses. >>> >>> >> Hundreds? Do you have evidence of this? >> > > One is one to many, I dont need evidence of hundreds - that is > hopefully an exaggeration. > >Don''t show up to a data fight without data :-/ Yes, we do track this information and guys like me analyze it. The ratio of installed base to problem reports for ZFS is quite high. When we see a trend, we adjust priorities to address it. This is just part of our overall quality program. Which brings me to the required mantra, if you don''t file a bug or make a service call, the problem doesn''t get tracked. Please make the effort so that we can prioritize the use of our limited resources. Posting a fine whine on this (or any) forum is not guaranteed to result in an entry in our problem tracking system -- someone has to put in the extra effort, or it will fall into the silent complainant category. Please help us to improve the quality of our systems, thanks. -- richard
David Magda
2008-Oct-11 01:55 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Oct 10, 2008, at 15:48, Victor Latushkin wrote:> I''ve mostly seen (2), because despite all the best practices out > there, > single vdev pools are quite common. In all such cases that I had my > hands on it was possible to recover pool by going back by one or two > txgs.For better or worse this is the case where I work. Most of our storage is on SANs (EMC and NetApp), and so if we need more space we ask for it and we get a giant LUN given to us (usually multi-pathed). We also have a lot of Veritas VxVM and VxFS for Oracle, and so even if we''re running Solaris 10, we''re not using ZFS in that case. SAN space is also allocated to Windows and VMware ESX machines as well, so it''s not like we can ask for the disks in the SAN to be exported raw, as that would mess up managing of things with the other OSes. (We have a very small global storage / back up team, and I really don''t want to add more to their workload.) If someone finds themselves in this position, what advice can be followed to minimize risks? For example, is having checksums enabled a good idea? If you have no redundancy and an error occurs, the system will panic by default (configurable in newer builds of OpenSolaris, but not in Solaris ''proper'' yet). But if the system is ignoring checksums, you''re no worse off than most other file systems (but still get all the other features of ZFS). Or is there a way to mitigate a checksum error on non-redundant zpool?
Jeff Bonwick
2008-Oct-11 02:14 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
> Or is there a way to mitigate a checksum error on non-redundant zpool?It''s just like the difference between non-parity, parity, and ECC memory. Most filesystems don''t have checksums (non-parity), so they don''t even know when they''re returning corrupt data. ZFS without any replication can detect errors, but can''t fix them (like parity memory). ZFS with mirroring or RAID-Z can both detect and correct (like ECC memory). Note: even in a single-device pool, ZFS metadata is replicated via ditto blocks at two or three different places on the device, so that a localized media failure can be both detected and corrected. If you have two or more devices, even without any mirroring or RAID-Z, ZFS metadata is mirrored (again via ditto blocks) across those devices. Jeff
Mike Gerdts
2008-Oct-11 03:59 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Fri, Oct 10, 2008 at 9:14 PM, Jeff Bonwick <Jeff.Bonwick at sun.com> wrote:> Note: even in a single-device pool, ZFS metadata is replicated via > ditto blocks at two or three different places on the device, so that > a localized media failure can be both detected and corrected. > If you have two or more devices, even without any mirroring > or RAID-Z, ZFS metadata is mirrored (again via ditto blocks) > across those devices.And in the event that you have a pool that is mostly not very important but some of it is important, you can have data mirrored on a per dataset level via copies=n. If we can avoid losing an entire pool by rolling back a txg or two, the biggest source of data loss and frustration is taken care of. Ditto blocks for metadata should take care of most other cases that would result in wide spread loss. Normal bit rot that causes you to lose blocks here and there are somewhat likely to take out a small minority of files and spit warnings along the way. If there are some files that are more important to you than others (e.g. losing files in rpool/home may have more impact than than rpool/ROOT) copies=2 can help there. And for those places where losing a txg or two is a mortal sin, don''t use flaky hardware and allow zfs to handle a layer of redundancy. This gets me thinking that it may be worthwhile to have a small (<100 MB x 2) rescue boot environment with copies=2 (as well as rpool/boot/) so that "pkg repair" could be used to deal with cases that prevent your normal (>4 GB) boot environment from booting. -- Mike Gerdts http://mgerdts.blogspot.com/
Juergen Nickelsen
2008-Oct-11 18:06 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun /
"Timh Bergstr?m" <timh.bergstrom at diino.net> writes:> Unfortunely I can only agree to the doubts about running ZFS in > production environments, i''ve lost ditto-blocks, i''''ve gotten > corrupted pools and a bunch of other failures even in > mirror/raidz/raidz2 setups with or without hardware mirrors/raid5/6. > Plus the insecurity of a sudden crash/reboot will corrupt or even > destroy the pools with "restore from backup" as the only advice. I''ve > been lucky so far about getting my pools back thanks to people like > Victor.With which release was that? Solaris 10 or OpenSolaris? Regards, Juergen.
Keith Bierman
2008-Oct-12 02:36 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Oct 10, 2008, at 7:55 PM 10/10/, David Magda wrote:> > If someone finds themselves in this position, what advice can be > followed to minimize risks?Can you ask for two LUNs on different physical SAN devices and have an expectation of getting it?>-- Keith H. Bierman khbkhb at gmail.com | AIM kbiermank 5430 Nassau Circle East | Cherry Hills Village, CO 80113 | 303-997-2749 <speaking for myself*> Copyright 2008
Wade.Stuart at fallon.com
2008-Oct-13 15:46 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
zfs-discuss-bounces at opensolaris.org wrote on 10/11/2008 09:36:02 PM:> > On Oct 10, 2008, at 7:55 PM 10/10/, David Magda wrote: > > > > > If someone finds themselves in this position, what advice can be > > followed to minimize risks? > > Can you ask for two LUNs on different physical SAN devices and have > an expectation of getting it?Better yet also ask for multiple paths over different SAN infrastructure to each. Then again, I would hope you don''t need to ask your SAN folks for that? -Wade
Mike Gerdts
2008-Oct-13 16:58 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
On Thu, Oct 9, 2008 at 10:33 PM, Mike Gerdts <mgerdts at gmail.com> wrote:> On Thu, Oct 9, 2008 at 10:18 AM, Mike Gerdts <mgerdts at gmail.com> wrote: >> On Thu, Oct 9, 2008 at 10:10 AM, Greg Shaw <Greg.Shaw at sun.com> wrote: >>> Nevada isn''t production code. For real ZFS testing, you must use a >>> production release, currently Solaris 10 (update 5, soon to be update 6). >> >> I misstated before in my LDoms case. The corrupted pool was on >> Solaris 10, with LDoms 1.0. The control domain was SX*E, but the >> zpool there showed no problems. I got into a panic loop with dangling >> dbufs. My understanding is that this was caused by a bug in the LDoms >> manager 1.0 code that has been fixed in a later release. It was a >> supported configuration, I pushed for and got a fix. However, that >> pool was still lost. > > Or maybe it wasn''t fixed yet. I see that this was committed just today. > > 6684721 file backed virtual i/o should be synchronous > > http://hg.genunix.org/onnv-gate.hg/rev/eb40ff0c92ecThe related information from the LDoms Manager 1.1 Early Access release notes (820-4914-10): Data Might Not Be Written Immediately to the Virtual Disk Backend If Virtual I/O Is Backed by a File or Volume Bug ID 6684721: When a file or volume is exported as a virtual disk, then the service domain exporting that file or volume is acting as a storage cache for the virtual disk. In that case, data written to the virtual disk might get cached into the service domain memory instead of being immediately written to the virtual disk backend. Data are not cached if the virtual disk backend is a physical disk or slice, or if it is a volume device exported as a single-slice disk. Workaround: If the virtual disk backend is a file or a volume device exported as a full disk, then you can prevent data from being cached into the service domain memory and have data written immediately to the virtual disk backend by adding the following line to the /etc/system file on the service domain. set vds:vd_file_write_flags = 0 Note ? Setting this tunable flag does have an impact on performance when writing to a virtual disk, but it does ensure that data are written immediately to the virtual disk backend. -- Mike Gerdts http://mgerdts.blogspot.com/
Miles Nordin
2008-Oct-13 17:50 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
>>>>> "dm" == David Magda <dmagda at ee.ryerson.ca> writes: >>>>> "jb" == Jeff Bonwick <Jeff.Bonwick at sun.com> writes: >>>>> "mg" == Mike Gerdts <mgerdts at gmail.com> writes:dm> If you have no redundancy and an error occurs, the system will dm> panic by default (configurable in newer builds of OpenSolaris, dm> but not in Solaris ''proper'' yet). But if the system is dm> ignoring checksums, you''re no worse off than most other file dm> systems It''s not safe to assume the checksum errors are silent corruption. Most or all of the checksum errors I''ve seen on my system come from ZFS failing to fully resilver a temporarily-broken mirror. It''s not safe to assume failmode=<!panic> will stop your box from freezing. Problems with one zpool can cause problems with other unaffected pools. Problems at the storage driver level can cause one bad disk to freeze other good disks. Problems with the user interface generally make it impossible to offline a known-bad device because the user interface is frozen, or you get some catchall error like ``no valid replicas'''' because who-knows-what, or ``I/O error'''' because the user interface can''t mark the failed drive as offline in the copy of the label stored on the failed drive---if metastat behaved that way?! I''ve also had problems with iscsiadm and format pausing for minutes because a discovery-address is not responding, which could turn into hours if I had a hundred iSCSI targets---if I could just edit a damned text file like on a real Unix, I wouldn''t have to put up with these needlessly-complex state machines and multiplicative timeouts. NFS can freeze entirely if any exported filesystem has problems. Yes, some of the panics reported may come from failmode, but if you look through bugs.opensolaris.org and the list you''ll see many different kinds of assertion-failure panics that aren''t controlled by the failmode knob, usually panic-on-import or freeze-on-import, but sometimes other kinds. To my view, the good news for ZFS is that most other things suck almost as much, so there is only a little catching-up to do before it''s competitive. OTOH it looks like an unworkable disaster w.r.t. the promised future environment where pools have hundreds of disks, always some of them failing. The exception handling is a mess, the timers are attached to accidental hodge-podge ``layered'''' state machines for which no one will accept ultimate responsibility, and the locking of various user interfaces and subsystems is coarse because it''s built either for correctness/simplicity/deadlines, or for a mistaken, outdated goal: high-performance, assuming-a-fully-working-system, otherwise-fix-your-hardware. jb> ditto blocks mg> copies=n. neither of which applies to the situations Victor helped recover from. It''s possible ditto blocks are quietly helping people, but I''ve not read on the list of one scenario where something bad happened and the resolution was ``you should have used copies=n''''. The OP is asking about best practices that mitigate known problems, not a repeat of the standard list of bullet point features and their hypothetical virtues. mg> And for those places where losing a txg or two is a mortal mg> sin, don''t use flaky hardware and allow zfs to handle a layer mg> of redundancy. It is a mortal sin for a filesystem in all places. It''s just much less bad than losing the entire pool. To be a safe backing-store for databases or email, ZFS needs to have implementable best-practices that stop this from happening, not just recover from it. Whatever recovery there is, certainly should not be silent and maybe should not be automatic. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081013/c7d4f126/attachment.bin>
Gino
2008-Nov-29 11:49 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
> About 99% of the problems reported as "I need ZFS > fsck" can be summed up > by two ZFS bugs: > > 1. If a toplevel vdev fails to open, we should be > able to pull > information from necessary ditto blocks to open > the pool and make > what progress we can. Right now, the root vdev > code assumes "can''t > open = faulted pool," which results in failure > scenarios that are > perfectly recoverable most of the time. This needs > to be fixed > so that pool failure is only determined by the > ability to read > critical metadata (such as the root of the DSL). > . If an uberblock ends up with an inconsistent view > of the world (due > to failure of DKIOCFLUSHWRITECACHE, for example), > we should be able > to go back to previous uberblocks to find a good > view of our pool. > This is the failure mode described by Jeff. > [b]These are both bugs in ZFS and will be fixed. [/b]I totally agree these covers most of the corruptions we had in past. Any news about that bugs in recent Nevada release? Anyone can provide us a detailed procedure to "go back to previous uberblocks to find a good view of our pool" as described by Jeff? Thanks gino -- This message posted from opensolaris.org
Ray Clark
2008-Nov-30 16:22 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
It would be extremely helpful to know what brands/models of disks lie and which don''t. This information could be provided diplomatically simply as threads documenting problems you are working on, stating the facts. Use of a specific string of words would make searching for it easy. There should be no liability, since you are simply documenting compatibility with zfs. Or perhaps if the lawyers let you, you could simply publish a compatibility/incompatibility list. These ARE facts. If there is a way to make a detection tool, that would be very useful too, although after the purchase is made, it could be hard to send it back. However that info could be fed into the database as that drive/model being incompatible with zfs. As Solaris / zfs gains ground, this could become a strong driver in the industry. Re: I''ll run tests with known-broken disks to determine how far back we need to go in practice -- I''ll bet one txg is almost always enough. So go back three - we are using zfs because we want absolute reliability (or at least as close as we can get). --Ray -- This message posted from opensolaris.org
>>>>> "tt" == Toby Thain <toby at telegraphics.com.au> writes: >>>>> "mg" == Mike Gerdts <mgerdts at gmail.com> writes:tt> I think we have to assume Anton was joking - otherwise his tt> measure is uselessly unscientific. I think it''s rude to talk about someone who''s present in the third person, especially when you''re trying to minimize his view. Were you joking, Anton? :) 0. The reports I read were not useless in the way some have stated, because for example Mike sampled his own observations: mg> In the past year I''ve lost more ZFS file systems than I have mg> any other type of file system in the past 5 years. With other mg> file systems I can almost always get some data back. With ZFS mg> I can''t get any back. It''s not just bloggers and pundits sampling mailing list traffic. I thought there was at least one other post like this but could not find it. 1. I don''t think your impressions nor Anton''s and mine are ``useless'''' 2. I don''t think your positive impression is any more scientific than his and my skeptical one. 3. I''m in general troubled by reports of corruption that aren''t well-investigated, because this will stop young, fragile filesystems from becoming old and robust. BUT.... 4. I''m less troubled by (3) because a few of the corruption reports were well-investigated by Victor, and he recovered them manually and posted a summary here: http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/051643.html and how the exprience might inform ZFS improvements: http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/051667.html 5. I''m more troubled again because everyone seems to have forgotten (4). Mike, Victor, and others can''t necessarily repeat themselves every time this thread''s resurrected. If yapping mailing list monkeys like me don''t remember this experience, invested-wishing and marketing white papers will drown out the experience we''re getting. I''ve pointed straight at an unfixed corruption problem that''s biting ZFS users, and the discussion about where to place the blame and how to fix it. It is not fixed now, yet pundits on-list and all over the Interweb like here: http://www.kev009.com/wp/2008/11/on-file-systems/ talk about corruption bugs hazily and say ``most of all that''s been fixed'''' when it''s not so hazy and hasn''t been, then focus on theoretical unrealized capabilities of the on-disk format and mimimize this clear experience into ghostly distant-past rumor. I don''t see when the single-LUN SAN corruption problems were fixed. I think the supposed ``silent FC bit flipping'''' basis for the ``use multiple SAN LUN''s'''' best-practice is revoltingly dishonest, that we _know_ better. I''m not saying devices aren''t guilty---Sun''s sun4v IO virtualizer was documented as guilty of ignoring cache flushes to inflate performance just like the loomingly-unnamed models of lying SATA drives: http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/051735.html Is a storage-stack-related version this problem the cause of lost single-LUN SAN pools? maybe, maybe not, but either way we need an end-to-end solution. I don''t currently see an end-to-end solution to this pervasive blame-the-device mantra every time a pool goes bad. I keep digging through the archives to post messages like this because I feel like everyone only wants to have happy memories, and that it''s going to bring about a sad end. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081212/f666444a/attachment.bin>
Johan Hartzenberg
2008-Dec-12 20:38 UTC
[zfs-discuss] Split responsibility for data with ZFS
On Fri, Dec 12, 2008 at 10:10 PM, Miles Nordin <carton at ivy.net> wrote:> > > 0. The reports I read were not useless in the way some have stated, > because for example Mike sampled his own observations:[snip]> > I don''t see when the single-LUN SAN corruption problems were fixed. I > think the supposed ``silent FC bit flipping'''' basis for the ``use > multiple SAN LUN''s'''' best-practice is revoltingly dishonest, that we > _know_ better. I''m not saying devices aren''t guilty---Sun''s sun4v IO > virtualizer was documented as guilty of ignoring cache flushes to > inflate performance just like the loomingly-unnamed models of lying > SATA drives: > > > http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/051735.html > > Is a storage-stack-related version this problem the cause of lost > single-LUN SAN pools? maybe, maybe not, but either way we need an > end-to-end solution. I don''t currently see an end-to-end solution to > this pervasive blame-the-device mantra every time a pool goes bad. > > I keep digging through the archives to post messages like this because > I feel like everyone only wants to have happy memories, and that it''s > going to bring about a sad end. >Thank you. There is so much unsupported claims and noise on both sides that everybody is sounding like a bunch of fanboys. The only bit that I understand about why HW raid "might" be bad is that if it had access to the disks behind a HW RAID LUN, then _IF_ zfs were to encounter corrupted data in a read, it will probably be able to re-construct that data. This is at the cost of doing the parity calculations on a general purpose CPU, and then sending that parity data, as well as the data to write, across the wire. Some of that cost may be offset against Raid-Z''s optimizations over raid-5 in some situations, but all of this is pretty much if-then-maybe type situations. I also understand that HW raid arrays have some vulnerabilities and weaknesses, but those seem to be offset against ZFS'' notorious instability during error conditions. I say notorious, because of all the open bug reports and reports on the list of I/O hanging and/or systems panicing while waiting for ZFS to realize that something has gone wrong. I think if this last point can be addressed - make ZFS respond MUCH faster to failures, then it will go a long way to make ZFS be more readily adopted. -- Any sufficiently advanced technology is indistinguishable from magic. Arthur C. Clarke My blog: http://initialprogramload.blogspot.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081212/5b5f4cdd/attachment.html>
On 12-Dec-08, at 3:10 PM, Miles Nordin wrote:>>>>>> "tt" == Toby Thain <toby at telegraphics.com.au> writes: >>>>>> "mg" == Mike Gerdts <mgerdts at gmail.com> writes: > > tt> I think we have to assume Anton was joking - otherwise his > tt> measure is uselessly unscientific. > > I think it''s rude to talk about someone who''s present in the third > person, especially when you''re trying to minimize his view. Were you > joking, Anton? :) > .... > > 1. I don''t think your impressions nor Anton''s and mine are ``useless''''Alright, I agree I should retract the ''useless'' but I would keep the ''unscientific''. --Toby
On 12-Dec-08, at 3:38 PM, Johan Hartzenberg wrote:> ... > The only bit that I understand about why HW raid "might" be bad is > that if it had access to the disks behind a HW RAID LUN, then _IF_ > zfs were to encounter corrupted data in a read, it will probably be > able to re-construct that data. This is at the cost of doing the > parity calculations on a general purpose CPU,Except that it''s not just parity - ZFS checksums where RAID-N does not (although I''ve heard that some RAID systems checksum "somewhere" - not end-to-end of course). Call me a fanboy if you will, but ZFS is different from hw RAID. I am not an "automatic denier" of ZFS bugs or flaws, but I do acknowledge it''s more revolution than evolution. It''s software. We only need be patient while it matures. :) --Toby -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081212/c44dc663/attachment.html>
Bob Friesenhahn
2008-Dec-12 21:11 UTC
[zfs-discuss] Split responsibility for data with ZFS
On Fri, 12 Dec 2008, Toby Thain wrote:>> >> 1. I don''t think your impressions nor Anton''s and mine are ``useless'''' > > Alright, I agree I should retract the ''useless'' but I would keep the > ''unscientific''.There is no need to retract the ''useless''. By the same useless measure, George Bush Jr has done a fantastic job at dealing with world terror since there has not been a serious attack on US soil by islamic terrorists since 2002. One might think that this impression is significant yet it is not since the previous attack on US soil was in 1993, which was about 9 years and we have only gone about 6 thus far. By statistical measures, George Bush Jr could have done absolutely nothing and it is likely that nothing bad would have happened at all. There is insufficient evidence to suggest one conclusion vs another. This example shows the dangers of using illogical thinking to presumably reach a logical conclusion. It is particularly dangerous to exhibit illogical thinking in public where everyone can see. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn
2008-Dec-12 21:16 UTC
[zfs-discuss] Split responsibility for data with ZFS
On Fri, 12 Dec 2008, Toby Thain wrote:> > Except that it''s not just parity - ZFS checksums where RAID-N does not > (although I''ve heard that some RAID systems checksum "somewhere" - not > end-to-end of course).It will soon be quite easy to build a RAID system like this using OpenSolaris and a sub-project known as COMSTAR. The checksums will be done using a storage technology called ZFS. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Johan Hartzenberg wrote:> There is so much unsupported claims and noise on both sides that > everybody is sounding like a bunch of fanboys.I don''t think there are two sides. Anyone who has been around computing for any length of time has lost data due to various failures. The question isn''t about losing data, it is about how to proceed when your data is damaged.> > The only bit that I understand about why HW raid "might" be bad is > that if it had access to the disks behind a HW RAID LUN, then _IF_ zfs > were to encounter corrupted data in a read, it will probably be able > to re-construct that data. This is at the cost of doing the parity > calculations on a general purpose CPU, and then sending that parity > data, as well as the data to write, across the wire. Some of that > cost may be offset against Raid-Z''s optimizations over raid-5 in some > situations, but all of this is pretty much if-then-maybe type situations.OK, repeat after me: there is no such thing as hardware RAID, there is no such thing as hardware RAID, there is no such thing as hardware RAID. There is only software RAID. If you believe any software is infallible, then you will be hurt. Even beyond RAID, there is quite sophisticated software on your disks, and anyone who has had to upgrade disk firmware will attest that disk firmware is not infallible.> I also understand that HW raid arrays have some vulnerabilities and > weaknesses, but those seem to be offset against ZFS'' notorious > instability during error conditions. I say notorious, because of all > the open bug reports and reports on the list of I/O hanging and/or > systems panicing while waiting for ZFS to realize that something has > gone wrong. > > I think if this last point can be addressed - make ZFS respond MUCH > faster to failures, then it will go a long way to make ZFS be more > readily adopted.However, you can''t respond too fast -- something which seems to get lost in these conversations. If you declare a disk dead too fast, then you get caught in a bind by things like Seagate disks which "freeze" for a few seconds. It may be much better to ride through such things than initiate a reconfiguration action (as described in the article below). http://blogs.zdnet.com/storage/?p=369&tag=nl.e539 Note: as of b97, it is now possible to set per-device retries in the sd and ssd drivers. This is a good start towards satisfying those who are fed up with the default sd/ssd retry logic. See sd(7d) http://opensolaris.org/os/community/arc/caselog/2007/505/ -- richard
On Fri, Dec 12, 2008 at 2:51 PM, Toby Thain <toby at telegraphics.com.au>wrote:> > On 12-Dec-08, at 3:38 PM, Johan Hartzenberg wrote: > > ... > The only bit that I understand about why HW raid "might" be bad is that if > it had access to the disks behind a HW RAID LUN, then _IF_ zfs were to > encounter corrupted data in a read, it will probably be able to re-construct > that data. This is at the cost of doing the parity calculations on a > general purpose CPU, > > > Except that it''s *not just parity* - ZFS checksums where RAID-N does not > (although I''ve heard that some RAID systems checksum "somewhere" - not > end-to-end of course). > > Call me a fanboy if you will, but ZFS is different from hw RAID. I am not > an "automatic denier" of ZFS bugs or flaws, but I do acknowledge it''s more > *revolution* than evolution. It''s software. We only need be patient while > it matures. :) > > --Toby > >I''m going to pitch in here as devil''s advocate and say this is hardly revolution. 99% of what zfs is attempting to do is something NetApp and WAFL have been doing for 15 years+. Regardless of the merits of their patents and prior art, etc., this is not something revolutionarily new. It may be "revolution" in the sense that it''s the first time it''s come to open source software and been given away, but it''s hardly "revolutionary" in file systems as a whole. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081212/25adf353/attachment.html>
Tim wrote:> > > On Fri, Dec 12, 2008 at 2:51 PM, Toby Thain <toby at telegraphics.com.au > <mailto:toby at telegraphics.com.au>> wrote: > > > On 12-Dec-08, at 3:38 PM, Johan Hartzenberg wrote: > >> ... >> The only bit that I understand about why HW raid "might" be bad >> is that if it had access to the disks behind a HW RAID LUN, then >> _IF_ zfs were to encounter corrupted data in a read, it will >> probably be able to re-construct that data. This is at the cost >> of doing the parity calculations on a general purpose CPU, > > Except that it''s /not just parity/ - ZFS checksums where RAID-N > does not (although I''ve heard that some RAID systems checksum > "somewhere" - not end-to-end of course). > > Call me a fanboy if you will, but ZFS is different from hw RAID. I > am not an "automatic denier" of ZFS bugs or flaws, but I do > acknowledge it''s more /revolution/ than evolution. It''s software. > We only need be patient while it matures. :) > > --Toby > > > I''m going to pitch in here as devil''s advocate and say this is hardly > revolution. 99% of what zfs is attempting to do is something NetApp > and WAFL have been doing for 15 years+.The ideas aren''t new, but the combination of the ideas is. NetApp is still a box at the end of a bit of wire that the OS has to blindly trust. -- Ian.
On Fri, Dec 12, 2008 at 3:36 PM, Ian Collins <ian at ianshome.com> wrote:> > > The ideas aren''t new, but the combination of the ideas is. NetApp is > still a box at the end of a bit of wire that the OS has to blindly trust. > > -- > Ian. > >I''m not aware of many, if any large shops that are moving to a model of "all internal disk with applications running on them". The sun box will just be "a box at the end of the wire", a-la storage 7000 when it''s an nfs/cifs/iscsi target. Centralized storage is a *good thing*. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081212/ab918330/attachment.html>
Tim wrote:> > > On Fri, Dec 12, 2008 at 3:36 PM, Ian Collins <ian at ianshome.com > <mailto:ian at ianshome.com>> wrote: > > > > The ideas aren''t new, but the combination of the ideas is. NetApp is > still a box at the end of a bit of wire that the OS has to blindly > trust. > > -- > Ian. > > > > I''m not aware of many, if any large shops that are moving to a model > of "all internal disk with applications running on them". The sun box > will just be "a box at the end of the wire", a-la storage 7000 when > it''s an nfs/cifs/iscsi target. Centralized storage is a *good thing*. >Maybe, but I''m sure that will change as the performance of the storage subsystems continue to exceed the performance of the bit of wire. That''s where the revolution bit comes in; applications can now coexist with NetApp quality storage management. -- Ian.
>>>>> "nw" == Nicolas Williams <Nicolas.Williams at sun.com> writes: >>>>> "tt" == Toby Thain <toby at telegraphics.com.au> writes: >>>>> "jh" == Johan Hartzenberg <jhartzen at gmail.com> writes:nw> If you can fully trust the SAN then there''s no reason not to nw> run ZFS on top of it with no ZFS mirrors and no RAID-Z. The best practice I understood is currently to use zpool-layer redundancy especially with SAN even moreso than with single-spindle local storage, because of (1) the new corruption problems people are having with ZFS on single-LUN SAN''s that they didn''t have when using UFS and vxfs on the same SAN, and (2) the new severity of the problem, losing the whole pool instead of the few files you lose to UFS corruption or that you''re supposed to lose to random bit flips on ZFS. The problems do not sound like random bit-flips. They''re corruption of every ueberblock. The best-guess explanation AIUI, is not FC checksum gremlins---it''s that write access to the SAN is lost and then comes back---ex. if the SAN target loses power or fabric access but the ZFS host doesn''t reboot---and either the storage stack is misreporting the failure or ZFS isn''t correctly responding to the errors. see the posts I referenced. Apparently the layering is not as simple in practice as one might imagine. Even if you ignore the post-mortem analysis of the corrupt pools and look only at the symptom, if it were random corruption from DRAM and FC checksum gremlins, we should see mostly reports of a few files lost to checksum errors on single-LUN SAN''s and reported in ''zpool status'', much more often than whole zpool''s lost, yet exactly the opposite is happening. jh> The only bit that I understand about why HW raid "might" be jh> bad is that if it had access to the disks behind a HW RAID jh> LUN, then _IF_ zfs were to encounter corrupted data in a read, In at least one case it''s certain there are no reported latent sector errors from the SAN on the corrupt LUN---''dd if=<..lun..> of=/dev/null'' worked for at least one person who lost a single-LUN zpool. it doesn''t sound to me like random bit-flips causing the problem, since all copies of the ueberblock are corrupt, and that''s a bit far-fetched to happen randomly on a LUN that scrubs almost clean when mounted with the second-newest ueberblock. jh> ZFS'' notorious instability during error conditions. right, availability is a reason to use RAID below ZFS layer. It might or might not be related to the SAN problems. Maybe yes if the corruption happens during a path failover or a temporary connectivity interruption. but the symptom''s different from the timeout/availability thread, is a corrupt unmountable pool. The hang discussion was about frozen systems where the pool imports fine after reboot, which is a different symptom. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081212/9772a302/attachment.bin>
Nicolas Williams
2008-Dec-12 22:49 UTC
[zfs-discuss] Split responsibility for data with ZFS
On Fri, Dec 12, 2008 at 05:31:37PM -0500, Miles Nordin wrote:> nw> If you can fully trust the SAN then there''s no reason not to > nw> run ZFS on top of it with no ZFS mirrors and no RAID-Z. > > The best practice I understood is currently to use zpool-layer > redundancy especially with SAN even moreso than with single-spindleYes, but I believe this whole thread is about ZFS with no zpool-layer redundancy, with RAID done in the SAN.> local storage, because of (1) the new corruption problems people areYour thesis is that all corruption problems observed with ZFS on SANs are: a) phantom writes that never reached the rotating rust, b) not bit rot, corruption in the I/O paths, ... Correct?> The problems do not sound like random bit-flips. They''re corruption > of every ueberblock. The best-guess explanation AIUI, is not FCSome of the earlier problems of type (2) were triggered by checksum verification failures on pools with no redundancy, where ZFS would just panic (IIRC). These were due to bit-rot issues, not cache flush failures.> checksum gremlins---it''s that write access to the SAN is lost and then > comes back---ex. if the SAN target loses power or fabric access but > the ZFS host doesn''t reboot---and either the storage stack is > misreporting the failure or ZFS isn''t correctly responding to the > errors. see the posts I referenced.It''s possible that ZFS could, periodically (in the background) and/or at pool import time (synchronously), validate the consistency on disk of every transaction going backwards from the last until one is found that is consistent, or until it runs out of past ?berblocks, or it goes too far into the past. (Does ZFS have an option to do that? It might be a useful option to have for dealing with lying SANs.)> jh> ZFS'' notorious instability during error conditions.> right, availability is a reason to use RAID below ZFS layer. It mightZFS handles device errors better when ZFS does redundancy at the zpool layer, as opposed to when redundancy is left to the SAN. That''s well established, so why do you say the opposite? Nico --
> I''m going to pitch in here as devil''s advocate and say this is hardly > revolution. 99% of what zfs is attempting to do is something NetApp and > WAFL have been doing for 15 years+. Regardless of the merits of their > patents and prior art, etc., this is not something revolutionarily new. It > may be "revolution" in the sense that it''s the first time it''s come to open > source software and been given away, but it''s hardly "revolutionary" in file > systems as a whole."99% of what ZFS is attempting to do?" Hmm, OK -- let''s make a list: end-to-end checksums unlimited snapshots and clones O(1) snapshot creation O(delta) snapshot deletion O(delta) incremental generation transactionally safe RAID without NVRAM variable blocksize block-level compression dynamic striping intelligent prefetch with automatic length and stride detection ditto blocks to increase metadata replication delegated administration scalability to many cores scalability to huge datasets hybrid storage pools (flash/disk mix) that optimize price/performance How many of those does NetApp have? I believe the correct answer is 0%. Jeff
I find this thread both interesting and disturbing. I''m fairly new to this list so please excuse me if my comments/opinions are simplistic or just incorrect. I think there''s been to much FC SAN bashing so let me change the example. What if you buy a 7000 Series server (complete with zfs) and setup an IP SAN. You create a LUN and share it out to a Solaris 10 host. On the solaris host you create a ZFS pool with that iscsi LUN. Now my undersatnding is that you will not be able to correct errors on the zpool of the Solaris10 machine because zfs on the solaris 10 machine is not doing the raid. Another example would be if you were sharing out a lun to a vmware server, from your iscsi san or fc san, and creating solaris 10 virtual machines, with zfs booting. Another example would be Solaris 10 booting a zfs filesystem from a hardware mirrored pair of drives. Now these are examples of standard implementations of machines in a datacenter, specifically ones I have installed.>From following this thread I now feel that if I have uncorrectable "dataerrors" on the zfs pools there will be no way to easily repair the pool. I see no reason that if I do detect errors as I scrub the zfs pool that I should be able to run a simple utility to fix the pools as I would a ufs filesystem and then recover the corrupted files from tape. I believe that for zfs to be used as a general purpose filesystem that there has to be support built into zfs to support these standard data center implementations, otherwise it will just become a specialized filesystem, like Netapp''s WAFL, and there are alot more servers than storage appliances in the datacenter. I think this thread has put zfs in a negative light. I don''t actually believe that I will experience many of these problems in an Enterprise class data center, but still I don''t look forward to having to deal with the consequences of encountering these types of problems. Maybe zfs is not ready to be considered a general purpose filesystem. -- Ed Spencer
[sigh, here we go again... isn''t this in a FAQ somewhere, it certainly is in the archives...] Ed Spencer wrote:> I find this thread both interesting and disturbing. I''m fairly new to > this list so please excuse me if my comments/opinions are simplistic or > just incorrect. > > I think there''s been to much FC SAN bashing so let me change the > example. > > What if you buy a 7000 Series server (complete with zfs) and setup an IP > SAN. You create a LUN and share it out to a Solaris 10 host. > On the solaris host you create a ZFS pool with that iscsi LUN. >You are certainly able to implement ZFS redundancy on the Solaris 10 host.> Now my undersatnding is that you will not be able to correct errors on > the zpool of the Solaris10 machine because zfs on the solaris 10 machine > is not doing the raid. >No, this is not a completely true statement (more below)> Another example would be if you were sharing out a lun to a vmware > server, from your iscsi san or fc san, and creating solaris 10 virtual > machines, with zfs booting. >You are certainly able to implement ZFS redundancy on the Solaris 10 VM.> Another example would be Solaris 10 booting a zfs filesystem from a > hardware mirrored pair of drives. >You are certainly able to implement ZFS redundancy on the mirrored pair of drives.> Now these are examples of standard implementations of machines in a > datacenter, specifically ones I have installed. >I presume you are saying that you implemented only the default ZFS data protection for a single vdev. You have more options, including copies, mirroring, raidz, etc.> >From following this thread I now feel that if I have uncorrectable "data > errors" on the zfs pools there will be no way to easily repair the pool. >Untrue. ZFS will attempt to repair what it can repair. More below.> I see no reason that if I do detect errors as I scrub the zfs pool that > I should be able to run a simple utility to fix the pools as I would a > ufs filesystem and then recover the corrupted files from tape. >There is no utility for UFS which will repair corrupted data. UFS is blissfully unaware of data corruption. fsck will attempt to reconcile metadata problems, which were very common before logging was added, because UFS does not have an always consistent on-disk format (ZFS does). By default, ZFS uses copies=2 for metadata. Uberblocks are 4x redundant. If data corruption is detected in a file, zpool status -x will show exactly which files are corrupted, which will allow you to make the decision how you want to handle the broken file. IMHO, you are getting hung up about the fact that if data corruption is detected in a file and ZFS does not have a way to repair the file, then you will probably want to do something about it manually. With UFS, you''ll never know, though you might see some symptoms like your apps crash or your spreadsheet has the wrong numbers.> I believe that for zfs to be used as a general purpose filesystem that > there has to be support built into zfs to support these standard data > center implementations, otherwise it will just become a specialized > filesystem, like Netapp''s WAFL, and there are alot more servers than > storage appliances in the datacenter. >I disagree. ZFS will be the preferred boot file system for Solaris systems -- it already is the only boot file system available for OpenSolaris. Features like snapshots (that actually work, unlike UFS snapshots for many cases) and cloning are extremely useful for managing OSes, patches, and upgrades. ZFS is the future general purpose file system for Solaris, UFS is not (which will become readily apparent when you buy a 1.5 TByte disk)> I think this thread has put zfs in a negative light. I don''t actually > believe that I will experience many of these problems in an Enterprise > class data center, but still I don''t look forward to having to deal with > the consequences of encountering these types of problems. >One reason you may have never experienced data corruption with UFS (which I find hard to believe, having used UFS for 20+ years) is that UFS has no way to detect data corruption. Are you trying to kill the canary? :-)> Maybe zfs is not ready to be considered a general purpose filesystem. >I''d say maybe UFS is not ready to be considered a general purpose file system, by today''s standards :-) -- richard
Richard, I have been glancing through the posts, saw more hardware RAID vs ZFS discussion, some are very useful. However, as you adviced me the other day, we should think about the overall solution architect, not just the feature itself. I believe the spirit of ZFS snapshot is more significant than what have been discussed, in the rapid (though I don''t know if it is stateful today) application migration capabilities that enhance overall business continuity, hopefully fulfilling the enterprise availability requirements. I really don''t think any Hardware RAID with embedded snapshot can do such, and I am never IMHO. One example: ZFS is used to both capture the guest from a snapshot and move the compressed snapshot between servers, not limited to the Sun xVM hypervisor; the same approach could be used with respect to hosting Solaris Zones or Sun Logical Domains. Best, z
On Fri, Dec 12, 2008 at 8:16 PM, Jeff Bonwick <Jeff.Bonwick at sun.com> wrote:> > I''m going to pitch in here as devil''s advocate and say this is hardly > > revolution. 99% of what zfs is attempting to do is something NetApp and > > WAFL have been doing for 15 years+. Regardless of the merits of their > > patents and prior art, etc., this is not something revolutionarily new. > It > > may be "revolution" in the sense that it''s the first time it''s come to > open > > source software and been given away, but it''s hardly "revolutionary" in > file > > systems as a whole. > > "99% of what ZFS is attempting to do?" Hmm, OK -- let''s make a list: > > end-to-end checksums > unlimited snapshots and clones > O(1) snapshot creation > O(delta) snapshot deletion > O(delta) incremental generation > transactionally safe RAID without NVRAM > variable blocksize > block-level compression > dynamic striping > intelligent prefetch with automatic length and stride detection > ditto blocks to increase metadata replication > delegated administration > scalability to many cores > scalability to huge datasets > hybrid storage pools (flash/disk mix) that optimize > price/performance > > How many of those does NetApp have? I believe the correct answer is 0%. > > JeffSeriously? Do you know anything about the NetApp platform? I''m hoping this is a genuine question... Off the top of my head nearly all of them. Some of them have artificial limitations because they learned the hard way that if you give customers enough rope they''ll hang themselves. For instance "unlimited snapshots". Do I even need to begin to tell you what a horrible, HORRIBLE idea that is? "Why can''t I get my space back?" Oh, just do a snapshot list and figure out which one is still holding the data. What? Your console locks up for 8 hours when you try to list out the snapshots? Huh... that''s weird. It''s sort of like that whole "unlimited filesystems" thing. Just don''t ever reboot your server, right? Or "you can have 40pb in one pool!!!". How do you back it up? Oh, just mirror it to another system? And when you hit a bug that toasts both of them you can just start restoring from tape for the next 8 years, right? Or if by some luck we get a zfsiron, you can walk the metadata for the next 5 years. NVRAM has been replaced by flash drives in a ZFS world to get any kind of performance... so you''re trading one high priced storage for another. Your snapshot creation and deletion is identical. Your incremental generations is identical. End-to-end checksums? Yup. Let''s see... they don''t have block-level compression, they chose dedup instead which nets better results. "Hybrid storage pool" is achieved through PAM modules. Outside of that... I don''t see ANYTHING in your list they didn''t do first. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081213/71937bc5/attachment.html>
> Off the top of my head nearly all of them. Some of them have artificial > limitations because they learned the hard way that if you give customers > enough rope they''ll hang themselves. For instance "unlimited snapshots".Oh, that''s precious! It''s not an arbitrary limit, it''s a safety feafure!> Outside of that... I don''t see ANYTHING in your list they didn''t do first.Then you don''t know ANYTHING about either platform. Constant-time snapshots, for example. ZFS has them; NetApp''s are O(N), where N is the total number of blocks, because that''s how big their bitmaps are. If you think O(1) is not a revolutionary improvement over O(N), then not only do you not know much about either snapshot algorithm, you don''t know much about computing. Sorry, everyone else, for feeding the troll. Chum the water all you like, I''m done with this thread. Jeff
> Seriously? Do you know anything about the NetApp platform? I''m hoping this > is a genuine question... > > Off the top of my head nearly all of them. Some of them have artificial > limitations because they learned the hard way that if you give customers > enough rope they''ll hang themselves. For instance "unlimited snapshots". > Do I even need to begin to tell you what a horrible, HORRIBLE idea that is? > "Why can''t I get my space back?" Oh, just do a snapshot list and figure out > which one is still holding the data. What? Your console locks up for 8 > hours when you try to list out the snapshots? Huh... that''s weird. > > It''s sort of like that whole "unlimited filesystems" thing. Just don''t ever > reboot your server, right? Or "you can have 40pb in one pool!!!". How do > you back it up? Oh, just mirror it to another system? And when you hit a > bug that toasts both of them you can just start restoring from tape for the > next 8 years, right? Or if by some luck we get a zfsiron, you can walk the > metadata for the next 5 years. > > NVRAM has been replaced by flash drives in a ZFS world to get any kind of > performance... so you''re trading one high priced storage for another. Your > snapshot creation and deletion is identical. Your incremental generations > is identical. End-to-end checksums? Yup. > > Let''s see... they don''t have block-level compression, they chose dedup > instead which nets better results. "Hybrid storage pool" is achieved > through PAM modules. Outside of that... I don''t see ANYTHING in your list > they didn''t do first.Wow -- I''ve spoken to many NetApp partisans over the years, but you might just take the cake. Of course, most of the people I talk to are actually _using_ NetApp''s technology, a practice that tends to leave even the most stalwart proponents realistic about the (many) limitations of NetApp''s technology... For example, take the PAM. Do you actually have one of these, or are you basing your thoughts on reading whitepapers? I ask because (1) they are horrifically expensive (2) they don''t perform that well (especially considering that they''re DRAM!) (3) they''re grossly undersized (a 6000 series can still only max out at a paltry 96G -- and that''s with virtually no slots left for I/O) and (4) they''re not selling well. So if you actually bought a PAM, that already puts you in a razor-thin minority of NetApp customers (most of whom see through the PAM and recognize it for the kludge that it is); if you bought a PAM and think that it''s somehow a replacement for the ZFS hybrid storage pool (which has an order of magnitude more cache), then I''m sure NetApp loves you: you must be the dumbest, richest customer that ever fell in their lap! - Bryan -------------------------------------------------------------------------- Bryan Cantrill, Sun Microsystems Fishworks. http://blogs.sun.com/bmc
Bob Friesenhahn
2008-Dec-13 16:03 UTC
[zfs-discuss] Split responsibility for data with ZFS
On Sat, 13 Dec 2008, Tim wrote:> > Seriously? Do you know anything about the NetApp platform? I''m hoping this > is a genuine question...I believe that esteemed Sun engineers like Jeff are quite familiar with the NetApp platform. Besides NetApp being one of the primary storage competitors, it is a virtual minefield out there and one must take great care not to step on other company''s patents.> Off the top of my head nearly all of them. Some of them have artificial > limitations because they learned the hard way that if you give customers > enough rope they''ll hang themselves. For instance "unlimited snapshots". > Do I even need to begin to tell you what a horrible, HORRIBLE idea that is? > "Why can''t I get my space back?" Oh, just do a snapshot list and figure out > which one is still holding the data. What? Your console locks up for 8 > hours when you try to list out the snapshots? Huh... that''s weird.I suggest that you retire to the safety of the rubber room while the rest of us enjoy these zfs features. By the same measures, you would advocate that people should never be allowed to go outside due to the wide open spaces. Perhaps people will wander outside their homes and forget how to make it back. Or perhaps there will be gravity failure and some of the people outside will be lost in space. There is some activity off the starboard bow, perhaps you should check it out ... Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Hi Bob, Tim, Jeff, you are all my friends, and you all know what you are talking about. As a friend, and trusting your personal integrity, I ask you, please, don''t get mad, enjoy the open discussion. (ok, ok, O(N) is revolutionary in tech thinking, just not revolutionary in end customer value. And safety features are important in risk management for enterprises.) I have friends at NetApp, and there are people there that I don''t give a damn. I am an enterprise architect, I don''t care about the little environments that can be fulfilled most effectively by any one operating enviornment applications. They are not enterprises and are risky in that business model in economy downturns. In that spirit, and looking at the NetApp virtual server support architecture, I would say -- as much as the ONTAP/WAFL thing (even with GX integration) is elegant, it would make more sense to utilize the file system capabilities with kernal integration to hypervisors, in virtual server deployments, instead of promoting a storage-device-based file system and data management solution (more proprietary at the solution level). So, in my position, NetApp PiT is not as good as ZFS PiT, because it is too far from the hypervisor. You can support me or attack me with more technical details (if you know NetApp is developing an API for all server hypervisors, I don''t). And don''t worry, I have the biggest eagle, but so far, no one has been able to hurt that. ;-) Best, z ----- Original Message ----- From: "Bob Friesenhahn" <bfriesen at simple.dallas.tx.us> To: "Tim" <tim at tcsac.net> Cc: <zfs-discuss at opensolaris.org> Sent: Saturday, December 13, 2008 11:03 AM Subject: Re: [zfs-discuss] Split responsibility for data with ZFS> On Sat, 13 Dec 2008, Tim wrote: >> >> Seriously? Do you know anything about the NetApp platform? I''m hoping >> this >> is a genuine question... > > I believe that esteemed Sun engineers like Jeff are quite familiar > with the NetApp platform. Besides NetApp being one of the primary > storage competitors, it is a virtual minefield out there and one must > take great care not to step on other company''s patents. > >> Off the top of my head nearly all of them. Some of them have artificial >> limitations because they learned the hard way that if you give customers >> enough rope they''ll hang themselves. For instance "unlimited snapshots". >> Do I even need to begin to tell you what a horrible, HORRIBLE idea that >> is? >> "Why can''t I get my space back?" Oh, just do a snapshot list and figure >> out >> which one is still holding the data. What? Your console locks up for 8 >> hours when you try to list out the snapshots? Huh... that''s weird. > > I suggest that you retire to the safety of the rubber room while the > rest of us enjoy these zfs features. By the same measures, you would > advocate that people should never be allowed to go outside due to the > wide open spaces. Perhaps people will wander outside their homes and > forget how to make it back. Or perhaps there will be gravity failure > and some of the people outside will be lost in space. > > There is some activity off the starboard bow, perhaps you should check > it out ... > > Bob > =====================================> Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Bob Friesenhahn
2008-Dec-14 05:45 UTC
[zfs-discuss] Split responsibility for data with ZFS
On Sat, 13 Dec 2008, Joseph Zhou wrote:> > In that spirit, and looking at the NetApp virtual server support > architecture, I would say -- > as much as the ONTAP/WAFL thing (even with GX integration) is elegant, it > would make more sense to utilize the file system capabilities with kernal > integration to hypervisors, in virtual server deployments, instead of > promoting a storage-device-based file system and data management solution > (more proprietary at the solution level).I am not an enterprise architect but I do agree that when multiple client OSs are involved it is still useful if storage looks like a legacy disk drive. Luckly Solaris already offers iSCSI in Solaris 10 and OpenSolaris is now able to offer high performance fiber channel target and fiber channel over ethernet layers on top of reliable ZFS. The full benefit of ZFS is not provided, but the storage is successfully divorced from the client with a higher degree of data reliability and performance than is available from current firmware based RAID arrays. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
I wasn''t joking, though as is well known, the plural of anecdote is not data. Both UFS and ZFS, in common with all file system, have design flaws and bugs. To lose an entire UFS file system (barring the loss of the entire underlying storage) requires a great deal of corruption; there are multiple copies of the superblock, cylinder headers and their inodes are stored in a regular pattern and easily found by recovery tools, and the UFS file system check utility, while not perfect, can repair almost any corruption. There are third party tools which can perform much more analysis and recovery in a worst-case scenario. A single bad bloc To lose an entire ZFS pool requires that the most recent uberblock, or one of the top-level blocks to which it points, be damaged. There are currently no recovery tools (at least, none of which I am aware). I find it na?ve to imagine that Sun customers "expect" their UFS (or other) file systems to be unrecoverable. Any case where fsck failed quickly became an escalation to the sustaining engineering organization. Restoring from backup is almost never a satisfactory answer for a commercial enterprise. As usual, the disclaimer; I now work for another storage company, and while I''ve been on the teams developing and maintaining a number of commercial file systems (including two of Sun''s), ZFS has not been one of them. -- This message posted from opensolaris.org
Some RAID systems compare checksums on reads, though this is usually only for RAID-4 configurations (e.g. DataDirect) because of the performance hit otherwise. End-to-end checksums are not yet common. The SCSI committee recently ratified T10 DIF, which allows either an operating system or application to supply checksums and have them stored and retrieved with data. Oracle has been working to add support for this to Linux, and several array and drive vendors have committed to implementing it. So one could say that ZFS is ahead of the curve here. ZFS is not particularly revolutionary: software RAID has been around since the invention of the term; end-to-end checksums to disk have been used since the 1960s (though more often in databases, tape, and optical media); WAFL-like file structures may pre-date NetApp. It does put these together for the first time in a widely available system, though, which is certainly innovative and useful. It will be more useful when it has a more complete disaster recovery model than ''restore from backup.'' -- This message posted from opensolaris.org
Anton B. Rang wrote:> I find it na?ve to imagine that Sun customers "expect" their UFS (or other) file systems to be unrecoverable.OK, I''ll bite. If we believe the disk vendors who rate their disks as having an unrecoverable error rate of 1 bit per 10^14 bits read, and knowing that UFS has absolutely no data protection of its data, why would you think that it is naive to think that a disk system with UFS cannot lose data? Rather, I would say it has a distinctly calculable probability. Similarly, for ZFS, the checksum is not perfect, so there is a calculable probability that the ZFS checksum will not detect an unrecoverable (read) error. The difference is that the probability that ZFS will not detect an error is considerably smaller than that of UFS (or FAT, or HSFS, or ...)> Any case where fsck failed quickly became an escalation to the sustaining engineering organization. Restoring from backup is almost never a satisfactory answer for a commercial enterprise. >I agree. However, I''ve personally experienced well over 100 fsck failures over the years, and while I was always unsatisfied, I didn''t always lose data[1]. When I did lose data, perhaps it was data I could live without, but that was my call. Would you rather that ZFS should simply say, "hey you lost some data, but we won''t tell you where... ?" [1] once upon a time, I used a [vendor-name-elided] disk for a 2,300 user e-mail message store. I upgraded the OS, which implemented some new SCSI options. The disk''s firmware didn''t handle those options properly and would wait about 7 hours before corrupting the UFS file system containing the message store, requiring a full restore. So, how many shifts do you think it took to fail, recover, and ultimately resolve the disk firmware issue? Hint: the firmware rev arrived via UPS. Personally, I''m very glad that a file system has come along that verifies data... and that feature seems to be catching, as other file systems seem to be doing the same. Hopefully, in a few years silent data corruption will be a footnote on the lore of computing. -- richard
Anton B. Rang wrote:> Some RAID systems compare checksums on reads, though this is usually only for RAID-4 configurations (e.g. DataDirect) because of the performance hit otherwise. >For the record, Solaris had a (mirrored) RAID system which would compare data from both sides of the mirror upon read. It never achieved significant market penetration and was subsequently scrapped. Many of the reasons that the market did not accept it are solved by the method used by ZFS, which is far superior.> End-to-end checksums are not yet common. The SCSI committee recently ratified T10 DIF, which allows either an operating system or application to supply checksums and have them stored and retrieved with data. Oracle has been working to add support for this to Linux, and several array and drive vendors have committed to implementing it. So one could say that ZFS is ahead of the curve here. >Oracle also has data checksumming enabled by default for later releases. I look forward to any field data analysis they may publish :-)> ZFS is not particularly revolutionary: software RAID has been around since the invention of the term; end-to-end checksums to disk have been used since the 1960s (though more often in databases, tape, and optical media); WAFL-like file structures may pre-date NetApp. It does put these together for the first time in a widely available system, though, which is certainly innovative and useful. It will be more useful when it has a more complete disaster recovery model than ''restore from backup.'' >If you wish to implement a disaster recovery model, then you should look far beyond what ZFS (or any file system) can provide. Effective disaster recovery requires significant attention to process. -- richard
I think the problem for me is not that there''s a risk of data loss if a pool becomes corrupt, but that there are no recovery tools available. With UFS, people expect that if the worst happens, fsck will be able to recover their data in most cases. With ZFS you have no such tools, yet Victor has on at least two occasions shown that it''s quite possible to recover pools that were completely unusable (I believe by making use of old / backup copies of the uberblock). My concern is that ZFS has all this information on disk, it has the ability to know exactly what is and isn''t corrupted, and it should (at least for a system with snapshots) have many, many potential uberblocks to try. It should be far, far better than UFS at recovering from these things, but for a certain class of faults, when it hits a problem it just stops dead. That''s what frustrates me - knowing that there''s potential to have all my data there, stored safely away, but having it completely inaccessible due to a lack of recovery tools. -- This message posted from opensolaris.org
Casper.Dik at Sun.COM
2008-Dec-15 10:30 UTC
[zfs-discuss] Split responsibility for data with ZFS
>I think the problem for me is not that there''s a risk of data loss if >a pool becomes corrupt, but that there are no recovery tools >available. With UFS, people expect that if the worst happens, fsck >will be able to recover their data in most cases.Except, of course, that fsck lies. In "fixes" the meta data and the quality of the rest is unknown. Anyone using UFS knows that UFS file corruption are common; specifically, when using a "UFS root" and the system panic''s when trying to install a device driver, there''s a good chance that some files in /etc are corrupt. Some were application problems (some code used fsync(fileno(fp)); fclose(fp); it doesn''t guarantee anything)>With ZFS you have no such tools, yet Victor has on at least two occasions >shown that it''s quite possible to recover pools that were completely unusable >(I believe by making use of old / backup copies of the uberblock).True; and certainly ZFS should be able backtrack. But it''s much more likely to happen "automatically" then using a recovery tool. See, fsck could only be written because specific corruption are known and the patterns they have. With ZFS, you can only backup to a certain uberblock and the pattern will be a surprise. Casper
Forgive me for not understanding the details, but couldn''t you also work backwards through the blocks with ZFS and attempt to recreate the uberblock? So if you lost the uberblock, could you (memory and time allowing) start scanning the disk, looking for orphan blocks that aren''t refernced anywhere else and piece together the top of the tree? Or roll back to a previous uberblock (or a snapshot uberblock), and then look to see what blocks are on the disk but not referenced anywhere. Is there any way to intelligently work out where those blocks would be linked by looking at how they interact with the known data? Of course, rolling back to a previous uberblock would still be a massive step forward, and something I think would do much to improve the perception of ZFS as a tool to reliably store data. You cannot understate the difference to the end user between a file system that on boot says: "Sorry, can''t read your data pool." With one that says: "Whoops, the uberblock, and all the backups are borked. Would you like to roll back to a backup uberblock, or leave the filesystem offline to repair manually?" As much as anything else, a simple statement explaining *why* a pool is inaccessible, and saying just how badly things have gone wrong helps tons. Being able to recover anything after that is just the icing on the cake, especially if it can be done automatically. Ross PS. Sorry for the duplicate Casper, I forgot to cc the list. On Mon, Dec 15, 2008 at 10:30 AM, <Casper.Dik at sun.com> wrote:> >>I think the problem for me is not that there''s a risk of data loss if >>a pool becomes corrupt, but that there are no recovery tools >>available. With UFS, people expect that if the worst happens, fsck >>will be able to recover their data in most cases. > > Except, of course, that fsck lies. In "fixes" the meta data and the > quality of the rest is unknown. > > Anyone using UFS knows that UFS file corruption are common; specifically, > when using a "UFS root" and the system panic''s when trying to > install a device driver, there''s a good chance that some files in > /etc are corrupt. Some were application problems (some code used > fsync(fileno(fp)); fclose(fp); it doesn''t guarantee anything) > > >>With ZFS you have no such tools, yet Victor has on at least two occasions >>shown that it''s quite possible to recover pools that were completely unusable >>(I believe by making use of old / backup copies of the uberblock). > > True; and certainly ZFS should be able backtrack. But it''s > much more likely to happen "automatically" then using a recovery > tool. > > See, fsck could only be written because specific corruption are known > and the patterns they have. With ZFS, you can only backup to > a certain uberblock and the pattern will be a surprise. > > Casper >
Bob Friesenhahn
2008-Dec-15 18:34 UTC
[zfs-discuss] Split responsibility for data with ZFS
On Mon, 15 Dec 2008, Ross wrote:> My concern is that ZFS has all this information on disk, it has the > ability to know exactly what is and isn''t corrupted, and it should > (at least for a system with snapshots) have many, many potential > uberblocks to try. It should be far, far better than UFS at > recovering from these things, but for a certain class of faults, > when it hits a problem it just stops dead.While ZFS knows if a data block is retrieved correctly from disk, a correctly retrieved data block does not indicate that the pool isn''t "corrupted". A block written in the wrong order is a form of corruption. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
I''m not sure I follow how that can happen, I thought ZFS writes were designed to be atomic? They either commit properly on disk or they don''t? On Mon, Dec 15, 2008 at 6:34 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Mon, 15 Dec 2008, Ross wrote: > >> My concern is that ZFS has all this information on disk, it has the >> ability to know exactly what is and isn''t corrupted, and it should (at least >> for a system with snapshots) have many, many potential uberblocks to try. >> It should be far, far better than UFS at recovering from these things, but >> for a certain class of faults, when it hits a problem it just stops dead. > > While ZFS knows if a data block is retrieved correctly from disk, a > correctly retrieved data block does not indicate that the pool isn''t > "corrupted". A block written in the wrong order is a form of corruption. > > Bob > =====================================> Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > >
Bob Friesenhahn
2008-Dec-15 19:36 UTC
[zfs-discuss] Split responsibility for data with ZFS
On Mon, 15 Dec 2008, Ross Smith wrote:> I''m not sure I follow how that can happen, I thought ZFS writes were > designed to be atomic? They either commit properly on disk or they > don''t?Yes, this is true. One reason why people complain about corrupted ZFS pools is because they have hardware which writes data in a different order than what was requested. Some hardware claims to have written the data but instead it has been secretly cached for later (or perhaps for never) and data blocks get written in some other order. It seems that ZFS is capable of working reliably with "cheap" hardware but not with wrongly designed hardware. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Nicolas Williams
2008-Dec-15 19:46 UTC
[zfs-discuss] Split responsibility for data with ZFS
On Mon, Dec 15, 2008 at 01:36:46PM -0600, Bob Friesenhahn wrote:> On Mon, 15 Dec 2008, Ross Smith wrote: > > > I''m not sure I follow how that can happen, I thought ZFS writes were > > designed to be atomic? They either commit properly on disk or they > > don''t? > > Yes, this is true. One reason why people complain about corrupted ZFS > pools is because they have hardware which writes data in a different > order than what was requested. Some hardware claims to have written > the data but instead it has been secretly cached for later (or perhaps > for never) and data blocks get written in some other order. It seems > that ZFS is capable of working reliably with "cheap" hardware but not > with wrongly designed hardware.Order of writes matters between transactions, not inside transactions, and at the boundary is a cache flush. Thus what matters really isn''t write order as much as whether the devices lie about cache flushes.
>>>>> "nw" == Nicolas Williams <Nicolas.Williams at sun.com> writes:nw> Your thesis is that all corruption problems observed with ZFS nw> on SANs are: a) phantom writes that never reached the rotating nw> rust, b) not bit rot, corruption in the I/O paths, ... nw> Correct? yeah. by ``all'''' I mean the several single-LUN pools that were recovered by using an older set of ueberblocks. Of course I don''t mean ``all'''' as in all pools imagineable including this one 10 years ago on an unnamed Major Vendor''s RAID shelf that gave you a scar just above the ankle. But it is really sounding so far like just one major problem with single-LUN ZFS''s on SAN''s? or am I wrong, there are lots of pools which can''t be recovered with old ueberblocks? Remember the problem is losing pools. It is not, ``for weeks I kept losing files. I would get errors reported in ''zpool status'', and it would tell me the filename ''blah'' has uncorrectable errors. This went on for a while, then one day we lost the whole pool.'''' I''ve heard zero reports like that. nw> Some of the earlier problems of type (2) were triggered by nw> checksum verification failures on pools with no redundancy, but checksum failures aren''t caused just by bitrot in ZFS. I get hundreds of them after half of my iSCSI mirror bounces because of the incomplete-resilvering bug. I don''t know the on-disk format well, but maybe the checksum was wrong because the label pointed to a block that wasn''t an ueberblock. Maybe the checksum is functioning in leiu of a commit sector: maybe all four ueberblocks were written incompletely because there is some bug or missing-workaround in the way ZFS flushes and schedules the ueberblock writes, so with some written sectors and some unwritten sectors the overall block checksum is wrong. Maybe this is a downside to the filesystem-level checksum. For integrity it''s an upside, but the netapp block-level checksum, where you checksum just the data plus the block-number at RAID layer, should narrow down checksum failures to disk bit flips only and thus be better for tracking down problems and building statistics comparable with other systems. We already know the ''zpool status'' CKSUM column isn''t so selective, and can catch out-of-date data too. The overall point, what I''d rather have as my ``thesis,'''' is you can''t allow ZFS to exhonerate itself with an error message. Losing the whole pool in a situation where UFS would (or _might_, is not even proven beyond doubt that it _would_), have corrupted a bit of data, isn''t an advantage just because ZFS can printf a warning that says ``loss of entire pool detected. must be corruption outside ZFS!'''' -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081215/a5d8cfcd/attachment.bin>
>>>>> "bc" == Bryan Cantrill <bmc at eng.sun.com> writes: >>>>> "jz" == Joseph Zhou <jz at excelsioritsolutions.com> writes:bc> most of the people I talk to are actually _using_ NetApp''s bc> technology, a practice that tends to leave even the most bc> stalwart proponents realistic about the (many) limitations of bc> NetApp''s same applies to ZFS pundits! As Tim said, the one-filesystem-per-user thing is not working out. O(1) for number of filesystems would be great but isn''t there. Maybe the format allows unlimited O(1) snapshots, but it''s at best O(1) to take them. All over the place it''s probably O(n) or worse to _have_ them. to boot with them, to scrub with them. I think the winning snapshot architecture is more like source code revision control: take infinitely-granular snapshots, a continuous line, and run a cron service to trim the line into a series of points. The management can be delegated, but inspection commands are not safe and can lock the whole filesystem, and ''zfs recv''ing certain streams panics the whole box so backup cannot really be safely delegated either. The panic-on-import problems are bad for delegation because you can''t safely let users mount things, which to my view is where delegated administration begins. It''s too unstable to think of delegating anything---it''s all just UI baloney until the panics are fixed and failures are contained within one pool. The scalability to multiple cores goals are admirable, but only certain things are parallelized. You can only replace one device at a time, which some day will not be enough to keep up with natural failure rates. I think ''zfs send'' does not use multiple cores well, right? AIUI people are getting non-scaling performance in send/recv while the ordinary filesystem performance does scale, and thus getting painted into a corner. Yeah there''s compression, but as Tim said people are getting more savings from dedup, which goes naturally with writeable clones too. Also the NetApp dedup is a background thread while the ZFS compression is synchronous with writing. as well as not scaling to multiple cores and seeming to have some bugs in the gzip version. Yeah there is some heirarchical storage in it, but after half a year still a slog cannot be removed? In general I think ZFS pundits compliment the architecture and not the implementation. The big compliment I have for it is just that the ZFS piece is free software, even though large chunks of OpenSolaris aren''t. That''s a gigantic advantage, especially over NetApp, which probably has about as much long-term future as Lisp. jz> As a friend, and trusting your personal integrity, I ask you, jz> please, don''t get mad, enjoy the open discussion. Joseph, I don''t see the problem and think it''s fine to excited so long as actual information comes out. There''s nothing ad-hominem in the discussion yet, and being ordered not to get mad will make any normal person furious, especially if you make the order based on ``trust'''' and ``personal integrity''''---why bring up such things at all? I almost feel like you''re baiting them! I know it''s normal for sysadmins to be dry and menial, but it''s still a technical discussion, so I hope it doesn''t upset anyone because it''s not boring. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081215/c0bcda6a/attachment.bin>
Nicolas Williams
2008-Dec-15 22:12 UTC
[zfs-discuss] Split responsibility for data with ZFS
On Mon, Dec 15, 2008 at 05:04:03PM -0500, Miles Nordin wrote:> As Tim said, the one-filesystem-per-user thing is not working out.For NFSv3 clients that truncate MOUNT protocol answers (and v4 clients that still rely on the MOUNT protocol), yes, one-filesystem-per-user is a problem. For NFSv4 clients that support mirror mounts its not a problem at all. You''re not required to go with one-filesystem-per-user though! That''s only if you want to approximate quotas.> O(1) for number of filesystems would be great but isn''t there.It is O(1) for filesystems (parts of the system could be parallelized more, but the on-disk data format is O(1) for filesystem creation and mounting, just like it is for snapshots and clones).> Maybe the format allows unlimited O(1) snapshots, but it''s at best > O(1) to take them. All over the place it''s probably O(n) or worse to > _have_ them. to boot with them, to scrub with them.It''s NOT O(N) to boot because of snapshots, nor to scrub. Scrub and resilver are O(N) where N is the amount used (as opposed to O(N) where N is the size of the volume, for HW RAID and the like). Nico --
> > Maybe the format allows unlimited O(1) snapshots, but it''s at best > O(1) to take them. All over the place it''s probably O(n) or worse to > _have_ them. to boot with them, to scrub with them.Why would a scrub be O(n snapshots)? The O(n filesystems) effects reported from time to time in OpenSolaris seem due to code that iterates over them. The new ability to create huge numbers of them puts stress on assumptions valid in more traditional UNIX configurations, right? --Toby
>>>>> "nw" == Nicolas Williams <Nicolas.Williams at sun.com> writes:nw> For NFSv4 clients that support mirror mounts its not a problem nw> at all. no, 3000 - 10,000 users is common for a large campus, and according to posters here, sometimes that many users actually can fit into the bandwidth of a single pool. But ZFS is not useable with that many filesystems. booting, ''zfs create'', ''zfs list'', all take hours. see list archives. If the on-disk format is theoretically capable of achieving O(1) for number of filesystems, that''s nice! It''s just not an advantage over NetApp when it''s not working yet. And, with any project, sometimes the last 5% of the work never gets done. so I''m making a desperate call to start basing punditry on experience rather than white papers and optimistic architecture documents. OpenSolaris could have an advantage here---it''s much easier to get experience with Solaris than NetApp because it''s not (a) expensive and (b) locked behind a bunch of licenses, agreements and contracts, unshareable documentation, private censored web forums (NOW site), u.s.w., so OpenSolaris punditry could one day become a lot more trustworthy than NetApp punditry. nw> You''re not required to go with one-filesystem-per-user though! It was pitched as an architectural advantage, but never fully delivered, and worse, used to justify removing traditional Unix quotas. Consequently, quota-wise, ZFS becomes a regression w.r.t. UFS rather than an evolution, because of over-focusing on the virtues of the architecture rather than the delivered implementation. I don''t use quotas and don''t care, but it''s a good example of broken advocacy. nw> It''s NOT O(N) to boot because of snapshots, nor to scrub. I think it is. try it and see. :/ That was Tim''s point as I read it. Jeff claimed ``unlimited snapshots and clones'''' as a ZFS advantage over NetApp, and Tim said open bugs or subtle limitations make the supposed advantage a fantasy, even a liability: ``"unlimited snapshots". Do I even need to begin to tell you what a horrible, HORRIBLE idea that is? "Why can''t I get my space back?" Oh, just do a snapshot list and figure out which one is still holding the data. What? Your console locks up for 8 hours when you try to list out the snapshots? Huh... that''s weird.'''' ...and to add to that, the snapshot list in ZFS does a better job of showing which one''s using the space if there are fewer snapshots. with hundreds of snapshots ''zfs list'' shows a USED column full of zeroes, correctly, because you won''t save any space by deleting just one---you have to delete a range of snapshots to get some space back. Of course that''s not the same thing as being O(N), that''s just annoying. and I don''t know that it''s really O(N)---it could be better or worse than O(N). It''s not O(1) though, to boot, list, or scrub snapshots. and if it''s not O(1) because of some unnecessary high-level ioctl accidentally called in some obscure, abstract library by the ``simple'''' user interface, it''s still not O(1)! For practical users, that library could remain suboptimal for the next two years, and I don''t want to spend those two years enduring a bunch of blogging about nonexistent O(1) snapshots just because the on-disk format theoretically doesn''t impede delivering them. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081216/2dda3e91/attachment.bin>
John Kaitschuck
2008-Dec-16 19:22 UTC
[zfs-discuss] Split responsibility for data with ZFS
Miles Nordin wrote:>>>>>> "nw" == Nicolas Williams <Nicolas.Williams at sun.com> writes: > > > nw> You''re not required to go with one-filesystem-per-user though! > > It was pitched as an architectural advantage, but never fully > delivered, and worse, used to justify removing traditional Unix > quotas. Consequently, quota-wise, ZFS becomes a regression w.r.t. UFS > rather than an evolution, because of over-focusing on the virtues of > the architecture rather than the delivered implementation. > >Precisely. The issues for quotas, for ZFS on a per user basis was pointed out several years ago at FAST, when some of the Sun folks showed up to discuss ZFS in a late evening meeting. A file system per user approach is not very viable when you have tens of thousands of users. It was my hope that Sun would get that message by now, as I consider it one of the major problems with ZFS.
Gino
2009-Feb-07 13:54 UTC
[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
> FYI, I''m working on a workaround for broken devices. > As you note, > ome disks flat-out lie: you issue the > synchronize-cache command, > they say "got it, boss", yet the data is still not on > stable storage. > Why do they do this? Because "it performs better". > Well, duh -- > ou can make stuff *really* fast if it doesn''t have to > be correct. >> The uberblock ring buffer in ZFS gives us a way to > cope with this, > as long as we don''t reuse freed blocks for a few > transaction groups. > The basic idea: if we can''t read the pool startign > from the most > recent uberblock, then we should be able to use the > one before it, > or the one before that, etc, as long as we haven''t > yet reused any > blocks that were freed in those earlier txgs. This > allows us to > use the normal load on the pool, plus the passage of > time, as a > displacement flush for disk caches that ignore the > sync command. > > If we go back far enough in (txg) time, we will > eventually find an > uberblock all of whose dependent data blocks have > make it to disk. > I''ll run tests with known-broken disks to determine > how far back we > need to go in practice -- I''ll bet one txg is almost > always enough. > > JeffHi Jeff, we just losed 2 pools on snv91. Any news about your workaround to recover pools discarding last txg? thanks gino -- This message posted from opensolaris.org