thr3ads.net - zfs discuss - [zfs-discuss] zpool unimportable (corrupt zpool metadata??) but no zdb -l device problems [Oct 2008]

If this information is useful, please help other people find it:
Share via:

Vasile Dumitrescu

2008-Oct-01 09:20 UTC

[zfs-discuss] zpool unimportable (corrupt zpool metadata??) but no zdb -l device problems

Hi,
I am running snv90. I have a pool that is 6x1TB, config raidz. After a computer
crash (root is NOT on the pool - only data) the pool showed FAULTED status.
I exported and tried to reimport it, with the result as follows:
===============# zpool import
  pool: ztank
    id: 12125153257763159358
 state: FAULTED
status: The pool metadata is corrupted.
action: The pool cannot be imported due to damaged devices or data.
        The pool may be active on another system, but can be imported using
        the ''-f'' flag.
   see: http://www.sun.com/msg/ZFS-8000-72
config:

        ztank       FAULTED  corrupted data
          raidz1    ONLINE
            c1t6d0  ONLINE
            c1t5d0  ONLINE
            c1t4d0  ONLINE
            c1t3d0  ONLINE
            c1t2d0  ONLINE
            c1t1d0  ONLINE
===============
I searched google and run zdb -l for every pool device. Results follow below...
to me it appears that all disks are ok and zdb can see the zpool structure off
of each of them. (at least this is how I can interpret the messages, but the
zpool still says corrupt zpool metadata :-(

Any ideas as to what I might be able to do to salvage the data? restoring from
backup is not an option (yes, I know :() - as this is a personal project I hoped
the raidz would be enough :-(

The output for each of the disks is more or less identical, all labels are
accessible.

# zdb -l /dev/dsk/c1t6d0s0
--------------------------------------------
LABEL 0
--------------------------------------------
    version=10
    name=''ztank''
    state=0
    txg=207161
    pool_guid=12125153257763159358
    hostid=628051022
    hostname=''zfssrv''
    top_guid=763279656890868029
    guid=10947029755543026189
    vdev_tree
        type=''raidz''
        id=0
        guid=763279656890868029
        nparity=1
        metaslab_array=14
        metaslab_shift=35
        ashift=9
        asize=6001149345792
        is_log=0
        children[0]
                type=''disk''
                id=0
                guid=10947029755543026189
                path=''/dev/dsk/c1t1d0s0''
                devid=''id1,sd at
f0000000048455c81000880330000/a''
                phys_path=''/pci at 0,0/pci1000,30 at 10/sd at
1,0:a''
                whole_disk=1
                DTL=193
        children[1]
                type=''disk''
                id=1
                guid=2640926618230776740
                path=''/dev/dsk/c1t2d0s0''
                devid=''id1,sd at
f0000000048455c81000992690001/a''
                phys_path=''/pci at 0,0/pci1000,30 at 10/sd at
2,0:a''
                whole_disk=1
                DTL=192
        children[2]
                type=''disk''
                id=2
                guid=8982722125061616789
                path=''/dev/dsk/c1t3d0s0''
                devid=''id1,sd at
f0000000048455c81000ae8610002/a''
                phys_path=''/pci at 0,0/pci1000,30 at 10/sd at
3,0:a''
                whole_disk=1
                DTL=191
        children[3]
                type=''disk''
                id=3
                guid=7263648809970512976
                path=''/dev/dsk/c1t4d0s0''
                devid=''id1,sd at
f0000000048455c81000bb2cf0003/a''
                phys_path=''/pci at 0,0/pci1000,30 at 10/sd at
4,0:a''
                whole_disk=1
                DTL=190
        children[4]
                type=''disk''
                id=4
                guid=5275414937202266822
                path=''/dev/dsk/c1t5d0s0''
                devid=''id1,sd at
f0000000048455c81000ca3c40004/a''
                phys_path=''/pci at 0,0/pci1000,30 at 10/sd at
5,0:a''
                whole_disk=1
                DTL=189
        children[5]
                type=''disk''
                id=5
                guid=8503895341004279533
                path=''/dev/dsk/c1t6d0s0''
                devid=''id1,sd at
f0000000048455c81000d49220005/a''
                phys_path=''/pci at 0,0/pci1000,30 at 10/sd at
6,0:a''
                whole_disk=1
                DTL=188
--------------------------------------------
LABEL 1
--------------------------------------------
    version=10
    name=''ztank''
    state=0
    txg=207161
    pool_guid=12125153257763159358
    hostid=628051022
    hostname=''zfssrv''
    top_guid=763279656890868029
    guid=10947029755543026189
    vdev_tree
        type=''raidz''
        id=0
        guid=763279656890868029
        nparity=1
        metaslab_array=14
        metaslab_shift=35
        ashift=9
        asize=6001149345792
        is_log=0
        children[0]
                type=''disk''
                id=0
                guid=10947029755543026189
                path=''/dev/dsk/c1t1d0s0''
                devid=''id1,sd at
f0000000048455c81000880330000/a''
                phys_path=''/pci at 0,0/pci1000,30 at 10/sd at
1,0:a''
                whole_disk=1
                DTL=193
        children[1]
                type=''disk''
                id=1
                guid=2640926618230776740
                path=''/dev/dsk/c1t2d0s0''
                devid=''id1,sd at
f0000000048455c81000992690001/a''
                phys_path=''/pci at 0,0/pci1000,30 at 10/sd at
2,0:a''
                whole_disk=1
                DTL=192
        children[2]
                type=''disk''
                id=2
                guid=8982722125061616789
                path=''/dev/dsk/c1t3d0s0''
                devid=''id1,sd at
f0000000048455c81000ae8610002/a''
                phys_path=''/pci at 0,0/pci1000,30 at 10/sd at
3,0:a''
                whole_disk=1
                DTL=191
        children[3]
                type=''disk''
                id=3
                guid=7263648809970512976
                path=''/dev/dsk/c1t4d0s0''
                devid=''id1,sd at
f0000000048455c81000bb2cf0003/a''
                phys_path=''/pci at 0,0/pci1000,30 at 10/sd at
4,0:a''
                whole_disk=1
                DTL=190
        children[4]
                type=''disk''
                id=4
                guid=5275414937202266822
                path=''/dev/dsk/c1t5d0s0''
                devid=''id1,sd at
f0000000048455c81000ca3c40004/a''
                phys_path=''/pci at 0,0/pci1000,30 at 10/sd at
5,0:a''
                whole_disk=1
                DTL=189
        children[5]
                type=''disk''
                id=5
                guid=8503895341004279533
                path=''/dev/dsk/c1t6d0s0''
                devid=''id1,sd at
f0000000048455c81000d49220005/a''
                phys_path=''/pci at 0,0/pci1000,30 at 10/sd at
6,0:a''
                whole_disk=1
                DTL=188
--------------------------------------------
LABEL 2
--------------------------------------------
    version=10
    name=''ztank''
    state=0
    txg=207161
    pool_guid=12125153257763159358
    hostid=628051022
    hostname=''zfssrv''
    top_guid=763279656890868029
    guid=10947029755543026189
    vdev_tree
        type=''raidz''
        id=0
        guid=763279656890868029
        nparity=1
        metaslab_array=14
        metaslab_shift=35
        ashift=9
        asize=6001149345792
        is_log=0
        children[0]
                type=''disk''
                id=0
                guid=10947029755543026189
                path=''/dev/dsk/c1t1d0s0''
                devid=''id1,sd at
f0000000048455c81000880330000/a''
                phys_path=''/pci at 0,0/pci1000,30 at 10/sd at
1,0:a''
                whole_disk=1
                DTL=193
        children[1]
                type=''disk''
                id=1
                guid=2640926618230776740
                path=''/dev/dsk/c1t2d0s0''
                devid=''id1,sd at
f0000000048455c81000992690001/a''
                phys_path=''/pci at 0,0/pci1000,30 at 10/sd at
2,0:a''
                whole_disk=1
                DTL=192
        children[2]
                type=''disk''
                id=2
                guid=8982722125061616789
                path=''/dev/dsk/c1t3d0s0''
                devid=''id1,sd at
f0000000048455c81000ae8610002/a''
                phys_path=''/pci at 0,0/pci1000,30 at 10/sd at
3,0:a''
                whole_disk=1
                DTL=191
        children[3]
                type=''disk''
                id=3
                guid=7263648809970512976
                path=''/dev/dsk/c1t4d0s0''
                devid=''id1,sd at
f0000000048455c81000bb2cf0003/a''
                phys_path=''/pci at 0,0/pci1000,30 at 10/sd at
4,0:a''
                whole_disk=1
                DTL=190
        children[4]
                type=''disk''
                id=4
                guid=5275414937202266822
                path=''/dev/dsk/c1t5d0s0''
                devid=''id1,sd at
f0000000048455c81000ca3c40004/a''
                phys_path=''/pci at 0,0/pci1000,30 at 10/sd at
5,0:a''
                whole_disk=1
                DTL=189
        children[5]
                type=''disk''
                id=5
                guid=8503895341004279533
                path=''/dev/dsk/c1t6d0s0''
                devid=''id1,sd at
f0000000048455c81000d49220005/a''
                phys_path=''/pci at 0,0/pci1000,30 at 10/sd at
6,0:a''
                whole_disk=1
                DTL=188
--------------------------------------------
LABEL 3
--------------------------------------------
    version=10
    name=''ztank''
    state=0
    txg=207161
    pool_guid=12125153257763159358
    hostid=628051022
    hostname=''zfssrv''
    top_guid=763279656890868029
    guid=10947029755543026189
    vdev_tree
        type=''raidz''
        id=0
        guid=763279656890868029
        nparity=1
        metaslab_array=14
        metaslab_shift=35
        ashift=9
        asize=6001149345792
        is_log=0
        children[0]
                type=''disk''
                id=0
                guid=10947029755543026189
                path=''/dev/dsk/c1t1d0s0''
                devid=''id1,sd at
f0000000048455c81000880330000/a''
                phys_path=''/pci at 0,0/pci1000,30 at 10/sd at
1,0:a''
                whole_disk=1
                DTL=193
        children[1]
                type=''disk''
                id=1
                guid=2640926618230776740
                path=''/dev/dsk/c1t2d0s0''
                devid=''id1,sd at
f0000000048455c81000992690001/a''
                phys_path=''/pci at 0,0/pci1000,30 at 10/sd at
2,0:a''
                whole_disk=1
                DTL=192
        children[2]
                type=''disk''
                id=2
                guid=8982722125061616789
                path=''/dev/dsk/c1t3d0s0''
                devid=''id1,sd at
f0000000048455c81000ae8610002/a''
                phys_path=''/pci at 0,0/pci1000,30 at 10/sd at
3,0:a''
                whole_disk=1
                DTL=191
        children[3]
                type=''disk''
                id=3
                guid=7263648809970512976
                path=''/dev/dsk/c1t4d0s0''
                devid=''id1,sd at
f0000000048455c81000bb2cf0003/a''
                phys_path=''/pci at 0,0/pci1000,30 at 10/sd at
4,0:a''
                whole_disk=1
                DTL=190
        children[4]
                type=''disk''
                id=4
                guid=5275414937202266822
                path=''/dev/dsk/c1t5d0s0''
                devid=''id1,sd at
f0000000048455c81000ca3c40004/a''
                phys_path=''/pci at 0,0/pci1000,30 at 10/sd at
5,0:a''
                whole_disk=1
                DTL=189
        children[5]
                type=''disk''
                id=5
                guid=8503895341004279533
                path=''/dev/dsk/c1t6d0s0''
                devid=''id1,sd at
f0000000048455c81000d49220005/a''
                phys_path=''/pci at 0,0/pci1000,30 at 10/sd at
6,0:a''
                whole_disk=1
                DTL=188
===============--
This message posted from opensolaris.org

Vasile Dumitrescu

2008-Oct-01 10:42 UTC

head link

[zfs-discuss] zpool unimportable (corrupt zpool metadata??) but no zdb -l device problems

an update to the above: I tried to run zdb -e on the pool id and here''s
the result:
# zdb -e 12125153257763159358
zdb: can''t open 12125153257763159358: I/O error

NB zdb seems to recognize the ID because runnig it with an incorrect ID gives me
an error
# zdb -e 12125153257763159354
zdb: can''t open 12125153257763159354: No such file or directory

Also zdb -e with the ID of the syspool works:
# zdb -e 8843238790372298114
Uberblock

        magic = 0000000000bab10c
        version = 10
        txg = 317369
        guid_sum = 14131844542001965925
        timestamp = 1222857640 UTC = Wed Oct  1 12:40:40 2008

Dataset mos [META], ID 0, cr_txg 4, 2.76M, 244 objects
Dataset 8843238790372298114/export/home [ZPL], ID 60, cr_txg 721, 1.21G, 55
objects
Dataset 8843238790372298114/export [ZPL], ID 54, cr_txg 718, 19.0K, 5 objects
Dataset 8843238790372298114/swap [ZVOL], ID 28, cr_txg 15, 519M, 3 objects
Dataset 8843238790372298114/ROOT/snv_90 [ZPL], ID 48, cr_txg 710, 6.85G, 254748
objects
Dataset 8843238790372298114/ROOT [ZPL], ID 22, cr_txg 12, 18.0K, 4 objects
Dataset 8843238790372298114/dump [ZVOL], ID 34, cr_txg 18, 512M, 3 objects
Dataset 8843238790372298114 [ZPL], ID 5, cr_txg 4, 39.5K, 13 objects

etc etc.
============
Any ideas? Could this be a hardware problem? I have no idea what to do next :-(

thanks for your help!
Vasile
--
This message posted from opensolaris.org

Vasile Dumitrescu

2008-Oct-01 18:24 UTC

head link

[zfs-discuss] one step forward - pinging Lukas pool: ztankKarwacki (kangurek)

on the advice of Okana in the freenode.net #opensolaris channel I tried to run
the latest opensolaris livecd and try to import the pool. No luck, however I
tried the trick in Lukas''s post that allowed him to import the pool and
I had a beginning of luck.

By doing the mdb wizardry he indicated I was able to run zpool import with the
following result:
pool: ztank
id: whatever
state: ONLINE
status: The pool was last accessed by another system.
see http://www.sun.com/msg/ZFS-8000-EY

config:
  ztank        ONLINE
    raidz1     ONLINE
      c4t0d0 ONLINE
      c4t1d0 ONLINE
      c4t2d0 ONLINE
      c4t3d0 ONLINE
      c4t4d0 ONLINE
      c4t5d0 ONLINE

HOWEVER.
When I attempt again to import using zdb -e ztank
I still get zdb: can''t open ztank: I/O error
and zpool import -f, whilst it starts and seems to access the disks
sequentially, it stops al the 3rd one (no sure which precisely - it spins it up
and the process stops right there, and the system will not reboot when asked to
(shutdown -g0 -y -i5)
so there''s some slight progress here.

I would really appreciate ideas from you guys!

Thanks
Vasile
--
This message posted from opensolaris.org

Martin Uhl

2008-Oct-02 14:37 UTC

head link

[zfs-discuss] one step forward - pinging Lukas pool: ztankKarwacki (kangurek)

> When I attempt again to import using zdb -e ztank
> I still get zdb: can''t open ztank: I/O error
> and zpool import -f, whilst it starts and seems to
> access the disks sequentially, it stops al the 3rd
> one (no sure which precisely - it spins it up and the
> process stops right there, and the system will not
> reboot when asked to (shutdown -g0 -y -i5)
> so there''s some slight progress here.
How about just removing that disk and try importing?
--
This message posted from opensolaris.org

Vasile Dumitrescu

2008-Oct-02 20:32 UTC

head link

[zfs-discuss] one step forward - pinging Lukas pool: ztankKarwacki (kangurek)

Thanks Martin,
Yeah, tried it but no luck :-( I do not think it is a hardware problem - in fact
I tried removing every disk one by one with no luck - this is why I think it is
not in fact a hardware problem...
Kind regards
Vasile
--
This message posted from opensolaris.org

Vasile Dumitrescu

2008-Oct-03 14:42 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

Hi folks,

I just wanted to share the end of my "adventure" here and especially
take the time to thank Victor for helping me out of this mess.

I will let him explain the technical details (I am out of my depth here) but
bottom line he spent a couple of hours with me on the machine and sorted me out.
His explanation: he invalidated the incorrect uberblocks and forced zfs to
revert to an earlier state that was consistent.

The machine is now in the process of doing a full scrub and the first order of
business tomorrow will be to do a full backup :-)

According to his explanation, the reason for the troubles I had was that Solaris
was running in a VM on my Debian server and it was not shut down properly when
the Debian server did a controlled shutdown following a UPS event.

The Solaris machine was abruptly shut down but because it was not in control of
the entire chain till bare hardware, it appears that some writes were in fact
still with Debian when Solaris thought them safely executed.

This left the zpool in question in a state that even raidz1 did not help with.

Anyway, again, lots and lots of thanks to Victor!!!

kind regards
Vasile
--
This message posted from opensolaris.org

Darren J Moffat

2008-Oct-03 14:50 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

Vasile Dumitrescu wrote:> Hi folks,
> 
> I just wanted to share the end of my "adventure" here and
especially take the time to thank Victor for helping me out of this mess.
> 
> I will let him explain the technical details (I am out of my depth here)
but bottom line he spent a couple of hours with me on the machine and sorted me
out. His explanation: he invalidated the incorrect uberblocks and forced zfs to
revert to an earlier state that was consistent.
> 
> The machine is now in the process of doing a full scrub and the first order
of business tomorrow will be to do a full backup :-)
> 
> According to his explanation, the reason for the troubles I had was that
Solaris was running in a VM on my Debian server and it was not shut down
properly when the Debian server did a controlled shutdown following a UPS event.
Which VM solution was this ? VMware, VirtualBox, Xen, other ?  How were 
the "disks" presented to the guest ?  What are the "disks"
in the host,
real disks, files, something else ?


-- 
Darren J Moffat

Vasile Dumitrescu

2008-Oct-03 15:37 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

> 
> Which VM solution was this ? VMware, VirtualBox, Xen,
> other ?  How were 
> the "disks" presented to the guest ?  What are the
> "disks" in the host, 
> real disks, files, something else ?
> 
> 
> -- 
> Darren J Moffat
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discu
> ss
VMWare 6.0.4 running on Debian unstable, 
Linux bigsrv 2.6.26-1-amd64 #1 SMP Wed Sep 24 13:59:41 UTC 2008 x86_64 GNU/Linux

Solaris is vanilla snv_90 installed with no GUI.

Here is the content of the .vmx file in question:
===============================================#!/usr/bin/vmware
config.version = "8"
virtualHW.version = "6"
scsi0.present = "TRUE"
scsi0.virtualDev = "lsilogic"

memsize = "4096"
MemAllowAutoScaleDown = "FALSE"
MemTrimRate = "0"
sched.mem.pshare.enable = "FALSE"
sched.mem.minsize = "3062"
sched.mem.max = "7000"
sched.mem.maxmemctl = "0"
sched.mem.shares = "100000"

scsi0:0.present = "TRUE"
scsi0:0.fileName = "/home/vasile/vmware/solsrv/OpenSolaris64.vmdk"
ide1:0.present = "TRUE"
ide1:0.autodetect = "TRUE"
ide1:0.deviceType = "cdrom-image"
floppy0.startConnected = "FALSE"
floppy0.autodetect = "TRUE"
ethernet0.present = "TRUE"
ethernet0.virtualDev = "e1000"
ethernet0.wakeOnPcktRcv = "TRUE"
sound.present = "FALSE"
sound.fileName = "-1"
sound.autodetect = "TRUE"
svga.autodetect = "FALSE"
pciBridge0.present = "TRUE"
displayName = "zfssrv"
guestOS = "solaris10-64"
nvram = "Solaris 10 64-bit.nvram"
deploymentPlatform = "windows"
virtualHW.productCompatibility = "hosted"
RemoteDisplay.vnc.port = "0"
tools.upgrade.policy = "useGlobal"

floppy0.fileName = "/dev/fd0"
extendedConfigFile = "Solaris 10 64-bit.vmxf"

ide1:0.fileName = ""
floppy0.present = "FALSE"
gui.powerOnAtStartup = "TRUE"

ide1:0.startConnected = "TRUE"
ethernet0.addressType = "generated"
uuid.location = "56 4d da 02 a4 a0 78 74-2e 09 90 62 45 bb c4 94"
uuid.bios = "56 4d da 02 a4 a0 78 74-2e 09 90 62 45 bb c4 94"
scsi0:0.redo = ""
pciBridge0.pciSlotNumber = "17"
scsi0.pciSlotNumber = "16"
ethernet0.pciSlotNumber = "32"
sound.pciSlotNumber = "-1"
ethernet0.generatedAddress = "00:0c:29:bb:c4:94"
ethernet0.generatedAddressOffset = "0"
tools.syncTime = "FALSE"

svga.maxWidth = "1024"
svga.maxHeight = "768"
svga.vramSize = "3145728"

scsi0:1.present = "TRUE"
scsi0:1.fileName = "ztank-sda.vmdk"
scsi0:1.mode = "independent-persistent"
scsi0:1.deviceType = "rawDisk"
scsi0:2.present = "TRUE"
scsi0:2.fileName = "ztank-sdb.vmdk"
scsi0:2.mode = "independent-persistent"
scsi0:2.deviceType = "rawDisk"
scsi0:3.present = "TRUE"
scsi0:3.fileName = "ztank-sdc.vmdk"
scsi0:3.mode = "independent-persistent"
scsi0:3.deviceType = "rawDisk"
scsi0:4.present = "TRUE"
scsi0:4.fileName = "ztank-sdd.vmdk"
scsi0:4.mode = "independent-persistent"
scsi0:4.deviceType = "rawDisk"
scsi0:5.present = "TRUE"
scsi0:5.fileName = "ztank-sde.vmdk"
scsi0:5.mode = "independent-persistent"
scsi0:5.deviceType = "rawDisk"
scsi0:6.present = "TRUE"
scsi0:6.fileName = "ztank-sdf.vmdk"
scsi0:6.mode = "independent-persistent"
scsi0:6.deviceType = "rawDisk"

scsi0:1.redo = ""
scsi0:2.redo = ""
scsi0:3.redo = ""
scsi0:4.redo = ""
scsi0:5.redo = ""
scsi0:6.redo = ""

isolation.tools.dnd.disable = "TRUE"
snapshot.disabled = "TRUE"

scsi0:0.mode = "independent-persistent"

isolation.tools.copy.disable = "FALSE"
isolation.tools.paste.disable = "FALSE"

tools.remindInstall = "TRUE"
===============================================
in summary: physical disks, assigned 100% to the VM

HTH

kind regards
Vasile
--
This message posted from opensolaris.org

Fajar A. Nugraha

2008-Oct-04 07:19 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

On Fri, Oct 3, 2008 at 10:37 PM, Vasile Dumitrescu
<vasiledumitrescu at gmail.com> wrote:
> VMWare 6.0.4 running on Debian unstable,
> Linux bigsrv 2.6.26-1-amd64 #1 SMP Wed Sep 24 13:59:41 UTC 2008 x86_64
GNU/Linux
>
> Solaris is vanilla snv_90 installed with no GUI.
>
> in summary: physical disks, assigned 100% to the VM
That''s weird. I thought one of the point of using physical disks
instead of files was to avoid problems caused by caching on host/dom0?

Darren J Moffat

2008-Oct-06 09:39 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

Fajar A. Nugraha wrote:> On Fri, Oct 3, 2008 at 10:37 PM, Vasile Dumitrescu
> <vasiledumitrescu at gmail.com> wrote:
> 
>> VMWare 6.0.4 running on Debian unstable,
>> Linux bigsrv 2.6.26-1-amd64 #1 SMP Wed Sep 24 13:59:41 UTC 2008 x86_64
GNU/Linux
>>
>> Solaris is vanilla snv_90 installed with no GUI.
> 
> 
>> in summary: physical disks, assigned 100% to the VM
> 
> That''s weird. I thought one of the point of using physical disks
> instead of files was to avoid problems caused by caching on host/dom0?
The data still flows through the host/dom0 device drivers and is thus at 
the mercy of the commands they issue to the physical devices.

-- 
Darren J Moffat

2008-Oct-09 09:53 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

> His explanation: he invalidated the incorrect
> uberblocks and forced zfs to revert to an earlier
> state that was consistent.
Would someone be willing to document the steps required in order to do this
please?

I have a disk in a similar state:

# zpool import
  pool: tank
    id: 13234439337856002730
 state: FAULTED
status: The pool metadata is corrupted.
action: The pool cannot be imported due to damaged devices or data.
	The pool may be active on another system, but can be imported using
	the ''-f'' flag.
   see: http://www.sun.com/msg/ZFS-8000-72
config:

	tank        FAULTED  corrupted data
	  c7d0      ONLINE

This happened after I foolishly began trusting zfs-fuse with some large but
relatively unimportant data on a big, empty single disk zpool in my home machine
and then suffered a power cut before I got around to backing it up.

OpenSolaris can''t import the pool either, so the drive is sat on a
shelf waiting till a method for fixing it is published.

While it''s clearly my own fault for taking the risks I did,
it''s still pretty frustrating knowing that all my data is likely still
intact and nicely checksummed on the disk but that none of it is accessible due
to some tiny filesystem inconsistency.  With pretty much any other FS I think I
could get most of it back.

Clearly such a small number of occurrences in what were admittedly precarious
configurations aren''t going to be particularly convincing motivators to
provide a general solution, but I''d feel a whole lot better about using
ZFS if I knew that there were some documented steps or a tool (zfsck? ;) that
could help to recover from this kind of metadata corruption in the unlikely
event of it happening.

cheers,

Rob
--
This message posted from opensolaris.org

Mike Gerdts

2008-Oct-09 11:37 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

On Thu, Oct 9, 2008 at 4:53 AM, . <osl at boymonkey.com>
wrote:> While it''s clearly my own fault for taking the risks I did,
it''s
> still pretty frustrating knowing that all my data is likely still
> intact and nicely checksummed on the disk but that none of it is
> accessible due to some tiny filesystem inconsistency. ?With pretty
> much any other FS I think I could get most of it back.
>
> Clearly such a small number of occurrences in what were admittedly
> precarious configurations aren''t going to be particularly
convincing
> motivators to provide a general solution, but I''d feel a whole lot
> better about using ZFS if I knew that there were some documented
> steps or a tool (zfsck? ;) that could help to recover from this kind
> of metadata corruption in the unlikely event of it happening.
Well said.  You have hit on my #1 concern with deploying ZFS.

FWIW, I belive that I have hit the same type of bug as the OP in the
following combinations:

- T2000, LDoms 1.0, various builds of Nevada in control and guest
  domains.
- Laptop, VirtualBox 1.6.2, Windows XP SP2 host, OpenSolaris 2008.05 @
  build 97 guest

In the past year I''ve lost more ZFS file systems than I have any other
type of file system in the past 5 years.  With other file systems I
can almost always get some data back.  With ZFS I can''t get any back.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Wilkinson, Alex

2008-Oct-09 11:46 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

0n Thu, Oct 09, 2008 at 06:37:23AM -0500, Mike Gerdts wrote: 

    >FWIW, I belive that I have hit the same type of bug as the OP in the
    >following combinations:
    >
    >- T2000, LDoms 1.0, various builds of Nevada in control and guest
    >  domains.
    >- Laptop, VirtualBox 1.6.2, Windows XP SP2 host, OpenSolaris 2008.05 @
    >  build 97 guest
    >
    >In the past year I''ve lost more ZFS file systems than I have
any other
    >type of file system in the past 5 years.  With other file systems I
    >can almost always get some data back.  With ZFS I can''t get any
back.

Thats scary to hear!

 -aW

IMPORTANT: This email remains the property of the Australian Defence
Organisation and is subject to the jurisdiction of section 70 of the CRIMES ACT
1914.  If you have received this email in error, you are requested to contact
the sender and delete the email.

Ahmed Kamal

2008-Oct-09 12:44 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

>   >In the past year I''ve lost more ZFS file systems than I have any
other
   >type of file system in the past 5 years.  With other file systems I
   >can almost always get some data back.  With ZFS I can''t get any
back.

Thats scary to hear!>
>I am really scared now! I was the one trying to quantify ZFS reliability,
and that is surely bad to hear!
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081009/8dbecfa9/attachment.html>

Mike Gerdts

2008-Oct-09 13:22 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

On Thu, Oct 9, 2008 at 7:44 AM, Ahmed Kamal
<email.ahmedkamal at googlemail.com> wrote:>
>    >
>    >In the past year I''ve lost more ZFS file systems than I
have any other
>    >type of file system in the past 5 years.  With other file systems I
>    >can almost always get some data back.  With ZFS I can''t get
any back.
>
>> Thats scary to hear!
>>
>
> I am really scared now! I was the one trying to quantify ZFS reliability,
> and that is surely bad to hear!
The circumstances where I have lost data have been when ZFS has not
handled a layer of redundancy.  However, I am not terribly optimistic
of the prospects of ZFS on any device that hasn''t committed writes
that ZFS thinks are committed.  Mirrors and raidz would also be
vulnerable to such failures.

I also have run into other failures that have gone unanswered on the
lists.  It makes me wary about using zfs without a support contract
that allows me to escalate to engineering.  Patching only support
won''t help.

http://mail.opensolaris.org/pipermail/zfs-discuss/2007-December/044984.html
   Hang only after I mirrored the zpool, no response on the list

http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048255.html
   I think this is fixed around snv_98, but the zfs-discuss list was
   surprisingly silent on acknowledging it as a problem - I had no
   idea that it was being worked until I saw the commit.  The panic
   seemed to be caused by dtrace - core developers of dtrace
   were quite interested in the kernel crash dump.

http://mail.opensolaris.org/pipermail/zfs-discuss/2008-September/051109.html
   Panic during ON build.  Pool was lost, no response from list.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Timh Bergström

2008-Oct-09 14:50 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

Unfortunely I can only agree to the doubts about running ZFS in
production environments, i''ve lost ditto-blocks,
i''''ve gotten
corrupted pools and a bunch of other failures even in
mirror/raidz/raidz2 setups with or without hardware mirrors/raid5/6.
Plus the insecurity of a sudden crash/reboot will corrupt or even
destroy the pools with "restore from backup" as the only advice.
I''ve
been lucky so far about getting my pools back thanks to people like
Victor.

What would be needed is a proper fsck for ZFS which can resolv "minor"
data corruptions, tools for rebuilding, resizing and moving the data
about on pools is also needed, even recover of data from faulted
pools, like there is for ext2/3/ufs/ntfs.

All in all, great FS but not production ready until the tools are in
place or it gets really really resillient to minor failures and/or
crashes in both software and hardware. For now i''ll stick to XFS/UFS
and sw/hw-raid and live with the restrictions of such fs.

//T

2008/10/9 Mike Gerdts <mgerdts at gmail.com>:> On Thu, Oct 9, 2008 at 7:44 AM, Ahmed Kamal
> <email.ahmedkamal at googlemail.com> wrote:
>>
>>    >
>>    >In the past year I''ve lost more ZFS file systems than I
have any other
>>    >type of file system in the past 5 years.  With other file
systems I
>>    >can almost always get some data back.  With ZFS I can''t
get any back.
>>
>>> Thats scary to hear!
>>>
>>
>> I am really scared now! I was the one trying to quantify ZFS
reliability,
>> and that is surely bad to hear!
>
> The circumstances where I have lost data have been when ZFS has not
> handled a layer of redundancy.  However, I am not terribly optimistic
> of the prospects of ZFS on any device that hasn''t committed writes
> that ZFS thinks are committed.  Mirrors and raidz would also be
> vulnerable to such failures.
>
> I also have run into other failures that have gone unanswered on the
> lists.  It makes me wary about using zfs without a support contract
> that allows me to escalate to engineering.  Patching only support
> won''t help.
>
> http://mail.opensolaris.org/pipermail/zfs-discuss/2007-December/044984.html
>   Hang only after I mirrored the zpool, no response on the list
>
> http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048255.html
>   I think this is fixed around snv_98, but the zfs-discuss list was
>   surprisingly silent on acknowledging it as a problem - I had no
>   idea that it was being worked until I saw the commit.  The panic
>   seemed to be caused by dtrace - core developers of dtrace
>   were quite interested in the kernel crash dump.
>
>
http://mail.opensolaris.org/pipermail/zfs-discuss/2008-September/051109.html
>   Panic during ON build.  Pool was lost, no response from list.
>
> --
> Mike Gerdts
> http://mgerdts.blogspot.com/
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

-- 
Timh Bergstr?m
System Administrator
Diino AB - www.diino.com
:wq

Greg Shaw

2008-Oct-09 15:10 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

Perhaps I mis-understand, but the below issues are all based on Nevada, 
not Solaris 10.  

Nevada isn''t production code.  For real ZFS testing, you must use a 
production release, currently Solaris 10 (update 5, soon to be update 6).

In the last 2 years, I''ve stored everything in my environment (home 
directory, builds, etc.) on ZFS on multiple types of storage subsystems 
without issues.  All of this has been on Solaris 10, however.

Btw, I completely agree on the panic issue.    If I have a large DB 
server with many pools, and one inconsequential pool fails, I lose the 
entire DB server.   I''d really like to see an option at the zpool level
directing what to do in a panic for a particular pool.    Perhaps this 
is in the latest bits; if so, sorry, I''m running old stuff.  :-)

I also run ZFS on my mac.  While not production quality, some of the 
panic errors dealing with external (firewire, usb, esata) are very 
irritating.   A hiccup due to a jostled cable, and the entire box 
panics.   That''s frustrating.

Timh Bergstr?m wrote:> Unfortunely I can only agree to the doubts about running ZFS in
> production environments, i''ve lost ditto-blocks,
i''''ve gotten
> corrupted pools and a bunch of other failures even in
> mirror/raidz/raidz2 setups with or without hardware mirrors/raid5/6.
> Plus the insecurity of a sudden crash/reboot will corrupt or even
> destroy the pools with "restore from backup" as the only advice.
I''ve
> been lucky so far about getting my pools back thanks to people like
> Victor.
>
> What would be needed is a proper fsck for ZFS which can resolv
"minor"
> data corruptions, tools for rebuilding, resizing and moving the data
> about on pools is also needed, even recover of data from faulted
> pools, like there is for ext2/3/ufs/ntfs.
>
> All in all, great FS but not production ready until the tools are in
> place or it gets really really resillient to minor failures and/or
> crashes in both software and hardware. For now i''ll stick to
XFS/UFS
> and sw/hw-raid and live with the restrictions of such fs.
>
> //T
>
> 2008/10/9 Mike Gerdts <mgerdts at gmail.com>:
>   
>> On Thu, Oct 9, 2008 at 7:44 AM, Ahmed Kamal
>> <email.ahmedkamal at googlemail.com> wrote:
>>     
>>>    >
>>>    >In the past year I''ve lost more ZFS file systems
than I have any other
>>>    >type of file system in the past 5 years.  With other file
systems I
>>>    >can almost always get some data back.  With ZFS I
can''t get any back.
>>>
>>>       
>>>> Thats scary to hear!
>>>>
>>>>         
>>> I am really scared now! I was the one trying to quantify ZFS
reliability,
>>> and that is surely bad to hear!
>>>       
>> The circumstances where I have lost data have been when ZFS has not
>> handled a layer of redundancy.  However, I am not terribly optimistic
>> of the prospects of ZFS on any device that hasn''t committed
writes
>> that ZFS thinks are committed.  Mirrors and raidz would also be
>> vulnerable to such failures.
>>
>> I also have run into other failures that have gone unanswered on the
>> lists.  It makes me wary about using zfs without a support contract
>> that allows me to escalate to engineering.  Patching only support
>> won''t help.
>>
>>
http://mail.opensolaris.org/pipermail/zfs-discuss/2007-December/044984.html
>>   Hang only after I mirrored the zpool, no response on the list
>>
>> http://mail.opensolaris.org/pipermail/zfs-discuss/2008-June/048255.html
>>   I think this is fixed around snv_98, but the zfs-discuss list was
>>   surprisingly silent on acknowledging it as a problem - I had no
>>   idea that it was being worked until I saw the commit.  The panic
>>   seemed to be caused by dtrace - core developers of dtrace
>>   were quite interested in the kernel crash dump.
>>
>>
http://mail.opensolaris.org/pipermail/zfs-discuss/2008-September/051109.html
>>   Panic during ON build.  Pool was lost, no response from list.
>>
>> --
>> Mike Gerdts
>> http://mgerdts.blogspot.com/
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>>     
>
>
>
>

Mike Gerdts

2008-Oct-09 15:18 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

On Thu, Oct 9, 2008 at 10:10 AM, Greg Shaw <Greg.Shaw at sun.com>
wrote:> Nevada isn''t production code.  For real ZFS testing, you must use
a
> production release, currently Solaris 10 (update 5, soon to be update 6).
I misstated before in my LDoms case.  The corrupted pool was on
Solaris 10, with LDoms 1.0.  The control domain was SX*E, but the
zpool there showed no problems.  I got into a panic loop with dangling
dbufs.  My understanding is that this was caused by a bug in the LDoms
manager 1.0 code that has been fixed in a later release.  It was a
supported configuration, I pushed for and got a fix.  However, that
pool was still lost.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Miles Nordin

2008-Oct-09 18:38 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

>>>>> "gs" == Greg Shaw <Greg.Shaw at Sun.COM>
writes:
gs> Nevada isn''t production code. For real ZFS testing, you
must
gs> use a production release, currently Solaris 10 (update 5, soon
gs> to be update 6).

based on list feedback, my impression is that the results of a
``test'''' confined to s10, particularly s10u4 (the latest
available
during most of Mike''s experience), would be worse than Nevada
experience over the same period. but I doubt either matches UFS+SVM
or ext3+LVM2. The on-disk format with ``ditto blocks'''' and
``always
consistent'''' may be fantastic, but the code for reading it is
not.

Maybe the code is stellar, and the problem really is underlying
storage stacks that fail to respect write barriers. If so, ZFS needs
to include a storage stack qualification tool. For me it doesn''t
strain credibility to believe these problems might be rampant in VM
stacks and SAN''s, nor do I find it unacceptable if ZFS is vastly more
sensitive to them than any other filesystem. If this speculation
turns out to really be the case, I imagine the two going together: the
problems are rampant because they don''t bother other filesystems too
catastrophically. If this is really the situation, then ZFS needs to
give the sysadmin a way to isolate and fix the problems
deterministically before filling the pool with data, not just blame
the sysadmin based on nebulous speculatory hindsight gremlins.

And if it''s NOT the case, the ZFS problems need to be acknowledged and
fixed.

To my view, the above is *IN ADDITION* to developing a
recovery/forensic/``fsck'''' tool, not either/or. The pools
should not
be getting corrupt in the first place, and pulling the cord should not
mean you have to settle for best-effort. None of the modern
filesystems demand an fsck after unclean shutdown.

The current procedure for qualifying a platform seems to be: (1)
subject it to heavy write activity, (2) pull the cord, (3) repeat.
Ahmed, maybe you should use that test to ``quantify''''
filesystem
reliability. You can try it with ZFS, then reinstall the machine with
CentOS and try the same test with ext3+LVM2 or xfs+areca. The numbers
you get are how many times can you pull the cord before you lose
something, and how much do you lose. Here''s a really old test of that
sort comparing Linux filesystems which is something like what I have
in mind:

https://www.redhat.com/archives/fedora-list/2004-July/msg00418.html

so you see he got two sets of numbers---number of reboots and amount
of corruption. For reiserfs and JFS he lost their equivalent of ``the
whole pool'''', and for ext3 and XFS he got corruption but never
lost
the pool. It''s not clear to me the filesystems ever claimed to
prevent corruption in his test scenario (was he calling fsync() after
each log write? syslog does that sometimes, and if so, they do claim
it, but if he''s just writing with some silly script they
don''t), but
definitely they do all claim you won''t lose the whole pool in a power
outage, and only two out of four delivered on that. I base my choice
of Linux filesystem on this test, and wish I''d done such a test before
converting things to ZFS.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081009/632ede48/attachment.bin>

Bob Friesenhahn

2008-Oct-09 19:06 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

On Thu, 9 Oct 2008, Miles Nordin wrote:>
> catastrophically.  If this is really the situation, then ZFS needs to
> give the sysadmin a way to isolate and fix the problems
> deterministically before filling the pool with data, not just blame
> the sysadmin based on nebulous speculatory hindsight gremlins.
>
> And if it''s NOT the case, the ZFS problems need to be acknowledged
and
> fixed.
Can you provide any supportive evidence that ZFS is as fragile as you 
describe?
>From recent opinions expressed here, properly-designed ZFS pools must be inexplicably permanently cratering each and every day.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Mike Gerdts

2008-Oct-10 03:33 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

On Thu, Oct 9, 2008 at 10:18 AM, Mike Gerdts <mgerdts at gmail.com>
wrote:> On Thu, Oct 9, 2008 at 10:10 AM, Greg Shaw <Greg.Shaw at sun.com>
wrote:
>> Nevada isn''t production code.  For real ZFS testing, you must
use a
>> production release, currently Solaris 10 (update 5, soon to be update
6).
>
> I misstated before in my LDoms case.  The corrupted pool was on
> Solaris 10, with LDoms 1.0.  The control domain was SX*E, but the
> zpool there showed no problems.  I got into a panic loop with dangling
> dbufs.  My understanding is that this was caused by a bug in the LDoms
> manager 1.0 code that has been fixed in a later release.  It was a
> supported configuration, I pushed for and got a fix.  However, that
> pool was still lost.
Or maybe it wasn''t fixed yet.  I see that this was committed just
today.

6684721 file backed virtual i/o should be synchronous

http://hg.genunix.org/onnv-gate.hg/rev/eb40ff0c92ec

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Timh Bergström

2008-Oct-10 07:38 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008/10/9 Bob Friesenhahn <bfriesen at
simple.dallas.tx.us>:> On Thu, 9 Oct 2008, Miles Nordin wrote:
>>
>> catastrophically.  If this is really the situation, then ZFS needs to
>> give the sysadmin a way to isolate and fix the problems
>> deterministically before filling the pool with data, not just blame
>> the sysadmin based on nebulous speculatory hindsight gremlins.
>>
>> And if it''s NOT the case, the ZFS problems need to be
acknowledged and
>> fixed.
>
> Can you provide any supportive evidence that ZFS is as fragile as you
> describe?
The hundreds of sysadmins seeing their pools go byebye after normal
operations in a production environment is evidence enough. And the
number of times people like Victor have saved our asses.
>
> >From recent opinions expressed here, properly-designed ZFS pools must
> be inexplicably permanently cratering each and every day.
>
> Bob
> =====================================> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>


-- 
Timh Bergstr?m
System Administrator
Diino AB - www.diino.com
:wq

Jeff Bonwick

2008-Oct-10 08:26 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

> The circumstances where I have lost data have been when ZFS has not
> handled a layer of redundancy.  However, I am not terribly optimistic
> of the prospects of ZFS on any device that hasn''t committed writes
> that ZFS thinks are committed.
FYI, I''m working on a workaround for broken devices.  As you note,
some disks flat-out lie: you issue the synchronize-cache command,
they say "got it, boss", yet the data is still not on stable storage.
Why do they do this?  Because "it performs better".  Well, duh --
you can make stuff *really* fast if it doesn''t have to be correct.

Before I explain how ZFS can fix this, I need to get something off my
chest: people who knowingly make such disks should be in federal prison.
It is *fraud* to win benchmarks this way.  Doing so causes real harm
to real people.  Same goes for NFS implementations that ignore sync.
We have specifications for a reason.  People assume that you honor them,
and build higher-level systems on top of them.  Change the mass of
the proton by a few percent, and the stars explode.  It is impossible
to build a functioning civil society in a culture that tolerates lies.
We need a little more Code of Hammurabi in the storage industry.

Now:

The uberblock ring buffer in ZFS gives us a way to cope with this,
as long as we don''t reuse freed blocks for a few transaction groups.
The basic idea: if we can''t read the pool startign from the most
recent uberblock, then we should be able to use the one before it,
or the one before that, etc, as long as we haven''t yet reused any
blocks that were freed in those earlier txgs.  This allows us to
use the normal load on the pool, plus the passage of time, as a
displacement flush for disk caches that ignore the sync command.

If we go back far enough in (txg) time, we will eventually find an
uberblock all of whose dependent data blocks have make it to disk.
I''ll run tests with known-broken disks to determine how far back we
need to go in practice -- I''ll bet one txg is almost always enough.

Jeff

Ross

2008-Oct-10 09:29 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

That sounds like a great idea for a tool Jeff.  Would it be possible to build
that in as a "zpool recover" command?

Being able to run a tool like that and see just how bad the corruption is, but
know it''s possible to recover an older version would be great.  Is
there any chance of outputting details so the sysadmin can know roughly how much
was lost?

My thoughts are going to be very rough (I don''t know much about zfs
internals), but I''m wondering if something like this would work, where
all bad blocks are reported, along with the latest 3 good ones:

*************************************8
# zpool recover <pool>
......... pool details ...........

Finding and testing uberblocks...
1.  block a      date/time:  xxxxx/xxxx
     CORRUPTED
2.  block b      date/time:  yyyyy/yyyy
     CORRUPTED
3.  block c      date/time:  zzzzz/zzzz
     Appears OK
4.  block d      date/time:  zzzzz/zzzz
     Appears OK
5.  block e      date/time:  zzzzz/zzzz
     Appears OK
> *************************************8

Victor was talking in another thread about using zdb to check the pool before
doing an import of a damaged pool.  Might it be possible for the next stage of
the recovery process to give the user an option of testing or importing the pool
for any particular uberblock?

It does sound like testing can take a long time, so this would need to be
something that can be cancelled, and you would also need a way to mark
uberblocks as bad should problems be found with either the test or import.

This would be a great addition to ZFS though, and would hopefully save Victor a
bit of time ;-)

Ross
--
This message posted from opensolaris.org

Ricardo M. Correia

2008-Oct-10 09:48 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

Hi Jeff,

On Sex, 2008-10-10 at 01:26 -0700, Jeff Bonwick wrote:> > The circumstances where I have lost data have been when ZFS has not
> > handled a layer of redundancy.  However, I am not terribly optimistic
> > of the prospects of ZFS on any device that hasn''t committed
writes
> > that ZFS thinks are committed.
> 
> FYI, I''m working on a workaround for broken devices.  As you note,
> some disks flat-out lie: you issue the synchronize-cache command,
> they say "got it, boss", yet the data is still not on stable
storage.
It''s not just about ignoring the synchronize-cache command,
there''s also
another weak spot.

ZFS is quite resilient against so-called phantom writes, provided that
they occur sporadically - let''s say, if the disk decides to _randomly_
ignore writes 10% of the time, ZFS could probably survive that pretty
well even on single-vdev pools, due to ditto blocks.

However, it is not so resilient when the storage system suffers hiccups
which cause phantom writes to occur continuously, even if for a small
period of time (say less than 10 seconds), and then return to normal.
This could happen for several reasons, including network problems, bugs
in software or even firmware, etc.

I think in this case, going back to a previous uberblock could also be
enough to recover from such a scenario most of the times, unless perhaps
the error occurred too long ago, and the unwritten metadata got flushed
out of the ARC and didn''t have a chance to get rewritten.

In any case, a more generic solution to repair all kinds of metadata
corruption, such as (e.g.) space map corruption, would be very
desirable, as I think everyone can agree.

Best regards,
Ricardo

Marcelo Leal

2008-Oct-10 13:15 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

Hello all,
 I think the problem here is the ZFS? capacity for recovery from a failure. 
Forgive me, but thinking about creating a code "without failures",
maybe the hackers did forget that other people can make mistakes (if they
can?t).
 - "ZFS does not need fsck".
 Ok, that?s a great statement, but i think ZFS needs one. Really does. And in my
opinion a enhanced zdb would be the solution. Flexibility. Options.
 - "I have 90% of something i think is your filesystem, do you want
it"?
 I think a software is  as good as it can recovery from failures. And i don?t
want to know who failed, i?m not going to send anyone to jail, i?m not a lawyer.
I agree with Jeff, really do, but that is "another" problem...
 The solution Jeff is working one, i think is really great, since it does NOT be
the "all or nothing" again... I don?t know about you, but A LOT of
times i was saved by the "Lost and Found" directory! All the beauty of
a UNIX system is "rm /etc/passwd" after have edited it, and get the
whole file doing a "cat /dev/mem". ;-)
 I think there are a lot of parts in ZFS design that remembers me when you see
something left on the floor at home, so you ask for your son why he did not get
it, and he says "it was not me".
 peace.

 Leal.
--
This message posted from opensolaris.org

Miles Nordin

2008-Oct-10 17:58 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

>>>>> "jb" == Jeff Bonwick <Jeff.Bonwick at
sun.com> writes:
>>>>> "rmc" == Ricardo M Correia <Ricardo.M.Correia
at Sun.COM> writes:
jb> We need a little more Code of Hammurabi in the storage
jb> industry.

It seems like most of the work people have to do now is cleaning up
after the sloppyness of others. At least it takes the longest.

You could always mention which disks you found ignoring the
command---wouldn''t that help the overall problem? I understand
there''s a pervasive ``i don'' wan'' any trouble,
mistah'''' attitude, but
I don''t understand where it comes from.

http://www.ferris.edu/news/jimcrow/tom/

jb> displacement flush for disk caches that ignore the sync
jb> command.

Sounds like a good idea but:

(1) won''t this break the NFS guarantees you were just saying should
never be broken?

I get it, someone else is breaking a standard so how can ZFS be
expected to yadda yadda yadad. But I fear it will just push
``blame the sysadmin'''' one step further out. ex., Q.
``with ZFS
all my NFS clients become unstable after the server
reboots,'''' or
``I''m getting silent corruption with NFS''''. A.
``your drives
might have gremlins in them, no way to know,'''' and ``well
what do
you expect without a single integrity domain and TCP''s weak
checksums. / no i''m using a crossover cable, and FCS is not
weak. / ZFS managing a layer of redundancy it is probably your
RAM or corruption on the uh, between the Ethernet MAC chip and
the PCI slot''''

(1a) I''m concerned about how it''ll be reported when it
happens.

(a) if it''s not reported at all, then ZFS is hiding the fact
that fsync() is not working. Also, other journaling
filesystems sometimes report when they find
``unexpected'''' corruption, which is useful for
finding
both hardware and software problems.

I''m already concerned ZFS is not reporting enough, like
when it says a vdev component is ONLINE, but ''zpool
offline pool <component>'' says ''no valid
replicas'', then
after a scrub there is no change to zpool status, but
zpool offline works again.

ZFS should not ``simplify'''' the user interface
to the
point that it''s hiding problems with itself and its
environment to the ends of avoiding discussion.

(b) if it is reported, then whenever the reporter-blob
raises its hand it will have the effect of exonerating
ZFS in most people''s minds, like the stupid CKSUM column
does right now. ``ZFS-FEED-B33F error? oh yeah that''s
the new ueberblock search code. that means your disks
are ignoring the SYNCHRONIZE CACHE command. thank GOD
you have ZFS with ANY OTHER FILESYSTEM all bets would be
totally off. lucky you. / I have tried ten different
models from all four brands. / yeah sucks don''t it?
flagrant violation of the standard, industry wide. / my
linux testing tool says they''re obeying the command fine
/ linux is crap / i added a patch to solaris to block
the SYNC CACHE command and the disks got faster so I
think it''s not being ignored / well the stack is
complicated and flushing happens at many levels, like
think about controller performance, and that''s
completely unsupported you are doing something REALLY
UNSAFE there you should NOT DO THAT it is
STUPID'''' and
so on, stalling the actual fix literally for years.

The right way to exonerate ZFS is to make a diagnosis
tool for the disks which proves they''re broken, and then
don''t buy those disks. not to make a new class of ZFS
fault report that could potentially capture all kinds of
problems, then hazily assign blame to an untestable
quantity.

(2) disks are probably not the only thing dropping the write
barriers. So far, we''re also suspecting (unproven!) iSCSI
targets/initiators, particularly around a TCP reconnection event
or target reboot. and VM stacks, both VirtualBox and the HVM in
UltraSPARC T1. probably other stuff.

I''m concerned that assumptions you''ll find safe to make
about
disks after you get started, like nothing is more than 1s stale,
or send a CDB to size the on-disk cache and imagine it''s a FIFO
and it''ll be no worse than that, or ``you can get an fsync by
pausing reads for 500ms'''' or whatever, will add
robustness for
current and future broken disks but won''t apply to other types of
broken storage layer.

rmc> However, it is not so resilient when the storage system
rmc> suffers hiccups which cause phantom writes to occur
rmc> continuously, even if for a small period of time (say less
rmc> than 10 seconds), and then return to normal.

ha! that is a great idea. temporal ditto blocks: Important writes
should be written, aged in RAM for 1 minute, then rewritten. :) This
will help with latent sector errors caused by powersag/vibration
too. but...Even I will admit at some point you have to give up and
let the filesystem get corrupted.

actually I''m more in the camp of making ZFS fragile to incorrect
storage stacks, and offering an offline recovery tool that treats the
corrupt pool as read-only and copies it into a new filesystem (so you
need a second same-size empty pool to use the tool). I like this
painful way better than fsck-like things, and much better than silent
workarounds. but i''m probably in the wrong camp on this one.

My reasoning is, we will not be ultimately happy with a fileystem
where fsync() is broken, and that''s the best you can do. To compete
with Netapp, we need to bang on this thing until it''s actually
working. So far I think sysadmins are receptive to the idea they need
to fix <...> about their setup, or make purchases with extreme care,
or do testing before production. We are not lazy and do not expect an
appliance-on-a-CD.

it''s just that pass-the-buck won''t ever deliver something
useful.
When ext3 was corrupting filesystems on laptops, ext3 got blamed, and
ext3 was not at the root of the problem. But no one _accepted_ that
ext3 was correctly-coded until the overall problem was fixed. (IIRC
it was: you need to send drives a stop-unit command before sending the
ACPI powerdown, because even if they ignore synchronize-cache they do
still flush when told to stop-unit)

It''s proper to have a strict separation between ``unclean
shutdown''''
and ``recovery from corruption''''. UFS does have the
separation
between log-rolling and fsck-ing, but ZFS could detect the difference
between unclean shutdown and corruption a lot better than UFS, and
that''s good. Currently ZFS seems to detect it by telling you
``pool''s
corrupt. <shrug>, destroy it.''''---the fact that the
recovery tool is
entirely absent isn''t good, but keeping recovery actions like this
ueberblock-search strictly separate makes delivering something truly
correct on the ``unclean shutdown'''' front more likely.

I think, if iSCSI target/initiator combinations are silently
discarding 10sec worth of writes (ex., when they drop and reconnect
their TCP session), then this needs to be proven and their
implementation can be and needs to be corrected, not speculated on and
then worked around.

And I bet this same beefing-up performance numbers by discarding cache
flushes is as rampant in the virtualization game as in the hard disk
game.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081010/ab907880/attachment.bin>

Eric Schrock

2008-Oct-10 18:23 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

On Fri, Oct 10, 2008 at 06:15:16AM -0700, Marcelo Leal
wrote:>  - "ZFS does not need fsck".
>  Ok, that?s a great statement, but i think ZFS needs one. Really does.
>  And in my opinion a enhanced zdb would be the solution. Flexibility.
>  Options.
About 99% of the problems reported as "I need ZFS fsck" can be summed
up
by two ZFS bugs:

1. If a toplevel vdev fails to open, we should be able to pull
   information from necessary ditto blocks to open the pool and make
   what progress we can.  Right now, the root vdev code assumes
"can''t
   open = faulted pool," which results in failure scenarios that are
   perfectly recoverable most of the time.  This needs to be fixed
   so that pool failure is only determined by the ability to read
   critical metadata (such as the root of the DSL).

2. If an uberblock ends up with an inconsistent view of the world (due
   to failure of DKIOCFLUSHWRITECACHE, for example), we should be able
   to go back to previous uberblocks to find a good view of our pool.
   This is the failure mode described by Jeff.

These are both bugs in ZFS and will be fixed.  The other 1% of the
complaints are usually of the form "I created my pool on top of my old
one" or "I imported a LUN on two different systems at the same
time".
It''s unclear what a ''fsck'' tool could do in this
scenario, if anything.
Due to a variety of reasons (hierarchical nature of ZFS, variable block
sizes, RAIDZ-Z, compression, etc), it''s difficult to even *identify* a
ZFS block, let alone determine its validity and associate it in some
larger construct.

There are some interesting possibilities for limited forensic tools - in
particular, I like the idea of a mdb backend for reading and writing ZFS
pools[1].  But I haven''t actually heard a reasonable proposal for what
a
fsck-like tool (i.e. one that could "repair" things automatically)
would
actually *do*, let alone how it would work in the variety of situations
it needs to (compressed RAID-Z?) where the standard ZFS infrastructure
fails.

- Eric

[1]
http://mbruning.blogspot.com/2008/08/recovering-removed-file-on-zfs-disk.html

--
Eric Schrock, Fishworks                        http://blogs.sun.com/eschrock

Victor Latushkin

2008-Oct-10 19:48 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

Eric Schrock wrote:> On Fri, Oct 10, 2008 at 06:15:16AM -0700, Marcelo Leal wrote:
>>  - "ZFS does not need fsck".
>>  Ok, that?s a great statement, but i think ZFS needs one. Really does.
>>  And in my opinion a enhanced zdb would be the solution. Flexibility.
>>  Options.
> 
> About 99% of the problems reported as "I need ZFS fsck" can be
summed up
> by two ZFS bugs:
> 
> 1. If a toplevel vdev fails to open, we should be able to pull
>    information from necessary ditto blocks to open the pool and make
>    what progress we can.  Right now, the root vdev code assumes
"can''t
>    open = faulted pool," which results in failure scenarios that are
>    perfectly recoverable most of the time.  This needs to be fixed
>    so that pool failure is only determined by the ability to read
>    critical metadata (such as the root of the DSL).
> 
> 2. If an uberblock ends up with an inconsistent view of the world (due
>    to failure of DKIOCFLUSHWRITECACHE, for example), we should be able
>    to go back to previous uberblocks to find a good view of our pool.
>    This is the failure mode described by Jeff.
I''ve mostly seen (2), because despite all the best practices out there,
single vdev pools are quite common. In all such cases that I had my 
hands on it was possible to recover pool by going back by one or two txgs.
> These are both bugs in ZFS and will be fixed.  The other 1% of the
> complaints are usually of the form "I created my pool on top of my old
> one" or "I imported a LUN on two different systems at the same
time".
Of these two former is not easy because it requires searching through 
the entire disk space for root block candidates and trying each of them.
Latter one is not catastrophic in case there were little to no activity 
from one system. In this case one of the first things to suffer is pool 
config object, and corruption of it prevents pool open.

Fortunately enough, after putback of

6733970 assertion failure in dbuf_dirty() via spa_sync_nvlist()

in build 99 corrupted pool config object is written in such a way during 
open that prevents reading in old corrupted copy, and in most cases this 
allows to import pool and save most of the data. zdb is useful to 
understand how much is corrupted and how much is recovered. If nothing 
else is corrupted, then pool may be available for further use without 
recreation. Again, in every case I had my hands on it was possible to 
either recover pool completely or at least save most of the data.
> It''s unclear what a ''fsck'' tool could do in this
scenario, if anything.
> Due to a variety of reasons (hierarchical nature of ZFS, variable block
> sizes, RAIDZ-Z, compression, etc), it''s difficult to even
*identify* a
> ZFS block, let alone determine its validity and associate it in some
> larger construct.
Indeed. In "more ZFS recovery" case involving 42TB pool with about 8TB
used, zdb -bv alone took several hours to walk the block tree and verify 
consistency of block pointers, and zdb -bcv took couple of days to 
verify all user data blocks as well. And different checksums and gang 
blocks in addition to all other dynamic features mentioned complicate 
the task of identifying ZFS blocks and linking those blocks into tree 
and make it really time (and space) consuming.
> There are some interesting possibilities for limited forensic tools - in
> particular, I like the idea of a mdb backend for reading and writing ZFS
> pools[1].  But I haven''t actually heard a reasonable proposal for
what a
> fsck-like tool (i.e. one that could "repair" things
automatically) would
> actually *do*, let alone how it would work in the variety of situations
> it needs to (compressed RAID-Z?) where the standard ZFS infrastructure
> fails.
There are a number of bugs and rfes to improve usefulness of zdb for 
field use, e.g.

6720637 want zdb -l option to dump uberblock arrays as well
6709782 issues running zdb with -p and -e options
6736356 zdb -R needs to work with exported pools
6720907 zdb should handle errors while dumping datasets and objects
6746101 zdb command to search for ZFS labels in a device
6757444 want zdb -R to supoprt decompression, checksumming and raid-z
6757430 want an option for zdb to disable space map loading and leak 
tracking

Hth,
Victor
> - Eric
> 
> [1]
http://mbruning.blogspot.com/2008/08/recovering-removed-file-on-zfs-disk.html
> 
> --
> Eric Schrock, Fishworks                       
http://blogs.sun.com/eschrock
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Timh Bergström

2008-Oct-10 19:50 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

2008/10/10 Richard Elling <Richard.Elling at
sun.com>:> Timh Bergstr?m wrote:
>>
>> 2008/10/9 Bob Friesenhahn <bfriesen at simple.dallas.tx.us>:
>>
>>>
>>> On Thu, 9 Oct 2008, Miles Nordin wrote:
>>>
>>>>
>>>> catastrophically.  If this is really the situation, then ZFS
needs to
>>>> give the sysadmin a way to isolate and fix the problems
>>>> deterministically before filling the pool with data, not just
blame
>>>> the sysadmin based on nebulous speculatory hindsight gremlins.
>>>>
>>>> And if it''s NOT the case, the ZFS problems need to be
acknowledged and
>>>> fixed.
>>>>
>>>
>>> Can you provide any supportive evidence that ZFS is as fragile as
you
>>> describe?
>>>
>>
>> The hundreds of sysadmins seeing their pools go byebye after normal
>> operations in a production environment is evidence enough. And the
>> number of times people like Victor have saved our asses.
>>
>
> Hundreds?  Do you have evidence of this?
One is one to many, I dont need evidence of hundreds - that is
hopefully an exaggeration.

//T
> -- richard
>
>

Marcelo Leal

2008-Oct-10 20:29 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

> On Fri, Oct 10, 2008 at 06:15:16AM -0700, Marcelo
> Leal wrote:
> >  - "ZFS does not need fsck".
> >  Ok, that?s a great statement, but i think ZFS
> needs one. Really does.
> >  And in my opinion a enhanced zdb would be the
> solution. Flexibility.
> >  Options.
> 
> About 99% of the problems reported as "I need ZFS
> fsck" can be summed up
> by two ZFS bugs:
> 
> 1. If a toplevel vdev fails to open, we should be
> able to pull
> information from necessary ditto blocks to open
>  the pool and make
> what progress we can.  Right now, the root vdev
>  code assumes "can''t
> open = faulted pool," which results in failure
>  scenarios that are
> perfectly recoverable most of the time.  This needs
>  to be fixed
> so that pool failure is only determined by the
>  ability to read
>   critical metadata (such as the root of the DSL).
> . If an uberblock ends up with an inconsistent view
> of the world (due
> to failure of DKIOCFLUSHWRITECACHE, for example),
>  we should be able
> to go back to previous uberblocks to find a good
>  view of our pool.
>   This is the failure mode described by Jeff.
> hese are both bugs in ZFS and will be fixed.  
 That?s it! It?s 100% for me! ;-) 
 One is the "all-or-nothing" problem, and the other is about guilty...
;-))
> 
> There are some interesting possibilities for limited
> forensic tools - in
> particular, I like the idea of a mdb backend for
> reading and writing ZFS
> pools[1].  In my opinion would be great the whole functionality in zdb. it?s simple, and
the concepts are clear on the tool. mdb is a debugger, needs concepts that i
think is different in a tool for read/fix filesystems. Just an opinion... What
does not mean we can not have both. Like i said, flexibility, options... ;-)


 But I haven''t actually heard a reasonable> proposal for what a
> fsck-like tool 
 I think we must NOT stuck in the word "fsck", i have used it just as
an example (Lost and Found). And i think other users used just as an example
too. The important is the two points you have described very *well*.

(i.e. one that could "repair" things> automatically) would
> actually *do*, let alone how it would work in the
> variety of situations
> it needs to (compressed RAID-Z?) where the standard
> ZFS infrastructure
> fails.
> 
> - Eric
> 
> [1]
> http://mbruning.blogspot.com/2008/08/recovering-remove
> d-file-on-zfs-disk.html
> 
> --
> Eric Schrock, Fishworks
>                        http://blogs.sun.com/eschrock
> ________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discu
> ss
 Many thanks for your answer!
 Leal.
--
This message posted from opensolaris.org

Ricardo M. Correia

2008-Oct-10 20:42 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

On Sex, 2008-10-10 at 11:23 -0700, Eric Schrock wrote:> But I haven''t actually heard a reasonable proposal for what a
> fsck-like tool (i.e. one that could "repair" things
automatically) would
> actually *do*, let alone how it would work in the variety of situations
> it needs to (compressed RAID-Z?) where the standard ZFS infrastructure
> fails.
I''d say an fsck-like tool for ZFS should not worry much compression,
checksums, RAID-Z and whatnot. In essence, it would try to do what an
fsck tool does for a typical filesystem, and so would be mostly
oblivious to the layout or encoding of the blocks, perhaps treating
blocks with failed checksums as blocks full of zeros.

Here''s how it could work (of course, this is all easier said than
done):

1) Open all the devices specified by the user. Optionally, take just a
pool name/guid and scan for the right devices in /dev/[r]dsk.

2) Verify if the pool configuration read from the devices is sane -- if
not, try to generate a consistent configuration. Some elements of the
pool configuration, such as the correct pool version, could be checked
in later steps, depending on features that were found.

3) Starting from the last uberblock, fully traverse a few levels down
the tree. If less than 100% of the blocks could be read without errors,
do the same for previous uberblocks and offer the user the choice to
which uberblock to use, or if running non-interactively, choose the one
with the best success rate.

4) Traverse the list/tree of filesystems, snapshots and clones. Make
sure that they are well-connected. For each filesystem, try to replay
the ZILs, clean them out.

5) Now fully traverse the pool. Compute the space maps and FS space
usage on-the-go, as blocks are read.

6) For each metadata block read, check whether the fields are sane, fix
them/zero them out if they''re not. Basically we''re assuming
here that we
may have corrupted metadata with correct checksums.

If some metadata block can not be read due to a failed checksum, assume
the block is full of zeros, and fix it.

By the way, this includes every field of every kind of metadata block,
including ZAPs, ACLs, FID maps, znode fields, everything.

For fields that reference other objects, make sure that the object they
reference is of the correct type and that the object itself is correct.

For objects that are missing, create empty ones if necessary.

7) Check that every object is referenced somewhere and link unreferenced
objects to /lost+found/object-type/, or similar.

8) Probably do other things that I''m forgetting.

9) In the end, check if the space maps are consistent with the ones
computed, write correct ones if not. Check that space
usage/reservations/quotas are correct.

Essentially, the goal is that at the end of this process, the pool
should contain consistent information, should have as much data as could
be recovered and should never cause any further errors in ZFS due to
invalid metadata/fields; either when importing it, reading from it or
writing/modifying it (except that it would still return EIO errors when
trying to read corrupted file data blocks, of course).

Now, a problem with fsck-like tools, and perhaps especially with ZFS, is
that some of these steps may either require lots of memory or multiple
filesystem/pool traversals.

I''d say having such a tool, even if it required additional temporary
storage for operation (hopefully not a very large fraction of the pool
size), would be *very* useful and would clear up any worries that people
currently have.

Kind regards,
Ricardo

Richard Elling

2008-Oct-10 22:38 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

Timh Bergstr?m wrote:> 2008/10/10 Richard Elling <Richard.Elling at sun.com>:
>   
>> Timh Bergstr?m wrote:
>>     
>>> 2008/10/9 Bob Friesenhahn <bfriesen at simple.dallas.tx.us>:
>>>
>>>       
>>>> On Thu, 9 Oct 2008, Miles Nordin wrote:
>>>>
>>>>         
>>>>> catastrophically.  If this is really the situation, then
ZFS needs to
>>>>> give the sysadmin a way to isolate and fix the problems
>>>>> deterministically before filling the pool with data, not
just blame
>>>>> the sysadmin based on nebulous speculatory hindsight
gremlins.
>>>>>
>>>>> And if it''s NOT the case, the ZFS problems need to
be acknowledged and
>>>>> fixed.
>>>>>
>>>>>           
>>>> Can you provide any supportive evidence that ZFS is as fragile
as you
>>>> describe?
>>>>
>>>>         
>>> The hundreds of sysadmins seeing their pools go byebye after normal
>>> operations in a production environment is evidence enough. And the
>>> number of times people like Victor have saved our asses.
>>>
>>>       
>> Hundreds?  Do you have evidence of this?
>>     
>
> One is one to many, I dont need evidence of hundreds - that is
> hopefully an exaggeration.
>
>   
Don''t show up to a data fight without data :-/
Yes, we do track this information and guys like me analyze it.
The ratio of installed base to problem reports for ZFS is quite high.
When we see a trend, we adjust priorities to address it.  This is just
part of our overall quality program.

Which brings me to the required mantra, if you don''t file a bug or
make a service call, the problem doesn''t get tracked.  Please make
the effort so that we can prioritize the use of our limited resources.
Posting a fine whine on this (or any) forum is not guaranteed to result
in an entry in our problem tracking system -- someone has to put in
the extra effort, or it will fall into the silent complainant category.
Please help us to improve the quality of our systems, thanks.
 -- richard

David Magda

2008-Oct-11 01:55 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

On Oct 10, 2008, at 15:48, Victor Latushkin wrote:
> I''ve mostly seen (2), because despite all the best practices out  
> there,
> single vdev pools are quite common. In all such cases that I had my
> hands on it was possible to recover pool by going back by one or two  
> txgs.
For better or worse this is the case where I work.

Most of our storage is on SANs (EMC and NetApp), and so if we need  
more space we ask for it and we get a giant LUN given to us (usually  
multi-pathed). We also have a lot of Veritas VxVM and VxFS for Oracle,  
and so even if we''re running Solaris 10, we''re not using ZFS
in that
case.

SAN space is also allocated to Windows and VMware ESX machines as  
well, so it''s not like we can ask for the disks in the SAN to be  
exported raw, as that would mess up managing of things with the other  
OSes. (We have a very small global storage / back up team, and I  
really don''t want to add more to their workload.)

If someone finds themselves in this position, what advice can be  
followed to minimize risks?

For example, is having checksums enabled a good idea? If you have no  
redundancy and an error occurs, the system will panic by default  
(configurable in newer builds of OpenSolaris, but not in Solaris  
''proper'' yet). But if the system is ignoring checksums,
you''re no
worse off than most other file systems (but still get all the other  
features of ZFS).

Or is there a way to mitigate a checksum error on non-redundant zpool?

Jeff Bonwick

2008-Oct-11 02:14 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

> Or is there a way to mitigate a checksum error on non-redundant zpool?
It''s just like the difference between non-parity, parity, and ECC
memory.
Most filesystems don''t have checksums (non-parity), so they
don''t even
know when they''re returning corrupt data.  ZFS without any replication
can detect errors, but can''t fix them (like parity memory).  ZFS with
mirroring or RAID-Z can both detect and correct (like ECC memory).

Note: even in a single-device pool, ZFS metadata is replicated via
ditto blocks at two or three different places on the device, so that
a localized media failure can be both detected and corrected.
If you have two or more devices, even without any mirroring
or RAID-Z, ZFS metadata is mirrored (again via ditto blocks)
across those devices.

Jeff

Mike Gerdts

2008-Oct-11 03:59 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

On Fri, Oct 10, 2008 at 9:14 PM, Jeff Bonwick <Jeff.Bonwick at sun.com>
wrote:> Note: even in a single-device pool, ZFS metadata is replicated via
> ditto blocks at two or three different places on the device, so that
> a localized media failure can be both detected and corrected.
> If you have two or more devices, even without any mirroring
> or RAID-Z, ZFS metadata is mirrored (again via ditto blocks)
> across those devices.
And in the event that you have a pool that is mostly not very
important but some of it is important, you can have data mirrored on a
per dataset level via copies=n.

If we can avoid losing an entire pool by rolling back a txg or two,
the biggest source of data loss and frustration is taken care of.
Ditto blocks for metadata should take care of most other cases that
would result in wide spread loss.  Normal bit rot that causes you to
lose blocks here and there are somewhat likely to take out a small
minority of files and spit warnings along the way.  If there are some
files that are more important to you than others (e.g. losing files in
rpool/home may have more impact than than rpool/ROOT) copies=2 can
help there.

And for those places where losing a txg or two is a mortal sin, don''t
use flaky hardware and allow zfs to handle a layer of redundancy.

This gets me thinking that it may be worthwhile to have a small (<100
MB x 2) rescue boot environment with copies=2 (as well as rpool/boot/)
so that "pkg repair" could be used to deal with cases that prevent
your normal (>4 GB) boot environment from booting.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Juergen Nickelsen

2008-Oct-11 18:06 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun /

"Timh Bergstr?m" <timh.bergstrom at diino.net> writes:
> Unfortunely I can only agree to the doubts about running ZFS in
> production environments, i''ve lost ditto-blocks,
i''''ve gotten
> corrupted pools and a bunch of other failures even in
> mirror/raidz/raidz2 setups with or without hardware mirrors/raid5/6.
> Plus the insecurity of a sudden crash/reboot will corrupt or even
> destroy the pools with "restore from backup" as the only advice.
I''ve
> been lucky so far about getting my pools back thanks to people like
> Victor.
With which release was that? Solaris 10 or OpenSolaris?

Regards, Juergen.

Keith Bierman

2008-Oct-12 02:36 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

On Oct 10, 2008, at 7:55 PM   10/10/, David Magda wrote:
>
> If someone finds themselves in this position, what advice can be
> followed to minimize risks?
Can you ask for two LUNs on different physical SAN devices and have  
an expectation of getting it?
>
-- 
Keith H. Bierman   khbkhb at gmail.com      | AIM kbiermank
5430 Nassau Circle East                  |
Cherry Hills Village, CO 80113           | 303-997-2749
<speaking for myself*> Copyright 2008

Wade.Stuart at fallon.com

2008-Oct-13 15:46 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

zfs-discuss-bounces at opensolaris.org wrote on 10/11/2008 09:36:02 PM:
>
> On Oct 10, 2008, at 7:55 PM   10/10/, David Magda wrote:
>
> >
> > If someone finds themselves in this position, what advice can be
> > followed to minimize risks?
>
> Can you ask for two LUNs on different physical SAN devices and have
> an expectation of getting it?
Better yet also ask for multiple paths over different SAN infrastructure to
each.  Then again, I would hope you don''t need to ask your SAN folks
for
that?

-Wade

Mike Gerdts

2008-Oct-13 16:58 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

On Thu, Oct 9, 2008 at 10:33 PM, Mike Gerdts <mgerdts at gmail.com>
wrote:> On Thu, Oct 9, 2008 at 10:18 AM, Mike Gerdts <mgerdts at gmail.com>
wrote:
>> On Thu, Oct 9, 2008 at 10:10 AM, Greg Shaw <Greg.Shaw at sun.com>
wrote:
>>> Nevada isn''t production code.  For real ZFS testing, you
must use a
>>> production release, currently Solaris 10 (update 5, soon to be
update 6).
>>
>> I misstated before in my LDoms case.  The corrupted pool was on
>> Solaris 10, with LDoms 1.0.  The control domain was SX*E, but the
>> zpool there showed no problems.  I got into a panic loop with dangling
>> dbufs.  My understanding is that this was caused by a bug in the LDoms
>> manager 1.0 code that has been fixed in a later release.  It was a
>> supported configuration, I pushed for and got a fix.  However, that
>> pool was still lost.
>
> Or maybe it wasn''t fixed yet.  I see that this was committed just
today.
>
> 6684721 file backed virtual i/o should be synchronous
>
> http://hg.genunix.org/onnv-gate.hg/rev/eb40ff0c92ec
The related information from the LDoms Manager 1.1 Early Access
release notes (820-4914-10):

Data Might Not Be Written Immediately to the Virtual Disk Backend If
Virtual I/O Is Backed by a File or Volume

Bug ID 6684721: When a file or volume is exported as a virtual disk,
then the service domain exporting that file or volume is acting as a
storage cache for the virtual disk. In that case, data written to the
virtual disk might get cached into the service domain memory instead
of being immediately written to the virtual disk backend. Data are not
cached if the virtual disk backend is a physical disk or slice, or if
it is a volume device exported as a single-slice disk.

Workaround: If the virtual disk backend is a file or a volume device
exported as a full disk, then you can prevent data from being cached
into the service domain memory and have data written immediately to
the virtual disk backend by adding the following line to the
/etc/system file on the service domain.

set vds:vd_file_write_flags = 0

Note ? Setting this tunable flag does have an impact on performance
when writing to a virtual disk, but it does ensure that data are
written immediately to the virtual disk backend.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Miles Nordin

2008-Oct-13 17:50 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

>>>>> "dm" == David Magda <dmagda at
ee.ryerson.ca> writes:
>>>>> "jb" == Jeff Bonwick <Jeff.Bonwick at
sun.com> writes:
>>>>> "mg" == Mike Gerdts <mgerdts at gmail.com>
writes:
dm> If you have no redundancy and an error occurs, the system will
dm> panic by default (configurable in newer builds of OpenSolaris,
dm> but not in Solaris ''proper'' yet). But if the system
is
dm> ignoring checksums, you''re no worse off than most other file
dm> systems

It''s not safe to assume the checksum errors are silent corruption.
Most or all of the checksum errors I''ve seen on my system come from
ZFS failing to fully resilver a temporarily-broken mirror.

It''s not safe to assume failmode=<!panic> will stop your box from
freezing. Problems with one zpool can cause problems with other
unaffected pools. Problems at the storage driver level can cause one
bad disk to freeze other good disks. Problems with the user interface
generally make it impossible to offline a known-bad device because the
user interface is frozen, or you get some catchall error like ``no
valid replicas'''' because who-knows-what, or ``I/O
error'''' because the
user interface can''t mark the failed drive as offline in the copy of
the label stored on the failed drive---if metastat behaved that way?!

I''ve also had problems with iscsiadm and format pausing for minutes
because a discovery-address is not responding, which could turn into
hours if I had a hundred iSCSI targets---if I could just edit a damned
text file like on a real Unix, I wouldn''t have to put up with these
needlessly-complex state machines and multiplicative timeouts. NFS
can freeze entirely if any exported filesystem has problems.

Yes, some of the panics reported may come from failmode, but if you
look through bugs.opensolaris.org and the list you''ll see many
different kinds of assertion-failure panics that aren''t controlled by
the failmode knob, usually panic-on-import or freeze-on-import, but
sometimes other kinds.

To my view, the good news for ZFS is that most other things suck
almost as much, so there is only a little catching-up to do before
it''s competitive. OTOH it looks like an unworkable disaster
w.r.t. the promised future environment where pools have hundreds of
disks, always some of them failing. The exception handling is a mess,
the timers are attached to accidental hodge-podge ``layered''''
state
machines for which no one will accept ultimate responsibility, and the
locking of various user interfaces and subsystems is coarse because
it''s built either for correctness/simplicity/deadlines, or for a
mistaken, outdated goal: high-performance,
assuming-a-fully-working-system, otherwise-fix-your-hardware.

jb> ditto blocks
mg> copies=n.

neither of which applies to the situations Victor helped recover from.
It''s possible ditto blocks are quietly helping people, but
I''ve not
read on the list of one scenario where something bad happened and the
resolution was ``you should have used copies=n''''.

The OP is asking about best practices that mitigate known problems,
not a repeat of the standard list of bullet point features and their
hypothetical virtues.

mg> And for those places where losing a txg or two is a mortal
mg> sin, don''t use flaky hardware and allow zfs to handle a
layer
mg> of redundancy.

It is a mortal sin for a filesystem in all places. It''s just much
less bad than losing the entire pool. To be a safe backing-store for
databases or email, ZFS needs to have implementable best-practices
that stop this from happening, not just recover from it. Whatever
recovery there is, certainly should not be silent and maybe should not
be automatic.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081013/c7d4f126/attachment.bin>

Gino

2008-Nov-29 11:49 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

> About 99% of the problems reported as "I need ZFS
> fsck" can be summed up
> by two ZFS bugs:
> 
> 1. If a toplevel vdev fails to open, we should be
> able to pull
> information from necessary ditto blocks to open
>  the pool and make
> what progress we can.  Right now, the root vdev
>  code assumes "can''t
> open = faulted pool," which results in failure
>  scenarios that are
> perfectly recoverable most of the time.  This needs
>  to be fixed
> so that pool failure is only determined by the
>  ability to read
>   critical metadata (such as the root of the DSL).
> . If an uberblock ends up with an inconsistent view
> of the world (due
> to failure of DKIOCFLUSHWRITECACHE, for example),
>  we should be able
> to go back to previous uberblocks to find a good
>  view of our pool.
>   This is the failure mode described by Jeff.
> [b]These are both bugs in ZFS and will be fixed.  [/b]
I totally agree these covers most of the corruptions we had in past.
Any news about that bugs in recent Nevada release?

Anyone can provide us a detailed procedure to "go back to previous
uberblocks to find a good view of our pool" as described by Jeff?

Thanks
gino
-- 
This message posted from opensolaris.org

Ray Clark

2008-Nov-30 16:22 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

It would be extremely helpful to know what brands/models of disks lie and which
don''t.  This information could be provided diplomatically simply as
threads documenting problems you are working on, stating the facts.  Use of a
specific string of words would make searching for it easy.  There should be no
liability, since you are simply documenting compatibility with zfs.

Or perhaps if the lawyers let you, you could simply publish a
compatibility/incompatibility list.  These ARE facts.

If there is a way to make a detection tool, that would be very useful too,
although after the purchase is made, it could be hard to send it back.  However
that info could be fed into the database as that drive/model being incompatible
with zfs.

As Solaris / zfs gains ground, this could become a strong driver in the
industry.

Re: I''ll run tests with known-broken disks to determine how far back we
need to go in practice -- I''ll bet one txg is almost always enough.

So go back three - we are using zfs because we want absolute reliability (or at
least as close as we can get).

--Ray
-- 
This message posted from opensolaris.org

Miles Nordin

2008-Dec-12 20:10 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

>>>>> "tt" == Toby Thain <toby at
telegraphics.com.au> writes:
>>>>> "mg" == Mike Gerdts <mgerdts at gmail.com>
writes:
    tt> I think we have to assume Anton was joking - otherwise his
    tt> measure is uselessly unscientific.

I think it''s rude to talk about someone who''s present in the
third
person, especially when you''re trying to minimize his view.  Were you
joking, Anton? :)

0. The reports I read were not useless in the way some have stated,
   because for example Mike sampled his own observations:

    mg> In the past year I''ve lost more ZFS file systems than I have
    mg> any other type of file system in the past 5 years.  With other
    mg> file systems I can almost always get some data back.  With ZFS
    mg> I can''t get any back.

   It''s not just bloggers and pundits sampling mailing list traffic.  I
   thought there was at least one other post like this but could not
   find it.

1. I don''t think your impressions nor Anton''s and mine are
``useless''''

2. I don''t think your positive impression is any more scientific than
   his and my skeptical one.

3. I''m in general troubled by reports of corruption that
aren''t
   well-investigated, because this will stop young, fragile
   filesystems from becoming old and robust.  BUT....

4. I''m less troubled by (3) because a few of the corruption reports
   were well-investigated by Victor, and he recovered them manually
   and posted a summary here:

    http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/051643.html

   and how the exprience might inform ZFS improvements:

    http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/051667.html

5. I''m more troubled again because everyone seems to have forgotten
   (4).  Mike, Victor, and others can''t necessarily repeat themselves
   every time this thread''s resurrected.  If yapping mailing list
   monkeys like me don''t remember this experience, invested-wishing
   and marketing white papers will drown out the experience we''re
   getting.

   I''ve pointed straight at an unfixed corruption problem
that''s
   biting ZFS users, and the discussion about where to place the blame
   and how to fix it.  It is not fixed now, yet pundits on-list and
   all over the Interweb like here:

    http://www.kev009.com/wp/2008/11/on-file-systems/

   talk about corruption bugs hazily and say ``most of all that''s been
   fixed'''' when it''s not so hazy and hasn''t
been, then focus on
   theoretical unrealized capabilities of the on-disk format and
   mimimize this clear experience into ghostly distant-past rumor.  

I don''t see when the single-LUN SAN corruption problems were fixed.  I
think the supposed ``silent FC bit flipping'''' basis for the
``use
multiple SAN LUN''s'''' best-practice is revoltingly
dishonest, that we
_know_ better.  I''m not saying devices aren''t
guilty---Sun''s sun4v IO
virtualizer was documented as guilty of ignoring cache flushes to
inflate performance just like the loomingly-unnamed models of lying
SATA drives:

 http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/051735.html

Is a storage-stack-related version this problem the cause of lost
single-LUN SAN pools?  maybe, maybe not, but either way we need an
end-to-end solution.  I don''t currently see an end-to-end solution to
this pervasive blame-the-device mantra every time a pool goes bad.

I keep digging through the archives to post messages like this because
I feel like everyone only wants to have happy memories, and that it''s
going to bring about a sad end.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081212/f666444a/attachment.bin>

Johan Hartzenberg

2008-Dec-12 20:38 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

On Fri, Dec 12, 2008 at 10:10 PM, Miles Nordin <carton at ivy.net> wrote:
>
>
> 0. The reports I read were not useless in the way some have stated,
>   because for example Mike sampled his own observations:
[snip]
>
> I don''t see when the single-LUN SAN corruption problems were
fixed.  I
> think the supposed ``silent FC bit flipping'''' basis for
the ``use
> multiple SAN LUN''s'''' best-practice is
revoltingly dishonest, that we
> _know_ better.  I''m not saying devices aren''t
guilty---Sun''s sun4v IO
> virtualizer was documented as guilty of ignoring cache flushes to
> inflate performance just like the loomingly-unnamed models of lying
> SATA drives:
>
>
> http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/051735.html
>
> Is a storage-stack-related version this problem the cause of lost
> single-LUN SAN pools?  maybe, maybe not, but either way we need an
> end-to-end solution.  I don''t currently see an end-to-end solution
to
> this pervasive blame-the-device mantra every time a pool goes bad.
>
> I keep digging through the archives to post messages like this because
> I feel like everyone only wants to have happy memories, and that
it''s
> going to bring about a sad end.
>
Thank you.

There is so much unsupported claims and noise on both sides that everybody
is sounding like a bunch of fanboys.

The only bit that I understand about why HW raid "might" be bad is
that if
it had access to the disks behind a HW RAID LUN, then _IF_ zfs were to
encounter corrupted data in a read, it will probably be able to re-construct
that data.  This is at the cost of doing the parity calculations on a
general purpose CPU, and then sending that parity data, as well as the data
to write, across the wire.  Some of that cost may be offset against
Raid-Z''s
optimizations over raid-5 in some situations, but all of this is pretty much
if-then-maybe type situations.

I also understand that HW raid arrays have some vulnerabilities and
weaknesses, but those seem to be offset against ZFS'' notorious
instability
during error conditions.  I say notorious, because of all the open bug
reports and reports on the list of I/O hanging and/or systems panicing while
waiting for ZFS to realize that something has gone wrong.

I think if this last point can be addressed - make ZFS respond MUCH faster
to failures, then it will go a long way to make ZFS  be more readily
adopted.

-- 
Any sufficiently advanced technology is indistinguishable from magic.
   Arthur C. Clarke

My blog: http://initialprogramload.blogspot.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081212/5b5f4cdd/attachment.html>

Toby Thain

2008-Dec-12 20:44 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

On 12-Dec-08, at 3:10 PM, Miles Nordin wrote:
>>>>>> "tt" == Toby Thain <toby at
telegraphics.com.au> writes:
>>>>>> "mg" == Mike Gerdts <mgerdts at
gmail.com> writes:
>
>     tt> I think we have to assume Anton was joking - otherwise his
>     tt> measure is uselessly unscientific.
>
> I think it''s rude to talk about someone who''s present in
the third
> person, especially when you''re trying to minimize his view.  Were
you
> joking, Anton? :)
> ....
>
> 1. I don''t think your impressions nor Anton''s and mine
are ``useless''''
Alright, I agree I should retract the ''useless'' but I would
keep the
''unscientific''.

--Toby

Toby Thain

2008-Dec-12 20:51 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

On 12-Dec-08, at 3:38 PM, Johan Hartzenberg wrote:
> ...
> The only bit that I understand about why HW raid "might" be bad
is
> that if it had access to the disks behind a HW RAID LUN, then _IF_  
> zfs were to encounter corrupted data in a read, it will probably be  
> able to re-construct that data.  This is at the cost of doing the  
> parity calculations on a general purpose CPU,
Except that it''s not just parity - ZFS checksums where RAID-N does  
not (although I''ve heard that some RAID systems checksum
"somewhere"
- not end-to-end of course).

Call me a fanboy if you will, but ZFS is different from hw RAID. I am  
not an "automatic denier" of ZFS bugs or flaws, but I do acknowledge  
it''s more revolution than evolution. It''s software. We only
need be
patient while it matures. :)

--Toby

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081212/c44dc663/attachment.html>

Bob Friesenhahn

2008-Dec-12 21:11 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

On Fri, 12 Dec 2008, Toby Thain wrote:>>
>> 1. I don''t think your impressions nor Anton''s and
mine are ``useless''''
>
> Alright, I agree I should retract the ''useless'' but I
would keep the
> ''unscientific''.
There is no need to retract the ''useless''.  By the same
useless
measure, George Bush Jr has done a fantastic job at dealing with world 
terror since there has not been a serious attack on US soil by islamic 
terrorists since 2002.  One might think that this impression is 
significant yet it is not since the previous attack on US soil was in 
1993, which was about 9 years and we have only gone about 6 thus far. 
By statistical measures, George Bush Jr could have done absolutely 
nothing and it is likely that nothing bad would have happened at all. 
There is insufficient evidence to suggest one conclusion vs another.

This example shows the dangers of using illogical thinking to 
presumably reach a logical conclusion.  It is particularly dangerous 
to exhibit illogical thinking in public where everyone can see.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Friesenhahn

2008-Dec-12 21:16 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

On Fri, 12 Dec 2008, Toby Thain wrote:>
> Except that it''s not just parity - ZFS checksums where RAID-N does
not
> (although I''ve heard that some RAID systems checksum
"somewhere" - not
> end-to-end of course).
It will soon be quite easy to build a RAID system like this using 
OpenSolaris and a sub-project known as COMSTAR.  The checksums will be 
done using a storage technology called ZFS.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2008-Dec-12 21:30 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

Johan Hartzenberg wrote:> There is so much unsupported claims and noise on both sides that 
> everybody is sounding like a bunch of fanboys.
I don''t think there are two sides.  Anyone who has been around
computing
for any length of time has lost data due to various failures.  The 
question isn''t
about losing data, it is about how to proceed when your data is damaged.
>
> The only bit that I understand about why HW raid "might" be bad
is
> that if it had access to the disks behind a HW RAID LUN, then _IF_ zfs 
> were to encounter corrupted data in a read, it will probably be able 
> to re-construct that data.  This is at the cost of doing the parity 
> calculations on a general purpose CPU, and then sending that parity 
> data, as well as the data to write, across the wire.  Some of that 
> cost may be offset against Raid-Z''s optimizations over raid-5 in
some
> situations, but all of this is pretty much if-then-maybe type situations.
OK, repeat after me: there is no such thing as hardware RAID, there is no
such thing as hardware RAID, there is no such thing as hardware RAID.
There is only software RAID.  If you believe any software is infallible, 
then
you will be hurt. Even beyond RAID, there is quite sophisticated software
on your disks, and anyone who has had to upgrade disk firmware will
attest that disk firmware is not infallible.
> I also understand that HW raid arrays have some vulnerabilities and 
> weaknesses, but those seem to be offset against ZFS'' notorious 
> instability during error conditions.  I say notorious, because of all 
> the open bug reports and reports on the list of I/O hanging and/or 
> systems panicing while waiting for ZFS to realize that something has 
> gone wrong.
>
> I think if this last point can be addressed - make ZFS respond MUCH 
> faster to failures, then it will go a long way to make ZFS  be more 
> readily adopted.
However, you can''t respond too fast -- something which seems to get
lost
in these conversations.  If you declare a disk dead too fast, then you get
caught in a bind by things like Seagate disks which "freeze" for a few
seconds.  It may be much better to ride through such things than initiate a
reconfiguration action (as described in the article below).
http://blogs.zdnet.com/storage/?p=369&tag=nl.e539

Note: as of b97, it is now possible to set per-device retries in the sd and
ssd drivers. This is a good start towards satisfying those who are
fed up with the default sd/ssd retry logic.  See sd(7d)
http://opensolaris.org/os/community/arc/caselog/2007/505/

 -- richard

Tim

2008-Dec-12 21:30 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

On Fri, Dec 12, 2008 at 2:51 PM, Toby Thain <toby at
telegraphics.com.au>wrote:
>
> On 12-Dec-08, at 3:38 PM, Johan Hartzenberg wrote:
>
> ...
> The only bit that I understand about why HW raid "might" be bad
is that if
> it had access to the disks behind a HW RAID LUN, then _IF_ zfs were to
> encounter corrupted data in a read, it will probably be able to
re-construct
> that data.  This is at the cost of doing the parity calculations on a
> general purpose CPU,
>
>
> Except that it''s *not just parity* - ZFS checksums where RAID-N
does not
> (although I''ve heard that some RAID systems checksum
"somewhere" - not
> end-to-end of course).
>
> Call me a fanboy if you will, but ZFS is different from hw RAID. I am not
> an "automatic denier" of ZFS bugs or flaws, but I do acknowledge
it''s more
> *revolution* than evolution. It''s software. We only need be
patient while
> it matures. :)
>
> --Toby
>
>I''m going to pitch in here as devil''s advocate and say this is
hardly
revolution.  99% of what zfs is attempting to do is something NetApp and
WAFL have been doing for 15 years+.  Regardless of the merits of their
patents and prior art, etc., this is not something revolutionarily new.  It
may be "revolution" in the sense that it''s the first time
it''s come to open
source software and been given away, but it''s hardly
"revolutionary" in file
systems as a whole.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081212/25adf353/attachment.html>

Ian Collins

2008-Dec-12 21:36 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

Tim wrote:>
>
> On Fri, Dec 12, 2008 at 2:51 PM, Toby Thain <toby at telegraphics.com.au
> <mailto:toby at telegraphics.com.au>> wrote:
>
>
>     On 12-Dec-08, at 3:38 PM, Johan Hartzenberg wrote:
>
>>     ...
>>     The only bit that I understand about why HW raid "might"
be bad
>>     is that if it had access to the disks behind a HW RAID LUN, then
>>     _IF_ zfs were to encounter corrupted data in a read, it will
>>     probably be able to re-construct that data.  This is at the cost
>>     of doing the parity calculations on a general purpose CPU, 
>
>     Except that it''s /not just parity/ - ZFS checksums where
RAID-N
>     does not (although I''ve heard that some RAID systems checksum
>     "somewhere" - not end-to-end of course).
>
>     Call me a fanboy if you will, but ZFS is different from hw RAID. I
>     am not an "automatic denier" of ZFS bugs or flaws, but I do
>     acknowledge it''s more /revolution/ than evolution.
It''s software.
>     We only need be patient while it matures. :)
>
>     --Toby
>
>
> I''m going to pitch in here as devil''s advocate and say
this is hardly
> revolution.  99% of what zfs is attempting to do is something NetApp
> and WAFL have been doing for 15 years+. 
The ideas aren''t new, but the combination of the ideas is.  NetApp is
still a box at the end of a bit of wire that the OS has to blindly trust.

-- 
Ian.

Tim

2008-Dec-12 22:00 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

On Fri, Dec 12, 2008 at 3:36 PM, Ian Collins <ian at ianshome.com> wrote:
>
>
> The ideas aren''t new, but the combination of the ideas is.  NetApp
is
> still a box at the end of a bit of wire that the OS has to blindly trust.
>
> --
> Ian.
>
>
I''m not aware of many, if any large shops that are moving to a model of
"all
internal disk with applications running on them".  The sun box will just be
"a box at the end of the wire", a-la storage 7000 when it''s
an
nfs/cifs/iscsi target.  Centralized storage is a *good thing*.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081212/ab918330/attachment.html>

Ian Collins

2008-Dec-12 22:11 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

Tim wrote:>
>
> On Fri, Dec 12, 2008 at 3:36 PM, Ian Collins <ian at ianshome.com
> <mailto:ian at ianshome.com>> wrote:
>
>
>
>     The ideas aren''t new, but the combination of the ideas is. 
NetApp is
>     still a box at the end of a bit of wire that the OS has to blindly
>     trust.
>
>     --
>     Ian.
>
>
>
> I''m not aware of many, if any large shops that are moving to a
model
> of "all internal disk with applications running on them".  The
sun box
> will just be "a box at the end of the wire", a-la storage 7000
when
> it''s an nfs/cifs/iscsi target.  Centralized storage is a *good
thing*.
>Maybe, but I''m sure that will change as the performance of the storage
subsystems continue to exceed the performance of the bit of wire. 

That''s where the revolution bit comes in; applications can now coexist
with NetApp quality storage management.

-- 
Ian.

Miles Nordin

2008-Dec-12 22:31 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

>>>>> "nw" == Nicolas Williams <Nicolas.Williams at
sun.com> writes:
>>>>> "tt" == Toby Thain <toby at
telegraphics.com.au> writes:
>>>>> "jh" == Johan Hartzenberg <jhartzen at
gmail.com> writes:
nw> If you can fully trust the SAN then there''s no reason not to
nw> run ZFS on top of it with no ZFS mirrors and no RAID-Z.

The best practice I understood is currently to use zpool-layer
redundancy especially with SAN even moreso than with single-spindle
local storage, because of (1) the new corruption problems people are
having with ZFS on single-LUN SAN''s that they didn''t have when
using
UFS and vxfs on the same SAN, and (2) the new severity of the problem,
losing the whole pool instead of the few files you lose to UFS
corruption or that you''re supposed to lose to random bit flips on ZFS.

The problems do not sound like random bit-flips. They''re corruption
of every ueberblock. The best-guess explanation AIUI, is not FC
checksum gremlins---it''s that write access to the SAN is lost and then
comes back---ex. if the SAN target loses power or fabric access but
the ZFS host doesn''t reboot---and either the storage stack is
misreporting the failure or ZFS isn''t correctly responding to the
errors. see the posts I referenced.

Apparently the layering is not as simple in practice as one might
imagine.

Even if you ignore the post-mortem analysis of the corrupt pools and
look only at the symptom, if it were random corruption from DRAM and
FC checksum gremlins, we should see mostly reports of a few files lost
to checksum errors on single-LUN SAN''s and reported in ''zpool
status'',
much more often than whole zpool''s lost, yet exactly the opposite is
happening.

jh> The only bit that I understand about why HW raid "might" be
jh> bad is that if it had access to the disks behind a HW RAID
jh> LUN, then _IF_ zfs were to encounter corrupted data in a read,

In at least one case it''s certain there are no reported latent sector
errors from the SAN on the corrupt LUN---''dd if=<..lun..>
of=/dev/null'' worked for at least one person who lost a single-LUN
zpool. it doesn''t sound to me like random bit-flips causing the
problem, since all copies of the ueberblock are corrupt, and that''s a
bit far-fetched to happen randomly on a LUN that scrubs almost clean
when mounted with the second-newest ueberblock.

jh> ZFS'' notorious instability during error conditions.

right, availability is a reason to use RAID below ZFS layer. It might
or might not be related to the SAN problems. Maybe yes if the
corruption happens during a path failover or a temporary connectivity
interruption. but the symptom''s different from the
timeout/availability thread, is a corrupt unmountable pool. The hang
discussion was about frozen systems where the pool imports fine after
reboot, which is a different symptom.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081212/9772a302/attachment.bin>

Nicolas Williams

2008-Dec-12 22:49 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

On Fri, Dec 12, 2008 at 05:31:37PM -0500, Miles Nordin
wrote:>     nw> If you can fully trust the SAN then there''s no reason
not to
>     nw> run ZFS on top of it with no ZFS mirrors and no RAID-Z.
> 
> The best practice I understood is currently to use zpool-layer
> redundancy especially with SAN even moreso than with single-spindle
Yes, but I believe this whole thread is about ZFS with no zpool-layer
redundancy, with RAID done in the SAN.
> local storage, because of (1) the new corruption problems people are
Your thesis is that all corruption problems observed with ZFS on SANs
are: a) phantom writes that never reached the rotating rust, b) not bit
rot, corruption in the I/O paths, ...  Correct?
> The problems do not sound like random bit-flips.  They''re
corruption
> of every ueberblock.  The best-guess explanation AIUI, is not FC
Some of the earlier problems of type (2) were triggered by checksum
verification failures on pools with no redundancy, where ZFS would just
panic (IIRC).  These were due to bit-rot issues, not cache flush
failures.
> checksum gremlins---it''s that write access to the SAN is lost and
then
> comes back---ex. if the SAN target loses power or fabric access but
> the ZFS host doesn''t reboot---and either the storage stack is
> misreporting the failure or ZFS isn''t correctly responding to the
> errors.  see the posts I referenced.
It''s possible that ZFS could, periodically (in the background) and/or
at
pool import time (synchronously), validate the consistency on disk of
every transaction going backwards from the last until one is found that
is consistent, or until it runs out of past ?berblocks, or it goes too
far into the past.  (Does ZFS have an option to do that?  It might be a
useful option to have for dealing with lying SANs.)
>     jh> ZFS'' notorious instability during error conditions.
> right, availability is a reason to use RAID below ZFS layer.  It might
ZFS handles device errors better when ZFS does redundancy at the zpool
layer, as opposed to when redundancy is left to the SAN.  That''s well
established, so why do you say the opposite?

Nico
--

Jeff Bonwick

2008-Dec-13 02:16 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

> I''m going to pitch in here as devil''s advocate and say
this is hardly
> revolution.  99% of what zfs is attempting to do is something NetApp and
> WAFL have been doing for 15 years+.  Regardless of the merits of their
> patents and prior art, etc., this is not something revolutionarily new.  It
> may be "revolution" in the sense that it''s the first
time it''s come to open
> source software and been given away, but it''s hardly
"revolutionary" in file
> systems as a whole.
"99% of what ZFS is attempting to do?"  Hmm, OK -- let''s make
a list:

	end-to-end checksums
	unlimited snapshots and clones
	O(1) snapshot creation
	O(delta) snapshot deletion
	O(delta) incremental generation
	transactionally safe RAID without NVRAM
	variable blocksize
	block-level compression
	dynamic striping
	intelligent prefetch with automatic length and stride detection
	ditto blocks to increase metadata replication
	delegated administration
	scalability to many cores
	scalability to huge datasets
	hybrid storage pools (flash/disk mix) that optimize price/performance

How many of those does NetApp have?  I believe the correct answer is 0%.

Jeff

Ed Spencer

2008-Dec-13 02:28 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

I find this thread both interesting and disturbing. I''m fairly new to
this list so please excuse me if my comments/opinions are simplistic or
just incorrect.

I think there''s been to much FC SAN bashing so let me change the
example.

What if you buy a 7000 Series server (complete with zfs) and setup an IP
SAN. You create a LUN and share it out to a Solaris 10 host.
On the solaris host you create a ZFS pool with that iscsi LUN.

Now my undersatnding is that you will not be able to correct errors on
the zpool of the Solaris10 machine because zfs on the solaris 10 machine
is not doing the raid.

Another example would be if you were sharing out a lun to a vmware
server, from your iscsi san or fc san, and creating solaris 10 virtual
machines, with zfs booting.

Another example would be Solaris 10 booting a zfs filesystem from a
hardware mirrored pair of drives.

Now these are examples of standard implementations of machines in a
datacenter, specifically ones I have installed.
>From following this thread I now feel that if I have uncorrectable
"dataerrors" on the zfs pools there will be no way to easily repair the pool.

I see no reason that if I do detect errors as I scrub the zfs pool that
I should be able to run a simple utility to fix the pools as I would a
ufs filesystem and then recover the corrupted files from tape.

I believe that for zfs to be used as a general purpose filesystem that
there has to be support built into zfs to support these standard data
center implementations, otherwise it will just become a specialized
filesystem, like Netapp''s WAFL, and there are alot more servers than
storage appliances in the datacenter.

I think this thread has put zfs in a negative light.  I don''t actually
believe that I will experience many of these problems in an Enterprise
class data center, but still I don''t look forward to having to deal
with
the consequences of encountering these types of problems.

Maybe zfs is not ready to be considered a general purpose filesystem.

--
Ed Spencer

Richard Elling

2008-Dec-13 03:34 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

[sigh, here we go again... isn''t this in a FAQ somewhere, it certainly
is
in the archives...]

Ed Spencer wrote:> I find this thread both interesting and disturbing. I''m fairly new
to
> this list so please excuse me if my comments/opinions are simplistic or
> just incorrect.
>
> I think there''s been to much FC SAN bashing so let me change the
> example.
>
> What if you buy a 7000 Series server (complete with zfs) and setup an IP
> SAN. You create a LUN and share it out to a Solaris 10 host.
> On the solaris host you create a ZFS pool with that iscsi LUN.
>   
You are certainly able to implement ZFS redundancy on the
Solaris 10 host.
> Now my undersatnding is that you will not be able to correct errors on
> the zpool of the Solaris10 machine because zfs on the solaris 10 machine
> is not doing the raid.
>   
No, this is not a completely true statement (more below)
> Another example would be if you were sharing out a lun to a vmware
> server, from your iscsi san or fc san, and creating solaris 10 virtual
> machines, with zfs booting.
>   
You are certainly able to implement ZFS redundancy on the
Solaris 10 VM.
> Another example would be Solaris 10 booting a zfs filesystem from a
> hardware mirrored pair of drives.
>   
You are certainly able to implement ZFS redundancy on the mirrored
pair of drives.
> Now these are examples of standard implementations of machines in a
> datacenter, specifically ones I have installed.
>   
I presume you are saying that you implemented only the default ZFS
data protection for a single vdev.  You have more options, including
copies, mirroring, raidz, etc.
> >From following this thread I now feel that if I have uncorrectable
"data
> errors" on the zfs pools there will be no way to easily repair the
pool.
>   
Untrue.  ZFS will attempt to repair what it can repair.  More below.
> I see no reason that if I do detect errors as I scrub the zfs pool that
> I should be able to run a simple utility to fix the pools as I would a
> ufs filesystem and then recover the corrupted files from tape.
>   
There is no utility for UFS which will repair corrupted data.  UFS is
blissfully unaware of data corruption. fsck will attempt to reconcile
metadata problems, which were very common before logging was
added, because UFS does not have an always consistent on-disk
format (ZFS does).

By default, ZFS uses copies=2 for metadata.  Uberblocks are 4x
redundant.  If data corruption is detected in a file, zpool status -x
will show exactly which files are corrupted, which will allow you
to make the decision how you want to handle the broken file.

IMHO, you are getting hung up about the fact that if data corruption
is detected in a file and ZFS does not have a way to repair the file, then
you will probably want to do something about it manually.  With UFS,
you''ll never know, though you might see some symptoms like your
apps crash or your spreadsheet has the wrong numbers.
> I believe that for zfs to be used as a general purpose filesystem that
> there has to be support built into zfs to support these standard data
> center implementations, otherwise it will just become a specialized
> filesystem, like Netapp''s WAFL, and there are alot more servers
than
> storage appliances in the datacenter.
>   
I disagree.  ZFS will be the preferred boot file system for Solaris
systems -- it already is the only boot file system available for
OpenSolaris.  Features like snapshots (that actually work, unlike UFS
snapshots for many cases) and cloning are extremely useful for
managing OSes, patches, and upgrades.  ZFS is the future general
purpose file system for Solaris, UFS is not (which will become
readily apparent when you buy a 1.5 TByte disk)
> I think this thread has put zfs in a negative light.  I don''t
actually
> believe that I will experience many of these problems in an Enterprise
> class data center, but still I don''t look forward to having to
deal with
> the consequences of encountering these types of problems.
>   
One reason you may have never experienced data corruption with
UFS (which I find hard to believe, having used UFS for 20+ years)
is that UFS has no way to detect data corruption.  Are you trying to
kill the canary? :-)
> Maybe zfs is not ready to be considered a general purpose filesystem.
>   
I''d say maybe UFS is not ready to be considered a general purpose file
system, by today''s standards :-)
 -- richard

Joseph Zhou

2008-Dec-13 08:57 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

Richard, I have been glancing through the posts, saw more hardware RAID vs 
ZFS discussion, some are very useful.
However, as you adviced me the other day, we should think about the overall 
solution architect, not just the feature itself.

I believe the spirit of ZFS snapshot is more significant than what have been 
discussed, in the rapid (though I don''t know if it is stateful today) 
application migration capabilities that enhance overall business continuity, 
hopefully fulfilling the enterprise availability requirements.  I really 
don''t think any Hardware RAID with embedded snapshot can do such, and I
am
never IMHO.

One example:
ZFS is used to both capture the guest from a snapshot and move the 
compressed snapshot between servers, not limited to the Sun xVM hypervisor; 
the same approach could be used with respect to hosting Solaris Zones or Sun 
Logical Domains.

Best,
z

Tim

2008-Dec-13 09:15 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

On Fri, Dec 12, 2008 at 8:16 PM, Jeff Bonwick <Jeff.Bonwick at sun.com>
wrote:
> > I''m going to pitch in here as devil''s advocate and
say this is hardly
> > revolution.  99% of what zfs is attempting to do is something NetApp
and
> > WAFL have been doing for 15 years+.  Regardless of the merits of their
> > patents and prior art, etc., this is not something revolutionarily
new.
>  It
> > may be "revolution" in the sense that it''s the
first time it''s come to
> open
> > source software and been given away, but it''s hardly
"revolutionary" in
> file
> > systems as a whole.
>
> "99% of what ZFS is attempting to do?"  Hmm, OK -- let''s
make a list:
>
>        end-to-end checksums
>        unlimited snapshots and clones
>        O(1) snapshot creation
>        O(delta) snapshot deletion
>        O(delta) incremental generation
>        transactionally safe RAID without NVRAM
>        variable blocksize
>        block-level compression
>        dynamic striping
>        intelligent prefetch with automatic length and stride detection
>        ditto blocks to increase metadata replication
>        delegated administration
>        scalability to many cores
>        scalability to huge datasets
>        hybrid storage pools (flash/disk mix) that optimize
> price/performance
>
> How many of those does NetApp have?  I believe the correct answer is 0%.
>
> Jeff

Seriously?  Do you know anything about the NetApp platform?  I''m hoping
this
is a genuine question...

Off the top of my head nearly all of them.  Some of them have artificial
limitations because they learned the hard way that if you give customers
enough rope they''ll hang themselves.  For instance "unlimited
snapshots".
Do I even need to begin to tell you what a horrible, HORRIBLE idea that is?
"Why can''t I get my space back?"  Oh, just do a snapshot list
and figure out
which one is still holding the data.  What?  Your console locks up for 8
hours when you try to list out the snapshots?  Huh... that''s weird.

It''s sort of like that whole "unlimited filesystems" thing. 
Just don''t ever
reboot your server, right?  Or "you can have 40pb in one pool!!!". 
How do
you back it up?  Oh, just mirror it to another system?  And when you hit a
bug that toasts both of them you can just start restoring from tape for the
next 8 years, right?  Or if by some luck we get a zfsiron, you can walk the
metadata for the next 5 years.

NVRAM has been replaced by flash drives in a ZFS world to get any kind of
performance... so you''re trading one high priced storage for another. 
Your
snapshot creation and deletion is identical.  Your incremental generations
is identical.  End-to-end checksums?  Yup.

Let''s see... they don''t have block-level compression, they
chose dedup
instead which nets better results.  "Hybrid storage pool" is achieved
through PAM modules.  Outside of that... I don''t see ANYTHING in your
list
they didn''t do first.

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081213/71937bc5/attachment.html>

Jeff Bonwick

2008-Dec-13 10:01 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

> Off the top of my head nearly all of them.  Some of them have artificial
> limitations because they learned the hard way that if you give customers
> enough rope they''ll hang themselves.  For instance "unlimited
snapshots".
Oh, that''s precious!  It''s not an arbitrary limit,
it''s a safety feafure!
> Outside of that... I don''t see ANYTHING in your list they
didn''t do first.
Then you don''t know ANYTHING about either platform.  Constant-time
snapshots, for example.  ZFS has them;  NetApp''s are O(N), where N is
the total number of blocks, because that''s how big their bitmaps are.
If you think O(1) is not a revolutionary improvement over O(N),
then not only do you not know much about either snapshot algorithm,
you don''t know much about computing.

Sorry, everyone else, for feeding the troll.  Chum the water all you like,
I''m done with this thread.

Jeff

Bryan Cantrill

2008-Dec-13 15:54 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

> Seriously?  Do you know anything about the NetApp platform?  I''m
hoping this
> is a genuine question...
> 
> Off the top of my head nearly all of them.  Some of them have artificial
> limitations because they learned the hard way that if you give customers
> enough rope they''ll hang themselves.  For instance "unlimited
snapshots".
> Do I even need to begin to tell you what a horrible, HORRIBLE idea that is?
> "Why can''t I get my space back?"  Oh, just do a snapshot
list and figure out
> which one is still holding the data.  What?  Your console locks up for 8
> hours when you try to list out the snapshots?  Huh... that''s
weird.
> 
> It''s sort of like that whole "unlimited filesystems"
thing.  Just don''t ever
> reboot your server, right?  Or "you can have 40pb in one
pool!!!".  How do
> you back it up?  Oh, just mirror it to another system?  And when you hit a
> bug that toasts both of them you can just start restoring from tape for the
> next 8 years, right?  Or if by some luck we get a zfsiron, you can walk the
> metadata for the next 5 years.
> 
> NVRAM has been replaced by flash drives in a ZFS world to get any kind of
> performance... so you''re trading one high priced storage for
another.  Your
> snapshot creation and deletion is identical.  Your incremental generations
> is identical.  End-to-end checksums?  Yup.
> 
> Let''s see... they don''t have block-level compression,
they chose dedup
> instead which nets better results.  "Hybrid storage pool" is
achieved
> through PAM modules.  Outside of that... I don''t see ANYTHING in
your list
> they didn''t do first.
Wow -- I''ve spoken to many NetApp partisans over the years, but you
might
just take the cake.  Of course, most of the people I talk to are actually
_using_ NetApp''s technology, a practice that tends to leave even the
most
stalwart proponents realistic about the (many) limitations of NetApp''s
technology...

For example, take the PAM.  Do you actually have one of these, or are you
basing your thoughts on reading whitepapers?  I ask because (1) they are
horrifically expensive (2) they don''t perform that well (especially
considering that they''re DRAM!) (3) they''re grossly undersized
(a 6000
series can still only max out at a paltry 96G -- and that''s with
virtually
no slots left for I/O) and (4) they''re not selling well.  So if you
actually bought a PAM, that already puts you in a razor-thin minority of
NetApp customers (most of whom see through the PAM and recognize it for
the kludge that it is); if you bought a PAM and think that it''s somehow
a
replacement for the ZFS hybrid storage pool (which has an order of magnitude
more cache), then I''m sure NetApp loves you:  you must be the dumbest,
richest customer that ever fell in their lap!

	- Bryan

--------------------------------------------------------------------------
Bryan Cantrill, Sun Microsystems Fishworks.       http://blogs.sun.com/bmc

Bob Friesenhahn

2008-Dec-13 16:03 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

On Sat, 13 Dec 2008, Tim wrote:>
> Seriously?  Do you know anything about the NetApp platform?  I''m
hoping this
> is a genuine question...
I believe that esteemed Sun engineers like Jeff are quite familiar 
with the NetApp platform.  Besides NetApp being one of the primary 
storage competitors, it is a virtual minefield out there and one must 
take great care not to step on other company''s patents.
> Off the top of my head nearly all of them.  Some of them have artificial
> limitations because they learned the hard way that if you give customers
> enough rope they''ll hang themselves.  For instance "unlimited
snapshots".
> Do I even need to begin to tell you what a horrible, HORRIBLE idea that is?
> "Why can''t I get my space back?"  Oh, just do a snapshot
list and figure out
> which one is still holding the data.  What?  Your console locks up for 8
> hours when you try to list out the snapshots?  Huh... that''s
weird.
I suggest that you retire to the safety of the rubber room while the 
rest of us enjoy these zfs features. By the same measures, you would 
advocate that people should never be allowed to go outside due to the 
wide open spaces.  Perhaps people will wander outside their homes and 
forget how to make it back.  Or perhaps there will be gravity failure 
and some of the people outside will be lost in space.

There is some activity off the starboard bow, perhaps you should check 
it out ...

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Joseph Zhou

2008-Dec-13 23:14 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

Hi Bob, Tim, Jeff, you are all my friends, and you all know what you are 
talking about.
As a friend, and trusting your personal integrity, I ask you, please,
don''t
get mad, enjoy the open discussion.

(ok, ok, O(N) is revolutionary in tech thinking, just not revolutionary in 
end customer value.  And safety features are important in risk management 
for enterprises.)

I have friends at NetApp, and there are people there that I don''t give
a
damn.

I am an enterprise architect, I don''t care about the little
environments
that can be fulfilled most effectively by any one operating enviornment 
applications. They are not enterprises and are risky in that business model 
in economy downturns.

In that spirit, and looking at the NetApp virtual server support 
architecture, I would say --
as much as the ONTAP/WAFL thing (even with GX integration) is elegant, it 
would make more sense to utilize the file system capabilities with kernal 
integration to hypervisors, in virtual server deployments, instead of 
promoting a storage-device-based file system and data management solution 
(more proprietary at the solution level).

So, in my position, NetApp PiT is not as good as ZFS PiT, because it is too 
far from the hypervisor.
You can support me or attack me with more technical details (if you know 
NetApp is developing an API for all server hypervisors, I don''t).
And don''t worry, I have the biggest eagle, but so far, no one has been
able
to hurt that.   ;-)

Best,
z

----- Original Message ----- 
From: "Bob Friesenhahn" <bfriesen at simple.dallas.tx.us>
To: "Tim" <tim at tcsac.net>
Cc: <zfs-discuss at opensolaris.org>
Sent: Saturday, December 13, 2008 11:03 AM
Subject: Re: [zfs-discuss] Split responsibility for data with ZFS

> On Sat, 13 Dec 2008, Tim wrote:
>>
>> Seriously?  Do you know anything about the NetApp platform? 
I''m hoping
>> this
>> is a genuine question...
>
> I believe that esteemed Sun engineers like Jeff are quite familiar
> with the NetApp platform.  Besides NetApp being one of the primary
> storage competitors, it is a virtual minefield out there and one must
> take great care not to step on other company''s patents.
>
>> Off the top of my head nearly all of them.  Some of them have
artificial
>> limitations because they learned the hard way that if you give
customers
>> enough rope they''ll hang themselves.  For instance
"unlimited snapshots".
>> Do I even need to begin to tell you what a horrible, HORRIBLE idea that
>> is?
>> "Why can''t I get my space back?"  Oh, just do a
snapshot list and figure
>> out
>> which one is still holding the data.  What?  Your console locks up for
8
>> hours when you try to list out the snapshots?  Huh... that''s
weird.
>
> I suggest that you retire to the safety of the rubber room while the
> rest of us enjoy these zfs features. By the same measures, you would
> advocate that people should never be allowed to go outside due to the
> wide open spaces.  Perhaps people will wander outside their homes and
> forget how to make it back.  Or perhaps there will be gravity failure
> and some of the people outside will be lost in space.
>
> There is some activity off the starboard bow, perhaps you should check
> it out ...
>
> Bob
> =====================================> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Bob Friesenhahn

2008-Dec-14 05:45 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

On Sat, 13 Dec 2008, Joseph Zhou wrote:>
> In that spirit, and looking at the NetApp virtual server support 
> architecture, I would say --
> as much as the ONTAP/WAFL thing (even with GX integration) is elegant, it 
> would make more sense to utilize the file system capabilities with kernal 
> integration to hypervisors, in virtual server deployments, instead of 
> promoting a storage-device-based file system and data management solution 
> (more proprietary at the solution level).
I am not an enterprise architect but I do agree that when multiple 
client OSs are involved it is still useful if storage looks like a 
legacy disk drive.  Luckly Solaris already offers iSCSI in Solaris 10 
and OpenSolaris is now able to offer high performance fiber channel 
target and fiber channel over ethernet layers on top of reliable ZFS. 
The full benefit of ZFS is not provided, but the storage is 
successfully divorced from the client with a higher degree of data 
reliability and performance than is available from current firmware 
based RAID arrays.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Anton B. Rang

2008-Dec-14 06:07 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

I wasn''t joking, though as is well known, the plural of anecdote is not
data.

Both UFS and ZFS, in common with all file system, have design flaws and bugs.

To lose an entire UFS file system (barring the loss of the entire underlying
storage) requires a great deal of corruption; there are multiple copies of the
superblock, cylinder headers and their inodes are stored in a regular pattern
and easily found by recovery tools, and the UFS file system check utility, while
not perfect, can repair almost any corruption. There are third party tools which
can perform much more analysis and recovery in a worst-case scenario. A single
bad bloc

To lose an entire ZFS pool requires that the most recent uberblock, or one of
the top-level blocks to which it points, be damaged.  There are currently no
recovery tools (at least, none of which I am aware).

I find it na?ve to imagine that Sun customers "expect" their UFS (or
other) file systems to be unrecoverable. Any case where fsck failed quickly
became an escalation to the sustaining engineering organization. Restoring from
backup is almost never a satisfactory answer for a commercial enterprise.

As usual, the disclaimer; I now work for another storage company, and while
I''ve been on the teams developing and maintaining a number of
commercial file systems (including two of Sun''s), ZFS has not been one
of them.
-- 
This message posted from opensolaris.org

Anton B. Rang

2008-Dec-14 06:20 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

Some RAID systems compare checksums on reads, though this is usually only for
RAID-4 configurations (e.g. DataDirect) because of the performance hit
otherwise.

End-to-end checksums are not yet common. The SCSI committee recently ratified
T10 DIF, which allows either an operating system or application to supply
checksums and have them stored and retrieved with data. Oracle has been working
to add support for this to Linux, and several array and drive vendors have
committed to implementing it. So one could say that ZFS is ahead of the curve
here.

ZFS is not particularly revolutionary: software RAID has been around since the
invention of the term; end-to-end checksums to disk have been used since the
1960s (though more often in databases, tape, and optical media); WAFL-like file
structures may pre-date NetApp. It does put these together for the first time in
a widely available system, though, which is certainly innovative and useful. It
will be more useful when it has a more complete disaster recovery model than
''restore from backup.''
-- 
This message posted from opensolaris.org

Richard Elling

2008-Dec-14 07:02 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

Anton B. Rang wrote:> I find it na?ve to imagine that Sun customers "expect" their UFS
(or other) file systems to be unrecoverable.
OK, I''ll bite. If we believe the disk vendors who rate their disks as
having
an unrecoverable error rate of 1 bit per 10^14 bits read, and knowing that
UFS has absolutely no data protection of its data, why would you think that
it is naive to think that a disk system with UFS cannot lose data?
Rather, I
would say it has a distinctly calculable probability. Similarly, for
ZFS, the
checksum is not perfect, so there is a calculable probability that the ZFS
checksum will not detect an unrecoverable (read) error. The difference is
that the probability that ZFS will not detect an error is considerably
smaller
than that of UFS (or FAT, or HSFS, or ...)> Any case where fsck failed quickly became an escalation to the sustaining
engineering organization. Restoring from backup is almost never a satisfactory
answer for a commercial enterprise.
>
I agree. However, I''ve personally experienced well over 100 fsck
failures
over the years, and while I was always unsatisfied, I didn''t always
lose
data[1].
When I did lose data, perhaps it was data I could live without, but that
was my
call. Would you rather that ZFS should simply say, "hey you lost some
data, but
we won''t tell you where... ?"

[1] once upon a time, I used a [vendor-name-elided] disk for a 2,300
user e-mail
message store. I upgraded the OS, which implemented some new SCSI
options.
The disk''s firmware didn''t handle those options properly and
would wait
about
7 hours before corrupting the UFS file system containing the message store,
requiring a full restore. So, how many shifts do you think it took to
fail, recover,
and ultimately resolve the disk firmware issue? Hint: the firmware rev
arrived via
UPS.

Personally, I''m very glad that a file system has come along that
verifies data... and
that feature seems to be catching, as other file systems seem to be
doing the same.
Hopefully, in a few years silent data corruption will be a footnote on
the lore of
computing.
-- richard

Richard Elling

2008-Dec-14 07:11 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

Anton B. Rang wrote:> Some RAID systems compare checksums on reads, though this is usually only
for RAID-4 configurations (e.g. DataDirect) because of the performance hit
otherwise.
>   
For the record, Solaris had a (mirrored) RAID system which would compare
data from both sides of the mirror upon read.  It never achieved significant
market penetration and was subsequently scrapped.  Many of the reasons that
the market did not accept it are solved by the method used by ZFS, which is
far superior.
> End-to-end checksums are not yet common. The SCSI committee recently
ratified T10 DIF, which allows either an operating system or application to
supply checksums and have them stored and retrieved with data. Oracle has been
working to add support for this to Linux, and several array and drive vendors
have committed to implementing it. So one could say that ZFS is ahead of the
curve here.
>   
Oracle also has data checksumming enabled by default for later releases.
I look forward to any field data analysis they may publish :-)
> ZFS is not particularly revolutionary: software RAID has been around since
the invention of the term; end-to-end checksums to disk have been used since the
1960s (though more often in databases, tape, and optical media); WAFL-like file
structures may pre-date NetApp. It does put these together for the first time in
a widely available system, though, which is certainly innovative and useful. It
will be more useful when it has a more complete disaster recovery model than
''restore from backup.''
>   
If you wish to implement a disaster recovery model, then you should look far
beyond what ZFS (or any file system) can provide.  Effective disaster 
recovery
requires significant attention to process.
 -- richard

Ross

2008-Dec-15 10:13 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

I think the problem for me is not that there''s a risk of data loss if a
pool becomes corrupt, but that there are no recovery tools available.  With UFS,
people expect that if the worst happens, fsck will be able to recover their data
in most cases.

With ZFS you have no such tools, yet Victor has on at least two occasions shown
that it''s quite possible to recover pools that were completely unusable
(I believe by making use of old / backup copies of the uberblock).

My concern is that ZFS has all this information on disk, it has the ability to
know exactly what is and isn''t corrupted, and it should (at least for a
system with snapshots) have many, many potential uberblocks to try.  It should
be far, far better than UFS at recovering from these things, but for a certain
class of faults, when it hits a problem it just stops dead.

That''s what frustrates me - knowing that there''s potential to
have all my data there, stored safely away, but having it completely
inaccessible due to a lack of recovery tools.
-- 
This message posted from opensolaris.org

Casper.Dik at Sun.COM

2008-Dec-15 10:30 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

>I think the problem for me is not that there''s a risk of data loss
if
>a pool becomes corrupt, but that there are no recovery tools
>available.  With UFS, people expect that if the worst happens, fsck
>will be able to recover their data in most cases.
Except, of course, that fsck lies.  In "fixes" the meta data and the
quality of the rest is unknown.

Anyone using UFS knows that UFS file corruption are common; specifically,
when using a "UFS root" and the system panic''s when trying to
install a device driver, there''s a good chance that some files in
/etc are corrupt. Some were application problems (some code used
fsync(fileno(fp)); fclose(fp); it doesn''t guarantee anything)

>With ZFS you have no such tools, yet Victor has on at least two occasions
>shown that it''s quite possible to recover pools that were
completely unusable
>(I believe by making use of old / backup copies of the uberblock).
True; and certainly ZFS should be able backtrack.  But it''s
much more likely to happen "automatically" then using a recovery
tool.

See, fsck could only be written because specific corruption are known
and the patterns they have.   With ZFS, you can only backup to
a certain uberblock and the pattern will be a surprise.

Casper

Ross Smith

2008-Dec-15 10:59 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

Forgive me for not understanding the details, but couldn''t you also
work backwards through the blocks with ZFS and attempt to recreate the
uberblock?

So if you lost the uberblock, could you (memory and time allowing)
start scanning the disk, looking for orphan blocks that aren''t
refernced anywhere else and piece together the top of the tree?

Or roll back to a previous uberblock (or a snapshot uberblock), and
then look to see what blocks are on the disk but not referenced
anywhere.  Is there any way to intelligently work out where those
blocks would be linked by looking at how they interact with the known
data?

Of course, rolling back to a previous uberblock would still be a
massive step forward, and something I think would do much to improve
the perception of ZFS as a tool to reliably store data.

You cannot understate the difference to the end user between a file
system that on boot says:
"Sorry, can''t read your data pool."

With one that says:
"Whoops, the uberblock, and all the backups are borked.  Would you
like to roll back to a backup uberblock, or leave the filesystem
offline to repair manually?"

As much as anything else, a simple statement explaining *why* a pool
is inaccessible, and saying just how badly things have gone wrong
helps tons.  Being able to recover anything after that is just the
icing on the cake, especially if it can be done automatically.

Ross

PS.  Sorry for the duplicate Casper, I forgot to cc the list.

On Mon, Dec 15, 2008 at 10:30 AM,  <Casper.Dik at sun.com>
wrote:>
>>I think the problem for me is not that there''s a risk of data
loss if
>>a pool becomes corrupt, but that there are no recovery tools
>>available.  With UFS, people expect that if the worst happens, fsck
>>will be able to recover their data in most cases.
>
> Except, of course, that fsck lies.  In "fixes" the meta data and
the
> quality of the rest is unknown.
>
> Anyone using UFS knows that UFS file corruption are common; specifically,
> when using a "UFS root" and the system panic''s when
trying to
> install a device driver, there''s a good chance that some files in
> /etc are corrupt. Some were application problems (some code used
> fsync(fileno(fp)); fclose(fp); it doesn''t guarantee anything)
>
>
>>With ZFS you have no such tools, yet Victor has on at least two
occasions
>>shown that it''s quite possible to recover pools that were
completely unusable
>>(I believe by making use of old / backup copies of the uberblock).
>
> True; and certainly ZFS should be able backtrack.  But it''s
> much more likely to happen "automatically" then using a recovery
> tool.
>
> See, fsck could only be written because specific corruption are known
> and the patterns they have.   With ZFS, you can only backup to
> a certain uberblock and the pattern will be a surprise.
>
> Casper
>

Bob Friesenhahn

2008-Dec-15 18:34 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

On Mon, 15 Dec 2008, Ross wrote:
> My concern is that ZFS has all this information on disk, it has the 
> ability to know exactly what is and isn''t corrupted, and it should
> (at least for a system with snapshots) have many, many potential 
> uberblocks to try.  It should be far, far better than UFS at 
> recovering from these things, but for a certain class of faults, 
> when it hits a problem it just stops dead.
While ZFS knows if a data block is retrieved correctly from disk, a 
correctly retrieved data block does not indicate that the pool isn''t 
"corrupted".  A block written in the wrong order is a form of 
corruption.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Ross Smith

2008-Dec-15 19:19 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

I''m not sure I follow how that can happen, I thought ZFS writes were
designed to be atomic?  They either commit properly on disk or they
don''t?


On Mon, Dec 15, 2008 at 6:34 PM, Bob Friesenhahn
<bfriesen at simple.dallas.tx.us> wrote:> On Mon, 15 Dec 2008, Ross wrote:
>
>> My concern is that ZFS has all this information on disk, it has the
>> ability to know exactly what is and isn''t corrupted, and it
should (at least
>> for a system with snapshots) have many, many potential uberblocks to
try.
>>  It should be far, far better than UFS at recovering from these things,
but
>> for a certain class of faults, when it hits a problem it just stops
dead.
>
> While ZFS knows if a data block is retrieved correctly from disk, a
> correctly retrieved data block does not indicate that the pool
isn''t
> "corrupted".  A block written in the wrong order is a form of
corruption.
>
> Bob
> =====================================> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
>
>

Bob Friesenhahn

2008-Dec-15 19:36 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

On Mon, 15 Dec 2008, Ross Smith wrote:
> I''m not sure I follow how that can happen, I thought ZFS writes
were
> designed to be atomic?  They either commit properly on disk or they
> don''t?
Yes, this is true.  One reason why people complain about corrupted ZFS 
pools is because they have hardware which writes data in a different 
order than what was requested.  Some hardware claims to have written 
the data but instead it has been secretly cached for later (or perhaps 
for never) and data blocks get written in some other order.  It seems 
that ZFS is capable of working reliably with "cheap" hardware but not 
with wrongly designed hardware.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Nicolas Williams

2008-Dec-15 19:46 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

On Mon, Dec 15, 2008 at 01:36:46PM -0600, Bob Friesenhahn
wrote:> On Mon, 15 Dec 2008, Ross Smith wrote:
> 
> > I''m not sure I follow how that can happen, I thought ZFS
writes were
> > designed to be atomic?  They either commit properly on disk or they
> > don''t?
> 
> Yes, this is true.  One reason why people complain about corrupted ZFS 
> pools is because they have hardware which writes data in a different 
> order than what was requested.  Some hardware claims to have written 
> the data but instead it has been secretly cached for later (or perhaps 
> for never) and data blocks get written in some other order.  It seems 
> that ZFS is capable of working reliably with "cheap" hardware but
not
> with wrongly designed hardware.
Order of writes matters between transactions, not inside transactions,
and at the boundary is a cache flush.  Thus what matters really isn''t
write order as much as whether the devices lie about cache flushes.

Miles Nordin

2008-Dec-15 21:51 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

>>>>> "nw" == Nicolas Williams <Nicolas.Williams at
sun.com> writes:
nw> Your thesis is that all corruption problems observed with ZFS
nw> on SANs are: a) phantom writes that never reached the rotating
nw> rust, b) not bit rot, corruption in the I/O paths, ...
nw> Correct?

yeah.

by ``all'''' I mean the several single-LUN pools that were
recovered by
using an older set of ueberblocks. Of course I don''t mean
``all'''' as
in all pools imagineable including this one 10 years ago on an unnamed
Major Vendor''s RAID shelf that gave you a scar just above the ankle.

But it is really sounding so far like just one major problem with
single-LUN ZFS''s on SAN''s? or am I wrong, there are lots of
pools
which can''t be recovered with old ueberblocks?

Remember the problem is losing pools. It is not, ``for weeks I kept
losing files. I would get errors reported in ''zpool status'',
and it
would tell me the filename ''blah'' has uncorrectable errors.
This went
on for a while, then one day we lost the whole pool.''''
I''ve heard
zero reports like that.

nw> Some of the earlier problems of type (2) were triggered by
nw> checksum verification failures on pools with no redundancy,

but checksum failures aren''t caused just by bitrot in ZFS. I get
hundreds of them after half of my iSCSI mirror bounces because of the
incomplete-resilvering bug.

I don''t know the on-disk format well, but maybe the checksum was wrong
because the label pointed to a block that wasn''t an ueberblock. Maybe
the checksum is functioning in leiu of a commit sector: maybe all four
ueberblocks were written incompletely because there is some bug or
missing-workaround in the way ZFS flushes and schedules the ueberblock
writes, so with some written sectors and some unwritten sectors the
overall block checksum is wrong.

Maybe this is a downside to the filesystem-level checksum. For
integrity it''s an upside, but the netapp block-level checksum, where
you checksum just the data plus the block-number at RAID layer, should
narrow down checksum failures to disk bit flips only and thus be
better for tracking down problems and building statistics comparable
with other systems. We already know the ''zpool status'' CKSUM
column
isn''t so selective, and can catch out-of-date data too.

The overall point, what I''d rather have as my
``thesis,'''' is you can''t
allow ZFS to exhonerate itself with an error message. Losing the
whole pool in a situation where UFS would (or _might_, is not even
proven beyond doubt that it _would_), have corrupted a bit of data,
isn''t an advantage just because ZFS can printf a warning that says
``loss of entire pool detected. must be corruption outside
ZFS!''''
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081215/a5d8cfcd/attachment.bin>

Miles Nordin

2008-Dec-15 22:04 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

>>>>> "bc" == Bryan Cantrill <bmc at eng.sun.com>
writes:
>>>>> "jz" == Joseph Zhou <jz at
excelsioritsolutions.com> writes:
bc> most of the people I talk to are actually _using_ NetApp''s
bc> technology, a practice that tends to leave even the most
bc> stalwart proponents realistic about the (many) limitations of
bc> NetApp''s

same applies to ZFS pundits!

As Tim said, the one-filesystem-per-user thing is not working out.
O(1) for number of filesystems would be great but isn''t there.

Maybe the format allows unlimited O(1) snapshots, but it''s at best
O(1) to take them. All over the place it''s probably O(n) or worse to
_have_ them. to boot with them, to scrub with them.

I think the winning snapshot architecture is more like source code
revision control: take infinitely-granular snapshots, a continuous
line, and run a cron service to trim the line into a series of points.

The management can be delegated, but inspection commands are not safe
and can lock the whole filesystem, and ''zfs recv''ing certain
streams
panics the whole box so backup cannot really be safely delegated either.
The panic-on-import problems are bad for delegation because you can''t
safely let users mount things, which to my view is where delegated
administration begins. It''s too unstable to think of delegating
anything---it''s all just UI baloney until the panics are fixed and
failures are contained within one pool.

The scalability to multiple cores goals are admirable, but only
certain things are parallelized. You can only replace one device at a
time, which some day will not be enough to keep up with natural
failure rates. I think ''zfs send'' does not use multiple cores
well,
right? AIUI people are getting non-scaling performance in send/recv
while the ordinary filesystem performance does scale, and thus getting
painted into a corner.

Yeah there''s compression, but as Tim said people are getting more
savings from dedup, which goes naturally with writeable clones too.
Also the NetApp dedup is a background thread while the ZFS compression
is synchronous with writing. as well as not scaling to multiple cores
and seeming to have some bugs in the gzip version.

Yeah there is some heirarchical storage in it, but after half a year
still a slog cannot be removed?

In general I think ZFS pundits compliment the architecture and not the
implementation.

The big compliment I have for it is just that the ZFS piece is free
software, even though large chunks of OpenSolaris aren''t.
That''s a
gigantic advantage, especially over NetApp, which probably has about
as much long-term future as Lisp.

jz> As a friend, and trusting your personal integrity, I ask you,
jz> please, don''t get mad, enjoy the open discussion.

Joseph, I don''t see the problem and think it''s fine to excited
so long
as actual information comes out. There''s nothing ad-hominem in the
discussion yet, and being ordered not to get mad will make any normal
person furious, especially if you make the order based on
``trust''''
and ``personal integrity''''---why bring up such things at all?
I
almost feel like you''re baiting them! I know it''s normal for
sysadmins to be dry and menial, but it''s still a technical discussion,
so I hope it doesn''t upset anyone because it''s not boring.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081215/c0bcda6a/attachment.bin>

Nicolas Williams

2008-Dec-15 22:12 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

On Mon, Dec 15, 2008 at 05:04:03PM -0500, Miles Nordin
wrote:> As Tim said, the one-filesystem-per-user thing is not working out.
For NFSv3 clients that truncate MOUNT protocol answers (and v4 clients
that still rely on the MOUNT protocol), yes, one-filesystem-per-user is
a problem.  For NFSv4 clients that support mirror mounts its not a
problem at all.  You''re not required to go with one-filesystem-per-user
though!  That''s only if you want to approximate quotas.
> O(1) for number of filesystems would be great but isn''t there.
It is O(1) for filesystems (parts of the system could be parallelized
more, but the on-disk data format is O(1) for filesystem creation and
mounting, just like it is for snapshots and clones).
> Maybe the format allows unlimited O(1) snapshots, but it''s at best
> O(1) to take them.  All over the place it''s probably O(n) or worse
to
> _have_ them.  to boot with them, to scrub with them.
It''s NOT O(N) to boot because of snapshots, nor to scrub.  Scrub and
resilver are O(N) where N is the amount used (as opposed to O(N) where N
is the size of the volume, for HW RAID and the like).

Nico
--

Toby Thain

2008-Dec-16 01:43 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

>
> Maybe the format allows unlimited O(1) snapshots, but it''s at best
> O(1) to take them.  All over the place it''s probably O(n) or worse
to
> _have_ them.  to boot with them, to scrub with them.
Why would a scrub be O(n snapshots)?

The O(n filesystems) effects reported from time to time in  
OpenSolaris seem due to code that iterates over them. The new ability  
to create huge numbers of them puts stress on assumptions valid in  
more traditional UNIX configurations, right?

--Toby

Miles Nordin

2008-Dec-16 19:00 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

>>>>> "nw" == Nicolas Williams <Nicolas.Williams at
sun.com> writes:
nw> For NFSv4 clients that support mirror mounts its not a problem
nw> at all.

no, 3000 - 10,000 users is common for a large campus, and according to
posters here, sometimes that many users actually can fit into the
bandwidth of a single pool. But ZFS is not useable with that many
filesystems. booting, ''zfs create'', ''zfs
list'', all take hours. see
list archives.

If the on-disk format is theoretically capable of achieving O(1) for
number of filesystems, that''s nice! It''s just not an
advantage over
NetApp when it''s not working yet. And, with any project, sometimes the
last 5% of the work never gets done.

so I''m making a desperate call to start basing punditry on experience
rather than white papers and optimistic architecture documents.
OpenSolaris could have an advantage here---it''s much easier to get
experience with Solaris than NetApp because it''s not (a) expensive and
(b) locked behind a bunch of licenses, agreements and contracts,
unshareable documentation, private censored web forums (NOW site),
u.s.w., so OpenSolaris punditry could one day become a lot more
trustworthy than NetApp punditry.

nw> You''re not required to go with one-filesystem-per-user
though!

It was pitched as an architectural advantage, but never fully
delivered, and worse, used to justify removing traditional Unix
quotas. Consequently, quota-wise, ZFS becomes a regression w.r.t. UFS
rather than an evolution, because of over-focusing on the virtues of
the architecture rather than the delivered implementation.

I don''t use quotas and don''t care, but it''s a good
example of broken
advocacy.

nw> It''s NOT O(N) to boot because of snapshots, nor to scrub.

I think it is. try it and see. :/

That was Tim''s point as I read it. Jeff claimed ``unlimited snapshots
and clones'''' as a ZFS advantage over NetApp, and Tim said open
bugs or
subtle limitations make the supposed advantage a fantasy, even a
liability:

``"unlimited snapshots". Do I even need to begin to tell you what a
horrible, HORRIBLE idea that is? "Why can''t I get my space
back?"
Oh, just do a snapshot list and figure out which one is still
holding the data. What? Your console locks up for 8 hours when
you try to list out the snapshots? Huh... that''s
weird.''''

...and to add to that, the snapshot list in ZFS does a better job of
showing which one''s using the space if there are fewer snapshots.
with hundreds of snapshots ''zfs list'' shows a USED column full
of
zeroes, correctly, because you won''t save any space by deleting just
one---you have to delete a range of snapshots to get some space back.
Of course that''s not the same thing as being O(N), that''s just
annoying.

and I don''t know that it''s really O(N)---it could be better or
worse
than O(N). It''s not O(1) though, to boot, list, or scrub snapshots.

and if it''s not O(1) because of some unnecessary high-level ioctl
accidentally called in some obscure, abstract library by the
``simple'''' user interface, it''s still not O(1)! For
practical users,
that library could remain suboptimal for the next two years, and I
don''t want to spend those two years enduring a bunch of blogging about
nonexistent O(1) snapshots just because the on-disk format
theoretically doesn''t impede delivering them.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081216/2dda3e91/attachment.bin>

John Kaitschuck

2008-Dec-16 19:22 UTC

head link

[zfs-discuss] Split responsibility for data with ZFS

Miles Nordin wrote:>>>>>> "nw" == Nicolas Williams <Nicolas.Williams
at sun.com> writes:
>
> 
>     nw> You''re not required to go with one-filesystem-per-user
though!
> 
> It was pitched as an architectural advantage, but never fully
> delivered, and worse, used to justify removing traditional Unix
> quotas.  Consequently, quota-wise, ZFS becomes a regression w.r.t. UFS
> rather than an evolution, because of over-focusing on the virtues of
> the architecture rather than the delivered implementation.
> 
>

Precisely.

The issues for quotas, for ZFS on a per user basis was pointed
out several years ago at FAST, when some of the Sun folks showed
up to discuss ZFS in a late evening meeting. A file system per
user approach is not very viable when you have tens of thousands
of users.

It was my hope that Sun would get that message by now, as I
consider it one of the major problems with ZFS.

Gino

2009-Feb-07 13:54 UTC

head link

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

> FYI, I''m working on a workaround for broken devices.
>  As you note,
> ome disks flat-out lie: you issue the
> synchronize-cache command,
> they say "got it, boss", yet the data is still not on
> stable storage.
> Why do they do this?  Because "it performs better".
>  Well, duh --
> ou can make stuff *really* fast if it doesn''t have to
> be correct.
> 
> The uberblock ring buffer in ZFS gives us a way to
> cope with this,
> as long as we don''t reuse freed blocks for a few
> transaction groups.
> The basic idea: if we can''t read the pool startign
> from the most
> recent uberblock, then we should be able to use the
> one before it,
> or the one before that, etc, as long as we haven''t
> yet reused any
> blocks that were freed in those earlier txgs.  This
> allows us to
> use the normal load on the pool, plus the passage of
> time, as a
> displacement flush for disk caches that ignore the
> sync command.
> 
> If we go back far enough in (txg) time, we will
> eventually find an
> uberblock all of whose dependent data blocks have
> make it to disk.
> I''ll run tests with known-broken disks to determine
> how far back we
> need to go in practice -- I''ll bet one txg is almost
> always enough.
> 
> Jeff
Hi Jeff,
we just losed 2 pools on snv91.
Any news about your workaround to recover pools discarding last txg?

thanks
gino
-- 
This message posted from opensolaris.org

zfs discuss - Oct 2008 - zpool unimportable (corrupt zpool metadata??) but no zdb -l device problems

[zfs-discuss] zpool unimportable (corrupt zpool metadata??) but no zdb -l device problems

[zfs-discuss] zpool unimportable (corrupt zpool metadata??) but no zdb -l device problems

[zfs-discuss] one step forward - pinging Lukas pool: ztankKarwacki (kangurek)

[zfs-discuss] one step forward - pinging Lukas pool: ztankKarwacki (kangurek)

[zfs-discuss] one step forward - pinging Lukas pool: ztankKarwacki (kangurek)

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun /

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Solved - a big THANKS to Victor Latushkin @ Sun / Moscow

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS

[zfs-discuss] Split responsibility for data with ZFS