thr3ads.net - zfs discuss - [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work [Jul 2009]

If this information is useful, please help other people find it:
Share via:

Russel

2009-Jul-19 01:12 UTC

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

Well I have a 10TB (5x2TB) in RAIDZ on VirtualBox
got it all working on Windows XP and Windows 7.
SMB shares back to my PC, great managed the impossible!
Copied all my data over form a loads of old external disks,
sorted it, all in all 15 days work (my holiday :-)) Used
raw disks to the VirtualBox so was quite ok performance.

Then Opensolaris (2009.6) crashes as I tried to close it down,
in the end had to power of the VirtualBox.

rebooted, I then get this:

#zpool status

 pool: array1
 state: FAULTED
status: The pool metadata is corrupted and the pool cannot be opened.
action: Destroy and re-create the pool from a backup source.
   see: http://www.sun.com/msg/ZFS-8000-72
 scrub: none requested
config:

	NAME          STATE     READ WRITE CKSUM
	array1        FAULTED      0     0     1  corrupted data
	  raidz1      ONLINE       0     0     6
	    c9t0d0s0  ONLINE       0     0     0
	    c9t1d0s0  ONLINE       0     0     0
	    c9t2d0s0  ONLINE       0     0     0
	    c9t3d0s0  ONLINE       0     0     0
	    c9t4d0s0  ONLINE       0     0     0

  pool: rpool
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
	still be used, but some features are unavailable.
action: Upgrade the pool using ''zpool upgrade''.  Once this is
done, the
	pool will no longer be accessible on older software versions.
 scrub: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	rpool       ONLINE       0     0     0
	  c7d0s0    ONLINE       0     0     0

errors: No known data errors
root at storage1:/rpool/rtsmb/lost# 

==============================
After 9 hours of reading many blogs and posting I am 
about to give up.  Heres some output that may hopefully
allow some one to help me (Victor?)

==============================
#zdb -u array1
zdb: can''t open array1: I/O error
#zdb -l /dev/dsk/c9t0d0s0
--------------------------------------------
LABEL 0
--------------------------------------------
    version=14
    name=''array1''
    state=0
    txg=336051
    pool_guid=2240875695356292882
    hostid=881445
    hostname=''storage1''
    top_guid=2550252815929083498
    guid=1431843495093629813
    vdev_tree
        type=''raidz''
        id=0
        guid=2550252815929083498
        nparity=1
        metaslab_array=23
        metaslab_shift=36
        ashift=9
        asize=9901403013120
        is_log=0
        children[0]
                type=''disk''
                id=0
                guid=1431843495093629813
                path=''/dev/dsk/c9t0d0s0''
                devid=''id1,sd at
SATA_____VBOX_HARDDISK____VB90d7ae97-6e68097a/a''
                phys_path=''/pci at 0,0/pci8086,2829 at d/disk at
0,0:a''
                whole_disk=0
                DTL=44
        children[1]
                type=''disk''
                id=1
                guid=1558447330187786228
                path=''/dev/dsk/c9t1d0s0''
                devid=''id1,sd at
SATA_____VBOX_HARDDISK____VB315f2939-fdadfa14/a''
                phys_path=''/pci at 0,0/pci8086,2829 at d/disk at
1,0:a''
                whole_disk=0
                DTL=43
        children[2]
                type=''disk''
                id=2
                guid=10659506225279255914
                path=''/dev/dsk/c9t2d0s0''
                devid=''id1,sd at
SATA_____VBOX_HARDDISK____VBd9514af5-8837e2f7/a''
                phys_path=''/pci at 0,0/pci8086,2829 at d/disk at
2,0:a''
                whole_disk=0
                DTL=42
                degraded=1
        children[3]
                type=''disk''
                id=3
                guid=2558128054346170575
                path=''/dev/dsk/c9t3d0s0''
                devid=''id1,sd at
SATA_____VBOX_HARDDISK____VBab7f62b2-3b162694/a''
                phys_path=''/pci at 0,0/pci8086,2829 at d/disk at
3,0:a''
                whole_disk=0
                DTL=41
        children[4]
                type=''disk''
                id=4
                guid=13991896528691960894
                path=''/dev/dsk/c9t4d0s0''
                devid=''id1,sd at
SATA_____VBOX_HARDDISK____VB67b9775c-3ba02834/a''
                phys_path=''/pci at 0,0/pci8086,2829 at d/disk at
4,0:a''
                whole_disk=0
                DTL=40
--------------------------------------------
LABEL 1
--------------------------------------------
    version=14
    name=''array1''
    state=0
    txg=336051
    pool_guid=2240875695356292882
    hostid=881445
    hostname=''storage1''
    top_guid=2550252815929083498
    guid=1431843495093629813
    vdev_tree
        type=''raidz''
        id=0
        guid=2550252815929083498
        nparity=1
        metaslab_array=23
        metaslab_shift=36
        ashift=9
        asize=9901403013120
        is_log=0
        children[0]
                type=''disk''
                id=0
                guid=1431843495093629813
                path=''/dev/dsk/c9t0d0s0''
                devid=''id1,sd at
SATA_____VBOX_HARDDISK____VB90d7ae97-6e68097a/a''
                phys_path=''/pci at 0,0/pci8086,2829 at d/disk at
0,0:a''
                whole_disk=0
                DTL=44
        children[1]
                type=''disk''
                id=1
                guid=1558447330187786228
                path=''/dev/dsk/c9t1d0s0''
                devid=''id1,sd at
SATA_____VBOX_HARDDISK____VB315f2939-fdadfa14/a''
                phys_path=''/pci at 0,0/pci8086,2829 at d/disk at
1,0:a''
                whole_disk=0
                DTL=43
        children[2]
                type=''disk''
                id=2
                guid=10659506225279255914
                path=''/dev/dsk/c9t2d0s0''
                devid=''id1,sd at
SATA_____VBOX_HARDDISK____VBd9514af5-8837e2f7/a''
                phys_path=''/pci at 0,0/pci8086,2829 at d/disk at
2,0:a''
                whole_disk=0
                DTL=42
                degraded=1
        children[3]
                type=''disk''
                id=3
                guid=2558128054346170575
                path=''/dev/dsk/c9t3d0s0''
                devid=''id1,sd at
SATA_____VBOX_HARDDISK____VBab7f62b2-3b162694/a''
                phys_path=''/pci at 0,0/pci8086,2829 at d/disk at
3,0:a''
                whole_disk=0
                DTL=41
        children[4]
                type=''disk''
                id=4
                guid=13991896528691960894
                path=''/dev/dsk/c9t4d0s0''
                devid=''id1,sd at
SATA_____VBOX_HARDDISK____VB67b9775c-3ba02834/a''
                phys_path=''/pci at 0,0/pci8086,2829 at d/disk at
4,0:a''
                whole_disk=0
                DTL=40
--------------------------------------------
LABEL 2
--------------------------------------------
    version=14
    name=''array1''
    state=0
    txg=336051
    pool_guid=2240875695356292882
    hostid=881445
    hostname=''storage1''
    top_guid=2550252815929083498
    guid=1431843495093629813
    vdev_tree
        type=''raidz''
        id=0
        guid=2550252815929083498
        nparity=1
        metaslab_array=23
        metaslab_shift=36
        ashift=9
        asize=9901403013120
        is_log=0
        children[0]
                type=''disk''
                id=0
                guid=1431843495093629813
                path=''/dev/dsk/c9t0d0s0''
                devid=''id1,sd at
SATA_____VBOX_HARDDISK____VB90d7ae97-6e68097a/a''
                phys_path=''/pci at 0,0/pci8086,2829 at d/disk at
0,0:a''
                whole_disk=0
                DTL=44
        children[1]
                type=''disk''
                id=1
                guid=1558447330187786228
                path=''/dev/dsk/c9t1d0s0''
                devid=''id1,sd at
SATA_____VBOX_HARDDISK____VB315f2939-fdadfa14/a''
                phys_path=''/pci at 0,0/pci8086,2829 at d/disk at
1,0:a''
                whole_disk=0
                DTL=43
        children[2]
                type=''disk''
                id=2
                guid=10659506225279255914
                path=''/dev/dsk/c9t2d0s0''
                devid=''id1,sd at
SATA_____VBOX_HARDDISK____VBd9514af5-8837e2f7/a''
                phys_path=''/pci at 0,0/pci8086,2829 at d/disk at
2,0:a''
                whole_disk=0
                DTL=42
                degraded=1
        children[3]
                type=''disk''
                id=3
                guid=2558128054346170575
                path=''/dev/dsk/c9t3d0s0''
                devid=''id1,sd at
SATA_____VBOX_HARDDISK____VBab7f62b2-3b162694/a''
                phys_path=''/pci at 0,0/pci8086,2829 at d/disk at
3,0:a''
                whole_disk=0
                DTL=41
        children[4]
                type=''disk''
                id=4
                guid=13991896528691960894
                path=''/dev/dsk/c9t4d0s0''
                devid=''id1,sd at
SATA_____VBOX_HARDDISK____VB67b9775c-3ba02834/a''
                phys_path=''/pci at 0,0/pci8086,2829 at d/disk at
4,0:a''
                whole_disk=0
                DTL=40
--------------------------------------------
LABEL 3
--------------------------------------------
    version=14
    name=''array1''
    state=0
    txg=336051
    pool_guid=2240875695356292882
    hostid=881445
    hostname=''storage1''
    top_guid=2550252815929083498
    guid=1431843495093629813
    vdev_tree
        type=''raidz''
        id=0
        guid=2550252815929083498
        nparity=1
        metaslab_array=23
        metaslab_shift=36
        ashift=9
        asize=9901403013120
        is_log=0
        children[0]
                type=''disk''
                id=0
                guid=1431843495093629813
                path=''/dev/dsk/c9t0d0s0''
                devid=''id1,sd at
SATA_____VBOX_HARDDISK____VB90d7ae97-6e68097a/a''
                phys_path=''/pci at 0,0/pci8086,2829 at d/disk at
0,0:a''
                whole_disk=0
                DTL=44
        children[1]
                type=''disk''
                id=1
                guid=1558447330187786228
                path=''/dev/dsk/c9t1d0s0''
                devid=''id1,sd at
SATA_____VBOX_HARDDISK____VB315f2939-fdadfa14/a''
                phys_path=''/pci at 0,0/pci8086,2829 at d/disk at
1,0:a''
                whole_disk=0
                DTL=43
        children[2]
                type=''disk''
                id=2
                guid=10659506225279255914
                path=''/dev/dsk/c9t2d0s0''
                devid=''id1,sd at
SATA_____VBOX_HARDDISK____VBd9514af5-8837e2f7/a''
                phys_path=''/pci at 0,0/pci8086,2829 at d/disk at
2,0:a''
                whole_disk=0
                DTL=42
                degraded=1
        children[3]
                type=''disk''
                id=3
                guid=2558128054346170575
                path=''/dev/dsk/c9t3d0s0''
                devid=''id1,sd at
SATA_____VBOX_HARDDISK____VBab7f62b2-3b162694/a''
                phys_path=''/pci at 0,0/pci8086,2829 at d/disk at
3,0:a''
                whole_disk=0
                DTL=41
        children[4]
                type=''disk''
                id=4
                guid=13991896528691960894
                path=''/dev/dsk/c9t4d0s0''
                devid=''id1,sd at
SATA_____VBOX_HARDDISK____VB67b9775c-3ba02834/a''
                phys_path=''/pci at 0,0/pci8086,2829 at d/disk at
4,0:a''
                whole_disk=0
                DTL=40


#######################
did the same for all five disks, all ok but heres a grep of txg fields from all
of them:grep txg t*
t0:    txg=336051
t0:    txg=336051
t0:    txg=336051
t0:    txg=336051
t1:    txg=319963
t1:    txg=319963
t1:    txg=319963
t1:    txg=319963
t2:    txg=336051
t2:    txg=336051
t2:    txg=336051
t2:    txg=336051
t3:    txg=319963
t3:    txg=319963
t3:    txg=319963
t3:    txg=319963
t4:    txg=319963
t4:    txg=319963
t4:    txg=319963
t4:    txg=319963

#######################
wrote little script to go back bards until I found a txg which would give
me a U block evcenutual found this one bellow
#######################

#zdb -u -t 335425 array1
Uberblock

	magic = 0000000000bab10c
	version = 14
	txg = 335425
	guid_sum = 16544206071174628188
	timestamp = 1247514285 UTC = Mon Jul 13 20:44:45 2009

#date
Sunday, 19 July 2009 01:58:18 BST

==============================================
Infact July 13 is ok thats when I was last adding files and moving things about
so thats not a bad point to return to....

So  how do I manage to "roll-back" to txg = 335425 so I can hopefully
get my 10TB back?

Or is the answer return to doing HW raid under NTFS and windows directly?
I heard people at sun may be working on a tool that can roll-back to a txg
is it about (yes I understand the issue of you can''t roll back as a
freeded block
may have been re-used) but I''d happ;y loosse one of my 6GB files to get
the
other 6TB back thanks!!

PLEASE PLEASE any help really appreciated.

Russel
-- 
This message posted from opensolaris.org

Orvar Korvar

2009-Jul-19 01:59 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

Sorry to hear that, but you do know that VirtualBox is not really stable?
VirtualBox does show some instability from time to time. You havent read the
VirtualBox forums? I would advice against VirtualBox for saving all your data in
ZFS. I would use OpenSolaris without virtualization. I hope your problem gets
fixed, though.
-- 
This message posted from opensolaris.org

Russel

2009-Jul-19 02:39 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

Yes you''ll find my name all over VB at the moment, but I have found it
to be stable
(don''t install the addons disk for solaris!!, use 3.0.2, and for me
winXP32bit and
OpenSolaris 2009.6 has been rock solid, it was (seems) to be opensolaris failed
with extract_boot_list doesn''t belong to 101, but noone on opensol,
seems
interested about it as other have reported it to, prob a rare issue.

But yer, I hope Vicktor or someone will take a look. My worry is that if we
can''t recover from this, which a number of people (in variuos forms)
have come accross zfs may be introuble. We had this happen at work about 18
months ago
lost all the data (20TB)(didn''t know about zdb nor did sun support) so
we have start
to back away, but I though since jan 2009 patches things were meant to be alot
better, esp with sun using it in there storage servers now....
-- 
This message posted from opensolaris.org

Brent Jones

2009-Jul-19 07:00 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On Sat, Jul 18, 2009 at 7:39 PM, Russel<no-reply at opensolaris.org>
wrote:> Yes you''ll find my name all over VB at the moment, but I have
found it to be stable
> (don''t install the addons disk for solaris!!, use 3.0.2, and for
me winXP32bit and
> OpenSolaris 2009.6 has been rock solid, it was (seems) to be opensolaris
failed
> with extract_boot_list doesn''t belong to 101, but noone on
opensol, seems
> interested about it as other have reported it to, prob a rare issue.
>
> But yer, I hope Vicktor or someone will take a look. My worry is that if we
> can''t recover from this, which a number of people (in variuos
forms) have come accross zfs may be introuble. We had this happen at work about
18 months ago
> lost all the data (20TB)(didn''t know about zdb nor did sun
support) so we have start
> to back away, but I though since jan 2009 patches things were meant to be
alot better, esp with sun using it in there storage servers now....
> --
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
No offense, but you trusted 10TB of important data, running in
OpenSolaris from inside Virtualbox (not stable) on top of Windows XP
(arguably not stable, especially for production) on probably consumer
grade hardware with unknown support for any of the above products?

I''d like to say this was an unfortunate circumstance, but there are
many levels of fail here, and to blame ZFS seems misplaced, and the
subject on this thread especially inflammatory.



-- 
Brent Jones
brent at servuhome.net

Miles Nordin

2009-Jul-19 08:24 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

>>>>> "bj" == Brent Jones <brent at
servuhome.net> writes:
    bj> many levels of fail here,

pft.  Virtualbox isn''t unstable in any of my experience.  It
doesn''t
by default pass cache flushes from guest to host unless you set

VBoxManage setextradata VMNAME
"VBoxInternal/Devices/piix3ide/0/LUN#[x]/Config/IgnoreFlush" 0

however OP does not mention the _host_ crashing, so this questionable
``optimization'''' should not matter.  Yanking the
guest''s virtual cord
is something ZFS is supposed to tolerate:  remember the
``crash-consistent backup'''' concept (not to mention the
``always
consistent on disk'''' claim, but really any filesystem even
without
that claim should tolerate having the guest''s virtual cord yanked, or
the guest''s kernel crashing, without losing all its contents---the
claim only means no time-consuming fsck after reboot).

    bj> to blame ZFS seems misplaced,

-1

The fact that it''s a known problem doesn''t make it not a
problem.

    bj> the subject on this thread especially inflammatory.

so what?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090719/d38df479/attachment.bin>

Markus Kovero

2009-Jul-19 08:35 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

I would be intrested in how to roll-back to certain txg-points in case of
disaster, that was what Russel was after anyway.

Yours
Markus Kovero

-----Original Message-----
From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at
opensolaris.org] On Behalf Of Miles Nordin
Sent: 19. hein?kuuta 2009 11:24
To: zfs-discuss at opensolaris.org
Subject: Re: [zfs-discuss] Another user looses his pool (10TB) in this case and
40 days work
>>>>> "bj" == Brent Jones <brent at
servuhome.net> writes:
    bj> many levels of fail here,

pft.  Virtualbox isn''t unstable in any of my experience.  It
doesn''t by default pass cache flushes from guest to host unless you set

VBoxManage setextradata VMNAME
"VBoxInternal/Devices/piix3ide/0/LUN#[x]/Config/IgnoreFlush" 0

however OP does not mention the _host_ crashing, so this questionable
``optimization'''' should not matter.  Yanking the
guest''s virtual cord is something ZFS is supposed to tolerate: 
remember the ``crash-consistent backup'''' concept (not to
mention the ``always consistent on disk'''' claim, but really
any filesystem even without that claim should tolerate having the
guest''s virtual cord yanked, or the guest''s kernel crashing,
without losing all its contents---the claim only means no time-consuming fsck
after reboot).

    bj> to blame ZFS seems misplaced,

-1

The fact that it''s a known problem doesn''t make it not a
problem.

    bj> the subject on this thread especially inflammatory.

so what?

dick hoogendijk

2009-Jul-19 08:46 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On Sun, 19 Jul 2009 00:00:06 -0700
Brent Jones <brent at servuhome.net> wrote:
> No offense, but you trusted 10TB of important data, running in
> OpenSolaris from inside Virtualbox (not stable) on top of Windows XP
> (arguably not stable, especially for production) on probably consumer
> grade hardware with unknown support for any of the above products?
Running this kind of setup absolutely can give you NO garanties at all.
Virtualisation, OSOL/zfs on WinXP. It''s nice to play with and see it
"working" but would I TRUST precious data to it? No way!

-- 
Dick Hoogendijk -- PGP/GnuPG key: 01D2433D
+ http://nagual.nl/ | nevada / OpenSolaris 2010.02 B118
+ All that''s really worth doing is what we do for others (Lewis Carrol)

Ross

2009-Jul-19 08:48 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

While I agree with Brent, I think this is something that should be stressed in
the ZFS documentation.  Those of us with long term experience of ZFS know that
it''s really designed to work with hardware meeting quite specific
requirements.

Unfortunately, that isn''t documented anywhere, and more and more people
are being bitten by quite severe dataloss by virtue of the fact that ZFS is far
less forgiving than other filesystems when data hasn''t been properly
written to disk.

As far as I can see, the ZFS Administrator Guide is sorely lacking in any
warning that you are risking data loss if you run on consumer grade hardware. 
In fact, the requirements section states nothing more than:

"ZFS Hardware and Software Requirements and Recommendations

Make sure you review the following hardware and software requirements and
recommendations before attempting to use the ZFS software:

    * A SPARC? or x86 system that is running the or the Solaris 10 6/06 release
or later release.
    * The minimum disk size is 128 Mbytes. The minimum amount of disk space
required for a storage pool is approximately 64 Mbytes.
    * Currently, the minimum amount of memory recommended to install a Solaris
system is 768 Mbytes. However, for good ZFS performance, at least one Gbyte or
more of memory is recommended.
    * If you create a mirrored disk configuration, multiple controllers are
recommended."
-- 
This message posted from opensolaris.org

dick hoogendijk

2009-Jul-19 09:00 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On Sun, 19 Jul 2009 01:48:40 PDT
Ross <no-reply at opensolaris.org> wrote:
> As far as I can see, the ZFS Administrator Guide is sorely lacking in
> any warning that you are risking data loss if you run on consumer
> grade hardware.
And yet, ZFS is not only for NON-consumer grade hardware is it? the
fact that many, many people run "normal" consumer hardware does not
rule them out fro ZFS, does it? The "best filesystem ever", the
"end of
all other filesystems" would be nothing more than a dream if that was
true. Furthermore, much so-called consumer hardware is very good these
days. My guess is ZFS should work quite reliably on that hardware.
(i.e. non ECC memory should work fine!) / mirroring is a -must- !

-- 
Dick Hoogendijk -- PGP/GnuPG key: 01D2433D
+ http://nagual.nl/ | nevada / OpenSolaris 2010.02 B118
+ All that''s really worth doing is what we do for others (Lewis Carrol)

Russel

2009-Jul-19 11:12 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

Guys guys please chill...

First thanks to the info about virtualbox option to bypass the
cache (I don''t suppose you can give me a reference for that info?
(I''ll search the VB site :-)) As this was not clear to me. I use VB
like others use vmware etc to run solaris because its the ONLY
way I can, as I can''t get the drives for most of the H/W out there
in hobby land, so a virtualised system allows me to run my SilImage
chipset to link 3gb/s to the sata multi-port array I got for ?160.
===anyway lets top there or we will be off topic even more
===just though you should know why I did it, even bsd does
===not have a driver or I would have gone there so get zfs :-)

Anyway... my view on zfs was quite simple, it looked after
bit rot, and did self healing, and most importantly for me
as running it on consumer kit, was it seemed to avoid the Raid-5
write hole in the case of a crash! So if stuff falls over
eg windows,VB,Opensolaris etc I would not suffer
unknown data corruption and would just loose just that write
which was fine as the thing crashed..... so
for a flaky envorment ZFS sounds even more like the one you
want, LOL.

Loved all the technical stuff, I have had rather good deep dives
from Suns best here in UK/europe (I''m lucky as was a very
early employee of sun, and now work for a major firm :-)).
Liked the idea that you can build your own storage server
etc etc. I new most bugs, as I saw them, were fixed in Jan 09 
patch.....

I THOUGHT/ASSUMMED (yes you should never :-()) that given everything
else it would be blatantly obvious that when you try to mount a zpool the
thing would either rollback to last consistent state (that includes
the U-block and meta data thank you) or have a tool like fsck
which lets you do it, BUT you know once you start rolling back
(just like clearing inodes) your not going to be in such a good
place and you''d need to scrub or something, even if it say these
files are now corrupt, FINE, but I DIDN''T loose the filesystem
just a file or two. We should never loose the filesystem. But in
ZFS land thats the most likely fault it sounds we have in data-loss.


SUMMARY
========What I see here is the lack of the (not needed lol) fsck type tool.
WELL WE DO NEED it, we need to be able roll-back and recover
and repair.
 I have lost data stored on large Sun6790 arrays and now my 
home system.

So PLEASE anyone got a beta version of a tool to perform roll back?

Russel
(It will take me 10 days to pull my data off litttle my little
drives again, and 5 days to format with raid5 (H/W) and NFTS
not what I want, nor the raid-5 hole :-))
-- 
This message posted from opensolaris.org

Ross

2009-Jul-19 12:55 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

>From the experience myself and others have had, and Sun''s approach
with their Amber Road storage (FISHWORKS - fully integrated *hardware* and
software), my feeling is very much that ZFS was designed by Sun to run on
Sun''s own hardware, and as such, they were able to make certain
assumptions with their design.
ZFS was never designed to run on consumer hardware, it makes assumptions that
devices and drivers will always be well behaved when errors occur, and in
general is quite fragile if you''re running it on the wrong system.  On
the right hardware, I''ve no doubt that ZFS is incredibly reliable and
easy to manage.  On the wrong hardware, disk errors can hang your entire system,
hot swap can down your pool, and a power cut or other error can render your
entire pool inaccessible.

The success of any ZFS implementation is *very* dependent on the hardware you
choose to run it on.
-- 
This message posted from opensolaris.org

Ross

2009-Jul-19 13:05 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

Heh, yes, I assumed similar things Russel.  I also assumed that a faulty disk in
a raid-z set wouldn''t hang my entire pool indefinitely, that hot
plugging a drive wouldn''t reboot Solaris, and that my pool would
continue working after I disconnected one half of an iscsi mirror.

I also like yourself assumed that if ZFS is using copy on write, then even after
a really nasty crash, the vast majority of my data would be accessible.

And I also believed that when I had disconnected every drive from a ZFS pool,
that ZFS wouldn''t accept writes to it any more...

Unfortunately, all of these assumptions turned out to be false.  Learning ZFS
has been a painful experience.  I still like it, but I am very aware of its
limitations, and am cautious how I apply it these days.
-- 
This message posted from opensolaris.org

Brian Wilson

2009-Jul-19 14:29 UTC

head link

[zfs-discuss] What are the rollback tools?

It''s clear from some threads on this list that it IS possible to roll 
back a zpool to a previous state, and I seem to even remember reading 
someone was working on a tool or tools in that direction. 

Is that correct, is it possible to manually roll back a zpool for crash 
recovery purposes, if you''ve got enough clue/knowledge/experience on 
your side in regards to the right tools?

And the follow up question - how do I learn how to do that?  Is there 
documentation somewhere that instructs one on the tools used for low 
level ZFS troubleshooting and rollback, and how to come back from the 
land of crash-then-fail-to-import?  Is it the troubleshooting document?  
I''m happy to get a RTFM - please just point me at the right manual or 
Sun doc too :)

Thanks!
Brian

Bob Friesenhahn

2009-Jul-19 14:47 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On Sun, 19 Jul 2009, Ross wrote:
> The success of any ZFS implementation is *very* dependent on the 
> hardware you choose to run it on.
To clarify:

"The success of any filesystem implementation is *very* dependent on 
the hardware you choose to run it on."

ZFS requires that the hardware cache sync works and is respected. 
Without taking advantage of the drive caches, zfs would be 
considerably less performant.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Toby Thain

2009-Jul-19 15:16 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On 19-Jul-09, at 7:12 AM, Russel wrote:
> Guys guys please chill...
>
> First thanks to the info about virtualbox option to bypass the
> cache (I don''t suppose you can give me a reference for that info?
> (I''ll search the VB site :-))

I posted about that insane default, six months ago. Obviously ZFS  
isn''t the only subsystem that this breaks.
http://forums.virtualbox.org/viewtopic.php?f=8&t=13661&start=0
> As this was not clear to me. I use VB
> like others use vmware etc to run solaris because its the ONLY
> way I can,
Convenience always has a price.

--Toby

...

Ross

2009-Jul-19 15:44 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

That''s only one element of it Bob.  ZFS also needs devices to fail
quickly and in a predictable manner.

A consumer grade hard disk could lock up your entire pool as it fails.  The kit
Sun supply is more likely to fail in a manner ZFS can cope with.
-- 
This message posted from opensolaris.org

Frank Middleton

2009-Jul-19 21:31 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On 07/19/09 05:00 AM, dick hoogendijk wrote:
> (i.e. non ECC memory should work fine!) / mirroring is a -must- !
Yes, mirroring is a must, although it doesn''t help much if you
have memory errors (see several other threads on this topic):

http://en.wikipedia.org/wiki/Dynamic_random_access_memory#Errors_and_error_correction

"Tests[ecc]give widely varying error rates, but about 10^-12
error/bit?h is typical, roughly one bit error, per month, per
gigabyte of memory."

That''s roughly 1 per week in 4GB. If 1 error in 50 results in a ZFS
hit, that''s one/year per user on average. Some get more, some get
less.That sounds like pretty bad odds...

"In most computers used for serious scientific or financial computing
and as servers, ECC is the rule rather than the exception, as can be
seen by examining manufacturers'' specifications." Sun
doesn''t even
sell machines without ECC. There''s a reason for that.

IMO you''d be nuts to run ZFS on a machine without ECC unless
you don''t care about losing some or all of the data. Having
said that, we have yet to lose an entire pool - this is pretty
hard to do! I should add that since setting copies=2 and forcing
the files to be copied, there have been no more unrecoverable
errors on a particularly low end machine that was plagued with
them even with mirrors (and a UPS with a bad battery :-) ).

On 19-Jul-09, at 7:12 AM, Russel wrote:
> As this was not clear to me. I use VB like others use vmware
> etc to run solaris because its the ONLY way I can,
Given that PC hardware is so cheap these days (used SPARCS
even cheaper), surely it makes far more sense to build a nice
robust OSOL/ZFS based file server *with* ECC. Then you can use
iscsi for your VirtualBox VMs and solve all kinds of interesting
problems. But you still need to do backups. My solution for
that is to replicate the server and backup to it using zfs
send/recv. If a disk fails, you switch to the backup and no
worries about the second disk of the mirror failing during a
resilver.  A small price to pay for peace of mind.

Bob Friesenhahn

2009-Jul-19 21:39 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On Sun, 19 Jul 2009, Frank Middleton wrote:>
> Yes, mirroring is a must, although it doesn''t help much if you
> have memory errors (see several other threads on this topic):
>
>
http://en.wikipedia.org/wiki/Dynamic_random_access_memory#Errors_and_error_correction
> "Tests[ecc]give widely varying error rates, but about 10^-12
> error/bit?h is typical, roughly one bit error, per month, per
> gigabyte of memory."
>
> That''s roughly 1 per week in 4GB. If 1 error in 50 results in a
ZFS
> hit, that''s one/year per user on average. Some get more, some get
> less.That sounds like pretty bad odds...
I fail to see anything zfs-specific in the above.  It does not have 
anything more to do with zfs than it does with any other software 
running on the system.

I do have a couple of Windows PCs here without ECC, but they were 
gifts from other people, and not hardware that I purchased, and not 
used for any critical application.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Miles Nordin

2009-Jul-19 21:45 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

>>>>> "r" == Ross  <no-reply at opensolaris.org>
writes:
>>>>> "tt" == Toby Thain <toby at
telegraphics.com.au> writes:
     r> ZFS was never designed to run on consumer hardware,

this is markedroid garbage, as well as post-facto apologetics.

Don''t lower the bar.  Don''t blame the victim.

    tt> I posted about that insane default, six months ago. Obviously
    tt> ZFS isn''t the only subsystem that this breaks.

yes, but remember, in this case the host did not crash, so the insane
default should be irrelevant.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090719/fe7a4ae6/attachment.bin>

Richard Elling

2009-Jul-19 22:10 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

Frank Middleton wrote:> On 07/19/09 05:00 AM, dick hoogendijk wrote:
>
>> (i.e. non ECC memory should work fine!) / mirroring is a -must- !
>
> Yes, mirroring is a must, although it doesn''t help much if you
> have memory errors (see several other threads on this topic):
>
>
http://en.wikipedia.org/wiki/Dynamic_random_access_memory#Errors_and_error_correction
>
>  
> "Tests[ecc]give widely varying error rates, but about 10^-12
> error/bit?h is typical, roughly one bit error, per month, per
> gigabyte of memory."
>
> That''s roughly 1 per week in 4GB. If 1 error in 50 results in a
ZFS
> hit, that''s one/year per user on average. Some get more, some get
> less.That sounds like pretty bad odds...
Not that bad.  Uncommitted ZFS data in memory does not tend to
live that long. Writes are generally out to media in 30 seconds.
Solaris scrubs memory, with a 12-hour cycle time, so memory does
not remain untouched for a month. For high-end systems,
memory scrubs are also performed by the memory controllers.

Beware, if you go down this path of thought for very long, you''ll soon
be afraid to get out of bed in the morning... wait... most people actually
die in beds, so perhaps you''ll be afraid to go to bed instead :-)
>
> "In most computers used for serious scientific or financial computing
> and as servers, ECC is the rule rather than the exception, as can be
> seen by examining manufacturers'' specifications." Sun
doesn''t even
> sell machines without ECC. There''s a reason for that.
Yes, but all of the discussions in this thread can be classified as
systems engineering problems, not product design problems.
If you do your own systems engineering, then add this to your
(hopefully long) checklist.
 -- richard

Bob Friesenhahn

2009-Jul-19 23:02 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On Sun, 19 Jul 2009, Miles Nordin wrote:
>>>>>> "r" == Ross  <no-reply at
opensolaris.org> writes:
>>>>>> "tt" == Toby Thain <toby at
telegraphics.com.au> writes:
>
>     r> ZFS was never designed to run on consumer hardware,
>
> this is markedroid garbage, as well as post-facto apologetics.
>
> Don''t lower the bar.  Don''t blame the victim.
I think that the standard disclaimer "Always use protection" applies 
here.  Victims who do not use protection should assume substantial 
guilt for their subsequent woes.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Gavin Maltby

2009-Jul-20 00:13 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

dick hoogendijk wrote:
> true. Furthermore, much so-called consumer hardware is very good these
> days. My guess is ZFS should work quite reliably on that hardware.
> (i.e. non ECC memory should work fine!) / mirroring is a -must- !
No, ECC memory is a must too.  ZFS checksumming verifies and corrects
data read back from a disk, but once it is read from disk it is stashed
in memory for your application to use - without ECC you erode confidence that
what you read from memory is correct.

Gavin

David Magda

2009-Jul-20 00:29 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On Jul 19, 2009, at 20:13, Gavin Maltby wrote:
> No, ECC memory is a must too.  ZFS checksumming verifies and corrects
> data read back from a disk, but once it is read from disk it is  
> stashed
> in memory for your application to use - without ECC you erode  
> confidence that
> what you read from memory is correct.
Right, because once (say) Apple incorporates ZFS into Mac OS X they''ll
also start shipping MacBooks and iMacs with ECC. If it''s so necessary  
we might as well have any kernel that has ZFS in it only allow ''zpool  
create'' to be run if the kernel detects ECC modules.

Come on.

It''s a nice-to-have, but at some point we''re getting into the
tinfoil
hat-equivalent of data protection.

Bob Friesenhahn

2009-Jul-20 00:41 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On Sun, 19 Jul 2009, David Magda wrote:>
> Right, because once (say) Apple incorporates ZFS into Mac OS X 
> they''ll also start shipping MacBooks and iMacs with ECC. If
it''s so
> necessary we might as well have any kernel that has ZFS in it only 
> allow ''zpool create'' to be run if the kernel detects ECC
modules.
The MacBooks and iMacs are only used as an execution environment for 
the Safari web browser.  ECC is only necessary for computers which 
save data somewhere so the MacBook and iMac do not need ECC.

Regardless (in order to stay on topic) it is worth mentioning that the 
10TB data lost to a failed pool was not lost due to lack of ECC.  It 
was lost because VirtualBox intentionally broke the guest operating 
system.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Gavin Maltby

2009-Jul-20 00:58 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

Hi,

David Magda wrote:> On Jul 19, 2009, at 20:13, Gavin Maltby wrote:
> 
>> No, ECC memory is a must too.  ZFS checksumming verifies and corrects
>> data read back from a disk, but once it is read from disk it is stashed
>> in memory for your application to use - without ECC you erode 
>> confidence that
>> what you read from memory is correct.
> 
> Right, because once (say) Apple incorporates ZFS into Mac OS X
they''ll
> also start shipping MacBooks and iMacs with ECC. 
If customers were committing valuable business data to MacBooks and iMacs
then ECC would be a requirement.  I don''t know of terribly many
customers running their business of of a laptop.
> If it''s so necessary we 
> might as well have any kernel that has ZFS in it only allow ''zpool
> create'' to be run if the kernel detects ECC modules.
> 
> Come on.
 >> 
> It''s a nice-to-have, but at some point we''re getting into
the tinfoil
> hat-equivalent of data protection.
On a laptop zfs is a huge amount safer than other filesystems, still has
all the great usability features etc - but zfs does not magically turn
your laptop into a server-grade system.  What you refer to as a tinfoil hat
is an essential component of any server if that is housing business-vital
data;  obviously it is just a nice-to-have on a laptop, but recognise
what you''re losing.

Gavin

Thomas Burgess

2009-Jul-20 01:39 UTC

head link

[zfs-discuss] What are the rollback tools?

i''m pretty sure you''re just looking for the zfs rollback
command.

a quick google brings up a lot of information and also man zfs

check out this page
http://docs.huihoo.com/opensolaris/solaris-zfs-administration-guide/html/ch06.html

On Sun, Jul 19, 2009 at 10:29 AM, Brian Wilson <bfwilson at
doit.wisc.edu>wrote:
> It''s clear from some threads on this list that it IS possible to
roll back
> a zpool to a previous state, and I seem to even remember reading someone
was
> working on a tool or tools in that direction.
> Is that correct, is it possible to manually roll back a zpool for crash
> recovery purposes, if you''ve got enough clue/knowledge/experience
on your
> side in regards to the right tools?
>
> And the follow up question - how do I learn how to do that?  Is there
> documentation somewhere that instructs one on the tools used for low level
> ZFS troubleshooting and rollback, and how to come back from the land of
> crash-then-fail-to-import?  Is it the troubleshooting document? 
I''m happy
> to get a RTFM - please just point me at the right manual or Sun doc too :)
>
> Thanks!
> Brian
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090719/3b93a066/attachment.html>

Richard Elling

2009-Jul-20 01:50 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

Gavin Maltby wrote:> Hi,
>
> David Magda wrote:
>> On Jul 19, 2009, at 20:13, Gavin Maltby wrote:
>>
>>> No, ECC memory is a must too.  ZFS checksumming verifies and
corrects
>>> data read back from a disk, but once it is read from disk it is
stashed
>>> in memory for your application to use - without ECC you erode 
>>> confidence that
>>> what you read from memory is correct.
>>
>> Right, because once (say) Apple incorporates ZFS into Mac OS X 
>> they''ll also start shipping MacBooks and iMacs with ECC. 
>
> If customers were committing valuable business data to MacBooks and iMacs
> then ECC would be a requirement.  I don''t know of terribly many
> customers running their business of of a laptop.
I do, even though I have a small business.  Neither InDesign nor 
Illustrator will be
ported to Linux or OpenSolaris in my lifetime... besides, iTunes rocks 
and it is the
best iPhone developer''s environment on the planet. The bigger problem
is
that
not all of Intel''s CPU products do ECC... the embedded and server
models do,
but it is the low-margin PC market that is willing to make that cost 
trade-off.
If people demanded ECC, like they do in the embedded and server markets,
then we wouldn''t be having this conversation.
 -- richard

Andre van Eyssen

2009-Jul-20 01:52 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On Sun, 19 Jul 2009, Richard Elling wrote:
> I do, even though I have a small business.  Neither InDesign nor 
> Illustrator will be ported to Linux or OpenSolaris in my lifetime... 
> besides, iTunes rocks and it is the best iPhone developer''s
environment
> on the planet.
Richard,

I think the point that Gavin was trying to make is that a sensible 
business would commit their valuable data back to a fileserver running on 
solid hardware with a solid operating system rather than relying on their 
single-spindle laptops to store their valuable content - not making any 
statement on the actual desktop platform.

For example, I use a mixture of Windows, MacOS, Solaris and OpenBSD around 
here, but all the valuable data is stored on a zpool located on a SPARC 
server (obviously with ECC RAM) with UPS power. With Windows around, I 
like the fact that I don''t need to think twice before reinstalling
those
machines.

Andre.

-- 
Andre van Eyssen.
mail: andre at purplecow.org          jabber: andre at interact.purplecow.org
purplecow.org: UNIX for the masses http://www2.purplecow.org
purplecow.org: PCOWpix             http://pix.purplecow.org

Ian Collins

2009-Jul-20 05:29 UTC

head link

[zfs-discuss] What are the rollback tools?

Thomas Burgess wrote:>
> On Sun, Jul 19, 2009 at 10:29 AM, Brian Wilson <bfwilson at
doit.wisc.edu
> <mailto:bfwilson at doit.wisc.edu>> wrote:
>
>     It''s clear from some threads on this list that it IS possible
to
>     roll back a zpool to a previous state, and I seem to even remember
>     reading someone was working on a tool or tools in that direction.
>     Is that correct, is it possible to manually roll back a zpool for
>     crash recovery purposes, if you''ve got enough
>     clue/knowledge/experience on your side in regards to the right tools?
>
>     And the follow up question - how do I learn how to do that?  Is
>     there documentation somewhere that instructs one on the tools used
>     for low level ZFS troubleshooting and rollback, and how to come
>     back from the land of crash-then-fail-to-import?  Is it the
>     troubleshooting document?  I''m happy to get a RTFM - please
just
>     point me at the right manual or Sun doc too :)
>
>
> i''m pretty sure you''re just looking for the zfs rollback
command.
That rolls back a filesystem, not the state of a corrupted pool.

-- 
Ian.

Ross

2009-Jul-20 05:46 UTC

head link

[zfs-discuss] What are the rollback tools?

I don''t know the details Brian, so I was waiting to see if anybody
remembered more, but that doesn''t seem to be the case.

There is a way to roll back pools, Victor has been very helpful to several
people, and in one of the threads where he managed to recover the pool, he
posted a writeup of the technique he used.  I don''t have a link
I''m afraid.  I believe it involves using ZDB and walking through the
pool to find the copies of the uberblock.

And I believe the person working on recovery tools was Jeff Bonwick (although I
may be wrong).  Again, that was from a thread on here talking about pool
recovery.  I''ve no idea how much progress has been made, but with no
announcements, I doubt there is anything available that will help you.
-- 
This message posted from opensolaris.org

Russel

2009-Jul-20 10:22 UTC

head link

[zfs-discuss] What are the rollback tools?

Hi,

Yes I read those threads, wow, dd directly over blocks at some offset point.....
I was hoping some tools may have been created by now..... hoping

Russel
-- 
This message posted from opensolaris.org

Russel

2009-Jul-20 10:26 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

Well I did have a UPS on the machine :-)

but the machine hung and I had to power it off...
(yep it was vertual, but that happens on direct HW too,
and virtualisasion is the happening ting at sun and else where!
I have a version of the data backed up, but will
take ages (10days) to restore).
-- 
This message posted from opensolaris.org

Ross

2009-Jul-20 11:45 UTC

head link

[zfs-discuss] What are the rollback tools?

That''s the stuff.  I think that is probably your best bet at the
moment.  I''ve not seen even a mention of an actual tool to do that, and
I''d be surprised if we saw one this side of Christmas.
-- 
This message posted from opensolaris.org

Alexander Skwar

2009-Jul-20 11:58 UTC

head link

[zfs-discuss] What are the rollback tools?

Hi.

Hm, what are you actually referring to?

On Mon, Jul 20, 2009 at 13:45, Ross <no-reply at opensolaris.org> wrote:
> That''s the stuff.  I think that is probably your best bet at the
moment.
>  I''ve not seen even a mention of an actual tool to do that, and
I''d be
> surprised if we saw one this side of Christmas.
> --
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
Alexander
-- 
[[ http://zensursula.net ]]
[ Soc. => http://twitter.com/alexs77 | http://www.plurk.com/alexs77 ]
[ Mehr => http://zyb.com/alexws77 ]
[ Chat => Jabber: alexws77 at jabber80.com | Google Talk: a.skwar at
gmail.com ]
[ Mehr => AIM: alexws77 ]
[ $[ $RANDOM % 6 ] = 0 ] && rm -rf / || echo ''CLICK!''

Sent from Winterthur, ZH, Switzerland
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090720/4c8a1d38/attachment.html>

Ross

2009-Jul-20 13:43 UTC

head link

[zfs-discuss] What are the rollback tools?

> Hm, what are you actually referring to?
Sorry, I''m not subscribed to this list, so I just replied on the forum.
This segment of the discussion is what I''m replying to:
http://www.opensolaris.org/jive/message.jspa?messageID=397730#397730
-- 
This message posted from opensolaris.org

Rob Logan

2009-Jul-20 13:45 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

>  the machine hung and I had to power it off.
kinda getting off the "zpool import --tgx -3" request, but
"hangs" are exceptionally rare and usually ram or other
hardware issue, solairs usually abends on software faults.

root at pdm #  uptime
   9:33am  up 1116 day(s), 21:12,  1 user,  load average: 0.07, 0.05, 0.05
root at pdm #  date
Mon Jul 20 09:33:07 EDT 2009
root at pdm #  uname -a
SunOS pdm 5.9 Generic_112233-12 sun4u sparc SUNW,Ultra-250

			Rob

Russel

2009-Jul-20 14:45 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

OK.

So do we have an zpool import --xtg 56574 mypoolname
or help to do it (script?)

Russel
-- 
This message posted from opensolaris.org

Toby Thain

2009-Jul-20 15:06 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On 20-Jul-09, at 6:26 AM, Russel wrote:
> Well I did have a UPS on the machine :-)
>
> but the machine hung and I had to power it off...
> (yep it was vertual, but that happens on direct HW too,
As has been discussed here before, the failure modes are different as  
the layer stack from filesystem to disk is obviously very different.

--Toby
> and virtualisasion is the happening ting at sun and else where!
> I have a version of the data backed up, but will
> take ages (10days) to restore).
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Frank Middleton

2009-Jul-20 19:48 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On 07/19/09 06:10 PM, Richard Elling wrote:
> Not that bad. Uncommitted ZFS data in memory does not tend to
> live that long. Writes are generally out to media in 30 seconds.
Yes, but memory hits are instantaneous. On a reasonably busy
system there may be buffers in queue all the time. You may have
a buffer in memory for 100uS but it only takes 1nS for that buffer
to be clobbered. If that happened to be metadata about to be written
to both sides of a mirror than you are toast.  Good thing this
never happens, right :-)
  > Beware, if you go down this path of thought for very long, you''ll
soon
> be afraid to get out of bed in the morning... wait... most people actually
> die in beds, so perhaps you''ll be afraid to go to bed instead :-)
Not at all. As with any rational business, my servers all have ECC,
and getting up and out isn''t a problem :-). Maybe I''ve had too
many
disks go bad, so I have ECC, mirrors, and backup to a system with
ECC and mirrors (and copies=2, as well). Maybe I''ve read too many
of your excellent blogs :-).
>>Sun doesn''t even sell machines without ECC. There''s a
reason for that.
> Yes, but all of the discussions in this thread can be classified as
> systems engineering problems, not product design problems.
Not sure I follow. We''ve had this discussion before. OSOL+ZFS lets
you build enterprise class systems on cheap hardware that has errors.
ZFS gives the illusion of being fragile because it, uniquely, reports
these errors. Running OSOL as a VM in VirtualBox using MSWanything
as a host is a bit like building on sand, but there''s nothing in
documentation anywhere to even warn folks that they shouldn''t rely
on software to get them out of trouble on cheap hardware. ECC is
just one (but essential) part of that.

On 07/19/09 08:29 PM, David Magda wrote:
> It''s a nice-to-have, but at some point we''re getting into
the tinfoil
> hat-equivalent of data protection.
But it is going to happen! Sun sells only machines with ECC because
that is the only way to ensure reliability. Someone who spends weeks
building a media server at home isn''t going to be happy if they lose
one media file let alone a whole pool. At least they should be warned
that without ECC at some point they will lose files. I''m not convinced
that there is any reasonable scenario for losing an entire pool though,
which was the original complaint in this thread.

Even trusty old SPARCs occasionally hang without a panic (in my
experience especially when a disk is about to go bad). If this
happens, and you have to power cycle because even stop-A doesn''t
respond, are you all saying that there is a risk of losing a pool
at that point? Surely the whole point of a journalled file system
is that it is pretty much proof against any catastrophe, even the
one described initially.

There have been a couple of (to me) unconvincing explanations of
how this pool was lost. Surely if there is a mechanism whereby
unflushed i/os can cause fatal metadata corruption, this should
be a high priority bug since this can happen on /any/ hardware; it
is just more likely if the foundations are shaky, so the explanation
must require more than that if it isn''t a bug.

Ross

2009-Jul-21 07:16 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

My understanding of the root cause of these issues is that the vast majority are
happening with consumer grade hardware that is reporting to ZFS that writes have
succeeded, when in fact they are still in the cache.

When that happens, ZFS believes the data is safely written, but a power cut or
crash can cause severe problems with the pool.  This is (I think) the reason for
comments about this being a system engineering, not design problem - ZFS assumes
the disks are telling the truth and has been designed this way.  It is up to the
administrator to engineer the server from components that accurately report
their status.

However, while the majority of these cases are with consumer hardware, the BBC
have reported that they hit this problem using Sun T2000 servers and commodity
SATA drives, so unless somebody from Sun can say otherwise, I feel that there is
still some risk of this occurring on Sun hardware.

I feel the ZFS marketing and documentation is very misleading in that it
completely ignores the issue of your entire pool being at risk unless you are
careful about the hardware used, leading to a lot of stories like this from
enthusiasts and early adopters.  I also believe ZFS needs recovery tools as a
matter of urgency, to protect its reputation if nothing else.
-- 
This message posted from opensolaris.org

George Wilson

2009-Jul-21 15:53 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

Russel wrote:> OK.
>
> So do we have an zpool import --xtg 56574 mypoolname
> or help to do it (script?)
>
> Russel
>   We are working on the pool rollback mechanism and hope to have that 
soon. The ZFS team recognizes that not all hardware is created equal and 
thus the need for this mechanism. We are using the following CR as the 
tracker for this work:

6667683 need a way to rollback to an uberblock from a previous txg

Thanks,
George

Richard Elling

2009-Jul-21 17:21 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On Jul 20, 2009, at 12:48 PM, Frank Middleton wrote:
> On 07/19/09 06:10 PM, Richard Elling wrote:
>
>> Not that bad. Uncommitted ZFS data in memory does not tend to
>> live that long. Writes are generally out to media in 30 seconds.
>
> Yes, but memory hits are instantaneous. On a reasonably busy
> system there may be buffers in queue all the time. You may have
> a buffer in memory for 100uS but it only takes 1nS for that buffer
> to be clobbered. If that happened to be metadata about to be written
> to both sides of a mirror than you are toast.  Good thing this
> never happens, right :-)
I never win the lottery either :-)
>
>> Beware, if you go down this path of thought for very long,
you''ll
>> soon
>> be afraid to get out of bed in the morning... wait... most people  
>> actually
>> die in beds, so perhaps you''ll be afraid to go to bed instead
:-)
>
> Not at all. As with any rational business, my servers all have ECC,
> and getting up and out isn''t a problem :-). Maybe I''ve
had too many
> disks go bad, so I have ECC, mirrors, and backup to a system with
> ECC and mirrors (and copies=2, as well). Maybe I''ve read too many
> of your excellent blogs :-).
>
>>> Sun doesn''t even sell machines without ECC.
There''s a reason for
>>> that.
>
>> Yes, but all of the discussions in this thread can be classified as
>> systems engineering problems, not product design problems.
>
> Not sure I follow. We''ve had this discussion before. OSOL+ZFS lets
> you build enterprise class systems on cheap hardware that has errors.
> ZFS gives the illusion of being fragile because it, uniquely, reports
> these errors. Running OSOL as a VM in VirtualBox using MSWanything
> as a host is a bit like building on sand, but there''s nothing in
> documentation anywhere to even warn folks that they shouldn''t rely
> on software to get them out of trouble on cheap hardware. ECC is
> just one (but essential) part of that.
It is a systems engineering problem because ZFS is working as designed
and VirtualBox is also working as designed.  If you file a bug against
either, the bug should be closed as "not a defect." That means the
responsibility for making sure that the two interoperate lies at the
systems level -- where systems engineers do their job. For an analogy,
guns don''t kill people, bullets kill people. The gun is just a  
platform for
directing bullets. If you shoot yourself in the foot, then the failure  
is not
with the gun or bullet, it is one layer above -- in the system.  It  
hurts
when you do that, so don''t do that.

>
> On 07/19/09 08:29 PM, David Magda wrote:
>
>> It''s a nice-to-have, but at some point we''re getting
into the tinfoil
>> hat-equivalent of data protection.
>
> But it is going to happen! Sun sells only machines with ECC because
> that is the only way to ensure reliability. Someone who spends weeks
> building a media server at home isn''t going to be happy if they
lose
> one media file let alone a whole pool. At least they should be warned
> that without ECC at some point they will lose files. I''m not
convinced
> that there is any reasonable scenario for losing an entire pool  
> though,
> which was the original complaint in this thread.
>
> Even trusty old SPARCs occasionally hang without a panic (in my
> experience especially when a disk is about to go bad). If this
> happens, and you have to power cycle because even stop-A doesn''t
> respond, are you all saying that there is a risk of losing a pool
> at that point? Surely the whole point of a journalled file system
> is that it is pretty much proof against any catastrophe, even the
> one described initially.
>
> There have been a couple of (to me) unconvincing explanations of
> how this pool was lost.
It is quite simple -- ZFS sent the flush command and VirtualBox
ignored it. Therefore the bits on the persistent store are consistent.
> Surely if there is a mechanism whereby
> unflushed i/os can cause fatal metadata corruption, this should
> be a high priority bug since this can happen on /any/ hardware; it
> is just more likely if the foundations are shaky, so the explanation
> must require more than that if it isn''t a bug.
It isn''t a bug in ZFS or VirtualBox. They work as designed.
As has been mentioned before, many times, the recovery of the
data is now a forensics exercise.  ZFS knows is that the consistency
is broken and is implementing the policy that consistency is more
important than automated access.
  -- richard

Alexander Skwar

2009-Jul-21 18:32 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

Hi.

Good to Know!

But how do we deal with that on older sStems, which don''t have the
patch applied, once it is out?

Thanks, Alexander

On Tuesday, July 21, 2009, George Wilson <George.Wilson at sun.com>
wrote:> Russel wrote:
>
> OK.
>
> So do we have an zpool import --xtg 56574 mypoolname
> or help to do it (script?)
>
> Russel
>
>
> We are working on the pool rollback mechanism and hope to have that soon.
The ZFS team recognizes that not all hardware is created equal and thus the need
for this mechanism. We are using the following CR as the tracker for this work:
>
> 6667683 need a way to rollback to an uberblock from a previous txg
>
> Thanks,
> George
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
-- 
Alexander
-- 
[[ http://zensursula.net ]]
[ Soc. => http://twitter.com/alexs77 | http://www.plurk.com/alexs77 ]
[ Mehr => http://zyb.com/alexws77 ]
[ Chat => Jabber: alexws77 at jabber80.com | Google Talk: a.skwar at
gmail.com ]
[ Mehr => AIM: alexws77 ]
[ $[ $RANDOM % 6 ] = 0 ] && rm -rf / || echo ''CLICK!''

Russel

2009-Jul-22 11:09 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

Thanks for the feed back George.
I hope we get the tools soon. 

At home I have now blown the ZFS away now and creating
a HW raid-5 set :-( Hopefully in the future when the tools
are there I will return to ZFS.

To All : The ECC discussion was very interesting as I had never 
considered it that way! I willl be buying ECC memory for my home
machine!! 

Again many many thanks to all how have replied it has been a very
interesting and informative discussion for me.

Best regards
Russel
-- 
This message posted from opensolaris.org

Anon Y Mous

2009-Jul-22 14:31 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40

I don''t mean to be offensive Russel, but if you do ever return to ZFS,
please promise me that you will never, ever, EVER run it virtualized on top of
NTFS (a.k.a. worst file system ever) in a production environment. Microsoft
Windows is a horribly unreliable operating system in situations where things
like protecting against data corruption are important. Microsoft knows this,
which is why they secretly run much of Microsoft.com, their www advertisement
campaigns, and the Microsoft Updates web sites on Akamai Linux in the data
center across the hall from the data center where I work.... and the
invulnerable file system behind Microsoft''s "cloud" that
secretly runs on Akamai''s content delivery system is none other than
ZFS''s long lost brother... Netapp WAFL! The first time I started to
catch on to this was when the Project Mojave advertisement campaign started and
lots of people were nmap scanning the site and noticing that it was running
Apache on Linux:

http://openmanifesto.blogspot.com/2008/07/mss-blunder-with-mojave-experiment-uses.html

Eventually Microsoft realized they messed up and started to edit the header
strings like they usually do to make it look like IIS:

https://lists.mayfirst.org/pipermail/nosi-discussion/2008-August/000417.html

although you could still figure it out if you were smart enough by using telnet
like this:

http://news.netcraft.com/archives/2003/08/17/wwwmicrosoftcom_runs_linux_up_to_a_point_.html

but the cat was already out of the bag. I did some investigating over a year ago
and talked to some of my long time friends who were senior Akamai techs, and one
of them eventually gave me a guided tour after hours and gave me a quick look at
the Netapp WAFL setup and explained how Microsoft Windows updates actually work.
Very cool! These Akamai guys are like the "Wizard of Oz" for the
Internet running everything behind the curtains there. Whenever Microsoft
Updates are down- Tell an Akamai tech! Everything''s will start working
fine within 5 minutes of you telling them (sure beats calling in to Microsoft
Tech Support in Mumbai India). Is apple.com or itunes running slow? Tell an
Akamai tech and it''ll be fixed immediately. Cnn.com down? Jcpenny.com
down? Yup. Tell an Akamai tech and it comes right back up. It''s very
rare that they have a serious problem like this one:

http://www.theregister.co.uk/2004/06/15/akamai_goes_postal/

in which case 25% of the internet (including google, yahoo, and lycos) usually
goes down with them. So my question to you Russel is- if Microsoft
can''t even rely on NTFS to run their own important infrastructure (they
obviously have a Netapp WAFL dependancy), what hope can your 10TB pool possibly
have? What you''re doing is the equivalent of building a 100 story tall
skyscraper out of titanium and then making the bottom-most ground floor and
basement foundation out of glue and pop sickle sticks, and then when the entire
building starts to collapse, you call in to the Titanium metal fabrication
corporation, blame them for the problem, and then tell them that they are
obligated to help you glue your pop sickle sticks back together because
it''s all their fault that the building collapsed! Not very fair IMHO.

In the future, keep in mind that (as far as I understand it) the only way to get
the 100% full benefits of ZFS checksum protection is to run it in on bare metal
with no virtualization. If you''re going to virtualize something,
virtualize Microsoft Windows and Linux inside of OpenSolaris. I''m
running ZFS in production with my OpenSolaris operating system zpool mirrored
three times over on 3 different drives, and I''ve never had a problem
with it. I even created a few simulated power outages to test my setup and
pulling the plug while twelve different users were uploading multiple files into
12 different Solaris zones definitely didn''t phase the zpool at all.
Just boots right back up and everything works. The thing is though, it only
seems to work when you''re not running it virtualized on top of a
closed-source proprietary file system that''s made out of glue and pop
sickle sticks.

Just my 2 cents. I could be wrong though.
--
This message posted from opensolaris.org

George Wilson

2009-Jul-22 15:55 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

Once these bits are available in Opensolaris then users will be able to 
upgrade rather easily. This would allow you to take a liveCD running 
these bits and recover older pools.

Do you currently have a pool which needs recovery?

Thanks,
George

Alexander Skwar wrote:> Hi.
>
> Good to Know!
>
> But how do we deal with that on older sStems, which don''t have the
> patch applied, once it is out?
>
> Thanks, Alexander
>
> On Tuesday, July 21, 2009, George Wilson <George.Wilson at sun.com>
wrote:
>   
>> Russel wrote:
>>
>> OK.
>>
>> So do we have an zpool import --xtg 56574 mypoolname
>> or help to do it (script?)
>>
>> Russel
>>
>>
>> We are working on the pool rollback mechanism and hope to have that
soon. The ZFS team recognizes that not all hardware is created equal and thus
the need for this mechanism. We are using the following CR as the tracker for
this work:
>>
>> 6667683 need a way to rollback to an uberblock from a previous txg
>>
>> Thanks,
>> George
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>>     
>
>

Mario Goebbels

2009-Jul-22 16:02 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

> To All : The ECC discussion was very interesting as I had never
> considered it that way! I willl be buying ECC memory for my home
> machine!!
You have to make sure your mainboard, chipset and/or CPU support it, 
otherwise any ECC modules will just work like regular modules.

The mainboard needs to have the necessary lanes to either the chipset 
that supports ECC (in case of Intel) or the CPU (in case of AMD).

I think all Xeon chipsets do ECC, as do various consumer ones (I only 
know of X38/X48, there''s also some 9xx ones that do). For consumer 
boards, it''s hard to figure out which actually do support it. I have an
X48-DQ6 mainboard from Gigabyte, which does it.

Regards,
-mg

Miles Nordin

2009-Jul-22 19:47 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40

>>>>> "aym" == Anon Y Mous <no-reply at
opensolaris.org> writes:
>>>>> "mg" == Mario Goebbels <me at tomservo.cc>
writes:
aym> I don''t mean to be offensive Russel, but if you do ever
return
aym> to ZFS, please promise me that you will never, ever, EVER run
aym> it virtualized on top of NTFS

he said he was using raw disk devices IIRC.

and once again, the host did not crash, only the guest, so even if it
were NTFS rather than raw disks, the integrity characteristics of NTFS
would have been irrelevant since the host was awlays shutdown cleanly.

aym> the only way to get the 100% full benefits of ZFS checksum
aym> protection is to run it in on bare metal with no
aym> virtualization.

bullshit. That makes no sense at all.

First, why should virtualization have anything to do with checksums?
Obviously checksums go straight through it. The suspected problem
lies elsewhere.

Second, virtualization is serious business. Problems need to be found
and fixed. At this point, you''ve become so aggressive with that
broom, anyone can see there''s obviously an elephant under the rug.

aym> I''m running ZFS in production with my OpenSolaris
aym> operating system zpool mirrored three times over on 3
aym> different drives, and I''ve never had a problem with it.

The idea of collecting other people''s problem reports is to figure out
what''s causing problems before one hits you. I hear this type of
thing all the time: ``The number of problems I''ve had is so close to
zero, it is zero, so by extrapolation nobody else can be having any
real problems because if I scale out my own experience the expected
number of problems in the entire world is zero.''''---wtf?
clearly
bogus!

mg> You have to make sure your mainboard, chipset and/or CPU
mg> support it, otherwise any ECC modules will just work like
mg> regular modules.

also scrubbing is sometimes enabled separately from plain ECC.
Without scrubbing the ECC can still correct errors, but won''t do so
until some actual thread reads the flipped-bit, which is probably
okay but <shrug>.

I vaguely remember something about an idle scrub thread in solaris
where the CPU itself does the scrubbing? but at least on AMD
platforms, the memory and cache controllers will do scrubbing
themselves using only memory bandwidth, without using CPU cycles, if
you ask.

On AMD you can use this script on Linux to control scrub speed and ECC
enablement if your BIOS does not support it. The script does appear
to do something on Phenom II, but I haven''t tried the 10-ohm resistor
test the author suggests. I think it should be adaptable to SOlaris.

http://hyvatti.iki.fi/~jaakko/sw/

now if only we could get 4GB ECC unbuffered DDR3 for similar prices to
non-ECC. :(
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090722/fb2b1ae5/attachment.bin>

Frank Middleton

2009-Jul-23 23:20 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On 07/21/09 01:21 PM, Richard Elling wrote:
> I never win the lottery either :-)
Let''s see. Your chance of winning a 49 ball lottery is apparently
around 1 in 14*10^6, although it''s much better than that because of
submatches (smaller payoffs for matches on less than 6 balls).

There are about 32*10^6 seconds in a year. If ZFS saves its writes
for 30 seconds and batches them out, that means 1 write leaves the
buffer exposed for roughly one millionth of a year. If you have 4GB
of memory, you might get 50  errors a year, but you say ZFS uses only
1/10 of this for writes, so that memory could see 5 errors/year. If
your single write was 1/70th of that (say around 6 MB), your chance
of a hit is around 5/70/10^-6 or 1 in 14*10^6, so you are correct!

So if you do one 6MB write/year, your chances of a hit in a year are
about the same as that of winning a grand slam lottery. Hopefully
not every hit will trash a file or pool, but odds are that you''ll
do many more writes than that, so on the whole I think a ZFS hit
is quite a bit more likely than winning the lottery each year :-).

Conversely, if you average one big write every 3 minutes or so (20%
occupancy), odds are almost certain that you''ll get one hit a year.
So some SOHO users who do far fewer writes won''t see any hits (say)
over a 5 year period. But some will, and they will be most unhappy --
calculate your odds and then make a decision! I daresay the PC
makers have done this calculation, which is why PCs don''t have ECC,
and hence IMO make for insufficiently reliable servers.

Conclusions from what I''ve gleaned from all the discussions here:
if you are too cheap to opt for mirroring, your best bet is to
disable checksumming and set copies=2. If you mirror but don''t
have ECC then at least set copies=2 and consider disabling checksums.
Actually, set copies=2 regardless, so that you have some redundancy
if one half of the mirror fails and you have a 10 hour resilver,
in which time you could easily get a (real) disk read error.

It seems to me some vendor is going to cotton onto the SOHO server
problem and make a bundle at the right price point. Sun''s offerings
seem unfortunately mostly overkill for the SOHO market, although the
X4140 looks rather interesting... Shame there aren''t any entry
level SPARCs any more :-(. Now what would doctors'' front offices do
if they couldn''t blame the computer for being down all the time?
  > It is quite simple -- ZFS sent the flush command and VirtualBox
> ignored it. Therefore the bits on the persistent store are consistent.
But even on the most majestic of hardware, a flush command could be
lost, could it not? An obvious case in point is ZFS over iscsi and
a router glitch. But the discussion seems to be moot since CR
6667683 is being addressed. Now about those writes to mirrored disks :)

Cheers -- Frank

roland

2009-Jul-25 11:06 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

>Running this kind of setup absolutely can give you NO garanties at all.
>Virtualisation, OSOL/zfs on WinXP. It''s nice to play with and see
it
>"working" but would I TRUST precious data to it? No way!
why not?
if i write some data trough virtualization layer which goes straight trough to
raw disk - what`s the problem?
do a snapshot and you can be sure you have a safe state. or not?
you can check if you are consistent by doing a scrub. or not?
taken buffers/caches into consideration, you could eventually loose some
seconds/minutes of work, but doesn`t zfs use transactional design which ensures
consistency?

so, how can that happen what?s being reported here, if zfs takes so much care of
consistency?
>When that happens, ZFS believes the data is safely written, but a power cut
or >crash can cause severe problems with the pool.
didn`t i read a million times that zfs ensures an "always consistent
state" and is self healing, too?

so, if new blocks are always written at new positions - why can`t we just roll
back to a point in time (for example last snapshot) which is known to be
safe/consistent ?

i give a shit about the last 5 minutes of work if i can recover my TB sized pool
instead.
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2009-Jul-25 15:38 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On Sat, 25 Jul 2009, roland wrote:>
>> When that happens, ZFS believes the data is safely written, but a 
>> power cut or >crash can cause severe problems with the pool.
>
> didn`t i read a million times that zfs ensures an "always consistent 
> state" and is self healing, too?
>
> so, if new blocks are always written at new positions - why can`t we 
> just roll back to a point in time (for example last snapshot) which 
> is known to be safe/consistent ?
As soon as you have more then one disk in the equation, then it is 
vital that the disks commit their data when requested since otherwise 
the data on disk will not be in a consistent state.  If the disks 
simply do whatever they want then some disks will have written the 
data while other disks will still have it cached.  This blows the 
"consistent state on disk" even though zfs wrote the data in order and
did all the right things.  Any uncommitted data in disk cache will be 
forgotten if the system loses power.

There is an additional problem if when the disks finally get around to 
writing the cached data that they write it in a different order than 
requested while ignoring the commit request.  It is common that the 
disks write data in the most efficient order, but it absolutely must 
commit all of the data when requested so that the checkpoint is valid.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

roland

2009-Jul-25 16:24 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

>As soon as you have more then one disk in the equation, then it is
>vital that the disks commit their data when requested since otherwise
>the data on disk will not be in a consistent state.
ok, but doesn`t that refer only to the most recent data?
why can i loose a whole 10TB pool including all the snapshots with the
logging/transactional nature of zfs?

isn`t the data in the snapshots set to read only so all blocks with snapshotted
data don`t change over time (and thus give an secure "entry" to a
consistent point in time) ?

ok, this are probably some short-sighted questions, but i`m trying to understand
how things could go wrong with zfs and how issues like these happen.

on other filesystems, we have tools for fsck as a last resort or tools to
recover data from unmountable filesystems.
with zfs i don`t know any of these, so it`s that "will solaris mount my zfs
after the next crash?" question which frightens me a little bit.
-- 
This message posted from opensolaris.org

David Magda

2009-Jul-25 17:27 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On Jul 25, 2009, at 12:24, roland wrote:
> why can i loose a whole 10TB pool including all the snapshots with  
> the logging/transactional nature of zfs?
Because ZFS does not (yet) have an (easy) way to go back a previous  
state. That''s what this bug is about:
> need a way to rollback to an uberblock from a previous txg
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6667683

While in most cases ZFS will cleanly recover after a non-clean  
shutdown, there are situations where the disks doing strange things  
(like lying) have caused the ZFS data structures to become wonky. The  
''broken'' data structure will cause all branches underneath it
to be
lost--and if it''s near the top of the tree, it could mean a good  
portion of the pool is inaccessible.

Fixing the above bug should hopefully allow users / sysadmins to tell  
ZFS to go ''back in time'' and look up previous versions of the
data
structures.

roland

2009-Jul-25 18:17 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

thanks for the explanation !

one more question:
> there are situations where the disks doing strange things
>(like lying) have caused the ZFS data structures to become wonky. The
>''broken'' data structure will cause all branches underneath
it to be
>lost--and if it''s near the top of the tree, it could mean a good
>portion of the pool is inaccessible.
can snapshots also be affected by such issue or are they somewhat
"immune" here?
-- 
This message posted from opensolaris.org

David Magda

2009-Jul-25 18:50 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On Jul 25, 2009, at 14:17, roland wrote:
> thanks for the explanation !
>
> one more question:
>
>> there are situations where the disks doing strange things
>> (like lying) have caused the ZFS data structures to become wonky. The
>> ''broken'' data structure will cause all branches
underneath it to be
>> lost--and if it''s near the top of the tree, it could mean a
good
>> portion of the pool is inaccessible.
>
> can snapshots also be affected by such issue or are they somewhat  
> "immune" here?
Yes, it can be affected. If the snapshot''s data structure / record is  
underneath the corrupted data in the tree then it won''t be able to be  
reached.

Frank Middleton

2009-Jul-25 19:32 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On 07/25/09 02:50 PM, David Magda wrote:
> Yes, it can be affected. If the snapshot''s data structure / record
is
> underneath the corrupted data in the tree then it won''t be able to
be
> reached.
Can you comment on if/how mirroring or raidz mitigates this, or tree
corruption in general? I have yet to lose a pool even on a machine
with fairly pathological problems, but it is mirrored (and copies=2).

I was also wondering if you could explain why the ZIL can''t
repair such damage.

Finally, a number of posters blamed VB for ignoring a flush, but
according to the evil tuning guide, without any application syncs,
ZFS may wait up to 5 seconds before issuing a synch, and there
must be all kinds of failure modes even on bare hardware where
it never gets a chance to do one at shutdown. This is interesting
if you do ZFS over iscsi because of the possibility of someone
tripping over a patch cord or a router blowing a fuse. Doesn''t
this mean /any/ hardware might have this problem, albeit with much
lower probability?

Thanks

Carson Gaspar

2009-Jul-25 20:30 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

Frank Middleton wrote:> Finally, a number of posters blamed VB for ignoring a flush, but
> according to the evil tuning guide, without any application syncs,
> ZFS may wait up to 5 seconds before issuing a synch, and there
> must be all kinds of failure modes even on bare hardware where
> it never gets a chance to do one at shutdown. This is interesting
> if you do ZFS over iscsi because of the possibility of someone
> tripping over a patch cord or a router blowing a fuse. Doesn''t
> this mean /any/ hardware might have this problem, albeit with much
> lower probability?
No. You''ll lose unwritten data, but won''t corrupt the pool,
because the on-disk
state will be sane, as long as your iSCSI stack doesn''t lie about data
commits
or ignore cache flush commands. Why is this so difficult for people to 
understand? Let me create a simple example for you.

Get yourself 4 small pieces of paper, and number them 1 through 4.

On piece 1, write "Four" (app write disk A)
On piece 2, write "Score" (app write disk B)
Place piece 1 and piece 2 together on the side (metadata write, cache flush)
On piece 3, write "Every" (app overwrite disk A)
On piece 4, write "Good" (app overwrite disk B)
Place piece 2 and piece 3 on top of pieces one and 2 (metadata write, cache
flush)

IFF you obeyed the instructions, the only things you could ever have on the side
are nothing, "Four Score", or "Every Good" (we assume that
side placement is
atomic). You could get killed after writing something on pieces 3 or 4, and lose
them, but you could never have garbage.

Now if you were too lazy to bother to follow the instructions properly, we could
end up with bizarre things. This is what happens when storage lies and re-orders
writes across boundaries.

-- 
Carson

Toby Thain

2009-Jul-25 23:34 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On 25-Jul-09, at 3:32 PM, Frank Middleton wrote:
> On 07/25/09 02:50 PM, David Magda wrote:
>
>> Yes, it can be affected. If the snapshot''s data structure /
record is
>> underneath the corrupted data in the tree then it won''t be
able to be
>> reached.
>
> Can you comment on if/how mirroring or raidz mitigates this, or tree
> corruption in general? I have yet to lose a pool even on a machine
> with fairly pathological problems, but it is mirrored (and copies=2).
>
> I was also wondering if you could explain why the ZIL can''t
> repair such damage.
>
> Finally, a number of posters blamed VB for ignoring a flush, but
> according to the evil tuning guide, without any application syncs,
> ZFS may wait up to 5 seconds before issuing a synch, and there
> must be all kinds of failure modes even on bare hardware where
> it never gets a chance to do one at shutdown. This is interesting
> if you do ZFS over iscsi because of the possibility of someone
> tripping over a patch cord or a router blowing a fuse. Doesn''t
> this mean /any/ hardware might have this problem, albeit with much
> lower probability?
The problem is assumed *ordering*. In this respect VB ignoring  
flushes and real hardware are not going to behave the same.

--Toby
>
> Thanks
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

David Magda

2009-Jul-26 05:40 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On Jul 25, 2009, at 15:32, Frank Middleton wrote:
> Can you comment on if/how mirroring or raidz mitigates this, or tree
> corruption in general? I have yet to lose a pool even on a machine
> with fairly pathological problems, but it is mirrored (and copies=2).
Presumably at least on of the drives in the mirror or RAID set would  
have the correct data or non-corrupted data structures.

There was a thread a while back on the risks involved in a SAN LUN  
(served from something like an EMC array), and whether you could trust  
the array or whether you should mirror LUNs. (I think the consensus  
was it was best to mirror LUNs--even from SANs, which presumably are  
more reliable than consumer SATA drives).
> I was also wondering if you could explain why the ZIL can''t
> repair such damage.
Beyond my knowledge.
> Finally, a number of posters blamed VB for ignoring a flush, but
> according to the evil tuning guide, without any application syncs,
> ZFS may wait up to 5 seconds before issuing a synch, and there
Yes, it will sync every 5 to 30 seconds, but how do you know the data  
is actually synced?! If the five second timer triggers and ZFS says  
"okay, time to sync", and goes through the proper procedures, what  
happens if the drive lies about the sync operation? What then?

That''s the whole point of this thread: what should happen, or what  
should the file system do, when the drive (real or virtual) lies about  
the syncing? It''s just as much a problem with any other POSIX file  
system (which have to deal with fsync(2))--ZFS isn''t that special in  
that regard. The Linux folks went through a protracted debate on a  
similar issue not too long ago:

http://thunk.org/tytso/blog/2009/03/15/dont-fear-the-fsync/
http://lwn.net/Articles/322823/
> tripping over a patch cord or a router blowing a fuse. Doesn''t
> this mean /any/ hardware might have this problem, albeit with much
> lower probability?
Yes, which is why it''s always recommended to have redundancy in your  
configuration (mirroring or RAID-Z). This way, hopefully, at least one  
drive is in a consistent state.

This is also (theoretically) why a drive purchased from Sun is more  
that expensive then a drive purchased from your neighbourhood computer  
shop: Sun (and presumably other manufacturers) takes the time and  
effort to test things to make sure that when a drive says "I''ve
synced
the data", it actually has synced the data. This testing is what  
you''re presumably paying for.

David Magda

2009-Jul-26 05:47 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On Jul 25, 2009, at 16:30, Carson Gaspar wrote:
> Frank Middleton wrote:
>
>> Doesn''t this mean /any/ hardware might have this problem,
albeit
>> with much lower probability?
>
> No. You''ll lose unwritten data, but won''t corrupt the
pool, because
> the on-disk state will be sane, as long as your iSCSI stack
doesn''t
> lie about data commits or ignore cache flush commands.
But this entire thread started because Virtual Box''s virtual disk / 
did/ lie about data commits.
> Why is this so difficult for people to understand?
Because most people make the (not unreasonable assumption) that disks  
save data the way that they''re supposed to: that the data goes in is  
the data that comes out, and that when the OS tells them to empty the  
buffer that they actually flush it.

It''s only us storage geeks that generally know the ugly truth that  
this assumption is not always true. :)

Frank Middleton

2009-Jul-26 15:08 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On 07/25/09 04:30 PM, Carson Gaspar wrote:
> No. You''ll lose unwritten data, but won''t corrupt the
pool, because
> the on-disk state will be sane, as long as your iSCSI stack
doesn''t
> lie about data commits or ignore cache flush commands. Why is this so
> difficult for people to understand? Let me create a simple example
> for you.
Are you sure about this example? AFAIK metadata refers to things like
the file''s name, atime, ACLs, etc., etc. Your example seems to be more
about how a journal works, which has little to do with metatdata other
than to manage it.
> Now if you were too lazy to bother to follow the instructions properly,
> we could end up with bizarre things. This is what happens when storage
> lies and re-orders writes across boundaries.
On 07/25/09 07:34 PM, Toby Thain wrote:
> The problem is assumed *ordering*. In this respect VB ignoring flushes
> and real hardware are not going to behave the same.
Why? An ignored flush is ignored. It may be more likely in VB, but it
can always happen. It mystifies me that VB would in some way alter
the ordering. I wonder if the OP could tell us what actual disks and
controller he used to see if the hardware might actually have done
out-of-order writes despite the fact that ZFS already does write
optimization. Maybe the disk didn''t like the physical location of
the log relative to the data so it wrote the data first? Even then
it isn''t onvious why this would cause the pool to be lost.

A traditional journalling file system should survive the loss pf a flush.
Either the log entry was written or it wasn''t. Even if the disk, for
some bizarre reason, writes some of the actual data before writing the
log, the repair process should undo that,

If written properly, it will use the information in the most current
complete journal entry to repair the file system. Doing synchs are
devastating to performance so usually there''s an option to disable
them, at the known risk of losing a lot more data. I''ve been using
SPARCs and Solaris from the beginning. Ever since UFS supported
journalling, I''ve never lost a file unless the disk went totally bad,
and none since mirroring. Didn''t miss fsck either :-)

Doesn''t ZIL effectively make ZFS into a journalled file system (in
another thread, Bob Friesenhahn says it isn''t, but I would submit
that the general opinion is correct that it is; "log" and
"journal"
have similar semantics). The evil tuning guide is pretty emphatic
about not disabling it!

My intuition (and this is entirely speculative) is that the ZFS ZIL
either doesn''t contain everything needed to restore the superstructure,
or that if it does, the recovery process is ignoring it. I think I read
that the ZIL is per-file system, but one hopes it doesn''t rely on the
superstructure recursively, or this would be impossible to fix (maybe
there''s a ZIL for the ZILs :) ).

On 07/21/09 11:53 AM, George Wilson wrote:
> We are working on the pool rollback mechanism and hope to have that
> soon. The ZFS team recognizes that not all hardware is created equal and
> thus the need for this mechanism. We are using the following CR as the
> tracker for this work:
>
> 6667683 need a way to rollback to an uberblock from a previous txg
so maybe this discussion is moot :-)

-- Frank

Bob Friesenhahn

2009-Jul-26 16:24 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On Sun, 26 Jul 2009, David Magda wrote:>
> That''s the whole point of this thread: what should happen, or what
should the
> file system do, when the drive (real or virtual) lies about the syncing?
It''s
> just as much a problem with any other POSIX file system (which have to deal
> with fsync(2))--ZFS isn''t that special in that regard. The Linux
folks went
> through a protracted debate on a similar issue not too long ago:
Zfs is pretty darn special.  RAIDed disk setups under Linux or *BSD 
work differently than zfs in a rather big way.  Consider that with a 
normal software-based RAID setup, you use OS tools to create a virtual 
RAIDed device (LUN) which appears as a large device that you can then 
create (e.g. mkfs) a traditional filesystem on top of.  Zfs works 
quite differently in that it is uses a pooled design which 
incorporates several RAID strategies directly.  Instead of sending the 
data to a virtual device which then arranges the underlying data 
according to a policy (striping, mirror, RAID5), zfs incorporates 
knowledge of the vdev RAID strategy and intelligently issues data to 
the disks in an ideal order, executing the disk drive commit requests 
directly.  Zfs removes the RAID obfustication which exists in 
traditional RAID systems.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Toby Thain

2009-Jul-26 21:10 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On 26-Jul-09, at 11:08 AM, Frank Middleton wrote:
> On 07/25/09 04:30 PM, Carson Gaspar wrote:
>
>> No. You''ll lose unwritten data, but won''t corrupt the
pool, because
>> the on-disk state will be sane, as long as your iSCSI stack
doesn''t
>> lie about data commits or ignore cache flush commands. Why is this so
>> difficult for people to understand? Let me create a simple example
>> for you.
>
> Are you sure about this example? AFAIK metadata refers to things like
> the file''s name, atime, ACLs, etc., etc. Your example seems to be
more
> about how a journal works, which has little to do with metatdata other
> than to manage it.
>
>> Now if you were too lazy to bother to follow the instructions  
>> properly,
>> we could end up with bizarre things. This is what happens when  
>> storage
>> lies and re-orders writes across boundaries.
>
> On 07/25/09 07:34 PM, Toby Thain wrote:
>
>> The problem is assumed *ordering*. In this respect VB ignoring  
>> flushes
>> and real hardware are not going to behave the same.
>
> Why? An ignored flush is ignored. It may be more likely in VB, but it
> can always happen.
And whenever it does: guess what happens?
> It mystifies me that VB would in some way alter
> the ordering.
Carson already went through a more detailed explanation. Let me try a  
different one:

ZFS issues writes A, B, C, FLUSH, D, E, F.

case 1) the semantics of the flush* allow ZFS to presume that A, B, C  
are all ''committed'' at the point that D is issued. You can
understand
that A, B, C may be done in any order, and D, E, F may be done in any  
order, due to the numerous abstraction layers involved - all the way  
down to the disk''s internal scheduling. ANY of these layers can  
affect the ordering of durable, physical writes _in the absence of a  
flush/barrier_.

case 2) but if the flush does NOT occur with the necessary semantics,  
the ordering of ALL SIX operations is now indeterminate, and by the  
time ZFS issues D, any of the first 3 (A, B, C) may well not have  
been committed at all. There is a very good chance this will violate  
an integrity assumption (I haven''t studied the source so I
can''t
point you to a specific design detail or line; rather I am working  
from how I understand transactional/journaled systems to work.  
Assuming my argument is valid, I am sure a ZFS engineer can cite a  
specific violation).

As has already been mentioned in this context, I think by David  
Magda, ordinary hardware will show this problem _if flushes are not  
functioning_ (an unusual case on bare metal), while on VirtualBox  
this is the default!

> ...
>
> Doesn''t ZIL effectively make ZFS into a journalled file system
Of course ZFS is transactional, as are other filesystems and software  
systems, such as RDBMS. But integrity of such systems depends on a  
hardware flush primitive that actually works. We are getting hoarse  
repeating this.

--Toby

* Essentially ''commit'' semantics: Flush synchronously,
operation is
complete only when data is durably stored.

...

Marcelo Leal

2009-Jul-27 15:18 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

> That''s only one element of it Bob.  ZFS also needs
> devices to fail quickly and in a predictable manner.
> 
> A consumer grade hard disk could lock up your entire
> pool as it fails.  The kit Sun supply is more likely
> to fail in a manner ZFS can cope with.
 I agree 100%.
 Hardware, firmware, drivers, should be fully integrated to a mission critical
app. With the wrong firmware, and consumer grade HD, disks failures stalls the
entire pool. I have experience with disks failing and taking 2 or tree seconds
to the system cope with (not just ZFS, but the controller, etc).

 Leal.
-- 
This message posted from opensolaris.org

Ross

2009-Jul-27 17:10 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

Heh, I''d kill for failures to be handled in 2 or 3 seconds.  I saw the
failure of a mirrored iSCSI disk lock the entire pool for 3 minutes.  That has
been addressed now, but device hangs have the potential to be *very* disruptive.
-- 
This message posted from opensolaris.org

Eric D. Mudama

2009-Jul-27 17:27 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On Sun, Jul 26 at  1:47, David Magda wrote:>
>On Jul 25, 2009, at 16:30, Carson Gaspar wrote:
>
>>Frank Middleton wrote:
>>
>>>Doesn''t this mean /any/ hardware might have this problem,
albeit
>>>with much lower probability?
>>
>>No. You''ll lose unwritten data, but won''t corrupt the
pool, because
>>the on-disk state will be sane, as long as your iSCSI stack
doesn''t
>>lie about data commits or ignore cache flush commands.
>
>But this entire thread started because Virtual Box''s virtual disk /
>did/ lie about data commits.
>
>>Why is this so difficult for people to understand?
>
>Because most people make the (not unreasonable assumption) that disks 
>save data the way that they''re supposed to: that the data goes in
is
>the data that comes out, and that when the OS tells them to empty the 
>buffer that they actually flush it.
>
>It''s only us storage geeks that generally know the ugly truth that 
>this assumption is not always true. :)
Can *someone* please name a single drive+firmware or RAID
controller+firmware that ignores FLUSH CACHE / FLUSH CACHE EXT
commands? Or worse, responds "ok" when the flush hasn''t
occurred?

Everyone on this list seems to blame lying hardware for ignoring
commands, but disks are relatively mature and I can''t believe that
major OEMs would qualify disks or other hardware that willingly ignore
commands.

--eric

-- 
Eric D. Mudama
edmudama at mail.bounceswoosh.org

Thomas Burgess

2009-Jul-27 17:49 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

i was under the impression it was virtualbox and it''s default setting
that
ignored the command, not the hard drive

On Mon, Jul 27, 2009 at 1:27 PM, Eric D. Mudama
<edmudama at bounceswoosh.org>wrote:
> On Sun, Jul 26 at  1:47, David Magda wrote:
>
>>
>> On Jul 25, 2009, at 16:30, Carson Gaspar wrote:
>>
>>  Frank Middleton wrote:
>>>
>>>  Doesn''t this mean /any/ hardware might have this problem,
albeit with
>>>> much lower probability?
>>>>
>>>
>>> No. You''ll lose unwritten data, but won''t corrupt
the pool, because the
>>> on-disk state will be sane, as long as your iSCSI stack
doesn''t lie about
>>> data commits or ignore cache flush commands.
>>>
>>
>> But this entire thread started because Virtual Box''s virtual
disk /
>> did/ lie about data commits.
>>
>>  Why is this so difficult for people to understand?
>>>
>>
>> Because most people make the (not unreasonable assumption) that disks
save
>> data the way that they''re supposed to: that the data goes in
is the data
>> that comes out, and that when the OS tells them to empty the buffer
that
>> they actually flush it.
>>
>> It''s only us storage geeks that generally know the ugly truth
that this
>> assumption is not always true. :)
>>
>
> Can *someone* please name a single drive+firmware or RAID
> controller+firmware that ignores FLUSH CACHE / FLUSH CACHE EXT
> commands? Or worse, responds "ok" when the flush hasn''t
occurred?
>
> Everyone on this list seems to blame lying hardware for ignoring
> commands, but disks are relatively mature and I can''t believe that
> major OEMs would qualify disks or other hardware that willingly ignore
> commands.
>
> --eric
>
> --
> Eric D. Mudama
> edmudama at mail.bounceswoosh.org
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090727/33446ac3/attachment.html>

Chris Ridd

2009-Jul-27 17:54 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On 27 Jul 2009, at 18:49, Thomas Burgess wrote:
>
> i was under the impression it was virtualbox and it''s default  
> setting that ignored the command, not the hard drive
Do other virtualization products (eg VMware, Parallels, Virtual PC)  
have the same default behaviour as VirtualBox?

I''ve a suspicion they all behave similarly dangerously, but actual  
data would be useful.

Cheers,

Chris

Adam Sherman

2009-Jul-27 17:59 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On 27-Jul-09, at 13:54 , Chris Ridd wrote:>> i was under the impression it was virtualbox and it''s default
>> setting that ignored the command, not the hard drive
>
> Do other virtualization products (eg VMware, Parallels, Virtual PC)  
> have the same default behaviour as VirtualBox?
>
> I''ve a suspicion they all behave similarly dangerously, but actual
> data would be useful.
Also, I think it may have already been posted, but I haven''t found the
option to disable VirtualBox'' disk cache. Anyone have the incantation  
handy?

Thanks,

A

--
Adam Sherman
CTO, Versature Corp.
Tel: +1.877.498.3772 x113

Mike Gerdts

2009-Jul-27 18:16 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On Mon, Jul 27, 2009 at 12:54 PM, Chris Ridd<chrisridd at mac.com>
wrote:>
> On 27 Jul 2009, at 18:49, Thomas Burgess wrote:
>
>>
>> i was under the impression it was virtualbox and it''s default
setting that
>> ignored the command, not the hard drive
>
> Do other virtualization products (eg VMware, Parallels, Virtual PC) have
the
> same default behaviour as VirtualBox?
I''ve lost a pool due to LDoms doing the same.  This bug seems to be
related.

http://bugs.opensolaris.org/view_bug.do?bug_id=6684721

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

David Magda

2009-Jul-27 19:14 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On Mon, July 27, 2009 13:59, Adam Sherman wrote:
> Also, I think it may have already been posted, but I haven''t found
the
> option to disable VirtualBox'' disk cache. Anyone have the
incantation
> handy?
http://forums.virtualbox.org/viewtopic.php?f=8&t=13661&start=0

It tells VB not to ignore the sync/flush command. Caching is still enabled
(it wasn''t the problem).

Frank Middleton

2009-Jul-27 19:44 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On 07/27/09 01:27 PM, Eric D. Mudama wrote:
> Everyone on this list seems to blame lying hardware for ignoring
> commands, but disks are relatively mature and I can''t believe that
> major OEMs would qualify disks or other hardware that willingly ignore
> commands.
You are absolutely correct, but if the cache flush command never makes
it to the disk, then it won''t see it. The contention is that by not
relaying the cache flush to the disk, VirtualBox caused the OP to lose
his pool.

IMO this argument is bogus because AFAIK the OP didn''t actually power
his system down, so the data would still have been in the cache, and
presumably have eventually have been written. The out-of-order writes
theory is also somewhat dubious, since he was able to write 10TB without
VB relaying the cache flushes. This is all highly hardware dependant,
and AFAIK no one ever asked the OP what hardware he had, instead,
blasting him for running VB on MSWindows. Since IIRC he was using raw
disk access, it is questionable whether or not MS was to blame, but
in general it simply shouldn''t be possible to lose a pool under
any conditions.

It does raise the question of what happens in general if a cache
flush doesn''t happen if, for example, a system crashes in such a way
that it requires a power cycle to restart, and the cache never gets
flushed. Do disks with volatile caches attempt to flush the cache
by themselves if they detect power down? It seems that the ZFS team
recognizes this as a problem, hence the CR to address it.

It turns out that (at least on this almost 4 year old blog)
http://blogs.sun.com/perrin/entry/the_lumberjack that the ZILs
/are/ allocated recursively from the main pool.  Unless there is
a ZIL for the ZILs, ZFS really isn''t fully journalled, and this
could be the real explanation for all lost pools and/or file
systems. It would be great to hear from the ZFS team that writing
a ZIL, presumably a transaction in it''s own right, is protected
somehow (by a ZIL for the ZILs?).

Of course the ZIL isn''t a journal in the traditional sense, and
AFAIK it has no undo capability the way that a DBMS usually has,
but it needs to be structured so that bizarre things that happen
when something as robust as Solaris crashes don''t cause data loss.
The nightmare scenario is when one disk of a mirror begins to
fail and the system comes to a grinding halt where even stop-a
doesn''t respond, and a power cycle is the only way out. Who
knows what writes may or may not have been issued or what the
state of the disk cache might be at such a time.

-- Frank

Richard Elling

2009-Jul-27 20:50 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On Jul 27, 2009, at 10:27 AM, Eric D. Mudama wrote:
> On Sun, Jul 26 at  1:47, David Magda wrote:
>>
>> On Jul 25, 2009, at 16:30, Carson Gaspar wrote:
>>
>>> Frank Middleton wrote:
>>>
>>>> Doesn''t this mean /any/ hardware might have this
problem, albeit
>>>> with much lower probability?
>>>
>>> No. You''ll lose unwritten data, but won''t corrupt
the pool,
>>> because the on-disk state will be sane, as long as your iSCSI  
>>> stack doesn''t lie about data commits or ignore cache flush
commands.
>>
>> But this entire thread started because Virtual Box''s virtual
disk /
>> did/ lie about data commits.
>>
>>> Why is this so difficult for people to understand?
>>
>> Because most people make the (not unreasonable assumption) that  
>> disks save data the way that they''re supposed to: that the
data
>> goes in is the data that comes out, and that when the OS tells them  
>> to empty the buffer that they actually flush it.
>>
>> It''s only us storage geeks that generally know the ugly truth
that
>> this assumption is not always true. :)
>
> Can *someone* please name a single drive+firmware or RAID
> controller+firmware that ignores FLUSH CACHE / FLUSH CACHE EXT
> commands? Or worse, responds "ok" when the flush hasn''t
occurred?
two seconds with google shows
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=183771&NewLang=en&Hilite=cache+flush

Give it up. These things happen.  Not much you can do about it, other
than design around it.
  -- richard

Adam Sherman

2009-Jul-27 20:54 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On 27-Jul-09, at 15:14 , David Magda wrote:>> Also, I think it may have already been posted, but I haven''t
found
>> the
>> option to disable VirtualBox'' disk cache. Anyone have the
incantation
>> handy?
>
> http://forums.virtualbox.org/viewtopic.php?f=8&t=13661&start=0
>
> It tells VB not to ignore the sync/flush command. Caching is still  
> enabled
> (it wasn''t the problem).
Thanks!

As Russell points on in the last post to that thread, it doesn''t seem  
possible to do this with virtual SATA disks? Odd.

A.

--
Adam Sherman
CTO, Versature Corp.
Tel: +1.877.498.3772 x113

Nigel Smith

2009-Jul-27 23:09 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

David Magda wrote:> This is also (theoretically) why a drive purchased from Sun is more  
> that expensive then a drive purchased from your neighbourhood computer  
> shop: Sun (and presumably other manufacturers) takes the time and  
> effort to test things to make sure that when a drive says
"I''ve synced
> the data", it actually has synced the data. This testing is what  
> you''re presumably paying for.
So how do you test a hard drive to check it does actually sync the data?
How would you do it in theory?
And in practice?

Now say we are talking about a virtual hard drive,
rather than a physical hard drive.
How would that affect the answer to the above questions?

Thanks
Nigel
-- 
This message posted from opensolaris.org

Toby Thain

2009-Jul-28 00:34 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On 27-Jul-09, at 3:44 PM, Frank Middleton wrote:
> On 07/27/09 01:27 PM, Eric D. Mudama wrote:
>
>> Everyone on this list seems to blame lying hardware for ignoring
>> commands, but disks are relatively mature and I can''t believe
that
>> major OEMs would qualify disks or other hardware that willingly  
>> ignore
>> commands.
>
> You are absolutely correct, but if the cache flush command never makes
> it to the disk, then it won''t see it. The contention is that by
not
> relaying the cache flush to the disk,
No - by COMPLETELY ignoring the flush.
> VirtualBox caused the OP to lose
> his pool.
>
> IMO this argument is bogus because AFAIK the OP didn''t actually
power
> his system down, so the data would still have been in the cache, and
> presumably have eventually have been written. The out-of-order writes
> theory is also somewhat dubious, since he was able to write 10TB  
> without
> VB relaying the cache flushes.
Huh? Of course he could. The guest didn''t crash while he was doing it!

The corruption occurred when the guest crashed (iirc). And the "out  
of order theory" need not be the *only possible* explanation, but it  
*is* sufficient.
> This is all highly hardware dependant,
Not in the least. It''s a logical problem.
> and AFAIK no one ever asked the OP what hardware he had, instead,
> blasting him for running VB on MSWindows.
Which is certainly not relevant to my hypothesis of what broke. I  
don''t care what host he is running. The argument is the same for all.
> Since IIRC he was using raw
> disk access, it is questionable whether or not MS was to blame, but
> in general it simply shouldn''t be possible to lose a pool under
> any conditions.
How about "when flushes are ignored"?
>
> It does raise the question of what happens in general if a cache
> flush doesn''t happen if, for example, a system crashes in such a
way
> that it requires a power cycle to restart, and the cache never gets
> flushed.
Previous explanations have not dented your misunderstanding one iota.

The problem is not that an attempted flush did not complete. It was  
that any and all flushes *prior to crash* were ignored. This is where  
the failure mode diverges from real hardware.

Again, look:

A B C FLUSH D E F FLUSH<CRASH>

Note that it does not matter *at all* whether the 2nd flush  
completed. What matters from an integrity point of view is that the  
*previous* flush was completed (and synchronously). Visualise this on  
the two scenarios:

1) real hardware: (barring actual defects) that A,B,C were written  
was guaranteed by the first flush (otherwise D would never have been  
issued). Integrity of system is intact regardless of whether the 2nd  
flush completed.

2) VirtualBox: flush never happened. Integrity of system is lost, or  
at best unknown, if it depends on A,B,C all completing before D.

> ...
>
> Of course the ZIL isn''t a journal in the traditional sense, and
> AFAIK it has no undo capability the way that a DBMS usually has,
> but it needs to be structured so that bizarre things that happen
> when something as robust as Solaris crashes don''t cause data loss.
A lot of engineering effort has been expended in UFS and ZFS to  
achieve just that. Which is why it''s so nutty to undermine that by  
violating semantics in lower layers.
> The nightmare scenario is when one disk of a mirror begins to
> fail and the system comes to a grinding halt where even stop-a
> doesn''t respond, and a power cycle is the only way out. Who
> knows what writes may or may not have been issued or what the
> state of the disk cache might be at such a time.
Again, if the flush semantics are respected*, this is not a problem.

--Toby

* - "When this operation completes, previous writes are verifiably on  
durable media**."

** - Durable media meaning physical media in a bare metal  
environment, and potentially "virtual media" in a virtualised  
environment.

>
> -- Frank
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Ross

2009-Jul-28 09:19 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

I think people can understand the concept of missing flushes.  The big
conceptual problem is how this manages to hose an entire filesystem, which is
assumed to have rather a lot of data which ZFS has already verified to be ok.

Hardware ignoring flushes and loosing recent data is understandable, I
don''t think anybody would argue with that.  Loosing access to your
entire pool and multiple gigabytes of data because a few writes failed is a
whole different story, and while I understand how it happens, ZFS appears to be
unique among modern filesystems in suffering such a catastrophic failure so
often.

To give a quick personal example:  I can plug a fat32 usb disk into a windows
system, drag some files to it, and pull that drive at any point.  I might loose
a few files, but I''ve never lost the entire filesystem.  Even if the
absolute worst happened, I know I can run scandisk, chkdisk, or any number of
file recovery tools and get my data back.

I would never, ever attempt this with ZFS.

For a filesystem like ZFS where it''s integrity and stability are sold
as being way better than existing filesystems, loosing your entire pool is a bit
of a shock.  I know that work is going on to be able to recover pools, and
I''ll sleep a lot sounder at night once it is available.
-- 
This message posted from opensolaris.org

Rennie Allen

2009-Jul-28 23:46 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

> 
> Can *someone* please name a single drive+firmware or
> RAID
> controller+firmware that ignores FLUSH CACHE / FLUSH
> CACHE EXT
> commands? Or worse, responds "ok" when the flush
> hasn''t occurred?
I think it would be a shorter list if one were to name the drives/controllers
that actually implement a flush properly.
 > Everyone on this list seems to blame lying hardware
> for ignoring
> commands, but disks are relatively mature and I can''t
> believe that
> major OEMs would qualify disks or other hardware that
> willingly ignore
> commands.
It seems you have too much faith in major OEM''s of storage, considering
that 99.9% of the market is personal use, and for which a 2% throughput
advantage over a competitor can make or break the profit margin on a device. 
Ignoring cache requests is guaranteed to get the best drive performance
benchmarks regardless of what the software is driving the device.  For example,
it is virtually impossible to find a USB drive that honors cache sync (to do so
would require that the device would stop completely until a fully synchronous
USB transaction had made it to the device, the data had been written).  Can you
imagine how long a USB drive would sit on store shelves if it actually did do a
proper cache sync?  While USB is the extreme case; and it does get better the
more expensive the drive, it is still far from a given that any particular
device properly handles cache flushes.
-- 
This message posted from opensolaris.org

Rennie Allen

2009-Jul-28 23:52 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

> This is also (theoretically) why a drive purchased
> from Sun is more  
> that expensive then a drive purchased from your
> neighbourhood computer  
> shop:
It''s more significant than that.  Drives aimed at the consumer market
are at a competitive disadvantage if they do handle cache flush correctly (since
the popular hardware blog of the day will show that the device is far slower
than the competitors that throw away the sync requests).

 Sun (and presumably other manufacturers) takes> the time and  
> effort to test things to make sure that when a drive
> says "I''ve synced  
> the data", it actually has synced the data. This
> testing is what  
> you''re presumably paying for.
It wouldn''t cost any more for commercial vendors to implement cache
flush properly, it is just that they are penalized by the market for doing so.
-- 
This message posted from opensolaris.org

Eric D. Mudama

2009-Jul-29 01:34 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On Mon, Jul 27 at 13:50, Richard Elling wrote:>On Jul 27, 2009, at 10:27 AM, Eric D. Mudama wrote:
>>Can *someone* please name a single drive+firmware or RAID
>>controller+firmware that ignores FLUSH CACHE / FLUSH CACHE EXT
>>commands? Or worse, responds "ok" when the flush
hasn''t occurred?
>
>two seconds with google shows
>http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=183771&NewLang=en&Hilite=cache+flush
>
>Give it up. These things happen.  Not much you can do about it, other
>than design around it.
> -- richard
>
That example is a windows-specific, and is a software driver, where
the data integrity feature must be manually disabled by the end user.
The default behavior was always maximum data protection.

While perhaps analagous at some level, the perpetual "your hardware
must be crappy/cheap/not-as-expensive-as-mine" doesn''t seem to be
a
sufficient explanation when things go wrong, like complete loss of a
pool.


-- 
Eric D. Mudama
edmudama at mail.bounceswoosh.org

James Andrewartha

2009-Jul-29 10:08 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

Nigel Smith wrote:> David Magda wrote:
>> This is also (theoretically) why a drive purchased from Sun is more  
>> that expensive then a drive purchased from your neighbourhood computer
>> shop: Sun (and presumably other manufacturers) takes the time and  
>> effort to test things to make sure that when a drive says
"I''ve synced
>> the data", it actually has synced the data. This testing is what  
>> you''re presumably paying for.
> 
> So how do you test a hard drive to check it does actually sync the data?
> How would you do it in theory?
> And in practice?
> 
> Now say we are talking about a virtual hard drive,
> rather than a physical hard drive.
> How would that affect the answer to the above questions?
http://brad.livejournal.com/2116715.html has a utility that can be used to
test if your systems (including virtual ones) properly sync data to disk
when asked to.

-- 
James Andrewartha

Nigel Smith

2009-Jul-29 13:12 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

Hi James
Many thanks for finding & posting that link.

I''m sure many people on this forum will
be interested in trying out Brad Fitzpatrick''s
perl script ''diskchecker.pl''.

It will be interesting to hear their results.

I''ve not yet had time to work out how Brad''s
script works.  If would be good if others
here can take a critical look at it, and 
feedback their comments to the forum.

I''m disappointed that I''ve not had a reply
from someone at Sun to explain how they
test their hard drives.

We''ve had a few people here quick to claim that most
hard drives fail to sync/flush correctly,
but AFAIK no one is saying how they know this.
Have they actually tested, in which case
how have they tested. Or do they just know
because of bad experiences having lost lots of data.

Best Regards
Nigel Smith
-- 
This message posted from opensolaris.org

Richard Elling

2009-Jul-29 17:55 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On Jul 28, 2009, at 6:34 PM, Eric D. Mudama wrote:
> On Mon, Jul 27 at 13:50, Richard Elling wrote:
>> On Jul 27, 2009, at 10:27 AM, Eric D. Mudama wrote:
>>> Can *someone* please name a single drive+firmware or RAID
>>> controller+firmware that ignores FLUSH CACHE / FLUSH CACHE EXT
>>> commands? Or worse, responds "ok" when the flush
hasn''t occurred?
>>
>> two seconds with google shows
>>
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=183771&NewLang=en&Hilite=cache+flush
>>
>> Give it up. These things happen.  Not much you can do about it, other
>> than design around it.
>> -- richard
>>
>
> That example is a windows-specific, and is a software driver, where
> the data integrity feature must be manually disabled by the end user.
> The default behavior was always maximum data protection.
I don''t think you read the post.  It specifically says, "Previous
versions
of the Promise drivers ignored the flush cache command until system
power down. " Promise makes RAID controllers and has a firmware
fix for this. This is the kind of thing we face: some performance
engineer tries to get an edge by assuming there is only one case
where cache flush matters.

Another 2 seconds with google shows:
http://sunsolve.sun.com/search/document.do?assetkey=1-66-200007-1
(interestingly, for this one, fsck also fails)

http://sunsolve.sun.com/search/document.do?assetkey=1-21-103622-06-1

http://forums.seagate.com/stx/board/message?board.id=freeagent&message.id=5060&query.id=3999#M5060

But they also get cache flush code wrong in the opposite direction. A
good example of that is the notorious Seagate 1.5 TB disk "stutter"
problem.

NB, for the most part, vendors do not air their dirty laundry (eg bug  
reports)
on the internet for those without support contracts.  If you have a  
support
contract, your search may show many more cases.
>
> While perhaps analagous at some level, the perpetual "your hardware
> must be crappy/cheap/not-as-expensive-as-mine" doesn''t seem
to be a
> sufficient explanation when things go wrong, like complete loss of a
> pool.
As I said before, it is a systems engineering problem. If you do your
own systems engineering, then you should make sure the components
you select work as you expect.
  -- richard

Russel

2009-Jul-31 14:11 UTC

head link

[zfs-discuss] How Virtual Box handles the IO

After all the discussion here about VB, and all the finger pointing
I raised a bug on VB about flushing.

Remember I am using RAW disks via the SATA emulation in VB
the disks are WD 2TB drives. Also remember the HOST machine 
NEVER crashed or stopped. BUT the guest OS OpenSolaris was
hung and so I powered off the VIRTUAL host.

OK, this is what the VB engineer had to say after reading this and
another thread I had pointed him to. (he missed the fast I was
using RAW not supprising as its a rather long thread now!)

==============================Just looked at those two threads, and from what I
saw all vital information is missing - no hint whatsoever on how the user set up
his disks, nothing about what errors should be dealt with and so on. So hard to
say anything sensible, especially as people seem most interested in assigning
blame to some product. ZFS doesn''t deserve this, and VirtualBox
doesn''t deserve this either.

In the first place, there is absolutely no difference in how the IDE and SATA
devices handle the flush command. The documentation just wasn''t updated
to talk about the SATA controller. Thanks for pointing this out, it will be
fixed in the next major release. If you want to get the information straight
away: just replace "piix3ide" with "ahci", and all other
flushing behavior settings apply as well. See a bit further below of what it
buys you (or not).

What I haven''t mentioned is the rationale behind the current behavior.
The reason for ignoring flushes is simple: the biggest competitor does it by
default as well, and one gets beaten up by every reviewer if VirtualBox is just
a few percent slower than you know what. Forget about arguing with reviewers.

That said, a bit about what flushing can achieve - or not. Just keep in mind
that VirtualBox doesn''t really buffer anything. In the IDE case every
read and write requests gets handed more or less straight (depending on the
image format complexity) to the host OS. So there is absolutely nothing which
can be lost if one assumes the host OS doesn''t crash.

In the SATA case things are slightly more complicated. If you''re using
anything but raw disks or flat file VMDKs, the behavior is 100% identical to
IDE. If you use raw disks or flat file VMDKs, we activate NCQ support in the
SATA device code, which means that the guest can push through a number of
commands at once, and they get handled on the host via async I/O. Again - if the
host OS works reliably there is nothing to lose.

The only thing what flushing can potentially improve is the behavior when the
host OS crashes. But that depends on many assumptions on what the respective OS
does, the filesystems do etc etc.

Hope those facts can be the basis of a real discussion. Feel free to raise any
issue you have in this context, as long as it''s not purely
hypothetical.

==================================-- 
This message posted from opensolaris.org

Richard Elling

2009-Jul-31 16:11 UTC

head link

[zfs-discuss] How Virtual Box handles the IO

Thanks for following up with this, Russel.

On Jul 31, 2009, at 7:11 AM, Russel wrote:
> After all the discussion here about VB, and all the finger pointing
> I raised a bug on VB about flushing.
>
> Remember I am using RAW disks via the SATA emulation in VB
> the disks are WD 2TB drives. Also remember the HOST machine
> NEVER crashed or stopped. BUT the guest OS OpenSolaris was
> hung and so I powered off the VIRTUAL host.
>
> OK, this is what the VB engineer had to say after reading this and
> another thread I had pointed him to. (he missed the fast I was
> using RAW not supprising as its a rather long thread now!)
>
> ==============================> Just looked at those two threads, and
from what I saw all vital
> information is missing - no hint whatsoever on how the user set up  
> his disks, nothing about what errors should be dealt with and so on.  
> So hard to say anything sensible, especially as people seem most  
> interested in assigning blame to some product. ZFS doesn''t deserve
> this, and VirtualBox doesn''t deserve this either.
>
> In the first place, there is absolutely no difference in how the IDE  
> and SATA devices handle the flush command. The documentation just  
> wasn''t updated to talk about the SATA controller. Thanks for  
> pointing this out, it will be fixed in the next major release. If  
> you want to get the information straight away: just replace  
> "piix3ide" with "ahci", and all other flushing behavior
settings
> apply as well. See a bit further below of what it buys you (or not).
>
> What I haven''t mentioned is the rationale behind the current  
> behavior. The reason for ignoring flushes is simple: the biggest  
> competitor does it by default as well, and one gets beaten up by  
> every reviewer if VirtualBox is just a few percent slower than you  
> know what. Forget about arguing with reviewers.
>
> That said, a bit about what flushing can achieve - or not. Just keep  
> in mind that VirtualBox doesn''t really buffer anything. In the IDE
> case every read and write requests gets handed more or less straight  
> (depending on the image format complexity) to the host OS. So there  
> is absolutely nothing which can be lost if one assumes the host OS  
> doesn''t crash.
>
> In the SATA case things are slightly more complicated. If you''re  
> using anything but raw disks or flat file VMDKs, the behavior is  
> 100% identical to IDE. If you use raw disks or flat file VMDKs, we  
> activate NCQ support in the SATA device code, which means that the  
> guest can push through a number of commands at once, and they get  
> handled on the host via async I/O. Again - if the host OS works  
> reliably there is nothing to lose.
The problem with this thought process is that since the data is not
on medium, a fault that occurs between the flush request and
the bogus ack goes undetected. The OS trusts when the disk
said "the data is on the medium" that the data is on the medium
with no errors.

This problem also affects "hardware" RAID arrays which provide
nonvolatile caches.  If the array acks a write and flush, but the
data is not yet committed to medium, then if the disk fails, the
data must remain in nonvolatile cache until it can be committed
to the medium. A use case may help, suppose the power goes
out. Most arrays have enough battery to last for some time. But
if power isn''t restored prior to the batteries discharging, then
there is a risk of data loss.

For ZFS, cache flush requests are not gratuitous. One critical
case is the uberblock or label update. ZFS does:
	1. update labels 0 and 2
	2. flush
	3. check for errors
	4. update labels 1 and 3
	5. flush
	6. check for errors

Making flush be a nop destroys the ability to check for errors
thus breaking the trust between ZFS and the data on medium.
  -- richard
>
> The only thing what flushing can potentially improve is the behavior  
> when the host OS crashes. But that depends on many assumptions on  
> what the respective OS does, the filesystems do etc etc.
>
> Hope those facts can be the basis of a real discussion. Feel free to  
> raise any issue you have in this context, as long as it''s not
purely
> hypothetical.
>
> ==================================> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Dave Stubbs

2009-Jul-31 22:23 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40

> I don''t mean to be offensive Russel, but if you do
> ever return to ZFS, please promise me that you will
> never, ever, EVER run it virtualized on top of NTFS
> (a.k.a. worst file system ever) in a production
> environment. Microsoft Windows is a horribly
> unreliable operating system in situations where
> things like protecting against data corruption are
> important. Microsoft knows this
Oh WOW!  Whether or not our friend Russel virtualized on top of NTFS (he
didn''t - he used raw disk access) this point is amazing!  System5 -
based on this thread I''d say you can''t really make this claim
at all.  Solaris suffered a crash and the ZFS filesystem lost EVERYTHING!  And
there aren''t even any recovery tools?

HANG YOUR HEADS!!!

Recovery from the same situation is EASY on NTFS.  There are piles of tools out
there that will recover the file system, and failing that, locate and extract
data.  The key parts of the file system are stored in multiple locations on the
disk just in case.  It''s been this way for over 10 years.  I''d
say it seems from this thread that my data is a lot safer on NTFS than it is on
ZFS!

I can''t believe my eyes as I read all these responses blaming system
engineering and hiding behind ECC memory excuses and "well, you know, ZFS
is intended for more Professional systems and not consumer devices, etc
etc."  My goodness!  You DO realize that Sun has this website called
opensolaris.org which actually proposes to have people use ZFS on commodity
hardware, don''t you?  I don''t see a huge warning on that site
saying "ATTENTION:  YOU PROBABLY WILL LOSE ALL YOUR DATA".

I recently flirted with putting several large Unified Storage 7000 systems on
our corporate network.  The hype about ZFS is quite compelling and I had
positive experience in my lab setting.  But because of not having Solaris
capability on our staff we went in another direction instead.

Reading this thread, I''m SO glad we didn''t put ZFS in
production in ANY way.  Guys, this is the real world.  Stuff happens.  It
doesn''t matter what the reason is - hardware lying about cache commits,
out-of-order commits, failure to use ECC memory, whatever.  It is ABSOLUTELY
unacceptable for the filesystem to be entirely lost.  No excuse or
rationalization of any type can be justified.  There MUST be at least the base
suite of tools to deal with this stuff.  without it, ZFS simply isn''t
ready yet.

I am saving a copy of this thread to show my colleagues and also those Sun
Microsystems sales people that keep calling.
-- 
This message posted from opensolaris.org

Richard Elling

2009-Jul-31 23:15 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40

wow, talk about a knee jerk reaction...

On Jul 31, 2009, at 3:23 PM, Dave Stubbs wrote:
>> I don''t mean to be offensive Russel, but if you do
>> ever return to ZFS, please promise me that you will
>> never, ever, EVER run it virtualized on top of NTFS
>> (a.k.a. worst file system ever) in a production
>> environment. Microsoft Windows is a horribly
>> unreliable operating system in situations where
>> things like protecting against data corruption are
>> important. Microsoft knows this
>
> Oh WOW!  Whether or not our friend Russel virtualized on top of NTFS  
> (he didn''t - he used raw disk access) this point is amazing!
This point doesn''t matter. VB sits between the guest OS and the raw  
disk and
drops cache flush requests.
> System5 - based on this thread I''d say you can''t really
make this
> claim at all.  Solaris suffered a crash and the ZFS filesystem lost  
> EVERYTHING!  And there aren''t even any recovery tools?
As has been described many times over the past few years, there is a  
manual
procedure.
> HANG YOUR HEADS!!!
> Recovery from the same situation is EASY on NTFS.  There are piles  
> of tools out there that will recover the file system, and failing  
> that, locate and extract data.  The key parts of the file system are  
> stored in multiple locations on the disk just in case.  It''s been
> this way for over 10 years.
ZFS also has redundant metadata written at different places on the disk.
ZFS, like NTFS, issues cache flush requests with the expectation that
the disk honors that request.
>  I''d say it seems from this thread that my data is a lot safer on
> NTFS than it is on ZFS!
Nope.  NTFS doesn''t know when data is corrupted.  Until it does, it is
blissfully ignorant.
>
> I can''t believe my eyes as I read all these responses blaming
system
> engineering and hiding behind ECC memory excuses and "well, you  
> know, ZFS is intended for more Professional systems and not consumer  
> devices, etc etc."  My goodness!  You DO realize that Sun has this  
> website called opensolaris.org which actually proposes to have  
> people use ZFS on commodity hardware, don''t you?  I don''t
see a huge
> warning on that site saying "ATTENTION:  YOU PROBABLY WILL LOSE ALL  
> YOUR DATA".
You probably won''t lose all of your data. Statistically speaking, there
are very few people who have seen this. There are many more cases
where ZFS detected and repaired corruption.
> I recently flirted with putting several large Unified Storage 7000  
> systems on our corporate network.  The hype about ZFS is quite  
> compelling and I had positive experience in my lab setting.  But  
> because of not having Solaris capability on our staff we went in  
> another direction instead.
Interesting. The 7000 systems completely shield you from the underlying
OS. You administer the system via a web browser interface. There is no
OS to learn with these systems, just like you don''t go around requiring
Darwin knowledge to use your iPhone.
> Reading this thread, I''m SO glad we didn''t put ZFS in
production in
> ANY way.  Guys, this is the real world.  Stuff happens.  It
doesn''t
> matter what the reason is - hardware lying about cache commits, out- 
> of-order commits, failure to use ECC memory, whatever.  It is  
> ABSOLUTELY unacceptable for the filesystem to be entirely lost.  No  
> excuse or rationalization of any type can be justified.  There MUST  
> be at least the base suite of tools to deal with this stuff.   
> without it, ZFS simply isn''t ready yet.
At the risk of being redundant, redundant there is a procedure.
The fine folks at Sun, like Victor Latushkin, have helped people recover
such pools, as has been pointed out in this thread several times.
This is not the sort of procedure easily done over an open forum, it is
more efficient to recover via a service call.

Microsoft talks about NTFS in Windows 2008[*] as, "Self-healing NTFS
preserves as much data as possible, based on the type of corruption
detected." Regarding catastrophic failures they note, "Self-healing
NTFS accepts the mount request, but if the volume is known to have
some form of corruption, a repair is initiated immediately. The  
exception
to this would be a catastrophic failure that requires an offline  
recovery
method?such as manual recovery?to minimize the loss of data."
Do you consider that any different than the current state of ZFS?

[*] http://technet.microsoft.com/en-us/library/cc771388(WS.10).aspx
  -- richard

Brian

2009-Jul-31 23:26 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40

I must say this thread has also damaged the view I have of ZFS.  Ive been
considering just getting a Raid 5 controller and going the linux route I had
planned on.
-- 
This message posted from opensolaris.org

David Magda

2009-Jul-31 23:50 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40

On Jul 31, 2009, at 19:26, Brian wrote:
> I must say this thread has also damaged the view I have of ZFS.  Ive  
> been considering just getting a Raid 5 controller and going the  
> linux route I had planned on.
It''s your data, and you are responsible for it. So this thread, if  
nothing else, allows you to make a informed decision.

I think that where most other file systems don''t detect or ignore the  
corner cases that have always existed (cf. CERN''s data integrity  
study), ZFS brings them to light. To some extent it''s a matter  
updating some of the available tools so that ZFS can recover some of  
these cases in a more graceful fashion.

It should also be noted though, that nobody notices when things go  
right. :) There are people who have been running ZFS on humongous  
pools for a while. It''s just that we always have the worst-case  
scenarios showing up on the list. :)

Bob Friesenhahn

2009-Jul-31 23:54 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40

On Fri, 31 Jul 2009, Brian wrote:
> I must say this thread has also damaged the view I have of ZFS. 
> Ive been considering just getting a Raid 5 controller and going the 
> linux route I had planned on.
Thankfully, the zfs users who have never lost a pool do not spend much 
time posting about their excitement at never losing a pool. 
Otherwise this list would be even more overwelming.

I have not yet lost a pool, and this includes the one built on USB 
drives which might be ignoring cache sync requests.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Jason A. Hoffman

2009-Aug-01 00:00 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40

On Jul 31, 2009, at 4:54 PM, Bob Friesenhahn wrote:
> On Fri, 31 Jul 2009, Brian wrote:
>
>> I must say this thread has also damaged the view I have of ZFS. Ive  
>> been considering just getting a Raid 5 controller and going the  
>> linux route I had planned on.
>
> Thankfully, the zfs users who have never lost a pool do not spend  
> much time posting about their excitement at never losing a pool.  
> Otherwise this list would be even more overwelming.
>
> I have not yet lost a pool, and this includes the one built on USB  
> drives which might be ignoring cache sync requests.
I have thousands and thousands and thousands of zpools. I started  
collecting such zpools back in 2005. None have been lost.


Best regards, Jason

------------------------------------------------------------
Jason A. Hoffman, PhD | Founder, CTO, Joyent Inc.

jason at joyent.com
http://joyent.com/

mobile: +1-415-279-6196
------------------------------------------------------------





-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090731/e09c5f0e/attachment.html>

David Magda

2009-Aug-01 00:11 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40

On Jul 31, 2009, at 20:00, Jason A. Hoffman wrote:
> On Jul 31, 2009, at 4:54 PM, Bob Friesenhahn wrote:
>
>> On Fri, 31 Jul 2009, Brian wrote:
>>
>>> I must say this thread has also damaged the view I have of ZFS.  
>>> Ive been considering just getting a Raid 5 controller and going  
>>> the linux route I had planned on.
>>
>> Thankfully, the zfs users who have never lost a pool do not spend  
>> much time posting about their excitement at never losing a pool.  
>> Otherwise this list would be even more overwelming.
>>
>> I have not yet lost a pool, and this includes the one built on USB  
>> drives which might be ignoring cache sync requests.
>
> I have thousands and thousands and thousands of zpools. I started  
> collecting such zpools back in 2005. None have been lost.
Also a reminder that on-disk redundancy (RAID-5, 6, Z, etc.) is no  
substitute for backups.

Your controller (or software RAID) can hose data in many circumstances  
as well. CERN''s study revealed a bug in the WD disk firmware (fixed in
a later version) interacting with their 3Ware controllers that caused  
have caused 80% of the errors they experienced.

Toby Thain

2009-Aug-01 00:21 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40

On 31-Jul-09, at 7:15 PM, Richard Elling wrote:
> wow, talk about a knee jerk reaction...
>
> On Jul 31, 2009, at 3:23 PM, Dave Stubbs wrote:
>
>>> I don''t mean to be offensive Russel, but if you do
>>> ever return to ZFS, please promise me that you will
>>> never, ever, EVER run it virtualized on top of NTFS
>>> (a.k.a. worst file system ever) in a production
>>> environment. Microsoft Windows is a horribly
>>> unreliable operating system in situations where
>>> things like protecting against data corruption are
>>> important. Microsoft knows this
>>
>> Oh WOW!  Whether or not our friend Russel virtualized on top of  
>> NTFS (he didn''t - he used raw disk access) this point is
amazing!
>
> This point doesn''t matter. VB sits between the guest OS and the
raw
> disk and
> drops cache flush requests.
>
>> System5 - based on this thread I''d say you can''t
really make this
>> claim at all.  Solaris suffered a crash and the ZFS filesystem  
>> lost EVERYTHING!  And there aren''t even any recovery tools?
>
> As has been described many times over the past few years, there is  
> a manual
> procedure.
>
>> HANG YOUR HEADS!!!
>
>> Recovery from the same situation is EASY on NTFS.  There are piles  
>> of tools out there that will recover the file system, and failing  
>> that, locate and extract data.  The key parts of the file system  
>> are stored in multiple locations on the disk just in case. 
It''s
>> been this way for over 10 years.
>
> ZFS also has redundant metadata written at different places on the  
> disk.
> ZFS, like NTFS, issues cache flush requests with the expectation that
> the disk honors that request.

Can anyone name a widely used transactional or journaled filesystem  
or RDBMS that *doesn''t* need working barriers?

>
>>  I''d say it seems from this thread that my data is a lot safer
on
>> NTFS than it is on ZFS!
>
> Nope.  NTFS doesn''t know when data is corrupted.  Until it does,
it is
> blissfully ignorant.

People still choose systems that don''t even know which side of a  
mirror is good. Do they ever wonder what happens when you turn off a  
busy RAID-1? Or why checksumming and COW make a difference?

This thread hasn''t shaken my preference for ZFS at all; just about  
everything else out there relies on nothing more than dumb luck to  
maintain integrity.

--Toby

>
>>
>> I can''t believe my eyes as I read all these responses blaming
>> system engineering and hiding behind ECC memory excuses and "well,
>> you know, ZFS is intended for more Professional systems and not  
>> consumer devices, etc etc."  My goodness!  You DO realize that Sun
>> has this website called opensolaris.org which actually proposes to  
>> have people use ZFS on commodity hardware, don''t you?  I
don''t see
>> a huge warning on that site saying "ATTENTION:  YOU PROBABLY WILL
>> LOSE ALL YOUR DATA".
>
> You probably won''t lose all of your data. Statistically speaking,
> there
> are very few people who have seen this. There are many more cases
> where ZFS detected and repaired corruption.
> ...
>  -- richard
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Ian Collins

2009-Aug-01 00:23 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40

Brian wrote:> I must say this thread has also damaged the view I have of ZFS.  Ive been
considering just getting a Raid 5 controller and going the linux route I had
planned on.
>   That''ll be you loss.  I''ve never managed to loose a pool and
I''ve all
sorts of unreliable media and all sorts of nasty ways to break them!

Whatever you choose, don''t forget to back up your data.

-- 
Ian.

Adam Sherman

2009-Aug-01 00:58 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40

On 31-Jul-09, at 20:00 , Jason A. Hoffman wrote:> I have thousands and thousands and thousands of zpools. I started  
> collecting such zpools back in 2005. None have been lost.
>
>
> Best regards, Jason
>
> ------------------------------------------------------------
> Jason A. Hoffman, PhD | Founder, CTO, Joyent Inc.

I believe I have about a TB of data on at least one of Jason''s pools  
and it seems to still be around. ;)

A.

--
Adam Sherman
CTO, Versature Corp.
Tel: +1.877.498.3772 x113

Frank Middleton

2009-Aug-01 00:58 UTC

head link

[zfs-discuss] How Virtual Box handles the IO

Great to hear a few success stories! We have been experimentally
running ZFS on really crappy hardware and it has never lost a
pool. Running on VB with ZFS/iscsi raw disks we have yet to see
any errors at all. On sun4u with lsi sas/sata it is really rock
solid. And we''ve been going out of our way to break it because of
bad experiences with ntfs, ext2 and UFS as well as many disk
failures (ever had fsck run amok?).

On 07/31/09 12:11 PM, Richard Elling wrote:
> Making flush be a nop destroys the ability to check for errors
> thus breaking the trust between ZFS and the data on medium.
> -- richard
Can you comment on the issue that the underlying disks were,
as far as we know, never powered down? My understanding is
that disks usually try to flush their caches as quickly as
possible to make room for more data, so in this scenario
things were probably quiet after the guest crash, so likely
what ever was in the cache would have been flushed anyway,
certainly by the time the OP restarted VB and the guest.

Could you also comment on CR 6667683. which I believe is proposed
as a solution for recovery in this very rare case? I understand
that the ZILs are allocated out of the general pool. Is there
a ZIL for the ZILs, or does this make no sense?

As the one who started the whole ECC discussion, I don''t think
anyone has ever claimed that lack of ECC caused this loss of
a pool or that it could. AFAIK lack of ECC can''t be a problem
at all on RAIDZ vdevs, only with single drives or plain mirrors.
I''ve suggested an RFE for the mirrored case to double buffer
the writes in this case, but disabling checksums pretty much
fixes the problem if you don''t have ECC, so it isn''t worth
pursuing. You can disable checksum per file system, so this
is an elegant solution if you don''t have ECC memory but
you do mirror. No mirror IMO is suicidal with any file system.

Has anyone ever actually lost a pool on Sun hardware other than
by losing too many replicas or operator error? As you have so
eloquently pointed out, building a reliable storage system is
an engineering problem. There are a lot of folks out there who
are very happy with ZFS on decent hardware. On crappy hardware
you get what you pay for...

Cheers -- Frank (happy ZFS evangelist)

Neil Perrin

2009-Aug-01 01:21 UTC

head link

[zfs-discuss] How Virtual Box handles the IO

> I understand > that the ZILs are allocated out of the general pool.
There is one intent log chain per dataset (file system or zvol).
The head of each log the log is kept in the main pool.
Without slog(s) we allocate (and chain) blocks from the
main pool. If separate intent log(s) exist then blocks are allocated
and chained there. If we fail to allocate from the
slog(s) then we revert to allocation from the main pool.
> Is there a ZIL for the ZILs, or does this make no sense?
There is no ZIL for the ZILs. Note the ZIL is not a journal
(like ext3 or ufs logging). It simply contains records of
system calls (including data) that need to be replayed if the
system crashes and those records have not been committed
in a transaction group.

Hope that helps: Neil.

Bryan Allen

2009-Aug-01 02:41 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40

+------------------------------------------------------------------------------
| On 2009-07-31 17:00:54, Jason A. Hoffman wrote:
| 
| I have thousands and thousands and thousands of zpools. I started  
| collecting such zpools back in 2005. None have been lost.

I don''t have thousands and thousands of zpools, but I do have more than
would
fit in a breadbox. And bigger, too.

ZFS: Verifying, cuddling and wrangling my employer''s business critical
data
since 2007.

(No bits were harmed in the production of this storage network.)

(No, really. We validated their checksums.)
-- 
bda
cyberpunk is dead. long live cyberpunk.

Mike Gerdts

2009-Aug-01 04:12 UTC

head link

[zfs-discuss] How Virtual Box handles the IO

On Fri, Jul 31, 2009 at 7:58 PM, Frank
Middleton<f.middleton at apogeect.com> wrote:> Has anyone ever actually lost a pool on Sun hardware other than
> by losing too many replicas or operator error? As you have so
Yes, I have lost a pool when running on Sun hardware.

http://mail.opensolaris.org/pipermail/zfs-discuss/2007-September/013233.html

Quite likely related to:

http://bugs.opensolaris.org/view_bug.do?bug_id=6684721

In other words, it was a buggy Sun component that didn''t do the right
thing with cache flushes.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Ross

2009-Aug-01 05:39 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40

> wow, talk about a knee jerk reaction...
Not at all.  A long thread is started where the user lost his pool, and
discussion shows it''s a known problem.  I love ZFS and I''m
still very nervous about the risk of loosing an entire pool.
 > As has been described many times over the past few
> years, there is a  manual procedure.
Yes, but there are a few issues with this:

1. The OP doesn''t seem to have been able to get anybody to help him
recover his pool.  The natural assumption reading a thread like this is that ZFS
pool corruption happens, and you loose your data.
2. While the procedure may have been mentioned, I''ve never seen a link
to official documentation on it.
3. My understanding from reading Victors threads (although I may be wrong) is
that this recovery takes a significant amount of time.
> You probably won''t lose all of your data. Statistically speaking,
there
> are very few people who have seen this. There are many more cases
> where ZFS detected and repaired corruption.
Yes, but statistics don''t matter when emotions come into play, and
I''m afraid with something like this it''s going to scare off a
lot of people who read about it.

It might be rare, but people don''t think like that.  Why do you think
so many play the lottery ;-)

The other point is that system admins like to have control over their own data. 
It''s their job on the line if things go wrong, and if they see a major
problem like this without an obvious solution and which they would have very
little control over if it happens, they''re going to get very nervous
about implementing it.
>From a psychological point of view, this issue is very damaging to zfs.
On the flip side, once the recovery tool is available, this will turn into a
good positive for zfs.  I don''t believe I''ve heard of any
other bug that causes complete loss of the pool, so with a recovery tool, zfs
should have an enviable ability to safeguard data.
-- 
This message posted from opensolaris.org

Scott Lawson

2009-Aug-01 07:11 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40

Dave Stubbs wrote:>> I don''t mean to be offensive Russel, but if you do
>> ever return to ZFS, please promise me that you will
>> never, ever, EVER run it virtualized on top of NTFS
>> (a.k.a. worst file system ever) in a production
>> environment. Microsoft Windows is a horribly
>> unreliable operating system in situations where
>> things like protecting against data corruption are
>> important. Microsoft knows this
>>     
>
> Oh WOW!  Whether or not our friend Russel virtualized on top of NTFS (he
didn''t - he used raw disk access) this point is amazing!  System5 -
based on this thread I''d say you can''t really make this claim
at all.  Solaris suffered a crash and the ZFS filesystem lost EVERYTHING!  And
there aren''t even any recovery tools?
>
> HANG YOUR HEADS!!!
>
> Recovery from the same situation is EASY on NTFS.  There are piles of tools
out there that will recover the file system, and failing that, locate and
extract data.  The key parts of the file system are stored in multiple locations
on the dYou mean the data that you don''t know you have lost yet? ZFS allows you
to be very paranoid about data protection with things like copies=2,3,4 
etc etc..> isk just in case.  It''s been this way for over 10 years. 
I''d say it seems from this thread that my data is a lot safer on NTFS
than it is on ZFS!
>   
> I can''t believe my eyes as I read all these responses blaming
system engineering and hiding behind ECC memory excuses and "well, you
know, ZFS is intended for more Professional systems and not consumer devices,
etc etc."  My goodness!  You DO realize that Sun has this website called
opensolaris.org which actually proposes to have people use ZFS on commodity
hardware, don''t you?  I don''t see a huge warning on that site
saying "ATTENTION:  YOU PROBABLY WILL LOSE ALL YOUR DATA".
>
> I recently flirted with putting several large Unified Storage 7000 systems
on our corporate network.  The hype about ZFS is quite compelling and I had
positive experience in my lab setting.  But because of not having Solaris
capability on our staff we went in another direction instead.
>   You do realize that the 7000 series machines are appliances and have no 
prerequisite for you to have any Solaris knowledge whatsoever? They are 
a supported
device just like any other disk storage system that you can purchase 
from any vendor and have it supported as such. To use it all you need is 
a web browser. Thats it.
This is no different than your EMC array or HP Storageworks hardware, 
except that the under pinnings of the storage system are there for all 
to see in the form
of open source code contributed to the community by Sun.> Reading this thread, I''m SO glad we didn''t put ZFS in
production in ANY way.  Guys, this is the real world.  Stuff happens.  It
doesn''t matter what the reason is - hardware lying about cache commits,
out-of-order commits, failure to use ECC memory, whatever.  It is ABSOLUTELY
unacceptable for the filesystem to be entirely lost.  No excuse or
rationalization of any type can be justified.  There MUST be at least the base
suite of tools to deal with this stuff.  without it, ZFS simply isn''t
ready yet.
>   Sounds like you have no real world experience of ZFS in production 
environments and it''s true reliability. As many people here report
there
are thousands if not millions
of zpools out there containing business critical environments that are 
happily fixing broken hardware on a daily basis. I have personally seen 
all sorts of pieces of hardware
break and ZFS corrected and fixed things for me.  I personally manage 50 
plus ZFS zpools that are anywhere from 100GB to 30 TB. Works very, very, 
very well for me.
I have never lost anything despite having had plenty of pieces of 
hardware break in some form underneath ZFS.> I am saving a copy of this thread to show my colleagues and also those Sun
Microsystems sales people that keep calling.
>

Brian

2009-Aug-01 09:04 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40

> On Fri, 31 Jul 2009, Brian wrote:
> 
> > I must say this thread has also damaged the view I
> have of ZFS. 
> > Ive been considering just getting a Raid 5
> controller and going the 
> > linux route I had planned on.
> 
> Thankfully, the zfs users who have never lost a pool
> do not spend much 
> time posting about their excitement at never losing a
> pool. 
> Otherwise this list would be even more overwelming.
> 
> I have not yet lost a pool, and this includes the one
> built on USB 
> drives which might be ignoring cache sync requests.
> 
> Bob
> --
> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
> http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,
>    http://www.GraphicsMagick.org/
> ____________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discu
> ss

Yes you are right, I spoke irrationally.  I still intend to try it out at least
for a period of time to see what I think.  ill put it through the standard tests
and such.  However I am having trouble getting my motherboard to recognize 4 of
the hard drives I picked ( I made a post about it in the storage forum).  Once
thats finished ill get this testing underway
-- 
This message posted from opensolaris.org

Germano Caronni

2009-Aug-02 22:07 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

Have you considered this? 
*Maybe* a little time travel to an old uberblock could help you?
http://www.opensolaris.org/jive/thread.jspa?threadID=85794
-- 
This message posted from opensolaris.org

Stephen Pflaum

2009-Aug-02 23:44 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40

George,

I have a pool with family photos on it which needs recovery. Is there a livecd
with a tool to invalidate the uberblock which will boot on a macbookpro?

Steve
-- 
This message posted from opensolaris.org

Victor Latushkin

2009-Aug-03 18:00 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40

On 03.08.09 03:44, Stephen Pflaum wrote:> George,
> 
> I have a pool with family photos on it which needs recovery. Is there a
livecd with a tool to invalidate the uberblock which will boot on a macbookpro?
This has been recovered by rolling two txgs back. pool is being scrubbed now. 
More details and some helpful hints later.

Victor

Roch Bourbonnais

2009-Aug-04 13:00 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

Le 19 juil. 09 ? 16:47, Bob Friesenhahn a ?crit :
> On Sun, 19 Jul 2009, Ross wrote:
>
>> The success of any ZFS implementation is *very* dependent on the  
>> hardware you choose to run it on.
>
> To clarify:
>
> "The success of any filesystem implementation is *very* dependent on  
> the hardware you choose to run it on."
>
> ZFS requires that the hardware cache sync works and is respected.
yes.
> Without taking advantage of the drive caches, zfs would be  
> considerably less performant.
>
That, I''m not so sure.

When ZFS first came out, most pools were built on thumpers with a SATA  
device driver
that did not handle NCQ concurrency. Enabling the write cache on a  
drive was a necessary way to have the drive firmware
handle multiple request with small service times. Today we''ve got  
better device drivers but we''ve stopped comparing performance
data with on/off settings on the disk write caches. The delta today  
might be a lot smaller than it used to be (and even less noticeable if  
one uses a slog on
SSD).

-r

> Bob
> --
> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2431 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090804/3e7a29a3/attachment.bin>

Roch Bourbonnais

2009-Aug-04 13:28 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

Le 26 juil. 09 ? 01:34, Toby Thain a ?crit :
>
> On 25-Jul-09, at 3:32 PM, Frank Middleton wrote:
>
>> On 07/25/09 02:50 PM, David Magda wrote:
>>
>>> Yes, it can be affected. If the snapshot''s data structure
/ record
>>> is
>>> underneath the corrupted data in the tree then it won''t be
able to
>>> be
>>> reached.
>>
>> Can you comment on if/how mirroring or raidz mitigates this, or tree
>> corruption in general? I have yet to lose a pool even on a machine
>> with fairly pathological problems, but it is mirrored (and copies=2).
>>
>> I was also wondering if you could explain why the ZIL can''t
>> repair such damage.
>>
>> Finally, a number of posters blamed VB for ignoring a flush, but
>> according to the evil tuning guide, without any application syncs,
>> ZFS may wait up to 5 seconds before issuing a synch, and there
>> must be all kinds of failure modes even on bare hardware where
>> it never gets a chance to do one at shutdown. This is interesting
>> if you do ZFS over iscsi because of the possibility of someone
>> tripping over a patch cord or a router blowing a fuse. Doesn''t
>> this mean /any/ hardware might have this problem, albeit with much
>> lower probability?
>
> The problem is assumed *ordering*. In this respect VB ignoring  
> flushes and real hardware are not going to behave the same.
>
> --Toby
I agree that noone should be ignoring cache flushes. However the path  
to corruption must involve
some dropped acknowledged I/Os. The ueberblock I/O was issued to  
stable storage but the blocks it pointed to,  which had reached the  
disk firmware earlier,
never make it to stable storage. I can see this scenerio when the disk  
looses power but I don''t see it with cutting power to the guest.

When managing a zpool on external storage, do people export the pool  
before taking snapshots of the guest ?

-r

>
>>
>> Thanks
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2431 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090804/497519f6/attachment.bin>

Toby Thain

2009-Aug-04 16:33 UTC

head link

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

On 4-Aug-09, at 9:28 AM, Roch Bourbonnais wrote:
>
> Le 26 juil. 09 ? 01:34, Toby Thain a ?crit :
>
>>
>> On 25-Jul-09, at 3:32 PM, Frank Middleton wrote:
>>
>>> On 07/25/09 02:50 PM, David Magda wrote:
>>>
>>>> Yes, it can be affected. If the snapshot''s data
structure /
>>>> record is
>>>> underneath the corrupted data in the tree then it
won''t be able
>>>> to be
>>>> reached.
>>>
>>> Can you comment on if/how mirroring or raidz mitigates this, or
tree
>>> corruption in general? I have yet to lose a pool even on a machine
>>> with fairly pathological problems, but it is mirrored (and  
>>> copies=2).
>>>
>>> I was also wondering if you could explain why the ZIL
can''t
>>> repair such damage.
>>>
>>> Finally, a number of posters blamed VB for ignoring a flush, but
>>> according to the evil tuning guide, without any application syncs,
>>> ZFS may wait up to 5 seconds before issuing a synch,
^^ of course this can never cause inconsistency. The issue under  
discussion is inconsistency - unexpected corruption of on-disk  
structures.
>>> and there
>>> must be all kinds of failure modes even on bare hardware where
>>> it never gets a chance to do one at shutdown. This is interesting
>>> if you do ZFS over iscsi because of the possibility of someone
>>> tripping over a patch cord or a router blowing a fuse.
Doesn''t
>>> this mean /any/ hardware might have this problem, albeit with much
>>> lower probability?
>>
>> The problem is assumed *ordering*. In this respect VB ignoring  
>> flushes and real hardware are not going to behave the same.
>>
>> --Toby
>
> I agree that noone should be ignoring cache flushes. However the  
> path to corruption must involve
> some dropped acknowledged I/Os. The ueberblock I/O was issued to  
> stable storage but the blocks it pointed to,  which had reached the  
> disk firmware earlier,
> never make it to stable storage. I can see this scenerio when the  
> disk looses power
Or if the host O/S crashes. All this applies to virtual IDE devices  
alone, of course. iSCSI is a different case entirely as presumably  
flushes/barriers are processed normally.
> but I don''t see it with cutting power to the guest.
Right, in this case it''s unlikely or nearly impossible.

--Toby
>
> When managing a zpool on external storage, do people export the  
> pool before taking snapshots of the guest ?
>
> -r
>
>
>>
>>>
>>> Thanks
>>>
>>> _______________________________________________
>>> zfs-discuss mailing list
>>> zfs-discuss at opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

James Hess

2009-Aug-05 05:23 UTC

head link

[zfs-discuss] How Virtual Box handles the IO

So much for the "it''s a consumer hardware problem" argument.
I for one gotta count it as a major drawback of ZFS that it doesn''t
provide you a mechanism to get something of your pool back  in the manner of
reconstruction or reversion, if a failure occurs,  where there is a metadata
inconsistency.

A policy of data integrity taken to the extreme of blocking access to good data
is not something OS users want.

Users don''t put up with this sort of thing from other filesystems... 
some sort of improvement here is sorely needed.

ZFS ought to be retaining enough information and make an effort to bring pool
metadata back to a consistent state,   even if it means  loss of data,  that a
file may have to revert to an older state,   or a file that was undergoing
changes  may now be unreadable,  since the log was inconsistent..

even if the user should have to zpool import with a  recovery-mode  option or
something of that nature.

It beats losing a TB of data on the pool that should be otherwise intact.
-- 
This message posted from opensolaris.org

Thomas Burgess

2009-Aug-05 15:56 UTC

head link

[zfs-discuss] How Virtual Box handles the IO

>From what i understand, and from everything i''ve read by following
threadshere, there are ways to do it but there is not a standardized tool yet, and
it''s complicated and on a per-case basis but people who pay for support
have
recovered pools.

i''m sure they are working on it, and i would imagine it would be a
major
goal.

On Wed, Aug 5, 2009 at 1:23 AM, James Hess <no-reply at opensolaris.org>
wrote:
> So much for the "it''s a consumer hardware problem"
argument.
> I for one gotta count it as a major drawback of ZFS that it
doesn''t provide
> you a mechanism to get something of your pool back  in the manner of
> reconstruction or reversion, if a failure occurs,  where there is a
metadata
> inconsistency.
>
> A policy of data integrity taken to the extreme of blocking access to good
> data is not something OS users want.
>
> Users don''t put up with this sort of thing from other
filesystems...  some
> sort of improvement here is sorely needed.
>
> ZFS ought to be retaining enough information and make an effort to bring
> pool metadata back to a consistent state,   even if it means  loss of data,
>  that a file may have to revert to an older state,   or a file that was
> undergoing changes  may now be unreadable,  since the log was
inconsistent..
>
> even if the user should have to zpool import with a  recovery-mode  option
> or something of that nature.
>
> It beats losing a TB of data on the pool that should be otherwise intact.
> --
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090805/a052284f/attachment.html>

zfs discuss - Jul 2009 - Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] What are the rollback tools?

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] What are the rollback tools?

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] What are the rollback tools?

[zfs-discuss] What are the rollback tools?

[zfs-discuss] What are the rollback tools?

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] What are the rollback tools?

[zfs-discuss] What are the rollback tools?

[zfs-discuss] What are the rollback tools?

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work

[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work