Russel
2009-Jul-19 01:12 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Well I have a 10TB (5x2TB) in RAIDZ on VirtualBox got it all working on Windows XP and Windows 7. SMB shares back to my PC, great managed the impossible! Copied all my data over form a loads of old external disks, sorted it, all in all 15 days work (my holiday :-)) Used raw disks to the VirtualBox so was quite ok performance. Then Opensolaris (2009.6) crashes as I tried to close it down, in the end had to power of the VirtualBox. rebooted, I then get this: #zpool status pool: array1 state: FAULTED status: The pool metadata is corrupted and the pool cannot be opened. action: Destroy and re-create the pool from a backup source. see: http://www.sun.com/msg/ZFS-8000-72 scrub: none requested config: NAME STATE READ WRITE CKSUM array1 FAULTED 0 0 1 corrupted data raidz1 ONLINE 0 0 6 c9t0d0s0 ONLINE 0 0 0 c9t1d0s0 ONLINE 0 0 0 c9t2d0s0 ONLINE 0 0 0 c9t3d0s0 ONLINE 0 0 0 c9t4d0s0 ONLINE 0 0 0 pool: rpool state: ONLINE status: The pool is formatted using an older on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using ''zpool upgrade''. Once this is done, the pool will no longer be accessible on older software versions. scrub: none requested config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 c7d0s0 ONLINE 0 0 0 errors: No known data errors root at storage1:/rpool/rtsmb/lost# ============================== After 9 hours of reading many blogs and posting I am about to give up. Heres some output that may hopefully allow some one to help me (Victor?) ============================== #zdb -u array1 zdb: can''t open array1: I/O error #zdb -l /dev/dsk/c9t0d0s0 -------------------------------------------- LABEL 0 -------------------------------------------- version=14 name=''array1'' state=0 txg=336051 pool_guid=2240875695356292882 hostid=881445 hostname=''storage1'' top_guid=2550252815929083498 guid=1431843495093629813 vdev_tree type=''raidz'' id=0 guid=2550252815929083498 nparity=1 metaslab_array=23 metaslab_shift=36 ashift=9 asize=9901403013120 is_log=0 children[0] type=''disk'' id=0 guid=1431843495093629813 path=''/dev/dsk/c9t0d0s0'' devid=''id1,sd at SATA_____VBOX_HARDDISK____VB90d7ae97-6e68097a/a'' phys_path=''/pci at 0,0/pci8086,2829 at d/disk at 0,0:a'' whole_disk=0 DTL=44 children[1] type=''disk'' id=1 guid=1558447330187786228 path=''/dev/dsk/c9t1d0s0'' devid=''id1,sd at SATA_____VBOX_HARDDISK____VB315f2939-fdadfa14/a'' phys_path=''/pci at 0,0/pci8086,2829 at d/disk at 1,0:a'' whole_disk=0 DTL=43 children[2] type=''disk'' id=2 guid=10659506225279255914 path=''/dev/dsk/c9t2d0s0'' devid=''id1,sd at SATA_____VBOX_HARDDISK____VBd9514af5-8837e2f7/a'' phys_path=''/pci at 0,0/pci8086,2829 at d/disk at 2,0:a'' whole_disk=0 DTL=42 degraded=1 children[3] type=''disk'' id=3 guid=2558128054346170575 path=''/dev/dsk/c9t3d0s0'' devid=''id1,sd at SATA_____VBOX_HARDDISK____VBab7f62b2-3b162694/a'' phys_path=''/pci at 0,0/pci8086,2829 at d/disk at 3,0:a'' whole_disk=0 DTL=41 children[4] type=''disk'' id=4 guid=13991896528691960894 path=''/dev/dsk/c9t4d0s0'' devid=''id1,sd at SATA_____VBOX_HARDDISK____VB67b9775c-3ba02834/a'' phys_path=''/pci at 0,0/pci8086,2829 at d/disk at 4,0:a'' whole_disk=0 DTL=40 -------------------------------------------- LABEL 1 -------------------------------------------- version=14 name=''array1'' state=0 txg=336051 pool_guid=2240875695356292882 hostid=881445 hostname=''storage1'' top_guid=2550252815929083498 guid=1431843495093629813 vdev_tree type=''raidz'' id=0 guid=2550252815929083498 nparity=1 metaslab_array=23 metaslab_shift=36 ashift=9 asize=9901403013120 is_log=0 children[0] type=''disk'' id=0 guid=1431843495093629813 path=''/dev/dsk/c9t0d0s0'' devid=''id1,sd at SATA_____VBOX_HARDDISK____VB90d7ae97-6e68097a/a'' phys_path=''/pci at 0,0/pci8086,2829 at d/disk at 0,0:a'' whole_disk=0 DTL=44 children[1] type=''disk'' id=1 guid=1558447330187786228 path=''/dev/dsk/c9t1d0s0'' devid=''id1,sd at SATA_____VBOX_HARDDISK____VB315f2939-fdadfa14/a'' phys_path=''/pci at 0,0/pci8086,2829 at d/disk at 1,0:a'' whole_disk=0 DTL=43 children[2] type=''disk'' id=2 guid=10659506225279255914 path=''/dev/dsk/c9t2d0s0'' devid=''id1,sd at SATA_____VBOX_HARDDISK____VBd9514af5-8837e2f7/a'' phys_path=''/pci at 0,0/pci8086,2829 at d/disk at 2,0:a'' whole_disk=0 DTL=42 degraded=1 children[3] type=''disk'' id=3 guid=2558128054346170575 path=''/dev/dsk/c9t3d0s0'' devid=''id1,sd at SATA_____VBOX_HARDDISK____VBab7f62b2-3b162694/a'' phys_path=''/pci at 0,0/pci8086,2829 at d/disk at 3,0:a'' whole_disk=0 DTL=41 children[4] type=''disk'' id=4 guid=13991896528691960894 path=''/dev/dsk/c9t4d0s0'' devid=''id1,sd at SATA_____VBOX_HARDDISK____VB67b9775c-3ba02834/a'' phys_path=''/pci at 0,0/pci8086,2829 at d/disk at 4,0:a'' whole_disk=0 DTL=40 -------------------------------------------- LABEL 2 -------------------------------------------- version=14 name=''array1'' state=0 txg=336051 pool_guid=2240875695356292882 hostid=881445 hostname=''storage1'' top_guid=2550252815929083498 guid=1431843495093629813 vdev_tree type=''raidz'' id=0 guid=2550252815929083498 nparity=1 metaslab_array=23 metaslab_shift=36 ashift=9 asize=9901403013120 is_log=0 children[0] type=''disk'' id=0 guid=1431843495093629813 path=''/dev/dsk/c9t0d0s0'' devid=''id1,sd at SATA_____VBOX_HARDDISK____VB90d7ae97-6e68097a/a'' phys_path=''/pci at 0,0/pci8086,2829 at d/disk at 0,0:a'' whole_disk=0 DTL=44 children[1] type=''disk'' id=1 guid=1558447330187786228 path=''/dev/dsk/c9t1d0s0'' devid=''id1,sd at SATA_____VBOX_HARDDISK____VB315f2939-fdadfa14/a'' phys_path=''/pci at 0,0/pci8086,2829 at d/disk at 1,0:a'' whole_disk=0 DTL=43 children[2] type=''disk'' id=2 guid=10659506225279255914 path=''/dev/dsk/c9t2d0s0'' devid=''id1,sd at SATA_____VBOX_HARDDISK____VBd9514af5-8837e2f7/a'' phys_path=''/pci at 0,0/pci8086,2829 at d/disk at 2,0:a'' whole_disk=0 DTL=42 degraded=1 children[3] type=''disk'' id=3 guid=2558128054346170575 path=''/dev/dsk/c9t3d0s0'' devid=''id1,sd at SATA_____VBOX_HARDDISK____VBab7f62b2-3b162694/a'' phys_path=''/pci at 0,0/pci8086,2829 at d/disk at 3,0:a'' whole_disk=0 DTL=41 children[4] type=''disk'' id=4 guid=13991896528691960894 path=''/dev/dsk/c9t4d0s0'' devid=''id1,sd at SATA_____VBOX_HARDDISK____VB67b9775c-3ba02834/a'' phys_path=''/pci at 0,0/pci8086,2829 at d/disk at 4,0:a'' whole_disk=0 DTL=40 -------------------------------------------- LABEL 3 -------------------------------------------- version=14 name=''array1'' state=0 txg=336051 pool_guid=2240875695356292882 hostid=881445 hostname=''storage1'' top_guid=2550252815929083498 guid=1431843495093629813 vdev_tree type=''raidz'' id=0 guid=2550252815929083498 nparity=1 metaslab_array=23 metaslab_shift=36 ashift=9 asize=9901403013120 is_log=0 children[0] type=''disk'' id=0 guid=1431843495093629813 path=''/dev/dsk/c9t0d0s0'' devid=''id1,sd at SATA_____VBOX_HARDDISK____VB90d7ae97-6e68097a/a'' phys_path=''/pci at 0,0/pci8086,2829 at d/disk at 0,0:a'' whole_disk=0 DTL=44 children[1] type=''disk'' id=1 guid=1558447330187786228 path=''/dev/dsk/c9t1d0s0'' devid=''id1,sd at SATA_____VBOX_HARDDISK____VB315f2939-fdadfa14/a'' phys_path=''/pci at 0,0/pci8086,2829 at d/disk at 1,0:a'' whole_disk=0 DTL=43 children[2] type=''disk'' id=2 guid=10659506225279255914 path=''/dev/dsk/c9t2d0s0'' devid=''id1,sd at SATA_____VBOX_HARDDISK____VBd9514af5-8837e2f7/a'' phys_path=''/pci at 0,0/pci8086,2829 at d/disk at 2,0:a'' whole_disk=0 DTL=42 degraded=1 children[3] type=''disk'' id=3 guid=2558128054346170575 path=''/dev/dsk/c9t3d0s0'' devid=''id1,sd at SATA_____VBOX_HARDDISK____VBab7f62b2-3b162694/a'' phys_path=''/pci at 0,0/pci8086,2829 at d/disk at 3,0:a'' whole_disk=0 DTL=41 children[4] type=''disk'' id=4 guid=13991896528691960894 path=''/dev/dsk/c9t4d0s0'' devid=''id1,sd at SATA_____VBOX_HARDDISK____VB67b9775c-3ba02834/a'' phys_path=''/pci at 0,0/pci8086,2829 at d/disk at 4,0:a'' whole_disk=0 DTL=40 ####################### did the same for all five disks, all ok but heres a grep of txg fields from all of them:grep txg t* t0: txg=336051 t0: txg=336051 t0: txg=336051 t0: txg=336051 t1: txg=319963 t1: txg=319963 t1: txg=319963 t1: txg=319963 t2: txg=336051 t2: txg=336051 t2: txg=336051 t2: txg=336051 t3: txg=319963 t3: txg=319963 t3: txg=319963 t3: txg=319963 t4: txg=319963 t4: txg=319963 t4: txg=319963 t4: txg=319963 ####################### wrote little script to go back bards until I found a txg which would give me a U block evcenutual found this one bellow ####################### #zdb -u -t 335425 array1 Uberblock magic = 0000000000bab10c version = 14 txg = 335425 guid_sum = 16544206071174628188 timestamp = 1247514285 UTC = Mon Jul 13 20:44:45 2009 #date Sunday, 19 July 2009 01:58:18 BST ============================================== Infact July 13 is ok thats when I was last adding files and moving things about so thats not a bad point to return to.... So how do I manage to "roll-back" to txg = 335425 so I can hopefully get my 10TB back? Or is the answer return to doing HW raid under NTFS and windows directly? I heard people at sun may be working on a tool that can roll-back to a txg is it about (yes I understand the issue of you can''t roll back as a freeded block may have been re-used) but I''d happ;y loosse one of my 6GB files to get the other 6TB back thanks!! PLEASE PLEASE any help really appreciated. Russel -- This message posted from opensolaris.org
Orvar Korvar
2009-Jul-19 01:59 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Sorry to hear that, but you do know that VirtualBox is not really stable? VirtualBox does show some instability from time to time. You havent read the VirtualBox forums? I would advice against VirtualBox for saving all your data in ZFS. I would use OpenSolaris without virtualization. I hope your problem gets fixed, though. -- This message posted from opensolaris.org
Russel
2009-Jul-19 02:39 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Yes you''ll find my name all over VB at the moment, but I have found it to be stable (don''t install the addons disk for solaris!!, use 3.0.2, and for me winXP32bit and OpenSolaris 2009.6 has been rock solid, it was (seems) to be opensolaris failed with extract_boot_list doesn''t belong to 101, but noone on opensol, seems interested about it as other have reported it to, prob a rare issue. But yer, I hope Vicktor or someone will take a look. My worry is that if we can''t recover from this, which a number of people (in variuos forms) have come accross zfs may be introuble. We had this happen at work about 18 months ago lost all the data (20TB)(didn''t know about zdb nor did sun support) so we have start to back away, but I though since jan 2009 patches things were meant to be alot better, esp with sun using it in there storage servers now.... -- This message posted from opensolaris.org
Brent Jones
2009-Jul-19 07:00 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Sat, Jul 18, 2009 at 7:39 PM, Russel<no-reply at opensolaris.org> wrote:> Yes you''ll find my name all over VB at the moment, but I have found it to be stable > (don''t install the addons disk for solaris!!, use 3.0.2, and for me winXP32bit and > OpenSolaris 2009.6 has been rock solid, it was (seems) to be opensolaris failed > with extract_boot_list doesn''t belong to 101, but noone on opensol, seems > interested about it as other have reported it to, prob a rare issue. > > But yer, I hope Vicktor or someone will take a look. My worry is that if we > can''t recover from this, which a number of people (in variuos forms) have come accross zfs may be introuble. We had this happen at work about 18 months ago > lost all the data (20TB)(didn''t know about zdb nor did sun support) so we have start > to back away, but I though since jan 2009 patches things were meant to be alot better, esp with sun using it in there storage servers now.... > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >No offense, but you trusted 10TB of important data, running in OpenSolaris from inside Virtualbox (not stable) on top of Windows XP (arguably not stable, especially for production) on probably consumer grade hardware with unknown support for any of the above products? I''d like to say this was an unfortunate circumstance, but there are many levels of fail here, and to blame ZFS seems misplaced, and the subject on this thread especially inflammatory. -- Brent Jones brent at servuhome.net
Miles Nordin
2009-Jul-19 08:24 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
>>>>> "bj" == Brent Jones <brent at servuhome.net> writes:bj> many levels of fail here, pft. Virtualbox isn''t unstable in any of my experience. It doesn''t by default pass cache flushes from guest to host unless you set VBoxManage setextradata VMNAME "VBoxInternal/Devices/piix3ide/0/LUN#[x]/Config/IgnoreFlush" 0 however OP does not mention the _host_ crashing, so this questionable ``optimization'''' should not matter. Yanking the guest''s virtual cord is something ZFS is supposed to tolerate: remember the ``crash-consistent backup'''' concept (not to mention the ``always consistent on disk'''' claim, but really any filesystem even without that claim should tolerate having the guest''s virtual cord yanked, or the guest''s kernel crashing, without losing all its contents---the claim only means no time-consuming fsck after reboot). bj> to blame ZFS seems misplaced, -1 The fact that it''s a known problem doesn''t make it not a problem. bj> the subject on this thread especially inflammatory. so what? -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090719/d38df479/attachment.bin>
Markus Kovero
2009-Jul-19 08:35 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
I would be intrested in how to roll-back to certain txg-points in case of disaster, that was what Russel was after anyway. Yours Markus Kovero -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Miles Nordin Sent: 19. hein?kuuta 2009 11:24 To: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work>>>>> "bj" == Brent Jones <brent at servuhome.net> writes:bj> many levels of fail here, pft. Virtualbox isn''t unstable in any of my experience. It doesn''t by default pass cache flushes from guest to host unless you set VBoxManage setextradata VMNAME "VBoxInternal/Devices/piix3ide/0/LUN#[x]/Config/IgnoreFlush" 0 however OP does not mention the _host_ crashing, so this questionable ``optimization'''' should not matter. Yanking the guest''s virtual cord is something ZFS is supposed to tolerate: remember the ``crash-consistent backup'''' concept (not to mention the ``always consistent on disk'''' claim, but really any filesystem even without that claim should tolerate having the guest''s virtual cord yanked, or the guest''s kernel crashing, without losing all its contents---the claim only means no time-consuming fsck after reboot). bj> to blame ZFS seems misplaced, -1 The fact that it''s a known problem doesn''t make it not a problem. bj> the subject on this thread especially inflammatory. so what?
dick hoogendijk
2009-Jul-19 08:46 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Sun, 19 Jul 2009 00:00:06 -0700 Brent Jones <brent at servuhome.net> wrote:> No offense, but you trusted 10TB of important data, running in > OpenSolaris from inside Virtualbox (not stable) on top of Windows XP > (arguably not stable, especially for production) on probably consumer > grade hardware with unknown support for any of the above products?Running this kind of setup absolutely can give you NO garanties at all. Virtualisation, OSOL/zfs on WinXP. It''s nice to play with and see it "working" but would I TRUST precious data to it? No way! -- Dick Hoogendijk -- PGP/GnuPG key: 01D2433D + http://nagual.nl/ | nevada / OpenSolaris 2010.02 B118 + All that''s really worth doing is what we do for others (Lewis Carrol)
Ross
2009-Jul-19 08:48 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
While I agree with Brent, I think this is something that should be stressed in the ZFS documentation. Those of us with long term experience of ZFS know that it''s really designed to work with hardware meeting quite specific requirements. Unfortunately, that isn''t documented anywhere, and more and more people are being bitten by quite severe dataloss by virtue of the fact that ZFS is far less forgiving than other filesystems when data hasn''t been properly written to disk. As far as I can see, the ZFS Administrator Guide is sorely lacking in any warning that you are risking data loss if you run on consumer grade hardware. In fact, the requirements section states nothing more than: "ZFS Hardware and Software Requirements and Recommendations Make sure you review the following hardware and software requirements and recommendations before attempting to use the ZFS software: * A SPARC? or x86 system that is running the or the Solaris 10 6/06 release or later release. * The minimum disk size is 128 Mbytes. The minimum amount of disk space required for a storage pool is approximately 64 Mbytes. * Currently, the minimum amount of memory recommended to install a Solaris system is 768 Mbytes. However, for good ZFS performance, at least one Gbyte or more of memory is recommended. * If you create a mirrored disk configuration, multiple controllers are recommended." -- This message posted from opensolaris.org
dick hoogendijk
2009-Jul-19 09:00 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Sun, 19 Jul 2009 01:48:40 PDT Ross <no-reply at opensolaris.org> wrote:> As far as I can see, the ZFS Administrator Guide is sorely lacking in > any warning that you are risking data loss if you run on consumer > grade hardware.And yet, ZFS is not only for NON-consumer grade hardware is it? the fact that many, many people run "normal" consumer hardware does not rule them out fro ZFS, does it? The "best filesystem ever", the "end of all other filesystems" would be nothing more than a dream if that was true. Furthermore, much so-called consumer hardware is very good these days. My guess is ZFS should work quite reliably on that hardware. (i.e. non ECC memory should work fine!) / mirroring is a -must- ! -- Dick Hoogendijk -- PGP/GnuPG key: 01D2433D + http://nagual.nl/ | nevada / OpenSolaris 2010.02 B118 + All that''s really worth doing is what we do for others (Lewis Carrol)
Russel
2009-Jul-19 11:12 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Guys guys please chill... First thanks to the info about virtualbox option to bypass the cache (I don''t suppose you can give me a reference for that info? (I''ll search the VB site :-)) As this was not clear to me. I use VB like others use vmware etc to run solaris because its the ONLY way I can, as I can''t get the drives for most of the H/W out there in hobby land, so a virtualised system allows me to run my SilImage chipset to link 3gb/s to the sata multi-port array I got for ?160. ===anyway lets top there or we will be off topic even more ===just though you should know why I did it, even bsd does ===not have a driver or I would have gone there so get zfs :-) Anyway... my view on zfs was quite simple, it looked after bit rot, and did self healing, and most importantly for me as running it on consumer kit, was it seemed to avoid the Raid-5 write hole in the case of a crash! So if stuff falls over eg windows,VB,Opensolaris etc I would not suffer unknown data corruption and would just loose just that write which was fine as the thing crashed..... so for a flaky envorment ZFS sounds even more like the one you want, LOL. Loved all the technical stuff, I have had rather good deep dives from Suns best here in UK/europe (I''m lucky as was a very early employee of sun, and now work for a major firm :-)). Liked the idea that you can build your own storage server etc etc. I new most bugs, as I saw them, were fixed in Jan 09 patch..... I THOUGHT/ASSUMMED (yes you should never :-()) that given everything else it would be blatantly obvious that when you try to mount a zpool the thing would either rollback to last consistent state (that includes the U-block and meta data thank you) or have a tool like fsck which lets you do it, BUT you know once you start rolling back (just like clearing inodes) your not going to be in such a good place and you''d need to scrub or something, even if it say these files are now corrupt, FINE, but I DIDN''T loose the filesystem just a file or two. We should never loose the filesystem. But in ZFS land thats the most likely fault it sounds we have in data-loss. SUMMARY ========What I see here is the lack of the (not needed lol) fsck type tool. WELL WE DO NEED it, we need to be able roll-back and recover and repair. I have lost data stored on large Sun6790 arrays and now my home system. So PLEASE anyone got a beta version of a tool to perform roll back? Russel (It will take me 10 days to pull my data off litttle my little drives again, and 5 days to format with raid5 (H/W) and NFTS not what I want, nor the raid-5 hole :-)) -- This message posted from opensolaris.org
Ross
2009-Jul-19 12:55 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
>From the experience myself and others have had, and Sun''s approach with their Amber Road storage (FISHWORKS - fully integrated *hardware* and software), my feeling is very much that ZFS was designed by Sun to run on Sun''s own hardware, and as such, they were able to make certain assumptions with their design.ZFS was never designed to run on consumer hardware, it makes assumptions that devices and drivers will always be well behaved when errors occur, and in general is quite fragile if you''re running it on the wrong system. On the right hardware, I''ve no doubt that ZFS is incredibly reliable and easy to manage. On the wrong hardware, disk errors can hang your entire system, hot swap can down your pool, and a power cut or other error can render your entire pool inaccessible. The success of any ZFS implementation is *very* dependent on the hardware you choose to run it on. -- This message posted from opensolaris.org
Ross
2009-Jul-19 13:05 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Heh, yes, I assumed similar things Russel. I also assumed that a faulty disk in a raid-z set wouldn''t hang my entire pool indefinitely, that hot plugging a drive wouldn''t reboot Solaris, and that my pool would continue working after I disconnected one half of an iscsi mirror. I also like yourself assumed that if ZFS is using copy on write, then even after a really nasty crash, the vast majority of my data would be accessible. And I also believed that when I had disconnected every drive from a ZFS pool, that ZFS wouldn''t accept writes to it any more... Unfortunately, all of these assumptions turned out to be false. Learning ZFS has been a painful experience. I still like it, but I am very aware of its limitations, and am cautious how I apply it these days. -- This message posted from opensolaris.org
It''s clear from some threads on this list that it IS possible to roll back a zpool to a previous state, and I seem to even remember reading someone was working on a tool or tools in that direction. Is that correct, is it possible to manually roll back a zpool for crash recovery purposes, if you''ve got enough clue/knowledge/experience on your side in regards to the right tools? And the follow up question - how do I learn how to do that? Is there documentation somewhere that instructs one on the tools used for low level ZFS troubleshooting and rollback, and how to come back from the land of crash-then-fail-to-import? Is it the troubleshooting document? I''m happy to get a RTFM - please just point me at the right manual or Sun doc too :) Thanks! Brian
Bob Friesenhahn
2009-Jul-19 14:47 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Sun, 19 Jul 2009, Ross wrote:> The success of any ZFS implementation is *very* dependent on the > hardware you choose to run it on.To clarify: "The success of any filesystem implementation is *very* dependent on the hardware you choose to run it on." ZFS requires that the hardware cache sync works and is respected. Without taking advantage of the drive caches, zfs would be considerably less performant. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Toby Thain
2009-Jul-19 15:16 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 19-Jul-09, at 7:12 AM, Russel wrote:> Guys guys please chill... > > First thanks to the info about virtualbox option to bypass the > cache (I don''t suppose you can give me a reference for that info? > (I''ll search the VB site :-))I posted about that insane default, six months ago. Obviously ZFS isn''t the only subsystem that this breaks. http://forums.virtualbox.org/viewtopic.php?f=8&t=13661&start=0> As this was not clear to me. I use VB > like others use vmware etc to run solaris because its the ONLY > way I can,Convenience always has a price. --Toby ...
Ross
2009-Jul-19 15:44 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
That''s only one element of it Bob. ZFS also needs devices to fail quickly and in a predictable manner. A consumer grade hard disk could lock up your entire pool as it fails. The kit Sun supply is more likely to fail in a manner ZFS can cope with. -- This message posted from opensolaris.org
Frank Middleton
2009-Jul-19 21:31 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 07/19/09 05:00 AM, dick hoogendijk wrote:> (i.e. non ECC memory should work fine!) / mirroring is a -must- !Yes, mirroring is a must, although it doesn''t help much if you have memory errors (see several other threads on this topic): http://en.wikipedia.org/wiki/Dynamic_random_access_memory#Errors_and_error_correction "Tests[ecc]give widely varying error rates, but about 10^-12 error/bit?h is typical, roughly one bit error, per month, per gigabyte of memory." That''s roughly 1 per week in 4GB. If 1 error in 50 results in a ZFS hit, that''s one/year per user on average. Some get more, some get less.That sounds like pretty bad odds... "In most computers used for serious scientific or financial computing and as servers, ECC is the rule rather than the exception, as can be seen by examining manufacturers'' specifications." Sun doesn''t even sell machines without ECC. There''s a reason for that. IMO you''d be nuts to run ZFS on a machine without ECC unless you don''t care about losing some or all of the data. Having said that, we have yet to lose an entire pool - this is pretty hard to do! I should add that since setting copies=2 and forcing the files to be copied, there have been no more unrecoverable errors on a particularly low end machine that was plagued with them even with mirrors (and a UPS with a bad battery :-) ). On 19-Jul-09, at 7:12 AM, Russel wrote:> As this was not clear to me. I use VB like others use vmware > etc to run solaris because its the ONLY way I can,Given that PC hardware is so cheap these days (used SPARCS even cheaper), surely it makes far more sense to build a nice robust OSOL/ZFS based file server *with* ECC. Then you can use iscsi for your VirtualBox VMs and solve all kinds of interesting problems. But you still need to do backups. My solution for that is to replicate the server and backup to it using zfs send/recv. If a disk fails, you switch to the backup and no worries about the second disk of the mirror failing during a resilver. A small price to pay for peace of mind.
Bob Friesenhahn
2009-Jul-19 21:39 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Sun, 19 Jul 2009, Frank Middleton wrote:> > Yes, mirroring is a must, although it doesn''t help much if you > have memory errors (see several other threads on this topic): > > http://en.wikipedia.org/wiki/Dynamic_random_access_memory#Errors_and_error_correction > "Tests[ecc]give widely varying error rates, but about 10^-12 > error/bit?h is typical, roughly one bit error, per month, per > gigabyte of memory." > > That''s roughly 1 per week in 4GB. If 1 error in 50 results in a ZFS > hit, that''s one/year per user on average. Some get more, some get > less.That sounds like pretty bad odds...I fail to see anything zfs-specific in the above. It does not have anything more to do with zfs than it does with any other software running on the system. I do have a couple of Windows PCs here without ECC, but they were gifts from other people, and not hardware that I purchased, and not used for any critical application. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Miles Nordin
2009-Jul-19 21:45 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
>>>>> "r" == Ross <no-reply at opensolaris.org> writes: >>>>> "tt" == Toby Thain <toby at telegraphics.com.au> writes:r> ZFS was never designed to run on consumer hardware, this is markedroid garbage, as well as post-facto apologetics. Don''t lower the bar. Don''t blame the victim. tt> I posted about that insane default, six months ago. Obviously tt> ZFS isn''t the only subsystem that this breaks. yes, but remember, in this case the host did not crash, so the insane default should be irrelevant. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090719/fe7a4ae6/attachment.bin>
Richard Elling
2009-Jul-19 22:10 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Frank Middleton wrote:> On 07/19/09 05:00 AM, dick hoogendijk wrote: > >> (i.e. non ECC memory should work fine!) / mirroring is a -must- ! > > Yes, mirroring is a must, although it doesn''t help much if you > have memory errors (see several other threads on this topic): > > http://en.wikipedia.org/wiki/Dynamic_random_access_memory#Errors_and_error_correction > > > "Tests[ecc]give widely varying error rates, but about 10^-12 > error/bit?h is typical, roughly one bit error, per month, per > gigabyte of memory." > > That''s roughly 1 per week in 4GB. If 1 error in 50 results in a ZFS > hit, that''s one/year per user on average. Some get more, some get > less.That sounds like pretty bad odds...Not that bad. Uncommitted ZFS data in memory does not tend to live that long. Writes are generally out to media in 30 seconds. Solaris scrubs memory, with a 12-hour cycle time, so memory does not remain untouched for a month. For high-end systems, memory scrubs are also performed by the memory controllers. Beware, if you go down this path of thought for very long, you''ll soon be afraid to get out of bed in the morning... wait... most people actually die in beds, so perhaps you''ll be afraid to go to bed instead :-)> > "In most computers used for serious scientific or financial computing > and as servers, ECC is the rule rather than the exception, as can be > seen by examining manufacturers'' specifications." Sun doesn''t even > sell machines without ECC. There''s a reason for that.Yes, but all of the discussions in this thread can be classified as systems engineering problems, not product design problems. If you do your own systems engineering, then add this to your (hopefully long) checklist. -- richard
Bob Friesenhahn
2009-Jul-19 23:02 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Sun, 19 Jul 2009, Miles Nordin wrote:>>>>>> "r" == Ross <no-reply at opensolaris.org> writes: >>>>>> "tt" == Toby Thain <toby at telegraphics.com.au> writes: > > r> ZFS was never designed to run on consumer hardware, > > this is markedroid garbage, as well as post-facto apologetics. > > Don''t lower the bar. Don''t blame the victim.I think that the standard disclaimer "Always use protection" applies here. Victims who do not use protection should assume substantial guilt for their subsequent woes. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Gavin Maltby
2009-Jul-20 00:13 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
dick hoogendijk wrote:> true. Furthermore, much so-called consumer hardware is very good these > days. My guess is ZFS should work quite reliably on that hardware. > (i.e. non ECC memory should work fine!) / mirroring is a -must- !No, ECC memory is a must too. ZFS checksumming verifies and corrects data read back from a disk, but once it is read from disk it is stashed in memory for your application to use - without ECC you erode confidence that what you read from memory is correct. Gavin
David Magda
2009-Jul-20 00:29 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Jul 19, 2009, at 20:13, Gavin Maltby wrote:> No, ECC memory is a must too. ZFS checksumming verifies and corrects > data read back from a disk, but once it is read from disk it is > stashed > in memory for your application to use - without ECC you erode > confidence that > what you read from memory is correct.Right, because once (say) Apple incorporates ZFS into Mac OS X they''ll also start shipping MacBooks and iMacs with ECC. If it''s so necessary we might as well have any kernel that has ZFS in it only allow ''zpool create'' to be run if the kernel detects ECC modules. Come on. It''s a nice-to-have, but at some point we''re getting into the tinfoil hat-equivalent of data protection.
Bob Friesenhahn
2009-Jul-20 00:41 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Sun, 19 Jul 2009, David Magda wrote:> > Right, because once (say) Apple incorporates ZFS into Mac OS X > they''ll also start shipping MacBooks and iMacs with ECC. If it''s so > necessary we might as well have any kernel that has ZFS in it only > allow ''zpool create'' to be run if the kernel detects ECC modules.The MacBooks and iMacs are only used as an execution environment for the Safari web browser. ECC is only necessary for computers which save data somewhere so the MacBook and iMac do not need ECC. Regardless (in order to stay on topic) it is worth mentioning that the 10TB data lost to a failed pool was not lost due to lack of ECC. It was lost because VirtualBox intentionally broke the guest operating system. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Gavin Maltby
2009-Jul-20 00:58 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Hi, David Magda wrote:> On Jul 19, 2009, at 20:13, Gavin Maltby wrote: > >> No, ECC memory is a must too. ZFS checksumming verifies and corrects >> data read back from a disk, but once it is read from disk it is stashed >> in memory for your application to use - without ECC you erode >> confidence that >> what you read from memory is correct. > > Right, because once (say) Apple incorporates ZFS into Mac OS X they''ll > also start shipping MacBooks and iMacs with ECC.If customers were committing valuable business data to MacBooks and iMacs then ECC would be a requirement. I don''t know of terribly many customers running their business of of a laptop.> If it''s so necessary we > might as well have any kernel that has ZFS in it only allow ''zpool > create'' to be run if the kernel detects ECC modules. > > Come on.>> > It''s a nice-to-have, but at some point we''re getting into the tinfoil > hat-equivalent of data protection.On a laptop zfs is a huge amount safer than other filesystems, still has all the great usability features etc - but zfs does not magically turn your laptop into a server-grade system. What you refer to as a tinfoil hat is an essential component of any server if that is housing business-vital data; obviously it is just a nice-to-have on a laptop, but recognise what you''re losing. Gavin
i''m pretty sure you''re just looking for the zfs rollback command. a quick google brings up a lot of information and also man zfs check out this page http://docs.huihoo.com/opensolaris/solaris-zfs-administration-guide/html/ch06.html On Sun, Jul 19, 2009 at 10:29 AM, Brian Wilson <bfwilson at doit.wisc.edu>wrote:> It''s clear from some threads on this list that it IS possible to roll back > a zpool to a previous state, and I seem to even remember reading someone was > working on a tool or tools in that direction. > Is that correct, is it possible to manually roll back a zpool for crash > recovery purposes, if you''ve got enough clue/knowledge/experience on your > side in regards to the right tools? > > And the follow up question - how do I learn how to do that? Is there > documentation somewhere that instructs one on the tools used for low level > ZFS troubleshooting and rollback, and how to come back from the land of > crash-then-fail-to-import? Is it the troubleshooting document? I''m happy > to get a RTFM - please just point me at the right manual or Sun doc too :) > > Thanks! > Brian > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090719/3b93a066/attachment.html>
Richard Elling
2009-Jul-20 01:50 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Gavin Maltby wrote:> Hi, > > David Magda wrote: >> On Jul 19, 2009, at 20:13, Gavin Maltby wrote: >> >>> No, ECC memory is a must too. ZFS checksumming verifies and corrects >>> data read back from a disk, but once it is read from disk it is stashed >>> in memory for your application to use - without ECC you erode >>> confidence that >>> what you read from memory is correct. >> >> Right, because once (say) Apple incorporates ZFS into Mac OS X >> they''ll also start shipping MacBooks and iMacs with ECC. > > If customers were committing valuable business data to MacBooks and iMacs > then ECC would be a requirement. I don''t know of terribly many > customers running their business of of a laptop.I do, even though I have a small business. Neither InDesign nor Illustrator will be ported to Linux or OpenSolaris in my lifetime... besides, iTunes rocks and it is the best iPhone developer''s environment on the planet. The bigger problem is that not all of Intel''s CPU products do ECC... the embedded and server models do, but it is the low-margin PC market that is willing to make that cost trade-off. If people demanded ECC, like they do in the embedded and server markets, then we wouldn''t be having this conversation. -- richard
Andre van Eyssen
2009-Jul-20 01:52 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Sun, 19 Jul 2009, Richard Elling wrote:> I do, even though I have a small business. Neither InDesign nor > Illustrator will be ported to Linux or OpenSolaris in my lifetime... > besides, iTunes rocks and it is the best iPhone developer''s environment > on the planet.Richard, I think the point that Gavin was trying to make is that a sensible business would commit their valuable data back to a fileserver running on solid hardware with a solid operating system rather than relying on their single-spindle laptops to store their valuable content - not making any statement on the actual desktop platform. For example, I use a mixture of Windows, MacOS, Solaris and OpenBSD around here, but all the valuable data is stored on a zpool located on a SPARC server (obviously with ECC RAM) with UPS power. With Windows around, I like the fact that I don''t need to think twice before reinstalling those machines. Andre. -- Andre van Eyssen. mail: andre at purplecow.org jabber: andre at interact.purplecow.org purplecow.org: UNIX for the masses http://www2.purplecow.org purplecow.org: PCOWpix http://pix.purplecow.org
Thomas Burgess wrote:> > On Sun, Jul 19, 2009 at 10:29 AM, Brian Wilson <bfwilson at doit.wisc.edu > <mailto:bfwilson at doit.wisc.edu>> wrote: > > It''s clear from some threads on this list that it IS possible to > roll back a zpool to a previous state, and I seem to even remember > reading someone was working on a tool or tools in that direction. > Is that correct, is it possible to manually roll back a zpool for > crash recovery purposes, if you''ve got enough > clue/knowledge/experience on your side in regards to the right tools? > > And the follow up question - how do I learn how to do that? Is > there documentation somewhere that instructs one on the tools used > for low level ZFS troubleshooting and rollback, and how to come > back from the land of crash-then-fail-to-import? Is it the > troubleshooting document? I''m happy to get a RTFM - please just > point me at the right manual or Sun doc too :) > > > i''m pretty sure you''re just looking for the zfs rollback command.That rolls back a filesystem, not the state of a corrupted pool. -- Ian.
I don''t know the details Brian, so I was waiting to see if anybody remembered more, but that doesn''t seem to be the case. There is a way to roll back pools, Victor has been very helpful to several people, and in one of the threads where he managed to recover the pool, he posted a writeup of the technique he used. I don''t have a link I''m afraid. I believe it involves using ZDB and walking through the pool to find the copies of the uberblock. And I believe the person working on recovery tools was Jeff Bonwick (although I may be wrong). Again, that was from a thread on here talking about pool recovery. I''ve no idea how much progress has been made, but with no announcements, I doubt there is anything available that will help you. -- This message posted from opensolaris.org
Hi, Yes I read those threads, wow, dd directly over blocks at some offset point..... I was hoping some tools may have been created by now..... hoping Russel -- This message posted from opensolaris.org
Russel
2009-Jul-20 10:26 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Well I did have a UPS on the machine :-) but the machine hung and I had to power it off... (yep it was vertual, but that happens on direct HW too, and virtualisasion is the happening ting at sun and else where! I have a version of the data backed up, but will take ages (10days) to restore). -- This message posted from opensolaris.org
That''s the stuff. I think that is probably your best bet at the moment. I''ve not seen even a mention of an actual tool to do that, and I''d be surprised if we saw one this side of Christmas. -- This message posted from opensolaris.org
Hi. Hm, what are you actually referring to? On Mon, Jul 20, 2009 at 13:45, Ross <no-reply at opensolaris.org> wrote:> That''s the stuff. I think that is probably your best bet at the moment. > I''ve not seen even a mention of an actual tool to do that, and I''d be > surprised if we saw one this side of Christmas. > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >Alexander -- [[ http://zensursula.net ]] [ Soc. => http://twitter.com/alexs77 | http://www.plurk.com/alexs77 ] [ Mehr => http://zyb.com/alexws77 ] [ Chat => Jabber: alexws77 at jabber80.com | Google Talk: a.skwar at gmail.com ] [ Mehr => AIM: alexws77 ] [ $[ $RANDOM % 6 ] = 0 ] && rm -rf / || echo ''CLICK!'' Sent from Winterthur, ZH, Switzerland -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090720/4c8a1d38/attachment.html>
> Hm, what are you actually referring to?Sorry, I''m not subscribed to this list, so I just replied on the forum. This segment of the discussion is what I''m replying to: http://www.opensolaris.org/jive/message.jspa?messageID=397730#397730 -- This message posted from opensolaris.org
Rob Logan
2009-Jul-20 13:45 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
> the machine hung and I had to power it off.kinda getting off the "zpool import --tgx -3" request, but "hangs" are exceptionally rare and usually ram or other hardware issue, solairs usually abends on software faults. root at pdm # uptime 9:33am up 1116 day(s), 21:12, 1 user, load average: 0.07, 0.05, 0.05 root at pdm # date Mon Jul 20 09:33:07 EDT 2009 root at pdm # uname -a SunOS pdm 5.9 Generic_112233-12 sun4u sparc SUNW,Ultra-250 Rob
Russel
2009-Jul-20 14:45 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
OK. So do we have an zpool import --xtg 56574 mypoolname or help to do it (script?) Russel -- This message posted from opensolaris.org
Toby Thain
2009-Jul-20 15:06 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 20-Jul-09, at 6:26 AM, Russel wrote:> Well I did have a UPS on the machine :-) > > but the machine hung and I had to power it off... > (yep it was vertual, but that happens on direct HW too,As has been discussed here before, the failure modes are different as the layer stack from filesystem to disk is obviously very different. --Toby> and virtualisasion is the happening ting at sun and else where! > I have a version of the data backed up, but will > take ages (10days) to restore). > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Frank Middleton
2009-Jul-20 19:48 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 07/19/09 06:10 PM, Richard Elling wrote:> Not that bad. Uncommitted ZFS data in memory does not tend to > live that long. Writes are generally out to media in 30 seconds.Yes, but memory hits are instantaneous. On a reasonably busy system there may be buffers in queue all the time. You may have a buffer in memory for 100uS but it only takes 1nS for that buffer to be clobbered. If that happened to be metadata about to be written to both sides of a mirror than you are toast. Good thing this never happens, right :-)> Beware, if you go down this path of thought for very long, you''ll soon > be afraid to get out of bed in the morning... wait... most people actually > die in beds, so perhaps you''ll be afraid to go to bed instead :-)Not at all. As with any rational business, my servers all have ECC, and getting up and out isn''t a problem :-). Maybe I''ve had too many disks go bad, so I have ECC, mirrors, and backup to a system with ECC and mirrors (and copies=2, as well). Maybe I''ve read too many of your excellent blogs :-).>>Sun doesn''t even sell machines without ECC. There''s a reason for that.> Yes, but all of the discussions in this thread can be classified as > systems engineering problems, not product design problems.Not sure I follow. We''ve had this discussion before. OSOL+ZFS lets you build enterprise class systems on cheap hardware that has errors. ZFS gives the illusion of being fragile because it, uniquely, reports these errors. Running OSOL as a VM in VirtualBox using MSWanything as a host is a bit like building on sand, but there''s nothing in documentation anywhere to even warn folks that they shouldn''t rely on software to get them out of trouble on cheap hardware. ECC is just one (but essential) part of that. On 07/19/09 08:29 PM, David Magda wrote:> It''s a nice-to-have, but at some point we''re getting into the tinfoil > hat-equivalent of data protection.But it is going to happen! Sun sells only machines with ECC because that is the only way to ensure reliability. Someone who spends weeks building a media server at home isn''t going to be happy if they lose one media file let alone a whole pool. At least they should be warned that without ECC at some point they will lose files. I''m not convinced that there is any reasonable scenario for losing an entire pool though, which was the original complaint in this thread. Even trusty old SPARCs occasionally hang without a panic (in my experience especially when a disk is about to go bad). If this happens, and you have to power cycle because even stop-A doesn''t respond, are you all saying that there is a risk of losing a pool at that point? Surely the whole point of a journalled file system is that it is pretty much proof against any catastrophe, even the one described initially. There have been a couple of (to me) unconvincing explanations of how this pool was lost. Surely if there is a mechanism whereby unflushed i/os can cause fatal metadata corruption, this should be a high priority bug since this can happen on /any/ hardware; it is just more likely if the foundations are shaky, so the explanation must require more than that if it isn''t a bug.
Ross
2009-Jul-21 07:16 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
My understanding of the root cause of these issues is that the vast majority are happening with consumer grade hardware that is reporting to ZFS that writes have succeeded, when in fact they are still in the cache. When that happens, ZFS believes the data is safely written, but a power cut or crash can cause severe problems with the pool. This is (I think) the reason for comments about this being a system engineering, not design problem - ZFS assumes the disks are telling the truth and has been designed this way. It is up to the administrator to engineer the server from components that accurately report their status. However, while the majority of these cases are with consumer hardware, the BBC have reported that they hit this problem using Sun T2000 servers and commodity SATA drives, so unless somebody from Sun can say otherwise, I feel that there is still some risk of this occurring on Sun hardware. I feel the ZFS marketing and documentation is very misleading in that it completely ignores the issue of your entire pool being at risk unless you are careful about the hardware used, leading to a lot of stories like this from enthusiasts and early adopters. I also believe ZFS needs recovery tools as a matter of urgency, to protect its reputation if nothing else. -- This message posted from opensolaris.org
George Wilson
2009-Jul-21 15:53 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Russel wrote:> OK. > > So do we have an zpool import --xtg 56574 mypoolname > or help to do it (script?) > > Russel >We are working on the pool rollback mechanism and hope to have that soon. The ZFS team recognizes that not all hardware is created equal and thus the need for this mechanism. We are using the following CR as the tracker for this work: 6667683 need a way to rollback to an uberblock from a previous txg Thanks, George
Richard Elling
2009-Jul-21 17:21 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Jul 20, 2009, at 12:48 PM, Frank Middleton wrote:> On 07/19/09 06:10 PM, Richard Elling wrote: > >> Not that bad. Uncommitted ZFS data in memory does not tend to >> live that long. Writes are generally out to media in 30 seconds. > > Yes, but memory hits are instantaneous. On a reasonably busy > system there may be buffers in queue all the time. You may have > a buffer in memory for 100uS but it only takes 1nS for that buffer > to be clobbered. If that happened to be metadata about to be written > to both sides of a mirror than you are toast. Good thing this > never happens, right :-)I never win the lottery either :-)> >> Beware, if you go down this path of thought for very long, you''ll >> soon >> be afraid to get out of bed in the morning... wait... most people >> actually >> die in beds, so perhaps you''ll be afraid to go to bed instead :-) > > Not at all. As with any rational business, my servers all have ECC, > and getting up and out isn''t a problem :-). Maybe I''ve had too many > disks go bad, so I have ECC, mirrors, and backup to a system with > ECC and mirrors (and copies=2, as well). Maybe I''ve read too many > of your excellent blogs :-). > >>> Sun doesn''t even sell machines without ECC. There''s a reason for >>> that. > >> Yes, but all of the discussions in this thread can be classified as >> systems engineering problems, not product design problems. > > Not sure I follow. We''ve had this discussion before. OSOL+ZFS lets > you build enterprise class systems on cheap hardware that has errors. > ZFS gives the illusion of being fragile because it, uniquely, reports > these errors. Running OSOL as a VM in VirtualBox using MSWanything > as a host is a bit like building on sand, but there''s nothing in > documentation anywhere to even warn folks that they shouldn''t rely > on software to get them out of trouble on cheap hardware. ECC is > just one (but essential) part of that.It is a systems engineering problem because ZFS is working as designed and VirtualBox is also working as designed. If you file a bug against either, the bug should be closed as "not a defect." That means the responsibility for making sure that the two interoperate lies at the systems level -- where systems engineers do their job. For an analogy, guns don''t kill people, bullets kill people. The gun is just a platform for directing bullets. If you shoot yourself in the foot, then the failure is not with the gun or bullet, it is one layer above -- in the system. It hurts when you do that, so don''t do that.> > On 07/19/09 08:29 PM, David Magda wrote: > >> It''s a nice-to-have, but at some point we''re getting into the tinfoil >> hat-equivalent of data protection. > > But it is going to happen! Sun sells only machines with ECC because > that is the only way to ensure reliability. Someone who spends weeks > building a media server at home isn''t going to be happy if they lose > one media file let alone a whole pool. At least they should be warned > that without ECC at some point they will lose files. I''m not convinced > that there is any reasonable scenario for losing an entire pool > though, > which was the original complaint in this thread. > > Even trusty old SPARCs occasionally hang without a panic (in my > experience especially when a disk is about to go bad). If this > happens, and you have to power cycle because even stop-A doesn''t > respond, are you all saying that there is a risk of losing a pool > at that point? Surely the whole point of a journalled file system > is that it is pretty much proof against any catastrophe, even the > one described initially. > > There have been a couple of (to me) unconvincing explanations of > how this pool was lost.It is quite simple -- ZFS sent the flush command and VirtualBox ignored it. Therefore the bits on the persistent store are consistent.> Surely if there is a mechanism whereby > unflushed i/os can cause fatal metadata corruption, this should > be a high priority bug since this can happen on /any/ hardware; it > is just more likely if the foundations are shaky, so the explanation > must require more than that if it isn''t a bug.It isn''t a bug in ZFS or VirtualBox. They work as designed. As has been mentioned before, many times, the recovery of the data is now a forensics exercise. ZFS knows is that the consistency is broken and is implementing the policy that consistency is more important than automated access. -- richard
Alexander Skwar
2009-Jul-21 18:32 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Hi. Good to Know! But how do we deal with that on older sStems, which don''t have the patch applied, once it is out? Thanks, Alexander On Tuesday, July 21, 2009, George Wilson <George.Wilson at sun.com> wrote:> Russel wrote: > > OK. > > So do we have an zpool import --xtg 56574 mypoolname > or help to do it (script?) > > Russel > > > We are working on the pool rollback mechanism and hope to have that soon. The ZFS team recognizes that not all hardware is created equal and thus the need for this mechanism. We are using the following CR as the tracker for this work: > > 6667683 need a way to rollback to an uberblock from a previous txg > > Thanks, > George > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-- Alexander -- [[ http://zensursula.net ]] [ Soc. => http://twitter.com/alexs77 | http://www.plurk.com/alexs77 ] [ Mehr => http://zyb.com/alexws77 ] [ Chat => Jabber: alexws77 at jabber80.com | Google Talk: a.skwar at gmail.com ] [ Mehr => AIM: alexws77 ] [ $[ $RANDOM % 6 ] = 0 ] && rm -rf / || echo ''CLICK!''
Russel
2009-Jul-22 11:09 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Thanks for the feed back George. I hope we get the tools soon. At home I have now blown the ZFS away now and creating a HW raid-5 set :-( Hopefully in the future when the tools are there I will return to ZFS. To All : The ECC discussion was very interesting as I had never considered it that way! I willl be buying ECC memory for my home machine!! Again many many thanks to all how have replied it has been a very interesting and informative discussion for me. Best regards Russel -- This message posted from opensolaris.org
Anon Y Mous
2009-Jul-22 14:31 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40
I don''t mean to be offensive Russel, but if you do ever return to ZFS, please promise me that you will never, ever, EVER run it virtualized on top of NTFS (a.k.a. worst file system ever) in a production environment. Microsoft Windows is a horribly unreliable operating system in situations where things like protecting against data corruption are important. Microsoft knows this, which is why they secretly run much of Microsoft.com, their www advertisement campaigns, and the Microsoft Updates web sites on Akamai Linux in the data center across the hall from the data center where I work.... and the invulnerable file system behind Microsoft''s "cloud" that secretly runs on Akamai''s content delivery system is none other than ZFS''s long lost brother... Netapp WAFL! The first time I started to catch on to this was when the Project Mojave advertisement campaign started and lots of people were nmap scanning the site and noticing that it was running Apache on Linux: http://openmanifesto.blogspot.com/2008/07/mss-blunder-with-mojave-experiment-uses.html Eventually Microsoft realized they messed up and started to edit the header strings like they usually do to make it look like IIS: https://lists.mayfirst.org/pipermail/nosi-discussion/2008-August/000417.html although you could still figure it out if you were smart enough by using telnet like this: http://news.netcraft.com/archives/2003/08/17/wwwmicrosoftcom_runs_linux_up_to_a_point_.html but the cat was already out of the bag. I did some investigating over a year ago and talked to some of my long time friends who were senior Akamai techs, and one of them eventually gave me a guided tour after hours and gave me a quick look at the Netapp WAFL setup and explained how Microsoft Windows updates actually work. Very cool! These Akamai guys are like the "Wizard of Oz" for the Internet running everything behind the curtains there. Whenever Microsoft Updates are down- Tell an Akamai tech! Everything''s will start working fine within 5 minutes of you telling them (sure beats calling in to Microsoft Tech Support in Mumbai India). Is apple.com or itunes running slow? Tell an Akamai tech and it''ll be fixed immediately. Cnn.com down? Jcpenny.com down? Yup. Tell an Akamai tech and it comes right back up. It''s very rare that they have a serious problem like this one: http://www.theregister.co.uk/2004/06/15/akamai_goes_postal/ in which case 25% of the internet (including google, yahoo, and lycos) usually goes down with them. So my question to you Russel is- if Microsoft can''t even rely on NTFS to run their own important infrastructure (they obviously have a Netapp WAFL dependancy), what hope can your 10TB pool possibly have? What you''re doing is the equivalent of building a 100 story tall skyscraper out of titanium and then making the bottom-most ground floor and basement foundation out of glue and pop sickle sticks, and then when the entire building starts to collapse, you call in to the Titanium metal fabrication corporation, blame them for the problem, and then tell them that they are obligated to help you glue your pop sickle sticks back together because it''s all their fault that the building collapsed! Not very fair IMHO. In the future, keep in mind that (as far as I understand it) the only way to get the 100% full benefits of ZFS checksum protection is to run it in on bare metal with no virtualization. If you''re going to virtualize something, virtualize Microsoft Windows and Linux inside of OpenSolaris. I''m running ZFS in production with my OpenSolaris operating system zpool mirrored three times over on 3 different drives, and I''ve never had a problem with it. I even created a few simulated power outages to test my setup and pulling the plug while twelve different users were uploading multiple files into 12 different Solaris zones definitely didn''t phase the zpool at all. Just boots right back up and everything works. The thing is though, it only seems to work when you''re not running it virtualized on top of a closed-source proprietary file system that''s made out of glue and pop sickle sticks. Just my 2 cents. I could be wrong though. -- This message posted from opensolaris.org
George Wilson
2009-Jul-22 15:55 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Once these bits are available in Opensolaris then users will be able to upgrade rather easily. This would allow you to take a liveCD running these bits and recover older pools. Do you currently have a pool which needs recovery? Thanks, George Alexander Skwar wrote:> Hi. > > Good to Know! > > But how do we deal with that on older sStems, which don''t have the > patch applied, once it is out? > > Thanks, Alexander > > On Tuesday, July 21, 2009, George Wilson <George.Wilson at sun.com> wrote: > >> Russel wrote: >> >> OK. >> >> So do we have an zpool import --xtg 56574 mypoolname >> or help to do it (script?) >> >> Russel >> >> >> We are working on the pool rollback mechanism and hope to have that soon. The ZFS team recognizes that not all hardware is created equal and thus the need for this mechanism. We are using the following CR as the tracker for this work: >> >> 6667683 need a way to rollback to an uberblock from a previous txg >> >> Thanks, >> George >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >> > >
Mario Goebbels
2009-Jul-22 16:02 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
> To All : The ECC discussion was very interesting as I had never > considered it that way! I willl be buying ECC memory for my home > machine!!You have to make sure your mainboard, chipset and/or CPU support it, otherwise any ECC modules will just work like regular modules. The mainboard needs to have the necessary lanes to either the chipset that supports ECC (in case of Intel) or the CPU (in case of AMD). I think all Xeon chipsets do ECC, as do various consumer ones (I only know of X38/X48, there''s also some 9xx ones that do). For consumer boards, it''s hard to figure out which actually do support it. I have an X48-DQ6 mainboard from Gigabyte, which does it. Regards, -mg
Miles Nordin
2009-Jul-22 19:47 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40
>>>>> "aym" == Anon Y Mous <no-reply at opensolaris.org> writes: >>>>> "mg" == Mario Goebbels <me at tomservo.cc> writes:aym> I don''t mean to be offensive Russel, but if you do ever return aym> to ZFS, please promise me that you will never, ever, EVER run aym> it virtualized on top of NTFS he said he was using raw disk devices IIRC. and once again, the host did not crash, only the guest, so even if it were NTFS rather than raw disks, the integrity characteristics of NTFS would have been irrelevant since the host was awlays shutdown cleanly. aym> the only way to get the 100% full benefits of ZFS checksum aym> protection is to run it in on bare metal with no aym> virtualization. bullshit. That makes no sense at all. First, why should virtualization have anything to do with checksums? Obviously checksums go straight through it. The suspected problem lies elsewhere. Second, virtualization is serious business. Problems need to be found and fixed. At this point, you''ve become so aggressive with that broom, anyone can see there''s obviously an elephant under the rug. aym> I''m running ZFS in production with my OpenSolaris aym> operating system zpool mirrored three times over on 3 aym> different drives, and I''ve never had a problem with it. The idea of collecting other people''s problem reports is to figure out what''s causing problems before one hits you. I hear this type of thing all the time: ``The number of problems I''ve had is so close to zero, it is zero, so by extrapolation nobody else can be having any real problems because if I scale out my own experience the expected number of problems in the entire world is zero.''''---wtf? clearly bogus! mg> You have to make sure your mainboard, chipset and/or CPU mg> support it, otherwise any ECC modules will just work like mg> regular modules. also scrubbing is sometimes enabled separately from plain ECC. Without scrubbing the ECC can still correct errors, but won''t do so until some actual thread reads the flipped-bit, which is probably okay but <shrug>. I vaguely remember something about an idle scrub thread in solaris where the CPU itself does the scrubbing? but at least on AMD platforms, the memory and cache controllers will do scrubbing themselves using only memory bandwidth, without using CPU cycles, if you ask. On AMD you can use this script on Linux to control scrub speed and ECC enablement if your BIOS does not support it. The script does appear to do something on Phenom II, but I haven''t tried the 10-ohm resistor test the author suggests. I think it should be adaptable to SOlaris. http://hyvatti.iki.fi/~jaakko/sw/ now if only we could get 4GB ECC unbuffered DDR3 for similar prices to non-ECC. :( -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090722/fb2b1ae5/attachment.bin>
Frank Middleton
2009-Jul-23 23:20 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 07/21/09 01:21 PM, Richard Elling wrote:> I never win the lottery either :-)Let''s see. Your chance of winning a 49 ball lottery is apparently around 1 in 14*10^6, although it''s much better than that because of submatches (smaller payoffs for matches on less than 6 balls). There are about 32*10^6 seconds in a year. If ZFS saves its writes for 30 seconds and batches them out, that means 1 write leaves the buffer exposed for roughly one millionth of a year. If you have 4GB of memory, you might get 50 errors a year, but you say ZFS uses only 1/10 of this for writes, so that memory could see 5 errors/year. If your single write was 1/70th of that (say around 6 MB), your chance of a hit is around 5/70/10^-6 or 1 in 14*10^6, so you are correct! So if you do one 6MB write/year, your chances of a hit in a year are about the same as that of winning a grand slam lottery. Hopefully not every hit will trash a file or pool, but odds are that you''ll do many more writes than that, so on the whole I think a ZFS hit is quite a bit more likely than winning the lottery each year :-). Conversely, if you average one big write every 3 minutes or so (20% occupancy), odds are almost certain that you''ll get one hit a year. So some SOHO users who do far fewer writes won''t see any hits (say) over a 5 year period. But some will, and they will be most unhappy -- calculate your odds and then make a decision! I daresay the PC makers have done this calculation, which is why PCs don''t have ECC, and hence IMO make for insufficiently reliable servers. Conclusions from what I''ve gleaned from all the discussions here: if you are too cheap to opt for mirroring, your best bet is to disable checksumming and set copies=2. If you mirror but don''t have ECC then at least set copies=2 and consider disabling checksums. Actually, set copies=2 regardless, so that you have some redundancy if one half of the mirror fails and you have a 10 hour resilver, in which time you could easily get a (real) disk read error. It seems to me some vendor is going to cotton onto the SOHO server problem and make a bundle at the right price point. Sun''s offerings seem unfortunately mostly overkill for the SOHO market, although the X4140 looks rather interesting... Shame there aren''t any entry level SPARCs any more :-(. Now what would doctors'' front offices do if they couldn''t blame the computer for being down all the time?> It is quite simple -- ZFS sent the flush command and VirtualBox > ignored it. Therefore the bits on the persistent store are consistent.But even on the most majestic of hardware, a flush command could be lost, could it not? An obvious case in point is ZFS over iscsi and a router glitch. But the discussion seems to be moot since CR 6667683 is being addressed. Now about those writes to mirrored disks :) Cheers -- Frank
roland
2009-Jul-25 11:06 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
>Running this kind of setup absolutely can give you NO garanties at all. >Virtualisation, OSOL/zfs on WinXP. It''s nice to play with and see it >"working" but would I TRUST precious data to it? No way!why not? if i write some data trough virtualization layer which goes straight trough to raw disk - what`s the problem? do a snapshot and you can be sure you have a safe state. or not? you can check if you are consistent by doing a scrub. or not? taken buffers/caches into consideration, you could eventually loose some seconds/minutes of work, but doesn`t zfs use transactional design which ensures consistency? so, how can that happen what?s being reported here, if zfs takes so much care of consistency?>When that happens, ZFS believes the data is safely written, but a power cut or >crash can cause severe problems with the pool.didn`t i read a million times that zfs ensures an "always consistent state" and is self healing, too? so, if new blocks are always written at new positions - why can`t we just roll back to a point in time (for example last snapshot) which is known to be safe/consistent ? i give a shit about the last 5 minutes of work if i can recover my TB sized pool instead. -- This message posted from opensolaris.org
Bob Friesenhahn
2009-Jul-25 15:38 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Sat, 25 Jul 2009, roland wrote:> >> When that happens, ZFS believes the data is safely written, but a >> power cut or >crash can cause severe problems with the pool. > > didn`t i read a million times that zfs ensures an "always consistent > state" and is self healing, too? > > so, if new blocks are always written at new positions - why can`t we > just roll back to a point in time (for example last snapshot) which > is known to be safe/consistent ?As soon as you have more then one disk in the equation, then it is vital that the disks commit their data when requested since otherwise the data on disk will not be in a consistent state. If the disks simply do whatever they want then some disks will have written the data while other disks will still have it cached. This blows the "consistent state on disk" even though zfs wrote the data in order and did all the right things. Any uncommitted data in disk cache will be forgotten if the system loses power. There is an additional problem if when the disks finally get around to writing the cached data that they write it in a different order than requested while ignoring the commit request. It is common that the disks write data in the most efficient order, but it absolutely must commit all of the data when requested so that the checkpoint is valid. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
roland
2009-Jul-25 16:24 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
>As soon as you have more then one disk in the equation, then it is >vital that the disks commit their data when requested since otherwise >the data on disk will not be in a consistent state.ok, but doesn`t that refer only to the most recent data? why can i loose a whole 10TB pool including all the snapshots with the logging/transactional nature of zfs? isn`t the data in the snapshots set to read only so all blocks with snapshotted data don`t change over time (and thus give an secure "entry" to a consistent point in time) ? ok, this are probably some short-sighted questions, but i`m trying to understand how things could go wrong with zfs and how issues like these happen. on other filesystems, we have tools for fsck as a last resort or tools to recover data from unmountable filesystems. with zfs i don`t know any of these, so it`s that "will solaris mount my zfs after the next crash?" question which frightens me a little bit. -- This message posted from opensolaris.org
David Magda
2009-Jul-25 17:27 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Jul 25, 2009, at 12:24, roland wrote:> why can i loose a whole 10TB pool including all the snapshots with > the logging/transactional nature of zfs?Because ZFS does not (yet) have an (easy) way to go back a previous state. That''s what this bug is about:> need a way to rollback to an uberblock from a previous txghttp://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6667683 While in most cases ZFS will cleanly recover after a non-clean shutdown, there are situations where the disks doing strange things (like lying) have caused the ZFS data structures to become wonky. The ''broken'' data structure will cause all branches underneath it to be lost--and if it''s near the top of the tree, it could mean a good portion of the pool is inaccessible. Fixing the above bug should hopefully allow users / sysadmins to tell ZFS to go ''back in time'' and look up previous versions of the data structures.
roland
2009-Jul-25 18:17 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
thanks for the explanation ! one more question:> there are situations where the disks doing strange things >(like lying) have caused the ZFS data structures to become wonky. The >''broken'' data structure will cause all branches underneath it to be >lost--and if it''s near the top of the tree, it could mean a good >portion of the pool is inaccessible.can snapshots also be affected by such issue or are they somewhat "immune" here? -- This message posted from opensolaris.org
David Magda
2009-Jul-25 18:50 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Jul 25, 2009, at 14:17, roland wrote:> thanks for the explanation ! > > one more question: > >> there are situations where the disks doing strange things >> (like lying) have caused the ZFS data structures to become wonky. The >> ''broken'' data structure will cause all branches underneath it to be >> lost--and if it''s near the top of the tree, it could mean a good >> portion of the pool is inaccessible. > > can snapshots also be affected by such issue or are they somewhat > "immune" here?Yes, it can be affected. If the snapshot''s data structure / record is underneath the corrupted data in the tree then it won''t be able to be reached.
Frank Middleton
2009-Jul-25 19:32 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 07/25/09 02:50 PM, David Magda wrote:> Yes, it can be affected. If the snapshot''s data structure / record is > underneath the corrupted data in the tree then it won''t be able to be > reached.Can you comment on if/how mirroring or raidz mitigates this, or tree corruption in general? I have yet to lose a pool even on a machine with fairly pathological problems, but it is mirrored (and copies=2). I was also wondering if you could explain why the ZIL can''t repair such damage. Finally, a number of posters blamed VB for ignoring a flush, but according to the evil tuning guide, without any application syncs, ZFS may wait up to 5 seconds before issuing a synch, and there must be all kinds of failure modes even on bare hardware where it never gets a chance to do one at shutdown. This is interesting if you do ZFS over iscsi because of the possibility of someone tripping over a patch cord or a router blowing a fuse. Doesn''t this mean /any/ hardware might have this problem, albeit with much lower probability? Thanks
Carson Gaspar
2009-Jul-25 20:30 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Frank Middleton wrote:> Finally, a number of posters blamed VB for ignoring a flush, but > according to the evil tuning guide, without any application syncs, > ZFS may wait up to 5 seconds before issuing a synch, and there > must be all kinds of failure modes even on bare hardware where > it never gets a chance to do one at shutdown. This is interesting > if you do ZFS over iscsi because of the possibility of someone > tripping over a patch cord or a router blowing a fuse. Doesn''t > this mean /any/ hardware might have this problem, albeit with much > lower probability?No. You''ll lose unwritten data, but won''t corrupt the pool, because the on-disk state will be sane, as long as your iSCSI stack doesn''t lie about data commits or ignore cache flush commands. Why is this so difficult for people to understand? Let me create a simple example for you. Get yourself 4 small pieces of paper, and number them 1 through 4. On piece 1, write "Four" (app write disk A) On piece 2, write "Score" (app write disk B) Place piece 1 and piece 2 together on the side (metadata write, cache flush) On piece 3, write "Every" (app overwrite disk A) On piece 4, write "Good" (app overwrite disk B) Place piece 2 and piece 3 on top of pieces one and 2 (metadata write, cache flush) IFF you obeyed the instructions, the only things you could ever have on the side are nothing, "Four Score", or "Every Good" (we assume that side placement is atomic). You could get killed after writing something on pieces 3 or 4, and lose them, but you could never have garbage. Now if you were too lazy to bother to follow the instructions properly, we could end up with bizarre things. This is what happens when storage lies and re-orders writes across boundaries. -- Carson
Toby Thain
2009-Jul-25 23:34 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 25-Jul-09, at 3:32 PM, Frank Middleton wrote:> On 07/25/09 02:50 PM, David Magda wrote: > >> Yes, it can be affected. If the snapshot''s data structure / record is >> underneath the corrupted data in the tree then it won''t be able to be >> reached. > > Can you comment on if/how mirroring or raidz mitigates this, or tree > corruption in general? I have yet to lose a pool even on a machine > with fairly pathological problems, but it is mirrored (and copies=2). > > I was also wondering if you could explain why the ZIL can''t > repair such damage. > > Finally, a number of posters blamed VB for ignoring a flush, but > according to the evil tuning guide, without any application syncs, > ZFS may wait up to 5 seconds before issuing a synch, and there > must be all kinds of failure modes even on bare hardware where > it never gets a chance to do one at shutdown. This is interesting > if you do ZFS over iscsi because of the possibility of someone > tripping over a patch cord or a router blowing a fuse. Doesn''t > this mean /any/ hardware might have this problem, albeit with much > lower probability?The problem is assumed *ordering*. In this respect VB ignoring flushes and real hardware are not going to behave the same. --Toby> > Thanks > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
David Magda
2009-Jul-26 05:40 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Jul 25, 2009, at 15:32, Frank Middleton wrote:> Can you comment on if/how mirroring or raidz mitigates this, or tree > corruption in general? I have yet to lose a pool even on a machine > with fairly pathological problems, but it is mirrored (and copies=2).Presumably at least on of the drives in the mirror or RAID set would have the correct data or non-corrupted data structures. There was a thread a while back on the risks involved in a SAN LUN (served from something like an EMC array), and whether you could trust the array or whether you should mirror LUNs. (I think the consensus was it was best to mirror LUNs--even from SANs, which presumably are more reliable than consumer SATA drives).> I was also wondering if you could explain why the ZIL can''t > repair such damage.Beyond my knowledge.> Finally, a number of posters blamed VB for ignoring a flush, but > according to the evil tuning guide, without any application syncs, > ZFS may wait up to 5 seconds before issuing a synch, and thereYes, it will sync every 5 to 30 seconds, but how do you know the data is actually synced?! If the five second timer triggers and ZFS says "okay, time to sync", and goes through the proper procedures, what happens if the drive lies about the sync operation? What then? That''s the whole point of this thread: what should happen, or what should the file system do, when the drive (real or virtual) lies about the syncing? It''s just as much a problem with any other POSIX file system (which have to deal with fsync(2))--ZFS isn''t that special in that regard. The Linux folks went through a protracted debate on a similar issue not too long ago: http://thunk.org/tytso/blog/2009/03/15/dont-fear-the-fsync/ http://lwn.net/Articles/322823/> tripping over a patch cord or a router blowing a fuse. Doesn''t > this mean /any/ hardware might have this problem, albeit with much > lower probability?Yes, which is why it''s always recommended to have redundancy in your configuration (mirroring or RAID-Z). This way, hopefully, at least one drive is in a consistent state. This is also (theoretically) why a drive purchased from Sun is more that expensive then a drive purchased from your neighbourhood computer shop: Sun (and presumably other manufacturers) takes the time and effort to test things to make sure that when a drive says "I''ve synced the data", it actually has synced the data. This testing is what you''re presumably paying for.
David Magda
2009-Jul-26 05:47 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Jul 25, 2009, at 16:30, Carson Gaspar wrote:> Frank Middleton wrote: > >> Doesn''t this mean /any/ hardware might have this problem, albeit >> with much lower probability? > > No. You''ll lose unwritten data, but won''t corrupt the pool, because > the on-disk state will be sane, as long as your iSCSI stack doesn''t > lie about data commits or ignore cache flush commands.But this entire thread started because Virtual Box''s virtual disk / did/ lie about data commits.> Why is this so difficult for people to understand?Because most people make the (not unreasonable assumption) that disks save data the way that they''re supposed to: that the data goes in is the data that comes out, and that when the OS tells them to empty the buffer that they actually flush it. It''s only us storage geeks that generally know the ugly truth that this assumption is not always true. :)
Frank Middleton
2009-Jul-26 15:08 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 07/25/09 04:30 PM, Carson Gaspar wrote:> No. You''ll lose unwritten data, but won''t corrupt the pool, because > the on-disk state will be sane, as long as your iSCSI stack doesn''t > lie about data commits or ignore cache flush commands. Why is this so > difficult for people to understand? Let me create a simple example > for you.Are you sure about this example? AFAIK metadata refers to things like the file''s name, atime, ACLs, etc., etc. Your example seems to be more about how a journal works, which has little to do with metatdata other than to manage it.> Now if you were too lazy to bother to follow the instructions properly, > we could end up with bizarre things. This is what happens when storage > lies and re-orders writes across boundaries.On 07/25/09 07:34 PM, Toby Thain wrote:> The problem is assumed *ordering*. In this respect VB ignoring flushes > and real hardware are not going to behave the same.Why? An ignored flush is ignored. It may be more likely in VB, but it can always happen. It mystifies me that VB would in some way alter the ordering. I wonder if the OP could tell us what actual disks and controller he used to see if the hardware might actually have done out-of-order writes despite the fact that ZFS already does write optimization. Maybe the disk didn''t like the physical location of the log relative to the data so it wrote the data first? Even then it isn''t onvious why this would cause the pool to be lost. A traditional journalling file system should survive the loss pf a flush. Either the log entry was written or it wasn''t. Even if the disk, for some bizarre reason, writes some of the actual data before writing the log, the repair process should undo that, If written properly, it will use the information in the most current complete journal entry to repair the file system. Doing synchs are devastating to performance so usually there''s an option to disable them, at the known risk of losing a lot more data. I''ve been using SPARCs and Solaris from the beginning. Ever since UFS supported journalling, I''ve never lost a file unless the disk went totally bad, and none since mirroring. Didn''t miss fsck either :-) Doesn''t ZIL effectively make ZFS into a journalled file system (in another thread, Bob Friesenhahn says it isn''t, but I would submit that the general opinion is correct that it is; "log" and "journal" have similar semantics). The evil tuning guide is pretty emphatic about not disabling it! My intuition (and this is entirely speculative) is that the ZFS ZIL either doesn''t contain everything needed to restore the superstructure, or that if it does, the recovery process is ignoring it. I think I read that the ZIL is per-file system, but one hopes it doesn''t rely on the superstructure recursively, or this would be impossible to fix (maybe there''s a ZIL for the ZILs :) ). On 07/21/09 11:53 AM, George Wilson wrote:> We are working on the pool rollback mechanism and hope to have that > soon. The ZFS team recognizes that not all hardware is created equal and > thus the need for this mechanism. We are using the following CR as the > tracker for this work: > > 6667683 need a way to rollback to an uberblock from a previous txgso maybe this discussion is moot :-) -- Frank
Bob Friesenhahn
2009-Jul-26 16:24 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Sun, 26 Jul 2009, David Magda wrote:> > That''s the whole point of this thread: what should happen, or what should the > file system do, when the drive (real or virtual) lies about the syncing? It''s > just as much a problem with any other POSIX file system (which have to deal > with fsync(2))--ZFS isn''t that special in that regard. The Linux folks went > through a protracted debate on a similar issue not too long ago:Zfs is pretty darn special. RAIDed disk setups under Linux or *BSD work differently than zfs in a rather big way. Consider that with a normal software-based RAID setup, you use OS tools to create a virtual RAIDed device (LUN) which appears as a large device that you can then create (e.g. mkfs) a traditional filesystem on top of. Zfs works quite differently in that it is uses a pooled design which incorporates several RAID strategies directly. Instead of sending the data to a virtual device which then arranges the underlying data according to a policy (striping, mirror, RAID5), zfs incorporates knowledge of the vdev RAID strategy and intelligently issues data to the disks in an ideal order, executing the disk drive commit requests directly. Zfs removes the RAID obfustication which exists in traditional RAID systems. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Toby Thain
2009-Jul-26 21:10 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 26-Jul-09, at 11:08 AM, Frank Middleton wrote:> On 07/25/09 04:30 PM, Carson Gaspar wrote: > >> No. You''ll lose unwritten data, but won''t corrupt the pool, because >> the on-disk state will be sane, as long as your iSCSI stack doesn''t >> lie about data commits or ignore cache flush commands. Why is this so >> difficult for people to understand? Let me create a simple example >> for you. > > Are you sure about this example? AFAIK metadata refers to things like > the file''s name, atime, ACLs, etc., etc. Your example seems to be more > about how a journal works, which has little to do with metatdata other > than to manage it. > >> Now if you were too lazy to bother to follow the instructions >> properly, >> we could end up with bizarre things. This is what happens when >> storage >> lies and re-orders writes across boundaries. > > On 07/25/09 07:34 PM, Toby Thain wrote: > >> The problem is assumed *ordering*. In this respect VB ignoring >> flushes >> and real hardware are not going to behave the same. > > Why? An ignored flush is ignored. It may be more likely in VB, but it > can always happen.And whenever it does: guess what happens?> It mystifies me that VB would in some way alter > the ordering.Carson already went through a more detailed explanation. Let me try a different one: ZFS issues writes A, B, C, FLUSH, D, E, F. case 1) the semantics of the flush* allow ZFS to presume that A, B, C are all ''committed'' at the point that D is issued. You can understand that A, B, C may be done in any order, and D, E, F may be done in any order, due to the numerous abstraction layers involved - all the way down to the disk''s internal scheduling. ANY of these layers can affect the ordering of durable, physical writes _in the absence of a flush/barrier_. case 2) but if the flush does NOT occur with the necessary semantics, the ordering of ALL SIX operations is now indeterminate, and by the time ZFS issues D, any of the first 3 (A, B, C) may well not have been committed at all. There is a very good chance this will violate an integrity assumption (I haven''t studied the source so I can''t point you to a specific design detail or line; rather I am working from how I understand transactional/journaled systems to work. Assuming my argument is valid, I am sure a ZFS engineer can cite a specific violation). As has already been mentioned in this context, I think by David Magda, ordinary hardware will show this problem _if flushes are not functioning_ (an unusual case on bare metal), while on VirtualBox this is the default!> ... > > Doesn''t ZIL effectively make ZFS into a journalled file systemOf course ZFS is transactional, as are other filesystems and software systems, such as RDBMS. But integrity of such systems depends on a hardware flush primitive that actually works. We are getting hoarse repeating this. --Toby * Essentially ''commit'' semantics: Flush synchronously, operation is complete only when data is durably stored. ...
Marcelo Leal
2009-Jul-27 15:18 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
> That''s only one element of it Bob. ZFS also needs > devices to fail quickly and in a predictable manner. > > A consumer grade hard disk could lock up your entire > pool as it fails. The kit Sun supply is more likely > to fail in a manner ZFS can cope with.I agree 100%. Hardware, firmware, drivers, should be fully integrated to a mission critical app. With the wrong firmware, and consumer grade HD, disks failures stalls the entire pool. I have experience with disks failing and taking 2 or tree seconds to the system cope with (not just ZFS, but the controller, etc). Leal. -- This message posted from opensolaris.org
Ross
2009-Jul-27 17:10 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Heh, I''d kill for failures to be handled in 2 or 3 seconds. I saw the failure of a mirrored iSCSI disk lock the entire pool for 3 minutes. That has been addressed now, but device hangs have the potential to be *very* disruptive. -- This message posted from opensolaris.org
Eric D. Mudama
2009-Jul-27 17:27 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Sun, Jul 26 at 1:47, David Magda wrote:> >On Jul 25, 2009, at 16:30, Carson Gaspar wrote: > >>Frank Middleton wrote: >> >>>Doesn''t this mean /any/ hardware might have this problem, albeit >>>with much lower probability? >> >>No. You''ll lose unwritten data, but won''t corrupt the pool, because >>the on-disk state will be sane, as long as your iSCSI stack doesn''t >>lie about data commits or ignore cache flush commands. > >But this entire thread started because Virtual Box''s virtual disk / >did/ lie about data commits. > >>Why is this so difficult for people to understand? > >Because most people make the (not unreasonable assumption) that disks >save data the way that they''re supposed to: that the data goes in is >the data that comes out, and that when the OS tells them to empty the >buffer that they actually flush it. > >It''s only us storage geeks that generally know the ugly truth that >this assumption is not always true. :)Can *someone* please name a single drive+firmware or RAID controller+firmware that ignores FLUSH CACHE / FLUSH CACHE EXT commands? Or worse, responds "ok" when the flush hasn''t occurred? Everyone on this list seems to blame lying hardware for ignoring commands, but disks are relatively mature and I can''t believe that major OEMs would qualify disks or other hardware that willingly ignore commands. --eric -- Eric D. Mudama edmudama at mail.bounceswoosh.org
Thomas Burgess
2009-Jul-27 17:49 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
i was under the impression it was virtualbox and it''s default setting that ignored the command, not the hard drive On Mon, Jul 27, 2009 at 1:27 PM, Eric D. Mudama <edmudama at bounceswoosh.org>wrote:> On Sun, Jul 26 at 1:47, David Magda wrote: > >> >> On Jul 25, 2009, at 16:30, Carson Gaspar wrote: >> >> Frank Middleton wrote: >>> >>> Doesn''t this mean /any/ hardware might have this problem, albeit with >>>> much lower probability? >>>> >>> >>> No. You''ll lose unwritten data, but won''t corrupt the pool, because the >>> on-disk state will be sane, as long as your iSCSI stack doesn''t lie about >>> data commits or ignore cache flush commands. >>> >> >> But this entire thread started because Virtual Box''s virtual disk / >> did/ lie about data commits. >> >> Why is this so difficult for people to understand? >>> >> >> Because most people make the (not unreasonable assumption) that disks save >> data the way that they''re supposed to: that the data goes in is the data >> that comes out, and that when the OS tells them to empty the buffer that >> they actually flush it. >> >> It''s only us storage geeks that generally know the ugly truth that this >> assumption is not always true. :) >> > > Can *someone* please name a single drive+firmware or RAID > controller+firmware that ignores FLUSH CACHE / FLUSH CACHE EXT > commands? Or worse, responds "ok" when the flush hasn''t occurred? > > Everyone on this list seems to blame lying hardware for ignoring > commands, but disks are relatively mature and I can''t believe that > major OEMs would qualify disks or other hardware that willingly ignore > commands. > > --eric > > -- > Eric D. Mudama > edmudama at mail.bounceswoosh.org > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090727/33446ac3/attachment.html>
Chris Ridd
2009-Jul-27 17:54 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 27 Jul 2009, at 18:49, Thomas Burgess wrote:> > i was under the impression it was virtualbox and it''s default > setting that ignored the command, not the hard driveDo other virtualization products (eg VMware, Parallels, Virtual PC) have the same default behaviour as VirtualBox? I''ve a suspicion they all behave similarly dangerously, but actual data would be useful. Cheers, Chris
Adam Sherman
2009-Jul-27 17:59 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 27-Jul-09, at 13:54 , Chris Ridd wrote:>> i was under the impression it was virtualbox and it''s default >> setting that ignored the command, not the hard drive > > Do other virtualization products (eg VMware, Parallels, Virtual PC) > have the same default behaviour as VirtualBox? > > I''ve a suspicion they all behave similarly dangerously, but actual > data would be useful.Also, I think it may have already been posted, but I haven''t found the option to disable VirtualBox'' disk cache. Anyone have the incantation handy? Thanks, A -- Adam Sherman CTO, Versature Corp. Tel: +1.877.498.3772 x113
Mike Gerdts
2009-Jul-27 18:16 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Mon, Jul 27, 2009 at 12:54 PM, Chris Ridd<chrisridd at mac.com> wrote:> > On 27 Jul 2009, at 18:49, Thomas Burgess wrote: > >> >> i was under the impression it was virtualbox and it''s default setting that >> ignored the command, not the hard drive > > Do other virtualization products (eg VMware, Parallels, Virtual PC) have the > same default behaviour as VirtualBox?I''ve lost a pool due to LDoms doing the same. This bug seems to be related. http://bugs.opensolaris.org/view_bug.do?bug_id=6684721 -- Mike Gerdts http://mgerdts.blogspot.com/
David Magda
2009-Jul-27 19:14 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Mon, July 27, 2009 13:59, Adam Sherman wrote:> Also, I think it may have already been posted, but I haven''t found the > option to disable VirtualBox'' disk cache. Anyone have the incantation > handy?http://forums.virtualbox.org/viewtopic.php?f=8&t=13661&start=0 It tells VB not to ignore the sync/flush command. Caching is still enabled (it wasn''t the problem).
Frank Middleton
2009-Jul-27 19:44 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 07/27/09 01:27 PM, Eric D. Mudama wrote:> Everyone on this list seems to blame lying hardware for ignoring > commands, but disks are relatively mature and I can''t believe that > major OEMs would qualify disks or other hardware that willingly ignore > commands.You are absolutely correct, but if the cache flush command never makes it to the disk, then it won''t see it. The contention is that by not relaying the cache flush to the disk, VirtualBox caused the OP to lose his pool. IMO this argument is bogus because AFAIK the OP didn''t actually power his system down, so the data would still have been in the cache, and presumably have eventually have been written. The out-of-order writes theory is also somewhat dubious, since he was able to write 10TB without VB relaying the cache flushes. This is all highly hardware dependant, and AFAIK no one ever asked the OP what hardware he had, instead, blasting him for running VB on MSWindows. Since IIRC he was using raw disk access, it is questionable whether or not MS was to blame, but in general it simply shouldn''t be possible to lose a pool under any conditions. It does raise the question of what happens in general if a cache flush doesn''t happen if, for example, a system crashes in such a way that it requires a power cycle to restart, and the cache never gets flushed. Do disks with volatile caches attempt to flush the cache by themselves if they detect power down? It seems that the ZFS team recognizes this as a problem, hence the CR to address it. It turns out that (at least on this almost 4 year old blog) http://blogs.sun.com/perrin/entry/the_lumberjack that the ZILs /are/ allocated recursively from the main pool. Unless there is a ZIL for the ZILs, ZFS really isn''t fully journalled, and this could be the real explanation for all lost pools and/or file systems. It would be great to hear from the ZFS team that writing a ZIL, presumably a transaction in it''s own right, is protected somehow (by a ZIL for the ZILs?). Of course the ZIL isn''t a journal in the traditional sense, and AFAIK it has no undo capability the way that a DBMS usually has, but it needs to be structured so that bizarre things that happen when something as robust as Solaris crashes don''t cause data loss. The nightmare scenario is when one disk of a mirror begins to fail and the system comes to a grinding halt where even stop-a doesn''t respond, and a power cycle is the only way out. Who knows what writes may or may not have been issued or what the state of the disk cache might be at such a time. -- Frank
Richard Elling
2009-Jul-27 20:50 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Jul 27, 2009, at 10:27 AM, Eric D. Mudama wrote:> On Sun, Jul 26 at 1:47, David Magda wrote: >> >> On Jul 25, 2009, at 16:30, Carson Gaspar wrote: >> >>> Frank Middleton wrote: >>> >>>> Doesn''t this mean /any/ hardware might have this problem, albeit >>>> with much lower probability? >>> >>> No. You''ll lose unwritten data, but won''t corrupt the pool, >>> because the on-disk state will be sane, as long as your iSCSI >>> stack doesn''t lie about data commits or ignore cache flush commands. >> >> But this entire thread started because Virtual Box''s virtual disk / >> did/ lie about data commits. >> >>> Why is this so difficult for people to understand? >> >> Because most people make the (not unreasonable assumption) that >> disks save data the way that they''re supposed to: that the data >> goes in is the data that comes out, and that when the OS tells them >> to empty the buffer that they actually flush it. >> >> It''s only us storage geeks that generally know the ugly truth that >> this assumption is not always true. :) > > Can *someone* please name a single drive+firmware or RAID > controller+firmware that ignores FLUSH CACHE / FLUSH CACHE EXT > commands? Or worse, responds "ok" when the flush hasn''t occurred?two seconds with google shows http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=183771&NewLang=en&Hilite=cache+flush Give it up. These things happen. Not much you can do about it, other than design around it. -- richard
Adam Sherman
2009-Jul-27 20:54 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 27-Jul-09, at 15:14 , David Magda wrote:>> Also, I think it may have already been posted, but I haven''t found >> the >> option to disable VirtualBox'' disk cache. Anyone have the incantation >> handy? > > http://forums.virtualbox.org/viewtopic.php?f=8&t=13661&start=0 > > It tells VB not to ignore the sync/flush command. Caching is still > enabled > (it wasn''t the problem).Thanks! As Russell points on in the last post to that thread, it doesn''t seem possible to do this with virtual SATA disks? Odd. A. -- Adam Sherman CTO, Versature Corp. Tel: +1.877.498.3772 x113
Nigel Smith
2009-Jul-27 23:09 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
David Magda wrote:> This is also (theoretically) why a drive purchased from Sun is more > that expensive then a drive purchased from your neighbourhood computer > shop: Sun (and presumably other manufacturers) takes the time and > effort to test things to make sure that when a drive says "I''ve synced > the data", it actually has synced the data. This testing is what > you''re presumably paying for.So how do you test a hard drive to check it does actually sync the data? How would you do it in theory? And in practice? Now say we are talking about a virtual hard drive, rather than a physical hard drive. How would that affect the answer to the above questions? Thanks Nigel -- This message posted from opensolaris.org
Toby Thain
2009-Jul-28 00:34 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 27-Jul-09, at 3:44 PM, Frank Middleton wrote:> On 07/27/09 01:27 PM, Eric D. Mudama wrote: > >> Everyone on this list seems to blame lying hardware for ignoring >> commands, but disks are relatively mature and I can''t believe that >> major OEMs would qualify disks or other hardware that willingly >> ignore >> commands. > > You are absolutely correct, but if the cache flush command never makes > it to the disk, then it won''t see it. The contention is that by not > relaying the cache flush to the disk,No - by COMPLETELY ignoring the flush.> VirtualBox caused the OP to lose > his pool. > > IMO this argument is bogus because AFAIK the OP didn''t actually power > his system down, so the data would still have been in the cache, and > presumably have eventually have been written. The out-of-order writes > theory is also somewhat dubious, since he was able to write 10TB > without > VB relaying the cache flushes.Huh? Of course he could. The guest didn''t crash while he was doing it! The corruption occurred when the guest crashed (iirc). And the "out of order theory" need not be the *only possible* explanation, but it *is* sufficient.> This is all highly hardware dependant,Not in the least. It''s a logical problem.> and AFAIK no one ever asked the OP what hardware he had, instead, > blasting him for running VB on MSWindows.Which is certainly not relevant to my hypothesis of what broke. I don''t care what host he is running. The argument is the same for all.> Since IIRC he was using raw > disk access, it is questionable whether or not MS was to blame, but > in general it simply shouldn''t be possible to lose a pool under > any conditions.How about "when flushes are ignored"?> > It does raise the question of what happens in general if a cache > flush doesn''t happen if, for example, a system crashes in such a way > that it requires a power cycle to restart, and the cache never gets > flushed.Previous explanations have not dented your misunderstanding one iota. The problem is not that an attempted flush did not complete. It was that any and all flushes *prior to crash* were ignored. This is where the failure mode diverges from real hardware. Again, look: A B C FLUSH D E F FLUSH<CRASH> Note that it does not matter *at all* whether the 2nd flush completed. What matters from an integrity point of view is that the *previous* flush was completed (and synchronously). Visualise this on the two scenarios: 1) real hardware: (barring actual defects) that A,B,C were written was guaranteed by the first flush (otherwise D would never have been issued). Integrity of system is intact regardless of whether the 2nd flush completed. 2) VirtualBox: flush never happened. Integrity of system is lost, or at best unknown, if it depends on A,B,C all completing before D.> ... > > Of course the ZIL isn''t a journal in the traditional sense, and > AFAIK it has no undo capability the way that a DBMS usually has, > but it needs to be structured so that bizarre things that happen > when something as robust as Solaris crashes don''t cause data loss.A lot of engineering effort has been expended in UFS and ZFS to achieve just that. Which is why it''s so nutty to undermine that by violating semantics in lower layers.> The nightmare scenario is when one disk of a mirror begins to > fail and the system comes to a grinding halt where even stop-a > doesn''t respond, and a power cycle is the only way out. Who > knows what writes may or may not have been issued or what the > state of the disk cache might be at such a time.Again, if the flush semantics are respected*, this is not a problem. --Toby * - "When this operation completes, previous writes are verifiably on durable media**." ** - Durable media meaning physical media in a bare metal environment, and potentially "virtual media" in a virtualised environment.> > -- Frank > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Ross
2009-Jul-28 09:19 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
I think people can understand the concept of missing flushes. The big conceptual problem is how this manages to hose an entire filesystem, which is assumed to have rather a lot of data which ZFS has already verified to be ok. Hardware ignoring flushes and loosing recent data is understandable, I don''t think anybody would argue with that. Loosing access to your entire pool and multiple gigabytes of data because a few writes failed is a whole different story, and while I understand how it happens, ZFS appears to be unique among modern filesystems in suffering such a catastrophic failure so often. To give a quick personal example: I can plug a fat32 usb disk into a windows system, drag some files to it, and pull that drive at any point. I might loose a few files, but I''ve never lost the entire filesystem. Even if the absolute worst happened, I know I can run scandisk, chkdisk, or any number of file recovery tools and get my data back. I would never, ever attempt this with ZFS. For a filesystem like ZFS where it''s integrity and stability are sold as being way better than existing filesystems, loosing your entire pool is a bit of a shock. I know that work is going on to be able to recover pools, and I''ll sleep a lot sounder at night once it is available. -- This message posted from opensolaris.org
Rennie Allen
2009-Jul-28 23:46 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
> > Can *someone* please name a single drive+firmware or > RAID > controller+firmware that ignores FLUSH CACHE / FLUSH > CACHE EXT > commands? Or worse, responds "ok" when the flush > hasn''t occurred?I think it would be a shorter list if one were to name the drives/controllers that actually implement a flush properly.> Everyone on this list seems to blame lying hardware > for ignoring > commands, but disks are relatively mature and I can''t > believe that > major OEMs would qualify disks or other hardware that > willingly ignore > commands.It seems you have too much faith in major OEM''s of storage, considering that 99.9% of the market is personal use, and for which a 2% throughput advantage over a competitor can make or break the profit margin on a device. Ignoring cache requests is guaranteed to get the best drive performance benchmarks regardless of what the software is driving the device. For example, it is virtually impossible to find a USB drive that honors cache sync (to do so would require that the device would stop completely until a fully synchronous USB transaction had made it to the device, the data had been written). Can you imagine how long a USB drive would sit on store shelves if it actually did do a proper cache sync? While USB is the extreme case; and it does get better the more expensive the drive, it is still far from a given that any particular device properly handles cache flushes. -- This message posted from opensolaris.org
Rennie Allen
2009-Jul-28 23:52 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
> This is also (theoretically) why a drive purchased > from Sun is more > that expensive then a drive purchased from your > neighbourhood computer > shop:It''s more significant than that. Drives aimed at the consumer market are at a competitive disadvantage if they do handle cache flush correctly (since the popular hardware blog of the day will show that the device is far slower than the competitors that throw away the sync requests). Sun (and presumably other manufacturers) takes> the time and > effort to test things to make sure that when a drive > says "I''ve synced > the data", it actually has synced the data. This > testing is what > you''re presumably paying for.It wouldn''t cost any more for commercial vendors to implement cache flush properly, it is just that they are penalized by the market for doing so. -- This message posted from opensolaris.org
Eric D. Mudama
2009-Jul-29 01:34 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Mon, Jul 27 at 13:50, Richard Elling wrote:>On Jul 27, 2009, at 10:27 AM, Eric D. Mudama wrote: >>Can *someone* please name a single drive+firmware or RAID >>controller+firmware that ignores FLUSH CACHE / FLUSH CACHE EXT >>commands? Or worse, responds "ok" when the flush hasn''t occurred? > >two seconds with google shows >http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=183771&NewLang=en&Hilite=cache+flush > >Give it up. These things happen. Not much you can do about it, other >than design around it. > -- richard >That example is a windows-specific, and is a software driver, where the data integrity feature must be manually disabled by the end user. The default behavior was always maximum data protection. While perhaps analagous at some level, the perpetual "your hardware must be crappy/cheap/not-as-expensive-as-mine" doesn''t seem to be a sufficient explanation when things go wrong, like complete loss of a pool. -- Eric D. Mudama edmudama at mail.bounceswoosh.org
James Andrewartha
2009-Jul-29 10:08 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Nigel Smith wrote:> David Magda wrote: >> This is also (theoretically) why a drive purchased from Sun is more >> that expensive then a drive purchased from your neighbourhood computer >> shop: Sun (and presumably other manufacturers) takes the time and >> effort to test things to make sure that when a drive says "I''ve synced >> the data", it actually has synced the data. This testing is what >> you''re presumably paying for. > > So how do you test a hard drive to check it does actually sync the data? > How would you do it in theory? > And in practice? > > Now say we are talking about a virtual hard drive, > rather than a physical hard drive. > How would that affect the answer to the above questions?http://brad.livejournal.com/2116715.html has a utility that can be used to test if your systems (including virtual ones) properly sync data to disk when asked to. -- James Andrewartha
Nigel Smith
2009-Jul-29 13:12 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Hi James Many thanks for finding & posting that link. I''m sure many people on this forum will be interested in trying out Brad Fitzpatrick''s perl script ''diskchecker.pl''. It will be interesting to hear their results. I''ve not yet had time to work out how Brad''s script works. If would be good if others here can take a critical look at it, and feedback their comments to the forum. I''m disappointed that I''ve not had a reply from someone at Sun to explain how they test their hard drives. We''ve had a few people here quick to claim that most hard drives fail to sync/flush correctly, but AFAIK no one is saying how they know this. Have they actually tested, in which case how have they tested. Or do they just know because of bad experiences having lost lots of data. Best Regards Nigel Smith -- This message posted from opensolaris.org
Richard Elling
2009-Jul-29 17:55 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On Jul 28, 2009, at 6:34 PM, Eric D. Mudama wrote:> On Mon, Jul 27 at 13:50, Richard Elling wrote: >> On Jul 27, 2009, at 10:27 AM, Eric D. Mudama wrote: >>> Can *someone* please name a single drive+firmware or RAID >>> controller+firmware that ignores FLUSH CACHE / FLUSH CACHE EXT >>> commands? Or worse, responds "ok" when the flush hasn''t occurred? >> >> two seconds with google shows >> http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=183771&NewLang=en&Hilite=cache+flush >> >> Give it up. These things happen. Not much you can do about it, other >> than design around it. >> -- richard >> > > That example is a windows-specific, and is a software driver, where > the data integrity feature must be manually disabled by the end user. > The default behavior was always maximum data protection.I don''t think you read the post. It specifically says, "Previous versions of the Promise drivers ignored the flush cache command until system power down. " Promise makes RAID controllers and has a firmware fix for this. This is the kind of thing we face: some performance engineer tries to get an edge by assuming there is only one case where cache flush matters. Another 2 seconds with google shows: http://sunsolve.sun.com/search/document.do?assetkey=1-66-200007-1 (interestingly, for this one, fsck also fails) http://sunsolve.sun.com/search/document.do?assetkey=1-21-103622-06-1 http://forums.seagate.com/stx/board/message?board.id=freeagent&message.id=5060&query.id=3999#M5060 But they also get cache flush code wrong in the opposite direction. A good example of that is the notorious Seagate 1.5 TB disk "stutter" problem. NB, for the most part, vendors do not air their dirty laundry (eg bug reports) on the internet for those without support contracts. If you have a support contract, your search may show many more cases.> > While perhaps analagous at some level, the perpetual "your hardware > must be crappy/cheap/not-as-expensive-as-mine" doesn''t seem to be a > sufficient explanation when things go wrong, like complete loss of a > pool.As I said before, it is a systems engineering problem. If you do your own systems engineering, then you should make sure the components you select work as you expect. -- richard
After all the discussion here about VB, and all the finger pointing I raised a bug on VB about flushing. Remember I am using RAW disks via the SATA emulation in VB the disks are WD 2TB drives. Also remember the HOST machine NEVER crashed or stopped. BUT the guest OS OpenSolaris was hung and so I powered off the VIRTUAL host. OK, this is what the VB engineer had to say after reading this and another thread I had pointed him to. (he missed the fast I was using RAW not supprising as its a rather long thread now!) ==============================Just looked at those two threads, and from what I saw all vital information is missing - no hint whatsoever on how the user set up his disks, nothing about what errors should be dealt with and so on. So hard to say anything sensible, especially as people seem most interested in assigning blame to some product. ZFS doesn''t deserve this, and VirtualBox doesn''t deserve this either. In the first place, there is absolutely no difference in how the IDE and SATA devices handle the flush command. The documentation just wasn''t updated to talk about the SATA controller. Thanks for pointing this out, it will be fixed in the next major release. If you want to get the information straight away: just replace "piix3ide" with "ahci", and all other flushing behavior settings apply as well. See a bit further below of what it buys you (or not). What I haven''t mentioned is the rationale behind the current behavior. The reason for ignoring flushes is simple: the biggest competitor does it by default as well, and one gets beaten up by every reviewer if VirtualBox is just a few percent slower than you know what. Forget about arguing with reviewers. That said, a bit about what flushing can achieve - or not. Just keep in mind that VirtualBox doesn''t really buffer anything. In the IDE case every read and write requests gets handed more or less straight (depending on the image format complexity) to the host OS. So there is absolutely nothing which can be lost if one assumes the host OS doesn''t crash. In the SATA case things are slightly more complicated. If you''re using anything but raw disks or flat file VMDKs, the behavior is 100% identical to IDE. If you use raw disks or flat file VMDKs, we activate NCQ support in the SATA device code, which means that the guest can push through a number of commands at once, and they get handled on the host via async I/O. Again - if the host OS works reliably there is nothing to lose. The only thing what flushing can potentially improve is the behavior when the host OS crashes. But that depends on many assumptions on what the respective OS does, the filesystems do etc etc. Hope those facts can be the basis of a real discussion. Feel free to raise any issue you have in this context, as long as it''s not purely hypothetical. ==================================-- This message posted from opensolaris.org
Thanks for following up with this, Russel. On Jul 31, 2009, at 7:11 AM, Russel wrote:> After all the discussion here about VB, and all the finger pointing > I raised a bug on VB about flushing. > > Remember I am using RAW disks via the SATA emulation in VB > the disks are WD 2TB drives. Also remember the HOST machine > NEVER crashed or stopped. BUT the guest OS OpenSolaris was > hung and so I powered off the VIRTUAL host. > > OK, this is what the VB engineer had to say after reading this and > another thread I had pointed him to. (he missed the fast I was > using RAW not supprising as its a rather long thread now!) > > ==============================> Just looked at those two threads, and from what I saw all vital > information is missing - no hint whatsoever on how the user set up > his disks, nothing about what errors should be dealt with and so on. > So hard to say anything sensible, especially as people seem most > interested in assigning blame to some product. ZFS doesn''t deserve > this, and VirtualBox doesn''t deserve this either. > > In the first place, there is absolutely no difference in how the IDE > and SATA devices handle the flush command. The documentation just > wasn''t updated to talk about the SATA controller. Thanks for > pointing this out, it will be fixed in the next major release. If > you want to get the information straight away: just replace > "piix3ide" with "ahci", and all other flushing behavior settings > apply as well. See a bit further below of what it buys you (or not). > > What I haven''t mentioned is the rationale behind the current > behavior. The reason for ignoring flushes is simple: the biggest > competitor does it by default as well, and one gets beaten up by > every reviewer if VirtualBox is just a few percent slower than you > know what. Forget about arguing with reviewers. > > That said, a bit about what flushing can achieve - or not. Just keep > in mind that VirtualBox doesn''t really buffer anything. In the IDE > case every read and write requests gets handed more or less straight > (depending on the image format complexity) to the host OS. So there > is absolutely nothing which can be lost if one assumes the host OS > doesn''t crash. > > In the SATA case things are slightly more complicated. If you''re > using anything but raw disks or flat file VMDKs, the behavior is > 100% identical to IDE. If you use raw disks or flat file VMDKs, we > activate NCQ support in the SATA device code, which means that the > guest can push through a number of commands at once, and they get > handled on the host via async I/O. Again - if the host OS works > reliably there is nothing to lose.The problem with this thought process is that since the data is not on medium, a fault that occurs between the flush request and the bogus ack goes undetected. The OS trusts when the disk said "the data is on the medium" that the data is on the medium with no errors. This problem also affects "hardware" RAID arrays which provide nonvolatile caches. If the array acks a write and flush, but the data is not yet committed to medium, then if the disk fails, the data must remain in nonvolatile cache until it can be committed to the medium. A use case may help, suppose the power goes out. Most arrays have enough battery to last for some time. But if power isn''t restored prior to the batteries discharging, then there is a risk of data loss. For ZFS, cache flush requests are not gratuitous. One critical case is the uberblock or label update. ZFS does: 1. update labels 0 and 2 2. flush 3. check for errors 4. update labels 1 and 3 5. flush 6. check for errors Making flush be a nop destroys the ability to check for errors thus breaking the trust between ZFS and the data on medium. -- richard> > The only thing what flushing can potentially improve is the behavior > when the host OS crashes. But that depends on many assumptions on > what the respective OS does, the filesystems do etc etc. > > Hope those facts can be the basis of a real discussion. Feel free to > raise any issue you have in this context, as long as it''s not purely > hypothetical. > > ==================================> -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Dave Stubbs
2009-Jul-31 22:23 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40
> I don''t mean to be offensive Russel, but if you do > ever return to ZFS, please promise me that you will > never, ever, EVER run it virtualized on top of NTFS > (a.k.a. worst file system ever) in a production > environment. Microsoft Windows is a horribly > unreliable operating system in situations where > things like protecting against data corruption are > important. Microsoft knows thisOh WOW! Whether or not our friend Russel virtualized on top of NTFS (he didn''t - he used raw disk access) this point is amazing! System5 - based on this thread I''d say you can''t really make this claim at all. Solaris suffered a crash and the ZFS filesystem lost EVERYTHING! And there aren''t even any recovery tools? HANG YOUR HEADS!!! Recovery from the same situation is EASY on NTFS. There are piles of tools out there that will recover the file system, and failing that, locate and extract data. The key parts of the file system are stored in multiple locations on the disk just in case. It''s been this way for over 10 years. I''d say it seems from this thread that my data is a lot safer on NTFS than it is on ZFS! I can''t believe my eyes as I read all these responses blaming system engineering and hiding behind ECC memory excuses and "well, you know, ZFS is intended for more Professional systems and not consumer devices, etc etc." My goodness! You DO realize that Sun has this website called opensolaris.org which actually proposes to have people use ZFS on commodity hardware, don''t you? I don''t see a huge warning on that site saying "ATTENTION: YOU PROBABLY WILL LOSE ALL YOUR DATA". I recently flirted with putting several large Unified Storage 7000 systems on our corporate network. The hype about ZFS is quite compelling and I had positive experience in my lab setting. But because of not having Solaris capability on our staff we went in another direction instead. Reading this thread, I''m SO glad we didn''t put ZFS in production in ANY way. Guys, this is the real world. Stuff happens. It doesn''t matter what the reason is - hardware lying about cache commits, out-of-order commits, failure to use ECC memory, whatever. It is ABSOLUTELY unacceptable for the filesystem to be entirely lost. No excuse or rationalization of any type can be justified. There MUST be at least the base suite of tools to deal with this stuff. without it, ZFS simply isn''t ready yet. I am saving a copy of this thread to show my colleagues and also those Sun Microsystems sales people that keep calling. -- This message posted from opensolaris.org
Richard Elling
2009-Jul-31 23:15 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40
wow, talk about a knee jerk reaction... On Jul 31, 2009, at 3:23 PM, Dave Stubbs wrote:>> I don''t mean to be offensive Russel, but if you do >> ever return to ZFS, please promise me that you will >> never, ever, EVER run it virtualized on top of NTFS >> (a.k.a. worst file system ever) in a production >> environment. Microsoft Windows is a horribly >> unreliable operating system in situations where >> things like protecting against data corruption are >> important. Microsoft knows this > > Oh WOW! Whether or not our friend Russel virtualized on top of NTFS > (he didn''t - he used raw disk access) this point is amazing!This point doesn''t matter. VB sits between the guest OS and the raw disk and drops cache flush requests.> System5 - based on this thread I''d say you can''t really make this > claim at all. Solaris suffered a crash and the ZFS filesystem lost > EVERYTHING! And there aren''t even any recovery tools?As has been described many times over the past few years, there is a manual procedure.> HANG YOUR HEADS!!!> Recovery from the same situation is EASY on NTFS. There are piles > of tools out there that will recover the file system, and failing > that, locate and extract data. The key parts of the file system are > stored in multiple locations on the disk just in case. It''s been > this way for over 10 years.ZFS also has redundant metadata written at different places on the disk. ZFS, like NTFS, issues cache flush requests with the expectation that the disk honors that request.> I''d say it seems from this thread that my data is a lot safer on > NTFS than it is on ZFS!Nope. NTFS doesn''t know when data is corrupted. Until it does, it is blissfully ignorant.> > I can''t believe my eyes as I read all these responses blaming system > engineering and hiding behind ECC memory excuses and "well, you > know, ZFS is intended for more Professional systems and not consumer > devices, etc etc." My goodness! You DO realize that Sun has this > website called opensolaris.org which actually proposes to have > people use ZFS on commodity hardware, don''t you? I don''t see a huge > warning on that site saying "ATTENTION: YOU PROBABLY WILL LOSE ALL > YOUR DATA".You probably won''t lose all of your data. Statistically speaking, there are very few people who have seen this. There are many more cases where ZFS detected and repaired corruption.> I recently flirted with putting several large Unified Storage 7000 > systems on our corporate network. The hype about ZFS is quite > compelling and I had positive experience in my lab setting. But > because of not having Solaris capability on our staff we went in > another direction instead.Interesting. The 7000 systems completely shield you from the underlying OS. You administer the system via a web browser interface. There is no OS to learn with these systems, just like you don''t go around requiring Darwin knowledge to use your iPhone.> Reading this thread, I''m SO glad we didn''t put ZFS in production in > ANY way. Guys, this is the real world. Stuff happens. It doesn''t > matter what the reason is - hardware lying about cache commits, out- > of-order commits, failure to use ECC memory, whatever. It is > ABSOLUTELY unacceptable for the filesystem to be entirely lost. No > excuse or rationalization of any type can be justified. There MUST > be at least the base suite of tools to deal with this stuff. > without it, ZFS simply isn''t ready yet.At the risk of being redundant, redundant there is a procedure. The fine folks at Sun, like Victor Latushkin, have helped people recover such pools, as has been pointed out in this thread several times. This is not the sort of procedure easily done over an open forum, it is more efficient to recover via a service call. Microsoft talks about NTFS in Windows 2008[*] as, "Self-healing NTFS preserves as much data as possible, based on the type of corruption detected." Regarding catastrophic failures they note, "Self-healing NTFS accepts the mount request, but if the volume is known to have some form of corruption, a repair is initiated immediately. The exception to this would be a catastrophic failure that requires an offline recovery method?such as manual recovery?to minimize the loss of data." Do you consider that any different than the current state of ZFS? [*] http://technet.microsoft.com/en-us/library/cc771388(WS.10).aspx -- richard
Brian
2009-Jul-31 23:26 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40
I must say this thread has also damaged the view I have of ZFS. Ive been considering just getting a Raid 5 controller and going the linux route I had planned on. -- This message posted from opensolaris.org
David Magda
2009-Jul-31 23:50 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40
On Jul 31, 2009, at 19:26, Brian wrote:> I must say this thread has also damaged the view I have of ZFS. Ive > been considering just getting a Raid 5 controller and going the > linux route I had planned on.It''s your data, and you are responsible for it. So this thread, if nothing else, allows you to make a informed decision. I think that where most other file systems don''t detect or ignore the corner cases that have always existed (cf. CERN''s data integrity study), ZFS brings them to light. To some extent it''s a matter updating some of the available tools so that ZFS can recover some of these cases in a more graceful fashion. It should also be noted though, that nobody notices when things go right. :) There are people who have been running ZFS on humongous pools for a while. It''s just that we always have the worst-case scenarios showing up on the list. :)
Bob Friesenhahn
2009-Jul-31 23:54 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40
On Fri, 31 Jul 2009, Brian wrote:> I must say this thread has also damaged the view I have of ZFS. > Ive been considering just getting a Raid 5 controller and going the > linux route I had planned on.Thankfully, the zfs users who have never lost a pool do not spend much time posting about their excitement at never losing a pool. Otherwise this list would be even more overwelming. I have not yet lost a pool, and this includes the one built on USB drives which might be ignoring cache sync requests. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Jason A. Hoffman
2009-Aug-01 00:00 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40
On Jul 31, 2009, at 4:54 PM, Bob Friesenhahn wrote:> On Fri, 31 Jul 2009, Brian wrote: > >> I must say this thread has also damaged the view I have of ZFS. Ive >> been considering just getting a Raid 5 controller and going the >> linux route I had planned on. > > Thankfully, the zfs users who have never lost a pool do not spend > much time posting about their excitement at never losing a pool. > Otherwise this list would be even more overwelming. > > I have not yet lost a pool, and this includes the one built on USB > drives which might be ignoring cache sync requests.I have thousands and thousands and thousands of zpools. I started collecting such zpools back in 2005. None have been lost. Best regards, Jason ------------------------------------------------------------ Jason A. Hoffman, PhD | Founder, CTO, Joyent Inc. jason at joyent.com http://joyent.com/ mobile: +1-415-279-6196 ------------------------------------------------------------ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090731/e09c5f0e/attachment.html>
David Magda
2009-Aug-01 00:11 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40
On Jul 31, 2009, at 20:00, Jason A. Hoffman wrote:> On Jul 31, 2009, at 4:54 PM, Bob Friesenhahn wrote: > >> On Fri, 31 Jul 2009, Brian wrote: >> >>> I must say this thread has also damaged the view I have of ZFS. >>> Ive been considering just getting a Raid 5 controller and going >>> the linux route I had planned on. >> >> Thankfully, the zfs users who have never lost a pool do not spend >> much time posting about their excitement at never losing a pool. >> Otherwise this list would be even more overwelming. >> >> I have not yet lost a pool, and this includes the one built on USB >> drives which might be ignoring cache sync requests. > > I have thousands and thousands and thousands of zpools. I started > collecting such zpools back in 2005. None have been lost.Also a reminder that on-disk redundancy (RAID-5, 6, Z, etc.) is no substitute for backups. Your controller (or software RAID) can hose data in many circumstances as well. CERN''s study revealed a bug in the WD disk firmware (fixed in a later version) interacting with their 3Ware controllers that caused have caused 80% of the errors they experienced.
Toby Thain
2009-Aug-01 00:21 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40
On 31-Jul-09, at 7:15 PM, Richard Elling wrote:> wow, talk about a knee jerk reaction... > > On Jul 31, 2009, at 3:23 PM, Dave Stubbs wrote: > >>> I don''t mean to be offensive Russel, but if you do >>> ever return to ZFS, please promise me that you will >>> never, ever, EVER run it virtualized on top of NTFS >>> (a.k.a. worst file system ever) in a production >>> environment. Microsoft Windows is a horribly >>> unreliable operating system in situations where >>> things like protecting against data corruption are >>> important. Microsoft knows this >> >> Oh WOW! Whether or not our friend Russel virtualized on top of >> NTFS (he didn''t - he used raw disk access) this point is amazing! > > This point doesn''t matter. VB sits between the guest OS and the raw > disk and > drops cache flush requests. > >> System5 - based on this thread I''d say you can''t really make this >> claim at all. Solaris suffered a crash and the ZFS filesystem >> lost EVERYTHING! And there aren''t even any recovery tools? > > As has been described many times over the past few years, there is > a manual > procedure. > >> HANG YOUR HEADS!!! > >> Recovery from the same situation is EASY on NTFS. There are piles >> of tools out there that will recover the file system, and failing >> that, locate and extract data. The key parts of the file system >> are stored in multiple locations on the disk just in case. It''s >> been this way for over 10 years. > > ZFS also has redundant metadata written at different places on the > disk. > ZFS, like NTFS, issues cache flush requests with the expectation that > the disk honors that request.Can anyone name a widely used transactional or journaled filesystem or RDBMS that *doesn''t* need working barriers?> >> I''d say it seems from this thread that my data is a lot safer on >> NTFS than it is on ZFS! > > Nope. NTFS doesn''t know when data is corrupted. Until it does, it is > blissfully ignorant.People still choose systems that don''t even know which side of a mirror is good. Do they ever wonder what happens when you turn off a busy RAID-1? Or why checksumming and COW make a difference? This thread hasn''t shaken my preference for ZFS at all; just about everything else out there relies on nothing more than dumb luck to maintain integrity. --Toby> >> >> I can''t believe my eyes as I read all these responses blaming >> system engineering and hiding behind ECC memory excuses and "well, >> you know, ZFS is intended for more Professional systems and not >> consumer devices, etc etc." My goodness! You DO realize that Sun >> has this website called opensolaris.org which actually proposes to >> have people use ZFS on commodity hardware, don''t you? I don''t see >> a huge warning on that site saying "ATTENTION: YOU PROBABLY WILL >> LOSE ALL YOUR DATA". > > You probably won''t lose all of your data. Statistically speaking, > there > are very few people who have seen this. There are many more cases > where ZFS detected and repaired corruption. > ... > -- richard > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Ian Collins
2009-Aug-01 00:23 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40
Brian wrote:> I must say this thread has also damaged the view I have of ZFS. Ive been considering just getting a Raid 5 controller and going the linux route I had planned on. >That''ll be you loss. I''ve never managed to loose a pool and I''ve all sorts of unreliable media and all sorts of nasty ways to break them! Whatever you choose, don''t forget to back up your data. -- Ian.
Adam Sherman
2009-Aug-01 00:58 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40
On 31-Jul-09, at 20:00 , Jason A. Hoffman wrote:> I have thousands and thousands and thousands of zpools. I started > collecting such zpools back in 2005. None have been lost. > > > Best regards, Jason > > ------------------------------------------------------------ > Jason A. Hoffman, PhD | Founder, CTO, Joyent Inc.I believe I have about a TB of data on at least one of Jason''s pools and it seems to still be around. ;) A. -- Adam Sherman CTO, Versature Corp. Tel: +1.877.498.3772 x113
Great to hear a few success stories! We have been experimentally running ZFS on really crappy hardware and it has never lost a pool. Running on VB with ZFS/iscsi raw disks we have yet to see any errors at all. On sun4u with lsi sas/sata it is really rock solid. And we''ve been going out of our way to break it because of bad experiences with ntfs, ext2 and UFS as well as many disk failures (ever had fsck run amok?). On 07/31/09 12:11 PM, Richard Elling wrote:> Making flush be a nop destroys the ability to check for errors > thus breaking the trust between ZFS and the data on medium. > -- richardCan you comment on the issue that the underlying disks were, as far as we know, never powered down? My understanding is that disks usually try to flush their caches as quickly as possible to make room for more data, so in this scenario things were probably quiet after the guest crash, so likely what ever was in the cache would have been flushed anyway, certainly by the time the OP restarted VB and the guest. Could you also comment on CR 6667683. which I believe is proposed as a solution for recovery in this very rare case? I understand that the ZILs are allocated out of the general pool. Is there a ZIL for the ZILs, or does this make no sense? As the one who started the whole ECC discussion, I don''t think anyone has ever claimed that lack of ECC caused this loss of a pool or that it could. AFAIK lack of ECC can''t be a problem at all on RAIDZ vdevs, only with single drives or plain mirrors. I''ve suggested an RFE for the mirrored case to double buffer the writes in this case, but disabling checksums pretty much fixes the problem if you don''t have ECC, so it isn''t worth pursuing. You can disable checksum per file system, so this is an elegant solution if you don''t have ECC memory but you do mirror. No mirror IMO is suicidal with any file system. Has anyone ever actually lost a pool on Sun hardware other than by losing too many replicas or operator error? As you have so eloquently pointed out, building a reliable storage system is an engineering problem. There are a lot of folks out there who are very happy with ZFS on decent hardware. On crappy hardware you get what you pay for... Cheers -- Frank (happy ZFS evangelist)
> I understand > that the ZILs are allocated out of the general pool.There is one intent log chain per dataset (file system or zvol). The head of each log the log is kept in the main pool. Without slog(s) we allocate (and chain) blocks from the main pool. If separate intent log(s) exist then blocks are allocated and chained there. If we fail to allocate from the slog(s) then we revert to allocation from the main pool.> Is there a ZIL for the ZILs, or does this make no sense?There is no ZIL for the ZILs. Note the ZIL is not a journal (like ext3 or ufs logging). It simply contains records of system calls (including data) that need to be replayed if the system crashes and those records have not been committed in a transaction group. Hope that helps: Neil.
Bryan Allen
2009-Aug-01 02:41 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40
+------------------------------------------------------------------------------ | On 2009-07-31 17:00:54, Jason A. Hoffman wrote: | | I have thousands and thousands and thousands of zpools. I started | collecting such zpools back in 2005. None have been lost. I don''t have thousands and thousands of zpools, but I do have more than would fit in a breadbox. And bigger, too. ZFS: Verifying, cuddling and wrangling my employer''s business critical data since 2007. (No bits were harmed in the production of this storage network.) (No, really. We validated their checksums.) -- bda cyberpunk is dead. long live cyberpunk.
On Fri, Jul 31, 2009 at 7:58 PM, Frank Middleton<f.middleton at apogeect.com> wrote:> Has anyone ever actually lost a pool on Sun hardware other than > by losing too many replicas or operator error? As you have soYes, I have lost a pool when running on Sun hardware. http://mail.opensolaris.org/pipermail/zfs-discuss/2007-September/013233.html Quite likely related to: http://bugs.opensolaris.org/view_bug.do?bug_id=6684721 In other words, it was a buggy Sun component that didn''t do the right thing with cache flushes. -- Mike Gerdts http://mgerdts.blogspot.com/
Ross
2009-Aug-01 05:39 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40
> wow, talk about a knee jerk reaction...Not at all. A long thread is started where the user lost his pool, and discussion shows it''s a known problem. I love ZFS and I''m still very nervous about the risk of loosing an entire pool.> As has been described many times over the past few > years, there is a manual procedure.Yes, but there are a few issues with this: 1. The OP doesn''t seem to have been able to get anybody to help him recover his pool. The natural assumption reading a thread like this is that ZFS pool corruption happens, and you loose your data. 2. While the procedure may have been mentioned, I''ve never seen a link to official documentation on it. 3. My understanding from reading Victors threads (although I may be wrong) is that this recovery takes a significant amount of time.> You probably won''t lose all of your data. Statistically speaking, there > are very few people who have seen this. There are many more cases > where ZFS detected and repaired corruption.Yes, but statistics don''t matter when emotions come into play, and I''m afraid with something like this it''s going to scare off a lot of people who read about it. It might be rare, but people don''t think like that. Why do you think so many play the lottery ;-) The other point is that system admins like to have control over their own data. It''s their job on the line if things go wrong, and if they see a major problem like this without an obvious solution and which they would have very little control over if it happens, they''re going to get very nervous about implementing it.>From a psychological point of view, this issue is very damaging to zfs.On the flip side, once the recovery tool is available, this will turn into a good positive for zfs. I don''t believe I''ve heard of any other bug that causes complete loss of the pool, so with a recovery tool, zfs should have an enviable ability to safeguard data. -- This message posted from opensolaris.org
Scott Lawson
2009-Aug-01 07:11 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40
Dave Stubbs wrote:>> I don''t mean to be offensive Russel, but if you do >> ever return to ZFS, please promise me that you will >> never, ever, EVER run it virtualized on top of NTFS >> (a.k.a. worst file system ever) in a production >> environment. Microsoft Windows is a horribly >> unreliable operating system in situations where >> things like protecting against data corruption are >> important. Microsoft knows this >> > > Oh WOW! Whether or not our friend Russel virtualized on top of NTFS (he didn''t - he used raw disk access) this point is amazing! System5 - based on this thread I''d say you can''t really make this claim at all. Solaris suffered a crash and the ZFS filesystem lost EVERYTHING! And there aren''t even any recovery tools? > > HANG YOUR HEADS!!! > > Recovery from the same situation is EASY on NTFS. There are piles of tools out there that will recover the file system, and failing that, locate and extract data. The key parts of the file system are stored in multiple locations on the dYou mean the data that you don''t know you have lost yet? ZFS allows you to be very paranoid about data protection with things like copies=2,3,4 etc etc..> isk just in case. It''s been this way for over 10 years. I''d say it seems from this thread that my data is a lot safer on NTFS than it is on ZFS! > > I can''t believe my eyes as I read all these responses blaming system engineering and hiding behind ECC memory excuses and "well, you know, ZFS is intended for more Professional systems and not consumer devices, etc etc." My goodness! You DO realize that Sun has this website called opensolaris.org which actually proposes to have people use ZFS on commodity hardware, don''t you? I don''t see a huge warning on that site saying "ATTENTION: YOU PROBABLY WILL LOSE ALL YOUR DATA". > > I recently flirted with putting several large Unified Storage 7000 systems on our corporate network. The hype about ZFS is quite compelling and I had positive experience in my lab setting. But because of not having Solaris capability on our staff we went in another direction instead. >You do realize that the 7000 series machines are appliances and have no prerequisite for you to have any Solaris knowledge whatsoever? They are a supported device just like any other disk storage system that you can purchase from any vendor and have it supported as such. To use it all you need is a web browser. Thats it. This is no different than your EMC array or HP Storageworks hardware, except that the under pinnings of the storage system are there for all to see in the form of open source code contributed to the community by Sun.> Reading this thread, I''m SO glad we didn''t put ZFS in production in ANY way. Guys, this is the real world. Stuff happens. It doesn''t matter what the reason is - hardware lying about cache commits, out-of-order commits, failure to use ECC memory, whatever. It is ABSOLUTELY unacceptable for the filesystem to be entirely lost. No excuse or rationalization of any type can be justified. There MUST be at least the base suite of tools to deal with this stuff. without it, ZFS simply isn''t ready yet. >Sounds like you have no real world experience of ZFS in production environments and it''s true reliability. As many people here report there are thousands if not millions of zpools out there containing business critical environments that are happily fixing broken hardware on a daily basis. I have personally seen all sorts of pieces of hardware break and ZFS corrected and fixed things for me. I personally manage 50 plus ZFS zpools that are anywhere from 100GB to 30 TB. Works very, very, very well for me. I have never lost anything despite having had plenty of pieces of hardware break in some form underneath ZFS.> I am saving a copy of this thread to show my colleagues and also those Sun Microsystems sales people that keep calling. >
Brian
2009-Aug-01 09:04 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40
> On Fri, 31 Jul 2009, Brian wrote: > > > I must say this thread has also damaged the view I > have of ZFS. > > Ive been considering just getting a Raid 5 > controller and going the > > linux route I had planned on. > > Thankfully, the zfs users who have never lost a pool > do not spend much > time posting about their excitement at never losing a > pool. > Otherwise this list would be even more overwelming. > > I have not yet lost a pool, and this includes the one > built on USB > drives which might be ignoring cache sync requests. > > Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, > http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, > http://www.GraphicsMagick.org/ > ____________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ssYes you are right, I spoke irrationally. I still intend to try it out at least for a period of time to see what I think. ill put it through the standard tests and such. However I am having trouble getting my motherboard to recognize 4 of the hard drives I picked ( I made a post about it in the storage forum). Once thats finished ill get this testing underway -- This message posted from opensolaris.org
Germano Caronni
2009-Aug-02 22:07 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Have you considered this? *Maybe* a little time travel to an old uberblock could help you? http://www.opensolaris.org/jive/thread.jspa?threadID=85794 -- This message posted from opensolaris.org
Stephen Pflaum
2009-Aug-02 23:44 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40
George, I have a pool with family photos on it which needs recovery. Is there a livecd with a tool to invalidate the uberblock which will boot on a macbookpro? Steve -- This message posted from opensolaris.org
Victor Latushkin
2009-Aug-03 18:00 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40
On 03.08.09 03:44, Stephen Pflaum wrote:> George, > > I have a pool with family photos on it which needs recovery. Is there a livecd with a tool to invalidate the uberblock which will boot on a macbookpro?This has been recovered by rolling two txgs back. pool is being scrubbed now. More details and some helpful hints later. Victor
Roch Bourbonnais
2009-Aug-04 13:00 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Le 19 juil. 09 ? 16:47, Bob Friesenhahn a ?crit :> On Sun, 19 Jul 2009, Ross wrote: > >> The success of any ZFS implementation is *very* dependent on the >> hardware you choose to run it on. > > To clarify: > > "The success of any filesystem implementation is *very* dependent on > the hardware you choose to run it on." > > ZFS requires that the hardware cache sync works and is respected.yes.> Without taking advantage of the drive caches, zfs would be > considerably less performant. >That, I''m not so sure. When ZFS first came out, most pools were built on thumpers with a SATA device driver that did not handle NCQ concurrency. Enabling the write cache on a drive was a necessary way to have the drive firmware handle multiple request with small service times. Today we''ve got better device drivers but we''ve stopped comparing performance data with on/off settings on the disk write caches. The delta today might be a lot smaller than it used to be (and even less noticeable if one uses a slog on SSD). -r> Bob > -- > Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2431 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090804/3e7a29a3/attachment.bin>
Roch Bourbonnais
2009-Aug-04 13:28 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Le 26 juil. 09 ? 01:34, Toby Thain a ?crit :> > On 25-Jul-09, at 3:32 PM, Frank Middleton wrote: > >> On 07/25/09 02:50 PM, David Magda wrote: >> >>> Yes, it can be affected. If the snapshot''s data structure / record >>> is >>> underneath the corrupted data in the tree then it won''t be able to >>> be >>> reached. >> >> Can you comment on if/how mirroring or raidz mitigates this, or tree >> corruption in general? I have yet to lose a pool even on a machine >> with fairly pathological problems, but it is mirrored (and copies=2). >> >> I was also wondering if you could explain why the ZIL can''t >> repair such damage. >> >> Finally, a number of posters blamed VB for ignoring a flush, but >> according to the evil tuning guide, without any application syncs, >> ZFS may wait up to 5 seconds before issuing a synch, and there >> must be all kinds of failure modes even on bare hardware where >> it never gets a chance to do one at shutdown. This is interesting >> if you do ZFS over iscsi because of the possibility of someone >> tripping over a patch cord or a router blowing a fuse. Doesn''t >> this mean /any/ hardware might have this problem, albeit with much >> lower probability? > > The problem is assumed *ordering*. In this respect VB ignoring > flushes and real hardware are not going to behave the same. > > --TobyI agree that noone should be ignoring cache flushes. However the path to corruption must involve some dropped acknowledged I/Os. The ueberblock I/O was issued to stable storage but the blocks it pointed to, which had reached the disk firmware earlier, never make it to stable storage. I can see this scenerio when the disk looses power but I don''t see it with cutting power to the guest. When managing a zpool on external storage, do people export the pool before taking snapshots of the guest ? -r> >> >> Thanks >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2431 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090804/497519f6/attachment.bin>
Toby Thain
2009-Aug-04 16:33 UTC
[zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
On 4-Aug-09, at 9:28 AM, Roch Bourbonnais wrote:> > Le 26 juil. 09 ? 01:34, Toby Thain a ?crit : > >> >> On 25-Jul-09, at 3:32 PM, Frank Middleton wrote: >> >>> On 07/25/09 02:50 PM, David Magda wrote: >>> >>>> Yes, it can be affected. If the snapshot''s data structure / >>>> record is >>>> underneath the corrupted data in the tree then it won''t be able >>>> to be >>>> reached. >>> >>> Can you comment on if/how mirroring or raidz mitigates this, or tree >>> corruption in general? I have yet to lose a pool even on a machine >>> with fairly pathological problems, but it is mirrored (and >>> copies=2). >>> >>> I was also wondering if you could explain why the ZIL can''t >>> repair such damage. >>> >>> Finally, a number of posters blamed VB for ignoring a flush, but >>> according to the evil tuning guide, without any application syncs, >>> ZFS may wait up to 5 seconds before issuing a synch,^^ of course this can never cause inconsistency. The issue under discussion is inconsistency - unexpected corruption of on-disk structures.>>> and there >>> must be all kinds of failure modes even on bare hardware where >>> it never gets a chance to do one at shutdown. This is interesting >>> if you do ZFS over iscsi because of the possibility of someone >>> tripping over a patch cord or a router blowing a fuse. Doesn''t >>> this mean /any/ hardware might have this problem, albeit with much >>> lower probability? >> >> The problem is assumed *ordering*. In this respect VB ignoring >> flushes and real hardware are not going to behave the same. >> >> --Toby > > I agree that noone should be ignoring cache flushes. However the > path to corruption must involve > some dropped acknowledged I/Os. The ueberblock I/O was issued to > stable storage but the blocks it pointed to, which had reached the > disk firmware earlier, > never make it to stable storage. I can see this scenerio when the > disk looses powerOr if the host O/S crashes. All this applies to virtual IDE devices alone, of course. iSCSI is a different case entirely as presumably flushes/barriers are processed normally.> but I don''t see it with cutting power to the guest.Right, in this case it''s unlikely or nearly impossible. --Toby> > When managing a zpool on external storage, do people export the > pool before taking snapshots of the guest ? > > -r > > >> >>> >>> Thanks >>> >>> _______________________________________________ >>> zfs-discuss mailing list >>> zfs-discuss at opensolaris.org >>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
So much for the "it''s a consumer hardware problem" argument. I for one gotta count it as a major drawback of ZFS that it doesn''t provide you a mechanism to get something of your pool back in the manner of reconstruction or reversion, if a failure occurs, where there is a metadata inconsistency. A policy of data integrity taken to the extreme of blocking access to good data is not something OS users want. Users don''t put up with this sort of thing from other filesystems... some sort of improvement here is sorely needed. ZFS ought to be retaining enough information and make an effort to bring pool metadata back to a consistent state, even if it means loss of data, that a file may have to revert to an older state, or a file that was undergoing changes may now be unreadable, since the log was inconsistent.. even if the user should have to zpool import with a recovery-mode option or something of that nature. It beats losing a TB of data on the pool that should be otherwise intact. -- This message posted from opensolaris.org
>From what i understand, and from everything i''ve read by following threadshere, there are ways to do it but there is not a standardized tool yet, and it''s complicated and on a per-case basis but people who pay for support have recovered pools. i''m sure they are working on it, and i would imagine it would be a major goal. On Wed, Aug 5, 2009 at 1:23 AM, James Hess <no-reply at opensolaris.org> wrote:> So much for the "it''s a consumer hardware problem" argument. > I for one gotta count it as a major drawback of ZFS that it doesn''t provide > you a mechanism to get something of your pool back in the manner of > reconstruction or reversion, if a failure occurs, where there is a metadata > inconsistency. > > A policy of data integrity taken to the extreme of blocking access to good > data is not something OS users want. > > Users don''t put up with this sort of thing from other filesystems... some > sort of improvement here is sorely needed. > > ZFS ought to be retaining enough information and make an effort to bring > pool metadata back to a consistent state, even if it means loss of data, > that a file may have to revert to an older state, or a file that was > undergoing changes may now be unreadable, since the log was inconsistent.. > > even if the user should have to zpool import with a recovery-mode option > or something of that nature. > > It beats losing a TB of data on the pool that should be otherwise intact. > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090805/a052284f/attachment.html>