Hi, after working for 1 month with ZFS on 2 external USB drives I have experienced, that the all new zfs filesystem is the most unreliable FS I have ever seen. Since working with the zfs, I have lost datas from: 1 80 GB external Drive 1 1 Terrabyte external Drive It is a shame, that zfs has no filesystem management tools for repairing e. g. being able to repair those errors: NAME STATE READ WRITE CKSUM usbhdd1 ONLINE 0 0 8 c3t0d0s0 ONLINE 0 0 8 errors: Permanent errors have been detected in the following files: usbhdd1: 0x0 It is indeed very disappointing that moving USB zpools between computers ends in 90 % with a massive loss of data. This is to the not reliable working command zfs umount <poolname>, even if the output of mount shows you, that the pool is no longer mounted and ist removed from mntab. It works only 1 or 2 times, but removing the device back to the other machine, the pool won''t be either recognized at all or the error mentioned above occurs. Or suddenly you''ll find that message inside your messages: "Fault tolerance of the pool may be compromised." However, I just want to state a warning, that ZFS is far from being that what it is promising, and so far from my sum of experience I can''t recommend at all to use zfs on a professional system. Regards, Dave. -- This message posted from opensolaris.org
On 09 February, 2009 - D. Eckert sent me these 1,5K bytes:> Hi, > > after working for 1 month with ZFS on 2 external USB drives I have experienced, that the all new zfs filesystem is the most unreliable FS I have ever seen. > > Since working with the zfs, I have lost datas from: > > 1 80 GB external Drive > 1 1 Terrabyte external Drive > > It is a shame, that zfs has no filesystem management tools for repairing e. g. being able to repair those errors: > > NAME STATE READ WRITE CKSUM > usbhdd1 ONLINE 0 0 8 > c3t0d0s0 ONLINE 0 0 8 > > errors: Permanent errors have been detected in the following files: > > usbhdd1: 0x0 > > > It is indeed very disappointing that moving USB zpools between > computers ends in 90 % with a massive loss of data. > > This is to the not reliable working command zfs umount <poolname>, > even if the output of mount shows you, that the pool is no longer > mounted and ist removed from mntab.You don''t move a pool with ''zfs umount'', that only unmounts a single zfs filesystem within a pool, but the pool is still active.. ''zpool export'' releases the pool from the OS, then ''zpool import'' on the other machine.> It works only 1 or 2 times, but removing the device back to the other > machine, the pool won''t be either recognized at all or the error > mentioned above occurs. > > Or suddenly you''ll find that message inside your messages: "Fault > tolerance of the pool may be compromised." > > However, I just want to state a warning, that ZFS is far from being > that what it is promising, and so far from my sum of experience I > can''t recommend at all to use zfs on a professional system.You''re basically yanking disks from a live filesystem, if you don''t do that, filesystems are happier. /Tomas -- Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
Casper.Dik at Sun.COM
2009-Feb-09 09:56 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
>However, I just want to state a warning, that ZFS is far from being that what it >is promising, and so far from my sum of experience I can''t recommend at all to >use zfs on a professional system.Or, perhaps, you''ve given ZFS disks which are so broken that they are really unusable; it is USB, after all. And certainly, on Solaris you''d get the same errors with UFS or PCFS; but you would not able to detect any corruption. You may have seen Al''s post about moving a spinning 1TB hard disk. Before we can judge what goes wrong, we would need a bit more information such as: - motherboard and the USB controller - the USB enclosure which holds the disk(s) - the type of the disks themselves. - any messages recorded in /var/adm/messages (for the time you used the database) - and how did you remove the disks from the system? Unfortunately, you cannot be sure that when the USB enclosure says that all the data is safe, it is actually written to the disk. Casper
Hi Caspar, thanks for you reply. I completely disagreed to your opinion, that is USB. And seems as well, that I am not the only one having this opinion regarding ZFS. However, the hardware used is: 1 Sun Fire 280R Solaris 10 generic 10-08 latest updates 1 Lenovo T61 Notebook running Solaris 10 genric 10-08 latest updates 1 Sony VGN-NR38Z Harddrives in use: Trekstore 1 TB, Seagate momentus 7.200 rpm 2.5" 80 GB. The harddrives used are brand new, as well the Sony notebook. Even if I did zfs umount poolname I waited for 30 sec. and then unplugged, data corruption occurs. For testing purposes on a Sun Fire 280R completely set up with ZFS I tried hotswaping a HDD. There happens the same. It is a big administrator''s burden to get such a zfs drive back to live. So how can I get my zpools back to life? Regards, Dave. -- This message posted from opensolaris.org
D. Eckert wrote:> Hi Caspar, > > thanks for you reply. > > I completely disagreed to your opinion, that is USB. And seems as well, that I am not the only one having this opinion regarding ZFS. > > However, the hardware used is: > > 1 Sun Fire 280R Solaris 10 generic 10-08 latest updates > 1 Lenovo T61 Notebook running Solaris 10 genric 10-08 latest updates > 1 Sony VGN-NR38Z > > Harddrives in use: Trekstore 1 TB, Seagate momentus 7.200 rpm 2.5" 80 GB. > > The harddrives used are brand new, as well the Sony notebook. > > Even if I did zfs umount poolname I waited for 30 sec. and then unplugged, data corruption occurs. > >You don''t zfs umount poolname, you zpool export it. -- Ian.
Casper.Dik at Sun.COM
2009-Feb-09 10:25 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
>However, the hardware used is: > >1 Sun Fire 280R Solaris 10 generic 10-08 latest updates >1 Lenovo T61 Notebook running Solaris 10 genric 10-08 latest updates >1 Sony VGN-NR38Z > >Harddrives in use: Trekstore 1 TB, Seagate momentus 7.200 rpm 2.5" 80 GB.(Is that the Trekstore with 2x500GB)>The harddrives used are brand new, as well the Sony notebook. > >Even if I did zfs umount poolname I waited for 30 sec. and then unplugged, data corruption occurs.Did you EXPORT the pool? "Unmount" is not sufficient. You need to use: zpool export poolname How exactly did you remove the disk from the 280R? And what exact problem did you get? You need to "off-line" the disk before actually removing it, physically. Casper
> > "Unmount" is not sufficient. >Well, umount is not the "right" way to do it, so he''d be simulating a power-loss/system-crash. That still doesn''t explain why massive data loss would occur ? I would understand the last txg being lost, but 90% according to OP ?! -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090209/2f4a658a/attachment.html>
Casper.Dik at Sun.COM
2009-Feb-09 11:00 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
>Well, umount is not the "right" way to do it, so he''d be simulating a >power-loss/system-crash. That still doesn''t explain why massive data loss >would occur ? I would understand the last txg being lost, but 90% according >to OP ?!On USB or? I think he was trying to properly unmount the USB devices. One of the known issues with USB devices is that they may not properly work; for a typical disk, it will properly "flush write cache" when it is instructed to do so. However, when you connect the devices using a USB controller and a USB enclosure, we''re less certain that "flush write cache" will make it to the drive, because: - was a command send to the enclosure (e.g., if you needed to configure the device with "reduced-cmd-support=true", then all bets are off) - when the enclosure responds, did it send a "flush write cache" to the disk? - and when it responds, did it wait until the disk completed the command? It is one of the reasons why I''d recommend against USB for disks. Too many variables. Casper
ok, so far so good. but how can I get my pool up and running???? Following output: bash-3.00# zfs get all usbhdd1 NAME PROPERTY VALUE SOURCE usbhdd1 type filesystem - usbhdd1 creation Do Dez 25 23:36 2008 - usbhdd1 used 34,3G - usbhdd1 available 39,0G - usbhdd1 referenced 34,3G - usbhdd1 compressratio 1.00x - usbhdd1 mounted no - usbhdd1 quota none default usbhdd1 reservation none default usbhdd1 recordsize 128K default usbhdd1 mountpoint /usbhdd1 default usbhdd1 sharenfs off default usbhdd1 checksum on local usbhdd1 compression off default usbhdd1 atime on default usbhdd1 devices on default usbhdd1 exec on default usbhdd1 setuid on default usbhdd1 readonly off default usbhdd1 zoned off default usbhdd1 snapdir hidden default usbhdd1 aclmode groupmask default usbhdd1 aclinherit restricted default usbhdd1 canmount on default usbhdd1 shareiscsi off default usbhdd1 xattr on default usbhdd1 copies 1 default interner Fehler: unable to get version property interner Fehler: unable to get utf8only property interner Fehler: unable to get normalization property interner Fehler: unable to get casesensitivity property usbhdd1 vscan off default usbhdd1 nbmand off default usbhdd1 sharesmb off default usbhdd1 refquota none default usbhdd1 refreservation none default bash-3.00# zpool status -xv usbhdd1 Pool: usbhdd1 Status: ONLINE Zustand: Auf mindestens einem Ger?t ist ein Fehler aufgetreten, der eine Datenbesch?digung bewirkt hat. M?glicherweise sind davon Anwendungen betroffen. Aktion: Stellen Sie die betreffende Datei wenn m?glich wieder her. Anderenfalls stellen Sie den gesamten Pool aus einer Sicherung wieder her. Siehe: http://www.sun.com/msg/ZFS-8000-8A scrub: Keine erforderlich config: NAME STATE READ WRITE CKSUM usbhdd1 ONLINE 0 0 16 c3t0d0s0 ONLINE 0 0 16 errors: Permanent errors have been detected in the following files: usbhdd1:<0x0> bash-3.00# zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT storage 48,8G 16,3G 32,4G 33% ONLINE - usbdrv1 484M 2,79M 481M 0% ONLINE - usbhdd1 74,5G 34,3G 40,2G 46% ONLINE - I don''t understand, that I get status information about the pool, e. g. cap, size, health, but I can not mount it to the system: bash-3.00# zfs mount usbhdd1 cannot mount ''usbhdd1'': E/A-Fehler bash-3.00# any suggestion for help? thanks and regards, dave. -- This message posted from opensolaris.org
James C. McPherson
2009-Feb-09 11:59 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Mon, 09 Feb 2009 03:10:21 -0800 (PST) "D. Eckert" <contact at desystems.cc> wrote:> ok, so far so good. > > but how can I get my pool up and running????I can''t help you with this bit ....> bash-3.00# zpool status -xv usbhdd1 > Pool: usbhdd1 > Status: ONLINE > Zustand: Auf mindestens einem Ger?t ist ein Fehler aufgetreten, der > eine Datenbesch?digung bewirkt hat. M?glicherweise sind davon > Anwendungen betroffen. Aktion: Stellen Sie die betreffende Datei wenn > m?glich wieder her. Anderenfalls stellen Sie den gesamten Pool aus > einer Sicherung wieder her. Siehe: http://www.sun.com/msg/ZFS-8000-8A > scrub: Keine erforderlich > config: > > NAME STATE READ WRITE CKSUM > usbhdd1 ONLINE 0 0 16 > c3t0d0s0 ONLINE 0 0 16 > > errors: Permanent errors have been detected in the following files: > > usbhdd1:<0x0> > > bash-3.00# zpool list > NAME SIZE USED AVAIL CAP HEALTH ALTROOT > storage 48,8G 16,3G 32,4G 33% ONLINE - > usbdrv1 484M 2,79M 481M 0% ONLINE - > usbhdd1 74,5G 34,3G 40,2G 46% ONLINE - > > I don''t understand, that I get status information about the pool, e. > g. cap, size, health, but I can not mount it to the system: > > bash-3.00# zfs mount usbhdd1 > cannot mount ''usbhdd1'': E/A-Fehler > bash-3.00#You have checksum errors on a non-replicated pool. This is not something that can be ignored. James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
James, on a UFS ore reiserfs such errors could be corrected. It is grossly negligent to develop a file system without proper repairing tools. More and more becomes clear, that it just was a marketing slogan by Sun to state, that ZFS does not use any repairing tools due to healing itself. In this particular case we are talking about a a loss of at least 35 GB (!!!!) of data. And as long as the ZFS Developer are more focused on proven wrong marketing aspects I can''t recommend ZFS at all in a professional area and I am thinking about to make this issue clear on the Sun Conference we have in Germany in March this year. It is not a good practice just to make someone believe who just lost that mass of data "I am sorry, but I can''t help you." Even in the fact, if you don''t understand why it happened. A good practice would be to care first for a proper documentation. There''s nothing stated in the man pages, if USB zpools are used, that the zfs mount/unmount is NOT recommended and zpool export should be used instead. Having facilities to override checksumming to get even an as corrupted tagged pool mounted to rescue the data shouldn''t be just a dream. it should be a MUST TO HAVE. I agree, it is always - regardless the used FSType - to have a proper backup facility in place. But based on the issues ZFS was designed for - for very big pools - it becomes as well a cost aspect. And as well it would be a good practice by Sun due to having internet boards full of complaining people loosing data just because of using zfs, to CARE FOR THAT. Regards, DE -- This message posted from opensolaris.org
> on a UFS ore reiserfs such errors could be corrected.I think some of these people are assuming your hard drive is broken. I''m not sure what you''re assuming, but if the hard drive is broken, I don''t think ANY file system can do anything about that. At best, if the disk was in a RAID 5 array, and the other disks worked, then the parity from the working disks could correct the broken data on the broken drive... but you only have a single disk, not a mirror or a raid 5, so this fix can''t be done... I think this might be a case of zfs reporting errors that other file systems don''t notice. Your hard drive might have been broken for months without you knowing it until now. In that case the errors aren''t the fault of zfs. It is the fault of the broken drive, and the fault of the other file systems for not knowing when data is corrupted. See what I mean? -- This message posted from opensolaris.org
> bash-3.00# zfs mount usbhdd1 > cannot mount ''usbhdd1'': E/A-Fehler > bash-3.00#Why is there an I/O error? Is there any information logged to /var/adm/messages when this I/O error is reported? E.g. timeout errors for the USB storage device? -- This message posted from opensolaris.org
Casper.Dik at Sun.COM
2009-Feb-09 13:58 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
>James, > >on a UFS ore reiserfs such errors could be corrected.That''s not true. That depends on the nature of the error. I''ve seen quite a few problems on UFS with corrupted file contents; such filesystems always are "clean". Yet the filesystems are corrupted. And no tool can fix those filesystems.>It is grossly negligent to develop a file system without proper repairing tools.Repairing to what state? One of the reasons why there''s a "ufs fsck" is because its disk state is nearly always "corrupted". The log only allows you to repair the metadata, NEVER the data. And I''ve seen the corrupted files many times. (Specifically, when you upgrade a driver and it''s buggy, you would typically have a broken driver_alias, name_to_major, etc. Though I added a few fsyncs in update_drv and ilk and it is better, fsck does not "fix" UFS filesystems. Fsck can only repair known faults; known discrepancies in the meta data. Since ZFS doesn''t have such known discrepancies, there''s nothing to repair.>More and more becomes clear, that it just was a marketing slogan by Sun to state, >that ZFS does not use any repairing tools due to healing itself.If it can repair, then it does. But if you only have one copy of the data, then you cannot repair the data missing.>In this particular case we are talking about a a loss of at least 35 GB (!!!!) >of data.>A good practice would be to care first for a proper documentation. There''s >nothing stated in the man pages, if USB zpools are used, that the zfs >mount/unmount is NOT recommended and zpool export should be used instead.You have a live pool and you yank it out of the system? Where does it say that you can do that?>Having facilities to override checksumming to get even an as corrupted tagged pool mounted to rescue the data shouldn''t be just a dream. it should be a MUST TO HAVE.Depends on how much of the data is corrupted and which parts they are.>I agree, it is always - regardless the used FSType - to have a proper backup >facility in place. But based on the issues ZFS was designed for - for very big >pools - it becomes as well a cost aspect. > >And as well it would be a good practice by Sun due to having internet boards >full of complaining people loosing data just because of using zfs, to CARE FOR THAT.I''ve not seen a lot of people who complained; or perhaps I don''t look carefully (I''m not in ZFS development) What I have seen is some issues with weird BIOS issues (taking part of a disk); connecting a zpool to different systems at the same time, including what you may have done by having the zpool "imported" on both systems. Casper
too many words wasted, but not a single word, how to restore the data. I have read the man pages carefully. But again: there''s nothing said, that on USB drives zfs umount pool is not allowed. So how on earth should a simple user know that, if he knows that filesystems properly unmounted using the umount cmd?? And again: Why should a 2 weeks old Seagate HDD suddenly be damaged, if there was no shock, hit or any other event like that? It is of course easier to blame the stupid user instead of having proper documentation and emergency tools to handle that. The list of malfunctions of SNV Builts gets longer and longer with every version released. e. g. on SNV 107 - installation script is unable to write properly the boot blocks for grub - you choose German locale, but have an American Keyboard style in the gnome (since SNV 103) - in SNV 107 adding these lines to xorg.conf: Option "XkbRules" "xorg" Option "XkbModel" "pc105" Option "XkbLayout" "de" (was working in SNV 103) lets crash the Xserver. - latest Nvidia Driver (Vers. 180) for GeForce 8400M doesn''t work with OpenSolaris SNV 107 - nwam and iwk0: not solved, no DHCP responses it seems better, to stay focused on having a colourfull gui with hundreds of functions no one needs instead providing a stable core. I am looking forward the day booting OpenSolaris and see a greeting Windows XP Logo surrounded by the blue bubbles of OpenSolaris..... Cheers, D. -- This message posted from opensolaris.org
Hi Dave, Having read through the whole thread, I think there are several things that could all be adding to your problems. At least some of which are not related to ZFS at all. You mentioned the ZFS docs not warning you about this, and yet I know the docs explictly tell you that: 1. While a ZFS pool that has no redundancy (Mirroring or Parity,) like your''s is missing, can still *detect* errors in the data read from the drive, it can''t *repair* those errors. Repairing errors requires that ZFS be performing (at least) the (top-most level of) Mirroring or Parity functions. Since you have no Mirroring or Parity ZFS cannot automatically recover this data. 2. As others have said, a zpool can contain many filesystems. ''zfs umount'' only unmounts a single filesystem. Removing a full pool from a machine requires a ''zpool export'' no matter what disk technology is being used (USB, SCSI, SATA, FC, etc.) On the new system you would use ''zpool import'' to bring the pool into the new system. I''m sure this next on is documented by Sun also though not in the ZFS docs, probably in some other part of the system dealing with removable devices: 3. In addition, according to Casper''s message you need to ''off-line'' USB (and probasbly other types too) storage in Solaris (Just like in Windows) before pulling the plug. This has nothing to do with ZFS. This will have corrupted (possibly even past the point of repair most other filesystems also. Still, I had an idea on something you might try. I don''t know how long it''s been since you pulled the drive, or what else you''ve done since. Which machine is reporting the errors you''ve shown us? The machine you pulled the drives from? or the machine you moved them too? Were you successful in ''zpool import'' the pool into the other machines? This idea might work either way, but if you haven''t successfully immported it into another machine there''s probably more of a chance. If the output is from the machine you pulled them out of, then basically that machine still thinks the pool is connected to it, and it thinks the one and only disk in the pool is now not responding. In this case the errors you see in the tables are the errors from trying to contact a drive that no longer exists. Have you reconnected the disk to the original machine yet? If not I''d attempt a ''zpool export'' now (though that may not work.) and then shut the machine down fully, and connect the disk. Then boot it all up. Depending on what you''ve tried to do with this disk to fix the problem since it happened I have no idea exactly how the machine will come up. If you couldn''t do the ''zpool'' export, then the machine will try to mount the FS''s in the pool on boot. This may nor may not work. If you were successful in doing the export with the disks disconnected, then it won''t try, and you''ll need to ''zpool import'' them after the machine is booted. Depending on how the import goes, you might still see errors in the ''zpool status'' output. If so, I know a ''zpool clear'' will clear those errors, and I doubt it can make the situation any worse than it is now. You''d have to give us info about what the machine tells you after this before I can advise you more. But (and the experts can correct me if I''m wrong) this might ''just work(tm)''. My theory here is that the ZFS may have been successful in keeping the state of the (meta)data on the disk consistent after all. The checksum and I/O errors listed may be from ZFS trying to access the non-existent drive after you removed it. Which (in theory) are all bogus errors, and don''t really point to errors in the data on the drive. Of course there are many things that all have to be true for this theory to turn out to be true. Depending on what has happened to the machines and the disks since they were originally unplugged from each other, all bets might be off. And then there''s the possibility that the my idea never could work at all. People much more expert than I can chime in on that. -Kyle D. Eckert wrote:> Hi, > > after working for 1 month with ZFS on 2 external USB drives I have experienced, that the all new zfs filesystem is the most unreliable FS I have ever seen. > > Since working with the zfs, I have lost datas from: > > 1 80 GB external Drive > 1 1 Terrabyte external Drive > > It is a shame, that zfs has no filesystem management tools for repairing e. g. being able to repair those errors: > > NAME STATE READ WRITE CKSUM > usbhdd1 ONLINE 0 0 8 > c3t0d0s0 ONLINE 0 0 8 > > errors: Permanent errors have been detected in the following files: > > usbhdd1: 0x0 > > > It is indeed very disappointing that moving USB zpools between computers ends in 90 % with a massive loss of data. > > This is to the not reliable working command zfs umount <poolname>, even if the output of mount shows you, that the pool is no longer mounted and ist removed from mntab. > > It works only 1 or 2 times, but removing the device back to the other machine, the pool won''t be either recognized at all or the error mentioned above occurs. > > Or suddenly you''ll find that message inside your messages: "Fault tolerance of the pool may be compromised." > > However, I just want to state a warning, that ZFS is far from being that what it is promising, and so far from my sum of experience I can''t recommend at all to use zfs on a professional system. > > Regards, > > Dave. >
Christian Wolff
2009-Feb-09 15:25 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
First: It sucks to loose data. That''s very uncool...BUT I don''t know how ZFS should be able to recover data with no mirror to copy from. If you have some kind of a RAID level you''re easily able to recover your data. I saw that several times. Without any problems and even with nearly no performance impact on a productive machine. No offense. But you must admit that you flame on a filesystem without even knowing the right commands and blame us to not recover your data?! C''mon! Regards, Chris -- This message posted from opensolaris.org
Casper.Dik at Sun.COM
2009-Feb-09 15:27 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
>too many words wasted, but not a single word, how to restore the data. > >I have read the man pages carefully. But again: there''s nothing said, that on USB drives zfs umount pool is not allowed.You cannot unmount a pool. You can only unmount a filesystem. That the default name of the pool''s filesystem is the same as the name of the pool is an artifact of the implementation. Surely, you can unmount the filesystem. That is not illegal. But you''ve removed a live pool WITHOUT exporting it. I can understand that you make that mistake because you take what you know from other filesystems and you apply that to ZFS.>So how on earth should a simple user know that, if he knows that filesystems >properly unmounted using the umount cmd??Reading the documentation. The zpool and zfs commands are easy to use and perhaps this stops you and others from reading, e.g., http://docs.sun.com/app/docs/doc/819-5461/gavwn?l=en&a=view And before you use ZFS you must understand some of the basic concepts; rather than having a device which you can mount with "mount", you have a "pool" and that "pool" is owned by the system.> >And again: Why should a 2 weeks old Seagate HDD suddenly be damaged, if there >was no shock, hit or any other event like tht?If you removed the device from a live pool and moved it on another system and them moved it back, then, yes, you could have problems. You I''d suppose that the system shouldn''t go online and requiring an import (-f).> >It is of course easier to blame the stupid user instead of having proper >documentation and emergency tools to handle that.The document explains that must use export in order to remove pools from one system to another. I''m not sure how the system can prevent that; there''s no "lock" on your USB slots. As for the other problems with nv107, each time we change a lot of software; and sometimes we change important parts of the sofware; e.g., in nv107 we changed to a newer version of Xorg. The cutting edge builds vary in quality. Casper
Full of sympathy, I still feel you might as well relax a bit. It is the XkbVariant that starts X without any chance to return. But look at the many "boot stops after the third line", and from my side, the not working network settings, even without nwam. The worst part was a so-called engineer stating that one simply can''t expect a host to connect to the same gateway through 2 different paths properly. But it would be wrong to admonish the individuals, and my excuses to those I treated with contempt. The problem cannot be solved in this forum. The issue needs to be addressed elsewhere. When adoption (migration) is the objective, in the first place the kernel needs to boot, whatever the hardware, even if graceful degradation is unavoidable. Second, a network setting must be possible, and not simply doing nothing, or requiring a dead NIC to be added just to boot. As much as I was grateful to be helped, of course an X server needs to fall back to sane behaviour at all times. And sendmail loses mail. All this is sick. But priorities need to come from managers, or the community, not from the coders. In OpenSolaris SUN insists on calling the shots, so it will be managers in this case. I myself am very unhappy with ZFS; not because it had failed me, but for a third party, cold-eyes review, the man page and the concept and (arcane) commands by now surpasses by far the sequence of logical steps to partition (fdisk) and format (newfs) a drive. Pools, tanks, scrubs, imports, exports and whatnot; I don''t think this was the original intention. And - as bad as the network engineer further up - is the statement on ''USB hard disk not suitable for ZFS'' or similar. Do not get me wrong, OpenSolaris is still my preferred Desktop, I love its stability, and - laugh at me - it is the only one that always allows to kill an application gone sour (Ubuntu usually fails here). I consider it elegant and helpful with my daily work. *If* it is up, *if* it boots. Alas, this is by far the more difficult part. And here I agree with you: USB hard disks need a proper, clear, way to be attached and removed, without even exceeding the old way of mount-umount. Try to run a hard disk test. Let us also compare here: I never lost an ext3-drive that would pass the hardware test. On the contrary, at times I could recover data from one that failed. But let us introduce as measure the former one: As long as the drive is not flagged ''corrupt'' by the disk test utility, it surely must not lose any data (aside from ''rm''). My honest and curious question: Does ZFS pass this test? Uwe -- This message posted from opensolaris.org
D. Eckert wrote:> too many words wasted, but not a single word, how to restore the data. > > I have read the man pages carefully. But again: there''s nothing said, that on USB drives zfs umount pool is not allowed. >It is allowed. But it''s not enough. You need to read both the ''zpool '' and ''zfs'' manpages. the ''zpool'' manpage will tell you that the way to move the ''whole pool'' to another machine is to run ''zpool export <poolname>''. The ''zpool export'' will actually run the ''zfs umount'' for you, though it''s not a problem if it''s already been done. Note, this isn''t USB specific, you won''t see anything in the docs about USB. This condition applies to SCSI and others too. You need to export the pool to move it to another machine. If the machine crashed before you could export it, ''zpool import -f'' on the new machine can help import it anyway. With USB, there are probably other commands you''ll also need to use to notify Solaris that you are going to unplug the drive, Just like the ''Safely remove hardware'' tool on windows. Or you need to remove it only when the system is shut down. These commands will be documented somewhere else, not in the ZFS docs because they don''t apply to just ZFS.> So how on earth should a simple user know that, if he knows that filesystems properly unmounted using the umount cmd?? >You need to understand that the filesystems are all contained in a ''pool'' (more than one filesystem can share the disk space in in the same pool). Unmounting the filesystem *does not* prepare the *pool* to be moved from one machine to another.> And again: Why should a 2 weeks old Seagate HDD suddenly be damaged, if there was no shock, hit or any other event like that? >Who knows? Some harddrives are manufactured with problems. Remember that ZFS is designed to catch problems that even the ECC on the drive doesn''t catch. So it''s not impossible for it to catch errors even the manufacturer''s QA tests missed.> It is of course easier to blame the stupid user instead of having proper documentation and emergency tools to handle that. > >I beleive that between the man pages, the administration docs on the web, the best practices pages, and all the other blogs and web pages, that ZFS is documented well enough. It''s not like other filesystems, so there is more to learn, and you need to review all the docs, not just the ones that cover the operations (like unmount) that you''re familiar with. Understanding pools (and the commands that manage pools,) is also important. Man pages and command references are good when you understand the architecture and need to learn about the details of a command you know you need to use. It''s the other documentation that will fill you in you on how the system parts work together, and advise you on the best way to setup or do what you want. As I said in my other email ZFS can''t repair errors without a way to reconstruct the data. It needs mirroring, parity (or the copies=x setting) to be able to repair the data. By setting up a pool with no redundancy. So your email subject line is a little backwards, since any ''professional'' usage would incorporate redundancy (Mirror, Parity, etc.) What you''re trying to do is more ''home/hobbiest'' usage. Though most home/hobbiest users decide to incorporte redundancy for any data they really care about.> The list of malfunctions of SNV Builts gets longer and longer with every version released. > >I''m sure new things are added every release, but many are also fixed. sNV is pre-release software after all. Overall the problems found aren''t around long, and I beleive the list gets shorter as often as it gets longer. If you want production level Solaris, ZFS is available in solaris 10.> e. g. on SNV 107 > > - installation script is unable to write properly the boot blocks for grub > - you choose German locale, but have an American Keyboard style in the gnome (since SNV 103) > - in SNV 107 adding these lines to xorg.conf: > > Option "XkbRules" "xorg" > Option "XkbModel" "pc105" > Option "XkbLayout" "de" > > (was working in SNV 103) > > lets crash the Xserver. > > - latest Nvidia Driver (Vers. 180) for GeForce 8400M doesn''t work with OpenSolaris SNV 107 > - nwam and iwk0: not solved, no DHCP responses > >Yes there was a major update of the X server sources to catch up to the latest(?) X.org release. Workarounds are known, and I bet this will be working again in b108 (or not long after.)> it seems better, to stay focused on having a colourfull gui with hundreds of functions no one needs instead providing a stable core. > >The core of solaris is much more stable than anythign else I''ve used. The windowing system is not a part of the core of an operatinog system in my book.> I am looking forward the day booting OpenSolaris and see a greeting Windows XP Logo surrounded by the blue bubbles of OpenSolaris..... > ><roll-eyes> Note that sNV (aka SXCE - or Solaris eXpress Community Edition) isn''t really OpenSolaris, though they are related. OpenSolaris is based of specifc snapshots of sNV (the last one being b101 I think) and is updated much less often than sNV. sNV is mainly targeted at those who want to develop Solaris itself, and those who want to try out the latest builds. -Kyle> Cheers, > > D. >
David Champion
2009-Feb-09 16:00 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
> too many words wasted, but not a single word, how to restore the data. > > I have read the man pages carefully. But again: there''s nothing said, > that on USB drives zfs umount pool is not allowed.You misunderstand. This particular point has nothing to do with USB; it''s the same for any ZFS environment. You''re allowed to do a zfs umount on a filesystem, there''s no problem with that. But remember that ZFS is not just a filesystem, in the way that reiserfs and UFS are filesystems. It''s an integrated storage pooling system and filesystem. When you umount a filesystem, you''re not taking any storage offline, you''re just removing the filesystem''s presence on the VFS hierarchy. You umounted a zfs filesystem, not touching the pool, then removed the device. This is analogous to preparing an external hardware RAID and creating one or more filesystems, using them a while, umounting one of them, and powering down the RAID. You did nothing to protect other filesystems or the RAID''s r/w cache. Everything on the RAID is now inconsistent and suspect. But since your "RAID" was a single striped volume, there''s no mirror or parity information with which to reconstruct the data. ZFS is capable of detecting these problems, where other filesystems are often not. But no filesystem can tell what the data should have been when the only copy of the data is damaged. This is documented in ZFS. It''s not about USB, it''s just that USB devices can be more vulnerable to this kind of treatment than other kinds of storage are.> And again: Why should a 2 weeks old Seagate HDD suddenly be damaged, > if there was no shock, hit or any other event like that?It happens all the time. We just don''t always know about it. -- -D. dgc at uchicago.edu NSIT University of Chicago
Andrew Gabriel
2009-Feb-09 16:01 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
Kyle McDonald wrote:> D. Eckert wrote: > >> too many words wasted, but not a single word, how to restore the data. >> >> I have read the man pages carefully. But again: there''s nothing said, that on USB drives zfs umount pool is not allowed. >> >> > It is allowed. But it''s not enough. You need to read both the ''zpool '' > and ''zfs'' manpages. the ''zpool'' manpage will tell you that the way to > move the ''whole pool'' to another machine is to run ''zpool export > <poolname>''. The ''zpool export'' will actually run the ''zfs umount'' for > you, though it''s not a problem if it''s already been done. > > Note, this isn''t USB specific, you won''t see anything in the docs about > USB. This condition applies to SCSI and others too. You need to export > the pool to move it to another machine. If the machine crashed before > you could export it, ''zpool import -f'' on the new machine can help > import it anyway. > > With USB, there are probably other commands you''ll also need to use to > notify Solaris that you are going to unplug the drive, Just like the > ''Safely remove hardware'' tool on windows. Or you need to remove it only > when the system is shut down. These commands will be documented > somewhere else, not in the ZFS docs because they don''t apply to just ZFS. >That would be cfgadm(1M). It''s also used for hot-swapable SATA drives (and probably other things). -- Andrew
Bob Friesenhahn
2009-Feb-09 17:05 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Mon, 9 Feb 2009, D. Eckert wrote:> > A good practice would be to care first for a proper documentation. > There''s nothing stated in the man pages, if USB zpools are used, > that the zfs mount/unmount is NOT recommended and zpool export > should be used instead.I have been using USB mirrored disks for backup purposes for about eight months now. No data loss, or even any reported uncorrectable read failures. These disks have been shared between two different systems (x86 and SPARC). The documentation said that I should use zfs export/import and so that is what I have done, with no problems. While these USB disks seem to be working reliably, it is certainly possible to construct a USB arrangement which does not work reliably since most USB hardware is cheap junk. My USB disks are direct attached and don''t go through a USB bridge. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Seagate7, You are not using ZFS correctly. You have misunderstood how it is used. If you dont follow the manual (which you havent) then any filesystem will cause problems and corruption, even ZFS or ntfs or FAT32, etc. You must use ZFS correctly. Start by reading the manual. For ZFS to be able to repair errors, you must use two drives or more. This is clearly written in the manual. If you only use one drive then ZFS can not repair errors. If you use one drive, then ZFS can only detect errors, but not repair errors. This is also clearly written in the manual. And, when you pull out a disk, you must use "zpool export" command. This is also clearly written in the manual. If you pull out a drive without issuing a warning that you will do so (by zpool export) then ZFS will not work. You are not following the manual, then any software will cause problems. Even Windows. You are not using ZFS as it is intended to do. I suggest, in the future, you stay with Windows, which you know. If you use Unix without knowing it or without reading the manual, then you will have problems. You know Windows, stay with Windows. -- This message posted from opensolaris.org
* Orvar Korvar (knatte_fnatte_tjatte at yahoo.com) wrote:> Seagate7, > > You are not using ZFS correctly. You have misunderstood how it is > used. If you dont follow the manual (which you havent) then any > filesystem will cause problems and corruption, even ZFS or ntfs or > FAT32, etc. You must use ZFS correctly. Start by reading the manual. > > For ZFS to be able to repair errors, you must use two drives or more. > This is clearly written in the manual. If you only use one drive then > ZFS can not repair errors. If you use one drive, then ZFS can only > detect errors, but not repair errors. This is also clearly written in > the manual.Or, you can set copies > 1 on your zfs filesystems. This at least protects you in cases of data corruption on a single drive but not if the entire drive goes belly up. Cheers, -- Glenn
>>>>> "ok" == Orvar Korvar <knatte_fnatte_tjatte at yahoo.com> writes:ok> You are not using ZFS correctly. ok> You have misunderstood how it is used. If you dont follow the ok> manual (which you havent) then any filesystem will cause ok> problems and corruption, even ZFS or ntfs or FAT32, etc. You ok> must use ZFS correctly. Start by reading the manual. Before writing a reply dripping with condescention, why don''t you start by reading the part of the ``manual'''' where it says ``always consistent on disk''''? Please, lay off the kool-aid, or else drink more of it: Unclean dismounts are *SUPPORTED*. This is a great supposed ZFS feature BUT cord-yanking is not supposed to cause loss of the entire filesystem, not on _any_ modern filesystem such as: UFS, FFS, ext3, xfs, hfs+. There is a real problem here. Maybe not all of the problem is in ZFS, but some of it is. If ZFS is going to be vastly more sensitive to discarded SYNCHRONIZE CACHE commands than competing filesystems to the point that it trashes entire pools on an unclean dismount, then it will have to include a storage stack qualification tool, not just a row of defensive pundits ready to point their fingers at hard drives which are guilty until proven innocent, and lack an innocence-proving tool. And I''m not convinced that''s the only problem. Even if it is, the write barrier problem is pervasive. Linux LVM2 throws them away, and many OS''s that _do_ implement fdatasync() for the userland including Linux-without-LVM2 only sync part way down, don''t propogate it all the way down the storage stack to the drive, so file-backed pools (as you might use for testing, backup, or virtual guests) are not completely safe. Aside from these examples, note that, AIUI, Sun''s sun4v I/O virtualizer, VirtualBox software, and iSCSI initiator and target were all caught guilty of this write barrier problem, too, so it''s not only, or even mostly, a consumer-grade problem or an other-tent problem. If this is really the problem trashing everyone''s pools, it doesn''t make me feel better because the problem is pretty hard to escape once you do the slightest meagerly-creative thing with your storage. Even if the ultimate problem turns out not to be in ZFS, the ZFS camp will probably have to persecute the many fixes since they''re the ones so unusually vulnerable to it. also there are worse problems with some USB NAND FLASH sticks according to Linux MTD/UBI folks: http://www.linux-mtd.infradead.org/doc/ubifs.html#L_raw_vs_ftl We have heard reports that MMC and SD cards corrupt and loose data if power is cut during writing. Even the data which was there long time before may corrupt or disappear. This means that they have bad FTL which does not do things properly. But again, this does not have to be true for all MMCs and SDs - there are many different vendors. But again, you should be careful. Of course this doesn''t apply to any spinning hard drives nor to all sticks, only to some sticks. The ubifs camp did an end-to-end test for their filesystem''s integrity using a networked power strip to do automated cord-yanking. I think ZFS needs an easier, faster test though, something everyone can do before loading data into a pool. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090209/56cc52cb/attachment.bin>
On 9-Feb-09, at 6:17 PM, Miles Nordin wrote:>>>>>> "ok" == Orvar Korvar <knatte_fnatte_tjatte at yahoo.com> writes: > > ok> You are not using ZFS correctly. > ok> You have misunderstood how it is used. If you dont follow the > ok> manual (which you havent) then any filesystem will cause > ok> problems and corruption, even ZFS or ntfs or FAT32, etc. You > ok> must use ZFS correctly. Start by reading the manual. > > Before writing a reply dripping with condescention, why don''t you > start by reading the part of the ``manual'''' where it says ``always > consistent on disk''''? > > Please, lay off the kool-aid, or else drink more of it: Unclean > dismounts are *SUPPORTED*. This is a great supposed ZFS feature BUT > cord-yanking is not supposed to cause loss of the entire filesystem, > not on _any_ modern filesystem such as: UFS, FFS, ext3, xfs, hfs+.> ... the write barrier problem is pervasive. Linux LVM2 > throws them away, and many OS''s that _do_ implement fdatasync() for > the userland including Linux-without-LVM2 only sync part way down, > don''t propogate it all the way down the storage stack to the drive, so > file-backed pools (as you might use for testing, backup, or virtual > guests) are not completely safe. > > Aside from these examples, note that, AIUI, Sun''s sun4v I/O > virtualizer, VirtualBox software, and iSCSI initiator and target were > all caught guilty of this write barrier problem, too,YES! I recently discovered that VirtualBox apparently defaults to ignoring flushes, which would, if true, introduce a failure mode generally absent from real hardware (and eventually resulting in consistency problems quite unexpected to the user who carefully configured her journaled filesystem or transactional RDBMS!) It seems as though I''ll have to dive into the source code to prove it, though: http://forums.virtualbox.org/viewtopic.php?p=59123#59123 There is no substitute for cord-yank tests - many and often. The weird part is, the ZFS design team simulated millions of them. So the full explanation remains to be uncovered? --Toby> so it''s not > only, or even mostly, a consumer-grade problem or an other-tent > problem. > ... > The ubifs camp did an end-to-end test for their filesystem''s integrity > using a networked power strip to do automated cord-yanking. I think > ZFS needs an easier, faster test though, something everyone can do > before loading data into a pool. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> There is no substitute for cord-yank tests - many and often. The > weird part is, the ZFS design team simulated millions of them. > So the full explanation remains to be uncovered?We simulated power failure; we did not simulate disks that simply blow off write ordering. Any disk that you''d ever deploy in an enterprise or storage appliance context gets this right. The good news is that ZFS is getting popular enough on consumer-grade hardware. The bad news is that said hardware has a different set of failure modes, so it takes a bit of work to become resilient to them. This is pretty high on my short list. Jeff
> > The good news is that ZFS is getting popular enough on consumer-grade > hardware. The bad news is that said hardware has a different set of > failure modes, so it takes a bit of work to become resilient to them. > This is pretty high on my short list.So does this basically mean zfs rolls-back to the latest on-disk consistent state before any failure, even if it means (minor) data loss. Is there any bug report I can follow so I would know when the fix for this is committed Regards -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090210/3bb98c12/attachment-0013.html>
> > There is no substitute for cord-yank tests - many > and often. The > > weird part is, the ZFS design team simulated > millions of them. > > So the full explanation remains to be uncovered? > > We simulated power failure; we did not simulate disks > that simply > blow off write ordering. Any disk that you''d ever > deploy in an > enterprise or storage appliance context gets this > right. > > The good news is that ZFS is getting popular enough > on consumer-grade > hardware. The bad news is that said hardware has a > different set of > failure modes, so it takes a bit of work to become > resilient to them. > This is pretty high on my short list.Jeff, we lost many zpools with multimillion$ EMC, Netapp and HDS arrays just simulating fc switches power fails. The problem is that ZFS can''t properly recover itself. How can even think to adopt ZFS with >100TB pools if a simple fc switch failure can make a pool totally unaccessible? I know UFS fsck can only repair metadata but this is much better than loose all your data! All we know how much it would take to restore from backup 100TB of data .. ZFS should be at least able to recover pools discarding last txg as you suggested months ago. Any news about that? thanks gino -- This message posted from opensolaris.org
dick hoogendijk
2009-Feb-10 16:28 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Mon, 09 Feb 2009 01:46:01 PST "D. Eckert" <contact at desystems.cc> wrote:> after working for 1 month with ZFS on 2 external USB drives I have > experienced, that the all new zfs filesystem is the most unreliable > FS I have ever seen. > > Since working with the zfs, I have lost datas from: > > 1 80 GB external Drive > 1 1 Terrabyte external Drive > > It is a shame, that zfs has no filesystem management tools for > repairing e. g. being able to repair those errors: > > NAME STATE READ WRITE CKSUM > usbhdd1 ONLINE 0 0 8 > c3t0d0s0 ONLINE 0 0 8 > > errors: Permanent errors have been detected in the following files: > > usbhdd1: 0x0What filesystem likes it when disks are pulled out from a LIVE filesystem? Try that on UFS and you''re f** up too. You problem is that you have not read the manual well! Using the wrong command gets you into trouble. Soi. Maybe zpool export/import does what you want? -- Dick Hoogendijk -- PGP/GnuPG key: 01D2433D + http://nagual.nl/ | SunOS sxce snv105 ++ + All that''s really worth doing is what we do for others (Lewis Carrol)
> On Mon, 09 Feb 2009 01:46:01 PST > "D. Eckert" <contact at desystems.cc> wrote: > > > after working for 1 month with ZFS on 2 external > USB drives I have > > experienced, that the all new zfs filesystem is the > most unreliable > > FS I have ever seen. > > > > Since working with the zfs, I have lost datas from: > > > > 1 80 GB external Drive > > 1 1 Terrabyte external Drive > > > > It is a shame, that zfs has no filesystem > management tools for > > repairing e. g. being able to repair those errors: > > > > NAME STATE READ WRITE CKSUM > > usbhdd1 ONLINE 0 0 8 > > c3t0d0s0 ONLINE 0 0 8 > > > > errors: Permanent errors have been detected in the > following files: > > > > usbhdd1: 0x0 > > What filesystem likes it when disks are pulled out > from a LIVE > filesystem? Try that on UFS and you''re f** up too. > > You problem is that you have not read the manual > well! > Using the wrong command gets you into trouble. Soi. > > Maybe zpool export/import does what you want?Dick, Dave made a mistake pulling out the drives with out exporting them first. For sure also UFS/XFS/EXT4/.. doesn''t like that kind of operations but only with ZFS you risk to loose ALL your data. that''s the point! gino -- This message posted from opensolaris.org
Mattias Pantzare
2009-Feb-10 16:55 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
> What filesystem likes it when disks are pulled out from a LIVE > filesystem? Try that on UFS and you''re f** up too.Pulling a disk from a live filesystem is the same as pulling the power from the computer. All modern filesystems can handle that just fine. UFS with logging on do not even need fsck. Now if you have a disk that lies and don''t write to the disk when it should all bets are off.
Peter Schuller
2009-Feb-10 18:00 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
> >However, I just want to state a warning, that ZFS is far from being that what it > >is promising, and so far from my sum of experience I can''t recommend at all to > >use zfs on a professional system. > > > Or, perhaps, you''ve given ZFS disks which are so broken that they are > really unusable; it is USB, after all.I had a cheap-o USB enclosure that definitely did ignore such commands. On every txg commit I''d get a warning in dmesg (this was on FreeBSD) about the device not implementing the relevant SCSI command. This of course would affect filesystems other than ZFS aswell. What is worse, I was unable to completely disable write caching either because that, too, did not actually propagate to the underlying device when attempted. (I could not say for certain whether this was fundamental to the device or in combination with a FreeBSD issue.) -- / Peter Schuller PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at infidyne.com>'' Key retrieval: Send an E-Mail to getpgpkey at scode.org E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090210/15aef5bc/attachment-0013.bin>
Charles Binford
2009-Feb-10 18:03 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
Jeff, what do you mean by "disks that simply blow off write ordering."? My experience is that most enterprise disks are some flavor of SCSI, and host SCSI drivers almost ALWAYS use simple queue tags, implying the target is free to re-order the commands for performance. Are talking about something else, or does ZFS request Order Queue Tags on certain commands? Charles Jeff Bonwick wrote:>> There is no substitute for cord-yank tests - many and often. The >> weird part is, the ZFS design team simulated millions of them. >> So the full explanation remains to be uncovered? >> > > We simulated power failure; we did not simulate disks that simply > blow off write ordering. Any disk that you''d ever deploy in an > enterprise or storage appliance context gets this right. > > The good news is that ZFS is getting popular enough on consumer-grade > hardware. The bad news is that said hardware has a different set of > failure modes, so it takes a bit of work to become resilient to them. > This is pretty high on my short list. > > Jeff > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Peter Schuller
2009-Feb-10 18:05 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
> YES! I recently discovered that VirtualBox apparently defaults to > ignoring flushes, which would, if true, introduce a failure mode > generally absent from real hardware (and eventually resulting in > consistency problems quite unexpected to the user who carefully > configured her journaled filesystem or transactional RDBMS!)I recommend everyone to be extremely hesitant to assume that any particular storage setup actually honors write barriers and cache flushes. This is a recommendation I would give even when you purchase non-cheap battery backed hardware RAID controllers (I won''t mention any names or details to avoid bashing as I''m sure it''s not specific to the particular vendor I had problems with most recently). You need the underlying device to do the right thing, the driver to do the right thing, the operating system in general to do the right thing (which includes the file system, block device layer if any etc - for example, if use md on Linux with RAID5/6 you''re toast). So again I cannot stress enough - do not assume things behave in a non-broken fashion with respect to write barriers and flushes. I can''t speak to expensive integrated hardware solutions; I HOPE, though at this point my level of paranoid does not allow me to assume, that if you buy boxed systems from companies like Sun/HP/etc you get decent stuff. But I can definitely say that paying non-trivial amounts of money for hardware is not a guarantee that you won''t get completely broken behavior. <speculation> I think it boils down to the fact that 99% of customers that aren''t doing integration of the individual components in overall packages, probably don''t care/understand/bother with it, so as long as the benchmarks say it''s "fast", they sell. </speculation> -- / Peter Schuller PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at infidyne.com>'' Key retrieval: Send an E-Mail to getpgpkey at scode.org E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090210/3c9ef52f/attachment-0013.bin>
Peter Schuller
2009-Feb-10 18:07 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
> And again: Why should a 2 weeks old Seagate HDD suddenly be damaged, if there was no shock, hit or any other event like that?I have no information about your particular situation, but you have to remember the ZFS uncovers problems that otherwise go unnoticed. Just personally on my private hardware (meaning a very limited set), I have seen silent corruption issues several times. The most recent one I discovered almost immediately because of ZFS. If it weren''t for ZFS, I would have been highly likely to have transfered my entire system without noticing and suffer weird problems a couple of weeks later. While I don''t know what is going on in your case, blaming the introduction of a piece of software/hardware/procedure on some problem without identifying a causal relationship, is a common mistake to make. -- / Peter Schuller PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at infidyne.com>'' Key retrieval: Send an E-Mail to getpgpkey at scode.org E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090210/7b2c24db/attachment-0013.bin>
Peter Schuller
2009-Feb-10 18:13 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
> on a UFS ore reiserfs such errors could be corrected.In general, UFS has zero capability to actually fix real corruption in any reliable way. What you normally do with fsck is repairing *expected* inconsistencies that the file system was *designed* to produce in the event of e.g. a sudden reboot or a crash. This is entirely different from repairing arbitrary corruption. If ZFS says that a file has a checksum error, that can very well be because there is a bug in ZFS. But it can also be the case that there *is* actual on-disk (or in-transit) corruption that ZFS has detected, and given I/O errors back to an application instead of producing bad data. Now it is probably entirely true that onces you *do* have broken hardware or there is some other reason for corruption beyond that which you can design for, ZFS is probably less mature than traditional file systems in terms of the availability of tools and procedures to salvage whatever might actually be salvagable. That is a valid critisicm. But you *have* to realize the distinction between "repairing" fully expected inconsistencies specifically expected as part of regular operation in the event of a crash/power outtage, from problems arrising from misbehaving hardware or bugs in software. ZFS cannot magically overcome such problems, nor can UFS/reiserfs/xfs/whatever else. -- / Peter Schuller PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at infidyne.com>'' Key retrieval: Send an E-Mail to getpgpkey at scode.org E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090210/9bfd7b95/attachment-0013.bin>
>>>>> "jb" == Jeff Bonwick <Jeff.Bonwick at sun.com> writes:jb> We simulated power failure; we did not simulate disks that jb> simply blow off write ordering. Any disk that you''d ever jb> deploy in an enterprise or storage appliance context gets this jb> right. Did you simulate power failure of iSCSI/FC target shelves without power failure of the head node? How about power failure of iscsitadm-style iSCSI targets? How about rebooting the master domain in sun4v---what is it called? I''ve not had any sun4v but heard the I/O domain, the kernel that contains all the disk drivers, can be rebooted without rebooting the guest-domain kernels which have virtual-disk-drivers, and that sounds like a great opportunity to lose a batch of writes. Do you consider sun4v virtual I/O or iscsitadm as well-fitted to an ``enterprise'''' context, or are they not ready for deploying in the Enterprise yet? :) IMHO it''d really be fantastic if almost all the lost ZFS pools turned out to be just this one write cache problem, and ZFS the canary---not in terms of a checksum canary this time, but in terms of shitting itself when write barriers are violated. Then it''ll be almost a blessing that ZFS is so vulnerable to it, because maybe there will be enough awareness and pressure that it''ll finally become practical to build an end-to-end system without this problem. Suddenly having a database-friendly filesystem everywhere, including trees mounted over NFS/cifs/lustre/whatevers-next, might change some of our assumptions about which MUA''s have fragile message stores and what programs need to store things on ``a local drive''''. I''m ordering a big batch of crappy peecee hardware tomorrow so I can finally start testing and quit ranting. I''ll see if this old post can serve as the qualification tool I keep wanting: http://code.sixapart.com/svn/tools/trunk/diskchecker.pl He used the tool on Linux, I think, and he used it end-to-end, to check fsync() from user-level. which is odd, because I thought I remember reading Linux does _not_ propogate fsync() all the way to the disk, and they''re trying to fix it. In its internal storage stack, Linux has separate ideas of ``cache flush'''' and ``write barrier'''' while my impression is that physical disks have only the latter, so they sort of rely sometimes on things happening ``soon'''', but this guy is saying whether fsync() works or not, on Linux ext3, is determined almost entirely by the disk. possibly the tool can be improved---someone on this list had the interesting idea to write backwards, to provoke the drive into wanting to reorder writes across a barrier since even the dumbest drive will want to write in the direction the platter''s spinning. I''m not sure that backwards-writing will provoke misbehavior inside iSCSI stacks though. In the end the obvious mtd/ubi-style test of writing to a zpool and trying to destroy it by yanking cords might be the best test. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090210/0cd8e479/attachment-0013.bin>
On 10-Feb-09, at 1:03 PM, Charles Binford wrote:> Jeff, what do you mean by "disks that simply blow off write > ordering."? > My experience is that most enterprise disks are some flavor of > SCSI, and > host SCSI drivers almost ALWAYS use simple queue tags, implying the > target is free to re-order the commands for performance.That''s right; I/O is reordered in many unpredictable ways on the way to the disk. So a flush or barrier enforces ordering at certain critical points. Transactional and journaling systems normally *require* a *functioning* flush/barrier for integrity. --Toby> Are talking > about something else, or does ZFS request Order Queue Tags on certain > commands? > > Charles > > Jeff Bonwick wrote: >>> There is no substitute for cord-yank tests - many and often. The >>> weird part is, the ZFS design team simulated millions of them. >>> So the full explanation remains to be uncovered? >>> >> >> We simulated power failure; we did not simulate disks that simply >> blow off write ordering. Any disk that you''d ever deploy in an >> enterprise or storage appliance context gets this right. >> >> The good news is that ZFS is getting popular enough on consumer-grade >> hardware. The bad news is that said hardware has a different set of >> failure modes, so it takes a bit of work to become resilient to them. >> This is pretty high on my short list. >> >> Jeff >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>>> "g" == Gino <dandr.ch at gmail.com> writes:g> we lost many zpools with multimillion$ EMC, Netapp and g> HDS arrays just simulating fc switches power fails. g> The problem is that ZFS can''t properly recover itself. I don''t like what you call ``the problem''''---I think it assumes too much. You mistake *A* fix for *THE* problem, before we can even agree for sure on, what is the problem. The problem may be in the solaris FC initiator, in a corner case of the FC protocol itself, or in ZFS''s exception handling when a ``SYNCHRONIZE CACHE'''' command returns failure. It''s likely other filesystems are affected by ``the problem'''' as I define it, just much less so. If that''s the case, it''d be much better IMHO to fix the real problem once and for all, and find it so that it stays fixed, than to make ZFS work around it by losing a tiny bit of data instead of the whole pool. I don''t think ZFS should feel entitled to brag about protection from Silent Corruption, if it were at the same time willing to silently boot without a slog, or silently rollback to an earlier ueberblock, or if it acts like a cheap USB stick when an FC switch reboots (by quietly losing things that were written long ago). That''s something else to think of: if what''s happening is what we think is happening, then you may be having ``the problem'''' at other times when you do not lose pools! I''m a fan of availability and not of ZFS''s lazy panics and peppering of assertions, but I''m starting to come around a little bit: I don''t want to miss an opportunity to raise everyone''s expectations of their storage stacks, to finally hold cheating disk vendors, cheating virtualization software vendors, and lazy iSCSI programmers accountable, and to make the exception handling in ZFS actually capable of dealing with modern storage instead of hanging status commands, hanging NFS stacks, and inability to replay writes. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090210/3064e127/attachment-0013.bin>
>>>>> "ps" == Peter Schuller <peter.schuller at infidyne.com> writes:ps> This is a recommendation I would give even when you purchase ps> non-cheap battery backed hardware RAID controllers (I won''t ps> mention any names or details to avoid bashing as I''m sure it''s ps> not specific to the particular vendor I had problems with most ps> recently). This again? If you''re sure the device is broken, then I think others would like to know it, even if all devices are broken. but, fine. Anyway, how did you determine the device was broken? At least you can tell us that much without fear of retaliation (whether baseless or founded), and maybe others can use the same test to independently discover what you did which would be both fair and safe for you. This is the real problem as I see it---a bunch of FUD, without any actual resolution beyond ``it''s working, I _think_, and in any case the random beatings have stopped so D''OH-NT TOUCH *ANY*THING! THAR BE DEMONZ IN THE BOWELS O DIS DISK SHELF!'''' If anyone asks questions, they get no actual information, but a huge amount of blame heaped on the sysadmin. Your post is a great example of the typical way this problem is handled because it does both: deny information and blame the sysadmin. Though I''m really picking on you way too much here. Hopefully everyone''s starting to agree, though, we do need a real way out of this mess! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090210/eb7c64c4/attachment-0013.bin>
(..) Dave made a mistake pulling out the drives with out exporting them first. For sure also UFS/XFS/EXT4/.. doesn''t like that kind of operations but only with ZFS you risk to loose ALL your data. that''s the point! (...) I did that many times after performing the umount cmd with ufs/reiserfs filesystems on USB external drives. And they never complainted or got corrupted. -- This message posted from opensolaris.org
I disagree, see posting above. ZFS just accepts it 2 or 3 times. after that, your data are passed away to nirvana for no reason. And it should be legal, to have an external USB drive with a ZFS. with all respect, why should a user always care for redundancy, e. g. setup a mirror on a single HDD between the slices?? This reduces half your available space you have on your drive. -- This message posted from opensolaris.org
On 2/10/2009 2:50 PM, D. Eckert wrote:> (..) > Dave made a mistake pulling out the drives with out exporting them first. > For sure also UFS/XFS/EXT4/.. doesn''t like that kind of operations but only with ZFS you risk to loose ALL your data. > that''s the point! > (...) > > I did that many times after performing the umount cmd with ufs/reiserfs filesystems on USB external drives. And they never complainted or got corrupted. >Possibly so. But if you had that ufs/reiserfs on a LVM or on a RAID0 spanning removable drives, you probably wouldn''t have been so lucky. Just because you only create a single ZFS filesystem inside your zpool, doesn''t mean that when that single filesystem is unmounted it si safe to remove the drive. When you consider the extra layer of the zPool (like LVM or sw RAID) it''s not surpriseing there are other things you have to do before you remove the disk. -Kyle
(...) If anyone asks questions, they get no actual information, but a huge amount of blame heaped on the sysadmin. Your post is a great example of the typical way this problem is handled because it does both: deny information and blame the sysadmin. Though I''m really picking on you way too much here. Hopefully everyone''s starting to agree, though, we do need a real way out of this mess! (...) THANK YOU! It''s precisely walking in my shoes. or with a different expression: THE STUPID USER. -- This message posted from opensolaris.org
Roman Shaposhnik
2009-Feb-10 20:00 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Feb 9, 2009, at 7:06 PM, Jeff Bonwick wrote:>>There is no substitute for cord-yank tests - many and often. The >>weird part is, the ZFS design team simulated millions of them. >>So the full explanation remains to be uncovered? > >We simulated power failure; we did not simulate disks that simply >blow off write ordering. Any disk that you''d ever deploy in an >enterprise or storage appliance context gets this right. > >The good news is that ZFS is getting popular enough on consumer-grade >hardware. The bad news is that said hardware has a different set of >failure modes, so it takes a bit of work to become resilient to them. >This is pretty high on my short list.Speaking of "modes of failure": historically fsck has been used for slightly different (although related purposes): 0. as a tool capable of restoring consistency in a FS that didn''t guarantee an always consistent on-disk state 1. as a forensics tool that would let you retrieve as much information as possible from a physically ill device Thanks goodness, ZFS doesn''t need fsck for #0. That still leaves #1. So far all we have in that department is zdb/mdb. These two can do wonders when used by professionals, yet still fall into "don''t try that at home" category for everybody else. Does such a tool sound reasonable? Does it have a chance of ever showing up on your list? Thanks, Roman. -- This message posted from opensolaris.org
On 2/10/2009 2:54 PM, D. Eckert wrote:> I disagree, see posting above. > > ZFS just accepts it 2 or 3 times. after that, your data are passed away to nirvana for no reason. > > And it should be legal, to have an external USB drive with a ZFS. with all respect, why should a user always care for redundancy, e. g. setup a mirror on a single HDD between the slices?? > >You don''t have to have redundancy. But if you don''t then I don''t know how you can expect the ''repair'' features of ZFS to bail you out when somethign bad happens.> This reduces half your available space you have on your drive. >Mirroring between slices does more than that. it'' will ruin your performance also. It''s be much better to set ''copies=2'', though that will still reduce your space by half. -Kyle
Carsten Aulbert
2009-Feb-10 20:02 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
Hi, i''ve followed this thread a bit and I think there are some correct points on any side of the discussion, but here I see a misconception (at least I think it is): D. Eckert schrieb:> (..) > Dave made a mistake pulling out the drives with out exporting them first. > For sure also UFS/XFS/EXT4/.. doesn''t like that kind of operations but only with ZFS you risk to loose ALL your data. > that''s the point! > (...) > > I did that many times after performing the umount cmd with ufs/reiserfs filesystems on USB external drives. And they never complainted or got corrupted.This of ZFS as an entity which cannot live without the underlying ZPOOL. You can have reiserfs, jfs, ext?, xfs - you name it - on any logical device as it will only live on this one and when you umount it, it''s safe to power it off, yank the disk out whatever since there is now other layer between the file system and the logical disk partition/slice/... However, as soon as you add another layer (say RAID which in this analogy is somehow the ZPOOL) you might also lose data when you have a RAID0 setup and umount reiserfs/ufs/whatever and take a disc out of the RAID and destroy it or change a few sectors on it. When you then mount the file system again, it''s utterly broken and lost. Or - which might be worse - you might end up with a "silent" data corruption you will never notice unless you try to open the data block which is damaged. However, in your case you have some checksum error in the file system on a single hard disk which might have been caused by some accident. ZFS is good in the respect that it can tell you that somethings broken, but without a mirror or parity device it won''t be able to fix the data out of thin air. I cannot claim to fully understand what happened to your devices, so please take my written stuff with a grain of salt. Cheers Carsten
On Tue, Feb 10, 2009 at 12:46 PM, Miles Nordin <carton at ivy.net> wrote:> > It''s likely other filesystems are affected by ``the problem'''' as I > define it, just much less so. If that''s the case, it''d be much better > IMHO to fix the real problem once and for all, and find it so that it > stays fixed, than to make ZFS work around it by losing a tiny bit of > data instead of the whole pool. I don''t think ZFS should feel > entitled to brag about protection from Silent Corruption, if it were > at the same time willing to silently boot without a slog, or silently > rollback to an earlier ueberblock, or if it acts like a cheap USB > stick when an FC switch reboots (by quietly losing things that were > written long ago). >I agree, silently rolling back would be a *BAD THING*. HOWEVER, not giving you the option to easily roll back *AT ALL* is a *WORSE THING*. I don''t think zfs should brag about anything if my pool can be down for hours or days because I''m not given the option to roll back to a consistent state when I *KNOW* it''s what I want to do. Of course, making that easy wouldn''t sell support contracts, would it? --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090210/9f6d8fbe/attachment-0012.html>
>>>>> "rs" == Roman Shaposhnik <rvs at sun.com> writes:rs> 1. as a forensics tool that would let you retrieve as much rs> information as possible from a physically ill device a nit, but I''ve never foudn fsck alone useful for this. Maybe for ``a filesystem trashed by bad RAM/CPU/bugs'''' it is useful, but for a physically bad disk I''ve always had to use dd_rescue or ''dd bs=512 conv=noerror,sync'' onto a good disk before pulling out the fsck. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090210/c2d82a6b/attachment-0010.bin>
(...) You don''t move a pool with ''zfs umount'', that only unmounts a single zfs filesystem within a pool, but the pool is still active.. ''zpool export'' releases the pool from the OS, then ''zpool import'' on the other machine. (...) with all respect: I never read such a non logic ridiculous . I have a single zpool set up over the entire available disk space on an external USB drive without any other filesystems inside this particular pool. so how on earth should I be sure, that the pool is still a live pool inside the operating system if the output of ''mount'' cmd tells me, the pool is no longer attached to the root FS???? this doesn''t make sense at all and it is a vulnerability of ZFS. so if the output of the mount cmd tells you the FS / ZPOOL is not mounted I can''t face any reason why the filesystem should be still up and running, because I just unmounted the only one available ZPOOL. And by the way: After performing: ''zpool umount usbhdd1'' I can NOT access any single file inside /usbhdd1. What else should be released from the OS FS than a single zpool containing no other sub Filesystems? Why? The answer is quite simple: The pool is unmounted and no longer hooked up to the system''s filesystem. so what should me prevent from unplugging the usb wire? Regards, DE -- This message posted from opensolaris.org
(...) Possibly so. But if you had that ufs/reiserfs on a LVM or on a RAID0 spanning removable drives, you probably wouldn''t have been so lucky. (...) we are not talking about a RAID 5 array or an LVM. We are talking about a single FS setup as a zpool over the entire available disk space on an external USB HDD. I decided to do so due to the read/write speed performance of zfs comparing to UFS/ReiserFS. Regards, DE. -- This message posted from opensolaris.org
Nicolas Williams
2009-Feb-10 20:38 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Tue, Feb 10, 2009 at 12:31:05PM -0800, D. Eckert wrote:> (...) > You don''t move a pool with ''zfs umount'', that only unmounts a single zfs > filesystem within a pool, but the pool is still active.. ''zpool export'' > releases the pool from the OS, then ''zpool import'' on the other machine. > (...) > > with all respect: I never read such a non logic ridiculous .It''s not "logic" -- it''s what ZFS does. It lets you have N filesystems in one pool. The price you pay is that unmounting one such filesystem is insufficient to quiesce the pool in which that filesystem lives: you must export the pool in order to quiesce it. Perhaps what you want to argue is that unmounting the root filesystem of a pool should cause the pool to be exported. Nico --
D. Eckert wrote:> (...) > Possibly so. But if you had that ufs/reiserfs on a LVM or on a RAID0 > spanning removable drives, you probably wouldn''t have been so lucky. > (...) > > we are not talking about a RAID 5 array or an LVM. We are talking about a single FS setup as a zpool over the entire available disk space on an external USB HDD. > >You are missing the point. A ZFS filesystem is not the same as a UFS filesystem on a device, the extra layer of the pool makes it closer to a RAID volume. You have to halt the pool before removing the device. These posts do sound like someone who is blaming their parents after breaking a new toy before reading the instructions. -- Ian.
D. Eckert wrote:> (...) > You don''t move a pool with ''zfs umount'', that only unmounts a single zfs > filesystem within a pool, but the pool is still active.. ''zpool export'' > releases the pool from the OS, then ''zpool import'' on the other machine. > (...) > > with all respect: I never read such a non logic ridiculous .You are not listening and you are not learning. You do not seem to understand the fundamentals of ZFS.> > I have a single zpool set up over the entire available disk space on an external USB drive without any other filesystems inside this particular pool. > > so how on earth should I be sure, that the pool is still a live pool inside the operating system if the output of ''mount'' cmd tells me, the pool is no longer attached to the root FS???? > > this doesn''t make sense at all and it is a vulnerability of ZFS.''mount'' is not designed to know anything about the storage *pools*. Yes, you unmounted the filesystem and mount shows it is not mounted. This does not mean the zpool is not still imported and active.> > so if the output of the mount cmd tells you the FS / ZPOOL is not mounted I can''t face any reason why the filesystem should be still up and running, because I just unmounted the only one available ZPOOL.No, you did not unmount the zpool.> And by the way: After performing: ''zpool umount usbhdd1'' I can NOT access any single file inside /usbhdd1.There is no ''zpool unmount'' command.> > What else should be released from the OS FS than a single zpool containing no other sub Filesystems?Again, you have not ''released'' the zpool.> > Why? The answer is quite simple: The pool is unmounted and no longer hooked up to the system''s filesystem. so what should me prevent from unplugging the usb wire? >Again, you are not understanding the fundamentals of ZFS. You may have unmounted the *filesystem*, but not the zpool. You yanked a disk containing a live, imported zpool. Since the advice and information offered to you in this thread has been completely disregarded, the only thing left to say is: RTFM.
On 10-Feb-09, at 1:05 PM, Peter Schuller wrote:>> YES! I recently discovered that VirtualBox apparently defaults to >> ignoring flushes, which would, if true, introduce a failure mode >> generally absent from real hardware (and eventually resulting in >> consistency problems quite unexpected to the user who carefully >> configured her journaled filesystem or transactional RDBMS!) > > I recommend everyone to be extremely hesitant to assume that any > particular storage setup actually honors write barriers and cache > flushes. ...+1.> > You need the underlying device to do the right thing, the driver to do > the right thing, the operating system in general to do the right thing > (which includes the file system, block device layer if any etc - for > example, if use md on Linux with RAID5/6 you''re toast).Absolutely.> > So again I cannot stress enough - do not assume things behave in a > non-broken fashion with respect to write barriers and flushes.That''s why I believe there is no substitute for pull-plug tests, and I would perform quite a few on a loaded system before being confident about it. The last time I did that in anger was against a Sun X2200 + LVM mirror + Ubuntu + reiser3fs + MySQL InnoDB, and it performed flawlessly (although I agree there may be a weak link in LVM; not my choice. I''d have chosen Solaris+ZFS).> I can''t > speak to expensive integrated hardware solutions; I HOPE, though at > this point my level of paranoid does not allow me to assume, that if > you buy boxed systems from companies like Sun/HP/etc you get decent > stuff. But I can definitely say that paying non-trivial amounts of > money for hardware is not a guarantee that you won''t get completely > broken behavior.+1. --Toby> ... > > -- > / Peter Schuller > > PGP userID: 0xE9758B7D or ''Peter Schuller > <peter.schuller at infidyne.com>'' > Key retrieval: Send an E-Mail to getpgpkey at scode.org > E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org >
Mario Goebbels
2009-Feb-10 20:57 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
> The good news is that ZFS is getting popular enough on consumer-grade > hardware. The bad news is that said hardware has a different set of > failure modes, so it takes a bit of work to become resilient to them. > This is pretty high on my short list.One thing I''d like to see is an _easy_ option to fall back onto older uberblocks when the zpool went belly up for a silly reason. Something that doesn''t involve esoteric parameters supplied to zdb. -mg -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 225 bytes Desc: OpenPGP digital signature URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090210/ca942bc6/attachment-0007.bin>
Charles Binford
2009-Feb-10 21:13 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
DE - could you please post the output of your ''zpool umount usbhdd1'' command? I believe the output will prove useful to the point being discussed below. Charles D. Eckert wrote:> (...) > You don''t move a pool with ''zfs umount'', that only unmounts a single zfs > filesystem within a pool, but the pool is still active.. ''zpool export'' > releases the pool from the OS, then ''zpool import'' on the other machine. > (...) > > with all respect: I never read such a non logic ridiculous . > > I have a single zpool set up over the entire available disk space on an external USB drive without any other filesystems inside this particular pool. > > so how on earth should I be sure, that the pool is still a live pool inside the operating system if the output of ''mount'' cmd tells me, the pool is no longer attached to the root FS???? > > this doesn''t make sense at all and it is a vulnerability of ZFS. > > so if the output of the mount cmd tells you the FS / ZPOOL is not mounted I can''t face any reason why the filesystem should be still up and running, because I just unmounted the only one available ZPOOL. > > And by the way: After performing: ''zpool umount usbhdd1'' I can NOT access any single file inside /usbhdd1. > > What else should be released from the OS FS than a single zpool containing no other sub Filesystems? > > Why? The answer is quite simple: The pool is unmounted and no longer hooked up to the system''s filesystem. so what should me prevent from unplugging the usb wire? > > Regards, > DE >
I think you are not reading carefully enough, and I can trace from your reply a typically American arrogant behavior. WE, THE PROUDEST AND infallibles on earth DID NEVER MAKE a mistake. It is just the stupid user who did not read the fucking manual carefully enough. ???? Hello? Did you already recognized the sound of the shot?? No, you didn''t. If you would, than you''d know, that we are not talking about HOW TO PREVENT SUCH EVENTS IN FUTURE but of recovering the data. I learned my lesson well, and in future this won''t happen again, because we will no longer use zfs, but we have a legal interest, to get back our data we stored in trust on a non reap Filesystem developed and introduced by Sun. And that Sun has a big problem regarding version numbers and supported options is not a secret. e. g.: On Solaris 10 generic 10-2008, latest updates, running zfs Version 10 the ''t'' option in zdb is missing. But on SNV 107, same zfs Version 10 ''t'' option of zdb is available. AND: it is not acceptable, that having on 2 systems the same zfs version running, that the output of zdb -u <pool> differs. Even if a UFS/ReiserFS is corrupted, you have chances to access even a part of the date. On ZFS you can''t. You are lost inside the castle someone has the key just thrown away. And the key just seems to held by the developers of Sun. If you have any idea of IT Security, you should know well the expression and meaning of "The key of the kingdom". And as more postings we have to read in the sound of yours as more we are thinking to raise a court trail against Sun just to stop that american arrogance and to withhold technologies and methods to recover a filesystem. However, just tell me how to get the data back from the hard drive zfs just messed up with, and you are the king, and we are happy, and this issue us closed. I hope I''ve made myself very clear. Regards from Germany. DE. -- This message posted from opensolaris.org
Peter Schuller
2009-Feb-10 21:26 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
> ps> This is a recommendation I would give even when you purchase > ps> non-cheap battery backed hardware RAID controllers (I won''t > ps> mention any names or details to avoid bashing as I''m sure it''s > ps> not specific to the particular vendor I had problems with most > ps> recently). > > This again? If you''re sure the device is broken, then I think others > would like to know it, even if all devices are broken.The problem is that I even had help from the vendor in question, and it was not for me personally but for a company, and I don''t want to use information obtained that way to do any public bashing. But I have no particular indication that there is any problem with the vendor in general; it was a combination of choices made by Linux kernel developers and the behavior of the RAID controller. My interpretation was that no one was there looking at the big picture, and the end result was that if you followed the instructions specifically given by the vendor, you would have a setup whereby you would loose correctness whenever the BBU was overheated/broken/disabled. The alternative was to get completely piss-poor performance by not being able to take advantage of the battery backed nature of the cache at all (which defeats most of the purpose of having the controller, if you use it in any kind of transactional database environment or similar).> but, fine. Anyway, how did you determine the device was broken?By performing timing tests as mentioned in the other post that you answered separately, and after detecting the problem confirming the status with respect to caching at the different levels as claimed by the administrative tool for the controller. While timing tests cannot conclusively prove correct behavior, it can definitely proove incorrect behavior in cases where your timings are simply theoretically impossible given the physical nature of the underlying drives.> At > least you can tell us that much without fear of retaliation (whether > baseless or founded), and maybe others can use the same test to > independently discover what you did which would be both fair and safe > for you.The test was trivial; in my case a ~10 line Python script or something along those lines. Perhaps I should just go ahead and release something which non-programmers can easily run and draw conclusions from.> This is the real problem as I see it---a bunch of FUD, without any > actual resolution beyond ``it''s working, I _think_, and in any case > the random beatings have stopped so D''OH-NT TOUCH *ANY*THING! THAR BE > DEMONZ IN THE BOWELS O DIS DISK SHELF!''''I''d love to go on a public rant, because I think the whole situation was a perfect example of a case where a single competent person who actually cares about correctness could have pinpointed this problem trivially. But instead you have different camps doing their own stuff and not considering the big picture.> If anyone asks questions, they get no actual information, but a huge > amount of blame heaped on the sysadmin. Your post is a great example > of the typical way this problem is handled because it does both: deny > information and blame the sysadmin. Though I''m really picking on you > way too much here. Hopefully everyone''s starting to agree, though, we > do need a real way out of this mess!I''m not quite sure what you''re referring to here. I''m not blaming any sysadmin. I was trying to point out *TO* sysadmins, to help them, that I recommend being paranoid about correctness. If you mean the original poster in the thread having issues, I am not blaming him *at all* in the post you responded to. It was strictly meant as a comment in response to the poster who noted that he discovered, to his surprise, the problems with VirtualBox. I wanted to make the point that while I completely understand his surprise, I have come to expect that these things are broken by default (regardless of whether you''re using virtualbox or not, or vendor X or Y etc), and that care should be taken if you do want to have correctness when it comes to write barriers and/or honoring fsync(). However, that said, as I stated in another post I wouldn''t be surprised if it turns out the USB device was ignoring sync commands. But I have no idea what the case was for the original poster, nor have I even followed the thread in detail enough to know if that would even be a possible explanation for his problems. -- / Peter Schuller PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at infidyne.com>'' Key retrieval: Send an E-Mail to getpgpkey at scode.org E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090210/21e73ccf/attachment-0007.bin>
Marcelo H Majczak
2009-Feb-10 21:29 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
I''ll make a meta comment on the thread itself, not on the ZFS issue. There is more bashing and broad accusations than it would normally happen on a "professional usage" situation. Maybe a board admin can run a script on the ip addresses logged and find a more subtle meaning... I don''t know, I''m just a bit skeptical by nature. -- This message posted from opensolaris.org
if you are interested in my IP Address: no problem: 83.236.164.80 it just exactly approves my assumption, that''s best and easier for someone - if he''s in the right position - to adhere a big pavement on someone''s mouth to avoid hearing a legal critique instead of discussing out the problem to find a proper solution. My honest congratulations! -- This message posted from opensolaris.org
>>>>> "de" == D Eckert <contact at desystems.cc> writes:de> from your reply a typically American arrogant behavior. de> WE, THE PROUDEST AND infallibles on earth DID NEVER MAKE a de> mistake. Maybe I should speak up since I defended you at the start. To my view: REASONABLE: * expect that ZFS lose almost nothing when yanking the power cord, or when uncleanly dismounting. * expect that ``always consistent on disk'''' mean something in practice, even given the real hardware and the non-ZFS parts of the storage stack which exist right now. * where the first two are impossible have both a real answer as to why, and a workable way forward, rather than obstructionist FUD. Especially when cord-yanking/unclean-dismount causes ZFS to lose more than other filesystems. not point to dragons and FUD and blame whatever is difficult to exhonerate, especially hindsight surrounding inexpensive devices, and the sysadmin himself. UNREASONABLE: * say ``any filesystem will lose arbitrary amounts of data when uncleanly dismounted because filesystems do not `like'' that. you were `asking'' for it.'''' This is flatly untrue of every non-Microsoft filesystem, even very old ones. Also it directly contradicts the most central claims made by the ZFS kool-aid pushers. * say that the central claims don''t apply to single-vdev pools. * belief in ''copies=2'' * be outraged that ZFS maintenance commands differ from other filesystems. Refuse to listen when the reasonable, easily-described, and documented differences in this mount/umount import/export interface are repeatedly explained to you. You seem to think the mere fact that new commands exist at all means they are badly-designed, and this is way too conservative. This is lazy and boring and unconvincing, especially to me who feels maybe too much has already been sacrificed to the mantra of keeping the user interface simple. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090210/09e9d055/attachment-0007.bin>
Roman V. Shaposhnik
2009-Feb-10 21:48 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Wed, 2009-02-11 at 09:49 +1300, Ian Collins wrote:> These posts do sound like someone who is blaming their parents after > breaking a new toy before reading the instructions.It looks like there''s a serious denial of the fact that "bad things do happen to even the best of people" on this thread. Thanks, Roman.
Richard Elling
2009-Feb-10 21:53 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
Mario Goebbels wrote:>> The good news is that ZFS is getting popular enough on consumer-grade >> hardware. The bad news is that said hardware has a different set of >> failure modes, so it takes a bit of work to become resilient to them. >> This is pretty high on my short list. >> > > One thing I''d like to see is an _easy_ option to fall back onto older > uberblocks when the zpool went belly up for a silly reason. Something > that doesn''t involve esoteric parameters supplied to zdb. >This is CR 6667683 http://bugs.opensolaris.org/view_bug.do?bug_id=6667683 -- richard
dick hoogendijk
2009-Feb-10 21:59 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Tue, 10 Feb 2009 13:14:57 PST "D. Eckert" <contact at desystems.cc> wrote:> Hello? Did you already recognized the sound of the shot??> I learned my lesson well, and in future this won''t happen > again, because we will no longer use zfs, but we have a legal > interest, to get back our data we stored in trust on a non reap > Filesystem developed and introduced by Sun. > > And that Sun has a big problem regarding version numbers and > supported options is not a secret.It''s time we learn ours too. I can understand that you want your data back. You can''t. You made a big mistake. Soi. Also, you?e messages are full of anti-SUN, anti-ZFS, anti-ALL (but you). I''m convinced you won''t learn. You just did what you intended to do. Kick some (sun)ash. If you don''t like SUN, ZFS then don''t use it. If you -DO- learn to use it right. You sound too much like a troll at times. If you are, just say so. If you''re not, then read the advice you''ve been given more carefully. Otherwise, you''re just wasting peoples time and energy. -- Dick Hoogendijk -- PGP/GnuPG key: 01D2433D + http://nagual.nl/ | SunOS sxce snv107 ++ + All that''s really worth doing is what we do for others (Lewis Carrol)
Roman V. Shaposhnik wrote:> On Wed, 2009-02-11 at 09:49 +1300, Ian Collins wrote: > >> These posts do sound like someone who is blaming their parents after >> breaking a new toy before reading the instructions. >> > > It looks like there''s a serious denial of the fact that "bad things > do happen to even the best of people" on this thread. > >Sure. I think most here would agree that some form of recovery tool for ZFS is long overdue. I''ve rebuilt a UFS filesystem after it was damaged by an exploding power supply and it was a strangely rewarding experience. I''m not sure how ZFS would survive this type of failure and I doubt I''d be able to recover a broken pool without help. It''s also clear that the OP has failed to grasp the principles of ZFS and he appears reluctant to acknowledge this. USB removable devices are not the most reliable storage media. I have been using USB sticks as a high speed data link between home and office for over a year now and I''ve never had any corruption, but I have had two sticks fail. If I''m travelling, I always back up to a two stick ZFS mirror. -- Ian.
David Champion
2009-Feb-10 22:29 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
DE: I think that a big part of the reason you''re getting the responses you do is not arrogance from Sun or us kool-aid drinkers, but your own tone and attitude. You didn''t ask for help in your initial message at all. The entire post was a diatribe against Sun and ZFS which was based on your experience of using ZFS in a way that the ZFS documentation tells you not to use it. You have some legitimate concerns, but you began by insulting a lot of people''s work instead of by asking questions. Since then you''ve asked for help, but your tone has only gotten angrier. In your most recent post you even threatened legal action against Sun. Where I work, as soon as someone makes a legal threat, we move a support case from technical staff directly to our lawyers. If Sun is like us, that means you can expect no more free, voluntary support from Sun''s engineering team; it will be mediated by counsel, if at all. This is not good for you. I apologize if I seem arrogant, but I think you need to reconsider your approach. All in all I think the people in this forum who work at Sun have treated you very well. -- -D. dgc at uchicago.edu NSIT University of Chicago
On February 10, 2009 1:14:57 PM -0800 "D. Eckert" <contact at desystems.cc> wrote:> > I hope I''ve made myself very clear. >Very. Rarely has the adage "what one says reveals more about the speaker than the subject" been more evident.> And as more postings we have to read in the sound of yours as more we are > thinking to raise a court trail against Sun just to stop that > american arrogance and to withhold technologies and methods to recover > a filesystem.Comments like this are especially laughable (and revealing). In spite of your arrogant tone (perhaps amplified by translation, but still clearly present), many here have tried to be helpful. However you have already made your decision and aren''t listening. The validity (or not) of your problem is overshadowed by the presentation. Are you sure D Eckert isn''t a pseudonym for Al Viro? From your original post:> after working for 1 month with ZFS on 2 external USB drives I have > experienced, that the all new zfs filesystem is the most unreliable FS I > have ever seen.To the contrary, after working with ZFS for a few years (since it has been publicly available), I have found that it is the most reliable FS ever known. Well, who am I anyway. Just my 0.02. Of course it has some warts -- all complex software does -- and you have revealed a big one. But you would choose to throw the baby out with the bathwater. The problem you have experienced is mitigated in the real world by the fact that data you actually care about requires replication. -frank
We have seen some unfortunate miscommunication here, and misinterpretation. This extends into differences of culture. One of the vocal person in here is surely not ''Anti-xyz''; rather I sense his intense desire to further the progress by pointing his finger to some potential wounds. May I repeat my request, to run a hardware diagnosis on the drives concerned (being aware of the ambiguities involved). If the hardware passes with flying colours, we need to look deeper into the underlying matter. Many in here administrate professional systems, with SCSI, RAID and whatnot. If ZFS does a great service to them, we are happy. On the other hand, though, and, again, management decisions come into perspective, OpenSolaris tries to appeal to the mass market and enter the end-user scene. Then remarks like one had to RTFM the man-pages of zfs, zpool, up and down, is out of place. USB disk drives are common, ubiquitous even. To discourage their use is out of question. To add another layer to ''mount'' likewise. Now we are in heavy seas: ZFS might lose all data irrecoverably? Not fine, but what''s the alternative? UFS is sparsely supported elsewhere (and probably considered ''legacy'' by SUN), extn is supported read-only. The only and last other file system is vfat/pcfs. Alas, when I wrote in finding it failing on a larger drive, I was told (search the archives), that is was a ''hack'' built into the kernel only. Now what? vfat is the crappiest of all. UFS obsolete and not widely available, ZFS is currently discussed to lose all data irreversibly on USB-drives. I repeat that I have never lost a single drive - despite of usually using cheapo crap outside of my production boxes - in the last 10 years, aside of complete hardware failure. All my other drives, ext2, ext3, ffs, have always allowed to salvage some stuff and recover the larger part of data, despite of some of my users yanking out drives in the most inconvenient moments. Back to where I started from, with some questions: 1. Can the relevant people confirm that drives might turn dead when leaving a pool at unfortunate moments? Despite of complete physical integrity? [I''d really appreciate an answer here, because this is what I am starting to implement here: ZFS on USB drives.] 2. Are those drives in unrecoverable state passing their integrity/diagnosis tests (r/w)? 3. If what has been mentioned, that a pool is an entity like RAID in between and hurting the pool might as well destruct data, if this is the case, can this destruction of a pool not also happen within the confines of a server, without any physical yanking of the drive, by a dying controller? Thanks, Uwe -- This message posted from opensolaris.org
Fredrich Maney
2009-Feb-11 05:44 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Tue, Feb 10, 2009 at 4:14 PM, D. Eckert <contact at desystems.cc> wrote:> I think you are not reading carefully enough, and I > can trace from your reply a typically American > arrogant behavior. > > WE, THE PROUDEST AND infallibles on earth DID NEVER MAKE > a mistake. It is just the stupid user who did not read the > fucking manual carefully enough. > > ????Ah... an illiterate AND idiotic bigot. Have you even read the manual or *ANY* of the replies to your posts? *YOU* caused the situation that resulted in your data being corrupted. Not Sun, not OpenSolaris, not ZFS and not anyone on this list. Yet you feel the need to blame ZFS and insult the people that have been trying to help you understand what happened and why you shouldn''t do what you did. ZFS is not a filesystem like UFS or Reiserfs, nor is it an LVM like SVM or VxVM. It is both a filesystem and a logical volume manager. As such, like all LVM solutions, there are two steps that you must perform to safely remove a disk: unmount the filesystem and quiesce the volume. That means you *MUST*, in the case of ZFS, issue ''umount filesystem'' *AND* ''zpool export'' before you yank the USB stick out of the machine. Effectively what you did was create a one-sided mirrored volume with one filesystem on it, then put your very important (but not important enough to bother mirroring or backing up) data on it. Then you unmounted the filesystem and ripped the active volume out of the machine. You got away with it a couple of times because just how good of a job the ZFS developers did at idiot proofing it, but when it finally got to the point where you lost your data, you came here to bitch and point fingers at everyone but the responsible party (hint, it''s you). When your ignorance (and fault) was pointed out to you, you then resorted to personal attacks and slurs. Nice. Very professional. Welcome to the bit-bucket. fpsm
Fredrich Maney
2009-Feb-11 05:56 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
Good. It looks like this thread can finally die. I received the following in response to my message below: This is an automatically generated Delivery Status Notification Delivery to the following recipient failed permanently: contact at desystems.cc Technical details of permanent failure: Google tried to deliver your message, but it was rejected by the recipient domain. We recommend contacting the other email provider for further information about the cause of this error. The error that the other server returned was: 553 553 5.3.0 <contact at desystems.cc>... Your spam was rejected! (state 14). On Wed, Feb 11, 2009 at 12:44 AM, Fredrich Maney <fredrichmaney at gmail.com> wrote:> On Tue, Feb 10, 2009 at 4:14 PM, D. Eckert <contact at desystems.cc> wrote: >> I think you are not reading carefully enough, and I >> can trace from your reply a typically American >> arrogant behavior. >> >> WE, THE PROUDEST AND infallibles on earth DID NEVER MAKE >> a mistake. It is just the stupid user who did not read the >> fucking manual carefully enough. >> >> ???? > > Ah... an illiterate AND idiotic bigot. Have you even read the manual > or *ANY* of the replies to your posts? *YOU* caused the situation that > resulted in your data being corrupted. Not Sun, not OpenSolaris, not > ZFS and not anyone on this list. Yet you feel the need to blame ZFS > and insult the people that have been trying to help you understand > what happened and why you shouldn''t do what you did. > > ZFS is not a filesystem like UFS or Reiserfs, nor is it an LVM like > SVM or VxVM. It is both a filesystem and a logical volume manager. As > such, like all LVM solutions, there are two steps that you must > perform to safely remove a disk: unmount the filesystem and quiesce > the volume. That means you *MUST*, in the case of ZFS, issue ''umount > filesystem'' *AND* ''zpool export'' before you yank the USB stick out of > the machine. > > Effectively what you did was create a one-sided mirrored volume with > one filesystem on it, then put your very important (but not important > enough to bother mirroring or backing up) data on it. Then you > unmounted the filesystem and ripped the active volume out of the > machine. You got away with it a couple of times because just how good > of a job the ZFS developers did at idiot proofing it, but when it > finally got to the point where you lost your data, you came here to > bitch and point fingers at everyone but the responsible party (hint, > it''s you). When your ignorance (and fault) was pointed out to you, you > then resorted to personal attacks and slurs. Nice. Very professional. > Welcome to the bit-bucket. > > fpsm >
Jan.Dreyer at bertelsmann.de
2009-Feb-11 07:35 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
In other words: Dont feed the troll. Greets Jan Dreyer zfs-discuss-bounces at opensolaris.org <> wrote :> Good. It looks like this thread can finally die. I received the > following in response to my message below: > > > > > This is an automatically generated Delivery Status Notification > > Delivery to the following recipient failed permanently: > > contact at desystems.cc > > Technical details of permanent failure: > Google tried to deliver your message, but it was rejected by the > recipient domain. We recommend contacting the other email provider for > further information about the cause of this error. The error that the > other server returned was: 553 553 5.3.0 <contact at desystems.cc>... > Your spam was rejected! (state 14). > > > > > On Wed, Feb 11, 2009 at 12:44 AM, Fredrich Maney > <fredrichmaney at gmail.com> wrote: >> On Tue, Feb 10, 2009 at 4:14 PM, D. Eckert > <contact at desystems.cc> wrote: >>> I think you are not reading carefully enough, and I >>> can trace from your reply a typically American >>> arrogant behavior. >>> >>> WE, THE PROUDEST AND infallibles on earth DID NEVER MAKE >>> a mistake. It is just the stupid user who did not read the >>> fucking manual carefully enough. >>> >>> ???? >> >> Ah... an illiterate AND idiotic bigot. Have you even read the manual >> or *ANY* of the replies to your posts? *YOU* caused the situation >> that resulted in your data being corrupted. Not Sun, not >> OpenSolaris, not ZFS and not anyone on this list. Yet you feel the >> need to blame ZFS and insult the people that have been trying to >> help you understand what happened and why you shouldn''t do what you >> did. >> >> ZFS is not a filesystem like UFS or Reiserfs, nor is it an LVM like >> SVM or VxVM. It is both a filesystem and a logical volume manager. As >> such, like all LVM solutions, there are two steps that you must >> perform to safely remove a disk: unmount the filesystem and quiesce >> the volume. That means you *MUST*, in the case of ZFS, issue ''umount >> filesystem'' *AND* ''zpool export'' before you yank the USB stick out >> of the machine. >> >> Effectively what you did was create a one-sided mirrored volume with >> one filesystem on it, then put your very important (but not important >> enough to bother mirroring or backing up) data on it. Then you >> unmounted the filesystem and ripped the active volume out of the >> machine. You got away with it a couple of times because just how good >> of a job the ZFS developers did at idiot proofing it, but when it >> finally got to the point where you lost your data, you came here to >> bitch and point fingers at everyone but the responsible party (hint, >> it''s you). When your ignorance (and fault) was pointed out to you, >> you then resorted to personal attacks and slurs. Nice. Very >> professional. Welcome to the bit-bucket. >> >> fpsm >> > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> Fsck can only repair known faults; known > discrepancies in the meta data. > Since ZFS doesn''t have such known discrepancies, > there''s nothing to repair.I''m rather tired of hearing this mantra. If ZFS detects an error in part of its data structures, then there is clearly something to repair. The choice ZFS presently makes is effectively to prune the entire pool hierarchy from the point of error downward. If the error found is near the root of the pool, this renders all files inaccessible. This is rather as if fsck, when finding a corrupted UFS directory, removed all of the files within it instead of either (a) trying to repair the directory, or (b) placing them in lost+found; or, when it found a doubly-allocated block, chose to reformat the filesystem. ZFS could do *much* better here both in on-line and off-line operation. It''s misdirection to say that, because ZFS is intended to keep its pool always consistent, there are no inconsistencies possible, and no way to repair them. Almost every file system has adopted journaling for at least its metadata, which is a time-honored way to keep consistency; but almost every file system has a repair utility for when the journal is damaged or the file system is damaged in some other way. I haven''t heard of a NetApp box (with its tree-structured WAFL system) suddenly making all of its data permanently inaccessible because of a disk error or software bug, but I have heard of them requiring file system repair on rare occasions. I''ve described before a number of checks which ZFS could perform, and the repair operations possible. I''ll add a couple more. ZFS could keep track of where its internal nodes are stored, perhaps using a bitmap journaled in a traditional way or perhaps using the ZIL; this would make recovery of individual files much easier in the event of total file system loss. ZFS could segregate data and metadata sufficiently to make it easy to identify its metadata, or use self-checksums in additional areas, which would allow much of a filesystem to be reconstructed even if top-level metadata were corrupted. Every file system needs a repair utility, even if the only expected use case is for the elephant tripping over the fibre cables. -- This message posted from opensolaris.org
Uwe Dippel wrote:> We have seen some unfortunate miscommunication here, and misinterpretation. This extends into differences of culture. One of the vocal person in here is surely not ''Anti-xyz''; rather I sense his intense desire to further the progress by pointing his finger to some potential wounds.I really don''t have a dog in this fight but I think what we''ve seen here is the behavior of a person who is too lazy to read the manual, unable to understand the technology they are working with, and unwilling to face the consequences of their own behavior. As the Solaris user base increases though, the number of people like this will increase. The general population do not read the manuals nor do they care how the magic box works, they just want it to work. This is entirely appropriate for a business user who is using the computer as a means to an end. They have their area of expertise, which isn''t computers. Of course, it really isn''t appropriate for a system administrator so I can''t generate a lot of sympathy for DE personally, especially after the manner in which he has behaved in this thread. Turning Solaris into something that can be used with the same amount of thought as a toaster is one of the challenges facing the Sun and the community in the future. Designing guards to prevent the ignorant from harming themselves is a challenge (see quote below). "There are 2 things that are infinite in this world, the universe and human stupidity. I''m not sure about the first one" - Albert Einstein Regards, Greg
> I''m rather tired of hearing this mantra. > [...] > Every file system needs a repair utilityHey, wait a minute -- that''s a mantra too! I don''t think there''s actually any substantive disagreement here -- stating that one doesn''t need a separate program called /usr/sbin/fsck is not the same as saying that filesystems don''t need error detection and recovery. There''s quite a bit of that in the current code, and more in the works. Like performance, it is never really "done" -- we can always do better.> I''ve described before a number of checks which ZFS could perform [...]Well, ZFS is open source. I would love to see your passion for this topic directed at the source code. Seriously. Jeff
> Mario Goebbels wrote: > >> The good news is that ZFS is getting popular > enough on consumer-grade > >> hardware. The bad news is that said hardware has > a different set of > >> failure modes, so it takes a bit of work to become > resilient to them. > >> This is pretty high on my short list. > >> > > > > One thing I''d like to see is an _easy_ option to > fall back onto older > > uberblocks when the zpool went belly up for a silly > reason. Something > > that doesn''t involve esoteric parameters supplied > to zdb. > > > > This is CR 6667683 > http://bugs.opensolaris.org/view_bug.do?bug_id=6667683I think that would solve 99% of ZFS corruption problems! Is there any EDT for this patch? tnx gino -- This message posted from opensolaris.org
> >>>>> "g" == Gino <dandr.ch at gmail.com> writes: > > g> we lost many zpools with multimillion$ EMC, > Netapp and > g> HDS arrays just simulating fc switches power > fails. > g> The problem is that ZFS can''t properly > recover itself. > I don''t like what you call ``the problem''''---I think > it assumes too > much. You mistake *A* fix for *THE* problem, before > we can even agree > for sure on, what is the problem. The problem may be > in the solaris > FC initiator, in a corner case of the FC protocol > itself, or in ZFS''s > exception handling when a ``SYNCHRONIZE CACHE'''' > command returns > failure. > > It''s likely other filesystems are affected by ``the > problem'''' as I > define it, just much less so. If that''s the case, > it''d be much better > IMHO to fix the real problem once and for all, and > find it so that it > stays fixed, than to make ZFS work around it by > losing a tiny bit of > data instead of the whole pool. I don''t think ZFS > should feel > entitled to brag about protection from Silent > Corruption, if it were > at the same time willing to silently boot without a > slog, or silently > rollback to an earlier ueberblock, or if it acts like > a cheap USB > stick when an FC switch reboots (by quietly losing > things that were > written long ago).I agree but I''d like to point out that the MAIN problem with ZFS is that because of a corruption you-ll loose ALL your data and there is no way to recover it. Consider an example where you have 100TB of data and a fc switch fails or other hw problem happens during I/O on a single file. With UFS you''ll probably get corruption on that single file. With ZFS you''ll loose all your data. I totally agree that ZFS is theoretically much much much much much better than UFS but in real world application having a risk to loose access to an entire pool is not acceptable. gino -- This message posted from opensolaris.org
> > This is CR 6667683 > > http://bugs.opensolaris.org/view_bug.do?bug_id=6667683 > > I think that would solve 99% of ZFS corruption problems!Based on the reports I''ve seen to date, I think you''re right.> Is there any EDT for this patch?Well, because of this thread, this has gone from "on my list" to "I''m currently working on it." And I''d like to take moment to thank everyone who''s weighed in, because it really does make a difference in setting priorities. As for a date, I would estimate "weeks, not months". Jeff
[Still waiting for answers on my earlier questions] So I take it that ZFS solves one problem perfectly well: Integrity of data blocks. It uses CRC and atomic writes for this purpose, and as far as I could follow this list, nobody has ever had any problems in this respect. However, it also - at least to me - looks like that there is a chance that you have a disk in your hands with 100% correct data blocks, but no way to retrieve a single one; under the unfortunate circumstances that the semantics of these blocks is lost. From what I can gather here, and correct me if I am wrong, the problem is not so much on the individual file system to which these 100% correct blocks belong, than on the level of the overall structure of those filesystems. If this was the case, a copy/mirror like it is used in FAT32 might be one solution, though maybe not the most elegant one. Could another approach be, to provide each file system a (virtual) self-contained, basic, pool to which it belongs, and from that it could be recovered? A pool that is over-ruled by the existence of a consistent higher-level pool (the one that the user has created and the user interacts with)? I concede that these might be impossible one way or another, but conceptually at least, a fall-back pool is thinkable. Nobody expects consistency of a file that sees the drive yanked while the writing is going on. But an ''atomic'' update before and after could be useful; one that propagates through to the upper level, so that the state of the pool is consistent at any moment, with or without the changes of the underlying file system. -- This message posted from opensolaris.org
> > > This is CR 6667683 > > > > http://bugs.opensolaris.org/view_bug.do?bug_id=6667683 > > > > I think that would solve 99% of ZFS corruption > problems! > > Based on the reports I''ve seen to date, I think > you''re right. > > > Is there any EDT for this patch? > > Well, because of this thread, this has gone from "on > my list" to > "I''m currently working on it." And I''d like to take > moment to > thank everyone who''s weighed in, because it really > does make a > difference in setting priorities. > > As for a date, I would estimate "weeks, not months".Excellent news! -- This message posted from opensolaris.org
On 2/10/2009 3:37 PM, D. Eckert wrote:> (...) > Possibly so. But if you had that ufs/reiserfs on a LVM or on a RAID0 > spanning removable drives, you probably wouldn''t have been so lucky. > (...) > > we are not talking about a RAID 5 array or an LVM. We are talking about a single FS setup as a zpool over the entire available disk space on an external USB HDD. > >Ok then the parallel on linux would still be something like running reiserfs on a single disk LVM (which I think redhat still installs with by default?) And my real point is that with ZFS even though you only wany a single FS on a single disk, you can''t treat it like the LVM/RAID level of software isn''t there just because you only have one disk. It is still there, and you need to understand it''s commands and how to use them when you want to diconnect the disk.> I decided to do so due to the read/write speed performance of zfs comparing to UFS/ReiserFS. > >That''s fine. If you have reasons to use a single disk that option is still available. Again that doesn''t mean you can treat it like a FS on a raw device. -Kyle> Regards, > > DE. >
On 2/10/2009 4:48 PM, Roman V. Shaposhnik wrote:> On Wed, 2009-02-11 at 09:49 +1300, Ian Collins wrote: > >> These posts do sound like someone who is blaming their parents after >> breaking a new toy before reading the instructions. >> > > It looks like there''s a serious denial of the fact that "bad things > do happen to even the best of people" on this thread. >No one is denying that that can happen. However there are many things that were done here that increased the chance (or things that weren''t done that could have decreased the chance) of this happenning. I''m not saying the OP should have known better. Everyone learns from mistakes. I''m just trying to explain to him both why what happenned might have happenned, and what he could have done that might have avoided it. Is it still possible that something like this could have happenned? sure. Should there be a better way to handle it when it does? you bet! -Kyle> Thanks, > Roman. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
dick hoogendijk
2009-Feb-11 14:42 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Tue, 10 Feb 2009 21:43:00 PST Uwe Dippel <udippel at gmail.com> wrote:> Back to where I started from, with some questions: > 1. Can the relevant people confirm that drives might turn dead when > leaving a pool at unfortunate moments? Despite of complete physical > integrity?I have not experienced this. I -DID- experience a dead UFS formatted (usb) drive once when I unplugged it without unmounting it first. (Shit can happen).. the filesystem was beyond repair. I had to reformat the drive. Never complaiend though. It -was- my fault ;-) With ZFS I mirror all my drives. -- Dick Hoogendijk -- PGP/GnuPG key: 01D2433D + http://nagual.nl/ | SunOS sxce snv107 ++ + All that''s really worth doing is what we do for others (Lewis Carrol)
David Dyer-Bennet
2009-Feb-11 15:08 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Tue, February 10, 2009 23:43, Uwe Dippel wrote:> 1. Can the relevant people confirm that drives might turn dead when > leaving a pool at unfortunate moments? Despite of complete physical > integrity? [I''d really appreciate an answer here, because this is what I > am starting to implement here: ZFS on USB drives.] > 2. Are those drives in unrecoverable state passing their > integrity/diagnosis tests (r/w)? > 3. If what has been mentioned, that a pool is an entity like RAID in > between and hurting the pool might as well destruct data, if this is the > case, can this destruction of a pool not also happen within the confines > of a server, without any physical yanking of the drive, by a dying > controller?Seems like a power failure, controller failure, or processor failure could all produce the equivalent of yanking a USB cable. As could a cat knocking an external drive off the desk :-). All of those things are real-world issues that we must contend with. The two hardware failures are entirely possible even in a top-end commercial machine-room installation. (For that matter, the power failure is, too; I''ve seen places where the UPS came on fine when the power failed, and the generator cut in fine before the UPS failed...and then the automatic fail-BACK failed, and everything went dark when the generator ran out of fuel). I confess to not being adequately reassured right now that my external USB backup disks are reasonably secure. This all-or-nothing behavior of ZFS pools is kinda scary. Turns out I''d rather have 99% of my data than 0% -- who knew? :-) I''d much rather have 100.00% than either of course, and I''m running ZFS with mirroring, and doing regular backups, because of that. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
On 11-Feb-09, at 10:08 AM, David Dyer-Bennet wrote:> > On Tue, February 10, 2009 23:43, Uwe Dippel wrote: > >> 1. Can the relevant people confirm that drives might turn dead when >> leaving a pool at unfortunate moments? Despite of complete physical >> integrity? [I''d really appreciate an answer here, because this is >> what I >> am starting to implement here: ZFS on USB drives.] >> 2. Are those drives in unrecoverable state passing their >> integrity/diagnosis tests (r/w)? >> 3. If what has been mentioned, that a pool is an entity like RAID in >> between and hurting the pool might as well destruct data, if this >> is the >> case, can this destruction of a pool not also happen within the >> confines >> of a server, without any physical yanking of the drive, by a dying >> controller? > > Seems like a power failure, controller failure, or processor > failure could > all produce the equivalent of yanking a USB cable. As could a cat > knocking an external drive off the desk :-). All of those things are > real-world issues that we must contend with.And journaled/transactional systems are designed to deal with that just fine. The exception was clearly noted by Jeff.> The two hardware failures > are entirely possible even in a top-end commercial machine-room > installation. (For that matter, the power failure is, too; I''ve seen > places where the UPS came on fine when the power failed, and the > generator > cut in fine before the UPS failed...and then the automatic fail-BACK > failed, and everything went dark when the generator ran out of fuel).Yes, this happens in *every* data centre eventually. Data centres are also subject to many of the usual human errors. --Toby> > I confess to not being adequately reassured right now that my > external USB > backup disks are reasonably secure. > > This all-or-nothing behavior of ZFS pools is kinda scary. Turns > out I''d > rather have 99% of my data than 0% -- who knew? :-) I''d much > rather have > 100.00% than either of course, and I''m running ZFS with mirroring, and > doing regular backups, because of that. > -- > David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ > Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ > Photos: http://dd-b.net/photography/gallery/ > Dragaera: http://dragaera.info > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Tue, Feb 10, 2009 at 11:44 PM, Fredrich Maney <fredrichmaney at gmail.com>wrote:> > Ah... an illiterate AND idiotic bigot. Have you even read the manual > or *ANY* of the replies to your posts? *YOU* caused the situation that > resulted in your data being corrupted. Not Sun, not OpenSolaris, not > ZFS and not anyone on this list. Yet you feel the need to blame ZFS > and insult the people that have been trying to help you understand > what happened and why you shouldn''t do what you did. >#1 English is clearly not his native tongue. Calling someone idiotic and illiterate when they''re doing as well as he is in a second language is not only inaccurate, it''s "idiotic".> > ZFS is not a filesystem like UFS or Reiserfs, nor is it an LVM like > SVM or VxVM. It is both a filesystem and a logical volume manager. As > such, like all LVM solutions, there are two steps that you must > perform to safely remove a disk: unmount the filesystem and quiesce > the volume. That means you *MUST*, in the case of ZFS, issue ''umount > filesystem'' *AND* ''zpool export'' before you yank the USB stick out of > the machine. > > Effectively what you did was create a one-sided mirrored volume with > one filesystem on it, then put your very important (but not important > enough to bother mirroring or backing up) data on it. Then you > unmounted the filesystem and ripped the active volume out of the > machine. You got away with it a couple of times because just how good > of a job the ZFS developers did at idiot proofing it, but when it > finally got to the point where you lost your data, you came here to > bitch and point fingers at everyone but the responsible party (hint, > it''s you). When your ignorance (and fault) was pointed out to you, you > then resorted to personal attacks and slurs. Nice. Very professional. > Welcome to the bit-bucket. >All that and yet the fact remains: I''ve never "ejected" a USB drive from OS X or Windows, I simply pull it and go, and I''ve never once lost data, or had it become unrecoverable or even corrupted. And yes, I do keep checksums of all the data sitting on them and periodically check it. So, for all of your ranting and raving, the fact remains even a *crappy* filesystem like fat32 manages to handle a hot unplug without any prior notice without going belly up. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090211/7ebe5031/attachment-0005.html>
Tim; The proper procedure for ejecting a USB drive in Windows is to right click the device icon and eject the appropriate listed device. I''ve done this before without ejecting and lost data before. My personal experience with ZFS is that it is very reliable FS. I''ve not lost data on it yet even after several hardware upgrades, abrupt failures and recently an unofficial unsanction expansion technique. The folks at Sun who developed this earnestly believe in their product. Sometimes, these belief can translate to an uneven reply. For my own reasons, I too believe whole heartedly in ZFS. (I don''t work in Sun nor do I own any share in Sun). Perhaps we can all work together and find the proper solution here. Logic dictates that ZFS can survive an abrupt failure far better than traditional VM/FS combination. The end to end checking summing simply do not exist in traditional methodologies. Could you describe in detail the kind of IO access you were generating prior to pulling out the USB? Warmest Regards Steven Sim Tim wrote: On Tue, Feb 10, 2009 at 11:44 PM, Fredrich Maney <fredrichmaney@gmail.com> wrote: Ah... an illiterate AND idiotic bigot. Have you even read the manual or *ANY* of the replies to your posts? *YOU* caused the situation that resulted in your data being corrupted. Not Sun, not OpenSolaris, not ZFS and not anyone on this list. Yet you feel the need to blame ZFS and insult the people that have been trying to help you understand what happened and why you shouldn''t do what you did. #1 English is clearly not his native tongue. Calling someone idiotic and illiterate when they''re doing as well as he is in a second language is not only inaccurate, it''s "idiotic". ZFS is not a filesystem like UFS or Reiserfs, nor is it an LVM like SVM or VxVM. It is both a filesystem and a logical volume manager. As such, like all LVM solutions, there are two steps that you must perform to safely remove a disk: unmount the filesystem and quiesce the volume. That means you *MUST*, in the case of ZFS, issue ''umount filesystem'' *AND* ''zpool export'' before you yank the USB stick out of the machine. Effectively what you did was create a one-sided mirrored volume with one filesystem on it, then put your very important (but not important enough to bother mirroring or backing up) data on it. Then you unmounted the filesystem and ripped the active volume out of the machine. You got away with it a couple of times because just how good of a job the ZFS developers did at idiot proofing it, but when it finally got to the point where you lost your data, you came here to bitch and point fingers at everyone but the responsible party (hint, it''s you). When your ignorance (and fault) was pointed out to you, you then resorted to personal attacks and slurs. Nice. Very professional. Welcome to the bit-bucket. All that and yet the fact remains: I''ve never "ejected" a USB drive from OS X or Windows, I simply pull it and go, and I''ve never once lost data, or had it become unrecoverable or even corrupted. And yes, I do keep checksums of all the data sitting on them and periodically check it. So, for all of your ranting and raving, the fact remains even a *crappy* filesystem like fat32 manages to handle a hot unplug without any prior notice without going belly up. --Tim _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Wed, Feb 11, 2009 at 10:33 AM, Steven Sim <unixandme at gmail.com> wrote:> Tim; > > The proper procedure for ejecting a USB drive in Windows is to right click > the device icon and eject the appropriate listed device. >I''m well aware of what the proper procedure is. My point is, I''ve done it for years without for various reasons, and never lost data.> > > I''ve done this before without ejecting and lost data before. >Congratulations? You''re honestly the first person I''ve *EVER* heard of losing data from it. Now if we''re talking windows98 with it''s beta support of USB that''s another story entirely. But anything from XP on... that takes an awful lot of work.> > > My personal experience with ZFS is that it is very reliable FS. I''ve not > lost data on it yet even after several hardware upgrades, abrupt failures > and recently an unofficial unsanction expansion technique. > > The folks at Sun who developed this earnestly believe in their product. > Sometimes, these belief can translate to an uneven reply. > > For my own reasons, I too believe whole heartedly in ZFS. (I don''t work in > Sun nor do I own any share in Sun). > > Perhaps we can all work together and find the proper solution here. > > Logic dictates that ZFS can survive an abrupt failure far better than > traditional VM/FS combination. The end to end checking summing simply do not > exist in traditional methodologies. >But it doesn''t, and that''s the problem.> > Could you describe in detail the kind of IO access you were generating > prior to pulling out the USB? >I personally wouldn''t even think of putting ZFS on a USB drive. There''s someone posting here weekly about losing data to ZFS on a USB solution, no thanks. Not only that, the complete lack of cross platform support makes it essentially useless in my world. I would like to believe it has more to do with Solaris''s support of USB than ZFS, but the fact remains it''s a pretty glaring deficiency in 2009, no matter which part of the stack is at fault. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090211/a8fcc5f2/attachment-0005.html>
(...) Good. It looks like this thread can finally die. I received the following in response to my message below: (...) I apologize that your eMail could not be delivered. This is to either the mail server you use is considered as a machine from a dynamic ip pool or your mail server is anywhere on official lists blacklisted. pls. check the IP of your mailserver e. g. with spamcop or spamhouse. Regards. -- This message posted from opensolaris.org
Bob Friesenhahn
2009-Feb-11 16:49 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Wed, 11 Feb 2009, David Dyer-Bennet wrote:> This all-or-nothing behavior of ZFS pools is kinda scary. Turns out I''d > rather have 99% of my data than 0% -- who knew? :-) I''d much rather have > 100.00% than either of course, and I''m running ZFS with mirroring, and > doing regular backups, because of that.It seems to me that this level of terror is getting out of hand. I am glad to see that you made it to work today since statistics show that you might have gotten into a deadly automobile accident on the way to the office and would no longer care about your data. In fact, quite a lot of people get in serious automobile accidents yet we rarely hear such levels of terror regarding taking a drive in an automobile. Most people are far more afraid of taking a plane flight than taking a drive in their car, even though taking a drive in their car is far more risky. It is best to put risks in perspective. People are notoriously poor at evaluating risks and paranoia is often the result. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
(...) Ah... an illiterate AND idiotic bigot. (...) I apologize for my poor English. Yes, it''s not my mother tongue, but I have no doubt at all, that this discussion could be continued in German as well. But just to make it clear: Finally I did understand very well were I went wrong. But it wasn''t something I did expect. Due to the fact, that I was using a single zpool with no other filesystems inside I thought, unmounting it with the command ''zfs umount usbhdd1'' and checking if usbhdd1 is still shown in the output of ''mount'' (it wasn''t), I expected, that the pool was clearly unmounted and there is no risk to yank the USB wire. Even from the view of logic, that ''zpool export usbhdd1'' will release the entire pool from the system should ''zfs umount usbhdd1'' do the same in case no other filesystem exists inside this particular pool. if the output of the mount cmd doesn''t show your zfs pool anymore what else should be there what can be unmounted? This is just what caused confusion on my side, and that''s human, but I learned for the future. Regards. -- This message posted from opensolaris.org
Bob Friesenhahn
2009-Feb-11 17:21 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Wed, 11 Feb 2009, Tim wrote:> > All that and yet the fact remains: I''ve never "ejected" a USB drive from OS > X or Windows, I simply pull it and go, and I''ve never once lost data, or had > it become unrecoverable or even corrupted. > > And yes, I do keep checksums of all the data sitting on them and > periodically check it. So, for all of your ranting and raving, the fact > remains even a *crappy* filesystem like fat32 manages to handle a hot unplug > without any prior notice without going belly up.This seems like another one of your trolls. Any one of us who have used USB drives under OS-X or Windows knows that the OS complains quite a lot if you just unplug the drive so we all learn how to do things properly. You must have very special data if you compute independent checksums for each one of your files, and it leaves me wondering why you think that data is correct due to being checksummed. Checksumming incorrect data does not make that data correct. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 11-Feb-09, at 11:19 AM, Tim wrote:> ... > And yes, I do keep checksums of all the data sitting on them and > periodically check it. So, for all of your ranting and raving, the > fact remains even a *crappy* filesystem like fat32 manages to > handle a hot unplug without any prior notice without going belly up.By chance, certainly not design. --Toby> > > --Tim > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On 2/11/2009 12:35 PM, Toby Thain wrote:> > On 11-Feb-09, at 11:19 AM, Tim wrote: > >> ... >> And yes, I do keep checksums of all the data sitting on them and >> periodically check it. So, for all of your ranting and raving, the >> fact remains even a *crappy* filesystem like fat32 manages to handle >> a hot unplug without any prior notice without going belly up. > > By chance, certainly not design.Yep. I''ve never unplugged a USB drive on purpose, but I have left a drive plugged into the docking station, Hibernated windows XP professional, undocked the laptop, and then woken it up later undocked. It routinely would pop up windows saying that a ''delayed write'' was not successful on the now missing drive. I''ve always counted myself lucky that any new data written to that drive was written long long before I hibernated, becuase have yet to find any problems with that data, (but I don''t read it very often if at all.) But it is luck only! -Kyle> > --Toby > >> >> >> --Tim >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
David Dyer-Bennet
2009-Feb-11 18:11 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Wed, February 11, 2009 11:21, Bob Friesenhahn wrote:> On Wed, 11 Feb 2009, Tim wrote: >> >> All that and yet the fact remains: I''ve never "ejected" a USB drive from >> OS >> X or Windows, I simply pull it and go, and I''ve never once lost data, or >> had >> it become unrecoverable or even corrupted. >> >> And yes, I do keep checksums of all the data sitting on them and >> periodically check it. So, for all of your ranting and raving, the fact >> remains even a *crappy* filesystem like fat32 manages to handle a hot >> unplug >> without any prior notice without going belly up. > > This seems like another one of your trolls. Any one of us who have > used USB drives under OS-X or Windows knows that the OS complains > quite a lot if you just unplug the drive so we all learn how to do > things properly.Then again, I''ve never lost data during the learning period, nor on the rare occasions where I just get it wrong. This is good; not quite remembering to eject a USB memory stick is *so* easy. We do all know why violating protocols here works so much of the time, right? It''s because Windows is using very simple, old-fashioned strategies to write to the USB devices. Write caching is nonexistent, or of very short duration, for example. So if IO has quiesced to the device, it''s been several seconds since the last IO, it''s nearly certain to just pull it. Nearly. ZFS is applying much more modern, much more aggressive, optimizing strategies. This is entirely good; ZFS is intended for a space where that''s important a lot of the time. But one tradeoff is that those rules become more important.> You must have very special data if you compute independent checksums > for each one of your files, and it leaves me wondering why you think > that data is correct due to being checksummed. Checksumming incorrect > data does not make that data correct.Can''t speak for him, but I have par2 checksums and redundant data for lots of my old photos on disk. I created them before writing archival optical disks of the data, to give me some additional hope of recovering the data in the long run. I don''t, in fact, know that most of those photos are actually valid data; only the ones I''ve viewed after creating the par2 checksums (and I can''t rule out weird errors that don''t result in corrupting the whole rest of the image even then). Still, once I''ve got the checksum on file, I can at least determine that I''ve had a disk error in many cases (not quite identical to determining that the data is still valid; after all, the data and the checksum could have been corrupted in such a way that I get a false positive on the checksum). -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
David Dyer-Bennet
2009-Feb-11 18:12 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Wed, February 11, 2009 11:35, Toby Thain wrote:> > On 11-Feb-09, at 11:19 AM, Tim wrote: > >> ... >> And yes, I do keep checksums of all the data sitting on them and >> periodically check it. So, for all of your ranting and raving, the >> fact remains even a *crappy* filesystem like fat32 manages to >> handle a hot unplug without any prior notice without going belly up. > > By chance, certainly not design.No, I do think it''s by design -- it''s because the design isn''t aggressively exploiting possible performance. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
David Dyer-Bennet
2009-Feb-11 18:21 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Wed, February 11, 2009 10:49, Bob Friesenhahn wrote:> On Wed, 11 Feb 2009, David Dyer-Bennet wrote: >> This all-or-nothing behavior of ZFS pools is kinda scary. Turns out I''d >> rather have 99% of my data than 0% -- who knew? :-) I''d much rather >> have >> 100.00% than either of course, and I''m running ZFS with mirroring, and >> doing regular backups, because of that. > > It seems to me that this level of terror is getting out of hand. I am > glad to see that you made it to work today since statistics show that > you might have gotten into a deadly automobile accident on the way to > the office and would no longer care about your data. In fact, quite a > lot of people get in serious automobile accidents yet we rarely hear > such levels of terror regarding taking a drive in an automobile. > > Most people are far more afraid of taking a plane flight than taking a > drive in their car, even though taking a drive in their car is far > more risky. > > It is best to put risks in perspective. People are notoriously poor > at evaluating risks and paranoia is often the result.All true (and I''m certainly glad I made it to work myself; I did drive, which is one of the most dangerous things most people do). I think you''re overstating my terror level, though; I''d say I''m at yellow; not even orange. I''ve spent $2000 on hardware and, by now, hundreds of hours of my time trying to get and keep a ZFS-based home NAS working. Because it''s the only affordable modern practice, my backups are on external drives (USB drives because that''s "the" standard for consumer external drives, they were much cheaper when I bought them than any that supported Firewire at the 1TB size). So hearing how easy it is to muck up a ZFS pool on USB is leading me, again, to doubt this entire enterprise. Am I really better off than I would be with an Infrant Ready NAS, or a Drobo? I''m certainly far behind financially and with my time. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
Bob Friesenhahn
2009-Feb-11 18:23 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Wed, 11 Feb 2009, David Dyer-Bennet wrote:> > Then again, I''ve never lost data during the learning period, nor on the > rare occasions where I just get it wrong. This is good; not quite > remembering to eject a USB memory stick is *so* easy.With Windows and OS-X, it is up to the *user* to determine if they have lost data. This is because they are designed to be user-friendly operating systems. If the disk can be loaded at all, Windows and OS-X will just go with what is left. If Windows and OS-X started to tell users that they lost some data, then those users would be in a panic (just like we see here). The whole notion of "journaling" is to intentionally lose data by rolling back to a known good point. More data might be lost than if the task was left to a tool like ''fsck'' but the journaling approach is much faster. Windows and OS-X are highly unlikely to inform you that some data was lost due to the filesystem being rolled back. Your comments about write caching being a factor seem reasonable. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
David Dyer-Bennet
2009-Feb-11 18:38 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Wed, February 11, 2009 12:23, Bob Friesenhahn wrote:> On Wed, 11 Feb 2009, David Dyer-Bennet wrote: >> >> Then again, I''ve never lost data during the learning period, nor on the >> rare occasions where I just get it wrong. This is good; not quite >> remembering to eject a USB memory stick is *so* easy. > > With Windows and OS-X, it is up to the *user* to determine if they > have lost data. This is because they are designed to be user-friendly > operating systems. If the disk can be loaded at all, Windows and OS-X > will just go with what is left. If Windows and OS-X started to tell > users that they lost some data, then those users would be in a panic > (just like we see here).I don''t carry much on my memory stick -- mostly stuff in transit from one place to another. Two things that live there constantly are my encrypted password database, and some private keys (encrypted under passphrases). So the stuff on the memory stick tends to get looked at, and the stuff that lives there is in a format where corruption is very likely to get noticed. So while I can''t absolutely swear that I never lost data I didn''t notice losing, I''m fairly confident that no data was lost. And I''m absolutely sure no data THAT I CARED ABOUT was lost, which is all that really matters.> The whole notion of "journaling" is to intentionally lose data by > rolling back to a known good point. More data might be lost than if > the task was left to a tool like ''fsck'' but the journaling approach is > much faster. Windows and OS-X are highly unlikely to inform you that > some data was lost due to the filesystem being rolled back.True about journaling. This applies to NTFS disks for Windows, but not to FAT systems (which aren''t journaled); and memory sticks for me are always FAT systems. Databases have something of an all-or-nothing problem as well, for that matter, and for something of the same reasons. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
On February 11, 2009 12:21:03 PM -0600 David Dyer-Bennet <dd-b at dd-b.net> wrote:> I''ve spent $2000 on hardware and, by now, hundreds of hours of my time > trying to get and keep a ZFS-based home NAS working. Because it''s the > only affordable modern practice, my backups are on external drives (USB > drives because that''s "the" standard for consumer external drives, they > were much cheaper when I bought them than any that supported Firewire at > the 1TB size). So hearing how easy it is to muck up a ZFS pool on USB is > leading me, again, to doubt this entire enterprise.Same here, except I have no doubts. As I only use the USB for backup, I''m quite happy with it. I have a 4-disk enclosure that accepts SATA drives. My main storage is a 12-bay SAS/SATA enclosure. After my own experience with USB (I still have the problem that I cannot create new pools while another USB drive is present with a zpool on it, whether or not that zpool is active ... no response on that thread yet and I expect never), I''m not thrilled with it and suspect some of the problem lies in the way that USB is handled differently than other physical connections (can''t use ''format'', e.g.). Anyway to get back to the point I wouldn''t want to use it for primary storage, even if it were only 2 drives. That''s unfortunate, but in line with Solaris'' hardware support, historically. -frank
Fredrich Maney
2009-Feb-11 19:32 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Wed, Feb 11, 2009 at 11:19 AM, Tim <tim at tcsac.net> wrote:> On Tue, Feb 10, 2009 at 11:44 PM, Fredrich Maney <fredrichmaney at gmail.com> > wrote:>> Ah... an illiterate AND idiotic bigot. Have you even read the manual >> or *ANY* of the replies to your posts? *YOU* caused the situation that >> resulted in your data being corrupted. Not Sun, not OpenSolaris, not >> ZFS and not anyone on this list. Yet you feel the need to blame ZFS >> and insult the people that have been trying to help you understand >> what happened and why you shouldn''t do what you did.> #1 English is clearly not his native tongue. Calling someone idiotic and > illiterate when they''re doing as well as he is in a second language is not > only inaccurate, it''s "idiotic".I have a great deal of respect for his command of more than one language. What I don''t have any respect for is his complete unwillingness to actually read the dozens of responses that have all said the same thing, namely that his problems are self inflicted due his refusal to read the documentation. I refrained from calling him an idiot until after he proved himself one by spewing his blind bigotry against the US. All in all, I''d say he got far better treatment than he gave and infinitely better than he deserved.>> ZFS is not a filesystem like UFS or Reiserfs, nor is it an LVM like >> SVM or VxVM. It is both a filesystem and a logical volume manager. As >> such, like all LVM solutions, there are two steps that you must >> perform to safely remove a disk: unmount the filesystem and quiesce >> the volume. That means you *MUST*, in the case of ZFS, issue ''umount >> filesystem'' *AND* ''zpool export'' before you yank the USB stick out of >> the machine. >> >> Effectively what you did was create a one-sided mirrored volume with >> one filesystem on it, then put your very important (but not important >> enough to bother mirroring or backing up) data on it. Then you >> unmounted the filesystem and ripped the active volume out of the >> machine. You got away with it a couple of times because just how good >> of a job the ZFS developers did at idiot proofing it, but when it >> finally got to the point where you lost your data, you came here to >> bitch and point fingers at everyone but the responsible party (hint, >> it''s you). When your ignorance (and fault) was pointed out to you, you >> then resorted to personal attacks and slurs. Nice. Very professional. >> Welcome to the bit-bucket. > > All that and yet the fact remains: I''ve never "ejected" a USB drive from OS > X or Windows, I simply pull it and go, and I''ve never once lost data, or had > it become unrecoverable or even corrupted.You''ve been lucky then. I''ve lost data and had corrupted filesystems on USB sticks on both of those OSes, as well as several Linux and BSD variants, from doing just that. [...] fpsm
On February 11, 2009 2:07:47 AM -0800 Gino <dandr.ch at gmail.com> wrote:> I agree but I''d like to point out that the MAIN problem with ZFS is that > because of a corruption you-ll loose ALL your data and there is no way to > recover it. Consider an example where you have 100TB of data and a fc > switch fails or other hw problem happens during I/O on a single file. > With UFS you''ll probably get corruption on that single file. With ZFS > you''ll loose all your data. I totally agree that ZFS is theoretically > much much much much much better than UFS but in real world application > having a risk to loose access to an entire pool is not acceptable.if you have 100TB of data, wouldn''t you have a completely redundant storage network -- dual FC switches on different electrical supplies, etc. i''ve never designed or implemented a storage network before but such designs seem common in the literature and well supported by Solaris. i have done such designs with data networks and such redundancy is quite common. i mean, that''s a lot of data to go missing due to a single device failing -- which it will. not to say it''s not a problem with zfs, just that in the real world, it should be mitigated since your storage network design would overcome a single failure *anyway* -- regardless of zfs. -frank
David Dyer-Bennet wrote:> I''ve spent $2000 on hardware and, by now, hundreds of hours of my time > trying to get and keep a ZFS-based home NAS working.Hundreds of hours doing what? I just plugged in the drives, built the pool and left the box in a corner for the past couple of years. It''s been upgraded twice, from build 62 to 72 to get the SATA framework and then to b101 for CIFS. -- Ian.
Thommy M. Malmström
2009-Feb-11 20:01 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
> after working for 1 month with ZFS on 2 external USB > drives I have experienced, that the all new zfs > filesystem is the most unreliable FS I have ever > seen.Troll. -- This message posted from opensolaris.org
David Dyer-Bennet
2009-Feb-11 20:01 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Wed, February 11, 2009 13:45, Ian Collins wrote:> David Dyer-Bennet wrote: >> I''ve spent $2000 on hardware and, by now, hundreds of hours of my time >> trying to get and keep a ZFS-based home NAS working. > > Hundreds of hours doing what? I just plugged in the drives, built the > pool and left the box in a corner for the past couple of years. It''s > been upgraded twice, from build 62 to 72 to get the SATA framework and > then to b101 for CIFS.Well, good for you. It took me a lot of work to get it working in the first place (and then with only 4 of my 8 hot-swap bays, 4 of my 6 eSATA connections on the motherboard) working. Before that, I''d spent quite a lot of time trying to get VMWare to run Solaris, which it wouldn''t back then. I did manage to get Parallels, I think it was, to let me create a Solaris system and then a ZFS pool to play with (this was back before OpenSolaris and before any sort of LiveCD I could find). Then I had a series of events starting in December of last year that, in hindsight, I think were mainly or entirely one memory SIMM going bad, which caused me to upgrade to 2008.11 and also have to restore my main pool from backup. Oh, and converted from using Samba to using CIFS. I''m just now getting close to having things up working again usably and stably, still working on backup. I do still have some problems with file access permissions I know, due to the new different handling of ACLs I guess. And I wasn''t a Solaris admin to begin with. I guess SunOS back when was the first Unix I had root on, but since then I''ve mostly worked with Linux (including my time as news admin for a local ISP, and my years as an engineer with Sun, where I was in the streaming video server group). In some ways a completely UNfamiliar system might have been easier :-). -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
On Wed, Feb 11, 2009 at 11:46 AM, Kyle McDonald <KMcDonald at egenera.com>wrote:> > Yep. I''ve never unplugged a USB drive on purpose, but I have left a drive > plugged into the docking station, Hibernated windows XP professional, > undocked the laptop, and then woken it up later undocked. It routinely would > pop up windows saying that a ''delayed write'' was not successful on the now > missing drive. > > I''ve always counted myself lucky that any new data written to that drive > was written long long before I hibernated, becuase have yet to find any > problems with that data, (but I don''t read it very often if at all.) But it > is luck only! > > -Kyle >Right, except the OP stated he unmounted the filesystem in question, and it was the *ONLY* one on the drive, meaning there is absolutely 0 chance of their being pending writes. There''s nothing to write to. I don''t know what exactly it is you put on your USB drives, but I''m certainly aware of whether or not things on mine are in use before pulling the drive out. If a picture is open and in an editor, I''m obviously not going to save it then pull the drive mid-save. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090211/c54b6c59/attachment-0005.html>
On Wed, Feb 11, 2009 at 1:36 PM, Frank Cusack <fcusack at fcusack.com> wrote:> > if you have 100TB of data, wouldn''t you have a completely redundant > storage network -- dual FC switches on different electrical supplies, > etc. i''ve never designed or implemented a storage network before but > such designs seem common in the literature and well supported by > Solaris. i have done such designs with data networks and such > redundancy is quite common. > > i mean, that''s a lot of data to go missing due to a single device > failing -- which it will. > > not to say it''s not a problem with zfs, just that in the real world, > it should be mitigated since your storage network design would overcome > a single failure *anyway* -- regardless of zfs. >It''s hardly uncommon for an entire datacenter to go down, redundant power or not. When it does, if it means I have to restore hundreds of terabytes if not petabytes from tape instead of just restoring the files that were corrupted or running an fsck, we''ve got issues. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090211/d1bf8b21/attachment-0005.html>
On February 11, 2009 3:02:48 PM -0600 Tim <tim at tcsac.net> wrote:> On Wed, Feb 11, 2009 at 1:36 PM, Frank Cusack <fcusack at fcusack.com> wrote: > >> >> if you have 100TB of data, wouldn''t you have a completely redundant >> storage network -- dual FC switches on different electrical supplies, >> etc. i''ve never designed or implemented a storage network before but >> such designs seem common in the literature and well supported by >> Solaris. i have done such designs with data networks and such >> redundancy is quite common. >> >> i mean, that''s a lot of data to go missing due to a single device >> failing -- which it will. >> >> not to say it''s not a problem with zfs, just that in the real world, >> it should be mitigated since your storage network design would overcome >> a single failure *anyway* -- regardless of zfs. >> > > It''s hardly uncommon for an entire datacenter to go down, redundant power > or not. When it does, if it means I have to restore hundreds of > terabytes if not petabytes from tape instead of just restoring the files > that were corrupted or running an fsck, we''ve got issues.Isn''t this easily worked around by having UPS power in addition to whatever the data center supplies? I''ve been there with entire data center shutdown (or partial, but entire as far as my gear is concerned), but for really critical stuff we''ve had our own UPS. I don''t know if that really works for 100TB and up though. That''s a lot of disk == a lot of UPS capacity. And again, I''m not trying to take away from the fact that this is a significant zfs problem. -frank
Bob Friesenhahn
2009-Feb-11 21:52 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Wed, 11 Feb 2009, Tim wrote:> > Right, except the OP stated he unmounted the filesystem in question, and it > was the *ONLY* one on the drive, meaning there is absolutely 0 chance of > their being pending writes. There''s nothing to write to.This is an interesting assumption leading to a wrong conclusion. If the file is updated and the filesystem is "unmounted", it is still possible for there to be uncommitted data in the pool. If you pay closer attention you will see that "mounting" the filesystem basically just adds a logical path mapping since the filesystem is already available under /poolname/filesystemname regardless. So doing the mount makes /poolname/filesystemname available as /filesystemname, or whatever mount path you specify. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
David Dyer-Bennet
2009-Feb-11 22:44 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Wed, February 11, 2009 15:51, Frank Cusack wrote:> On February 11, 2009 3:02:48 PM -0600 Tim <tim at tcsac.net> wrote:>> It''s hardly uncommon for an entire datacenter to go down, redundant >> power >> or not. When it does, if it means I have to restore hundreds of >> terabytes if not petabytes from tape instead of just restoring the files >> that were corrupted or running an fsck, we''ve got issues. > > Isn''t this easily worked around by having UPS power in addition to > whatever the data center supplies?Well, that covers some of the cases (it does take a fairly hefty UPS to deal with 100TB levels of redundant disk).> I''ve been there with entire data center shutdown (or partial, but entire > as far as my gear is concerned), but for really critical stuff we''ve had > our own UPS.I knew people once who had pretty careful power support; UPS where needed, then backup generator that would cut in automatically, and cut back when power was restored. Unfortunately, the cut back failed to happen automatically. On a weekend. So things sailed along fine until the generator ran out of fuel, and then shut down MOST uncleanly. Best laid plans of mice and men gang aft agley, or some such (from memory, and the spelling seems unlikely). Sure, human error was a factor. But human error is a MAJOR factor in the real world, and one of the things we''re trying to protect our data from. Certainly, if a short power glitch on the normal mains feed (to lapse into Brit for a second) brings down your data server in an uncontrolled fashion, you didn''t do a very good job of protecting it. My home NAS is protected to the point of one UPS, anyway. But real-world problems a few steps more severe can produce the same power cut, practically anywhere, just not as often.> I don''t know if that really works for 100TB and up though. That''s a lot > of disk == a lot of UPS capacity. And again, I''m not trying to take away > from the fact that this is a significant zfs problem.We''ve got this UPS in our server room that''s about, oh, 4 washing machines in size. It''s wired into building power, and powers the outlets the servers are plugged into, and the floor outlets out here the development PCs are plugged into also. I never got the tour, but I heard about the battery backup system at the old data center Northwest Airlines had back when they ran their own reservations system. Enough lead-acid batteries to keep an IBM mainframe running for three hours. One can certainly do it if one wants to badly enough, which one should if the data is important. I can''t imagine anybody investing in 100TB of enterprise-grade storage if the data WASN''T important! -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
David Dyer-Bennet
2009-Feb-11 22:52 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Wed, February 11, 2009 15:52, Bob Friesenhahn wrote:> On Wed, 11 Feb 2009, Tim wrote: >> >> Right, except the OP stated he unmounted the filesystem in question, and >> it >> was the *ONLY* one on the drive, meaning there is absolutely 0 chance of >> their being pending writes. There''s nothing to write to. > > This is an interesting assumption leading to a wrong conclusion. If > the file is updated and the filesystem is "unmounted", it is still > possible for there to be uncommitted data in the pool. If you pay > closer attention you will see that "mounting" the filesystem basically > just adds a logical path mapping since the filesystem is already > available under /poolname/filesystemname regardless. So doing the > mount makes /poolname/filesystemname available as /filesystemname, or > whatever mount path you specify.As a practical matter, it seems unreasonable to me that there would be uncommitted data in the pool after some quite short period of time when there''s no new IO activity to the pool (not just the filesystem). 5 or 10 seconds, maybe? (Possibly excepting if there was a HUGE spike of IO for a while just before this; there could be considerable stuff in the ZIL not yet committed then, I would think.) That is, if I plug in a memory stick with ZFS on it, read and write for a while, then when I''m done and IO appears to have quiesced, observe that the IO light on the drive is inactive for several seconds, I''d be kinda disappointed if I got actual corrution if I pulled it. Complaints about not being exported next time I tried to import it, sure. Maybe other complaints. I wouldn''t do this deliberately (other than for testing). But it seems wrong to leave things uncommitted significantlylonger than necessary (seconds are huge time units to a computer, after all), and if the device is sitting there not doing IO, there''s no reason it shouldn''t have been writing anything uncommitted instead. Conversely, anybody who is pulling disks / memory sticks off while IO is visibly incomplete really SHOULD expect to lose everything on them, even if sometimes they''ll be luckier than that. I suppose we''re dealing with people who didn''t work with floppies here, where that lesson got pretty solidly beaten in to people :-. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
Bob Friesenhahn
2009-Feb-11 23:25 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Wed, 11 Feb 2009, David Dyer-Bennet wrote:> > As a practical matter, it seems unreasonable to me that there would be > uncommitted data in the pool after some quite short period of time when > there''s no new IO activity to the pool (not just the filesystem). 5 or 10 > seconds, maybe? (Possibly excepting if there was a HUGE spike of IO for a > while just before this; there could be considerable stuff in the ZIL not > yet committed then, I would think.)I agree. ZFS apparently syncs uncommitted writes every 5 seconds. If there has been no filesystem I/O (including read I/O due to atime) for at least 10 seconds, and there has not been more data burst-written into RAM than can be written to disk in 10 seconds, then there should be nothing remaining to write. Regardless, it seems that the ZFS problems with crummy hardware are primarily due to the crummy hardware writting the data to the disk in a different order than expected. ZFS expects that after a sync that all pending writes are committed. The lesson is that unprofessional hardware may prove to be unreliable for professional usage. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
I need to disappoint you here, LED inactive for a few seconds is a very bad indicator of pending writes. Used to experience this on a stick on Ubuntu, which was silent until the ''umount'' and then it started to write for some 10 seconds. On the other hand, you are spot-on w.r.t. ''umount''. Once the command is through, there is no more write to be expected. And if there was, it would be a serious bug. So this ''umount''ed system needs to be in perfectly consistent states. (Which is why I wrote further up that the structure above the file system, that is the pool, is probably the culprit for all this misery.) [i]Conversely, anybody who is pulling disks / memory sticks off while IO is visibly incomplete really SHOULD expect to lose everything on them[/i] I hope you don''t mean this. Not in a filesystem much hyped and much advanced. Of course, we expect corruption of all files whose ''write'' has been boldly interrupted. But I for one, expect the metadata of all other files to be readily available. Kind of, at the next use, telling me:"You idiot removed the plug last, while files were still in the process of writing. Don''t expect them to be available now. Here is the list of all other files: [list of all files not being written then]" Uwe -- This message posted from opensolaris.org
On 11-Feb-09, at 5:52 PM, David Dyer-Bennet wrote:> > On Wed, February 11, 2009 15:52, Bob Friesenhahn wrote: >> On Wed, 11 Feb 2009, Tim wrote: >>> >>> Right, except the OP stated he unmounted the filesystem in >>> question, and >>> it >>> was the *ONLY* one on the drive, meaning there is absolutely 0 >>> chance of >>> their being pending writes. There''s nothing to write to. >> >> This is an interesting assumption leading to a wrong conclusion. If >> the file is updated and the filesystem is "unmounted", it is still >> possible for there to be uncommitted data in the pool. ... > > As a practical matter, it seems unreasonable to me that there would be > uncommitted data in the pool after some quite short period of time ... > > That is, if I plug in a memory stick with ZFS on it, read and write > for a > while, then when I''m done and IO appears to have quiesced, observe > that > the IO light on the drive is inactive for several seconds, I''d be > kinda > disappointed if I got actual corrution if I pulled it.Absolutely. You should never get "actual corruption" (inconsistency) at any time *except* in the case Jeff Bonwick explained: i.e. faulty/ misbehaving hardware! (That''s one meaning of "always consistent on disk".) I think this is well understood, is it not? Write barriers are not a new concept, and nor is the necessity. For example, they are a clearly described feature of DEC''s MSCP protocol*, long before ATA or SCSI - presumably so that transactional systems could actually be built at all. Devices were held to a high standard of conformance since DEC''s customers (like Sun''s) were traditionally those whose data was of very high value. Storage engineers across the industry were certainly implementing them long before MSCP. --Toby * - The related patent that I am looking at is #4,449,182, filed 5 Oct, 1981. "Interface between a pair of processors, such as host and peripheral- controlling processors in data processing systems." Also the MSCP document released with the UDA50 mass storage subsystem, dated April 1982: "4.5 Command Categories and Execution Order ... Sequential commands are those commands that, for the same unit, must be executed in precise order. ... All sequential commands for a particular unit that are received on the same connection must be executed in the exact order that the MSCP server receives them. The execution of a sequential command may not be interleaved with the execution of any other sequential or non-sequential commands for the same unit. Furthermore, any non-sequential commands received before and on the same connection as a particular sequential command must be completed before execution of that sequential command begins, and any non-sequential commands received after and on the same conection as a particular sequential command must not begin execution until after that sequential command is completed. Sequential commands are, in effect, a barrier than non-sequential commands cannot pass or penetrate. Non-sequential commands are those commands that controllers may re-order so as to optimize performance. Controllers may furthermore interleave the execution of several non-sequential commands among themselves, ..."> Complaints about > not being exported next time I tried to import it, sure. Maybe other > complaints. I wouldn''t do this deliberately (other than for testing). > ... > > -- > David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ > Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ > Photos: http://dd-b.net/photography/gallery/ > Dragaera: http://dragaera.info > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On 11-Feb-09, at 7:16 PM, Uwe Dippel wrote:> I need to disappoint you here, LED inactive for a few seconds is a > very bad indicator of pending writes. Used to experience this on a > stick on Ubuntu, which was silent until the ''umount'' and then it > started to write for some 10 seconds. > > On the other hand, you are spot-on w.r.t. ''umount''. Once the > command is through, there is no more write to be expected. And if > there was, it would be a serious bug.Yes; though at the risk of repetition - the bug here can be in the drive...> So this ''umount''ed system needs to be in perfectly consistent > states. (Which is why I wrote further up that the structure above > the file system, that is the pool, is probably the culprit for all > this misery.) > > [i]Conversely, anybody who is pulling disks / memory sticks off > while IO is > visibly incomplete really SHOULD expect to lose everything on them[/i] > I hope you don''t mean this. Not in a filesystem much hyped and much > advanced. Of course, we expect corruption of all files whose > ''write'' has been boldly interrupted. But I for one, expect the > metadata of all other files to be readily available. Kind of, at > the next use, telling me:"You idiot removed the plug last, while > files were still in the process of writing. Don''t expect them to be > available now. Here is the list of all other files: [list of all > files not being written then]"That hope is a little naive. AIUI, it cannot be known, thanks to the many indeterminacies of the I/O path, which ''files'' were partially written (since a whole slew of copy-on-writes to many objects could have been in flight, and absent a barrier it cannot be known post facto which succeeded). What is known, is the last checkpoint. Hence the feasible recovery mode is a partial, automatic rollback to a past consistent state. Somebody correct me if I am wrong. --Toby> > Uwe > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Toby, sad that you fall for the last resort of the marketing droids here. All manufactures (and there are only a few left) will sue the hell out of you if you state that their drives don''t ''sync''. And each and every drive I have ever used did. So the talk about a distinct borderline between ''enterprise'' and ''home'' is just cheap and not sustainable. Also, if you were correct, and ZFS allowed for compromising the metadata of dormant files (folders) by writing metadata for other files (folders), we would not have advanced beyond FAT, and ZFS would be but a short episode in the history of file systems. Or am I the last to notice that atomic writes have been dropped? Especially with atomic writes you either have the last consistent state of the file structure, or the updated one. So what would be the meaning of ''always consistent on the drive'' if metadata were allowed to hang in between; in an inconsistent state? You write "What is known, is the last checkpoint." Exactly, and here a contradiction shows: the last checkpoint of all untouched files (plus those read only) does contain exactly all untouched files. How could one allow to compromise the last checkpoint by writing a new one? You are correct with "the feasible recovery mode is a partial". Though here we have heard some stories of total loss. Nobody has questioned that the recovery of an interrupted ''write'' must necessarily be partial. What is questioned is the complete loss of semantics. Uwe -- This message posted from opensolaris.org
David Dyer-Bennet
2009-Feb-12 02:49 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Wed, February 11, 2009 17:25, Bob Friesenhahn wrote:> Regardless, it seems that the ZFS problems with crummy hardware are > primarily due to the crummy hardware writting the data to the disk in > a different order than expected. ZFS expects that after a sync that > all pending writes are committed.Which is something Unix has been claiming (or pretending) to provide for some time now, yes.> The lesson is that unprofessional hardware may prove to be unreliable > for professional usage.Or any other usage. And the question is how can we tell them apart? -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
David Dyer-Bennet
2009-Feb-12 02:53 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Wed, February 11, 2009 18:25, Toby Thain wrote:> > Absolutely. You should never get "actual corruption" (inconsistency) > at any time *except* in the case Jeff Bonwick explained: i.e. faulty/ > misbehaving hardware! (That''s one meaning of "always consistent on > disk".) > > I think this is well understood, is it not?Perhaps. I think the consensus seems to be settling down this direction (as I filter for reliability of people posting, not by raw count :-)). The shocker is how much hardware that doesn''t behave to spec in this area seems to be out there -- or so people claim; the other problem is that we can''t sort out which is which.> Write barriers are not a new concept, and nor is the necessity. For > example, they are a clearly described feature of DEC''s MSCP > protocol*, long before ATA or SCSI - presumably so that transactional > systems could actually be built at all. Devices were held to a high > standard of conformance since DEC''s customers (like Sun''s) were > traditionally those whose data was of very high value. Storage > engineers across the industry were certainly implementing them long > before MSCP. > > --Toby > > > * - The related patent that I am looking at is #4,449,182, filed 5 > Oct, 1981. > "Interface between a pair of processors, such as host and peripheral- > controlling processors in data processing systems."While I was working for LCG in Marlboro, in fact. (Not on hardware, nowhere near that work.) -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
On 11-Feb-09, at 9:30 PM, Uwe Dippel wrote:> Toby, > > sad that you fall for the last resort of the marketing droids here. > All manufactures (and there are only a few left) will sue the hell > out of you if you state that their drives don''t ''sync''. And each > and every drive I have ever used did. So the talk about a distinct > borderline between ''enterprise'' and ''home'' is just cheap and not > sustainable.They have existed. This thread has shown a motive to verify COTS drives for this property, if the data is valuable.> > Also, if you were correct, and ZFS allowed for compromising the > metadata of dormant files (folders) by writing metadata for other > files (folders), we would not have advanced beyond FAT, and ZFS > would be but a short episode in the history of file systems. Or am > I the last to notice that atomic writes have been dropped? > Especially with atomic writes you either have the last consistent > state of the file structure, or the updated one. So what would be > the meaning of ''always consistent on the drive'' if metadata were > allowed to hang in between; in an inconsistent state? You write > "What is known, is the last checkpoint." Exactly, and here a > contradiction shows: the last checkpoint of all untouched files > (plus those read only) does contain exactly all untouched files. > How could one allow to compromise the last checkpoint by writing a > new one?ZFS claims that the last checkpoint (my term, sorry, not an official one) is fully consistent (metadata *and* data! Unlike other filesystems). Since consistency is achievable by thousands of other transactional systems I have no reason to doubt that it is achieved by ZFS.> You are correct with "the feasible recovery mode is a partial". > Though here we have heard some stories of total loss. Nobody has > questioned that the recovery of an interrupted ''write'' must > necessarily be partial. What is questioned is the complete loss of > semantics.Only an incomplete transaction would be lost, AIUI. That is the ''atomic'' property of all journaled and transactional systems. (All of it, or none of it.) --Toby> > Uwe > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
May I doubt that there are drives that don''t ''sync''? That means you have a good chance of corrupted data at a normal ''reboot''; or just at a ''umount'' (without considering ZFS here). May I doubt the marketing drab that you need to buy a USCSI or whatnot to have functional ''sync'' at a shutdown or umount? There are millions if not billions of drives out there that come up with consistent data structures after a clean shutdown. This means that a proper ''umount'' flushes everything on those drives, and we need not expect corrupted data, and no further writes. And that was the topic further up to which I tried to answer. As well as to the notion that a file system that encounters interrupted writes may well and legally be completely unreadable. That is what I refuted, nothing else. Uwe -- This message posted from opensolaris.org
David Dyer-Bennet
2009-Feb-12 15:16 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Wed, February 11, 2009 18:16, Uwe Dippel wrote:> I need to disappoint you here, LED inactive for a few seconds is a very > bad indicator of pending writes. Used to experience this on a stick on > Ubuntu, which was silent until the ''umount'' and then it started to write > for some 10 seconds.Yikes, that''s bizarre.> On the other hand, you are spot-on w.r.t. ''umount''. Once the command is > through, there is no more write to be expected. And if there was, it would > be a serious bug. So this ''umount''ed system needs to be in perfectly > consistent states. (Which is why I wrote further up that the structure > above the file system, that is the pool, is probably the culprit for all > this misery.)Yeah, once it''s unmounted it really REALLY should be in a consistent state.> [i]Conversely, anybody who is pulling disks / memory sticks off while IO > is > visibly incomplete really SHOULD expect to lose everything on them[/i] > I hope you don''t mean this. Not in a filesystem much hyped and much > advanced. Of course, we expect corruption of all files whose ''write'' has > been boldly interrupted. But I for one, expect the metadata of all other > files to be readily available. Kind of, at the next use, telling me:"You > idiot removed the plug last, while files were still in the process of > writing. Don''t expect them to be available now. Here is the list of all > other files: [list of all files not being written then]"It''s good to have hopes, certainly. I''m just kinda cynical. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
after all statements read here I just want to highlight another issue regarding ZFS. It was here many times recommended to set copies=2. Installing Solaris 10 10/2008 or snv_107 you can choose either to use UFS or ZFS. If you choose ZFS by default, the rpool will be created by default with ''copies=1''. If someone does not mention this and you have a hanging system with no chance to access or to shutdown properly and you have no other chance than to press the power button of your notebook through the desk plate, couldn''t it be that there happens the same with my external usb drive? This is the same sudden power off event what seems to damage my pool. And it would be a nice to have that ZFS could handle this. Another issue what I miss in this thread is, that ZFS is a layer on an EFI lable. What about that in case of a sudden power off event? Regards, Dave. -- This message posted from opensolaris.org
> All that and yet the fact > remains: I've never "ejected" a USB > drive from OS X or Windows, I simply pull it and go, > and I've never once lost data, or had it become > unrecoverable or even corrupted.<br> > <br>And yes, I do keep checksums of all the data > sitting on them and periodically check it. So, > for all of your ranting and raving, the fact remains > even a *crappy* filesystem like fat32 manages to > handle a hot unplug without any prior notice without > going belly up.<br> > <br>--Tim<br></div></div>Just wanted to chime in with my 2c here. I''ve also *never* unmounted a USB drive from windows, and have been using them regularly since memory sticks became available. So that''s 2-3 years of experience and I''ve never lost work on a memory stick, nor had a file corrupted. I can also state with confidence that very, very few of the 100 staff working here will even be aware that it''s possible to unmount a USB volume in windows. They will all just pull the plug when their work is saved, and since they all come to me when they have problems, I think I can safely say that pulling USB devices really doesn''t tend to corrupt filesystems in Windows. Everybody I know just waits for the light on the device to go out. And while this isn''t really what ZFS is designed to do, I do think it should be able to cope. First of all, some kind of ZFS recovery tools are needed. There''s going to be an awful lot of good data on that disk, making all of that inaccessible just because the last write failed isn''t really on. It''s a copy on write filesystem, "zpool import" really should be able to take advantage of that for recovering pools! I don''t know the technicalities of how it works on disk, but my feeling is that the last successful mount point should be saved, and the last few uberblocks should also be available, so barring complete hardware failure, some kind of pool should be available for mounting. Also, if a drive is removed while writes are pending, some kind of error or warning is needed, either in the console, or the GUI. It should be possible to prompt the user to re-insert the device so that the remaining writes can be completed. Recovering the pool in that situation should be easy - you can keep the location of the uberblock you''re using in memory, and just re-write everything. Of course, that does assume that devices are being truthful when they say that data has been committed, but a little data loss from badly designed hardware is I feel acceptable, so long as ZFS can have a go at recovering corrupted pools when it does happen, instead of giving up completely like it does now. Yes, these problems happen more often with consumer level hardware, but recovery tools like this are going to be very much appreciated by anybody who encounters problems like this on a server! -- This message posted from opensolaris.org
Robert Milkowski
2009-Feb-12 16:44 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
Hello Bob, Wednesday, February 11, 2009, 11:25:12 PM, you wrote: BF> I agree. ZFS apparently syncs uncommitted writes every 5 seconds. BF> If there has been no filesystem I/O (including read I/O due to atime) BF> for at least 10 seconds, and there has not been more data BF> burst-written into RAM than can be written to disk in 10 seconds, then BF> there should be nothing remaining to write. That''s not entirely true. After recent changes writes could be delayed even up-to 30s by default. -- Best regards, Robert Milkowski http://milek.blogspot.com
Ross wrote:> I can also state with confidence that very, very few of the 100 staff working here will even be aware that it''s possible to unmount a USB volume in windows. They will all just pull the plug when their work is saved, and since they all come to me when they have problems, I think I can safely say that pulling USB devices really doesn''t tend to corrupt filesystems in Windows. Everybody I know just waits for the light on the device to go out. >The key here is that Windows does not cache writes to the USB drive unless you go in and specifically enable them. It caches reads but not writes. If you enable them you will lose data if you pull the stick out before all the data is written. This is the type of safety measure that needs to be implemented in ZFS if it is to support the average user instead of just the IT professionals. Regards, Greg
David Dyer-Bennet
2009-Feb-12 17:31 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Thu, February 12, 2009 10:10, Ross wrote:> Of course, that does assume that devices are being truthful when they say > that data has been committed, but a little data loss from badly designed > hardware is I feel acceptable, so long as ZFS can have a go at recovering > corrupted pools when it does happen, instead of giving up completely like > it does now.Well; not "acceptable" as such. But I''d agree it''s outside ZFS''s purview. The blame for data lost due to hardware actively lying and not working to spec goes to the hardware vendor, not to ZFS. If ZFS could easily and reliably warn about such hardware I''d want it to, but the consensus seems to be that we don''t have a reliable qualification procedure. In terms of upselling people to a Sun storage solution, having ZFS diagnose problems with their cheap hardware early is clearly desirable :-). -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
On Thu, Feb 12, 2009 at 11:31 AM, David Dyer-Bennet <dd-b at dd-b.net> wrote:> > On Thu, February 12, 2009 10:10, Ross wrote: > > > Of course, that does assume that devices are being truthful when they say > > that data has been committed, but a little data loss from badly designed > > hardware is I feel acceptable, so long as ZFS can have a go at recovering > > corrupted pools when it does happen, instead of giving up completely like > > it does now. > > Well; not "acceptable" as such. But I''d agree it''s outside ZFS''s purview. > The blame for data lost due to hardware actively lying and not working to > spec goes to the hardware vendor, not to ZFS. > > If ZFS could easily and reliably warn about such hardware I''d want it to, > but the consensus seems to be that we don''t have a reliable qualification > procedure. In terms of upselling people to a Sun storage solution, having > ZFS diagnose problems with their cheap hardware early is clearly desirable > :-). > >Right, well I can''t imagine it''s impossible to write a small app that can test whether or not drives are honoring correctly by issuing a commit and immediately reading back to see if it was indeed committed or not. Like a "zfs test cXtX". Of course, then you can''t just blame the hardware everytime something in zfs breaks ;) --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090212/43a343c7/attachment.html>
On Thu, Feb 12, 2009 at 11:53:40AM -0500, Greg Palmer wrote:> Ross wrote: > >I can also state with confidence that very, very few of the 100 staff > >working here will even be aware that it''s possible to unmount a USB volume > >in windows. They will all just pull the plug when their work is saved, > >and since they all come to me when they have problems, I think I can > >safely say that pulling USB devices really doesn''t tend to corrupt > >filesystems in Windows. Everybody I know just waits for the light on the > >device to go out. > > > The key here is that Windows does not cache writes to the USB drive > unless you go in and specifically enable them. It caches reads but not > writes. If you enable them you will lose data if you pull the stick out > before all the data is written. This is the type of safety measure that > needs to be implemented in ZFS if it is to support the average user > instead of just the IT professionals.That implies that ZFS will have to detect removable devices and treat them differently than fixed devices. It might have to be an option that can be enabled for higher performance with reduced data security. -- -Gary Mills- -Unix Support- -U of M Academic Computing and Networking-
Mattias Pantzare
2009-Feb-12 20:45 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
> > Right, well I can''t imagine it''s impossible to write a small app that can > test whether or not drives are honoring correctly by issuing a commit and > immediately reading back to see if it was indeed committed or not. Like a > "zfs test cXtX". Of course, then you can''t just blame the hardware > everytime something in zfs breaks ;)A read of data in the disk cache will be read from the disk cache. You can''t tell the disk to ignore its cache and read directly from the plater. The only way to test this is to write and the remove the power from the disk. Not easy in software.
That would be the ideal, but really I''d settle for just improved error handling and recovery for now. In the longer term, disabling write caching by default for USB or Firewire drives might be nice. On Thu, Feb 12, 2009 at 8:35 PM, Gary Mills <mills at cc.umanitoba.ca> wrote:> On Thu, Feb 12, 2009 at 11:53:40AM -0500, Greg Palmer wrote: >> Ross wrote: >> >I can also state with confidence that very, very few of the 100 staff >> >working here will even be aware that it''s possible to unmount a USB volume >> >in windows. They will all just pull the plug when their work is saved, >> >and since they all come to me when they have problems, I think I can >> >safely say that pulling USB devices really doesn''t tend to corrupt >> >filesystems in Windows. Everybody I know just waits for the light on the >> >device to go out. >> > >> The key here is that Windows does not cache writes to the USB drive >> unless you go in and specifically enable them. It caches reads but not >> writes. If you enable them you will lose data if you pull the stick out >> before all the data is written. This is the type of safety measure that >> needs to be implemented in ZFS if it is to support the average user >> instead of just the IT professionals. > > That implies that ZFS will have to detect removable devices and treat > them differently than fixed devices. It might have to be an option > that can be enabled for higher performance with reduced data security. > > -- > -Gary Mills- -Unix Support- -U of M Academic Computing and Networking- >
bdebelius at intelesyscorp.com
2009-Feb-12 21:44 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
Is this the crux of the problem? http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6424510 ''For usb devices, the driver currently ignores DKIOCFLUSHWRITECACHE. This can cause catastrophic data corruption in the event of power loss, even for filesystems like ZFS that are designed to survive it. Dropping a flush-cache command is just as bad as dropping a write. It violates the interface that software relies on to use the device.'' -- This message posted from opensolaris.org
That does look like the issue being discussed. It''s a little alarming that the bug was reported against snv54 and is still not fixed :( Does anyone know how to push for resolution on this? USB is pretty common, like it or not for storage purposes - especially amongst the laptop-using dev crowd that OpenSolaris apparently targets. On Thu, Feb 12, 2009 at 4:44 PM, bdebelius at intelesyscorp.com <bdebelius at intelesyscorp.com> wrote:> Is this the crux of the problem? > > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6424510 > > ''For usb devices, the driver currently ignores DKIOCFLUSHWRITECACHE. > This can cause catastrophic data corruption in the event of power loss, > even for filesystems like ZFS that are designed to survive it. > Dropping a flush-cache command is just as bad as dropping a write. > It violates the interface that software relies on to use the device.'' > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
David Dyer-Bennet
2009-Feb-12 22:38 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Thu, February 12, 2009 14:02, Tim wrote:> > Right, well I can''t imagine it''s impossible to write a small app that can > test whether or not drives are honoring correctly by issuing a commit and > immediately reading back to see if it was indeed committed or not. Like a > "zfs test cXtX". Of course, then you can''t just blame the hardware > everytime something in zfs breaks ;) >I can imagine it fairly easily. All you''ve got to work with is what the drive says about itself, and how fast, and the what we''re trying to test is whether it lies. It''s often very hard to catch it out on this sort of thing. We need somebody who really understands the command sets available to send to modern drives (which is not me) to provide a test they think would work, and people can argue or try it. My impression, though, is that the people with the expertise are so far consistently saying it''s not possible. I think at this point somebody who thinks it''s possible needs to do the work to at least propose a specific test, or else we have to give up on the idea. I''m still hoping for at least some kind of qualification procedure involving manual intervention (hence not something that could be embodied in a simple command you just typed), but we''re not seeing even this so far. Of course, the other side of this is that, if people "know" that drives have these problems, there must in fact be some way to demonstrate it, or they wouldn''t know. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
bdebelius at intelesyscorp.com
2009-Feb-12 22:47 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
I just tried putting a pool on a USB flash drive, writing a file to it, and then yanking it. I did not lose any data or the pool, but I had to reboot before I could get any zpool command to complete without freezing. I also had OS reboot once on its own, when I tried to issue a zpool command to the pool. OS did noticed the disk was yanked until i tried to status it. -- This message posted from opensolaris.org
Bill Sommerfeld
2009-Feb-12 22:57 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Thu, 2009-02-12 at 17:35 -0500, Blake wrote:> That does look like the issue being discussed. > > It''s a little alarming that the bug was reported against snv54 and is > still not fixed :(bugs.opensolaris.org''s information about this bug is out of date. It was fixed in snv_54: changeset: 3169:1dea14abfe17 user: phitran date: Sat Nov 25 11:05:17 2006 -0800 files: usr/src/uts/common/io/scsi/targets/sd.c 6424510 usb ignores DKIOCFLUSHWRITECACHE - Bill
On 12-Feb-09, at 3:02 PM, Tim wrote:> > > On Thu, Feb 12, 2009 at 11:31 AM, David Dyer-Bennet <dd-b at dd-b.net> > wrote: > > On Thu, February 12, 2009 10:10, Ross wrote: > > > Of course, that does assume that devices are being truthful when > they say > > that data has been committed, but a little data loss from badly > designed > > hardware is I feel acceptable, so long as ZFS can have a go at > recovering > > corrupted pools when it does happen, instead of giving up > completely like > > it does now. > > Well; not "acceptable" as such. But I''d agree it''s outside ZFS''s > purview. > The blame for data lost due to hardware actively lying and not > working to > spec goes to the hardware vendor, not to ZFS. > > If ZFS could easily and reliably warn about such hardware I''d want > it to, > but the consensus seems to be that we don''t have a reliable > qualification > procedure. In terms of upselling people to a Sun storage solution, > having > ZFS diagnose problems with their cheap hardware early is clearly > desirable > :-). > > > > Right, well I can''t imagine it''s impossible to write a small app > that can test whether or not drives are honoring correctly by > issuing a commit and immediately reading back to see if it was > indeed committed or not.You do realise that this is not as easy as it looks? :) For one thing, the drive will simply serve the read from cache. It''s hard to imagine a test that doesn''t involve literally pulling plugs; even better, a purpose built hardware test harness. Nonetheless I hope that someone comes up with a brilliant test. But if the ZFS team hasn''t found one yet... it looks grim :) --Toby> Like a "zfs test cXtX". Of course, then you can''t just blame the > hardware everytime something in zfs breaks ;) > > --Tim > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090212/78b32510/attachment.html>
I''m sure it''s very hard to write good error handling code for hardware events like this. I think, after skimming this thread (a pretty wild ride), we can at least decide that there is an RFE for a recovery tool for zfs - something to allow us to try to pull data from a failed pool. That seems like a reasonable tool to request/work on, no? On Thu, Feb 12, 2009 at 6:03 PM, Toby Thain <toby at telegraphics.com.au> wrote:> > On 12-Feb-09, at 3:02 PM, Tim wrote: > > > On Thu, Feb 12, 2009 at 11:31 AM, David Dyer-Bennet <dd-b at dd-b.net> wrote: >> >> On Thu, February 12, 2009 10:10, Ross wrote: >> >> > Of course, that does assume that devices are being truthful when they >> > say >> > that data has been committed, but a little data loss from badly designed >> > hardware is I feel acceptable, so long as ZFS can have a go at >> > recovering >> > corrupted pools when it does happen, instead of giving up completely >> > like >> > it does now. >> >> Well; not "acceptable" as such. But I''d agree it''s outside ZFS''s purview. >> The blame for data lost due to hardware actively lying and not working to >> spec goes to the hardware vendor, not to ZFS. >> >> If ZFS could easily and reliably warn about such hardware I''d want it to, >> but the consensus seems to be that we don''t have a reliable qualification >> procedure. In terms of upselling people to a Sun storage solution, having >> ZFS diagnose problems with their cheap hardware early is clearly desirable >> :-). >> > > > Right, well I can''t imagine it''s impossible to write a small app that can > test whether or not drives are honoring correctly by issuing a commit and > immediately reading back to see if it was indeed committed or not. > > You do realise that this is not as easy as it looks? :) For one thing, the > drive will simply serve the read from cache. > It''s hard to imagine a test that doesn''t involve literally pulling plugs; > even better, a purpose built hardware test harness. > Nonetheless I hope that someone comes up with a brilliant test. But if the > ZFS team hasn''t found one yet... it looks grim :) > --Toby > > Like a "zfs test cXtX". Of course, then you can''t just blame the hardware > everytime something in zfs breaks ;) > > --Tim > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >
Eric D. Mudama
2009-Feb-13 00:02 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Thu, Feb 12 at 21:45, Mattias Pantzare wrote:>A read of data in the disk cache will be read from the disk cache. You >can''t tell the disk to ignore its cache and read directly from the >plater. > > The only way to test this is to write and the remove the power from >the disk. Not easy in software.Not true with modern SATA drives that support NCQ, as there is a FUA bit that can be set by the driver on NCQ reads. If the device implements the spec, any overlapped write cache data will be flushed, invalidated, and a fresh read done from the non-volatile media for the FUA read command. --eric -- Eric D. Mudama edmudama at mail.bounceswoosh.org
Blake wrote:> I''m sure it''s very hard to write good error handling code for hardware > events like this. > > I think, after skimming this thread (a pretty wild ride), we can at > least decide that there is an RFE for a recovery tool for zfs - > something to allow us to try to pull data from a failed pool. That > seems like a reasonable tool to request/work on, no? >The ability to force a roll back to an older uberblock in order to be able to access the pool (in the case of corrupt current uberblock) should be ZFS developer''s very top priority, IMO. I''d offer to do it myself, but I have nowhere near the ability to do so. -- Dave
On 12-Feb-09, at 7:02 PM, Eric D. Mudama wrote:> On Thu, Feb 12 at 21:45, Mattias Pantzare wrote: >> A read of data in the disk cache will be read from the disk cache. >> You >> can''t tell the disk to ignore its cache and read directly from the >> plater. >> >> The only way to test this is to write and the remove the power from >> the disk. Not easy in software. > > Not true with modern SATA drives that support NCQ, as there is a FUA > bit that can be set by the driver on NCQ reads. If the device > implements the spec,^^ Spec compliance is what we''re testing for... We wouldn''t know if this special variant is working correctly either. :) --T> any overlapped write cache data will be flushed, > invalidated, and a fresh read done from the non-volatile media for the > FUA read command. > > --eric > > > > -- > Eric D. Mudama > edmudama at mail.bounceswoosh.org > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Blake, On Thu, Feb 12, 2009 at 05:35:14PM -0500, Blake wrote:> That does look like the issue being discussed. > > It''s a little alarming that the bug was reported against snv54 and is > still not fixed :(Looks like the bug-report is out of sync. I see that the bug has been fixed in B54. Here is the link to source gate which shows that the fix is in the gate : http://src.opensolaris.org/source/search?q=&defs=&refs=&path=&hist=6424510&project=%2Fonnv And here are the diffs : http://src.opensolaris.org/source/diff/onnv/onnv-gate/usr/src/uts/common/io/scsi/targets/sd.c?r2=%2Fonnv%2Fonnv-gate%2Fusr%2Fsrc%2Futs%2Fcommon%2Fio%2Fscsi%2Ftargets%2Fsd.c%403169&r1=%2Fonnv%2Fonnv-gate%2Fusr%2Fsrc%2Futs%2Fcommon%2Fio%2Fscsi%2Ftargets%2Fsd.c%403138 Thanks and regards, Sanjeev.> > Does anyone know how to push for resolution on this? USB is pretty > common, like it or not for storage purposes - especially amongst the > laptop-using dev crowd that OpenSolaris apparently targets. > > > > On Thu, Feb 12, 2009 at 4:44 PM, bdebelius at intelesyscorp.com > <bdebelius at intelesyscorp.com> wrote: > > Is this the crux of the problem? > > > > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6424510 > > > > ''For usb devices, the driver currently ignores DKIOCFLUSHWRITECACHE. > > This can cause catastrophic data corruption in the event of power loss, > > even for filesystems like ZFS that are designed to survive it. > > Dropping a flush-cache command is just as bad as dropping a write. > > It violates the interface that software relies on to use the device.'' > > -- > > This message posted from opensolaris.org > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- ---------------- Sanjeev Bagewadi Solaris RPE Bangalore, India
bcirvin, you proposed "something to allow us to try to pull data from a failed pool". Yes and no. ''Yes'' as a pragmatic solution; ''no'' for what ZFS was ''sold'' to be: the last filesystem mankind would need. It was conceived as a filesystem that does not need recovery, due to its guaranteed consistent states on the/any drive - or better: at any moment. If this was truly the case, a recovery program was not needed, and I don''t think SUN will like one neither. It also is more then suboptimal to prevent caching as proposed by others; this is but a very ugly hack. Again, and I have yet to receive comments on this, the original poster claimed to have done a proper flash/sync, and left a 100% consistent file system behind on his drive. At reboot, the pool, the higher entity, failed miserably. Of course, now one can conceive a program that scans the whole drive, like in the good ole days on ancient file systems to recover all those 100% correct file system(s). Or, one could - as proposed - add an ?berblock, like we had the FAT-mirror in the last millennium. The alternative, and engineering-wise much better solution, would be to diagnose the weakness on the contextual or semantical level: Where 100% consistent file systems cannot be communicated to by the operating system. This - so it seems - is (still) a shortcoming of the concept of ZFS. Which might be solved by means of yesterday, I agree. Or, by throwing more work into the level of the volume management, the pools. Without claiming to have the solution, conceptually I might want to propose to do away with the static, look-up-table-like structure of the pool, as stored in a mirror or ?berblock. Could it be feasible to associate pools dynamically? Could it be feasible, that the filesystems in a pool create a (new) handle once they are updated in a consistent manner? And when the drive is plugged/turned on, the software simply collects all the handles of all file systems on that drive? Then the export/import is possible, but not required any longer, since the filesystems form their own entities. They can still have associated contextual/semantic (stored) structures into which they are ''plugged'' once the drive is up; if one wanted to (''logical volume''). But with or without, the pool would self-configure when the drive starts by picking up all file system handles. Uwe -- This message posted from opensolaris.org
On February 12, 2009 1:44:34 PM -0800 bdebelius at intelesyscorp.com wrote:> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6424510...> Dropping a flush-cache command is just as bad as dropping a write.Not that it matters, but it seems obvious that this is wrong or anyway an exaggeration. Dropping a flush-cache just means that you have to wait until the device is quiesced before the data is consistent. Dropping a write is much much worse. -frank
I am wondering if the usb storage device is not reliable for ZFS usage, can the situation be improved if I put the intent log on internal sata disk to avoid corruption and utilize the convenience of usb storage at the same time? -- This message posted from opensolaris.org
huh? but that looses the convenience of USB. I''ve used USB drives without problems at all, just remember to "zpool export" them before you unplug. -- This message posted from opensolaris.org
While mobility could be lost, usb storage still has the advantage of being cheap and easy to install comparing to install internal disks on pc, so if I just want to use it to provide zfs storage space for home file server, can a small intent log located on internal sata disk prevent the pool corruption caused by a power cut? -- This message posted from opensolaris.org
On 2/13/2009 5:58 AM, Ross wrote:> huh? but that looses the convenience of USB. > > I''ve used USB drives without problems at all, just remember to "zpool export" them before you unplug. >I think there is a subcommand of cfgaadm you should run to to notify Solariss that you intend to unplug the device. I don''t use USB, and my familiarity with cfgadm (for FC and SCSI) is limited. -Kyle
Having a separate intent log on good hardware will not prevent corruption on a pool with bad hardware. By "good" I mean hardware that correctly flush their write caches when requested. Note, a pool is always consistent (again when using good hardware). The function of the intent log is not to provide consistency (like a journal), but to speed up synchronous requests like fsync and O_DSYNC. Neil. On 02/13/09 06:29, Jiawei Zhao wrote:> While mobility could be lost, usb storage still has the advantage of being cheap > and easy to install comparing to install internal disks on pc, so if I just want to > use it to provide zfs storage space for home file server, can a small intent log > located on internal sata disk prevent the pool corruption caused by a power cut?
Eric D. Mudama
2009-Feb-13 16:53 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Fri, Feb 13 at 9:14, Neil Perrin wrote:> Having a separate intent log on good hardware will not prevent corruption > on a pool with bad hardware. By "good" I mean hardware that correctly > flush their write caches when requested.Can someone please name a specific piece of bad hardware? --eric -- Eric D. Mudama edmudama at mail.bounceswoosh.org
Eric D. Mudama
2009-Feb-13 17:09 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Thu, Feb 12 at 19:43, Toby Thain wrote:> ^^ Spec compliance is what we''re testing for... We wouldn''t know if this > special variant is working correctly either. :)Time the difference between NCQ reads with and without FUA in the presence of overlapped cached write data. That should have a significant performance penalty, compared to a device servicing the reads from a volatile buffer cache. FYI, there are semi-commonly-available power control units that take serial port or USB as an input, and have a whole bunch of SATA power connectors on them. These are the sorts of things that drive vendors use to bounce power unexpectedly in their testing, if you need to perform that same validation, it makes sense to invest in that bit of infrastructure. Something like this: http://www.ulinktech.com/products/hw_power_hub.html or just roll your own in a few days like this guy did for his printer: http://chezphil.org/slugpower/ It should be pretty trivial to perform a few thousand cached writes, issue a flush cache ext, and turn off power immediately after that command completes. Then go back and figure out how many of those writes were successfully written as the device claimed. -- Eric D. Mudama edmudama at mail.bounceswoosh.org
>>>>> "gm" == Gary Mills <mills at cc.umanitoba.ca> writes:gm> That implies that ZFS will have to detect removable devices gm> and treat them differently than fixed devices. please, no more of this garbage, no more hidden unchangeable automatic condescending behavior. The whole format vs rmformat mess is just ridiculous. And software and hardware developers alike have both proven themselves incapable of settling on a definition of ``removeable'''' that fits with actual use-cases like: FC/iSCSI; hot-swappable SATA; adapters that have removeable sockets on both ends like USB-to-SD, firewire CD-ROM''s, SATA/SAS port multipliers, and so on. As we''ve said many times, if the devices are working properly, then they can be unplugged uncleanly without corrupting the pool, and without corrupting any other non-Microsoft filesystem. This is an old, SOLVED, problem. It''s ridiculous hypocricy to make whole filesystems DSYNC, to even _invent the possibility for the filesystem to be DSYNC_, just because it is possible to remove something. Will you do the same thing because it is possible for your laptop''s battery to run out? just, STOP! If the devices are broken, the problem is that they''re broken, not that they''re removeable. personally, I think everything with a broken write cache should be black-listed in the kernel and attach read-only by default, whether it''s a USB bridge or a SATA disk. This will not be perfect because USB bridges, RAID layers and iSCSI targets, will often hide the identity of the SATA drive behind them, and of course people will demand a way to disable it. but if you want to be ``safe'''', then for the sake of making the point, THIS is the right way to do it, not muck around with these overloaded notions of ``removeable''''. Also, the so-far unacknowledged ``iSCSI/FC Write Hole'''' should be fixed so that a copy of all written data is held in the initiator''s buffer cache until it''s verified as *on the physical platter/NVRAM* so that it can be replayed if necessary, and SYNC CACHE commands are allowed to fail far enough that even *things which USE the initiator, like ZFS* will understand what it means when SYNC CACHE fails, and bounced connections are handled correctly---otherwise, when connections bounce or SYNC CACHE returns failure, correctness requires that the initiator pretend like its plug was pulled and panic. Short of that the initiator system must forcibly unmount all filesystems using that device and kill all processes that had files open on those filesystems. And sysadmins should have and know how to cleverly use a tool that tests for both functioning barriers and working SYNC CACHE, end-to-end. NO more ``removeable'''' attributes, please! You are just pretending to solve a much bigger problem, and making things clumsy and disgusting in the process. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090213/92ad0204/attachment.bin>
>>>>> "fc" == Frank Cusack <fcusack at fcusack.com> writes:>> Dropping a flush-cache command is just as bad as dropping a >> write. fc> Not that it matters, but it seems obvious that this is wrong fc> or anyway an exaggeration. Dropping a flush-cache just means fc> that you have to wait until the device is quiesced before the fc> data is consistent. fc> Dropping a write is much much worse. backwards i think. Dropping a flush-cache is WORSE than dropping the flush-cache plus all writes after the flush-cache. The problem that causes loss of whole pools rather than loss of recently-written data isn''t that you''re writing too little. It''s that you''re dropping the barrier and misordering the writes. consequently you lose *everything you''ve ever written,* which is much worse than losing some recent writes, even a lot of them. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090213/14252466/attachment.bin>
>>>>> "t" == Tim <tim at tcsac.net> writes:t> I would like to believe it has more to do with Solaris''s t> support of USB than ZFS, but the fact remains it''s a pretty t> glaring deficiency in 2009, no matter which part of the stack t> is at fault. maybe, but for this job I don''t much mind glaring deficiencies, as long as it''s possible to assemble a working system without resorting to trial-and-error, and possible to know it''s working before loading data on it. Right now, by following the ``best practices'''', you don''t know what to buy, and after you receive the hardware you don''t know if it works until you lose a pool, at which time someone will tell you ``i guess it wasn''t ever working.'''' Even if you order sun4v or an expensive FC disk shelf, you still don''t know if it works. (though, I''m starting to suspect, ni the case of FC or iSCSI the answer is always ``it does not work'''') The only thing you know for sure is, if you lose a pool, someone will blame it on hardware bugs surroudning cache flushes, or else try to conflate the issue with a bunch of inapplicable garbage about checksums and wire corruption. This is unworkable. I''m not saying glaring 2009 deficiencies are irrelevant---on my laptop I do mind because I got out of a multi-year abusive relationship with NetBSD/hpcmips, and now want all parts of my laptop to have drivers. And I guess it applies to that neat timeslider / home-base--USB-disk case we were talking about a month ago. but for what I''m doing I will actually accept the advice ``do not ever put ZFS on USB because ZFS is a canary in the mine of USB bugs''''---it''s just, that advice is not really good enough to settle the whole issue. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090213/12016a5b/attachment.bin>
>>>>> "fc" == Frank Cusack <fcusack at fcusack.com> writes:fc> if you have 100TB of data, wouldn''t you have a completely fc> redundant storage network If you work for a ponderous leaf-eating brontosorous maybe. If your company is modern I think having such an oddly large amount of data in one pool means you''d more likely have 70 whitebox peecees using motherboard ethernet/sata only, connected to a mesh of unmanaged L2 switches (of some peculiar brand that happens to work well.) There will always be one or two peecees switched off, and constantly something will be resilvering. The home user case is not really just for home users. I think a lot of people are tired of paying quadruple for stuff that still breaks, even serious people. fc> Isn''t this easily worked around by having UPS power in fc> addition to whatever the data center supplies? In NYC over the last five years the power has been more reliable going into my UPS than coming out of it. The main reason for having a UPS is wiring maintenance. And the most important part of the UPS is the externally-mounted bypass switch because the UPS also needs maintenance. UPS has never _solved_ anything, it always just helps. so in the end we have to count on the software''s graceful behavior, not on absolutes. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090213/3a36c742/attachment.bin>
Miles Nordin wrote:> gm> That implies that ZFS will have to detect removable devices > gm> and treat them differently than fixed devices. > > please, no more of this garbage, no more hidden unchangeable automatic > condescending behavior. The whole format vs rmformat mess is just > ridiculous. And software and hardware developers alike have both > proven themselves incapable of settling on a definition of > ``removeable'''' that fits with actual use-cases like: FC/iSCSI; > hot-swappable SATA; adapters that have removeable sockets on both ends > like USB-to-SD, firewire CD-ROM''s, SATA/SAS port multipliers, and so > on. >Since this discussion is taking place in the context of someone removing a USB stick I think you''re confusing the issue by dragging in other technologies. Let''s keep this in the context of the posts preceding it which is how USB devices are treated. I would argue that one of the first design goals in an environment where you can expect people who are not computer professionals to be interfacing with computers is to make sure that the appropriate safeties are in place and that the system does not behave in a manner which a reasonable person might find unexpected. This is common practice for any sort of professional engineering effort. As an example, you aren''t going to go out there and find yourself a chainsaw being sold new without a guard. It might be removable, but the default is to include it. Why? Well because there is a considerable chance of damage to the user without it. Likewise with a file system on a device which might cache a data write for as long as thirty seconds while being easily removable. In this case, the user may write the file and seconds later remove the device. Many folks out there behave in this manner. It really doesn''t matter to them that they have a copy of the last save they did two hours ago, what they want and expect is that the most recent data they saved actually be on the USB stick for the to retrieve. What you are suggesting is that it is better to lose that data when it could have been avoided. I would personally suggest that it is better to have default behavior which is not surprising along with more advanced behavior for those who have bothered to read the manual. In Windows case, the write cache can be turned on, it is not "unchangeable" and those who have educated themselves use it. I seldom turn it on unless I''m doing heavy I/O to a USB hard drive, otherwise the performance difference is just not that great. Regards, Greg
On February 13, 2009 12:20:21 PM -0500 Miles Nordin <carton at Ivy.NET> wrote:>>>>>> "fc" == Frank Cusack <fcusack at fcusack.com> writes: > > >> Dropping a flush-cache command is just as bad as dropping a > >> write. > > fc> Not that it matters, but it seems obvious that this is wrong > fc> or anyway an exaggeration. Dropping a flush-cache just means > fc> that you have to wait until the device is quiesced before the > fc> data is consistent. > > fc> Dropping a write is much much worse. > > backwards i think. Dropping a flush-cache is WORSE than dropping the > flush-cache plus all writes after the flush-cache. The problem that > causes loss of whole pools rather than loss of recently-written data > isn''t that you''re writing too little. It''s that you''re dropping the > barrier and misordering the writes. consequently you lose *everything > you''ve ever written,* which is much worse than losing some recent > writes, even a lot of them.Who said dropping a flush-cache means dropping any subsequent writes, or misordering writes? If you''re misordering writes isn''t that a completely different problem? Even then, I don''t see how it''s worse than DROPPING a write. The data eventually gets to disk, and at that point in time, the disk is consistent. When dropping a write, the data never makes it to disk, ever. In the face of a power loss, of course these result in the same problem, but even without a power loss the drop of a write is "catastrophic". -frank
On February 13, 2009 12:10:08 PM -0500 Miles Nordin <carton at Ivy.NET> wrote:> please, no more of this garbage, no more hidden unchangeable automatic > condescending behavior. The whole format vs rmformat mess is just > ridiculous.thank you.
On February 13, 2009 12:41:12 PM -0500 Miles Nordin <carton at Ivy.NET> wrote:>>>>>> "fc" == Frank Cusack <fcusack at fcusack.com> writes: > > fc> if you have 100TB of data, wouldn''t you have a completely > fc> redundant storage network > > If you work for a ponderous leaf-eating brontosorous maybe. If your > company is modern I think having such an oddly large amount of data in > one pool means you''d more likely have 70 whitebox peecees using > motherboard ethernet/sata only, connected to a mesh of unmanaged L2 > switches (of some peculiar brand that happens to work well.) There > will always be one or two peecees switched off, and constantly > something will be resilvering. The home user case is not really just > for home users. I think a lot of people are tired of paying quadruple > for stuff that still breaks, even serious people.oh i dunno. i recently worked for a company that practically defines modern and we had multiples of 100TB of data. Like you said, not all in one place, but any given piece was fully redundant (well, if you count RAID-5 as "fully" ... but I''m really referring to the infrastructure). I can''t imagine it any other way ... the cost of not having redundancy in the face of a failure is so much higher compared to the cost of building in that redundancy. Also I''m not sure how you get 1 pool with more than 1 peecee as zfs is not a cluster fs. So what you are talking about is multiple pools, and in that case if you do lose one (not redundant for whatever reason) you only have to restore a fraction of the 100TB from backup.> fc> Isn''t this easily worked around by having UPS power in > fc> addition to whatever the data center supplies? > > In NYC over the last five years the power has been more reliable going > into my UPS than coming out of it. The main reason for having a UPS > is wiring maintenance. And the most important part of the UPS is the > externally-mounted bypass switch because the UPS also needs > maintenance. UPS has never _solved_ anything, it always just helps. > so in the end we have to count on the software''s graceful behavior, > not on absolutes.I can''t say I agree about the UPS, however I''ve already been pretty forthright that UPS, etc. isn''t the answer to the problem, just a mitigating factor to the root problem. -frank
Dick Hoogendijk
2009-Feb-13 18:09 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Fri, 13 Feb 2009 17:53:00 +0100, Eric D. Mudama <edmudama at bounceswoosh.org> wrote:> On Fri, Feb 13 at 9:14, Neil Perrin wrote: >> Having a separate intent log on good hardware will not prevent >> corruption >> on a pool with bad hardware. By "good" I mean hardware that correctly >> flush their write caches when requested. > > Can someone please name a specific piece of bad hardware?Or better still, name a few -GOOD- ones. -- Dick Hoogendijk -- PGP/GnuPG key: 01D2433D + http://nagual.nl/ | SunOS sxce snv107++ + All that''s really worth doing is what we do for others (Lewis Carrol)
>>>>> "fc" == Frank Cusack <fcusack at fcusack.com> writes:fc> If you''re misordering writes fc> isn''t that a completely different problem? no. ignoring the flush cache command causes writes to be misordered. fc> Even then, I don''t see how it''s worse than DROPPING a write. fc> The data eventually gets to disk, and at that point in time, fc> the disk is consistent. When dropping a write, the data never fc> makes it to disk, ever. If you drop the flush cache command and every write after the flush cache command, yeah yeah it''s bad, but in THAT case, the disk is still always consistent because no writes have been misordered. fc> In the face of a power loss, of course these result in the fc> same problem, no, it''s completely different in a power loss, which is exactly the point. If you pull the cord while the disk is inconsistent, you may lose the entire pool. If the disk is never inconsistent because you''ve never misordered writes, you will only lose recent write activity. Losing everything you''ve ever written is usually much worse than losing what you''ve written recently. yeah yeah some devil''s advocate will toss in, ``i *need* some consistency promises or else it''s better that the pool its hand and say `broken, restore backup please'' even if the hand-raising comes in the form of losing the entire pool,'''' well in that case neither one is acceptable. But if your requirements are looser, then dropping a flush cache command plus every write after the flush cache command is much better than just ignoring the flush cache command. of course, that is a weird kind of failure that never happens. I described it just to make a point, to argue against this overly-simple idea ``every write is precious. let''s do them as soon as possible because there could be Valuable Business Data inside the writes! we don''t want to lose anything Valuable!'''' The part of SYNC CACHE that''s causing people to lose entire pools isn''t the ``hurry up! write faster!'''' part of the command, such that without it you still get your precious writes, just a little slower. NO. It''s the ``control the order of writes'''' part that''s important for integrity on a single-device vdev. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090213/d9427a4c/attachment.bin>
On February 13, 2009 1:10:55 PM -0500 Miles Nordin <carton at Ivy.NET> wrote:>>>>>> "fc" == Frank Cusack <fcusack at fcusack.com> writes: > > fc> If you''re misordering writes > fc> isn''t that a completely different problem? > > no. ignoring the flush cache command causes writes to be misordered.oh. can you supply a reference or if you have the time, some more explanation? (or can someone else confirm this.) my understanding (weak, admittedly) is that drives will reorder writes on their own, and this is generally considered normal behavior. so to guarantee consistency *in the face of some kind of failure like a power loss*, we have write barriers. flush-cache is a stronger kind of write barrier. now that i think more, i suppose yes if you ignore the flush cache, then writes before and after the flush cache could be misordered, however it''s the same as if there were no flush cache at all, and again as long as the drive has power and you can quiesce it then the data makes it to disk, and all is consistent and well. yes? whereas if you drop a write, well it''s gone off into a black hole.> fc> Even then, I don''t see how it''s worse than DROPPING a write. > fc> The data eventually gets to disk, and at that point in time, > fc> the disk is consistent. When dropping a write, the data never > fc> makes it to disk, ever. > > If you drop the flush cache command and every write after the flush > cache command, yeah yeah it''s bad, but in THAT case, the disk is still > always consistent because no writes have been misordered.why would dropping a flush cache imply dropping every write after the flush cache?> fc> In the face of a power loss, of course these result in the > fc> same problem, > > no, it''s completely different in a power loss, which is exactly the point. > > If you pull the cord while the disk is inconsistent, you may lose the > entire pool. If the disk is never inconsistent because you''ve never > misordered writes, you will only lose recent write activity. Losing > everything you''ve ever written is usually much worse than losing what > you''ve written recently.yeah, as soon as i wrote that i realized my error, so thank you and i agree on that point. *in the event of a power loss* being inconsistent is a worse problem. -frank
On February 13, 2009 10:29:05 AM -0800 Frank Cusack <fcusack at fcusack.com> wrote:> On February 13, 2009 1:10:55 PM -0500 Miles Nordin <carton at Ivy.NET> wrote: >>>>>>> "fc" == Frank Cusack <fcusack at fcusack.com> writes: >> >> fc> If you''re misordering writes >> fc> isn''t that a completely different problem? >> >> no. ignoring the flush cache command causes writes to be misordered. > > oh. can you supply a reference or if you have the time, some more > explanation? (or can someone else confirm this.)uhh ... that question can be ignored as i answered it myself below. sorry if i''m must being noisy now.> my understanding (weak, admittedly) is that drives will reorder writes > on their own, and this is generally considered normal behavior. so > to guarantee consistency *in the face of some kind of failure like a > power loss*, we have write barriers. flush-cache is a stronger kind > of write barrier. > > now that i think more, i suppose yes if you ignore the flush cache, > then writes before and after the flush cache could be misordered, > however it''s the same as if there were no flush cache at all, and > again as long as the drive has power and you can quiesce it then > the data makes it to disk, and all is consistent and well. yes?-frank
>>>>> "fc" == Frank Cusack <fcusack at fcusack.com> writes:fc> why would dropping a flush cache imply dropping every write fc> after the flush cache? it wouldn''t and probably never does. It was an imaginary scenario invented to argue with you and to agree with the guy in the USB bug who said ``dropping a cache flush command is as bad as dropping a write.'''' fc> oh. can you supply a reference or if you have the time, some fc> more explanation? (or can someone else confirm this.) I posted something long a few days ago that I need to revisit. The problem is, I don''t actually understand how the disk commands work, so I was talking out my ass. Although I kept saying, ``I''m not sure it actually works this way,'''' my saying so doesn''t help anyone who spends the time to read it and then gets a bunch of mistaken garbage stuck in his head, which people who actually recognize as garbage are too busy to correct. It''d be better for everyone if I didn''t do that. On the other hand, I think there''s some worth to dreaming up several possibilities of what I fantisize the various commands might mean or do, rather than simply reading one of the specs to get the one right answer, because from what people in here say it soudns as though implementors of actual systems based on the SCSI commandset live in this same imaginary world of fantastic and multiple realities without any meaningful review or accountability that I do. (disks, bridges, iSCSI targets and initiators, VMWare/VBox storage, ...) -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090213/c0f56fd4/attachment.bin>
Superb news, thanks Jeff. Having that will really raise ZFS up a notch, and align it much better with peoples expectations. I assume it''ll work via zpool import, and let the user know what''s gone wrong? If you think back to this case, imagine how different the users response would have been if instead of being unable to mount the pool, ZFS had turned around and said: "This pool was not unmounted cleanly, and data has been lost. Do you want to restore your pool to the last viable state: (timestamp goes here)?" Something like that will have people praising ZFS'' ability to safeguard their data, and the way it recovers even after system crashes or when hardware has gone wrong. You could even have a "common causes of this are..." message, or a link to an online help article if you wanted people to be really impressed. -- This message posted from opensolaris.org
Bob Friesenhahn
2009-Feb-13 19:41 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Fri, 13 Feb 2009, Ross wrote:> > Something like that will have people praising ZFS'' ability to > safeguard their data, and the way it recovers even after system > crashes or when hardware has gone wrong. You could even have a > "common causes of this are..." message, or a link to an online help > article if you wanted people to be really impressed.I see a career in politics for you. Barring an operating system implementation bug, the type of problem you are talking about is due to improperly working hardware. Irreversibly reverting to a previous checkpoint may or may not obtain the correct data. Perhaps it will produce a bunch of checksum errors. There are already people praising ZFS'' ability to safeguard their data, and the way it recovers even after system crashes or when hardware has gone wrong. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Nicolas Williams
2009-Feb-13 20:00 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Fri, Feb 13, 2009 at 10:29:05AM -0800, Frank Cusack wrote:> On February 13, 2009 1:10:55 PM -0500 Miles Nordin <carton at Ivy.NET> wrote: > >>>>>>"fc" == Frank Cusack <fcusack at fcusack.com> writes: > > > > fc> If you''re misordering writes > > fc> isn''t that a completely different problem? > > > >no. ignoring the flush cache command causes writes to be misordered. > > oh. can you supply a reference or if you have the time, some more > explanation? (or can someone else confirm this.)Ordering matters for atomic operations, and filesystems are full of those. Now, if ordering is broken but the writes all eventually hit the disk then no one will notice. But if power failures and/or partitions (cables get pulled, network partitions occur affecting an iSCSI connection, ...) then bad things happen. For ZFS the easiest way to ameliorate this is the txg fallback fix that Jeff Bonwick has said is now a priority. And if ZFS guarantees no block re-use until N txgs pass after a block is freed, then the fallback can be of up to N txgs, which gives you a decent chance that you''ll recover your pool in the face of buggy devices, but for each discarded txg you lose that transaction''s writes, you lose data incrementally. (The larger N is the better your chance that the oldest of the last N txg''s writes will all hit the disk in spite of the disk''s lousy cache behaviors.) The next question is how to do the fallback, UI-wise. Should it ever be automatic? A pool option for that would be nice (I''d use it on all-USB pools). If/when not automatic, how should the user/admin be informed of the failure to open the pool and the option to fallback on an older txg (with data loss)? (For non-removable pools imported at boot time the answer is that the service will fail, causing sulogin to be invoked so you can fix the problem on console. For removable pools there should be a GUI.) Nico --
On Fri, Feb 13, 2009 at 7:41 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Fri, 13 Feb 2009, Ross wrote: >> >> Something like that will have people praising ZFS'' ability to safeguard >> their data, and the way it recovers even after system crashes or when >> hardware has gone wrong. You could even have a "common causes of this >> are..." message, or a link to an online help article if you wanted people to >> be really impressed. > > I see a career in politics for you. Barring an operating system > implementation bug, the type of problem you are talking about is due to > improperly working hardware. Irreversibly reverting to a previous > checkpoint may or may not obtain the correct data. Perhaps it will produce > a bunch of checksum errors.Yes, the root cause is improperly working hardware (or an OS bug like 6424510), but with ZFS being a copy on write system, when errors occur with a recent write, for the vast majority of the pools out there you still have huge amounts of data that is still perfectly valid and should be accessible. Unless I''m misunderstanding something, reverting to a previous checkpoint gets you back to a state where ZFS knows it''s good (or at least where ZFS can verify whether it''s good or not). You have to consider that even with improperly working hardware, ZFS has been checksumming data, so if that hardware has been working for any length of time, you *know* that the data on it is good. Yes, if you have databases or files there that were mid-write, they will almost certainly be corrupted. But at least your filesystem is back, and it''s in as good a state as it''s going to be given that in order for your pool to be in this position, your hardware went wrong mid-write. And as an added bonus, if you''re using ZFS snapshots, now your pool is accessible, you have a bunch of backups available so you can probably roll corrupted files back to working versions. For me, that is about as good as you can get in terms of handling a sudden hardware failure. Everything that is known to be saved to disk is there, you can verify (with absolute certainty) whether data is ok or not, and you have backup copies of damaged files. In the old days you''d need to be reverting to tape backups for both of these, with potentially hours of downtime before you even know where you are. Achieving that in a few seconds (or minutes) is a massive step forwards.> There are already people praising ZFS'' ability to safeguard their data, and > the way it recovers even after system crashes or when hardware has gone > wrong.Yes there are, but the majority of these are praising the ability of ZFS checksums to detect bad data, and to repair it when you have redundancy in your pool. I''ve not seen that many cases of people praising ZFS'' recovery ability - uberblock problems seem to have a nasty habit of leaving you with tons of good, checksummed data on a pool that you can''t get to, and while many hardware problems are dealt with, others can hang your entire pool.> > Bob > =====================================> Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > >
David Collier-Brown
2009-Feb-13 20:23 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
Bob Friesenhahn wrote:> On Fri, 13 Feb 2009, Ross wrote: >> >> Something like that will have people praising ZFS'' ability to >> safeguard their data, and the way it recovers even after system >> crashes or when hardware has gone wrong. You could even have a >> "common causes of this are..." message, or a link to an online help >> article if you wanted people to be really impressed. > > I see a career in politics for you. Barring an operating system > implementation bug, the type of problem you are talking about is due to > improperly working hardware. Irreversibly reverting to a previous > checkpoint may or may not obtain the correct data. Perhaps it will > produce a bunch of checksum errors.Actually that''s a lot like FMA replies when it sees a problem, telling the person what happened and pointing them to a web page which can be updated with the newest information on the problem. That''s a good spot for "This pool was not unmounted cleanly due to a hardware fault and data has been lost. The "<name of timestamp>" line contains the date which can be recovered to. Use the command # zfs reframbulocate <this> <that> -t <timestamp> to revert to <timestamp> --dave -- David Collier-Brown | Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest davecb at sun.com | -- Mark Twain cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191#
Bob Friesenhahn
2009-Feb-13 20:24 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Fri, 13 Feb 2009, Ross Smith wrote:> > You have to consider that even with improperly working hardware, ZFS > has been checksumming data, so if that hardware has been working for > any length of time, you *know* that the data on it is good.You only know this if the data has previously been read. Assume that the device temporarily stops pysically writing, but otherwise responds normally to ZFS. Then the device starts writing again (including a recent uberblock), but with a large gap in the writes. Then the system loses power, or crashes. What happens then? Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Fri, Feb 13, 2009 at 8:24 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Fri, 13 Feb 2009, Ross Smith wrote: >> >> You have to consider that even with improperly working hardware, ZFS >> has been checksumming data, so if that hardware has been working for >> any length of time, you *know* that the data on it is good. > > You only know this if the data has previously been read. > > Assume that the device temporarily stops pysically writing, but otherwise > responds normally to ZFS. Then the device starts writing again (including a > recent uberblock), but with a large gap in the writes. Then the system > loses power, or crashes. What happens then?Well in that case you''re screwed, but if ZFS is known to handle even corrupted pools automatically, when that happens the immediate response on the forums is going to be "something really bad has happened to your hardware", followed by troubleshooting to find out what. Instead of the response now, where we all know there''s every chance the data is ok, and just can''t be gotten to without zdb. Also, that''s a pretty extreme situation since you''d need a device that is being written to but not read from to fail in this exact way. It also needs to have no scrubbing being run, so the problem has remained undetected. However, even in that situation, if we assume that it happened and that these recovery tools are available, ZFS will either report that your pool is seriously corrupted, indicating a major hardware problem (and ZFS can now state this with some confidence), or ZFS will be able to open a previous uberblock, mount your pool and begin a scrub, at which point all your missing writes will be found too and reported. And then you can go back to your snapshots. :-D> > Bob > =====================================> Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > >
Richard Elling
2009-Feb-13 20:47 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
Greg Palmer wrote:> Miles Nordin wrote: >> gm> That implies that ZFS will have to detect removable devices >> gm> and treat them differently than fixed devices. >> >> please, no more of this garbage, no more hidden unchangeable automatic >> condescending behavior. The whole format vs rmformat mess is just >> ridiculous. And software and hardware developers alike have both >> proven themselves incapable of settling on a definition of >> ``removeable'''' that fits with actual use-cases like: FC/iSCSI; >> hot-swappable SATA; adapters that have removeable sockets on both ends >> like USB-to-SD, firewire CD-ROM''s, SATA/SAS port multipliers, and so >> on. > Since this discussion is taking place in the context of someone > removing a USB stick I think you''re confusing the issue by dragging in > other technologies. Let''s keep this in the context of the posts > preceding it which is how USB devices are treated. I would argue that > one of the first design goals in an environment where you can expect > people who are not computer professionals to be interfacing with > computers is to make sure that the appropriate safeties are in place > and that the system does not behave in a manner which a reasonable > person might find unexpected.It has been my experience that USB sticks use FAT, which is an ancient file system which contains few of the features you expect from modern file systems. As such, it really doesn''t do any write caching. Hence, it seems to work ok for casual users. I note that neither NTFS, ZFS, reiserfs, nor many of the other, high performance file systems are used by default for USB devices. Could it be that anyone not using FAT for USB devices is straining against architectural limits? -- richard
On Fri, Feb 13, 2009 at 8:24 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Fri, 13 Feb 2009, Ross Smith wrote: >> >> You have to consider that even with improperly working hardware, ZFS >> has been checksumming data, so if that hardware has been working for >> any length of time, you *know* that the data on it is good. > > You only know this if the data has previously been read. > > Assume that the device temporarily stops pysically writing, but otherwise > responds normally to ZFS. Then the device starts writing again (including a > recent uberblock), but with a large gap in the writes. Then the system > loses power, or crashes. What happens then?Hey Bob, Thinking about this a bit more, you''ve given me an idea: Would it be worth ZFS occasionally reading previous uberblocks from the pool, just to check they are there and working ok? I wonder if you could do this after a few uberblocks have been written. It would seem to be a good way of catching devices that aren''t writing correctly early on, as well as a way of guaranteeing that previous uberblocks are available to roll back to should a write go wrong. I wonder what the upper limits for this kind of write failure is going to be. I''ve seen 30 second delays mentioned in this thread. How often are uberblocks written? Is there any guarantee that we''ll always have more than 30 seconds worth of uberblocks on a drive? Should ZFS be set so that it keeps either a given number of uberblocks, or 5 minutes worth of uberblocks, whichever is the larger? Ross
Bob Friesenhahn
2009-Feb-13 20:57 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Fri, 13 Feb 2009, Ross Smith wrote:> > Also, that''s a pretty extreme situation since you''d need a device that > is being written to but not read from to fail in this exact way. It > also needs to have no scrubbing being run, so the problem has remained > undetected.On systems with a lot of RAM, 100% write is a pretty common situation since reads are often against data which are already cached in RAM. This is common when doing bulk data copies from one device to another (e.g. a backup from an "internal" pool to a USB-based pool) since the necessary filesystem information for the destination filesystem can be cached in memory for quick access rather than going to disk. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Bob Friesenhahn
2009-Feb-13 20:59 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Fri, 13 Feb 2009, Ross Smith wrote:> Thinking about this a bit more, you''ve given me an idea: Would it be > worth ZFS occasionally reading previous uberblocks from the pool, just > to check they are there and working ok?That sounds like a good idea. However, how do you know for sure that the data returned is not returned from a volatile cache? If the hardware is ignoring cache flush requests, then any data returned may be from a volatile cache. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Nicolas Williams
2009-Feb-13 21:09 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Fri, Feb 13, 2009 at 02:00:28PM -0600, Nicolas Williams wrote:> Ordering matters for atomic operations, and filesystems are full of > those.Also, note that ignoring barriers is effectively as bad as dropping writes if there''s any chance that some writes will never hit the disk because of, say, power failures. Imagine 100 txgs, but some writes from the first txg never hitting the disk because the drive keeps them in the cache without flushing them for too long, then you pull out the disk, or power fails -- in that case not even fallback to older txgs will help you, there''d be nothing that ZFS could do to help you. Of course, presumably even with most lousy drives you''d still have to be quite unlucky to lose writes written more than N txgs ago, for some value of N. But the point stands; what you lose will be a matter of chance (and it could well be whole datasets) given the kinds of devices we''ve been discussing. Nico --
Richard Elling wrote:> Greg Palmer wrote: >> Miles Nordin wrote: >>> gm> That implies that ZFS will have to detect removable devices >>> gm> and treat them differently than fixed devices. >>> >>> please, no more of this garbage, no more hidden unchangeable automatic >>> condescending behavior. The whole format vs rmformat mess is just >>> ridiculous. And software and hardware developers alike have both >>> proven themselves incapable of settling on a definition of >>> ``removeable'''' that fits with actual use-cases like: FC/iSCSI; >>> hot-swappable SATA; adapters that have removeable sockets on both ends >>> like USB-to-SD, firewire CD-ROM''s, SATA/SAS port multipliers, and so >>> on. >> Since this discussion is taking place in the context of someone >> removing a USB stick I think you''re confusing the issue by dragging >> in other technologies. Let''s keep this in the context of the posts >> preceding it which is how USB devices are treated. I would argue that >> one of the first design goals in an environment where you can expect >> people who are not computer professionals to be interfacing with >> computers is to make sure that the appropriate safeties are in place >> and that the system does not behave in a manner which a reasonable >> person might find unexpected. > > It has been my experience that USB sticks use FAT, which is an ancient > file system which contains few of the features you expect from modern > file systems. As such, it really doesn''t do any write caching. Hence, it > seems to work ok for casual users. I note that neither NTFS, ZFS, > reiserfs, > nor many of the other, high performance file systems are used by default > for USB devices. Could it be that anyone not using FAT for USB devices > is straining against architectural limits?I''d follow that up by saying that those of us who do use something other that FAT with USB devices have a reasonable understanding of the limitations of those devices. Using ZFS is non-trivial from a typical user''s perspective. The device has to be identified and the pool created. When a USB device is connected, the pool has to be manually imported before it can be used. Import/export could be fully integrated with gnome. Once that is in place, using a ZFS formatted USB stick should be just as "safe" as a FAT formatted one. -- Ian.
You don''t, but that''s why I was wondering about time limits. You have to have a cut off somewhere, but if you''re checking the last few minutes of uberblocks that really should cope with a lot. It seems like a simple enough thing to implement, and if a pool still gets corrupted with these checks in place, you can absolutely, positively blame it on the hardware. :D However, I''ve just had another idea. Since the uberblocks are pretty vital in recovering a pool, and I believe it''s a fair bit of work to search the disk to find them. Might it be a good idea to allow ZFS to store uberblock locations elsewhere for recovery purposes? This could be as simple as a USB stick plugged into the server, a separate drive, or a network server. I guess even the ZIL device would work if it''s separate hardware. But knowing the locations of the uberblocks would save yet more time should recovery be needed. On Fri, Feb 13, 2009 at 8:59 PM, Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Fri, 13 Feb 2009, Ross Smith wrote: > >> Thinking about this a bit more, you''ve given me an idea: Would it be >> worth ZFS occasionally reading previous uberblocks from the pool, just >> to check they are there and working ok? > > That sounds like a good idea. However, how do you know for sure that the > data returned is not returned from a volatile cache? If the hardware is > ignoring cache flush requests, then any data returned may be from a volatile > cache. > > Bob > =====================================> Bob Friesenhahn > bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ > >
Richard Elling wrote:> Greg Palmer wrote: >> Miles Nordin wrote: >>> gm> That implies that ZFS will have to detect removable devices >>> gm> and treat them differently than fixed devices. >>> >>> please, no more of this garbage, no more hidden unchangeable automatic >>> condescending behavior. The whole format vs rmformat mess is just >>> ridiculous. And software and hardware developers alike have both >>> proven themselves incapable of settling on a definition of >>> ``removeable'''' that fits with actual use-cases like: FC/iSCSI; >>> hot-swappable SATA; adapters that have removeable sockets on both ends >>> like USB-to-SD, firewire CD-ROM''s, SATA/SAS port multipliers, and so >>> on. >> Since this discussion is taking place in the context of someone >> removing a USB stick I think you''re confusing the issue by dragging >> in other technologies. Let''s keep this in the context of the posts >> preceding it which is how USB devices are treated. I would argue that >> one of the first design goals in an environment where you can expect >> people who are not computer professionals to be interfacing with >> computers is to make sure that the appropriate safeties are in place >> and that the system does not behave in a manner which a reasonable >> person might find unexpected. > > It has been my experience that USB sticks use FAT, which is an ancient > file system which contains few of the features you expect from modern > file systems. As such, it really doesn''t do any write caching. Hence, it > seems to work ok for casual users. I note that neither NTFS, ZFS, > reiserfs, > nor many of the other, high performance file systems are used by default > for USB devices. Could it be that anyone not using FAT for USB devices > is straining against architectural limits? > -- richardThe default disabling of caching with Windows I mentioned is the same for either FAT or NTFS file systems. My personal guess would be that it''s purely an effort to prevent software errors in the interface between the chair and keyboard. :-) I think a lot of users got trained in how to use a floppy disc and once they were trained, when they encountered the USB stick, they continued to treat it as an instance of the floppy class. This rubbed off on those around them. I can''t tell you how many users have given me a blank stare and told me "But the light was out" when I saw them yank a USB stick out and mentioned it was a bad idea. Regards, Greg
Bob Friesenhahn
2009-Feb-13 22:21 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Fri, 13 Feb 2009, Ross Smith wrote:> However, I''ve just had another idea. Since the uberblocks are pretty > vital in recovering a pool, and I believe it''s a fair bit of work to > search the disk to find them. Might it be a good idea to allow ZFS to > store uberblock locations elsewhere for recovery purposes?Perhaps it is best to leave decisions on these issues to the ZFS designers who know how things work. Previous descriptions from people who do know how things work didn''t make it sound very difficult to find the last 20 uberblocks. It sounded like they were at known points for any given pool. Those folks have surely tired of this discussion by now and are working on actual code rather than reading idle discussion between several people who don''t know the details of how things work. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Fri, Feb 13, 2009 at 4:21 PM, Bob Friesenhahn < bfriesen at simple.dallas.tx.us> wrote:> On Fri, 13 Feb 2009, Ross Smith wrote: > > However, I''ve just had another idea. Since the uberblocks are pretty >> vital in recovering a pool, and I believe it''s a fair bit of work to >> search the disk to find them. Might it be a good idea to allow ZFS to >> store uberblock locations elsewhere for recovery purposes? >> > > Perhaps it is best to leave decisions on these issues to the ZFS designers > who know how things work. > > Previous descriptions from people who do know how things work didn''t make > it sound very difficult to find the last 20 uberblocks. It sounded like > they were at known points for any given pool. > > Those folks have surely tired of this discussion by now and are working on > actual code rather than reading idle discussion between several people who > don''t know the details of how things work. >People who "don''t know how things work" often aren''t tied down by the baggage of knowing how things work. Which leads to creative solutions those who are weighed down didn''t think of. I don''t think it hurts in the least to throw out some ideas. If they aren''t valid, it''s not hard to ignore them and move on. It surely isn''t a waste of anyone''s time to spend 5 minutes reading a response and weighing if the idea is valid or not. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20090213/ca555f98/attachment.html>
Richard Elling
2009-Feb-13 23:09 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
Tim wrote:> > > On Fri, Feb 13, 2009 at 4:21 PM, Bob Friesenhahn > <bfriesen at simple.dallas.tx.us <mailto:bfriesen at simple.dallas.tx.us>> > wrote: > > On Fri, 13 Feb 2009, Ross Smith wrote: > > However, I''ve just had another idea. Since the uberblocks are > pretty > vital in recovering a pool, and I believe it''s a fair bit of > work to > search the disk to find them. Might it be a good idea to > allow ZFS to > store uberblock locations elsewhere for recovery purposes? > > > Perhaps it is best to leave decisions on these issues to the ZFS > designers who know how things work. > > Previous descriptions from people who do know how things work > didn''t make it sound very difficult to find the last 20 > uberblocks. It sounded like they were at known points for any > given pool. > > Those folks have surely tired of this discussion by now and are > working on actual code rather than reading idle discussion between > several people who don''t know the details of how things work. > > > > People who "don''t know how things work" often aren''t tied down by the > baggage of knowing how things work. Which leads to creative solutions > those who are weighed down didn''t think of. I don''t think it hurts in > the least to throw out some ideas. If they aren''t valid, it''s not > hard to ignore them and move on. It surely isn''t a waste of anyone''s > time to spend 5 minutes reading a response and weighing if the idea is > valid or not.OTOH, anyone who followed this discussion the last few times, has looked at the on-disk format documents, or reviewed the source code would know that the uberblocks are kept in an 128-entry circular queue which is 4x redundant with 2 copies each at the beginning and end of the vdev. Other metadata, by default, is 2x redundant and spatially diverse. Clearly, the failure mode being hashed out here has resulted in the defeat of those protections. The only real question is how fast Jeff can roll out the feature to allow reverting to previous uberblocks. The procedure for doing this by hand has long been known, and was posted on this forum -- though it is tedious. -- richard
Bob Friesenhahn
2009-Feb-14 01:58 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Fri, 13 Feb 2009, Tim wrote:> I don''t think it hurts in the least to throw out some ideas. If > they aren''t valid, it''s not hard to ignore them and move on. It > surely isn''t a waste of anyone''s time to spend 5 minutes reading a > response and weighing if the idea is valid or not.Today I sat down at 9:00 AM to read the new mail for the day and did not catch up until five hours later. Quite a lot of the reading was this (now) useless discussion thread. It is now useless since after five hours of reading, there were no ideas expressed that had not been expressed before. With this level of overhead, I am surprise that there is any remaining development motion on ZFS at all. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On February 13, 2009 7:58:51 PM -0600 Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> With this level of overhead, I am surprise that there is any remaining > development motion on ZFS at all.come on now. with all due respect, you are attempting to stifle relevant discussion and that is, well, bordering on ridiculous. i sure have learned a lot from this thread. now of course that is meaningless because i don''t and almost certainly never will contribute to zfs, but i assume there are others who have learned from this thread. that''s definitely a good thing. this thread also appears to be the impetus to change priorities on zfs development.> Today I sat down at 9:00 AM to read the new mail for the day and did not > catch up until five hours later. Quite a lot of the reading was this > (now) useless discussion thread. It is now useless since after five > hours of reading, there were no ideas expressed that had not been > expressed before.lastly, WOW! if this thread is worthless to you, learn to use the delete button. especially if you read that slowly. i know i certainly couldn''t keep up with all my incoming mail if i read everything. i''m sorry to berate you, as you do make very valuable contributions to the discussion here, but i take offense at your attempts to limit discussion simply because you know everything there is to know about the subject. great, now i am guilty of being "overhead". -frank
James C. McPherson
2009-Feb-14 05:27 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
Hi Bob, On Fri, 13 Feb 2009 19:58:51 -0600 (CST) Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:> On Fri, 13 Feb 2009, Tim wrote: > > > I don''t think it hurts in the least to throw out some ideas. If > > they aren''t valid, it''s not hard to ignore them and move on. It > > surely isn''t a waste of anyone''s time to spend 5 minutes reading a > > response and weighing if the idea is valid or not. > > Today I sat down at 9:00 AM to read the new mail for the day and did > not catch up until five hours later. Quite a lot of the reading was > this (now) useless discussion thread. It is now useless since after > five hours of reading, there were no ideas expressed that had not > been expressed before.I''ve found this thread to be like watching a car accident, and also really frustrating due to the inability to use search engines on the part of many posters.> With this level of overhead, I am surprise that there is any > remaining development motion on ZFS at all.Good thing the ZFS developers have mail filters :-) cheers, James -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog
Bob Friesenhahn
2009-Feb-14 18:00 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
On Fri, 13 Feb 2009, Frank Cusack wrote:> > i''m sorry to berate you, as you do make very valuable contributions to > the discussion here, but i take offense at your attempts to limit > discussion simply because you know everything there is to know about > the subject.The point is that those of us in the chattering class (i.e. people like you and me) clearly know very little about the subject, and continuting to chatter among ourselves is soon no longer rewarding. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Hey guys, I''ll let this die in a sec, but I just wanted to say that I''ve gone and read the on disk document again this morning, and to be honest Richard, without the description you just wrote, I really wouldn''t have known that uberblocks are in a 128 entry circular queue that''s 4x redundant. Please understand that I''m not asking for answers to these notes, this post is purely to illustrate to you ZFS guys that much as I appreciate having the ZFS docs available, they are very tough going for anybody who isn''t a ZFS developer. I consider myself well above average in IT ability, and I''ve really spent quite a lot of time in the past year reading around ZFS, but even so I would definitely have come to the wrong conclusion regarding uberblocks. Richard''s post I can understand really easily, but in the on disk format docs, that information is spread over 7 pages of really quite technical detail, and to be honest, for a user like myself raises as many questions as it answers: On page 6 I learn that labels are stored on each vdev, as well as each disk. So there will be a label on the pool, mirror (or raid group), and disk. I know the disk ones are at the start and end of the disk, and it sounds like the mirror vdev is in the same place, but where is the root vdev label? The example given doesn''t mention its location at all. Then, on page 7 it sounds like the entire label is overwriten whenever on-disk data is updated - "any time on-disk data is overwritten, there is potential for error". To me, it sounds like it''s not a 128 entry queue, but just a group of 4 labels, all of which are overwritten as data goes to disk. Then finally, on page 12 the uberblock is mentioned (although as an aside, the first time I read these docs I had no idea what the uberblock actually was). It does say that only one uberblock is active at a time, but with it being part of the label I''d just assume these were overwritten as a group.. And that''s why I''ll often throw ideas out - I can either rely on my own limited knowledge of ZFS to say if it will work, or I can take advantage of the excellent community we have here, and post the idea for all to see. It''s a quick way for good ideas to be improved upon, and bad ideas consigned to the bin. I''ve done it before in my rather lengthly ''zfs availability'' thread. My thoughts there were thrashed out nicely, with some quite superb additions (namely the concept of lop sided mirrors which I think are a great idea). Ross PS. I''ve also found why I thought you had to search for these blocks, it was after reading this thread where somebody used mdb to search a corrupt pool to try to recover data: http://opensolaris.org/jive/message.jspa?messageID=318009 On Fri, Feb 13, 2009 at 11:09 PM, Richard Elling <richard.elling at gmail.com> wrote:> Tim wrote: >> >> >> On Fri, Feb 13, 2009 at 4:21 PM, Bob Friesenhahn >> <bfriesen at simple.dallas.tx.us <mailto:bfriesen at simple.dallas.tx.us>> wrote: >> >> On Fri, 13 Feb 2009, Ross Smith wrote: >> >> However, I''ve just had another idea. Since the uberblocks are >> pretty >> vital in recovering a pool, and I believe it''s a fair bit of >> work to >> search the disk to find them. Might it be a good idea to >> allow ZFS to >> store uberblock locations elsewhere for recovery purposes? >> >> >> Perhaps it is best to leave decisions on these issues to the ZFS >> designers who know how things work. >> >> Previous descriptions from people who do know how things work >> didn''t make it sound very difficult to find the last 20 >> uberblocks. It sounded like they were at known points for any >> given pool. >> >> Those folks have surely tired of this discussion by now and are >> working on actual code rather than reading idle discussion between >> several people who don''t know the details of how things work. >> >> >> >> People who "don''t know how things work" often aren''t tied down by the >> baggage of knowing how things work. Which leads to creative solutions those >> who are weighed down didn''t think of. I don''t think it hurts in the least >> to throw out some ideas. If they aren''t valid, it''s not hard to ignore them >> and move on. It surely isn''t a waste of anyone''s time to spend 5 minutes >> reading a response and weighing if the idea is valid or not. > > OTOH, anyone who followed this discussion the last few times, has looked > at the on-disk format documents, or reviewed the source code would know > that the uberblocks are kept in an 128-entry circular queue which is 4x > redundant with 2 copies each at the beginning and end of the vdev. > Other metadata, by default, is 2x redundant and spatially diverse. > > Clearly, the failure mode being hashed out here has resulted in the defeat > of those protections. The only real question is how fast Jeff can roll out > the > feature to allow reverting to previous uberblocks. The procedure for doing > this by hand has long been known, and was posted on this forum -- though > it is tedious. > -- richard > >
On Fri, Feb 13, 2009 at 9:47 PM, Richard Elling <richard.elling at gmail.com> wrote:> It has been my experience that USB sticks use FAT, which is an ancient > file system which contains few of the features you expect from modern > file systems. As such, it really doesn''t do any write caching. Hence, it > seems to work ok for casual users. I note that neither NTFS, ZFS, reiserfs, > nor many of the other, high performance file systems are used by default > for USB devices. Could it be that anyone not using FAT for USB devices > is straining against architectural limits?There are no archtiectural limits. USB sticks can be used with whatever you throw at them. On sticks I use to interchange data with Windows machines I have NTFS, on others differente filesystems: ZFS, ext4, btrfs, often encrypted on block level. USB sticks are generally very simple -- no discard commands and other fancy stuff, but overall they are block devices just like discs, arrays, SSDs... -- Tomasz Torcz xmpp: zdzichubg at chrome.pl
Mario Goebbels wrote:> One thing I''d like to see is an _easy_ option to fall back onto older > uberblocks when the zpool went belly up for a silly reason. Something > that doesn''t involve esoteric parameters supplied to zdb.Between uberblock updates, there may be many write operations to a data file, each requiring a copy on write operation. Some of those operations may reuse blocks that were metadata blocks pointed to by the previous uberblock. In which case the old uberblock points to a metadata tree full of garbage. Jeff, you must have some idea on how to overcome this in your bugfix, would you care to share? --Joe
Robert Milkowski
2009-Feb-24 19:41 UTC
[zfs-discuss] ZFS: unreliable for professional usage?
Hello Joe, Monday, February 23, 2009, 7:23:39 PM, you wrote: MJ> Mario Goebbels wrote:>> One thing I''d like to see is an _easy_ option to fall back onto older >> uberblocks when the zpool went belly up for a silly reason. Something >> that doesn''t involve esoteric parameters supplied to zdb.MJ> Between uberblock updates, there may be many write operations to MJ> a data file, each requiring a copy on write operation. Some of MJ> those operations may reuse blocks that were metadata blocks MJ> pointed to by the previous uberblock. MJ> In which case the old uberblock points to a metadata tree full of garbage. MJ> Jeff, you must have some idea on how to overcome this in your bugfix, would you care to share? As was suggested on the list before ZFS could keep a list of freed blocks for last N txgs and if there are still other blocks to be used it would not allocated those from the last N transactions. -- Best regards, Robert Milkowski http://milek.blogspot.com