thr3ads.net - zfs discuss - [zfs-discuss] [on-discuss] Reliability at power failure? [Apr 2009]

If this information is useful, please help other people find it:
Share via:

Uwe Dippel

2009-Apr-19 10:15 UTC

[zfs-discuss] [on-discuss] Reliability at power failure?

Casper.Dik at Sun.COM wrote:>
> I would suggest that you follow my recipe: not check the boot-archive 
> during a reboot.  And then report back.  (I''m assuming that that
will take
> several weeks)
>   
We are back at square one; or, at the subject line.
I did a zpool status -v, everything was hunky dory.
Next, a power failure, 2 hours later, and this is what zpool status -v 
thinks:

zpool status -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     0
      c1d0s0    ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        //etc/svc/repository-boot-20090419_174236

I know, the hord-core defenders of ZFS will repeat for the umpteenth 
time that I should be grateful that ZFS can NOTICE and inform about the 
problem.
Others might want to repeat that this is not supposed to happen in the 
first place.

Reliability at power failure? That was my question, and I had to learn 
that the answer is ''no''.
How about my proposal to always have a proper snapshot available? And 
after some 4 days without any CKSUM error, how can yanking the power 
cord mess boot-stuff?

Uwe

Casper.Dik at Sun.COM

2009-Apr-19 12:16 UTC

head link

[zfs-discuss] [on-discuss] Reliability at power failure?

>We are back at square one; or, at the subject line.
>I did a zpool status -v, everything was hunky dory.
>Next, a power failure, 2 hours later, and this is what zpool status -v 
>thinks:
>
>zpool status -v
>  pool: rpool
> state: ONLINE
>status: One or more devices has experienced an error resulting in data
>    corruption.  Applications may be affected.
>action: Restore the file in question if possible.  Otherwise restore the
>    entire pool from backup.
>   see: http://www.sun.com/msg/ZFS-8000-8A
> scrub: none requested
>config:
>
>    NAME        STATE     READ WRITE CKSUM
>    rpool       ONLINE       0     0     0
>      c1d0s0    ONLINE       0     0     0
>
>errors: Permanent errors have been detected in the following files:
>
>        //etc/svc/repository-boot-20090419_174236
>
>I know, the hord-core defenders of ZFS will repeat for the umpteenth 
>time that I should be grateful that ZFS can NOTICE and inform about the 
>problem.
:-)

The file is created on boot and I assume this was created directly after 
the boot after the  power-failure.

Am I correct in thinking that:
	the last boot happened on 2009/04/19_17:42:36
	the system hasn''t reboot since that time
>Others might want to repeat that this is not supposed to happen in the 
>first place.
ZFS guarantees that does cannot happen, unless the hardware is bad.  Bad 
means here "the hardware doesn''t promise what ZFS believes the
hardware
promises".

But anything can cause this:

	hardware problems:
		- bad memory
		- bad disk
		- bad disk controller
		- bad power supply
		
	software problem
		- memory corruption through any odd driver
		- any part of the zfs stack

My memory would still be a hardware problem.  I remember a particular case 
where ZFS continuously found checksums; replacing the power supply fixed 
that.

Casper

Uwe Dippel

2009-Apr-19 14:38 UTC

head link

[zfs-discuss] [on-discuss] Reliability at power failure?

Casper.Dik at Sun.COM wrote:>> We are back at square one; or, at the subject line.
>> I did a zpool status -v, everything was hunky dory.
>> Next, a power failure, 2 hours later, and this is what zpool status -v 
>> thinks:
>>
>> zpool status -v
>>  pool: rpool
>> state: ONLINE
>> status: One or more devices has experienced an error resulting in data
>>    corruption.  Applications may be affected.
>> action: Restore the file in question if possible.  Otherwise restore
the
>>    entire pool from backup.
>>   see: http://www.sun.com/msg/ZFS-8000-8A
>> scrub: none requested
>> config:
>>
>>    NAME        STATE     READ WRITE CKSUM
>>    rpool       ONLINE       0     0     0
>>      c1d0s0    ONLINE       0     0     0
>>
>> errors: Permanent errors have been detected in the following files:
>>
>>        //etc/svc/repository-boot-20090419_174236
>>
>> I know, the hord-core defenders of ZFS will repeat for the umpteenth 
>> time that I should be grateful that ZFS can NOTICE and inform about the
>> problem.
>>     
>
> :-)
>
> The file is created on boot and I assume this was created directly after 
> the boot after the  power-failure.
>
> Am I correct in thinking that:
> 	the last boot happened on 2009/04/19_17:42:36
> 	the system hasn''t reboot since that time
>   
Good guess, but wrong. Another two to go ...   :)>   
>> Others might want to repeat that this is not supposed to happen in the 
>> first place.
>>     
>
> ZFS guarantees that does cannot happen, unless the hardware is bad.  Bad 
> means here "the hardware doesn''t promise what ZFS believes
the hardware
> promises".
>
> But anything can cause this:
>
> 	hardware problems:
> 		- bad memory
> 		- bad disk
> 		- bad disk controller
> 		- bad power supply
> 		
> 	software problem
> 		- memory corruption through any odd driver
> 		- any part of the zfs stack
>
> My memory would still be a hardware problem.  I remember a particular case 
> where ZFS continuously found checksums; replacing the power supply fixed 
> that.
>   
Chances are. That Ubuntu as double boot here never finds anything wrong, 
crashes, etc.
And again, someone will inform me that this is the beauty of ZFS: That I 
know of the corruption.

After a scrub, what I see is:

 zpool status -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub completed after 0h48m with 1 errors on Sun Apr 19 19:09:26 
2009
config:

    NAME        STATE     READ WRITE CKSUM
    rpool       ONLINE       0     0     1
      c1d0s0    ONLINE       0     0     2

errors: Permanent errors have been detected in the following files:

        <0xa6>:<0x4f002>

Which file to replace?

Serious, what would a normal user expected to do here? No, I don''t have
a backup of a file that has recently been created, true, at 17:42 on 
April 19th.
Reinstall? While everything was okay 12 hours ago, after some 30 crashes 
due to power-failures, that were - until recently - rectified with 
crashes at boot, Failsafe, reboot.
A system that has been going up and down without much hassle for 1.5 
years, both on OpenSolaris on UFS and Ubuntu?

(Let''s not forget the thread started with my question "Why do I
have to
Failsafe so frequently after a power failure, to correct a corrupted 
bootarchive?")

Uwe

dick hoogendijk

2009-Apr-19 14:58 UTC

head link

[zfs-discuss] [on-discuss] Reliability at power failure?

On Sun, 19 Apr 2009 18:15:31 +0800
Uwe Dippel <udippel at gmail.com> wrote:
> Reliability at power failure? That was my question, and I had to
> learn that the answer is ''no''.
Sorry Uwe, but the answer is yes. Assuming that your hardware is in
order. I''ve read quite some msgs from you here recently and all of them
make me think you''re no fan of zfs at all. Why don''t you quit
using it
and focus a little more on installing SunStudio (which isn''t that hard
to do; at least not so hard as you want us to believe it is in another
thread). All I ever had to do was start the installer (in a GUI) and
-all- software was placed where it was supposed to.
> And after some 4 days without any CKSUM error, how can yanking the
> power cord mess boot-stuff?
Maybe because on the fifth day some hardware failure occurred? ;-)

-- 
Dick Hoogendijk -- PGP/GnuPG key: 01D2433D
+ http://nagual.nl/ | nevada / opensolaris
+ All that''s really worth doing is what we do for others (Lewis Carrol)

Toby Thain

2009-Apr-19 15:10 UTC

head link

[zfs-discuss] [on-discuss] Reliability at power failure?

On 19-Apr-09, at 10:38 AM, Uwe Dippel wrote:
> Casper.Dik at Sun.COM wrote:
>>> We are back at square one; or, at the subject line.
>>> I did a zpool status -v, everything was hunky dory.
>>> Next, a power failure, 2 hours later, and this is what zpool  
>>> status -v thinks:
>>>
>>> zpool status -v
>>>  pool: rpool
>>> state: ONLINE
>>> status: One or more devices has experienced an error resulting in  
>>> data
>>>    corruption.  Applications may be affected.
>>> action: Restore the file in question if possible.  Otherwise  
>>> restore the
>>>    entire pool from backup.
>>>   see: http://www.sun.com/msg/ZFS-8000-8A
>>> scrub: none requested
>>> config:
>>>
>>>    NAME        STATE     READ WRITE CKSUM
>>>    rpool       ONLINE       0     0     0
>>>      c1d0s0    ONLINE       0     0     0
>>>
>>> errors: Permanent errors have been detected in the following files:
>>>
>>>        //etc/svc/repository-boot-20090419_174236
>>>
>>> I know, the hord-core defenders of ZFS will repeat for the  
>>> umpteenth time that I should be grateful that ZFS can NOTICE and  
>>> inform about the problem.
>>>
>>
>> :-)
>>
>> The file is created on boot and I assume this was created directly  
>> after the boot after the  power-failure.
>>
>> Am I correct in thinking that:
>> 	the last boot happened on 2009/04/19_17:42:36
>> 	the system hasn''t reboot since that time
>>
>
> Good guess, but wrong. Another two to go ...   :)
>>
>>> Others might want to repeat that this is not supposed to happen  
>>> in the first place.
>>>
>>
>> ZFS guarantees that does cannot happen, unless the hardware is  
>> bad.  Bad means here "the hardware doesn''t promise what
ZFS
>> believes the hardware promises".
>>
>> But anything can cause this:
>>
>> 	hardware problems:
>> 		- bad memory
>> 		- bad disk
>> 		- bad disk controller
>> 		- bad power supply
>> 		
>> 	software problem
>> 		- memory corruption through any odd driver
>> 		- any part of the zfs stack
>>
>> My memory would still be a hardware problem.  I remember a  
>> particular case where ZFS continuously found checksums; replacing  
>> the power supply fixed that.
>>
>
> Chances are. That Ubuntu as double boot here never finds anything  
> wrong, crashes, etc.
Why should it? It isn''t designed to do so.


> And again, someone will inform me that this is the beauty of ZFS:  
> That I know of the corruption.
>
> After a scrub, what I see is:
>
> zpool status -v
>  pool: rpool
> state: ONLINE
> status: One or more devices has experienced an error resulting in data
>    corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise  
> restore the
>    entire pool from backup.
>   see: http://www.sun.com/msg/ZFS-8000-8A
> scrub: scrub completed after 0h48m with 1 errors on Sun Apr 19  
> 19:09:26 2009
> config:
>
>    NAME        STATE     READ WRITE CKSUM
>    rpool       ONLINE       0     0     1
>      c1d0s0    ONLINE       0     0     2
>
> errors: Permanent errors have been detected in the following files:
>
>        <0xa6>:<0x4f002>
>
> Which file to replace?
Have you thoroughly checked your hardware?

Why are you running a non-redundant pool?

--Toby
>
> Serious, what would a normal user expected to do here? No, I don''t
> have a backup of a file that has recently been created, true, at  
> 17:42 on April 19th.
> Reinstall? While everything was okay 12 hours ago, after some 30  
> crashes due to power-failures, that were - until recently -  
> rectified with crashes at boot, Failsafe, reboot.
> A system that has been going up and down without much hassle for  
> 1.5 years, both on OpenSolaris on UFS and Ubuntu?
>
> (Let''s not forget the thread started with my question "Why do
I
> have to Failsafe so frequently after a power failure, to correct a  
> corrupted bootarchive?")
>
> Uwe
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Uwe Dippel

2009-Apr-19 15:20 UTC

head link

[zfs-discuss] [on-discuss] Reliability at power failure?

dick hoogendijk wrote:> Why don''t you quit using it
> and focus a little more on installing SunStudio (which isn''t that
hard
> to do; at least not so hard as you want us to believe it is in another
> thread). All I ever had to do was start the installer (in a GUI) and
> -all- software was placed where it was supposed to.
>   
Lucky you. So you doubt that I ran the ./installer, the GUI came up, and 
in the end netbeans wasn''t there? Why should I make that up??
It took me until here
http://docs.sun.com/app/docs/doc/820-2972/gabcd?a=view
to find the solution, even though the title didn''t fit.

> Maybe because on the fifth day some hardware failure occurred? ;-)
>   
That would be which? The system works and is up and running beautifully. 
OpenSolaris, as of now. Ah, you''re hinting at a rare hardware glitch as
underlying problem? AFAIU, it is a proclaimed feature of ZFS that writes 
are atomic, out and over.


Uwe,
who is a big fan of a ZFS that fulfills all of its promises. Snapshots 
and luupgrade have yet to fail me on it. And a few other beautiful 
things. It is the reliability that makes me wonder if UFS/FFS/ext3 are 
not better choices in this respect.
Blaming standard, off-the-shelf hardware as ''too cheap'' is a
too
slippery slope, btw.

Uwe Dippel

2009-Apr-19 15:26 UTC

head link

[zfs-discuss] [on-discuss] Reliability at power failure?

Toby Thain wrote:>
>>
>> Chances are. That Ubuntu as double boot here never finds anything 
>> wrong, crashes, etc.
>
> Why should it? It isn''t designed to do so.
I knew this would inevitably creep up.  :)

>
> Why are you running a non-redundant pool?
Because.
90+% of the normal desktop users will run a non-redundant pool, and 
expect their filesystems to not add operational failures, but come back 
after a yanked power cord without fail.

Uwe

Dennis Clarke

2009-Apr-19 15:55 UTC

head link

[zfs-discuss] [on-discuss] Reliability at power failure?

>> And after some 4 days without any CKSUM error, how can yanking the
>> power cord mess boot-stuff?
>
> Maybe because on the fifth day some hardware failure occurred? ;-)
ha ha !  sorry .. that was pretty funny.

-- 
Dennis

Bob Friesenhahn

2009-Apr-19 16:24 UTC

head link

[zfs-discuss] [on-discuss] Reliability at power failure?

On Sun, 19 Apr 2009, Uwe Dippel wrote:>> 
>> Why are you running a non-redundant pool?
>
> Because.
> 90+% of the normal desktop users will run a non-redundant pool, and expect 
> their filesystems to not add operational failures, but come back after a 
> yanked power cord without fail.
OpenSolaris desktop users are surely less than 0.5% of the desktop 
population.  Are the 90+% of the normal desktop users you are talking 
about the Microsoft Windows users, which is indeed something like 90%? 
If you really want to be part of the majority, perhaps you installed 
the wrong operating system.  If you want to be included in the 0.5% of 
the desktop population who are smart enough to run OpenSolaris, maybe 
you should add a mirror drive.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

dick hoogendijk

2009-Apr-19 16:38 UTC

head link

[zfs-discuss] [on-discuss] Reliability at power failure?

On Sun, 19 Apr 2009 11:24:26 -0500 (CDT)
Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:
> If you want to be included in the 0.5% of the desktop population who
> are smart enough to run OpenSolaris, maybe you should add a mirror
> drive.
You took the words right out of my mouth.
I often see/read messages from people who seem to thing (open)solaris
is some kind of windows or even linux. The latter is famous to run on
cheap and often even very old hardware. That''s OK, because linux is not
only a modern system, it''s also a geek system. And windows runs on
almost everything, BSOD''s included.

Solaris/OpenSolaris is not a system for all. It has hardware demands;
that is, if you want to run it safe. I know people who run ZFS on a
32bit system and that often goes well. Until the system comes under
heavy load and strange errors appear.

Although mirroring existed in hardware and software "solutions" were
present in the OS''ses I ran before Solaris it''s only since I
run my
systems on ZFS (S10/nevada/OpenSolaris) that all my drives are
mirrored. Prices are cheap (and I mean CHEAP). If you still run ZFS on
a single drive (or worse: on a part of it) you don''t follow the
"rules".
That''s not "professional" and even for home users
it''s not wise.

-- 
Dick Hoogendijk -- PGP/GnuPG key: 01D2433D
+ http://nagual.nl/ | nevada / opensolaris
+ All that''s really worth doing is what we do for others (Lewis Carrol)

Uwe Dippel

2009-Apr-19 16:41 UTC

head link

[zfs-discuss] [on-discuss] Reliability at power failure?

Bob Friesenhahn wrote:>
> OpenSolaris desktop users are surely less than 0.5% of the desktop 
> population.  Are the 90+% of the normal desktop users you are talking 
> about the Microsoft Windows users, which is indeed something like 90%? 
> If you really want to be part of the majority, perhaps you installed 
> the wrong operating system.  If you want to be included in the 0.5% of 
> the desktop population who are smart enough to run OpenSolaris, maybe 
> you should add a mirror drive.
Thanks for the advice, Bob!
Though I don''t insist to belong to that less than 0.5% of the
population
who are smart enough to run OpenSolaris and add a mirror drive, I''d 
still like to run OpenSolaris, and without mirror drive. Where does that 
put me?

Uwe

dick hoogendijk

2009-Apr-19 16:52 UTC

head link

[zfs-discuss] [on-discuss] Reliability at power failure?

On Mon, 20 Apr 2009 00:41:49 +0800
Uwe Dippel <udippel at gmail.com> wrote:> I''d still like to run OpenSolaris, and without mirror drive.
> Where does that put me?
Somewhere I wouldn''t want to be. NOT if I run production servers, that
is. Systems to play with are OK of course. You need redundancy and you
don''t get that on a single drive. A sound use of ZFS needs it.
Otherwise the system is "crippled" before you start using it.

The only place I run ZFS on singles drives is in a xVM ;-)

-- 
Dick Hoogendijk -- PGP/GnuPG key: 01D2433D
+ http://nagual.nl/ | nevada / opensolaris
+ All that''s really worth doing is what we do for others (Lewis Carrol)

Bob Friesenhahn

2009-Apr-19 17:14 UTC

head link

[zfs-discuss] [on-discuss] Reliability at power failure?

On Sun, 19 Apr 2009, Eric D. Mudama wrote:>
> Additionally, over the last few months I''m pretty sure
I''ve seen this
> same discussion and report of corruption when the person *did* have
> mirrored boot and had an unsafe power fail.  I''ll have to dig to
find
> it though.
You are right that there have been reports of boot archive corruption, 
but this corruption is at a higher level than zfs.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Eric D. Mudama

2009-Apr-19 17:14 UTC

head link

[zfs-discuss] [on-discuss] Reliability at power failure?

On Sun, Apr 19 at 18:38, dick hoogendijk wrote:>On Sun, 19 Apr 2009 11:24:26 -0500 (CDT)
>Bob Friesenhahn <bfriesen at simple.dallas.tx.us> wrote:
>
>> If you want to be included in the 0.5% of the desktop population who
>> are smart enough to run OpenSolaris, maybe you should add a mirror
>> drive.
>
>You took the words right out of my mouth.
>I often see/read messages from people who seem to thing (open)solaris
>is some kind of windows or even linux. The latter is famous to run on
>cheap and often even very old hardware. That''s OK, because linux is
not
>only a modern system, it''s also a geek system. And windows runs on
>almost everything, BSOD''s included.
Just to play devil''s advocate, those new Sun blades have a single
flash DIMM per processing node as storage.

Additionally, over the last few months I''m pretty sure I''ve
seen this
same discussion and report of corruption when the person *did* have
mirrored boot and had an unsafe power fail.  I''ll have to dig to find
it though.

--eric


-- 
Eric D. Mudama
edmudama at mail.bounceswoosh.org

Oscar del Rio

2009-Apr-19 17:17 UTC

head link

[zfs-discuss] [on-discuss] Reliability at power failure?

Uwe Dippel wrote:
> Next, a power failure, 2 hours later, and this is what zpool status -v 
> thinks:
> Reliability at power failure? That was my question, and I had to learn 
Your question should be about HARDWARE reliability after power failure.
Some (cheap) hardware are very unreliable, either the HDD or the PSU or 
both.

Many systems (Linux, Windows, whatever) silently become corrupted until 
the day they no longer boot, and a HDD surface scan usually finds 
several bad sectors.

Just this week I had to low-level reformat a box - the partition table 
became unreadable/unwritable after a dirty shutdown.  (A desktop 
machine, not a server, and the HDD showed only ONE bad sector, so 
replacing the HDD was not justifiable in this case)

Mario Goebbels

2009-Apr-19 17:38 UTC

head link

[zfs-discuss] [on-discuss] Reliability at power failure?

>> Because.
>> 90+% of the normal desktop users will run a non-redundant pool, and
>> expect their filesystems to not add operational failures, but come
>> back after a yanked power cord without fail.
>
> OpenSolaris desktop users are surely less than 0.5% of the desktop
> population. Are the 90+% of the normal desktop users you are talking
> about the Microsoft Windows users, which is indeed something like 90%?
> If you really want to be part of the majority, perhaps you installed the
> wrong operating system. If you want to be included in the 0.5% of the
> desktop population who are smart enough to run OpenSolaris, maybe you
> should add a mirror drive.
Not to be a party pooper, but once the Apple brigade gets their filthy 
hands on ZFS (post-Snow Leopard?), it will be an issue.

Personally, I run a mirror.

-mg

Richard Elling

2009-Apr-19 20:56 UTC

head link

[zfs-discuss] [on-discuss] Reliability at power failure?

Uwe Dippel wrote:> Casper.Dik at Sun.COM wrote:
>>
>> I would suggest that you follow my recipe: not check the boot-archive 
>> during a reboot.  And then report back.  (I''m assuming that
that will
>> take several weeks)
>>   
>
> We are back at square one; or, at the subject line.
> I did a zpool status -v, everything was hunky dory.
> Next, a power failure, 2 hours later, and this is what zpool status -v 
> thinks:
>
> zpool status -v
>  pool: rpool
> state: ONLINE
> status: One or more devices has experienced an error resulting in data
>    corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>    entire pool from backup.
>   see: http://www.sun.com/msg/ZFS-8000-8A
> scrub: none requested
> config:
>
>    NAME        STATE     READ WRITE CKSUM
>    rpool       ONLINE       0     0     0
>      c1d0s0    ONLINE       0     0     0
>
> errors: Permanent errors have been detected in the following files:
>
>        //etc/svc/repository-boot-20090419_174236
This file is created at boot time, not when power has failed.
So the fault likely occurred during the boot.  With this knowledge,
the rest of your argument makes no sense.
 -- richard
>
> I know, the hord-core defenders of ZFS will repeat for the umpteenth 
> time that I should be grateful that ZFS can NOTICE and inform about 
> the problem.
> Others might want to repeat that this is not supposed to happen in the 
> first place.
>
> Reliability at power failure? That was my question, and I had to learn 
> that the answer is ''no''.
> How about my proposal to always have a proper snapshot available? And 
> after some 4 days without any CKSUM error, how can yanking the power 
> cord mess boot-stuff?
>
> Uwe
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Marion Hakanson

2009-Apr-19 21:17 UTC

head link

[zfs-discuss] [on-discuss] Reliability at power failure?

udippel at gmail.com said:> dick at nagual.nl wrote:
>> Maybe because on the fifth day some hardware failure occurred? ;-)
> 
> That would be which? The system works and is up and running beautifully.
> OpenSolaris, as of now.
Running beautifully as long as the power stays on?  Is it hard to believe
hardware might glitch at power-failure (or power-on-after-failure)?

> Ah, you''re hinting at a rare hardware glitch as  underlying
problem? AFAIU,
> it is a proclaimed feature of ZFS that writes  are atomic, out and over
Not only does ZFS advertise atomic updates, it also _depends_ on them,
and checks for them having happened, likely more so than other filesystems.
Is it hard to believe that ZFS is exercising and/or checking up on your
hardware in ways that Linux does not do?

> Uwe,
> who is a big fan of a ZFS that fulfills all of its promises. Snapshots  and
> luupgrade have yet to fail me on it. And a few other beautiful  things. It
is
> the reliability that makes me wonder if UFS/FFS/ext3 are  not better
choices
> in this respect. Blaming standard, off-the-shelf hardware as ''too
cheap'' is a
> too  slippery slope, btw. 
Sorry to hear you''re still having this issue.  I can only offer
anecdotal
experience:  Running Solaris-10 here, non-mirrored ZFS root/boot since
last December (other ZFS filesystems, mirrored and non-mirrored, for 2 years
prior), on standard off-the-shelf PC, slightly more than 5 years old.  This
system has been through multiple power-failures, never with any corruption.
Same goes for a 2-yr-old Dell desktop PC at work, with mirrored ZFS root/boot;
Multiple power failures, never any reported checksum errors or other 
corruption.

We also have Solaris-10 systems at work, non-ZFS-boot, but with ZFS running
without redundancy on non-Sun fiberchannel RAID gear.  These have had
power failures and other SAN outages without causing corruption of ZFS
filesystems.

We have experienced a number of times where systems failed to boot after
power-failure, due to boot-archive being out of date.  Not corrupted, just
out of date.  Annoying and inconvient for production systems, but nothing
at all to do with ZFS.

So, I personally have not found ZFS to be any less reliable in presence of
power failures than Solaris-10/UFS or Linux on the same hardware.

I wonder what is it that''s unique or rare about your situation, that
OpenSolaris and/or ZFS is uncovering?  I also wonder how hard it might
be to make ZFS resilient to whatever unique/rare circumstances you have,
as compared to finding/fixing/avoiding those circumstances.

Regards,

Marion

David Magda

2009-Apr-19 21:56 UTC

head link

[zfs-discuss] [on-discuss] Reliability at power failure?

On Apr 19, 2009, at 12:52, dick hoogendijk wrote:
> You need redundancy and you don''t get that on a single drive. A  
> sound use of ZFS needs it.

Not quite the same, but...

"zfs set copies=2 myzfsfs" ?

dick hoogendijk

2009-Apr-19 22:17 UTC

head link

[zfs-discuss] [on-discuss] Reliability at power failure?

On Sun, 19 Apr 2009 17:56:54 -0400
David Magda <dmagda at ee.ryerson.ca> wrote:
> On Apr 19, 2009, at 12:52, dick hoogendijk wrote:
> 
> > You need redundancy and you don''t get that on a single drive.
A
> > sound use of ZFS needs it.
> Not quite the same, but...
> 
> "zfs set copies=2 myzfsfs" ?
Like you say: not quite the same. If your drive fails, you''re scr**d

-- 
Dick Hoogendijk -- PGP/GnuPG key: 01D2433D
+ http://nagual.nl/ | nevada / opensolaris
+ All that''s really worth doing is what we do for others (Lewis Carrol)

Uwe Dippel

2009-Apr-19 23:54 UTC

head link

[zfs-discuss] [on-discuss] Reliability at power failure?

Richard Elling wrote:>>
>>        //etc/svc/repository-boot-20090419_174236
>
> This file is created at boot time, not when power has failed.
> So the fault likely occurred during the boot.  With this knowledge,
> the rest of your argument makes no sense.
reboot    system boot                   Sun Apr 19 17:46
reboot    system down                   Sun Apr 19 17:45
reboot    system boot                   Sun Apr 19 17:44
reboot    system down                   Sun Apr 19 17:44
reboot    system boot                   Sun Apr 19 17:43
reboot    system down                   Sat Apr 18 15:09

The result that you saw was the one after the last boot at 17:46.
You are probably correct with your statement that the fault probably 
occurred at boot time.

Uwe

Robert Thurlow

2009-Apr-20 16:27 UTC

head link

[zfs-discuss] [on-discuss] Reliability at power failure?

dick hoogendijk wrote:
> Sorry Uwe, but the answer is yes. Assuming that your hardware is in
> order. I''ve read quite some msgs from you here recently and all of
them
> make me think you''re no fan of zfs at all. Why don''t you
quit using it
> and focus a little more on installing SunStudio
I would really like to NOT chase people away from ZFS for
any reason.  There''s no need.

ZFS is currently a little too expert-friendly.  I''m used to
ZFS, so when it shows me messages, I know what it''s saying.
But when I read them a second time, I always wonder if we
could word them to be more approachable without losing the
precision.  I would like to see alternate wordings suggested
in RFEs, since I think some folks had good suggestions.  As
an example of wording that needs an upgrade:

 > errors: Permanent errors have been detected in the following files:
 >        <0xa6>:<0x4f002>

Could we not offer a clue that this was in metadata, even if
it is darned hard to print a meaningful path name?

Obligatory positive message:

I was rewiring my monitors yesterday to get them all on a
switchable power bar, and bumped a power switch briefly.
The old dual Opteron machine hosting my storage pool did not
power up again after that.  I had an external Firewire case
the pool had been destined for, and so I removed the drives
and put them in the external case, and plugged the case into
my SunBlade 2500.  ''zpool import -f'' went nicely, and I
didn''t
lose a thing.  I don''t think any other filesystem or OS would
make a recovery operation like this any easier.

Oh yeah, this was after a mostly-effortless ZFS-accelerated
Live Upgrade from snv_91 to snv_112 (almost a year) on another
box.

Rob T

Possibly Parallel Threads

Search for more apparently analagous threads

zfs discuss - Apr 2009 - [on-discuss] Reliability at power failure?

[zfs-discuss] [on-discuss] Reliability at power failure?

[zfs-discuss] [on-discuss] Reliability at power failure?

[zfs-discuss] [on-discuss] Reliability at power failure?

[zfs-discuss] [on-discuss] Reliability at power failure?

[zfs-discuss] [on-discuss] Reliability at power failure?

[zfs-discuss] [on-discuss] Reliability at power failure?

[zfs-discuss] [on-discuss] Reliability at power failure?

[zfs-discuss] [on-discuss] Reliability at power failure?

[zfs-discuss] [on-discuss] Reliability at power failure?

[zfs-discuss] [on-discuss] Reliability at power failure?

[zfs-discuss] [on-discuss] Reliability at power failure?

[zfs-discuss] [on-discuss] Reliability at power failure?

[zfs-discuss] [on-discuss] Reliability at power failure?

[zfs-discuss] [on-discuss] Reliability at power failure?

[zfs-discuss] [on-discuss] Reliability at power failure?

[zfs-discuss] [on-discuss] Reliability at power failure?

[zfs-discuss] [on-discuss] Reliability at power failure?

[zfs-discuss] [on-discuss] Reliability at power failure?

[zfs-discuss] [on-discuss] Reliability at power failure?

[zfs-discuss] [on-discuss] Reliability at power failure?

[zfs-discuss] [on-discuss] Reliability at power failure?

[zfs-discuss] [on-discuss] Reliability at power failure?

Possibly Parallel Threads