thr3ads.net - zfs discuss - [zfs-discuss] ZFS problem mirror [Jul 2008]

If this information is useful, please help other people find it:
Share via:

2008-Jul-08 19:56 UTC

[zfs-discuss] ZFS problem mirror

Hi everyone,

i did a nice install of opensolaris and i pulled 2x500 gig sata disk in a zpool
mirror.
Everything went well and i got it so that my mirror called datatank got shared
by using CIFS. I can access it from my macbook and pc.
So with this nice setup i started to put my files on but now i notice this :

 pool: datatank
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: scrub in progress for 0h1m, 38.50% done, 0h3m to go
config:

        NAME        STATE     READ WRITE CKSUM
        datatank    DEGRADED     0     0     0
          mirror    DEGRADED     0     0     0
            c5d0    DEGRADED     0     0     0  too many errors
            c7d0    DEGRADED     0     0     0  too many errors

errors: Permanent errors have been detected in the following files:

        datatank:<0x3df2>

It seems that files are corrupted and when i delete them my pool stay degraded.
I did the clear command and then it went ok until after a while i coppied files
over  from my macbook and again some were corrupted.

Iam a affraid to put my produktion files on this server, it doens''t
seems reliable.
What can i do ? anybody any clues. 

I notice also that i got an error on my boot disk saying (bootdisk is a 20 gig
ata):

gzip: kernel/misc/qlc/qlc_fw_2400: I/O error

Thanks in advance !!

best regards,

Y
 
 
This message posted from opensolaris.org

Tim

2008-Jul-08 19:59 UTC

head link

[zfs-discuss] ZFS problem mirror

On Tue, Jul 8, 2008 at 2:56 PM, BG <ben at syn3.net> wrote:
> Hi everyone,
>
> i did a nice install of opensolaris and i pulled 2x500 gig sata disk in a
> zpool mirror.
> Everything went well and i got it so that my mirror called datatank got
> shared by using CIFS. I can access it from my macbook and pc.
> So with this nice setup i started to put my files on but now i notice this
> :
>
>  pool: datatank
>  state: DEGRADED
> status: One or more devices has experienced an error resulting in data
>        corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>        entire pool from backup.
>   see: http://www.sun.com/msg/ZFS-8000-8A
>  scrub: scrub in progress for 0h1m, 38.50% done, 0h3m to go
> config:
>
>        NAME        STATE     READ WRITE CKSUM
>        datatank    DEGRADED     0     0     0
>          mirror    DEGRADED     0     0     0
>            c5d0    DEGRADED     0     0     0  too many errors
>            c7d0    DEGRADED     0     0     0  too many errors
>
> errors: Permanent errors have been detected in the following files:
>
>        datatank:<0x3df2>
>
> It seems that files are corrupted and when i delete them my pool stay
> degraded. I did the clear command and then it went ok until after a while i
> coppied files over  from my macbook and again some were corrupted.
>
> Iam a affraid to put my produktion files on this server, it
doens''t seems
> reliable.
> What can i do ? anybody any clues.
>
> I notice also that i got an error on my boot disk saying (bootdisk is a 20
> gig ata):
>
> gzip: kernel/misc/qlc/qlc_fw_2400: I/O error
>
> Thanks in advance !!
>
> best regards,
>
> Y
>
>
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Might want to provide some basics:

What build of Opensolaris are you running?  What version of ZFS?

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080708/05784865/attachment.html>

2008-Jul-08 20:31 UTC

head link

[zfs-discuss] ZFS problem mirror

i removed the files that were corrupted,scrubbed the datatank mirror and the did
status -v datatank and i got this :

 pool: datatank
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using ''zpool clear'' or replace the device with
''zpool replace''.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 0h4m with 0 errors on Tue Jul  8 22:17:26 2008
config:

        NAME        STATE     READ WRITE CKSUM
        datatank    DEGRADED     0     0     4
          mirror    DEGRADED     0     0     4
            c5d0    DEGRADED     0     0     8  too many errors
            c7d0    DEGRADED     0     0     8  too many errors

errors: No known data errors
then i used the zpool clear command
zpool clear datatank :
and this it the output :
 pool: datatank
 state: ONLINE
 scrub: scrub completed after 0h4m with 0 errors on Tue Jul  8 22:17:26 2008
config:

        NAME        STATE     READ WRITE CKSUM
        datatank    ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c5d0    ONLINE       0     0     0
            c7d0    ONLINE       0     0     0

errors: No known data errors

But is this really ok ? because everytime i got a file corruption i
can''t do these steps. Btw the files aren''t corrupt on my
macbook so how come they get corrupted during the transport or on the mirror ?
and the error i mentioned on my boot disk is that of some vallue, i searched on
the net and found the the package is related with iscsi but i have only sata and
ata so.
 
 
This message posted from opensolaris.org

Ross

2008-Jul-09 18:19 UTC

head link

[zfs-discuss] ZFS problem mirror

Hi, I''m no ZFS or Solaris expert, but with no replies from anybody else
I''ll give you my thoughts.

I strongly suspect you''ve got a hardware or driver fault on that
server.  ZFS almost certainly isn''t corrupting the files itself,
it''s simply reporting that it''s finding corruption.  You may
well have a memory error, or a bad controller or driver.  I''d imagine
you''d simply be getting silent corruption if you were running anything
other than ZFS on there.

Personally I''d check the server carefully, possibly trying those disks
in another machine if you can.
 
 
This message posted from opensolaris.org

2008-Jul-11 15:42 UTC

head link

[zfs-discuss] ZFS problem mirror

Hi thanks for you help in the forum help i got an answer also iam gonna try
that. But your suggestion is also an angle with i will investigate. Is there
maybo some diagnostic tool in opensolaris i can use, or shall i use the solaris
bootable cd that inspects of my hw is fully compitble ?

thanks !
 
 
This message posted from opensolaris.org

Ross

2008-Jul-11 16:11 UTC

head link

[zfs-discuss] ZFS problem mirror

There''s nothing I know of I''m afraid, I''m too new to
Solaris to have looked into things that deeply.

If you have access to any spare parts, the easiest way to test is to swop things
over and see if the problem is reproducable.  It could even be something as
simple as a struggling power supply.

Running a compatibility check does sound like a good first step though.
 
 
This message posted from opensolaris.org

Akhilesh Mritunjai

2008-Jul-11 17:53 UTC

head link

[zfs-discuss] ZFS problem mirror

Hi

I too strongly suspect that some HW component is failing. It is rare to see all
drives (in your case both drives in mirror and the boot drive) reporting errors
at same time.

"zfs clear" just resets the error counters. You still have got errors
in there.

Start with following components (in this order):

1. Memory: Use memtest86+ (use any live CD.. it is very common)
2. Power supply -> search the forums, it is very common
3. Your mobo/disk controller -> (??? try another one maybe)

Have you also experienced any kernel panics or strange random software crashes
on this box ?
 
 
This message posted from opensolaris.org

2008-Jul-11 18:22 UTC

head link

[zfs-discuss] ZFS problem mirror

Hi 

running all kinds of tools now even a tool for my hd from WD, so we will she
what the results are.
I ordered another mobo this morning and if that  doesn''t work then i
will ask a fellow sysop to punt my disk in his solaris array.

No i didn''t notice anything of kernel panics the only thing i noticed
was this line popping up when i did a shutdown
The machine himself is just used as a storage array nothing else is running on
it, and i use CIFS to share and that works great.


gzip: kernel/misc/qlc/qlc_fw_2400: I/O error 

keep you posted thanks for everything already :)
 
 
This message posted from opensolaris.org

Ross

2008-Jul-11 18:55 UTC

head link

[zfs-discuss] ZFS problem mirror

Trying the disks in another machine is a great step, it will eliminate those
quickly.  Use your own cables too so you can eliminate them from suspicion.

If this is hardware related, from my own experience I would say it''s
most likely to be (in order):
- Power Supply
- Memory  (especially if ever handled without anti-static precautions)
- Bad driver / disk controller
- Bad cpu / motherboard
- other component

When you get your new board, just set it up for troubleshooting with the bare
minimum components:
- Power supply
- Motherboard
- CPU
- Memory
- Disks
- Power button

Don''t even connect the reset switch or the case LED''s. 
It''s by far the quickest way to eliminate items from suspicion.
 
 
This message posted from opensolaris.org

2008-Jul-28 08:20 UTC

head link

[zfs-discuss] ZFS problem mirror

So we finnaly got arround the problem, after replacing almost everything it
seems that the memory was the devil. I pulled it out and replaced it with ECC
memory and now everything works fine for 14 days already.
This knowing i will never putt non ecc memory in my boxes again. 
thanks for al the help !!
 
 
This message posted from opensolaris.org

Ross

2008-Jul-28 08:23 UTC

head link

[zfs-discuss] ZFS problem mirror

Heh, yup, memory errors are among the worst to diagnose.  Glad you got to the
bottom of it, and it''s good to see ZFS again catching faults that
otherwise seem to be missed.
 
 
This message posted from opensolaris.org

2008-Jul-28 10:45 UTC

head link

[zfs-discuss] ZFS problem mirror

indeed that''s one of the nice things that ZFS is picky on data and
allerts you immediatly. Before some files became corrupt and one was wondering
what happend and how this was possible since everything seems fine for months :)

the more i use solaris the more i love it :)
 
 
This message posted from opensolaris.org

Mario Goebbels

2008-Jul-28 10:49 UTC

head link

[zfs-discuss] ZFS problem mirror

> This knowing i will never putt non ecc memory in my boxes again. 
What''s your mainboard and CPU? I''ve looked up the thread on
the forum
and there''s no hardware information. Don''t be fooled just
because the
RAM''s ECC. The mainboard (and CPU in case of AMDs) have to support
that.

There are two factors: The chipset (Intel/old AMDs) or CPU (AMDs with
IMC) have to support ECC, as well the mainboard has to have the
additional memory lanes etched into the PCB.

If the chipset/CPU supports ECC but the lanes are missing, you''ll end
up
having lots of ECC errors, if the chipset can''t sense that. If it does,
it''ll run as regular RAM. If the chipset/CPU don''t support it,
it''ll run
as regular RAM.

Regards,
-mg

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 225 bytes
Desc: OpenPGP digital signature
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080728/8e10677b/attachment.bin>

Bob Friesenhahn

2008-Jul-28 14:43 UTC

head link

[zfs-discuss] ZFS problem mirror

On Mon, 28 Jul 2008, BG wrote:
> indeed that''s one of the nice things that ZFS is picky on data and
> allerts you immediatly. Before some files became corrupt and one was 
> wondering what happend and how this was possible since everything 
> seems fine for months :)
Unfortunately, ZFS does not detect or correct memory errors.  Memory 
reliability is currently an Achilles'' heel for ZFS, which blows MTTDL 
models which are based on disk media reliability alone.

Consider that in servers, the ZFS ARC (containing a copy of 
often-accessed data) will often grow to consume most of the system 
RAM.  This growth mostly occurs after server daemons have been 
successfully started.  Large servers can include lots of RAM, most of 
which is used for caching.  All of the data read or written passes 
through RAM.  Any error which corrupts data before it has been 
checksummed by ZFS will cause silent data corruption on disk.  If data 
in RAM becomes corrupt between the time that it is checksummed and it 
is written to disk, then ZFS will detect the problem, but the data 
will be corrupt and unrecoverable.  Likewise, if the ZFS ARC returns 
corrupted data to an application, the application may then write a 
(possibly) modified version of this corrupted data to disk.

Reliable memory is imperative and without ECC, memory read errors can 
go undetected for a long time.  Even with ECC it is possible to 
experience undetected/uncorrected memory errors but Solaris includes a 
very good fault management system for ECC memory so that it avoids 
memory chips which produce many detectable read errors.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2008-Jul-28 18:40 UTC

head link

[zfs-discuss] ZFS problem mirror

Bob Friesenhahn wrote:> On Mon, 28 Jul 2008, BG wrote:
>
>   
>> indeed that''s one of the nice things that ZFS is picky on data
and
>> allerts you immediatly. Before some files became corrupt and one was 
>> wondering what happend and how this was possible since everything 
>> seems fine for months :)
>>     
>
> Unfortunately, ZFS does not detect or correct memory errors.  Memory 
> reliability is currently an Achilles'' heel for ZFS, which blows
MTTDL
> models which are based on disk media reliability alone.
>   
We can (and do) model systems complete with the data path from
CPU to memory to PCI* to HBA to disk and back.  Basically, the
results will show that you want ECC memory and PCI-Express as
major technology components.  FWIW, Sun no longer sells computers
without ECC memory.

But ZFS can do better.  I filed CR6674679 which basically says
that if redundant copies of data have the same, wrong checksum,
then ZFS should issue an e-report to that effect.  This will allow
you to move suspicion away from the disks as a root cause towards
a  common cause, like memory, shared HBA or bus, etc. It won''t
be able to recover the data, but it can help debug the system.
 -- richard

Bob Friesenhahn

2008-Jul-28 19:25 UTC

head link

[zfs-discuss] ZFS problem mirror

On Mon, 28 Jul 2008, Richard Elling wrote:>
> But ZFS can do better.  I filed CR6674679 which basically says
> that if redundant copies of data have the same, wrong checksum,
> then ZFS should issue an e-report to that effect.  This will allow
> you to move suspicion away from the disks as a root cause towards
> a  common cause, like memory, shared HBA or bus, etc. It won''t
> be able to recover the data, but it can help debug the system.
A rather obvious thing to do is to have a low-priority task running 
which validates checksums of memory in the ZFS ARC.  That way memory 
content which is somehow altered (due to memory glitch or kernel bug) 
will be detected so someone can fix the problem.  Even ECC memory will 
not fix the problem when an adaptor card writes to the wrong location, 
or a device driver does something wrong.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

2008-Jul-28 19:52 UTC

head link

[zfs-discuss] ZFS problem mirror

mainboard is :

KFN4-DRE 
more info you find here :
http://www.asus.com/products.aspx?l1=9&l2=39&l3=174&l4=0&model=1844&modelmenu=2

cpu:
2x opteron aMD Opteron 2350 2.0GHz HT 4MB SF

memory was cheap stuff non ecc replaced it with kingston ECC mem KVR667D2D8P5/2G

in the mean time we have 4x500Gb in two mirror sets.
 
 
This message posted from opensolaris.org

Richard Elling

2008-Jul-28 20:00 UTC

head link

[zfs-discuss] ZFS problem mirror

Bob Friesenhahn wrote:> On Mon, 28 Jul 2008, Richard Elling wrote:
>>
>> But ZFS can do better.  I filed CR6674679 which basically says
>> that if redundant copies of data have the same, wrong checksum,
>> then ZFS should issue an e-report to that effect.  This will allow
>> you to move suspicion away from the disks as a root cause towards
>> a  common cause, like memory, shared HBA or bus, etc. It won''t
>> be able to recover the data, but it can help debug the system.
>
> A rather obvious thing to do is to have a low-priority task running 
> which validates checksums of memory in the ZFS ARC.  That way memory 
> content which is somehow altered (due to memory glitch or kernel bug) 
> will be detected so someone can fix the problem.  Even ECC memory will 
> not fix the problem when an adaptor card writes to the wrong location, 
> or a device driver does something wrong.
We already have memory scrubbers which check memory.  Actually,
we''ve had these for about 10 years, but it only works for ECC
memory... if you have only parity memory, then you can''t fix anything
at the hardware level, and the best you can hope is that FMA will do
the right thing.

It is not clear to me where ARC validation occurs.  Perhaps someone
who deals with the ARC code could shed some light.
 -- richard

Bob Friesenhahn

2008-Jul-28 20:47 UTC

head link

[zfs-discuss] ZFS problem mirror

On Mon, 28 Jul 2008, Richard Elling wrote:
> It is not clear to me where ARC validation occurs.  Perhaps someone
> who deals with the ARC code could shed some light.
More than likely, ARC data is not stored using original filesystem 
blocks so the existing filesystem block checksums are not useful.

Sun hardware is very well made with ECC memory but many OpenSolaris 
users will be using PCs with bargain memory.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Mario Goebbels

2008-Jul-28 21:12 UTC

head link

[zfs-discuss] ZFS problem mirror

> mainboard is :
> 
> KFN4-DRE 
> more info you find here :
http://www.asus.com/products.aspx?l1=9&l2=39&l3=174&l4=0&model=1844&modelmenu=2
> 
> cpu:
> 2x opteron aMD Opteron 2350 2.0GHz HT 4MB SF
You''ll be fine with that. Just had to make sure.

Regards,
-mg

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 225 bytes
Desc: OpenPGP digital signature
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080728/a634903e/attachment.bin>

Mario Goebbels

2008-Jul-28 21:17 UTC

head link

[zfs-discuss] ZFS problem mirror

> We already have memory scrubbers which check memory.  Actually,
> we''ve had these for about 10 years, but it only works for ECC
> memory... if you have only parity memory, then you can''t fix
anything
> at the hardware level, and the best you can hope is that FMA will do
> the right thing.
In Solaris, these however only work with supported Sun hardware, right?
Because I don''t see how I could put Solaris to cooperate with that X48
chipset, that''s managing my ECC RAM.

Regards,
-mg

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 225 bytes
Desc: OpenPGP digital signature
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080728/e493fa45/attachment.bin>

Richard Elling

2008-Jul-28 23:16 UTC

head link

[zfs-discuss] ZFS problem mirror

Mario Goebbels wrote:>> We already have memory scrubbers which check memory.  Actually,
>> we''ve had these for about 10 years, but it only works for ECC
>> memory... if you have only parity memory, then you can''t fix
anything
>> at the hardware level, and the best you can hope is that FMA will do
>> the right thing.
>>     
>
> In Solaris, these however only work with supported Sun hardware, right?
> Because I don''t see how I could put Solaris to cooperate with that
X48
> chipset, that''s managing my ECC RAM.
>   
There are different methods of scrubbing.  Some memory controllers will
do it for you, so we defer to them.  If that is not the case, then you might
see scrubs occurring as (software) kernel thread, memscrubber.  For x86
AMD, you can see some of the logic here:
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/os/memscrub.c
To get all of the details for each platform, you''ll need to look at the
various architecture-specific sources.

Note that there are reasons why a scrub won''t do much for you,
particularly
if you are not using ECC memory.
 -- richard

zfs discuss - Jul 2008 - ZFS problem mirror

[zfs-discuss] ZFS problem mirror

[zfs-discuss] ZFS problem mirror

[zfs-discuss] ZFS problem mirror

[zfs-discuss] ZFS problem mirror

[zfs-discuss] ZFS problem mirror

[zfs-discuss] ZFS problem mirror

[zfs-discuss] ZFS problem mirror

[zfs-discuss] ZFS problem mirror

[zfs-discuss] ZFS problem mirror

[zfs-discuss] ZFS problem mirror

[zfs-discuss] ZFS problem mirror

[zfs-discuss] ZFS problem mirror

[zfs-discuss] ZFS problem mirror

[zfs-discuss] ZFS problem mirror

[zfs-discuss] ZFS problem mirror

[zfs-discuss] ZFS problem mirror

[zfs-discuss] ZFS problem mirror

[zfs-discuss] ZFS problem mirror

[zfs-discuss] ZFS problem mirror

[zfs-discuss] ZFS problem mirror

[zfs-discuss] ZFS problem mirror

[zfs-discuss] ZFS problem mirror