thr3ads.net - freebsd stable - filesystem corruption with 1TB filesystem, 4.9-STABLE, twe [Apr 2004]

If this information is useful, please help other people find it:
Share via:

Ollie Cook

2004-Apr-18 14:19 UTC

filesystem corruption with 1TB filesystem, 4.9-STABLE, twe

Hi,

I am experiencing filesystem corruption while using a 1TB (appx.) partition
under 4.9-STABLE (sources from Mar 17) and an 8-port 3ware ATA RAID card (twe
device driver). The RAID set comprises 5x250GB ATA disks.

The kernel logs such messages as:

Apr 17 16:25:37 heman /kernel: free inode /clara/170175645 had 137391860 blocks
Apr 17 17:18:29 heman /kernel: free inode /clara/169969279 had 1803039330 blocks
Apr 17 18:06:38 heman /kernel: free inode /clara/171086221 had 544501359 blocks

The operations it was performing at the time involved copying a lot of small
(email messages) files from a busy NFS mount to the RAID5 array. A number of
processes were all copying different files and the throughput was around 3MB/s
to disk.

As far as I can tell from sys/ufs/ffs/ffs_alloc.c this error indicates that a
kernel data structure contains unexpected data, but I'm not confident enough
to
be able to tell what might be causing that.

After such messages, if I cleanly unmount the filesystem and run fsck, errors
are detected. Such errors are:

  directory corrupted
  directory contains empty blocks
  unallocated inode
  wrong link counts

There are many more distinct error messages, but those are the ones I recall.
After a number of passes through fsck, the filesystem is eventually marked
clean but quite a number of files wind up in lost+found.

Has anyone seen behaviour similar to this with twe RAID sets or large
partitions in the past? I've not been able to find reports of similar
symptoms
using Google.

Can anyone offer advice on how I might further debug this problem?

Yours,

Ollie

Apr 16 11:34:12 heman /kernel: twe0: <3ware Storage Controller> port
0xc800-0xc80f mem 0xfe000000-0xfe7fffff,0xfe8ffc00-0xfe8ffc0f irq 10 at device
4.0 on pci3
Apr 16 11:34:12 heman /kernel: twe0: 8 ports, Firmware FE7X 1.05.00.065, BIOS
BE7X 1.08.00.048
Apr 16 11:34:12 heman /kernel: twed0: <Unit 0, JBOD, Normal> on twe0
Apr 16 11:34:12 heman /kernel: twed0: 4126MB (8452080 sectors)
Apr 16 11:34:12 heman /kernel: twed1: <Unit 1, RAID5, Normal> on twe0
Apr 16 11:34:12 heman /kernel: twed1: 953896MB (1953580032 sectors)
Apr 16 11:34:12 heman /kernel: twe0: command interrupt

-- 
Oliver Cook    Systems Administrator, Claranet UK
ollie@uk.clara.net               +44 20 7903 3065

Ollie Cook

2004-Apr-19 07:05 UTC

head link

filesystem corruption with 1TB filesystem, 4.9-STABLE, twe

On Sun, Apr 18, 2004 at 10:18:53PM +0100, Ollie Cook
wrote:> Hi,
> 
> I am experiencing filesystem corruption while using a 1TB (appx.) partition
> under 4.9-STABLE (sources from Mar 17) and an 8-port 3ware ATA RAID card
(twe
> device driver). The RAID set comprises 5x250GB ATA disks.
> 
> The kernel logs such messages as:
> 
> Apr 17 16:25:37 heman /kernel: free inode /clara/170175645 had 137391860
blocks
> Apr 17 17:18:29 heman /kernel: free inode /clara/169969279 had 1803039330
blocks
> Apr 17 18:06:38 heman /kernel: free inode /clara/171086221 had 544501359
blocks
*snip*

I have some further details which I hope might shed some more light on this
problem. Accessing some files which appear (from a directory listing for
example) to have been stored correctly results in 'Bad file descriptor'.
This
is with a freshly checked and clean filesystem.

I say 'clean', but after fsck declares it clean, another pass through
fsck will
diagnose further errors. This is without mounting the filesystem between
passes.

I ran a few simple tests and was able to ascertain that the open(2) and read(2)
system calls don't return errors but fstat(2) does return EBADF.

su-2.05b# ls | grep 1071701821.78602.aether.uk.clara.net
1071701821.78602.aether.uk.clara.net
su-2.05b# ls  1071701821.78602.aether.uk.clara.net
ls: 1071701821.78602.aether.uk.clara.net: Bad file descriptor

Any assistance in diagnosing this would be greatly appreciated.

Yours,

Ollie

-- 
Oliver Cook    Systems Administrator, Claranet UK
ollie@uk.clara.net               +44 20 7903 3065

Vinod Kashyap

2004-Apr-19 10:54 UTC

head link

filesystem corruption with 1TB filesystem, 4.9-STABLE, twe

Please check the customer advisory at
http://www.3ware.com/support/index.asp
and make sure you don't have a known problem configuration.

Also, I would recommend using the "FreeBSD 4.8 Beta Driver"
on the 3ware website, or still better, use the driver at the
tip of RELENG_4 at:
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/twe/?only_with_tag=RELENG_4

If you need any help with the 3ware product, please contact
support@3ware.com.


Thanks,

Vinod.

> -----Original Message-----
> From: owner-freebsd-stable@freebsd.org
> [mailto:owner-freebsd-stable@freebsd.org]On Behalf Of Ollie Cook
> Sent: Monday, April 19, 2004 10:06 AM
> To: Matthew Seaman; freebsd-stable@freebsd.org
> Subject: Re: filesystem corruption with 1TB filesystem, 
> 4.9-STABLE, twe
> 
> 
> On Mon, Apr 19, 2004 at 04:51:16PM +0100, Matthew Seaman wrote:
> > Can you rule out hardware problems by substituting in a different
> > known good 3-ware card?  Last time I saw anything like this 
> was an IO
> > controller chip going marginal through overheating under heavy load
> > and flipping occasional bits on disk block addresses.
> 
> Hi,
> 
> This is the first time we've tried the 3ware ATA RAID cards 
> so I don't have
> another 'known good' one. Usually we use SCSI RAID cards. If 
> I were to replace
> it I still wouldn't be able to guarantee that is was 'known 
> good'. I may be
> able to arrange a part replacement, just to test that theory, 
> in any case.
> 
> Are you aware of any utilities which would be able to 
> diagnose what you
> describe? I guess what I'm asking is how did you determine 
> that that was what
> happened last time you saw these symtoms?
> 
> Thanks,
> 
> Ollie
> 
> -- 
> Oliver Cook    Systems Administrator, Claranet UK
> ollie@uk.clara.net               +44 20 7903 3065
> _______________________________________________
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to 
> "freebsd-stable-unsubscribe@freebsd.org"
>

Doug White

2004-Apr-30 10:28 UTC

head link

filesystem corruption with 1TB filesystem, 4.9-STABLE, twe

On Sun, 18 Apr 2004, Ollie Cook wrote:
> I am experiencing filesystem corruption while using a 1TB (appx.) partition
> under 4.9-STABLE (sources from Mar 17) and an 8-port 3ware ATA RAID card
(twe
> device driver). The RAID set comprises 5x250GB ATA disks.
[...]

The type of corruption you're seeing would be consistent with one of the
disks not accepting writes or some other sort of array corruption. I
realize it'll take forever, but can you run an array verify?  I wonder if
the BIOS isn't picking up a disk failure since it isn't throwing errors,
but isn't doing any useful work either.

>
> The kernel logs such messages as:
>
> Apr 17 16:25:37 heman /kernel: free inode /clara/170175645 had 137391860
blocks
> Apr 17 17:18:29 heman /kernel: free inode /clara/169969279 had 1803039330
blocks
> Apr 17 18:06:38 heman /kernel: free inode /clara/171086221 had 544501359
blocks
>
> The operations it was performing at the time involved copying a lot of
small
> (email messages) files from a busy NFS mount to the RAID5 array. A number
of
> processes were all copying different files and the throughput was around
3MB/s
> to disk.
>
> As far as I can tell from sys/ufs/ffs/ffs_alloc.c this error indicates that
a
> kernel data structure contains unexpected data, but I'm not confident
enough to
> be able to tell what might be causing that.
>
> After such messages, if I cleanly unmount the filesystem and run fsck,
errors
> are detected. Such errors are:
>
>   directory corrupted
>   directory contains empty blocks
>   unallocated inode
>   wrong link counts
>
> There are many more distinct error messages, but those are the ones I
recall.
> After a number of passes through fsck, the filesystem is eventually marked
> clean but quite a number of files wind up in lost+found.
>
> Has anyone seen behaviour similar to this with twe RAID sets or large
> partitions in the past? I've not been able to find reports of similar
symptoms
> using Google.
>
> Can anyone offer advice on how I might further debug this problem?
>
> Yours,
>
> Ollie
>
> Apr 16 11:34:12 heman /kernel: twe0: <3ware Storage Controller> port
0xc800-0xc80f mem 0xfe000000-0xfe7fffff,0xfe8ffc00-0xfe8ffc0f irq 10 at device
4.0 on pci3
> Apr 16 11:34:12 heman /kernel: twe0: 8 ports, Firmware FE7X 1.05.00.065,
BIOS BE7X 1.08.00.048
> Apr 16 11:34:12 heman /kernel: twed0: <Unit 0, JBOD, Normal> on twe0
> Apr 16 11:34:12 heman /kernel: twed0: 4126MB (8452080 sectors)
> Apr 16 11:34:12 heman /kernel: twed1: <Unit 1, RAID5, Normal> on twe0
> Apr 16 11:34:12 heman /kernel: twed1: 953896MB (1953580032 sectors)
> Apr 16 11:34:12 heman /kernel: twe0: command interrupt
>
>
-- 
Doug White                    |  FreeBSD: The Power to Serve
dwhite@gumbysoft.com          |  www.FreeBSD.org

Matthew Reimer

2004-Apr-30 10:52 UTC

head link

filesystem corruption with 1TB filesystem, 4.9-STABLE, twe

Is your card plugged into a riser card? We had similar problems (random 
corruption) with a 7506-8 card. The workaround was to set the speed for 
that PCI slot to 33MHz (rather than Auto or 66MHz). I think this tech 
note describes our problem:

http://www.3ware.com/kb/article.aspx?id=10848

(Read the PDF file attached to the tech note.)

Now the box is as solid as a rock.

Matt

Doug White wrote:> On Sun, 18 Apr 2004, Ollie Cook wrote:
> 
> 
>>I am experiencing filesystem corruption while using a 1TB (appx.)
partition
>>under 4.9-STABLE (sources from Mar 17) and an 8-port 3ware ATA RAID card
(twe
>>device driver). The RAID set comprises 5x250GB ATA disks.
> 
> 
> [...]
> 
> The type of corruption you're seeing would be consistent with one of
the
> disks not accepting writes or some other sort of array corruption. I
> realize it'll take forever, but can you run an array verify?  I wonder
if
> the BIOS isn't picking up a disk failure since it isn't throwing
errors,
> but isn't doing any useful work either.
> 
> 
> 
>>The kernel logs such messages as:
>>
>>Apr 17 16:25:37 heman /kernel: free inode /clara/170175645 had 137391860
blocks
>>Apr 17 17:18:29 heman /kernel: free inode /clara/169969279 had
1803039330 blocks
>>Apr 17 18:06:38 heman /kernel: free inode /clara/171086221 had 544501359
blocks
>>
>>The operations it was performing at the time involved copying a lot of
small
>>(email messages) files from a busy NFS mount to the RAID5 array. A
number of
>>processes were all copying different files and the throughput was around
3MB/s
>>to disk.
>>
>>As far as I can tell from sys/ufs/ffs/ffs_alloc.c this error indicates
that a
>>kernel data structure contains unexpected data, but I'm not
confident enough to
>>be able to tell what might be causing that.
>>
>>After such messages, if I cleanly unmount the filesystem and run fsck,
errors
>>are detected. Such errors are:
>>
>>  directory corrupted
>>  directory contains empty blocks
>>  unallocated inode
>>  wrong link counts
>>
>>There are many more distinct error messages, but those are the ones I
recall.
>>After a number of passes through fsck, the filesystem is eventually
marked
>>clean but quite a number of files wind up in lost+found.
>>
>>Has anyone seen behaviour similar to this with twe RAID sets or large
>>partitions in the past? I've not been able to find reports of
similar symptoms
>>using Google.
>>
>>Can anyone offer advice on how I might further debug this problem?
>>
>>Yours,
>>
>>Ollie
>>
>>Apr 16 11:34:12 heman /kernel: twe0: <3ware Storage Controller>
port 0xc800-0xc80f mem 0xfe000000-0xfe7fffff,0xfe8ffc00-0xfe8ffc0f irq 10 at
device 4.0 on pci3
>>Apr 16 11:34:12 heman /kernel: twe0: 8 ports, Firmware FE7X 1.05.00.065,
BIOS BE7X 1.08.00.048
>>Apr 16 11:34:12 heman /kernel: twed0: <Unit 0, JBOD, Normal> on
twe0
>>Apr 16 11:34:12 heman /kernel: twed0: 4126MB (8452080 sectors)
>>Apr 16 11:34:12 heman /kernel: twed1: <Unit 1, RAID5, Normal> on
twe0
>>Apr 16 11:34:12 heman /kernel: twed1: 953896MB (1953580032 sectors)
>>Apr 16 11:34:12 heman /kernel: twe0: command interrupt
>>
>>
> 
>

freebsd stable - Apr 2004 - filesystem corruption with 1TB filesystem, 4.9-STABLE, twe

filesystem corruption with 1TB filesystem, 4.9-STABLE, twe

filesystem corruption with 1TB filesystem, 4.9-STABLE, twe

filesystem corruption with 1TB filesystem, 4.9-STABLE, twe

filesystem corruption with 1TB filesystem, 4.9-STABLE, twe

filesystem corruption with 1TB filesystem, 4.9-STABLE, twe