Ollie Cook
2004-Apr-18 14:19 UTC
filesystem corruption with 1TB filesystem, 4.9-STABLE, twe
Hi, I am experiencing filesystem corruption while using a 1TB (appx.) partition under 4.9-STABLE (sources from Mar 17) and an 8-port 3ware ATA RAID card (twe device driver). The RAID set comprises 5x250GB ATA disks. The kernel logs such messages as: Apr 17 16:25:37 heman /kernel: free inode /clara/170175645 had 137391860 blocks Apr 17 17:18:29 heman /kernel: free inode /clara/169969279 had 1803039330 blocks Apr 17 18:06:38 heman /kernel: free inode /clara/171086221 had 544501359 blocks The operations it was performing at the time involved copying a lot of small (email messages) files from a busy NFS mount to the RAID5 array. A number of processes were all copying different files and the throughput was around 3MB/s to disk. As far as I can tell from sys/ufs/ffs/ffs_alloc.c this error indicates that a kernel data structure contains unexpected data, but I'm not confident enough to be able to tell what might be causing that. After such messages, if I cleanly unmount the filesystem and run fsck, errors are detected. Such errors are: directory corrupted directory contains empty blocks unallocated inode wrong link counts There are many more distinct error messages, but those are the ones I recall. After a number of passes through fsck, the filesystem is eventually marked clean but quite a number of files wind up in lost+found. Has anyone seen behaviour similar to this with twe RAID sets or large partitions in the past? I've not been able to find reports of similar symptoms using Google. Can anyone offer advice on how I might further debug this problem? Yours, Ollie Apr 16 11:34:12 heman /kernel: twe0: <3ware Storage Controller> port 0xc800-0xc80f mem 0xfe000000-0xfe7fffff,0xfe8ffc00-0xfe8ffc0f irq 10 at device 4.0 on pci3 Apr 16 11:34:12 heman /kernel: twe0: 8 ports, Firmware FE7X 1.05.00.065, BIOS BE7X 1.08.00.048 Apr 16 11:34:12 heman /kernel: twed0: <Unit 0, JBOD, Normal> on twe0 Apr 16 11:34:12 heman /kernel: twed0: 4126MB (8452080 sectors) Apr 16 11:34:12 heman /kernel: twed1: <Unit 1, RAID5, Normal> on twe0 Apr 16 11:34:12 heman /kernel: twed1: 953896MB (1953580032 sectors) Apr 16 11:34:12 heman /kernel: twe0: command interrupt -- Oliver Cook Systems Administrator, Claranet UK ollie@uk.clara.net +44 20 7903 3065
Ollie Cook
2004-Apr-19 07:05 UTC
filesystem corruption with 1TB filesystem, 4.9-STABLE, twe
On Sun, Apr 18, 2004 at 10:18:53PM +0100, Ollie Cook wrote:> Hi, > > I am experiencing filesystem corruption while using a 1TB (appx.) partition > under 4.9-STABLE (sources from Mar 17) and an 8-port 3ware ATA RAID card (twe > device driver). The RAID set comprises 5x250GB ATA disks. > > The kernel logs such messages as: > > Apr 17 16:25:37 heman /kernel: free inode /clara/170175645 had 137391860 blocks > Apr 17 17:18:29 heman /kernel: free inode /clara/169969279 had 1803039330 blocks > Apr 17 18:06:38 heman /kernel: free inode /clara/171086221 had 544501359 blocks*snip* I have some further details which I hope might shed some more light on this problem. Accessing some files which appear (from a directory listing for example) to have been stored correctly results in 'Bad file descriptor'. This is with a freshly checked and clean filesystem. I say 'clean', but after fsck declares it clean, another pass through fsck will diagnose further errors. This is without mounting the filesystem between passes. I ran a few simple tests and was able to ascertain that the open(2) and read(2) system calls don't return errors but fstat(2) does return EBADF. su-2.05b# ls | grep 1071701821.78602.aether.uk.clara.net 1071701821.78602.aether.uk.clara.net su-2.05b# ls 1071701821.78602.aether.uk.clara.net ls: 1071701821.78602.aether.uk.clara.net: Bad file descriptor Any assistance in diagnosing this would be greatly appreciated. Yours, Ollie -- Oliver Cook Systems Administrator, Claranet UK ollie@uk.clara.net +44 20 7903 3065
Vinod Kashyap
2004-Apr-19 10:54 UTC
filesystem corruption with 1TB filesystem, 4.9-STABLE, twe
Please check the customer advisory at http://www.3ware.com/support/index.asp and make sure you don't have a known problem configuration. Also, I would recommend using the "FreeBSD 4.8 Beta Driver" on the 3ware website, or still better, use the driver at the tip of RELENG_4 at: http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/dev/twe/?only_with_tag=RELENG_4 If you need any help with the 3ware product, please contact support@3ware.com. Thanks, Vinod.> -----Original Message----- > From: owner-freebsd-stable@freebsd.org > [mailto:owner-freebsd-stable@freebsd.org]On Behalf Of Ollie Cook > Sent: Monday, April 19, 2004 10:06 AM > To: Matthew Seaman; freebsd-stable@freebsd.org > Subject: Re: filesystem corruption with 1TB filesystem, > 4.9-STABLE, twe > > > On Mon, Apr 19, 2004 at 04:51:16PM +0100, Matthew Seaman wrote: > > Can you rule out hardware problems by substituting in a different > > known good 3-ware card? Last time I saw anything like this > was an IO > > controller chip going marginal through overheating under heavy load > > and flipping occasional bits on disk block addresses. > > Hi, > > This is the first time we've tried the 3ware ATA RAID cards > so I don't have > another 'known good' one. Usually we use SCSI RAID cards. If > I were to replace > it I still wouldn't be able to guarantee that is was 'known > good'. I may be > able to arrange a part replacement, just to test that theory, > in any case. > > Are you aware of any utilities which would be able to > diagnose what you > describe? I guess what I'm asking is how did you determine > that that was what > happened last time you saw these symtoms? > > Thanks, > > Ollie > > -- > Oliver Cook Systems Administrator, Claranet UK > ollie@uk.clara.net +44 20 7903 3065 > _______________________________________________ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to > "freebsd-stable-unsubscribe@freebsd.org" >
Doug White
2004-Apr-30 10:28 UTC
filesystem corruption with 1TB filesystem, 4.9-STABLE, twe
On Sun, 18 Apr 2004, Ollie Cook wrote:> I am experiencing filesystem corruption while using a 1TB (appx.) partition > under 4.9-STABLE (sources from Mar 17) and an 8-port 3ware ATA RAID card (twe > device driver). The RAID set comprises 5x250GB ATA disks.[...] The type of corruption you're seeing would be consistent with one of the disks not accepting writes or some other sort of array corruption. I realize it'll take forever, but can you run an array verify? I wonder if the BIOS isn't picking up a disk failure since it isn't throwing errors, but isn't doing any useful work either.> > The kernel logs such messages as: > > Apr 17 16:25:37 heman /kernel: free inode /clara/170175645 had 137391860 blocks > Apr 17 17:18:29 heman /kernel: free inode /clara/169969279 had 1803039330 blocks > Apr 17 18:06:38 heman /kernel: free inode /clara/171086221 had 544501359 blocks > > The operations it was performing at the time involved copying a lot of small > (email messages) files from a busy NFS mount to the RAID5 array. A number of > processes were all copying different files and the throughput was around 3MB/s > to disk. > > As far as I can tell from sys/ufs/ffs/ffs_alloc.c this error indicates that a > kernel data structure contains unexpected data, but I'm not confident enough to > be able to tell what might be causing that. > > After such messages, if I cleanly unmount the filesystem and run fsck, errors > are detected. Such errors are: > > directory corrupted > directory contains empty blocks > unallocated inode > wrong link counts > > There are many more distinct error messages, but those are the ones I recall. > After a number of passes through fsck, the filesystem is eventually marked > clean but quite a number of files wind up in lost+found. > > Has anyone seen behaviour similar to this with twe RAID sets or large > partitions in the past? I've not been able to find reports of similar symptoms > using Google. > > Can anyone offer advice on how I might further debug this problem? > > Yours, > > Ollie > > Apr 16 11:34:12 heman /kernel: twe0: <3ware Storage Controller> port 0xc800-0xc80f mem 0xfe000000-0xfe7fffff,0xfe8ffc00-0xfe8ffc0f irq 10 at device 4.0 on pci3 > Apr 16 11:34:12 heman /kernel: twe0: 8 ports, Firmware FE7X 1.05.00.065, BIOS BE7X 1.08.00.048 > Apr 16 11:34:12 heman /kernel: twed0: <Unit 0, JBOD, Normal> on twe0 > Apr 16 11:34:12 heman /kernel: twed0: 4126MB (8452080 sectors) > Apr 16 11:34:12 heman /kernel: twed1: <Unit 1, RAID5, Normal> on twe0 > Apr 16 11:34:12 heman /kernel: twed1: 953896MB (1953580032 sectors) > Apr 16 11:34:12 heman /kernel: twe0: command interrupt > >-- Doug White | FreeBSD: The Power to Serve dwhite@gumbysoft.com | www.FreeBSD.org
Matthew Reimer
2004-Apr-30 10:52 UTC
filesystem corruption with 1TB filesystem, 4.9-STABLE, twe
Is your card plugged into a riser card? We had similar problems (random corruption) with a 7506-8 card. The workaround was to set the speed for that PCI slot to 33MHz (rather than Auto or 66MHz). I think this tech note describes our problem: http://www.3ware.com/kb/article.aspx?id=10848 (Read the PDF file attached to the tech note.) Now the box is as solid as a rock. Matt Doug White wrote:> On Sun, 18 Apr 2004, Ollie Cook wrote: > > >>I am experiencing filesystem corruption while using a 1TB (appx.) partition >>under 4.9-STABLE (sources from Mar 17) and an 8-port 3ware ATA RAID card (twe >>device driver). The RAID set comprises 5x250GB ATA disks. > > > [...] > > The type of corruption you're seeing would be consistent with one of the > disks not accepting writes or some other sort of array corruption. I > realize it'll take forever, but can you run an array verify? I wonder if > the BIOS isn't picking up a disk failure since it isn't throwing errors, > but isn't doing any useful work either. > > > >>The kernel logs such messages as: >> >>Apr 17 16:25:37 heman /kernel: free inode /clara/170175645 had 137391860 blocks >>Apr 17 17:18:29 heman /kernel: free inode /clara/169969279 had 1803039330 blocks >>Apr 17 18:06:38 heman /kernel: free inode /clara/171086221 had 544501359 blocks >> >>The operations it was performing at the time involved copying a lot of small >>(email messages) files from a busy NFS mount to the RAID5 array. A number of >>processes were all copying different files and the throughput was around 3MB/s >>to disk. >> >>As far as I can tell from sys/ufs/ffs/ffs_alloc.c this error indicates that a >>kernel data structure contains unexpected data, but I'm not confident enough to >>be able to tell what might be causing that. >> >>After such messages, if I cleanly unmount the filesystem and run fsck, errors >>are detected. Such errors are: >> >> directory corrupted >> directory contains empty blocks >> unallocated inode >> wrong link counts >> >>There are many more distinct error messages, but those are the ones I recall. >>After a number of passes through fsck, the filesystem is eventually marked >>clean but quite a number of files wind up in lost+found. >> >>Has anyone seen behaviour similar to this with twe RAID sets or large >>partitions in the past? I've not been able to find reports of similar symptoms >>using Google. >> >>Can anyone offer advice on how I might further debug this problem? >> >>Yours, >> >>Ollie >> >>Apr 16 11:34:12 heman /kernel: twe0: <3ware Storage Controller> port 0xc800-0xc80f mem 0xfe000000-0xfe7fffff,0xfe8ffc00-0xfe8ffc0f irq 10 at device 4.0 on pci3 >>Apr 16 11:34:12 heman /kernel: twe0: 8 ports, Firmware FE7X 1.05.00.065, BIOS BE7X 1.08.00.048 >>Apr 16 11:34:12 heman /kernel: twed0: <Unit 0, JBOD, Normal> on twe0 >>Apr 16 11:34:12 heman /kernel: twed0: 4126MB (8452080 sectors) >>Apr 16 11:34:12 heman /kernel: twed1: <Unit 1, RAID5, Normal> on twe0 >>Apr 16 11:34:12 heman /kernel: twed1: 953896MB (1953580032 sectors) >>Apr 16 11:34:12 heman /kernel: twe0: command interrupt >> >> > >