Hi list, I'm looking for advice/help in tracking down a problem with a new system I've purchased. I have a beige box server with a Gigabyte GA-M51GM-S2G motherboard. It has the nVidia MCP51 SATA controller with 3 250 gig Western Digital hard drives attached to it. It seems that when doing a considerable amount of file writing, the filesystem will become read-only. See attached dmesg output. I started looking for help on the nvnews forums, and found a suggestion to set the pci=nommconf kernel parameter. This did not help. Aside from that, there have only been suggestions such as "its likely faulty hardware". kernel 2.6.17-10-generic #2 SMP running on Ubuntu Edgy Eft, amd64 version; but the same problem showed up on Fedora Core 6, both x86_64 and i386. I checked to see if it was perhaps bad memory by running memtest86+, but after 14 hours no errors were found. I've run badblocks on the disk that contains the ext3 partition and no errors were found. Aside from badblocks, I'm not aware of any disk tools I could use to test further. smartmon tools report that all 3 of the disks are OK. The bulk of the data being sent to the machine is via the network using the application Unison, version 2.13.16 if that makes any difference. I haven't tried another suggestion to set the kernel paramter idle=poll, but since nothing else has worked so far, I don't see that making much difference. Also I haven't tried installing Windows to isolate the "faulty hardware" suggestion. Bad hardware would suggest that Windows would see problems too right? Any help would be greatly appreciated. I'm at the end of my rope on this one. Thanks in advance, Tim -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: dmesg.output URL: <http://listman.redhat.com/archives/ext3-users/attachments/20070128/28150c28/attachment.ksh> -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: lspci.output URL: <http://listman.redhat.com/archives/ext3-users/attachments/20070128/28150c28/attachment-0001.ksh>
On Sun, Jan 28, 2007 at 04:38:12PM -0600, Tim Rupp wrote:> > I'm looking for advice/help in tracking down a problem with a new system > I've purchased. > > I have a beige box server with a Gigabyte GA-M51GM-S2G motherboard. It > has the nVidia MCP51 SATA controller with 3 250 gig Western Digital hard > drives attached to it. > > It seems that when doing a considerable amount of file writing, the > filesystem will become read-only. See attached dmesg output.According to the dmesg output, the filesystem is getting remounted read-only because the kernel detected an inconsistency in the block allocation bitmaps. Basically, a block that was in use and getting freed (due to a file getting deleted) was found to be already marked as not in use in the block bitmap. This is very dangerous, since a corrupted block allocation bitmap can result in data loss when a block gets used by two different files, and the contents of part of the first file gets overwritten by the second. Hence, ext3 remounted the filesystem read-only in order to protect your data from getting (more) corrupted. The question then is why is this happening. If you run e2fsck and it finds nothing wrong, then that means it was the in-core memory that was corrupted --- so the data was correct on disk, but when it was read from disk to memory, it had gotten corrupted somehow (another good reason for ext3 to mark the filesystem read-only; to prevent the corrupted data from getting written back to disk). In any case, given that you've checked the memory, it does rather seem to narrow it down to either SATA cables, the disk drives, or the SATA controller, roughly in that order of probability. The SATA cables are probably the cheapest to try replacing first. I suppose there is a chance that there it's a hardware device driver or kernel issue. You might want to ask on LKML or on the Ubuntu support forums if there are any known issues wit the nVidia SATA controller driver. Good luck, - Ted
On Jan 28, 2007 16:38 -0600, Tim Rupp wrote:> I checked to see if it was perhaps bad memory by running memtest86+, but > after 14 hours no errors were found.I've heard in the past that you need to run memtest86 for at least a day or two to be sure about that. Another option (if you have multiple sticks of RAM) is to take half of it out, see if the problem still happens (when running with your reproducer), repeat until you've isolated it to one or more sticks of RAM. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.