Hello all, I have a CentOS 5.7 machine hosting a 16 TB XFS partition used to house backups. The backups are run via rsync/rsnapshot and are large in terms of the number of files: over 10 million each. Now the machine is not particularly powerful: it is 64-bit machine, dual core CPU, 3 GB RAM. So perhaps this is a factor in why I am having the following problem: once in awhile that XFS partition starts generating multiple I/O errors, files that had content become 0 byte, directories disappear, etc. Every time a reboot fixes that, however. So far I've looked at logs but could not find a cause of precipitating event. Hence the question: has anyone experienced anything along those lines? What could be the cause of this? Thanks. Boris.
On Sun, Jan 22, 2012 at 9:06 AM, Boris Epstein <borepstein at gmail.com> wrote:> Hello all, > > I have a CentOS 5.7 machine hosting a 16 TB XFS partition used to house > backups. The backups are run via rsync/rsnapshot and are large in terms of > the number of files: over 10 million each. > > Now the machine is not particularly powerful: it is 64-bit machine, dual > core CPU, 3 GB RAM. So perhaps this is a factor in why I am having the > following problem: once in awhile that XFS partition starts generating > multiple I/O errors, files that had content become 0 byte, directories > disappear, etc. Every time a reboot fixes that, however. So far I've looked > at logs but could not find a cause of precipitating event. > > Hence the question: has anyone experienced anything along those lines? > What could be the cause of this? > > Thanks. > > Boris. >Correction to the above: the XFS partition is 26TB, not 16 TB (not that it should matter in the context of this particular situation). Also, here's somethine else I have discovered. Apparently there is an potential intermittent RAID disk trouble. At least I found the following in the system log: Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0026): Drive ECC error reported:port=4, unit=0. Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x002D): Source drive error occurred:port=4, unit=0. Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0004): Rebuild failed:unit=0. Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x003B): Rebuild paused:unit=0. ... Jan 22 09:55:23 nrims-bs kernel: 3w-9xxx: scsi6: AEN: WARNING (0x04:0x000F): SMART threshold exceeded:port=9. Jan 22 09:55:23 nrims-bs kernel: 3w-9xxx: scsi6: AEN: WARNING (0x04:0x000F): SMART threshold exceeded:port=9. Jan 22 09:56:17 nrims-bs kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x000B): Rebuild started:unit=0. Even if a disk is misbehaving in a RAID6 that should not be causing I/O errors. Plus, why is it never straight after a rebbot and is always fixed by a reboot? Be that as it may, I am still puzzled. Boris.
> Now the machine is not particularly powerful: it is 64-bit machine, dual > core CPU, 3 GB RAM. So perhaps this is a factor in why I am having the > following problem: once in awhile that XFS partition starts generating > multiple I/O errors, files that had content become 0 byte, directories > disappear, etc. Every time a reboot fixes that, however. So far I've looked > at logs but could not find a cause of precipitating event.Is the CentOS you are running a 64 bit one? The reason I am asking this is because the use of XFS under a 32 bit OS is NOT recommended. If you search this list's archives you will find some discussion about this subject.
>I have a CentOS 5.7 machine hosting a 16 TB XFS partition used to house >backups. The backups are run via rsync/rsnapshot and are large in terms of >the number of files: over 10 million each. > >Now the machine is not particularly powerful: it is 64-bit machine, dual >core CPU, 3 GB RAM. So perhaps this is a factor in why I am having the >following problem: once in awhile that XFS partition starts generating >multiple I/O errors, files that had content become 0 byte, directories >disappear, etc. Every time a reboot fixes that, however. So far I've looked >at logs but could not find a cause of precipitating event. > >Hence the question: has anyone experienced anything along those lines? What >could be the cause of this?In every situation like this that I have seen, it was hardware that never had adequate memory provisioned. Another consideration is you almost certainly wont be able to run a repair on that fs with so little ram. Finally, it would be interesting to know how you architected the storage hardware. Hardware raid, BBC, drive cache status, barrier status etc...