On May 8, 2019, at 9:59 AM, Michelle Sullivan <michelle at sorbs.net> wrote:> Paul Mather wrote: >>> due to lack of space. Interestingly have had another drive die in the >>> array - and it doesn't just have one or two sectors down it has a *lot* >>> - which was not noticed by the original machine - I moved the drive to >>> a byte copier which is where it's reporting 100's of sectors damaged... >>> could this be compounded by zfs/mfi driver/hba not picking up errors >>> like it should? >> >> >> Did you have regular pool scrubs enabled? It would have picked up >> silent data corruption like this. It does for me. > Yes, every month (once a month because, (1) the data doesn't change much > (new data is added, old it not touched), and (2) because to complete it > took 2 weeks.)Do you also run sysutils/smartmontools to monitor S.M.A.R.T. attributes? Although imperfect, it can sometimes signal trouble brewing with a drive (e.g., increasing Reallocated_Sector_Ct and Current_Pending_Sector counts) that can lead to proactive remediation before catastrophe strikes. Unless you have been gathering periodic drive metrics, you have no way of knowing whether these hundreds of bad sectors have happened suddenly or slowly over a period of time. Cheers, Paul.
Paul Mather wrote:> On May 8, 2019, at 9:59 AM, Michelle Sullivan <michelle at sorbs.net> wrote: > >> Paul Mather wrote: >>>> due to lack of space. Interestingly have had another drive die in >>>> the array - and it doesn't just have one or two sectors down it has >>>> a *lot* - which was not noticed by the original machine - I moved >>>> the drive to a byte copier which is where it's reporting 100's of >>>> sectors damaged... could this be compounded by zfs/mfi driver/hba >>>> not picking up errors like it should? >>> >>> >>> Did you have regular pool scrubs enabled? It would have picked up >>> silent data corruption like this. It does for me. >> Yes, every month (once a month because, (1) the data doesn't change >> much (new data is added, old it not touched), and (2) because to >> complete it took 2 weeks.) > > > Do you also run sysutils/smartmontools to monitor S.M.A.R.T. > attributes? Although imperfect, it can sometimes signal trouble > brewing with a drive (e.g., increasing Reallocated_Sector_Ct and > Current_Pending_Sector counts) that can lead to proactive remediation > before catastrophe strikes.not Automatically> > Unless you have been gathering periodic drive metrics, you have no way > of knowing whether these hundreds of bad sectors have happened > suddenly or slowly over a period of time.no, it something i have thought about but been unable to spend the time on. -- Michelle Sullivan http://www.mhix.org/
On Wed, 8 May 2019, Paul Mather wrote:> On May 8, 2019, at 9:59 AM, Michelle Sullivan <michelle at sorbs.net> wrote: > >> Paul Mather wrote: >>>> due to lack of space. Interestingly have had another drive die in the >>>> array - and it doesn't just have one or two sectors down it has a *lot* - >>>> which was not noticed by the original machine - I moved the drive to a >>>> byte copier which is where it's reporting 100's of sectors damaged... >>>> could this be compounded by zfs/mfi driver/hba not picking up errors like >>>> it should? >>> >>> >>> Did you have regular pool scrubs enabled? It would have picked up silent >>> data corruption like this. It does for me. >> Yes, every month (once a month because, (1) the data doesn't change much >> (new data is added, old it not touched), and (2) because to complete it >> took 2 weeks.) > > > Do you also run sysutils/smartmontools to monitor S.M.A.R.T. attributes? > Although imperfect, it can sometimes signal trouble brewing with a drive > (e.g., increasing Reallocated_Sector_Ct and Current_Pending_Sector counts) > that can lead to proactive remediation before catastrophe strikes. > > Unless you have been gathering periodic drive metrics, you have no way of > knowing whether these hundreds of bad sectors have happened suddenly or > slowly over a period of time. >+1 Use `smartctl` from a cron script to do regular (say, weekly) *long* self-tests of hard drives, and also log (say, daily) all the SMART information from each drive. Then if a drive fails, you can at least check the logs for whether SMART noticed symptoms, and (if so) for other drives with symptoms. Or enhance this with a slightly longer script, which watches the logs for symptoms, and alerts you. (My experience is that SMART's *long* self-test checks the entire disk for read errors, without neither downside of `zpool scrub` - it does a fast, sequential read of the HD, including free space. That makes it a nice test for failing disk hardware; not a replacement for `zpool scrub`.)> Cheers, > > Paul. > _______________________________________________ > freebsd-stable at freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org"