Jim Klimov
2012-Jun-10 13:07 UTC
[zfs-discuss] A disk on Thumper giving random CKSUM error counts
Hello all, As some of you might remember, there is a Sun Fire X4500 (Thumper) server that I was asked to help upgrade to modern disks. It is still in a testing phase, and the one UltraStar 3Tb currently available to the server''s owners is humming in the server, with one small partition on its tail which replaced a dead 250Gb disk earlier in to pool. The OS is still SXCE snv_117 so far. Early tests which have filled the UltraStar with data in a couple of single-partition pools on it had shown that the writes and scrubs yielded 0 errors. However, now that this disk works as part of a larger old pool (9*5-vdev raidz1 sets), it suffers CKSUM error counts found on every scrub. They tried to reseat the disk in the same position (and will soon try the ex-position of another failed disk), but reseating did not help. For some reason, while early high CKSUM counts led to ZFS degrading the pool, this no longer happens (and my questionable script to clear the errors during scrub is not in use anymore). Numbers of CKSUM errors vary, in no dependable pattern: 1852, 317, 146, 83, 32, 1063, 6, 163, 4, 1, 8, 50... I can not say that there is a pattern leading to "now that some intermediate errors will cleanse, they will remain zero". Hence the question: what can be wrong, considering (hoping) that there are high-quality components in play, in a cooled datacenter room with UPSes powering the box? Is there some way to reliably test and blame or rule out: * HDD itself (media, chips, connectors) as a black box * OS version or aging X4500 hardware, including: * backplane connectors * marvell controllers * power source * ECC RAM * CPUs So far, two disks have failed on this server in positions c1t2 and c5t6, and the replacement disk is currently running in position c1t2. Other disks have not reported errors over the 4 years that the server is in 24*7*365 service, so I do doubt that this is a systemwide problem (CPU, RAM, power), or even a controller/backplane-wide problem, but I am more inclined towards the connectors or particular lanes on the controller. Any better ideas, perhaps someone had same experiences? Thanks, //Jim Klimov