This morning we got a fault management message from one of our production
servers stating that a fault in one of our pools had been detected and fixed.
Looking into the error using fmdump gives:
fmdump -v -u 90ea244e-1ea9-4bd6-d2be-e4e7a021f006
TIME                 UUID                                 SUNW-MSG-ID
Oct 22 09:29:05.3448 90ea244e-1ea9-4bd6-d2be-e4e7a021f006 FMD-8000-4M Repaired
  100%  fault.fs.zfs.device
        Problem in: zfs://pool=vol02/vdev=179e471c0732582
           Affects:   zfs://pool=vol02/vdev=179e471c0732582
               FRU: -
          Location: -
My question is: how do I relate the vdev name above (179e471c0732582) with an
actual drive? I''ve checked these id''s against the device ids
(cXtYdZ - obviously no match) and against all of the disk serial numbers.
I''ve also tried all of the "zpool list" and "zpool
status" options with no luck.
I''m sure I''m missing something obvious here, but if anyone can
point me in the right direction I''d appreciate it!
-- 
This message posted from opensolaris.org
Hi Sean,
A better way probably exists but I use the fdump -eV to identify the
pool and the device information (vdev_path) that is listed like this:
# fmdump -eV | more
.
.
.
         pool = test
         pool_guid = 0x6de45047d7bde91d
         pool_context = 0
         pool_failmode = wait
         vdev_guid = 0x2ab2d3ba9fc1922b
         vdev_type = disk
         vdev_path = /dev/dsk/c0t6d0s0
Then you can match the vdev_path device to the device in your storage
pool.
You can also review the date/time stamps in this output to see how long
the device has had a problem.
Its probably a good idea to run a zpool scrub on this pool too.
Cindy
On 10/23/09 12:04, sean walmsley wrote:> This morning we got a fault management message from one of our production
servers stating that a fault in one of our pools had been detected and fixed.
Looking into the error using fmdump gives:
> 
> fmdump -v -u 90ea244e-1ea9-4bd6-d2be-e4e7a021f006
> TIME                 UUID                                 SUNW-MSG-ID
> Oct 22 09:29:05.3448 90ea244e-1ea9-4bd6-d2be-e4e7a021f006 FMD-8000-4M
Repaired
>   100%  fault.fs.zfs.device
> 
>         Problem in: zfs://pool=vol02/vdev=179e471c0732582
>            Affects:   zfs://pool=vol02/vdev=179e471c0732582
>                FRU: -
>           Location: -
> 
> My question is: how do I relate the vdev name above (179e471c0732582) with
an actual drive? I''ve checked these id''s against the device
ids (cXtYdZ - obviously no match) and against all of the disk serial numbers.
I''ve also tried all of the "zpool list" and "zpool
status" options with no luck.
> 
> I''m sure I''m missing something obvious here, but if
anyone can point me in the right direction I''d appreciate it!
Thanks for this information.
We have a weekly scrub schedule, but I ran another just to be sure :-) It
completed with 0 errors.
Running fmdump -eV gives:
TIME                           CLASS
fmdump: /var/fm/fmd/errlog is empty
Dumping the faultlog (no -e) does give some output, but again there are no
"human readable" identifiers:
... (some stuff omitted)
        (start fault-list[0])
        nvlist version: 0
                version = 0x0
                class = fault.fs.zfs.device
                certainty = 0x64
                asru = (embedded nvlist)
                nvlist version: 0
                        version = 0x0
                        scheme = zfs
                        pool = 0x4fcdc2c9d60a5810
                        vdev = 0x179e471c0732582
                (end asru)
                resource = (embedded nvlist)
                nvlist version: 0
                        version = 0x0
                        scheme = zfs
                        pool = 0x4fcdc2c9d60a5810
                        vdev = 0x179e471c0732582
                (end resource)
        (end fault-list[0])
So, I''m still stumped.
-- 
This message posted from opensolaris.org
I''m stumped too. Someone with more FM* experience needs to comment. Cindy On 10/23/09 14:52, sean walmsley wrote:> Thanks for this information. > > We have a weekly scrub schedule, but I ran another just to be sure :-) It completed with 0 errors. > > Running fmdump -eV gives: > TIME CLASS > fmdump: /var/fm/fmd/errlog is empty > > Dumping the faultlog (no -e) does give some output, but again there are no "human readable" identifiers: > > ... (some stuff omitted) > (start fault-list[0]) > nvlist version: 0 > version = 0x0 > class = fault.fs.zfs.device > certainty = 0x64 > asru = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = zfs > pool = 0x4fcdc2c9d60a5810 > vdev = 0x179e471c0732582 > (end asru) > > resource = (embedded nvlist) > nvlist version: 0 > version = 0x0 > scheme = zfs > pool = 0x4fcdc2c9d60a5810 > vdev = 0x179e471c0732582 > (end resource) > > (end fault-list[0]) > > So, I''m still stumped.
On 10/23/09 15:05, Cindy Swearingen wrote:> I''m stumped too. Someone with more FM* experience needs to comment.Looks like your errlog may have been rotated out of existence - see if there is a .X or .gz version in /var/fm/fmd/errlog*. The list.suspect fault should be including a location field that would contain the human readable name for the vdev, but this work (extending the libtopo scheme to support enumeration and label properties) hasn''t yet been done. There is also a small change that needs to be made to fmd to support location for non-FRUs. You should to able to do "echo ::spa -c | mdb -k" and look for that vdev id, assuming the vdev is still active on the system. - Eric> > Cindy > > On 10/23/09 14:52, sean walmsley wrote: >> Thanks for this information. >> >> We have a weekly scrub schedule, but I ran another just to be sure :-) >> It completed with 0 errors. >> >> Running fmdump -eV gives: >> TIME CLASS >> fmdump: /var/fm/fmd/errlog is empty >> >> Dumping the faultlog (no -e) does give some output, but again there >> are no "human readable" identifiers: >> >> ... (some stuff omitted) >> (start fault-list[0]) >> nvlist version: 0 >> version = 0x0 >> class = fault.fs.zfs.device >> certainty = 0x64 >> asru = (embedded nvlist) >> nvlist version: 0 >> version = 0x0 >> scheme = zfs >> pool = 0x4fcdc2c9d60a5810 >> vdev = 0x179e471c0732582 >> (end asru) >> >> resource = (embedded nvlist) >> nvlist version: 0 >> version = 0x0 >> scheme = zfs >> pool = 0x4fcdc2c9d60a5810 >> vdev = 0x179e471c0732582 >> (end resource) >> >> (end fault-list[0]) >> >> So, I''m still stumped. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Eric Schrock, Fishworks http://blogs.sun.com/eschrock
On Oct 23, 2009, at 3:19 PM, Eric Schrock wrote:> On 10/23/09 15:05, Cindy Swearingen wrote: >> I''m stumped too. Someone with more FM* experience needs to comment. > > Looks like your errlog may have been rotated out of existence - see > if there is a .X or .gz version in /var/fm/fmd/errlog*. The > list.suspect fault should be including a location field that would > contain the human readable name for the vdev, but this work > (extending the libtopo scheme to support enumeration and label > properties) hasn''t yet been done. There is also a small change that > needs to be made to fmd to support location for non-FRUs. You > should to able to do "echo ::spa -c | mdb -k" and look for that vdev > id, assuming the vdev is still active on the system.These are the guids, correct? If so, then "zdb -C" will show them. Conversion of hex-decimal or verse vica is an exercise for the reader. -richard> > - Eric > >> Cindy >> On 10/23/09 14:52, sean walmsley wrote: >>> Thanks for this information. >>> >>> We have a weekly scrub schedule, but I ran another just to be >>> sure :-) It completed with 0 errors. >>> >>> Running fmdump -eV gives: >>> TIME CLASS >>> fmdump: /var/fm/fmd/errlog is empty >>> >>> Dumping the faultlog (no -e) does give some output, but again >>> there are no "human readable" identifiers: >>> >>> ... (some stuff omitted) >>> (start fault-list[0]) >>> nvlist version: 0 >>> version = 0x0 >>> class = fault.fs.zfs.device >>> certainty = 0x64 >>> asru = (embedded nvlist) >>> nvlist version: 0 >>> version = 0x0 >>> scheme = zfs >>> pool = 0x4fcdc2c9d60a5810 >>> vdev = 0x179e471c0732582 >>> (end asru) >>> >>> resource = (embedded nvlist) >>> nvlist version: 0 >>> version = 0x0 >>> scheme = zfs >>> pool = 0x4fcdc2c9d60a5810 >>> vdev = 0x179e471c0732582 >>> (end resource) >>> >>> (end fault-list[0]) >>> >>> So, I''m still stumped. >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > -- > Eric Schrock, Fishworks http://blogs.sun.com/eschrock > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Eric and Richard - thanks for your responses. I tried both: echo ::spa -c | mcb zdb -C (not much of a man page for this one!) and was able to match the POOL id from the log (hex 4fcdc2c9d60a5810) with both outputs. As Richard pointed out, I needed to convert the hex value to decimal to get a match with the zdb output. In neither case, however, was I able to get a match with the disk vdev id from the fmdump output. It turns out that a disk in this machine was replaced about a month ago, and sure enough the vdev that was complaining at the time was the 0x179e471c0732582 vdev that is now missing. What''s confusing is that the fmd message I posted about is dated Oct 22 whereas the original error and replacement happened back in September. An "fmadm faulty" on the machine currently doesn''t return any issues. After physically replacing the bad drive and issuing the "zpool replace" command, I think that we probably issued the "fmadm repair <uuid>" command in line with what Sun has asked us to do in the past. In our experience, if you don''t do this then fmd will re-issue duplicate complaints regarding hardware failures after every reboot until you do. In this case, perhaps a "repair" wasn''t really the appropriate command since we actually replaced the drive. Would a "fmadm flush" have been better? Perhaps a clean reboot is in order? So, it looks like the root problem here is that fmd is confused rather than there being a real issue with ZFS. Despite this, we''re happy to know that we can now match vdevs against physical devices using either the mdb trick or zdb. We''ve followed Eric''s work on ZFS device enumeration for the Fishwork project with great interest - hopefully this will eventually get extended to the fmdump output as suggested. Sean Walmsley -- This message posted from opensolaris.org
On 10/23/09 16:56, sean walmsley wrote:> Eric and Richard - thanks for your responses. > > I tried both: > > echo ::spa -c | mcb > zdb -C (not much of a man page for this one!) > > and was able to match the POOL id from the log (hex 4fcdc2c9d60a5810) with both outputs. As Richard pointed out, I needed to convert the hex value to decimal to get a match with the zdb output. > > In neither case, however, was I able to get a match with the disk vdev id from the fmdump output. > > It turns out that a disk in this machine was replaced about a month ago, and sure enough the vdev that was complaining at the time was the 0x179e471c0732582 vdev that is now missing. > What''s confusing is that the fmd message I posted about is dated Oct 22 whereas the original error and replacement happened back in September. An "fmadm faulty" on the machine currently doesn''t return any issues.That message indicates that a previous problem was repaired, not a new diagnosis.> After physically replacing the bad drive and issuing the "zpool replace" command, I think that we probably issued the "fmadm repair <uuid>" command in line with what Sun has asked us to do in the past. In our experience, if you don''t do this then fmd will re-issue duplicate complaints regarding hardware failures after every reboot until you do. In this case, perhaps a "repair" wasn''t really the appropriate command since we actually replaced the drive. Would a "fmadm flush" have been better? Perhaps a clean reboot is in order? > > So, it looks like the root problem here is that fmd is confused rather than there being a real issue with ZFS. Despite this, we''re happy to know that we can now match vdevs against physical devices using either the mdb trick or zdb.This is fixed in build 127 via: 6889827 ZFS retire agent needs to do a better job of staying in sync> We''ve followed Eric''s work on ZFS device enumeration for the Fishwork project with great interest - hopefully this will eventually get extended to the fmdump output as suggested.Yep, we''re working on it ;-) - Eric -- Eric Schrock, Fishworks http://blogs.sun.com/eschrock