I just noticed this today: # zpool status -v pool: space state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAME STATE READ WRITE CKSUM space ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 errors: Permanent errors have been detected in the following files: space/dcc:<0x11e887> space/dcc:<0xba25aa> The device here is a hardware mirror of two 146-gig SAS drives. How can ZFS detect errors when it has no redundancy? How do I determine what files these are? Will a scrub fix it? This is a production system, so I want to be careful. It''s running Solaris 10 5/09 s10x_u7wos_08 X86. -- -Gary Mills- -Unix Group- -Computer and Network Services-
Hi Gary, To answer your questions, the hardware read some data and ZFS detected a problem with the checksums in this dataset and reported this problem. ZFS can do this regardless of ZFS redundancy. I don''t think a scrub will fix these permanent errors, but it depends on the corruption. If its data, but not redundant and no copies=2, then probably not. If its metadata, then multiple copies exist, but it depends on the extent of the corruption. If space/dcc is a dataset, is it mounted? ZFS might not be able to print the filenames if the dataset is not mounted, but I''m not sure if this is why only object numbers are displayed. The zpool status -v command will generally print out filenames, dnode object numbers, or identify metadata corruption problems. These look like object numbers, because they are large, rather than metadata objects, but an expert will have to comment. You might be able to identify these object numbers with zdb, but I''m not sure how do that. I would also check fmdump -eV to see how frequent the hardware has had problems. Cindy On 12/04/09 12:19, Gary Mills wrote:> I just noticed this today: > > # zpool status -v > pool: space > state: ONLINE > status: One or more devices has experienced an error resulting in data > corruption. Applications may be affected. > action: Restore the file in question if possible. Otherwise restore the > entire pool from backup. > see: http://www.sun.com/msg/ZFS-8000-8A > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > space ONLINE 0 0 0 > c0t1d0 ONLINE 0 0 0 > > errors: Permanent errors have been detected in the following files: > > space/dcc:<0x11e887> > space/dcc:<0xba25aa> > > The device here is a hardware mirror of two 146-gig SAS drives. > How can ZFS detect errors when it has no redundancy? How do I > determine what files these are? Will a scrub fix it? This is a > production system, so I want to be careful. > > It''s running Solaris 10 5/09 s10x_u7wos_08 X86. >
On Dec 5, 2009, at 0:52, Cindy Swearingen <Cindy.Swearingen at Sun.COM> wrote:> Hi Gary, > > To answer your questions, the hardware read some data and ZFS detected > a problem with the checksums in this dataset and reported this > problem. > ZFS can do this regardless of ZFS redundancy. > > I don''t think a scrub will fix these permanent errors, but it depends > on the corruption. If its data, but not redundant and no copies=2, > then probably not. If its metadata, then multiple copies exist, but > it depends on the extent of the corruption. > > If space/dcc is a dataset, is it mounted? ZFS might not be able to > print the filenames if the dataset is not mounted, but I''m not sure > if this is why only object numbers are displayed. > > The zpool status -v command will generally print out filenames, dnode > object numbers, or identify metadata corruption problems. These look > like object numbers, because they are large, rather than metadata > objects, but an expert will have to comment.Yes, thi is object numbers and most likely reason these are not turned into filnames is that corresponding files no longer exist. So I''d run scrub another time, if the files are gone and there are no other corruptions scrub will reset error log and zpool status should become clean.> > You might be able to identify these object numbers with zdb, but > I''m not sure how do that. >You can try to use zdb this way to check if these objects still exist zdb -d space/dcc 0x11e887 0xba25aa Victor> I would also check fmdump -eV to see how frequent the hardware > has had problems. > > Cindy > > > On 12/04/09 12:19, Gary Mills wrote: >> I just noticed this today: >> # zpool status -v >> pool: space >> state: ONLINE >> status: One or more devices has experienced an error resulting >> in data >> corruption. Applications may be affected. >> action: Restore the file in question if possible. Otherwise >> restore the >> entire pool from backup. >> see: http://www.sun.com/msg/ZFS-8000-8A >> scrub: none requested >> config: >> NAME STATE READ WRITE CKSUM >> space ONLINE 0 0 0 >> c0t1d0 ONLINE 0 0 0 >> errors: Permanent errors have been detected in the following >> files: >> space/dcc:<0x11e887> >> space/dcc:<0xba25aa> >> The device here is a hardware mirror of two 146-gig SAS drives. >> How can ZFS detect errors when it has no redundancy? How do I >> determine what files these are? Will a scrub fix it? This is a >> production system, so I want to be careful. >> It''s running Solaris 10 5/09 s10x_u7wos_08 X86. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Fri, Dec 04, 2009 at 02:52:47PM -0700, Cindy Swearingen wrote:> > If space/dcc is a dataset, is it mounted? ZFS might not be able to > print the filenames if the dataset is not mounted, but I''m not sure > if this is why only object numbers are displayed.Yes, it''s mounted and is quite an active filesystem.> I would also check fmdump -eV to see how frequent the hardware > has had problems.That shows ZFS checksum errors in July, but nothing since that time. There were also DIMM errors before that, starting in June. We replaced the failed DIMMs, also in July. This is an X4450 with ECC memory. There were no disk errors reported. I suppose we can blame the memory. -- -Gary Mills- -Unix Group- -Computer and Network Services-
On Sat, Dec 05, 2009 at 01:52:12AM +0300, Victor Latushkin wrote:> On Dec 5, 2009, at 0:52, Cindy Swearingen <Cindy.Swearingen at Sun.COM> > wrote: > > >The zpool status -v command will generally print out filenames, dnode > >object numbers, or identify metadata corruption problems. These look > >like object numbers, because they are large, rather than metadata > >objects, but an expert will have to comment. > > Yes, thi is object numbers and most likely reason these are not turned > into filnames is that corresponding files no longer exist.That seems to be the case: # zdb -d space/dcc 0x11e887 0xba25aa Dataset space/dcc [ZPL], ID 21, cr_txg 19, 20.5G, 3672408 objects> So I''d run scrub another time, if the files are gone and there are no > other corruptions scrub will reset error log and zpool status should > become clean.That worked. After the scrub, there are no errors reported.> >You might be able to identify these object numbers with zdb, but > >I''m not sure how do that. > > You can try to use zdb this way to check if these objects still exist > > zdb -d space/dcc 0x11e887 0xba25aa-- -Gary Mills- -Unix Group- -Computer and Network Services-