[Lustre v1.8.3 Servers on CentOS 5.4] We had a piece of hardware fail and it killed all data on an OST. On the MDS I ran: lctl --device 25 deactivate lctl conf_param sanvol06-OST0013.osc.active=0 Lustre is back up and running and now we''re in cleanup mode. I''m now trying to get a list of files that are now corrupt. On one of the lustre clients I''m running: lfs find --obd sanvol06-OST0013_UUID <my lustre mount point> It starts to list files and then a few minutes later it runs into an error and stops: cb_find_init: IOC_LOV_GETINFO on <filename> failed: Input/output error. In dmesg I see: LustreError: 13926:0:(file.c:1053:ll_glimpse_size()) obd_enqueue returned rc -5, returning -EIO The file that gets that "Input/output error" cannot be delete or removed from the file system. How can I get around this? At the end of the day I need to get a list of a files that were on the bad OST and then be able to remove them. Thanks for your help, Scott Barber iMemories.com Senior Systems Administrator
On 2010-06-02, at 11:54, Scott Barber wrote:> I''m now trying to get a list of files that are now corrupt. On one of > the lustre clients I''m running: > lfs find --obd sanvol06-OST0013_UUID <my lustre mount point> > > It starts to list files and then a few minutes later it runs into an > error and stops: > cb_find_init: IOC_LOV_GETINFO on <filename> failed: Input/output error. > > In dmesg I see: > LustreError: 13926:0:(file.c:1053:ll_glimpse_size()) obd_enqueue > returned rc -5, returning -EIO > > The file that gets that "Input/output error" cannot be delete or > removed from the file system. How can I get around this?There is a bug in "lfs find" that it tries to get the file size unnecessarily. You can use "lfs getstripe -obd ..." instead, and it should work even if the OST is down. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
Great - exactly what I needed. Thanks, Scott On Wed, Jun 2, 2010 at 12:58 PM, Andreas Dilger <andreas.dilger at oracle.com> wrote:> On 2010-06-02, at 11:54, Scott Barber wrote: >> I''m now trying to get a list of files that are now corrupt. On one of >> the lustre clients I''m running: >> lfs find --obd sanvol06-OST0013_UUID ?<my lustre mount point> >> >> It starts to list files and then a few minutes later it runs into an >> error and stops: >> cb_find_init: IOC_LOV_GETINFO on <filename> failed: Input/output error. >> >> In dmesg I see: >> LustreError: 13926:0:(file.c:1053:ll_glimpse_size()) obd_enqueue >> returned rc -5, returning -EIO >> >> The file that gets that "Input/output error" cannot be delete or >> removed from the file system. How can I get around this? > > There is a bug in "lfs find" that it tries to get the file size unnecessarily. ?You can use "lfs getstripe -obd ..." instead, and it should work even if the OST is down. > > Cheers, Andreas > -- > Andreas Dilger > Lustre Technical Lead > Oracle Corporation Canada Inc. > >
On Wednesday 02 June 2010, Andreas Dilger wrote:> On 2010-06-02, at 11:54, Scott Barber wrote: > > I''m now trying to get a list of files that are now corrupt. On one of > > the lustre clients I''m running: > > lfs find --obd sanvol06-OST0013_UUID <my lustre mount point> > > > > It starts to list files and then a few minutes later it runs into an > > error and stops: > > cb_find_init: IOC_LOV_GETINFO on <filename> failed: Input/output error. > > > > In dmesg I see: > > LustreError: 13926:0:(file.c:1053:ll_glimpse_size()) obd_enqueue > > returned rc -5, returning -EIO > > > > The file that gets that "Input/output error" cannot be delete or > > removed from the file system. How can I get around this? > > There is a bug in "lfs find" that it tries to get the file size > unnecessarily. You can use "lfs getstripe -obd ..." instead, and it > should work even if the OST is down.Hmm, yes and no. In principle I like the idea that lfs find tries to figure out the file size. A couple of years ago I had to deal with 3 disk failure of raid6 and although we tried to clone the 3rd failing disk, in the end we lost that OST. Now there was stripe size of 4M and a stripe count of 4 configured. When I then run ''lfs find'' to find files located on that OST, it reported lots of file, that *would* have data on that OST, if the file would have sufficiently large. But then lots of files had been smaller than 1M and so it would have been wrong to delete those files. It turned out that ''lfs find'' was rather useless for us and I simply had to read each file - if read succeeded all was fine, it it failed I moved it into a dedicated subdirectory. The missing OST later on was recreated (that was more easy that time with 1.4 than nowadays) and we only lost a small part of the file, definitely much less than what ''lfs find'' suggested. So if ''lfs find'' now used the filesize to determine if a file is really located on an OST, that would be an improvement. Of course, if it fails at all with an IO error, it is also not useful ;) Cheers, Bernd -- Bernd Schubert DataDirect Networks