hi all, What does this mean? # cat /proc/fs/lustre/mds/storage-MDT0000/recovery_status status: INACTIVE Yesterdeay evening there was a kernel panic at mount time on both of our meta servers. I replaced the kernel on them from 2.6.18-53.1.21 to 2.6.18-53.1.13. I didn''t touch lustre version, it''s 1.6.5.1. The cluster seems to be working at this moment, but I think, this could not be healthy. What troubles could this cause and how can I fix it? In the docs I can see only the usage of this procfs entry to monitoring the recovering procedure. Thank you, tamas
2008/9/30 Papp Tam?s <tompos at martos.bme.hu>:> What does this mean? > > # cat /proc/fs/lustre/mds/storage-MDT0000/recovery_status > status: INACTIVEIt''s normal, it just means recovery is not running (because it''s finished or been aborted or whatever)
James Braid wrote:> 2008/9/30 Papp Tam?s <tompos at martos.bme.hu>: > >> What does this mean? >> >> # cat /proc/fs/lustre/mds/storage-MDT0000/recovery_status >> status: INACTIVE >> > > It''s normal, it just means recovery is not running (because it''s > finished or been aborted or whatever) >This is another one. This one has no problems ever. Don''t should it look like this? # cat /proc/fs/lustre/mds/archive-MDT0000/recovery_status status: COMPLETE recovery_start: 1221309004 recovery_end: 1221309469 recovered_clients: 1 unrecovered_clients: 0 last_transno: 895550056 replayed_requests: 0 How can I force the recovery just for test? tamas
Hi, COMPLETE means that this particular OST was in recovery and recovery is now finished. To force recovery just unmount OST and then mount it again. If unmounted OST had any clients connected after mounting it back it will start recovery process to let all the clients reconnect to it. When OST is in recovery status it will refuse all new connections from the clients which means that file system that this OST is a part of will not be accessible until recovery finishes. Recovery will finish either when all previously connected clients will reconnect or it will timout after a certain amount of time. If one of the clients that was connected to the OST will crash or loose power etc. before it will get a chance to reconnect then recovery will have to time. If you know that OST will not recover all previously connected clients because one of them isn''t there any more you can avoid waiting for recovery to timeout and you can abort recovery manually. lctl --device <OST_device_number> abort_recovery You can find OST_device_number by running ''lctl dl'' command You will see line like this 7 UP obdfilter ddn_data-OST0009 ddn_data-OST0009_UUID 1159 Number 7 is the number of the OST device. All this is in the lustre operation manual, so please read it. Cheers Wojciech Papp Tam?s wrote:> James Braid wrote: > >> 2008/9/30 Papp Tam?s <tompos at martos.bme.hu>: >> >> >>> What does this mean? >>> >>> # cat /proc/fs/lustre/mds/storage-MDT0000/recovery_status >>> status: INACTIVE >>> >>> >> It''s normal, it just means recovery is not running (because it''s >> finished or been aborted or whatever) >> >> > > This is another one. This one has no problems ever. Don''t should it look > like this? > > # cat /proc/fs/lustre/mds/archive-MDT0000/recovery_status > status: COMPLETE > recovery_start: 1221309004 > recovery_end: 1221309469 > recovered_clients: 1 > unrecovered_clients: 0 > last_transno: 895550056 > replayed_requests: 0 > > > > How can I force the recovery just for test? > > tamas > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- Wojciech Turek Assistant System Manager High Performance Computing Service University of Cambridge Email: wjt27 at cam.ac.uk Tel: (+)44 1223 763517
Wojciech Turek wrote:> Hi, > > COMPLETE means that this particular OST was in recovery and recovery is > now finished. > To force recovery just unmount OST and then mount it again. If unmounted > OST had any clients connected after mounting it back it will start > recovery process to let all the clients reconnect to it. When OST is in > recovery status it will refuse all new connections from the clients > which means that file system that this OST is a part of will not be > accessible until recovery finishes. Recovery will finish either when all > previously connected clients will reconnect or it will timout after a > certain amount of time. If one of the clients that was connected to the > OST will crash or loose power etc. before it will get a chance to > reconnect then recovery will have to time. If you know that OST will not > recover all previously connected clients because one of them isn''t there > any more you can avoid waiting for recovery to timeout and you can abort > recovery manually. > lctl --device <OST_device_number> abort_recovery > You can find OST_device_number by running ''lctl dl'' command > You will see line like this > 7 UP obdfilter ddn_data-OST0009 ddn_data-OST0009_UUID 1159 > Number 7 is the number of the OST device. > > All this is in the lustre operation manual, so please read it. >Of course I''ve read the manual many times. The problem is not with COMPLETE recovery_status, but INACTIVE. I haven''t found any info about it in the manual. I wanted to force the recovery without unmounting the OST just for give a try. Anyway it seems to working right now, I hope, it''s OK. Thanks, tamas
Papp Tamas wrote: I think something wrong with the listserver, I only sent one email:) tamas
Hi INACTIVE status in recovery_status file for the particular lustre target device means that this device didn''t go into recovery after starting (mounting) it. Lustre device will only go into recovery if it there were clients connected to it when the device was stopped. So INACTIVE status is pretty normal status for lustre targets. I don''t thing that you can force lustre target device into recovery without unmounting it. Regards, Wojciech Papp Tamas wrote:> Wojciech Turek wrote: >> Hi, >> >> COMPLETE means that this particular OST was in recovery and recovery >> is now finished. >> To force recovery just unmount OST and then mount it again. If >> unmounted OST had any clients connected after mounting it back it >> will start recovery process to let all the clients reconnect to it. >> When OST is in recovery status it will refuse all new connections >> from the clients which means that file system that this OST is a part >> of will not be accessible until recovery finishes. Recovery will >> finish either when all previously connected clients will reconnect or >> it will timout after a certain amount of time. If one of the clients >> that was connected to the OST will crash or loose power etc. before >> it will get a chance to reconnect then recovery will have to time. If >> you know that OST will not recover all previously connected clients >> because one of them isn''t there any more you can avoid waiting for >> recovery to timeout and you can abort recovery manually. >> lctl --device <OST_device_number> abort_recovery >> You can find OST_device_number by running ''lctl dl'' command >> You will see line like this >> 7 UP obdfilter ddn_data-OST0009 ddn_data-OST0009_UUID 1159 >> Number 7 is the number of the OST device. >> >> All this is in the lustre operation manual, so please read it. >> > > Of course I''ve read the manual many times. > The problem is not with COMPLETE recovery_status, but INACTIVE. > I haven''t found any info about it in the manual. > > I wanted to force the recovery without unmounting the OST just for > give a try. > Anyway it seems to working right now, I hope, it''s OK. > > Thanks, > > tamas >