Hi all, after failure of a server contributing two OSTs to our lustre fs, I''m having trouble either getting rid of these OSTs for good or re-introducing them. (It''s a test system, the data on it may be thrown away anytime if necessary). System is running Debian Etch, kernel 2.6.20, Lustre 1.6.0.1 Trying to mount the OSTs invariably gives me kernel: LustreError: Trying to start OBD testfs1-OST000b_UUID using the wrong disk . Were the /dev/ assignments rearranged? kernel: LustreError: 7792:0:(filter.c:1008:filter_prep()) cannot read last_rcvd: rc = -22 ... the log messages following these lines are a consequence of these, I guess. Although I''m not sure what may be /dev/ assignments, nothing has been changed on this machine - just a reboot (and maybe a damaged partition, of course) I also havn''t found the meaning of the code -22 ? I went on trying to unregister these OSTs on the MGS. The Lustre manual says $ mgs> lctl conf_param testfs-OST0001.osc.active=0 This doesn''t work, as do most of the examples given in http://manual.lustre.org/manual/LustreManual16_HTML - for which Lustre version was this manual written? ''man lctl'' tells me that the --device option may be missing. On the MGS, I got $ mgs> lctl dl ... 19 UP osc testfs1-OST000a-osc testfs1-mdtlov_UUID 5 20 UP osc testfs1-OST000b-osc testfs1-mdtlov_UUID 5 (Something else that I''m missing painfully in all the Lustre documentation: explanation of output of commands!) My guess was the correct name for my OSTs is given in the fourth field, so I tried $ mgs> lctl --device testfs1-OST000a-osc conf_param testfs1-OST000b.osc.active=0 This at least didn''t give me an error. The output of ''lctl dl'' did not change, however., 19 and 20 still there and UP. $ mgs> lctl --device testfs1-OST000a-osc deactivate had the same result. Still, I went on to the OSS and tried $ oss> tunefs.lustre --erase-params --fsname=testfs1 --ost --mgsnode=MGS@tcp0 /dev/sdb1 which doesn''t work because of tunefs.lustre: cannot change the name of a registered target tunefs.lustre: exiting with 1 (Operation not permitted) $ oss> tunefs.lustre --writeconf --erase-params --fsname=testfs1 --ost --mgsnode=MGS@tcp0 /dev/sdb1 works fine, but mounting the partition results in exactly the same error messages in the syslog as before. So far I have not tried reformatting these partitions. But I think I should ask the experts here about all the mistakes I made. Many thanks. Thomas -- -------------------------------------------------------------------- Thomas Roth Department: Informationstechnologie Location: SB3 1.262 Phone: +49-6159-71 2126 Fax: +49-6159-71 2986 Gesellschaft f?r Schwerionenforschung mbH Planckstra?e 1 D-64291 Darmstadt www.gsi.de Gesellschaft mit beschr?nkter Haftung Sitz der Gesellschaft: Darmstadt Handelsregister: Amtsgericht Darmstadt, HRB 1528 Gesch?ftsf?hrer: Professor Dr. Walter F. Henning, Dr. Alexander Kurz Vorsitzende des Aufsichtsrates: Dr. Beatrix Vierkorn-Rudolph, Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt
Nathaniel Rutman
2007-Aug-08 16:10 UTC
[Lustre-discuss] recovery after OSS failure, error -22
Thomas Roth wrote:> Hi all, > > after failure of a server contributing two OSTs to our lustre fs, I''m > having trouble either getting rid of these OSTs for good or > re-introducing them. (It''s a test system, the data on it may be thrown > away anytime if necessary). System is running Debian Etch, kernel > 2.6.20, Lustre 1.6.0.1 > > Trying to mount the OSTs invariably gives me > > kernel: LustreError: Trying to start OBD testfs1-OST000b_UUID using the > wrong disk . Were the /dev/ assignments rearranged? > kernel: LustreError: 7792:0:(filter.c:1008:filter_prep()) cannot read > last_rcvd: rc = -22 > ... >"the wrong disk ." -- the missing disk name implies the last_rcvd file has been corrupted. (The -22 EINVAL is a consequence of that.) You could try mounting the disk as type ldiskfs, then erasing the last_rcvd file - this should cause the OST to regenerate it.> the log messages following these lines are a consequence of these, I guess. > Although I''m not sure what may be /dev/ assignments, nothing has been > changed on this machine - just a reboot (and maybe a damaged partition, > of course) > I also havn''t found the meaning of the code -22 ? > > I went on trying to unregister these OSTs on the MGS. The Lustre manual > says > $ mgs> lctl conf_param testfs-OST0001.osc.active=0 > This doesn''t work, as do most of the examples given in > http://manual.lustre.org/manual/LustreManual16_HTML - for which Lustre > version was this manual written? ''man lctl'' tells me that the --device > option may be missing. On the MGS, I got > $ mgs> lctl dl > ... > 19 UP osc testfs1-OST000a-osc testfs1-mdtlov_UUID 5 > 20 UP osc testfs1-OST000b-osc testfs1-mdtlov_UUID 5 > > (Something else that I''m missing painfully in all the Lustre > documentation: explanation of output of commands!) > My guess was the correct name for my OSTs is given in the fourth field, > so I tried > $ mgs> lctl --device testfs1-OST000a-osc conf_param > testfs1-OST000b.osc.active=0 > > This at least didn''t give me an error. The output of ''lctl dl'' did not > change, however., 19 and 20 still there and UP. > > $ mgs> lctl --device testfs1-OST000a-osc deactivate > had the same result. > > Still, I went on to the OSS and tried > $ oss> tunefs.lustre --erase-params --fsname=testfs1 --ost > --mgsnode=MGS@tcp0 /dev/sdb1 > which doesn''t work because of > tunefs.lustre: cannot change the name of a registered target > tunefs.lustre: exiting with 1 (Operation not permitted) > > $ oss> tunefs.lustre --writeconf --erase-params --fsname=testfs1 --ost > --mgsnode=MGS@tcp0 /dev/sdb1 > works fine, but mounting the partition results in exactly the same error > messages in the syslog as before. > > So far I have not tried reformatting these partitions. But I think I > should ask the experts here about all the mistakes I made. > > Many thanks. > Thomas > > > >