Jeff Johnson
2010-Dec-06 23:50 UTC
[Lustre-discuss] OST errors caused by residual client info?
Greetings.. Is it possible that the below error can be derived from a client that has not been rebooted or had lustre kernel mods reloaded during a time when a few test file systems were built and mounted? LustreError: 12967:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (-19) req at ffff81032dd2d000 x1348952525350751/t0 o8-><?>@<?>:0/0 lens 368/0 e 0 to 0 dl 1291669076 ref 1 fl Interpret:/0/0 rc -19/0 LustreError: 12967:0:(ldlm_lib.c:1914:target_send_reply_msg()) Skipped 55 previous similar messages LustreError: 137-5: UUID ''fs-OST0058_UUID'' is not available for connect (no target) Normally this would be a back end storage issue. In this case, the oss where this error is logged doesn''t have an ost "OST0058". It has an ost "OST006d". Regardless of the ost name, the backend raid is healthy with no hardware errors. No other h/w errors present on the oss node (e.g.: mce, panic, ib/enet failures, etc). Previous test incarnations of this filesystem were built where ost name was not assigned (e.g.: OSTFFFF) and was assigned upon first mount and connection to the mds. Is it possible that some clients have residual pointers or config data about the previously built file systems? Thanks! --Jeff
Oleg Drokin
2010-Dec-06 23:55 UTC
[Lustre-discuss] OST errors caused by residual client info?
Hello! On Dec 6, 2010, at 6:50 PM, Jeff Johnson wrote:> Previous test incarnations of this filesystem were built where ost name > was not assigned (e.g.: OSTFFFF) and was assigned upon first mount and > connection to the mds. Is it possible that some clients have residual > pointers or config data about the previously built file systems?If you did not unmount clients from the previous incarnation of the filesystem, those clients would still continue to try to contact the servers they know about even after the servers themselves go away and are repurposed (since there is no way for the client to know about this). Bye, Oleg
Jeff Johnson
2010-Dec-07 00:05 UTC
[Lustre-discuss] OST errors caused by residual client info?
On 12/6/10 3:55 PM, Oleg Drokin wrote:> Hello! > > On Dec 6, 2010, at 6:50 PM, Jeff Johnson wrote: >> Previous test incarnations of this filesystem were built where ost name >> was not assigned (e.g.: OSTFFFF) and was assigned upon first mount and >> connection to the mds. Is it possible that some clients have residual >> pointers or config data about the previously built file systems? > If you did not unmount clients from the previous incarnation of the filesystem, > those clients would still continue to try to contact the servers they know about even > after the servers themselves go away and are repurposed (since there is no way for the > client to know about this).All clients were unmounted but the lustre kernel mods were never removed/reloaded nor were the clients rebooted. Is it odd that this error would occur naming an ost that is not present on that oss? Should an oss only report this error about its own ost devices? As I said, this particular oss where the error came from only has an OST006c and OST006d. It does not have an OST0058 although it may have back when the filesystem was made with a simple test csv that did not specifically give index numbers as part of the mkfs.lustre process. They were named later, randomly, when the osts were first mounted and connected to the mds. Do you think it is possible for a client to retain this information even though a umount/mount of the filesystem took place? --Jeff
Oleg Drokin
2010-Dec-07 00:15 UTC
[Lustre-discuss] OST errors caused by residual client info?
Hello! On Dec 6, 2010, at 7:05 PM, Jeff Johnson wrote:>>> Previous test incarnations of this filesystem were built where ost name >>> was not assigned (e.g.: OSTFFFF) and was assigned upon first mount and >>> connection to the mds. Is it possible that some clients have residual >>> pointers or config data about the previously built file systems? >> If you did not unmount clients from the previous incarnation of the filesystem, >> those clients would still continue to try to contact the servers they know about even >> after the servers themselves go away and are repurposed (since there is no way for the >> client to know about this). > All clients were unmounted but the lustre kernel mods were never removed/reloaded nor were the clients rebooted.If the clients were unmounted, then there is no information left in the kernel about those now vanished mountpoints.> Is it odd that this error would occur naming an ost that is not present on that oss? Should an oss only report this error about its own ost devices? As I said,OSS would report such an error if a client contacted it trying to access an OST not present on this OSS. This could be because of a client containing some stale information about services because it was not unmounted from previous incarnation of the filesystem or it could be because there is an failover pair setup that names this OSS as a possible nid for a failover target.> Do you think it is possible for a client to retain this information even though a umount/mount of the filesystem took place?If the clients unmounted cleanly, I don''t think there is anywhere such info could be stored. You could go back to the clients sending these requests (identify them by error messages in the logs, they''d complain about error -19 connecting to OSTs) and see what''s wrong with them, what do they have mounted and such. Bye, Oleg