thr3ads.net - Lustre discuss - [Lustre-discuss] Help interpreting error messages [May 2006]

If this information is useful, please help other people find it:
Share via:

David S.

2006-May-19 07:36 UTC

[Lustre-discuss] Help interpreting error messages

> > 
> > Running that program on all nodes concurrently, with each job
> > reading a separate input file, locks up the file system - any access 
> > (''ls'', ''cd'',
''df'', whatever) hangs indefinitely.  A hard reboot of
> > the meta-data server and clients is necessary to get the file
> > system back.  Running on 4 - 8 nodes slows the file system to a 
> > crawl.
> 
> Do you know if any of the nodes are hung solid, or if you can still
> login to them and run commands?  If so, it would be extremely valuable
> to get stack traces from SysRq-T.  If not, it would help to know which
> nodes are hung solid, and exactly what is on the screen at the end.
I don''t have "SysRq" enabled in the kernel.  I''ll
set it up and try.
The nodes themselves don''t hang.  I can still log into them and
run any command that doesn''t touch the Lustre file system, mounted
as ''/vol/10'' there.  I haven''t examined the console
output on a
troubled node, but ''syslog'' there typically reports something
like

	Jun 10 10:53:20 eee002 kernel: LustreError:
24503:(client.c:812:ptlrpc_expire_one_request()) @@@ timeout req@edd46200
x984232/t0 o36->mds1_UUID@NID_fgrate1_UUID:12 lens 168/584 ref 1 fl RPC:/0/0
rc 0/0
	Jun 10 10:53:20 eee002 kernel: LustreError:
24503:(recover.c:100:ptlrpc_run_failed_import_upcall()) Error invoking recovery
upcall /usr/lib/lustre/lustre_upcall FAILED_IMPORT mds1_UUID
MDC_eee002.grid.gs.washington.edu_mds1_MNT_feee002 NID_fgrate1_UUID: -2; check
/proc/sys/lustre/lustre_upcall

four or five minutes after the program starts.  Usually just those
two messages, and nothing else.  At that point, the file system is
locked.
> 
> If we send you a complimentary copy of Lustre 1.2.2, would you be
> willing and able to try to reproduce it with the new code?
Sure, I''ll give it a go.
> 
> > ''Syslog'' on the object and meta-data servers reports
errors
> > like
> > 
> > Jun 10 16:57:44 grate1 kernel: LustreError:
1275:(filter.c:1383:__filter_oa2dentry()) filter_getattr on non-existent object:
335019
> > Jun 10 16:57:44 grate1 kernel: LustreError:
1301:(filter.c:1383:__filter_oa2dentry()) filter_getattr on non-existent object:
335019
> > Jun 10 16:57:44 grate1 kernel: Lustre:
1323:(mds_open.c:1015:mds_mfd_close()) destroying orphan object 28dc025:7de29b69
> > Jun 10 16:57:44 grate1 kernel: LustreError:
1347:(mds_reint.c:1403:mds_add_link_orphan()) pending destroy of 1x open file
chr10_012.fragment5.tmp.cut2.temp_adjust = 28e0025:7de29b65
> > Jun 10 16:57:44 grate1 kernel: LustreError:
1343:(mds_reint.c:1403:mds_add_link_orphan()) pending destroy of 1x open file
chr10_008.fragment5.tmp.cut1.temp_adjust = 28d4023:7de29b67
> > Jun 10 16:57:44 grate1 kernel: LustreError:
1341:(mds_reint.c:1403:mds_add_link_orphan()) pending destroy of 1x open file
chr10_007.fragment5.tmp.cut1.temp_adjust = 28d0023:7de29b6a
> > Jun 10 16:57:44 grate1 kernel: Lustre:
1327:(mds_open.c:1015:mds_mfd_close()) destroying orphan object 28d4023:7de29b67
> > Jun 10 16:57:44 grate1 kernel: Lustre:
1331:(mds_open.c:1015:mds_mfd_close()) destroying orphan object 28e0025:7de29b65
> > Jun 10 16:57:44 grate1 kernel: Lustre:
1330:(mds_open.c:1015:mds_mfd_close()) destroying orphan object 28d0023:7de29b6a
> > 
> > 
> > which I don''t know how to interpret.
> 
> Of all of them, only the first two are somewhat worrying -- they
> indicate that a client is looking for an object which the metadata
> server thinks should exist, but the object server does not have.  The
> others are normal messages, related to files which have been unlinked
> but not yet closed, and are then finally closed (and thus removed from
> the disk).  Those messages have been removed from the console output in
> 1.2.x.
> 
> Are these the only messages which appear before you reboot?  How long do
> you wait before rebooting?  Unless the nodes are completely wedged, you
> should see some messages about recovery on at least the clients, but
> maybe also some servers.
I''ve waited up to an hour or so after the file system hangs until
before re-booting.  The error messages I see on the servers are like

	Jun 10 09:42:05 grate1 kernel: LustreError:
1799:(filter.c:1767:filter_destroy()) destroying non-existent object 300310
Jun 10 09:42:05 grate1 kernel: LustreError:
1800:(filter.c:1767:filter_destroy()) destroying non-existent object 300310
	Jun 10 09:42:05 grate1 kernel: LustreError:
1810:(filter.c:1383:__filter_oa2dentry()) filter_getattr on non-existent object:
300149
	Jun 10 09:42:05 grate1 kernel: LustreError:
1790:(filter.c:1383:__filter_oa2dentry()) filter_getattr on non-existent object:
300149

Other error messages appear when I try to shut down Lustre on the 
meta-data server

	Jun 10 11:26:33 grate1 kernel: Lustre:
1639:(socknal_cb.c:1547:ksocknal_process_receive()) [ece0d000] EOF from
0xc0a80105 ip 192.168.1.5:32773
	Jun 10 11:29:07 grate1 kernel: LustreError:
3951:(obd_config.c:321:class_cleanup()) OBD device 5 (f929e024) has refcount 55
	Jun 10 11:29:07 grate1 kernel: LustreError:
3951:(obd_config.c:273:dump_exports()) mds1: e3d03400
cf09d_MNT_feln001_1e93ad6d32 50 0 00000000
	Jun 10 11:29:07 grate1 kernel: LustreError:
3951:(obd_config.c:273:dump_exports()) mds1: e3d03c00
e1195_MNT_feln001_f6dc0ac823 18 0 00000000
	Jun 10 11:29:07 grate1 kernel: LustreError:
3951:(obd_config.c:273:dump_exports()) mds1: eca97e00
3775d_MNT_feln001_ed581c22ff 1 0 00000000

	...

	Jun 10 11:29:07 grate1 kernel: LustreError:
3951:(obd_config.c:273:dump_exports()) mds1: ebf6e000 mds1_UUID 2 0 00000000
	Jun 10 11:29:07 grate1 kernel: LustreError:
3951:(obd_config.c:239:class_detach()) OBD device 5 still set up
	Jun 10 11:29:07 grate1 kernel: LustreError:
3984:(obd_config.c:321:class_cleanup()) OBD device 2 (f929cf08) has refcount 50

	...

	Jun 10 11:29:08 grate1 kernel: LustreError:
4050:(obd_config.c:239:class_detach()) OBD device 0 still set up
	Jun 10 11:29:54 grate1 kernel: LustreError:
1645:(socknal_cb.c:2534:ksocknal_find_timed_out_conn()) Timed out TX to
0xc0a80005 3624 ebff6000 192.168.0.5
	Jun 10 11:29:54 grate1 kernel: LustreError:
1645:(socknal_cb.c:2566:ksocknal_check_peer_timeouts()) Timeout out
conn->0xc0a80005 ip 192.168.0.5:988
	Jun 10 11:30:35 grate1 kernel: LustreError:
1652:(client.c:812:ptlrpc_expire_one_request()) @@@ timeout req@dfd44800
x211509/t0 o400->ost1_UUID@NID_fgrate2_UUID:6 lens 64/64 ref 1 fl
Complete:N/0/0 rc 0/0
	Jun 10 11:30:35 grate1 kernel: LustreError:
1652:(recover.c:100:ptlrpc_run_failed_import_upcall()) Error invoking recovery
upcall /usr/lib/lustre/lustre_upcall FAILED_IMPORT ost1_UUID
OSC_grate1.grid.gs.washington.edu_ost1_mds1 NID_fgrate2_UUID: -2; check
/proc/sys/lustre/lustre_upcall
	Jun 10 11:30:35 grate1 kernel: LustreError:
1652:(client.c:812:ptlrpc_expire_one_request()) @@@ timeout req@eda28400
x211511/t0 o400->ost3_UUID@NID_fgrate2_UUID:6 lens 64/64 ref 1 fl
Complete:N/0/0 rc 0/0
	Jun 10 11:30:35 grate1 kernel: LustreError:
1652:(recover.c:100:ptlrpc_run_failed_import_upcall()) Error invoking recovery
upcall /usr/lib/lustre/lustre_upcall FAILED_IMPORT ost3_UUID
OSC_grate1.grid.gs.washington.edu_ost3_mds1 NID_fgrate2_UUID: -2; check
/proc/sys/lustre/lustre_upcall
	Jun 10 11:30:35 grate1 kernel: LustreError:
1652:(client.c:812:ptlrpc_expire_one_request()) @@@ timeout req@dd1aac00
x211513/t0 o400->ost5_UUID@NID_fgrate2_UUID:6 lens 64/64 ref 1 fl
Complete:N/0/0 rc 0/0
	Jun 10 11:30:35 grate1 kernel: LustreError:
1652:(recover.c:100:ptlrpc_run_failed_import_upcall()) Error invoking recovery
upcall /usr/lib/lustre/lustre_upcall FAILED_IMPORT ost5_UUID
OSC_grate1.grid.gs.washington.edu_ost5_mds1 NID_fgrate2_UUID: -2; check
/proc/sys/lustre/lustre_upcall


Note that I can''t stop Lustre on the meta-data server in this 
circumstance.  I have to re-set the machine to get it back.  
Also, only the meta-data server and clients have to be re-set; 
the object server keeps running.  (Though maybe I should re-start 
Lustre there?)

> 
> Thanks again--
> 
> -Phil
> 
Thanks for your help.

David S.

Phil Schwan

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Help interpreting error messages

On Fri, 2004-06-11 at 19:33, David S. wrote:> 
> I don''t have "SysRq" enabled in the kernel. 
I''ll set it up and try.
> The nodes themselves don''t hang.  I can still log into them and
> run any command that doesn''t touch the Lustre file system, mounted
> as ''/vol/10'' there.  I haven''t examined the
console output on a
> troubled node, but ''syslog'' there typically reports
something like
> 
> 	Jun 10 10:53:20 eee002 kernel: LustreError:
24503:(client.c:812:ptlrpc_expire_one_request()) @@@ timeout req@edd46200
x984232/t0 o36->mds1_UUID@NID_fgrate1_UUID:12 lens 168/584 ref 1 fl RPC:/0/0
rc 0/0
> 	Jun 10 10:53:20 eee002 kernel: LustreError:
24503:(recover.c:100:ptlrpc_run_failed_import_upcall()) Error invoking recovery
upcall /usr/lib/lustre/lustre_upcall FAILED_IMPORT mds1_UUID
MDC_eee002.grid.gs.washington.edu_mds1_MNT_feee002 NID_fgrate1_UUID: -2; check
/proc/sys/lustre/lustre_upcall
That is a client node, right?
> four or five minutes after the program starts.  Usually just those
> two messages, and nothing else.  At that point, the file system is
> locked.
Given all of this new information, I think SysRq is not so important.

Lustre 1.0.x does not automatically reconnect to the server if an upcall
script is not provided.  You need to reconnect the MDC or OSC manually
with lctl (see knowledge base issue #2396:
https://bugzilla.lustre.org/show_bug.cgi?id=2396)

That should get things moving after the timeout, at least, but it
doesn''t explain why the timeout occurred in the first place.  Perhaps
this will be fixed already when you try 1.2.x.  In any case, it will
reconnect automatically, and we can debug the timeout separately.
> Other error messages appear when I try to shut down Lustre on the 
> meta-data server
> 
> 	Jun 10 11:26:33 grate1 kernel: Lustre:
1639:(socknal_cb.c:1547:ksocknal_process_receive()) [ece0d000] EOF from
0xc0a80105 ip 192.168.1.5:32773
> 	Jun 10 11:29:07 grate1 kernel: LustreError:
3951:(obd_config.c:321:class_cleanup()) OBD device 5 (f929e024) has refcount 55
> 	Jun 10 11:29:07 grate1 kernel: LustreError:
3951:(obd_config.c:273:dump_exports()) mds1: e3d03400
cf09d_MNT_feln001_1e93ad6d32 50 0 00000000
> 	Jun 10 11:29:07 grate1 kernel: LustreError:
3951:(obd_config.c:273:dump_exports()) mds1: e3d03c00
e1195_MNT_feln001_f6dc0ac823 18 0 00000000
> 	Jun 10 11:29:07 grate1 kernel: LustreError:
3951:(obd_config.c:273:dump_exports()) mds1: eca97e00
3775d_MNT_feln001_ed581c22ff 1 0 00000000
> 
> Note that I can''t stop Lustre on the meta-data server in this 
> circumstance.  I have to re-set the machine to get it back.  
> Also, only the meta-data server and clients have to be re-set; 
> the object server keeps running.  (Though maybe I should re-start 
> Lustre there?)
You are seeing these messages because the client has timed out waiting
for the metadata server to respond, but the metadata server believes
that the client is still connected.  If a server still has connected
clients, it won''t cleanup unless you provide --force.

If your MDS still won''t cleanup, even with --force, then that is a
separate bug.

Thanks--

-Phil

Phil Schwan

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Help interpreting error messages

Hi David--

Thank you very much for reporting this bug.  Your job sounds very
similar to loads that our other customers run, so I am confident that we
can help you out.

On Fri, 2004-06-11 at 17:10, David S. wrote:> 
> Running that program on all nodes concurrently, with each job
> reading a separate input file, locks up the file system - any access 
> (''ls'', ''cd'', ''df'',
whatever) hangs indefinitely.  A hard reboot of
> the meta-data server and clients is necessary to get the file
> system back.  Running on 4 - 8 nodes slows the file system to a 
> crawl.
Do you know if any of the nodes are hung solid, or if you can still
login to them and run commands?  If so, it would be extremely valuable
to get stack traces from SysRq-T.  If not, it would help to know which
nodes are hung solid, and exactly what is on the screen at the end.

If we send you a complimentary copy of Lustre 1.2.2, would you be
willing and able to try to reproduce it with the new code?
> ''Syslog'' on the object and meta-data servers reports
errors
> like
> 
> Jun 10 16:57:44 grate1 kernel: LustreError:
1275:(filter.c:1383:__filter_oa2dentry()) filter_getattr on non-existent object:
335019
> Jun 10 16:57:44 grate1 kernel: LustreError:
1301:(filter.c:1383:__filter_oa2dentry()) filter_getattr on non-existent object:
335019
> Jun 10 16:57:44 grate1 kernel: Lustre:
1323:(mds_open.c:1015:mds_mfd_close()) destroying orphan object 28dc025:7de29b69
> Jun 10 16:57:44 grate1 kernel: LustreError:
1347:(mds_reint.c:1403:mds_add_link_orphan()) pending destroy of 1x open file
chr10_012.fragment5.tmp.cut2.temp_adjust = 28e0025:7de29b65
> Jun 10 16:57:44 grate1 kernel: LustreError:
1343:(mds_reint.c:1403:mds_add_link_orphan()) pending destroy of 1x open file
chr10_008.fragment5.tmp.cut1.temp_adjust = 28d4023:7de29b67
> Jun 10 16:57:44 grate1 kernel: LustreError:
1341:(mds_reint.c:1403:mds_add_link_orphan()) pending destroy of 1x open file
chr10_007.fragment5.tmp.cut1.temp_adjust = 28d0023:7de29b6a
> Jun 10 16:57:44 grate1 kernel: Lustre:
1327:(mds_open.c:1015:mds_mfd_close()) destroying orphan object 28d4023:7de29b67
> Jun 10 16:57:44 grate1 kernel: Lustre:
1331:(mds_open.c:1015:mds_mfd_close()) destroying orphan object 28e0025:7de29b65
> Jun 10 16:57:44 grate1 kernel: Lustre:
1330:(mds_open.c:1015:mds_mfd_close()) destroying orphan object 28d0023:7de29b6a
> 
> 
> which I don''t know how to interpret.
Of all of them, only the first two are somewhat worrying -- they
indicate that a client is looking for an object which the metadata
server thinks should exist, but the object server does not have.  The
others are normal messages, related to files which have been unlinked
but not yet closed, and are then finally closed (and thus removed from
the disk).  Those messages have been removed from the console output in
1.2.x.

Are these the only messages which appear before you reboot?  How long do
you wait before rebooting?  Unless the nodes are completely wedged, you
should see some messages about recovery on at least the clients, but
maybe also some servers.

Thanks again--

-Phil

David S.

2006-May-19 07:36 UTC

head link

[Lustre-discuss] Help interpreting error messages

I''m using Lustre 1.0.4 on a ~40 node cluster with two servers, one
strictly an object server and the other a combined object and meta-
data server.  (I can provide more details on the hardware and
configuration if necessary.)  I''m running a program that seems to
be giving the file system a great deal of difficulty.  That program
reads a large data file in 400 Kb chunks, does some processing, then
writes 10 - 20 output files up to 400 Kb in size.  That''s repeated
~7000 times, and then output files get merged into four files.
Running that program on all nodes concurrently, with each job
reading a separate input file, locks up the file system - any access 
(''ls'', ''cd'', ''df'',
whatever) hangs indefinitely.  A hard reboot of
the meta-data server and clients is necessary to get the file
system back.  Running on 4 - 8 nodes slows the file system to a 
crawl.  ''Syslog'' on the object and meta-data servers reports
errors
like

	Jun 10 16:57:44 grate1 kernel: LustreError:
1275:(filter.c:1383:__filter_oa2dentry()) filter_getattr on non-existent object:
335019
Jun 10 16:57:44 grate1 kernel: LustreError:
1301:(filter.c:1383:__filter_oa2dentry()) filter_getattr on non-existent object:
335019
Jun 10 16:57:44 grate1 kernel: Lustre: 1323:(mds_open.c:1015:mds_mfd_close())
destroying orphan object 28dc025:7de29b69
Jun 10 16:57:44 grate1 kernel: LustreError:
1347:(mds_reint.c:1403:mds_add_link_orphan()) pending destroy of 1x open file
chr10_012.fragment5.tmp.cut2.temp_adjust = 28e0025:7de29b65
Jun 10 16:57:44 grate1 kernel: LustreError:
1343:(mds_reint.c:1403:mds_add_link_orphan()) pending destroy of 1x open file
chr10_008.fragment5.tmp.cut1.temp_adjust = 28d4023:7de29b67
Jun 10 16:57:44 grate1 kernel: LustreError:
1341:(mds_reint.c:1403:mds_add_link_orphan()) pending destroy of 1x open file
chr10_007.fragment5.tmp.cut1.temp_adjust = 28d0023:7de29b6a
Jun 10 16:57:44 grate1 kernel: Lustre: 1327:(mds_open.c:1015:mds_mfd_close())
destroying orphan object 28d4023:7de29b67
Jun 10 16:57:44 grate1 kernel: Lustre: 1331:(mds_open.c:1015:mds_mfd_close())
destroying orphan object 28e0025:7de29b65
Jun 10 16:57:44 grate1 kernel: Lustre: 1330:(mds_open.c:1015:mds_mfd_close())
destroying orphan object 28d0023:7de29b6a


which I don''t know how to interpret.

I''d be grateful for any advice.

David S.

Lustre discuss - May 2006 - Help interpreting error messages

[Lustre-discuss] Help interpreting error messages

[Lustre-discuss] Help interpreting error messages

[Lustre-discuss] Help interpreting error messages

[Lustre-discuss] Help interpreting error messages