> > > > Running that program on all nodes concurrently, with each job > > reading a separate input file, locks up the file system - any access > > (''ls'', ''cd'', ''df'', whatever) hangs indefinitely. A hard reboot of > > the meta-data server and clients is necessary to get the file > > system back. Running on 4 - 8 nodes slows the file system to a > > crawl. > > Do you know if any of the nodes are hung solid, or if you can still > login to them and run commands? If so, it would be extremely valuable > to get stack traces from SysRq-T. If not, it would help to know which > nodes are hung solid, and exactly what is on the screen at the end.I don''t have "SysRq" enabled in the kernel. I''ll set it up and try. The nodes themselves don''t hang. I can still log into them and run any command that doesn''t touch the Lustre file system, mounted as ''/vol/10'' there. I haven''t examined the console output on a troubled node, but ''syslog'' there typically reports something like Jun 10 10:53:20 eee002 kernel: LustreError: 24503:(client.c:812:ptlrpc_expire_one_request()) @@@ timeout req@edd46200 x984232/t0 o36->mds1_UUID@NID_fgrate1_UUID:12 lens 168/584 ref 1 fl RPC:/0/0 rc 0/0 Jun 10 10:53:20 eee002 kernel: LustreError: 24503:(recover.c:100:ptlrpc_run_failed_import_upcall()) Error invoking recovery upcall /usr/lib/lustre/lustre_upcall FAILED_IMPORT mds1_UUID MDC_eee002.grid.gs.washington.edu_mds1_MNT_feee002 NID_fgrate1_UUID: -2; check /proc/sys/lustre/lustre_upcall four or five minutes after the program starts. Usually just those two messages, and nothing else. At that point, the file system is locked.> > If we send you a complimentary copy of Lustre 1.2.2, would you be > willing and able to try to reproduce it with the new code?Sure, I''ll give it a go.> > > ''Syslog'' on the object and meta-data servers reports errors > > like > > > > Jun 10 16:57:44 grate1 kernel: LustreError: 1275:(filter.c:1383:__filter_oa2dentry()) filter_getattr on non-existent object: 335019 > > Jun 10 16:57:44 grate1 kernel: LustreError: 1301:(filter.c:1383:__filter_oa2dentry()) filter_getattr on non-existent object: 335019 > > Jun 10 16:57:44 grate1 kernel: Lustre: 1323:(mds_open.c:1015:mds_mfd_close()) destroying orphan object 28dc025:7de29b69 > > Jun 10 16:57:44 grate1 kernel: LustreError: 1347:(mds_reint.c:1403:mds_add_link_orphan()) pending destroy of 1x open file chr10_012.fragment5.tmp.cut2.temp_adjust = 28e0025:7de29b65 > > Jun 10 16:57:44 grate1 kernel: LustreError: 1343:(mds_reint.c:1403:mds_add_link_orphan()) pending destroy of 1x open file chr10_008.fragment5.tmp.cut1.temp_adjust = 28d4023:7de29b67 > > Jun 10 16:57:44 grate1 kernel: LustreError: 1341:(mds_reint.c:1403:mds_add_link_orphan()) pending destroy of 1x open file chr10_007.fragment5.tmp.cut1.temp_adjust = 28d0023:7de29b6a > > Jun 10 16:57:44 grate1 kernel: Lustre: 1327:(mds_open.c:1015:mds_mfd_close()) destroying orphan object 28d4023:7de29b67 > > Jun 10 16:57:44 grate1 kernel: Lustre: 1331:(mds_open.c:1015:mds_mfd_close()) destroying orphan object 28e0025:7de29b65 > > Jun 10 16:57:44 grate1 kernel: Lustre: 1330:(mds_open.c:1015:mds_mfd_close()) destroying orphan object 28d0023:7de29b6a > > > > > > which I don''t know how to interpret. > > Of all of them, only the first two are somewhat worrying -- they > indicate that a client is looking for an object which the metadata > server thinks should exist, but the object server does not have. The > others are normal messages, related to files which have been unlinked > but not yet closed, and are then finally closed (and thus removed from > the disk). Those messages have been removed from the console output in > 1.2.x. > > Are these the only messages which appear before you reboot? How long do > you wait before rebooting? Unless the nodes are completely wedged, you > should see some messages about recovery on at least the clients, but > maybe also some servers.I''ve waited up to an hour or so after the file system hangs until before re-booting. The error messages I see on the servers are like Jun 10 09:42:05 grate1 kernel: LustreError: 1799:(filter.c:1767:filter_destroy()) destroying non-existent object 300310 Jun 10 09:42:05 grate1 kernel: LustreError: 1800:(filter.c:1767:filter_destroy()) destroying non-existent object 300310 Jun 10 09:42:05 grate1 kernel: LustreError: 1810:(filter.c:1383:__filter_oa2dentry()) filter_getattr on non-existent object: 300149 Jun 10 09:42:05 grate1 kernel: LustreError: 1790:(filter.c:1383:__filter_oa2dentry()) filter_getattr on non-existent object: 300149 Other error messages appear when I try to shut down Lustre on the meta-data server Jun 10 11:26:33 grate1 kernel: Lustre: 1639:(socknal_cb.c:1547:ksocknal_process_receive()) [ece0d000] EOF from 0xc0a80105 ip 192.168.1.5:32773 Jun 10 11:29:07 grate1 kernel: LustreError: 3951:(obd_config.c:321:class_cleanup()) OBD device 5 (f929e024) has refcount 55 Jun 10 11:29:07 grate1 kernel: LustreError: 3951:(obd_config.c:273:dump_exports()) mds1: e3d03400 cf09d_MNT_feln001_1e93ad6d32 50 0 00000000 Jun 10 11:29:07 grate1 kernel: LustreError: 3951:(obd_config.c:273:dump_exports()) mds1: e3d03c00 e1195_MNT_feln001_f6dc0ac823 18 0 00000000 Jun 10 11:29:07 grate1 kernel: LustreError: 3951:(obd_config.c:273:dump_exports()) mds1: eca97e00 3775d_MNT_feln001_ed581c22ff 1 0 00000000 ... Jun 10 11:29:07 grate1 kernel: LustreError: 3951:(obd_config.c:273:dump_exports()) mds1: ebf6e000 mds1_UUID 2 0 00000000 Jun 10 11:29:07 grate1 kernel: LustreError: 3951:(obd_config.c:239:class_detach()) OBD device 5 still set up Jun 10 11:29:07 grate1 kernel: LustreError: 3984:(obd_config.c:321:class_cleanup()) OBD device 2 (f929cf08) has refcount 50 ... Jun 10 11:29:08 grate1 kernel: LustreError: 4050:(obd_config.c:239:class_detach()) OBD device 0 still set up Jun 10 11:29:54 grate1 kernel: LustreError: 1645:(socknal_cb.c:2534:ksocknal_find_timed_out_conn()) Timed out TX to 0xc0a80005 3624 ebff6000 192.168.0.5 Jun 10 11:29:54 grate1 kernel: LustreError: 1645:(socknal_cb.c:2566:ksocknal_check_peer_timeouts()) Timeout out conn->0xc0a80005 ip 192.168.0.5:988 Jun 10 11:30:35 grate1 kernel: LustreError: 1652:(client.c:812:ptlrpc_expire_one_request()) @@@ timeout req@dfd44800 x211509/t0 o400->ost1_UUID@NID_fgrate2_UUID:6 lens 64/64 ref 1 fl Complete:N/0/0 rc 0/0 Jun 10 11:30:35 grate1 kernel: LustreError: 1652:(recover.c:100:ptlrpc_run_failed_import_upcall()) Error invoking recovery upcall /usr/lib/lustre/lustre_upcall FAILED_IMPORT ost1_UUID OSC_grate1.grid.gs.washington.edu_ost1_mds1 NID_fgrate2_UUID: -2; check /proc/sys/lustre/lustre_upcall Jun 10 11:30:35 grate1 kernel: LustreError: 1652:(client.c:812:ptlrpc_expire_one_request()) @@@ timeout req@eda28400 x211511/t0 o400->ost3_UUID@NID_fgrate2_UUID:6 lens 64/64 ref 1 fl Complete:N/0/0 rc 0/0 Jun 10 11:30:35 grate1 kernel: LustreError: 1652:(recover.c:100:ptlrpc_run_failed_import_upcall()) Error invoking recovery upcall /usr/lib/lustre/lustre_upcall FAILED_IMPORT ost3_UUID OSC_grate1.grid.gs.washington.edu_ost3_mds1 NID_fgrate2_UUID: -2; check /proc/sys/lustre/lustre_upcall Jun 10 11:30:35 grate1 kernel: LustreError: 1652:(client.c:812:ptlrpc_expire_one_request()) @@@ timeout req@dd1aac00 x211513/t0 o400->ost5_UUID@NID_fgrate2_UUID:6 lens 64/64 ref 1 fl Complete:N/0/0 rc 0/0 Jun 10 11:30:35 grate1 kernel: LustreError: 1652:(recover.c:100:ptlrpc_run_failed_import_upcall()) Error invoking recovery upcall /usr/lib/lustre/lustre_upcall FAILED_IMPORT ost5_UUID OSC_grate1.grid.gs.washington.edu_ost5_mds1 NID_fgrate2_UUID: -2; check /proc/sys/lustre/lustre_upcall Note that I can''t stop Lustre on the meta-data server in this circumstance. I have to re-set the machine to get it back. Also, only the meta-data server and clients have to be re-set; the object server keeps running. (Though maybe I should re-start Lustre there?)> > Thanks again-- > > -Phil >Thanks for your help. David S.
On Fri, 2004-06-11 at 19:33, David S. wrote:> > I don''t have "SysRq" enabled in the kernel. I''ll set it up and try. > The nodes themselves don''t hang. I can still log into them and > run any command that doesn''t touch the Lustre file system, mounted > as ''/vol/10'' there. I haven''t examined the console output on a > troubled node, but ''syslog'' there typically reports something like > > Jun 10 10:53:20 eee002 kernel: LustreError: 24503:(client.c:812:ptlrpc_expire_one_request()) @@@ timeout req@edd46200 x984232/t0 o36->mds1_UUID@NID_fgrate1_UUID:12 lens 168/584 ref 1 fl RPC:/0/0 rc 0/0 > Jun 10 10:53:20 eee002 kernel: LustreError: 24503:(recover.c:100:ptlrpc_run_failed_import_upcall()) Error invoking recovery upcall /usr/lib/lustre/lustre_upcall FAILED_IMPORT mds1_UUID MDC_eee002.grid.gs.washington.edu_mds1_MNT_feee002 NID_fgrate1_UUID: -2; check /proc/sys/lustre/lustre_upcallThat is a client node, right?> four or five minutes after the program starts. Usually just those > two messages, and nothing else. At that point, the file system is > locked.Given all of this new information, I think SysRq is not so important. Lustre 1.0.x does not automatically reconnect to the server if an upcall script is not provided. You need to reconnect the MDC or OSC manually with lctl (see knowledge base issue #2396: https://bugzilla.lustre.org/show_bug.cgi?id=2396) That should get things moving after the timeout, at least, but it doesn''t explain why the timeout occurred in the first place. Perhaps this will be fixed already when you try 1.2.x. In any case, it will reconnect automatically, and we can debug the timeout separately.> Other error messages appear when I try to shut down Lustre on the > meta-data server > > Jun 10 11:26:33 grate1 kernel: Lustre: 1639:(socknal_cb.c:1547:ksocknal_process_receive()) [ece0d000] EOF from 0xc0a80105 ip 192.168.1.5:32773 > Jun 10 11:29:07 grate1 kernel: LustreError: 3951:(obd_config.c:321:class_cleanup()) OBD device 5 (f929e024) has refcount 55 > Jun 10 11:29:07 grate1 kernel: LustreError: 3951:(obd_config.c:273:dump_exports()) mds1: e3d03400 cf09d_MNT_feln001_1e93ad6d32 50 0 00000000 > Jun 10 11:29:07 grate1 kernel: LustreError: 3951:(obd_config.c:273:dump_exports()) mds1: e3d03c00 e1195_MNT_feln001_f6dc0ac823 18 0 00000000 > Jun 10 11:29:07 grate1 kernel: LustreError: 3951:(obd_config.c:273:dump_exports()) mds1: eca97e00 3775d_MNT_feln001_ed581c22ff 1 0 00000000 > > Note that I can''t stop Lustre on the meta-data server in this > circumstance. I have to re-set the machine to get it back. > Also, only the meta-data server and clients have to be re-set; > the object server keeps running. (Though maybe I should re-start > Lustre there?)You are seeing these messages because the client has timed out waiting for the metadata server to respond, but the metadata server believes that the client is still connected. If a server still has connected clients, it won''t cleanup unless you provide --force. If your MDS still won''t cleanup, even with --force, then that is a separate bug. Thanks-- -Phil
Hi David-- Thank you very much for reporting this bug. Your job sounds very similar to loads that our other customers run, so I am confident that we can help you out. On Fri, 2004-06-11 at 17:10, David S. wrote:> > Running that program on all nodes concurrently, with each job > reading a separate input file, locks up the file system - any access > (''ls'', ''cd'', ''df'', whatever) hangs indefinitely. A hard reboot of > the meta-data server and clients is necessary to get the file > system back. Running on 4 - 8 nodes slows the file system to a > crawl.Do you know if any of the nodes are hung solid, or if you can still login to them and run commands? If so, it would be extremely valuable to get stack traces from SysRq-T. If not, it would help to know which nodes are hung solid, and exactly what is on the screen at the end. If we send you a complimentary copy of Lustre 1.2.2, would you be willing and able to try to reproduce it with the new code?> ''Syslog'' on the object and meta-data servers reports errors > like > > Jun 10 16:57:44 grate1 kernel: LustreError: 1275:(filter.c:1383:__filter_oa2dentry()) filter_getattr on non-existent object: 335019 > Jun 10 16:57:44 grate1 kernel: LustreError: 1301:(filter.c:1383:__filter_oa2dentry()) filter_getattr on non-existent object: 335019 > Jun 10 16:57:44 grate1 kernel: Lustre: 1323:(mds_open.c:1015:mds_mfd_close()) destroying orphan object 28dc025:7de29b69 > Jun 10 16:57:44 grate1 kernel: LustreError: 1347:(mds_reint.c:1403:mds_add_link_orphan()) pending destroy of 1x open file chr10_012.fragment5.tmp.cut2.temp_adjust = 28e0025:7de29b65 > Jun 10 16:57:44 grate1 kernel: LustreError: 1343:(mds_reint.c:1403:mds_add_link_orphan()) pending destroy of 1x open file chr10_008.fragment5.tmp.cut1.temp_adjust = 28d4023:7de29b67 > Jun 10 16:57:44 grate1 kernel: LustreError: 1341:(mds_reint.c:1403:mds_add_link_orphan()) pending destroy of 1x open file chr10_007.fragment5.tmp.cut1.temp_adjust = 28d0023:7de29b6a > Jun 10 16:57:44 grate1 kernel: Lustre: 1327:(mds_open.c:1015:mds_mfd_close()) destroying orphan object 28d4023:7de29b67 > Jun 10 16:57:44 grate1 kernel: Lustre: 1331:(mds_open.c:1015:mds_mfd_close()) destroying orphan object 28e0025:7de29b65 > Jun 10 16:57:44 grate1 kernel: Lustre: 1330:(mds_open.c:1015:mds_mfd_close()) destroying orphan object 28d0023:7de29b6a > > > which I don''t know how to interpret.Of all of them, only the first two are somewhat worrying -- they indicate that a client is looking for an object which the metadata server thinks should exist, but the object server does not have. The others are normal messages, related to files which have been unlinked but not yet closed, and are then finally closed (and thus removed from the disk). Those messages have been removed from the console output in 1.2.x. Are these the only messages which appear before you reboot? How long do you wait before rebooting? Unless the nodes are completely wedged, you should see some messages about recovery on at least the clients, but maybe also some servers. Thanks again-- -Phil
I''m using Lustre 1.0.4 on a ~40 node cluster with two servers, one strictly an object server and the other a combined object and meta- data server. (I can provide more details on the hardware and configuration if necessary.) I''m running a program that seems to be giving the file system a great deal of difficulty. That program reads a large data file in 400 Kb chunks, does some processing, then writes 10 - 20 output files up to 400 Kb in size. That''s repeated ~7000 times, and then output files get merged into four files. Running that program on all nodes concurrently, with each job reading a separate input file, locks up the file system - any access (''ls'', ''cd'', ''df'', whatever) hangs indefinitely. A hard reboot of the meta-data server and clients is necessary to get the file system back. Running on 4 - 8 nodes slows the file system to a crawl. ''Syslog'' on the object and meta-data servers reports errors like Jun 10 16:57:44 grate1 kernel: LustreError: 1275:(filter.c:1383:__filter_oa2dentry()) filter_getattr on non-existent object: 335019 Jun 10 16:57:44 grate1 kernel: LustreError: 1301:(filter.c:1383:__filter_oa2dentry()) filter_getattr on non-existent object: 335019 Jun 10 16:57:44 grate1 kernel: Lustre: 1323:(mds_open.c:1015:mds_mfd_close()) destroying orphan object 28dc025:7de29b69 Jun 10 16:57:44 grate1 kernel: LustreError: 1347:(mds_reint.c:1403:mds_add_link_orphan()) pending destroy of 1x open file chr10_012.fragment5.tmp.cut2.temp_adjust = 28e0025:7de29b65 Jun 10 16:57:44 grate1 kernel: LustreError: 1343:(mds_reint.c:1403:mds_add_link_orphan()) pending destroy of 1x open file chr10_008.fragment5.tmp.cut1.temp_adjust = 28d4023:7de29b67 Jun 10 16:57:44 grate1 kernel: LustreError: 1341:(mds_reint.c:1403:mds_add_link_orphan()) pending destroy of 1x open file chr10_007.fragment5.tmp.cut1.temp_adjust = 28d0023:7de29b6a Jun 10 16:57:44 grate1 kernel: Lustre: 1327:(mds_open.c:1015:mds_mfd_close()) destroying orphan object 28d4023:7de29b67 Jun 10 16:57:44 grate1 kernel: Lustre: 1331:(mds_open.c:1015:mds_mfd_close()) destroying orphan object 28e0025:7de29b65 Jun 10 16:57:44 grate1 kernel: Lustre: 1330:(mds_open.c:1015:mds_mfd_close()) destroying orphan object 28d0023:7de29b6a which I don''t know how to interpret. I''d be grateful for any advice. David S.