Hello, I am not sure this is the right place to post this message but here goes. I am seeing this error messsage intermittently through my tests. thanos kernel: LustreError: 2456:(niobuf.c:746:ptlrpc_link_svc_me()) LBUG. I am working with the 1.0.1 release. Any ideas as to what is causing the bug? I tried looking through the code to figure it out. But I am not too familiar with the code. Thanks, Jai.
Franco Broi wrote:> Even though I did a recover after this error I can no longer access some > of the data, and the df is screwed up. > > This is a correct df running from another client: > > [franco@charlie19]$ df /data22 > Filesystem 1K-blocks Used Available Use% Mounted on > data22 8063461792 1463253368 6190608456 20% /data22 > > and this on the client which had the LBUG error: > > Filesystem 1K-blocks Used Available Use% Mounted on > data22 4031730896 731684656 3095246256 20% /data22In general, once a node has had an LBUG, it should be rebooted. LBUGs are assertions, errors which we think should really not happen. As a result, they often happen with locks held, or structures in unknown states, etc, and cannot be gracefully cleaned up. We''re still trying to work out exactly how to prevent this LBUG. If you can reproduce (without an LBUG) the "half of the data is missing" state, which is not resolved by a recover, then please let me know. We would track that as a separate issue. Thanks-- -Phil
Even though I did a recover after this error I can no longer access some of the data, and the df is screwed up. This is a correct df running from another client: [franco@charlie19]$ df /data22 Filesystem 1K-blocks Used Available Use% Mounted on data22 8063461792 1463253368 6190608456 20% /data22 and this on the client which had the LBUG error: Filesystem 1K-blocks Used Available Use% Mounted on data22 4031730896 731684656 3095246256 20% /data22 I''m running 1.0.1 on the client and 1.0.2 on the OST/MDS, could this be causing these strange problems? LustreError: 27041:(../../portals/include/portals/lib-p30.h:194:lib_md_alloc()) PORTALS: out of memory at ../../portals/include/portals/lib-p30.h:194 (tried to alloc ''md'' from slab ''ptl_md_slab'') LustreError: 27041:(../../portals/include/portals/lib-p30.h:194:lib_md_alloc()) PORTALS: 3616216 total bytes allocated by portals LustreError: 27041:(niobuf.c:97:ptl_send_buf()) PtlMDBind failed: 2 LustreError: 27041:(../../portals/include/portals/lib-p30.h:194:lib_md_alloc()) PORTALS: out of memory at ../../portals/include/portals/lib-p30.h:194 (tried to alloc ''md'' from slab ''ptl_md_slab'') LustreError: 27041:(../../portals/include/portals/lib-p30.h:194:lib_md_alloc()) PORTALS: 3612924 total bytes allocated by portals LustreError: 27041:(niobuf.c:656:ptl_send_rpc()) PtlMDAttach failed: 2 LustreError: 27041:(niobuf.c:658:ptl_send_rpc()) LBUG LustreError: dumping log to /tmp/lustre-log-charlie10.1074742446 ... writing ... LustreError: 27041:(debug.c:894:portals_run_upcall()) Error -2 invoking portals upcall /usr/lib/lustre/portals_upcall LBUG,niobuf.c,ptl_send_rpc,658; check /proc/sys/portals/upcall LustreError: wrote 5242880 bytes LustreError: 27041:(client.c:810:ptlrpc_expire_one_request()) @@@ timeout req@f4076800 x151832/t0 o3->md0_UUID@NID_160.0.40.5_UUID:6 lens 288/240 ref 1 fl RPC:N/0/0 rc 0 LustreError: 27041:(recover.c:100:ptlrpc_run_failed_import_upcall()) Error invoking recovery upcall /usr/lib/lustre/lustre_upcall FAILED_IMPORT md0_UUID OSC_charlie10_md0_MNT_charlie10 NID_160.0.40.5_UUID: -2; check /proc/sys/lustre/lustre_upcall LustreError: 27041:(client.c:810:ptlrpc_expire_one_request()) @@@ timeout req@cfc7da00 x151833/t0 o3->md3_UUID@NID_160.0.40.5_UUID:6 lens 288/240 ref 1 fl RPC:N/0/0 rc 0 LustreError: 27041:(recover.c:100:ptlrpc_run_failed_import_upcall()) Error invoking recovery upcall /usr/lib/lustre/lustre_upcall FAILED_IMPORT md3_UUID OSC_charlie10_md3_MNT_charlie10 NID_160.0.40.5_UUID: -2; check /proc/sys/lustre/lustre_upcall LustreError: 13022:(import.c:128:ptlrpc_connect_import()) reconnected to echo5-mds_UUID@NID_160.0.40.5_UUID after partition