We have been running the config below on three different lustre file systems since early January and, for the most part, things have been pretty stable. We are now experiencing frequent hangs on some clients - particularly our interactive login nodes. All processes get blocked behind Lustre I/O requests. When this happens there are *no* messages in either dmesg or syslog on the clients. They seem unaware of a problem. However, on the MDS we see the following... 1. A ton of lustre-log.M.N files get dumped into /tmp in a short period of time. Most of them appear to be full of garbage and unprintable characters rather than thread stack traces. Many of them are also zero length. 2. Lots of dmesg output similar to that appended (see below). 3. Pretty much the same in syslog. 4. The frequency/period of these events seems to consistent with the timeouts associated with the following messages... Lustre: 4534:0:(watchdog.c:312:lcw_update_time()) Expired watchdog for pid 4534 disabled after 499.9514s We have been adjusting lru_size on the clients but so far it has made no difference. We have "options mds mds_num_threads=512" and our system timeout is 1000 (sure, go ahead and flame me but if we don''t do that we get tons of "endpoint transport failures" on the clients and no, there are no connectivity issues). :) We are open to suggestion and wondering if we should update the MDSs to 1.6.5. Can we do that safely without also upgrading the clients and OSTs? Our config is as below. Thanks, Charlie Taylor UF HPC Center = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ======================================================================= Config: Lustre 1.6.4.2 CentOS 5.0 Kernel 2.6.18-8.1.14 OFED 1.2 320 o2ib clients 80 tcp clients dmesg output: ============LustreError: dumping log to /tmp/lustre-log.1213385571.4240 ll_mdt_20 S ffff81023d2c9700 0 4070 1 4071 4069 (L-TLB) ffff81023d2c9700 ffff81023d2c9630 ffff81023d2c9630 000000000000000a ffff81023fa09100 ffff8101438ca7a0 0002bb03681325fb 0000000000008f21 ffff81023fa092e8 0000000000000001 ffffffff8866eb51 ffffffffffffffff Call Trace: [<ffffffff8866eb51>] :ptlrpc:ldlm_run_cp_ast_work+0x161/0x1f0 [<ffffffff88686da0>] :ptlrpc:ldlm_expired_completion_wait+0x0/0x250 [<ffffffff800611f7>] schedule_timeout+0x8a/0xad [<ffffffff80092c5e>] process_timeout+0x0/0x5 [<ffffffff886881fd>] :ptlrpc:ldlm_completion_ast+0x35d/0x6a0 [<ffffffff886705e9>] :ptlrpc:ldlm_lock_enqueue+0x559/0x5c0 [<ffffffff80086a74>] default_wake_function+0x0/0xe [<ffffffff8866ce6a>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x3a/ 0x90 [<ffffffff88684bb0>] :ptlrpc:ldlm_blocking_ast+0x0/0x2d0 [<ffffffff88685e24>] :ptlrpc:ldlm_cli_enqueue_local+0x454/0x510 [<ffffffff888d5e87>] :mds:mds_fid2locked_dentry+0x1d7/0x2a0 [<ffffffff88687ea0>] :ptlrpc:ldlm_completion_ast+0x0/0x6a0 [<ffffffff888d6647>] :mds:mds_getattr_lock+0x6f7/0xc70 [<ffffffff885b11c4>] :ksocklnd:ksocknal_alloc_tx+0x1c4/0x270 [<ffffffff888d7191>] :mds:mds_intent_policy+0x5d1/0xbe0 [<ffffffff8854dca7>] :lnet:lnet_prep_send+0x67/0xb0 [<ffffffff88673786>] :ptlrpc:ldlm_resource_putref+0x1b6/0x3b0 [<ffffffff88670183>] :ptlrpc:ldlm_lock_enqueue+0xf3/0x5c0 [<ffffffff8866dbbd>] :ptlrpc:ldlm_lock_create+0x98d/0x9c0 [<ffffffff88690660>] :ptlrpc:ldlm_server_completion_ast+0x0/0x570 [<ffffffff8868cda0>] :ptlrpc:ldlm_handle_enqueue+0xd90/0x1410 [<ffffffff88690bd0>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x690 [<ffffffff888e0cad>] :mds:mds_handle+0x46dd/0x58ff [<ffffffff8860bcb2>] :obdclass:class_handle2object+0xd2/0x160 [<ffffffff886a7280>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 [<ffffffff886a4e35>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 [<ffffffff886aca8b>] :ptlrpc:ptlrpc_server_handle_request+0xb0b/0x1270 [<ffffffff800608e8>] thread_return+0x0/0xea [<ffffffff8006b165>] do_gettimeofday+0x50/0x92 [<ffffffff8851b056>] :libcfs:lcw_update_time+0x16/0x100 [<ffffffff8003cee3>] lock_timer_base+0x1b/0x3c [<ffffffff886af4cc>] :ptlrpc:ptlrpc_main+0x7dc/0x950 [<ffffffff80086a74>] default_wake_function+0x0/0xe [<ffffffff8005be25>] child_rip+0xa/0x11 [<ffffffff886aecf0>] :ptlrpc:ptlrpc_main+0x0/0x950 [<ffffffff8005be1b>] child_rip+0x0/0x11 LustreError: dumping log to /tmp/lustre-log.1213385600.4070 Lustre: 4338:0:(ldlm_lib.c:519:target_handle_reconnect()) hpcdata- MDT0000: a9f365a3-8746-6ed5-e45e-cd292891ece2 reconnecting Lustre: 4338:0:(ldlm_lib.c:519:target_handle_reconnect()) Skipped 6 previous similar messages Lustre: 4338:0:(ldlm_lib.c:747:target_handle_connect()) hpcdata- MDT0000: refuse reconnection from a9f365a3-8746-6ed5-e45e-cd292891ece2 at 10.13.25.198 @o2ib to 0xffff8101cdbd6000; still busy with 2 active RPCs Lustre: 4338:0:(ldlm_lib.c:747:target_handle_connect()) Skipped 6 previous similar messages ll_mdt_501 S ffff81023ca51700 0 4551 1 4552 4550 (L-TLB) ffff81023ca51700 ffff81023ca51630 ffff81023ca51630 000000000000000a ffff81023c9ce860 ffff81014395c820 0002bb22256f1724 0000000000009eb2 ffff81023c9cea48 0000000000000002 ffffffff8866eb51 ffffffffffffffff Call Trace: [<ffffffff8866eb51>] :ptlrpc:ldlm_run_cp_ast_work+0x161/0x1f0 [<ffffffff88686da0>] :ptlrpc:ldlm_expired_completion_wait+0x0/0x250 [<ffffffff800611f7>] schedule_timeout+0x8a/0xad [<ffffffff80092c5e>] process_timeout+0x0/0x5 [<ffffffff886881fd>] :ptlrpc:ldlm_completion_ast+0x35d/0x6a0 [<ffffffff886705e9>] :ptlrpc:ldlm_lock_enqueue+0x559/0x5c0 [<ffffffff80086a74>] default_wake_function+0x0/0xe [<ffffffff8866ce6a>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x3a/ 0x90 [<ffffffff88684bb0>] :ptlrpc:ldlm_blocking_ast+0x0/0x2d0 [<ffffffff88685e24>] :ptlrpc:ldlm_cli_enqueue_local+0x454/0x510 [<ffffffff888d5e87>] :mds:mds_fid2locked_dentry+0x1d7/0x2a0 [<ffffffff88687ea0>] :ptlrpc:ldlm_completion_ast+0x0/0x6a0 [<ffffffff888d6647>] :mds:mds_getattr_lock+0x6f7/0xc70 [<ffffffff8858fd7c>] :ko2iblnd:kiblnd_send+0x88c/0x8d0 [<ffffffff888d7191>] :mds:mds_intent_policy+0x5d1/0xbe0 [<ffffffff8854dca7>] :lnet:lnet_prep_send+0x67/0xb0 [<ffffffff88673786>] :ptlrpc:ldlm_resource_putref+0x1b6/0x3b0 [<ffffffff88670183>] :ptlrpc:ldlm_lock_enqueue+0xf3/0x5c0 [<ffffffff8866dbbd>] :ptlrpc:ldlm_lock_create+0x98d/0x9c0 [<ffffffff88690660>] :ptlrpc:ldlm_server_completion_ast+0x0/0x570 [<ffffffff8868cda0>] :ptlrpc:ldlm_handle_enqueue+0xd90/0x1410 [<ffffffff88690bd0>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x690 [<ffffffff888e0cad>] :mds:mds_handle+0x46dd/0x58ff [<ffffffff8860bcb2>] :obdclass:class_handle2object+0xd2/0x160 [<ffffffff886a7280>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 [<ffffffff886a4e35>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 [<ffffffff886aca8b>] :ptlrpc:ptlrpc_server_handle_request+0xb0b/0x1270 [<ffffffff800608e8>] thread_return+0x0/0xea [<ffffffff8006b165>] do_gettimeofday+0x50/0x92 [<ffffffff8851b056>] :libcfs:lcw_update_time+0x16/0x100 [<ffffffff8003cee3>] lock_timer_base+0x1b/0x3c [<ffffffff886af4cc>] :ptlrpc:ptlrpc_main+0x7dc/0x950 [<ffffffff80086a74>] default_wake_function+0x0/0xe [<ffffffff8005be25>] child_rip+0xa/0x11 [<ffffffff886aecf0>] :ptlrpc:ptlrpc_main+0x0/0x950 [<ffffffff8005be1b>] child_rip+0x0/0x11 LustreError: dumping log to /tmp/lustre-log.1213385732.4551 ll_mdt_13 S ffff81023d2a7700 0 4063 1 4064 4062 (L-TLB) ffff81023d2a7700 ffff81023d2a7630 ffff81023d2a7630 000000000000000a ffff81023e0577a0 ffff8101438ca7a0 0002bb23b76dc267 00000000000104cb ffff81023e057988 0000000000000001 ffffffff8866eb51 ffffffffffffffff Call Trace: [<ffffffff8866eb51>] :ptlrpc:ldlm_run_cp_ast_work+0x161/0x1f0 [<ffffffff88686da0>] :ptlrpc:ldlm_expired_completion_wait+0x0/0x250 [<ffffffff800611f7>] schedule_timeout+0x8a/0xad [<ffffffff80092c5e>] process_timeout+0x0/0x5 [<ffffffff886881fd>] :ptlrpc:ldlm_completion_ast+0x35d/0x6a0 [<ffffffff886705e9>] :ptlrpc:ldlm_lock_enqueue+0x559/0x5c0 [<ffffffff80086a74>] default_wake_function+0x0/0xe [<ffffffff8866ce6a>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x3a/ 0x90 [<ffffffff88684bb0>] :ptlrpc:ldlm_blocking_ast+0x0/0x2d0 [<ffffffff88685e24>] :ptlrpc:ldlm_cli_enqueue_local+0x454/0x510 [<ffffffff888d5e87>] :mds:mds_fid2locked_dentry+0x1d7/0x2a0 [<ffffffff88687ea0>] :ptlrpc:ldlm_completion_ast+0x0/0x6a0 [<ffffffff888d6647>] :mds:mds_getattr_lock+0x6f7/0xc70 [<ffffffff8858fd7c>] :ko2iblnd:kiblnd_send+0x88c/0x8d0 [<ffffffff888d7191>] :mds:mds_intent_policy+0x5d1/0xbe0 [<ffffffff8854dca7>] :lnet:lnet_prep_send+0x67/0xb0 [<ffffffff88673786>] :ptlrpc:ldlm_resource_putref+0x1b6/0x3b0 [<ffffffff88670183>] :ptlrpc:ldlm_lock_enqueue+0xf3/0x5c0 [<ffffffff8866dbbd>] :ptlrpc:ldlm_lock_create+0x98d/0x9c0 [<ffffffff88690660>] :ptlrpc:ldlm_server_completion_ast+0x0/0x570 [<ffffffff8868cda0>] :ptlrpc:ldlm_handle_enqueue+0xd90/0x1410 [<ffffffff88690bd0>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x690 [<ffffffff888e0cad>] :mds:mds_handle+0x46dd/0x58ff [<ffffffff8860bcb2>] :obdclass:class_handle2object+0xd2/0x160 [<ffffffff886a7280>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 [<ffffffff886a4e35>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 [<ffffffff886aca8b>] :ptlrpc:ptlrpc_server_handle_request+0xb0b/0x1270 [<ffffffff800608e8>] thread_return+0x0/0xea [<ffffffff8006b165>] do_gettimeofday+0x50/0x92 [<ffffffff8851b056>] :libcfs:lcw_update_time+0x16/0x100 [<ffffffff8003cee3>] lock_timer_base+0x1b/0x3c [<ffffffff886af4cc>] :ptlrpc:ptlrpc_main+0x7dc/0x950 [<ffffffff80086a74>] default_wake_function+0x0/0xe [<ffffffff8005be25>] child_rip+0xa/0x11 [<ffffffff886aecf0>] :ptlrpc:ptlrpc_main+0x0/0x950 [<ffffffff8005be1b>] child_rip+0x0/0x11 LustreError: dumping log to /tmp/lustre-log.1213385739.4063 ll_mdt_115 S ffff81023d1a9700 0 4165 1 4166 4164 (L-TLB) ffff81023d1a9700 ffff81023d1a9630 ffff81023d1a9630 000000000000000a ffff81023ee5b860 ffff8101438ca7a0 0002bb28f86558dc 000000000001089d ffff81023ee5ba48 0000000000000001 ffffffff8866eb51 ffffffffffffffff Call Trace: [<ffffffff8866eb51>] :ptlrpc:ldlm_run_cp_ast_work+0x161/0x1f0 [<ffffffff88686da0>] :ptlrpc:ldlm_expired_completion_wait+0x0/0x250 [<ffffffff800611f7>] schedule_timeout+0x8a/0xad [<ffffffff80092c5e>] process_timeout+0x0/0x5 [<ffffffff886881fd>] :ptlrpc:ldlm_completion_ast+0x35d/0x6a0 [<ffffffff886705e9>] :ptlrpc:ldlm_lock_enqueue+0x559/0x5c0 [<ffffffff80086a74>] default_wake_function+0x0/0xe [<ffffffff8866ce6a>] :ptlrpc:ldlm_lock_addref_internal_nolock+0x3a/ 0x90 [<ffffffff88684bb0>] :ptlrpc:ldlm_blocking_ast+0x0/0x2d0 [<ffffffff88685e24>] :ptlrpc:ldlm_cli_enqueue_local+0x454/0x510 [<ffffffff888d5e87>] :mds:mds_fid2locked_dentry+0x1d7/0x2a0 [<ffffffff88687ea0>] :ptlrpc:ldlm_completion_ast+0x0/0x6a0 [<ffffffff888d6647>] :mds:mds_getattr_lock+0x6f7/0xc70 [<ffffffff885b11c4>] :ksocklnd:ksocknal_alloc_tx+0x1c4/0x270 [<ffffffff888d7191>] :mds:mds_intent_policy+0x5d1/0xbe0 [<ffffffff8854dca7>] :lnet:lnet_prep_send+0x67/0xb0 [<ffffffff88673786>] :ptlrpc:ldlm_resource_putref+0x1b6/0x3b0 [<ffffffff88670183>] :ptlrpc:ldlm_lock_enqueue+0xf3/0x5c0 [<ffffffff8866dbbd>] :ptlrpc:ldlm_lock_create+0x98d/0x9c0 [<ffffffff88690660>] :ptlrpc:ldlm_server_completion_ast+0x0/0x570 [<ffffffff8868cda0>] :ptlrpc:ldlm_handle_enqueue+0xd90/0x1410 [<ffffffff88690bd0>] :ptlrpc:ldlm_server_blocking_ast+0x0/0x690 [<ffffffff888e0cad>] :mds:mds_handle+0x46dd/0x58ff [<ffffffff8860bcb2>] :obdclass:class_handle2object+0xd2/0x160 [<ffffffff886a7280>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 [<ffffffff886a4e35>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 [<ffffffff886aca8b>] :ptlrpc:ptlrpc_server_handle_request+0xb0b/0x1270 [<ffffffff800608e8>] thread_return+0x0/0xea [<ffffffff8006b165>] do_gettimeofday+0x50/0x92 [<ffffffff8851b056>] :libcfs:lcw_update_time+0x16/0x100 [<ffffffff8003cee3>] lock_timer_base+0x1b/0x3c [<ffffffff886af4cc>] :ptlrpc:ptlrpc_main+0x7dc/0x950 [<ffffffff80086a74>] default_wake_function+0x0/0xe [<ffffffff8005be25>] child_rip+0xa/0x11 [<ffffffff886aecf0>] :ptlrpc:ptlrpc_main+0x0/0x950 [<ffffffff8005be1b>] child_rip+0x0/0x11 LustreError: dumping log to /tmp/lustre-log.1213385762.4165 Lustre: 4135:0:(ldlm_lib.c:519:target_handle_reconnect()) hpcdata- MDT0000: a9f365a3-8746-6ed5-e45e-cd292891ece2 reconnecting Lustre: 4135:0:(ldlm_lib.c:519:target_handle_reconnect()) Skipped 11 previous similar messages Lustre: 4135:0:(ldlm_lib.c:747:target_handle_connect()) hpcdata- MDT0000: refuse reconnection from a9f365a3-8746-6ed5-e45e-cd292891ece2 at 10.13.25.198 @o2ib to 0xffff8101cdbd6000; still busy with 2 active RPCs Lustre: 4135:0:(ldlm_lib.c:747:target_handle_connect()) Skipped 11 previous similar messages LustreError: 30076:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) req at ffff810143b27450 x4795081/t0 o101-><?>@<? >:-1 lens 232/0 ref 0 fl Interpret:/0/0 rc -107/0 LustreError: 30076:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 4938 previous similar messages LustreError: 0:0:(ldlm_lockd.c:210:waiting_locks_callback()) ### lock callback timer expired: evicting client ac5fac72- d4bf-8ff6-9e28-664d7c46678b at NET_0x500000a0d1972_UUID nid 10.13.25.114 at o2ib ns: mds-hpcdata-MDT0000_UUID lock: ffff8101207a9d00/0x3d8831a1452e3fc0 lrc: 1/0,0 mode: CR/CR res: 3673528/1497616058 bits 0x3 rrc: 44 type: IBT flags: 4000020 remote: 0xa943eb2122dba12d expref: 11 pid 4407 Lustre: 4534:0:(watchdog.c:312:lcw_update_time()) Expired watchdog for pid 4534 disabled after 499.9514s Lustre: 4534:0:(watchdog.c:312:lcw_update_time()) Skipped 40 previous similar messages LustreError: 4039:0:(mgs_handler.c:515:mgs_handle()) lustre_mgs: operation 101 on unconnected MGS LustreError: 4039:0:(mgs_handler.c:515:mgs_handle()) Skipped 4927 previous similar messages LustreError: 4557:0:(handler.c:1499:mds_handle()) operation 101 on unconnected MDS from 12345-10.13.25.114 at o2ib [root at homemds tmp]#
On Jun 13, 2008 16:03 -0400, Charles Taylor wrote:> We have been running the config below on three different lustre file > systems since early January and, for the most part, things have been > pretty stable. We are now experiencing frequent hangs on some > clients - particularly our interactive login nodes. All processes > get blocked behind Lustre I/O requests. When this happens there are > *no* messages in either dmesg or syslog on the clients. They seem > unaware of a problem.This is likely due to "client statahead" problems. Please disable this with "echo 0 > /proc/fs/lustre/llite/*/statahead_max" on the clients. This should also be fixed in 1.6.5> 1. A ton of lustre-log.M.N files get dumped into /tmp in a short > period of time. Most of them appear to be full of garbage and > unprintable characters rather than thread stack traces. Many of them > are also zero length.The lustre-log files are not stack traces. They are dumped lustre debug logs.> We have been adjusting lru_size on the clients but so far it has made > no difference. We have "options mds mds_num_threads=512" and our > system timeout is 1000 (sure, go ahead and flame me but if we don''t do > that we get tons of "endpoint transport failures" on the clients and > no, there are no connectivity issues). :) > > We are open to suggestion and wondering if we should update the MDSs > to 1.6.5. Can we do that safely without also upgrading the clients > and OSTs?In general the MDS and OSS nodes should run the same level of software, as that is what we test, but there isn''t a hard requirement for it. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On Jun 13, 2008, at 5:46 PM, Andreas Dilger wrote:> On Jun 13, 2008 16:03 -0400, Charles Taylor wrote: >> We have been running the config below on three different lustre file >> systems since early January and, for the most part, things have been >> pretty stable. We are now experiencing frequent hangs on some >> clients - particularly our interactive login nodes. All processes >> get blocked behind Lustre I/O requests. When this happens there are >> *no* messages in either dmesg or syslog on the clients. They seem >> unaware of a problem. > > This is likely due to "client statahead" problems. Please disable > this > with "echo 0 > /proc/fs/lustre/llite/*/statahead_max" on the clients. > This should also be fixed in 1.6.5This seems to have done the trick. Odd though that we''ve been running this way for several months and it didn''t seem to be an issue until now. We saw the discussions of this go by at one point and we should have just taken care of it then whether we were seeing it or not. Thanks for reminding us of it.> > >> 1. A ton of lustre-log.M.N files get dumped into /tmp in a short >> period of time. Most of them appear to be full of garbage and >> unprintable characters rather than thread stack traces. Many of >> them >> are also zero length. > > The lustre-log files are not stack traces. They are dumped lustre > debug > logs.Got it.> > >> We have been adjusting lru_size on the clients but so far it has made >> no difference. We have "options mds mds_num_threads=512" and our >> system timeout is 1000 (sure, go ahead and flame me but if we don''t >> do >> that we get tons of "endpoint transport failures" on the clients and >> no, there are no connectivity issues). :) >> >> We are open to suggestion and wondering if we should update the MDSs >> to 1.6.5. Can we do that safely without also upgrading the clients >> and OSTs? > > In general the MDS and OSS nodes should run the same level of > software, > as that is what we test, but there isn''t a hard requirement for it.Would it be reasonable then, to upgrade the MDSs and OSSs but leave the clients at 1.6.4.2 or is that asking for trouble. I think this comes up a lot and I''m pretty sure people have said they do it successfully. I''m just wondering if it is a *design* goal that is architected in or just something that happens to work most of the time. Thanks again, Charlie Taylor UF HPC Center>
On Jun 14, 2008 08:06 -0400, Charles Taylor wrote:>>> 1. A ton of lustre-log.M.N files get dumped into /tmp in a short >>> period of time. Most of them appear to be full of garbage and >>> unprintable characters rather than thread stack traces. Many of them >>> are also zero length. >> >> The lustre-log files are not stack traces. They are dumped lustre debug >> logs. > > Got it.Just to mention - you can decode these files using the command: lctl df <logfile> <textfile>>>> We are open to suggestion and wondering if we should update the MDSs >>> to 1.6.5. Can we do that safely without also upgrading the clients >>> and OSTs? >> >> In general the MDS and OSS nodes should run the same level of software, >> as that is what we test, but there isn''t a hard requirement for it. > > Would it be reasonable then, to upgrade the MDSs and OSSs but leave the > clients at 1.6.4.2 or is that asking for trouble. I think this comes up a > lot and I''m pretty sure people have said they do it successfully. I''m > just wondering if it is a *design* goal that is architected in or just > something that happens to work most of the time.The Lustre upgrade process is always planned to allow 1.X.Y to work for all ''Y'' values, and between 1.X.Y_latest and 1.X_next.Z. We test these combinations for Lustre releases, for both interoperability and upgrade. In the vast majority of cases 1.X_next will work with any 1.X release, but we can''t test all combinations so we don''t make such claims. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On Jun 16, 2008, at 1:01 AM, Andreas Dilger wrote:> > Just to mention - you can decode these files using the command: > > lctl df <logfile> <textfile>Awesome.>> >> Would it be reasonable then, to upgrade the MDSs and OSSs but leave >> the >> clients at 1.6.4.2 or is that asking for trouble. I think this >> comes up a >> lot and I''m pretty sure people have said they do it successfully. >> I''m >> just wondering if it is a *design* goal that is architected in or >> just >> something that happens to work most of the time. > > The Lustre upgrade process is always planned to allow 1.X.Y to work > for > all ''Y'' values, and between 1.X.Y_latest and 1.X_next.Z. We test > these combinations for Lustre releases, for both interoperability and > upgrade. In the vast majority of cases 1.X_next will work with any > 1.X > release, but we can''t test all combinations so we don''t make such > claims.Crazy awesome! Charlie Taylor UF HPC Center