Greetings. I''m starting to investigate Lustre and see if it would work in my situation or if its a solution in search of a problem. I''ve currently got 136 Sun V20zs running RHEL 5.2 and 26TB split between two other Sun Opteron systems. I''m adding 4 Sun x4500s @ 48T each and 129 8 way Xeon systems to the cluster. Doing some testing using one of the x4500s as the OSS with 6 OSTs and one of the v20zs acting as MGS/MDT and the test client, I keep getting stack dumps. What I am attempting to do, as a load test is to rsync the entire set of NFS exported folders from the existing 26T file servers to the lustre file system. After about 500G copy, I get a stack dump on the v20z and the rsync hangs (it can be killed). I know with RHEL4.2 on the v20z, if I ran it with the SMP kernel and used it as an NFS server it would do a full kernel panic under load. That was reproducible across the entire cluster, but I have not had that problem yet with RHEL 5.2. Thank you for your time and any words of advice. Bob Healey Except from dmesg: Jul 17 16:32:42 compute-4-10 kernel: LustreError: 3834:0:(ldlm_lock.c:430:__ldlm_handle2lock()) ASSERTION(lock->l_resource != NULL) failed Jul 17 16:32:42 compute-4-10 kernel: LustreError: 3834:0:(tracefile.c:432:libcfs_assertion_failed()) LBUG Jul 17 16:32:42 compute-4-10 kernel: Lustre: 3834:0:(linux-debug.c:167:libcfs_debug_dumpstack()) showing stack for process 3834 Jul 17 16:32:42 compute-4-10 kernel: ldlm_cn_06 R running task 0 3834 1 3835 3833 (L-TLB) Jul 17 16:32:42 compute-4-10 kernel: ffff810034917e50 0000000000000046 ffff810034da55c8 ffffffff8006b6c9 Jul 17 16:32:42 compute-4-10 kernel: ffff810040ee99c0 ffffffff88626771 ffff810034da5400 ffff810034da54e0 Jul 17 16:32:42 compute-4-10 kernel: ffff81003f5c1d40 ffffffff88624456 ffff810034da5588 0000000000000000 Jul 17 16:32:42 compute-4-10 kernel: Call Trace: Jul 17 16:32:42 compute-4-10 kernel: [<ffffffff8006b6c9>] do_gettimeofday+0x50/0x92 Jul 17 16:32:42 compute-4-10 kernel: [<ffffffff88624456>] :libcfs:lcw_update_time+0x16/0x100 Jul 17 16:32:42 compute-4-10 kernel: [<ffffffff800868b0>] __wake_up_common+0x3e/0x68 Jul 17 16:32:42 compute-4-10 kernel: [<ffffffff887770bc>] :ptlrpc:ptlrpc_main+0xdcc/0xf50 Jul 17 16:32:42 compute-4-10 kernel: [<ffffffff80088432>] default_wake_function+0x0/0xe Jul 17 16:32:42 compute-4-10 kernel: [<ffffffff8005bfb1>] child_rip+0xa/0x11 Jul 17 16:32:43 compute-4-10 kernel: [<ffffffff887762f0>] :ptlrpc:ptlrpc_main+0x0/0xf50 Jul 17 16:32:43 compute-4-10 kernel: [<ffffffff8005bfa7>] child_rip+0x0/0x11 Jul 17 16:32:43 compute-4-10 kernel: Jul 17 16:32:43 compute-4-10 kernel: LustreError: dumping log to /tmp/lustre-log.1216326763.3834 -- Bob Healey Systems Administrator Physics Department, RPI healer at rpi.edu
On Fri, 2008-07-18 at 16:47 -0400, Robert Healey wrote:> > Except from dmesg: > Jul 17 16:32:42 compute-4-10 kernel: LustreError: > 3834:0:(ldlm_lock.c:430:__ldlm_handle2lock()) ASSERTION(lock->l_resource > != NULL) > failedThis looks like bug 15269. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080724/d7b1bc7c/attachment.bin
Is anyone running, or does anyone know of someone running Lustre on an Altix 4700 (or other large Itanium SMP system)? I was wondering if there are any quirks to getting very large aggregate performance to a single node (1024+ cores). Thanks, Craig -- Craig Tierney (craig.tierney at noaa.gov)
On Jul 24, 2008 11:52 -0600, Craig Tierney wrote:> Is anyone running, or does anyone know of someone > running Lustre on an Altix 4700 (or other large > Itanium SMP system)? I was wondering if there > are any quirks to getting very large aggregate > performance to a single node (1024+ cores).I believe there were some patches added to CVS (not sure if they are in 1.6.5 or not) that addressed allocation problems with per-CPU data structures that were hit on 128-node system. There are also patches in bug 11817 that are addressing issues in many-core SMP clients, but there is likely still work to be done in this area. What kind of network do you have on such a system? Do all of the cores have equal access to the external network? If not, it would be good to e.g. bind the ptlrpcd thread to one of the IO nodes for better performance. There hasn''t been any effort yet to e.g. have multiple ptlrpcd threads (1 per IO node) to handle RPC requests from a thousand other cores. If that became a bottleneck I suspect it wouldn''t be too hard to bind multiple ptlrpcd threads to multiple IO nodes, each having a ptlrpcd_pc list and ptlrpc_add_set() could get some kind of smarts about locality for which ptlrpcd_pc to add the outgoing request to. There have been tests in the past to get 2GB/s+ from clients with good networks and 32 IA64 CPUs, but depending on what kind of throughput you are looking at there may still be a bunch of work to be done. We''d be very interested to get feedback about any issues you hit on such a large system, because we don''t get much chance to test on a single system with so many cores. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Andreas Dilger wrote:> On Jul 24, 2008 11:52 -0600, Craig Tierney wrote: >> Is anyone running, or does anyone know of someone >> running Lustre on an Altix 4700 (or other large >> Itanium SMP system)? I was wondering if there >> are any quirks to getting very large aggregate >> performance to a single node (1024+ cores). > > I believe there were some patches added to CVS (not sure if they are in > 1.6.5 or not) that addressed allocation problems with per-CPU data > structures that were hit on 128-node system. > > There are also patches in bug 11817 that are addressing issues in > many-core SMP clients, but there is likely still work to be done in > this area. > > What kind of network do you have on such a system? Do all of the > cores have equal access to the external network? If not, it would > be good to e.g. bind the ptlrpcd thread to one of the IO nodes for > better performance. > > There hasn''t been any effort yet to e.g. have multiple ptlrpcd threads > (1 per IO node) to handle RPC requests from a thousand other cores. > If that became a bottleneck I suspect it wouldn''t be too hard to bind > multiple ptlrpcd threads to multiple IO nodes, each having a ptlrpcd_pc > list and ptlrpc_add_set() could get some kind of smarts about locality > for which ptlrpcd_pc to add the outgoing request to. > > There have been tests in the past to get 2GB/s+ from clients with > good networks and 32 IA64 CPUs, but depending on what kind of throughput > you are looking at there may still be a bunch of work to be done. > > We''d be very interested to get feedback about any issues you hit on > such a large system, because we don''t get much chance to test on a > single system with so many cores. >I don''t have any Altix SMP systems, but I know some others that do. They are having some issues where a very large scratch space could be very helpful. We just brought in a DDN9900 with Lustre 1.6.5.1 and I have been extraordinarily happy with its performance at this point. So I was asking to understand if the setup we have could be helpful in other settings with the Altix systems. thanks, Craig> Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- Craig Tierney (craig.tierney at noaa.gov)