Hello, I see some oddities on Lustre clients running under Xen DomU. I got messages like this: Jul 29 14:35:23 quark8-1 kernel: Lustre: Request x2674628 sent from stable-OST0001-osc-ffff8801a72a5000 to NID 147.251.9.9 at tcp 100s ago has timed out (limit 100s). Jul 29 14:35:23 quark8-1 kernel: Lustre: stable-OST0001-osc-ffff8801a72a5000: Connection to service stable-OST0001 via nid 147.251.9.9 at tcp was lost; in progress operations using this service will wait for recovery to complete. Jul 29 14:35:23 quark8-1 kernel: LustreError: 128:0:(ldlm_request.c:1033:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway Jul 29 14:35:23 quark8-1 kernel: LustreError: 128:0:(ldlm_request.c:1622:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11 Jul 29 14:35:23 quark8-1 kernel: Lustre: stable-OST0001-osc-ffff8801a72a5000: Connection restored to service stable-OST0001 using nid 147.251.9.9 at tcp. The network is OK all the time. I tried both 1.6.x and 1.8.x Lustre. All the same. Moreover, from time to time, Lustre fs gets stuck in: [<ffffffff882c5ef3>] :mdc:mdc_close+0x1e3/0x7a0 [<ffffffff88332f53>] :lustre:ll_close_inode_openhandle+0x1e3/0x650 [<ffffffff88333a05>] :lustre:ll_mdc_real_close+0x115/0x370 [<ffffffff883691e1>] :lustre:ll_mdc_blocking_ast+0x1d1/0x570 [<ffffffff88186720>] :ptlrpc:ldlm_cancel_callback+0x50/0xd0 [<ffffffff881a0721>] :ptlrpc:ldlm_cli_cancel_local+0x61/0x350 [<ffffffff881a2025>] :ptlrpc:ldlm_cancel_lru_local+0x165/0x340 [<ffffffff881a14c7>] :ptlrpc:ldlm_cli_cancel_list+0xf7/0x380 [<ffffffff881a2263>] :ptlrpc:ldlm_cancel_lru+0x63/0x1b0 [<ffffffff881b62d7>] :ptlrpc:ldlm_cli_pool_shrink+0xf7/0x240 [<ffffffff881b365d>] :ptlrpc:ldlm_pool_shrink+0x2d/0xe0 [<ffffffff881b48fb>] :ptlrpc:ldlm_pools_shrink+0x25b/0x330 [<ffffffff8025c705>] shrink_slab+0xe2/0x15a when the DomU is being suspended (most memory and CPU is stolen by another DomU). Is this something known or unsupported? (I.e., running Lustre under Xen with domains preemption) -- Luk?? Hejtm?nek
On 2010-11-30, at 12:14, Lukas Hejtmanek wrote:> I see some oddities on Lustre clients running under Xen DomU ... > when the DomU is being suspended (most memory and CPU is stolen by another DomU). > > Is this something known or unsupported? (I.e., running Lustre under Xen with domains preemption)Lustre servers expect to always be able to communicate with the clients, and that the clients are responsive to their requests, or they are evicted by the server to keep the filesystem usable. This makes sense in an HPC environment where the filesystem is a shared resource, and clients fail regularly (due to huge numbers of clients), and it is more important for the rest of the clients to continue. There is a mode that the servers will NOT expect the clients to be responsive, but then it is the client''s responsibility to act accordingly and check for pending server requests before it uses any saved state. That is used by the "liblustre" code, but is not implemented for the normal Linux client. Combining these two modes, for use in VM systems where the Linux client may be unresponsive for long periods of time might makes sense, though it can also add a lot of complexity. I don''t think anyone is planning to work on this in the near future. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
On Tue, Nov 30, 2010 at 02:27:18PM -0700, Andreas Dilger wrote:> Combining these two modes, for use in VM systems where the Linux client may > be unresponsive for long periods of time might makes sense, though it can > also add a lot of complexity. I don''t think anyone is planning to work on > this in the near future.Thanks, however, the client is not completely unresponsive, it still has CPU and about 500MB RAM so it should be OK from Lustre point of view. Also, the problem arises as soon as I begin to shrink memory, e.g., from 8GB to 0.5GB. Sometimes, Lustre client hangs. Is there any conceptual problem with Lustre and memory shrinker in Xen? -- Luk?? Hejtm?nek
Hello! On Dec 1, 2010, at 3:12 AM, Lukas Hejtmanek wrote:> On Tue, Nov 30, 2010 at 02:27:18PM -0700, Andreas Dilger wrote: >> Combining these two modes, for use in VM systems where the Linux client may >> be unresponsive for long periods of time might makes sense, though it can >> also add a lot of complexity. I don''t think anyone is planning to work on >> this in the near future. > Thanks, however, the client is not completely unresponsive, it still has CPU > and about 500MB RAM so it should be OK from Lustre point of view. > Also, the problem arises as soon as I begin to shrink memory, e.g., from 8GB > to 0.5GB. Sometimes, Lustre client hangs. Is there any conceptual problem with > Lustre and memory shrinker in Xen?It all depends on how Xen does the shrinking. If it blocks kernel code from execution for long periods of time in process, it''s the same as if the node is suspended for some time essentially. Even if it would block only certain threads that tare trying to access to be shrunk memory and it happens to be certain lustre threads, that would still spend trouble if the blocking persists for significant amounts of time. I personally use XEN without memory shrinking for my testing and it seems to be working just fine. Bye, Oleg
On Thu, Dec 02, 2010 at 10:18:51AM -0500, Oleg Drokin wrote:> It all depends on how Xen does the shrinking. If it blocks kernel code from execution for long periods of time in process, > it''s the same as if the node is suspended for some time essentially. > Even if it would block only certain threads that tare trying to access to be shrunk memory and it happens to be certain lustre threads, > that would still spend trouble if the blocking persists for significant amounts of time. > > I personally use XEN without memory shrinking for my testing and it seems to be working just fine.It could stop all the processes. And it pages out most used memory into swap. But maybe there are some problems resulting from unlocked pages in lustre threads? -- Luk?? Hejtm?nek
Hello! On Dec 2, 2010, at 10:50 AM, Lukas Hejtmanek wrote:> On Thu, Dec 02, 2010 at 10:18:51AM -0500, Oleg Drokin wrote: >> It all depends on how Xen does the shrinking. If it blocks kernel code from execution for long periods of time in process, >> it''s the same as if the node is suspended for some time essentially. >> Even if it would block only certain threads that tare trying to access to be shrunk memory and it happens to be certain lustre threads, >> that would still spend trouble if the blocking persists for significant amounts of time. > It could stop all the processes. And it pages out most used memory into swap. > But maybe there are some problems resulting from unlocked pages in lustre > threads?Well, I assume XEN plays nice with others and does not just unlcok pages it did not lock because that would upset everybody, not just Lustre. In case of Lustre I am sure you''ll see tons of tripped assertions. Lustre does not have any important processes running in userspace, but there are some important kernel threads that should not be deprived of CPU for too long. Bye, Oleg