Hi all, What can cause a client to receive a "o2iblnd no resources" message from an OSS? --------------------------------------------------------------------------- Feb 1 15:24:24 node-5-8 kernel: LustreError: 1893:0:(o2iblnd_cb.c:2448:kiblnd_rejected()) 10.10.60.3 at o2ib rejected: o2iblnd no resources --------------------------------------------------------------------------- I suspect an out-of-memory problem, and indeed the OSS logs are filled up with the following: --------------------------------------------------------------------------- ib_cm/3: page allocation failure. order:4, mode:0xd0 Call Trace:<ffffffff8015c847>{__alloc_pages+777} <ffffffff801727e9>{alloc_page_interleave+61} <ffffffff8015c8e0>{__get_free_pages+11} <ffffffff8015facd>{kmem_getpages+36} <ffffffff80160262>{cache_alloc_refill+609} <ffffffff8015ff30>{__kmalloc+123} <ffffffffa014ee75>{:ib_mthca:mthca_alloc_qp_common+668} <ffffffffa014f42d>{:ib_mthca:mthca_alloc_qp+178} <ffffffffa0153e3a>{:ib_mthca:mthca_create_qp+311} <ffffffffa00d5b1b>{:ib_core:ib_create_qp+20} <ffffffffa021a5f9>{:rdma_cm:rdma_create_qp+43} <ffffffff8024b7b5>{dma_pool_free+245} <ffffffffa014b257>{:ib_mthca:mthca_init_cq+1073} <ffffffffa01540cf>{:ib_mthca:mthca_create_cq+282} <ffffffff801727e9>{alloc_page_interleave+61} <ffffffffa0400c10>{:ko2iblnd:kiblnd_cq_completion+0} <ffffffffa0400d50>{:ko2iblnd:kiblnd_cq_event+0} <ffffffffa00d5cc1>{:ib_core:ib_create_cq+33} <ffffffffa03f56bd>{:ko2iblnd:kiblnd_create_conn+3565} <ffffffffa0276f38>{:libcfs:cfs_alloc+40} <ffffffffa03fe457>{:ko2iblnd:kiblnd_passive_connect+2215} <ffffffffa00d8595>{:ib_core:ib_find_cached_gid+244} <ffffffffa021a278>{:rdma_cm:cma_acquire_dev+293} <ffffffffa03ff540>{:ko2iblnd:kiblnd_cm_callback+64} <ffffffffa03ff500>{:ko2iblnd:kiblnd_cm_callback+0} <ffffffffa021b19a>{:rdma_cm:cma_req_handler+863} <ffffffff801e8427>{alloc_layer+67} <ffffffff801e8645>{idr_get_new_above_int+423} <ffffffffa00fa0ab>{:ib_cm:cm_process_work+101} <ffffffffa00faa57>{:ib_cm:cm_req_handler+2398} <ffffffffa00fae3c>{:ib_cm:cm_work_handler+0} <ffffffffa00fae6a>{:ib_cm:cm_work_handler+46} <ffffffff80146fca>{worker_thread+419} <ffffffff80133566>{default_wake_function+0} <ffffffff801335b7>{__wake_up_common+67} <ffffffff80133566>{default_wake_function+0} <ffffffff8014ad18>{keventd_create_kthread+0} <ffffffff80146e27>{worker_thread+0} <ffffffff8014ad18>{keventd_create_kthread+0} <ffffffff8014acef>{kthread+200} <ffffffff80110de3>{child_rip+8} <ffffffff8014ad18>{keventd_create_kthread+0} <ffffffff8014ac27>{kthread+0} <ffffffff80110ddb>{child_rip+0} Mem-info: Node 0 DMA per-cpu: cpu 0 hot: low 2, high 6, batch 1 cpu 0 cold: low 0, high 2, batch 1 cpu 1 hot: low 2, high 6, batch 1 cpu 1 cold: low 0, high 2, batch 1 cpu 2 hot: low 2, high 6, batch 1 cpu 2 cold: low 0, high 2, batch 1 cpu 3 hot: low 2, high 6, batch 1 cpu 3 cold: low 0, high 2, batch 1 Node 0 Normal per-cpu: cpu 0 hot: low 32, high 96, batch 16 cpu 0 cold: low 0, high 32, batch 16 cpu 1 hot: low 32, high 96, batch 16 cpu 1 cold: low 0, high 32, batch 16 cpu 2 hot: low 32, high 96, batch 16 cpu 2 cold: low 0, high 32, batch 16 cpu 3 hot: low 32, high 96, batch 16 cpu 3 cold: low 0, high 32, batch 16 Node 0 HighMem per-cpu: empty Free pages: 35336kB (0kB HighMem) Active:534156 inactive:127091 dirty:1072 writeback:0 unstable:0 free:8834 slab:146612 mapped:26222 pagetables:1035 Node 0 DMA free:9832kB min:52kB low:64kB high:76kB active:0kB inactive:0kB present:16384kB pages_scanned:37 all_unreclaimable? yes protections[]: 0 510200 510200 Node 0 Normal free:25504kB min:16328kB low:20408kB high:24492kB active:2136624kB inactive:508364kB present:4964352kB pages_scanned:0 all_unreclaimable? no protections[]: 0 0 0 Node 0 HighMem free:0kB min:128kB low:160kB high:192kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no protections[]: 0 0 0 Node 0 DMA: 2*4kB 2*8kB 1*16kB 0*32kB 1*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 0*2048kB 2*4096kB = 9832kB Node 0 Normal: 1284*4kB 2290*8kB 126*16kB 1*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 25504kB Node 0 HighMem: empty Swap cache: add 111, delete 111, find 23/36, race 0+0 Free swap: 4096360kB 1245184 pages of RAM 235840 reserved pages 659867 pages shared 0 pages swap cached --------------------------------------------------------------------------- IB links are up and working on both the client and the OSS: --------------------------------------------------------------------------- client# ibstatus Infiniband device ''mthca0'' port 1 status: default gid: fe80:0000:0000:0000:0005:ad00:0008:af71 base lid: 0x83 sm lid: 0x130 state: 4: ACTIVE phys state: 5: LinkUp rate: 20 Gb/sec (4X DDR) oss# ibstatus Infiniband device ''mthca0'' port 1 status: default gid: fe80:0000:0000:0000:0005:ad00:0008:cb11 base lid: 0x126 sm lid: 0x130 state: 4: ACTIVE phys state: 5: LinkUp rate: 20 Gb/sec (4X DDR) --------------------------------------------------------------------------- And the Subnet Manager doesn''t expose any unusual error or skyrocketed counter (I use OFED 1.2, kernel 2.6.9-55.0.9.EL_lustre.1.6.4.1smp). What I don''t really get is that most clients can access files on this OSS with no issue, and besides, my limited understanding of the kernel memory mechanisms tend to let me believe that this OSS is not out of memory: --------------------------------------------------------------------------- # cat /proc/meminfo MemTotal: 4037380 kB MemFree: 31688 kB Buffers: 1333536 kB Cached: 1231900 kB SwapCached: 0 kB Active: 2138948 kB Inactive: 507720 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 4037380 kB LowFree: 31688 kB SwapTotal: 4096564 kB SwapFree: 4096360 kB Dirty: 6868 kB Writeback: 0 kB Mapped: 106984 kB Slab: 588200 kB CommitLimit: 6115252 kB Committed_AS: 860508 kB PageTables: 4304 kB VmallocTotal: 536870911 kB VmallocUsed: 274788 kB VmallocChunk: 536596091 kB HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 2048 kB --------------------------------------------------------------------------- This only appeared lately, after several week of continuous use of the filesystem, without any problem. Is there anything like a memory leak somewhere? Any help to diagnose the problem would be greatly appreciated. Thanks! -- Kilian
Hi Kilian, I think it''s because o2iblnd uses fragmented RDMA by default(Max to 256), so we have to set max_send_wr as (concurrent_send * (256 + 1)) while creating QP by rdma_create_qp(), it takes a lot of resource and can make a busy server out of memory sometime. To resolve this problem, we have to use FMR to map fragmented buffer to virtual contiguous I/O address, there will always be one fragment for RDMA by this way. Here is patch for this problem (using FMR in o2iblnd) https://bugzilla.lustre.org/attachment.cgi?id=15144 Regards Liang Kilian CAVALOTTI wrote:> Hi all, > > What can cause a client to receive a "o2iblnd no resources" message > from an OSS? > --------------------------------------------------------------------------- > Feb 1 15:24:24 node-5-8 kernel: LustreError: 1893:0:(o2iblnd_cb.c:2448:kiblnd_rejected()) 10.10.60.3 at o2ib rejected: o2iblnd no resources > --------------------------------------------------------------------------- > > I suspect an out-of-memory problem, and indeed the OSS logs are filled > up with the following: > --------------------------------------------------------------------------- > ib_cm/3: page allocation failure. order:4, mode:0xd0 > > Call Trace:<ffffffff8015c847>{__alloc_pages+777} <ffffffff801727e9>{alloc_page_interleave+61} > <ffffffff8015c8e0>{__get_free_pages+11} <ffffffff8015facd>{kmem_getpages+36} > <ffffffff80160262>{cache_alloc_refill+609} <ffffffff8015ff30>{__kmalloc+123} > <ffffffffa014ee75>{:ib_mthca:mthca_alloc_qp_common+668} > <ffffffffa014f42d>{:ib_mthca:mthca_alloc_qp+178} <ffffffffa0153e3a>{:ib_mthca:mthca_create_qp+311} > <ffffffffa00d5b1b>{:ib_core:ib_create_qp+20} <ffffffffa021a5f9>{:rdma_cm:rdma_create_qp+43} > <ffffffff8024b7b5>{dma_pool_free+245} <ffffffffa014b257>{:ib_mthca:mthca_init_cq+1073} > <ffffffffa01540cf>{:ib_mthca:mthca_create_cq+282} <ffffffff801727e9>{alloc_page_interleave+61} > <ffffffffa0400c10>{:ko2iblnd:kiblnd_cq_completion+0} > <ffffffffa0400d50>{:ko2iblnd:kiblnd_cq_event+0} <ffffffffa00d5cc1>{:ib_core:ib_create_cq+33} > <ffffffffa03f56bd>{:ko2iblnd:kiblnd_create_conn+3565} > <ffffffffa0276f38>{:libcfs:cfs_alloc+40} <ffffffffa03fe457>{:ko2iblnd:kiblnd_passive_connect+2215} > <ffffffffa00d8595>{:ib_core:ib_find_cached_gid+244} > <ffffffffa021a278>{:rdma_cm:cma_acquire_dev+293} <ffffffffa03ff540>{:ko2iblnd:kiblnd_cm_callback+64} > <ffffffffa03ff500>{:ko2iblnd:kiblnd_cm_callback+0} > <ffffffffa021b19a>{:rdma_cm:cma_req_handler+863} <ffffffff801e8427>{alloc_layer+67} > <ffffffff801e8645>{idr_get_new_above_int+423} <ffffffffa00fa0ab>{:ib_cm:cm_process_work+101} > <ffffffffa00faa57>{:ib_cm:cm_req_handler+2398} <ffffffffa00fae3c>{:ib_cm:cm_work_handler+0} > <ffffffffa00fae6a>{:ib_cm:cm_work_handler+46} <ffffffff80146fca>{worker_thread+419} > <ffffffff80133566>{default_wake_function+0} <ffffffff801335b7>{__wake_up_common+67} > <ffffffff80133566>{default_wake_function+0} <ffffffff8014ad18>{keventd_create_kthread+0} > <ffffffff80146e27>{worker_thread+0} <ffffffff8014ad18>{keventd_create_kthread+0} > <ffffffff8014acef>{kthread+200} <ffffffff80110de3>{child_rip+8} > <ffffffff8014ad18>{keventd_create_kthread+0} <ffffffff8014ac27>{kthread+0} > <ffffffff80110ddb>{child_rip+0} > Mem-info: > Node 0 DMA per-cpu: > cpu 0 hot: low 2, high 6, batch 1 > cpu 0 cold: low 0, high 2, batch 1 > cpu 1 hot: low 2, high 6, batch 1 > cpu 1 cold: low 0, high 2, batch 1 > cpu 2 hot: low 2, high 6, batch 1 > cpu 2 cold: low 0, high 2, batch 1 > cpu 3 hot: low 2, high 6, batch 1 > cpu 3 cold: low 0, high 2, batch 1 > Node 0 Normal per-cpu: > cpu 0 hot: low 32, high 96, batch 16 > cpu 0 cold: low 0, high 32, batch 16 > cpu 1 hot: low 32, high 96, batch 16 > cpu 1 cold: low 0, high 32, batch 16 > cpu 2 hot: low 32, high 96, batch 16 > cpu 2 cold: low 0, high 32, batch 16 > cpu 3 hot: low 32, high 96, batch 16 > cpu 3 cold: low 0, high 32, batch 16 > Node 0 HighMem per-cpu: empty > > Free pages: 35336kB (0kB HighMem) > Active:534156 inactive:127091 dirty:1072 writeback:0 unstable:0 free:8834 slab:146612 mapped:26222 pagetables:1035 > Node 0 DMA free:9832kB min:52kB low:64kB high:76kB active:0kB inactive:0kB present:16384kB pages_scanned:37 all_unreclaimable? yes > protections[]: 0 510200 510200 > Node 0 Normal free:25504kB min:16328kB low:20408kB high:24492kB active:2136624kB inactive:508364kB present:4964352kB pages_scanned:0 all_unreclaimable? no > protections[]: 0 0 0 > Node 0 HighMem free:0kB min:128kB low:160kB high:192kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no > protections[]: 0 0 0 > Node 0 DMA: 2*4kB 2*8kB 1*16kB 0*32kB 1*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 0*2048kB 2*4096kB = 9832kB > Node 0 Normal: 1284*4kB 2290*8kB 126*16kB 1*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 25504kB > Node 0 HighMem: empty > Swap cache: add 111, delete 111, find 23/36, race 0+0 > Free swap: 4096360kB > 1245184 pages of RAM > 235840 reserved pages > 659867 pages shared > 0 pages swap cached > --------------------------------------------------------------------------- > > IB links are up and working on both the client and the OSS: > --------------------------------------------------------------------------- > client# ibstatus > Infiniband device ''mthca0'' port 1 status: > default gid: fe80:0000:0000:0000:0005:ad00:0008:af71 > base lid: 0x83 > sm lid: 0x130 > state: 4: ACTIVE > phys state: 5: LinkUp > rate: 20 Gb/sec (4X DDR) > oss# ibstatus > Infiniband device ''mthca0'' port 1 status: > default gid: fe80:0000:0000:0000:0005:ad00:0008:cb11 > base lid: 0x126 > sm lid: 0x130 > state: 4: ACTIVE > phys state: 5: LinkUp > rate: 20 Gb/sec (4X DDR) > --------------------------------------------------------------------------- > And the Subnet Manager doesn''t expose any unusual error or skyrocketed > counter (I use OFED 1.2, kernel 2.6.9-55.0.9.EL_lustre.1.6.4.1smp). > > What I don''t really get is that most clients can access files on this > OSS with no issue, and besides, my limited understanding of the kernel > memory mechanisms tend to let me believe that this OSS is not out of > memory: > --------------------------------------------------------------------------- > # cat /proc/meminfo > MemTotal: 4037380 kB > MemFree: 31688 kB > Buffers: 1333536 kB > Cached: 1231900 kB > SwapCached: 0 kB > Active: 2138948 kB > Inactive: 507720 kB > HighTotal: 0 kB > HighFree: 0 kB > LowTotal: 4037380 kB > LowFree: 31688 kB > SwapTotal: 4096564 kB > SwapFree: 4096360 kB > Dirty: 6868 kB > Writeback: 0 kB > Mapped: 106984 kB > Slab: 588200 kB > CommitLimit: 6115252 kB > Committed_AS: 860508 kB > PageTables: 4304 kB > VmallocTotal: 536870911 kB > VmallocUsed: 274788 kB > VmallocChunk: 536596091 kB > HugePages_Total: 0 > HugePages_Free: 0 > Hugepagesize: 2048 kB > --------------------------------------------------------------------------- > > This only appeared lately, after several week of continuous use of the > filesystem, without any problem. Is there anything like a memory leak > somewhere? Any help to diagnose the problem would be greatly appreciated. > > Thanks! >
On Sat, Feb 02, 2008 at 03:39:09PM +0800, Liang Zhen wrote:> Hi Kilian, > I think it''s because o2iblnd uses fragmented RDMA by default(Max to > 256), so we have to set max_send_wr as (concurrent_send * (256 + 1)) > while creating QP by rdma_create_qp(), it takes a lot of resource and > can make a busy server out of memory sometime. > To resolve this problem, we have to use FMR to map fragmented buffer to > virtual contiguous I/O address, there will always be one fragment for > RDMA by this way. > Here is patch for this problem (using FMR in o2iblnd) > https://bugzilla.lustre.org/attachment.cgi?id=15144This is an experimental patch - nodes with the patch applied are not interoperable with those without it. Please don''t propagate the patch to production systems. Isaac
On Saturday 02 February 2008 00:42:47 Isaac Huang wrote:> > Here is patch for this problem (using FMR in o2iblnd) > > https://bugzilla.lustre.org/attachment.cgi?id=15144 > > This is an experimental patch - nodes with the patch applied are not > interoperable with those without it. Please don''t propagate the patch > to production systems.Thanks for the explanation. Since the problem indeed occurs on a production system, I''d rather keep experimental patches out of the way. I assume that adding more RAM on the OSSes is likely to solve this problem, right? If that''s the case, I''d probably go this way, before the FMR patch is landed. Thanks, -- Kilian
Hi Liang, On Friday 01 February 2008 23:39:09 you wrote:> I think it''s because o2iblnd uses fragmented RDMA by default(Max to > 256), so we have to set max_send_wr as (concurrent_send * (256 + 1)) > while creating QP by rdma_create_qp(), it takes a lot of resource and > can make a busy server out of memory sometime.By the way, is there a way to free some of this memory to resolve the problem temporarily, without having to restart the OSS? Thanks, -- Kilian
On Sat, Feb 02, 2008 at 12:29:07PM -0800, Kilian CAVALOTTI wrote:> On Saturday 02 February 2008 00:42:47 Isaac Huang wrote: > > > Here is patch for this problem (using FMR in o2iblnd) > > > https://bugzilla.lustre.org/attachment.cgi?id=15144 > > > > This is an experimental patch - nodes with the patch applied are not > > interoperable with those without it. Please don''t propagate the patch > > to production systems. > > Thanks for the explanation. Since the problem indeed occurs on a production > system, I''d rather keep experimental patches out of the way. > > I assume that adding more RAM on the OSSes is likely to solve this problem, > right? If that''s the case, I''d probably go this way, before the FMR patch > is landed.It depends on the architectures of the OSSes - o2iblnd, and I believe OFED too, can''t use memory in ZONE_HIGHMEM. For example, on x86_64 where ZONE_HIGHMEM is empty, adding more RAM will certainly help. Isaac
On Sunday 03 February 2008 06:30:16 am Isaac Huang wrote:> It depends on the architectures of the OSSes - o2iblnd, and I believe > OFED too, can''t use memory in ZONE_HIGHMEM. For example, on x86_64 > where ZONE_HIGHMEM is empty, adding more RAM will certainly help.Good to know, thanks. On the strange side, this "no resources" message only appears on one client. It get it from pretty much all our 8 OSSes, while all the other 276 clients can still access the filesystem (hence all the 8 OSSes) with not a single problem. Rebooting the problematic client doesn''t help either. Does that sound that something that logic can explain? I would assume that if the OSS were out of memory, this would indifferently affect all the clients, right? Thanks, -- Kilian