Hi all,
What can cause a client to receive a "o2iblnd no resources" message
from an OSS?
---------------------------------------------------------------------------
Feb 1 15:24:24 node-5-8 kernel: LustreError:
1893:0:(o2iblnd_cb.c:2448:kiblnd_rejected()) 10.10.60.3 at o2ib rejected:
o2iblnd no resources
---------------------------------------------------------------------------
I suspect an out-of-memory problem, and indeed the OSS logs are filled
up with the following:
---------------------------------------------------------------------------
ib_cm/3: page allocation failure. order:4, mode:0xd0
Call Trace:<ffffffff8015c847>{__alloc_pages+777}
<ffffffff801727e9>{alloc_page_interleave+61}
<ffffffff8015c8e0>{__get_free_pages+11}
<ffffffff8015facd>{kmem_getpages+36}
<ffffffff80160262>{cache_alloc_refill+609}
<ffffffff8015ff30>{__kmalloc+123}
<ffffffffa014ee75>{:ib_mthca:mthca_alloc_qp_common+668}
<ffffffffa014f42d>{:ib_mthca:mthca_alloc_qp+178}
<ffffffffa0153e3a>{:ib_mthca:mthca_create_qp+311}
<ffffffffa00d5b1b>{:ib_core:ib_create_qp+20}
<ffffffffa021a5f9>{:rdma_cm:rdma_create_qp+43}
<ffffffff8024b7b5>{dma_pool_free+245}
<ffffffffa014b257>{:ib_mthca:mthca_init_cq+1073}
<ffffffffa01540cf>{:ib_mthca:mthca_create_cq+282}
<ffffffff801727e9>{alloc_page_interleave+61}
<ffffffffa0400c10>{:ko2iblnd:kiblnd_cq_completion+0}
<ffffffffa0400d50>{:ko2iblnd:kiblnd_cq_event+0}
<ffffffffa00d5cc1>{:ib_core:ib_create_cq+33}
<ffffffffa03f56bd>{:ko2iblnd:kiblnd_create_conn+3565}
<ffffffffa0276f38>{:libcfs:cfs_alloc+40}
<ffffffffa03fe457>{:ko2iblnd:kiblnd_passive_connect+2215}
<ffffffffa00d8595>{:ib_core:ib_find_cached_gid+244}
<ffffffffa021a278>{:rdma_cm:cma_acquire_dev+293}
<ffffffffa03ff540>{:ko2iblnd:kiblnd_cm_callback+64}
<ffffffffa03ff500>{:ko2iblnd:kiblnd_cm_callback+0}
<ffffffffa021b19a>{:rdma_cm:cma_req_handler+863}
<ffffffff801e8427>{alloc_layer+67}
<ffffffff801e8645>{idr_get_new_above_int+423}
<ffffffffa00fa0ab>{:ib_cm:cm_process_work+101}
<ffffffffa00faa57>{:ib_cm:cm_req_handler+2398}
<ffffffffa00fae3c>{:ib_cm:cm_work_handler+0}
<ffffffffa00fae6a>{:ib_cm:cm_work_handler+46}
<ffffffff80146fca>{worker_thread+419}
<ffffffff80133566>{default_wake_function+0}
<ffffffff801335b7>{__wake_up_common+67}
<ffffffff80133566>{default_wake_function+0}
<ffffffff8014ad18>{keventd_create_kthread+0}
<ffffffff80146e27>{worker_thread+0}
<ffffffff8014ad18>{keventd_create_kthread+0}
<ffffffff8014acef>{kthread+200}
<ffffffff80110de3>{child_rip+8}
<ffffffff8014ad18>{keventd_create_kthread+0}
<ffffffff8014ac27>{kthread+0}
<ffffffff80110ddb>{child_rip+0}
Mem-info:
Node 0 DMA per-cpu:
cpu 0 hot: low 2, high 6, batch 1
cpu 0 cold: low 0, high 2, batch 1
cpu 1 hot: low 2, high 6, batch 1
cpu 1 cold: low 0, high 2, batch 1
cpu 2 hot: low 2, high 6, batch 1
cpu 2 cold: low 0, high 2, batch 1
cpu 3 hot: low 2, high 6, batch 1
cpu 3 cold: low 0, high 2, batch 1
Node 0 Normal per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16
cpu 1 hot: low 32, high 96, batch 16
cpu 1 cold: low 0, high 32, batch 16
cpu 2 hot: low 32, high 96, batch 16
cpu 2 cold: low 0, high 32, batch 16
cpu 3 hot: low 32, high 96, batch 16
cpu 3 cold: low 0, high 32, batch 16
Node 0 HighMem per-cpu: empty
Free pages: 35336kB (0kB HighMem)
Active:534156 inactive:127091 dirty:1072 writeback:0 unstable:0 free:8834
slab:146612 mapped:26222 pagetables:1035
Node 0 DMA free:9832kB min:52kB low:64kB high:76kB active:0kB inactive:0kB
present:16384kB pages_scanned:37 all_unreclaimable? yes
protections[]: 0 510200 510200
Node 0 Normal free:25504kB min:16328kB low:20408kB high:24492kB active:2136624kB
inactive:508364kB present:4964352kB pages_scanned:0 all_unreclaimable? no
protections[]: 0 0 0
Node 0 HighMem free:0kB min:128kB low:160kB high:192kB active:0kB inactive:0kB
present:0kB pages_scanned:0 all_unreclaimable? no
protections[]: 0 0 0
Node 0 DMA: 2*4kB 2*8kB 1*16kB 0*32kB 1*64kB 0*128kB 0*256kB 1*512kB 1*1024kB
0*2048kB 2*4096kB = 9832kB
Node 0 Normal: 1284*4kB 2290*8kB 126*16kB 1*32kB 0*64kB 0*128kB 0*256kB 0*512kB
0*1024kB 0*2048kB 0*4096kB = 25504kB
Node 0 HighMem: empty
Swap cache: add 111, delete 111, find 23/36, race 0+0
Free swap: 4096360kB
1245184 pages of RAM
235840 reserved pages
659867 pages shared
0 pages swap cached
---------------------------------------------------------------------------
IB links are up and working on both the client and the OSS:
---------------------------------------------------------------------------
client# ibstatus
Infiniband device ''mthca0'' port 1 status:
default gid: fe80:0000:0000:0000:0005:ad00:0008:af71
base lid: 0x83
sm lid: 0x130
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 20 Gb/sec (4X DDR)
oss# ibstatus
Infiniband device ''mthca0'' port 1 status:
default gid: fe80:0000:0000:0000:0005:ad00:0008:cb11
base lid: 0x126
sm lid: 0x130
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 20 Gb/sec (4X DDR)
---------------------------------------------------------------------------
And the Subnet Manager doesn''t expose any unusual error or skyrocketed
counter (I use OFED 1.2, kernel 2.6.9-55.0.9.EL_lustre.1.6.4.1smp).
What I don''t really get is that most clients can access files on this
OSS with no issue, and besides, my limited understanding of the kernel
memory mechanisms tend to let me believe that this OSS is not out of
memory:
---------------------------------------------------------------------------
# cat /proc/meminfo
MemTotal: 4037380 kB
MemFree: 31688 kB
Buffers: 1333536 kB
Cached: 1231900 kB
SwapCached: 0 kB
Active: 2138948 kB
Inactive: 507720 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 4037380 kB
LowFree: 31688 kB
SwapTotal: 4096564 kB
SwapFree: 4096360 kB
Dirty: 6868 kB
Writeback: 0 kB
Mapped: 106984 kB
Slab: 588200 kB
CommitLimit: 6115252 kB
Committed_AS: 860508 kB
PageTables: 4304 kB
VmallocTotal: 536870911 kB
VmallocUsed: 274788 kB
VmallocChunk: 536596091 kB
HugePages_Total: 0
HugePages_Free: 0
Hugepagesize: 2048 kB
---------------------------------------------------------------------------
This only appeared lately, after several week of continuous use of the
filesystem, without any problem. Is there anything like a memory leak
somewhere? Any help to diagnose the problem would be greatly appreciated.
Thanks!
--
Kilian
Hi Kilian, I think it''s because o2iblnd uses fragmented RDMA by default(Max to 256), so we have to set max_send_wr as (concurrent_send * (256 + 1)) while creating QP by rdma_create_qp(), it takes a lot of resource and can make a busy server out of memory sometime. To resolve this problem, we have to use FMR to map fragmented buffer to virtual contiguous I/O address, there will always be one fragment for RDMA by this way. Here is patch for this problem (using FMR in o2iblnd) https://bugzilla.lustre.org/attachment.cgi?id=15144 Regards Liang Kilian CAVALOTTI wrote:> Hi all, > > What can cause a client to receive a "o2iblnd no resources" message > from an OSS? > --------------------------------------------------------------------------- > Feb 1 15:24:24 node-5-8 kernel: LustreError: 1893:0:(o2iblnd_cb.c:2448:kiblnd_rejected()) 10.10.60.3 at o2ib rejected: o2iblnd no resources > --------------------------------------------------------------------------- > > I suspect an out-of-memory problem, and indeed the OSS logs are filled > up with the following: > --------------------------------------------------------------------------- > ib_cm/3: page allocation failure. order:4, mode:0xd0 > > Call Trace:<ffffffff8015c847>{__alloc_pages+777} <ffffffff801727e9>{alloc_page_interleave+61} > <ffffffff8015c8e0>{__get_free_pages+11} <ffffffff8015facd>{kmem_getpages+36} > <ffffffff80160262>{cache_alloc_refill+609} <ffffffff8015ff30>{__kmalloc+123} > <ffffffffa014ee75>{:ib_mthca:mthca_alloc_qp_common+668} > <ffffffffa014f42d>{:ib_mthca:mthca_alloc_qp+178} <ffffffffa0153e3a>{:ib_mthca:mthca_create_qp+311} > <ffffffffa00d5b1b>{:ib_core:ib_create_qp+20} <ffffffffa021a5f9>{:rdma_cm:rdma_create_qp+43} > <ffffffff8024b7b5>{dma_pool_free+245} <ffffffffa014b257>{:ib_mthca:mthca_init_cq+1073} > <ffffffffa01540cf>{:ib_mthca:mthca_create_cq+282} <ffffffff801727e9>{alloc_page_interleave+61} > <ffffffffa0400c10>{:ko2iblnd:kiblnd_cq_completion+0} > <ffffffffa0400d50>{:ko2iblnd:kiblnd_cq_event+0} <ffffffffa00d5cc1>{:ib_core:ib_create_cq+33} > <ffffffffa03f56bd>{:ko2iblnd:kiblnd_create_conn+3565} > <ffffffffa0276f38>{:libcfs:cfs_alloc+40} <ffffffffa03fe457>{:ko2iblnd:kiblnd_passive_connect+2215} > <ffffffffa00d8595>{:ib_core:ib_find_cached_gid+244} > <ffffffffa021a278>{:rdma_cm:cma_acquire_dev+293} <ffffffffa03ff540>{:ko2iblnd:kiblnd_cm_callback+64} > <ffffffffa03ff500>{:ko2iblnd:kiblnd_cm_callback+0} > <ffffffffa021b19a>{:rdma_cm:cma_req_handler+863} <ffffffff801e8427>{alloc_layer+67} > <ffffffff801e8645>{idr_get_new_above_int+423} <ffffffffa00fa0ab>{:ib_cm:cm_process_work+101} > <ffffffffa00faa57>{:ib_cm:cm_req_handler+2398} <ffffffffa00fae3c>{:ib_cm:cm_work_handler+0} > <ffffffffa00fae6a>{:ib_cm:cm_work_handler+46} <ffffffff80146fca>{worker_thread+419} > <ffffffff80133566>{default_wake_function+0} <ffffffff801335b7>{__wake_up_common+67} > <ffffffff80133566>{default_wake_function+0} <ffffffff8014ad18>{keventd_create_kthread+0} > <ffffffff80146e27>{worker_thread+0} <ffffffff8014ad18>{keventd_create_kthread+0} > <ffffffff8014acef>{kthread+200} <ffffffff80110de3>{child_rip+8} > <ffffffff8014ad18>{keventd_create_kthread+0} <ffffffff8014ac27>{kthread+0} > <ffffffff80110ddb>{child_rip+0} > Mem-info: > Node 0 DMA per-cpu: > cpu 0 hot: low 2, high 6, batch 1 > cpu 0 cold: low 0, high 2, batch 1 > cpu 1 hot: low 2, high 6, batch 1 > cpu 1 cold: low 0, high 2, batch 1 > cpu 2 hot: low 2, high 6, batch 1 > cpu 2 cold: low 0, high 2, batch 1 > cpu 3 hot: low 2, high 6, batch 1 > cpu 3 cold: low 0, high 2, batch 1 > Node 0 Normal per-cpu: > cpu 0 hot: low 32, high 96, batch 16 > cpu 0 cold: low 0, high 32, batch 16 > cpu 1 hot: low 32, high 96, batch 16 > cpu 1 cold: low 0, high 32, batch 16 > cpu 2 hot: low 32, high 96, batch 16 > cpu 2 cold: low 0, high 32, batch 16 > cpu 3 hot: low 32, high 96, batch 16 > cpu 3 cold: low 0, high 32, batch 16 > Node 0 HighMem per-cpu: empty > > Free pages: 35336kB (0kB HighMem) > Active:534156 inactive:127091 dirty:1072 writeback:0 unstable:0 free:8834 slab:146612 mapped:26222 pagetables:1035 > Node 0 DMA free:9832kB min:52kB low:64kB high:76kB active:0kB inactive:0kB present:16384kB pages_scanned:37 all_unreclaimable? yes > protections[]: 0 510200 510200 > Node 0 Normal free:25504kB min:16328kB low:20408kB high:24492kB active:2136624kB inactive:508364kB present:4964352kB pages_scanned:0 all_unreclaimable? no > protections[]: 0 0 0 > Node 0 HighMem free:0kB min:128kB low:160kB high:192kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no > protections[]: 0 0 0 > Node 0 DMA: 2*4kB 2*8kB 1*16kB 0*32kB 1*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 0*2048kB 2*4096kB = 9832kB > Node 0 Normal: 1284*4kB 2290*8kB 126*16kB 1*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 25504kB > Node 0 HighMem: empty > Swap cache: add 111, delete 111, find 23/36, race 0+0 > Free swap: 4096360kB > 1245184 pages of RAM > 235840 reserved pages > 659867 pages shared > 0 pages swap cached > --------------------------------------------------------------------------- > > IB links are up and working on both the client and the OSS: > --------------------------------------------------------------------------- > client# ibstatus > Infiniband device ''mthca0'' port 1 status: > default gid: fe80:0000:0000:0000:0005:ad00:0008:af71 > base lid: 0x83 > sm lid: 0x130 > state: 4: ACTIVE > phys state: 5: LinkUp > rate: 20 Gb/sec (4X DDR) > oss# ibstatus > Infiniband device ''mthca0'' port 1 status: > default gid: fe80:0000:0000:0000:0005:ad00:0008:cb11 > base lid: 0x126 > sm lid: 0x130 > state: 4: ACTIVE > phys state: 5: LinkUp > rate: 20 Gb/sec (4X DDR) > --------------------------------------------------------------------------- > And the Subnet Manager doesn''t expose any unusual error or skyrocketed > counter (I use OFED 1.2, kernel 2.6.9-55.0.9.EL_lustre.1.6.4.1smp). > > What I don''t really get is that most clients can access files on this > OSS with no issue, and besides, my limited understanding of the kernel > memory mechanisms tend to let me believe that this OSS is not out of > memory: > --------------------------------------------------------------------------- > # cat /proc/meminfo > MemTotal: 4037380 kB > MemFree: 31688 kB > Buffers: 1333536 kB > Cached: 1231900 kB > SwapCached: 0 kB > Active: 2138948 kB > Inactive: 507720 kB > HighTotal: 0 kB > HighFree: 0 kB > LowTotal: 4037380 kB > LowFree: 31688 kB > SwapTotal: 4096564 kB > SwapFree: 4096360 kB > Dirty: 6868 kB > Writeback: 0 kB > Mapped: 106984 kB > Slab: 588200 kB > CommitLimit: 6115252 kB > Committed_AS: 860508 kB > PageTables: 4304 kB > VmallocTotal: 536870911 kB > VmallocUsed: 274788 kB > VmallocChunk: 536596091 kB > HugePages_Total: 0 > HugePages_Free: 0 > Hugepagesize: 2048 kB > --------------------------------------------------------------------------- > > This only appeared lately, after several week of continuous use of the > filesystem, without any problem. Is there anything like a memory leak > somewhere? Any help to diagnose the problem would be greatly appreciated. > > Thanks! >
On Sat, Feb 02, 2008 at 03:39:09PM +0800, Liang Zhen wrote:> Hi Kilian, > I think it''s because o2iblnd uses fragmented RDMA by default(Max to > 256), so we have to set max_send_wr as (concurrent_send * (256 + 1)) > while creating QP by rdma_create_qp(), it takes a lot of resource and > can make a busy server out of memory sometime. > To resolve this problem, we have to use FMR to map fragmented buffer to > virtual contiguous I/O address, there will always be one fragment for > RDMA by this way. > Here is patch for this problem (using FMR in o2iblnd) > https://bugzilla.lustre.org/attachment.cgi?id=15144This is an experimental patch - nodes with the patch applied are not interoperable with those without it. Please don''t propagate the patch to production systems. Isaac
On Saturday 02 February 2008 00:42:47 Isaac Huang wrote:> > Here is patch for this problem (using FMR in o2iblnd) > > https://bugzilla.lustre.org/attachment.cgi?id=15144 > > This is an experimental patch - nodes with the patch applied are not > interoperable with those without it. Please don''t propagate the patch > to production systems.Thanks for the explanation. Since the problem indeed occurs on a production system, I''d rather keep experimental patches out of the way. I assume that adding more RAM on the OSSes is likely to solve this problem, right? If that''s the case, I''d probably go this way, before the FMR patch is landed. Thanks, -- Kilian
Hi Liang, On Friday 01 February 2008 23:39:09 you wrote:> I think it''s because o2iblnd uses fragmented RDMA by default(Max to > 256), so we have to set max_send_wr as (concurrent_send * (256 + 1)) > while creating QP by rdma_create_qp(), it takes a lot of resource and > can make a busy server out of memory sometime.By the way, is there a way to free some of this memory to resolve the problem temporarily, without having to restart the OSS? Thanks, -- Kilian
On Sat, Feb 02, 2008 at 12:29:07PM -0800, Kilian CAVALOTTI wrote:> On Saturday 02 February 2008 00:42:47 Isaac Huang wrote: > > > Here is patch for this problem (using FMR in o2iblnd) > > > https://bugzilla.lustre.org/attachment.cgi?id=15144 > > > > This is an experimental patch - nodes with the patch applied are not > > interoperable with those without it. Please don''t propagate the patch > > to production systems. > > Thanks for the explanation. Since the problem indeed occurs on a production > system, I''d rather keep experimental patches out of the way. > > I assume that adding more RAM on the OSSes is likely to solve this problem, > right? If that''s the case, I''d probably go this way, before the FMR patch > is landed.It depends on the architectures of the OSSes - o2iblnd, and I believe OFED too, can''t use memory in ZONE_HIGHMEM. For example, on x86_64 where ZONE_HIGHMEM is empty, adding more RAM will certainly help. Isaac
On Sunday 03 February 2008 06:30:16 am Isaac Huang wrote:> It depends on the architectures of the OSSes - o2iblnd, and I believe > OFED too, can''t use memory in ZONE_HIGHMEM. For example, on x86_64 > where ZONE_HIGHMEM is empty, adding more RAM will certainly help.Good to know, thanks. On the strange side, this "no resources" message only appears on one client. It get it from pretty much all our 8 OSSes, while all the other 276 clients can still access the filesystem (hence all the 8 OSSes) with not a single problem. Rebooting the problematic client doesn''t help either. Does that sound that something that logic can explain? I would assume that if the OSS were out of memory, this would indifferently affect all the clients, right? Thanks, -- Kilian