Hello. The short story: while setting up InfiniBand connection between two servers one of which is Xen's dom0, I cannot complete the RDMA latency test. It crashes even with breaking the ssh connection to Xen's dom0. I've decided to ask the question here so that maybe Xen gurus will notice something that is actually related to Xen. The long story. The first server is the Xen 4.4 with Ubuntu 14.04 as dom0 (hostname is xen). The second server is a usual server with Ubuntu 14.04 (hostname is node3). They both have Mellanox MT25208 HCAs connected over IB switch. Both have all the kernel modules loaded, OpenSM installed. The IPoIB works fine. The bare ibping goes both directions xen -> node3 and node3 -> xen. The problem occurs when I try ib_rdma_lat test. Here are the steps that lead to ib_rdma_lat and next sshd crash on xen. 1. On the xen I run ib_rdma_lat. 2. On the node3 I run ib_rdma_lat xen 3. The ssh connection to xen closes. 4. This is the output before ssh's connection close on xen. root@xen:~/tmp/22# ib_rdma_lat local address: LID 0x03 QPN 0x10406 PSN 0x9f903b RKey 0x40004000 VAddr 0x000000017e4001 remote address: LID 0x01 QPN 0x10406 PSN 0xd8c16e RKey 0x20004000 VAddr 0x000000013fd001 Connection to xen closed by remote host. Connection to xen closed. I googled, and the only thing that I could do was tunig the ib_mthca's module parameters num_mtt and log_mtts_per_seg. As it is said in the article http://community.mellanox.com/docs/DOC-1120. I set them on both servers as num_mtt=4194304 and log_mtts_per_seg=4. I did this while experimenting with those values so that the ib_mthca module would load correct. But this didn't help. ib_rdma_lat still crashes on xen. Here's the log: Aug 4 00:12:52 localhost kernel: [ 4011.170180] ib_rdma_lat invoked oom-killer: gfp_mask=0x0, order=0, oom_score_adj=0 Aug 4 00:12:52 localhost kernel: [ 4011.170189] ib_rdma_lat cpuset=/ mems_allowed=0 Aug 4 00:12:52 localhost kernel: [ 4011.170195] CPU: 0 PID: 2889 Comm: ib_rdma_lat Tainted: G B W 3.13.0-32-generic #57-Ubuntu Aug 4 00:12:52 localhost kernel: [ 4011.170198] Hardware name: Supermicro X9DRFF-iG+/-7G+/-iTG+/-7TG+/X9DRFF-iG+/-7G+/-iTG+/-7TG+, BIOS 3.0 07/29/2013 Aug 4 00:12:52 localhost kernel: [ 4011.170202] 0000000000000000 ffff880f175ebc68 ffffffff8171bcb4 ffff880f1ae02fe0 Aug 4 00:12:52 localhost kernel: [ 4011.170209] ffff880f175ebcf0 ffffffff817165ef ffff880f1a96afe0 0000000000000000 Aug 4 00:12:52 localhost kernel: [ 4011.170213] 00000000016ad5c1 ffff880f1a96afe0 ffffffff817246aa ffffffff8172417b Aug 4 00:12:52 localhost kernel: [ 4011.170217] Call Trace: Aug 4 00:12:52 localhost kernel: [ 4011.170236] [<ffffffff8171bcb4>] dump_stack+0x45/0x56 Aug 4 00:12:52 localhost kernel: [ 4011.170242] [<ffffffff817165ef>] dump_header+0x7f/0x1f1 Aug 4 00:12:52 localhost kernel: [ 4011.170248] [<ffffffff817246aa>] ? error_exit+0x2a/0x60 Aug 4 00:12:52 localhost kernel: [ 4011.170253] [<ffffffff8172417b>] ? retint_restore_args+0x5/0x6 Aug 4 00:12:52 localhost kernel: [ 4011.170260] [<ffffffff81151bfe>] oom_kill_process+0x1ce/0x330 Aug 4 00:12:52 localhost kernel: [ 4011.170269] [<ffffffff812d3ac5>] ? security_capable_noaudit+0x15/0x20 Aug 4 00:12:52 localhost kernel: [ 4011.170273] [<ffffffff81152334>] out_of_memory+0x414/0x450 Aug 4 00:12:52 localhost kernel: [ 4011.170278] [<ffffffff811523df>] pagefault_out_of_memory+0x6f/0x80 Aug 4 00:12:52 localhost kernel: [ 4011.170284] [<ffffffff81714c38>] mm_fault_error+0x8e/0x180 Aug 4 00:12:52 localhost kernel: [ 4011.170289] [<ffffffff81727f01>] __do_page_fault+0x4a1/0x560 Aug 4 00:12:52 localhost kernel: [ 4011.170299] [<ffffffff81111116>] ? __acct_update_integrals+0x76/0xe0 Aug 4 00:12:52 localhost kernel: [ 4011.170305] [<ffffffff8111155c>] ? acct_account_cputime+0x1c/0x20 Aug 4 00:12:52 localhost kernel: [ 4011.170312] [<ffffffff8109d7db>] ? account_user_time+0x8b/0xa0 Aug 4 00:12:52 localhost kernel: [ 4011.170316] [<ffffffff8109ddf4>] ? vtime_account_user+0x54/0x60 Aug 4 00:12:52 localhost kernel: [ 4011.170320] [<ffffffff81727fda>] do_page_fault+0x1a/0x70 Aug 4 00:12:52 localhost kernel: [ 4011.170324] [<ffffffff81724448>] page_fault+0x28/0x30 Aug 4 00:12:52 localhost kernel: [ 4011.170326] Mem-Info: Aug 4 00:12:52 localhost kernel: [ 4011.170329] Node 0 DMA per-cpu: Aug 4 00:12:52 localhost kernel: [ 4011.170334] CPU 0: hi: 0, btch: 1 usd: 0 Aug 4 00:12:52 localhost kernel: [ 4011.170336] Node 0 DMA32 per-cpu: Aug 4 00:12:52 localhost kernel: [ 4011.170339] CPU 0: hi: 186, btch: 31 usd: 135 Aug 4 00:12:52 localhost kernel: [ 4011.170341] Node 0 Normal per-cpu: Aug 4 00:12:52 localhost kernel: [ 4011.170344] CPU 0: hi: 186, btch: 31 usd: 124 Aug 4 00:12:52 localhost kernel: [ 4011.170351] active_anon:7920 inactive_anon:23 isolated_anon:0 Aug 4 00:12:52 localhost kernel: [ 4011.170351] active_file:20177 inactive_file:37521 isolated_file:0 Aug 4 00:12:52 localhost kernel: [ 4011.170351] unevictable:8 dirty:0 writeback:0 unstable:0 Aug 4 00:12:52 localhost kernel: [ 4011.170351] free:15211440 slab_reclaimable:4583 slab_unreclaimable:8427 Aug 4 00:12:52 localhost kernel: [ 4011.170351] mapped:4644 shmem:408 pagetables:993 bounce:0 Aug 4 00:12:52 localhost kernel: [ 4011.170351] free_cma:0 Aug 4 00:12:52 localhost kernel: [ 4011.170358] Node 0 DMA free:15888kB min:8kB low:8kB high:12kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15972kB managed:15888kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Aug 4 00:12:52 localhost kernel: [ 4011.170367] lowmem_reserve[]: 0 1980 60135 60135 Aug 4 00:12:52 localhost kernel: [ 4011.170372] Node 0 DMA32 free:2017364kB min:1032kB low:1288kB high:1548kB active_anon:992kB inactive_anon:4kB active_file:2596kB inactive_file:5756kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2045472kB managed:2031128kB mlocked:0kB dirty:0kB writeback:0kB mapped:692kB shmem:32kB slab_reclaimable:428kB slab_unreclaimable:472kB kernel_stack:40kB pagetables:132kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Aug 4 00:12:52 localhost kernel: [ 4011.170381] lowmem_reserve[]: 0 0 58154 58154 Aug 4 00:12:52 localhost kernel: [ 4011.170386] Node 0 Normal free:58812508kB min:30348kB low:37932kB high:45520kB active_anon:30688kB inactive_anon:88kB active_file:78112kB inactive_file:144328kB unevictable:32kB isolated(anon):0kB isolated(file):0kB present:60853112kB managed:59550432kB mlocked:32kB dirty:0kB writeback:0kB mapped:17884kB shmem:1600kB slab_reclaimable:17904kB slab_unreclaimable:33236kB kernel_stack:1704kB pagetables:3840kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Aug 4 00:12:52 localhost kernel: [ 4011.170394] lowmem_reserve[]: 0 0 0 0 Aug 4 00:12:52 localhost kernel: [ 4011.170398] Node 0 DMA: 0*4kB 0*8kB 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15888kB Aug 4 00:12:52 localhost kernel: [ 4011.170416] Node 0 DMA32: 1*4kB (M) 12*8kB (UEM) 7*16kB (UE) 2*32kB (UM) 1*64kB (U) 2*128kB (UM) 0*256kB 1*512kB (E) 1*1024kB (E) 2*2048kB (ER) 491*4096kB (M) = 2017364kB Aug 4 00:12:52 localhost kernel: [ 4011.170434] Node 0 Normal: 67*4kB (UM) 34*8kB (UEM) 16*16kB (UEM) 38*32kB (UM) 26*64kB (UM) 22*128kB (UEM) 15*256kB (UEM) 2*512kB (M) 1*1024kB (M) 3*2048kB (UEM) 14354*4096kB (MR) 58812508kB Aug 4 00:12:52 localhost kernel: [ 4011.170468] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Aug 4 00:12:52 localhost kernel: [ 4011.170470] 58105 total pagecache pages Aug 4 00:12:52 localhost kernel: [ 4011.170473] 0 pages in swap cache Aug 4 00:12:52 localhost kernel: [ 4011.170476] Swap cache stats: add 0, delete 0, find 0/189 Aug 4 00:12:52 localhost kernel: [ 4011.170478] Free swap = 33517564kB Aug 4 00:12:52 localhost kernel: [ 4011.170480] Total swap = 33517564kB Aug 4 00:12:52 localhost kernel: [ 4011.170482] 15728639 pages RAM Aug 4 00:12:52 localhost kernel: [ 4011.170483] 0 pages HighMem/MovableOnly Aug 4 00:12:52 localhost kernel: [ 4011.170485] 325670 pages reserved Aug 4 00:12:52 localhost kernel: [ 4011.170487] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Aug 4 00:12:52 localhost kernel: [ 4011.170496] [ 375] 0 375 4935 228 14 0 0 upstart-udev-br Aug 4 00:12:52 localhost kernel: [ 4011.170501] [ 384] 0 384 12927 485 28 0 -1000 systemd-udevd Aug 4 00:12:52 localhost kernel: [ 4011.170505] [ 571] 102 571 9887 391 23 0 0 dbus-daemon Aug 4 00:12:52 localhost kernel: [ 4011.170509] [ 590] 101 590 63961 318 27 0 0 rsyslogd Aug 4 00:12:52 localhost kernel: [ 4011.170513] [ 596] 0 596 4823 373 14 0 0 bluetoothd Aug 4 00:12:52 localhost kernel: [ 4011.170516] [ 606] 0 606 18680 893 40 0 0 cupsd Aug 4 00:12:52 localhost kernel: [ 4011.170520] [ 614] 0 614 5870 106 16 0 0 rpc.idmapd Aug 4 00:12:52 localhost kernel: [ 4011.170523] [ 622] 0 622 10863 454 26 0 0 systemd-logind Aug 4 00:12:52 localhost kernel: [ 4011.170528] [ 702] 0 702 3984 308 13 0 0 upstart-file-br Aug 4 00:12:52 localhost kernel: [ 4011.170531] [ 877] 0 877 5855 275 18 0 0 rpcbind Aug 4 00:12:52 localhost kernel: [ 4011.170534] [ 898] 111 898 5386 347 15 0 0 rpc.statd Aug 4 00:12:52 localhost kernel: [ 4011.170538] [ 901] 0 901 3848 184 13 0 0 upstart-socket- Aug 4 00:12:52 localhost kernel: [ 4011.170541] [ 1300] 105 1300 7861 513 21 0 0 ntpd Aug 4 00:12:52 localhost kernel: [ 4011.170545] [ 1374] 0 1374 5268 237 13 0 0 getty Aug 4 00:12:52 localhost kernel: [ 4011.170548] [ 1378] 0 1378 5268 235 13 0 0 getty Aug 4 00:12:52 localhost kernel: [ 4011.170551] [ 1384] 0 1384 5268 237 13 0 0 getty Aug 4 00:12:52 localhost kernel: [ 4011.170555] [ 1385] 0 1385 5268 238 13 0 0 getty Aug 4 00:12:52 localhost kernel: [ 4011.170558] [ 1388] 0 1388 5268 238 13 0 0 getty Aug 4 00:12:52 localhost kernel: [ 4011.170561] [ 1427] 0 1427 15341 762 33 0 -1000 sshd Aug 4 00:12:52 localhost kernel: [ 4011.170564] [ 1443] 0 1443 5914 257 17 0 0 cron Aug 4 00:12:52 localhost kernel: [ 4011.170568] [ 1554] 0 1554 2750 242 11 0 0 xenstored Aug 4 00:12:52 localhost kernel: [ 4011.170571] [ 1566] 0 1566 22752 261 19 0 0 xenconsoled Aug 4 00:12:52 localhost kernel: [ 4011.170575] [ 1613] 0 1613 73631 1045 48 0 0 polkitd Aug 4 00:12:52 localhost kernel: [ 4011.170578] [ 1885] 113 1885 7052 249 18 0 0 dnsmasq Aug 4 00:12:52 localhost kernel: [ 4011.170581] [ 2004] 0 2004 148275 997 39 0 0 console-kit-dae Aug 4 00:12:52 localhost kernel: [ 4011.170585] [ 2166] 0 2166 23985 237 21 0 0 xl Aug 4 00:12:52 localhost kernel: [ 4011.170589] [ 2303] 0 2303 5268 237 13 0 0 getty Aug 4 00:12:52 localhost kernel: [ 4011.170592] [ 2378] 0 2378 82712 784 23 0 0 opensm Aug 4 00:12:52 localhost kernel: [ 4011.170595] [ 2379] 0 2379 65942 358 22 0 0 opensm Aug 4 00:12:52 localhost kernel: [ 4011.170598] [ 2450] 106 2450 91259 1269 74 0 0 whoopsie Aug 4 00:12:52 localhost kernel: [ 4011.170602] [ 2453] 0 2453 93762 3220 114 0 0 libvirtd Aug 4 00:12:52 localhost kernel: [ 4011.170605] [ 2634] 0 2634 26407 1058 54 0 0 sshd Aug 4 00:12:52 localhost kernel: [ 4011.170608] [ 2671] 1000 2671 26407 501 52 0 0 sshd Aug 4 00:12:52 localhost kernel: [ 4011.170612] [ 2672] 1000 2672 7041 1040 17 0 0 bash Aug 4 00:12:52 localhost kernel: [ 4011.170615] [ 2749] 0 2749 17566 547 36 0 0 sudo Aug 4 00:12:52 localhost kernel: [ 4011.170618] [ 2750] 0 2750 7063 1074 16 0 0 bash Aug 4 00:12:52 localhost kernel: [ 4011.170622] [ 2889] 0 2889 3732 213 12 0 0 ib_rdma_lat Aug 4 00:12:52 localhost kernel: [ 4011.170625] Out of memory: Kill process 2453 (libvirtd) score 0 or sacrifice child Aug 4 00:12:52 localhost kernel: [ 4011.170729] Killed process 2453 (libvirtd) total-vm:375048kB, anon-rss:4748kB, file-rss:8132kB The xen (dom0) has 60GB of RAM. And the node3 has 180GB of RAM. Here are some logs and command outputs that I made for diagnosing the problem. 1. dmesg on xen https://dl.dropboxusercontent.com/u/8057759/ib_mthca/dmesg.xen.log 2. xl dmesg on xen https://dl.dropboxusercontent.com/u/8057759/ib_mthca/xl-dmesg.xen.log 3. parameters of the loaded ib_mthca on xen https://dl.dropboxusercontent.com/u/8057759/ib_mthca/ib_mthca.xen.log 4. ibhosts on xen https://dl.dropboxusercontent.com/u/8057759/ib_mthca/ibhosts.xen.log 5. ibstat on xen https://dl.dropboxusercontent.com/u/8057759/ib_mthca/ibstat.xen.log 6. ibstatus on xen https://dl.dropboxusercontent.com/u/8057759/ib_mthca/ibstatus.xen.log 7. lsmod | grep rdma on xen https://dl.dropboxusercontent.com/u/8057759/ib_mthca/lsmod-rdma.xen.log 8. lspci -s 04:00.0 -k on xen https://dl.dropboxusercontent.com/u/8057759/ib_mthca/lspci.xen.log 9. a cut from /var/log/syslog after ib_rdma_lat crash on xen https://dl.dropboxusercontent.com/u/8057759/ib_mthca/syslog.xen Can anyone advise me anything, please? -- Best regards, Grigory Ptashko +7 (916) 1489766 grigory.ptashko@gmail.com _______________________________________________ Xen-users mailing list Xen-users@lists.xen.org http://lists.xen.org/xen-users