bjohanso at psc.edu
2007-Sep-12 15:41 UTC
[Lustre-devel] [Bug 13607] New: lnet router RDMA too fragmented: 128/256 src 128/256 dst frags
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=13607
Client: catamount using liblustre 1.4.11
Router: kptllnd - ko2iblnd
Running b_eff_io
http://www.hlrs.de/organization/par/services/models/mpi/b_eff_io/index_v1.1.html
fails when attempting
to write a chunk size of 1048832, this succeeds on the local lustre filesystem
on the XT3 (all contained within cray
portals)
A successful run on the local filesystem:
-----+---+------+----+--------+----------+----------+----------------+-------+-------------+-------+-------+------+-----+-----+-----+-----+----------------
num. acc-| pat- |pat-|scheduled chunk| chunk|filename | repeat|
transferred| meas-|=sum of|time of
meas.calls|last |last | measured
of |ess| tern |tern| time | size| size| | factor|
MB of this| ured | I/O |barr-
|bcast|file-| I/O |barr.| bandwidth
PEs | | type | | [sec] | on disk| in memory| | |
pattern| time | | ier |
| sync|call |+bcst| of this pattern
-----+---+------+----+--------+----------+----------+----------------+-------+-------------+-------+-------+------+-----+-----+-----+-----+----------------
n=1 a=0 type=0 p= 0 Tp= 0.00 l= 1048576 L= 1048576 i00_001_0 r= 1 S
1.049 MB t= 0.01 = 0.011+
0.000+0.000+0.000 0.011 0.000 bw= 92.353 MB/s
n=1 a=0 type=0 p= 1 Tp= 0.40 l= 4194304 L= 4194304 i00_001_0 r= 14 S
58.720 MB t= 0.38 = 0.379+
0.000+0.000+0.000 0.027 0.000 bw= 154.821 MB/s
n=1 a=0 type=0 p= 2 Tp= 0.40 l= 1048576 L= 2097152 i00_001_0 r= 22 S
46.137 MB t= 0.39 = 0.390+
0.000+0.000+0.000 0.019 0.000 bw= 118.293 MB/s
n=1 a=0 type=0 p= 3 Tp= 0.40 l= 1048576 L= 1048576 i00_001_0 r= 32 S
33.554 MB t= 0.39 = 0.393+
0.000+0.000+0.000 0.011 0.000 bw= 85.332 MB/s
n=1 a=0 type=0 p= 4 Tp= 0.20 l= 32768 L= 1048576 i00_001_0 r= 17 S
17.826 MB t= 0.19 = 0.195+
0.000+0.000+0.000 0.011 0.000 bw= 91.425 MB/s
n=1 a=0 type=0 p= 5 Tp= 0.20 l= 1024 L= 1048576 i00_001_0 r= 17 S
17.826 MB t= 0.19 = 0.192+
0.000+0.000+0.000 0.011 0.000 bw= 92.684 MB/s
n=1 a=0 type=0 p= 6 Tp= 0.20 l= 32776 L= 1048832 i00_001_0 r= 13 S
13.635 MB t= 0.20 = 0.195+
0.000+0.000+0.000 0.015 0.000 bw= 69.834 MB/s
n=1 a=0 type=0 p= 7 Tp= 0.20 l= 1032 L= 1056768 i00_001_0 r= 12 S
12.681 MB t= 0.19 = 0.185+
0.000+0.000+0.000 0.015 0.000 bw= 68.509 MB/s
n=1 a=0 type=0 p= 8 Tp= 0.20 l= 1048584 L= 1048584 i00_001_0 r= 14 S
14.680 MB t= 0.20 = 0.198+
0.000+0.000+0.000 0.012 0.000 bw= 74.275 MB/s
total pattern type: S=216.108 MB t=2.14 t_op=0.00 t_cl=0.00 wbw= 101.238
MB/s b_eff_io_write_scatter= 101.007 MB/s
The routed filesystem fails on pattern 6
-----+---+------+----+--------+----------+----------+---------------
num. acc-| pat- |pat-|scheduled chunk| chunk|filename
of |ess| tern |tern| time | size| size|
PEs | | type | | [sec] | on disk| in memory|
-----+---+------+----+--------+----------+----------+--------------
n=1 a=0 type=0 p= 6 Tp= 0.20 l= 32776 L= 1048832 i00_001_0
Router debug:
0000400:2000000:0:1189540683.206990:0:16915:0:(api-ni.c:1082:lnet_startup_lndnis())
Added LNI 2571 at ptl [8/768]
0000400:2000000:0:1189540683.265861:0:16915:0:(api-ni.c:1082:lnet_startup_lndnis())
Added LNI 10.10.101.105 at o2ib [512/1024]
0000800:020000:0:1189540707.150075:0:16918:0:(o2iblnd_cb.c:1207:kiblnd_init_rdma())
RDMA too fragmented: 128/256 src 128/256 dst frags
0000800:020000:0:1189540707.150086:0:16918:0:(o2iblnd_cb.c:449:kiblnd_handle_rx())
Can''t setup rdma for PUT to 10.10.101.40 at o2ib: -90
0000800:000400:0:1189540733.893046:0:16918:0:(ptllnd_peer.c:1142:kptllnd_tx_launch())
Refusing to create a new connection to U3-2 at ptl (non-kernel peer)
0000800:020000:0:1189540914.127475:0:16918:0:(o2iblnd_cb.c:1207:kiblnd_init_rdma())
RDMA too fragmented: 128/256 src 128/256 dst frags
0000800:020000:0:1189540914.127486:0:16918:0:(o2iblnd_cb.c:449:kiblnd_handle_rx())
Can''t setup rdma for PUT to 10.10.101.36 at o2ib: -90
0000800:020000:0:1189609614.644581:0:16918:0:(o2iblnd_cb.c:1207:kiblnd_init_rdma())
RDMA too fragmented: 128/256 src 128/256 dst frags
0000800:020000:0:1189609614.644594:0:16918:0:(o2iblnd_cb.c:449:kiblnd_handle_rx())
Can''t setup rdma for PUT to 10.10.101.34 at o2ib: -90
0000800:020000:0:1189609724.267097:0:16918:0:(o2iblnd_cb.c:1207:kiblnd_init_rdma())
RDMA too fragmented: 128/256 src 128/256 dst frags
0000800:020000:0:1189609724.267110:0:16918:0:(o2iblnd_cb.c:449:kiblnd_handle_rx())
Can''t setup rdma for PUT to 10.10.101.34 at o2ib: -90
0000800:020000:0:1189610024.296970:0:16918:0:(o2iblnd_cb.c:1207:kiblnd_init_rdma())
RDMA too fragmented: 128/256 src 128/256 dst frags
0000800:020000:0:1189610024.296981:0:16918:0:(o2iblnd_cb.c:449:kiblnd_handle_rx())
Can''t setup rdma for PUT to 10.10.101.34 at o2ib: -90
0000800:020000:0:1189610324.325446:0:16918:0:(o2iblnd_cb.c:1207:kiblnd_init_rdma())
RDMA too fragmented: 128/256 src 128/256 dst frags
0000800:020000:0:1189610324.325458:0:16918:0:(o2iblnd_cb.c:449:kiblnd_handle_rx())
Can''t setup rdma for PUT to 10.10.101.34 at o2ib: -90
0000800:020000:0:1189610624.352945:0:16918:0:(o2iblnd_cb.c:1207:kiblnd_init_rdma())
RDMA too fragmented: 128/256 src 128/256 dst frags
0000800:020000:0:1189610624.352957:0:16918:0:(o2iblnd_cb.c:449:kiblnd_handle_rx())
Can''t setup rdma for PUT to 10.10.101.34 at o2ib: -90
Debug log: 17 lines, 17 kept, 0 dropped.