bjohanso at psc.edu
2007-Sep-12 15:41 UTC
[Lustre-devel] [Bug 13607] New: lnet router RDMA too fragmented: 128/256 src 128/256 dst frags
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=13607 Client: catamount using liblustre 1.4.11 Router: kptllnd - ko2iblnd Running b_eff_io http://www.hlrs.de/organization/par/services/models/mpi/b_eff_io/index_v1.1.html fails when attempting to write a chunk size of 1048832, this succeeds on the local lustre filesystem on the XT3 (all contained within cray portals) A successful run on the local filesystem: -----+---+------+----+--------+----------+----------+----------------+-------+-------------+-------+-------+------+-----+-----+-----+-----+---------------- num. acc-| pat- |pat-|scheduled chunk| chunk|filename | repeat| transferred| meas-|=sum of|time of meas.calls|last |last | measured of |ess| tern |tern| time | size| size| | factor| MB of this| ured | I/O |barr- |bcast|file-| I/O |barr.| bandwidth PEs | | type | | [sec] | on disk| in memory| | | pattern| time | | ier | | sync|call |+bcst| of this pattern -----+---+------+----+--------+----------+----------+----------------+-------+-------------+-------+-------+------+-----+-----+-----+-----+---------------- n=1 a=0 type=0 p= 0 Tp= 0.00 l= 1048576 L= 1048576 i00_001_0 r= 1 S 1.049 MB t= 0.01 = 0.011+ 0.000+0.000+0.000 0.011 0.000 bw= 92.353 MB/s n=1 a=0 type=0 p= 1 Tp= 0.40 l= 4194304 L= 4194304 i00_001_0 r= 14 S 58.720 MB t= 0.38 = 0.379+ 0.000+0.000+0.000 0.027 0.000 bw= 154.821 MB/s n=1 a=0 type=0 p= 2 Tp= 0.40 l= 1048576 L= 2097152 i00_001_0 r= 22 S 46.137 MB t= 0.39 = 0.390+ 0.000+0.000+0.000 0.019 0.000 bw= 118.293 MB/s n=1 a=0 type=0 p= 3 Tp= 0.40 l= 1048576 L= 1048576 i00_001_0 r= 32 S 33.554 MB t= 0.39 = 0.393+ 0.000+0.000+0.000 0.011 0.000 bw= 85.332 MB/s n=1 a=0 type=0 p= 4 Tp= 0.20 l= 32768 L= 1048576 i00_001_0 r= 17 S 17.826 MB t= 0.19 = 0.195+ 0.000+0.000+0.000 0.011 0.000 bw= 91.425 MB/s n=1 a=0 type=0 p= 5 Tp= 0.20 l= 1024 L= 1048576 i00_001_0 r= 17 S 17.826 MB t= 0.19 = 0.192+ 0.000+0.000+0.000 0.011 0.000 bw= 92.684 MB/s n=1 a=0 type=0 p= 6 Tp= 0.20 l= 32776 L= 1048832 i00_001_0 r= 13 S 13.635 MB t= 0.20 = 0.195+ 0.000+0.000+0.000 0.015 0.000 bw= 69.834 MB/s n=1 a=0 type=0 p= 7 Tp= 0.20 l= 1032 L= 1056768 i00_001_0 r= 12 S 12.681 MB t= 0.19 = 0.185+ 0.000+0.000+0.000 0.015 0.000 bw= 68.509 MB/s n=1 a=0 type=0 p= 8 Tp= 0.20 l= 1048584 L= 1048584 i00_001_0 r= 14 S 14.680 MB t= 0.20 = 0.198+ 0.000+0.000+0.000 0.012 0.000 bw= 74.275 MB/s total pattern type: S=216.108 MB t=2.14 t_op=0.00 t_cl=0.00 wbw= 101.238 MB/s b_eff_io_write_scatter= 101.007 MB/s The routed filesystem fails on pattern 6 -----+---+------+----+--------+----------+----------+--------------- num. acc-| pat- |pat-|scheduled chunk| chunk|filename of |ess| tern |tern| time | size| size| PEs | | type | | [sec] | on disk| in memory| -----+---+------+----+--------+----------+----------+-------------- n=1 a=0 type=0 p= 6 Tp= 0.20 l= 32776 L= 1048832 i00_001_0 Router debug: 0000400:2000000:0:1189540683.206990:0:16915:0:(api-ni.c:1082:lnet_startup_lndnis()) Added LNI 2571 at ptl [8/768] 0000400:2000000:0:1189540683.265861:0:16915:0:(api-ni.c:1082:lnet_startup_lndnis()) Added LNI 10.10.101.105 at o2ib [512/1024] 0000800:020000:0:1189540707.150075:0:16918:0:(o2iblnd_cb.c:1207:kiblnd_init_rdma()) RDMA too fragmented: 128/256 src 128/256 dst frags 0000800:020000:0:1189540707.150086:0:16918:0:(o2iblnd_cb.c:449:kiblnd_handle_rx()) Can''t setup rdma for PUT to 10.10.101.40 at o2ib: -90 0000800:000400:0:1189540733.893046:0:16918:0:(ptllnd_peer.c:1142:kptllnd_tx_launch()) Refusing to create a new connection to U3-2 at ptl (non-kernel peer) 0000800:020000:0:1189540914.127475:0:16918:0:(o2iblnd_cb.c:1207:kiblnd_init_rdma()) RDMA too fragmented: 128/256 src 128/256 dst frags 0000800:020000:0:1189540914.127486:0:16918:0:(o2iblnd_cb.c:449:kiblnd_handle_rx()) Can''t setup rdma for PUT to 10.10.101.36 at o2ib: -90 0000800:020000:0:1189609614.644581:0:16918:0:(o2iblnd_cb.c:1207:kiblnd_init_rdma()) RDMA too fragmented: 128/256 src 128/256 dst frags 0000800:020000:0:1189609614.644594:0:16918:0:(o2iblnd_cb.c:449:kiblnd_handle_rx()) Can''t setup rdma for PUT to 10.10.101.34 at o2ib: -90 0000800:020000:0:1189609724.267097:0:16918:0:(o2iblnd_cb.c:1207:kiblnd_init_rdma()) RDMA too fragmented: 128/256 src 128/256 dst frags 0000800:020000:0:1189609724.267110:0:16918:0:(o2iblnd_cb.c:449:kiblnd_handle_rx()) Can''t setup rdma for PUT to 10.10.101.34 at o2ib: -90 0000800:020000:0:1189610024.296970:0:16918:0:(o2iblnd_cb.c:1207:kiblnd_init_rdma()) RDMA too fragmented: 128/256 src 128/256 dst frags 0000800:020000:0:1189610024.296981:0:16918:0:(o2iblnd_cb.c:449:kiblnd_handle_rx()) Can''t setup rdma for PUT to 10.10.101.34 at o2ib: -90 0000800:020000:0:1189610324.325446:0:16918:0:(o2iblnd_cb.c:1207:kiblnd_init_rdma()) RDMA too fragmented: 128/256 src 128/256 dst frags 0000800:020000:0:1189610324.325458:0:16918:0:(o2iblnd_cb.c:449:kiblnd_handle_rx()) Can''t setup rdma for PUT to 10.10.101.34 at o2ib: -90 0000800:020000:0:1189610624.352945:0:16918:0:(o2iblnd_cb.c:1207:kiblnd_init_rdma()) RDMA too fragmented: 128/256 src 128/256 dst frags 0000800:020000:0:1189610624.352957:0:16918:0:(o2iblnd_cb.c:449:kiblnd_handle_rx()) Can''t setup rdma for PUT to 10.10.101.34 at o2ib: -90 Debug log: 17 lines, 17 kept, 0 dropped.