Hi all, we have a testbed running with Lustre 1.8.3 and a RTT of ~4ms (10GE network cards everywhere) for a ping between client and servers. If I have read the iozone source code correctly, iozone reports bandwidth in KB/s and includes the time for the open() call, but not for close(). If I write a single 64 byte file (with one I/O request), iozone tells me something like ''295548'', which means ~295 MB/s. Dividing the file size by the bandwidth, I get the time that was needed to create the file and write the 64 bytes (with a single request). In this case, the time is about 0,2 micro seconds which is way below the RTT. That mean for a Lustre file system, if I create a file and write 64 bytes, the client sends two(?) RPCs to the server and does not wait for the completion. Is this correct? But it will wait for the completion of both RPCs when the file ist closed? The numbers look different when I disable the client side cache by setting max_dirty_mb to 0. Regards, Michael -- Michael Kluge, M.Sc. Technische Universit?t Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax: (+49) 351 463-37773 e-mail: michael.kluge at tu-dresden.de WWW: http://www.tu-dresden.de/zih -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5973 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110309/00933a00/attachment.bin
> we have a testbed running with Lustre 1.8.3 and a RTT of ~4ms (10GE > network cards everywhere) for a ping between client and servers. If I > have read the iozone source code correctly, iozone reports bandwidth in > KB/s and includes the time for the open() call, but not for close(). IfTo include the close time you need to add the -c flag. The -e should include fsync/fflush in the time.> I write a single 64 byte file (with one I/O request), iozone tells me > something like ''295548'', which means ~295 MB/s. Dividing the file size > by the bandwidth, I get the time that was needed to create the file and > write the 64 bytes (with a single request). In this case, the time is > about 0,2 micro seconds which is way below the RTT.That seems oddly fast for such a small file over any latency. Since you shouldn''t even be able to lock the file in that time.> That mean for a Lustre file system, if I create a file and write 64 > bytes, the client sends two(?) RPCs to the server and does not wait for > the completion. Is this correct? But it will wait for the completion of > both RPCs when the file ist closed?You can see what Lustre is doing if the client isn''t doing any other activity and you enable some tracing. "sysctl -w lnet.debug=''+rpctrace vfstrace''" should allow you to see the VFS ops ll_file_open, ll_file_writev/ll_file_aio_write, ll_file_release, along with any RPCs generated by them. You should see an RPC for the file open which will be a 101 opcode for requesting the lock and you should see a reply AFAIK before the client actually attempts to write any data. So that should bring your time upto at least 4 ms for 1 RTT. The initial write should request a lock from the first stripe followed by a OST write RPC (opcode 4) followed by a file close (opcode 35). I ran a test over 4 ms latency so you can see what I''m referring to. I thought that there was a patch in Lustre a few months back that forced a flush before a file close, but this is from a 1.8.5 client so I''m guessing that isn''t how it works because between when I closed the file and the end I had to "sync" for the OST write to show up. 00000100:00100000:3:1299677923.962151:0:31223:0:(client.c:2087:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc bash:052a94ea-8d57-ac3e-829e-890e0bb65639:31223:x1362774702706288:192.168.66.2 at tcp8:101 00000100:00100000:3:1299677923.966512:0:31223:0:(client.c:2192:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc bash:052a94ea-8d57-ac3e-829e-890e0bb65639:31223:x1362774702706288:192.168.66.2 at tcp8:101 00000080:00200000:3:1299677923.966562:0:31223:0:(file.c:419:ll_file_open()) VFS Op:inode=219/2880174574(ffff8801e1a68550), flags 101101 00000100:00100000:3:1299677923.966615:0:31223:0:(client.c:2087:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc bash:052a94ea-8d57-ac3e-829e-890e0bb65639:31223:x1362774702706292:192.168.66.3 at tcp8:101 00000100:00100000:3:1299677923.970994:0:31223:0:(client.c:2192:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc bash:052a94ea-8d57-ac3e-829e-890e0bb65639:31223:x1362774702706292:192.168.66.3 at tcp8:101 00000080:00200000:3:1299677923.971058:0:31223:0:(file.c:256:ll_file_release()) VFS Op:inode=219/2880174574(ffff8801e1a68550) 00000100:00100000:3:1299677923.971078:0:31223:0:(client.c:2087:ptlrpc_queue_wait()) Sending RPC pname:cluuid:pid:xid:nid:opc bash:052a94ea-8d57-ac3e-829e-890e0bb65639:31223:x1362774702706296:192.168.66.2 at tcp8:35 00000100:00100000:3:1299677923.975447:0:31223:0:(client.c:2192:ptlrpc_queue_wait()) Completed RPC pname:cluuid:pid:xid:nid:opc bash:052a94ea-8d57-ac3e-829e-890e0bb65639:31223:x1362774702706296:192.168.66.2 at tcp8:35 00000100:00100000:0:1299677927.817739:0:2894:0:(client.c:1177:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:052a94ea-8d57-ac3e-829e-890e0bb65639:2894:x1362774702706304:192.168.66.3 at tcp8:4 00000100:00100000:0:1299677927.826642:0:2894:0:(client.c:1444:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:052a94ea-8d57-ac3e-829e-890e0bb65639:2894:x1362774702706304:192.168.66.3 at tcp8:4> The numbers look different when I disable the client side cache by > setting max_dirty_mb to 0.Without any grant I think all RPCs have to be synchronous so you''ll see a huge performance hit over latency. Jeremy
Hi Jeremy,> > I write a single 64 byte file (with one I/O request), iozone tells me > > something like ''295548'', which means ~295 MB/s. Dividing the file size > > by the bandwidth, I get the time that was needed to create the file and > > write the 64 bytes (with a single request). In this case, the time is > > about 0,2 micro seconds which is way below the RTT. > > That seems oddly fast for such a small file over any latency. Since you > shouldn''t even be able to lock the file in that time.OK, thanks. So I have to see at least a latency of one RTT. I think I need to dig through the data again. I might have made a mistake in one of the formulas ...> > That mean for a Lustre file system, if I create a file and write 64 > > bytes, the client sends two(?) RPCs to the server and does not wait for > > the completion. Is this correct? But it will wait for the completion of > > both RPCs when the file ist closed? > You can see what Lustre is doing if the client isn''t doing any other > activity and you enable some tracing. "sysctl -w lnet.debug=''+rpctrace > vfstrace''" should allow you to see the VFS ops ll_file_open, > ll_file_writev/ll_file_aio_write, ll_file_release, along with any RPCs > generated by them. You should see an RPC for the file open which will > be a 101 opcode for requesting the lock and you should see a reply AFAIK > before the client actually attempts to write any data. So that should > bring your time upto at least 4 ms for 1 RTT. The initial write should > request a lock from the first stripe followed by a OST write RPC (opcode > 4) followed by a file close (opcode 35). I ran a test over 4 ms > latency so you can see what I''m referring to. I thought that there was > a patch in Lustre a few months back that forced a flush before a file > close, but this is from a 1.8.5 client so I''m guessing that isn''t how it > works because between when I closed the file and the end I had to "sync" > for the OST write to show up.OK, understood.> > The numbers look different when I disable the client side cache by > > setting max_dirty_mb to 0. > > Without any grant I think all RPCs have to be synchronous so you''ll see > a huge performance hit over latency.These numbers look different :) I''am still trying to make sense of a couple of measurements and to put some useful data into some charts for the LUG this year. Michael -- Michael Kluge, M.Sc. Technische Universit?t Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax: (+49) 351 463-37773 e-mail: michael.kluge at tu-dresden.de WWW: http://www.tu-dresden.de/zih -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5973 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110309/9b628623/attachment-0001.bin