Scott,> Please let me know if you are still not back full time. If so, I can > post this to the devel list.yes; I''m back :)> We are working on cleaning up MX on client reconnect (so MXLND can > cancel outstanding txs and rxs). At the same time, we are merging our > MX-2G and MX-10G code branches and we may have introduced a bug in MX > that I am seeing in MXLND. I am using your ltest scripts and I have > not seen these messages before: > > Aug 29 07:58:59 172.31.164.34 kernel: LustreError: 5642:0:(debug.c: > 167:block_debug_check()) echo: id 0x3 offset 556720128 end off: > 0x212de000 != 0x212ee000 > Aug 29 07:58:59 172.31.164.34 kernel: LustreError: 5642:0:(debug.c: > 155:block_debug_check()) echo: id 0x3 offset 556724224 off: > 0x212df000 != 0x212ef000 > > I assume that LNET is checking the bulk read and writes to ensure the > correct data was transfered. What is it checking specifically?yes indeed. Both client and server do checks on a few locations at the start and end of each page of bulk data. Look for block_debug_setup() and block_debug_check(), and where obdecho and echo_client call them to set/check the object number and file offset in each block (page) of bulk I/O. It looks like data for the previous block got into this one. Cheers, Eric --------------------------------------------------- |Eric Barton Barton Software | |9 York Gardens Tel: +44 (117) 330 1575 | |Clifton Mobile: +44 (7909) 680 356 | |Bristol BS8 4LL Fax: call first | |United Kingdom E-Mail: eeb@bartonsoftware.com| ---------------------------------------------------
On Aug 29, 2006, at 9:13 AM, Eric Barton wrote:> Scott, > >> Please let me know if you are still not back full time. If so, I can >> post this to the devel list. > > yes; I''m back :)Glad to hear it.>> We are working on cleaning up MX on client reconnect (so MXLND can >> cancel outstanding txs and rxs). At the same time, we are merging our >> MX-2G and MX-10G code branches and we may have introduced a bug in MX >> that I am seeing in MXLND. I am using your ltest scripts and I have >> not seen these messages before: >> >> Aug 29 07:58:59 172.31.164.34 kernel: LustreError: 5642:0:(debug.c: >> 167:block_debug_check()) echo: id 0x3 offset 556720128 end off: >> 0x212de000 != 0x212ee000 >> Aug 29 07:58:59 172.31.164.34 kernel: LustreError: 5642:0:(debug.c: >> 155:block_debug_check()) echo: id 0x3 offset 556724224 off: >> 0x212df000 != 0x212ef000 >> >> I assume that LNET is checking the bulk read and writes to ensure the >> correct data was transfered. What is it checking specifically? > > yes indeed. Both client and server do checks on a few locations at > the > start and end of each page of bulk data. Look for block_debug_setup > () and > block_debug_check(), and where obdecho and echo_client call them to > set/check the object number and file offset in each block (page) of > bulk > I/O. It looks like data for the previous block got into this one. > > Cheers, > EricThat''s what I was afraid of. I am looking through the debug code now and I added more information to the block to help us identify the problem. Thanks, Scott