Hello All, Recently, we have been noticing some errors on our implementation of lustre at our university. We noticed. LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit before arrival at OST: from 192.168.0.3 at tcp inum (somenumber)/(somenumber) object (some number)/0 extend [0-4095] Its actually coming from 2 particular hosts (1 OSS) another from 1 particular client. I also see @@@ redo for unrecoverable error req at fff8xxxxxxxxxxxxxxxxxxxx Any thoughts how can I get rid of these messages? Using. 1.6.5.52 (OSS/OST) 1.6.6 (client)
On Wed, 2008-11-12 at 07:17 -0500, Mag Gam wrote:> > We noticed. > LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit before > arrival at OST: from 192.168.0.3 at tcp inum (somenumber)/(somenumber) > object (some number)/0 extend [0-4095] > > Its actually coming from 2 particular hosts (1 OSS) another from 1 > particular client. > > I also see @@@ redo for unrecoverable error req at fff8xxxxxxxxxxxxxxxxxxxx > > Any thoughts how can I get rid of these messages?Assuming it''s not a bug in Lustre, fix whatever is mangling the data before it arrives at the OST. Do you have errors on your networking fabric, or on the interfaces of the hosts on either end of the transaction? b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081112/b728929a/attachment.bin
On Nov 12, 2008 08:10 -0500, Brian J. Murrell wrote:> On Wed, 2008-11-12 at 07:17 -0500, Mag Gam wrote: > > We noticed. > > LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit before > > arrival at OST: from 192.168.0.3 at tcp inum (somenumber)/(somenumber) > > object (some number)/0 extend [0-4095] > > > > Its actually coming from 2 particular hosts (1 OSS) another from 1 > > particular client. > > > > I also see @@@ redo for unrecoverable error req at fff8xxxxxxxxxxxxxxxxxxxx > > > > Any thoughts how can I get rid of these messages? > > Assuming it''s not a bug in Lustre, fix whatever is mangling the data > before it arrives at the OST. Do you have errors on your networking > fabric, or on the interfaces of the hosts on either end of the > transaction?Note that a similar error can also happen in the case of an application doing mmap IO, which the Linux kernel does not prevent from modifying the page even while it is being RDMA''d over the network, so it is hard for Lustre to provide a checksum for. The client would have printed a message like the following in that case: "BAD WRITE CHECKSUM: changed in transit AND doesn''t match the original - likely false positive due to mmap IO (bug 11742)" If the client''s copy of the data has not changed, and the checksum is still correct, then it points to data corruption on the network (probably in the NIC itself if it is specific to one node). Note that since the NIC is doing the TCP checksumming itself, this kind of error won''t be caught by TCP packet checksums because the data is already corrupted in the NIC memory before the TCP checksum is computed. This specific problem was actually hit by a customer and is one of the reasons why Lustre does its own data checksum, instead of depending on the TCP layer to deliver the data without any errors. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
This makes sense. I do see RX dropped packets. I first want to figure this out. Also, I had bonding mode enabled but it seems I need to goto independent modes. Does lustre do bonding on a filesystem level? or is it preferred to go with the OS? TIA On Wed, Nov 12, 2008 at 4:32 PM, Andreas Dilger <adilger at sun.com> wrote:> On Nov 12, 2008 08:10 -0500, Brian J. Murrell wrote: >> On Wed, 2008-11-12 at 07:17 -0500, Mag Gam wrote: >> > We noticed. >> > LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit before >> > arrival at OST: from 192.168.0.3 at tcp inum (somenumber)/(somenumber) >> > object (some number)/0 extend [0-4095] >> > >> > Its actually coming from 2 particular hosts (1 OSS) another from 1 >> > particular client. >> > >> > I also see @@@ redo for unrecoverable error req at fff8xxxxxxxxxxxxxxxxxxxx >> > >> > Any thoughts how can I get rid of these messages? >> >> Assuming it''s not a bug in Lustre, fix whatever is mangling the data >> before it arrives at the OST. Do you have errors on your networking >> fabric, or on the interfaces of the hosts on either end of the >> transaction? > > Note that a similar error can also happen in the case of an application > doing mmap IO, which the Linux kernel does not prevent from modifying > the page even while it is being RDMA''d over the network, so it is hard > for Lustre to provide a checksum for. > > The client would have printed a message like the following in that case: > > "BAD WRITE CHECKSUM: changed in transit AND doesn''t match the > original - likely false positive due to mmap IO (bug 11742)" > > If the client''s copy of the data has not changed, and the checksum > is still correct, then it points to data corruption on the network > (probably in the NIC itself if it is specific to one node). > > Note that since the NIC is doing the TCP checksumming itself, this kind > of error won''t be caught by TCP packet checksums because the data is > already corrupted in the NIC memory before the TCP checksum is computed. > > This specific problem was actually hit by a customer and is one of the > reasons why Lustre does its own data checksum, instead of depending on > the TCP layer to deliver the data without any errors. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
OK. It seems Lustre FS is dropping the packets. I did multiple FTPs and they were very large files (10GB each), and no packet drops however when I copy the files onto Lustre I get these packet drops. I am using Intel based NICs and took the Rx params in consideration, but I am still dropping packets very heavy. Is there another thing I should look into? BTW, the network guy sees no packet drops on the router side... On Wed, Nov 12, 2008 at 4:32 PM, Andreas Dilger <adilger at sun.com> wrote:> On Nov 12, 2008 08:10 -0500, Brian J. Murrell wrote: >> On Wed, 2008-11-12 at 07:17 -0500, Mag Gam wrote: >> > We noticed. >> > LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit before >> > arrival at OST: from 192.168.0.3 at tcp inum (somenumber)/(somenumber) >> > object (some number)/0 extend [0-4095] >> > >> > Its actually coming from 2 particular hosts (1 OSS) another from 1 >> > particular client. >> > >> > I also see @@@ redo for unrecoverable error req at fff8xxxxxxxxxxxxxxxxxxxx >> > >> > Any thoughts how can I get rid of these messages? >> >> Assuming it''s not a bug in Lustre, fix whatever is mangling the data >> before it arrives at the OST. Do you have errors on your networking >> fabric, or on the interfaces of the hosts on either end of the >> transaction? > > Note that a similar error can also happen in the case of an application > doing mmap IO, which the Linux kernel does not prevent from modifying > the page even while it is being RDMA''d over the network, so it is hard > for Lustre to provide a checksum for. > > The client would have printed a message like the following in that case: > > "BAD WRITE CHECKSUM: changed in transit AND doesn''t match the > original - likely false positive due to mmap IO (bug 11742)" > > If the client''s copy of the data has not changed, and the checksum > is still correct, then it points to data corruption on the network > (probably in the NIC itself if it is specific to one node). > > Note that since the NIC is doing the TCP checksumming itself, this kind > of error won''t be caught by TCP packet checksums because the data is > already corrupted in the NIC memory before the TCP checksum is computed. > > This specific problem was actually hit by a customer and is one of the > reasons why Lustre does its own data checksum, instead of depending on > the TCP layer to deliver the data without any errors. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
On Thu, 2008-11-13 at 21:32 -0500, Mag Gam wrote:> OK. > > It seems Lustre FS is dropping the packets.No. Nobody said anything about packets being dropped. They are failing checksum.> I did multiple FTPs and > they were very large files (10GB each), and no packet dropsDid you verify the contents of what you ftp''d matched the original? Are you using the same machines in your ftp tests that are reporting checksum failures with Lustre? You might want to look in our test suite and see if there is a checksum unit test. I''d be surprised if there is not. Maybe run that and see what the results are. I''m afraid I don''t have a lustre source tree very handy at the moment to check for you. b.
Brian. Thanks for getting back to me. Yes. The contents matched but getting the RX drop which is king of scary. I am using the same machine when doing the test. I have already looked at the Lnet tests http://manual.lustre.org/manual/LustreManual16_HTML/LustreIOKit.html#50642990_pgfId-1290255 For some reason, "lst add_group servers ipaddrs_of_OSS_and_MDS" gets me a RPC error but it seems my 5 servers get added. Wierd. Is there better documentation or perhaps an example for the lnet tests I am curious to try it. BTW, I am very happy to see this http://manual.lustre.org/manual/LustreManual16_HTML/LustreTuning.html#50642992_24952 (Last section regarding CRC). Where can I read more about this?? Keep in mind, I am using e1000 NICs, and I think there is some tuning I should be doing (but I am not certain if I am doing the right tuning) TIA On Fri, Nov 14, 2008 at 7:11 AM, Brian J. Murrell <Brian.Murrell at sun.com> wrote:> On Thu, 2008-11-13 at 21:32 -0500, Mag Gam wrote: >> OK. >> >> It seems Lustre FS is dropping the packets. > > No. Nobody said anything about packets being dropped. They are failing > checksum. > >> I did multiple FTPs and >> they were very large files (10GB each), and no packet drops > > Did you verify the contents of what you ftp''d matched the original? Are > you using the same machines in your ftp tests that are reporting > checksum failures with Lustre? > > You might want to look in our test suite and see if there is a checksum > unit test. I''d be surprised if there is not. Maybe run that and see > what the results are. I''m afraid I don''t have a lustre source tree very > handy at the moment to check for you. > > b. > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
I have done the tuning but still occasionally get a CSUM error. About 200 per day. Considering, we probally transfer close to 500G to 1TB of data a day is not that bad. I did the tuning on the e1000 card but I am not sure what else to do. The network guys have nothing wrong with their switch and the cables are fine (we even got them replaced). Since lustre has its own checksumming, I suppose I am in good shape... On Sat, Nov 15, 2008 at 10:59 AM, Mag Gam <magawake at gmail.com> wrote:> Brian. Thanks for getting back to me. > > Yes. The contents matched but getting the RX drop which is king of > scary. I am using the same machine when doing the test. > > I have already looked at the Lnet tests > > http://manual.lustre.org/manual/LustreManual16_HTML/LustreIOKit.html#50642990_pgfId-1290255 > > For some reason, "lst add_group servers ipaddrs_of_OSS_and_MDS" gets > me a RPC error but it seems my 5 servers get added. Wierd. Is there > better documentation or perhaps an example for the lnet tests I am > curious to try it. > > BTW, I am very happy to see this > http://manual.lustre.org/manual/LustreManual16_HTML/LustreTuning.html#50642992_24952 > (Last section regarding CRC). Where can I read more about this?? > > > > Keep in mind, I am using e1000 NICs, and I think there is some tuning > I should be doing (but I am not certain if I am doing the right > tuning) > > TIA > > > > > > > > > > On Fri, Nov 14, 2008 at 7:11 AM, Brian J. Murrell <Brian.Murrell at sun.com> wrote: >> On Thu, 2008-11-13 at 21:32 -0500, Mag Gam wrote: >>> OK. >>> >>> It seems Lustre FS is dropping the packets. >> >> No. Nobody said anything about packets being dropped. They are failing >> checksum. >> >>> I did multiple FTPs and >>> they were very large files (10GB each), and no packet drops >> >> Did you verify the contents of what you ftp''d matched the original? Are >> you using the same machines in your ftp tests that are reporting >> checksum failures with Lustre? >> >> You might want to look in our test suite and see if there is a checksum >> unit test. I''d be surprised if there is not. Maybe run that and see >> what the results are. I''m afraid I don''t have a lustre source tree very >> handy at the moment to check for you. >> >> b. >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >
I have previously observed cases where the RX checksum offload NIC would pass packets up to Linux as "good" if the Ethernet CRC was valid, even though the UDP checksum failed (for some reason it appeared that something (the sender?) was corrupting a byte in the payload after calculating the UDP csum, but before the Ethernet CRC was calculated). So disable any NIC offloading on both sides (ethtool) and see if the Lustre csums errors go away. Also note that is you are using mmap files, it is _expected_ that the csum might not match, as the page can be modified between when the csum is calculated by Luster, and the page is actually transmitted. Kevin Mag Gam wrote:> I have done the tuning but still occasionally get a CSUM error. About > 200 per day. Considering, we probally transfer close to 500G to 1TB > of data a day is not that bad. > > I did the tuning on the e1000 card but I am not sure what else to do. > The network guys have nothing wrong with their switch and the cables > are fine (we even got them replaced). > > Since lustre has its own checksumming, I suppose I am in good shape... > > > > > On Sat, Nov 15, 2008 at 10:59 AM, Mag Gam <magawake at gmail.com> wrote: > >> Brian. Thanks for getting back to me. >> >> Yes. The contents matched but getting the RX drop which is king of >> scary. I am using the same machine when doing the test. >> >> I have already looked at the Lnet tests >> >> http://manual.lustre.org/manual/LustreManual16_HTML/LustreIOKit.html#50642990_pgfId-1290255 >> >> For some reason, "lst add_group servers ipaddrs_of_OSS_and_MDS" gets >> me a RPC error but it seems my 5 servers get added. Wierd. Is there >> better documentation or perhaps an example for the lnet tests I am >> curious to try it. >> >> BTW, I am very happy to see this >> http://manual.lustre.org/manual/LustreManual16_HTML/LustreTuning.html#50642992_24952 >> (Last section regarding CRC). Where can I read more about this?? >> >> >> >> Keep in mind, I am using e1000 NICs, and I think there is some tuning >> I should be doing (but I am not certain if I am doing the right >> tuning) >> >> TIA >> >> >> >> >> >> >> >> >> >> On Fri, Nov 14, 2008 at 7:11 AM, Brian J. Murrell <Brian.Murrell at sun.com> wrote: >> >>> On Thu, 2008-11-13 at 21:32 -0500, Mag Gam wrote: >>> >>>> OK. >>>> >>>> It seems Lustre FS is dropping the packets. >>>> >>> No. Nobody said anything about packets being dropped. They are failing >>> checksum. >>> >>> >>>> I did multiple FTPs and >>>> they were very large files (10GB each), and no packet drops >>>> >>> Did you verify the contents of what you ftp''d matched the original? Are >>> you using the same machines in your ftp tests that are reporting >>> checksum failures with Lustre? >>> >>> You might want to look in our test suite and see if there is a checksum >>> unit test. I''d be surprised if there is not. Maybe run that and see >>> what the results are. I''m afraid I don''t have a lustre source tree very >>> handy at the moment to check for you. >>> >>> b. >>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >>> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Kevin: Thanks for the response. What do I need to change using ethtool? BTW, I am using ethernet bonding to increase bandwidth. I suspect this could be causing the problem... I am not sure if my applications are using mmap(). I am not aware of an easy way to determine if they are. On Wed, Dec 31, 2008 at 12:34 PM, Kevin Van Maren <Kevin.Vanmaren at sun.com> wrote:> I have previously observed cases where the RX checksum offload NIC would > pass packets up > to Linux as "good" if the Ethernet CRC was valid, even though the UDP > checksum failed (for > some reason it appeared that something (the sender?) was corrupting a byte > in the payload after > calculating the UDP csum, but before the Ethernet CRC was calculated). > > So disable any NIC offloading on both sides (ethtool) and see if the Lustre > csums errors go away. > > Also note that is you are using mmap files, it is _expected_ that the csum > might not match, > as the page can be modified between when the csum is calculated by Luster, > and the page > is actually transmitted. > > Kevin > > > Mag Gam wrote: >> >> I have done the tuning but still occasionally get a CSUM error. About >> 200 per day. Considering, we probally transfer close to 500G to 1TB >> of data a day is not that bad. >> >> I did the tuning on the e1000 card but I am not sure what else to do. >> The network guys have nothing wrong with their switch and the cables >> are fine (we even got them replaced). >> >> Since lustre has its own checksumming, I suppose I am in good shape... >> >> >> >> >> On Sat, Nov 15, 2008 at 10:59 AM, Mag Gam <magawake at gmail.com> wrote: >> >>> >>> Brian. Thanks for getting back to me. >>> >>> Yes. The contents matched but getting the RX drop which is king of >>> scary. I am using the same machine when doing the test. >>> >>> I have already looked at the Lnet tests >>> >>> >>> http://manual.lustre.org/manual/LustreManual16_HTML/LustreIOKit.html#50642990_pgfId-1290255 >>> >>> For some reason, "lst add_group servers ipaddrs_of_OSS_and_MDS" gets >>> me a RPC error but it seems my 5 servers get added. Wierd. Is there >>> better documentation or perhaps an example for the lnet tests I am >>> curious to try it. >>> >>> BTW, I am very happy to see this >>> >>> http://manual.lustre.org/manual/LustreManual16_HTML/LustreTuning.html#50642992_24952 >>> (Last section regarding CRC). Where can I read more about this?? >>> >>> >>> >>> Keep in mind, I am using e1000 NICs, and I think there is some tuning >>> I should be doing (but I am not certain if I am doing the right >>> tuning) >>> >>> TIA >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Fri, Nov 14, 2008 at 7:11 AM, Brian J. Murrell <Brian.Murrell at sun.com> >>> wrote: >>> >>>> >>>> On Thu, 2008-11-13 at 21:32 -0500, Mag Gam wrote: >>>> >>>>> >>>>> OK. >>>>> >>>>> It seems Lustre FS is dropping the packets. >>>>> >>>> >>>> No. Nobody said anything about packets being dropped. They are failing >>>> checksum. >>>> >>>> >>>>> >>>>> I did multiple FTPs and >>>>> they were very large files (10GB each), and no packet drops >>>>> >>>> >>>> Did you verify the contents of what you ftp''d matched the original? Are >>>> you using the same machines in your ftp tests that are reporting >>>> checksum failures with Lustre? >>>> >>>> You might want to look in our test suite and see if there is a checksum >>>> unit test. I''d be surprised if there is not. Maybe run that and see >>>> what the results are. I''m afraid I don''t have a lustre source tree very >>>> handy at the moment to check for you. >>>> >>>> b. >>>> >>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>>> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > >
Use the manpage, as it varies somewhat across releases, but basically disable the TX and RX checksum offload, and the large TCP segmentation offload: ethtool -K eth0 rx off ethtool -K eth0 tx off ethtool -K eth0 tso off "ethtool -k eth0" should report them all off. Repeat for all interfaces, since you are doing bonding. If you do that on all the clients and servers, and if the problem goes away, turn them back on one at a time to see which is causing your problems. Kevin Mag Gam wrote:> Kevin: > > Thanks for the response. > > What do I need to change using ethtool? BTW, I am using ethernet > bonding to increase bandwidth. I suspect this could be causing the > problem... > > I am not sure if my applications are using mmap(). I am not aware of > an easy way to determine if they are. > > > > On Wed, Dec 31, 2008 at 12:34 PM, Kevin Van Maren > <Kevin.Vanmaren at sun.com> wrote: > >> I have previously observed cases where the RX checksum offload NIC would >> pass packets up >> to Linux as "good" if the Ethernet CRC was valid, even though the UDP >> checksum failed (for >> some reason it appeared that something (the sender?) was corrupting a byte >> in the payload after >> calculating the UDP csum, but before the Ethernet CRC was calculated). >> >> So disable any NIC offloading on both sides (ethtool) and see if the Lustre >> csums errors go away. >> >> Also note that is you are using mmap files, it is _expected_ that the csum >> might not match, >> as the page can be modified between when the csum is calculated by Luster, >> and the page >> is actually transmitted. >> >> Kevin >> >> >> Mag Gam wrote: >> >>> I have done the tuning but still occasionally get a CSUM error. About >>> 200 per day. Considering, we probally transfer close to 500G to 1TB >>> of data a day is not that bad. >>> >>> I did the tuning on the e1000 card but I am not sure what else to do. >>> The network guys have nothing wrong with their switch and the cables >>> are fine (we even got them replaced). >>> >>> Since lustre has its own checksumming, I suppose I am in good shape... >>> >>> >>> >>>>>> >>>>> No. Nobody said anything about packets being dropped. They are failing >>>>> checksum. >>>>>