thr3ads.net - Lustre discuss - [Lustre-discuss] ost_brw

If this information is useful, please help other people find it:
Share via:

Mag Gam

2008-Nov-12 12:17 UTC

[Lustre-discuss] ost_brw_write()

Hello All,

Recently, we have been noticing some errors on our implementation of
lustre at our university.

We noticed.
LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit before
arrival at OST: from 192.168.0.3 at tcp inum (somenumber)/(somenumber)
object (some number)/0 extend [0-4095]

Its actually coming from 2 particular hosts (1 OSS) another from 1
particular client.

I also see @@@ redo for unrecoverable error req at fff8xxxxxxxxxxxxxxxxxxxx

Any thoughts how can I get rid of these messages?

Using.
1.6.5.52 (OSS/OST)
1.6.6 (client)

Brian J. Murrell

2008-Nov-12 13:10 UTC

head link

[Lustre-discuss] ost_brw_write()

On Wed, 2008-11-12 at 07:17 -0500, Mag Gam wrote:> 
> We noticed.
> LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit before
> arrival at OST: from 192.168.0.3 at tcp inum (somenumber)/(somenumber)
> object (some number)/0 extend [0-4095]
> 
> Its actually coming from 2 particular hosts (1 OSS) another from 1
> particular client.
> 
> I also see @@@ redo for unrecoverable error req at fff8xxxxxxxxxxxxxxxxxxxx
> 
> Any thoughts how can I get rid of these messages?
Assuming it''s not a bug in Lustre, fix whatever is mangling the data
before it arrives at the OST.  Do you have errors on your networking
fabric, or on the interfaces of the hosts on either end of the
transaction?

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20081112/b728929a/attachment.bin

Andreas Dilger

2008-Nov-12 21:32 UTC

head link

[Lustre-discuss] ost_brw_write()

On Nov 12, 2008  08:10 -0500, Brian J. Murrell wrote:> On Wed, 2008-11-12 at 07:17 -0500, Mag Gam wrote:
> > We noticed.
> > LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit before
> > arrival at OST: from 192.168.0.3 at tcp inum (somenumber)/(somenumber)
> > object (some number)/0 extend [0-4095]
> > 
> > Its actually coming from 2 particular hosts (1 OSS) another from 1
> > particular client.
> > 
> > I also see @@@ redo for unrecoverable error req at
fff8xxxxxxxxxxxxxxxxxxxx
> > 
> > Any thoughts how can I get rid of these messages?
> 
> Assuming it''s not a bug in Lustre, fix whatever is mangling the
data
> before it arrives at the OST.  Do you have errors on your networking
> fabric, or on the interfaces of the hosts on either end of the
> transaction?
Note that a similar error can also happen in the case of an application
doing mmap IO, which the Linux kernel does not prevent from modifying
the page even while it is being RDMA''d over the network, so it is hard
for Lustre to provide a checksum for.

The client would have printed a message like the following in that case:

	"BAD WRITE CHECKSUM: changed in transit AND doesn''t match the
	 original - likely false positive due to mmap IO (bug 11742)"

If the client''s copy of the data has not changed, and the checksum
is still correct, then it points to data corruption on the network
(probably in the NIC itself if it is specific to one node).

Note that since the NIC is doing the TCP checksumming itself, this kind
of error won''t be caught by TCP packet checksums because the data is
already corrupted in the NIC memory before the TCP checksum is computed.

This specific problem was actually hit by a customer and is one of the
reasons why Lustre does its own data checksum, instead of depending on
the TCP layer to deliver the data without any errors.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Mag Gam

2008-Nov-13 02:53 UTC

head link

[Lustre-discuss] ost_brw_write()

This makes sense. I do see RX dropped packets.

I first want to figure this out.

Also, I had bonding mode enabled but it seems I need to goto
independent modes. Does lustre do bonding on a filesystem level? or is
it preferred to go with the OS?
TIA


On Wed, Nov 12, 2008 at 4:32 PM, Andreas Dilger <adilger at sun.com>
wrote:> On Nov 12, 2008  08:10 -0500, Brian J. Murrell wrote:
>> On Wed, 2008-11-12 at 07:17 -0500, Mag Gam wrote:
>> > We noticed.
>> > LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit before
>> > arrival at OST: from 192.168.0.3 at tcp inum
(somenumber)/(somenumber)
>> > object (some number)/0 extend [0-4095]
>> >
>> > Its actually coming from 2 particular hosts (1 OSS) another from 1
>> > particular client.
>> >
>> > I also see @@@ redo for unrecoverable error req at
fff8xxxxxxxxxxxxxxxxxxxx
>> >
>> > Any thoughts how can I get rid of these messages?
>>
>> Assuming it''s not a bug in Lustre, fix whatever is mangling
the data
>> before it arrives at the OST.  Do you have errors on your networking
>> fabric, or on the interfaces of the hosts on either end of the
>> transaction?
>
> Note that a similar error can also happen in the case of an application
> doing mmap IO, which the Linux kernel does not prevent from modifying
> the page even while it is being RDMA''d over the network, so it is
hard
> for Lustre to provide a checksum for.
>
> The client would have printed a message like the following in that case:
>
>        "BAD WRITE CHECKSUM: changed in transit AND doesn''t
match the
>         original - likely false positive due to mmap IO (bug 11742)"
>
> If the client''s copy of the data has not changed, and the checksum
> is still correct, then it points to data corruption on the network
> (probably in the NIC itself if it is specific to one node).
>
> Note that since the NIC is doing the TCP checksumming itself, this kind
> of error won''t be caught by TCP packet checksums because the data
is
> already corrupted in the NIC memory before the TCP checksum is computed.
>
> This specific problem was actually hit by a customer and is one of the
> reasons why Lustre does its own data checksum, instead of depending on
> the TCP layer to deliver the data without any errors.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Mag Gam

2008-Nov-14 02:32 UTC

head link

[Lustre-discuss] ost_brw_write()

OK.

It seems Lustre FS is dropping the packets. I did multiple FTPs and
they were very large files (10GB each), and no packet drops however
when I copy the files onto Lustre I get these packet drops.
I am using Intel based NICs and took the Rx params in consideration,
but I am still dropping packets very heavy. Is there another thing I
should look into?

BTW, the network guy sees no packet drops on the router side...



On Wed, Nov 12, 2008 at 4:32 PM, Andreas Dilger <adilger at sun.com>
wrote:> On Nov 12, 2008  08:10 -0500, Brian J. Murrell wrote:
>> On Wed, 2008-11-12 at 07:17 -0500, Mag Gam wrote:
>> > We noticed.
>> > LustreError: 132-0: BAD WRITE CHECKSUM: changed in transit before
>> > arrival at OST: from 192.168.0.3 at tcp inum
(somenumber)/(somenumber)
>> > object (some number)/0 extend [0-4095]
>> >
>> > Its actually coming from 2 particular hosts (1 OSS) another from 1
>> > particular client.
>> >
>> > I also see @@@ redo for unrecoverable error req at
fff8xxxxxxxxxxxxxxxxxxxx
>> >
>> > Any thoughts how can I get rid of these messages?
>>
>> Assuming it''s not a bug in Lustre, fix whatever is mangling
the data
>> before it arrives at the OST.  Do you have errors on your networking
>> fabric, or on the interfaces of the hosts on either end of the
>> transaction?
>
> Note that a similar error can also happen in the case of an application
> doing mmap IO, which the Linux kernel does not prevent from modifying
> the page even while it is being RDMA''d over the network, so it is
hard
> for Lustre to provide a checksum for.
>
> The client would have printed a message like the following in that case:
>
>        "BAD WRITE CHECKSUM: changed in transit AND doesn''t
match the
>         original - likely false positive due to mmap IO (bug 11742)"
>
> If the client''s copy of the data has not changed, and the checksum
> is still correct, then it points to data corruption on the network
> (probably in the NIC itself if it is specific to one node).
>
> Note that since the NIC is doing the TCP checksumming itself, this kind
> of error won''t be caught by TCP packet checksums because the data
is
> already corrupted in the NIC memory before the TCP checksum is computed.
>
> This specific problem was actually hit by a customer and is one of the
> reasons why Lustre does its own data checksum, instead of depending on
> the TCP layer to deliver the data without any errors.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Brian J. Murrell

2008-Nov-14 12:11 UTC

head link

[Lustre-discuss] ost_brw_write()

On Thu, 2008-11-13 at 21:32 -0500, Mag Gam wrote:> OK.
> 
> It seems Lustre FS is dropping the packets.
No.  Nobody said anything about packets being dropped.  They are failing
checksum.
>  I did multiple FTPs and
> they were very large files (10GB each), and no packet drops
Did you verify the contents of what you ftp''d matched the original? 
Are
you using the same machines in your ftp tests that are reporting
checksum failures with Lustre?

You might want to look in our test suite and see if there is a checksum
unit test.  I''d be surprised if there is not.  Maybe run that and see
what the results are.  I''m afraid I don''t have a lustre source
tree very
handy at the moment to check for you.

b.

Mag Gam

2008-Nov-15 15:59 UTC

head link

[Lustre-discuss] ost_brw_write()

Brian. Thanks for getting back to me.

Yes. The contents matched but getting the RX drop which is king of
scary. I am using the same machine when doing the test.

I have already looked at the Lnet tests

http://manual.lustre.org/manual/LustreManual16_HTML/LustreIOKit.html#50642990_pgfId-1290255

For some reason, "lst add_group servers ipaddrs_of_OSS_and_MDS" gets
me a RPC error but it seems my 5 servers get added. Wierd. Is there
better documentation or perhaps an example for the lnet tests I am
curious to try it.

BTW, I am very happy to see this
http://manual.lustre.org/manual/LustreManual16_HTML/LustreTuning.html#50642992_24952
(Last section regarding CRC). Where can I read more about this??

Keep in mind, I am using e1000 NICs, and I think there is some tuning
I should be doing (but I am not certain if I am doing the right
tuning)

TIA

On Fri, Nov 14, 2008 at 7:11 AM, Brian J. Murrell <Brian.Murrell at
sun.com> wrote:> On Thu, 2008-11-13 at 21:32 -0500, Mag Gam wrote:
>> OK.
>>
>> It seems Lustre FS is dropping the packets.
>
> No.  Nobody said anything about packets being dropped.  They are failing
> checksum.
>
>>  I did multiple FTPs and
>> they were very large files (10GB each), and no packet drops
>
> Did you verify the contents of what you ftp''d matched the
original?  Are
> you using the same machines in your ftp tests that are reporting
> checksum failures with Lustre?
>
> You might want to look in our test suite and see if there is a checksum
> unit test.  I''d be surprised if there is not.  Maybe run that and
see
> what the results are.  I''m afraid I don''t have a lustre
source tree very
> handy at the moment to check for you.
>
> b.
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Mag Gam

2008-Dec-31 03:06 UTC

head link

[Lustre-discuss] ost_brw_write()

I have done the tuning but still occasionally get a CSUM error. About
200 per day.  Considering, we probally transfer close to 500G to 1TB
of data a day is not that bad.

I did the tuning on the e1000 card but I am not sure what else to do.
The network guys have nothing wrong with their switch and the cables
are fine (we even got them replaced).

Since lustre has its own checksumming, I suppose I am in good shape...




On Sat, Nov 15, 2008 at 10:59 AM, Mag Gam <magawake at gmail.com>
wrote:> Brian. Thanks for getting back to me.
>
> Yes. The contents matched but getting the RX drop which is king of
> scary. I am using the same machine when doing the test.
>
> I have already looked at the Lnet tests
>
>
http://manual.lustre.org/manual/LustreManual16_HTML/LustreIOKit.html#50642990_pgfId-1290255
>
> For some reason, "lst add_group servers ipaddrs_of_OSS_and_MDS"
gets
> me a RPC error but it seems my 5 servers get added. Wierd. Is there
> better documentation or perhaps an example for the lnet tests I am
> curious to try it.
>
> BTW, I am very happy to see this
>
http://manual.lustre.org/manual/LustreManual16_HTML/LustreTuning.html#50642992_24952
> (Last section regarding CRC). Where can I read more about this??
>
>
>
> Keep in mind, I am using e1000 NICs, and I think there is some tuning
> I should be doing (but I am not certain if I am doing the right
> tuning)
>
> TIA
>
>
>
>
>
>
>
>
>
> On Fri, Nov 14, 2008 at 7:11 AM, Brian J. Murrell <Brian.Murrell at
sun.com> wrote:
>> On Thu, 2008-11-13 at 21:32 -0500, Mag Gam wrote:
>>> OK.
>>>
>>> It seems Lustre FS is dropping the packets.
>>
>> No.  Nobody said anything about packets being dropped.  They are
failing
>> checksum.
>>
>>>  I did multiple FTPs and
>>> they were very large files (10GB each), and no packet drops
>>
>> Did you verify the contents of what you ftp''d matched the
original?  Are
>> you using the same machines in your ftp tests that are reporting
>> checksum failures with Lustre?
>>
>> You might want to look in our test suite and see if there is a checksum
>> unit test.  I''d be surprised if there is not.  Maybe run that
and see
>> what the results are.  I''m afraid I don''t have a
lustre source tree very
>> handy at the moment to check for you.
>>
>> b.
>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>

Kevin Van Maren

2008-Dec-31 17:34 UTC

head link

[Lustre-discuss] ost_brw_write()

I have previously observed cases where the RX checksum offload NIC would 
pass packets up
to Linux as "good" if the Ethernet CRC was valid, even though the UDP 
checksum failed (for
some reason it appeared that something (the sender?) was corrupting a 
byte in the payload after
calculating the UDP csum, but before the Ethernet CRC was calculated).

So disable any NIC offloading on both sides (ethtool) and see if the 
Lustre csums errors go away.

Also note that is you are using mmap files, it is _expected_ that the 
csum might not match,
as the page can be modified between when the csum is calculated by 
Luster, and the page
is actually transmitted.

Kevin


Mag Gam wrote:> I have done the tuning but still occasionally get a CSUM error. About
> 200 per day.  Considering, we probally transfer close to 500G to 1TB
> of data a day is not that bad.
>
> I did the tuning on the e1000 card but I am not sure what else to do.
> The network guys have nothing wrong with their switch and the cables
> are fine (we even got them replaced).
>
> Since lustre has its own checksumming, I suppose I am in good shape...
>
>
>
>
> On Sat, Nov 15, 2008 at 10:59 AM, Mag Gam <magawake at gmail.com>
wrote:
>   
>> Brian. Thanks for getting back to me.
>>
>> Yes. The contents matched but getting the RX drop which is king of
>> scary. I am using the same machine when doing the test.
>>
>> I have already looked at the Lnet tests
>>
>>
http://manual.lustre.org/manual/LustreManual16_HTML/LustreIOKit.html#50642990_pgfId-1290255
>>
>> For some reason, "lst add_group servers
ipaddrs_of_OSS_and_MDS" gets
>> me a RPC error but it seems my 5 servers get added. Wierd. Is there
>> better documentation or perhaps an example for the lnet tests I am
>> curious to try it.
>>
>> BTW, I am very happy to see this
>>
http://manual.lustre.org/manual/LustreManual16_HTML/LustreTuning.html#50642992_24952
>> (Last section regarding CRC). Where can I read more about this??
>>
>>
>>
>> Keep in mind, I am using e1000 NICs, and I think there is some tuning
>> I should be doing (but I am not certain if I am doing the right
>> tuning)
>>
>> TIA
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Nov 14, 2008 at 7:11 AM, Brian J. Murrell <Brian.Murrell at
sun.com> wrote:
>>     
>>> On Thu, 2008-11-13 at 21:32 -0500, Mag Gam wrote:
>>>       
>>>> OK.
>>>>
>>>> It seems Lustre FS is dropping the packets.
>>>>         
>>> No.  Nobody said anything about packets being dropped.  They are
failing
>>> checksum.
>>>
>>>       
>>>>  I did multiple FTPs and
>>>> they were very large files (10GB each), and no packet drops
>>>>         
>>> Did you verify the contents of what you ftp''d matched the
original?  Are
>>> you using the same machines in your ftp tests that are reporting
>>> checksum failures with Lustre?
>>>
>>> You might want to look in our test suite and see if there is a
checksum
>>> unit test.  I''d be surprised if there is not.  Maybe run
that and see
>>> what the results are.  I''m afraid I don''t have a
lustre source tree very
>>> handy at the moment to check for you.
>>>
>>> b.
>>>
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>>       
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Mag Gam

2008-Dec-31 20:25 UTC

head link

[Lustre-discuss] ost_brw_write()

Kevin:

Thanks for the response.

What do I need to change using ethtool? BTW, I am using ethernet
bonding to increase bandwidth. I suspect this could be causing the
problem...

I am not sure if my applications are using mmap(). I am not aware of
an easy way to determine if they are.



On Wed, Dec 31, 2008 at 12:34 PM, Kevin Van Maren
<Kevin.Vanmaren at sun.com> wrote:> I have previously observed cases where the RX checksum offload NIC would
> pass packets up
> to Linux as "good" if the Ethernet CRC was valid, even though the
UDP
> checksum failed (for
> some reason it appeared that something (the sender?) was corrupting a byte
> in the payload after
> calculating the UDP csum, but before the Ethernet CRC was calculated).
>
> So disable any NIC offloading on both sides (ethtool) and see if the Lustre
> csums errors go away.
>
> Also note that is you are using mmap files, it is _expected_ that the csum
> might not match,
> as the page can be modified between when the csum is calculated by Luster,
> and the page
> is actually transmitted.
>
> Kevin
>
>
> Mag Gam wrote:
>>
>> I have done the tuning but still occasionally get a CSUM error. About
>> 200 per day.  Considering, we probally transfer close to 500G to 1TB
>> of data a day is not that bad.
>>
>> I did the tuning on the e1000 card but I am not sure what else to do.
>> The network guys have nothing wrong with their switch and the cables
>> are fine (we even got them replaced).
>>
>> Since lustre has its own checksumming, I suppose I am in good shape...
>>
>>
>>
>>
>> On Sat, Nov 15, 2008 at 10:59 AM, Mag Gam <magawake at gmail.com>
wrote:
>>
>>>
>>> Brian. Thanks for getting back to me.
>>>
>>> Yes. The contents matched but getting the RX drop which is king of
>>> scary. I am using the same machine when doing the test.
>>>
>>> I have already looked at the Lnet tests
>>>
>>>
>>>
http://manual.lustre.org/manual/LustreManual16_HTML/LustreIOKit.html#50642990_pgfId-1290255
>>>
>>> For some reason, "lst add_group servers
ipaddrs_of_OSS_and_MDS" gets
>>> me a RPC error but it seems my 5 servers get added. Wierd. Is there
>>> better documentation or perhaps an example for the lnet tests I am
>>> curious to try it.
>>>
>>> BTW, I am very happy to see this
>>>
>>>
http://manual.lustre.org/manual/LustreManual16_HTML/LustreTuning.html#50642992_24952
>>> (Last section regarding CRC). Where can I read more about this??
>>>
>>>
>>>
>>> Keep in mind, I am using e1000 NICs, and I think there is some
tuning
>>> I should be doing (but I am not certain if I am doing the right
>>> tuning)
>>>
>>> TIA
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Nov 14, 2008 at 7:11 AM, Brian J. Murrell <Brian.Murrell
at sun.com>
>>> wrote:
>>>
>>>>
>>>> On Thu, 2008-11-13 at 21:32 -0500, Mag Gam wrote:
>>>>
>>>>>
>>>>> OK.
>>>>>
>>>>> It seems Lustre FS is dropping the packets.
>>>>>
>>>>
>>>> No.  Nobody said anything about packets being dropped.  They
are failing
>>>> checksum.
>>>>
>>>>
>>>>>
>>>>>  I did multiple FTPs and
>>>>> they were very large files (10GB each), and no packet drops
>>>>>
>>>>
>>>> Did you verify the contents of what you ftp''d matched
the original?  Are
>>>> you using the same machines in your ftp tests that are
reporting
>>>> checksum failures with Lustre?
>>>>
>>>> You might want to look in our test suite and see if there is a
checksum
>>>> unit test.  I''d be surprised if there is not.  Maybe
run that and see
>>>> what the results are.  I''m afraid I don''t
have a lustre source tree very
>>>> handy at the moment to check for you.
>>>>
>>>> b.
>>>>
>>>>
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>
>>>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>
>

Kevin Van Maren

2008-Dec-31 22:09 UTC

head link

[Lustre-discuss] ost_brw_write()

Use the manpage, as it varies somewhat across releases, but basically 
disable the TX and RX checksum offload, and the large TCP segmentation 
offload:

ethtool -K eth0 rx off
ethtool -K eth0 tx off
ethtool -K eth0 tso off

"ethtool -k eth0" should report them all off.  Repeat for all 
interfaces, since you are doing bonding.

If you do that on all the clients and servers, and if the problem goes 
away, turn them back on one at a time to see which is causing your problems.

Kevin


Mag Gam wrote:> Kevin:
>
> Thanks for the response.
>
> What do I need to change using ethtool? BTW, I am using ethernet
> bonding to increase bandwidth. I suspect this could be causing the
> problem...
>
> I am not sure if my applications are using mmap(). I am not aware of
> an easy way to determine if they are.
>
>
>
> On Wed, Dec 31, 2008 at 12:34 PM, Kevin Van Maren
> <Kevin.Vanmaren at sun.com> wrote:
>   
>> I have previously observed cases where the RX checksum offload NIC
would
>> pass packets up
>> to Linux as "good" if the Ethernet CRC was valid, even though
the UDP
>> checksum failed (for
>> some reason it appeared that something (the sender?) was corrupting a
byte
>> in the payload after
>> calculating the UDP csum, but before the Ethernet CRC was calculated).
>>
>> So disable any NIC offloading on both sides (ethtool) and see if the
Lustre
>> csums errors go away.
>>
>> Also note that is you are using mmap files, it is _expected_ that the
csum
>> might not match,
>> as the page can be modified between when the csum is calculated by
Luster,
>> and the page
>> is actually transmitted.
>>
>> Kevin
>>
>>
>> Mag Gam wrote:
>>     
>>> I have done the tuning but still occasionally get a CSUM error.
About
>>> 200 per day.  Considering, we probally transfer close to 500G to
1TB
>>> of data a day is not that bad.
>>>
>>> I did the tuning on the e1000 card but I am not sure what else to
do.
>>> The network guys have nothing wrong with their switch and the
cables
>>> are fine (we even got them replaced).
>>>
>>> Since lustre has its own checksumming, I suppose I am in good
shape...
>>>
>>>
>>>       
>>>>>>             
>>>>> No.  Nobody said anything about packets being dropped. 
They are failing
>>>>> checksum.
>>>>>

Lustre discuss - Nov 2008 - ost_brw_write()

[Lustre-discuss] ost_brw_write()

[Lustre-discuss] ost_brw_write()

[Lustre-discuss] ost_brw_write()

[Lustre-discuss] ost_brw_write()

[Lustre-discuss] ost_brw_write()

[Lustre-discuss] ost_brw_write()

[Lustre-discuss] ost_brw_write()

[Lustre-discuss] ost_brw_write()

[Lustre-discuss] ost_brw_write()

[Lustre-discuss] ost_brw_write()

[Lustre-discuss] ost_brw_write()