thr3ads.net - Ocfs2 users - [Ocfs2-users] Unstable Cluster Node [Nov 2007]

If this information is useful, please help other people find it:
Share via:

rain c

2007-Nov-30 03:25 UTC

[Ocfs2-users] Unstable Cluster Node

Hi,

I have a 2-Node OCFS2 Cluster on top of DRBD 8.0.4. 
The kernel version I use is:

uname -a
Linux webhost1 2.6.18-028stab039 #2 SMP Tue Aug 21 17:49:05 UTC 2007 i686
GNU/Linux

Both nodes are in the same bladecenter an directly connected with 1Gbit/s by the
baldecenters internal ethernet switch.

One of the nodes stops working at least once a day with the following messages:

Nov 23 19:05:02 webhost2 kernel: (4424,3):o2net_sendpage:827 ERROR: sendpage of
size 24 to node webhost1 (num 0) at 10.2.0.70:7777 failed with 4294967264
Nov 23 19:05:02 webhost2 kernel: (6774,0):dlm_send_remote_convert_request:395
ERROR: status = -107
Nov 23 19:05:02 webhost2 kernel: (4997,2):dlm_send_remote_convert_request:395
ERROR: status = -107
Nov 23 19:05:02 webhost2 kernel: (4997,2):dlm_wait_for_node_death:374
225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of
node 0
Nov 23 19:05:02 webhost2 kernel: (6774,0):dlm_wait_for_node_death:374
225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of
node 0


After that the node hangs and even does not reboot although
/proc/sys/kernel/panic and /proc/sys/kernel/panic_on_oops are set to 1.

Can anybody please help me to understand the error messages and make that node
more stable?


Thanks,
- Rainer




     
____________________________________________________________________________________
Be a better pen pal. 
Text or chat with friends inside Yahoo! Mail. See how. 
http://overview.mail.yahoo.com/

Mark Fasheh

2007-Nov-30 09:17 UTC

head link

[Ocfs2-users] Unstable Cluster Node

On Fri, Nov 30, 2007 at 03:25:27AM -0800, rain c wrote:> uname -a
> Linux webhost1 2.6.18-028stab039 #2 SMP Tue Aug 21 17:49:05 UTC 2007 i686
GNU/Linux
> 
> Both nodes are in the same bladecenter an directly connected with 1Gbit/s
by the baldecenters internal ethernet switch.
> 
> One of the nodes stops working at least once a day with the following
messages:
> 
> Nov 23 19:05:02 webhost2 kernel: (4424,3):o2net_sendpage:827 ERROR:
sendpage of size 24 to node webhost1 (num 0) at 10.2.0.70:7777 failed with
4294967264
> Nov 23 19:05:02 webhost2 kernel:
(6774,0):dlm_send_remote_convert_request:395 ERROR: status = -107
> Nov 23 19:05:02 webhost2 kernel:
(4997,2):dlm_send_remote_convert_request:395 ERROR: status = -107
> Nov 23 19:05:02 webhost2 kernel: (4997,2):dlm_wait_for_node_death:374
225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of
node 0
> Nov 23 19:05:02 webhost2 kernel: (6774,0):dlm_wait_for_node_death:374
225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of
node 0
> 
> 
> After that the node hangs and even does not reboot although
/proc/sys/kernel/panic and /proc/sys/kernel/panic_on_oops are set to 1.
> 
> Can anybody please help me to understand the error messages and make that
node more stable?
This was fixed in 2.6.23, and there's a version of the patch backported to
2.6.22:

http://www.kernel.org/pub/linux/kernel/people/mfasheh/ocfs2/backports/2.6.22.14/broken-out/0006-ocfs2-Retry-sendpage-if-it-returns-EAGAIN.patch


Unfortunately, It doesn't look like the patch would apply to 2.6.18 without
some work.


Can you upgrade to the latest stable 2.6.23 kernel? If you do, take the
time to apply the other patches in our 2.6.23 backports:

http://www.kernel.org/pub/linux/kernel/people/mfasheh/ocfs2/backports/2.6.23.7/
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com

rain c

2007-Dec-03 04:45 UTC

head link

[Ocfs2-users] Unstable Cluster Node

Hi,

thanks very much for your answer.
My problem is, that I connot really use kernel 2.6.22, because I also need the
openVZ patch which is not available in a stable version for 2.6.22. Is there a
way to backport ocfs2-Retry-if-it-returns-EAGAIN to 2.6.18?

Further I wonder why only one (and always the same) of my nodes is so unstable.
Are you sure that it cannot be any other problem?

Thanks very much,
- Rainer

On Friday, November 30, 2007 6:16:21 PM Mark Fasheh wrote:
On Fri, Nov 30, 2007 at 03:25:27AM -0800, rain c wrote:> uname -a
> Linux webhost1 2.6.18-028stab039 #2 SMP Tue Aug 21 17:49:05 UTC 2007
 i686 GNU/Linux> 
> Both nodes are in the same bladecenter an directly connected with
 1Gbit/s by the baldecenters internal ethernet switch.> 
> One of the nodes stops working at least once a day with the following
 messages:> 
> Nov 23 19:05:02 webhost2 kernel: (4424,3):o2net_sendpage:827 ERROR: sendpage of size 24 to node webhost1 (num 0) at 10.2.0.70:7777 failed
 with 4294967264> Nov 23 19:05:02 webhost2 kernel: (6774,0):dlm_send_remote_convert_request:395 ERROR: status =
-107> Nov 23 19:05:02 webhost2 kernel: (4997,2):dlm_send_remote_convert_request:395 ERROR: status =
-107> Nov 23 19:05:02 webhost2 kernel: (4997,2):dlm_wait_for_node_death:374 225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of
 death of node 0> Nov 23 19:05:02 webhost2 kernel: (6774,0):dlm_wait_for_node_death:374 225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of
 death of node 0> 
> 
> After that the node hangs and even does not reboot although /proc/sys/kernel/panic and /proc/sys/kernel/panic_on_oops are set to
1.> 
> Can anybody please help me to understand the error messages and make that node more stable?

This was fixed in 2.6.23, and there's a version of the patch backported
 to
2.6.22:

http://www.kernel.org/pub/linux/kernel/people/mfasheh/ocfs2/backports/2.6.22.14/broken-out/0006-ocfs2-Retry-sendpage-if-it-returns-EAGAIN.patch

Unfortunately, It doesn't look like the patch would apply to 2.6.18
 without
some work.

Can you upgrade to the latest stable 2.6.23 kernel? If you do, take the
time to apply the other patches in our 2.6.23 backports:

http://www.kernel.org/pub/linux/kernel/people/mfasheh/ocfs2/backports/2.6.23.7/
    --Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh@oracle.com

____________________________________________________________________________________
Get easy, one-click access to your favorites. 
Make Yahoo! your homepage.
http://www.yahoo.com/r/hs

rain c

2007-Dec-04 07:18 UTC

head link

[Ocfs2-users] Unstable Cluster Node

Hi,

first of all thank you very much for providing the patches to me so fast!

On Monday, December 3, 2007 7:18:12 PM Mark Fasheh
wrote:> Attached is a pair of patches which applied more cleanly. Basically it
> includes another tcp.c fix which the -EAGAIN fix built on top of. Both
 would> be good for you to have one way or the other. Fair warning though - I
 don't> really have the ability to test 2.6.18 fixes right now, so you're going
 to> have to be a bit of a beta tester ;) That said, they look pretty clean
 to me> so I have a relatively high confidence that they should work.
I applied both patches as well as all the patches found on
http://www.kernel.org/pub/linux/kernel/people/mfasheh/ocfs2/backports/2.6.18/

Further I applied the current stable openVZ patch for 2.6.18 as well as
a little patch i wrote on my own for IPVS (both already applied to the last
used unstable kernel).

All the patches fit perfect and I have the kernel up and running now.
At least it is already stable for some hours, but more about stability
I can tell you only tomorrow.

> I'm not sure why it would be always one node and not the other.
We'd
> probably need more detailing information about what's going on to
 figure> that out. Maybe some combination of user application + cluster stack
> conspires to put a larger messaging load on it?
> 
> Are there any other ocfs2 messages in your logs for that node?
All I found is that it sometimes say
dlm_send_remote_convert_request:395 ERROR: status = -112

instead of

dlm_send_remote_convert_request:395 ERROR: status = -107

shortly before crash.

Further I found some messages, but they are kinda historical. So I am not sure
anymore if they were during normal operation or during examination of some other
configuration:
kernel: o2net: no longer connected to node webhost2 (num 1) at 10.2.0.71:7777
kernel: (6693,3):dlm_send_proxy_ast_msg:457 ERROR: status = -112
kernel: (6693,3):dlm_flush_asts:589 ERROR: status = -112

and

kernel: o2net: no longer connected to node webhost2 (num 1) at 10.2.0.71:7777
kernel: (27088,2):dlm_do_master_request:1331 ERROR: link to 1 went down!
kernel: (27088,2):dlm_get_lock_resource:915 ERROR: status = -112

You further asked for my cluster setup:
Base is a DRBD 8.0.4 device in primary/primary mode. This is formated
with OCFS2 as one partition. Inside this partition are the private areas of
openVZ virtual enviroments (VPS). Inside these VPS run
mostly webservers but also some other network services.

Between this two cluster nodes I have an ultramonkey heartbeat that
manages an IPVS load balancer for the webservers that are located
inside the VPS on both cluster nodes on the OCFS2
filesystem. The crashing machine is always the one, that is the hot
standby for IPVS.

I will further test if this changes if I make the other node the hot standby.
> If the two patches here work for you, I'll probably just add them to
 that> directory for others to use.
Until now your patches work pretty good for me, but if they really solve my
stability problem I can only tell you tomorrow when I hopefully see that both
nodes survived the night ;-)

Thanks very much for you expert help,

- Rainer

____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now. 
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

rain c

2007-Dec-05 05:27 UTC

head link

[Ocfs2-users] Unstable Cluster Node

Hi,



as I wrote yesterday I applyed all the patches. Unfortunately it did not bring
the wanted results. The same node crashed again with very similar messages
today. I attached also the messages of the other node that stayed alive.

Not to forget to mention that in the meantime I switched the full hardware to
make sure that it is not a hardware problem. On the first view it looks like a
network problem for me, but as I already wrote before the two nodes are IBM
blades in the same bladecenter directly connected by the bladcenters internal
switch. All the other blades in the same bladecenter make no problems.



I am really at the end of my knowledge and hope you can still help me.



Thanks very much,

- Rainer



+----------------------------------------------+

 | These are the messages of the crashing node: |

+----------------------------------------------+



Dec  5 12:58:14 webhost2 kernel: o2net: no longer connected to node webhost1
(num 0) at 10.2.0.70:7777

Dec  5 12:58:14 webhost2 kernel: (10409,1):dlm_send_remote_convert_request:395
ERROR: status = -112

Dec  5 12:58:14 webhost2 kernel: (14860,2):dlm_send_remote_convert_request:395
ERROR: status = -112

Dec  5 12:58:14 webhost2 kernel: (14860,2):dlm_wait_for_node_death:374
225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of
node 0

Dec  5 12:58:14 webhost2 kernel: (10409,1):dlm_wait_for_node_death:374
225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of
node 0

Dec  5 12:58:14 webhost2 kernel: (8536,0):dlm_send_remote_convert_request:395
ERROR: status = -112

Dec  5 12:58:14 webhost2 kernel: (8536,0):dlm_wait_for_node_death:374
225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of
node 0

Dec  5 12:58:20 webhost2 kernel: (10409,1):dlm_send_remote_convert_request:395
ERROR: status = -107

Dec  5 12:58:20 webhost2 kernel: (14860,3):dlm_send_remote_convert_request:395
ERROR: status = -107

Dec  5 12:58:20 webhost2 kernel: (8536,0):dlm_send_remote_convert_request:395
ERROR: status = -107

Dec  5 12:58:20 webhost2 kernel: (14860,3):dlm_wait_for_node_death:374
225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of
node 0

Dec  5 12:58:20 webhost2 kernel: (8536,0):dlm_wait_for_node_death:374
225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of
node 0

Dec  5 12:58:20 webhost2 kernel: (10409,1):dlm_wait_for_node_death:374
225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of
node 0

Dec  5 12:58:25 webhost2 kernel: (10409,1):dlm_send_remote_convert_request:395
ERROR: status = -107

Dec  5 12:58:25 webhost2 kernel: (14860,3):dlm_send_remote_convert_request:395
ERROR: status = -107

Dec  5 12:58:25 webhost2 kernel: (8536,0):dlm_send_remote_convert_request:395
ERROR: status = -107

Dec  5 12:58:25 webhost2 kernel: (14860,3):dlm_wait_for_node_death:374
225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of
node 0

Dec  5 12:58:25 webhost2 kernel: (8536,0):dlm_wait_for_node_death:374
225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of
node 0

Dec  5 12:58:25 webhost2 kernel: (10409,1):dlm_wait_for_node_death:374
225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of
node 0

Dec  5 12:58:30 webhost2 kernel: (14860,2):dlm_send_remote_convert_request:395
ERROR: status = -107

Dec  5 12:58:30 webhost2 kernel: (10409,0):dlm_send_remote_convert_request:395
ERROR: status = -107

Dec  5 12:58:30 webhost2 kernel: (8536,1):dlm_send_remote_convert_request:395
ERROR: status = -107

Dec  5 12:58:30 webhost2 kernel: (10409,0):dlm_wait_for_node_death:374
225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of
node 0

Dec  5 12:58:30 webhost2 kernel: (8536,1):dlm_wait_for_node_death:374
225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of
node 0

Dec  5 12:58:30 webhost2 kernel: (14860,2):dlm_wait_for_node_death:374
225202289F954729807AACECEBB2D2AC: waiting 5000ms for notification of death of
node 0



+--------------------------------------------------------------------------------+

 | During that crash you can see the following messages at the other (stable)
node: |

+--------------------------------------------------------------------------------+



Dec  5 12:58:15 webhost1 kernel: o2net: connection to node webhost2 (num 1) at
10.2.0.71:7777 has been idle for 10 seconds, shutting it down.

Dec  5 12:58:15 webhost1 kernel: (0,2):o2net_idle_timer:1313 here are some times
that might help debug the situation: (tmr 1196859485.13835 now 1196859495.12881
dr 1196859485.13824 adv 1196859485.13837:1196859485.13838 func (434028bd:504)
1196859485.12053:1196859485.12057)

Dec  5 12:58:15 webhost1 kernel: o2net: no longer connected to node webhost2
(num 1) at 10.2.0.71:7777

Dec  5 12:58:15 webhost1 kernel: (8511,2):dlm_send_proxy_ast_msg:457 ERROR:
status = -112

Dec  5 12:58:15 webhost1 kernel: (8511,2):dlm_flush_asts:589 ERROR: status =
-112

Dec  5 12:58:55 webhost1 kernel: (11011,3):ocfs2_replay_journal:1184 Recovering
node 1 from slot 0 on device (147,0)





--------------------------------------------------------------------------------------------



On Monday, December 3, 2007 7:18:12 PM Mark Fasheh wrote:

 On Mon, Dec 03, 2007 at 04:45:01AM -0800, rain c wrote:
> thanks very much for your answer.
> My problem is, that I connot really use kernel 2.6.22, because I also  need
> the openVZ patch which is not available in a stable version for  2.6.22. Is
> there a way to backport ocfs2-Retry-if-it-returns-EAGAIN to 2.6.18?


Attached is a pair of patches which applied more cleanly. Basically it

includes another tcp.c fix which the -EAGAIN fix built on top of. Both  would

be good for you to have one way or the other. Fair warning though - I  don't

really have the ability to test 2.6.18 fixes right now, so you're going  to

have to be a bit of a beta tester ;) That said, they look pretty clean  to me

so I have a relatively high confidence that they should work.



Be sure to apply them in order:



$ cd linux-2.6.18

$ patch -p1 < 0001-ocfs2-Backport-message-locking-fix-to-2.6.18.patch

$ patch -p1 < 0002-ocfs2-Backport-sendpage-fix-to-2.6.18.patch




> Further I wonder why only one (and always the same) of my nodes is so
> unstable.


I'm not sure why it would be always one node and not the other. We'd

probably need more detailing information about what's going on to  figure

that out. Maybe some combination of user application + cluster stack

conspires to put a larger messaging load on it?



Are there any other ocfs2 messages in your logs for that node?




> Are you sure that it cannot be any other problem?


No, not 100% sure. My first hunch was the -EAGAIN bug because your  messages

looked exactly what I saw there. Looking a bit deeper, it seems that  your

value (when turned into a signed integer) is -32, which would actually  make

it -EPIPE. 



-EPIPE gets returned from several places in the tcp code, in particular

do_tcp_sendpages() and sk_stream_wait_memory(). If you look at the 1st  patch

that's attached, you'll see that it fixes some races that occurred when

sending outgoing messages, including when those functions were called.  While

I'm not 100% sure these patches will fix it, I definitely think it's 
the 1st

thing we should try.



By the way, while you're doing this it might be a good idea to also  apply

some of the other patches we backported to 2.6.18 a long time ago:



http://www.kernel.org/pub/linux/kernel/people/mfasheh/ocfs2/backports/2.6.18/





If the two patches here work for you, I'll probably just add them to  that

directory for others to use.

    --Mark



--

Mark Fasheh

Senior Software Developer, Oracle

mark.fasheh@oracle.com





-----Inline Attachment Follows-----


>From 42318a6658696711baf25d8bd17e3d2827472d66 Mon Sep 17 00:00:00 2001
From: Zhen Wei <zwei@novell.com>

Date: Tue, 23 Jan 2007 17:19:59 -0800

Subject: ocfs2: Backport message locking fix to 2.6.18



Untested fix, apply at your own risk.

Original commit message follows.



ocfs2: introduce sc->sc_send_lock to protect outbound outbound messages



When there is a lot of multithreaded I/O usage, two threads can collide

while sending out a message to the other nodes. This is due to the lack  of

locking between threads while sending out the messages.



When a connected TCP send(), sendto(), or sendmsg() arrives in the  Linux

kernel, it eventually comes through tcp_sendmsg(). tcp_sendmsg()  protects

itself by acquiring a lock at invocation by calling lock_sock().

tcp_sendmsg() then loops over the buffers in the iovec, allocating

associated sk_buff's and cache pages for use in the actual send. As it  does

so, it pushes the data out to tcp for actual transmission. However, if  one

of those allocation fails (because a large number of large sends is  being

processed, for example), it must wait for memory to become available.  It

does so by jumping to wait_for_sndbuf or wait_for_memory, both of which

eventually cause a call to sk_stream_wait_memory().  sk_stream_wait_memory()

contains a code path that calls sk_wait_event(). Finally,  sk_wait_event()

contains the call to release_sock().



The following patch adds a lock to the socket container in order to

properly serialize outbound requests.



From: Zhen Wei <zwei@novell.com>

Acked-by: Jeff Mahoney <jeffm@suse.com>

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>

---

 fs/ocfs2/cluster/tcp.c          |    8 ++++++++

 fs/ocfs2/cluster/tcp_internal.h |    2 ++

 2 files changed, 10 insertions(+), 0 deletions(-)



diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c

index b650efa..3c5bf4d 100644

--- a/fs/ocfs2/cluster/tcp.c

+++ b/fs/ocfs2/cluster/tcp.c

@@ -520,6 +520,8 @@ static void o2net_register_callbacks(struct sock  *sk,

     sk->sk_data_ready = o2net_data_ready;

     sk->sk_state_change = o2net_state_change;

 

+    mutex_init(&sc->sc_send_lock);

+

     write_unlock_bh(&sk->sk_callback_lock);

 }

 

@@ -818,10 +820,12 @@ static void o2net_sendpage(struct  o2net_sock_container
*sc,

     ssize_t ret;

 

 

+    mutex_lock(&sc->sc_send_lock);

     ret = sc->sc_sock->ops->sendpage(sc->sc_sock,

                      virt_to_page(kmalloced_virt),

                      (long)kmalloced_virt & ~PAGE_MASK,

                      size, MSG_DONTWAIT);

+    mutex_unlock(&sc->sc_send_lock);

     if (ret != size) {

         mlog(ML_ERROR, "sendpage of size %zu to " SC_NODEF_FMT 

              " failed with %zd\n", size, SC_NODEF_ARGS(sc), ret);

@@ -936,8 +940,10 @@ int o2net_send_message_vec(u32 msg_type, u32 key,  struct
kvec *caller_vec,

 

     /* finally, convert the message header to network byte-order

      * and send */

+    mutex_lock(&sc->sc_send_lock);

     ret = o2net_send_tcp_msg(sc->sc_sock, vec, veclen,

                  sizeof(struct o2net_msg) + caller_bytes);

+    mutex_unlock(&sc->sc_send_lock);

     msglog(msg, "sending returned %d\n", ret);

     if (ret < 0) {

         mlog(0, "error returned from o2net_send_tcp_msg=%d\n", ret);

@@ -1068,8 +1074,10 @@ static int o2net_process_message(struct 
o2net_sock_container *sc,

 

 out_respond:

     /* this destroys the hdr, so don't use it after this */

+    mutex_lock(&sc->sc_send_lock);

     ret = o2net_send_status_magic(sc->sc_sock, hdr, syserr,

                       handler_status);

+    mutex_unlock(&sc->sc_send_lock);

     hdr = NULL;

     mlog(0, "sending handler status %d, syserr %d returned %d\n",

          handler_status, syserr, ret);

diff --git a/fs/ocfs2/cluster/tcp_internal.h  b/fs/ocfs2/cluster/tcp_internal.h

index ff9e2e2..008fcf9 100644

--- a/fs/ocfs2/cluster/tcp_internal.h

+++ b/fs/ocfs2/cluster/tcp_internal.h

@@ -142,6 +142,8 @@ struct o2net_sock_container {

     struct timeval         sc_tv_func_stop;

     u32            sc_msg_key;

     u16            sc_msg_type;

+

+    struct mutex        sc_send_lock;

 };

 

 struct o2net_msg_handler {

-- 

1.5.3.4









-----Inline Attachment Follows-----


>From 355053cdec5205ff35398d78f5c93a59eeb502ce Mon Sep 17 00:00:00 2001
From: Sunil Mushran <sunil.mushran@oracle.com>

Date: Mon, 30 Jul 2007 11:02:50 -0700

Subject: ocfs2: Backport sendpage() fix to 2.6.18



Untested fix, apply at your own risk.

Original commit message follows.



ocfs2: Retry sendpage() if it returns EAGAIN



Instead of treating EAGAIN, returned from sendpage(), as an error, this

patch retries the operation.



Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>

Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com>

---

 fs/ocfs2/cluster/tcp.c |   24 ++++++++++++++++--------

 1 files changed, 16 insertions(+), 8 deletions(-)



diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c

index 3c5bf4d..29554e5 100644

--- a/fs/ocfs2/cluster/tcp.c

+++ b/fs/ocfs2/cluster/tcp.c

@@ -819,17 +819,25 @@ static void o2net_sendpage(struct  o2net_sock_container
*sc,

     struct o2net_node *nn = o2net_nn_from_num(sc->sc_node->nd_num);

     ssize_t ret;

 

-

-    mutex_lock(&sc->sc_send_lock);

-    ret = sc->sc_sock->ops->sendpage(sc->sc_sock,

-                     virt_to_page(kmalloced_virt),

-                     (long)kmalloced_virt & ~PAGE_MASK,

-                     size, MSG_DONTWAIT);

-    mutex_unlock(&sc->sc_send_lock);

-    if (ret != size) {

+    while (1) {

+        mutex_lock(&sc->sc_send_lock);

+        ret = sc->sc_sock->ops->sendpage(sc->sc_sock,

+                         virt_to_page(kmalloced_virt),

+                         (long)kmalloced_virt & ~PAGE_MASK,

+                         size, MSG_DONTWAIT);

+        mutex_unlock(&sc->sc_send_lock);

+        if (ret == size)

+            break;

+        if (ret == (ssize_t)-EAGAIN) {

+            mlog(0, "sendpage of size %zu to " SC_NODEF_FMT

+                 " returned EAGAIN\n", size, SC_NODEF_ARGS(sc));

+            cond_resched();

+            continue;

+        }

         mlog(ML_ERROR, "sendpage of size %zu to " SC_NODEF_FMT 

              " failed with %zd\n", size, SC_NODEF_ARGS(sc), ret);

         o2net_ensure_shutdown(nn, sc, 0);

+        break;

     }

 }

 

-- 

1.5.3.4













     
____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs

Ocfs2 users - Nov 2007 - Unstable Cluster Node

[Ocfs2-users] Unstable Cluster Node

[Ocfs2-users] Unstable Cluster Node

[Ocfs2-users] Unstable Cluster Node

[Ocfs2-users] Unstable Cluster Node

[Ocfs2-users] Unstable Cluster Node