thr3ads.net - Gluster users - [Gluster-users] The continuing story ... [Sep 2009]

If this information is useful, please help other people find it:
Share via:

Stephan von Krawczynski

2009-Sep-07 14:46 UTC

[Gluster-users] The continuing story ...

Hello all,

last week we saw our first try to enable something like a real-world
environment on glusterfs fail.
Nevertheless we managed to get a working combination of _one_ server and _one_
client (using a replicate setup with a missing second server).
This setup worked for about 4 days, so yesterday we tried to enable the second
server. Within minutes the first one crashed. Well, really we do not know if
it crashed in its true meaning, the situation looked like this:
- server was ping'able
- glusterfsd was disconnected by the client because of missing ping-pong
- no login possible
- no fs action (no lights on the hd-stack)
- no screen (was blank, stayed blank)

This could also be a user-space hang or cpu busy/looping. We don't know.
The really interesting part is that the server worked for days being single,
but as soon as dual server fs action (obviously in combination with self
healing) started it did not survive 10 minutes.
Of course the second server went on, but we had to stop the whole thing
because the data was not completely healed, so it made no sense to go on with
old copies.
This was glusterfs 2.0.6 with a minimal server setup (storage/posix,
features/locks, performance/io-threads) on a linux kernel 2.6.25.2.
Is there someone out there that experienced something the like? 
Any ideas?

-- 
Regards,
Stephan

Daniel Jordan Bambach

2009-Sep-07 19:53 UTC

head link

[Gluster-users] The continuing story ...

Yep, I experience this exact lock-up state on the 2.x train of  
GlusterFS with two severs, each with local client, and have so far  
given up testing :( - I run 1.3 in production which still has problems  
when one of the servers goes down, and was hoping to move up to 2.x  
quickly, but cant at the moment.

Every time a new version comes out I update hoping it will be solved.

Because the machine that hangs, hangs so completely one can't ssh in  
and can't get a proper dump from the process, and any DEBUG log  
enabled has no information in it either, so I haven't been able to  
provide anything useful to the team to work from :(



On 7 Sep 2009, at 15:46, Stephan von Krawczynski wrote:
> Hello all,
>
> last week we saw our first try to enable something like a real-world
> environment on glusterfs fail.
> Nevertheless we managed to get a working combination of _one_ server  
> and _one_
> client (using a replicate setup with a missing second server).
> This setup worked for about 4 days, so yesterday we tried to enable  
> the second
> server. Within minutes the first one crashed. Well, really we do not  
> know if
> it crashed in its true meaning, the situation looked like this:
> - server was ping'able
> - glusterfsd was disconnected by the client because of missing ping- 
> pong
> - no login possible
> - no fs action (no lights on the hd-stack)
> - no screen (was blank, stayed blank)
>
> This could also be a user-space hang or cpu busy/looping. We don't  
> know.
> The really interesting part is that the server worked for days being  
> single,
> but as soon as dual server fs action (obviously in combination with  
> self
> healing) started it did not survive 10 minutes.
> Of course the second server went on, but we had to stop the whole  
> thing
> because the data was not completely healed, so it made no sense to  
> go on with
> old copies.
> This was glusterfs 2.0.6 with a minimal server setup (storage/posix,
> features/locks, performance/io-threads) on a linux kernel 2.6.25.2.
> Is there someone out there that experienced something the like?
> Any ideas?
>
> -- 
> Regards,
> Stephan
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>

Jeff Evans

2009-Sep-08 00:13 UTC

head link

[Gluster-users] The continuing story ...

> - server was ping'able
> - glusterfsd was disconnected by the client because of missing
> ping-pong - no login possible
> - no fs action (no lights on the hd-stack)
> - no screen (was blank, stayed blank)
This is very similar to what I have seen many times (even back on
1.3), and have also commented on the list.

It seems that we have quite a few ACK's on this, or similar problems.

The only thing different in my scenario, is that the console doesn't
stay blank. When attempting to login I get the last login message, and
nothing more, no prompt ever. Also, I can see that other processes are
still listening on sockets etc.. so it seems like the kernel just
can't grab new FD's.

I too found the hang happens more easily if a downed node from a
replicate pair re-joins after some time.

Following suggestions that this is all kernel related, I have just
moved up to RHEL 5.4 in the hope that the new kernel will
help.

This fix stood out as potentially related for me:
https://bugzilla.redhat.com/show_bug.cgi?id=445433

We also have a broadcom network card, which had reports of hangs under
load, the kernel has a patch for that too.

If I still run into the hangs, I'll try xfs.

Thanks, Jeff.

Anand Avati

2009-Sep-10 16:15 UTC

head link

[Gluster-users] The continuing story ...

> > Which actually reinforces the point that glusterfs has very little
> to
> > do with this kernel lockup. It is not even performing the special
> fuse
> > protocol communication with the kernel in question. Just plain
> vanilla
> > POSIX system calls on disk filesystem and send/recv on TCP/IP
> sockets.
> 
> this does not reinforce anything special, gluster may be eating
> resources and not releasing them or calling system functions with
> bad arguments and the system may run out of them or enter some race
> condition and produce the lock.
Instead of guessing and contemplating and using your brain cycles to figure out
the cause, have you instead taken the effort to post the kernel backtraces you
have to the linux-kernel mailing list yet? All you need to do is compose an
email with the attachment you have already posted here previous and shoot it out
to LKML.
> Just note that a user has pointed in
> another message part of code not testing for null pointers, so the
> code could be plenty of similar things that can produce undesirable
> and/or unknown side effects.
Now, you are clearly proving that you have no clue about what you just spoke,
neither have you been reading my previous explanations. You have no clue what a
NULL pointer in a userland app can do and cannot do. And you talk about unknown
and undesirable effects of such programming bugs of a userland application
without understanding fundamental operating system concepts of kernel memory
isolation and system call mechanisms between processes and kernels. A missing
NULL check can result in a segfault of glusterfsd. A userspace application has a
limit to the damage it can cause. And that limitation is by virtue of it being a
userspace app in the first place.

Is glusterfsd eating up and not releasing resources? It could be. It may not be.
That being the trigger for the kernel lockup is one among the very many
possibilities. At the outset looking at the backtrace, it does not appear to be
anything to do with resource leaks or with glusterfs. To find out what the soft
lockup is all about, write to LKML and ask. A soft lock up is a kernel bug
whether you personally like it or not. Whether glusterfs is triggering this soft
lockup is not clear. Lets say it indeed is. It could be doing something like an
extended attribute call with a 2^n+1 byte buffer which triggered an off-by-1 bug
in the kernel. Or maybe it sent data in packet sizes which resulted in a certain
pattern of fragmentation leading to what not. Or maybe it allocated and freed
memory regions in a particular order. What all kind of debugging would you like
to see get added in glusterfs? Would you like glusterfs to check for itself if
it is performing system calls too frequently? or after an odd number of jiffy
intervals? Or if it allocates prime number of bytes for memory allocation? It is
those kinds of races and equally wierd corner cases which result in soft
lockups. Do you expect every userland application which have triggered kernel
soft lockups till date to impelment such instrumentation within them?

In the end, whether glusterfs has such instrumentation or not, the path for the
answer you are looking for is in that very kernel backtrace. Your approach to
debugging this kernel soft lockup is extremely inefficient for both you and us.
glusterfs misbehavior (definitely not null pointer access!) is one of the
possibilities. Though very unlikely from how your kernel backtrace appears. Work
on the evidence you already have. Do you want me to post your backtraces on LKML
on your behalf? Those developers are going to answer you whether it is that
unlikely case of an application leaking resources which caused this lockup, or
if it is a programming bug within the kernel itself. Without this initial
groundwork on the primary evidence you already have in hand, please do not
expect any further assistance on this list for debugging the soft lockup till
you have an indication from the kernel developers that the cause is a
misbehaving user app. No other (user space) project will help you with such
lockups either. There have been cases where rsync triggers a soft lockup, but
scp does not. Do you blame rsync for not having sufficient instrumentation and
debugging techniques, or accuse it of eating up resources -- all this even
before you post the kernel backtrace on LKML? You really will be making a fool
of yourself for this approach on other project lists, let alone receiving such
patient replies and advice.

This apart, we have no hard feelings and are just as keen in resolving other
glusterfs bug reports from you.

Avati

Krishna Srinivas

2009-Sep-17 22:00 UTC

head link

[Gluster-users] The continuing story ...

We have logged the bug at
http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=272

You can track the progress there.

Basically, when backend FS hangs because of lockups further access to
that FS from the glusterfs mount point will return error (instead of
hanging).

Thanks
Krishna

On Mon, Sep 7, 2009 at 8:16 PM, Stephan von Krawczynski
<skraw at ithnet.com> wrote:> Hello all,
>
> last week we saw our first try to enable something like a real-world
> environment on glusterfs fail.
> Nevertheless we managed to get a working combination of _one_ server and
_one_
> client (using a replicate setup with a missing second server).
> This setup worked for about 4 days, so yesterday we tried to enable the
second
> server. Within minutes the first one crashed. Well, really we do not know
if
> it crashed in its true meaning, the situation looked like this:
> - server was ping'able
> - glusterfsd was disconnected by the client because of missing ping-pong
> - no login possible
> - no fs action (no lights on the hd-stack)
> - no screen (was blank, stayed blank)
>
> This could also be a user-space hang or cpu busy/looping. We don't
know.
> The really interesting part is that the server worked for days being
single,
> but as soon as dual server fs action (obviously in combination with self
> healing) started it did not survive 10 minutes.
> Of course the second server went on, but we had to stop the whole thing
> because the data was not completely healed, so it made no sense to go on
with
> old copies.
> This was glusterfs 2.0.6 with a minimal server setup (storage/posix,
> features/locks, performance/io-threads) on a linux kernel 2.6.25.2.
> Is there someone out there that experienced something the like?
> Any ideas?
>
> --
> Regards,
> Stephan
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>

Anand Avati

2009-Sep-19 09:39 UTC

head link

[Gluster-users] The continuing story ...

> [root at wcarh033]~# ps -ef | grep gluster
> root      1548     1  0 21:00 ?        00:00:00 
> /opt/glusterfs/sbin/glusterfsd -f /etc/glusterfs/glusterfsd.vol
> root      1861     1  0 21:00 ?        00:00:00 
> /opt/glusterfs/sbin/glusterfs --log-level=NORMAL 
> --volfile=/etc/glusterfs/tools.vol /gluster/tools
> root      1874  1861  0 21:00 ?        00:00:00 /bin/mount -i -f -t 
> fuse.glusterfs -o rw,allow_other,default_permissions,max_read=131072 
> /etc/glusterfs/tools.vol /gluster/tools
> root      2426  2395  0 21:02 pts/2    00:00:00 grep gluster
> [root at wcarh033]~# ls /gluster/tools
> ^C^C
> 
> Yep - all three nodes locked up. All it took was a simultaneous reboot
> 
> of all three machines.
> 
> After I kill -9 1874 (kill 1874 without -9 has no effect) from a 
> different ssh session, I get:
> 
> ls: cannot access /gluster/tools: Transport endpoint is not connected
> 
> After this, mount works (unmount not necessary it turns out).
> 
> I am unable to strace -p the mount -t fuse without it freezing up. I
> can 
> pstack, but it returns 0 lines of output fairly quickly.
> 
> The symptoms are identical on all three machines. 3-way replication, 
> each server has both a server exposing one volume, and a client, with
> 
> cluster/replication and a preferred read of the local server.

This is a strange hang. I have a few more questions -


1. is this off glusterfs.git master branch or release-2.0? If this is master,
there have been heavy un-QA'ed modifications to get rid of libfuse
dependency.

2. what happens if you try to start the three daemons together now when the
system is not booting? Is this hang somehow related to the system booting?

3. can you provide dmesg output and glusterfs trace level logs of this scenario?

Avati

Gluster users - Sep 2009 - The continuing story ...

[Gluster-users] The continuing story ...

[Gluster-users] The continuing story ...

[Gluster-users] The continuing story ...

[Gluster-users] The continuing story ...

[Gluster-users] The continuing story ...

[Gluster-users] The continuing story ...