thr3ads.net - Gluster users - [Gluster-users] 2.0.6 [Aug 2009]

If this information is useful, please help other people find it:
Share via:

Stephan von Krawczynski

2009-Aug-18 13:01 UTC

[Gluster-users] 2.0.6

Hello all,

I really do wonder if any of you does rsync onto glusterfs. I can't hardly
believe that, because 2.0.6 as every single release before has broken mtimes
on non-empty directories. Am I really the only one recognising this
fundamental flaw?
-- 
Regards,
Stephan

Vikas Gorur

2009-Aug-18 13:37 UTC

head link

[Gluster-users] 2.0.6

----- "Stephan von Krawczynski" <skraw at ithnet.com> wrote:
> Hello all,
> 
> I really do wonder if any of you does rsync onto glusterfs. I can't
> hardly believe that, because 2.0.6 as every single release before has
broken
> mtimes on non-empty directories. Am I really the only one recognising this
> fundamental flaw?
This issue is being tracked, and will be fixed in the 2.1 release.
>From http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=137:
> Deferring a fix until the protocol and FOP prototypes are changed to
include
> stat info of parent directories (for 2.1).
Vikas
-- 
Engineer - http://gluster.com/

Stephan von Krawczynski

2009-Aug-18 16:22 UTC

head link

[Gluster-users] 2.0.6

On Tue, 18 Aug 2009 15:01:46 +0200
Stephan von Krawczynski <skraw at ithnet.com> wrote:
> Hello all,
> 
> I really do wonder if any of you does rsync onto glusterfs. I can't
hardly
> believe that, because 2.0.6 as every single release before has broken
mtimes
> on non-empty directories. Am I really the only one recognising this
> fundamental flaw?
> -- 
> Regards,
> Stephan
And, I forgot to mention: it took me around 1 hour of bonnie to crash a server
in a classical distribute setup...
Please stop featurism and start reliability.

-- 
Regards,
Stephan

Anand Avati

2009-Aug-21 22:42 UTC

head link

[Gluster-users] 2.0.6

Stephan,
   Please find replies below. I am merging the thread back to the ML.
> > Stephan, we need some more info. I think we are a lot closer to
> diagnosing this issue now. The hang is being caused by an io-thread
> getting hung either as a deadlock inside the glusterfsd process code,
> or blocked on disk access for an excessively long time. The following
> details will be _extremely_ useful for us.
> > 
> > 1. What is your backend FS, kernel version and distro running on
> server2? Is the backend FS on a local disk or some kind of SAN or
> iSCSI?
> 
> The backend FS is reiserfs3, kernel version 2.6.30.5, distro openSuSE
> 11.1.
> The backend FS resides on a local Areca RAID system. See attached
> output of
> former email. 
> 
> > 2. Was the glusterfsd on server2 taking 100% cpu at the time of the
> hang?
> 
> I can only try to remember that from the time I took the strace logs.
> I am not
> a 100% sure, but from typing and looking I would say the load was very
> low,
> probably next to zero. 
> 
> > 3. On server2, now that you have killed it with -11, you should be
> having a core file in /. Can you get the backtrace from all the
> threads? Please use the following commands -
> > 
> > sh# gdb /usr/sbin/glusterfsd -c /core.X
> > 
> > and then at the gdb prompt
> > 
> > (gdb) thread apply all bt
> > 
> > This should output the backtraces from all the threads.
> 
> The bad news is this: we were not able to normally shut down the box
> because
> the local (exported) fs hung completely. So shutdown did not work. We
> had to
> hard-reset it. When examining the box few minutes ago we had to find
> out that
> all logs (and likely the core dump) were dumped and lost. I have seen
> this
> kind of behaviour before, it is originated from reiserfs3 and not
> really
> unusual. This means: we redo the test and hope we can force the
> problem again.
> Then we take all possible logs, dmesg, cores away from the server
> before
> rebooting it. I am very sorry we lost the important part of
> information... 
Stephan,
   This clearly points that the root cause for the bonnie hangs which you have
been facing on every release of 2.0.x is because of the hanging reiserfs export
you have. When you have the backend FS which is misbehaving, this is the
expected behavior of GlusterFS. Not only will you see this in all versions of
GlusterFS, you will face the same hangs even with NFS or even running bonnie
directly  on your backend FS. All the IO calls are getting queued and blocked in
the IO thread which is touching the disk, and the main FS thread is up
responding to ping-pong requests, thus keeping the server "alive". All
of us on this ML could have spent far fewer cycles if the initial description of
the problem included a note which mentioned that one of the server's backend
reiserfs3 is known to freeze in the environment before. When someone reports a
hang on the glusterfs mountpoint, the first thing we developers do is trying to
find code paths for what we call "missing frames" (technically it is a
syscall leak, somewhat like a memory leak) and this is a very demanding and time
consuming debugging for us. All the information you can provide us will only
help us debug the issue faster.


All,
   The reason I merged this thread back with the ML is because we want to
request anybody reporting issues to give as much information as possible
upfront. In the interest of all of us, both the developers' and more
importantly of the community for getting quicker releases, good bug reports are
the best thing you can offer us. Please describe the FS configuration,
environment, application and steps to reproduce issue with versions, configs and
logs of every relevant component. And if you can, in fact, report all this
directly on our bug tracking site (http://bugs.gluster.com) (and keep the MLs
for discussions as much as possible) that would be the best you can do for us.

Thank you for all the support!

Avati

Anand Avati

2009-Aug-22 18:39 UTC

head link

[Gluster-users] 2.0.6

----- "David Saez Padros" <david at ols.es>
wrote:> in the problem we have the server also hang to the point that there
> where no way to access it and we end rebooting the server to gain
> acces to it
Do you mean you were unable to login to the machine over the network? unable to
have a responsive console shell? machine would not respond to ICMP on the
network? Do you still have the logfiles and volfiles and can you describe the
steps to reproduce in a bug report?

As a thumb rule, if your server hangs to the degree of not even having a usable
shell, it just means that heavy IO via glusterfs triggered some bug in the
operating system. try to get kernel output via dmesg or console logs if you have
any. glusterfsd only issues system calls and does not do anything funky with the
server. Think of some application local to the server causing such a hung.
glusterfsd is no different in that respect.

Avati

Anand Avati

2009-Aug-24 15:46 UTC

head link

[Gluster-users] 2.0.6

Replies inline
> back to our original problem of all-hanging glusterfs servers and
> clients.
> Today we got another hang with same look and feel, but this time we
> got
> something in the logs, please read and tell us how to further
> proceed.
> Configuration is as before. I send the whole log since boot, crash is
> visible 
> at the end. We did the same testing as before, running two bonnies on
> two 
> clients.
> 
> Linux version 2.6.30.5 (root at linux-tnpx) (gcc version 4.3.2
> [gcc-4_3-branch revision 141291] (SUSE Linux) ) #1 SMP Tue Aug 18
> 12:06:06 CEST 2009
> general protection fault: 0000 [#1] SMP 
> last sysfs file:
> /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
> CPU 2 
> Modules linked in: fuse loop i2c_i801 i2c_core e100 e1000e
> Pid: 3833, comm: glusterfsd Not tainted 2.6.30.5 #1 empty
> RIP: 0010:[<ffffffff80244305>]  [<ffffffff80244305>]
> __wake_up_bit+0xc/0x2d
> RSP: 0018:ffff88011fc51a98  EFLAGS: 00010292
> RAX: 8dfd233fe2300848 RBX: ffffe20000220058 RCX: 0000000000000040
> RDX: 0000000000000000 RSI: ffffe20000220058 RDI: 8dfd233fe2300840
> RBP: ffff8800b3be03b0 R08: b000000000000000 R09: ffffe20000220058
> R10: ffffffffb3be03b1 R11: 0000000000000001 R12: 00000000000021a4
> R13: 00000000021a4000 R14: ffff8800b3be03b0 R15: 00000000000021a4
> FS:  00007f684127f950(0000) GS:ffff880028052000(0000)
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fff5373eb78 CR3: 000000011fc79000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process glusterfsd (pid: 3833, threadinfo ffff88011fc50000, task
> ffff880126ad53e0)
> Stack:
>  ffffffff8048eb40 ffffffff00000000 ffffe20000220058 ffffffff8025fa3f
>  ffffffff8048eb40 0000000000004000 00000000ffffffff ffffffff8025fc36
>  000000d0b3be0298 ffffffff8048eb40 0000000000004000 00000000fffffff4
> Call Trace:
>  [<ffffffff8025fa3f>] ? find_lock_page+0x43/0x55
>  [<ffffffff8025fc36>] ? grab_cache_page_write_begin+0x3b/0xa1
>  [<ffffffff802d34ef>] ? reiserfs_write_begin+0x81/0x1dc
>  [<ffffffff802d5505>] ? reiserfs_get_block+0x0/0xeb5
>  [<ffffffff8026054a>] ? generic_file_buffered_write+0x12c/0x2fa
>  [<ffffffff80260bf7>] ? __generic_file_aio_write_nolock+0x349/0x37d
>  [<ffffffff8024d33f>] ? futex_wait+0x41a/0x42f
>  [<ffffffff802613ea>] ? generic_file_aio_write+0x64/0xc4
>  [<ffffffff80261386>] ? generic_file_aio_write+0x0/0xc4
>  [<ffffffff80286f09>] ? do_sync_readv_writev+0xc0/0x107
>  [<ffffffff8024d486>] ? futex_wake+0xc8/0xd9
>  [<ffffffff80244348>] ? autoremove_wake_function+0x0/0x2e
>  [<ffffffff8024e5e8>] ? do_futex+0xa9/0x8b3
>  [<ffffffff80286d95>] ? rw_copy_check_uvector+0x6d/0xe4
>  [<ffffffff80287581>] ? do_readv_writev+0xb2/0x18b
>  [<ffffffff80249bb7>] ? getnstimeofday+0x55/0xaf
>  [<ffffffff80246b62>] ? ktime_get_ts+0x21/0x49
>  [<ffffffff8028775b>] ? sys_writev+0x45/0x6e
>  [<ffffffff8020ae6b>] ? system_call_fastpath+0x16/0x1b
> Code: 00 48 29 f8 2b 8a d0 e7 5d 80 4c 01 c0 48 d3 e8 48 6b c0 18 48
> 03 82 c0 e7 5d 80 5a 5b 5d c3 48 83 ec 18 48 8d 47 08 89 54 24 08
<48>
> 39 47 08 74 16 48 89 34 24 48 89 e1 ba 01 00 00 00 be 03 00 
> RIP  [<ffffffff80244305>] __wake_up_bit+0xc/0x2d
>  RSP <ffff88011fc51a98>
> ---[ end trace 10a1fa47d70a1dc4 ]---
The right place to post this backtrace is reiserfs-devel at vger.kernel.org. You
could do them a favor by mentioning the closest pair of kernel versions in which
this issue is not-seen, and then appears -- if you have the time to do that for
them. For all you know it might already be fixed in a newer kernel version, but
you will find the right answer in that ML.

Avati

Gluster users - Aug 2009 - 2.0.6

[Gluster-users] 2.0.6

[Gluster-users] 2.0.6

[Gluster-users] 2.0.6

[Gluster-users] 2.0.6

[Gluster-users] 2.0.6

[Gluster-users] 2.0.6