thr3ads.net - Gluster users - [Gluster-users] Gluster Volume hangs (version 3.2.5) [Mar 2012]

If this information is useful, please help other people find it:
Share via:

Alessio Checcucci

2012-Mar-12 07:31 UTC

[Gluster-users] Gluster Volume hangs (version 3.2.5)

Dear All,
we are facing a problem in our computer room, we have 6 servers that act like
bricks for GlusterFS, the servers are configured in the following way:

OS: Centos 6.2 x86_64
Kernel: 2.6.32-220.4.2.el6.x86_64

Gluster RPM packages:
glusterfs-core-3.2.5-2.el6.x86_64
glusterfs-rdma-3.2.5-2.el6.x86_64
glusterfs-geo-replication-3.2.5-2.el6.x86_64
glusterfs-fuse-3.2.5-2.el6.x86_64

Each one is contributing a XFS filesystem to the global volume, the transport
mechanism is RDMA:

gluster volume create HPC_data transport rdma pleiades01:/data pleiades02:/data
pleiades03:/data pleiades04:/data pleiades05:/data pleiades06:/data

Each server mounts, using the fuse driver, the volume on a dedicated mount point
according to the following fstab:

pleiades01:/HPC_data        /HPCdata                glusterfs defaults,_netdev 0
0

We are running mongodb on top of the Gluster volume for performance testing and
speed is definitely high. Unfortunately when we run a large mongoimport job
after short time from the beginning the GlusterFS volume hangs completely and is
inaccessible from any node. The following error is logged after some time:

Mar  8 08:16:03 pleiades03 kernel: INFO: task mongod:5508 blocked for more than
120 seconds.
Mar  8 08:16:03 pleiades03 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  8 08:16:03 pleiades03 kernel: mongod        D 0000000000000007     0  5508 
1 0x00000000
Mar  8 08:16:03 pleiades03 kernel: ffff881709b95de8 0000000000000086
0000000000000000 0000000000000008
Mar  8 08:16:03 pleiades03 kernel: ffff881709b95d68 ffffffff81090a7f
ffff8816b6974cc0 0000000000000000
Mar  8 08:16:03 pleiades03 kernel: ffff8817fdd81af8 ffff881709b95fd8
000000000000f4e8 ffff8817fdd81af8
Mar  8 08:16:03 pleiades03 kernel: Call Trace:
Mar  8 08:16:03 pleiades03 kernel: [<ffffffff81090a7f>] ?
wake_up_bit+0x2f/0x40
Mar  8 08:16:03 pleiades03 kernel: [<ffffffff81090d7e>] ?
prepare_to_wait+0x4e/0x80
Mar  8 08:16:03 pleiades03 kernel: [<ffffffffa112c6b5>]
fuse_set_nowrite+0xa5/0xe0 [fuse]
Mar  8 08:16:03 pleiades03 kernel: [<ffffffff81090a90>] ?
autoremove_wake_function+0x0/0x40
Mar  8 08:16:03 pleiades03 kernel: [<ffffffffa112fd48>]
fuse_fsync_common+0xa8/0x180 [fuse]
Mar  8 08:16:03 pleiades03 kernel: [<ffffffffa112fe30>]
fuse_fsync+0x10/0x20 [fuse]
Mar  8 08:16:03 pleiades03 kernel: [<ffffffff811a52d1>]
vfs_fsync_range+0xa1/0xe0
Mar  8 08:16:03 pleiades03 kernel: [<ffffffff811a537d>]
vfs_fsync+0x1d/0x20
Mar  8 08:16:03 pleiades03 kernel: [<ffffffff81144421>]
sys_msync+0x151/0x1e0
Mar  8 08:16:03 pleiades03 kernel: [<ffffffff8100b0f2>]
system_call_fastpath+0x16/0x1b

Any attempt to access the volume from any node is fruitless until the mongodb
process is killed, the session accessing the /HPCdata path gets freezed.
Anyway a complete stop (force) and start of the volume is needed to have it back
operational.
The situation can be reproduced at will.
Is there anybody able to help us? Could we collect more pieces of information to
help diagnosing the problem?

Thanks a lot
Alessio 

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120312/6d888a41/attachment.html>

Alessio Checcucci

2012-Mar-15 02:33 UTC

head link

[Gluster-users] Gluster Volume hangs (version 3.2.5)

Dear All,
we are facing a problem in our computer room, we have 6 servers that act like
bricks for GlusterFS, the servers are configured in the following way:

OS: Centos 6.2 x86_64
Kernel: 2.6.32-220.4.2.el6.x86_64

Gluster RPM packages:
glusterfs-core-3.2.5-2.el6.x86_64
glusterfs-rdma-3.2.5-2.el6.x86_64
glusterfs-geo-replication-3.2.5-2.el6.x86_64
glusterfs-fuse-3.2.5-2.el6.x86_64

Each one is contributing a XFS filesystem to the global volume, the transport
mechanism is RDMA:

gluster volume create HPC_data transport rdma pleiades01:/data pleiades02:/data
pleiades03:/data pleiades04:/data pleiades05:/data pleiades06:/data

Each server mounts, using the fuse driver, the volume on a dedicated mount point
according to the following fstab:

pleiades01:/HPC_data        /HPCdata                glusterfs defaults,_netdev 0
0

We are running mongodb on top of the Gluster volume for performance testing and
speed is definitely high. Unfortunately when we run a large mongoimport job
after short time from the beginning the GlusterFS volume hangs completely and is
inaccessible from any node. The following error is logged after some time in
/var/log/messages:

Mar  8 08:16:03 pleiades03 kernel: INFO: task mongod:5508 blocked for more than
120 seconds.
Mar  8 08:16:03 pleiades03 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  8 08:16:03 pleiades03 kernel: mongod        D 0000000000000007     0  5508 
1 0x00000000
Mar  8 08:16:03 pleiades03 kernel: ffff881709b95de8 0000000000000086
0000000000000000 0000000000000008
Mar  8 08:16:03 pleiades03 kernel: ffff881709b95d68 ffffffff81090a7f
ffff8816b6974cc0 0000000000000000
Mar  8 08:16:03 pleiades03 kernel: ffff8817fdd81af8 ffff881709b95fd8
000000000000f4e8 ffff8817fdd81af8
Mar  8 08:16:03 pleiades03 kernel: Call Trace:
Mar  8 08:16:03 pleiades03 kernel: [<ffffffff81090a7f>] ?
wake_up_bit+0x2f/0x40
Mar  8 08:16:03 pleiades03 kernel: [<ffffffff81090d7e>] ?
prepare_to_wait+0x4e/0x80
Mar  8 08:16:03 pleiades03 kernel: [<ffffffffa112c6b5>]
fuse_set_nowrite+0xa5/0xe0 [fuse]
Mar  8 08:16:03 pleiades03 kernel: [<ffffffff81090a90>] ?
autoremove_wake_function+0x0/0x40
Mar  8 08:16:03 pleiades03 kernel: [<ffffffffa112fd48>]
fuse_fsync_common+0xa8/0x180 [fuse]
Mar  8 08:16:03 pleiades03 kernel: [<ffffffffa112fe30>]
fuse_fsync+0x10/0x20 [fuse]
Mar  8 08:16:03 pleiades03 kernel: [<ffffffff811a52d1>]
vfs_fsync_range+0xa1/0xe0
Mar  8 08:16:03 pleiades03 kernel: [<ffffffff811a537d>]
vfs_fsync+0x1d/0x20
Mar  8 08:16:03 pleiades03 kernel: [<ffffffff81144421>]
sys_msync+0x151/0x1e0
Mar  8 08:16:03 pleiades03 kernel: [<ffffffff8100b0f2>]
system_call_fastpath+0x16/0x1b

Any attempt to access the volume from any node is fruitless until the mongodb
process is killed, the sessions accessing the /HPCdata path gets freezed on any
node.
Anyway a complete stop (force) and start of the volume is needed to have it back
operational.
The situation can be reproduced at will.
Is there anybody able to help us? Could we collect more pieces of information to
help diagnosing the problem?

Thanks a lot
Alessio 

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120315/8b8b62f5/attachment.html>

Bryan Whitehead

2012-Mar-15 21:23 UTC

head link

[Gluster-users] Gluster Volume hangs (version 3.2.5)

I found running gluster with infiniband to be faster and less cpu
intensive when using ipoib (IP over Infiniband). I'd give that a shot.

On Mon, Mar 12, 2012 at 12:31 AM, Alessio Checcucci
<alessio.checcucci at gmail.com> wrote:> Dear All,
> we are facing a problem in our computer room, we have 6 servers that act
> like bricks for GlusterFS, the servers are configured in the following way:
>
> OS: Centos 6.2 x86_64
> Kernel:?2.6.32-220.4.2.el6.x86_64
>
> Gluster RPM packages:
> glusterfs-core-3.2.5-2.el6.x86_64
> glusterfs-rdma-3.2.5-2.el6.x86_64
> glusterfs-geo-replication-3.2.5-2.el6.x86_64
> glusterfs-fuse-3.2.5-2.el6.x86_64
>
> Each one is contributing a XFS filesystem to the global volume, the
> transport mechanism is RDMA:
>
> gluster volume create HPC_data transport rdma pleiades01:/data
> pleiades02:/data pleiades03:/data pleiades04:/data pleiades05:/data
> pleiades06:/data
>
> Each server mounts, using the fuse driver, the volume on a dedicated mount
> point according to the following fstab:
>
> pleiades01:/HPC_data ? ? ? ?/HPCdata ? ? ? ? ? ? ? ?glusterfs
> defaults,_netdev 0 0
>
> We are running mongodb on top of the Gluster volume for performance testing
> and speed is definitely high. Unfortunately when we run a large mongoimport
> job after short time from the beginning the GlusterFS volume hangs
> completely and is inaccessible from any node. The following error is logged
> after some time:
>
> Mar ?8 08:16:03 pleiades03 kernel: INFO: task mongod:5508 blocked for more
> than 120 seconds.
> Mar ?8 08:16:03 pleiades03 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Mar ?8 08:16:03 pleiades03 kernel: mongod ? ? ? ?D 0000000000000007 ? ? 0
> ?5508 ? ? ?1 0x00000000
> Mar ?8 08:16:03 pleiades03 kernel: ffff881709b95de8 0000000000000086
> 0000000000000000 0000000000000008
> Mar ?8 08:16:03 pleiades03 kernel: ffff881709b95d68 ffffffff81090a7f
> ffff8816b6974cc0 0000000000000000
> Mar ?8 08:16:03 pleiades03 kernel: ffff8817fdd81af8 ffff881709b95fd8
> 000000000000f4e8 ffff8817fdd81af8
> Mar ?8 08:16:03 pleiades03 kernel: Call Trace:
> Mar ?8 08:16:03 pleiades03 kernel: [<ffffffff81090a7f>] ?
> wake_up_bit+0x2f/0x40
> Mar ?8 08:16:03 pleiades03 kernel: [<ffffffff81090d7e>] ?
> prepare_to_wait+0x4e/0x80
> Mar ?8 08:16:03 pleiades03 kernel: [<ffffffffa112c6b5>]
> fuse_set_nowrite+0xa5/0xe0 [fuse]
> Mar ?8 08:16:03 pleiades03 kernel: [<ffffffff81090a90>] ?
> autoremove_wake_function+0x0/0x40
> Mar ?8 08:16:03 pleiades03 kernel: [<ffffffffa112fd48>]
> fuse_fsync_common+0xa8/0x180 [fuse]
> Mar ?8 08:16:03 pleiades03 kernel: [<ffffffffa112fe30>]
fuse_fsync+0x10/0x20
> [fuse]
> Mar ?8 08:16:03 pleiades03 kernel: [<ffffffff811a52d1>]
> vfs_fsync_range+0xa1/0xe0
> Mar ?8 08:16:03 pleiades03 kernel: [<ffffffff811a537d>]
vfs_fsync+0x1d/0x20
> Mar ?8 08:16:03 pleiades03 kernel: [<ffffffff81144421>]
> sys_msync+0x151/0x1e0
> Mar ?8 08:16:03 pleiades03 kernel: [<ffffffff8100b0f2>]
> system_call_fastpath+0x16/0x1b
>
> Any attempt to access the volume from any node is fruitless until the
> mongodb process is killed, the session accessing the /HPCdata path gets
> freezed.
> Anyway a complete stop (force) and start of the volume is needed to have it
> back operational.
> The situation can be reproduced at will.
> Is there anybody able to help us? Could we collect more pieces of
> information to help diagnosing the problem?
>
> Thanks a lot
> Alessio
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>

Gluster users - Mar 2012 - Gluster Volume hangs (version 3.2.5)

[Gluster-users] Gluster Volume hangs (version 3.2.5)

[Gluster-users] Gluster Volume hangs (version 3.2.5)

[Gluster-users] Gluster Volume hangs (version 3.2.5)