thr3ads.net - Gluster users - [Gluster-users] gluster(1.3.10) becomes unstable after some time [Sep 2008]

If this information is useful, please help other people find it:
Share via:

Roman Hlynovskiy

2008-Sep-08 08:45 UTC

[Gluster-users] gluster(1.3.10) becomes unstable after some time

hello all,

i have a setup of 4 identical servers. each of them exports 2 data
bricks and 1 namespace brick.
each first brick of server is AFR'ed with second brick of previous
server. so, this configurations gives some service redundancy in case
of failure of one of the servers.
all the namespace bricks are also AFR'ed into one.
below you can find my configuration from the first server. as it can
be seen, for client configuration I used local bricks from this
server: brick1, brick2, brickns instead of network-exported from this
server brick01, brick02, brick01ns for i/o reading improvement. So,
the second server uses brick1, brick2, brickns instead of brick03,
brick04, brick02ns etc

The first problem I saw: After 20 minutes of some basic tests with
file copying gluster mount on all servers became unavailable.

I see the following errors in the log:
2008-09-08 14:26:36 W [client-protocol.c:205:call_bail] brick03ns:
activating bail-out. pending frames = 1. last sent = 2008-09-08
14:19:43. last received = 2008-09-08 14:19:43 transport-timeout = 42
2008-09-08 14:26:36 C [client-protocol.c:212:call_bail] brick03ns:
bailing transport
2008-09-08 14:26:36 E [tcp.c:124:tcp_except] brick03ns: shutdown () -
error: Transport endpoint is not connected
2008-09-08 14:26:36 W [client-protocol.c:205:call_bail] brick05:
activating bail-out. pending frames = 1. last sent = 2008-09-08
14:19:43. last received = 2008-09-08 14:19:43 transport-timeout = 42
2008-09-08 14:26:36 C [client-protocol.c:212:call_bail] brick05:
bailing transport
2008-09-08 14:26:36 E [tcp.c:124:tcp_except] brick05: shutdown () -
error: Transport endpoint is not connected
2008-09-08 14:26:36 W [client-protocol.c:205:call_bail] brick06:
activating bail-out. pending frames = 1. last sent = 2008-09-08
14:19:43. last received = 2008-09-08 14:19:43 transport-timeout = 42
2008-09-08 14:26:36 C [client-protocol.c:212:call_bail] brick06:
bailing transport
2008-09-08 14:26:36 E [tcp.c:124:tcp_except] brick06: shutdown () -
error: Transport endpoint is not connected
2008-09-08 14:26:41 W [client-protocol.c:205:call_bail] brick08:
activating bail-out. pending frames = 1. last sent = 2008-09-08
14:19:43. last received = 2008-09-08 14:19:43 transport-timeout = 42
2008-09-08 14:26:41 C [client-protocol.c:212:call_bail] brick08:
bailing transport
2008-09-08 14:26:41 E [tcp.c:124:tcp_except] brick08: shutdown () -
error: Transport endpoint is not connected
2008-09-08 14:26:41 W [client-protocol.c:205:call_bail] brick04ns:
activating bail-out. pending frames = 1. last sent = 2008-09-08
14:19:43. last received = 2008-09-08 14:19:43 transport-tim= 42
2008-09-08 14:26:41 C [client-protocol.c:212:call_bail] brick04ns:
bailing transport
2008-09-08 14:26:41 E [tcp.c:124:tcp_except] brick04ns: shutdown () -
error: Transport endpoint is not connected
2008-09-08 14:26:41 W [client-protocol.c:205:call_bail] brick07:
activating bail-out. pending frames = 1. last sent = 2008-09-08
14:19:43. last received = 2008-09-08 14:19:43 transport-timeout = 42
2008-09-08 14:26:41 C [client-protocol.c:212:call_bail] brick07:
bailing transport
2008-09-08 14:26:41 E [tcp.c:124:tcp_except] brick07: shutdown () -
error: Transport endpoint is not connected

The second problem I see - even with  'option
alu.read-only-subvolumes' gluster remains writing to the specified as
read-only volumes. what could be the reason for this?

----------------------
volume posix1
        type storage/posix
        option directory /mnt/os1/export
end-volume

volume locks1
        type features/posix-locks
        subvolumes posix1
        option mandatory on
end-volume

volume brick1
        type performance/io-threads
        option thread-count 4
        option cache-size 32MB
        subvolumes locks1
end-volume


volume posix2
        type storage/posix
        option directory /mnt/os2/export
end-volume

volume locks2
        type features/posix-locks
        subvolumes posix2
        option mandatory on
end-volume

volume brick2
        type performance/io-threads
        option thread-count 4
        option cache-size 32MB
        subvolumes locks2
end-volume


volume brickns
        type storage/posix
        option directory /mnt/ms
end-volume


volume server
        type protocol/server
        subvolumes brick1 brick2 brickns
        option transport-type tcp/server
        option auth.ip.brick1.allow *
        option auth.ip.brick2.allow *
        option auth.ip.brickns.allow *
end-volume



volume brick01
 type protocol/client
 option transport-type tcp/client
 option remote-host 192.168.252.11
 option remote-subvolume brick1
end-volume


volume brick02
 type protocol/client
 option transport-type tcp/client
 option remote-host 192.168.252.11
 option remote-subvolume brick2
end-volume


volume brick01ns
 type protocol/client
 option transport-type tcp/client
 option remote-host 192.168.252.11
 option remote-subvolume brickns
end-volume


volume brick03
 type protocol/client
 option transport-type tcp/client
 option remote-host 192.168.252.21
 option remote-subvolume brick1
end-volume


volume brick04
 type protocol/client
 option transport-type tcp/client
 option remote-host 192.168.252.21
 option remote-subvolume brick2
end-volume


volume brick02ns
 type protocol/client
 option transport-type tcp/client
 option remote-host 192.168.252.21
 option remote-subvolume brickns
end-volume


volume brick05
 type protocol/client
 option transport-type tcp/client
 option remote-host 192.168.252.31
 option remote-subvolume brick1
end-volume


volume brick06
 type protocol/client
 option transport-type tcp/client
 option remote-host 192.168.252.31
 option remote-subvolume brick2
end-volume


volume brick03ns
 type protocol/client
 option transport-type tcp/client
 option remote-host 192.168.252.31
 option remote-subvolume brickns
end-volume


volume brick07
 type protocol/client
 option transport-type tcp/client
 option remote-host 192.168.252.41
 option remote-subvolume brick1
end-volume


volume brick08
 type protocol/client
 option transport-type tcp/client
 option remote-host 192.168.252.41
 option remote-subvolume brick2
end-volume


volume brick04ns
 type protocol/client
 option transport-type tcp/client
 option remote-host 192.168.252.41
 option remote-subvolume brickns
end-volume


volume afr01
 type cluster/afr
 subvolumes brick2 brick03
 option read-subvolume brick2
end-volume

volume afr02
 type cluster/afr
 subvolumes brick04 brick05
end-volume

volume afr03
 type cluster/afr
 subvolumes brick06 brick07
end-volume

volume afr04
 type cluster/afr
 subvolumes brick08 brick1
 option read-subvolume brick1
end-volume

volume afrns
 type cluster/afr
 subvolumes brickns brick02ns brick03ns brick04ns
 option read-subvolume brickns
end-volume


volume unify
 type cluster/unify
 subvolumes afr01 afr02 afr03 afr04
 option namespace afrns
 option scheduler alu
 option alu.read-only-subvolumes afr02,afr03
 option alu.limits.min-free-disk  5%
 option alu.stat-refresh.interval 10sec
 option alu.order
disk-usage:read-usage:write-usage:open-files-usage:disk-speed-usage
 option alu.disk-usage.entry-threshold 1024M
 option alu.disk-usage.exit-threshold 32M
end-volume
---------------------


-- 
...WBR, Roman Hlynovskiy

Amar S. Tumballi

2008-Sep-22 22:39 UTC

head link

[Gluster-users] gluster(1.3.10) becomes unstable after some time

Hi Roman,
 Sorry for the delay in response.

* The first problem I saw: After 20 minutes of some basic tests with
file copying gluster mount on all servers became unavailable.

Do you see any '/core*' files? this means the calls are bailing out,
there
are three possible reasons.
 i) because of heavy disk i/o, response is getting delayed, hence the
default 'transport-timeout' option is not enough. Try higher values like
120.

 ii) a glusterfs process died, hence the clients couldn't connect to the
corresponding server process (unlikely in your case a new connection is made
again after call bail).

 iii) bug in glusterfs itself. in this case, we would like you to try 1.3.12
(latest 1.3.x release) or wait for another 10days for next pre release of
1.4 branch, which should work fine IMO.

> The second problem I see - even with  'option
> alu.read-only-subvolumes' gluster remains writing to the specified as
> read-only volumes. what could be the reason for this?
>
The reason for it is, the 'read-only-subvolumes' option is used for
making
sure new files are not created on those two subvolumes. But if a file
already exists on those subvolumes then it continues to grow. If you don't
want any write to happen, you need to use filter.

Regards,
Amar

-- 
Amar Tumballi
Gluster/GlusterFS Hacker
[bulde on #gluster/irc.gnu.org]
http://www.zresearch.com - Commoditizing Super Storage!
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20080922/ca41ec50/attachment.html>

Gluster users - Sep 2008 - gluster(1.3.10) becomes unstable after some time

[Gluster-users] gluster(1.3.10) becomes unstable after some time

[Gluster-users] gluster(1.3.10) becomes unstable after some time