thr3ads.net - Gluster users - [Gluster-users] Gluster crashes when cascading AFR [Dec 2008]

If this information is useful, please help other people find it:
Share via:

Rainer Schwemmer

2008-Dec-15 15:03 UTC

[Gluster-users] Gluster crashes when cascading AFR

Hello all,

I am trying to set up a file replication scheme for our cluster of about
2000 nodes. I'm not sure if what i am doing is actually feasible, so
I'll best just start from the beginning and maybe one of you knows even
a better way to do this.

Basically we have a farm of 2000 machines which are running a certain
application that, during start up, reads about 300 MB of data (out of a
6 GB repository) of program libraries, geometry data etc and this 8
times per node. Once per core on every machine. The data is not modified
by the program so it can be regarded as read only. When the application
is launched it is launched on all nodes simultaneously and especially
now during debugging this is done very often (within minutes).

Up until now we are using NFS to export the software from a central
server, but we are not really happy with it. The performance is lousy,
nfs keeps crashing the repository server and if it does not crash, it
regularly claims that random files do not exist, while they are
obviously there because all the other nodes start up just fine.

Fortunately each of the nodes came with a hard drive which is not used
right now so i thought about replicating the software to each node and
just read everything in from there -> enter glusterfs.

The setup i have created now looks like this:

                repository server
                    (Node A) 
               (AFR to nodes below)
                      /|\
                     / | \
                    /  |  \
                ----   |   ----
              /        |        \
             /         |         \
        (Node B)     ..... 50 x Node B
  (AFR to nodes below)
           /|\
          / | \
         /  |  \
   (Node C) .... 40 x Node C
       |
       |
  Local FS on
 Compute Nodes

The idea is that the repository server mounts a gluster fs that is AFRed
to the 50 nodes of type B and each of the B type nodes has 40 of the
compute nodes below it, AFRing again to the 40 nodes below it. This way,
whenever a new release of the application is available, it just needs to
be copied to the gluster fs on the repository server and should be
replicated to the 2000 compute nodes. If any parts of the new release
need a "hotfix" they can just be modified directly on the repository
server, without having to roll out the whole application to the compute
nodes again. Since the data is not modified by the application it does
not interfere with the file replication scheme and i can just read it in
from the locally mounted fs on the compute nodes.

Now here is the problem:
Whenever i start gluster as client on Node A, all gluster processes on
the nodes of type B crash. Log of the crash:

> 2008-12-15 11:32:28 D [tcp-server.c:145:tcp_server_notify] server:
Registering socket (7) for new transport object of 10.128.2.2
> 2008-12-15 11:32:28 D [ip.c:120:gf_auth] head: allowed = "*",
received ip addr = "10.128.2.2"
> 2008-12-15 11:32:28 D [server-protocol.c:5674:mop_setvolume] server:
accepted client from 10.128.2.2:1023
> 2008-12-15 11:32:28 D [server-protocol.c:5717:mop_setvolume] server:
creating inode table with lru_limit=1024, xlator=head
> 2008-12-15 11:32:28 D [inode.c:1163:inode_table_new] head: creating new
inode table with lru_limit=1024, sizeof(inode_t)=156
> 2008-12-15 11:32:28 D [inode.c:577:__create_inode] head/inode: create
inode(1)
> 2008-12-15 11:32:28 D [inode.c:367:__active_inode] head/inode: activating
inode(1), lru=0/1024
> 2008-12-15 11:32:28 D [afr.c:950:afr_setxattr] head: AFRDEBUG:loc->path
= /
> 
> TLA Repo Revision: glusterfs--mainline--2.5--patch-797
> Time : 2008-12-15 11:32:28
> Signal Number : 11
> 
> glusterfsd -f /home/rainer/sources/gluster/sw-farmctl.hlta01.vol -l
/var/log/glusterfs/glusterfsd.log -L DEBUG
> volume server
>   type protocol/server
>   option auth.ip.head.allow *
>   option transport-type tcp/server
>   subvolumes head
> end-volume
> 
> volume head
>   type cluster/afr
>   option debug on
>   subvolumes local-brick hlta0101-client
> end-volume
> 
> volume hlta0101-client
>   type protocol/client
>   option remote-subvolume sw-brick
>   option remote-host hlta0101
>   option transport-type tcp/client
> end-volume
> 
> volume local-brick
>   type storage/posix
>   option directory /localdisk/gluster/sw
> end-volume
> frame : type(1) op(19)
> 
> /lib64/tls/libc.so.6[0x3d09b2e300]
> /lib64/tls/libc.so.6(memcpy+0x60)[0x3d09b725b0]
>
/usr/lib64/glusterfs/1.3.12/xlator/cluster/afr.so(afr_setxattr+0x207)[0x2a95797e57]
>
/usr/lib64/glusterfs/1.3.12/xlator/protocol/server.so(server_setxattr_resume+0xc6)[0x2a958b7846]
> /usr/lib64/libglusterfs.so.0(call_resume+0xf58)[0x3887f16af8]
>
/usr/lib64/glusterfs/1.3.12/xlator/protocol/server.so(server_setxattr+0x2b1)[0x2a958b7b21]
>
/usr/lib64/glusterfs/1.3.12/xlator/protocol/server.so(server_protocol_interpret+0x2f5)[0x2a958bc985]
>
/usr/lib64/glusterfs/1.3.12/xlator/protocol/server.so(notify+0xef)[0x2a958bd5ef]
> /usr/lib64/libglusterfs.so.0(sys_epoll_iteration+0xe1)[0x3887f113c1]
> /usr/lib64/libglusterfs.so.0(poll_iteration+0x4a)[0x3887f10b0a]
> [glusterfs](main+0x418)[0x402658]
> /lib64/tls/libc.so.6(__libc_start_main+0xdb)[0x3d09b1c40b]
> [glusterfs][0x401d8a]
> ---------
Just for testing i reduced the setup from a tree to a line of kind 
(Node A) AFR -> (Node B) AFR -> (Node C) 
with just 3 servers, but it doesn't even want to run in this configuration.
When i take out one of the AFRs and just put in tcp/client it works.
Interestingly enough, when gluster is already running on Node A, and i 
start gluster on Node B, everything works like i would like to have it.
But i can't really restart everything on Level B whenever a new client
is added to that FS.

If you have kept on reading until here, thanks for your attention and 
sorry for the long winded mail.

Here are the technical details of the software involved:
GlusterFS: version 1.3.12 on all nodes.
OS       : Linux 2.6.9-78.0.1 on Intel x86_64
Fuse     : version 2.6.3-2
The interconnect between nodes is TCP/IP over Ethernet.

Thanks for your help and cheers,

  Rainer


*****************************************************
* Rainer Schwemmer                                  *
*                                                   *
* PH Division, CERN                                 *
* LHCb Experiment                                   *
* CH-1211 Geneva 23                                 *
* Telephone: [41] 22 767 31 25                      *
* Fax:       [41] 22 767 94 25                      *
* E-mail:    mailto:rainer.schwemmer at NOSPAM.cern.ch *
*****************************************************

Vikas Gorur

2008-Dec-15 15:54 UTC

head link

[Gluster-users] Gluster crashes when cascading AFR

Rainer,

Thank you for your interest in GlusterFS.

I do not know of any user who's had an AFR configuration with 40-50
subvolumes, but there is no reason it shouldn't work. The write
performance will obviously be quite low, but in your case since you
will not be making heavy/daily use of it (the only writes will be when
you make a new release, if I understand correctly), that shouldn't be
an issue.

The version of GlusterFS you're using (1.3.12) is rather old now. We
have a new release 1.4.0 in the final stages of testing. We haven't
yet completely tested the AFR-over-AFR setup yet.

You could either wait a few days (less than a week) for us to make the
RC1 release with AFR-over-AFR tested or grab the TLA repository
version and give it a try.

Vikas
--
Engineer - Z Research
http://gluster.com/

Anand Avati

2008-Dec-15 16:47 UTC

head link

[Gluster-users] Gluster crashes when cascading AFR

> Basically we have a farm of 2000 machines which are running a certain
> application that, during start up, reads about 300 MB of data (out of a
> 6 GB repository) of program libraries, geometry data etc and this 8
> times per node. Once per core on every machine. The data is not modified
> by the program so it can be regarded as read only. When the application
> is launched it is launched on all nodes simultaneously and especially
> now during debugging this is done very often (within minutes).
You could use io-cache to help improve this situation. io-cache uses a
weighted LRU for cache replacement where weights can be assigned based
on filename/wild card pattern. This way you can 'force' this
particular HOT 300MB to be always served off memory. The other option
is, as already discussed in this thread, to replicate (which comes
with write performance hit).

avati

Anand Avati

2008-Dec-15 16:57 UTC

head link

[Gluster-users] Gluster crashes when cascading AFR

Fwd'ing to list. Rainer, do try our 1.4.0rc3 release for trying out
io-cache.
> Basically we have a farm of 2000 machines which are running a certain
> application that, during start up, reads about 300 MB of data (out of a
> 6 GB repository) of program libraries, geometry data etc and this 8
> times per node. Once per core on every machine. The data is not modified
> by the program so it can be regarded as read only. When the application
> is launched it is launched on all nodes simultaneously and especially
> now during debugging this is done very often (within minutes).
You could use io-cache to help improve this situation. io-cache uses a
weighted LRU for cache replacement where weights can be assigned based
on filename/wild card pattern. This way you can 'force' this
particular HOT 300MB to be always served off memory. The other option
is, as already discussed in this thread, to replicate (which comes
with write performance hit).

avati

Harald Stürzebecher

2008-Dec-15 21:24 UTC

head link

[Gluster-users] Gluster crashes when cascading AFR

Hello!

2008/12/15 Rainer Schwemmer <rainer.schwemmer at
cern.ch>:> Hello all,
>
> I am trying to set up a file replication scheme for our cluster of about
> 2000 nodes. I'm not sure if what i am doing is actually feasible, so
> I'll best just start from the beginning and maybe one of you knows even
> a better way to do this.
>
> Basically we have a farm of 2000 machines which are running a certain
> application that, during start up, reads about 300 MB of data (out of a
> 6 GB repository) of program libraries, geometry data etc and this 8
> times per node. Once per core on every machine. The data is not modified
> by the program so it can be regarded as read only. When the application
> is launched it is launched on all nodes simultaneously and especially
> now during debugging this is done very often (within minutes).
[...]
> The interconnect between nodes is TCP/IP over Ethernet.
I apologize in advance for not saying much about advanced GlusterFS
setups in this post. :-)

Before trying a multi-level AFR I'd rule out that a basic AFR setup
would not be able to do the job. Try TSTTCPW (The Simplest Thing That
Could Possibly Work) - and do some benchmarks. IMHO, anything faster
than your NFS server would be an improvement.

On setup might be an AFR'd volume on node A and nodes B and exporting
that to the clients like a server side AFR.
(http://www.gluster.org/docs/index.php/Setting_up_AFR_on_two_servers_with_server_side_replication)
Using 20 nodes "B", each one would have ~100 clients.

Reexporting the AFR'd GlusterFS volume over NFS would make changes to
the client nodes unnecessary.

<different ideas>

When I read '2000 machines' and 'read only' I thought of this
page:

http://wiki.systemimager.org/index.php/BitTorrent#Benchmark

Would it be possible to use some peer-to-peer software to distribute
the program and data files to the local disks?


I don't have any experience with networks of that size so I did some
calculations using optimistic estimated values:
Given 300MB data/core, 8 cores per node, 2000 nodes and one NFS server
over Gigabit Ethernet estimated at a maximum of 100MB/s the data
transfer for start up would take 3s/core = 24s/node = 48000s total ~13.3 hours.
Is that anywhere near the time it really takes or did I misread some
information?

With 10 Gigabit Ethernet and a NFS server powerful enough to use it
that time might be reduced by a factor of 10 to ~1.3 hours.

Using Gigabit Ethernet and running bittorrent on every node might
download a 6GB tar of the complete repository and unpack it to all the
local disks within less than 2 hours. Using a compressed file might be
a lot faster, depending on compression ratio.

Harald St?rzebecher

Keith Freedman

2008-Dec-15 22:09 UTC

head link

[Gluster-users] Gluster crashes when cascading AFR

here are my thoughts.

the value in having multiple tiers of AFR''ed 
nodes is simply to help aggregate bandwidth.
instead of having 2000 nodes fighting for data 
over the same network and disk, you''re distributing that across
multiple nodes.

So I think your architecture is valid.  You could 
use client side caching as has been suggested to 
increase performance on the clients so they''re 
reading from local disk instead of over the network all the time.

As I understand it, there''s still going to be 
some file access performance issues.
your client requests a file form it''s server.
that server checks with the other 49 in it''s AFR 
config to insure it''s got the latest version of the file (or it
auto-heals).
Since this is actually a filesytem that it''s 
getting from another server, that server also 
checks with it''s 49 peers.  This should be fine, 
although what I''m not clear on is whether or not
all 49 in the first tier are going to cause the 
same 49 checks.   my guess is some of them will 
be cached since they''ll be redundant, but I''ve no idea how
long this will take.

perhaps one of the devs can chime in on this aspect.

I''d think that since you''re in a primarily read 
environment, you''ll ultimately still benefit over 
a single NFS server because once the afr 
checks/auto-heal is done, you have fewer clients 
competing for bandwidth so things will ultimately 
be much faster, you may just have longer delays 
before the data starts moving if you''re 
monitoring your port traffic on the clients, but 
in the long run, (file request to file delivery time) you''ll be better
off.

At 01:24 PM 12/15/2008, Harald St?rzebecher wrote:>Hello!
>
>2008/12/15 Rainer Schwemmer <rainer.schwemmer at cern.ch>:
> > Hello all,
> >
> > I am trying to set up a file replication scheme for our cluster of
about
> > 2000 nodes. I''m not sure if what i am doing is actually
feasible, so
> > I''ll best just start from the beginning and maybe one of you
knows even
> > a better way to do this.
> >
> > Basically we have a farm of 2000 machines which are running a certain
> > application that, during start up, reads about 300 MB of data (out of
a
> > 6 GB repository) of program libraries, geometry data etc and this 8
> > times per node. Once per core on every machine. The data is not
modified
> > by the program so it can be regarded as read only. When the
application
> > is launched it is launched on all nodes simultaneously and especially
> > now during debugging this is done very often (within minutes).
>
>[...]
>
> > The interconnect between nodes is TCP/IP over Ethernet.
>
>I apologize in advance for not saying much about advanced GlusterFS
>setups in this post. :-)
>
>Before trying a multi-level AFR I''d rule out that a basic AFR setup
>would not be able to do the job. Try TSTTCPW (The Simplest Thing That
>Could Possibly Work) - and do some benchmarks. IMHO, anything faster
>than your NFS server would be an improvement.
>
>On setup might be an AFR''d volume on node A and nodes B and
exporting
>that to the clients like a server side AFR.
>(http://www.gluster.org/docs/index.php/Setting_up_AFR_on_two_servers_with_server_side_replication)
>Using 20 nodes "B", each one would have ~100 clients.
>
>Reexporting the AFR''d GlusterFS volume over NFS would make changes
to
>the client nodes unnecessary.
>
><different ideas>
>
>When I read ''2000 machines'' and ''read
only'' I thought of this page:
>
>http://wiki.systemimager.org/index.php/BitTorrent#Benchmark
>
>Would it be possible to use some peer-to-peer software to distribute
>the program and data files to the local disks?
>
>
>I don''t have any experience with networks of that size so I did
some
>calculations using optimistic estimated values:
>Given 300MB data/core, 8 cores per node, 2000 nodes and one NFS server
>over Gigabit Ethernet estimated at a maximum of 100MB/s the data
>transfer for start up would take 3s/core = 24s/node = 48000s total >~13.3
hours.
>Is that anywhere near the time it really takes or did I misread some
>information?
>
>With 10 Gigabit Ethernet and a NFS server powerful enough to use it
>that time might be reduced by a factor of 10 to ~1.3 hours.
>
>Using Gigabit Ethernet and running bittorrent on every node might
>download a 6GB tar of the complete repository and unpack it to all the
>local disks within less than 2 hours. Using a compressed file might be
>a lot faster, depending on compression ratio.
>
>Harald St?rzebecher
>
>_______________________________________________
>Gluster-users mailing list
>Gluster-users at gluster.org
>http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users

Keith Freedman

2008-Dec-15 22:09 UTC

head link

[Gluster-users] Gluster crashes when cascading AFR

here are my thoughts.

the value in having multiple tiers of AFR''ed 
nodes is simply to help aggregate bandwidth.
instead of having 2000 nodes fighting for data 
over the same network and disk, you''re distributing that across
multiple nodes.

So I think your architecture is valid.  You could 
use client side caching as has been suggested to 
increase performance on the clients so they''re 
reading from local disk instead of over the network all the time.

As I understand it, there''s still going to be 
some file access performance issues.
your client requests a file form it''s server.
that server checks with the other 49 in it''s AFR 
config to insure it''s got the latest version of the file (or it
auto-heals).
Since this is actually a filesytem that it''s 
getting from another server, that server also 
checks with it''s 49 peers.  This should be fine, 
although what I''m not clear on is whether or not
all 49 in the first tier are going to cause the 
same 49 checks.   my guess is some of them will 
be cached since they''ll be redundant, but I''ve no idea how
long this will take.

perhaps one of the devs can chime in on this aspect.

I''d think that since you''re in a primarily read 
environment, you''ll ultimately still benefit over 
a single NFS server because once the afr 
checks/auto-heal is done, you have fewer clients 
competing for bandwidth so things will ultimately 
be much faster, you may just have longer delays 
before the data starts moving if you''re 
monitoring your port traffic on the clients, but 
in the long run, (file request to file delivery time) you''ll be better
off.

At 01:24 PM 12/15/2008, Harald St?rzebecher wrote:>Hello!
>
>2008/12/15 Rainer Schwemmer <rainer.schwemmer at cern.ch>:
> > Hello all,
> >
> > I am trying to set up a file replication scheme for our cluster of
about
> > 2000 nodes. I''m not sure if what i am doing is actually
feasible, so
> > I''ll best just start from the beginning and maybe one of you
knows even
> > a better way to do this.
> >
> > Basically we have a farm of 2000 machines which are running a certain
> > application that, during start up, reads about 300 MB of data (out of
a
> > 6 GB repository) of program libraries, geometry data etc and this 8
> > times per node. Once per core on every machine. The data is not
modified
> > by the program so it can be regarded as read only. When the
application
> > is launched it is launched on all nodes simultaneously and especially
> > now during debugging this is done very often (within minutes).
>
>[...]
>
> > The interconnect between nodes is TCP/IP over Ethernet.
>
>I apologize in advance for not saying much about advanced GlusterFS
>setups in this post. :-)
>
>Before trying a multi-level AFR I''d rule out that a basic AFR setup
>would not be able to do the job. Try TSTTCPW (The Simplest Thing That
>Could Possibly Work) - and do some benchmarks. IMHO, anything faster
>than your NFS server would be an improvement.
>
>On setup might be an AFR''d volume on node A and nodes B and
exporting
>that to the clients like a server side AFR.
>(http://www.gluster.org/docs/index.php/Setting_up_AFR_on_two_servers_with_server_side_replication)
>Using 20 nodes "B", each one would have ~100 clients.
>
>Reexporting the AFR''d GlusterFS volume over NFS would make changes
to
>the client nodes unnecessary.
>
><different ideas>
>
>When I read ''2000 machines'' and ''read
only'' I thought of this page:
>
>http://wiki.systemimager.org/index.php/BitTorrent#Benchmark
>
>Would it be possible to use some peer-to-peer software to distribute
>the program and data files to the local disks?
>
>
>I don''t have any experience with networks of that size so I did
some
>calculations using optimistic estimated values:
>Given 300MB data/core, 8 cores per node, 2000 nodes and one NFS server
>over Gigabit Ethernet estimated at a maximum of 100MB/s the data
>transfer for start up would take 3s/core = 24s/node = 48000s total >~13.3
hours.
>Is that anywhere near the time it really takes or did I misread some
>information?
>
>With 10 Gigabit Ethernet and a NFS server powerful enough to use it
>that time might be reduced by a factor of 10 to ~1.3 hours.
>
>Using Gigabit Ethernet and running bittorrent on every node might
>download a 6GB tar of the complete repository and unpack it to all the
>local disks within less than 2 hours. Using a compressed file might be
>a lot faster, depending on compression ratio.
>
>Harald St?rzebecher
>
>_______________________________________________
>Gluster-users mailing list
>Gluster-users at gluster.org
>http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users

Keith Freedman

2008-Dec-15 22:09 UTC

head link

[Gluster-users] Gluster crashes when cascading AFR

here are my thoughts.

the value in having multiple tiers of AFR''ed 
nodes is simply to help aggregate bandwidth.
instead of having 2000 nodes fighting for data 
over the same network and disk, you''re distributing that across
multiple nodes.

So I think your architecture is valid.  You could 
use client side caching as has been suggested to 
increase performance on the clients so they''re 
reading from local disk instead of over the network all the time.

As I understand it, there''s still going to be 
some file access performance issues.
your client requests a file form it''s server.
that server checks with the other 49 in it''s AFR 
config to insure it''s got the latest version of the file (or it
auto-heals).
Since this is actually a filesytem that it''s 
getting from another server, that server also 
checks with it''s 49 peers.  This should be fine, 
although what I''m not clear on is whether or not
all 49 in the first tier are going to cause the 
same 49 checks.   my guess is some of them will 
be cached since they''ll be redundant, but I''ve no idea how
long this will take.

perhaps one of the devs can chime in on this aspect.

I''d think that since you''re in a primarily read 
environment, you''ll ultimately still benefit over 
a single NFS server because once the afr 
checks/auto-heal is done, you have fewer clients 
competing for bandwidth so things will ultimately 
be much faster, you may just have longer delays 
before the data starts moving if you''re 
monitoring your port traffic on the clients, but 
in the long run, (file request to file delivery time) you''ll be better
off.

At 01:24 PM 12/15/2008, Harald St?rzebecher wrote:>Hello!
>
>2008/12/15 Rainer Schwemmer <rainer.schwemmer at cern.ch>:
> > Hello all,
> >
> > I am trying to set up a file replication scheme for our cluster of
about
> > 2000 nodes. I''m not sure if what i am doing is actually
feasible, so
> > I''ll best just start from the beginning and maybe one of you
knows even
> > a better way to do this.
> >
> > Basically we have a farm of 2000 machines which are running a certain
> > application that, during start up, reads about 300 MB of data (out of
a
> > 6 GB repository) of program libraries, geometry data etc and this 8
> > times per node. Once per core on every machine. The data is not
modified
> > by the program so it can be regarded as read only. When the
application
> > is launched it is launched on all nodes simultaneously and especially
> > now during debugging this is done very often (within minutes).
>
>[...]
>
> > The interconnect between nodes is TCP/IP over Ethernet.
>
>I apologize in advance for not saying much about advanced GlusterFS
>setups in this post. :-)
>
>Before trying a multi-level AFR I''d rule out that a basic AFR setup
>would not be able to do the job. Try TSTTCPW (The Simplest Thing That
>Could Possibly Work) - and do some benchmarks. IMHO, anything faster
>than your NFS server would be an improvement.
>
>On setup might be an AFR''d volume on node A and nodes B and
exporting
>that to the clients like a server side AFR.
>(http://www.gluster.org/docs/index.php/Setting_up_AFR_on_two_servers_with_server_side_replication)
>Using 20 nodes "B", each one would have ~100 clients.
>
>Reexporting the AFR''d GlusterFS volume over NFS would make changes
to
>the client nodes unnecessary.
>
><different ideas>
>
>When I read ''2000 machines'' and ''read
only'' I thought of this page:
>
>http://wiki.systemimager.org/index.php/BitTorrent#Benchmark
>
>Would it be possible to use some peer-to-peer software to distribute
>the program and data files to the local disks?
>
>
>I don''t have any experience with networks of that size so I did
some
>calculations using optimistic estimated values:
>Given 300MB data/core, 8 cores per node, 2000 nodes and one NFS server
>over Gigabit Ethernet estimated at a maximum of 100MB/s the data
>transfer for start up would take 3s/core = 24s/node = 48000s total >~13.3
hours.
>Is that anywhere near the time it really takes or did I misread some
>information?
>
>With 10 Gigabit Ethernet and a NFS server powerful enough to use it
>that time might be reduced by a factor of 10 to ~1.3 hours.
>
>Using Gigabit Ethernet and running bittorrent on every node might
>download a 6GB tar of the complete repository and unpack it to all the
>local disks within less than 2 hours. Using a compressed file might be
>a lot faster, depending on compression ratio.
>
>Harald St?rzebecher
>
>_______________________________________________
>Gluster-users mailing list
>Gluster-users at gluster.org
>http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users

Rainer Schwemmer

2008-Dec-16 14:18 UTC

head link

[Gluster-users] Gluster crashes when cascading AFR

Hello all,

Thanks for all the suggestions so far.

We did consider bit torrent. Unfortunately it takes quite a lot of time
to index the whole repository and it makes changing single files rather
cumbersome if people need to apply a quick fix to a certain component in
the repository.

I did try the replication just to the 50 nodes of the first tier and
this seems to work fairly well. In case i can't get the two tiered setup
to work this will be the solution I'll go for.




There seems to be a bit of confusion of what I'm trying to do, or i did
not understand the caching part that some of you are suggesting.

The plan is to use AFR to write a copy of the repository onto the local
disks of each of the 2000 cluster nodes. Since gluster uses the
underlying ext3 file system and just puts the AFRed files onto the
disks, i should be able to read the repository data directly via ext3 on
the cluster nodes, once replication is completed.
This way i can also use the linux built in FS cache. I would just use
the root node of the hierarchy to throw in new files to be replicated on
all the cluster nodes when necessary.

Cheers,
  Rainer


On Mon, 2008-12-15 at 14:09 -0800, Keith Freedman wrote:> here are my thoughts.
> 
> the value in having multiple tiers of AFR'ed 
> nodes is simply to help aggregate bandwidth.
> instead of having 2000 nodes fighting for data 
> over the same network and disk, you're distributing that across
multiple nodes.
> 
> So I think your architecture is valid.  You could 
> use client side caching as has been suggested to 
> increase performance on the clients so they're 
> reading from local disk instead of over the network all the time.
> 
> As I understand it, there's still going to be 
> some file access performance issues.
> your client requests a file form it's server.
> that server checks with the other 49 in it's AFR 
> config to insure it's got the latest version of the file (or it
auto-heals).
> Since this is actually a filesytem that it's 
> getting from another server, that server also 
> checks with it's 49 peers.  This should be fine, 
> although what I'm not clear on is whether or not
> all 49 in the first tier are going to cause the 
> same 49 checks.   my guess is some of them will 
> be cached since they'll be redundant, but I've no idea how long
this will take.
> 
> perhaps one of the devs can chime in on this aspect.
> 
> I'd think that since you're in a primarily read 
> environment, you'll ultimately still benefit over 
> a single NFS server because once the afr 
> checks/auto-heal is done, you have fewer clients 
> competing for bandwidth so things will ultimately 
> be much faster, you may just have longer delays 
> before the data starts moving if you're 
> monitoring your port traffic on the clients, but 
> in the long run, (file request to file delivery time) you'll be better
off.
> 
> 
> 
> At 01:24 PM 12/15/2008, Harald St?rzebecher wrote:
> >Hello!
> >
> >2008/12/15 Rainer Schwemmer <rainer.schwemmer at cern.ch>:
> > > Hello all,
> > >
> > > I am trying to set up a file replication scheme for our cluster
of about
> > > 2000 nodes. I'm not sure if what i am doing is actually
feasible, so
> > > I'll best just start from the beginning and maybe one of you
knows even
> > > a better way to do this.
> > >
> > > Basically we have a farm of 2000 machines which are running a
certain
> > > application that, during start up, reads about 300 MB of data
(out of a
> > > 6 GB repository) of program libraries, geometry data etc and this
8
> > > times per node. Once per core on every machine. The data is not
modified
> > > by the program so it can be regarded as read only. When the
application
> > > is launched it is launched on all nodes simultaneously and
especially
> > > now during debugging this is done very often (within minutes).
> >
> >[...]
> >
> > > The interconnect between nodes is TCP/IP over Ethernet.
> >
> >I apologize in advance for not saying much about advanced GlusterFS
> >setups in this post. :-)
> >
> >Before trying a multi-level AFR I'd rule out that a basic AFR setup
> >would not be able to do the job. Try TSTTCPW (The Simplest Thing That
> >Could Possibly Work) - and do some benchmarks. IMHO, anything faster
> >than your NFS server would be an improvement.
> >
> >On setup might be an AFR'd volume on node A and nodes B and
exporting
> >that to the clients like a server side AFR.
>
>(http://www.gluster.org/docs/index.php/Setting_up_AFR_on_two_servers_with_server_side_replication)
> >Using 20 nodes "B", each one would have ~100 clients.
> >
> >Reexporting the AFR'd GlusterFS volume over NFS would make changes
to
> >the client nodes unnecessary.
> >
> ><different ideas>
> >
> >When I read '2000 machines' and 'read only' I thought
of this page:
> >
> >http://wiki.systemimager.org/index.php/BitTorrent#Benchmark
> >
> >Would it be possible to use some peer-to-peer software to distribute
> >the program and data files to the local disks?
> >
> >
> >I don't have any experience with networks of that size so I did
some
> >calculations using optimistic estimated values:
> >Given 300MB data/core, 8 cores per node, 2000 nodes and one NFS server
> >over Gigabit Ethernet estimated at a maximum of 100MB/s the data
> >transfer for start up would take 3s/core = 24s/node = 48000s total >
>~13.3 hours.
> >Is that anywhere near the time it really takes or did I misread some
> >information?
> >
> >With 10 Gigabit Ethernet and a NFS server powerful enough to use it
> >that time might be reduced by a factor of 10 to ~1.3 hours.
> >
> >Using Gigabit Ethernet and running bittorrent on every node might
> >download a 6GB tar of the complete repository and unpack it to all the
> >local disks within less than 2 hours. Using a compressed file might be
> >a lot faster, depending on compression ratio.
> >
> >Harald St?rzebecher
> >
> >_______________________________________________
> >Gluster-users mailing list
> >Gluster-users at gluster.org
> >http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users
>

Gluster users - Dec 2008 - Gluster crashes when cascading AFR

[Gluster-users] Gluster crashes when cascading AFR

[Gluster-users] Gluster crashes when cascading AFR

[Gluster-users] Gluster crashes when cascading AFR

[Gluster-users] Gluster crashes when cascading AFR

[Gluster-users] Gluster crashes when cascading AFR

[Gluster-users] Gluster crashes when cascading AFR

[Gluster-users] Gluster crashes when cascading AFR

[Gluster-users] Gluster crashes when cascading AFR

[Gluster-users] Gluster crashes when cascading AFR