Hello all, I am trying to set up a file replication scheme for our cluster of about 2000 nodes. I'm not sure if what i am doing is actually feasible, so I'll best just start from the beginning and maybe one of you knows even a better way to do this. Basically we have a farm of 2000 machines which are running a certain application that, during start up, reads about 300 MB of data (out of a 6 GB repository) of program libraries, geometry data etc and this 8 times per node. Once per core on every machine. The data is not modified by the program so it can be regarded as read only. When the application is launched it is launched on all nodes simultaneously and especially now during debugging this is done very often (within minutes). Up until now we are using NFS to export the software from a central server, but we are not really happy with it. The performance is lousy, nfs keeps crashing the repository server and if it does not crash, it regularly claims that random files do not exist, while they are obviously there because all the other nodes start up just fine. Fortunately each of the nodes came with a hard drive which is not used right now so i thought about replicating the software to each node and just read everything in from there -> enter glusterfs. The setup i have created now looks like this: repository server (Node A) (AFR to nodes below) /|\ / | \ / | \ ---- | ---- / | \ / | \ (Node B) ..... 50 x Node B (AFR to nodes below) /|\ / | \ / | \ (Node C) .... 40 x Node C | | Local FS on Compute Nodes The idea is that the repository server mounts a gluster fs that is AFRed to the 50 nodes of type B and each of the B type nodes has 40 of the compute nodes below it, AFRing again to the 40 nodes below it. This way, whenever a new release of the application is available, it just needs to be copied to the gluster fs on the repository server and should be replicated to the 2000 compute nodes. If any parts of the new release need a "hotfix" they can just be modified directly on the repository server, without having to roll out the whole application to the compute nodes again. Since the data is not modified by the application it does not interfere with the file replication scheme and i can just read it in from the locally mounted fs on the compute nodes. Now here is the problem: Whenever i start gluster as client on Node A, all gluster processes on the nodes of type B crash. Log of the crash:> 2008-12-15 11:32:28 D [tcp-server.c:145:tcp_server_notify] server: Registering socket (7) for new transport object of 10.128.2.2 > 2008-12-15 11:32:28 D [ip.c:120:gf_auth] head: allowed = "*", received ip addr = "10.128.2.2" > 2008-12-15 11:32:28 D [server-protocol.c:5674:mop_setvolume] server: accepted client from 10.128.2.2:1023 > 2008-12-15 11:32:28 D [server-protocol.c:5717:mop_setvolume] server: creating inode table with lru_limit=1024, xlator=head > 2008-12-15 11:32:28 D [inode.c:1163:inode_table_new] head: creating new inode table with lru_limit=1024, sizeof(inode_t)=156 > 2008-12-15 11:32:28 D [inode.c:577:__create_inode] head/inode: create inode(1) > 2008-12-15 11:32:28 D [inode.c:367:__active_inode] head/inode: activating inode(1), lru=0/1024 > 2008-12-15 11:32:28 D [afr.c:950:afr_setxattr] head: AFRDEBUG:loc->path = / > > TLA Repo Revision: glusterfs--mainline--2.5--patch-797 > Time : 2008-12-15 11:32:28 > Signal Number : 11 > > glusterfsd -f /home/rainer/sources/gluster/sw-farmctl.hlta01.vol -l /var/log/glusterfs/glusterfsd.log -L DEBUG > volume server > type protocol/server > option auth.ip.head.allow * > option transport-type tcp/server > subvolumes head > end-volume > > volume head > type cluster/afr > option debug on > subvolumes local-brick hlta0101-client > end-volume > > volume hlta0101-client > type protocol/client > option remote-subvolume sw-brick > option remote-host hlta0101 > option transport-type tcp/client > end-volume > > volume local-brick > type storage/posix > option directory /localdisk/gluster/sw > end-volume > frame : type(1) op(19) > > /lib64/tls/libc.so.6[0x3d09b2e300] > /lib64/tls/libc.so.6(memcpy+0x60)[0x3d09b725b0] > /usr/lib64/glusterfs/1.3.12/xlator/cluster/afr.so(afr_setxattr+0x207)[0x2a95797e57] > /usr/lib64/glusterfs/1.3.12/xlator/protocol/server.so(server_setxattr_resume+0xc6)[0x2a958b7846] > /usr/lib64/libglusterfs.so.0(call_resume+0xf58)[0x3887f16af8] > /usr/lib64/glusterfs/1.3.12/xlator/protocol/server.so(server_setxattr+0x2b1)[0x2a958b7b21] > /usr/lib64/glusterfs/1.3.12/xlator/protocol/server.so(server_protocol_interpret+0x2f5)[0x2a958bc985] > /usr/lib64/glusterfs/1.3.12/xlator/protocol/server.so(notify+0xef)[0x2a958bd5ef] > /usr/lib64/libglusterfs.so.0(sys_epoll_iteration+0xe1)[0x3887f113c1] > /usr/lib64/libglusterfs.so.0(poll_iteration+0x4a)[0x3887f10b0a] > [glusterfs](main+0x418)[0x402658] > /lib64/tls/libc.so.6(__libc_start_main+0xdb)[0x3d09b1c40b] > [glusterfs][0x401d8a] > ---------Just for testing i reduced the setup from a tree to a line of kind (Node A) AFR -> (Node B) AFR -> (Node C) with just 3 servers, but it doesn't even want to run in this configuration. When i take out one of the AFRs and just put in tcp/client it works. Interestingly enough, when gluster is already running on Node A, and i start gluster on Node B, everything works like i would like to have it. But i can't really restart everything on Level B whenever a new client is added to that FS. If you have kept on reading until here, thanks for your attention and sorry for the long winded mail. Here are the technical details of the software involved: GlusterFS: version 1.3.12 on all nodes. OS : Linux 2.6.9-78.0.1 on Intel x86_64 Fuse : version 2.6.3-2 The interconnect between nodes is TCP/IP over Ethernet. Thanks for your help and cheers, Rainer ***************************************************** * Rainer Schwemmer * * * * PH Division, CERN * * LHCb Experiment * * CH-1211 Geneva 23 * * Telephone: [41] 22 767 31 25 * * Fax: [41] 22 767 94 25 * * E-mail: mailto:rainer.schwemmer at NOSPAM.cern.ch * *****************************************************
Rainer, Thank you for your interest in GlusterFS. I do not know of any user who's had an AFR configuration with 40-50 subvolumes, but there is no reason it shouldn't work. The write performance will obviously be quite low, but in your case since you will not be making heavy/daily use of it (the only writes will be when you make a new release, if I understand correctly), that shouldn't be an issue. The version of GlusterFS you're using (1.3.12) is rather old now. We have a new release 1.4.0 in the final stages of testing. We haven't yet completely tested the AFR-over-AFR setup yet. You could either wait a few days (less than a week) for us to make the RC1 release with AFR-over-AFR tested or grab the TLA repository version and give it a try. Vikas -- Engineer - Z Research http://gluster.com/
> Basically we have a farm of 2000 machines which are running a certain > application that, during start up, reads about 300 MB of data (out of a > 6 GB repository) of program libraries, geometry data etc and this 8 > times per node. Once per core on every machine. The data is not modified > by the program so it can be regarded as read only. When the application > is launched it is launched on all nodes simultaneously and especially > now during debugging this is done very often (within minutes).You could use io-cache to help improve this situation. io-cache uses a weighted LRU for cache replacement where weights can be assigned based on filename/wild card pattern. This way you can 'force' this particular HOT 300MB to be always served off memory. The other option is, as already discussed in this thread, to replicate (which comes with write performance hit). avati
Fwd'ing to list. Rainer, do try our 1.4.0rc3 release for trying out io-cache.> Basically we have a farm of 2000 machines which are running a certain > application that, during start up, reads about 300 MB of data (out of a > 6 GB repository) of program libraries, geometry data etc and this 8 > times per node. Once per core on every machine. The data is not modified > by the program so it can be regarded as read only. When the application > is launched it is launched on all nodes simultaneously and especially > now during debugging this is done very often (within minutes).You could use io-cache to help improve this situation. io-cache uses a weighted LRU for cache replacement where weights can be assigned based on filename/wild card pattern. This way you can 'force' this particular HOT 300MB to be always served off memory. The other option is, as already discussed in this thread, to replicate (which comes with write performance hit). avati
Harald Stürzebecher
2008-Dec-15 21:24 UTC
[Gluster-users] Gluster crashes when cascading AFR
Hello! 2008/12/15 Rainer Schwemmer <rainer.schwemmer at cern.ch>:> Hello all, > > I am trying to set up a file replication scheme for our cluster of about > 2000 nodes. I'm not sure if what i am doing is actually feasible, so > I'll best just start from the beginning and maybe one of you knows even > a better way to do this. > > Basically we have a farm of 2000 machines which are running a certain > application that, during start up, reads about 300 MB of data (out of a > 6 GB repository) of program libraries, geometry data etc and this 8 > times per node. Once per core on every machine. The data is not modified > by the program so it can be regarded as read only. When the application > is launched it is launched on all nodes simultaneously and especially > now during debugging this is done very often (within minutes).[...]> The interconnect between nodes is TCP/IP over Ethernet.I apologize in advance for not saying much about advanced GlusterFS setups in this post. :-) Before trying a multi-level AFR I'd rule out that a basic AFR setup would not be able to do the job. Try TSTTCPW (The Simplest Thing That Could Possibly Work) - and do some benchmarks. IMHO, anything faster than your NFS server would be an improvement. On setup might be an AFR'd volume on node A and nodes B and exporting that to the clients like a server side AFR. (http://www.gluster.org/docs/index.php/Setting_up_AFR_on_two_servers_with_server_side_replication) Using 20 nodes "B", each one would have ~100 clients. Reexporting the AFR'd GlusterFS volume over NFS would make changes to the client nodes unnecessary. <different ideas> When I read '2000 machines' and 'read only' I thought of this page: http://wiki.systemimager.org/index.php/BitTorrent#Benchmark Would it be possible to use some peer-to-peer software to distribute the program and data files to the local disks? I don't have any experience with networks of that size so I did some calculations using optimistic estimated values: Given 300MB data/core, 8 cores per node, 2000 nodes and one NFS server over Gigabit Ethernet estimated at a maximum of 100MB/s the data transfer for start up would take 3s/core = 24s/node = 48000s total ~13.3 hours. Is that anywhere near the time it really takes or did I misread some information? With 10 Gigabit Ethernet and a NFS server powerful enough to use it that time might be reduced by a factor of 10 to ~1.3 hours. Using Gigabit Ethernet and running bittorrent on every node might download a 6GB tar of the complete repository and unpack it to all the local disks within less than 2 hours. Using a compressed file might be a lot faster, depending on compression ratio. Harald St?rzebecher
here are my thoughts. the value in having multiple tiers of AFR''ed nodes is simply to help aggregate bandwidth. instead of having 2000 nodes fighting for data over the same network and disk, you''re distributing that across multiple nodes. So I think your architecture is valid. You could use client side caching as has been suggested to increase performance on the clients so they''re reading from local disk instead of over the network all the time. As I understand it, there''s still going to be some file access performance issues. your client requests a file form it''s server. that server checks with the other 49 in it''s AFR config to insure it''s got the latest version of the file (or it auto-heals). Since this is actually a filesytem that it''s getting from another server, that server also checks with it''s 49 peers. This should be fine, although what I''m not clear on is whether or not all 49 in the first tier are going to cause the same 49 checks. my guess is some of them will be cached since they''ll be redundant, but I''ve no idea how long this will take. perhaps one of the devs can chime in on this aspect. I''d think that since you''re in a primarily read environment, you''ll ultimately still benefit over a single NFS server because once the afr checks/auto-heal is done, you have fewer clients competing for bandwidth so things will ultimately be much faster, you may just have longer delays before the data starts moving if you''re monitoring your port traffic on the clients, but in the long run, (file request to file delivery time) you''ll be better off. At 01:24 PM 12/15/2008, Harald St?rzebecher wrote:>Hello! > >2008/12/15 Rainer Schwemmer <rainer.schwemmer at cern.ch>: > > Hello all, > > > > I am trying to set up a file replication scheme for our cluster of about > > 2000 nodes. I''m not sure if what i am doing is actually feasible, so > > I''ll best just start from the beginning and maybe one of you knows even > > a better way to do this. > > > > Basically we have a farm of 2000 machines which are running a certain > > application that, during start up, reads about 300 MB of data (out of a > > 6 GB repository) of program libraries, geometry data etc and this 8 > > times per node. Once per core on every machine. The data is not modified > > by the program so it can be regarded as read only. When the application > > is launched it is launched on all nodes simultaneously and especially > > now during debugging this is done very often (within minutes). > >[...] > > > The interconnect between nodes is TCP/IP over Ethernet. > >I apologize in advance for not saying much about advanced GlusterFS >setups in this post. :-) > >Before trying a multi-level AFR I''d rule out that a basic AFR setup >would not be able to do the job. Try TSTTCPW (The Simplest Thing That >Could Possibly Work) - and do some benchmarks. IMHO, anything faster >than your NFS server would be an improvement. > >On setup might be an AFR''d volume on node A and nodes B and exporting >that to the clients like a server side AFR. >(http://www.gluster.org/docs/index.php/Setting_up_AFR_on_two_servers_with_server_side_replication) >Using 20 nodes "B", each one would have ~100 clients. > >Reexporting the AFR''d GlusterFS volume over NFS would make changes to >the client nodes unnecessary. > ><different ideas> > >When I read ''2000 machines'' and ''read only'' I thought of this page: > >http://wiki.systemimager.org/index.php/BitTorrent#Benchmark > >Would it be possible to use some peer-to-peer software to distribute >the program and data files to the local disks? > > >I don''t have any experience with networks of that size so I did some >calculations using optimistic estimated values: >Given 300MB data/core, 8 cores per node, 2000 nodes and one NFS server >over Gigabit Ethernet estimated at a maximum of 100MB/s the data >transfer for start up would take 3s/core = 24s/node = 48000s total >~13.3 hours. >Is that anywhere near the time it really takes or did I misread some >information? > >With 10 Gigabit Ethernet and a NFS server powerful enough to use it >that time might be reduced by a factor of 10 to ~1.3 hours. > >Using Gigabit Ethernet and running bittorrent on every node might >download a 6GB tar of the complete repository and unpack it to all the >local disks within less than 2 hours. Using a compressed file might be >a lot faster, depending on compression ratio. > >Harald St?rzebecher > >_______________________________________________ >Gluster-users mailing list >Gluster-users at gluster.org >http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users
here are my thoughts. the value in having multiple tiers of AFR''ed nodes is simply to help aggregate bandwidth. instead of having 2000 nodes fighting for data over the same network and disk, you''re distributing that across multiple nodes. So I think your architecture is valid. You could use client side caching as has been suggested to increase performance on the clients so they''re reading from local disk instead of over the network all the time. As I understand it, there''s still going to be some file access performance issues. your client requests a file form it''s server. that server checks with the other 49 in it''s AFR config to insure it''s got the latest version of the file (or it auto-heals). Since this is actually a filesytem that it''s getting from another server, that server also checks with it''s 49 peers. This should be fine, although what I''m not clear on is whether or not all 49 in the first tier are going to cause the same 49 checks. my guess is some of them will be cached since they''ll be redundant, but I''ve no idea how long this will take. perhaps one of the devs can chime in on this aspect. I''d think that since you''re in a primarily read environment, you''ll ultimately still benefit over a single NFS server because once the afr checks/auto-heal is done, you have fewer clients competing for bandwidth so things will ultimately be much faster, you may just have longer delays before the data starts moving if you''re monitoring your port traffic on the clients, but in the long run, (file request to file delivery time) you''ll be better off. At 01:24 PM 12/15/2008, Harald St?rzebecher wrote:>Hello! > >2008/12/15 Rainer Schwemmer <rainer.schwemmer at cern.ch>: > > Hello all, > > > > I am trying to set up a file replication scheme for our cluster of about > > 2000 nodes. I''m not sure if what i am doing is actually feasible, so > > I''ll best just start from the beginning and maybe one of you knows even > > a better way to do this. > > > > Basically we have a farm of 2000 machines which are running a certain > > application that, during start up, reads about 300 MB of data (out of a > > 6 GB repository) of program libraries, geometry data etc and this 8 > > times per node. Once per core on every machine. The data is not modified > > by the program so it can be regarded as read only. When the application > > is launched it is launched on all nodes simultaneously and especially > > now during debugging this is done very often (within minutes). > >[...] > > > The interconnect between nodes is TCP/IP over Ethernet. > >I apologize in advance for not saying much about advanced GlusterFS >setups in this post. :-) > >Before trying a multi-level AFR I''d rule out that a basic AFR setup >would not be able to do the job. Try TSTTCPW (The Simplest Thing That >Could Possibly Work) - and do some benchmarks. IMHO, anything faster >than your NFS server would be an improvement. > >On setup might be an AFR''d volume on node A and nodes B and exporting >that to the clients like a server side AFR. >(http://www.gluster.org/docs/index.php/Setting_up_AFR_on_two_servers_with_server_side_replication) >Using 20 nodes "B", each one would have ~100 clients. > >Reexporting the AFR''d GlusterFS volume over NFS would make changes to >the client nodes unnecessary. > ><different ideas> > >When I read ''2000 machines'' and ''read only'' I thought of this page: > >http://wiki.systemimager.org/index.php/BitTorrent#Benchmark > >Would it be possible to use some peer-to-peer software to distribute >the program and data files to the local disks? > > >I don''t have any experience with networks of that size so I did some >calculations using optimistic estimated values: >Given 300MB data/core, 8 cores per node, 2000 nodes and one NFS server >over Gigabit Ethernet estimated at a maximum of 100MB/s the data >transfer for start up would take 3s/core = 24s/node = 48000s total >~13.3 hours. >Is that anywhere near the time it really takes or did I misread some >information? > >With 10 Gigabit Ethernet and a NFS server powerful enough to use it >that time might be reduced by a factor of 10 to ~1.3 hours. > >Using Gigabit Ethernet and running bittorrent on every node might >download a 6GB tar of the complete repository and unpack it to all the >local disks within less than 2 hours. Using a compressed file might be >a lot faster, depending on compression ratio. > >Harald St?rzebecher > >_______________________________________________ >Gluster-users mailing list >Gluster-users at gluster.org >http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users
here are my thoughts. the value in having multiple tiers of AFR''ed nodes is simply to help aggregate bandwidth. instead of having 2000 nodes fighting for data over the same network and disk, you''re distributing that across multiple nodes. So I think your architecture is valid. You could use client side caching as has been suggested to increase performance on the clients so they''re reading from local disk instead of over the network all the time. As I understand it, there''s still going to be some file access performance issues. your client requests a file form it''s server. that server checks with the other 49 in it''s AFR config to insure it''s got the latest version of the file (or it auto-heals). Since this is actually a filesytem that it''s getting from another server, that server also checks with it''s 49 peers. This should be fine, although what I''m not clear on is whether or not all 49 in the first tier are going to cause the same 49 checks. my guess is some of them will be cached since they''ll be redundant, but I''ve no idea how long this will take. perhaps one of the devs can chime in on this aspect. I''d think that since you''re in a primarily read environment, you''ll ultimately still benefit over a single NFS server because once the afr checks/auto-heal is done, you have fewer clients competing for bandwidth so things will ultimately be much faster, you may just have longer delays before the data starts moving if you''re monitoring your port traffic on the clients, but in the long run, (file request to file delivery time) you''ll be better off. At 01:24 PM 12/15/2008, Harald St?rzebecher wrote:>Hello! > >2008/12/15 Rainer Schwemmer <rainer.schwemmer at cern.ch>: > > Hello all, > > > > I am trying to set up a file replication scheme for our cluster of about > > 2000 nodes. I''m not sure if what i am doing is actually feasible, so > > I''ll best just start from the beginning and maybe one of you knows even > > a better way to do this. > > > > Basically we have a farm of 2000 machines which are running a certain > > application that, during start up, reads about 300 MB of data (out of a > > 6 GB repository) of program libraries, geometry data etc and this 8 > > times per node. Once per core on every machine. The data is not modified > > by the program so it can be regarded as read only. When the application > > is launched it is launched on all nodes simultaneously and especially > > now during debugging this is done very often (within minutes). > >[...] > > > The interconnect between nodes is TCP/IP over Ethernet. > >I apologize in advance for not saying much about advanced GlusterFS >setups in this post. :-) > >Before trying a multi-level AFR I''d rule out that a basic AFR setup >would not be able to do the job. Try TSTTCPW (The Simplest Thing That >Could Possibly Work) - and do some benchmarks. IMHO, anything faster >than your NFS server would be an improvement. > >On setup might be an AFR''d volume on node A and nodes B and exporting >that to the clients like a server side AFR. >(http://www.gluster.org/docs/index.php/Setting_up_AFR_on_two_servers_with_server_side_replication) >Using 20 nodes "B", each one would have ~100 clients. > >Reexporting the AFR''d GlusterFS volume over NFS would make changes to >the client nodes unnecessary. > ><different ideas> > >When I read ''2000 machines'' and ''read only'' I thought of this page: > >http://wiki.systemimager.org/index.php/BitTorrent#Benchmark > >Would it be possible to use some peer-to-peer software to distribute >the program and data files to the local disks? > > >I don''t have any experience with networks of that size so I did some >calculations using optimistic estimated values: >Given 300MB data/core, 8 cores per node, 2000 nodes and one NFS server >over Gigabit Ethernet estimated at a maximum of 100MB/s the data >transfer for start up would take 3s/core = 24s/node = 48000s total >~13.3 hours. >Is that anywhere near the time it really takes or did I misread some >information? > >With 10 Gigabit Ethernet and a NFS server powerful enough to use it >that time might be reduced by a factor of 10 to ~1.3 hours. > >Using Gigabit Ethernet and running bittorrent on every node might >download a 6GB tar of the complete repository and unpack it to all the >local disks within less than 2 hours. Using a compressed file might be >a lot faster, depending on compression ratio. > >Harald St?rzebecher > >_______________________________________________ >Gluster-users mailing list >Gluster-users at gluster.org >http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users
Hello all, Thanks for all the suggestions so far. We did consider bit torrent. Unfortunately it takes quite a lot of time to index the whole repository and it makes changing single files rather cumbersome if people need to apply a quick fix to a certain component in the repository. I did try the replication just to the 50 nodes of the first tier and this seems to work fairly well. In case i can't get the two tiered setup to work this will be the solution I'll go for. There seems to be a bit of confusion of what I'm trying to do, or i did not understand the caching part that some of you are suggesting. The plan is to use AFR to write a copy of the repository onto the local disks of each of the 2000 cluster nodes. Since gluster uses the underlying ext3 file system and just puts the AFRed files onto the disks, i should be able to read the repository data directly via ext3 on the cluster nodes, once replication is completed. This way i can also use the linux built in FS cache. I would just use the root node of the hierarchy to throw in new files to be replicated on all the cluster nodes when necessary. Cheers, Rainer On Mon, 2008-12-15 at 14:09 -0800, Keith Freedman wrote:> here are my thoughts. > > the value in having multiple tiers of AFR'ed > nodes is simply to help aggregate bandwidth. > instead of having 2000 nodes fighting for data > over the same network and disk, you're distributing that across multiple nodes. > > So I think your architecture is valid. You could > use client side caching as has been suggested to > increase performance on the clients so they're > reading from local disk instead of over the network all the time. > > As I understand it, there's still going to be > some file access performance issues. > your client requests a file form it's server. > that server checks with the other 49 in it's AFR > config to insure it's got the latest version of the file (or it auto-heals). > Since this is actually a filesytem that it's > getting from another server, that server also > checks with it's 49 peers. This should be fine, > although what I'm not clear on is whether or not > all 49 in the first tier are going to cause the > same 49 checks. my guess is some of them will > be cached since they'll be redundant, but I've no idea how long this will take. > > perhaps one of the devs can chime in on this aspect. > > I'd think that since you're in a primarily read > environment, you'll ultimately still benefit over > a single NFS server because once the afr > checks/auto-heal is done, you have fewer clients > competing for bandwidth so things will ultimately > be much faster, you may just have longer delays > before the data starts moving if you're > monitoring your port traffic on the clients, but > in the long run, (file request to file delivery time) you'll be better off. > > > > At 01:24 PM 12/15/2008, Harald St?rzebecher wrote: > >Hello! > > > >2008/12/15 Rainer Schwemmer <rainer.schwemmer at cern.ch>: > > > Hello all, > > > > > > I am trying to set up a file replication scheme for our cluster of about > > > 2000 nodes. I'm not sure if what i am doing is actually feasible, so > > > I'll best just start from the beginning and maybe one of you knows even > > > a better way to do this. > > > > > > Basically we have a farm of 2000 machines which are running a certain > > > application that, during start up, reads about 300 MB of data (out of a > > > 6 GB repository) of program libraries, geometry data etc and this 8 > > > times per node. Once per core on every machine. The data is not modified > > > by the program so it can be regarded as read only. When the application > > > is launched it is launched on all nodes simultaneously and especially > > > now during debugging this is done very often (within minutes). > > > >[...] > > > > > The interconnect between nodes is TCP/IP over Ethernet. > > > >I apologize in advance for not saying much about advanced GlusterFS > >setups in this post. :-) > > > >Before trying a multi-level AFR I'd rule out that a basic AFR setup > >would not be able to do the job. Try TSTTCPW (The Simplest Thing That > >Could Possibly Work) - and do some benchmarks. IMHO, anything faster > >than your NFS server would be an improvement. > > > >On setup might be an AFR'd volume on node A and nodes B and exporting > >that to the clients like a server side AFR. > >(http://www.gluster.org/docs/index.php/Setting_up_AFR_on_two_servers_with_server_side_replication) > >Using 20 nodes "B", each one would have ~100 clients. > > > >Reexporting the AFR'd GlusterFS volume over NFS would make changes to > >the client nodes unnecessary. > > > ><different ideas> > > > >When I read '2000 machines' and 'read only' I thought of this page: > > > >http://wiki.systemimager.org/index.php/BitTorrent#Benchmark > > > >Would it be possible to use some peer-to-peer software to distribute > >the program and data files to the local disks? > > > > > >I don't have any experience with networks of that size so I did some > >calculations using optimistic estimated values: > >Given 300MB data/core, 8 cores per node, 2000 nodes and one NFS server > >over Gigabit Ethernet estimated at a maximum of 100MB/s the data > >transfer for start up would take 3s/core = 24s/node = 48000s total > >~13.3 hours. > >Is that anywhere near the time it really takes or did I misread some > >information? > > > >With 10 Gigabit Ethernet and a NFS server powerful enough to use it > >that time might be reduced by a factor of 10 to ~1.3 hours. > > > >Using Gigabit Ethernet and running bittorrent on every node might > >download a 6GB tar of the complete repository and unpack it to all the > >local disks within less than 2 hours. Using a compressed file might be > >a lot faster, depending on compression ratio. > > > >Harald St?rzebecher > > > >_______________________________________________ > >Gluster-users mailing list > >Gluster-users at gluster.org > >http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users >