Hi All, I'm experiencing extremely slow auto-heal rate with glusterfs and I want to hear from you guys to see if it seems reasonable or something's wrong. I did a reconfiguration of the glusterfs running on our lab cluster. Originally we have 25 nodes, each provide 1 1.5T SATA disk, the data are not replicated. Now we have expanded the cluster to 66 nodes and I decided to use all the 4 disks of each node and have the data replicated 3 times on different machines. What I did is to leave the original data untouched, and reconfigure glusterfs so that the original data volumes are paired up with new empty mirror volumes. On the server side, I have each node exports one volume for each of the 4 disks, with an IO-threads translator with 4 threads running on top of each disk and no other performance translators. On the client side which is mounted on each of the 66 nodes, I group all the volumes into mirrors of 3 volumes each, and then aggregate the mirrors with one DHT, and put a write-behind translator with window-size 1MB on top of that. The files are all images, roughly 200K each. To trigger auto-heal, I split a list of all the files (previously saved before reconfiguration) among the 66 nodes and have 4 threads on each nodes running the stat command on the files. The overall rate is about 50 files per second, which I think is very low. And will the auto-heal is running, all operations like cd and ls on the glusterfs client becomes extremely slow, each takes like one minute to finish. On http://www.gluster.org/docs/index.php/GlusterFS_2.0_I/O_Benchmark_Results, which uses 5 servers and one raid0 disk each node, with 5 threads running on 3 clients, about 61K 1M files can be created within 10min, at an average rate of 130 files/second. The glusterfsd processes are each taking about 99% CPU time ( we have 8 cores / node, but it seems glusterfsd is able to use only 1). There only advantage is that they use RAID0 disk and clients and servers are different machines. Other than that, we have more servers/clients, more disks on each node, and I have configured IO-thread and write-behind (which I don't think helps auto heal), and our files are only about 1/5 their size. Even if I count each file as 3 for replication, I'm only achieving similar throughput as a much smaller cluster. I don't know why it's like this and following are my hypothesis: 1. The overall through of glusterfs doesn't scale up to so many nodes; 2. The auto-heal operating is slower than creating the file from scratch; 3. Glusterfs is CPU bound -- it seems to be the case, but I'm unwilling to accept that a filesystem is CPU bound; 4. The network bandwith is saturated. I think we have 1gigabit eithernet on the nodes, which are on two racks and connected by some Foundry switch with 4 gigabit aggregate inter-rack bandwidth; 5. Replication is slow. Finally, the good news is that all the servers and client processes have been running for about 2 hours under stress, which I didn't expect in the beginning. Good job you glusterfs people! We are desperate for a shared storage and I'm eager to hear any suggestion you have to make glusterfs perform better. - Wei
There's one thing obviously wrong in the previous email. GlusterFS is able to use more than one CPU core because we are running 4x4 I/O threads on each node. I got the 99% number from ganglia output which is not correct. Sorry for that - Wei Wei Dong wrote:> Hi All, > > I'm experiencing extremely slow auto-heal rate with glusterfs and I > want to hear from you guys to see if it seems reasonable or > something's wrong. > > I did a reconfiguration of the glusterfs running on our lab cluster. > Originally we have 25 nodes, each provide 1 1.5T SATA disk, the data > are not replicated. Now we have expanded the cluster to 66 nodes and > I decided to use all the 4 disks of each node and have the data > replicated 3 times on different machines. What I did is to leave the > original data untouched, and reconfigure glusterfs so that the > original data volumes are paired up with new empty mirror volumes. On > the server side, I have each node exports one volume for each of the 4 > disks, with an IO-threads translator with 4 threads running on top of > each disk and no other performance translators. On the client side > which is mounted on each of the 66 nodes, I group all the volumes into > mirrors of 3 volumes each, and then aggregate the mirrors with one > DHT, and put a write-behind translator with window-size 1MB on top of > that. > > The files are all images, roughly 200K each. > > To trigger auto-heal, I split a list of all the files (previously > saved before reconfiguration) among the 66 nodes and have 4 threads on > each nodes running the stat command on the files. The overall rate is > about 50 files per second, which I think is very low. And will the > auto-heal is running, all operations like cd and ls on the glusterfs > client becomes extremely slow, each takes like one minute to finish. > > On > http://www.gluster.org/docs/index.php/GlusterFS_2.0_I/O_Benchmark_Results, > which uses 5 servers and one raid0 disk each node, with 5 threads > running on 3 clients, about 61K 1M files can be created within 10min, > at an average rate of 130 files/second. The glusterfsd processes are > each taking about 99% CPU time ( > we have 8 cores / node, but it seems glusterfsd is able to use only 1). > > There only advantage is that they use RAID0 disk and clients and > servers are different machines. Other than that, we have more > servers/clients, more disks on each node, and I have configured > IO-thread and write-behind (which I don't think helps auto heal), and > our files are only about 1/5 their size. Even if I count each file as > 3 for replication, I'm only achieving similar throughput as a much > smaller cluster. I don't know why it's like this and following are my > hypothesis: > > 1. The overall through of glusterfs doesn't scale up to so many nodes; > 2. The auto-heal operating is slower than creating the file from > scratch; > 3. Glusterfs is CPU bound -- it seems to be the case, but I'm > unwilling to accept that a filesystem is CPU bound; > 4. The network bandwith is saturated. I think we have 1gigabit > eithernet on the nodes, which are on two racks and connected by some > Foundry switch with 4 gigabit aggregate inter-rack bandwidth; > 5. Replication is slow. > > Finally, the good news is that all the servers and client processes > have been running for about 2 hours under stress, which I didn't > expect in the beginning. Good job you glusterfs people! > > We are desperate for a shared storage and I'm eager to hear any > suggestion you have to make glusterfs perform better. > > - Wei >
On Thu, Aug 20, 2009 at 12:05 PM, Wei Dong<wdong.pku at gmail.com> wrote:> > I'm experiencing extremely slow auto-heal rate with glusterfs and I want to > hear from you guys to see if it seems reasonable or something's wrong. > > I did a reconfiguration of the glusterfs running on our lab cluster. > ?Originally we have 25 nodes, each provide 1 1.5T SATA disk, the data are > not replicated. ?Now we have expanded the cluster to 66 nodes and I decided > to use all the 4 disks of each node and have the data replicated 3 times on > different machines. ?What I did is to leave the original data untouched, and > reconfigure glusterfs so that the original data volumes are paired up with > new empty mirror volumes. ?On the server side, I have each node exports one > volume for each of the 4 disks, with an IO-threads translator with 4 threads > running on top of each disk and no other performance translators. ?On the > client side which is mounted on each of the 66 nodes, I group all the > volumes into mirrors of 3 volumes each, and then aggregate the mirrors with > one DHT, and put a write-behind translator with window-size 1MB on top of > that. > > The files are all images, roughly 200K each. > > To trigger auto-heal, I split a list of all the files (previously saved > before reconfiguration) among the 66 nodes and have 4 threads on each nodes > running the stat command on the files. ?The overall rate is about 50 files > per second, which I think is very low. ?And will the auto-heal is running, > all operations like cd and ls on the glusterfs client becomes extremely > slow, each takes like one minute to finish. > > On > http://www.gluster.org/docs/index.php/GlusterFS_2.0_I/O_Benchmark_Results, > which uses 5 servers and one raid0 disk each node, with 5 threads running on > 3 clients, about 61K 1M files can be created within 10min, at an average > rate of 130 files/second. ?The glusterfsd processes are each taking about > 99% CPU time ( > we have 8 cores / node, but it seems glusterfsd is able to use only 1). > > There only advantage is that they use RAID0 disk and clients and servers are > different machines. ?Other than that, we have more servers/clients, more > disks on each node, and I have configured IO-thread and write-behind (which > I don't think helps auto heal), and our files are only about 1/5 their size. > ?Even if I count each file as 3 for replication, I'm only achieving similar > throughput as a much smaller cluster. ?I don't know why it's like this and > following are my hypothesis: > > 1. ?The overall through of glusterfs doesn't scale up to so many nodes; > 2. ?The auto-heal operating is slower than creating the file from scratch; > 3. ?Glusterfs is CPU bound -- it seems to be the case, but I'm unwilling to > accept that a filesystem is CPU bound; > 4. ?The network bandwith is saturated. ?I think we have 1gigabit eithernet > on the nodes, which are on two racks and connected by some Foundry switch > with 4 gigabit aggregate inter-rack bandwidth; > 5. ?Replication is slow.Just as a step in debugging - can you try with only one export per server? And in the next step run 4 servers on 4 different ports, each server exporting just one posix export. Can you tell us if both those steps result in the same low performance? Also do you have comparison performance numbers of a similar workload when you had a smaller configuration? Avati