I''m working with a new Lustre installation and I''m running into problems. I''m running iozone tests across multiple clients and basically when I exceed more then 2, sometimes 3 clients, I start getting a lot of evictions from the server. An example of the error can be found at: http://www.thedudeminds.com/lustre.txt Right now I suspect it''s possible network related, although I can''t really say this for sure based on the errors. I''m launching iozone using c3 from a head node. Here is my options: cexec rlx-2-6:3-5 ''/opt/iozone/bin/iozone -i 0 -i 1 -r 8k -s 4g -f /mnt/lustre/stripe_test/testfile.''\''''$$''\'''''' Any help is appreciated. -jeremy
The errors are indicative of a client eviction by a server for lack of responsiveness, potentially due to a network error (though that''s not indicated) or potentially due to a timeout within the lustre dlm. I suspect the intention was to incorporate the pid in the name. In this case I think that the command may not be what was intended. Replacing ''cexec'' with ''ssh'' and ''iozone'' with ''touch'' I can see on a system here that files are created with names like ''testfile.$$''. Thus on my system each client would write accessing the same file at the same location relatively synchronously. Was this what was really intended, or perhaps cexec has different semantics? - Tom --- Tom Hancock, Hewlett Packard, Galway. +353-91-754765 -----Original Message----- From: lustre-discuss-admin@lists.clusterfs.com [mailto:lustre-discuss-admin@lists.clusterfs.com] On Behalf Of Jeremy Hansen Sent: 20 November 2005 08:49 To: lustre-discuss@lists.clusterfs.com Subject: Re: [Lustre-discuss] How do I troubleshoot this problem? To add to this through a bit more testing, it seems I''m only get errors when running the test on a directory in which I''ve changed the stripe pattern to a stripe size of 4M using all available osts, which in this case is 30 osts. Running the same test using the default striping config seems to pass, but performance is very low. The default stripe config I believe is 1M and 1 OST per file. -jeremy On 11/20/05 12:32 AM, "Jeremy Hansen" <jeremy@smokehabanos.com> wrote:> I''m working with a new Lustre installation and I''m running intoproblems.> I''m running iozone tests across multiple clients and basically when I > exceed more then 2, sometimes 3 clients, I start getting a lot of > evictions from the server. An example of the error can be found at: > > http://www.thedudeminds.com/lustre.txt > > Right now I suspect it''s possible network related, although I can''t > really say this for sure based on the errors. > > I''m launching iozone using c3 from a head node. Here is my options: > > cexec rlx-2-6:3-5 ''/opt/iozone/bin/iozone -i 0 -i 1 -r 8k -s 4g -f > /mnt/lustre/stripe_test/testfile.''\''''$$''\'''''' > > Any help is appreciated. > > -jeremy > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@lists.clusterfs.com > https://lists.clusterfs.com/mailman/listinfo/lustre-discuss_______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.clusterfs.com https://lists.clusterfs.com/mailman/listinfo/lustre-discuss
Jeremy, See bottom post.... ----Original Message----- From: lustre-discuss-admin@lists.clusterfs.com [mailto:lustre-discuss-admin@lists.clusterfs.com] On Behalf Of Jeremy Hansen Sent: 20 November 2005 08:49 To: lustre-discuss@lists.clusterfs.com Subject: Re: [Lustre-discuss] How do I troubleshoot this problem? To add to this through a bit more testing, it seems I''m only get errors when running the test on a directory in which I''ve changed the stripe pattern to a stripe size of 4M using all available osts, which in this case is 30 osts. Running the same test using the default striping config seems to pass, but performance is very low. The default stripe config I believe is 1M and 1 OST per file. -jeremy On 11/20/05 12:32 AM, "Jeremy Hansen" <jeremy@smokehabanos.com> wrote:> I''m working with a new Lustre installation and I''m running intoproblems.> I''m running iozone tests across multiple clients and basically when I > exceed more then 2, sometimes 3 clients, I start getting a lot of > evictions from the server. An example of the error can be found at: > > http://www.thedudeminds.com/lustre.txt > > Right now I suspect it''s possible network related, although I can''t > really say this for sure based on the errors. > > I''m launching iozone using c3 from a head node. Here is my options: > > cexec rlx-2-6:3-5 ''/opt/iozone/bin/iozone -i 0 -i 1 -r 8k -s 4g -f > /mnt/lustre/stripe_test/testfile.''\''''$$''\'''''' > > Any help is appreciated. > > -jeremy > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@lists.clusterfs.com > https://lists.clusterfs.com/mailman/listinfo/lustre-discussJeremy, I''m wondering if you were intending on gathering system wide throughput ? If so, you may wish to let Iozone handle the launch of remote execution. Iozone''s -+m option is intended to be used in a cluster environment and provides stonewalling, (thread straggler elimination) unique file names, and aggregation of the results. Example: ( simple two node system "hostA" and "hostB", and each node has 1 Gig of RAM) iozone -r 64 -s 2g -t 2 -+m control_file and the contents of the control_file: hostA /mnt/lustre/stripe_test /opt/iozone/bin/iozone hostB /mnt/lustre/stripe_test /opt/iozone/bin/iozone In this example Iozone will start two copies of itself, one instance on hostA and the other on hostB. Each instance will test with 64Kbyte transfers, for a file size of 2 Gig. Since Iozone is coordinating the execution the file names, that Iozone uses for testing, will be automatically generated and non-conflicting. These test files would be created in /mnt/lustre/stripe_test (from config_file above) Each test would be started at a barrier, and would finish when the first child is complete. Throughput would then be calculated, and returned as results. Note: Iozone also supports multiple execution with a single shared test file. You may also choose to use file locking, record locking, or no locking at all, in this mode. Note: Iozone presents a very heavy workload. On my Lustre systems, in my lab at my house, I''ve also seen some evictions when running Iozone. This happens when I push the system (with Iozone) to the point where the nodes are not able to respond (due to high network loads, such as running Iozone on top of NFS, on top of Lustre, large transfer sizes, on very large files, on a slow 100BT network.) Once the congestion rises to certain level the nodes are unable to exchange keep-alive messages and evictions begin. Once the load drops down, the nodes re-join, and life goes on. I believe there are some timeouts (in Lustre) that you can tweak, to increase the amount of time that the nodes will wait for a keep-alive to come back. Adjusting this might improve the evictions. It might also increase the time one would wait in the event of a real node failure... One of those trade-offs.... Other things that I''ve noticed that seem to help: More memory, faster interconnect, more CPU''s per/node. Hope that my observations are helpful... Enjoy, Don Capps capps_at_iozone_dot_org.
Perfect. I was unaware of this. I will try this asap. Although, it seems that my invocation of iozone should be unrelated to the problems I''ve been seeing. Thank you -jeremy On Mon, 21 Nov 2005, Iozone wrote:> Jeremy, > > See bottom post.... > > ----Original Message----- > From: lustre-discuss-admin@lists.clusterfs.com > [mailto:lustre-discuss-admin@lists.clusterfs.com] On Behalf Of Jeremy > Hansen > Sent: 20 November 2005 08:49 > To: lustre-discuss@lists.clusterfs.com > Subject: Re: [Lustre-discuss] How do I troubleshoot this problem? > > To add to this through a bit more testing, it seems I''m only get errors > when running the test on a directory in which I''ve changed the stripe > pattern to a stripe size of 4M using all available osts, which in this > case is 30 osts. > Running the same test using the default striping config seems to pass, > but performance is very low. The default stripe config I believe is 1M > and 1 OST per file. > > -jeremy > > > On 11/20/05 12:32 AM, "Jeremy Hansen" <jeremy@smokehabanos.com> wrote: > > > I''m working with a new Lustre installation and I''m running into > problems. > > I''m running iozone tests across multiple clients and basically when I > > exceed more then 2, sometimes 3 clients, I start getting a lot of > > evictions from the server. An example of the error can be found at: > > > > http://www.thedudeminds.com/lustre.txt > > > > Right now I suspect it''s possible network related, although I can''t > > really say this for sure based on the errors. > > > > I''m launching iozone using c3 from a head node. Here is my options: > > > > cexec rlx-2-6:3-5 ''/opt/iozone/bin/iozone -i 0 -i 1 -r 8k -s 4g -f > > /mnt/lustre/stripe_test/testfile.''\''''$$''\'''''' > > > > Any help is appreciated. > > > > -jeremy > > > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss@lists.clusterfs.com > > https://lists.clusterfs.com/mailman/listinfo/lustre-discuss > > Jeremy, > > I''m wondering if you were intending on gathering > system wide throughput ? If so, you may wish to > let Iozone handle the launch of remote execution. > > Iozone''s -+m option is intended to be used in a cluster > environment and provides stonewalling, (thread straggler > elimination) unique file names, and aggregation of the results. > > Example: ( simple two node system "hostA" and "hostB", > and each node has 1 Gig of RAM) > > iozone -r 64 -s 2g -t 2 -+m control_file > > and the contents of the control_file: > hostA /mnt/lustre/stripe_test /opt/iozone/bin/iozone > hostB /mnt/lustre/stripe_test /opt/iozone/bin/iozone > > In this example Iozone will start two copies of itself, > one instance on hostA and the other on hostB. Each instance > will test with 64Kbyte transfers, for a file size of 2 Gig. > Since Iozone is coordinating the execution the file names, > that Iozone uses for testing, will be automatically generated > and non-conflicting. These test files would be created in > /mnt/lustre/stripe_test (from config_file above) > Each test would be started at a barrier, and would finish > when the first child is complete. Throughput would then > be calculated, and returned as results. > > Note: Iozone also supports multiple execution with a > single shared test file. You may also choose to use file > locking, record locking, or no locking at all, in this > mode. > > Note: Iozone presents a very heavy workload. On my > Lustre systems, in my lab at my house, I''ve also seen > some evictions when running Iozone. This happens when > I push the system (with Iozone) to the point where > the nodes are not able to respond (due to high network > loads, such as running Iozone on top of NFS, on top of > Lustre, large transfer sizes, on very large files, on a slow > 100BT network.) Once the congestion rises to certain level the > nodes are unable to exchange keep-alive messages and > evictions begin. Once the load drops down, the nodes re-join, > and life goes on. > I believe there are some timeouts (in Lustre) that you can > tweak, to increase the amount of time that the nodes > will wait for a keep-alive to come back. Adjusting this > might improve the evictions. It might also increase the > time one would wait in the event of a real node failure... One > of those trade-offs.... > Other things that I''ve noticed that seem to help: More memory, > faster interconnect, more CPU''s per/node. > > Hope that my observations are helpful... > > Enjoy, > Don Capps > capps_at_iozone_dot_org. > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@lists.clusterfs.com > https://lists.clusterfs.com/mailman/listinfo/lustre-discuss >
To add to this through a bit more testing, it seems I''m only get errors when running the test on a directory in which I''ve changed the stripe pattern to a stripe size of 4M using all available osts, which in this case is 30 osts. Running the same test using the default striping config seems to pass, but performance is very low. The default stripe config I believe is 1M and 1 OST per file. -jeremy On 11/20/05 12:32 AM, "Jeremy Hansen" <jeremy@smokehabanos.com> wrote:> I''m working with a new Lustre installation and I''m running into problems. > I''m running iozone tests across multiple clients and basically when I exceed > more then 2, sometimes 3 clients, I start getting a lot of evictions from > the server. An example of the error can be found at: > > http://www.thedudeminds.com/lustre.txt > > Right now I suspect it''s possible network related, although I can''t really > say this for sure based on the errors. > > I''m launching iozone using c3 from a head node. Here is my options: > > cexec rlx-2-6:3-5 ''/opt/iozone/bin/iozone -i 0 -i 1 -r 8k -s 4g -f > /mnt/lustre/stripe_test/testfile.''\''''$$''\'''''' > > Any help is appreciated. > > -jeremy > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@lists.clusterfs.com > https://lists.clusterfs.com/mailman/listinfo/lustre-discuss
Cexec is from the c3 project. Basically you create a table of machines and cexec will launch commands on those machines in parallel, using ssh. My intention is to run iozone tests across several clients in parallel using a different testfile on each client, hence the .$$ to make sure the files are different. So to answer your question, what I was doing was intentional. As for network problems, this is definitely a possibility. We''re in an environment where we do not control the switch we''re connected to, which is changing very shortly. Thanks -jeremy On Mon, 21 Nov 2005, Hancock, Tom wrote:> > The errors are indicative of a client eviction by a server for > lack of responsiveness, potentially due to a network error > (though that''s not indicated) or potentially due to a timeout > within the lustre dlm. > > I suspect the intention was to incorporate the pid in the name. > In this case I think that the command may not be what was intended. > Replacing ''cexec'' with ''ssh'' and ''iozone'' with ''touch'' I can see on > a system here that files are created with names like ''testfile.$$''. > Thus on my system each client would write accessing the same file > at the same location relatively synchronously. Was this what was > really intended, or perhaps cexec has different semantics? > > - Tom > > --- > Tom Hancock, Hewlett Packard, Galway. +353-91-754765 > > > -----Original Message----- > From: lustre-discuss-admin@lists.clusterfs.com > [mailto:lustre-discuss-admin@lists.clusterfs.com] On Behalf Of Jeremy > Hansen > Sent: 20 November 2005 08:49 > To: lustre-discuss@lists.clusterfs.com > Subject: Re: [Lustre-discuss] How do I troubleshoot this problem? > > To add to this through a bit more testing, it seems I''m only get errors > when running the test on a directory in which I''ve changed the stripe > pattern to a stripe size of 4M using all available osts, which in this > case is 30 osts. > Running the same test using the default striping config seems to pass, > but performance is very low. The default stripe config I believe is 1M > and 1 OST per file. > > -jeremy > > > On 11/20/05 12:32 AM, "Jeremy Hansen" <jeremy@smokehabanos.com> wrote: > > > I''m working with a new Lustre installation and I''m running into > problems. > > I''m running iozone tests across multiple clients and basically when I > > exceed more then 2, sometimes 3 clients, I start getting a lot of > > evictions from the server. An example of the error can be found at: > > > > http://www.thedudeminds.com/lustre.txt > > > > Right now I suspect it''s possible network related, although I can''t > > really say this for sure based on the errors. > > > > I''m launching iozone using c3 from a head node. Here is my options: > > > > cexec rlx-2-6:3-5 ''/opt/iozone/bin/iozone -i 0 -i 1 -r 8k -s 4g -f > > /mnt/lustre/stripe_test/testfile.''\''''$$''\'''''' > > > > Any help is appreciated. > > > > -jeremy > > > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss@lists.clusterfs.com > > https://lists.clusterfs.com/mailman/listinfo/lustre-discuss > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@lists.clusterfs.com > https://lists.clusterfs.com/mailman/listinfo/lustre-discuss > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@lists.clusterfs.com > https://lists.clusterfs.com/mailman/listinfo/lustre-discuss >
Don, thank you for the tips. Very useful. This is where I am at the moment. One, the entire network is gigE, currently connected to a Cisco 6513. All OSSs are connected to a single blade in the 6513 and there is 30 OSSs. Using the default stripe of 1M and a single OSS per file, only 8 clients and using your suggested in this fashion: /opt/iozone/bin/iozone -i 0 -i 1 -r 8k -s 4g -t 8 -+m hostlist -w What I see is the first set of read results (which do not look that great) Test running: Children see throughput for 8 initial writers = 116711.56 KB/sec Min throughput per process = 14515.36 KB/sec Max throughput per process = 14627.10 KB/sec Avg throughput per process = 14588.95 KB/sec Min xfer = 4162560.00 KB I would hope to see a throughput of at least 1G and that''s where it stops. The test fails to continue, it just sits. If I do this same test using a larger stripe size using all 30 OSSs, this is where I can easily get the errors to flood in and clients start getting evicted. It feels like networking issues, but I see no errors, dropped packets, etc on the switch (although that doesn''t convince me that the switch isn''t a problem). We''re replacing the Cisco with a Force10 E1200 in a few weeks but I''m uneasy about not being able to prove definitively that it''s a switch problem. Using iozone default -r and -s sizes, tests complete fine and give numbers more like what I''m expecting: /opt/iozone/bin/iozone -i 0 -i 1 -t 8 -+m hostlist -w Children see throughput for 8 initial writers = 1081793.05 KB/sec Min throughput per process = 150853.59 KB/sec Max throughput per process = 161713.44 KB/sec Avg throughput per process = 135224.13 KB/sec Min xfer = 512.00 KB But this test isn''t sufficient, since we''re going to be working with mostly very large files and I think the other tests, even if it''s maxing out the switch, should at least complete. Thank you -jeremy On 11/21/05 5:45 AM, "Iozone" <capps@iozone.org> wrote:> Jeremy, > > See bottom post.... > > ----Original Message----- > From: lustre-discuss-admin@lists.clusterfs.com > [mailto:lustre-discuss-admin@lists.clusterfs.com] On Behalf Of Jeremy > Hansen > Sent: 20 November 2005 08:49 > To: lustre-discuss@lists.clusterfs.com > Subject: Re: [Lustre-discuss] How do I troubleshoot this problem? > > To add to this through a bit more testing, it seems I''m only get errors > when running the test on a directory in which I''ve changed the stripe > pattern to a stripe size of 4M using all available osts, which in this > case is 30 osts. > Running the same test using the default striping config seems to pass, > but performance is very low. The default stripe config I believe is 1M > and 1 OST per file. > > -jeremy > > > On 11/20/05 12:32 AM, "Jeremy Hansen" <jeremy@smokehabanos.com> wrote: > >> I''m working with a new Lustre installation and I''m running into > problems. >> I''m running iozone tests across multiple clients and basically when I >> exceed more then 2, sometimes 3 clients, I start getting a lot of >> evictions from the server. An example of the error can be found at: >> >> http://www.thedudeminds.com/lustre.txt >> >> Right now I suspect it''s possible network related, although I can''t >> really say this for sure based on the errors. >> >> I''m launching iozone using c3 from a head node. Here is my options: >> >> cexec rlx-2-6:3-5 ''/opt/iozone/bin/iozone -i 0 -i 1 -r 8k -s 4g -f >> /mnt/lustre/stripe_test/testfile.''\''''$$''\'''''' >> >> Any help is appreciated. >> >> -jeremy >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss@lists.clusterfs.com >> https://lists.clusterfs.com/mailman/listinfo/lustre-discuss > > Jeremy, > > I''m wondering if you were intending on gathering > system wide throughput ? If so, you may wish to > let Iozone handle the launch of remote execution. > > Iozone''s -+m option is intended to be used in a cluster > environment and provides stonewalling, (thread straggler > elimination) unique file names, and aggregation of the results. > > Example: ( simple two node system "hostA" and "hostB", > and each node has 1 Gig of RAM) > > iozone -r 64 -s 2g -t 2 -+m control_file > > and the contents of the control_file: > hostA /mnt/lustre/stripe_test /opt/iozone/bin/iozone > hostB /mnt/lustre/stripe_test /opt/iozone/bin/iozone > > In this example Iozone will start two copies of itself, > one instance on hostA and the other on hostB. Each instance > will test with 64Kbyte transfers, for a file size of 2 Gig. > Since Iozone is coordinating the execution the file names, > that Iozone uses for testing, will be automatically generated > and non-conflicting. These test files would be created in > /mnt/lustre/stripe_test (from config_file above) > Each test would be started at a barrier, and would finish > when the first child is complete. Throughput would then > be calculated, and returned as results. > > Note: Iozone also supports multiple execution with a > single shared test file. You may also choose to use file > locking, record locking, or no locking at all, in this > mode. > > Note: Iozone presents a very heavy workload. On my > Lustre systems, in my lab at my house, I''ve also seen > some evictions when running Iozone. This happens when > I push the system (with Iozone) to the point where > the nodes are not able to respond (due to high network > loads, such as running Iozone on top of NFS, on top of > Lustre, large transfer sizes, on very large files, on a slow > 100BT network.) Once the congestion rises to certain level the > nodes are unable to exchange keep-alive messages and > evictions begin. Once the load drops down, the nodes re-join, > and life goes on. > I believe there are some timeouts (in Lustre) that you can > tweak, to increase the amount of time that the nodes > will wait for a keep-alive to come back. Adjusting this > might improve the evictions. It might also increase the > time one would wait in the event of a real node failure... One > of those trade-offs.... > Other things that I''ve noticed that seem to help: More memory, > faster interconnect, more CPU''s per/node. > > Hope that my observations are helpful... > > Enjoy, > Don Capps > capps_at_iozone_dot_org. > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@lists.clusterfs.com > https://lists.clusterfs.com/mailman/listinfo/lustre-discuss
On 11/21/05 3:54 AM, "Hancock, Tom" <Tom.Hancock@hp.com> wrote:> > The errors are indicative of a client eviction by a server for > lack of responsiveness, potentially due to a network error > (though that''s not indicated) or potentially due to a timeout > within the lustre dlm. > > I suspect the intention was to incorporate the pid in the name. > In this case I think that the command may not be what was intended. > Replacing ''cexec'' with ''ssh'' and ''iozone'' with ''touch'' I can see on > a system here that files are created with names like ''testfile.$$''. > Thus on my system each client would write accessing the same file > at the same location relatively synchronously. Was this what was > really intended, or perhaps cexec has different semantics?Tom, read over your email again. I the files are infact getting the pid at the end. This I can verify. The paste of my command line may have been miss leading. You do have to escape the $$''s to have the shell interpret things properly. -jeremy> > - Tom > > --- > Tom Hancock, Hewlett Packard, Galway. +353-91-754765 > > > -----Original Message----- > From: lustre-discuss-admin@lists.clusterfs.com > [mailto:lustre-discuss-admin@lists.clusterfs.com] On Behalf Of Jeremy > Hansen > Sent: 20 November 2005 08:49 > To: lustre-discuss@lists.clusterfs.com > Subject: Re: [Lustre-discuss] How do I troubleshoot this problem? > > To add to this through a bit more testing, it seems I''m only get errors > when running the test on a directory in which I''ve changed the stripe > pattern to a stripe size of 4M using all available osts, which in this > case is 30 osts. > Running the same test using the default striping config seems to pass, > but performance is very low. The default stripe config I believe is 1M > and 1 OST per file. > > -jeremy > > > On 11/20/05 12:32 AM, "Jeremy Hansen" <jeremy@smokehabanos.com> wrote: > >> I''m working with a new Lustre installation and I''m running into > problems. >> I''m running iozone tests across multiple clients and basically when I >> exceed more then 2, sometimes 3 clients, I start getting a lot of >> evictions from the server. An example of the error can be found at: >> >> http://www.thedudeminds.com/lustre.txt >> >> Right now I suspect it''s possible network related, although I can''t >> really say this for sure based on the errors. >> >> I''m launching iozone using c3 from a head node. Here is my options: >> >> cexec rlx-2-6:3-5 ''/opt/iozone/bin/iozone -i 0 -i 1 -r 8k -s 4g -f >> /mnt/lustre/stripe_test/testfile.''\''''$$''\'''''' >> >> Any help is appreciated. >> >> -jeremy >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss@lists.clusterfs.com >> https://lists.clusterfs.com/mailman/listinfo/lustre-discuss > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@lists.clusterfs.com > https://lists.clusterfs.com/mailman/listinfo/lustre-discuss > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss@lists.clusterfs.com > https://lists.clusterfs.com/mailman/listinfo/lustre-discuss