Hi, everyone; I''ve been using MPI-IO on a Lustre file system to good effect for a while now in an application that has up to 32 processes writing to a shared file. However, seeking to understand the performance of our system, and improve on it, I''ve recently made some changes to the ADIO Lustre code, which show some promise, but need more testing. Eventually, I''d like to submit the code changes back to the mpich2 project, but that is certainly contingent upon the results of testing (and various code compliance issues for mpich2/romio/adio that I will likely need to sort out.) This message is my request for volunteers to help test my code, in particular for output file correctness and shared-file write performance. If you''re interested in doing shared file I/O using MPI-IO on Lustre, please continue reading this message. In broad terms, the changes I made are on two fronts: changing the file domain partitioning algorithm, and introducing non-blocking operations at several points. The file domain partitioning algorithm that I implemented is from the paper "Dynamically Adapting File Domain Partitioning Methods for Collective I/O Based on Underlying Parallel File System Locking Protocols" by Wei-keng Liao and Alok Choudhary. The non-blocking operations that I added allow the ADIO Lustre driver better to parallelize the data exchange and writing procedures over multiple stripes within each process writing to one Lustre OST, My testing so far has been limited to four nodes, up to sixteen processes, writing to shared files on a Lustre file system with up to eight OSTs. These tests were conducted to simulate the production application for which I''m responsible, but on a different cluster, focused only on the file output. In these rather limited tests, I''ve seen write performance gains of up to a factor of two or three. The new file domain partitioning algorithm is most effective when the number of processes exceeds the number of Lustre OSTs, but there are smaller gains in other cases, and I have not seen instance in which the performance has decreased. As an example, in one case using sixteen processes, MPI over Infiniband, and a file striping factor of four, the new code achieves over 800 MB/s, whereas the standard code achieves 300 MB/s. I have hints that the relative performance gains when using a 1Gb Ethernet rather than Infiniband for MPI message passing are greater, but I have not completed my testing in that environment. If you''re willing to try out this code in a test environment please let me know. I have not yet put the code into a publicly accessible repository, but will do so if there is interest out there. -- Martin Pokorny Software Engineer - Jansky Very Large Array National Radio Astronomy Observatory - New Mexico Operations
Rob Latham
2012-May-23 18:28 UTC
[Lustre-discuss] [mpich-discuss] testing new ADIO Lustre code
On Thu, May 17, 2012 at 03:02:48PM -0600, Martin Pokorny wrote:> Hi, everyone; > > I''ve been using MPI-IO on a Lustre file system to good effect for a > while now in an application that has up to 32 processes writing to a > shared file. However, seeking to understand the performance of our > system, and improve on it, I''ve recently made some changes to the > ADIO Lustre code, which show some promise, but need more testing. > Eventually, I''d like to submit the code changes back to the mpich2 > project, but that is certainly contingent upon the results of > testing (and various code compliance issues for mpich2/romio/adio > that I will likely need to sort out.) This message is my request for > volunteers to help test my code, in particular for output file > correctness and shared-file write performance. If you''re interested > in doing shared file I/O using MPI-IO on Lustre, please continue > reading this message.Gosh, Martin, I really thought you''d get more attention with this post. I''d like to see these patches: I can''t aggressively test them on a lustre system but I''d be happy to provide another set of ROMIO-eyeballs.> In broad terms, the changes I made are on two fronts: changing the > file domain partitioning algorithm, and introducing non-blocking > operations at several points.Non-blocking communication or i/o ? One concern with non-blocking I/O in this path is that often the communication and I/O networks are the same thing (e.g. infiniband, or the BlueGene tree network in some situations).> The file domain partitioning algorithm > that I implemented is from the paper "Dynamically Adapting File > Domain Partitioning Methods for Collective I/O Based on Underlying > Parallel File System Locking Protocols" by Wei-keng Liao and Alok > Choudhary. The non-blocking operations that I added allow the ADIO > Lustre driver better to parallelize the data exchange and writing > procedures over multiple stripes within each process writing to one > Lustre OST,I was hoping Wei-keng would chime in on this. I''ll be sure to draw your patches to his attention.> My testing so far has been limited to four nodes, up to sixteen > processes, writing to shared files on a Lustre file system with up > to eight OSTs.Right now the only concern I have is that you may (and without looking at the code I have no way of knowing) traded better small-scale performance for worse large-scale performance.> These tests were conducted to simulate the production > application for which I''m responsible, but on a different cluster, > focused only on the file output. In these rather limited tests, I''ve > seen write performance gains of up to a factor of two or three. The > new file domain partitioning algorithm is most effective when the > number of processes exceeds the number of Lustre OSTs, but there are > smaller gains in other cases, and I have not seen instance in which > the performance has decreased. As an example, in one case using > sixteen processes, MPI over Infiniband, and a file striping factor > of four, the new code achieves over 800 MB/s, whereas the standard > code achieves 300 MB/s. I have hints that the relative performance > gains when using a 1Gb Ethernet rather than Infiniband for MPI > message passing are greater, but I have not completed my testing in > that environment. > > If you''re willing to try out this code in a test environment please > let me know. I have not yet put the code into a publicly accessible > repository, but will do so if there is interest out there.==rob -- Rob Latham Mathematics and Computer Science Division Argonne National Lab, IL USA
Martin Pokorny
2012-May-23 19:29 UTC
[Lustre-discuss] [mpich-discuss] testing new ADIO Lustre code
Rob, I''ve been in email contact with Wei-keng Liao about the changes that I''ve made. We have mainly discussed my implementation of the new non-blocking code; it turns out that he is currently working on a very similar set of modifications. The file domain partitioning algorithm comes from Wei-keng, and I expect that the results he published should apply to my implementation, at least approximately. He has encouraged me to do some large-scale testing using XSEDE, and I have plans to run some benchmarks in that setting soon. A few more comments, below. Rob Latham wrote:> On Thu, May 17, 2012 at 03:02:48PM -0600, Martin Pokorny wrote: >> Hi, everyone; >> >> I''ve been using MPI-IO on a Lustre file system to good effect for a >> while now in an application that has up to 32 processes writing to a >> shared file. However, seeking to understand the performance of our >> system, and improve on it, I''ve recently made some changes to the >> ADIO Lustre code, which show some promise, but need more testing. >> Eventually, I''d like to submit the code changes back to the mpich2 >> project, but that is certainly contingent upon the results of >> testing (and various code compliance issues for mpich2/romio/adio >> that I will likely need to sort out.) This message is my request for >> volunteers to help test my code, in particular for output file >> correctness and shared-file write performance. If you''re interested >> in doing shared file I/O using MPI-IO on Lustre, please continue >> reading this message. > > Gosh, Martin, I really thought you''d get more attention with this > post.Me too.> I''d like to see these patches: I can''t aggressively test them on a > lustre system but I''d be happy to provide another set of > ROMIO-eyeballs.I can send them to you now, if you want. However, I''m not finished testing and can''t rule out further changes. I will go ahead and put them somewhere publicly accessible, and let everyone know when it''s done.>> In broad terms, the changes I made are on two fronts: changing the >> file domain partitioning algorithm, and introducing non-blocking >> operations at several points. > > Non-blocking communication or i/o ? > > One concern with non-blocking I/O in this path is that often the > communication and I/O networks are the same thing (e.g. infiniband, or > the BlueGene tree network in some situations).Non-blocking in both communication and I/O. I was preparing a question to the Lustre discussion list about non-blocking I/O using the POSIX aio API. I''ll just ask right here, then. Is POSIX aio on a Lustre file system truly asynchronous? I expect that perhaps the implementation of aio in glibc may be asynchronous w.r.t. the calling thread, but I also wonder whether system calls to the Lustre client are asynchronous or not. Can anyone help me understand? I have a little data suggesting that the aio calls do improve performance a bit, but this is a tentative conclusion.>> The file domain partitioning algorithm >> that I implemented is from the paper "Dynamically Adapting File >> Domain Partitioning Methods for Collective I/O Based on Underlying >> Parallel File System Locking Protocols" by Wei-keng Liao and Alok >> Choudhary. The non-blocking operations that I added allow the ADIO >> Lustre driver better to parallelize the data exchange and writing >> procedures over multiple stripes within each process writing to one >> Lustre OST, > > I was hoping Wei-keng would chime in on this. I''ll be sure to draw > your patches to his attention.I''ve already done that.>> My testing so far has been limited to four nodes, up to sixteen >> processes, writing to shared files on a Lustre file system with up >> to eight OSTs. > > Right now the only concern I have is that you may (and without looking > at the code I have no way of knowing) traded better small-scale > performance for worse large-scale performance.Right. As I mentioned above I will soon be testing my code in a large-scale setting.>> These tests were conducted to simulate the production >> application for which I''m responsible, but on a different cluster, >> focused only on the file output. In these rather limited tests, I''ve >> seen write performance gains of up to a factor of two or three. The >> new file domain partitioning algorithm is most effective when the >> number of processes exceeds the number of Lustre OSTs, but there are >> smaller gains in other cases, and I have not seen instance in which >> the performance has decreased. As an example, in one case using >> sixteen processes, MPI over Infiniband, and a file striping factor >> of four, the new code achieves over 800 MB/s, whereas the standard >> code achieves 300 MB/s. I have hints that the relative performance >> gains when using a 1Gb Ethernet rather than Infiniband for MPI >> message passing are greater, but I have not completed my testing in >> that environment. >> >> If you''re willing to try out this code in a test environment please >> let me know. I have not yet put the code into a publicly accessible >> repository, but will do so if there is interest out there. > > ==rob >-- Martin
Andreas Dilger
2012-May-23 19:40 UTC
[Lustre-discuss] [wc-discuss] Re: [mpich-discuss] testing new ADIO Lustre code
On 2012-05-23, at 1:29 PM, Martin Pokorny wrote:> Rob Latham wrote: >> On Thu, May 17, 2012 at 03:02:48PM -0600, Martin Pokorny wrote: >>> In broad terms, the changes I made are on two fronts: changing the >>> file domain partitioning algorithm, and introducing non-blocking >>> operations at several points. >> >> Non-blocking communication or i/o ? >> One concern with non-blocking I/O in this path is that often the >> communication and I/O networks are the same thing (e.g. infiniband, or >> the BlueGene tree network in some situations). > > Non-blocking in both communication and I/O. I was preparing a question to the Lustre discussion list about non-blocking I/O using the POSIX aio API. I''ll just ask right here, then. Is POSIX aio on a Lustre file system truly asynchronous? I expect that perhaps the implementation of aio in glibc may be asynchronous w.r.t. the calling thread, but I also wonder whether system calls to the Lustre client are asynchronous or not. Can anyone help me understand? > > I have a little data suggesting that the aio calls do improve performance a bit, but this is a tentative conclusion.I don''t think the Lustre AIO calls are really async. This isn''t an area that has received much usage or attention, so I suspect the calls may only be wired up but will still block internally. It might be that the 20 layers of IO submission API abstraction in the kernel avoids this problem for us, but I haven''t looked at that recently. Cheers, Andreas -- Andreas Dilger Whamcloud, Inc. Principal Lustre Engineer http://www.whamcloud.com/