Mark Howison
2009-Dec-10 20:55 UTC
[Lustre-discuss] abnormally long ftruncates on Cray XT4
Hi, On Franklin, a CrayXT at NERSC with a lustre /scratch filesystem, we have noticed excessively long return times on ftruncate calls that are issued through HDF5 or the MPI-IO layer (through MPI_File_set_size() for instance). Here is a IO trace plot that shows 235GB written to a shared HDF5 file in 65s followed by an ftruncate that lasts about 50s: http://vis.lbl.gov/~mhowison/vorpal/n2048.cb.align.183/tag.png (Full details: With collective buffering enabled in the MPI-IO layer, the I/O pattern is essentially a series of 4MB writes issued from 48 nodes that have been designated as aggregator/writer nodes. The number of writer nodes matches the 48 OSTs that store the file, and the write size matches the 4MB stripe width. This sets up a pattern that *looks* to the OSTs as essentially the same pattern as if we had 48 single-stripe files and 48 nodes each writing to its own file. This has been the most effective way we have found to stage shared-file writes on lustre.) However, we''ve also seen this long ftruncate problem with several other IO patterns in addition to collective buffering in MPI-IO: for instance, when bypassing MPI-IO in HDF5 and instead using the MPI-POSIX driver and with unstructured 1D grids. Any ideas on what might cause these long ftruncates? We plan on analyzing LMT data from the metadata server to determine if it is simply contention with other users, but we are suspicious of the consistency and magnitude of these hangs. Thanks, Mark Howison NERSC Analytics mhowison at lbl.gov
Andreas Dilger
2009-Dec-11 10:03 UTC
[Lustre-discuss] abnormally long ftruncates on Cray XT4
On 2009-12-10, at 13:55, Mark Howison wrote:> On Franklin, a CrayXT at NERSC with a lustre /scratch filesystem, we > have noticed excessively long return times on ftruncate calls that are > issued through HDF5 or the MPI-IO layer (through MPI_File_set_size() > for instance). Here is a IO trace plot that shows 235GB written to a > shared HDF5 file in 65s followed by an ftruncate that lasts about 50s: > > http://vis.lbl.gov/~mhowison/vorpal/n2048.cb.align.183/tag.pngTo clarify - the vertical axis is for different servers or is that for clients? It definitely looks like 1 or 2 of the servers are much slower than the others, as shown by the "solid" line of writes, compared to the others which are very sparse. Then, the single purple line at the top is presumably the truncate in progress? Finally, at the far right, is that for reads? My original guess would have been that all of your clients are doing a truncate at the same time, and this is causing lock contention, but even that shouldn''t cause such a long delay. Another possibility is that the large file is fragmented on disk, and the truncate is taking a long time, but I also find it hard to believe it would take this long. Presumably there are no error messages during this time?> However, we''ve also seen this long ftruncate problem with several > other IO patterns in addition to collective buffering in MPI-IO: for > instance, when bypassing MPI-IO in HDF5 and instead using the > MPI-POSIX driver and with unstructured 1D grids. > > Any ideas on what might cause these long ftruncates? We plan on > analyzing LMT data from the metadata server to determine if it is > simply contention with other users, but we are suspicious of the > consistency and magnitude of these hangs.Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Mark Howison
2009-Dec-11 18:58 UTC
[Lustre-discuss] abnormally long ftruncates on Cray XT4
Hi Andreas, Sorry I should have provided a better description of the trace plot. Yes, the vertical axis shows MPI task number: the first 48 task (mod 4 because of quad cores) tasks are the writers, and indeed there a few OSTs that are an order of magnitude slower than others. Usually, we see that undesirable behavior when we have many more writers than OSTs (for instance 1K tasks hitting 48 OSTs)... now it has started happening even when we use collective buffering to carefully match the number of writers to OSTs. It is possible that this just contention from other users on the system. Andrew Uselton is going to help me obtain LMT data to verify that. Contention has been less of an issue since there was a large IO hardware upgrade to Franklin last spring, though. By any chance, have there been changes in recent releases of lustre that would affect how stripes are assigned to OSTs? For instance, is it no longer the case that they are assigned round robin? Maybe some type of load balancing has been introduced? Because that would break our 1-1 writer to OST pattern. The purple indicates ftruncate, which is only called from task 0 (that is how MPI_File_set_size is implemented). The salmon at the far right actually indicates fsyncs, not reads. And the brown (hard to see) is fclose. I will ask our systems people if they can locate logs and verify whether there were any errors at the time of the ftruncate, but I didn''t receive any errors on the client side. Thanks for helping us look into this, Mark On Fri, Dec 11, 2009 at 2:03 AM, Andreas Dilger <adilger at sun.com> wrote:> On 2009-12-10, at 13:55, Mark Howison wrote: >> >> On Franklin, a CrayXT at NERSC with a lustre /scratch filesystem, we >> have noticed excessively long return times on ftruncate calls that are >> issued through HDF5 or the MPI-IO layer (through MPI_File_set_size() >> for instance). Here is a IO trace plot that shows 235GB written to a >> shared HDF5 file in 65s followed by an ftruncate that lasts about 50s: >> >> http://vis.lbl.gov/~mhowison/vorpal/n2048.cb.align.183/tag.png > > To clarify - the vertical axis is for different servers or is that for > clients? ?It definitely looks like 1 or 2 of the servers are much slower > than the others, as shown by the "solid" line of writes, compared to the > others which are very sparse. > > Then, the single purple line at the top is presumably the truncate in > progress? > > Finally, at the far right, is that for reads? > > My original guess would have been that all of your clients are doing a > truncate at the same time, and this is causing lock contention, but even > that shouldn''t cause such a long delay. > > Another possibility is that the large file is fragmented on disk, and the > truncate is taking a long time, but I also find it hard to believe it would > take this long. > > Presumably there are no error messages during this time? > >> However, we''ve also seen this long ftruncate problem with several >> other IO patterns in addition to collective buffering in MPI-IO: for >> instance, when bypassing MPI-IO in HDF5 and instead using the >> MPI-POSIX driver and with unstructured 1D grids. >> >> Any ideas on what might cause these long ftruncates? We plan on >> analyzing LMT data from the metadata server to determine if it is >> simply contention with other users, but we are suspicious of the >> consistency and magnitude of these hangs. > > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > >
Mark Howison
2009-Dec-23 21:22 UTC
[Lustre-discuss] abnormally long ftruncates on Cray XT4
Hi Andreas, Our systems people looked through the error logs and didn''t report any errors directly related to my job. There were some errors around the same time and associated with other jobs, however (pasted below). Thanks Mark based on your timestamps I am assuming this is the run: Dec 7 17:00:13 nid04099 aprun[54251]: apid=1820824, Starting, user=43251, cmd_line="aprun -n 2048 ./vorpalio -a 4 -t 10 -b 80 80 80 -n 32 8 8 ", num_nodes=512, node_list=1440-1462,1470-1471,1489-1491,1493-1499,1534-1535,1537,1563-1568,1582-1600,1643-1644,1648-1650,1739-1745,1762-1776,2033-2038,2065-2067,2105-2114,2141-2168,2239-2244,2248-2255,2341-2367,2372-2374,2762-2795,2837-2864,2952-2995,3234-3245,3287-3296,3302-3305,3330-3339,3866-3909,3911-3933,3946-4023,12796-12839 No Lustre errors during the timeframe requested:>From Messages:Dec 7 15:30:02 nid00263 kernel: LustreError: 42196:0:(llite_nfs.c:252:ll_get_parent()) failure -13 inode 356483075 get parent Dec 7 19:22:13 nid00256 kernel: LustreError: 92924:0:(file.c:2653:ll_inode_revalidate_fini()) failure -43 inode 422182913 None in console log either. However just after the requested time period: [2009-12-07 17:03:07][c4-1c0s0n3]Lustre: 5555:0:(rw.c:1830:ll_readpage()) ino 169248734 page 1024 (4194304) not covered by a lock (mmap?). check debug logs. [2009-12-07 17:03:07][c4-1c0s0n2]Lustre: 5539:0:(rw.c:1830:ll_readpage()) ino 169248734 page 1024 (4194304) not covered by a lock (mmap?). check debug logs. [2009-12-07 17:03:07][c4-1c0s0n3]Lustre: 5557:0:(rw.c:1830:ll_readpage()) ino 169248734 page 1025 (4198400) not covered by a lock (mmap?). check debug logs. [2009-12-07 17:03:07][c4-1c0s0n3]Lustre: 5555:0:(rw.c:1830:ll_readpage()) Skipped 316 previous similar messages [2009-12-07 17:03:12][c4-1c0s0n3]Lustre: 5557:0:(rw.c:1830:ll_readpage()) ino 169248729 page 1024 (4194304) not covered by a lock (mmap?). check debug logs. [2009-12-07 17:03:12][c4-1c0s0n3]Lustre: 5557:0:(rw.c:1830:ll_readpage()) Skipped 1 previous similar message but both of these nodes (1250 and 1251) were not part of this job. On Fri, Dec 11, 2009 at 10:58 AM, Mark Howison <MHowison at lbl.gov> wrote:> Hi Andreas, > > Sorry I should have provided a better description of the trace plot. > Yes, the vertical axis shows MPI task number: the first 48 task (mod 4 > because of quad cores) tasks are the writers, and indeed there a few > OSTs that are an order of magnitude slower than others. Usually, we > see that undesirable behavior when we have many more writers than OSTs > (for instance 1K tasks hitting 48 OSTs)... now it has started > happening even when we use collective buffering to carefully match the > number of writers to OSTs. It is possible that this just contention > from other users on the system. Andrew Uselton is going to help me > obtain LMT data to verify that. Contention has been less of an issue > since there was a large IO hardware upgrade to Franklin last spring, > though. > > By any chance, have there been changes in recent releases of lustre > that would affect how stripes are assigned to OSTs? For instance, is > it no longer the case that they are assigned round robin? Maybe some > type of load balancing has been introduced? Because that would break > our 1-1 writer to OST pattern. > > The purple indicates ftruncate, which is only called from task 0 (that > is how MPI_File_set_size is implemented). The salmon at the far right > actually indicates fsyncs, not reads. And the brown (hard to see) is > fclose. > > I will ask our systems people if they can locate logs and verify > whether there were any errors at the time of the ftruncate, but I > didn''t receive any errors on the client side. > > Thanks for helping us look into this, > > Mark > > On Fri, Dec 11, 2009 at 2:03 AM, Andreas Dilger <adilger at sun.com> wrote: >> On 2009-12-10, at 13:55, Mark Howison wrote: >>> >>> On Franklin, a CrayXT at NERSC with a lustre /scratch filesystem, we >>> have noticed excessively long return times on ftruncate calls that are >>> issued through HDF5 or the MPI-IO layer (through MPI_File_set_size() >>> for instance). Here is a IO trace plot that shows 235GB written to a >>> shared HDF5 file in 65s followed by an ftruncate that lasts about 50s: >>> >>> http://vis.lbl.gov/~mhowison/vorpal/n2048.cb.align.183/tag.png >> >> To clarify - the vertical axis is for different servers or is that for >> clients? ?It definitely looks like 1 or 2 of the servers are much slower >> than the others, as shown by the "solid" line of writes, compared to the >> others which are very sparse. >> >> Then, the single purple line at the top is presumably the truncate in >> progress? >> >> Finally, at the far right, is that for reads? >> >> My original guess would have been that all of your clients are doing a >> truncate at the same time, and this is causing lock contention, but even >> that shouldn''t cause such a long delay. >> >> Another possibility is that the large file is fragmented on disk, and the >> truncate is taking a long time, but I also find it hard to believe it would >> take this long. >> >> Presumably there are no error messages during this time? >> >>> However, we''ve also seen this long ftruncate problem with several >>> other IO patterns in addition to collective buffering in MPI-IO: for >>> instance, when bypassing MPI-IO in HDF5 and instead using the >>> MPI-POSIX driver and with unstructured 1D grids. >>> >>> Any ideas on what might cause these long ftruncates? We plan on >>> analyzing LMT data from the metadata server to determine if it is >>> simply contention with other users, but we are suspicious of the >>> consistency and magnitude of these hangs. >> >> >> Cheers, Andreas >> -- >> Andreas Dilger >> Sr. Staff Engineer, Lustre Group >> Sun Microsystems of Canada, Inc. >> >> >
On Fri, Dec 11, 2009 at 10:58:19AM -0800, Mark Howison wrote:> The purple indicates ftruncate, which is only called from task 0 (that > is how MPI_File_set_size is implemented). The salmon at the far right > actually indicates fsyncs, not reads. And the brown (hard to see) is > fclose.Just to add a bit of ROMIO MPI-IO background while you explore this: In the old days every MPI process called ftruncate, and of course that gave brutally poor performance. This "ftruncate from one" approach has been in place since 2004. I would be both shocked and appalled if you encountered an ancient ROMIO-based MPI-IO implementation that still does ftruncate from all. if you are nfs-exporting this lustre volume for some reason, ROMIO will behave quite differently (and slowly). But you aren''t doing that or you''d see a very different IPM graph. ==rob -- Rob Latham Mathematics and Computer Science Division Argonne National Lab, IL USA