thr3ads.net - Lustre discuss - [Lustre-discuss] abnormally long ftruncates on Cray XT4 [Dec 2009]

If this information is useful, please help other people find it:
Share via:

Mark Howison

2009-Dec-10 20:55 UTC

[Lustre-discuss] abnormally long ftruncates on Cray XT4

Hi,

On Franklin, a CrayXT at NERSC with a lustre /scratch filesystem, we
have noticed excessively long return times on ftruncate calls that are
issued through HDF5 or the MPI-IO layer (through MPI_File_set_size()
for instance). Here is a IO trace plot that shows 235GB written to a
shared HDF5 file in 65s followed by an ftruncate that lasts about 50s:

http://vis.lbl.gov/~mhowison/vorpal/n2048.cb.align.183/tag.png

(Full details: With collective buffering enabled in the MPI-IO layer,
the I/O pattern is essentially a series of 4MB writes issued from 48
nodes that have been designated as aggregator/writer nodes. The number
of writer nodes matches the 48 OSTs that store the file, and the write
size matches the 4MB stripe width. This sets up a pattern that *looks*
to the OSTs as essentially the same pattern as if we had 48
single-stripe files and 48 nodes each writing to its own file. This
has been the most effective way we have found to stage shared-file
writes on lustre.)

However, we''ve also seen this long ftruncate problem with several
other IO patterns in addition to collective buffering in MPI-IO: for
instance, when bypassing MPI-IO in HDF5 and instead using the
MPI-POSIX driver and with unstructured 1D grids.

Any ideas on what might cause these long ftruncates? We plan on
analyzing LMT data from the metadata server to determine if it is
simply contention with other users, but we are suspicious of the
consistency and magnitude of these hangs.

Thanks,

Mark Howison
NERSC Analytics
mhowison at lbl.gov

Andreas Dilger

2009-Dec-11 10:03 UTC

head link

[Lustre-discuss] abnormally long ftruncates on Cray XT4

On 2009-12-10, at 13:55, Mark Howison wrote:> On Franklin, a CrayXT at NERSC with a lustre /scratch filesystem, we
> have noticed excessively long return times on ftruncate calls that are
> issued through HDF5 or the MPI-IO layer (through MPI_File_set_size()
> for instance). Here is a IO trace plot that shows 235GB written to a
> shared HDF5 file in 65s followed by an ftruncate that lasts about 50s:
>
> http://vis.lbl.gov/~mhowison/vorpal/n2048.cb.align.183/tag.png
To clarify - the vertical axis is for different servers or is that for  
clients?  It definitely looks like 1 or 2 of the servers are much  
slower than the others, as shown by the "solid" line of writes,  
compared to the others which are very sparse.

Then, the single purple line at the top is presumably the truncate in  
progress?

Finally, at the far right, is that for reads?

My original guess would have been that all of your clients are doing a  
truncate at the same time, and this is causing lock contention, but  
even that shouldn''t cause such a long delay.

Another possibility is that the large file is fragmented on disk, and  
the truncate is taking a long time, but I also find it hard to believe  
it would take this long.

Presumably there are no error messages during this time?
> However, we''ve also seen this long ftruncate problem with several
> other IO patterns in addition to collective buffering in MPI-IO: for
> instance, when bypassing MPI-IO in HDF5 and instead using the
> MPI-POSIX driver and with unstructured 1D grids.
>
> Any ideas on what might cause these long ftruncates? We plan on
> analyzing LMT data from the metadata server to determine if it is
> simply contention with other users, but we are suspicious of the
> consistency and magnitude of these hangs.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Mark Howison

2009-Dec-11 18:58 UTC

head link

[Lustre-discuss] abnormally long ftruncates on Cray XT4

Hi Andreas,

Sorry I should have provided a better description of the trace plot.
Yes, the vertical axis shows MPI task number: the first 48 task (mod 4
because of quad cores) tasks are the writers, and indeed there a few
OSTs that are an order of magnitude slower than others. Usually, we
see that undesirable behavior when we have many more writers than OSTs
(for instance 1K tasks hitting 48 OSTs)... now it has started
happening even when we use collective buffering to carefully match the
number of writers to OSTs. It is possible that this just contention
from other users on the system. Andrew Uselton is going to help me
obtain LMT data to verify that. Contention has been less of an issue
since there was a large IO hardware upgrade to Franklin last spring,
though.

By any chance, have there been changes in recent releases of lustre
that would affect how stripes are assigned to OSTs? For instance, is
it no longer the case that they are assigned round robin? Maybe some
type of load balancing has been introduced? Because that would break
our 1-1 writer to OST pattern.

The purple indicates ftruncate, which is only called from task 0 (that
is how MPI_File_set_size is implemented). The salmon at the far right
actually indicates fsyncs, not reads. And the brown (hard to see) is
fclose.

I will ask our systems people if they can locate logs and verify
whether there were any errors at the time of the ftruncate, but I
didn''t receive any errors on the client side.

Thanks for helping us look into this,

Mark

On Fri, Dec 11, 2009 at 2:03 AM, Andreas Dilger <adilger at sun.com>
wrote:> On 2009-12-10, at 13:55, Mark Howison wrote:
>>
>> On Franklin, a CrayXT at NERSC with a lustre /scratch filesystem, we
>> have noticed excessively long return times on ftruncate calls that are
>> issued through HDF5 or the MPI-IO layer (through MPI_File_set_size()
>> for instance). Here is a IO trace plot that shows 235GB written to a
>> shared HDF5 file in 65s followed by an ftruncate that lasts about 50s:
>>
>> http://vis.lbl.gov/~mhowison/vorpal/n2048.cb.align.183/tag.png
>
> To clarify - the vertical axis is for different servers or is that for
> clients? ?It definitely looks like 1 or 2 of the servers are much slower
> than the others, as shown by the "solid" line of writes, compared
to the
> others which are very sparse.
>
> Then, the single purple line at the top is presumably the truncate in
> progress?
>
> Finally, at the far right, is that for reads?
>
> My original guess would have been that all of your clients are doing a
> truncate at the same time, and this is causing lock contention, but even
> that shouldn''t cause such a long delay.
>
> Another possibility is that the large file is fragmented on disk, and the
> truncate is taking a long time, but I also find it hard to believe it would
> take this long.
>
> Presumably there are no error messages during this time?
>
>> However, we''ve also seen this long ftruncate problem with
several
>> other IO patterns in addition to collective buffering in MPI-IO: for
>> instance, when bypassing MPI-IO in HDF5 and instead using the
>> MPI-POSIX driver and with unstructured 1D grids.
>>
>> Any ideas on what might cause these long ftruncates? We plan on
>> analyzing LMT data from the metadata server to determine if it is
>> simply contention with other users, but we are suspicious of the
>> consistency and magnitude of these hangs.
>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>

Mark Howison

2009-Dec-23 21:22 UTC

head link

[Lustre-discuss] abnormally long ftruncates on Cray XT4

Hi Andreas,

Our systems people looked through the error logs and didn''t report any
errors directly related to my job. There were some errors around the
same time and associated with other jobs, however (pasted below).

Thanks
Mark

based on your timestamps I am assuming this is the run:

Dec 7 17:00:13 nid04099 aprun[54251]: apid=1820824, Starting,
user=43251, cmd_line="aprun -n 2048 ./vorpalio -a 4 -t 10 -b 80 80 80
-n 32 8 8 ", num_nodes=512,
node_list=1440-1462,1470-1471,1489-1491,1493-1499,1534-1535,1537,1563-1568,1582-1600,1643-1644,1648-1650,1739-1745,1762-1776,2033-2038,2065-2067,2105-2114,2141-2168,2239-2244,2248-2255,2341-2367,2372-2374,2762-2795,2837-2864,2952-2995,3234-3245,3287-3296,3302-3305,3330-3339,3866-3909,3911-3933,3946-4023,12796-12839

No Lustre errors during the timeframe requested:
>From Messages:Dec 7 15:30:02 nid00263 kernel: LustreError:
42196:0:(llite_nfs.c:252:ll_get_parent()) failure -13 inode 356483075
get parent
Dec 7 19:22:13 nid00256 kernel: LustreError:
92924:0:(file.c:2653:ll_inode_revalidate_fini()) failure -43 inode
422182913

None in console log either. However just after the requested time period:
[2009-12-07 17:03:07][c4-1c0s0n3]Lustre:
5555:0:(rw.c:1830:ll_readpage()) ino 169248734 page 1024 (4194304) not
covered by a lock (mmap?). check debug logs.
[2009-12-07 17:03:07][c4-1c0s0n2]Lustre:
5539:0:(rw.c:1830:ll_readpage()) ino 169248734 page 1024 (4194304) not
covered by a lock (mmap?). check debug logs.
[2009-12-07 17:03:07][c4-1c0s0n3]Lustre:
5557:0:(rw.c:1830:ll_readpage()) ino 169248734 page 1025 (4198400) not
covered by a lock (mmap?). check debug logs.
[2009-12-07 17:03:07][c4-1c0s0n3]Lustre:
5555:0:(rw.c:1830:ll_readpage()) Skipped 316 previous similar messages
[2009-12-07 17:03:12][c4-1c0s0n3]Lustre:
5557:0:(rw.c:1830:ll_readpage()) ino 169248729 page 1024 (4194304) not
covered by a lock (mmap?). check debug logs.
[2009-12-07 17:03:12][c4-1c0s0n3]Lustre:
5557:0:(rw.c:1830:ll_readpage()) Skipped 1 previous similar message

but both of these nodes (1250 and 1251) were not part of this job.

On Fri, Dec 11, 2009 at 10:58 AM, Mark Howison <MHowison at lbl.gov>
wrote:> Hi Andreas,
>
> Sorry I should have provided a better description of the trace plot.
> Yes, the vertical axis shows MPI task number: the first 48 task (mod 4
> because of quad cores) tasks are the writers, and indeed there a few
> OSTs that are an order of magnitude slower than others. Usually, we
> see that undesirable behavior when we have many more writers than OSTs
> (for instance 1K tasks hitting 48 OSTs)... now it has started
> happening even when we use collective buffering to carefully match the
> number of writers to OSTs. It is possible that this just contention
> from other users on the system. Andrew Uselton is going to help me
> obtain LMT data to verify that. Contention has been less of an issue
> since there was a large IO hardware upgrade to Franklin last spring,
> though.
>
> By any chance, have there been changes in recent releases of lustre
> that would affect how stripes are assigned to OSTs? For instance, is
> it no longer the case that they are assigned round robin? Maybe some
> type of load balancing has been introduced? Because that would break
> our 1-1 writer to OST pattern.
>
> The purple indicates ftruncate, which is only called from task 0 (that
> is how MPI_File_set_size is implemented). The salmon at the far right
> actually indicates fsyncs, not reads. And the brown (hard to see) is
> fclose.
>
> I will ask our systems people if they can locate logs and verify
> whether there were any errors at the time of the ftruncate, but I
> didn''t receive any errors on the client side.
>
> Thanks for helping us look into this,
>
> Mark
>
> On Fri, Dec 11, 2009 at 2:03 AM, Andreas Dilger <adilger at sun.com>
wrote:
>> On 2009-12-10, at 13:55, Mark Howison wrote:
>>>
>>> On Franklin, a CrayXT at NERSC with a lustre /scratch filesystem,
we
>>> have noticed excessively long return times on ftruncate calls that
are
>>> issued through HDF5 or the MPI-IO layer (through
MPI_File_set_size()
>>> for instance). Here is a IO trace plot that shows 235GB written to
a
>>> shared HDF5 file in 65s followed by an ftruncate that lasts about
50s:
>>>
>>> http://vis.lbl.gov/~mhowison/vorpal/n2048.cb.align.183/tag.png
>>
>> To clarify - the vertical axis is for different servers or is that for
>> clients? ?It definitely looks like 1 or 2 of the servers are much
slower
>> than the others, as shown by the "solid" line of writes,
compared to the
>> others which are very sparse.
>>
>> Then, the single purple line at the top is presumably the truncate in
>> progress?
>>
>> Finally, at the far right, is that for reads?
>>
>> My original guess would have been that all of your clients are doing a
>> truncate at the same time, and this is causing lock contention, but
even
>> that shouldn''t cause such a long delay.
>>
>> Another possibility is that the large file is fragmented on disk, and
the
>> truncate is taking a long time, but I also find it hard to believe it
would
>> take this long.
>>
>> Presumably there are no error messages during this time?
>>
>>> However, we''ve also seen this long ftruncate problem with
several
>>> other IO patterns in addition to collective buffering in MPI-IO:
for
>>> instance, when bypassing MPI-IO in HDF5 and instead using the
>>> MPI-POSIX driver and with unstructured 1D grids.
>>>
>>> Any ideas on what might cause these long ftruncates? We plan on
>>> analyzing LMT data from the metadata server to determine if it is
>>> simply contention with other users, but we are suspicious of the
>>> consistency and magnitude of these hangs.
>>
>>
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Sr. Staff Engineer, Lustre Group
>> Sun Microsystems of Canada, Inc.
>>
>>
>

Rob Latham

2010-Jan-04 21:31 UTC

head link

[Lustre-discuss] abnormally long ftruncates on Cray XT4

On Fri, Dec 11, 2009 at 10:58:19AM -0800, Mark Howison
wrote:> The purple indicates ftruncate, which is only called from task 0 (that
> is how MPI_File_set_size is implemented). The salmon at the far right
> actually indicates fsyncs, not reads. And the brown (hard to see) is
> fclose.
Just to add a bit of ROMIO MPI-IO background while you explore this:

In the old days every MPI process called ftruncate, and of course that
gave brutally poor performance.  This "ftruncate from one" approach
has been in place since 2004.  I would be both shocked and appalled if
you encountered an ancient ROMIO-based MPI-IO implementation that
still does ftruncate from all.

if you are nfs-exporting this lustre volume for some reason, ROMIO
will behave quite differently (and slowly).  But you aren''t doing that
or you''d see a very different IPM graph.

==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA

Lustre discuss - Dec 2009 - abnormally long ftruncates on Cray XT4

[Lustre-discuss] abnormally long ftruncates on Cray XT4

[Lustre-discuss] abnormally long ftruncates on Cray XT4

[Lustre-discuss] abnormally long ftruncates on Cray XT4

[Lustre-discuss] abnormally long ftruncates on Cray XT4

[Lustre-discuss] abnormally long ftruncates on Cray XT4