thr3ads.net - Lustre discuss - [Lustre-discuss] Bad distribution of files among OSTs [Oct 2009]

If this information is useful, please help other people find it:
Share via:

Thomas Roth

2009-Oct-30 18:07 UTC

[Lustre-discuss] Bad distribution of files among OSTs

Hi all,

in our 196 OST - Cluster, the previously perfect distribution of files
among the OSTs is not working anymore, since ~ 2 weeks.
The filling for most OSTs is between 57% and 62%, but some (~10)  have
risen up to 94%. I''m trying to fix that by having these OSTs
deactivated
on the MDT and finding and migrating away data from them, but it seems
I''m not fast enough and it''s a ongoing problem - I''ve
just deactivated
another OST with threatening 67%.

Our qos_prio_free is at the default 90%.

Our OST''s sizes are between 2.3TB and 4.5TB. We use striping level 1,
so
it would be possible to fill up an OST by just creating a 2TB file.
However, I''m not aware of any such gigafiles (using robinhood to get a
picture of our file system).

In addition, our user''s behavior should not have changed recently. In
August, the entire cluster had filled up to almost 80% in a neatly even
distribution among the OSTs, so we extended the cluster by more OSTs,
migrating data to even the filling between old and new ones. This also
succeeded, and up to October there was no indication of something not
working.

There are no error message in the logs that would point to some OSTs
being favored ;-)

So, what could be the cause of this misdistribution?

Regards,
Thomas

Jerome, Ron

2009-Oct-30 18:28 UTC

head link

[Lustre-discuss] Bad distribution of files among OSTs

Strangely (although I''m sure it''s not related) I have seen the
exact same behavior on my Lustre cluster in the last month or so. I have also
never seen this before, and to the best of my knowledge there is no change in
usage patterns.

I''m running 1.6.7.2 on the servers.

Ron Jerome
National Research Council Canada.


-----Original Message-----
From: lustre-discuss-bounces at lists.lustre.org on behalf of Thomas Roth
Sent: Fri 10/30/2009 2:07 PM
To: lustre-discuss at lists.lustre.org
Subject: [Lustre-discuss] Bad distribution of files among OSTs
 
Hi all,

in our 196 OST - Cluster, the previously perfect distribution of files
among the OSTs is not working anymore, since ~ 2 weeks.
The filling for most OSTs is between 57% and 62%, but some (~10)  have
risen up to 94%. I''m trying to fix that by having these OSTs
deactivated
on the MDT and finding and migrating away data from them, but it seems
I''m not fast enough and it''s a ongoing problem - I''ve
just deactivated
another OST with threatening 67%.

Our qos_prio_free is at the default 90%.

Our OST''s sizes are between 2.3TB and 4.5TB. We use striping level 1,
so
it would be possible to fill up an OST by just creating a 2TB file.
However, I''m not aware of any such gigafiles (using robinhood to get a
picture of our file system).

In addition, our user''s behavior should not have changed recently. In
August, the entire cluster had filled up to almost 80% in a neatly even
distribution among the OSTs, so we extended the cluster by more OSTs,
migrating data to even the filling between old and new ones. This also
succeeded, and up to October there was no indication of something not
working.

There are no error message in the logs that would point to some OSTs
being favored ;-)

So, what could be the cause of this misdistribution?

Regards,
Thomas


_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091030/43c5f8cb/attachment.html

Andreas Dilger

2009-Oct-31 00:12 UTC

head link

[Lustre-discuss] Bad distribution of files among OSTs

On 2009-10-30, at 12:07, Thomas Roth wrote:> in our 196 OST - Cluster, the previously perfect distribution of files
> among the OSTs is not working anymore, since ~ 2 weeks.
> The filling for most OSTs is between 57% and 62%, but some (~10)  have
> risen up to 94%. I''m trying to fix that by having these OSTs  
> deactivated
> on the MDT and finding and migrating away data from them, but it seems
> I''m not fast enough and it''s a ongoing problem -
I''ve just deactivated
> another OST with threatening 67%.
Is this correlated to some upgrade of Lustre?  What version are you  
using?

> Our qos_prio_free is at the default 90%.
>
> Our OST''s sizes are between 2.3TB and 4.5TB. We use striping level
> 1, so
> it would be possible to fill up an OST by just creating a 2TB file.
> However, I''m not aware of any such gigafiles (using robinhood to
get a
> picture of our file system).
To fill the smallest OST from 60% to 90% would only need a few file that
total 0.3 * 2.3TB, or 690GB.  One way to find such files is to mount the
full OSTs with ldiskfs and do "find /mnt/ost/O/0 -size +100G" to list
the
object IDs that are very large, and then in bug 21244 I''ve written a  
small
program that dumps the MDS inode number from the specified objects.  You
can then use "debugfs -c -R "ncheck {list of inode numbers} /dev/$ 
{mdsdev}"
on the MDS to find the pathnames of those files.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Thomas Roth

2009-Oct-31 08:01 UTC

head link

[Lustre-discuss] Bad distribution of files among OSTs

Thanks, Andreas.
Indeed we are running Lustre 1.6.7.2, on kernel 2.6.22, Debian Etch. But there
was no upgrade
involved, we moved from 1.6.7.1 to .2 in July.

The procedure you described has the slight disadvantage of having to take the
OSTs in question
offline. It would be nice if Robinhood did the same job on a live system -
according to its manual,
it can purge data on a per-OST basis if they become to full. However, I
haven''t yet found a way to
extract just the info about these OSTs without deleting files.

In fact, I am in the process of collecting this info "manually": I
have now quite a number of lists
of user''s data from running "lfs find  --obd OST...
/lustre/...", I just haven''t run these lists
through a "ls -lh" yet. To busy moving the files instead of measuring
them ;-)

Regards,
Thomas

Andreas Dilger wrote:> On 2009-10-30, at 12:07, Thomas Roth wrote:
>> in our 196 OST - Cluster, the previously perfect distribution of files
>> among the OSTs is not working anymore, since ~ 2 weeks.
>> The filling for most OSTs is between 57% and 62%, but some (~10)  have
>> risen up to 94%. I''m trying to fix that by having these OSTs
deactivated
>> on the MDT and finding and migrating away data from them, but it seems
>> I''m not fast enough and it''s a ongoing problem -
I''ve just deactivated
>> another OST with threatening 67%.
> 
> Is this correlated to some upgrade of Lustre?  What version are you using?
> 
> 
>> Our qos_prio_free is at the default 90%.
>>
>> Our OST''s sizes are between 2.3TB and 4.5TB. We use striping
level 1, so
>> it would be possible to fill up an OST by just creating a 2TB file.
>> However, I''m not aware of any such gigafiles (using robinhood
to get a
>> picture of our file system).
> 
> To fill the smallest OST from 60% to 90% would only need a few file that
> total 0.3 * 2.3TB, or 690GB.  One way to find such files is to mount the
> full OSTs with ldiskfs and do "find /mnt/ost/O/0 -size +100G" to
list the
> object IDs that are very large, and then in bug 21244 I''ve written
a small
> program that dumps the MDS inode number from the specified objects.  You
> can then use "debugfs -c -R "ncheck {list of inode numbers}
/dev/${mdsdev}"
> on the MDS to find the pathnames of those files.
> 
> Cheers, Andreas
> -- 
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
> 
-- 
--------------------------------------------------------------------
Thomas Roth
Gesellschaft f?r Schwerionenforschung
Planckstr. 1                -         64291 Darmstadt, Germany
Department: Informationstechnologie
Location: SB3 1.262
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986

-------------- next part --------------
A non-text attachment was scrubbed...
Name: t_roth.vcf
Type: text/x-vcard
Size: 298 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091031/3a968e28/attachment.vcf

Thomas Roth

2009-Nov-01 09:03 UTC

head link

[Lustre-discuss] Bad distribution of files among OSTs

Another question:
Could this situation, 10 full OSTs out of 200, lead to a significant drop in
performance?
Before, we could usually get the full 110MB/s or so over the 1Gbit/s ethernet
lines of the clients.
That had dropped to about 50%, but we did not find any other odd thing than the
filling levels of
the OSTs.

Regards,
Thomas

Andreas Dilger wrote:> On 2009-10-30, at 12:07, Thomas Roth wrote:
>> in our 196 OST - Cluster, the previously perfect distribution of files
>> among the OSTs is not working anymore, since ~ 2 weeks.
>> The filling for most OSTs is between 57% and 62%, but some (~10)  have
>> risen up to 94%. I''m trying to fix that by having these OSTs
deactivated
>> on the MDT and finding and migrating away data from them, but it seems
>> I''m not fast enough and it''s a ongoing problem -
I''ve just deactivated
>> another OST with threatening 67%.
> 
> Is this correlated to some upgrade of Lustre?  What version are you using?
> 
> 
>> Our qos_prio_free is at the default 90%.
>>
>> Our OST''s sizes are between 2.3TB and 4.5TB. We use striping
level 1, so
>> it would be possible to fill up an OST by just creating a 2TB file.
>> However, I''m not aware of any such gigafiles (using robinhood
to get a
>> picture of our file system).
> 
> To fill the smallest OST from 60% to 90% would only need a few file that
> total 0.3 * 2.3TB, or 690GB.  One way to find such files is to mount the
> full OSTs with ldiskfs and do "find /mnt/ost/O/0 -size +100G" to
list the
> object IDs that are very large, and then in bug 21244 I''ve written
a small
> program that dumps the MDS inode number from the specified objects.  You
> can then use "debugfs -c -R "ncheck {list of inode numbers}
/dev/${mdsdev}"
> on the MDS to find the pathnames of those files.
> 
> Cheers, Andreas
> -- 
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
> 
-- 
--------------------------------------------------------------------
Thomas Roth
Gesellschaft f?r Schwerionenforschung
Planckstr. 1                -         64291 Darmstadt, Germany
Department: Informationstechnologie
Location: SB3 1.262
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986

-------------- next part --------------
A non-text attachment was scrubbed...
Name: t_roth.vcf
Type: text/x-vcard
Size: 298 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091101/6059fd0f/attachment.vcf

Jerome, Ron

2009-Nov-01 14:17 UTC

head link

[Lustre-discuss] Bad distribution of files among OSTs

Another question I had with regards to this is how long have your OSS''s
been running without a reboot?

Mine have been up for 148 days which is probably longer than ever before.  And
now that I''ve said this, it just occurred to me that one of them was
rebooted about a three weeks ago and all the others have been up for almost 6
months.

I don''t know if this has any relevance, but it''s the only
thing I can think of that''s different.

Ron.


-----Original Message-----
From: lustre-discuss-bounces at lists.lustre.org on behalf of Thomas Roth
Sent: Sun 11/1/2009 4:03 AM
To: Andreas Dilger
Cc: lustre-discuss at lists.lustre.org
Subject: Re: [Lustre-discuss] Bad distribution of files among OSTs
 
Another question:
Could this situation, 10 full OSTs out of 200, lead to a significant drop in
performance?
Before, we could usually get the full 110MB/s or so over the 1Gbit/s ethernet
lines of the clients.
That had dropped to about 50%, but we did not find any other odd thing than the
filling levels of
the OSTs.

Regards,
Thomas

Andreas Dilger wrote:> On 2009-10-30, at 12:07, Thomas Roth wrote:
>> in our 196 OST - Cluster, the previously perfect distribution of files
>> among the OSTs is not working anymore, since ~ 2 weeks.
>> The filling for most OSTs is between 57% and 62%, but some (~10)  have
>> risen up to 94%. I''m trying to fix that by having these OSTs
deactivated
>> on the MDT and finding and migrating away data from them, but it seems
>> I''m not fast enough and it''s a ongoing problem -
I''ve just deactivated
>> another OST with threatening 67%.
> 
> Is this correlated to some upgrade of Lustre?  What version are you using?
> 
> 
>> Our qos_prio_free is at the default 90%.
>>
>> Our OST''s sizes are between 2.3TB and 4.5TB. We use striping
level 1, so
>> it would be possible to fill up an OST by just creating a 2TB file.
>> However, I''m not aware of any such gigafiles (using robinhood
to get a
>> picture of our file system).
> 
> To fill the smallest OST from 60% to 90% would only need a few file that
> total 0.3 * 2.3TB, or 690GB.  One way to find such files is to mount the
> full OSTs with ldiskfs and do "find /mnt/ost/O/0 -size +100G" to
list the
> object IDs that are very large, and then in bug 21244 I''ve written
a small
> program that dumps the MDS inode number from the specified objects.  You
> can then use "debugfs -c -R "ncheck {list of inode numbers}
/dev/${mdsdev}"
> on the MDS to find the pathnames of those files.
> 
> Cheers, Andreas
> -- 
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
> 
-- 
--------------------------------------------------------------------
Thomas Roth
Gesellschaft f?r Schwerionenforschung
Planckstr. 1                -         64291 Darmstadt, Germany
Department: Informationstechnologie
Location: SB3 1.262
Phone: +49-6159-71 1453  Fax: +49-6159-71 2986


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091101/92915c56/attachment.html

Andreas Dilger

2009-Nov-02 06:33 UTC

head link

[Lustre-discuss] Bad distribution of files among OSTs

On 2009-11-01, at 02:03, Thomas Roth wrote:> Could this situation, 10 full OSTs out of 200,
> lead to a significant drop in performance?
> Before, we could usually get the full 110MB/s
> or so over the 1Gbit/s ethernet lines of the clients.
> That had dropped to about 50%, but we did not
> find any other odd thing than the filling levels of
> the OSTs.
Yes, this is entirely possible.  If the OST is very
full, then it takes longer to find free blocks.
> Andreas Dilger wrote:
>> On 2009-10-30, at 12:07, Thomas Roth wrote:
>>> in our 196 OST - Cluster, the previously perfect distribution of  
>>> files
>>> among the OSTs is not working anymore, since ~ 2 weeks.
>>> The filling for most OSTs is between 57% and 62%, but some (~10)   
>>> have
>>> risen up to 94%. I''m trying to fix that by having these
OSTs
>>> deactivated
>>> on the MDT and finding and migrating away data from them, but it  
>>> seems
>>> I''m not fast enough and it''s a ongoing problem -
I''ve just
>>> deactivated
>>> another OST with threatening 67%.
>>
>> Is this correlated to some upgrade of Lustre?  What version are you  
>> using?
>>
>>
>>> Our qos_prio_free is at the default 90%.
>>>
>>> Our OST''s sizes are between 2.3TB and 4.5TB. We use
striping level
>>> 1, so
>>> it would be possible to fill up an OST by just creating a 2TB file.
>>> However, I''m not aware of any such gigafiles (using
robinhood to
>>> get a
>>> picture of our file system).
>>
>> To fill the smallest OST from 60% to 90% would only need a few file  
>> that
>> total 0.3 * 2.3TB, or 690GB.  One way to find such files is to  
>> mount the
>> full OSTs with ldiskfs and do "find /mnt/ost/O/0 -size +100G"
to
>> list the
>> object IDs that are very large, and then in bug 21244 I''ve
written
>> a small
>> program that dumps the MDS inode number from the specified  
>> objects.  You
>> can then use "debugfs -c -R "ncheck {list of inode numbers}
/dev/$
>> {mdsdev}"
>> on the MDS to find the pathnames of those files.
>>
>> Cheers, Andreas
>> -- 
>> Andreas Dilger
>> Sr. Staff Engineer, Lustre Group
>> Sun Microsystems of Canada, Inc.
>>
>
> -- 
> --------------------------------------------------------------------
> Thomas Roth
> Gesellschaft f?r Schwerionenforschung
> Planckstr. 1                -         64291 Darmstadt, Germany
> Department: Informationstechnologie
> Location: SB3 1.262
> Phone: +49-6159-71 1453  Fax: +49-6159-71 2986
>
> <t_roth.vcf>

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Kevin Van Maren

2009-Nov-03 09:05 UTC

head link

[Lustre-discuss] Bad distribution of files among OSTs

Andreas Dilger wrote:> On 2009-11-01, at 02:03, Thomas Roth wrote:
>   
>> Could this situation, 10 full OSTs out of 200,
>> lead to a significant drop in performance?
>> Before, we could usually get the full 110MB/s
>> or so over the 1Gbit/s ethernet lines of the clients.
>> That had dropped to about 50%, but we did not
>> find any other odd thing than the filling levels of
>> the OSTs.
>>     
>
> Yes, this is entirely possible.  If the OST is very
> full, then it takes longer to find free blocks.
>   
Longer to find them, and less likely to allocate large contiguous free 
blocks,
so the disk IOs are smaller with more seeks.

Kevin

Peter Grandi

2009-Nov-07 20:48 UTC

head link

[Lustre-discuss] Bad distribution of files among OSTs

[ ... ]
> Could this situation, 10 full OSTs out of 200, lead to a
> significant drop in performance?
Likely so, the major reasons being:

* If the OST spans a significant percentage of some disks, the
  inner tracks of disks are significantly slower than the outer
  tracks. This applies to any filesystem that fills up a disk.
  My home PC 1TB disk can do about 100-110MB/s throught the
  (JFS) filesystem in the outer tracks and aroun 50-55MB/s on
  the inner ones.

* The "free list" can become significantly scattered, depending
  on the precise allocation patterns of disks. If there are many
  rewrites of small files that can be particularly bad. Even
  extent base filesystems, which suffer particularly badly from
  that as the same file size has to be split into many more
  extents, increasing metadata overhead.

The two above are likely the reason why there have been other
reports that speed goes down as filesystems fill up:

 https://www.rz.uni-karlsruhe.de/rz/docs/Lustre/ssck_sfs_isc2007

   ?Performance degradation on xc2
          After 6 months of production we lost half of the file
          system performance
              Problem is under investigation by HP
              We had a similar problem on xc1 which was due to
              fragmentation Current solution for defragmentation
              is to recreate file systems?
> Before, we could usually get the full 110MB/s or so over the
> 1Gbit/s ethernet lines of the clients.  That had dropped to
> about 50%, but we did not find any other odd thing than the
> filling levels of the OSTs.
It could just be that *all* the OSTs are filling up; it is
impossible to avoid the inner track issue on hard disks (except
by limiting the top performance), and very difficult to avoid
the scattering of the "free list".

If you really care some solutions are:

* Keep filesystem not more than 60-70% full.

* Periodically reload filesystems from backup after reformatting.

* Use just the outer 1/3 to 1/2 of the disks (which in recent
  years been called "short stroking").

But looking at the absolute numbers there is something really
wrong: 50MB/s out of 200 OSTs is ridiculously low. The problem
is not that it is half of 110MB/s, and lower than it was then,
but that it is very low.

Each OST should be delivering at least 50MB/s if with recent
drives, and even with mild issues of inner track/fragmentation
of the "free list".

That you are getting 50MB/s may indicate that somehow your files
are not being sliced across multiple OSTs. This can have several
different reasons; IIRC there are a few discussions in the list
archive on this.

Lustre discuss - Oct 2009 - Bad distribution of files among OSTs

[Lustre-discuss] Bad distribution of files among OSTs

[Lustre-discuss] Bad distribution of files among OSTs

[Lustre-discuss] Bad distribution of files among OSTs

[Lustre-discuss] Bad distribution of files among OSTs

[Lustre-discuss] Bad distribution of files among OSTs

[Lustre-discuss] Bad distribution of files among OSTs

[Lustre-discuss] Bad distribution of files among OSTs

[Lustre-discuss] Bad distribution of files among OSTs

[Lustre-discuss] Bad distribution of files among OSTs