thr3ads.net - Lustre discuss - [Lustre-discuss] Lustre and disk tuning [Jan 2008]

If this information is useful, please help other people find it:
Share via:

Dan

2008-Jan-19 00:45 UTC

[Lustre-discuss] Lustre and disk tuning

Greetings all,

    I''m looking for some advice on improving disk performance and
understanding what Lustre is doing with it.  Right now I have a ~28 TB
OSS with 4 OSTs on it.  There are 4 clients using Lustre native - no
NFS.  If I write to the lustre volume from the clients I get odd
behavior.  Typically the writes have a long pause before any data
starts hitting the disks.  Then 2 or 3 of the clients will write
happily but one or two will not.  Eventually Lustre will pump out a
number of I/O related errors such as "slow i_mutex 165 seconds, slow
direct_io 32 seconds" and so on.  Next the clients that couldn''t
write
will catch up and pass the clients that could write.  At some point (5
minutes or so) the jobs start failing without any errors.  New jobs
can be started after these fail and the pattern repeats.  Write speeds
are low, around 22 MB/sec per client, the disks shouldn''t have any
problem handling 4 writes at this speed!!  This did work using NFS.

    When these disks were formated with XFS I/O was fast.  No problems at
all writing 475 MB/sec sustained per RAID controller (locally, not via
NFS).  No delays.  After configuring for Lustre the peak sustained
write (locally) is 230 MB/sec.  It will write for about 2 minutes
before logging about slow I/O.  This is without any clients connected.

So far I''ve done the following:

1.  Recompiled SCSI driver for RAID controller to use 1 MB blocks (from
256k).
2.  Adjusted MDS, OST threads
3.  Tried all I/O schedulers
4.  Tried all possible settings on RAID controllers for Caching and
read-ahead.
5.  Some minor stuff I forgot about!

Nothing makes a difference - same results under each configuration except
for schedulers.  When running the deadline scheduler the writes fail
faster and have delays around 30 seconds.  With all others the delays
range from 100 to 500 seconds.

The system has 4 cores and 4 GB of memory with 4 7 TB OSTs.  The disks are
in RAID 6 split between two controllers with 2 GB cache each.  One
controller has the MGS/MDT on it.  When running top it indicates 2/3 to
3/4 of memory utilized and 25% CPU utilization normally.

Suggestions?

Thank you,

Dan

Bernd Schubert

2008-Jan-21 10:31 UTC

head link

[Lustre-discuss] Lustre and disk tuning

Hello Dan,

On Saturday 19 January 2008 01:45:13 Dan wrote:> Greetings all,
>
>     I''m looking for some advice on improving disk performance and
> understanding what Lustre is doing with it.  Right now I have a ~28 TB
> OSS with 4 OSTs on it.  There are 4 clients using Lustre native - no
> NFS.  If I write to the lustre volume from the clients I get odd
> behavior.  Typically the writes have a long pause before any data
> starts hitting the disks.  Then 2 or 3 of the clients will write
> happily but one or two will not.  Eventually Lustre will pump out a
> number of I/O related errors such as "slow i_mutex 165 seconds, slow
> direct_io 32 seconds" and so on.  Next the clients that
couldn''t write
> will catch up and pass the clients that could write.  At some point (5
> minutes or so) the jobs start failing without any errors.  New jobs
> can be started after these fail and the pattern repeats.  Write speeds
> are low, around 22 MB/sec per client, the disks shouldn''t have any
> problem handling 4 writes at this speed!!  This did work using NFS.
>
>     When these disks were formated with XFS I/O was fast.  No problems at
> all writing 475 MB/sec sustained per RAID controller (locally, not via
> NFS).  No delays.  After configuring for Lustre the peak sustained
> write (locally) is 230 MB/sec.  It will write for about 2 minutes
> before logging about slow I/O.  This is without any clients connected.
>
> So far I''ve done the following:
>
> 1.  Recompiled SCSI driver for RAID controller to use 1 MB blocks (from
> 256k).
> 2.  Adjusted MDS, OST threads
> 3.  Tried all I/O schedulers
> 4.  Tried all possible settings on RAID controllers for Caching and
> read-ahead.
> 5.  Some minor stuff I forgot about!
>
> Nothing makes a difference - same results under each configuration except
> for schedulers.  When running the deadline scheduler the writes fail
> faster and have delays around 30 seconds.  With all others the delays
> range from 100 to 500 seconds.
>
> The system has 4 cores and 4 GB of memory with 4 7 TB OSTs.  The disks are
> in RAID 6 split between two controllers with 2 GB cache each.  One
> controller has the MGS/MDT on it.  When running top it indicates 2/3 to
> 3/4 of memory utilized and 25% CPU utilization normally.
>
> Suggestions?
>
we are usually benchmarking with ldiskfs first, so to figure out what we 
should get, we use ldiskfs in comparison to xfs. 

mount -t ldiskfs -omballoc,extents /dev/{device name} /{favorite mount}


Now benchmark it and compare it to xfs. You may also want to play with 
additional options as "data=writeback".


It also would be helpful if we would know which lustre version you are using. 
E.g. in lustre-1.4 mballoc and extents are not enabled by default, so its 
almost pure ext3, which is terribly slow compared to xfs.


Cheers,
Bernd


-- 
Bernd Schubert
Q-Leap Networks GmbH

Andreas Dilger

2008-Jan-21 23:34 UTC

head link

[Lustre-discuss] Lustre and disk tuning

On Jan 18, 2008  16:45 -0800, Dan wrote:>     I''m looking for some advice on improving disk performance and
> understanding what Lustre is doing with it.  Right now I have a ~28 TB
> OSS with 4 OSTs on it.  There are 4 clients using Lustre native - no
> NFS.  If I write to the lustre volume from the clients I get odd
> behavior.  Typically the writes have a long pause before any data
> starts hitting the disks.  Then 2 or 3 of the clients will write
> happily but one or two will not.  Eventually Lustre will pump out a
> number of I/O related errors such as "slow i_mutex 165 seconds, slow
> direct_io 32 seconds" and so on.  Next the clients that
couldn''t write
> will catch up and pass the clients that could write.  At some point (5
> minutes or so) the jobs start failing without any errors.  New jobs
> can be started after these fail and the pattern repeats.  Write speeds
> are low, around 22 MB/sec per client, the disks shouldn''t have any
> problem handling 4 writes at this speed!!  This did work using NFS.
> 
>     When these disks were formated with XFS I/O was fast.  No problems at
> all writing 475 MB/sec sustained per RAID controller (locally, not via
> NFS).  No delays.  After configuring for Lustre the peak sustained
> write (locally) is 230 MB/sec.  It will write for about 2 minutes
> before logging about slow I/O.  This is without any clients connected.
> 
> So far I''ve done the following:
> 
> 1.  Recompiled SCSI driver for RAID controller to use 1 MB blocks (from
> 256k).
> 2.  Adjusted MDS, OST threads
> 3.  Tried all I/O schedulers
> 4.  Tried all possible settings on RAID controllers for Caching and
> read-ahead.
> 5.  Some minor stuff I forgot about!
> 
> Nothing makes a difference - same results under each configuration except
> for schedulers.  When running the deadline scheduler the writes fail
> faster and have delays around 30 seconds.  With all others the delays
> range from 100 to 500 seconds.
> 
> The system has 4 cores and 4 GB of memory with 4 7 TB OSTs.  The disks are
> in RAID 6 split between two controllers with 2 GB cache each.  One
> controller has the MGS/MDT on it.  When running top it indicates 2/3 to
> 3/4 of memory utilized and 25% CPU utilization normally.
Are you using Lustre 1.4 or 1.6?  Are you mounting your OSTs with
"-o extents,mballoc"?  We''ve had Lustre OSSs nodes running in
excess
of 2GB/s with h/w RAID controllers.

Are you using partitions on your RAID device?  You shouldn''t - that
causes
unaligned IO to the device and needless read-modify-write for each RAID
stripe.

Is your RAID geometry efficient with 1MB IOs (e.g. 4+1 or 8+1)?  If not,
then you should consider mounting your OSTs with "-o
stripe={raid_stripe}",
where raid_stripe=N*raid_chunksize, N is the number of data disks for
RAID 5 N+1 or RAID 6 N+2.

You should download the lustre-iokit and use sgpdd-survey, obdfilter-survey,
and PIOS to determine what is causing the performance bottleneck.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Dan

2008-Jan-31 02:32 UTC

head link

[Lustre-discuss] Lustre and disk tuning

Sorry for the long delay!

I''m running Lustre 1.6.4.2.

I''m mounting with default options.  When I used -o extents,mballoc it
mounts and the volume hangs.  I tried to check it out with ldiskfs but no
luck.  I had to reboot the machine (hard boot at that) to get the devices
back.  It appears in the logs to mount with mballoc by default.

I''m not using partitions on the RAID devices.  I have two RAID
controllers
in the system.  All disks on each are grouped into a single RAID 6.  The
first controller has three volumes one for the MGS/MDT and two OSTs.  The
other only has two OSTs.

I attempted using the -o stripe=<raid_stripe=N*raid_chunksize> but no
luck.  When mounting the OSTs with the stripe option they hang and never
mount.  I''ve tried a couple of stipe sizes.

I was a little uncertain of the stripe size calculation so here we go...
My chunk size is 128k and there are 23 disks in RAID 6 (one hot spare
leave 23).  That means 21 data disks?  Judging by your formula I take 23 *
128k whis is 2944.  Is this even close to what you intended?  This stripe
size hangs at mount...

I''ve tried to test with the lustre-io kit but the tests (writes) fail
on
most OSTs.  That is the problem I''m having after all... frustrating.

Would it make sense to reconfigure the RAID controllers to have separate
groups of disks in RAID 6?  For performance is there a recommended max
size or number of disks for each OST?  Lastly, is it worth while to
consider putting the ext3 journal on another device exported from the RAID
controller?

Thank you!!

Dan

> On Jan 18, 2008  16:45 -0800, Dan wrote:
>>     I''m looking for some advice on improving disk performance
and
>> understanding what Lustre is doing with it.  Right now I have a ~28 TB
>> OSS with 4 OSTs on it.  There are 4 clients using Lustre native - no
>> NFS.  If I write to the lustre volume from the clients I get odd
>> behavior.  Typically the writes have a long pause before any data
>> starts hitting the disks.  Then 2 or 3 of the clients will write
>> happily but one or two will not.  Eventually Lustre will pump out a
>> number of I/O related errors such as "slow i_mutex 165 seconds,
slow
>> direct_io 32 seconds" and so on.  Next the clients that
couldn''t write
>> will catch up and pass the clients that could write.  At some point (5
>> minutes or so) the jobs start failing without any errors.  New jobs
>> can be started after these fail and the pattern repeats.  Write speeds
>> are low, around 22 MB/sec per client, the disks shouldn''t have
any
>> problem handling 4 writes at this speed!!  This did work using NFS.
>>
>>     When these disks were formated with XFS I/O was fast.  No problems
>> at
>> all writing 475 MB/sec sustained per RAID controller (locally, not via
>> NFS).  No delays.  After configuring for Lustre the peak sustained
>> write (locally) is 230 MB/sec.  It will write for about 2 minutes
>> before logging about slow I/O.  This is without any clients connected.
>>
>> So far I''ve done the following:
>>
>> 1.  Recompiled SCSI driver for RAID controller to use 1 MB blocks (from
>> 256k).
>> 2.  Adjusted MDS, OST threads
>> 3.  Tried all I/O schedulers
>> 4.  Tried all possible settings on RAID controllers for Caching and
>> read-ahead.
>> 5.  Some minor stuff I forgot about!
>>
>> Nothing makes a difference - same results under each configuration
>> except
>> for schedulers.  When running the deadline scheduler the writes fail
>> faster and have delays around 30 seconds.  With all others the delays
>> range from 100 to 500 seconds.
>>
>> The system has 4 cores and 4 GB of memory with 4 7 TB OSTs.  The disks
>> are
>> in RAID 6 split between two controllers with 2 GB cache each.  One
>> controller has the MGS/MDT on it.  When running top it indicates 2/3 to
>> 3/4 of memory utilized and 25% CPU utilization normally.
>
> Are you using Lustre 1.4 or 1.6?  Are you mounting your OSTs with
> "-o extents,mballoc"?  We''ve had Lustre OSSs nodes
running in excess
> of 2GB/s with h/w RAID controllers.
>
> Are you using partitions on your RAID device?  You shouldn''t -
that causes
> unaligned IO to the device and needless read-modify-write for each RAID
> stripe.
>
> Is your RAID geometry efficient with 1MB IOs (e.g. 4+1 or 8+1)?  If not,
> then you should consider mounting your OSTs with "-o
> stripe={raid_stripe}",
> where raid_stripe=N*raid_chunksize, N is the number of data disks for
> RAID 5 N+1 or RAID 6 N+2.
>
> You should download the lustre-iokit and use sgpdd-survey,
> obdfilter-survey,
> and PIOS to determine what is causing the performance bottleneck.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>

Andreas Dilger

2008-Jan-31 08:40 UTC

head link

[Lustre-discuss] Lustre and disk tuning

On Jan 30, 2008  18:32 -0800, Dan wrote:> I was a little uncertain of the stripe size calculation so here we go...
> My chunk size is 128k and there are 23 disks in RAID 6 (one hot spare
> leave 23).  That means 21 data disks?  Judging by your formula I take 23 *
> 128k whis is 2944.  Is this even close to what you intended?  This stripe
> size hangs at mount...
Hmm, I don''t think the mballoc code can efficiently deal with a stripe
size
larger than the RPC size (which is 1MB) because this will always result in
a read-modify-write of the RAID stripe as not enough data can be collected
to fill a stripe.
> I''ve tried to test with the lustre-io kit but the tests (writes)
fail on
> most OSTs.  That is the problem I''m having after all...
frustrating.
> 
> Would it make sense to reconfigure the RAID controllers to have separate
> groups of disks in RAID 6?  For performance is there a recommended max
> size or number of disks for each OST?  Lastly, is it worth while to
> consider putting the ext3 journal on another device exported from the RAID
> controller?
Having 21 disks in the RAID set is probably too large to be practical
because of the high overhead of doing IO of such a large size.
Good configurations for such a system might be 2x 8+2 + spare = 21 disks
with 128kB chunk size, or 16+2 + spare = 19 disks with 64kB chunk size.
Both result in 1MB full stripe size, which is what mballoc and Lustre
are optimized to by default.
> > On Jan 18, 2008  16:45 -0800, Dan wrote:
> >>     I''m looking for some advice on improving disk
performance and
> >> understanding what Lustre is doing with it.  Right now I have a
~28 TB
> >> OSS with 4 OSTs on it.  There are 4 clients using Lustre native -
no
> >> NFS.  If I write to the lustre volume from the clients I get odd
> >> behavior.  Typically the writes have a long pause before any data
> >> starts hitting the disks.  Then 2 or 3 of the clients will write
> >> happily but one or two will not.  Eventually Lustre will pump out
a
> >> number of I/O related errors such as "slow i_mutex 165
seconds, slow
> >> direct_io 32 seconds" and so on.  Next the clients that
couldn''t write
> >> will catch up and pass the clients that could write.  At some
point (5
> >> minutes or so) the jobs start failing without any errors.  New
jobs
> >> can be started after these fail and the pattern repeats.  Write
speeds
> >> are low, around 22 MB/sec per client, the disks shouldn''t
have any
> >> problem handling 4 writes at this speed!!  This did work using
NFS.
> >>
> >>     When these disks were formated with XFS I/O was fast.  No
problems
> >> at
> >> all writing 475 MB/sec sustained per RAID controller (locally, not
via
> >> NFS).  No delays.  After configuring for Lustre the peak sustained
> >> write (locally) is 230 MB/sec.  It will write for about 2 minutes
> >> before logging about slow I/O.  This is without any clients
connected.
> >>
> >> So far I''ve done the following:
> >>
> >> 1.  Recompiled SCSI driver for RAID controller to use 1 MB blocks
(from
> >> 256k).
> >> 2.  Adjusted MDS, OST threads
> >> 3.  Tried all I/O schedulers
> >> 4.  Tried all possible settings on RAID controllers for Caching
and
> >> read-ahead.
> >> 5.  Some minor stuff I forgot about!
> >>
> >> Nothing makes a difference - same results under each configuration
> >> except
> >> for schedulers.  When running the deadline scheduler the writes
fail
> >> faster and have delays around 30 seconds.  With all others the
delays
> >> range from 100 to 500 seconds.
> >>
> >> The system has 4 cores and 4 GB of memory with 4 7 TB OSTs.  The
disks
> >> are
> >> in RAID 6 split between two controllers with 2 GB cache each.  One
> >> controller has the MGS/MDT on it.  When running top it indicates
2/3 to
> >> 3/4 of memory utilized and 25% CPU utilization normally.
> >
> > Are you using Lustre 1.4 or 1.6?  Are you mounting your OSTs with
> > "-o extents,mballoc"?  We''ve had Lustre OSSs nodes
running in excess
> > of 2GB/s with h/w RAID controllers.
> >
> > Are you using partitions on your RAID device?  You shouldn''t
- that causes
> > unaligned IO to the device and needless read-modify-write for each
RAID
> > stripe.
> >
> > Is your RAID geometry efficient with 1MB IOs (e.g. 4+1 or 8+1)?  If
not,
> > then you should consider mounting your OSTs with "-o
> > stripe={raid_stripe}",
> > where raid_stripe=N*raid_chunksize, N is the number of data disks for
> > RAID 5 N+1 or RAID 6 N+2.
> >
> > You should download the lustre-iokit and use sgpdd-survey,
> > obdfilter-survey,
> > and PIOS to determine what is causing the performance bottleneck.
> >
> > Cheers, Andreas
> > --
> > Andreas Dilger
> > Sr. Staff Engineer, Lustre Group
> > Sun Microsystems of Canada, Inc.
> >
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Dan

2008-Jan-31 16:25 UTC

head link

[Lustre-discuss] Lustre and disk tuning

Thanks Andreas.  I''ll reconfigure the RAID and give it another shot
today.  Would it be reasonable to credit the stalled writes with this
I/O mismatch I have?

Dan


On Thu, 2008-01-31 at 01:40 -0700, Andreas Dilger wrote:
> On Jan 30, 2008  18:32 -0800, Dan wrote:
> > I was a little uncertain of the stripe size calculation so here we
go...
> > My chunk size is 128k and there are 23 disks in RAID 6 (one hot spare
> > leave 23).  That means 21 data disks?  Judging by your formula I take
23 *
> > 128k whis is 2944.  Is this even close to what you intended?  This
stripe
> > size hangs at mount...
> 
> Hmm, I don''t think the mballoc code can efficiently deal with a
stripe size
> larger than the RPC size (which is 1MB) because this will always result in
> a read-modify-write of the RAID stripe as not enough data can be collected
> to fill a stripe.
> 
> > I''ve tried to test with the lustre-io kit but the tests
(writes) fail on
> > most OSTs.  That is the problem I''m having after all...
frustrating.
> > 
> > Would it make sense to reconfigure the RAID controllers to have
separate
> > groups of disks in RAID 6?  For performance is there a recommended max
> > size or number of disks for each OST?  Lastly, is it worth while to
> > consider putting the ext3 journal on another device exported from the
RAID
> > controller?
> 
> Having 21 disks in the RAID set is probably too large to be practical
> because of the high overhead of doing IO of such a large size.
> Good configurations for such a system might be 2x 8+2 + spare = 21 disks
> with 128kB chunk size, or 16+2 + spare = 19 disks with 64kB chunk size.
> Both result in 1MB full stripe size, which is what mballoc and Lustre
> are optimized to by default.
> 
> > > On Jan 18, 2008  16:45 -0800, Dan wrote:
> > >>     I''m looking for some advice on improving disk
performance and
> > >> understanding what Lustre is doing with it.  Right now I have
a ~28 TB
> > >> OSS with 4 OSTs on it.  There are 4 clients using Lustre
native - no
> > >> NFS.  If I write to the lustre volume from the clients I get
odd
> > >> behavior.  Typically the writes have a long pause before any
data
> > >> starts hitting the disks.  Then 2 or 3 of the clients will
write
> > >> happily but one or two will not.  Eventually Lustre will pump
out a
> > >> number of I/O related errors such as "slow i_mutex 165
seconds, slow
> > >> direct_io 32 seconds" and so on.  Next the clients that
couldn''t write
> > >> will catch up and pass the clients that could write.  At some
point (5
> > >> minutes or so) the jobs start failing without any errors. 
New jobs
> > >> can be started after these fail and the pattern repeats. 
Write speeds
> > >> are low, around 22 MB/sec per client, the disks
shouldn''t have any
> > >> problem handling 4 writes at this speed!!  This did work
using NFS.
> > >>
> > >>     When these disks were formated with XFS I/O was fast.  No
problems
> > >> at
> > >> all writing 475 MB/sec sustained per RAID controller
(locally, not via
> > >> NFS).  No delays.  After configuring for Lustre the peak
sustained
> > >> write (locally) is 230 MB/sec.  It will write for about 2
minutes
> > >> before logging about slow I/O.  This is without any clients
connected.
> > >>
> > >> So far I''ve done the following:
> > >>
> > >> 1.  Recompiled SCSI driver for RAID controller to use 1 MB
blocks (from
> > >> 256k).
> > >> 2.  Adjusted MDS, OST threads
> > >> 3.  Tried all I/O schedulers
> > >> 4.  Tried all possible settings on RAID controllers for
Caching and
> > >> read-ahead.
> > >> 5.  Some minor stuff I forgot about!
> > >>
> > >> Nothing makes a difference - same results under each
configuration
> > >> except
> > >> for schedulers.  When running the deadline scheduler the
writes fail
> > >> faster and have delays around 30 seconds.  With all others
the delays
> > >> range from 100 to 500 seconds.
> > >>
> > >> The system has 4 cores and 4 GB of memory with 4 7 TB OSTs. 
The disks
> > >> are
> > >> in RAID 6 split between two controllers with 2 GB cache each.
One
> > >> controller has the MGS/MDT on it.  When running top it
indicates 2/3 to
> > >> 3/4 of memory utilized and 25% CPU utilization normally.
> > >
> > > Are you using Lustre 1.4 or 1.6?  Are you mounting your OSTs with
> > > "-o extents,mballoc"?  We''ve had Lustre OSSs
nodes running in excess
> > > of 2GB/s with h/w RAID controllers.
> > >
> > > Are you using partitions on your RAID device?  You
shouldn''t - that causes
> > > unaligned IO to the device and needless read-modify-write for
each RAID
> > > stripe.
> > >
> > > Is your RAID geometry efficient with 1MB IOs (e.g. 4+1 or 8+1)? 
If not,
> > > then you should consider mounting your OSTs with "-o
> > > stripe={raid_stripe}",
> > > where raid_stripe=N*raid_chunksize, N is the number of data disks
for
> > > RAID 5 N+1 or RAID 6 N+2.
> > >
> > > You should download the lustre-iokit and use sgpdd-survey,
> > > obdfilter-survey,
> > > and PIOS to determine what is causing the performance bottleneck.
> > >
> > > Cheers, Andreas
> > > --
> > > Andreas Dilger
> > > Sr. Staff Engineer, Lustre Group
> > > Sun Microsystems of Canada, Inc.
> > >
> > 
> > 
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080131/226f9d66/attachment-0002.html

Andreas Dilger

2008-Jan-31 19:38 UTC

head link

[Lustre-discuss] Lustre and disk tuning

On Jan 31, 2008  08:25 -0800, Dan wrote:> Thanks Andreas.  I''ll reconfigure the RAID and give it another
shot
> today.  Would it be reasonable to credit the stalled writes with this
> I/O mismatch I have?
It would definitely hurt performance...  Also, placing the MDT on the
same RAID6 is not very desirable...  Given that you now have a few
spare disks on the system, I''d also recommend a separate RAID 0+1 for
the MDT device.
> On Thu, 2008-01-31 at 01:40 -0700, Andreas Dilger wrote:
> > On Jan 30, 2008  18:32 -0800, Dan wrote:
> > > I was a little uncertain of the stripe size calculation so here
we go...
> > > My chunk size is 128k and there are 23 disks in RAID 6 (one hot
spare
> > > leave 23).  That means 21 data disks?  Judging by your formula I
take 23 *
> > > 128k whis is 2944.  Is this even close to what you intended? 
This stripe
> > > size hangs at mount...
> > 
> > Hmm, I don''t think the mballoc code can efficiently deal with
a stripe size
> > larger than the RPC size (which is 1MB) because this will always
result in
> > a read-modify-write of the RAID stripe as not enough data can be
collected
> > to fill a stripe.
> > 
> > > I''ve tried to test with the lustre-io kit but the tests
(writes) fail on
> > > most OSTs.  That is the problem I''m having after all...
frustrating.
> > > 
> > > Would it make sense to reconfigure the RAID controllers to have
separate
> > > groups of disks in RAID 6?  For performance is there a
recommended max
> > > size or number of disks for each OST?  Lastly, is it worth while
to
> > > consider putting the ext3 journal on another device exported from
the RAID
> > > controller?
> > 
> > Having 21 disks in the RAID set is probably too large to be practical
> > because of the high overhead of doing IO of such a large size.
> > Good configurations for such a system might be 2x 8+2 + spare = 21
disks
> > with 128kB chunk size, or 16+2 + spare = 19 disks with 64kB chunk
size.
> > Both result in 1MB full stripe size, which is what mballoc and Lustre
> > are optimized to by default.
> > 
> > > > On Jan 18, 2008  16:45 -0800, Dan wrote:
> > > >>     I''m looking for some advice on improving
disk performance and
> > > >> understanding what Lustre is doing with it.  Right now I
have a ~28 TB
> > > >> OSS with 4 OSTs on it.  There are 4 clients using Lustre
native - no
> > > >> NFS.  If I write to the lustre volume from the clients I
get odd
> > > >> behavior.  Typically the writes have a long pause before
any data
> > > >> starts hitting the disks.  Then 2 or 3 of the clients
will write
> > > >> happily but one or two will not.  Eventually Lustre will
pump out a
> > > >> number of I/O related errors such as "slow i_mutex
165 seconds, slow
> > > >> direct_io 32 seconds" and so on.  Next the clients
that couldn''t write
> > > >> will catch up and pass the clients that could write.  At
some point (5
> > > >> minutes or so) the jobs start failing without any
errors.  New jobs
> > > >> can be started after these fail and the pattern repeats.
Write speeds
> > > >> are low, around 22 MB/sec per client, the disks
shouldn''t have any
> > > >> problem handling 4 writes at this speed!!  This did work
using NFS.
> > > >>
> > > >>     When these disks were formated with XFS I/O was
fast.  No problems
> > > >> at
> > > >> all writing 475 MB/sec sustained per RAID controller
(locally, not via
> > > >> NFS).  No delays.  After configuring for Lustre the peak
sustained
> > > >> write (locally) is 230 MB/sec.  It will write for about
2 minutes
> > > >> before logging about slow I/O.  This is without any
clients connected.
> > > >>
> > > >> So far I''ve done the following:
> > > >>
> > > >> 1.  Recompiled SCSI driver for RAID controller to use 1
MB blocks (from
> > > >> 256k).
> > > >> 2.  Adjusted MDS, OST threads
> > > >> 3.  Tried all I/O schedulers
> > > >> 4.  Tried all possible settings on RAID controllers for
Caching and
> > > >> read-ahead.
> > > >> 5.  Some minor stuff I forgot about!
> > > >>
> > > >> Nothing makes a difference - same results under each
configuration
> > > >> except
> > > >> for schedulers.  When running the deadline scheduler the
writes fail
> > > >> faster and have delays around 30 seconds.  With all
others the delays
> > > >> range from 100 to 500 seconds.
> > > >>
> > > >> The system has 4 cores and 4 GB of memory with 4 7 TB
OSTs.  The disks
> > > >> are
> > > >> in RAID 6 split between two controllers with 2 GB cache
each.  One
> > > >> controller has the MGS/MDT on it.  When running top it
indicates 2/3 to
> > > >> 3/4 of memory utilized and 25% CPU utilization normally.
> > > >
> > > > Are you using Lustre 1.4 or 1.6?  Are you mounting your OSTs
with
> > > > "-o extents,mballoc"?  We''ve had Lustre
OSSs nodes running in excess
> > > > of 2GB/s with h/w RAID controllers.
> > > >
> > > > Are you using partitions on your RAID device?  You
shouldn''t - that causes
> > > > unaligned IO to the device and needless read-modify-write
for each RAID
> > > > stripe.
> > > >
> > > > Is your RAID geometry efficient with 1MB IOs (e.g. 4+1 or
8+1)?  If not,
> > > > then you should consider mounting your OSTs with "-o
> > > > stripe={raid_stripe}",
> > > > where raid_stripe=N*raid_chunksize, N is the number of data
disks for
> > > > RAID 5 N+1 or RAID 6 N+2.
> > > >
> > > > You should download the lustre-iokit and use sgpdd-survey,
> > > > obdfilter-survey,
> > > > and PIOS to determine what is causing the performance
bottleneck.
> > > >
> > > > Cheers, Andreas
> > > > --
> > > > Andreas Dilger
> > > > Sr. Staff Engineer, Lustre Group
> > > > Sun Microsystems of Canada, Inc.
> > > >
> > > 
> > > 
> > > _______________________________________________
> > > Lustre-discuss mailing list
> > > Lustre-discuss at lists.lustre.org
> > > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> > 
> > Cheers, Andreas
> > --
> > Andreas Dilger
> > Sr. Staff Engineer, Lustre Group
> > Sun Microsystems of Canada, Inc.
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre discuss - Jan 2008 - Lustre and disk tuning

[Lustre-discuss] Lustre and disk tuning

[Lustre-discuss] Lustre and disk tuning

[Lustre-discuss] Lustre and disk tuning

[Lustre-discuss] Lustre and disk tuning

[Lustre-discuss] Lustre and disk tuning

[Lustre-discuss] Lustre and disk tuning

[Lustre-discuss] Lustre and disk tuning