thr3ads.net - Lustre discuss - [Lustre-discuss] tuning for small I/O [Jan 2010]

If this information is useful, please help other people find it:
Share via:

Jay Christopherson

2010-Jan-07 21:17 UTC

[Lustre-discuss] tuning for small I/O

I''m attempting to run a pair of ActiveMQ java instances, using a shared
Lustre filesystem mounted with flock for failover purposes. There''s
lots of
ways to do ActiveMQ failover and shared filesystem just happens to be the
easiest.

ActiveMQ, at least the way we are using it, does a lot of small I/O''s,
like
600 - 800 IOPS worth of 6K I/O''s. When I attempt to use Lustre as the
shared filesystem, I see major IO wait time on the cpu''s, like 40 -
50%. My
OSS''s and MDS don''t seem to be particularly busy being 90%
idle or more
while this is running. If I remove Lustre from the equation and simply
write to local disk OR to an iSCSI mounted SAN disk, my ActiveMQ instances
don''t seem to have any problems.

The disk that is backing the OSS''s are all SAS 15K disks in a RAID1
config.
The OSS''s (2 of them) each have 8GB of memory and 4 cpu cores and are
doing
nothing else except being OSS''s. The MDS has one cpu and 4G of memory
and
is 98% idle while under this ActiveMQ load. The network I am using for
Lustre is dedicated gigabit ethernet and there are 8 clients, two of which
are these ActiveMQ clients.

So, my question is:

1. What should I be looking at to tune my Lustre FS for this type of IO?
I''ve tried upping the lru_size of the MDT and OST namespaces in
/proc/fs/lustre/ldlm to 5000 and 2000 respectively, but I don''t really
see
much difference. I have also ensured that striping is disabled (lfs
setstripe -d) on the shared directory.

I guess I am just not experienced enough yet with Lustre to know how to
track down and resolve this issue. I would think Lustre should be able to
handle this load, but I must be missing something. For the record, NFS was
not able to handle this load either, at least with default export settings
(async was improved, but async is not an option).

- Jay
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100107/3636a407/attachment.html

Atul Vidwansa

2010-Jan-08 09:49 UTC

head link

[Lustre-discuss] tuning for small I/O

Hi Jay,

There are multiple ways to tune Lustre for small IO. If you seach 
lustre-discuss archives, you will find many threads on same topic. I 
have some suggestions below:

Jay Christopherson wrote:> I''m attempting to run a pair of ActiveMQ java instances, using a 
> shared Lustre filesystem mounted with flock for failover purposes.  
> There''s lots of ways to do ActiveMQ failover and shared filesystem
> just happens to be the easiest.
>
> ActiveMQ, at least the way we are using it, does a lot of small
I/O''s,
> like 600 - 800 IOPS worth of 6K I/O''s.  When I attempt to use
Lustre
> as the shared filesystem, I see major IO wait time on the cpu''s,
like
> 40 - 50%.  My OSS''s and MDS don''t seem to be particularly
busy being
> 90% idle or more while this is running.  If I remove Lustre from the 
> equation and simply write to local disk OR to an iSCSI mounted SAN 
> disk, my ActiveMQ instances don''t seem to have any problems.
>
> The disk that is backing the OSS''s are all SAS 15K disks in a
RAID1
> config.  The OSS''s (2 of them) each have 8GB of memory and 4 cpu
cores
> and are doing nothing else except being OSS''s.  The MDS has one
cpu
> and 4G of memory and is 98% idle while under this ActiveMQ load.  The 
> network I am using for Lustre is dedicated gigabit ethernet and there 
> are 8 clients, two of which are these ActiveMQ clients.First of all, I would suggest benchmarking your Lustre setup for small 
file workload. For example, use Bonnie++ in IOPS mode to create small 
sized files on Lustre. That will tell you limit of Lustre setup. I got 
about 6000 creates/sec on my 12 disk (Seagate SAS 15K RPM 300 GB) RAID10 
setup.
>
> So, my question is:
>
> 1.  What should I be looking at to tune my Lustre FS for this type of 
> IO?  I''ve tried upping the lru_size of the MDT and OST namespaces
in
> /proc/fs/lustre/ldlm to 5000 and 2000 respectively, but I don''t
really
> see much difference.  I have also ensured that striping is disabled 
> (lfs setstripe -d) on the shared directory.Try disabling Lustre debug messages on all clients:

sysctl -w lnet.debug=0

Try increasing dirty cache on client nodes:

lctl set_param osc.*.max_dirty_mb=256

Also, you can bump up max rpcs in flight from 8 to 32 but given that you 
have gigabit ethernet network, I don''t think it will improve
performance.

Cheers,
-Atul>
> I guess I am just not experienced enough yet with Lustre to know how 
> to track down and resolve this issue.  I would think Lustre should be 
> able to handle this load, but I must be missing something.  For the 
> record, NFS was not able to handle this load either, at least with 
> default export settings (async was improved, but async is not an option).
>
> - Jay
> ------------------------------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Sheila Barthel

2010-Jan-08 15:31 UTC

head link

[Lustre-discuss] tuning for small I/O

Also, the Lustre manual includes a section on improving performance when 
working with small files:

http://manual.lustre.org/manual/LustreManual18_HTML/LustreTroubleshootingTips.html#50532481_pgfId-1291398

Sheila

Atul Vidwansa wrote:> Hi Jay,
>
> There are multiple ways to tune Lustre for small IO. If you seach 
> lustre-discuss archives, you will find many threads on same topic. I 
> have some suggestions below:
>
> Jay Christopherson wrote:
>   
>> I''m attempting to run a pair of ActiveMQ java instances, using
a
>> shared Lustre filesystem mounted with flock for failover purposes.  
>> There''s lots of ways to do ActiveMQ failover and shared
filesystem
>> just happens to be the easiest.
>>
>> ActiveMQ, at least the way we are using it, does a lot of small
I/O''s,
>> like 600 - 800 IOPS worth of 6K I/O''s.  When I attempt to use
Lustre
>> as the shared filesystem, I see major IO wait time on the
cpu''s, like
>> 40 - 50%.  My OSS''s and MDS don''t seem to be
particularly busy being
>> 90% idle or more while this is running.  If I remove Lustre from the 
>> equation and simply write to local disk OR to an iSCSI mounted SAN 
>> disk, my ActiveMQ instances don''t seem to have any problems.
>>
>> The disk that is backing the OSS''s are all SAS 15K disks in a
RAID1
>> config.  The OSS''s (2 of them) each have 8GB of memory and 4
cpu cores
>> and are doing nothing else except being OSS''s.  The MDS has
one cpu
>> and 4G of memory and is 98% idle while under this ActiveMQ load.  The 
>> network I am using for Lustre is dedicated gigabit ethernet and there 
>> are 8 clients, two of which are these ActiveMQ clients.
>>     
> First of all, I would suggest benchmarking your Lustre setup for small 
> file workload. For example, use Bonnie++ in IOPS mode to create small 
> sized files on Lustre. That will tell you limit of Lustre setup. I got 
> about 6000 creates/sec on my 12 disk (Seagate SAS 15K RPM 300 GB) RAID10 
> setup.
>
>   
>> So, my question is:
>>
>> 1.  What should I be looking at to tune my Lustre FS for this type of 
>> IO?  I''ve tried upping the lru_size of the MDT and OST
namespaces in
>> /proc/fs/lustre/ldlm to 5000 and 2000 respectively, but I
don''t really
>> see much difference.  I have also ensured that striping is disabled 
>> (lfs setstripe -d) on the shared directory.
>>     
> Try disabling Lustre debug messages on all clients:
>
> sysctl -w lnet.debug=0
>
> Try increasing dirty cache on client nodes:
>
> lctl set_param osc.*.max_dirty_mb=256
>
> Also, you can bump up max rpcs in flight from 8 to 32 but given that you 
> have gigabit ethernet network, I don''t think it will improve
performance.
>
> Cheers,
> -Atul
>   
>> I guess I am just not experienced enough yet with Lustre to know how 
>> to track down and resolve this issue.  I would think Lustre should be 
>> able to handle this load, but I must be missing something.  For the 
>> record, NFS was not able to handle this load either, at least with 
>> default export settings (async was improved, but async is not an
option).
>>
>> - Jay
>>
------------------------------------------------------------------------
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>   
>>     
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Peter Grandi

2010-Jan-10 22:19 UTC

head link

[Lustre-discuss] tuning for small I/O

The subject of this email "[Lustre-discuss] tuning for small I/O" is
a bit in the category of "tuning jackhammers to cut diamonds".

Lustre has been designed for massive streaming parallel IO, and does
OK-ish for traditional ("home dir") situations. Not necessarily
for shared message databases.
>>> I''m attempting to run a pair of ActiveMQ java instances,
Life will improve I hope :-).
>>> using a shared Lustre filesystem mounted with flock for failover
>>> purposes.
The ''flock'' is the key issue here, probably even more than the
"small I/O" issue.

Consider this thread on a very similar topic:

  http://lists.lustre.org/pipermail/lustre-discuss/2008-October/009001.html
   "The other alternative is "-o flock", which is coherent
locking
    across all clients, but has a noticable performance impact"

  http://lists.lustre.org/pipermail/lustre-discuss/2009-February/009690.html
   "It is both not very optimized and slower than local system since
    it needs to send network rpcs for locking (Except for the
    localflock which is same speed as for local fs)."

  http://lists.lustre.org/pipermail/lustre-discuss/2004-August/000425.html
   "We faced similar issues when we tried to access/modify a single
    file concurrently from multiple processes (across multiple
    clients) using the MPI-IO interfaces. We faced similar issues
    with other file systems as well, so we resorted to implementing
    our own file/record-locking in the MPI-IO middleware (on top of
    file-systems)."

  http://lists.lustre.org/pipermail/lustre-discuss/2009-February/009679.html

etc.
>>> There''s lots of ways to do ActiveMQ failover and shared
>>> filesystem just happens to be the easiest.
Easiest is very tempting, but perhaps not the most effective.  If
you really cared about getting meaningful replies you would have
provided these links BTW:

  http://activemq.apache.org/shared-file-system-master-slave.html
  http://activemq.apache.org/replicated-message-store.html

Doing a bit more searching it turns out that there are several
ways to tune ActiveMQ and this may reduce the number of barrier
operations/committed transactions. Maybe. There seem to be
something vaguely interesting here:

  http://fusesource.com/docs/broker/5.0/persistence/persistence.pdf

Otherwise I''d use the master/slave replication feature, but this
is just an impression.
>>> ActiveMQ, at least the way we are using it, does a lot of small
>>> I/O''s, like 600 - 800 IOPS worth of 6K I/O''s.
Thats seems pretty reasonable. I guess that is a few hundred/s worth
of journal updates. Problem is, they will be mostly hitting the same
files, thus the need for ''flock'' and synchronous updates.

So it matters *very much* how many of those 6K IOs are
transactional, that is involve locking and flushing to disk.
I suspect from your problems asn the later statement "async is
not an option" that each of them is a transaction.
>>> When I attempt to use Lustre as the shared filesystem, I see
>>> major IO wait time on the cpu''s, like 40 - 50%.
Why do many people fixate on IO wait? just because it is easy to
see? Bah!

If there is one, what is the performance problem *on the client*
in terms of *client application issues*? That''s what matters.
>>> My OSS''s and MDS don''t seem to be particularly
busy
Unsurprisingly. How many OSSes and how many OSTs per OSS, and
how many disks? just curiosity, it is not that important.
>>> being 90% idle or more while this is running.
Ideally they would be 100% idle :-).
>>> If I remove Lustre from the equation and simply write to local
>>> disk OR to an iSCSI mounted SAN disk, my ActiveMQ instances
>>> don''t seem to have any problems.
And which problems do you have when running with Lustre? You haven''t
said. "major IO wait" and "90% idle" are not problems, they
are
statistics, and they could mean something else.
>>> The disk that is backing the OSS''s are all SAS 15K disks
in a
>>> RAID1 config.
RAID1 is nice, but how many? That would be a very important detail.
>>> 1. What should I be looking at to tune my Lustre FS for this
>>> type of IO?
Not really. It is both a storage system problem and a network
protocol problem.

The "small I/O" problem is the least of the two, the real problem is
that you have "small I/O on a shared filesystem with distributed
interlocked updates to the same files", that is network protocol
problem.

The network protocol problem is very very difficult, because the
server needs to synchronize two clients, and present a current image
of the files to the two clients; that is when one client does an
update, the other client must be able to see it "immediately", which
is not easy. For example I have heard reports that when writing from
a client to a Lustre server, sometimes (in a small percentage of
cases) another client only sees the update dozens of seconds later
for example (but your use of locking may help with that). I wonder
if locking is enabled and used on that system BTW.
>>> [ ... ] I have also ensured that striping is disabled (lfs
>>> setstripe -d) on the shared directory.
Unless your files are real big that does not matter. Uhm, the
message store seems to actually use a few biggish (32MB?) journal
files plus (perhaps smaller) indices:

  http://activemq.apache.org/persistence.html
  http://activemq.apache.org/kahadb.html
  http://activemq.apache.org/amq-message-store.html
  http://activemq.apache.org/should-i-use-transactions.html
  http://activemq.apache.org/how-lightweight-is-sending-a-message.html

So perhaps the striping does have an effect.
>>> I guess I am just not experienced enough yet with Lustre to know
>>> how to track down and resolve this issue. I would think Lustre
>>> should be able to handle this load, but I must be missing
>>> something.
Sure it is able to handle that load -- it does, at great effort and
going the against the grain of what Lustre has been designed for.

The basic problem is that instead of using a low latency distributed
communication system for interlocked updating of the message store,
you are attempting to use the implicit one in a filesystem because
"shared filesystem just happens to be the easiest", even if they are
not meant to give you a high transaction rate with low latency for a
shared database.

In ordinary shared filesystems locking is provided for coarse
protection and IO is expected to be on fairly coarse granularity
too, and more so for Lustre and other cluster systems.
>>> For the record, NFS was not able to handle this load either, at
>>> least with default export settings
It is very very difficult to handle that workload acrowss multiple
clients in a distributed filesystem. Then NFSv3 or older have their
own additional issues.
>>> (async was improved, but async is not an option).
If ''async'' is not an option you have a big problem in general
as
hinted above.

Also, but not very related here, the NFS client for Linux has some
significant performance problems with writing. To the point that
sometimes I think that Lustre can be used to replace NFS even when
no clustering is desired (single OSS), simply because its protocol
is better (and there is a point also to LNET).
>> First of all, I would suggest benchmarking your Lustre setup for
>> small file workload.
I may have misunderstood but the original poster nowhere wrote
"small file workload", but "small I/O" which is quite
different.

The shared message store he has setup is to be updated concurrently
receiving small I/O transactions, but it is contained in journals of
probably a few dozen MB each.
>> For example, use Bonnie++ in IOPS mode to create small sized
>> files on Lustre. That will tell you limit of Lustre setup. I got
>> about 6000 creates/sec on my 12 disk (Seagate SAS 15K RPM 300 GB)
>> RAID10 setup.
Small sized files and creates/sec seem not what the original posted
is worries, even if 1000 metadata operations/s  per disk pairs seems
nice indeed.
>> Try disabling Lustre debug messages on all clients: sysctl -w
>> lnet.debug=0
That may help, I hadn''t thought of that.
>> Try increasing dirty cache on client nodes: lctl set_param
>> osc.*.max_dirty_mb=256 Also, you can bump up max rpcs in flight
>> from 8 to 32 but given that you have gigabit ethernet network, I
>> don''t think it will improve performance.
That can be counterproductive as the problem seems to be concurrent
interlocked updates from multiple clients to the persistent database
of a shared queue system. As clear from the point "async is not an
option").
> Also, the Lustre manual includes a section on improving performance when 
> working with small files:
>
http://manual.lustre.org/manual/LustreManual18_HTML/LustreTroubleshootingTips.html#50532481_pgfId-1291398
The real problem as hinted above is that interlocking is of the
essence in the application above, for a message store used by many
distributed clients, and the message store is not made up of small
files.

However, there is an interesting point there about precisely the
type of issue I was alluding above about interlocking:

  "By default, Lustre enforces POSIX coherency semantics, so it
  results in lock ping-pong between client nodes if they are all
  writing to the same file at one time."

Perhaps the other advice may be also relevant:

 "Add more disks or use SSD disks for the OSTs. This dramatically
  improves the IOPS rate."

but I think it is mostly a locking latency issue. If it were a small
transaction issue, a barrier every 6K may not work well with SSDs
which have an erase page size usually around 32KiB.

Good luck ;-).

Peter Grandi

2010-Jan-16 17:58 UTC

head link

[Lustre-discuss] tuning for small I/O

I have received some offline updates about this story:
>>> I''m attempting to run a pair of ActiveMQ java instances,
>>> using a shared Lustre filesystem mounted with flock for
>>> failover purposes.
> The ''flock'' is the key issue here, probably even more
than the
> "small I/O" issue. [ ... ]
>>> [ ... ] ActiveMQ, at least the way we are using it, does a
>>> lot of small I/O''s, like 600 - 800 IOPS worth of 6K
I/O''s.
> Thats seems pretty reasonable. I guess that is a few hundred/s
> worth of journal updates. Problem is, they will be mostly
> hitting the same files, thus the need for ''flock'' and
> synchronous updates. So it matters *very much* how many of
> those 6K IOs are transactional, that is involve locking and
> flushing to disk. I suspect from your problems asn the later
> statement "async is not an option" that each of them is a
> transaction.
>>> The disk that is backing the OSS''s are all SAS 15K disks
in
>>> a RAID1 config.
> RAID1 is nice, but how many? That would be a very important
> detail.
This apparently is a 14-drive RAID10 (hopefully a true RAID10
7x(1+1) rather than the RAID01 7+7 mentioned offline).

That means a total rate of perhaps 100-120 6K transactions per
disk, if lucky (that depends on the number of log files and
their spread).

The total data rate over Lustre is around 5MB/s, and even with
just 6K per operations Lustre should be doing that, even if I
suspect that the achievable ''flock'' rate depends more on the
MDS
storage system than the OSS one.

If every write is a transaction, and (hopefully) ActiveMQ
requests committing to stable storage every transaction, then it
is both an ''flock'' and ''fsync'' problem.

Then depending on the size of the queue I''d also look, if not
already done, at using host adapters with a quite large battery
back buffer/cache for both the MDS and the OSSes, as latency may
be because of waiting for uncached writes. Sure, the setup seems
to work fast enough when the disks are local, already, which may
mean over-wire latencies add too much, but reducing the storage
system latency may help, even if not needed in the local case.

That is purely a storage layer issue (for both MDTs and OSTs),
and nothing to do with Lustre itself, while the ''flock'' issue
(and flushing from the *clients*) has to do with Lustre (even if
it *may* too be alleviated with very low latency battery backed
buffers/caches).

Again, interlocked stable (''flock''/''fsync'')
storage operations
between two clients via a third server are difficult to make
fast, because of latency and flushing issues, in the context of
remote file access, either general purpose like NFS or parallel
bulk streaming like Lustre.

Lustre discuss - Jan 2010 - tuning for small I/O

[Lustre-discuss] tuning for small I/O

[Lustre-discuss] tuning for small I/O

[Lustre-discuss] tuning for small I/O

[Lustre-discuss] tuning for small I/O

[Lustre-discuss] tuning for small I/O