I''m attempting to run a pair of ActiveMQ java instances, using a shared Lustre filesystem mounted with flock for failover purposes. There''s lots of ways to do ActiveMQ failover and shared filesystem just happens to be the easiest. ActiveMQ, at least the way we are using it, does a lot of small I/O''s, like 600 - 800 IOPS worth of 6K I/O''s. When I attempt to use Lustre as the shared filesystem, I see major IO wait time on the cpu''s, like 40 - 50%. My OSS''s and MDS don''t seem to be particularly busy being 90% idle or more while this is running. If I remove Lustre from the equation and simply write to local disk OR to an iSCSI mounted SAN disk, my ActiveMQ instances don''t seem to have any problems. The disk that is backing the OSS''s are all SAS 15K disks in a RAID1 config. The OSS''s (2 of them) each have 8GB of memory and 4 cpu cores and are doing nothing else except being OSS''s. The MDS has one cpu and 4G of memory and is 98% idle while under this ActiveMQ load. The network I am using for Lustre is dedicated gigabit ethernet and there are 8 clients, two of which are these ActiveMQ clients. So, my question is: 1. What should I be looking at to tune my Lustre FS for this type of IO? I''ve tried upping the lru_size of the MDT and OST namespaces in /proc/fs/lustre/ldlm to 5000 and 2000 respectively, but I don''t really see much difference. I have also ensured that striping is disabled (lfs setstripe -d) on the shared directory. I guess I am just not experienced enough yet with Lustre to know how to track down and resolve this issue. I would think Lustre should be able to handle this load, but I must be missing something. For the record, NFS was not able to handle this load either, at least with default export settings (async was improved, but async is not an option). - Jay -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100107/3636a407/attachment.html
Hi Jay, There are multiple ways to tune Lustre for small IO. If you seach lustre-discuss archives, you will find many threads on same topic. I have some suggestions below: Jay Christopherson wrote:> I''m attempting to run a pair of ActiveMQ java instances, using a > shared Lustre filesystem mounted with flock for failover purposes. > There''s lots of ways to do ActiveMQ failover and shared filesystem > just happens to be the easiest. > > ActiveMQ, at least the way we are using it, does a lot of small I/O''s, > like 600 - 800 IOPS worth of 6K I/O''s. When I attempt to use Lustre > as the shared filesystem, I see major IO wait time on the cpu''s, like > 40 - 50%. My OSS''s and MDS don''t seem to be particularly busy being > 90% idle or more while this is running. If I remove Lustre from the > equation and simply write to local disk OR to an iSCSI mounted SAN > disk, my ActiveMQ instances don''t seem to have any problems. > > The disk that is backing the OSS''s are all SAS 15K disks in a RAID1 > config. The OSS''s (2 of them) each have 8GB of memory and 4 cpu cores > and are doing nothing else except being OSS''s. The MDS has one cpu > and 4G of memory and is 98% idle while under this ActiveMQ load. The > network I am using for Lustre is dedicated gigabit ethernet and there > are 8 clients, two of which are these ActiveMQ clients.First of all, I would suggest benchmarking your Lustre setup for small file workload. For example, use Bonnie++ in IOPS mode to create small sized files on Lustre. That will tell you limit of Lustre setup. I got about 6000 creates/sec on my 12 disk (Seagate SAS 15K RPM 300 GB) RAID10 setup.> > So, my question is: > > 1. What should I be looking at to tune my Lustre FS for this type of > IO? I''ve tried upping the lru_size of the MDT and OST namespaces in > /proc/fs/lustre/ldlm to 5000 and 2000 respectively, but I don''t really > see much difference. I have also ensured that striping is disabled > (lfs setstripe -d) on the shared directory.Try disabling Lustre debug messages on all clients: sysctl -w lnet.debug=0 Try increasing dirty cache on client nodes: lctl set_param osc.*.max_dirty_mb=256 Also, you can bump up max rpcs in flight from 8 to 32 but given that you have gigabit ethernet network, I don''t think it will improve performance. Cheers, -Atul> > I guess I am just not experienced enough yet with Lustre to know how > to track down and resolve this issue. I would think Lustre should be > able to handle this load, but I must be missing something. For the > record, NFS was not able to handle this load either, at least with > default export settings (async was improved, but async is not an option). > > - Jay > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Also, the Lustre manual includes a section on improving performance when working with small files: http://manual.lustre.org/manual/LustreManual18_HTML/LustreTroubleshootingTips.html#50532481_pgfId-1291398 Sheila Atul Vidwansa wrote:> Hi Jay, > > There are multiple ways to tune Lustre for small IO. If you seach > lustre-discuss archives, you will find many threads on same topic. I > have some suggestions below: > > Jay Christopherson wrote: > >> I''m attempting to run a pair of ActiveMQ java instances, using a >> shared Lustre filesystem mounted with flock for failover purposes. >> There''s lots of ways to do ActiveMQ failover and shared filesystem >> just happens to be the easiest. >> >> ActiveMQ, at least the way we are using it, does a lot of small I/O''s, >> like 600 - 800 IOPS worth of 6K I/O''s. When I attempt to use Lustre >> as the shared filesystem, I see major IO wait time on the cpu''s, like >> 40 - 50%. My OSS''s and MDS don''t seem to be particularly busy being >> 90% idle or more while this is running. If I remove Lustre from the >> equation and simply write to local disk OR to an iSCSI mounted SAN >> disk, my ActiveMQ instances don''t seem to have any problems. >> >> The disk that is backing the OSS''s are all SAS 15K disks in a RAID1 >> config. The OSS''s (2 of them) each have 8GB of memory and 4 cpu cores >> and are doing nothing else except being OSS''s. The MDS has one cpu >> and 4G of memory and is 98% idle while under this ActiveMQ load. The >> network I am using for Lustre is dedicated gigabit ethernet and there >> are 8 clients, two of which are these ActiveMQ clients. >> > First of all, I would suggest benchmarking your Lustre setup for small > file workload. For example, use Bonnie++ in IOPS mode to create small > sized files on Lustre. That will tell you limit of Lustre setup. I got > about 6000 creates/sec on my 12 disk (Seagate SAS 15K RPM 300 GB) RAID10 > setup. > > >> So, my question is: >> >> 1. What should I be looking at to tune my Lustre FS for this type of >> IO? I''ve tried upping the lru_size of the MDT and OST namespaces in >> /proc/fs/lustre/ldlm to 5000 and 2000 respectively, but I don''t really >> see much difference. I have also ensured that striping is disabled >> (lfs setstripe -d) on the shared directory. >> > Try disabling Lustre debug messages on all clients: > > sysctl -w lnet.debug=0 > > Try increasing dirty cache on client nodes: > > lctl set_param osc.*.max_dirty_mb=256 > > Also, you can bump up max rpcs in flight from 8 to 32 but given that you > have gigabit ethernet network, I don''t think it will improve performance. > > Cheers, > -Atul > >> I guess I am just not experienced enough yet with Lustre to know how >> to track down and resolve this issue. I would think Lustre should be >> able to handle this load, but I must be missing something. For the >> record, NFS was not able to handle this load either, at least with >> default export settings (async was improved, but async is not an option). >> >> - Jay >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
The subject of this email "[Lustre-discuss] tuning for small I/O" is a bit in the category of "tuning jackhammers to cut diamonds". Lustre has been designed for massive streaming parallel IO, and does OK-ish for traditional ("home dir") situations. Not necessarily for shared message databases.>>> I''m attempting to run a pair of ActiveMQ java instances,Life will improve I hope :-).>>> using a shared Lustre filesystem mounted with flock for failover >>> purposes.The ''flock'' is the key issue here, probably even more than the "small I/O" issue. Consider this thread on a very similar topic: http://lists.lustre.org/pipermail/lustre-discuss/2008-October/009001.html "The other alternative is "-o flock", which is coherent locking across all clients, but has a noticable performance impact" http://lists.lustre.org/pipermail/lustre-discuss/2009-February/009690.html "It is both not very optimized and slower than local system since it needs to send network rpcs for locking (Except for the localflock which is same speed as for local fs)." http://lists.lustre.org/pipermail/lustre-discuss/2004-August/000425.html "We faced similar issues when we tried to access/modify a single file concurrently from multiple processes (across multiple clients) using the MPI-IO interfaces. We faced similar issues with other file systems as well, so we resorted to implementing our own file/record-locking in the MPI-IO middleware (on top of file-systems)." http://lists.lustre.org/pipermail/lustre-discuss/2009-February/009679.html etc.>>> There''s lots of ways to do ActiveMQ failover and shared >>> filesystem just happens to be the easiest.Easiest is very tempting, but perhaps not the most effective. If you really cared about getting meaningful replies you would have provided these links BTW: http://activemq.apache.org/shared-file-system-master-slave.html http://activemq.apache.org/replicated-message-store.html Doing a bit more searching it turns out that there are several ways to tune ActiveMQ and this may reduce the number of barrier operations/committed transactions. Maybe. There seem to be something vaguely interesting here: http://fusesource.com/docs/broker/5.0/persistence/persistence.pdf Otherwise I''d use the master/slave replication feature, but this is just an impression.>>> ActiveMQ, at least the way we are using it, does a lot of small >>> I/O''s, like 600 - 800 IOPS worth of 6K I/O''s.Thats seems pretty reasonable. I guess that is a few hundred/s worth of journal updates. Problem is, they will be mostly hitting the same files, thus the need for ''flock'' and synchronous updates. So it matters *very much* how many of those 6K IOs are transactional, that is involve locking and flushing to disk. I suspect from your problems asn the later statement "async is not an option" that each of them is a transaction.>>> When I attempt to use Lustre as the shared filesystem, I see >>> major IO wait time on the cpu''s, like 40 - 50%.Why do many people fixate on IO wait? just because it is easy to see? Bah! If there is one, what is the performance problem *on the client* in terms of *client application issues*? That''s what matters.>>> My OSS''s and MDS don''t seem to be particularly busyUnsurprisingly. How many OSSes and how many OSTs per OSS, and how many disks? just curiosity, it is not that important.>>> being 90% idle or more while this is running.Ideally they would be 100% idle :-).>>> If I remove Lustre from the equation and simply write to local >>> disk OR to an iSCSI mounted SAN disk, my ActiveMQ instances >>> don''t seem to have any problems.And which problems do you have when running with Lustre? You haven''t said. "major IO wait" and "90% idle" are not problems, they are statistics, and they could mean something else.>>> The disk that is backing the OSS''s are all SAS 15K disks in a >>> RAID1 config.RAID1 is nice, but how many? That would be a very important detail.>>> 1. What should I be looking at to tune my Lustre FS for this >>> type of IO?Not really. It is both a storage system problem and a network protocol problem. The "small I/O" problem is the least of the two, the real problem is that you have "small I/O on a shared filesystem with distributed interlocked updates to the same files", that is network protocol problem. The network protocol problem is very very difficult, because the server needs to synchronize two clients, and present a current image of the files to the two clients; that is when one client does an update, the other client must be able to see it "immediately", which is not easy. For example I have heard reports that when writing from a client to a Lustre server, sometimes (in a small percentage of cases) another client only sees the update dozens of seconds later for example (but your use of locking may help with that). I wonder if locking is enabled and used on that system BTW.>>> [ ... ] I have also ensured that striping is disabled (lfs >>> setstripe -d) on the shared directory.Unless your files are real big that does not matter. Uhm, the message store seems to actually use a few biggish (32MB?) journal files plus (perhaps smaller) indices: http://activemq.apache.org/persistence.html http://activemq.apache.org/kahadb.html http://activemq.apache.org/amq-message-store.html http://activemq.apache.org/should-i-use-transactions.html http://activemq.apache.org/how-lightweight-is-sending-a-message.html So perhaps the striping does have an effect.>>> I guess I am just not experienced enough yet with Lustre to know >>> how to track down and resolve this issue. I would think Lustre >>> should be able to handle this load, but I must be missing >>> something.Sure it is able to handle that load -- it does, at great effort and going the against the grain of what Lustre has been designed for. The basic problem is that instead of using a low latency distributed communication system for interlocked updating of the message store, you are attempting to use the implicit one in a filesystem because "shared filesystem just happens to be the easiest", even if they are not meant to give you a high transaction rate with low latency for a shared database. In ordinary shared filesystems locking is provided for coarse protection and IO is expected to be on fairly coarse granularity too, and more so for Lustre and other cluster systems.>>> For the record, NFS was not able to handle this load either, at >>> least with default export settingsIt is very very difficult to handle that workload acrowss multiple clients in a distributed filesystem. Then NFSv3 or older have their own additional issues.>>> (async was improved, but async is not an option).If ''async'' is not an option you have a big problem in general as hinted above. Also, but not very related here, the NFS client for Linux has some significant performance problems with writing. To the point that sometimes I think that Lustre can be used to replace NFS even when no clustering is desired (single OSS), simply because its protocol is better (and there is a point also to LNET).>> First of all, I would suggest benchmarking your Lustre setup for >> small file workload.I may have misunderstood but the original poster nowhere wrote "small file workload", but "small I/O" which is quite different. The shared message store he has setup is to be updated concurrently receiving small I/O transactions, but it is contained in journals of probably a few dozen MB each.>> For example, use Bonnie++ in IOPS mode to create small sized >> files on Lustre. That will tell you limit of Lustre setup. I got >> about 6000 creates/sec on my 12 disk (Seagate SAS 15K RPM 300 GB) >> RAID10 setup.Small sized files and creates/sec seem not what the original posted is worries, even if 1000 metadata operations/s per disk pairs seems nice indeed.>> Try disabling Lustre debug messages on all clients: sysctl -w >> lnet.debug=0That may help, I hadn''t thought of that.>> Try increasing dirty cache on client nodes: lctl set_param >> osc.*.max_dirty_mb=256 Also, you can bump up max rpcs in flight >> from 8 to 32 but given that you have gigabit ethernet network, I >> don''t think it will improve performance.That can be counterproductive as the problem seems to be concurrent interlocked updates from multiple clients to the persistent database of a shared queue system. As clear from the point "async is not an option").> Also, the Lustre manual includes a section on improving performance when > working with small files: > http://manual.lustre.org/manual/LustreManual18_HTML/LustreTroubleshootingTips.html#50532481_pgfId-1291398The real problem as hinted above is that interlocking is of the essence in the application above, for a message store used by many distributed clients, and the message store is not made up of small files. However, there is an interesting point there about precisely the type of issue I was alluding above about interlocking: "By default, Lustre enforces POSIX coherency semantics, so it results in lock ping-pong between client nodes if they are all writing to the same file at one time." Perhaps the other advice may be also relevant: "Add more disks or use SSD disks for the OSTs. This dramatically improves the IOPS rate." but I think it is mostly a locking latency issue. If it were a small transaction issue, a barrier every 6K may not work well with SSDs which have an erase page size usually around 32KiB. Good luck ;-).
I have received some offline updates about this story:>>> I''m attempting to run a pair of ActiveMQ java instances, >>> using a shared Lustre filesystem mounted with flock for >>> failover purposes.> The ''flock'' is the key issue here, probably even more than the > "small I/O" issue. [ ... ]>>> [ ... ] ActiveMQ, at least the way we are using it, does a >>> lot of small I/O''s, like 600 - 800 IOPS worth of 6K I/O''s.> Thats seems pretty reasonable. I guess that is a few hundred/s > worth of journal updates. Problem is, they will be mostly > hitting the same files, thus the need for ''flock'' and > synchronous updates. So it matters *very much* how many of > those 6K IOs are transactional, that is involve locking and > flushing to disk. I suspect from your problems asn the later > statement "async is not an option" that each of them is a > transaction.>>> The disk that is backing the OSS''s are all SAS 15K disks in >>> a RAID1 config.> RAID1 is nice, but how many? That would be a very important > detail.This apparently is a 14-drive RAID10 (hopefully a true RAID10 7x(1+1) rather than the RAID01 7+7 mentioned offline). That means a total rate of perhaps 100-120 6K transactions per disk, if lucky (that depends on the number of log files and their spread). The total data rate over Lustre is around 5MB/s, and even with just 6K per operations Lustre should be doing that, even if I suspect that the achievable ''flock'' rate depends more on the MDS storage system than the OSS one. If every write is a transaction, and (hopefully) ActiveMQ requests committing to stable storage every transaction, then it is both an ''flock'' and ''fsync'' problem. Then depending on the size of the queue I''d also look, if not already done, at using host adapters with a quite large battery back buffer/cache for both the MDS and the OSSes, as latency may be because of waiting for uncached writes. Sure, the setup seems to work fast enough when the disks are local, already, which may mean over-wire latencies add too much, but reducing the storage system latency may help, even if not needed in the local case. That is purely a storage layer issue (for both MDTs and OSTs), and nothing to do with Lustre itself, while the ''flock'' issue (and flushing from the *clients*) has to do with Lustre (even if it *may* too be alleviated with very low latency battery backed buffers/caches). Again, interlocked stable (''flock''/''fsync'') storage operations between two clients via a third server are difficult to make fast, because of latency and flushing issues, in the context of remote file access, either general purpose like NFS or parallel bulk streaming like Lustre.