Vincent Kéravec
2008-Nov-22 01:20 UTC
[zfs-discuss] ZFS fragmentation with MySQL databases
I just try ZFS on one of our slave and got some really bad performance. When I start the server yesterday, it was able to keep up with the main server without problem but after two days of consecutive run the server is crushed by IO. After running the dtrace script iopattern, I notice that the workload is now 100% Random IO. Copying the database (140Go) from one directory to an other took more than 4 hours without any other tasks running on the server, and all the reads on table that where updated where random... Keeping an eye on iopattern and zpool iostat I saw that when the systems was accessing file that have not been changed the disk was reading sequentially at more than 50Mo/s but when reading files that changed often the speed got down to 2-3 Mo/s. The server has plenty of diskplace so it should not have such a level of file fragmentation in such a short time. For information I''m using solaris 10/08 with a mirrored root pool on two 1Tb Sata harddisk (slow with random io). I''m using MySQL 5.0.67 with MyISAM engine. The zfs recordsize is 8k as recommended on the zfs guide. -- This message posted from opensolaris.org
[Default] On Fri, 21 Nov 2008 17:20:48 PST, Vincent K?ravec <keravecv at gmail.com> wrote:> I just try ZFS on one of our slave and got some really > bad performance. > > When I start the server yesterday, it was able to keep > up with the main server without problem but after two > days of consecutive run the server is crushed by IO. > > After running the dtrace script iopattern, I notice > that the workload is now 100% Random IO. Copying the > database (140Go) from one directory to an other took > more than 4 hours without any other tasks running on > the server, and all the reads on table that where > updated where random... Keeping an eye on iopattern and > zpool iostat I saw that when the systems was accessing > file that have not been changed the disk was reading > sequentially at more than 50Mo/s but when reading files > that changed often the speed got down to 2-3 Mo/s.Good observation and analysis.> The server has plenty of diskplace so it should not > have such a level of file fragmentation in such a short > time.My explanation would be: Whenever a block within a file changes, zfs has to write it at another location ("copy on write"), so the previous version isn''t immediately lost. Zfs will try to keep the new version of the block close to the original one, but after several changes on the same database page, things get pretty messed up and logical sequential I/O becomes pretty much physically random indeed. The original blocks will eventually be added to the freelist and reused, so proximity can be restored, but it will never be 100% sequential again. The effect is larger when many snapshots are kept, because older block versions are not freed, or when the same block is changed very often and freelist updating has to be postponed. That is the trade-off between "always consistent" and "fast".> For information I''m using solaris 10/08 with a mirrored > root pool on two 1Tb Sata harddisk (slow with random > io). I''m using MySQL 5.0.67 with MyISAM engine. The zfs > recordsize is 8k as recommended on the zfs guide.I would suggest to enlarge the MyISAM buffers. The InnoDB engine does copy on write within its data files, so things might be different there. -- ( Kees Nuyt ) c[_]
Kees Nuyt wrote:> My explanation would be: Whenever a block within a file > changes, zfs has to write it at another location ("copy on > write"), so the previous version isn''t immediately lost. > > Zfs will try to keep the new version of the block close to > the original one, but after several changes on the same > database page, things get pretty messed up and logical > sequential I/O becomes pretty much physically random indeed. > > The original blocks will eventually be added to the freelist > and reused, so proximity can be restored, but it will never > be 100% sequential again. > The effect is larger when many snapshots are kept, because > older block versions are not freed, or when the same block > is changed very often and freelist updating has to be > postponed. > > That is the trade-off between "always consistent" and > "fast". >Well, does that mean ZFS is not best suited for database engines as underlying filesystem? With databases it will always be fragmented, hence slow performance? Because this way it would be best to use it for large file server that don''t usually change frequently. Thanks, Tamer
ZFS works marvelously well for data warehouse and analytic DBs. For lots of small updates scattered across the breadth of the persistent working set, it''s not going to work well IMO. Note that we''re using ZFS to host databases as large as 10,000 TB - that''s 10PB (!!). Solaris 10 U5 on X4540. That said - it''s on 96 servers running Greenplum DB. With SSD, the randomness won''t matter much I expect, though the filesystem won''t be helping by virtue of this fragmentation effect of COW. - Luke ----- Original Message ----- From: zfs-discuss-bounces at opensolaris.org <zfs-discuss-bounces at opensolaris.org> To: zfs-discuss at opensolaris.org <zfs-discuss at opensolaris.org> Sent: Sat Nov 22 16:43:53 2008 Subject: Re: [zfs-discuss] ZFS fragmentation with MySQL databases Kees Nuyt wrote:> My explanation would be: Whenever a block within a file > changes, zfs has to write it at another location ("copy on > write"), so the previous version isn''t immediately lost. > > Zfs will try to keep the new version of the block close to > the original one, but after several changes on the same > database page, things get pretty messed up and logical > sequential I/O becomes pretty much physically random indeed. > > The original blocks will eventually be added to the freelist > and reused, so proximity can be restored, but it will never > be 100% sequential again. > The effect is larger when many snapshots are kept, because > older block versions are not freed, or when the same block > is changed very often and freelist updating has to be > postponed. > > That is the trade-off between "always consistent" and > "fast". >Well, does that mean ZFS is not best suited for database engines as underlying filesystem? With databases it will always be fragmented, hence slow performance? Because this way it would be best to use it for large file server that don''t usually change frequently. Thanks, Tamer _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Bob Friesenhahn
2008-Nov-23 01:38 UTC
[zfs-discuss] ZFS fragmentation with MySQL databases
On Sun, 23 Nov 2008, Tamer Embaby wrote:>> That is the trade-off between "always consistent" and >> "fast". >> > Well, does that mean ZFS is not best suited for database engines as > underlying filesystem? With databases it will always be fragmented, > hence slow performance?Assuming that the filesystem block size matches the database size there is not so much of an issue with fragmentation because databases are generally fragmented (almost by definition) due to their nature of random access. Only a freshly written database from carefully ordered insert statements might be in a linear order, and only for accesses in the same linear order. Database indexes could be negatively impacted, but they are likely to be cached in RAM anyway. I understand that zfs uses a slab allocator so that file data is reserved in larger slabs (e.g. 1MB) and then the blocks are carved out of that. This tends to keep more of the file data together and reduces allocation overhead. Fragmentation is more of an impact for large files which should usually be accessed sequentially. Zfs''s COW algorithm and ordered writes will always be slower than for filesystems which simply overwrite existing blocks, but there is a better chance that the database will be immediately usable if someone pulls the power plug, and without needing to rely on special battery-backed hardware. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Luke Lonergan wrote:> ZFS works marvelously well for data warehouse and analytic DBs. For lots of small updates scattered across the breadth of the persistent working set, it''s not going to work well IMO. >Actually, it does seem to work quite well when you use a read optimized SSD for the L2ARC. In that case, "random" read workloads have very fast access, once the cache is warm. -- richard> Note that we''re using ZFS to host databases as large as 10,000 TB - that''s 10PB (!!). Solaris 10 U5 on X4540. That said - it''s on 96 servers running Greenplum DB. > > With SSD, the randomness won''t matter much I expect, though the filesystem won''t be helping by virtue of this fragmentation effect of COW. > > - Luke > > ----- Original Message ----- > From: zfs-discuss-bounces at opensolaris.org <zfs-discuss-bounces at opensolaris.org> > To: zfs-discuss at opensolaris.org <zfs-discuss at opensolaris.org> > Sent: Sat Nov 22 16:43:53 2008 > Subject: Re: [zfs-discuss] ZFS fragmentation with MySQL databases > > Kees Nuyt wrote: > >> My explanation would be: Whenever a block within a file >> changes, zfs has to write it at another location ("copy on >> write"), so the previous version isn''t immediately lost. >> >> Zfs will try to keep the new version of the block close to >> the original one, but after several changes on the same >> database page, things get pretty messed up and logical >> sequential I/O becomes pretty much physically random indeed. >> >> The original blocks will eventually be added to the freelist >> and reused, so proximity can be restored, but it will never >> be 100% sequential again. >> The effect is larger when many snapshots are kept, because >> older block versions are not freed, or when the same block >> is changed very often and freelist updating has to be >> postponed. >> >> That is the trade-off between "always consistent" and >> "fast". >> >> > Well, does that mean ZFS is not best suited for database engines as > underlying > filesystem? With databases it will always be fragmented, hence slow > performance? > > Because this way it would be best to use it for large file server that > don''t usually change frequently. > > Thanks, > Tamer > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
> Actually, it does seem to work quite > well when you use a read optimized > SSD for the L2ARC. In that case, > "random" read workloads have very > fast access, once the cache is warm.One would expect so, yes. But the usefulness of this is limited to the cases where the entire working set will fit into an SSD cache. In other words, for random access across a working set larger (by say X%) than the SSD-backed L2 ARC, the cache is useless. This should asymptotically approach truth as X grows and experience shows that X=200% is where it''s about 99% true. As time passes and SSDs get larger while many OLTP random workloads remain somewhat constrained in size, this becomes less important. Modern DB workloads are becoming hybridized, though. A ''mixed workload'' scenario is now common where there are a mix of updated working sets and indexed access alongside heavy analytical ''update rarely if ever'' kind of workloads. - Luke ----- Original Message ----- From: Richard.Elling at Sun.COM <Richard.Elling at Sun.COM> To: Luke Lonergan Cc: te at tsemba.org <te at tsemba.org>; zfs-discuss at opensolaris.org <zfs-discuss at opensolaris.org> Sent: Sat Nov 22 20:28:54 2008 Subject: Re: [zfs-discuss] ZFS fragmentation with MySQL databases Luke Lonergan wrote:> ZFS works marvelously well for data warehouse and analytic DBs. For lots of small updates scattered across the breadth of the persistent working set, it''s not going to work well IMO. >Actually, it does seem to work quite well when you use a read optimized SSD for the L2ARC. In that case, "random" read workloads have very fast access, once the cache is warm. -- richard> Note that we''re using ZFS to host databases as large as 10,000 TB - that''s 10PB (!!). Solaris 10 U5 on X4540. That said - it''s on 96 servers running Greenplum DB. > > With SSD, the randomness won''t matter much I expect, though the filesystem won''t be helping by virtue of this fragmentation effect of COW. > > - Luke > > ----- Original Message ----- > From: zfs-discuss-bounces at opensolaris.org <zfs-discuss-bounces at opensolaris.org> > To: zfs-discuss at opensolaris.org <zfs-discuss at opensolaris.org> > Sent: Sat Nov 22 16:43:53 2008 > Subject: Re: [zfs-discuss] ZFS fragmentation with MySQL databases > > Kees Nuyt wrote: > >> My explanation would be: Whenever a block within a file >> changes, zfs has to write it at another location ("copy on >> write"), so the previous version isn''t immediately lost. >> >> Zfs will try to keep the new version of the block close to >> the original one, but after several changes on the same >> database page, things get pretty messed up and logical >> sequential I/O becomes pretty much physically random indeed. >> >> The original blocks will eventually be added to the freelist >> and reused, so proximity can be restored, but it will never >> be 100% sequential again. >> The effect is larger when many snapshots are kept, because >> older block versions are not freed, or when the same block >> is changed very often and freelist updating has to be >> postponed. >> >> That is the trade-off between "always consistent" and >> "fast". >> >> > Well, does that mean ZFS is not best suited for database engines as > underlying > filesystem? With databases it will always be fragmented, hence slow > performance? > > Because this way it would be best to use it for large file server that > don''t usually change frequently. > > Thanks, > Tamer > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
> In other words, for random access across a working set larger (by say X%) than the SSD-backed L2 ARC, the cache is useless. This should asymptotically approach truth as X grows and experience shows that X=200% is where it''s about 99% true. >Ummm, before we throw around phrases like useless, how about a little testing ? I like a good academic argument just like the next guy, but before I dismiss something completely out of hand I''d like to see some data. Bob
Bob Friesenhahn
2008-Nov-23 16:57 UTC
[zfs-discuss] ZFS fragmentation with MySQL databases
On Sat, 22 Nov 2008, Bob Netherton wrote:> >> In other words, for random access across a working set larger (by >> say X%) than the SSD-backed L2 ARC, the cache is useless. This >> should asymptotically approach truth as X grows and experience >> shows that X=200% is where it''s about 99% true. >> > Ummm, before we throw around phrases like useless, how about a > little testing ? I like a good academic argument just like the next > guy, but before I dismiss something completely out of hand I''d like > to see some data.This argument can be proven by basic statistics without need to resort to actual testing. A similar issue applies to non-volatile write caches. Luckily, most data access is not completely random in nature. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> This argument can be proven by basic statistics without need to resort > to actual testing.Mathematical proof <> reality of how things end up getting used.> Luckily, most data access is not completely random in nature.Which was my point exactly. I''ve never seen a purely mathematical model put in production anywhere :-) Bob
Bob Friesenhahn
2008-Nov-23 17:51 UTC
[zfs-discuss] ZFS fragmentation with MySQL databases
On Sun, 23 Nov 2008, Bob Netherton wrote:>> This argument can be proven by basic statistics without need to resort >> to actual testing. > > Mathematical proof <> reality of how things end up getting used.Right. That is a good thing since otherwise the technologies that Sun has recently deployed for "Amber Road" would be deemed virtually useless (as would most computing architectures). It is quite trivial to demonstrate scenarios where read caches will fail, or NV write cache devices will become swamped (regardless of capacity) and worthless. Luckily, these are not the common scenarios for most users. For the write cache case it may be seen that if the volume of writes continually exceeds the write rate of the backing store and is continually to new locations, then the write cache becomes useless since it will always become full. The read cache case is subject to the normal rules which require that the read cache needs to be large enough to contain the common "working set" of data in order for it to be effective. Bob =====================================Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Luke Lonergan wrote:>> Actually, it does seem to work quite >> well when you use a read optimized >> SSD for the L2ARC. In that case, >> "random" read workloads have very >> fast access, once the cache is warm. >> > > One would expect so, yes. But the usefulness of this is limited to the cases where the entire working set will fit into an SSD cache. >Not entirely out of the question. SSDs can be purchased today with more than 500 GBytes in a 2.5" form factor. One or more of these would make a dandy L2ARC. http://www.stecinc.com/product/mach8mlc.php> In other words, for random access across a working set larger (by say X%) than the SSD-backed L2 ARC, the cache is useless. This should asymptotically approach truth as X grows and experience shows that X=200% is where it''s about 99% true. > > As time passes and SSDs get larger while many OLTP random workloads remain somewhat constrained in size, this becomes less important. >You can also purchase machines with 2+ TBytes of RAM, which will do nicely for caching most OLTP databases :-)> Modern DB workloads are becoming hybridized, though. A ''mixed workload'' scenario is now common where there are a mix of updated working sets and indexed access alongside heavy analytical ''update rarely if ever'' kind of workloads. >Agree. We think that the hybrid storage pool architecture will work well for a variety of these workloads, but the proof will be in the pudding. No doubt we''ll discover some interesting interactions along the way. Stay tuned... -- richard
>> >> One would expect so, yes. But the usefulness of this is limited to the cases where the entire working set will fit into an SSD cache. >> > > Not entirely out of the question. SSDs can be purchased today > with more than 500 GBytes in a 2.5" form factor. One or more of > these would make a dandy L2ARC. > http://www.stecinc.com/product/mach8mlc.phpSpeaking of which.. what''s the current limit on L2ARC size? Gathering tidbits here and there (7000 storage line config limits, FAST talk given by Bill Moore) there are indications that L2ARC can only be ~500GB? Is this the case? If so, is that a raw size limitation or a number of devices used to form the L2ARC limitation or something else? I''m sure some of us can come with examples where we really would like to use much more than a 500GB L2ARC :) -- This message posted from opensolaris.org
Darren J Moffat
2008-Dec-02 11:59 UTC
[zfs-discuss] ZFS fragmentation with MySQL databases
t. johnson wrote:>>> One would expect so, yes. But the usefulness of this is limited to the cases where the entire working set will fit into an SSD cache. >>> >> Not entirely out of the question. SSDs can be purchased today >> with more than 500 GBytes in a 2.5" form factor. One or more of >> these would make a dandy L2ARC. >> http://www.stecinc.com/product/mach8mlc.php > > > Speaking of which.. what''s the current limit on L2ARC size? Gathering tidbits here and there (7000 storage line config limits, FAST talk given by Bill Moore) there are indications that L2ARC can only be ~500GB?There is no limits on the size of the L2ARC that I could fine implemented in the source code. However every buffer that is cached on an L2ARC device needs an ARC header in the in memory ARC that points to it. So in practical terms there will be a limit on the size of an L2ARC based on the size of physical ram. For example a machine with 512 MegaByte RAM and a 500GByte SSD L2ARC is probably pretty silly. I''ll leave it as an exercise to the reader to work out how much core memory is needed based on the sizes of arc_buf_t (0x30) and arc_buf_hdr_t (0xf8). -- Darren J Moffat