thr3ads.net - zfs discuss - [zfs-discuss] ZFS fragmentation with MySQL databases [Nov 2008]

If this information is useful, please help other people find it:
Share via:

Vincent Kéravec

2008-Nov-22 01:20 UTC

[zfs-discuss] ZFS fragmentation with MySQL databases

I just try ZFS on one of our slave and got some really bad performance.

When I start the server yesterday, it was able to keep up with the main server
without problem but after two days of consecutive run the server is crushed by
IO.

After running the dtrace script iopattern, I notice that the workload is now
100% Random IO. Copying the database (140Go) from one directory to an other took
more than 4 hours without any other tasks running on the server, and all the
reads on table that where updated where random... Keeping an eye on iopattern
and zpool iostat I saw that when the systems was accessing file that have not
been changed the disk was reading sequentially at more than 50Mo/s but when
reading files that changed often the speed got down to 2-3 Mo/s.

The server has plenty of diskplace so it should not have such a level of file
fragmentation in such a short time.

For information I''m using solaris 10/08 with a mirrored root pool on
two 1Tb Sata harddisk (slow with random io). I''m using MySQL 5.0.67
with MyISAM engine. The zfs recordsize is 8k as recommended on the zfs guide.
-- 
This message posted from opensolaris.org

Kees Nuyt

2008-Nov-22 07:57 UTC

head link

[zfs-discuss] ZFS fragmentation with MySQL databases

[Default] On Fri, 21 Nov 2008 17:20:48 PST, Vincent K?ravec
<keravecv at gmail.com> wrote:
> I just try ZFS on one of our slave and got some really
> bad performance.
> 
> When I start the server yesterday, it was able to keep
> up with the main server without problem but after two
> days of consecutive run the server is crushed by IO.
> 
> After running the dtrace script iopattern, I notice
> that the workload is now 100% Random IO. Copying the
> database (140Go) from one directory to an other took
> more than 4 hours without any other tasks running on
> the server, and all the reads on table that where
> updated where random... Keeping an eye on iopattern and
> zpool iostat I saw that when the systems was accessing
> file that have not been changed the disk was reading
> sequentially at more than 50Mo/s but when reading files
> that changed often the speed got down to 2-3 Mo/s.
Good observation and analysis.
 > The server has plenty of diskplace so it should not
> have such a level of file fragmentation in such a short
> time.
My explanation would be: Whenever a block within a file
changes, zfs has to write it at another location ("copy on
write"), so the previous version isn''t immediately lost.

Zfs will try to keep the new version of the block close to
the original one, but after several changes on the same
database page, things get pretty messed up and logical
sequential I/O becomes pretty much physically random indeed.

The original blocks will eventually be added to the freelist
and reused, so proximity can be restored, but it will never
be 100% sequential again.
The effect is larger when many snapshots are kept, because
older block versions are not freed, or when the same block
is changed very often and freelist updating has to be
postponed.

That is the trade-off between "always consistent" and
"fast".
> For information I''m using solaris 10/08 with a mirrored
> root pool on two 1Tb Sata harddisk (slow with random
> io). I''m using MySQL 5.0.67 with MyISAM engine. The zfs
> recordsize is 8k as recommended on the zfs guide.
I would suggest to enlarge the MyISAM buffers.
The InnoDB engine does copy on write within its data files,
so things might be different there. 
-- 
  (  Kees Nuyt
  )
c[_]

Tamer Embaby

2008-Nov-23 00:43 UTC

head link

[zfs-discuss] ZFS fragmentation with MySQL databases

Kees Nuyt wrote:> My explanation would be: Whenever a block within a file
> changes, zfs has to write it at another location ("copy on
> write"), so the previous version isn''t immediately lost.
>
> Zfs will try to keep the new version of the block close to
> the original one, but after several changes on the same
> database page, things get pretty messed up and logical
> sequential I/O becomes pretty much physically random indeed.
>
> The original blocks will eventually be added to the freelist
> and reused, so proximity can be restored, but it will never
> be 100% sequential again.
> The effect is larger when many snapshots are kept, because
> older block versions are not freed, or when the same block
> is changed very often and freelist updating has to be
> postponed.
>
> That is the trade-off between "always consistent" and
> "fast".
>   Well, does that mean ZFS is not best suited for database engines as 
underlying
filesystem?  With databases it will always be fragmented, hence slow
performance?

Because this way it would be best to use it for large file server that
don''t usually change frequently.

Thanks,
Tamer

Luke Lonergan

2008-Nov-23 01:30 UTC

head link

[zfs-discuss] ZFS fragmentation with MySQL databases

ZFS works marvelously well for data warehouse and analytic DBs.  For lots of
small updates scattered across the breadth of the persistent working set,
it''s not going to work well IMO.

Note that we''re using ZFS to host databases as large as 10,000 TB -
that''s 10PB (!!).  Solaris 10 U5 on X4540.  That said - it''s
on 96 servers running Greenplum DB.

With SSD, the randomness won''t matter much I expect, though the
filesystem won''t be helping by virtue of this fragmentation effect of
COW.

- Luke

----- Original Message -----
From: zfs-discuss-bounces at opensolaris.org <zfs-discuss-bounces at
opensolaris.org>
To: zfs-discuss at opensolaris.org <zfs-discuss at opensolaris.org>
Sent: Sat Nov 22 16:43:53 2008
Subject: Re: [zfs-discuss] ZFS fragmentation with MySQL databases

Kees Nuyt wrote:> My explanation would be: Whenever a block within a file
> changes, zfs has to write it at another location ("copy on
> write"), so the previous version isn''t immediately lost.
>
> Zfs will try to keep the new version of the block close to
> the original one, but after several changes on the same
> database page, things get pretty messed up and logical
> sequential I/O becomes pretty much physically random indeed.
>
> The original blocks will eventually be added to the freelist
> and reused, so proximity can be restored, but it will never
> be 100% sequential again.
> The effect is larger when many snapshots are kept, because
> older block versions are not freed, or when the same block
> is changed very often and freelist updating has to be
> postponed.
>
> That is the trade-off between "always consistent" and
> "fast".
>Well, does that mean ZFS is not best suited for database engines as
underlying
filesystem?  With databases it will always be fragmented, hence slow
performance?

Because this way it would be best to use it for large file server that
don''t usually change frequently.

Thanks,
Tamer
_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Bob Friesenhahn

2008-Nov-23 01:38 UTC

head link

[zfs-discuss] ZFS fragmentation with MySQL databases

On Sun, 23 Nov 2008, Tamer Embaby wrote:>> That is the trade-off between "always consistent" and
>> "fast".
>>
> Well, does that mean ZFS is not best suited for database engines as 
> underlying filesystem?  With databases it will always be fragmented, 
> hence slow performance?
Assuming that the filesystem block size matches the database size 
there is not so much of an issue with fragmentation because databases 
are generally fragmented (almost by definition) due to their nature of 
random access.  Only a freshly written database from carefully ordered 
insert statements might be in a linear order, and only for accesses in 
the same linear order.  Database indexes could be negatively impacted, 
but they are likely to be cached in RAM anyway.  I understand that zfs 
uses a slab allocator so that file data is reserved in larger slabs 
(e.g. 1MB) and then the blocks are carved out of that.  This tends to 
keep more of the file data together and reduces allocation overhead.

Fragmentation is more of an impact for large files which should 
usually be accessed sequentially.

Zfs''s COW algorithm and ordered writes will always be slower than for 
filesystems which simply overwrite existing blocks, but there is a 
better chance that the database will be immediately usable if someone 
pulls the power plug, and without needing to rely on special 
battery-backed hardware.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2008-Nov-23 04:28 UTC

head link

[zfs-discuss] ZFS fragmentation with MySQL databases

Luke Lonergan wrote:> ZFS works marvelously well for data warehouse and analytic DBs.  For lots
of small updates scattered across the breadth of the persistent working set,
it''s not going to work well IMO.
>   
Actually, it does seem to work quite well when you use a read optimized
SSD for the L2ARC.  In that case, "random" read workloads have very
fast access, once the cache is warm.
 -- richard
> Note that we''re using ZFS to host databases as large as 10,000 TB
- that''s 10PB (!!).  Solaris 10 U5 on X4540.  That said - it''s
on 96 servers running Greenplum DB.
>
> With SSD, the randomness won''t matter much I expect, though the
filesystem won''t be helping by virtue of this fragmentation effect of
COW.
>
> - Luke
>
> ----- Original Message -----
> From: zfs-discuss-bounces at opensolaris.org <zfs-discuss-bounces at
opensolaris.org>
> To: zfs-discuss at opensolaris.org <zfs-discuss at opensolaris.org>
> Sent: Sat Nov 22 16:43:53 2008
> Subject: Re: [zfs-discuss] ZFS fragmentation with MySQL databases
>
> Kees Nuyt wrote:
>   
>> My explanation would be: Whenever a block within a file
>> changes, zfs has to write it at another location ("copy on
>> write"), so the previous version isn''t immediately lost.
>>
>> Zfs will try to keep the new version of the block close to
>> the original one, but after several changes on the same
>> database page, things get pretty messed up and logical
>> sequential I/O becomes pretty much physically random indeed.
>>
>> The original blocks will eventually be added to the freelist
>> and reused, so proximity can be restored, but it will never
>> be 100% sequential again.
>> The effect is larger when many snapshots are kept, because
>> older block versions are not freed, or when the same block
>> is changed very often and freelist updating has to be
>> postponed.
>>
>> That is the trade-off between "always consistent" and
>> "fast".
>>
>>     
> Well, does that mean ZFS is not best suited for database engines as
> underlying
> filesystem?  With databases it will always be fragmented, hence slow
> performance?
>
> Because this way it would be best to use it for large file server that
> don''t usually change frequently.
>
> Thanks,
> Tamer
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Luke Lonergan

2008-Nov-23 05:23 UTC

head link

[zfs-discuss] ZFS fragmentation with MySQL databases

> Actually, it does seem to work quite
> well when you use a read optimized
> SSD for the L2ARC.  In that case,
> "random" read workloads have very
> fast access, once the cache is warm.
One would expect so, yes.  But the usefulness of this is limited to the cases
where the entire working set will fit into an SSD cache.

In other words, for random access across a working set larger (by say X%) than
the SSD-backed L2 ARC, the cache is useless.  This should asymptotically
approach truth as X grows and experience shows that X=200% is where
it''s about 99% true.

As time passes and SSDs get larger while many OLTP random workloads remain
somewhat constrained in size, this becomes less important.

Modern DB workloads are becoming hybridized, though.  A ''mixed
workload'' scenario is now common where there are a mix of updated
working sets and indexed access alongside heavy analytical ''update
rarely if ever'' kind of workloads.

- Luke

----- Original Message -----
From: Richard.Elling at Sun.COM <Richard.Elling at Sun.COM>
To: Luke Lonergan
Cc: te at tsemba.org <te at tsemba.org>; zfs-discuss at opensolaris.org
<zfs-discuss at opensolaris.org>
Sent: Sat Nov 22 20:28:54 2008
Subject: Re: [zfs-discuss] ZFS fragmentation with MySQL databases

Luke Lonergan wrote:> ZFS works marvelously well for data warehouse and analytic DBs.  For lots
of small updates scattered across the breadth of the persistent working set,
it''s not going to work well IMO.
>
Actually, it does seem to work quite well when you use a read optimized
SSD for the L2ARC.  In that case, "random" read workloads have very
fast access, once the cache is warm.
 -- richard
> Note that we''re using ZFS to host databases as large as 10,000 TB
- that''s 10PB (!!).  Solaris 10 U5 on X4540.  That said - it''s
on 96 servers running Greenplum DB.
>
> With SSD, the randomness won''t matter much I expect, though the
filesystem won''t be helping by virtue of this fragmentation effect of
COW.
>
> - Luke
>
> ----- Original Message -----
> From: zfs-discuss-bounces at opensolaris.org <zfs-discuss-bounces at
opensolaris.org>
> To: zfs-discuss at opensolaris.org <zfs-discuss at opensolaris.org>
> Sent: Sat Nov 22 16:43:53 2008
> Subject: Re: [zfs-discuss] ZFS fragmentation with MySQL databases
>
> Kees Nuyt wrote:
>
>> My explanation would be: Whenever a block within a file
>> changes, zfs has to write it at another location ("copy on
>> write"), so the previous version isn''t immediately lost.
>>
>> Zfs will try to keep the new version of the block close to
>> the original one, but after several changes on the same
>> database page, things get pretty messed up and logical
>> sequential I/O becomes pretty much physically random indeed.
>>
>> The original blocks will eventually be added to the freelist
>> and reused, so proximity can be restored, but it will never
>> be 100% sequential again.
>> The effect is larger when many snapshots are kept, because
>> older block versions are not freed, or when the same block
>> is changed very often and freelist updating has to be
>> postponed.
>>
>> That is the trade-off between "always consistent" and
>> "fast".
>>
>>
> Well, does that mean ZFS is not best suited for database engines as
> underlying
> filesystem?  With databases it will always be fragmented, hence slow
> performance?
>
> Because this way it would be best to use it for large file server that
> don''t usually change frequently.
>
> Thanks,
> Tamer
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Bob Netherton

2008-Nov-23 05:59 UTC

head link

[zfs-discuss] ZFS fragmentation with MySQL databases

> In other words, for random access across a working set larger (by say X%)
than the SSD-backed L2 ARC, the cache is useless.  This should asymptotically
approach truth as X grows and experience shows that X=200% is where
it''s about 99% true.
>   Ummm, before we throw around phrases like useless, how about a little 
testing ?    I like a
good academic argument just like the next guy, but before I dismiss 
something completely
out of hand I''d like to see some data. 

Bob

Bob Friesenhahn

2008-Nov-23 16:57 UTC

head link

[zfs-discuss] ZFS fragmentation with MySQL databases

On Sat, 22 Nov 2008, Bob Netherton wrote:
>
>> In other words, for random access across a working set larger (by 
>> say X%) than the SSD-backed L2 ARC, the cache is useless.  This 
>> should asymptotically approach truth as X grows and experience 
>> shows that X=200% is where it''s about 99% true.
>>
> Ummm, before we throw around phrases like useless, how about a 
> little testing ?  I like a good academic argument just like the next 
> guy, but before I dismiss something completely out of hand I''d
like
> to see some data.
This argument can be proven by basic statistics without need to resort 
to actual testing.

A similar issue applies to non-volatile write caches.

Luckily, most data access is not completely random in nature.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Netherton

2008-Nov-23 17:32 UTC

head link

[zfs-discuss] ZFS fragmentation with MySQL databases

> This argument can be proven by basic statistics without need to resort 
> to actual testing.
Mathematical proof <> reality of how things end up getting used.
> Luckily, most data access is not completely random in nature.
Which was my point exactly.   I''ve never seen a purely mathematical
model put in production anywhere :-)


Bob

Bob Friesenhahn

2008-Nov-23 17:51 UTC

head link

[zfs-discuss] ZFS fragmentation with MySQL databases

On Sun, 23 Nov 2008, Bob Netherton wrote:
>> This argument can be proven by basic statistics without need to resort
>> to actual testing.
>
> Mathematical proof <> reality of how things end up getting used.
Right.  That is a good thing since otherwise the technologies that Sun 
has recently deployed for "Amber Road" would be deemed virtually 
useless (as would most computing architectures).  It is quite trivial 
to demonstrate scenarios where read caches will fail, or NV write 
cache devices will become swamped (regardless of capacity) and 
worthless.  Luckily, these are not the common scenarios for most 
users.

For the write cache case it may be seen that if the volume of writes 
continually exceeds the write rate of the backing store and is 
continually to new locations, then the write cache becomes useless 
since it will always become full.

The read cache case is subject to the normal rules which require that 
the read cache needs to be large enough to contain the common "working 
set" of data in order for it to be effective.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2008-Nov-24 18:57 UTC

head link

[zfs-discuss] ZFS fragmentation with MySQL databases

Luke Lonergan wrote:>> Actually, it does seem to work quite
>> well when you use a read optimized
>> SSD for the L2ARC.  In that case,
>> "random" read workloads have very
>> fast access, once the cache is warm.
>>     
>
> One would expect so, yes.  But the usefulness of this is limited to the
cases where the entire working set will fit into an SSD cache.
>   
Not entirely out of the question. SSDs can be purchased today
with more than 500 GBytes in a 2.5" form factor.  One or more of
these would make a dandy L2ARC.
http://www.stecinc.com/product/mach8mlc.php
> In other words, for random access across a working set larger (by say X%)
than the SSD-backed L2 ARC, the cache is useless.  This should asymptotically
approach truth as X grows and experience shows that X=200% is where
it''s about 99% true.
>
> As time passes and SSDs get larger while many OLTP random workloads remain
somewhat constrained in size, this becomes less important.
>   
You can also purchase machines with 2+ TBytes of RAM, which will
do nicely for caching most OLTP databases :-) > Modern DB workloads are becoming hybridized, though.  A ''mixed
workload'' scenario is now common where there are a mix of updated
working sets and indexed access alongside heavy analytical ''update
rarely if ever'' kind of workloads.
>   
Agree.  We think that the hybrid storage pool architecture will work
well for a variety of these workloads, but the proof will be in the
pudding.  No doubt we''ll discover some interesting interactions along
the way.  Stay tuned...
 -- richard

t. johnson

2008-Dec-02 11:08 UTC

head link

[zfs-discuss] ZFS fragmentation with MySQL databases

>>
>> One would expect so, yes. But the usefulness of this is limited to the
cases where the entire working set will fit into an SSD cache.
>>
>
> Not entirely out of the question. SSDs can be purchased today
> with more than 500 GBytes in a 2.5" form factor. One or more of
> these would make a dandy L2ARC.
> http://www.stecinc.com/product/mach8mlc.php

Speaking of which.. what''s the current limit on L2ARC size? Gathering
tidbits here and there (7000 storage line config limits, FAST talk given by Bill
Moore) there are indications that L2ARC can only be ~500GB?

Is this the case? If so, is that a raw size limitation or a number of devices
used to form the L2ARC limitation or something else? I''m sure some of
us can come with examples where we really would like to use much more than a
500GB L2ARC :)
-- 
This message posted from opensolaris.org

Darren J Moffat

2008-Dec-02 11:59 UTC

head link

[zfs-discuss] ZFS fragmentation with MySQL databases

t. johnson wrote:>>> One would expect so, yes. But the usefulness of this is limited to
the cases where the entire working set will fit into an SSD cache.
>>>
>> Not entirely out of the question. SSDs can be purchased today
>> with more than 500 GBytes in a 2.5" form factor. One or more of
>> these would make a dandy L2ARC.
>> http://www.stecinc.com/product/mach8mlc.php
> 
> 
> Speaking of which.. what''s the current limit on L2ARC size?
Gathering tidbits here and there (7000 storage line config limits, FAST talk
given by Bill Moore) there are indications that L2ARC can only be ~500GB?
There is no limits on the size of the L2ARC that I could fine 
implemented in the source code.

However every buffer that is cached on an L2ARC device needs an ARC 
header in the in memory ARC that points to it.  So in practical terms 
there will be a limit on the size of an L2ARC based on the size of 
physical ram.

For example a machine with 512 MegaByte RAM and a 500GByte SSD L2ARC is 
probably pretty silly.

I''ll leave it as an exercise to the reader to work out how much core 
memory is needed based on the sizes of arc_buf_t (0x30) and 
arc_buf_hdr_t (0xf8).

-- 
Darren J Moffat

zfs discuss - Nov 2008 - ZFS fragmentation with MySQL databases

[zfs-discuss] ZFS fragmentation with MySQL databases

[zfs-discuss] ZFS fragmentation with MySQL databases

[zfs-discuss] ZFS fragmentation with MySQL databases

[zfs-discuss] ZFS fragmentation with MySQL databases

[zfs-discuss] ZFS fragmentation with MySQL databases

[zfs-discuss] ZFS fragmentation with MySQL databases

[zfs-discuss] ZFS fragmentation with MySQL databases

[zfs-discuss] ZFS fragmentation with MySQL databases

[zfs-discuss] ZFS fragmentation with MySQL databases

[zfs-discuss] ZFS fragmentation with MySQL databases

[zfs-discuss] ZFS fragmentation with MySQL databases

[zfs-discuss] ZFS fragmentation with MySQL databases

[zfs-discuss] ZFS fragmentation with MySQL databases

[zfs-discuss] ZFS fragmentation with MySQL databases