Greetings zfs-discuss@ I was trying to narrow this down for some quite time. The problem is resides on couple of osol/sxce boxes that are used as dom0 hosts. Under high disk load on domU guests (backup process for example) domU performance is terrible. The worst thing is that iostat shows *very* high %w numbers, while zpool iostat showing quite low numbers. Couple things that to mention: 1. /etc/system tune: set zfs:zfs_arc_max = 524288000 2. dom0 is pinned to dedicated CPU, also memory is capped to 1GB. 3. no hardware raid involved, raw SATA drives fed to dom0 under rpool. 4. domUs are on top of zvols, 8K blocksize 5. iostat: http://pastebin.com/m4bf1c409 6. zpool iostat: http://pastebin.com/m179269e2 7. domU definition: http://pastebin.com/m48f18a76 8. dom0 bits are snv_115, snv_124, snv_126 and snv_130 9. domUs have ext3 mounted with: noatime,commit=120 10. there are ~4 domUs per dom0 host, each having dedicated cpu(s). Any hint would be apreciated where should I go from here.
On Feb 14, 2010, at 9:24 AM, Bogdan ?ulibrk wrote:> Greetings zfs-discuss@ > > I was trying to narrow this down for some quite time. The problem is resides on couple of osol/sxce boxes that are used as dom0 hosts. Under high disk load on domU guests (backup process for example) domU performance is terrible. The worst thing is that iostat shows *very* high %w numbers, while zpool iostat showing quite low numbers.Where is iostat %w measured?> > Couple things that to mention: > 1. /etc/system tune: set zfs:zfs_arc_max = 524288000 > 2. dom0 is pinned to dedicated CPU, also memory is capped to 1GB. > 3. no hardware raid involved, raw SATA drives fed to dom0 under rpool. > 4. domUs are on top of zvols, 8K blocksize > 5. iostat: http://pastebin.com/m4bf1c409Is this data from dom0? Looks like around 200-300 8KB random reads per second, which is about all you can expect from 3-5 SATA disks. -- richard> 6. zpool iostat: http://pastebin.com/m179269e2 > 7. domU definition: http://pastebin.com/m48f18a76 > 8. dom0 bits are snv_115, snv_124, snv_126 and snv_130 > 9. domUs have ext3 mounted with: noatime,commit=120 > 10. there are ~4 domUs per dom0 host, each having dedicated cpu(s). > > > Any hint would be apreciated where should I go from here. > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Richard first of all thank you for your time looking into this, apricieting that. What are my options from here? To move onto zvol with greater blocksize? 64k? 128k? Or I will get into another trouble going that way when I have small reads coming from domU (ext3 with default blocksize of 4k)? Richard Elling wrote:> On Feb 14, 2010, at 9:24 AM, Bogdan ?ulibrk wrote: > >> Greetings zfs-discuss@ >> >> I was trying to narrow this down for some quite time. The problem is resides on couple of osol/sxce boxes that are used as dom0 hosts. Under high disk load on domU guests (backup process for example) domU performance is terrible. The worst thing is that iostat shows *very* high %w numbers, while zpool iostat showing quite low numbers. > > Where is iostat %w measured? > >> Couple things that to mention: >> 1. /etc/system tune: set zfs:zfs_arc_max = 524288000 >> 2. dom0 is pinned to dedicated CPU, also memory is capped to 1GB. >> 3. no hardware raid involved, raw SATA drives fed to dom0 under rpool. >> 4. domUs are on top of zvols, 8K blocksize >> 5. iostat: http://pastebin.com/m4bf1c409 > > Is this data from dom0? > Looks like around 200-300 8KB random reads per second, which is > about all you can expect from 3-5 SATA disks. > -- richard > >> 6. zpool iostat: http://pastebin.com/m179269e2 >> 7. domU definition: http://pastebin.com/m48f18a76 >> 8. dom0 bits are snv_115, snv_124, snv_126 and snv_130 >> 9. domUs have ext3 mounted with: noatime,commit=120 >> 10. there are ~4 domUs per dom0 host, each having dedicated cpu(s). >> >> >> Any hint would be apreciated where should I go from here. >> >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Kjetil Torgrim Homme
2010-Feb-15 00:12 UTC
[zfs-discuss] ZFS slowness under domU high load
Bogdan ?ulibrk <bc at default.rs> writes:> What are my options from here? To move onto zvol with greater > blocksize? 64k? 128k? Or I will get into another trouble going that > way when I have small reads coming from domU (ext3 with default > blocksize of 4k)?yes, definitely. have you considered using NFS rather than zvols for the data filesystems? (keep zvol for the domU software.) it''s strange that you see so much write activity during backup -- I''d expect that to do just reads... what''s going on at the domU? generally, the best way to improve performance is to add RAM for ARC (512 MiB is *very* little IMHO) and SSD for your ZIL, but it does seem to be a poor match for your concept of many small low-cost dom0''s. -- Kjetil T. Homme Redpill Linpro AS - Changing the game
On 2/14/10 4:12 PM, Kjetil Torgrim Homme wrote:> Bogdan ?ulibrk<bc at default.rs> writes: > >> What are my options from here? To move onto zvol with greater >> blocksize? 64k? 128k? Or I will get into another trouble going that >> way when I have small reads coming from domU (ext3 with default >> blocksize of 4k)? > > yes, definitely. have you considered using NFS rather than zvols for > the data filesystems? (keep zvol for the domU software.) > > it''s strange that you see so much write activity during backup -- I''d > expect that to do just reads... what''s going on at the domU?Most likely the cause of the whole problem - not having noatime set for all domU and dom0 filesystems. During backups of the domUs you will be constantly thrashing doing writes to all your metadata for nothing. On the dom0, if you don''t have noatime on, you are constantly updating the metadata for the domU backing store, since they generally have some read traffic all the time.> generally, the best way to improve performance is to add RAM for ARC > (512 MiB is *very* little IMHO) and SSD for your ZIL, but it does seem > to be a poor match for your concept of many small low-cost dom0''s. >
On 2/14/10 7:02 PM, zfs ml wrote:> On 2/14/10 4:12 PM, Kjetil Torgrim Homme wrote: >> Bogdan ?ulibrk<bc at default.rs> writes: >> >>> What are my options from here? To move onto zvol with greater >>> blocksize? 64k? 128k? Or I will get into another trouble going that >>> way when I have small reads coming from domU (ext3 with default >>> blocksize of 4k)? >> >> yes, definitely. have you considered using NFS rather than zvols for >> the data filesystems? (keep zvol for the domU software.) >> >> it''s strange that you see so much write activity during backup -- I''d >> expect that to do just reads... what''s going on at the domU? > > Most likely the cause of the whole problem - not having noatime set for > all domU and dom0 filesystems. During backups of the domUs you will be > constantly thrashing doing writes to all your metadata for nothing. > > On the dom0, if you don''t have noatime on, you are constantly updating > the metadata for the domU backing store, since they generally have some > read traffic all the time.sorry, scratch the above - I didn''t see this: 9. domUs have ext3 mounted with: noatime,commit=120 Is the write traffic because you backing up to the same disks that the domUs live on?
zfs ml wrote:> sorry, scratch the above - I didn''t see this: > 9. domUs have ext3 mounted with: noatime,commit=120 > > Is the write traffic because you backing up to the same disks that the > domUs live on?Yes it is.
Kjetil and Richard thanks for this. Kjetil Torgrim Homme wrote:> Bogdan ?ulibrk <bc at default.rs> writes: > >> What are my options from here? To move onto zvol with greater >> blocksize? 64k? 128k? Or I will get into another trouble going that >> way when I have small reads coming from domU (ext3 with default >> blocksize of 4k)? > > yes, definitely. have you considered using NFS rather than zvols for > the data filesystems? (keep zvol for the domU software.) >That makes sense. Will it be useful to simply add new drive to domU backed with greater blocksize zvol or maybe vmdk file? Does it have to be nfs backend?> it''s strange that you see so much write activity during backup -- I''d > expect that to do just reads... what''s going on at the domU? > > generally, the best way to improve performance is to add RAM for ARC > (512 MiB is *very* little IMHO) and SSD for your ZIL, but it does seem > to be a poor match for your concept of many small low-cost dom0''s. >Writes are coming from backup packing before transfering on real backup location. Most likely this is the main reason for whole problem. One more thing regarding SSD, will be useful to throw in additional SAS/SATA drive in to serve as L2ARC? I know SSD is the most logical thing to put as L2ARC, but will conventional drive be of *any* help in L2ARC?
On Mon, Feb 15, 2010 at 01:45:57PM +0100, Bogdan ?ulibrk wrote:> One more thing regarding SSD, will be useful to throw in additional > SAS/SATA drive in to serve as L2ARC? I know SSD is the most logical > thing to put as L2ARC, but will conventional drive be of *any* help in > L2ARC?Only in very particular circumstances. L2ARC is a latency play; for it to win, you need the l2arc device(s) to be lower latency than the primary storage, at least for reads. This usually translates to ssd for lower latency than disk, but can also work if your data pool has unusually high latency - remote iscsi, usb, some other odd mostly channel-related configurations. If the reason your disks have high latency is simply high load, l2arc on another disk might, maybe, just work to redistribute some of that load, but it will be a precarious balance, and probably need several additional disks, perhaps roughly as many as currently in the pool. By that stage, you''re better off just reshaping the pool to use the extra disks to best effect; mirrors vs raidz, more vdevs, etc. Managing all that l2arc will take memory, too. In your case, though, a couple of extra disks dedicated to staging whatever transform you''re doing to the backup files might be worthwhile, if it will fit. Even if they make the backup transform itself slower (unlikely if its predominantly sequential), removing the contention impact from the primary service could be a net win. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100216/e21a3552/attachment.bin>