I''ve been going through my iostat, zilstat, and other outputs all to no avail. None of my disks ever seem to show outrageous service times, the load on the box is never high, and if the darned thing is CPU bound- I''m not even sure where to look. "(traversing DDT blocks even if in memory, etc - and kernel times indeed are above 50%) as I''m zeroing "deleted" blocks inside the "internal" pool. This took several days already, but recovered lots of space in my main pool also..." When you say you are zeroing deleted blocks- how are you going about doing that? Despite claims to the contrary- I can understand ZFS needing some tuning. What I can''t understand are the baffling differences in performance I see. For example- after deleting a large volume- suddenly my performance will skyrocket- then gradually degrade- but the question is why? I''m not running dedup. My disks seem to be largely idle. I have 8 3GHz cores that also seem to be idle. I seem to have enough memory. What is ZFS doing during this time? Everything I''ve read suggests one of two possible causes- too full, or bad hardware. Is there anything else that might be an issue here? Another ZFS factor I haven''t taken into account? Space seems to be the biggest factor in my performance difference- more free space = more performance- but as my fullest disks are less than 70% full, and my emptiest disks are less than 10% full- I can''t understand why space is an issue. I have a few hardware errors for one of my pool disks- but we''re talking about a very small number of errors over a long period of time. I''m considering replacing this disk but the pool is so slow at times I''m loathe to slow it down further by doing a replace unless I can be more certain that is going to fix the problem. -- This message posted from opensolaris.org
Well, as I wrote in other threads - i have a pool named "pool" on physical disks, and a compressed volume in this pool which i loopback-mount over iSCSI to make another pool named "dcpool". When files in "dcpool" are deleted, blocks are not zeroed out by current ZFS and they are still allocated for the physical "pool". Now i''m doing essentially this to clean up the parent pool: # dd if=/dev/zero of=/dcpool/nodedup/bigzerofile This file is in a non-deduped dataset, so to the point of view of dcpool, it has a growing huge file filled with zeroes - and its referenced blocks overwrite garbage left over from older deleted files and no longer referenced by "dcpool". However for the "pool" this is a write of compressed zeroed block, which is not to be referenced, so the "pool" releases a volume block and its referencing metadata block. This has already released over half a terabyte in my physical pool (compressed blocks filled with zeroes are a special case for ZFS and require none or less-than-usual reference metadata blocks) ;) However, since I have millions of 4kb blocks for volume data and its metadata, I guess fragmentation is quite high, maybe even interlacing one-to-one? One way or another, this "dcpool" never saw IOs faster that say 15Mb/s, and usually lingers in 1-5Mb/s range, while I can get 30-50Mb/s in the "pool" easily in other datasets (with dynamic block sizes and lengthier contiguous data stretches). Writes had been relatively quick for the first virtual terabyte or so, but it''s doing the last 100gb for several days now, at several megabytes per minute in the "dcpool" iostat. There''s several Mb/sec of IO''s on hardware disks to back this deletion and clean-up, however (as in my examples in previous post)... As for disks with different fill ratio - it is a commonly discussed performance problem. Seems to boil down to this: free space on all disks (actually on top-level VDEVs) is considered for round-robining writes to stripes. Disks that have been in use for a longer time may have very fragmented free space on one hand, and not so much of it on another, but ZFS is still trying to push bits around evenly. And while it''s waiting on some disks, others may be blocked as well. Something like that... People on this forum have seen and reported that adding a 100Mb file tanked their multiterabyte pool''s performance, and removing the file boosted it back up. I don''t want to mix up other writers'' findings, better search recent 5-10 pages of the forum post headings yourself. It''s within the last hundred of threads, I think, maybe ;) -- This message posted from opensolaris.org
Hung-ShengTsao (Lao Tsao) Ph.D.
2011-May-10 22:36 UTC
[zfs-discuss] Performance problem suggestions?
it is my understanding for write (fast) consider faster HDD (SSD) for ZIL for read consider faster HDD(SSD) for L2ARC There were many discussion for V12N env raid1 is better than raidz On 5/10/2011 3:31 PM, Don wrote:> I''ve been going through my iostat, zilstat, and other outputs all to no avail. None of my disks ever seem to show outrageous service times, the load on the box is never high, and if the darned thing is CPU bound- I''m not even sure where to look. > > "(traversing DDT blocks even if in memory, etc - and kernel times indeed are above 50%) as I''m zeroing "deleted" blocks inside the "internal" pool. This took several days already, but recovered lots of space in my main pool also..." > When you say you are zeroing deleted blocks- how are you going about doing that? > > Despite claims to the contrary- I can understand ZFS needing some tuning. What I can''t understand are the baffling differences in performance I see. For example- after deleting a large volume- suddenly my performance will skyrocket- then gradually degrade- but the question is why? > > I''m not running dedup. My disks seem to be largely idle. I have 8 3GHz cores that also seem to be idle. I seem to have enough memory. What is ZFS doing during this time? > > Everything I''ve read suggests one of two possible causes- too full, or bad hardware. Is there anything else that might be an issue here? Another ZFS factor I haven''t taken into account? > > Space seems to be the biggest factor in my performance difference- more free space = more performance- but as my fullest disks are less than 70% full, and my emptiest disks are less than 10% full- I can''t understand why space is an issue. > > I have a few hardware errors for one of my pool disks- but we''re talking about a very small number of errors over a long period of time. I''m considering replacing this disk but the pool is so slow at times I''m loathe to slow it down further by doing a replace unless I can be more certain that is going to fix the problem.-------------- next part -------------- A non-text attachment was scrubbed... Name: laotsao.vcf Type: text/x-vcard Size: 632 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110510/6eb03bc7/attachment.vcf>
> # dd if=/dev/zero of=/dcpool/nodedup/bigzerofileAhh- I misunderstood your pool layout earlier. Now I see what you were doing.>People on this forum have seen and reported that adding a 100Mb file tanked their > multiterabyte pool''s performance, and removing the file boosted it back up.Sadly I think several of those posts were mine or those of coworkers.> Disks that have been in use for a longer time may have very fragmented free > space on one hand, and not so much of it on another, but ZFS is still trying to push > bits around evenly. And while it''s waiting on some disks, others may be blocked as > well. Something like that...This could explain why performance would go up after a large delete but I''ve not seen large wait times for any of my disks. The service time, percent busy, and every other metric continues to show nearly idle disks. If this is the problem- it would be nice if there were a simple zfs or dtrace query that would show it to you. -- This message posted from opensolaris.org
> > Disks that have been in use for a longer time may have very fragmented free > > space on one hand, and not so much of it on another, but ZFS is still trying to push > > bits around evenly. And while it''s waiting on some disks, others may be blocked as > > well. Something like that... > This could explain why performance would go up after a large delete but I''ve not > seen large wait times for any of my disks. The service time, percent busy, and > every other metric continues to show nearly idle disks.I believe, in this situation the older fuller disks would show some activity and others can show zero or few IOs - because ZFS has no tasks for them. It sent a series of blocks to write from the queue, newer disks wrote them and stay dormant, while older disks seek around to fit that piece of data... When old disks complete the writes, ZFS batches them a new set of tasks.> If this is the problem- it would be nice if there were a simple zfs or dtrace query > that would show it to you.Well, it seems that the bridge between email and web interfaces to OpenSolaris forums has been fixed, for new posts at least, and hopefully Richard Elling or some other experts would come up with an idea of a dtrace for your situation. I have little non-zero hope that the experts would also come to the web-forums and review the past month''s posts and give their comments to my, your and others'' questions and findings ;) //Jim Klimov -- This message posted from opensolaris.org
Keep in mind zfs_vdev_max_pending. In the latest version of S10, this is set to 10. ZFS will not issue more than the value of this variable requests at a time for a LUN. Your disks may look relatively idle while ZFS has a lot of data piled up inside just waiting to be read or written. I have tweaked this on the fly. One key indicator is if your disk queues hover around 10. Jim --- ----- Original Message ----- From: jimklimov at cos.ru To: zfs-discuss at opensolaris.org Sent: Wednesday, May 11, 2011 3:22:19 AM GMT -08:00 US/Canada Pacific Subject: Re: [zfs-discuss] Performance problem suggestions?> > Disks that have been in use for a longer time may have very fragmented free > > space on one hand, and not so much of it on another, but ZFS is still trying to push > > bits around evenly. And while it''s waiting on some disks, others may be blocked as > > well. Something like that... > This could explain why performance would go up after a large delete but I''ve not > seen large wait times for any of my disks. The service time, percent busy, and > every other metric continues to show nearly idle disks.I believe, in this situation the older fuller disks would show some activity and others can show zero or few IOs - because ZFS has no tasks for them. It sent a series of blocks to write from the queue, newer disks wrote them and stay dormant, while older disks seek around to fit that piece of data... When old disks complete the writes, ZFS batches them a new set of tasks.> If this is the problem- it would be nice if there were a simple zfs or dtrace query > that would show it to you.Well, it seems that the bridge between email and web interfaces to OpenSolaris forums has been fixed, for new posts at least, and hopefully Richard Elling or some other experts would come up with an idea of a dtrace for your situation. I have little non-zero hope that the experts would also come to the web-forums and review the past month''s posts and give their comments to my, your and others'' questions and findings ;) //Jim Klimov -- This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> It sent a series of blocks to write from the queue, newer disks wrote them and stay > dormant, while older disks seek around to fit that piece of data... When old disks > complete the writes, ZFS batches them a new set of tasks.The thing is- as far as I know the OS doesn''t ask the disk to find a place to fit the data. Instead the OS tracks what space on the disk is free and then tells the disk where to write the data. Even if ZFS was waiting for the IO to complete I would expect to see that delay reflected in the disk service times. In our case we see no high service times, no busy disks, nothing. It seems like ZFS is just sitting there quietly and thinking to itself. If the processor were busy that might make sense but even there- our processor seems largely idle. At the same time- even a scrub on this system is a joke right now and that''s a read intensive operation. I''m seeing a scrub speed of 400K/s but almost no IO''s to my disks. -- This message posted from opensolaris.org
> The thing is- as far as I know the OS doesn''t ask the disk to find a place > to fit the data. Instead the OS tracks what space on the disk is free and > then tells the disk where to write the data.Yes and no, I did not formulate my idea clearly enough, sorry for confusion ;) Yes - The disks don''t care about free blocks at all. For them they are just LBA sector numbers. No - The OS does track which sectors correlate to its logical blocks it deems suitable for a write, and asks the disk to position its mechanical head to a specific track and access a specific sector. This is a slow operation which can only be done about 180-250 times per second for very random I/Os (may be more with HDD/Controller caching, queuing and faster spindles). I''m afraid that seeking to very dispersed metadata blocks, such as traversing the tree during a scrub on a fragmented drive, may qualify as a very random I/O. This reminds me of a long-hanging "BP Rewrite" project which would allow live re-arranging of ZFS data allowing, in particular, some extent of defragmentation. More useful usages would be changes to RAIDZ levels and number of disks though, maybe even removal of top-level VDEVs from a sufficiently empty pool... Hopefully the Illumos team or some other developers would push this idea into reality ;) There was a good tip from Jim Litchfield regarding VDEV Queue Sizing, though. Possible current default for zfs_vdev_max_pending is 10, which is okay (or may be even too much) for individual drives, but is not very much for arrays of many disks hidden behind a smart controller with its own caching and queuing, be it a SAN box controller or a PCI one which would intercept and reinterpret your ZFS''s calls. So maybe this is indeed a bottleneck - which you would see in "iostat -Xn 1" as "actv" field numbers which are near the configured queue size. //Jim -- This message posted from opensolaris.org
> This is a slow operation which can only be done about 180-250 times per second > for very random I/Os (may be more with HDD/Controller caching, queuing and > faster spindles). > I''m afraid that seeking to very dispersed metadata blocks, such as traversing the > tree during a scrub on a fragmented drive, may qualify as a very random I/O.And that''s the thing- I would understand if my scrub was slow because the disks were just being hammered by IOPS but- all joking aside- my pool is almost entirely idle according to an iostat -Xn -- This message posted from opensolaris.org