thr3ads.net - zfs discuss - [zfs-discuss] zfs fragmentation [Aug 2009]

If this information is useful, please help other people find it:
Share via:

Hua

2009-Aug-07 06:14 UTC

[zfs-discuss] zfs fragmentation

1. Due to the COW nature of zfs, files on zfs are more tender to be fragmented
comparing to traditional file system. Is this statement correct?

2. If so, common understanding is that fragmentation cause perform degradation,
will zfs or to what extend zfs performance is affected by the fragmentation?

3. Being a relative new file system, are there many adoption in large
implementation?

4. Googing "zfs fragmentation" doesn''t return a lot results.
It can because either there isn''t a lot major adoption of zfs or
fragment isn''t a really problem for zfs.

Any information is appreciated.
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2009-Aug-07 16:08 UTC

head link

[zfs-discuss] zfs fragmentation

On Thu, 6 Aug 2009, Hua wrote:
> 1. Due to the COW nature of zfs, files on zfs are more tender to be 
> fragmented comparing to traditional file system. Is this statement 
> correct?
Yes and no.  Fragmentation is a complex issue.

ZFS uses 128K data blocks by default whereas other filesystems 
typically use 4K or 8K blocks.  This naturally reduces the potential 
for fragmentation by 32X over 4k blocks.

ZFS storage pools are typically comprised of multiple "vdevs" and 
writes are distributed over these vdevs.  This means that the first 
128K of a file may go to the first vdev and the second 128K may go to 
the second vdev.  It could be argued that this is a type of 
fragmentation but since all of the vdevs can be read at once (if zfs 
prefetch chooses to do so) the seek time for single-user contiguous 
access is essentially zero since the seeks occur while the application 
is already busy processing other data.  When mirror vdevs are used, 
any device in the mirror may be used to read the data.

ZFS uses a slab allocator and allocates large contiguous chunks of 
from the vdev storage, and then carves the 128K blocks from those 
large chunks.  This dramatically increases the probability that 
related data will be very close on the same disk.

ZFS delays ordinary writes to the very last minute according to these 
rules (my understanding): 7/8th total memory consumed, 5 seconds of 
100% write I/O is collected, or 30 seconds has elapsed.  Since quite a 
lot of data is written at once, zfs is able to write that data in the 
best possible order.

ZFS uses a copy-on-write model.  Copy-on-write tends to cause 
fragmentation if portions of existing files are updated.  If a large 
portion of a file is overwritten in a short period of time, the result 
should be reasonably fragment-free but if parts of the file are 
updated over a long period of time (like a database) then the file is 
certain to be fragmented.  This is not such a big problem as it 
appears to be since such files were already typically accessed using 
random access.

ZFS absolutely observes synchronous write requests (e.g. by NFS or a 
database).  The synchronous write requests do not benefit from the 
long write aggregation delay so the result may not be written as 
ideally as ordinary write requests.  Recently zfs has added support 
for using a SSD as a synchronous write log, and this allows zfs to 
turn synchronous writes into more ordinary writes which can be written 
more intelligently while returning to the user with minimal latency.

Perhaps the most significant fragmentation concern for zfs is if the 
pool is allowed to become close to 100% full.  Similar to other 
filesystems, the quality of the storage allocations goes downhill fast 
when the pool is almost 100% full, so even files written contiguously 
may be written in fragments.
> 3. Being a relative new file system, are there many adoption in 
> large implementation?
There are indeed some sites which heavily use zfs.  One very large 
site using zfs is archive.org.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Scott Meilicke

2009-Aug-07 16:54 UTC

head link

[zfs-discuss] zfs fragmentation

> ZFS absolutely observes synchronous write requests (e.g. by NFS or a 
> database). The synchronous write requests do not benefit from the 
> long write aggregation delay so the result may not be written as 
> ideally as ordinary write requests. Recently zfs has added support 
> for using a SSD as a synchronous write log, and this allows zfs to 
> turn synchronous writes into more ordinary writes which can be written 
> more intelligently while returning to the user with minimal latency.
Bob, since the ZIL is used always, whether a separate device or not,
won''t writes to a system without a separate ZIL also be written as
intelligently as with a separate ZIL?

Thanks,
Scott
-- 
This message posted from opensolaris.org

Neil Perrin

2009-Aug-07 17:13 UTC

head link

[zfs-discuss] zfs fragmentation

On 08/07/09 10:54, Scott Meilicke wrote:>> ZFS absolutely observes synchronous write requests (e.g. by NFS or a 
>> database). The synchronous write requests do not benefit from the 
>> long write aggregation delay so the result may not be written as 
>> ideally as ordinary write requests. Recently zfs has added support 
>> for using a SSD as a synchronous write log, and this allows zfs to 
>> turn synchronous writes into more ordinary writes which can be written 
>> more intelligently while returning to the user with minimal latency.
> 
> Bob, since the ZIL is used always, whether a separate device or not,
> won''t writes to a system without a separate ZIL also be written as
> intelligently as with a separate ZIL?
- Yes. ZFS uses the same code path (intelligence?) to write out the data
from NFS - regardless of whether there''s a separate log (slog) or not.
> 
> Thanks,
> Scott

Bob Friesenhahn

2009-Aug-07 17:15 UTC

head link

[zfs-discuss] zfs fragmentation

On Fri, 7 Aug 2009, Scott Meilicke wrote:>
> Bob, since the ZIL is used always, whether a separate device or not, 
> won''t writes to a system without a separate ZIL also be written as
> intelligently as with a separate ZIL?
I don''t know the answer to that.  Perhaps there is no current 
advantage.  The longer the final writes can be deferred, the more 
opportunity there is to write the data with a better layout, or to 
avoid writing some data at all.

One thing I forgot to mention in my summary is that zfs is commonly 
used in multi-user environments where there may be many simultaneous 
writers.  Simultaneous writers tend to naturally fragment a filesystem 
unless the filesystem is willing to spread the data out in advance and 
take a seek hit (from one file to another) for each file write.  Zfs 
deferrment of the writes allows the data to be written more 
intelligently in these multi-user environments.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Ed Spencer

2009-Aug-07 21:29 UTC

head link

[zfs-discuss] zfs fragmentation

Let me give a real life example of what I believe is a fragmented zfs pool.

Currently the pool is 2 terabytes in size  (55% used) and is made of 4 san luns
(512gb each).
The pool has never gotten close to being full. We increase the size of the pool
by adding 2 512gb luns about once a year or so.

The pool has been divided into 7 filesystems.

The pool is used for imap email data. The email system (cyrus) has approximately
80,000 accounts all located within the pool, evenly distributed between the
filesystems.

Each account has a directory associated with it. This directory is the users
inbox. Additional mail folders are subdirectories. Mail is stored as individual
files.

We receive mail at a rate of 0-20MB/Second, every minute of every  hour of every
day of every week, etc etc.

Users recieve mail constantly over time. They read it and then either delete it
or store it in a subdirectory/folder.

I imagine that my mail (located in a single subdirectory structure) is spread
over the entire pool because it has been received over time. I believe the data
is highly fragmented (from a file and directory perspective).

The result of this is that backup thoughput of a single filesystem in this pool
is about 8GB/hour.
We use EMC networker for backups.
  
This is a problem. There are no utilities available to evaluate this type of
fragmentation.
There are no utilities to fix it.

ZFS, from the mail system perspective works great. 
Writes and random reads operate well.

Backup is a problem and not just because of small files, but small files
scatterred over the entire pool.

Adding another pool and copying all/some data over to it would only a short term
solution.

I believe zfs needs a feature that operates in the background and defrags the
pool to optimize sequential reads of the file and directory structure.

Ed
-- 
This message posted from opensolaris.org

Richard Elling

2009-Aug-08 00:33 UTC

head link

[zfs-discuss] zfs fragmentation

On Aug 7, 2009, at 2:29 PM, Ed Spencer wrote:
> Let me give a real life example of what I believe is a fragmented  
> zfs pool.
>
> Currently the pool is 2 terabytes in size  (55% used) and is made of  
> 4 san luns (512gb each).
> The pool has never gotten close to being full. We increase the size  
> of the pool by adding 2 512gb luns about once a year or so.
>
> The pool has been divided into 7 filesystems.
>
> The pool is used for imap email data. The email system (cyrus) has  
> approximately 80,000 accounts all located within the pool, evenly  
> distributed between the filesystems.
>
> Each account has a directory associated with it. This directory is  
> the users inbox. Additional mail folders are subdirectories. Mail is  
> stored as individual files.
>
> We receive mail at a rate of 0-20MB/Second, every minute of every   
> hour of every day of every week, etc etc.
>
> Users recieve mail constantly over time. They read it and then  
> either delete it or store it in a subdirectory/folder.
>
> I imagine that my mail (located in a single subdirectory structure)  
> is spread over the entire pool because it has been received over  
> time. I believe the data is highly fragmented (from a file and  
> directory perspective).
>
> The result of this is that backup thoughput of a single filesystem  
> in this pool is about 8GB/hour.
> We use EMC networker for backups.
This is very unlikely to be a "fragmentation problem." It is a  
scalability problem
and there may be something you can do about it in the short term.

However, though I usually like to tease, in this case I need to tease. I
recently completed a white paper on this exact workload and how we
designed it to scale. I hope to publish that paper RSN.  When the paper
hits the web, I''ll restart a new thread on using ZFS for large-scale  
email
systems.
>
> This is a problem. There are no utilities available to evaluate this  
> type of fragmentation.
> There are no utilities to fix it.
>
> ZFS, from the mail system perspective works great.
> Writes and random reads operate well.
>
> Backup is a problem and not just because of small files, but small  
> files scatterred over the entire pool.
>
> Adding another pool and copying all/some data over to it would only  
> a short term solution.
I''ll have to disagree.
> I believe zfs needs a feature that operates in the background and  
> defrags the pool to optimize sequential reads of the file and  
> directory structure.
This will not solve your problem, but there are other methods that can.
  -- richard
>
> Ed
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Ed Spencer

2009-Aug-08 12:02 UTC

head link

[zfs-discuss] zfs fragmentation

On Fri, 2009-08-07 at 19:33, Richard Elling wrote:
> This is very unlikely to be a "fragmentation problem." It is a  
> scalability problem
> and there may be something you can do about it in the short term.
You could be right.

Out test mail server consists of the exact same design, same hardware
(SUN4V)  but in a smaller configuration (less memory and 4 x 25g san
luns) has a backup/copy thoughput of 30GB/hour. Data used for testing
was "copied" from our production mail server.
> > Adding another pool and copying all/some data over to it would only  
> > a short term solution.
> 
> I''ll have to disagree.
What is the point of a filesystem the can grow to such a huge size and
not have functionality built in to optimize data layout?  Real world
implementations of filesystems that are intended to live for
years/decades need this functionality, don''t they?

Our mail system works well, only the backup doesn''t perform well.
All the features of ZFS that make reads perform well (prefetch, ARC)
have little effect.

We think backup is quite important. We do quite a few restores of months
old data. Snapshots help in the short term, but for longer term restores
we need to go to tape. 

Of course, as you can tell, I''m kinda stuck on this idea that
"file and
directory fragmentation" is causing our issues with the backup. I
don''t
know how to analyze the pool to better understand the problem.

If we did chop the pool up into lets say 7 pools (one for each current
filesystem) then over time these 7 pools would grow and we would end up
with the same issues. Thats why it seems to me to be a short term
solution.

If our issues with zfs are scalability then you could say zfs is not
scalable. Is that true?
(It certianly is if the solution is too create more pools!).  

-- 
Ed

Mattias Pantzare

2009-Aug-08 13:14 UTC

head link

[zfs-discuss] zfs fragmentation

>> > Adding another pool and copying all/some data over to it would
only
>> > a short term solution.
>>
>> I''ll have to disagree.
>
> What is the point of a filesystem the can grow to such a huge size and
> not have functionality built in to optimize data layout? ?Real world
> implementations of filesystems that are intended to live for
> years/decades need this functionality, don''t they?
>
> Our mail system works well, only the backup doesn''t perform well.
> All the features of ZFS that make reads perform well (prefetch, ARC)
> have little effect.
>
> We think backup is quite important. We do quite a few restores of months
> old data. Snapshots help in the short term, but for longer term restores
> we need to go to tape.
Your scalability problem may be in your backup solution.

The problem is not how many Gb data you have but the number of files.

It was a while since I worked with networker so things may have changed.

If you are doing backups directly to tape you may have a buffering
problem. By simply staging backups on disk we got at lot faster
backups.

Have you configured networker to do several simultaneous backups from
your pool?
You can do that by having several zfs on the same pool or tell
networker to do backups one directory level down so that it thinks you
have more file systems. And don''t forget to play with the parallelism
settings in networker.

This made a huge difference for us on VxFS.

Bob Friesenhahn

2009-Aug-08 14:17 UTC

head link

[zfs-discuss] zfs fragmentation

On Sat, 8 Aug 2009, Ed Spencer wrote:>
> What is the point of a filesystem the can grow to such a huge size and
> not have functionality built in to optimize data layout?  Real world
> implementations of filesystems that are intended to live for
> years/decades need this functionality, don''t they?
Enterprise storage should work fine without needing to run a tool to 
optimize data layout or repair the filesystem.  Well designed software 
uses an approach which does not unravel through use.
> Our mail system works well, only the backup doesn''t perform well.
> All the features of ZFS that make reads perform well (prefetch, ARC)
> have little effect.
It is already known that ZFS prefetch is often not aggressive enough 
for bulk reads, and sometimes gets lost entirely.  I think that is the 
first issue to resolve in order to get your backups going faster.

Many of us here already tested our own systems and found that under 
some conditions ZFS was offering up only 30MB/second for bulk data 
reads regardless of how exotic our storage pool and hardware was.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Ed Spencer

2009-Aug-08 17:51 UTC

head link

[zfs-discuss] zfs fragmentation

On Sat, 2009-08-08 at 09:17, Bob Friesenhahn wrote:> Many of us here already tested our own systems and found that under
> some conditions ZFS was offering up only 30MB/second for bulk data
> reads regardless of how exotic our storage pool and hardware was.
Just so we are using the same units of measurements. Backup/copy
throughput on our development mail server is 8.5MB/sec. The people
running our backups would be over joyed with that performance.

However backup/copy throughput on our production mail server is 2.25
MB/sec.

The underlying disk is 15000 RPM 146GB FC drives.
Our performance may be hampered somewhat because the luns are on a
Network Appliance accessed via iSCSI, but not to the extent that we are
seeing, and it does not account for the throughput difference in the
development and production pools.

When I talk about fragmentation its not in the normal sense. I''m not
talking about blocks in a file not being sequential. I''m talking about
files in a single directory that end up spread across the entire
filesytem/pool.

My problem right now is diagnosing the performance issues. I can''t
address them without understanding the underlying cause. There is a
lack of tools to help in this area. There is also a lack of acceptance
that I''m actually having a problem with zfs. Its frustrating.

Anyone know how significantly increase the performance of a zfs
filesystem without causing any downtime to an Enterprise email system
used by 30,000 intolerant people, when you don''t really know what is
causing the performance issues in the first place? (Yeah, it sucks to be
me!)

--
Ed

Ed Spencer

2009-Aug-08 18:20 UTC

head link

[zfs-discuss] zfs fragmentation

On Sat, 2009-08-08 at 08:14, Mattias Pantzare wrote:
> Your scalability problem may be in your backup solution.We''ve eliminated the backup system as being involved with the
performance issues. 

The servers are Solaris 10 with the OS on UFS filesystems. (In zfs
terms, the pool is old/mature). Solaris has been patched to a fairly
current level.  

Copying data from the zfs filesystem to the local ufs filesystem enjoys
the same throughput as the backup system. 

The test was simple. Create a test filesystem on the zfs pool. Restore
production email data to it. Reboot the server. Backup the data (29
minutes for a 15.8 gig of data). Reboot the server. Copy data from zfs
to ufs using a ''cp -pr ...'' command, which also took 29
minutes.

And if anyone is interested it only took 15 minutes to restore (write) 
the 15.8GB of data over the network. 

-- 
Ed

Ed Spencer

2009-Aug-08 20:02 UTC

head link

[zfs-discuss] zfs fragmentation

On Sat, 2009-08-08 at 09:17, Bob Friesenhahn wrote:
> Enterprise storage should work fine without needing to run a tool to 
> optimize data layout or repair the filesystem.  Well designed software 
> uses an approach which does not unravel through use.
Hmmmm, this is counter to my understanding. I always thought that to
optimize sequential read performance you must store the data according
to how the device will read the data. 

Spinning rust reads data in a sequential fashion. In order to optimize
read performance it has to be laid down that way.

When reading files in a directory, the files need to be laid out on the
physical device sequentially for optimal read performance.

I probably not he person to argue this point though....Is there a DBA
around?

Maybe my problems will go away once we move into the next generation of
storage devices, SSD''s! I''m starting to think that ZFS will
really shine
on SSD''s.

-- 
Ed

Mike Gerdts

2009-Aug-08 20:05 UTC

head link

[zfs-discuss] zfs fragmentation

On Sat, Aug 8, 2009 at 12:51 PM, Ed Spencer<Ed_Spencer at umanitoba.ca>
wrote:>
> On Sat, 2009-08-08 at 09:17, Bob Friesenhahn wrote:
>> Many of us here already tested our own systems and found that under
>> some conditions ZFS was offering up only 30MB/second for bulk data
>> reads regardless of how exotic our storage pool and hardware was.
>
> Just so we are using the same units of measurements. Backup/copy
> throughput on our development mail server is 8.5MB/sec. The people
> running our backups would be over joyed with that performance.
>
> However backup/copy throughput on our production mail server is 2.25
> MB/sec.
>
> The underlying disk is 15000 RPM 146GB FC drives.
> Our performance may be hampered somewhat because the luns are on a
> Network Appliance accessed via iSCSI, but not to the extent that we are
> seeing, and it does not account for the throughput difference in the
> development and production pools.
NetApp filers run WAFL - Write Anywhere File Layout.  Even if ZFS
arranged everything perfrectly (however that is defined) WAFL would
undo its hard work.

Since you are using iSCSI, I assume that you have disabled the Nagle
algorithm and increased  tcp_xmit_hiwat and tcp_recv_hiwat.  If not,
go do that now.
> When I talk about fragmentation its not in the normal sense. I''m
not
> talking about blocks in a file not being sequential. I''m talking
about
> files in a single directory that end up spread across the entire
> filesytem/pool.
It''s tempting to think that if the files were in roughly the same area
of the block device that ZFS sees that reading the files sequentially
would at least trigger a read-ahead at the filer.  I suspect that even
a moderate amount of file creation and deletion would cause the I/O
pattern to be random enough (not purely sequential) that the back-end
storage would not have a reasonable chance of recognizing it as a good
time for read-ahead.  Further, since the backup application is
probably in a loop of:

while there are more files in the directory
   if next file mtime > last backup time
       open file
       read file contents, send to backup stream
       close file
    end if
end while

In other words, other I/O operations are interspersed between the
sequential data reads, some files are likely to be skipped, and there
is latency introduced by writing to the data stream.  I would be
surprised to see any file system do intelligent read-ahead here.  In
other words, lots of small file operations make backups and especially
restores go slowly.  More backup and restore streams will almost
certainly help.  Multiplex the streams so that you can keep your tapes
moving at a constant speed.

Do you have statistics on network utilization to ensure that you
aren''t stressing it?

Have you looked at iostat data to be sure that you are seeing asvc_t +
wsvc_t that supports the number of operations that you need to
perform?  That is if asvc_t + wsvc_t for a device adds up to 10 ms, a
workload that waits for the completion of one I/O before issuing the
next will max out at 100 iops.  Presumably ZFS should hide some of
this from you[1], but it does suggest that each backup stream would be
limited to about 100 files per second[2].  This is because the read
request for one file does not happen before the close of the previous
file[3].  Since cyrus stores each message as a separate file, this
suggests that 2.5 MB/s corresponds to average mail message size of 25
KB.

1. via metadata caching, read-ahead on file data reads, etc.
2. Assuming wsvc_t + asvc_t = 10 ms
3. Assuming that networker is about as smart as tar, zip, cpio, etc.
> My problem right now is diagnosing the performance issues. ?I
can''t
> address them without understanding the underlying cause. ?There is a
> lack of tools to help in this area. There is also a lack of acceptance
> that I''m actually having a problem with zfs. Its frustrating.
This is a prime example of why Sun needs to sell Analytics[4][5] as an
add-on to Solaris in general.  This problem is just as hard to figure
out on Solaris as it is on Linux, Windows, etc.  If Analytics were
bundled with Gold and above support contracts, it would be a very
compelling reason to shell out a few extra bucks for better support
contract.

4. http://blogs.sun.com/bmc/resource/cec_analytics.pdf
5. http://blogs.sun.com/brendan/category/Fishworks
> Anyone know how significantly increase the performance of a zfs
> filesystem without causing any downtime to an Enterprise email system
> used by 30,000 intolerant people, when you don''t really know what
is
> causing the performance issues in the first place? (Yeah, it sucks to be
> me!)
Hopefully I''ve helped find a couple places to look...

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Mike Gerdts

2009-Aug-08 20:12 UTC

head link

[zfs-discuss] zfs fragmentation

On Sat, Aug 8, 2009 at 3:02 PM, Ed Spencer<Ed_Spencer at umanitoba.ca>
wrote:>
> On Sat, 2009-08-08 at 09:17, Bob Friesenhahn wrote:
>
>> Enterprise storage should work fine without needing to run a tool to
>> optimize data layout or repair the filesystem. ?Well designed software
>> uses an approach which does not unravel through use.
>
> Hmmmm, this is counter to my understanding. I always thought that to
> optimize sequential read performance you must store the data according
> to how the device will read the data.
>
> Spinning rust reads data in a sequential fashion. In order to optimize
> read performance it has to be laid down that way.
>
> When reading files in a directory, the files need to be laid out on the
> physical device sequentially for optimal read performance.
>
> I probably not he person to argue this point though....Is there a DBA
> around?
The DBA''s that I know use files that are at least hundreds of
megabytes in size.  Your problem is very different.
> Maybe my problems will go away once we move into the next generation of
> storage devices, SSD''s! I''m starting to think that ZFS
will really shine
> on SSD''s.
Your problem seems to be related to cold reads in a pretty large data
set.  With SSD''s (l2arc) you are likely to see a performance boost for
a larger set of recently read files, but my guess is that backups will
still be pretty slow.  There is likely more benefit in restore speed
with SSD''s than there is in read speeds.  However, the NVRAM on the
NetApp that is backing your iSCSI LUNs is probably already giving you
most of this benefit (assuming low latency on network connections).

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Bob Friesenhahn

2009-Aug-08 20:20 UTC

head link

[zfs-discuss] zfs fragmentation

On Sat, 8 Aug 2009, Ed Spencer wrote:
> On Sat, 2009-08-08 at 09:17, Bob Friesenhahn wrote:
>
>> Enterprise storage should work fine without needing to run a tool to
>> optimize data layout or repair the filesystem.  Well designed software
>> uses an approach which does not unravel through use.
>
> Hmmmm, this is counter to my understanding. I always thought that to
> optimize sequential read performance you must store the data according
> to how the device will read the data.
That is something I agree in.  As a result, the requirement/goal of an 
enterprise storage system should be to assure that data is as 
contiguious as possible, keeping in mind that mutiple disks (LUNs) may 
be involved.  It should not unravel and require first-aid in order to 
work correctly (like MS-DOS FAT).

If you are using a big LUN on some other storage device, then zfs is 
not able to do nearly as much to optimize performance as it would if 
it interfaced with a JBOD array.  For the big LUN, all it can do is 
try to write blocks associated with the current transaction group in 
the most contiguious order and read-ahead can not be as useful.  With 
the big LUN it does not know if the data is on difficult physical 
disks so it does not know if reading data in parallel will help reduce 
the read latencies.
> Maybe my problems will go away once we move into the next generation of
> storage devices, SSD''s! I''m starting to think that ZFS
will really shine
> on SSD''s.
A SSD slog backed by a SAS 15K JBOD array should perform much better 
than a big iSCSI LUN.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Ed Spencer

2009-Aug-08 20:25 UTC

head link

[zfs-discuss] zfs fragmentation

On Sat, 2009-08-08 at 15:12, Mike Gerdts wrote:
> The DBA''s that I know use files that are at least hundreds of
> megabytes in size.  Your problem is very different.Yes, definitely. 

I''m relating records in a table to my small files because our email
system treats the filesystem as a database.

And in the back of my mind I''m also thinking that you have to
rebuild/repair the database once in a while to improve performance.

And in my case, since the filesystem is the database, I want to do that
to zfs! 

At least thats what I''m thinking, however, and I always come back to
this, I''m not certian what is causing my problem. I need certainty
before taking action on the production system. 

-- 
Ed

Mike Gerdts

2009-Aug-08 21:09 UTC

head link

[zfs-discuss] zfs fragmentation

On Sat, Aug 8, 2009 at 3:25 PM, Ed Spencer<Ed_Spencer at umanitoba.ca>
wrote:>
> On Sat, 2009-08-08 at 15:12, Mike Gerdts wrote:
>
>> The DBA''s that I know use files that are at least hundreds of
>> megabytes in size. ?Your problem is very different.
> Yes, definitely.
>
> I''m relating records in a table to my small files because our
email
> system treats the filesystem as a database.
Right... but ZFS doesn''t understand your application.  The reason that
a file system would put files that are in the same directory in the
same general area on a disk is to minimize seek time.  I would argue
that seek time doesn''t matter a whole lot here - at least from the
vantage point of ZFS.  The LUNs that you have presented from the filer
are probably RAID6 across many disks.  ZFS seems to be doing a  4 way
stripe (or are you mirroring or raidz?).  Assuming you are doing
something like a 7+2 RAID6 on the back end, the contents would be
spread across 36 drives.[1]  The trick to making this perform well is
to have 36 * N worker threads.  Mail is a great thing to keep those
spindles kinda busy while getting decent performance.  A small number
of sequential readers - particularly with small files where you can''t
do a reasonable job with read-ahead - has little chance of keeping
that number of drives busy.

1. Or you might have 4 LUNs presented from one 4+1 RAID5 in which you
may be forcing more head movement because ZFS thinks it can speed
things up by striping data across the LUNs.

ZFS can recognize a database (or other application) doing a sequential
read on a large file.  While data located sequentially on disk can be
helpful for reads, this is much less important when the pool sits
across tens of disks.  This is because it has the ability to spread
the iops across lots of disks, potentially reading a heavily
fragmented file much faster than a purely sequential file.

In either case, your backup application is competing for iops (and
seeks) with other workload.  With the NetApp backend there are likely
other applications on the same aggregate that are forcing head
movement away from any data belonging to these LUNs.
> And in the back of my mind I''m also thinking that you have to
> rebuild/repair the database once in a while to improve performance.
Certainly.  Databases become fragmented and are reorganized to fix this.
> And in my case, since the filesystem is the database, I want to do that
> to zfs!
>
> At least thats what I''m thinking, however, and I always come back
to
> this, I''m not certian what is causing my problem. I need certainty
> before taking action on the production system.
Most databases are written in such a way that they can be optimized
for sequential reads (table scans) and for backups, whether on raw
disk or on a file system.  The more advanced the database is, the more
likely it is to ask the file system to get out of its way and *not* do
anything fancy.

It seems that cyrus was optimized for operations that make sense for a
mail program (deliver messages, retrieve messages, delete messages)
and nothing else.  I would argue that any application that creates
lots of tiny files is not optimized for backing up using a small
number of streams.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Ed Spencer

2009-Aug-08 21:14 UTC

head link

[zfs-discuss] zfs fragmentation

On Sat, 2009-08-08 at 15:20, Bob Friesenhahn wrote:
> A SSD slog backed by a SAS 15K JBOD array should perform much better 
> than a big iSCSI LUN.
Now...yes. We implemented this pool years ago. I believe, then, the
server would crash if you had a zfs drive fail. We decided to let the
netapp handle the disk redundency. Its worked out well. 

I''ve looked at those really nice Sun products adoringly. And a 7000
series appliance would also be a nice addition to our central NFS
service. Not to mention more cost effective than expanding our Network
Appliance (We have researchers who are quite hungry for storage and NFS
is always our first choice).

We now have quite an investment in the current implementation. Its
difficult to move away from. The netapp is quite a reliable product.

We are quite happy with zfs and our implementation. We just need to
address our backup performance and  improve it just a little bit!

We were almost lynched this spring because we encountered some pretty
severe zfs bugs. We are still running the IDR named "A wad of ZFS bug
fixes for Solaris 10 Update 6". It took over a month to resolve the
issues.

I work at a University and Final Exams and year end occur at the same
time. I don''t recommend having email problems during this time! People
are intolerant to email problems.

I live in hope that a Netapp OS update, or a solaris patch, or a zfs
patch, or a iscsi patch , or something will come along that improves our
performance just a bit so our backup people get off my back!

-- 
Ed

Ed Spencer

2009-Aug-08 21:42 UTC

head link

[zfs-discuss] zfs fragmentation

On Sat, 2009-08-08 at 16:09, Mike Gerdts wrote:
> Right... but ZFS doesn''t understand your application.  The reason
that
> a file system would put files that are in the same directory in the
> same general area on a disk is to minimize seek time.  I would argue
> that seek time doesn''t matter a whole lot here - at least from the
> vantage point of ZFS.  The LUNs that you have presented from the filer
> are probably RAID6 across many disks.  
Yes. Raid4DP. 16 drive arrays. 42 drives in total (one hot spare).
> ZFS seems to be doing a  4 way
> stripe (or are you mirroring or raidz?).  
Here''s the pool (no zfs raid):
  pool: space
 state: ONLINE
 scrub: none requested
config:

        NAME                                     STATE     READ WRITE
CKSUM
        space                                    ONLINE       0    
0     0
          c4t60A98000433469764E4A2D456A644A74d0  ONLINE       0    
0     0
          c4t60A98000433469764E4A2D456A696579d0  ONLINE       0    
0     0
          c4t60A98000433469764E4A476D2F6B385Ad0  ONLINE       0    
0     0
          c4t60A98000433469764E4A476D2F664E4Fd0  ONLINE       0    
0     0

errors: No known data errors
> Assuming you are doing
> something like a 7+2 RAID6 on the back end, the contents would be
> spread across 36 drives.[1]  The trick to making this perform well is
> to have 36 * N worker threads.  Mail is a great thing to keep those
> spindles kinda busy while getting decent performance.  A small number
> of sequential readers - particularly with small files where you
can''t
> do a reasonable job with read-ahead - has little chance of keeping
> that number of drives busy.
The server is also a Sun T2000 (sun4v).
> 1. Or you might have 4 LUNs presented from one 4+1 RAID5 in which you
> may be forcing more head movement because ZFS thinks it can speed
> things up by striping data across the LUNs.
> 
> ZFS can recognize a database (or other application) doing a sequential
> read on a large file.  While data located sequentially on disk can be
> helpful for reads, this is much less important when the pool sits
> across tens of disks.  This is because it has the ability to spread
> the iops across lots of disks, potentially reading a heavily
> fragmented file much faster than a purely sequential file.
> 
> In either case, your backup application is competing for iops (and
> seeks) with other workload.  With the NetApp backend there are likely
> other applications on the same aggregate that are forcing head
> movement away from any data belonging to these LUNs.Email makes up about 98% of our IP San.
There are only a couple of other apps on it that require block storage.
We run "reallocate" jobs nightly to ensure the luns stay sequential
within the netapp storage pool (aggregate) because of its COW
filesystem.
> > And in the back of my mind I''m also thinking that you have to
> > rebuild/repair the database once in a while to improve performance.
> 
> Certainly.  Databases become fragmented and are reorganized to fix this.
> 
> > And in my case, since the filesystem is the database, I want to do
that
> > to zfs!
> >
> > At least thats what I''m thinking, however, and I always come
back to
> > this, I''m not certian what is causing my problem. I need
certainty
> > before taking action on the production system.
> 
> Most databases are written in such a way that they can be optimized
> for sequential reads (table scans) and for backups, whether on raw
> disk or on a file system.  The more advanced the database is, the more
> likely it is to ask the file system to get out of its way and *not* do
> anything fancy.
> 
> It seems that cyrus was optimized for operations that make sense for a
> mail program (deliver messages, retrieve messages, delete messages)
> and nothing else.  I would argue that any application that creates
> lots of tiny files is not optimized for backing up using a small
> number of streams.
Oh yes. Lots of small files is the backup nightmare.

-- 
Ed

Ed Spencer

2009-Aug-08 22:35 UTC

head link

[zfs-discuss] zfs fragmentation

On Sat, 2009-08-08 at 15:05, Mike Gerdts wrote:> On Sat, Aug 8, 2009 at 12:51 PM, Ed Spencer<Ed_Spencer at
umanitoba.ca> wrote:
> >
> > On Sat, 2009-08-08 at 09:17, Bob Friesenhahn wrote:
> >> Many of us here already tested our own systems and found that
under
> >> some conditions ZFS was offering up only 30MB/second for bulk data
> >> reads regardless of how exotic our storage pool and hardware was.
> >
> > Just so we are using the same units of measurements. Backup/copy
> > throughput on our development mail server is 8.5MB/sec. The people
> > running our backups would be over joyed with that performance.
> >
> > However backup/copy throughput on our production mail server is 2.25
> > MB/sec.
> >
> > The underlying disk is 15000 RPM 146GB FC drives.
> > Our performance may be hampered somewhat because the luns are on a
> > Network Appliance accessed via iSCSI, but not to the extent that we
are
> > seeing, and it does not account for the throughput difference in the
> > development and production pools.
> 
> NetApp filers run WAFL - Write Anywhere File Layout.  Even if ZFS
> arranged everything perfrectly (however that is defined) WAFL would
> undo its hard work.
> 
> Since you are using iSCSI, I assume that you have disabled the Nagle
> algorithm and increased  tcp_xmit_hiwat and tcp_recv_hiwat.  If not,
> go do that now.We''ve tried many different iscsi parameter changes on our development
server:
Jumbo Frames
Disabling the Nagle
I''ll double check next week on tcp_xmit_hiwat and tcp_recv_hiwat.

Nothing has made any real difference. 
We are only using about 5% of the bandwidth on our IPSan.

We use two cisco ethernet switches on the IPSAN. The iscsi initiators
use MPXIO in a round robin configuration.  
> > When I talk about fragmentation its not in the normal sense.
I''m not
> > talking about blocks in a file not being sequential. I''m
talking about
> > files in a single directory that end up spread across the entire
> > filesytem/pool.
> 
> It''s tempting to think that if the files were in roughly the same
area
> of the block device that ZFS sees that reading the files sequentially
> would at least trigger a read-ahead at the filer.  I suspect that even
> a moderate amount of file creation and deletion would cause the I/O
> pattern to be random enough (not purely sequential) that the back-end
> storage would not have a reasonable chance of recognizing it as a good
> time for read-ahead.  Further, since the backup application is
> probably in a loop of:
> 
> while there are more files in the directory
>    if next file mtime > last backup time
>        open file
>        read file contents, send to backup stream
>        close file
>     end if
> end while
> 
> In other words, other I/O operations are interspersed between the
> sequential data reads, some files are likely to be skipped, and there
> is latency introduced by writing to the data stream.  I would be
> surprised to see any file system do intelligent read-ahead here.  In
> other words, lots of small file operations make backups and especially
> restores go slowly.  More backup and restore streams will almost
> certainly help.  Multiplex the streams so that you can keep your tapes
> moving at a constant speed.
We backup to disk first and then put to tape later.
> Do you have statistics on network utilization to ensure that you
> aren''t stressing it?
> 
> Have you looked at iostat data to be sure that you are seeing asvc_t +
> wsvc_t that supports the number of operations that you need to
> perform?  That is if asvc_t + wsvc_t for a device adds up to 10 ms, a
> workload that waits for the completion of one I/O before issuing the
> next will max out at 100 iops.  Presumably ZFS should hide some of
> this from you[1], but it does suggest that each backup stream would be
> limited to about 100 files per second[2].  This is because the read
> request for one file does not happen before the close of the previous
> file[3].  Since cyrus stores each message as a separate file, this
> suggests that 2.5 MB/s corresponds to average mail message size of 25
> KB.
> 
> 1. via metadata caching, read-ahead on file data reads, etc.
> 2. Assuming wsvc_t + asvc_t = 10 ms
> 3. Assuming that networker is about as smart as tar, zip, cpio, etc.
There is a backup of a single filesystem in the pool going on right now:
# zpool iostat 5 5
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
space       1.05T   965G     97     69  5.24M  2.71M
space       1.05T   965G    113     10  6.41M   996K
space       1.05T   965G    100    112  2.87M  1.81M
space       1.05T   965G    112      8  2.35M  35.9K
space       1.05T   965G    106      3  1.76M  55.1K

Here are examples :
iostat -xpn 5 5
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device

   17.1   29.2  746.7  317.1  0.0  0.6    0.0   12.5   0  27
c4t60A98000433469764E4A2D456A644A74d0
   25.0   11.9  991.9  277.0  0.0  0.6    0.0   16.1   0  36
c4t60A98000433469764E4A2D456A696579d0
   14.9   17.9  423.0  406.4  0.0  0.3    0.0   10.2   0  21
c4t60A98000433469764E4A476D2F664E4Fd0
   20.8   17.4  588.9  361.2  0.0  0.4    0.0   11.5   0  30
c4t60A98000433469764E4A476D2F6B385Ad0
   
and:
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   11.9   43.0  528.9 1972.8  0.0  2.1    0.0   38.9   0  31
c4t60A98000433469764E4A2D456A644A74d0
   17.0   19.6  496.9 1499.0  0.0  1.4    0.0   38.8   0  39
c4t60A98000433469764E4A2D456A696579d0
   14.0   30.0  670.2 1971.3  0.0  1.7    0.0   38.0   0  34
c4t60A98000433469764E4A476D2F664E4Fd0
   19.7   28.7  985.2 1647.6  0.0  1.6    0.0   32.5   0  37
c4t60A98000433469764E4A476D2F6B385Ad0
and:
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   22.7   41.3  973.7  423.5  0.0  0.8    0.0   11.8   0  34
c4t60A98000433469764E4A2D456A644A74d0
   27.9   20.0 1474.7  344.0  0.0  0.8    0.0   16.7   0  42
c4t60A98000433469764E4A2D456A696579d0
   15.1   17.9 1318.7  463.7  0.0  0.6    0.0   17.7   0  19
c4t60A98000433469764E4A476D2F664E4Fd0
   22.3   19.5 1801.7  406.7  0.0  0.8    0.0   20.0   0  29
c4t60A98000433469764E4A476D2F6B385Ad0
 


> > My problem right now is diagnosing the performance issues.  I
can''t
> > address them without understanding the underlying cause.  There is a
> > lack of tools to help in this area. There is also a lack of acceptance
> > that I''m actually having a problem with zfs. Its frustrating.
> 
> This is a prime example of why Sun needs to sell Analytics[4][5] as an
> add-on to Solaris in general.  This problem is just as hard to figure
> out on Solaris as it is on Linux, Windows, etc.  If Analytics were
> bundled with Gold and above support contracts, it would be a very
> compelling reason to shell out a few extra bucks for better support
> contract.
> 
> 4. http://blogs.sun.com/bmc/resource/cec_analytics.pdf
> 5. http://blogs.sun.com/brendan/category/Fishworks
> 
Oh definitely!
It will also give me the oppurtunity to yell at my drives!
Might help to relieve some stress.
http://sunbeltblog.blogspot.com/2009/01/yelling-at-your-hard-drive.html
> > Anyone know how significantly increase the performance of a zfs
> > filesystem without causing any downtime to an Enterprise email system
> > used by 30,000 intolerant people, when you don''t really know
what is
> > causing the performance issues in the first place? (Yeah, it sucks to
be
> > me!)
> 
> Hopefully I''ve helped find a couple places to look...
Thanx

-- 
Ed

Ed Spencer

2009-Aug-08 22:53 UTC

head link

[zfs-discuss] Fwd: zfs fragmentation

On Sat, 2009-08-08 at 17:25, Mike Gerdts wrote:
> ndd -get /dev/tcp tcp_xmit_hiwat
> ndd -get /dev/tcp tcp_recv_hiwat
> grep tcp-nodelay /kernel/drv/iscsi.conf# ndd -get /dev/tcp tcp_xmit_hiwat
2097152
# ndd -get /dev/tcp tcp_recv_hiwat
2097152
# grep tcp-nodelay /kernel/drv/iscsi.conf
#
> While backups are running (which is probably all the time given the
> backup rate....)
> 
> # look at service times
> iostat -xzn 10Oh crap. Look like there are no backup jobs running right now. It must
have just ended.> # is networker cpu bound?No. The server is barely tasked by either the email system or
networker.> prstat -mL
> Some indication of how many backup jobs run concurrently would
> probably help frame any future discussion.I''ll get more info on the backups next week when the full backups run.
 
-- 
Ed

Richard Elling

2009-Aug-08 23:25 UTC

head link

[zfs-discuss] zfs fragmentation

On Aug 8, 2009, at 5:02 AM, Ed Spencer wrote:
>
> On Fri, 2009-08-07 at 19:33, Richard Elling wrote:
>
>> This is very unlikely to be a "fragmentation problem." It is
a
>> scalability problem
>> and there may be something you can do about it in the short term.
>
> You could be right.
>
> Out test mail server consists of the exact same design, same hardware
> (SUN4V)  but in a smaller configuration (less memory and 4 x 25g san
> luns) has a backup/copy thoughput of 30GB/hour. Data used for testing
> was "copied" from our production mail server.
>
>>> Adding another pool and copying all/some data over to it would only
>>> a short term solution.
>>
>> I''ll have to disagree.
>
> What is the point of a filesystem the can grow to such a huge size and
> not have functionality built in to optimize data layout?  Real world
> implementations of filesystems that are intended to live for
> years/decades need this functionality, don''t they?
>
> Our mail system works well, only the backup doesn''t perform well.
> All the features of ZFS that make reads perform well (prefetch, ARC)
> have little effect.
The best workload is one that doesn''t read from disk to begin with :-)
For workloads with millions of files (eg large-scale mail servers) you
will need to increase the size of the Directory Name Lookup Cache
(DNLC). By default, it is way too small for such workloads. If the
directory names are in cache, then they do not have to be read from
disk -- a big win.

You can see how well the DNLC is working by looking at the output of
"vmstat -s" and look for the "total name lookups." You can
size DNLC
by tuning the ncsize parameter, but it requires a reboot.  See the
Solaris Tunable Parameters Guide for details.
http://docs.sun.com/app/docs/doc/817-0404/chapter2-35?a=view

I''d like to revisit the backup problem, but that is much more  
complicated
and probably won''t fit in a mail thread very easily (hence, the white
paper :-)
  -- richard

Bob Friesenhahn

2009-Aug-08 23:28 UTC

head link

[zfs-discuss] zfs fragmentation

On Sat, 8 Aug 2009, Ed Spencer wrote:>    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>   11.9   43.0  528.9 1972.8  0.0  2.1    0.0   38.9   0  31
> c4t60A98000433469764E4A2D456A644A74d0
>   17.0   19.6  496.9 1499.0  0.0  1.4    0.0   38.8   0  39
> c4t60A98000433469764E4A2D456A696579d0
>   14.0   30.0  670.2 1971.3  0.0  1.7    0.0   38.0   0  34
> c4t60A98000433469764E4A476D2F664E4Fd0
>   19.7   28.7  985.2 1647.6  0.0  1.6    0.0   32.5   0  37
> c4t60A98000433469764E4A476D2F6B385Ad0
I have this in my /etc/system file:

* Set device I/O maximum concurrency
* 
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device_I.2FO_Queue_Size_.28I.2FO_Concurrency.29
set zfs:zfs_vdev_max_pending = 5

This parameter may be worthwhile to look at to reduce your asvc_t. 
It seems that the default (35) is tuned for a true JBOD setup and not 
a SAN-hosted LUN.

As I recall, you can use the kernel debugger to set it while the 
system is running and immediately see differences in iostat output.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Mattias Pantzare

2009-Aug-09 00:13 UTC

head link

[zfs-discuss] zfs fragmentation

On Sat, Aug 8, 2009 at 20:20, Ed Spencer<Ed_Spencer at umanitoba.ca>
wrote:>
> On Sat, 2009-08-08 at 08:14, Mattias Pantzare wrote:
>
>> Your scalability problem may be in your backup solution.
> We''ve eliminated the backup system as being involved with the
> performance issues.
>
> The servers are Solaris 10 with the OS on UFS filesystems. (In zfs
> terms, the pool is old/mature). Solaris has been patched to a fairly
> current level.
>
> Copying data from the zfs filesystem to the local ufs filesystem enjoys
> the same throughput as the backup system.
>
> The test was simple. Create a test filesystem on the zfs pool. Restore
> production email data to it. Reboot the server. Backup the data (29
> minutes for a 15.8 gig of data). Reboot the server. Copy data from zfs
> to ufs using a ''cp -pr ...'' command, which also took 29
minutes.
Yes, that was expected. What hapens if you run two cp -pr at the same
time? I am guessing that two cp will take almost the same time as one.

If you get twice the performance from two cp  then you will get twice
the performance from doing two backups in parallel.

Ed Spencer

2009-Aug-11 12:33 UTC

head link

[zfs-discuss] zfs fragmentation

I''ve come up with a better name for the concept of file and directory
fragmentation which is, "Filesystem Entropy". Where, over time, an
active and volitile  filesystem moves from an organized state to a
disorganized state resulting in backup difficulties.

Here are some stats which illustrate the issue:

First the development mail server:
=================================(Jump frames, Nagle disabled and
tcp_xmit_hiwat,tcp_recv_hiwat set to
2097152)

Small file workload (copy from zfs on iscsi network to local ufs
filesystem)
# zpool iostat 10
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
space       70.5G  29.0G      3      0   247K  59.7K
space       70.5G  29.0G    136      0  8.37M      0
space       70.5G  29.0G    115      0  6.31M      0
space       70.5G  29.0G    108      0  7.08M      0
space       70.5G  29.0G    105      0  3.72M      0
space       70.5G  29.0G    135      0  3.74M      0
space       70.5G  29.0G    155      0  6.09M      0
space       70.5G  29.0G    193      0  4.85M      0
space       70.5G  29.0G    142      0  5.73M      0
space       70.5G  29.0G    159      0  7.87M      0

Large File workload (cd and dvd iso''s) 
# zpool iostat 10
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
space       70.5G  29.0G      3      0   224K  59.8K
space       70.5G  29.0G    462      0  57.8M      0
space       70.5G  29.0G    427      0  53.5M      0
space       70.5G  29.0G    406      0  50.8M      0
space       70.5G  29.0G    430      0  53.8M      0
space       70.5G  29.0G    382      0  47.9M      0

The production mail server:
==========================Mail system is running with 790 imap users logged in
(low imap work
load).
Two backup streams are running.
Not using jumbo frames, nagle enabled, tcp_xmit_hiwat,tcp_recv_hiwat set
to 2097152
    - we''ve never seen any effect of changing the iscsi transport
parameters
      under this small file workload.

# zpool iostat 10
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
space       1.06T   955G     96     69  5.20M  2.69M
space       1.06T   955G    175    105  8.96M  2.22M
space       1.06T   955G    182     16  4.47M   546K
space       1.06T   955G    170     16  4.82M  1.85M
space       1.06T   955G    145    159  4.23M  3.19M
space       1.06T   955G    138     15  4.97M  92.7K
space       1.06T   955G    134     15  3.82M  1.71M
space       1.06T   955G    109    123  3.07M  3.08M
space       1.06T   955G    106     11  3.07M  1.34M
space       1.06T   955G    120     17  3.69M  1.74M

# prstat -mL
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG
PROCESS/LWPID
 12438 root      12 6.9 0.0 0.0 0.0 0.0  81 0.1 508  84  4K   0 save/1
 27399 cyrus     15 0.5 0.0 0.0 0.0 0.0  85 0.0  18  10 297   0 imapd/1
 20230 root     3.9 8.0 0.0 0.0 0.0 0.0  88 0.1 393  33  2K   0 save/1
 25913 root     0.5 3.3 0.0 0.0 0.0 0.0  96 0.0  22   2  1K   0 prstat/1
 20495 cyrus    1.1 0.2 0.0 0.0 0.5 0.0  98 0.0  14   3 191   0 imapd/1
  1051 cyrus    1.2 0.0 0.0 0.0 0.0 0.0  99 0.0  19   1  80   0 master/1
 24350 cyrus    0.5 0.5 0.0 0.0 1.4 0.0  98 0.0  57   1 484   0 lmtpd/1
 22645 cyrus    0.6 0.3 0.0 0.0 0.0 0.0  99 0.0  53   1 603   0 imapd/1
 24904 cyrus    0.3 0.4 0.0 0.0 0.0 0.0  99 0.0  66   0 863   0 imapd/1
 18139 cyrus    0.3 0.2 0.0 0.0 0.0 0.0  99 0.0  24   0 195   0 imapd/1
 21459 cyrus    0.2 0.3 0.0 0.0 0.0 0.0  99 0.0  54   0 635   0 imapd/1
 24891 cyrus    0.3 0.3 0.0 0.0 0.9 0.0  99 0.0  28   0 259   0 lmtpd/1
   388 root     0.2 0.3 0.0 0.0 0.0 0.0 100 0.0   1   1  48   0
in.routed/1
 21643 cyrus    0.2 0.3 0.0 0.0 0.2 0.0  99 0.0  49   7 540   0 imapd/1
 18684 cyrus    0.2 0.3 0.0 0.0 0.0 0.0 100 0.0  48   1 544   0 imapd/1
 25398 cyrus    0.2 0.2 0.0 0.0 0.0 0.0 100 0.0  47   0 466   0 pop3d/1
 23724 cyrus    0.2 0.2 0.0 0.0 0.0 0.0 100 0.0  47   0 540   0 imapd/1
 24909 cyrus    0.1 0.2 0.0 0.0 0.2 0.0  99 0.0  25   1 251   0 lmtpd/1
 16317 cyrus    0.2 0.2 0.0 0.0 0.0 0.0 100 0.0  37   1 495   0 imapd/1
 28243 cyrus    0.1 0.3 0.0 0.0 0.0 0.0 100 0.0  32   0 289   0 imapd/1
 20097 cyrus    0.1 0.2 0.0 0.0 0.3 0.0  99 0.0  26   5 253   0 lmtpd/1
Total: 893 processes, 1125 lwps, load averages: 1.14, 1.16, 1.16
 
-- 
Ed

Alex Lam S.L.

2009-Aug-11 12:58 UTC

head link

[zfs-discuss] zfs fragmentation

At a first glance, your production server''s numbers are looking fairly
similar to the "small file workload" results of your development
server.

I thought you were saying that the development server has faster performance?

Alex.


On Tue, Aug 11, 2009 at 1:33 PM, Ed Spencer<Ed_Spencer at umanitoba.ca>
wrote:> I''ve come up with a better name for the concept of file and
directory
> fragmentation which is, "Filesystem Entropy". Where, over time,
an
> active and volitile ?filesystem moves from an organized state to a
> disorganized state resulting in backup difficulties.
>
> Here are some stats which illustrate the issue:
>
> First the development mail server:
> =================================> (Jump frames, Nagle disabled and
tcp_xmit_hiwat,tcp_recv_hiwat set to
> 2097152)
>
> Small file workload (copy from zfs on iscsi network to local ufs
> filesystem)
> # zpool iostat 10
> ? ? ? ? ? ? ? capacity ? ? operations ? ?bandwidth
> pool ? ? ? ? used ?avail ? read ?write ? read ?write
> ---------- ?----- ?----- ?----- ?----- ?----- ?-----
> space ? ? ? 70.5G ?29.0G ? ? ?3 ? ? ?0 ? 247K ?59.7K
> space ? ? ? 70.5G ?29.0G ? ?136 ? ? ?0 ?8.37M ? ? ?0
> space ? ? ? 70.5G ?29.0G ? ?115 ? ? ?0 ?6.31M ? ? ?0
> space ? ? ? 70.5G ?29.0G ? ?108 ? ? ?0 ?7.08M ? ? ?0
> space ? ? ? 70.5G ?29.0G ? ?105 ? ? ?0 ?3.72M ? ? ?0
> space ? ? ? 70.5G ?29.0G ? ?135 ? ? ?0 ?3.74M ? ? ?0
> space ? ? ? 70.5G ?29.0G ? ?155 ? ? ?0 ?6.09M ? ? ?0
> space ? ? ? 70.5G ?29.0G ? ?193 ? ? ?0 ?4.85M ? ? ?0
> space ? ? ? 70.5G ?29.0G ? ?142 ? ? ?0 ?5.73M ? ? ?0
> space ? ? ? 70.5G ?29.0G ? ?159 ? ? ?0 ?7.87M ? ? ?0
>
> Large File workload (cd and dvd iso''s)
> # zpool iostat 10
> ? ? ? ? ? ? ? capacity ? ? operations ? ?bandwidth
> pool ? ? ? ? used ?avail ? read ?write ? read ?write
> ---------- ?----- ?----- ?----- ?----- ?----- ?-----
> space ? ? ? 70.5G ?29.0G ? ? ?3 ? ? ?0 ? 224K ?59.8K
> space ? ? ? 70.5G ?29.0G ? ?462 ? ? ?0 ?57.8M ? ? ?0
> space ? ? ? 70.5G ?29.0G ? ?427 ? ? ?0 ?53.5M ? ? ?0
> space ? ? ? 70.5G ?29.0G ? ?406 ? ? ?0 ?50.8M ? ? ?0
> space ? ? ? 70.5G ?29.0G ? ?430 ? ? ?0 ?53.8M ? ? ?0
> space ? ? ? 70.5G ?29.0G ? ?382 ? ? ?0 ?47.9M ? ? ?0
>
> The production mail server:
> ==========================> Mail system is running with 790 imap users
logged in (low imap work
> load).
> Two backup streams are running.
> Not using jumbo frames, nagle enabled, tcp_xmit_hiwat,tcp_recv_hiwat set
> to 2097152
> ? ?- we''ve never seen any effect of changing the iscsi transport
> parameters
> ? ? ?under this small file workload.
>
> # zpool iostat 10
> ? ? ? ? ? ? ? capacity ? ? operations ? ?bandwidth
> pool ? ? ? ? used ?avail ? read ?write ? read ?write
> ---------- ?----- ?----- ?----- ?----- ?----- ?-----
> space ? ? ? 1.06T ? 955G ? ? 96 ? ? 69 ?5.20M ?2.69M
> space ? ? ? 1.06T ? 955G ? ?175 ? ?105 ?8.96M ?2.22M
> space ? ? ? 1.06T ? 955G ? ?182 ? ? 16 ?4.47M ? 546K
> space ? ? ? 1.06T ? 955G ? ?170 ? ? 16 ?4.82M ?1.85M
> space ? ? ? 1.06T ? 955G ? ?145 ? ?159 ?4.23M ?3.19M
> space ? ? ? 1.06T ? 955G ? ?138 ? ? 15 ?4.97M ?92.7K
> space ? ? ? 1.06T ? 955G ? ?134 ? ? 15 ?3.82M ?1.71M
> space ? ? ? 1.06T ? 955G ? ?109 ? ?123 ?3.07M ?3.08M
> space ? ? ? 1.06T ? 955G ? ?106 ? ? 11 ?3.07M ?1.34M
> space ? ? ? 1.06T ? 955G ? ?120 ? ? 17 ?3.69M ?1.74M
>
> # prstat -mL
> ? PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG
> PROCESS/LWPID
> ?12438 root ? ? ?12 6.9 0.0 0.0 0.0 0.0 ?81 0.1 508 ?84 ?4K ? 0 save/1
> ?27399 cyrus ? ? 15 0.5 0.0 0.0 0.0 0.0 ?85 0.0 ?18 ?10 297 ? 0 imapd/1
> ?20230 root ? ? 3.9 8.0 0.0 0.0 0.0 0.0 ?88 0.1 393 ?33 ?2K ? 0 save/1
> ?25913 root ? ? 0.5 3.3 0.0 0.0 0.0 0.0 ?96 0.0 ?22 ? 2 ?1K ? 0 prstat/1
> ?20495 cyrus ? ?1.1 0.2 0.0 0.0 0.5 0.0 ?98 0.0 ?14 ? 3 191 ? 0 imapd/1
> ?1051 cyrus ? ?1.2 0.0 0.0 0.0 0.0 0.0 ?99 0.0 ?19 ? 1 ?80 ? 0 master/1
> ?24350 cyrus ? ?0.5 0.5 0.0 0.0 1.4 0.0 ?98 0.0 ?57 ? 1 484 ? 0 lmtpd/1
> ?22645 cyrus ? ?0.6 0.3 0.0 0.0 0.0 0.0 ?99 0.0 ?53 ? 1 603 ? 0 imapd/1
> ?24904 cyrus ? ?0.3 0.4 0.0 0.0 0.0 0.0 ?99 0.0 ?66 ? 0 863 ? 0 imapd/1
> ?18139 cyrus ? ?0.3 0.2 0.0 0.0 0.0 0.0 ?99 0.0 ?24 ? 0 195 ? 0 imapd/1
> ?21459 cyrus ? ?0.2 0.3 0.0 0.0 0.0 0.0 ?99 0.0 ?54 ? 0 635 ? 0 imapd/1
> ?24891 cyrus ? ?0.3 0.3 0.0 0.0 0.9 0.0 ?99 0.0 ?28 ? 0 259 ? 0 lmtpd/1
> ? 388 root ? ? 0.2 0.3 0.0 0.0 0.0 0.0 100 0.0 ? 1 ? 1 ?48 ? 0
> in.routed/1
> ?21643 cyrus ? ?0.2 0.3 0.0 0.0 0.2 0.0 ?99 0.0 ?49 ? 7 540 ? 0 imapd/1
> ?18684 cyrus ? ?0.2 0.3 0.0 0.0 0.0 0.0 100 0.0 ?48 ? 1 544 ? 0 imapd/1
> ?25398 cyrus ? ?0.2 0.2 0.0 0.0 0.0 0.0 100 0.0 ?47 ? 0 466 ? 0 pop3d/1
> ?23724 cyrus ? ?0.2 0.2 0.0 0.0 0.0 0.0 100 0.0 ?47 ? 0 540 ? 0 imapd/1
> ?24909 cyrus ? ?0.1 0.2 0.0 0.0 0.2 0.0 ?99 0.0 ?25 ? 1 251 ? 0 lmtpd/1
> ?16317 cyrus ? ?0.2 0.2 0.0 0.0 0.0 0.0 100 0.0 ?37 ? 1 495 ? 0 imapd/1
> ?28243 cyrus ? ?0.1 0.3 0.0 0.0 0.0 0.0 100 0.0 ?32 ? 0 289 ? 0 imapd/1
> ?20097 cyrus ? ?0.1 0.2 0.0 0.0 0.3 0.0 ?99 0.0 ?26 ? 5 253 ? 0 lmtpd/1
> Total: 893 processes, 1125 lwps, load averages: 1.14, 1.16, 1.16
>
> --
> Ed
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>


-- 

Ted Turner  - "Sports is like a war without the killing." -
http://www.brainyquote.com/quotes/authors/t/ted_turner.html

Mike Gerdts

2009-Aug-11 14:14 UTC

head link

[zfs-discuss] zfs fragmentation

On Tue, Aug 11, 2009 at 7:33 AM, Ed Spencer<Ed_Spencer at umanitoba.ca>
wrote:> I''ve come up with a better name for the concept of file and
directory
> fragmentation which is, "Filesystem Entropy". Where, over time,
an
> active and volitile  filesystem moves from an organized state to a
> disorganized state resulting in backup difficulties.
>
> Here are some stats which illustrate the issue:
>
> First the development mail server:
> =================================> (Jump frames, Nagle disabled and
tcp_xmit_hiwat,tcp_recv_hiwat set to
> 2097152)
>
> Small file workload (copy from zfs on iscsi network to local ufs
> filesystem)
> # zpool iostat 10
>               capacity     operations    bandwidth
> pool         used  avail   read  write   read  write
> ----------  -----  -----  -----  -----  -----  -----
> space       70.5G  29.0G      3      0   247K  59.7K
> space       70.5G  29.0G    136      0  8.37M      0
> space       70.5G  29.0G    115      0  6.31M      0
> space       70.5G  29.0G    108      0  7.08M      0
> space       70.5G  29.0G    105      0  3.72M      0
> space       70.5G  29.0G    135      0  3.74M      0
> space       70.5G  29.0G    155      0  6.09M      0
> space       70.5G  29.0G    193      0  4.85M      0
> space       70.5G  29.0G    142      0  5.73M      0
> space       70.5G  29.0G    159      0  7.87M      0
So you are averaging about 6 MB/s on a small file workload.  The
average read size was about 44 KB.

This throughput could be limited by the file creation rate on UFS.
Perhaps a better command to use to judge of how fast a single stream
can read is "tar cf /dev/null $dir".
> Large File workload (cd and dvd iso''s)
> # zpool iostat 10
>               capacity     operations    bandwidth
> pool         used  avail   read  write   read  write
> ----------  -----  -----  -----  -----  -----  -----
> space       70.5G  29.0G      3      0   224K  59.8K
> space       70.5G  29.0G    462      0  57.8M      0
> space       70.5G  29.0G    427      0  53.5M      0
> space       70.5G  29.0G    406      0  50.8M      0
> space       70.5G  29.0G    430      0  53.8M      0
> space       70.5G  29.0G    382      0  47.9M      0
Here the average throughput was about 53 MB/s, with the average read
size at 128 KB.  Note that 128 KB is not only the largest block size
that ZFS supports, it is also the default value of maxphys.  Tuning
maxphys to 1 MB may give you a performance boost, so long as the files
are contiguous.  Unless the files were trickled in very slowly with a
lot of other IO at the same time, they are probably mostly contiguous.

1 Gbit links, they are at about 25% capacity - good.  I assume you
have similar load balancing at the NetApp side too.

In a previous message you said that this server was seeing better
backup throughput than the production server.  How does the mixture of
large files vs. small files compare on the two systems?
> The production mail server:
> ==========================> Mail system is running with 790 imap users
logged in (low imap work
> load).
> Two backup streams are running.
> Not using jumbo frames, nagle enabled, tcp_xmit_hiwat,tcp_recv_hiwat set
> to 2097152
>    - we''ve never seen any effect of changing the iscsi transport
> parameters
>      under this small file workload.
>
> # zpool iostat 10
>               capacity     operations    bandwidth
> pool         used  avail   read  write   read  write
> ----------  -----  -----  -----  -----  -----  -----
> space       1.06T   955G     96     69  5.20M  2.69M
> space       1.06T   955G    175    105  8.96M  2.22M
> space       1.06T   955G    182     16  4.47M   546K
> space       1.06T   955G    170     16  4.82M  1.85M
> space       1.06T   955G    145    159  4.23M  3.19M
> space       1.06T   955G    138     15  4.97M  92.7K
> space       1.06T   955G    134     15  3.82M  1.71M
> space       1.06T   955G    109    123  3.07M  3.08M
> space       1.06T   955G    106     11  3.07M  1.34M
> space       1.06T   955G    120     17  3.69M  1.74M
Here your average read throughput is about 4.6 MB/s with an average
read size of 47 KB.  That looks a lot like the simulation in the
non-production environment.

I would guess that the average message size is somewhere in the
40 - 50 KB range.
>
> # prstat -mL
>   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG
> PROCESS/LWPID
>  12438 root      12 6.9 0.0 0.0 0.0 0.0  81 0.1 508  84  4K   0 save/1
>  27399 cyrus     15 0.5 0.0 0.0 0.0 0.0  85 0.0  18  10 297   0 imapd/1
>  20230 root     3.9 8.0 0.0 0.0 0.0 0.0  88 0.1 393  33  2K   0 save/1[snip]

The "save" process is from Networker, right?  These process do not
look CPU bound (less than 20% on CPU).

In a previous message you showed iostat data at a time when backups
weren''t running.  I''ve reproduced below, removing the device
column
for sake of formatting.
> iostat -xpn 5 5
>                    extended device statistics
>    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>
>   17.1   29.2  746.7  317.1  0.0  0.6    0.0   12.5   0  27
>   25.0   11.9  991.9  277.0  0.0  0.6    0.0   16.1   0  36
>   14.9   17.9  423.0  406.4  0.0  0.3    0.0   10.2   0  21
>   20.8   17.4  588.9  361.2  0.0  0.4    0.0   11.5   0  30
>
> and:
>    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>   11.9   43.0  528.9 1972.8  0.0  2.1    0.0   38.9   0  31
>   17.0   19.6  496.9 1499.0  0.0  1.4    0.0   38.8   0  39
>   14.0   30.0  670.2 1971.3  0.0  1.7    0.0   38.0   0  34
>   19.7   28.7  985.2 1647.6  0.0  1.6    0.0   32.5   0  37
> and:
>    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>   22.7   41.3  973.7  423.5  0.0  0.8    0.0   11.8   0  34
>   27.9   20.0 1474.7  344.0  0.0  0.8    0.0   16.7   0  42
>   15.1   17.9 1318.7  463.7  0.0  0.6    0.0   17.7   0  19
>   22.3   19.5 1801.7  406.7  0.0  0.8    0.0   20.0   0  29
Service times are in the 10 - 39 ms range.  In the middle set, it
looks like there is some heavier than normal write activity (not more
writes, just bigger writes) and this seems to impact asvc_t.

Let''s look back at something I said the other day...

| Have you looked at iostat data to be sure that you are seeing asvc_t
| + wsvc_t that supports the number of operations that you need to
| perform?  That is if asvc_t + wsvc_t for a device adds up to 10 ms,
| a workload that waits for the completion of one I/O before issuing
| the next will max out at 100 iops.  Presumably ZFS should hide some
| of this from you[1], but it does suggest that each backup stream
| would be limited to about 100 files per second[2].  This is because
| the read request for one file does not happen before the close of
| the previous file[3].  Since cyrus stores each message as a separate
| file, this suggests that 2.5 MB/s corresponds to average mail
| message size of 25 KB.

It seems reasonable based on the iostat data to say that the typical
asvc_t is no better than 15 ms.  Since the IO for one file does not
start until the previous one completed, we can get no more than:

    1000 ms/sec
    -----------  = 67 sequential operations per second
      15 ms/io

By "sequential" I mean that one doesn''t start until the other
finishes.  There is certainly a better word, but it escapes me at the
moment.

At an average file size of 45 KB, that translates to about 3 MB/sec.
As you run two data streams, you are seeing throughput that looks
kinda like the 2 * 3 MB/sec.

With 4 backup streams do you get something that looks like 4 * 3 MB/s?
How does that effect iostat output?

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Ed Spencer

2009-Aug-11 14:39 UTC

head link

[zfs-discuss] zfs fragmentation

On Tue, 2009-08-11 at 07:58, Alex Lam S.L. wrote:> At a first glance, your production server''s numbers are looking
fairly
> similar to the "small file workload" results of your development
> server.
>
> I thought you were saying that the development server has faster
performance?
The development serer was running only one cp -pr command.

The production mail sevrer was running two concurrent backup jobs and of
course the mail system, with each job having the same performance
throughput as if there were a single job running. The single threaded
backup jobs do not conflict with each other over performance.

If we ran 20 concurrent backup jobs, overall performance would scale up
quite a bit. (I would guess between 5 and 10 times the performance). (I
just read Mike''s post and will do some ''concurrency''
testing).

Users are currently evenly distributed over 5 filesystems (I previously
mentioned 7 but its really 5 filesystems for users and 1 for system
data, totalling 6, and one test filesystem).

We backup 2 filesystems on tuesday, 2 filesystems on thursday, and 2 on
saturday. We backup to disk and then clone to tape. Our backup people
can only handle doing 2 filesystems per night.

Creating more filesystems to increase the parallelism of our backup is
one solution but its a major redesign of the of the mail system.

Adding a second server to half the pool and thereby half the problem is
another solution (and we would also create more filesystems at the same
time).

Moving the pool to a FC San or a JBOD may also increase performance.
(Less layers, introduced by the appliance, thereby increasing
performance.)

I suspect that if we ''rsync'' one of these filesystems to a
second
server/pool that we would also see a performance increase equal to what
we see on the development server. (I don''t know how zfs send a receive
work so I don''t know if it would address this "Filesystem
Entropy" or
specifically reorganize the files and directories). However, when we
created a testfs filesystem in the zfs pool on the production server,
and copied data to it, we saw the same performance as the other
filesystems, in the same pool.

We will have to do something to address the problem. A combination of
what I just listed is our probable course of action. (Much testing will
have to be done to ensure our solution will address the problem because
we are not 100% sure what is the cause of performance degradation).
I''m
also dealing with Network Appliance to see if there is anything we can
do at the filer end to increase performance. But I''m holding out
little
hope.

But please, don''t miss the point I''m trying to make. ZFS would
benefit
from a utility or a background process that would reorganize files and
directories in the pool to optimize performance. A utility to deal with
Filesystem Entropy. Currently a zfs pool will live as long as the
lifetime of the disks that it is on, without reorganization. This can be
a long long time. Not to mention slowly expanding the pool over time
contributes to the issue.

--
Ed

Richard Elling

2009-Aug-11 15:04 UTC

head link

[zfs-discuss] zfs fragmentation

On Aug 11, 2009, at 7:39 AM, Ed Spencer wrote:
>
> On Tue, 2009-08-11 at 07:58, Alex Lam S.L. wrote:
>> At a first glance, your production server''s numbers are
looking
>> fairly
>> similar to the "small file workload" results of your
development
>> server.
>>
>> I thought you were saying that the development server has faster  
>> performance?
>
> The development serer was running only one cp -pr command.
>
> The production mail sevrer was running two concurrent backup jobs  
> and of
> course the mail system, with each job having the same performance
> throughput as if there were a single job running. The single threaded
> backup jobs do not conflict with each other over performance.
Agree.
> If we ran 20 concurrent backup jobs, overall performance would scale  
> up
> quite a bit. (I would guess between 5 and 10 times the performance).  
> (I
> just read Mike''s post and will do some
''concurrency'' testing).
Yes.
> Users are currently evenly distributed over 5 filesystems (I  
> previously
> mentioned 7 but its really 5 filesystems for users and 1 for system
> data, totalling 6, and one test filesystem).
>
> We backup 2 filesystems on tuesday, 2 filesystems on thursday, and 2  
> on
> saturday. We backup to disk and then clone to tape. Our backup people
> can only handle doing 2 filesystems per night.
>
> Creating more filesystems to increase the parallelism of our backup is
> one solution but its a major redesign of the of the mail system.
Really?  I presume this is because of the way you originally
allocated accounts to file systems.  Creating file systems in ZFS is
easy, so could you explain in a new thread?
> Adding a second server to half the pool and thereby  half the  
> problem is
> another solution (and we would also create more filesystems at the  
> same
> time).
I''m not convinced this is a good idea. It is a lot of work based on
the assumption that the server is the bottleneck.
> Moving the pool to a FC San or a JBOD may also increase performance.
> (Less layers, introduced by the appliance, thereby increasing
> performance.)
Disagree.
> I suspect that if we ''rsync'' one of these filesystems to
a second
> server/pool  that we would also see a performance increase equal to  
> what
> we see on the development server. (I don''t know how zfs send a
receive
> work so I don''t know if it would address this "Filesystem
Entropy" or
> specifically reorganize the files and directories). However, when we
> created a testfs filesystem in the zfs pool on the production server,
> and copied data to it, we saw the same performance as the other
> filesystems, in the same pool.
Directory walkers, like NetBackup or rsync, will not scale well as
the number of files increases.  It doesn''t matter what file system you
use, the scalability will look more-or-less similar. For millions of  
files,
ZFS send/receive works much better.  More details are in my paper.
> We will have to do something to address the problem. A combination of
> what I just listed is our probable course of action. (Much testing  
> will
> have to be done to ensure our solution will address the problem  
> because
> we are not 100% sure what is the cause of performance degradation).   
> I''m
> also dealing with Network Appliance to see if there is anything we can
> do at the filer end to increase performance.  But I''m holding out
> little
> hope.
DNLC hit rate?
Also, is atime on?
>
> But please, don''t miss the point I''m trying to make. ZFS
would benefit
> from a utility or a background process that would reorganize files and
> directories in the pool to optimize performance. A utility to deal  
> with
> Filesystem Entropy. Currently a zfs pool will live as long as the
> lifetime of the disks that it is on, without reorganization. This  
> can be
> a long long time. Not to mention slowly expanding the pool over time
> contributes to the issue.
This does not come "for free" in either performance or risk. It will
do nothing to solve the directory walker''s problem.

NB, people who use UFS don''t tend to see this because UFS
can''t
handle millions of files.
  -- richard

David Magda

2009-Aug-11 15:25 UTC

head link

[zfs-discuss] zfs fragmentation

On Tue, August 11, 2009 10:39, Ed Spencer wrote:
> I suspect that if we ''rsync'' one of these filesystems to
a second
> server/pool  that we would also see a performance increase equal to what
> we see on the development server. (I don''t know how zfs send a
receive
Rsync has to traverse the entire directory tree to stat() every file to
see if it''s changed (and if it has, it then computes which parts of the
file that have been updated). Zfs send/recv however works at a lower level
and doesn''t go to each file: it can simply compare which file system
blocks have changed.

So you would create a snapshot on the ZFS file system(s) of interest and
send it over to where ever you want to replicate it. Later on you would
create another snapshot and, with the incremental ("-i") option in
zfs(1M), you could then only transfer the blocks of data that were changed
since the first snapshot. ZFS will be able to figure out the block
differences without having to touch every file.

Two pretty good explanations at:

http://www.markround.com/archives/38-ZFS-Replication.html
http://www.cuddletech.com/blog/pivot/entry.php?id=984
> work so I don''t know if it would address this "Filesystem
Entropy" or
> specifically reorganize the files and directories). However, when we
> created a testfs filesystem in the zfs pool on the production server,
> and copied data to it, we saw the same performance as the other
> filesystems, in the same pool.
Not surprising, since any file systems on any particular pool would be
using the same spindles. If you want different I/O characteristics
you''d
need a different pool with different spindles.
> We will have to do something to address the problem. A combination of
> what I just listed is our probable course of action. (Much testing will
> have to be done to ensure our solution will address the problem because
> we are not 100% sure what is the cause of performance degradation). 
I''m
Don''t forget about the DTrace Toolkit, as it has many handy scripts for
digging into various performance characteristics:

http://www.brendangregg.com/dtrace.html
> But please, don''t miss the point I''m trying to make. ZFS
would benefit
> from a utility or a background process that would reorganize files and
> directories in the pool to optimize performance. A utility to deal with
If you have a Sun support contract call them up and ask for this
enhancement. If there''s enough people asking for it the ZFS team will
add
it. Talking on the list is one thing, but if there''s no
"official" paper
trail in Sun''s database, then it won''t get the attention it
may deserve.

Louis-Frédéric Feuillette

2009-Aug-11 17:12 UTC

head link

[zfs-discuss] zfs fragmentation

On Tue, 2009-08-11 at 08:04 -0700, Richard Elling wrote:> On Aug 11, 2009, at 7:39 AM, Ed Spencer wrote:
> > I suspect that if we ''rsync'' one of these
filesystems to a second
> > server/pool  that we would also see a performance increase equal to  
> > what
> > we see on the development server. (I don''t know how zfs send
a receive
> > work so I don''t know if it would address this
"Filesystem Entropy" or
> > specifically reorganize the files and directories). However, when we
> > created a testfs filesystem in the zfs pool on the production server,
> > and copied data to it, we saw the same performance as the other
> > filesystems, in the same pool.
> 
> Directory walkers, like NetBackup or rsync, will not scale well as
> the number of files increases.  It doesn''t matter what file system
you
> use, the scalability will look more-or-less similar. For millions of  
> files,
> ZFS send/receive works much better.  More details are in my paper.
Is there link to this paper available?

-- 
Louis-Fr?d?ric Feuillette <jebnor at gmail.com>

Scott Lawson

2009-Aug-11 19:56 UTC

head link

[zfs-discuss] zfs fragmentation

Richard Elling wrote:> On Aug 11, 2009, at 7:39 AM, Ed Spencer wrote:
>
>>
>> On Tue, 2009-08-11 at 07:58, Alex Lam S.L. wrote:
>>> At a first glance, your production server''s numbers are
looking fairly
>>> similar to the "small file workload" results of your
development
>>> server.
>>>
>>> I thought you were saying that the development server has faster 
>>> performance?
>>
>> The development serer was running only one cp -pr command.
>>
>> The production mail sevrer was running two concurrent backup jobs and
of
>> course the mail system, with each job having the same performance
>> throughput as if there were a single job running. The single threaded
>> backup jobs do not conflict with each other over performance.
>
> Agree.
>
>> If we ran 20 concurrent backup jobs, overall performance would scale up
>> quite a bit. (I would guess between 5 and 10 times the performance). (I
>> just read Mike''s post and will do some
''concurrency'' testing).
>
> Yes.
>
>> Users are currently evenly distributed over 5 filesystems (I previously
>> mentioned 7 but its really 5 filesystems for users and 1 for system
>> data, totalling 6, and one test filesystem).
>>
>> We backup 2 filesystems on tuesday, 2 filesystems on thursday, and 2 on
>> saturday. We backup to disk and then clone to tape. Our backup people
>> can only handle doing 2 filesystems per night.
>>
>> Creating more filesystems to increase the parallelism of our backup is
>> one solution but its a major redesign of the of the mail system.
>
> Really?  I presume this is because of the way you originally
> allocated accounts to file systems.  Creating file systems in ZFS is
> easy, so could you explain in a new thread?Ed, This would be a good idea.

This issue has been discussed many time on the iMS mailing list for the 
Sun Messaging server
which as far as the way it stores messages on disk is very similar to 
Cyrus. (in fact I think
it once was based on the same code base).

The upshot of it is what has been explained by Mike in that these type 
of store create
millions of little files that Netbackup or Legato must walk over and 
backup one after another
sequentially. This does not scale very well at all due to the reasons 
explained by Mike.

The issue commonly discussed about on the iMS list has been one of file 
system size. In general the rule
of thumb most people had for this was around 100 to 250 GB per file 
system and lots of them to mostly
increase the parallelism in the backup process rather than for 
performance gains in the
actually functionality of the application.

I, as a rule of thumb group my large users who have large mailboxes, 
which in turn
have lots of large attachments into particular larger file system. 
Students who have small
quotas and generally lots of small messages or small files in this case 
into other smaller
file system. It really in this case is one size does not suit all. To 
keep backups within
the time allocation, a bit of filesystem monitoring is useful. In the 
days of UFS I used
to use a command like this to help make decisions.

[root at xxx]#> df -F ufs -o i
Filesystem             iused   ifree  %iused  Mounted on
/dev/md/dsk/d0        605765 6674235     8%   /
/dev/md/dsk/d50      2387509 28198091     8%   /mail1
/dev/md/dsk/d70      2090768 30669232     6%   /mail3
/dev/md/dsk/d60      2447548 30312452     7%   /mail2

I used this to balance the inodes. My guess is that around 85-90% of the 
inodes
in a messaging server store are files with the remainder directories. 
Either way
it is a simple way to make sure the stores are reasonably balanced. I am 
sure there
will be a good way to do this for ZFS?
>
>> Adding a second server to half the pool and thereby  half the problem
is
>> another solution (and we would also create more filesystems at the same
>> time).
>It can be a good idea, but it really depends on how many file systems 
you split
your message stores into. Also good for relocating message stores to if 
the first server
fails. This of course depends on your message store architecture. Easy 
to do with Sun
Messaging, not so sure about Cyrus. But I did once run a Simeon message 
server for
a University in London and that was based on Cyrus and was pretty 
similar from recollection.> I''m not convinced this is a good idea. It is a lot of work based
on
> the assumption that the server is the bottleneck.
>
>> Moving the pool to a FC San or a JBOD may also increase performance.
>> (Less layers, introduced by the appliance, thereby increasing
>> performance.)
>
> Disagree.
>
>> I suspect that if we ''rsync'' one of these filesystems
to a second
>> server/pool  that we would also see a performance increase equal to
what
>> we see on the development server. (I don''t know how zfs send a
receive
>> work so I don''t know if it would address this "Filesystem
Entropy" or
>> specifically reorganize the files and directories). However, when we
>> created a testfs filesystem in the zfs pool on the production server,
>> and copied data to it, we saw the same performance as the other
>> filesystems, in the same pool.
>
> Directory walkers, like NetBackup or rsync, will not scale well as
> the number of files increases.  It doesn''t matter what file system
you
> use, the scalability will look more-or-less similar. For millions of 
> files,
> ZFS send/receive works much better.  More details are in my paper.I look forward to reading this Richard. I think it will be a interesting 
read
for members of this.>
>> We will have to do something to address the problem. A combination of
>> what I just listed is our probable course of action. (Much testing will
>> have to be done to ensure our solution will address the problem because
>> we are not 100% sure what is the cause of performance degradation). 
I''m
>> also dealing with Network Appliance to see if there is anything we can
>> do at the filer end to increase performance.  But I''m holding
out little
>> hope.
>
> DNLC hit rate?
> Also, is atime on?Turning atime off may make a big difference for you. It certainly does 
for Sun Messaging server.
Maybe worth doing and reposting result?>
>>
>> But please, don''t miss the point I''m trying to make.
ZFS would benefit
>> from a utility or a background process that would reorganize files and
>> directories in the pool to optimize performance. A utility to deal with
>> Filesystem Entropy. Currently a zfs pool will live as long as the
>> lifetime of the disks that it is on, without reorganization. This can
be
>> a long long time. Not to mention slowly expanding the pool over time
>> contributes to the issue.
>
> This does not come "for free" in either performance or risk. It
will
> do nothing to solve the directory walker''s problem.Agree. It will have little bearing on the outcome for the reason you 
mention.>
> NB, people who use UFS don''t tend to see this because UFS
can''t
> handle millions of files.It can but only if you have less than a 1 TB''ish sized file systems.
Not
large by
ZFS standards. They do work, but with the same performance issue for 
directory
walker backups. Heaven help you in fsck''ing them after a system crash. 
Hours and hours.>  -- richard
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- 
_______________________________________________________________________


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax    : +64 09 968 7641
Mobile : +64 27 568 7611

mailto:scott at manukau.ac.nz

http://www.manukau.ac.nz

________________________________________________________________________


perl -e ''print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);''

________________________________________________________________________

Ed Spencer

2009-Aug-11 20:21 UTC

head link

[zfs-discuss] zfs fragmentation

Concurrency/Parallelism testing.
I have 6 different filesystems populated with email data on our mail
development server.
I rebooted the server before beginning the tests.
The server is a T2000 (sun4v) machine so its ideally suited for this
type of testing.
The test was to tar (to /dev/null) each of the filesystems. Launch 1,
gather stats launch another , gather stats, etc.
The underlying storage system is a Network Appliance. Our only one. In
production. Serving NFS, CIFS and iscsi. Other work the appliance is
doing may effect these tests, and vice versa :) . No one seemed to
notice I was running these tests. 

After 6 concurrent tar''s running we are probabaly seeing benefits of
the
ARC. 
At certian points I included load averages and traffic stats for each of
the iscsi ethernet interfaces that are configured with MPXIO.

After the first 6 jobs, I launched duplicates of the 6. Then another 6,
etc.

At the end I included the zfs kernel statistics:

1 job
====               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
space       70.5G  29.0G      0      0      0      0
space       70.5G  29.0G     19      0  1.04M      0
space       70.5G  29.0G    268      0  8.71M      0
space       70.5G  29.0G    196      0  11.3M      0
space       70.5G  29.0G    171      0  11.0M      0
space       70.5G  29.0G    182      0  5.01M      0
space       70.5G  29.0G    273      0  9.71M      0
space       70.5G  29.0G    292      0  8.91M      0
space       70.5G  29.0G    279      0  15.4M      0
space       70.5G  29.0G    219      0  11.3M      0
space       70.5G  29.0G    175      0  8.67M      0

2 jobs
=====               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
space       70.5G  29.0G    381      0  23.8M      0
space       70.5G  29.0G    422      0  28.0M      0
space       70.5G  29.0G    386      0  26.5M      0
space       70.5G  29.0G    380      0  22.9M      0
space       70.5G  29.0G    411      0  18.8M      0
space       70.5G  29.0G    393      0  20.7M      0
space       70.5G  29.0G    302      0  15.0M      0
space       70.5G  29.0G    267      0  15.6M      0
space       70.5G  29.0G    304      0  18.7M      0
space       70.5G  29.0G    534      0  19.7M      0
space       70.5G  29.0G    339      0  17.0M      0

3 jobs
=====               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
space       70.5G  29.0G    530      0  22.9M      0
space       70.5G  29.0G    428      0  16.3M      0
space       70.5G  29.0G    439      0  16.4M      0
space       70.5G  29.0G    511      0  22.1M      0
space       70.5G  29.0G    464      0  17.9M      0
space       70.5G  29.0G    371      0  12.1M      0
space       70.5G  29.0G    447      0  16.5M      0
space       70.5G  29.0G    379      0  15.5M      0

4jobs
=====               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
space       70.5G  29.0G    434      0  22.0M      0
space       70.5G  29.0G    506      0  29.5M      0
space       70.5G  29.0G    424      0  21.3M      0
space       70.5G  29.0G    643      0  36.0M      0
space       70.5G  29.0G    688      0  31.1M      0
space       70.5G  29.0G    726      0  37.6M      0
space       70.5G  29.0G    652      0  24.8M      0
space       70.5G  29.0G    646      0  33.9M      0

5jobs
=====               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
space       70.5G  29.0G    629      0  31.1M      0
space       70.5G  29.0G    774      0  45.8M      0
space       70.5G  29.0G    815      0  39.8M      0
space       70.5G  29.0G    895      0  44.4M      0
space       70.5G  29.0G    800      0  48.1M      0
space       70.5G  29.0G    857      0  51.8M      0
space       70.5G  29.0G    725      0  47.6M      0

6jobs
=====               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
space       70.5G  29.0G    924      0  58.8M      0
space       70.5G  29.0G    767      0  51.8M      0
space       70.5G  29.0G    862      0  48.4M      0
space       70.5G  29.0G    977      0  43.9M      0
space       70.5G  29.0G    954      0  53.7M      0
space       70.5G  29.0G    903      0  48.3M      0

# uptime
  2:19pm  up 15 min(s),  2 users,  load average: 1.44, 1.10, 0.67

26MB ( 1 minute average) on each iSCSI ethernet port

12jobs
=====               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
space       70.5G  29.0G    868      0  48.6M      0
space       70.5G  29.0G    903      0  45.3M      0
space       70.5G  29.0G    919      0  52.4M      0
space       70.5G  29.0G  1.20K      0  73.3M      0
space       70.5G  29.0G  1.16K      0  63.3M      0
space       70.5G  29.0G  1.12K      0  71.2M      0
space       70.5G  29.0G  1.29K      0  68.8M      0

# uptime
  2:22pm  up 18 min(s),  2 users,  load average: 1.75, 1.29, 0.80
33MB ( 1 minute average) on each iSCSI ethernet port

18jobs
=====               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
space       70.5G  29.0G  1.31K      0  69.3M      0
space       70.5G  29.0G  1.25K      0  74.7M      0
space       70.5G  29.0G  1.23K      0  74.4M      0
space       70.5G  29.0G  1.25K      0  72.1M      0
space       70.5G  29.0G  1.34K      0  75.3M      0
space       70.5G  29.0G  1.31K      0  77.4M      0
space       70.5G  29.0G    892      0  51.8M      0
space       70.5G  29.0G  1.12K      0  69.6M      0

24jobs
=====               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
space       70.5G  29.0G  1.56K      0  84.5M      0
space       70.5G  29.0G  1.46K      0  86.3M      0
space       70.5G  29.0G  1.43K      0  75.7M      0
space       70.5G  29.0G  1.35K      0  67.6M      0
space       70.5G  29.0G  1.38K      0  72.6M      0
space       70.5G  29.0G  1.14K      0  69.8M      0
space       70.5G  29.0G  1.19K      0  66.4M      0

# uptime
  2:26pm  up 23 min(s),  2 users,  load average: 2.29, 1.89, 1.20

36MB ( 1 minute average) on each iSCSI ethernet port

30jobs
=====               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
space       70.5G  29.0G  1.20K      0  63.9M      0
space       70.5G  29.0G  1.76K      0  82.3M      0
space       70.5G  29.0G  1.57K      0  79.8M      0
space       70.5G  29.0G  1.82K      0  96.2M      0
space       70.5G  29.0G  1.81K      0  82.7M      0
space       70.5G  29.0G  1.55K      0  74.9M      0
space       70.5G  29.0G  1.53K      0  77.9M      0
space       70.5G  29.0G  1.50K      0  81.6M      0

# uptime
  2:29pm  up 26 min(s),  2 users,  load average: 2.57, 2.12, 1.39

40MB ( 1 minute average) on each iSCSI ethernet port

35jobs
=====               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
space       70.5G  29.0G  1.41K      0  69.7M      0
space       70.5G  29.0G  1.58K      0  83.0M      0
space       70.5G  29.0G  1.31K      0  69.3M      0
space       70.5G  29.0G  1.53K      0  79.5M      0
space       70.5G  29.0G  1.42K      0  73.7M      0
space       70.5G  29.0G  1.45K      0  71.3M      0

# uptime
  2:34pm  up 30 min(s),  2 users,  load average: 2.70, 2.55, 1.79

45MB ( 1 minute average) on each iSCSI ethernet port

# kstat zfs
me:   arcstats                        class:    misc
        c                               4294967296
        c_max                           4294967296
        c_min                           536870912
        crtime                          5674386.62393914
        deleted                         1484966
        demand_data_hits                8323333
        demand_data_misses              1391606
        demand_metadata_hits            1320089
        demand_metadata_misses          83372
        evict_skip                      15986
        hash_chain_max                  10
        hash_chains                     47700
        hash_collisions                 1104590
        hash_elements                   166476
        hash_elements_max               188996
        hdr_size                        29907360
        hits                            10033815
        l2_abort_lowmem                 0
        l2_cksum_bad                    0
        l2_evict_lock_retry             0
        l2_evict_reading                0
        l2_feeds                        0
        l2_free_on_write                0
        l2_hdr_size                     0
        l2_hits                         0
        l2_io_error                     0
        l2_misses                       0
        l2_rw_clash                     0
        l2_size                         0
        l2_writes_done                  0
        l2_writes_error                 0
        l2_writes_hdr_miss              0
        l2_writes_sent                  0
        memory_throttle_count           0
        mfu_ghost_hits                  56647
        mfu_hits                        1963736
        misses                          1735570
        mru_ghost_hits                  27411
        mru_hits                        7715952
        mutex_miss                      82794
        p                               1918981120
        prefetch_data_hits              3017
        prefetch_data_misses            225803
        prefetch_metadata_hits          387376
        prefetch_metadata_misses        34789
        recycle_miss                    171217
        size                            3914208576
        snaptime                        5676565.69946945

module: zfs                             instance: 0
name:   vdev_cache_stats                class:    misc
        crtime                          5674386.6242014
        delegations                     15022
        hits                            38616
        misses                          64786
        snaptime                        5676565.7082284

-- 
Ed

Richard Elling

2009-Aug-11 20:38 UTC

head link

[zfs-discuss] zfs fragmentation

On Aug 11, 2009, at 1:21 PM, Ed Spencer wrote:
> Concurrency/Parallelism testing.
> I have 6 different filesystems populated with email data on our mail
> development server.
> I rebooted the server before beginning the tests.
> The server is a T2000 (sun4v) machine so its ideally suited for this
> type of testing.
> The test was to tar (to /dev/null) each of the filesystems. Launch 1,
> gather stats launch another , gather stats, etc.
> The underlying storage system is a Network Appliance. Our only one. In
> production. Serving NFS, CIFS and iscsi. Other work the appliance is
> doing may effect these tests, and vice versa :) . No one seemed to
> notice I was running these tests.
>
> After 6 concurrent tar''s running we are probabaly seeing benefits
of
> the
> ARC.
> At certian points I included load averages and traffic stats for  
> each of
> the iscsi ethernet interfaces that are configured with MPXIO.
>
> After the first 6 jobs, I launched duplicates of the 6. Then another  
> 6,
> etc.
>
iostat and zpool iostat measure I/O to the disks. fsstat measures
I/O to the file system (hence the name ;-).  A large discrepancy
between the two is another indicator of filesystem caching.

While tar is slightly interesting, I would expect that your normal
backup workload to show a lot of lookups and attr gets. If these
are cached, life will be better.
  -- richard

Ed Spencer

2009-Aug-11 22:00 UTC

head link

[zfs-discuss] zfs fragmentation

On Tue, 2009-08-11 at 14:56, Scott Lawson wrote:
> > Also, is atime on?
> Turning atime off may make a big difference for you. It certainly does 
> for Sun Messaging server.
> Maybe worth doing and reposting result?Yes. All these results were attained with atime=off. We made that change
on all the filesystems this spring.

I''d like tho thank everyone who took part in this thread. Its helped us
quite a bit. 
I''ll be re-reading this thread a few times to glean additional
recommendations I missed the first time. Not to mention doing some
additional testing.

We shall also look at tuning the DNLC on our production server.

I also did some stress testing using tar and large files and saw a
sustained read rate of between 100MB and 120MB per second running 5
concurrent tar''s.

-- 
Ed

Mike Gerdts

2009-Aug-12 03:47 UTC

head link

[zfs-discuss] zfs fragmentation

On Tue, Aug 11, 2009 at 9:39 AM, Ed Spencer<Ed_Spencer at umanitoba.ca>
wrote:> We backup 2 filesystems on tuesday, 2 filesystems on thursday, and 2 on
> saturday. We backup to disk and then clone to tape. Our backup people
> can only handle doing 2 filesystems per night.
>
> Creating more filesystems to increase the parallelism of our backup is
> one solution but its a major redesign of the of the mail system.
What is magical about a 1:1 mapping of backup job to file system?
According to the Networker manual[1], a save set in Networker can be
configured to back up certain directories.  According to some random
documentation about Cyrus[2], mail boxes fall under a pretty
predictable hierarchy.

1. http://oregonstate.edu/net/services/backups/clients/7_4/admin7_4.pdf
2. http://nakedape.cc/info/Cyrus-IMAP-HOWTO/components.html

Assuming that the way that your mailboxes get hashed fall into a
structure like $fs/b/bigbird and $fs/g/grover (and not just
$fs/bigbird and $fs/grover), you should be able to set a save set per
top level directory or per group of a few directories.  That is,
create a save set for $fs/a, $fs/b, etc. or $fs/a - $fs/d, $fs/e -
$fs/h, etc.  If you are able to create many smaller save sets and turn
the parallelism up you should be able to drive more throughput.

I wouldn''t get too worried about ensuring that they all start at the
same time[3], but it would probably make sense to prioritize the
larger ones so that they start early and the smaller ones can fill in
the parallelism gaps as the longer-running ones finish.

3. That is, there is sometimes benefit in having many more jobs to run
than you have concurrent streams.  This avoids having one save set
that finishes long after all the others because of poorly balanced
save sets.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Ed Spencer

2009-Aug-12 14:46 UTC

head link

[zfs-discuss] zfs fragmentation

I don''t know of any reason why we can''t turn 1 backup job per
filesystem
into say, up to say , 26 based on the cyrus file and directory
structure.

The cyrus file and directory structure is designed with users located
under the directories A,B,C,D,etc to deal with the millions of little
files issue at the  filesystem layer.

Our backups will have to be changed to use this design feature.
There will be a little work on the front end  to create the jobs but
once done the full backups should finish in a couple of hours.

As an aside, we are currently upgrading our backup server to a sun4v
machine.
This architecture is well suited to run more jobs in parallel.
 
Thanx for all your help and advice.

Ed

On Tue, 2009-08-11 at 22:47, Mike Gerdts wrote:> On Tue, Aug 11, 2009 at 9:39 AM, Ed Spencer<Ed_Spencer at
umanitoba.ca> wrote:
> > We backup 2 filesystems on tuesday, 2 filesystems on thursday, and 2
on
> > saturday. We backup to disk and then clone to tape. Our backup people
> > can only handle doing 2 filesystems per night.
> >
> > Creating more filesystems to increase the parallelism of our backup is
> > one solution but its a major redesign of the of the mail system.
> 
> What is magical about a 1:1 mapping of backup job to file system?
> According to the Networker manual[1], a save set in Networker can be
> configured to back up certain directories.  According to some random
> documentation about Cyrus[2], mail boxes fall under a pretty
> predictable hierarchy.
> 
> 1. http://oregonstate.edu/net/services/backups/clients/7_4/admin7_4.pdf
> 2. http://nakedape.cc/info/Cyrus-IMAP-HOWTO/components.html
> 
> Assuming that the way that your mailboxes get hashed fall into a
> structure like $fs/b/bigbird and $fs/g/grover (and not just
> $fs/bigbird and $fs/grover), you should be able to set a save set per
> top level directory or per group of a few directories.  That is,
> create a save set for $fs/a, $fs/b, etc. or $fs/a - $fs/d, $fs/e -
> $fs/h, etc.  If you are able to create many smaller save sets and turn
> the parallelism up you should be able to drive more throughput.
> 
> I wouldn''t get too worried about ensuring that they all start at
the
> same time[3], but it would probably make sense to prioritize the
> larger ones so that they start early and the smaller ones can fill in
> the parallelism gaps as the longer-running ones finish.
> 
> 3. That is, there is sometimes benefit in having many more jobs to run
> than you have concurrent streams.  This avoids having one save set
> that finishes long after all the others because of poorly balanced
> save sets.
> 
> -- 
> Mike Gerdts
> http://mgerdts.blogspot.com/-- 
Ed

Damjan Perenic

2009-Aug-12 16:53 UTC

head link

[zfs-discuss] zfs fragmentation

On Tue, Aug 11, 2009 at 11:04 PM, Richard
Elling<richard.elling at gmail.com> wrote:> On Aug 11, 2009, at 7:39 AM, Ed Spencer wrote:
>
>> I suspect that if we ''rsync'' one of these filesystems
to a second
>> server/pool ?that we would also see a performance increase equal to
what
>> we see on the development server. (I don''t know how zfs send a
receive
>> work so I don''t know if it would address this "Filesystem
Entropy" or
>> specifically reorganize the files and directories). However, when we
>> created a testfs filesystem in the zfs pool on the production server,
>> and copied data to it, we saw the same performance as the other
>> filesystems, in the same pool.
>
> Directory walkers, like NetBackup or rsync, will not scale well as
> the number of files increases. ?It doesn''t matter what file system
you
> use, the scalability will look more-or-less similar. For millions of files,
> ZFS send/receive works much better. ?More details are in my paper.
It would be nice if ZFS had something similar to VxFS File Change Log.
This feature is very useful for incremental backups and other
directory walkers, providing they support FCL.

Damjan

Scott Lawson

2009-Aug-12 20:07 UTC

head link

[zfs-discuss] zfs fragmentation

Ed Spencer wrote:> I don''t know of any reason why we can''t turn 1 backup job
per filesystem
> into say, up to say , 26 based on the cyrus file and directory
> structure.
>   No reason whatsoever. Sometimes the more the better as per the rest of 
this thread. The key
here is to test and tweak till you get the optimal arrangement of backup 
window time and performance.

Performance tuning is a little bit of a Journey, that sooner or later 
has a final destination. ;)> The cyrus file and directory structure is designed with users located
> under the directories A,B,C,D,etc to deal with the millions of little
> files issue at the  filesystem layer.
>   The sun messaging server actually hashes the user names into a structure 
which looks quite similar
to a squid cache store. This has a top level of 128 directories, which 
each in turn contain 128 directories,
which then contain a folder for each user that has been mapped into that 
structure by the hash algorithm
 on the user name. I use a wildcard mapping to split this into 16 
streams to cover the 0-9, a-f of the hexadecimal
directory structure names. eg. /mailstore1/users/0*> Our backups will have to be changed to use this design feature.
> There will be a little work on the front end  to create the jobs but
> once done the full backups should finish in a couple of hours.
>   The nice thing about this work is it really is only a one off 
configuration in the backup software
and then it is done. Certainly works a lot better than something like 
ALL_LOCAL_DRIVES
in Netbackup which effectively forks one backup thread per file
system.> As an aside, we are currently upgrading our backup server to a sun4v
> machine.
> This architecture is well suited to run more jobs in parallel.
>   I use a T5220 with staging to a J4500 with 48 x 1 TB disks in a zpool 
with 6 file systems. This then gets streamed
to 6 LTO4 tape drives in a SL500 .Needless to say this supports a high 
degree of parallelism  and generally
finds the source server to be the bottleneck. I also take advantage of 
the 10 GigE capability
built straight into the Ultrasparc T2. Only major bottleneck in this 
system is the SAS interconnect to the J4500.>  
> Thanx for all your help and advice.
>
> Ed
>
> On Tue, 2009-08-11 at 22:47, Mike Gerdts wrote:
>   
>> On Tue, Aug 11, 2009 at 9:39 AM, Ed Spencer<Ed_Spencer at
umanitoba.ca> wrote:
>>     
>>> We backup 2 filesystems on tuesday, 2 filesystems on thursday, and
2 on
>>> saturday. We backup to disk and then clone to tape. Our backup
people
>>> can only handle doing 2 filesystems per night.
>>>
>>> Creating more filesystems to increase the parallelism of our backup
is
>>> one solution but its a major redesign of the of the mail system.
>>>       
>> What is magical about a 1:1 mapping of backup job to file system?
>> According to the Networker manual[1], a save set in Networker can be
>> configured to back up certain directories.  According to some random
>> documentation about Cyrus[2], mail boxes fall under a pretty
>> predictable hierarchy.
>>
>> 1. http://oregonstate.edu/net/services/backups/clients/7_4/admin7_4.pdf
>> 2. http://nakedape.cc/info/Cyrus-IMAP-HOWTO/components.html
>>
>> Assuming that the way that your mailboxes get hashed fall into a
>> structure like $fs/b/bigbird and $fs/g/grover (and not just
>> $fs/bigbird and $fs/grover), you should be able to set a save set per
>> top level directory or per group of a few directories.  That is,
>> create a save set for $fs/a, $fs/b, etc. or $fs/a - $fs/d, $fs/e -
>> $fs/h, etc.  If you are able to create many smaller save sets and turn
>> the parallelism up you should be able to drive more throughput.
>>
>> I wouldn''t get too worried about ensuring that they all start
at the
>> same time[3], but it would probably make sense to prioritize the
>> larger ones so that they start early and the smaller ones can fill in
>> the parallelism gaps as the longer-running ones finish.
>>
>> 3. That is, there is sometimes benefit in having many more jobs to run
>> than you have concurrent streams.  This avoids having one save set
>> that finishes long after all the others because of poorly balanced
>> save sets.
>>     
Couldn''t agree more Mike.>> -- 
>> Mike Gerdts
>> http://mgerdts.blogspot.com/
>>     
-- 
_______________________________________________________________________


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax    : +64 09 968 7641
Mobile : +64 27 568 7611

mailto:scott at manukau.ac.nz

http://www.manukau.ac.nz

________________________________________________________________________


perl -e ''print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);''

________________________________________________________________________

Mertol Ozyoney

2009-Aug-18 14:40 UTC

head link

[zfs-discuss] zfs fragmentation

There are Works to make NDMP more efficient in highly fregmanted file
Systems with a lot of small files. 
I am not a development engineer so I don''t know much and I do not think
that
there is any committed work. However ZFS engineers on the forum may comment
more
Mertol 

Mertol Ozyoney 
Storage Practice - Sales Manager

Sun Microsystems, TR
Istanbul TR
Phone +902123352200
Mobile +905339310752
Fax +902123352222
Email mertol.ozyoney at sun.com

-----Original Message-----
From: zfs-discuss-bounces at opensolaris.org
[mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Ed Spencer
Sent: Sunday, August 09, 2009 12:14 AM
To: zfs-discuss at opensolaris.org
Subject: Re: [zfs-discuss] zfs fragmentation

On Sat, 2009-08-08 at 15:20, Bob Friesenhahn wrote:
> A SSD slog backed by a SAS 15K JBOD array should perform much better 
> than a big iSCSI LUN.
Now...yes. We implemented this pool years ago. I believe, then, the
server would crash if you had a zfs drive fail. We decided to let the
netapp handle the disk redundency. Its worked out well. 

I''ve looked at those really nice Sun products adoringly. And a 7000
series appliance would also be a nice addition to our central NFS
service. Not to mention more cost effective than expanding our Network
Appliance (We have researchers who are quite hungry for storage and NFS
is always our first choice).

We now have quite an investment in the current implementation. Its
difficult to move away from. The netapp is quite a reliable product.

We are quite happy with zfs and our implementation. We just need to
address our backup performance and  improve it just a little bit!

We were almost lynched this spring because we encountered some pretty
severe zfs bugs. We are still running the IDR named "A wad of ZFS bug
fixes for Solaris 10 Update 6". It took over a month to resolve the
issues.

I work at a University and Final Exams and year end occur at the same
time. I don''t recommend having email problems during this time! People
are intolerant to email problems.

I live in hope that a Netapp OS update, or a solaris patch, or a zfs
patch, or a iscsi patch , or something will come along that improves our
performance just a bit so our backup people get off my back!

-- 
Ed 

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

zfs discuss - Aug 2009 - zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] Fwd: zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation

[zfs-discuss] zfs fragmentation