1. Due to the COW nature of zfs, files on zfs are more tender to be fragmented comparing to traditional file system. Is this statement correct? 2. If so, common understanding is that fragmentation cause perform degradation, will zfs or to what extend zfs performance is affected by the fragmentation? 3. Being a relative new file system, are there many adoption in large implementation? 4. Googing "zfs fragmentation" doesn''t return a lot results. It can because either there isn''t a lot major adoption of zfs or fragment isn''t a really problem for zfs. Any information is appreciated. -- This message posted from opensolaris.org
On Thu, 6 Aug 2009, Hua wrote:> 1. Due to the COW nature of zfs, files on zfs are more tender to be > fragmented comparing to traditional file system. Is this statement > correct?Yes and no. Fragmentation is a complex issue. ZFS uses 128K data blocks by default whereas other filesystems typically use 4K or 8K blocks. This naturally reduces the potential for fragmentation by 32X over 4k blocks. ZFS storage pools are typically comprised of multiple "vdevs" and writes are distributed over these vdevs. This means that the first 128K of a file may go to the first vdev and the second 128K may go to the second vdev. It could be argued that this is a type of fragmentation but since all of the vdevs can be read at once (if zfs prefetch chooses to do so) the seek time for single-user contiguous access is essentially zero since the seeks occur while the application is already busy processing other data. When mirror vdevs are used, any device in the mirror may be used to read the data. ZFS uses a slab allocator and allocates large contiguous chunks of from the vdev storage, and then carves the 128K blocks from those large chunks. This dramatically increases the probability that related data will be very close on the same disk. ZFS delays ordinary writes to the very last minute according to these rules (my understanding): 7/8th total memory consumed, 5 seconds of 100% write I/O is collected, or 30 seconds has elapsed. Since quite a lot of data is written at once, zfs is able to write that data in the best possible order. ZFS uses a copy-on-write model. Copy-on-write tends to cause fragmentation if portions of existing files are updated. If a large portion of a file is overwritten in a short period of time, the result should be reasonably fragment-free but if parts of the file are updated over a long period of time (like a database) then the file is certain to be fragmented. This is not such a big problem as it appears to be since such files were already typically accessed using random access. ZFS absolutely observes synchronous write requests (e.g. by NFS or a database). The synchronous write requests do not benefit from the long write aggregation delay so the result may not be written as ideally as ordinary write requests. Recently zfs has added support for using a SSD as a synchronous write log, and this allows zfs to turn synchronous writes into more ordinary writes which can be written more intelligently while returning to the user with minimal latency. Perhaps the most significant fragmentation concern for zfs is if the pool is allowed to become close to 100% full. Similar to other filesystems, the quality of the storage allocations goes downhill fast when the pool is almost 100% full, so even files written contiguously may be written in fragments.> 3. Being a relative new file system, are there many adoption in > large implementation?There are indeed some sites which heavily use zfs. One very large site using zfs is archive.org. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
> ZFS absolutely observes synchronous write requests (e.g. by NFS or a > database). The synchronous write requests do not benefit from the > long write aggregation delay so the result may not be written as > ideally as ordinary write requests. Recently zfs has added support > for using a SSD as a synchronous write log, and this allows zfs to > turn synchronous writes into more ordinary writes which can be written > more intelligently while returning to the user with minimal latency.Bob, since the ZIL is used always, whether a separate device or not, won''t writes to a system without a separate ZIL also be written as intelligently as with a separate ZIL? Thanks, Scott -- This message posted from opensolaris.org
On 08/07/09 10:54, Scott Meilicke wrote:>> ZFS absolutely observes synchronous write requests (e.g. by NFS or a >> database). The synchronous write requests do not benefit from the >> long write aggregation delay so the result may not be written as >> ideally as ordinary write requests. Recently zfs has added support >> for using a SSD as a synchronous write log, and this allows zfs to >> turn synchronous writes into more ordinary writes which can be written >> more intelligently while returning to the user with minimal latency. > > Bob, since the ZIL is used always, whether a separate device or not, > won''t writes to a system without a separate ZIL also be written as > intelligently as with a separate ZIL?- Yes. ZFS uses the same code path (intelligence?) to write out the data from NFS - regardless of whether there''s a separate log (slog) or not.> > Thanks, > Scott
On Fri, 7 Aug 2009, Scott Meilicke wrote:> > Bob, since the ZIL is used always, whether a separate device or not, > won''t writes to a system without a separate ZIL also be written as > intelligently as with a separate ZIL?I don''t know the answer to that. Perhaps there is no current advantage. The longer the final writes can be deferred, the more opportunity there is to write the data with a better layout, or to avoid writing some data at all. One thing I forgot to mention in my summary is that zfs is commonly used in multi-user environments where there may be many simultaneous writers. Simultaneous writers tend to naturally fragment a filesystem unless the filesystem is willing to spread the data out in advance and take a seek hit (from one file to another) for each file write. Zfs deferrment of the writes allows the data to be written more intelligently in these multi-user environments. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Let me give a real life example of what I believe is a fragmented zfs pool. Currently the pool is 2 terabytes in size (55% used) and is made of 4 san luns (512gb each). The pool has never gotten close to being full. We increase the size of the pool by adding 2 512gb luns about once a year or so. The pool has been divided into 7 filesystems. The pool is used for imap email data. The email system (cyrus) has approximately 80,000 accounts all located within the pool, evenly distributed between the filesystems. Each account has a directory associated with it. This directory is the users inbox. Additional mail folders are subdirectories. Mail is stored as individual files. We receive mail at a rate of 0-20MB/Second, every minute of every hour of every day of every week, etc etc. Users recieve mail constantly over time. They read it and then either delete it or store it in a subdirectory/folder. I imagine that my mail (located in a single subdirectory structure) is spread over the entire pool because it has been received over time. I believe the data is highly fragmented (from a file and directory perspective). The result of this is that backup thoughput of a single filesystem in this pool is about 8GB/hour. We use EMC networker for backups. This is a problem. There are no utilities available to evaluate this type of fragmentation. There are no utilities to fix it. ZFS, from the mail system perspective works great. Writes and random reads operate well. Backup is a problem and not just because of small files, but small files scatterred over the entire pool. Adding another pool and copying all/some data over to it would only a short term solution. I believe zfs needs a feature that operates in the background and defrags the pool to optimize sequential reads of the file and directory structure. Ed -- This message posted from opensolaris.org
On Aug 7, 2009, at 2:29 PM, Ed Spencer wrote:> Let me give a real life example of what I believe is a fragmented > zfs pool. > > Currently the pool is 2 terabytes in size (55% used) and is made of > 4 san luns (512gb each). > The pool has never gotten close to being full. We increase the size > of the pool by adding 2 512gb luns about once a year or so. > > The pool has been divided into 7 filesystems. > > The pool is used for imap email data. The email system (cyrus) has > approximately 80,000 accounts all located within the pool, evenly > distributed between the filesystems. > > Each account has a directory associated with it. This directory is > the users inbox. Additional mail folders are subdirectories. Mail is > stored as individual files. > > We receive mail at a rate of 0-20MB/Second, every minute of every > hour of every day of every week, etc etc. > > Users recieve mail constantly over time. They read it and then > either delete it or store it in a subdirectory/folder. > > I imagine that my mail (located in a single subdirectory structure) > is spread over the entire pool because it has been received over > time. I believe the data is highly fragmented (from a file and > directory perspective). > > The result of this is that backup thoughput of a single filesystem > in this pool is about 8GB/hour. > We use EMC networker for backups.This is very unlikely to be a "fragmentation problem." It is a scalability problem and there may be something you can do about it in the short term. However, though I usually like to tease, in this case I need to tease. I recently completed a white paper on this exact workload and how we designed it to scale. I hope to publish that paper RSN. When the paper hits the web, I''ll restart a new thread on using ZFS for large-scale email systems.> > This is a problem. There are no utilities available to evaluate this > type of fragmentation. > There are no utilities to fix it. > > ZFS, from the mail system perspective works great. > Writes and random reads operate well. > > Backup is a problem and not just because of small files, but small > files scatterred over the entire pool. > > Adding another pool and copying all/some data over to it would only > a short term solution.I''ll have to disagree.> I believe zfs needs a feature that operates in the background and > defrags the pool to optimize sequential reads of the file and > directory structure.This will not solve your problem, but there are other methods that can. -- richard> > Ed > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Fri, 2009-08-07 at 19:33, Richard Elling wrote:> This is very unlikely to be a "fragmentation problem." It is a > scalability problem > and there may be something you can do about it in the short term.You could be right. Out test mail server consists of the exact same design, same hardware (SUN4V) but in a smaller configuration (less memory and 4 x 25g san luns) has a backup/copy thoughput of 30GB/hour. Data used for testing was "copied" from our production mail server.> > Adding another pool and copying all/some data over to it would only > > a short term solution. > > I''ll have to disagree.What is the point of a filesystem the can grow to such a huge size and not have functionality built in to optimize data layout? Real world implementations of filesystems that are intended to live for years/decades need this functionality, don''t they? Our mail system works well, only the backup doesn''t perform well. All the features of ZFS that make reads perform well (prefetch, ARC) have little effect. We think backup is quite important. We do quite a few restores of months old data. Snapshots help in the short term, but for longer term restores we need to go to tape. Of course, as you can tell, I''m kinda stuck on this idea that "file and directory fragmentation" is causing our issues with the backup. I don''t know how to analyze the pool to better understand the problem. If we did chop the pool up into lets say 7 pools (one for each current filesystem) then over time these 7 pools would grow and we would end up with the same issues. Thats why it seems to me to be a short term solution. If our issues with zfs are scalability then you could say zfs is not scalable. Is that true? (It certianly is if the solution is too create more pools!). -- Ed
>> > Adding another pool and copying all/some data over to it would only >> > a short term solution. >> >> I''ll have to disagree. > > What is the point of a filesystem the can grow to such a huge size and > not have functionality built in to optimize data layout? ?Real world > implementations of filesystems that are intended to live for > years/decades need this functionality, don''t they? > > Our mail system works well, only the backup doesn''t perform well. > All the features of ZFS that make reads perform well (prefetch, ARC) > have little effect. > > We think backup is quite important. We do quite a few restores of months > old data. Snapshots help in the short term, but for longer term restores > we need to go to tape.Your scalability problem may be in your backup solution. The problem is not how many Gb data you have but the number of files. It was a while since I worked with networker so things may have changed. If you are doing backups directly to tape you may have a buffering problem. By simply staging backups on disk we got at lot faster backups. Have you configured networker to do several simultaneous backups from your pool? You can do that by having several zfs on the same pool or tell networker to do backups one directory level down so that it thinks you have more file systems. And don''t forget to play with the parallelism settings in networker. This made a huge difference for us on VxFS.
On Sat, 8 Aug 2009, Ed Spencer wrote:> > What is the point of a filesystem the can grow to such a huge size and > not have functionality built in to optimize data layout? Real world > implementations of filesystems that are intended to live for > years/decades need this functionality, don''t they?Enterprise storage should work fine without needing to run a tool to optimize data layout or repair the filesystem. Well designed software uses an approach which does not unravel through use.> Our mail system works well, only the backup doesn''t perform well. > All the features of ZFS that make reads perform well (prefetch, ARC) > have little effect.It is already known that ZFS prefetch is often not aggressive enough for bulk reads, and sometimes gets lost entirely. I think that is the first issue to resolve in order to get your backups going faster. Many of us here already tested our own systems and found that under some conditions ZFS was offering up only 30MB/second for bulk data reads regardless of how exotic our storage pool and hardware was. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Sat, 2009-08-08 at 09:17, Bob Friesenhahn wrote:> Many of us here already tested our own systems and found that under > some conditions ZFS was offering up only 30MB/second for bulk data > reads regardless of how exotic our storage pool and hardware was.Just so we are using the same units of measurements. Backup/copy throughput on our development mail server is 8.5MB/sec. The people running our backups would be over joyed with that performance. However backup/copy throughput on our production mail server is 2.25 MB/sec. The underlying disk is 15000 RPM 146GB FC drives. Our performance may be hampered somewhat because the luns are on a Network Appliance accessed via iSCSI, but not to the extent that we are seeing, and it does not account for the throughput difference in the development and production pools. When I talk about fragmentation its not in the normal sense. I''m not talking about blocks in a file not being sequential. I''m talking about files in a single directory that end up spread across the entire filesytem/pool. My problem right now is diagnosing the performance issues. I can''t address them without understanding the underlying cause. There is a lack of tools to help in this area. There is also a lack of acceptance that I''m actually having a problem with zfs. Its frustrating. Anyone know how significantly increase the performance of a zfs filesystem without causing any downtime to an Enterprise email system used by 30,000 intolerant people, when you don''t really know what is causing the performance issues in the first place? (Yeah, it sucks to be me!) -- Ed
On Sat, 2009-08-08 at 08:14, Mattias Pantzare wrote:> Your scalability problem may be in your backup solution.We''ve eliminated the backup system as being involved with the performance issues. The servers are Solaris 10 with the OS on UFS filesystems. (In zfs terms, the pool is old/mature). Solaris has been patched to a fairly current level. Copying data from the zfs filesystem to the local ufs filesystem enjoys the same throughput as the backup system. The test was simple. Create a test filesystem on the zfs pool. Restore production email data to it. Reboot the server. Backup the data (29 minutes for a 15.8 gig of data). Reboot the server. Copy data from zfs to ufs using a ''cp -pr ...'' command, which also took 29 minutes. And if anyone is interested it only took 15 minutes to restore (write) the 15.8GB of data over the network. -- Ed
On Sat, 2009-08-08 at 09:17, Bob Friesenhahn wrote:> Enterprise storage should work fine without needing to run a tool to > optimize data layout or repair the filesystem. Well designed software > uses an approach which does not unravel through use.Hmmmm, this is counter to my understanding. I always thought that to optimize sequential read performance you must store the data according to how the device will read the data. Spinning rust reads data in a sequential fashion. In order to optimize read performance it has to be laid down that way. When reading files in a directory, the files need to be laid out on the physical device sequentially for optimal read performance. I probably not he person to argue this point though....Is there a DBA around? Maybe my problems will go away once we move into the next generation of storage devices, SSD''s! I''m starting to think that ZFS will really shine on SSD''s. -- Ed
On Sat, Aug 8, 2009 at 12:51 PM, Ed Spencer<Ed_Spencer at umanitoba.ca> wrote:> > On Sat, 2009-08-08 at 09:17, Bob Friesenhahn wrote: >> Many of us here already tested our own systems and found that under >> some conditions ZFS was offering up only 30MB/second for bulk data >> reads regardless of how exotic our storage pool and hardware was. > > Just so we are using the same units of measurements. Backup/copy > throughput on our development mail server is 8.5MB/sec. The people > running our backups would be over joyed with that performance. > > However backup/copy throughput on our production mail server is 2.25 > MB/sec. > > The underlying disk is 15000 RPM 146GB FC drives. > Our performance may be hampered somewhat because the luns are on a > Network Appliance accessed via iSCSI, but not to the extent that we are > seeing, and it does not account for the throughput difference in the > development and production pools.NetApp filers run WAFL - Write Anywhere File Layout. Even if ZFS arranged everything perfrectly (however that is defined) WAFL would undo its hard work. Since you are using iSCSI, I assume that you have disabled the Nagle algorithm and increased tcp_xmit_hiwat and tcp_recv_hiwat. If not, go do that now.> When I talk about fragmentation its not in the normal sense. I''m not > talking about blocks in a file not being sequential. I''m talking about > files in a single directory that end up spread across the entire > filesytem/pool.It''s tempting to think that if the files were in roughly the same area of the block device that ZFS sees that reading the files sequentially would at least trigger a read-ahead at the filer. I suspect that even a moderate amount of file creation and deletion would cause the I/O pattern to be random enough (not purely sequential) that the back-end storage would not have a reasonable chance of recognizing it as a good time for read-ahead. Further, since the backup application is probably in a loop of: while there are more files in the directory if next file mtime > last backup time open file read file contents, send to backup stream close file end if end while In other words, other I/O operations are interspersed between the sequential data reads, some files are likely to be skipped, and there is latency introduced by writing to the data stream. I would be surprised to see any file system do intelligent read-ahead here. In other words, lots of small file operations make backups and especially restores go slowly. More backup and restore streams will almost certainly help. Multiplex the streams so that you can keep your tapes moving at a constant speed. Do you have statistics on network utilization to ensure that you aren''t stressing it? Have you looked at iostat data to be sure that you are seeing asvc_t + wsvc_t that supports the number of operations that you need to perform? That is if asvc_t + wsvc_t for a device adds up to 10 ms, a workload that waits for the completion of one I/O before issuing the next will max out at 100 iops. Presumably ZFS should hide some of this from you[1], but it does suggest that each backup stream would be limited to about 100 files per second[2]. This is because the read request for one file does not happen before the close of the previous file[3]. Since cyrus stores each message as a separate file, this suggests that 2.5 MB/s corresponds to average mail message size of 25 KB. 1. via metadata caching, read-ahead on file data reads, etc. 2. Assuming wsvc_t + asvc_t = 10 ms 3. Assuming that networker is about as smart as tar, zip, cpio, etc.> My problem right now is diagnosing the performance issues. ?I can''t > address them without understanding the underlying cause. ?There is a > lack of tools to help in this area. There is also a lack of acceptance > that I''m actually having a problem with zfs. Its frustrating.This is a prime example of why Sun needs to sell Analytics[4][5] as an add-on to Solaris in general. This problem is just as hard to figure out on Solaris as it is on Linux, Windows, etc. If Analytics were bundled with Gold and above support contracts, it would be a very compelling reason to shell out a few extra bucks for better support contract. 4. http://blogs.sun.com/bmc/resource/cec_analytics.pdf 5. http://blogs.sun.com/brendan/category/Fishworks> Anyone know how significantly increase the performance of a zfs > filesystem without causing any downtime to an Enterprise email system > used by 30,000 intolerant people, when you don''t really know what is > causing the performance issues in the first place? (Yeah, it sucks to be > me!)Hopefully I''ve helped find a couple places to look... -- Mike Gerdts http://mgerdts.blogspot.com/
On Sat, Aug 8, 2009 at 3:02 PM, Ed Spencer<Ed_Spencer at umanitoba.ca> wrote:> > On Sat, 2009-08-08 at 09:17, Bob Friesenhahn wrote: > >> Enterprise storage should work fine without needing to run a tool to >> optimize data layout or repair the filesystem. ?Well designed software >> uses an approach which does not unravel through use. > > Hmmmm, this is counter to my understanding. I always thought that to > optimize sequential read performance you must store the data according > to how the device will read the data. > > Spinning rust reads data in a sequential fashion. In order to optimize > read performance it has to be laid down that way. > > When reading files in a directory, the files need to be laid out on the > physical device sequentially for optimal read performance. > > I probably not he person to argue this point though....Is there a DBA > around?The DBA''s that I know use files that are at least hundreds of megabytes in size. Your problem is very different.> Maybe my problems will go away once we move into the next generation of > storage devices, SSD''s! I''m starting to think that ZFS will really shine > on SSD''s.Your problem seems to be related to cold reads in a pretty large data set. With SSD''s (l2arc) you are likely to see a performance boost for a larger set of recently read files, but my guess is that backups will still be pretty slow. There is likely more benefit in restore speed with SSD''s than there is in read speeds. However, the NVRAM on the NetApp that is backing your iSCSI LUNs is probably already giving you most of this benefit (assuming low latency on network connections). -- Mike Gerdts http://mgerdts.blogspot.com/
On Sat, 8 Aug 2009, Ed Spencer wrote:> On Sat, 2009-08-08 at 09:17, Bob Friesenhahn wrote: > >> Enterprise storage should work fine without needing to run a tool to >> optimize data layout or repair the filesystem. Well designed software >> uses an approach which does not unravel through use. > > Hmmmm, this is counter to my understanding. I always thought that to > optimize sequential read performance you must store the data according > to how the device will read the data.That is something I agree in. As a result, the requirement/goal of an enterprise storage system should be to assure that data is as contiguious as possible, keeping in mind that mutiple disks (LUNs) may be involved. It should not unravel and require first-aid in order to work correctly (like MS-DOS FAT). If you are using a big LUN on some other storage device, then zfs is not able to do nearly as much to optimize performance as it would if it interfaced with a JBOD array. For the big LUN, all it can do is try to write blocks associated with the current transaction group in the most contiguious order and read-ahead can not be as useful. With the big LUN it does not know if the data is on difficult physical disks so it does not know if reading data in parallel will help reduce the read latencies.> Maybe my problems will go away once we move into the next generation of > storage devices, SSD''s! I''m starting to think that ZFS will really shine > on SSD''s.A SSD slog backed by a SAS 15K JBOD array should perform much better than a big iSCSI LUN. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Sat, 2009-08-08 at 15:12, Mike Gerdts wrote:> The DBA''s that I know use files that are at least hundreds of > megabytes in size. Your problem is very different.Yes, definitely. I''m relating records in a table to my small files because our email system treats the filesystem as a database. And in the back of my mind I''m also thinking that you have to rebuild/repair the database once in a while to improve performance. And in my case, since the filesystem is the database, I want to do that to zfs! At least thats what I''m thinking, however, and I always come back to this, I''m not certian what is causing my problem. I need certainty before taking action on the production system. -- Ed
On Sat, Aug 8, 2009 at 3:25 PM, Ed Spencer<Ed_Spencer at umanitoba.ca> wrote:> > On Sat, 2009-08-08 at 15:12, Mike Gerdts wrote: > >> The DBA''s that I know use files that are at least hundreds of >> megabytes in size. ?Your problem is very different. > Yes, definitely. > > I''m relating records in a table to my small files because our email > system treats the filesystem as a database.Right... but ZFS doesn''t understand your application. The reason that a file system would put files that are in the same directory in the same general area on a disk is to minimize seek time. I would argue that seek time doesn''t matter a whole lot here - at least from the vantage point of ZFS. The LUNs that you have presented from the filer are probably RAID6 across many disks. ZFS seems to be doing a 4 way stripe (or are you mirroring or raidz?). Assuming you are doing something like a 7+2 RAID6 on the back end, the contents would be spread across 36 drives.[1] The trick to making this perform well is to have 36 * N worker threads. Mail is a great thing to keep those spindles kinda busy while getting decent performance. A small number of sequential readers - particularly with small files where you can''t do a reasonable job with read-ahead - has little chance of keeping that number of drives busy. 1. Or you might have 4 LUNs presented from one 4+1 RAID5 in which you may be forcing more head movement because ZFS thinks it can speed things up by striping data across the LUNs. ZFS can recognize a database (or other application) doing a sequential read on a large file. While data located sequentially on disk can be helpful for reads, this is much less important when the pool sits across tens of disks. This is because it has the ability to spread the iops across lots of disks, potentially reading a heavily fragmented file much faster than a purely sequential file. In either case, your backup application is competing for iops (and seeks) with other workload. With the NetApp backend there are likely other applications on the same aggregate that are forcing head movement away from any data belonging to these LUNs.> And in the back of my mind I''m also thinking that you have to > rebuild/repair the database once in a while to improve performance.Certainly. Databases become fragmented and are reorganized to fix this.> And in my case, since the filesystem is the database, I want to do that > to zfs! > > At least thats what I''m thinking, however, and I always come back to > this, I''m not certian what is causing my problem. I need certainty > before taking action on the production system.Most databases are written in such a way that they can be optimized for sequential reads (table scans) and for backups, whether on raw disk or on a file system. The more advanced the database is, the more likely it is to ask the file system to get out of its way and *not* do anything fancy. It seems that cyrus was optimized for operations that make sense for a mail program (deliver messages, retrieve messages, delete messages) and nothing else. I would argue that any application that creates lots of tiny files is not optimized for backing up using a small number of streams. -- Mike Gerdts http://mgerdts.blogspot.com/
On Sat, 2009-08-08 at 15:20, Bob Friesenhahn wrote:> A SSD slog backed by a SAS 15K JBOD array should perform much better > than a big iSCSI LUN.Now...yes. We implemented this pool years ago. I believe, then, the server would crash if you had a zfs drive fail. We decided to let the netapp handle the disk redundency. Its worked out well. I''ve looked at those really nice Sun products adoringly. And a 7000 series appliance would also be a nice addition to our central NFS service. Not to mention more cost effective than expanding our Network Appliance (We have researchers who are quite hungry for storage and NFS is always our first choice). We now have quite an investment in the current implementation. Its difficult to move away from. The netapp is quite a reliable product. We are quite happy with zfs and our implementation. We just need to address our backup performance and improve it just a little bit! We were almost lynched this spring because we encountered some pretty severe zfs bugs. We are still running the IDR named "A wad of ZFS bug fixes for Solaris 10 Update 6". It took over a month to resolve the issues. I work at a University and Final Exams and year end occur at the same time. I don''t recommend having email problems during this time! People are intolerant to email problems. I live in hope that a Netapp OS update, or a solaris patch, or a zfs patch, or a iscsi patch , or something will come along that improves our performance just a bit so our backup people get off my back! -- Ed
On Sat, 2009-08-08 at 16:09, Mike Gerdts wrote:> Right... but ZFS doesn''t understand your application. The reason that > a file system would put files that are in the same directory in the > same general area on a disk is to minimize seek time. I would argue > that seek time doesn''t matter a whole lot here - at least from the > vantage point of ZFS. The LUNs that you have presented from the filer > are probably RAID6 across many disks.Yes. Raid4DP. 16 drive arrays. 42 drives in total (one hot spare).> ZFS seems to be doing a 4 way > stripe (or are you mirroring or raidz?).Here''s the pool (no zfs raid): pool: space state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM space ONLINE 0 0 0 c4t60A98000433469764E4A2D456A644A74d0 ONLINE 0 0 0 c4t60A98000433469764E4A2D456A696579d0 ONLINE 0 0 0 c4t60A98000433469764E4A476D2F6B385Ad0 ONLINE 0 0 0 c4t60A98000433469764E4A476D2F664E4Fd0 ONLINE 0 0 0 errors: No known data errors> Assuming you are doing > something like a 7+2 RAID6 on the back end, the contents would be > spread across 36 drives.[1] The trick to making this perform well is > to have 36 * N worker threads. Mail is a great thing to keep those > spindles kinda busy while getting decent performance. A small number > of sequential readers - particularly with small files where you can''t > do a reasonable job with read-ahead - has little chance of keeping > that number of drives busy.The server is also a Sun T2000 (sun4v).> 1. Or you might have 4 LUNs presented from one 4+1 RAID5 in which you > may be forcing more head movement because ZFS thinks it can speed > things up by striping data across the LUNs. > > ZFS can recognize a database (or other application) doing a sequential > read on a large file. While data located sequentially on disk can be > helpful for reads, this is much less important when the pool sits > across tens of disks. This is because it has the ability to spread > the iops across lots of disks, potentially reading a heavily > fragmented file much faster than a purely sequential file. > > In either case, your backup application is competing for iops (and > seeks) with other workload. With the NetApp backend there are likely > other applications on the same aggregate that are forcing head > movement away from any data belonging to these LUNs.Email makes up about 98% of our IP San. There are only a couple of other apps on it that require block storage. We run "reallocate" jobs nightly to ensure the luns stay sequential within the netapp storage pool (aggregate) because of its COW filesystem.> > And in the back of my mind I''m also thinking that you have to > > rebuild/repair the database once in a while to improve performance. > > Certainly. Databases become fragmented and are reorganized to fix this. > > > And in my case, since the filesystem is the database, I want to do that > > to zfs! > > > > At least thats what I''m thinking, however, and I always come back to > > this, I''m not certian what is causing my problem. I need certainty > > before taking action on the production system. > > Most databases are written in such a way that they can be optimized > for sequential reads (table scans) and for backups, whether on raw > disk or on a file system. The more advanced the database is, the more > likely it is to ask the file system to get out of its way and *not* do > anything fancy. > > It seems that cyrus was optimized for operations that make sense for a > mail program (deliver messages, retrieve messages, delete messages) > and nothing else. I would argue that any application that creates > lots of tiny files is not optimized for backing up using a small > number of streams.Oh yes. Lots of small files is the backup nightmare. -- Ed
On Sat, 2009-08-08 at 15:05, Mike Gerdts wrote:> On Sat, Aug 8, 2009 at 12:51 PM, Ed Spencer<Ed_Spencer at umanitoba.ca> wrote: > > > > On Sat, 2009-08-08 at 09:17, Bob Friesenhahn wrote: > >> Many of us here already tested our own systems and found that under > >> some conditions ZFS was offering up only 30MB/second for bulk data > >> reads regardless of how exotic our storage pool and hardware was. > > > > Just so we are using the same units of measurements. Backup/copy > > throughput on our development mail server is 8.5MB/sec. The people > > running our backups would be over joyed with that performance. > > > > However backup/copy throughput on our production mail server is 2.25 > > MB/sec. > > > > The underlying disk is 15000 RPM 146GB FC drives. > > Our performance may be hampered somewhat because the luns are on a > > Network Appliance accessed via iSCSI, but not to the extent that we are > > seeing, and it does not account for the throughput difference in the > > development and production pools. > > NetApp filers run WAFL - Write Anywhere File Layout. Even if ZFS > arranged everything perfrectly (however that is defined) WAFL would > undo its hard work. > > Since you are using iSCSI, I assume that you have disabled the Nagle > algorithm and increased tcp_xmit_hiwat and tcp_recv_hiwat. If not, > go do that now.We''ve tried many different iscsi parameter changes on our development server: Jumbo Frames Disabling the Nagle I''ll double check next week on tcp_xmit_hiwat and tcp_recv_hiwat. Nothing has made any real difference. We are only using about 5% of the bandwidth on our IPSan. We use two cisco ethernet switches on the IPSAN. The iscsi initiators use MPXIO in a round robin configuration.> > When I talk about fragmentation its not in the normal sense. I''m not > > talking about blocks in a file not being sequential. I''m talking about > > files in a single directory that end up spread across the entire > > filesytem/pool. > > It''s tempting to think that if the files were in roughly the same area > of the block device that ZFS sees that reading the files sequentially > would at least trigger a read-ahead at the filer. I suspect that even > a moderate amount of file creation and deletion would cause the I/O > pattern to be random enough (not purely sequential) that the back-end > storage would not have a reasonable chance of recognizing it as a good > time for read-ahead. Further, since the backup application is > probably in a loop of: > > while there are more files in the directory > if next file mtime > last backup time > open file > read file contents, send to backup stream > close file > end if > end while > > In other words, other I/O operations are interspersed between the > sequential data reads, some files are likely to be skipped, and there > is latency introduced by writing to the data stream. I would be > surprised to see any file system do intelligent read-ahead here. In > other words, lots of small file operations make backups and especially > restores go slowly. More backup and restore streams will almost > certainly help. Multiplex the streams so that you can keep your tapes > moving at a constant speed.We backup to disk first and then put to tape later.> Do you have statistics on network utilization to ensure that you > aren''t stressing it? > > Have you looked at iostat data to be sure that you are seeing asvc_t + > wsvc_t that supports the number of operations that you need to > perform? That is if asvc_t + wsvc_t for a device adds up to 10 ms, a > workload that waits for the completion of one I/O before issuing the > next will max out at 100 iops. Presumably ZFS should hide some of > this from you[1], but it does suggest that each backup stream would be > limited to about 100 files per second[2]. This is because the read > request for one file does not happen before the close of the previous > file[3]. Since cyrus stores each message as a separate file, this > suggests that 2.5 MB/s corresponds to average mail message size of 25 > KB. > > 1. via metadata caching, read-ahead on file data reads, etc. > 2. Assuming wsvc_t + asvc_t = 10 ms > 3. Assuming that networker is about as smart as tar, zip, cpio, etc.There is a backup of a single filesystem in the pool going on right now: # zpool iostat 5 5 capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- space 1.05T 965G 97 69 5.24M 2.71M space 1.05T 965G 113 10 6.41M 996K space 1.05T 965G 100 112 2.87M 1.81M space 1.05T 965G 112 8 2.35M 35.9K space 1.05T 965G 106 3 1.76M 55.1K Here are examples : iostat -xpn 5 5 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 17.1 29.2 746.7 317.1 0.0 0.6 0.0 12.5 0 27 c4t60A98000433469764E4A2D456A644A74d0 25.0 11.9 991.9 277.0 0.0 0.6 0.0 16.1 0 36 c4t60A98000433469764E4A2D456A696579d0 14.9 17.9 423.0 406.4 0.0 0.3 0.0 10.2 0 21 c4t60A98000433469764E4A476D2F664E4Fd0 20.8 17.4 588.9 361.2 0.0 0.4 0.0 11.5 0 30 c4t60A98000433469764E4A476D2F6B385Ad0 and: r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 11.9 43.0 528.9 1972.8 0.0 2.1 0.0 38.9 0 31 c4t60A98000433469764E4A2D456A644A74d0 17.0 19.6 496.9 1499.0 0.0 1.4 0.0 38.8 0 39 c4t60A98000433469764E4A2D456A696579d0 14.0 30.0 670.2 1971.3 0.0 1.7 0.0 38.0 0 34 c4t60A98000433469764E4A476D2F664E4Fd0 19.7 28.7 985.2 1647.6 0.0 1.6 0.0 32.5 0 37 c4t60A98000433469764E4A476D2F6B385Ad0 and: r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 22.7 41.3 973.7 423.5 0.0 0.8 0.0 11.8 0 34 c4t60A98000433469764E4A2D456A644A74d0 27.9 20.0 1474.7 344.0 0.0 0.8 0.0 16.7 0 42 c4t60A98000433469764E4A2D456A696579d0 15.1 17.9 1318.7 463.7 0.0 0.6 0.0 17.7 0 19 c4t60A98000433469764E4A476D2F664E4Fd0 22.3 19.5 1801.7 406.7 0.0 0.8 0.0 20.0 0 29 c4t60A98000433469764E4A476D2F6B385Ad0> > My problem right now is diagnosing the performance issues. I can''t > > address them without understanding the underlying cause. There is a > > lack of tools to help in this area. There is also a lack of acceptance > > that I''m actually having a problem with zfs. Its frustrating. > > This is a prime example of why Sun needs to sell Analytics[4][5] as an > add-on to Solaris in general. This problem is just as hard to figure > out on Solaris as it is on Linux, Windows, etc. If Analytics were > bundled with Gold and above support contracts, it would be a very > compelling reason to shell out a few extra bucks for better support > contract. > > 4. http://blogs.sun.com/bmc/resource/cec_analytics.pdf > 5. http://blogs.sun.com/brendan/category/Fishworks >Oh definitely! It will also give me the oppurtunity to yell at my drives! Might help to relieve some stress. http://sunbeltblog.blogspot.com/2009/01/yelling-at-your-hard-drive.html> > Anyone know how significantly increase the performance of a zfs > > filesystem without causing any downtime to an Enterprise email system > > used by 30,000 intolerant people, when you don''t really know what is > > causing the performance issues in the first place? (Yeah, it sucks to be > > me!) > > Hopefully I''ve helped find a couple places to look...Thanx -- Ed
On Sat, 2009-08-08 at 17:25, Mike Gerdts wrote:> ndd -get /dev/tcp tcp_xmit_hiwat > ndd -get /dev/tcp tcp_recv_hiwat > grep tcp-nodelay /kernel/drv/iscsi.conf# ndd -get /dev/tcp tcp_xmit_hiwat 2097152 # ndd -get /dev/tcp tcp_recv_hiwat 2097152 # grep tcp-nodelay /kernel/drv/iscsi.conf #> While backups are running (which is probably all the time given the > backup rate....) > > # look at service times > iostat -xzn 10Oh crap. Look like there are no backup jobs running right now. It must have just ended.> # is networker cpu bound?No. The server is barely tasked by either the email system or networker.> prstat -mL > Some indication of how many backup jobs run concurrently would > probably help frame any future discussion.I''ll get more info on the backups next week when the full backups run. -- Ed
On Aug 8, 2009, at 5:02 AM, Ed Spencer wrote:> > On Fri, 2009-08-07 at 19:33, Richard Elling wrote: > >> This is very unlikely to be a "fragmentation problem." It is a >> scalability problem >> and there may be something you can do about it in the short term. > > You could be right. > > Out test mail server consists of the exact same design, same hardware > (SUN4V) but in a smaller configuration (less memory and 4 x 25g san > luns) has a backup/copy thoughput of 30GB/hour. Data used for testing > was "copied" from our production mail server. > >>> Adding another pool and copying all/some data over to it would only >>> a short term solution. >> >> I''ll have to disagree. > > What is the point of a filesystem the can grow to such a huge size and > not have functionality built in to optimize data layout? Real world > implementations of filesystems that are intended to live for > years/decades need this functionality, don''t they? > > Our mail system works well, only the backup doesn''t perform well. > All the features of ZFS that make reads perform well (prefetch, ARC) > have little effect.The best workload is one that doesn''t read from disk to begin with :-) For workloads with millions of files (eg large-scale mail servers) you will need to increase the size of the Directory Name Lookup Cache (DNLC). By default, it is way too small for such workloads. If the directory names are in cache, then they do not have to be read from disk -- a big win. You can see how well the DNLC is working by looking at the output of "vmstat -s" and look for the "total name lookups." You can size DNLC by tuning the ncsize parameter, but it requires a reboot. See the Solaris Tunable Parameters Guide for details. http://docs.sun.com/app/docs/doc/817-0404/chapter2-35?a=view I''d like to revisit the backup problem, but that is much more complicated and probably won''t fit in a mail thread very easily (hence, the white paper :-) -- richard
On Sat, 8 Aug 2009, Ed Spencer wrote:> r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 11.9 43.0 528.9 1972.8 0.0 2.1 0.0 38.9 0 31 > c4t60A98000433469764E4A2D456A644A74d0 > 17.0 19.6 496.9 1499.0 0.0 1.4 0.0 38.8 0 39 > c4t60A98000433469764E4A2D456A696579d0 > 14.0 30.0 670.2 1971.3 0.0 1.7 0.0 38.0 0 34 > c4t60A98000433469764E4A476D2F664E4Fd0 > 19.7 28.7 985.2 1647.6 0.0 1.6 0.0 32.5 0 37 > c4t60A98000433469764E4A476D2F6B385Ad0I have this in my /etc/system file: * Set device I/O maximum concurrency * http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device_I.2FO_Queue_Size_.28I.2FO_Concurrency.29 set zfs:zfs_vdev_max_pending = 5 This parameter may be worthwhile to look at to reduce your asvc_t. It seems that the default (35) is tuned for a true JBOD setup and not a SAN-hosted LUN. As I recall, you can use the kernel debugger to set it while the system is running and immediately see differences in iostat output. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Sat, Aug 8, 2009 at 20:20, Ed Spencer<Ed_Spencer at umanitoba.ca> wrote:> > On Sat, 2009-08-08 at 08:14, Mattias Pantzare wrote: > >> Your scalability problem may be in your backup solution. > We''ve eliminated the backup system as being involved with the > performance issues. > > The servers are Solaris 10 with the OS on UFS filesystems. (In zfs > terms, the pool is old/mature). Solaris has been patched to a fairly > current level. > > Copying data from the zfs filesystem to the local ufs filesystem enjoys > the same throughput as the backup system. > > The test was simple. Create a test filesystem on the zfs pool. Restore > production email data to it. Reboot the server. Backup the data (29 > minutes for a 15.8 gig of data). Reboot the server. Copy data from zfs > to ufs using a ''cp -pr ...'' command, which also took 29 minutes.Yes, that was expected. What hapens if you run two cp -pr at the same time? I am guessing that two cp will take almost the same time as one. If you get twice the performance from two cp then you will get twice the performance from doing two backups in parallel.
I''ve come up with a better name for the concept of file and directory fragmentation which is, "Filesystem Entropy". Where, over time, an active and volitile filesystem moves from an organized state to a disorganized state resulting in backup difficulties. Here are some stats which illustrate the issue: First the development mail server: =================================(Jump frames, Nagle disabled and tcp_xmit_hiwat,tcp_recv_hiwat set to 2097152) Small file workload (copy from zfs on iscsi network to local ufs filesystem) # zpool iostat 10 capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- space 70.5G 29.0G 3 0 247K 59.7K space 70.5G 29.0G 136 0 8.37M 0 space 70.5G 29.0G 115 0 6.31M 0 space 70.5G 29.0G 108 0 7.08M 0 space 70.5G 29.0G 105 0 3.72M 0 space 70.5G 29.0G 135 0 3.74M 0 space 70.5G 29.0G 155 0 6.09M 0 space 70.5G 29.0G 193 0 4.85M 0 space 70.5G 29.0G 142 0 5.73M 0 space 70.5G 29.0G 159 0 7.87M 0 Large File workload (cd and dvd iso''s) # zpool iostat 10 capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- space 70.5G 29.0G 3 0 224K 59.8K space 70.5G 29.0G 462 0 57.8M 0 space 70.5G 29.0G 427 0 53.5M 0 space 70.5G 29.0G 406 0 50.8M 0 space 70.5G 29.0G 430 0 53.8M 0 space 70.5G 29.0G 382 0 47.9M 0 The production mail server: ==========================Mail system is running with 790 imap users logged in (low imap work load). Two backup streams are running. Not using jumbo frames, nagle enabled, tcp_xmit_hiwat,tcp_recv_hiwat set to 2097152 - we''ve never seen any effect of changing the iscsi transport parameters under this small file workload. # zpool iostat 10 capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- space 1.06T 955G 96 69 5.20M 2.69M space 1.06T 955G 175 105 8.96M 2.22M space 1.06T 955G 182 16 4.47M 546K space 1.06T 955G 170 16 4.82M 1.85M space 1.06T 955G 145 159 4.23M 3.19M space 1.06T 955G 138 15 4.97M 92.7K space 1.06T 955G 134 15 3.82M 1.71M space 1.06T 955G 109 123 3.07M 3.08M space 1.06T 955G 106 11 3.07M 1.34M space 1.06T 955G 120 17 3.69M 1.74M # prstat -mL PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 12438 root 12 6.9 0.0 0.0 0.0 0.0 81 0.1 508 84 4K 0 save/1 27399 cyrus 15 0.5 0.0 0.0 0.0 0.0 85 0.0 18 10 297 0 imapd/1 20230 root 3.9 8.0 0.0 0.0 0.0 0.0 88 0.1 393 33 2K 0 save/1 25913 root 0.5 3.3 0.0 0.0 0.0 0.0 96 0.0 22 2 1K 0 prstat/1 20495 cyrus 1.1 0.2 0.0 0.0 0.5 0.0 98 0.0 14 3 191 0 imapd/1 1051 cyrus 1.2 0.0 0.0 0.0 0.0 0.0 99 0.0 19 1 80 0 master/1 24350 cyrus 0.5 0.5 0.0 0.0 1.4 0.0 98 0.0 57 1 484 0 lmtpd/1 22645 cyrus 0.6 0.3 0.0 0.0 0.0 0.0 99 0.0 53 1 603 0 imapd/1 24904 cyrus 0.3 0.4 0.0 0.0 0.0 0.0 99 0.0 66 0 863 0 imapd/1 18139 cyrus 0.3 0.2 0.0 0.0 0.0 0.0 99 0.0 24 0 195 0 imapd/1 21459 cyrus 0.2 0.3 0.0 0.0 0.0 0.0 99 0.0 54 0 635 0 imapd/1 24891 cyrus 0.3 0.3 0.0 0.0 0.9 0.0 99 0.0 28 0 259 0 lmtpd/1 388 root 0.2 0.3 0.0 0.0 0.0 0.0 100 0.0 1 1 48 0 in.routed/1 21643 cyrus 0.2 0.3 0.0 0.0 0.2 0.0 99 0.0 49 7 540 0 imapd/1 18684 cyrus 0.2 0.3 0.0 0.0 0.0 0.0 100 0.0 48 1 544 0 imapd/1 25398 cyrus 0.2 0.2 0.0 0.0 0.0 0.0 100 0.0 47 0 466 0 pop3d/1 23724 cyrus 0.2 0.2 0.0 0.0 0.0 0.0 100 0.0 47 0 540 0 imapd/1 24909 cyrus 0.1 0.2 0.0 0.0 0.2 0.0 99 0.0 25 1 251 0 lmtpd/1 16317 cyrus 0.2 0.2 0.0 0.0 0.0 0.0 100 0.0 37 1 495 0 imapd/1 28243 cyrus 0.1 0.3 0.0 0.0 0.0 0.0 100 0.0 32 0 289 0 imapd/1 20097 cyrus 0.1 0.2 0.0 0.0 0.3 0.0 99 0.0 26 5 253 0 lmtpd/1 Total: 893 processes, 1125 lwps, load averages: 1.14, 1.16, 1.16 -- Ed
At a first glance, your production server''s numbers are looking fairly similar to the "small file workload" results of your development server. I thought you were saying that the development server has faster performance? Alex. On Tue, Aug 11, 2009 at 1:33 PM, Ed Spencer<Ed_Spencer at umanitoba.ca> wrote:> I''ve come up with a better name for the concept of file and directory > fragmentation which is, "Filesystem Entropy". Where, over time, an > active and volitile ?filesystem moves from an organized state to a > disorganized state resulting in backup difficulties. > > Here are some stats which illustrate the issue: > > First the development mail server: > =================================> (Jump frames, Nagle disabled and tcp_xmit_hiwat,tcp_recv_hiwat set to > 2097152) > > Small file workload (copy from zfs on iscsi network to local ufs > filesystem) > # zpool iostat 10 > ? ? ? ? ? ? ? capacity ? ? operations ? ?bandwidth > pool ? ? ? ? used ?avail ? read ?write ? read ?write > ---------- ?----- ?----- ?----- ?----- ?----- ?----- > space ? ? ? 70.5G ?29.0G ? ? ?3 ? ? ?0 ? 247K ?59.7K > space ? ? ? 70.5G ?29.0G ? ?136 ? ? ?0 ?8.37M ? ? ?0 > space ? ? ? 70.5G ?29.0G ? ?115 ? ? ?0 ?6.31M ? ? ?0 > space ? ? ? 70.5G ?29.0G ? ?108 ? ? ?0 ?7.08M ? ? ?0 > space ? ? ? 70.5G ?29.0G ? ?105 ? ? ?0 ?3.72M ? ? ?0 > space ? ? ? 70.5G ?29.0G ? ?135 ? ? ?0 ?3.74M ? ? ?0 > space ? ? ? 70.5G ?29.0G ? ?155 ? ? ?0 ?6.09M ? ? ?0 > space ? ? ? 70.5G ?29.0G ? ?193 ? ? ?0 ?4.85M ? ? ?0 > space ? ? ? 70.5G ?29.0G ? ?142 ? ? ?0 ?5.73M ? ? ?0 > space ? ? ? 70.5G ?29.0G ? ?159 ? ? ?0 ?7.87M ? ? ?0 > > Large File workload (cd and dvd iso''s) > # zpool iostat 10 > ? ? ? ? ? ? ? capacity ? ? operations ? ?bandwidth > pool ? ? ? ? used ?avail ? read ?write ? read ?write > ---------- ?----- ?----- ?----- ?----- ?----- ?----- > space ? ? ? 70.5G ?29.0G ? ? ?3 ? ? ?0 ? 224K ?59.8K > space ? ? ? 70.5G ?29.0G ? ?462 ? ? ?0 ?57.8M ? ? ?0 > space ? ? ? 70.5G ?29.0G ? ?427 ? ? ?0 ?53.5M ? ? ?0 > space ? ? ? 70.5G ?29.0G ? ?406 ? ? ?0 ?50.8M ? ? ?0 > space ? ? ? 70.5G ?29.0G ? ?430 ? ? ?0 ?53.8M ? ? ?0 > space ? ? ? 70.5G ?29.0G ? ?382 ? ? ?0 ?47.9M ? ? ?0 > > The production mail server: > ==========================> Mail system is running with 790 imap users logged in (low imap work > load). > Two backup streams are running. > Not using jumbo frames, nagle enabled, tcp_xmit_hiwat,tcp_recv_hiwat set > to 2097152 > ? ?- we''ve never seen any effect of changing the iscsi transport > parameters > ? ? ?under this small file workload. > > # zpool iostat 10 > ? ? ? ? ? ? ? capacity ? ? operations ? ?bandwidth > pool ? ? ? ? used ?avail ? read ?write ? read ?write > ---------- ?----- ?----- ?----- ?----- ?----- ?----- > space ? ? ? 1.06T ? 955G ? ? 96 ? ? 69 ?5.20M ?2.69M > space ? ? ? 1.06T ? 955G ? ?175 ? ?105 ?8.96M ?2.22M > space ? ? ? 1.06T ? 955G ? ?182 ? ? 16 ?4.47M ? 546K > space ? ? ? 1.06T ? 955G ? ?170 ? ? 16 ?4.82M ?1.85M > space ? ? ? 1.06T ? 955G ? ?145 ? ?159 ?4.23M ?3.19M > space ? ? ? 1.06T ? 955G ? ?138 ? ? 15 ?4.97M ?92.7K > space ? ? ? 1.06T ? 955G ? ?134 ? ? 15 ?3.82M ?1.71M > space ? ? ? 1.06T ? 955G ? ?109 ? ?123 ?3.07M ?3.08M > space ? ? ? 1.06T ? 955G ? ?106 ? ? 11 ?3.07M ?1.34M > space ? ? ? 1.06T ? 955G ? ?120 ? ? 17 ?3.69M ?1.74M > > # prstat -mL > ? PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG > PROCESS/LWPID > ?12438 root ? ? ?12 6.9 0.0 0.0 0.0 0.0 ?81 0.1 508 ?84 ?4K ? 0 save/1 > ?27399 cyrus ? ? 15 0.5 0.0 0.0 0.0 0.0 ?85 0.0 ?18 ?10 297 ? 0 imapd/1 > ?20230 root ? ? 3.9 8.0 0.0 0.0 0.0 0.0 ?88 0.1 393 ?33 ?2K ? 0 save/1 > ?25913 root ? ? 0.5 3.3 0.0 0.0 0.0 0.0 ?96 0.0 ?22 ? 2 ?1K ? 0 prstat/1 > ?20495 cyrus ? ?1.1 0.2 0.0 0.0 0.5 0.0 ?98 0.0 ?14 ? 3 191 ? 0 imapd/1 > ?1051 cyrus ? ?1.2 0.0 0.0 0.0 0.0 0.0 ?99 0.0 ?19 ? 1 ?80 ? 0 master/1 > ?24350 cyrus ? ?0.5 0.5 0.0 0.0 1.4 0.0 ?98 0.0 ?57 ? 1 484 ? 0 lmtpd/1 > ?22645 cyrus ? ?0.6 0.3 0.0 0.0 0.0 0.0 ?99 0.0 ?53 ? 1 603 ? 0 imapd/1 > ?24904 cyrus ? ?0.3 0.4 0.0 0.0 0.0 0.0 ?99 0.0 ?66 ? 0 863 ? 0 imapd/1 > ?18139 cyrus ? ?0.3 0.2 0.0 0.0 0.0 0.0 ?99 0.0 ?24 ? 0 195 ? 0 imapd/1 > ?21459 cyrus ? ?0.2 0.3 0.0 0.0 0.0 0.0 ?99 0.0 ?54 ? 0 635 ? 0 imapd/1 > ?24891 cyrus ? ?0.3 0.3 0.0 0.0 0.9 0.0 ?99 0.0 ?28 ? 0 259 ? 0 lmtpd/1 > ? 388 root ? ? 0.2 0.3 0.0 0.0 0.0 0.0 100 0.0 ? 1 ? 1 ?48 ? 0 > in.routed/1 > ?21643 cyrus ? ?0.2 0.3 0.0 0.0 0.2 0.0 ?99 0.0 ?49 ? 7 540 ? 0 imapd/1 > ?18684 cyrus ? ?0.2 0.3 0.0 0.0 0.0 0.0 100 0.0 ?48 ? 1 544 ? 0 imapd/1 > ?25398 cyrus ? ?0.2 0.2 0.0 0.0 0.0 0.0 100 0.0 ?47 ? 0 466 ? 0 pop3d/1 > ?23724 cyrus ? ?0.2 0.2 0.0 0.0 0.0 0.0 100 0.0 ?47 ? 0 540 ? 0 imapd/1 > ?24909 cyrus ? ?0.1 0.2 0.0 0.0 0.2 0.0 ?99 0.0 ?25 ? 1 251 ? 0 lmtpd/1 > ?16317 cyrus ? ?0.2 0.2 0.0 0.0 0.0 0.0 100 0.0 ?37 ? 1 495 ? 0 imapd/1 > ?28243 cyrus ? ?0.1 0.3 0.0 0.0 0.0 0.0 100 0.0 ?32 ? 0 289 ? 0 imapd/1 > ?20097 cyrus ? ?0.1 0.2 0.0 0.0 0.3 0.0 ?99 0.0 ?26 ? 5 253 ? 0 lmtpd/1 > Total: 893 processes, 1125 lwps, load averages: 1.14, 1.16, 1.16 > > -- > Ed > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-- Ted Turner - "Sports is like a war without the killing." - http://www.brainyquote.com/quotes/authors/t/ted_turner.html
On Tue, Aug 11, 2009 at 7:33 AM, Ed Spencer<Ed_Spencer at umanitoba.ca> wrote:> I''ve come up with a better name for the concept of file and directory > fragmentation which is, "Filesystem Entropy". Where, over time, an > active and volitile filesystem moves from an organized state to a > disorganized state resulting in backup difficulties. > > Here are some stats which illustrate the issue: > > First the development mail server: > =================================> (Jump frames, Nagle disabled and tcp_xmit_hiwat,tcp_recv_hiwat set to > 2097152) > > Small file workload (copy from zfs on iscsi network to local ufs > filesystem) > # zpool iostat 10 > capacity operations bandwidth > pool used avail read write read write > ---------- ----- ----- ----- ----- ----- ----- > space 70.5G 29.0G 3 0 247K 59.7K > space 70.5G 29.0G 136 0 8.37M 0 > space 70.5G 29.0G 115 0 6.31M 0 > space 70.5G 29.0G 108 0 7.08M 0 > space 70.5G 29.0G 105 0 3.72M 0 > space 70.5G 29.0G 135 0 3.74M 0 > space 70.5G 29.0G 155 0 6.09M 0 > space 70.5G 29.0G 193 0 4.85M 0 > space 70.5G 29.0G 142 0 5.73M 0 > space 70.5G 29.0G 159 0 7.87M 0So you are averaging about 6 MB/s on a small file workload. The average read size was about 44 KB. This throughput could be limited by the file creation rate on UFS. Perhaps a better command to use to judge of how fast a single stream can read is "tar cf /dev/null $dir".> Large File workload (cd and dvd iso''s) > # zpool iostat 10 > capacity operations bandwidth > pool used avail read write read write > ---------- ----- ----- ----- ----- ----- ----- > space 70.5G 29.0G 3 0 224K 59.8K > space 70.5G 29.0G 462 0 57.8M 0 > space 70.5G 29.0G 427 0 53.5M 0 > space 70.5G 29.0G 406 0 50.8M 0 > space 70.5G 29.0G 430 0 53.8M 0 > space 70.5G 29.0G 382 0 47.9M 0Here the average throughput was about 53 MB/s, with the average read size at 128 KB. Note that 128 KB is not only the largest block size that ZFS supports, it is also the default value of maxphys. Tuning maxphys to 1 MB may give you a performance boost, so long as the files are contiguous. Unless the files were trickled in very slowly with a lot of other IO at the same time, they are probably mostly contiguous. 1 Gbit links, they are at about 25% capacity - good. I assume you have similar load balancing at the NetApp side too. In a previous message you said that this server was seeing better backup throughput than the production server. How does the mixture of large files vs. small files compare on the two systems?> The production mail server: > ==========================> Mail system is running with 790 imap users logged in (low imap work > load). > Two backup streams are running. > Not using jumbo frames, nagle enabled, tcp_xmit_hiwat,tcp_recv_hiwat set > to 2097152 > - we''ve never seen any effect of changing the iscsi transport > parameters > under this small file workload. > > # zpool iostat 10 > capacity operations bandwidth > pool used avail read write read write > ---------- ----- ----- ----- ----- ----- ----- > space 1.06T 955G 96 69 5.20M 2.69M > space 1.06T 955G 175 105 8.96M 2.22M > space 1.06T 955G 182 16 4.47M 546K > space 1.06T 955G 170 16 4.82M 1.85M > space 1.06T 955G 145 159 4.23M 3.19M > space 1.06T 955G 138 15 4.97M 92.7K > space 1.06T 955G 134 15 3.82M 1.71M > space 1.06T 955G 109 123 3.07M 3.08M > space 1.06T 955G 106 11 3.07M 1.34M > space 1.06T 955G 120 17 3.69M 1.74MHere your average read throughput is about 4.6 MB/s with an average read size of 47 KB. That looks a lot like the simulation in the non-production environment. I would guess that the average message size is somewhere in the 40 - 50 KB range.> > # prstat -mL > PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG > PROCESS/LWPID > 12438 root 12 6.9 0.0 0.0 0.0 0.0 81 0.1 508 84 4K 0 save/1 > 27399 cyrus 15 0.5 0.0 0.0 0.0 0.0 85 0.0 18 10 297 0 imapd/1 > 20230 root 3.9 8.0 0.0 0.0 0.0 0.0 88 0.1 393 33 2K 0 save/1[snip] The "save" process is from Networker, right? These process do not look CPU bound (less than 20% on CPU). In a previous message you showed iostat data at a time when backups weren''t running. I''ve reproduced below, removing the device column for sake of formatting.> iostat -xpn 5 5 > extended device statistics > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > > 17.1 29.2 746.7 317.1 0.0 0.6 0.0 12.5 0 27 > 25.0 11.9 991.9 277.0 0.0 0.6 0.0 16.1 0 36 > 14.9 17.9 423.0 406.4 0.0 0.3 0.0 10.2 0 21 > 20.8 17.4 588.9 361.2 0.0 0.4 0.0 11.5 0 30 > > and: > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 11.9 43.0 528.9 1972.8 0.0 2.1 0.0 38.9 0 31 > 17.0 19.6 496.9 1499.0 0.0 1.4 0.0 38.8 0 39 > 14.0 30.0 670.2 1971.3 0.0 1.7 0.0 38.0 0 34 > 19.7 28.7 985.2 1647.6 0.0 1.6 0.0 32.5 0 37 > and: > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 22.7 41.3 973.7 423.5 0.0 0.8 0.0 11.8 0 34 > 27.9 20.0 1474.7 344.0 0.0 0.8 0.0 16.7 0 42 > 15.1 17.9 1318.7 463.7 0.0 0.6 0.0 17.7 0 19 > 22.3 19.5 1801.7 406.7 0.0 0.8 0.0 20.0 0 29Service times are in the 10 - 39 ms range. In the middle set, it looks like there is some heavier than normal write activity (not more writes, just bigger writes) and this seems to impact asvc_t. Let''s look back at something I said the other day... | Have you looked at iostat data to be sure that you are seeing asvc_t | + wsvc_t that supports the number of operations that you need to | perform? That is if asvc_t + wsvc_t for a device adds up to 10 ms, | a workload that waits for the completion of one I/O before issuing | the next will max out at 100 iops. Presumably ZFS should hide some | of this from you[1], but it does suggest that each backup stream | would be limited to about 100 files per second[2]. This is because | the read request for one file does not happen before the close of | the previous file[3]. Since cyrus stores each message as a separate | file, this suggests that 2.5 MB/s corresponds to average mail | message size of 25 KB. It seems reasonable based on the iostat data to say that the typical asvc_t is no better than 15 ms. Since the IO for one file does not start until the previous one completed, we can get no more than: 1000 ms/sec ----------- = 67 sequential operations per second 15 ms/io By "sequential" I mean that one doesn''t start until the other finishes. There is certainly a better word, but it escapes me at the moment. At an average file size of 45 KB, that translates to about 3 MB/sec. As you run two data streams, you are seeing throughput that looks kinda like the 2 * 3 MB/sec. With 4 backup streams do you get something that looks like 4 * 3 MB/s? How does that effect iostat output? -- Mike Gerdts http://mgerdts.blogspot.com/
On Tue, 2009-08-11 at 07:58, Alex Lam S.L. wrote:> At a first glance, your production server''s numbers are looking fairly > similar to the "small file workload" results of your development > server. > > I thought you were saying that the development server has faster performance?The development serer was running only one cp -pr command. The production mail sevrer was running two concurrent backup jobs and of course the mail system, with each job having the same performance throughput as if there were a single job running. The single threaded backup jobs do not conflict with each other over performance. If we ran 20 concurrent backup jobs, overall performance would scale up quite a bit. (I would guess between 5 and 10 times the performance). (I just read Mike''s post and will do some ''concurrency'' testing). Users are currently evenly distributed over 5 filesystems (I previously mentioned 7 but its really 5 filesystems for users and 1 for system data, totalling 6, and one test filesystem). We backup 2 filesystems on tuesday, 2 filesystems on thursday, and 2 on saturday. We backup to disk and then clone to tape. Our backup people can only handle doing 2 filesystems per night. Creating more filesystems to increase the parallelism of our backup is one solution but its a major redesign of the of the mail system. Adding a second server to half the pool and thereby half the problem is another solution (and we would also create more filesystems at the same time). Moving the pool to a FC San or a JBOD may also increase performance. (Less layers, introduced by the appliance, thereby increasing performance.) I suspect that if we ''rsync'' one of these filesystems to a second server/pool that we would also see a performance increase equal to what we see on the development server. (I don''t know how zfs send a receive work so I don''t know if it would address this "Filesystem Entropy" or specifically reorganize the files and directories). However, when we created a testfs filesystem in the zfs pool on the production server, and copied data to it, we saw the same performance as the other filesystems, in the same pool. We will have to do something to address the problem. A combination of what I just listed is our probable course of action. (Much testing will have to be done to ensure our solution will address the problem because we are not 100% sure what is the cause of performance degradation). I''m also dealing with Network Appliance to see if there is anything we can do at the filer end to increase performance. But I''m holding out little hope. But please, don''t miss the point I''m trying to make. ZFS would benefit from a utility or a background process that would reorganize files and directories in the pool to optimize performance. A utility to deal with Filesystem Entropy. Currently a zfs pool will live as long as the lifetime of the disks that it is on, without reorganization. This can be a long long time. Not to mention slowly expanding the pool over time contributes to the issue. -- Ed
On Aug 11, 2009, at 7:39 AM, Ed Spencer wrote:> > On Tue, 2009-08-11 at 07:58, Alex Lam S.L. wrote: >> At a first glance, your production server''s numbers are looking >> fairly >> similar to the "small file workload" results of your development >> server. >> >> I thought you were saying that the development server has faster >> performance? > > The development serer was running only one cp -pr command. > > The production mail sevrer was running two concurrent backup jobs > and of > course the mail system, with each job having the same performance > throughput as if there were a single job running. The single threaded > backup jobs do not conflict with each other over performance.Agree.> If we ran 20 concurrent backup jobs, overall performance would scale > up > quite a bit. (I would guess between 5 and 10 times the performance). > (I > just read Mike''s post and will do some ''concurrency'' testing).Yes.> Users are currently evenly distributed over 5 filesystems (I > previously > mentioned 7 but its really 5 filesystems for users and 1 for system > data, totalling 6, and one test filesystem). > > We backup 2 filesystems on tuesday, 2 filesystems on thursday, and 2 > on > saturday. We backup to disk and then clone to tape. Our backup people > can only handle doing 2 filesystems per night. > > Creating more filesystems to increase the parallelism of our backup is > one solution but its a major redesign of the of the mail system.Really? I presume this is because of the way you originally allocated accounts to file systems. Creating file systems in ZFS is easy, so could you explain in a new thread?> Adding a second server to half the pool and thereby half the > problem is > another solution (and we would also create more filesystems at the > same > time).I''m not convinced this is a good idea. It is a lot of work based on the assumption that the server is the bottleneck.> Moving the pool to a FC San or a JBOD may also increase performance. > (Less layers, introduced by the appliance, thereby increasing > performance.)Disagree.> I suspect that if we ''rsync'' one of these filesystems to a second > server/pool that we would also see a performance increase equal to > what > we see on the development server. (I don''t know how zfs send a receive > work so I don''t know if it would address this "Filesystem Entropy" or > specifically reorganize the files and directories). However, when we > created a testfs filesystem in the zfs pool on the production server, > and copied data to it, we saw the same performance as the other > filesystems, in the same pool.Directory walkers, like NetBackup or rsync, will not scale well as the number of files increases. It doesn''t matter what file system you use, the scalability will look more-or-less similar. For millions of files, ZFS send/receive works much better. More details are in my paper.> We will have to do something to address the problem. A combination of > what I just listed is our probable course of action. (Much testing > will > have to be done to ensure our solution will address the problem > because > we are not 100% sure what is the cause of performance degradation). > I''m > also dealing with Network Appliance to see if there is anything we can > do at the filer end to increase performance. But I''m holding out > little > hope.DNLC hit rate? Also, is atime on?> > But please, don''t miss the point I''m trying to make. ZFS would benefit > from a utility or a background process that would reorganize files and > directories in the pool to optimize performance. A utility to deal > with > Filesystem Entropy. Currently a zfs pool will live as long as the > lifetime of the disks that it is on, without reorganization. This > can be > a long long time. Not to mention slowly expanding the pool over time > contributes to the issue.This does not come "for free" in either performance or risk. It will do nothing to solve the directory walker''s problem. NB, people who use UFS don''t tend to see this because UFS can''t handle millions of files. -- richard
On Tue, August 11, 2009 10:39, Ed Spencer wrote:> I suspect that if we ''rsync'' one of these filesystems to a second > server/pool that we would also see a performance increase equal to what > we see on the development server. (I don''t know how zfs send a receiveRsync has to traverse the entire directory tree to stat() every file to see if it''s changed (and if it has, it then computes which parts of the file that have been updated). Zfs send/recv however works at a lower level and doesn''t go to each file: it can simply compare which file system blocks have changed. So you would create a snapshot on the ZFS file system(s) of interest and send it over to where ever you want to replicate it. Later on you would create another snapshot and, with the incremental ("-i") option in zfs(1M), you could then only transfer the blocks of data that were changed since the first snapshot. ZFS will be able to figure out the block differences without having to touch every file. Two pretty good explanations at: http://www.markround.com/archives/38-ZFS-Replication.html http://www.cuddletech.com/blog/pivot/entry.php?id=984> work so I don''t know if it would address this "Filesystem Entropy" or > specifically reorganize the files and directories). However, when we > created a testfs filesystem in the zfs pool on the production server, > and copied data to it, we saw the same performance as the other > filesystems, in the same pool.Not surprising, since any file systems on any particular pool would be using the same spindles. If you want different I/O characteristics you''d need a different pool with different spindles.> We will have to do something to address the problem. A combination of > what I just listed is our probable course of action. (Much testing will > have to be done to ensure our solution will address the problem because > we are not 100% sure what is the cause of performance degradation). I''mDon''t forget about the DTrace Toolkit, as it has many handy scripts for digging into various performance characteristics: http://www.brendangregg.com/dtrace.html> But please, don''t miss the point I''m trying to make. ZFS would benefit > from a utility or a background process that would reorganize files and > directories in the pool to optimize performance. A utility to deal withIf you have a Sun support contract call them up and ask for this enhancement. If there''s enough people asking for it the ZFS team will add it. Talking on the list is one thing, but if there''s no "official" paper trail in Sun''s database, then it won''t get the attention it may deserve.
On Tue, 2009-08-11 at 08:04 -0700, Richard Elling wrote:> On Aug 11, 2009, at 7:39 AM, Ed Spencer wrote: > > I suspect that if we ''rsync'' one of these filesystems to a second > > server/pool that we would also see a performance increase equal to > > what > > we see on the development server. (I don''t know how zfs send a receive > > work so I don''t know if it would address this "Filesystem Entropy" or > > specifically reorganize the files and directories). However, when we > > created a testfs filesystem in the zfs pool on the production server, > > and copied data to it, we saw the same performance as the other > > filesystems, in the same pool. > > Directory walkers, like NetBackup or rsync, will not scale well as > the number of files increases. It doesn''t matter what file system you > use, the scalability will look more-or-less similar. For millions of > files, > ZFS send/receive works much better. More details are in my paper.Is there link to this paper available? -- Louis-Fr?d?ric Feuillette <jebnor at gmail.com>
Richard Elling wrote:> On Aug 11, 2009, at 7:39 AM, Ed Spencer wrote: > >> >> On Tue, 2009-08-11 at 07:58, Alex Lam S.L. wrote: >>> At a first glance, your production server''s numbers are looking fairly >>> similar to the "small file workload" results of your development >>> server. >>> >>> I thought you were saying that the development server has faster >>> performance? >> >> The development serer was running only one cp -pr command. >> >> The production mail sevrer was running two concurrent backup jobs and of >> course the mail system, with each job having the same performance >> throughput as if there were a single job running. The single threaded >> backup jobs do not conflict with each other over performance. > > Agree. > >> If we ran 20 concurrent backup jobs, overall performance would scale up >> quite a bit. (I would guess between 5 and 10 times the performance). (I >> just read Mike''s post and will do some ''concurrency'' testing). > > Yes. > >> Users are currently evenly distributed over 5 filesystems (I previously >> mentioned 7 but its really 5 filesystems for users and 1 for system >> data, totalling 6, and one test filesystem). >> >> We backup 2 filesystems on tuesday, 2 filesystems on thursday, and 2 on >> saturday. We backup to disk and then clone to tape. Our backup people >> can only handle doing 2 filesystems per night. >> >> Creating more filesystems to increase the parallelism of our backup is >> one solution but its a major redesign of the of the mail system. > > Really? I presume this is because of the way you originally > allocated accounts to file systems. Creating file systems in ZFS is > easy, so could you explain in a new thread?Ed, This would be a good idea. This issue has been discussed many time on the iMS mailing list for the Sun Messaging server which as far as the way it stores messages on disk is very similar to Cyrus. (in fact I think it once was based on the same code base). The upshot of it is what has been explained by Mike in that these type of store create millions of little files that Netbackup or Legato must walk over and backup one after another sequentially. This does not scale very well at all due to the reasons explained by Mike. The issue commonly discussed about on the iMS list has been one of file system size. In general the rule of thumb most people had for this was around 100 to 250 GB per file system and lots of them to mostly increase the parallelism in the backup process rather than for performance gains in the actually functionality of the application. I, as a rule of thumb group my large users who have large mailboxes, which in turn have lots of large attachments into particular larger file system. Students who have small quotas and generally lots of small messages or small files in this case into other smaller file system. It really in this case is one size does not suit all. To keep backups within the time allocation, a bit of filesystem monitoring is useful. In the days of UFS I used to use a command like this to help make decisions. [root at xxx]#> df -F ufs -o i Filesystem iused ifree %iused Mounted on /dev/md/dsk/d0 605765 6674235 8% / /dev/md/dsk/d50 2387509 28198091 8% /mail1 /dev/md/dsk/d70 2090768 30669232 6% /mail3 /dev/md/dsk/d60 2447548 30312452 7% /mail2 I used this to balance the inodes. My guess is that around 85-90% of the inodes in a messaging server store are files with the remainder directories. Either way it is a simple way to make sure the stores are reasonably balanced. I am sure there will be a good way to do this for ZFS?> >> Adding a second server to half the pool and thereby half the problem is >> another solution (and we would also create more filesystems at the same >> time). >It can be a good idea, but it really depends on how many file systems you split your message stores into. Also good for relocating message stores to if the first server fails. This of course depends on your message store architecture. Easy to do with Sun Messaging, not so sure about Cyrus. But I did once run a Simeon message server for a University in London and that was based on Cyrus and was pretty similar from recollection.> I''m not convinced this is a good idea. It is a lot of work based on > the assumption that the server is the bottleneck. > >> Moving the pool to a FC San or a JBOD may also increase performance. >> (Less layers, introduced by the appliance, thereby increasing >> performance.) > > Disagree. > >> I suspect that if we ''rsync'' one of these filesystems to a second >> server/pool that we would also see a performance increase equal to what >> we see on the development server. (I don''t know how zfs send a receive >> work so I don''t know if it would address this "Filesystem Entropy" or >> specifically reorganize the files and directories). However, when we >> created a testfs filesystem in the zfs pool on the production server, >> and copied data to it, we saw the same performance as the other >> filesystems, in the same pool. > > Directory walkers, like NetBackup or rsync, will not scale well as > the number of files increases. It doesn''t matter what file system you > use, the scalability will look more-or-less similar. For millions of > files, > ZFS send/receive works much better. More details are in my paper.I look forward to reading this Richard. I think it will be a interesting read for members of this.> >> We will have to do something to address the problem. A combination of >> what I just listed is our probable course of action. (Much testing will >> have to be done to ensure our solution will address the problem because >> we are not 100% sure what is the cause of performance degradation). I''m >> also dealing with Network Appliance to see if there is anything we can >> do at the filer end to increase performance. But I''m holding out little >> hope. > > DNLC hit rate? > Also, is atime on?Turning atime off may make a big difference for you. It certainly does for Sun Messaging server. Maybe worth doing and reposting result?> >> >> But please, don''t miss the point I''m trying to make. ZFS would benefit >> from a utility or a background process that would reorganize files and >> directories in the pool to optimize performance. A utility to deal with >> Filesystem Entropy. Currently a zfs pool will live as long as the >> lifetime of the disks that it is on, without reorganization. This can be >> a long long time. Not to mention slowly expanding the pool over time >> contributes to the issue. > > This does not come "for free" in either performance or risk. It will > do nothing to solve the directory walker''s problem.Agree. It will have little bearing on the outcome for the reason you mention.> > NB, people who use UFS don''t tend to see this because UFS can''t > handle millions of files.It can but only if you have less than a 1 TB''ish sized file systems. Not large by ZFS standards. They do work, but with the same performance issue for directory walker backups. Heaven help you in fsck''ing them after a system crash. Hours and hours.> -- richard > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- _______________________________________________________________________ Scott Lawson Systems Architect Manukau Institute of Technology Information Communication Technology Services Private Bag 94006 Manukau City Auckland New Zealand Phone : +64 09 968 7611 Fax : +64 09 968 7641 Mobile : +64 27 568 7611 mailto:scott at manukau.ac.nz http://www.manukau.ac.nz ________________________________________________________________________ perl -e ''print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'' ________________________________________________________________________
Concurrency/Parallelism testing. I have 6 different filesystems populated with email data on our mail development server. I rebooted the server before beginning the tests. The server is a T2000 (sun4v) machine so its ideally suited for this type of testing. The test was to tar (to /dev/null) each of the filesystems. Launch 1, gather stats launch another , gather stats, etc. The underlying storage system is a Network Appliance. Our only one. In production. Serving NFS, CIFS and iscsi. Other work the appliance is doing may effect these tests, and vice versa :) . No one seemed to notice I was running these tests. After 6 concurrent tar''s running we are probabaly seeing benefits of the ARC. At certian points I included load averages and traffic stats for each of the iscsi ethernet interfaces that are configured with MPXIO. After the first 6 jobs, I launched duplicates of the 6. Then another 6, etc. At the end I included the zfs kernel statistics: 1 job ==== capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- space 70.5G 29.0G 0 0 0 0 space 70.5G 29.0G 19 0 1.04M 0 space 70.5G 29.0G 268 0 8.71M 0 space 70.5G 29.0G 196 0 11.3M 0 space 70.5G 29.0G 171 0 11.0M 0 space 70.5G 29.0G 182 0 5.01M 0 space 70.5G 29.0G 273 0 9.71M 0 space 70.5G 29.0G 292 0 8.91M 0 space 70.5G 29.0G 279 0 15.4M 0 space 70.5G 29.0G 219 0 11.3M 0 space 70.5G 29.0G 175 0 8.67M 0 2 jobs ===== capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- space 70.5G 29.0G 381 0 23.8M 0 space 70.5G 29.0G 422 0 28.0M 0 space 70.5G 29.0G 386 0 26.5M 0 space 70.5G 29.0G 380 0 22.9M 0 space 70.5G 29.0G 411 0 18.8M 0 space 70.5G 29.0G 393 0 20.7M 0 space 70.5G 29.0G 302 0 15.0M 0 space 70.5G 29.0G 267 0 15.6M 0 space 70.5G 29.0G 304 0 18.7M 0 space 70.5G 29.0G 534 0 19.7M 0 space 70.5G 29.0G 339 0 17.0M 0 3 jobs ===== capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- space 70.5G 29.0G 530 0 22.9M 0 space 70.5G 29.0G 428 0 16.3M 0 space 70.5G 29.0G 439 0 16.4M 0 space 70.5G 29.0G 511 0 22.1M 0 space 70.5G 29.0G 464 0 17.9M 0 space 70.5G 29.0G 371 0 12.1M 0 space 70.5G 29.0G 447 0 16.5M 0 space 70.5G 29.0G 379 0 15.5M 0 4jobs ===== capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- space 70.5G 29.0G 434 0 22.0M 0 space 70.5G 29.0G 506 0 29.5M 0 space 70.5G 29.0G 424 0 21.3M 0 space 70.5G 29.0G 643 0 36.0M 0 space 70.5G 29.0G 688 0 31.1M 0 space 70.5G 29.0G 726 0 37.6M 0 space 70.5G 29.0G 652 0 24.8M 0 space 70.5G 29.0G 646 0 33.9M 0 5jobs ===== capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- space 70.5G 29.0G 629 0 31.1M 0 space 70.5G 29.0G 774 0 45.8M 0 space 70.5G 29.0G 815 0 39.8M 0 space 70.5G 29.0G 895 0 44.4M 0 space 70.5G 29.0G 800 0 48.1M 0 space 70.5G 29.0G 857 0 51.8M 0 space 70.5G 29.0G 725 0 47.6M 0 6jobs ===== capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- space 70.5G 29.0G 924 0 58.8M 0 space 70.5G 29.0G 767 0 51.8M 0 space 70.5G 29.0G 862 0 48.4M 0 space 70.5G 29.0G 977 0 43.9M 0 space 70.5G 29.0G 954 0 53.7M 0 space 70.5G 29.0G 903 0 48.3M 0 # uptime 2:19pm up 15 min(s), 2 users, load average: 1.44, 1.10, 0.67 26MB ( 1 minute average) on each iSCSI ethernet port 12jobs ===== capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- space 70.5G 29.0G 868 0 48.6M 0 space 70.5G 29.0G 903 0 45.3M 0 space 70.5G 29.0G 919 0 52.4M 0 space 70.5G 29.0G 1.20K 0 73.3M 0 space 70.5G 29.0G 1.16K 0 63.3M 0 space 70.5G 29.0G 1.12K 0 71.2M 0 space 70.5G 29.0G 1.29K 0 68.8M 0 # uptime 2:22pm up 18 min(s), 2 users, load average: 1.75, 1.29, 0.80 33MB ( 1 minute average) on each iSCSI ethernet port 18jobs ===== capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- space 70.5G 29.0G 1.31K 0 69.3M 0 space 70.5G 29.0G 1.25K 0 74.7M 0 space 70.5G 29.0G 1.23K 0 74.4M 0 space 70.5G 29.0G 1.25K 0 72.1M 0 space 70.5G 29.0G 1.34K 0 75.3M 0 space 70.5G 29.0G 1.31K 0 77.4M 0 space 70.5G 29.0G 892 0 51.8M 0 space 70.5G 29.0G 1.12K 0 69.6M 0 24jobs ===== capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- space 70.5G 29.0G 1.56K 0 84.5M 0 space 70.5G 29.0G 1.46K 0 86.3M 0 space 70.5G 29.0G 1.43K 0 75.7M 0 space 70.5G 29.0G 1.35K 0 67.6M 0 space 70.5G 29.0G 1.38K 0 72.6M 0 space 70.5G 29.0G 1.14K 0 69.8M 0 space 70.5G 29.0G 1.19K 0 66.4M 0 # uptime 2:26pm up 23 min(s), 2 users, load average: 2.29, 1.89, 1.20 36MB ( 1 minute average) on each iSCSI ethernet port 30jobs ===== capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- space 70.5G 29.0G 1.20K 0 63.9M 0 space 70.5G 29.0G 1.76K 0 82.3M 0 space 70.5G 29.0G 1.57K 0 79.8M 0 space 70.5G 29.0G 1.82K 0 96.2M 0 space 70.5G 29.0G 1.81K 0 82.7M 0 space 70.5G 29.0G 1.55K 0 74.9M 0 space 70.5G 29.0G 1.53K 0 77.9M 0 space 70.5G 29.0G 1.50K 0 81.6M 0 # uptime 2:29pm up 26 min(s), 2 users, load average: 2.57, 2.12, 1.39 40MB ( 1 minute average) on each iSCSI ethernet port 35jobs ===== capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- space 70.5G 29.0G 1.41K 0 69.7M 0 space 70.5G 29.0G 1.58K 0 83.0M 0 space 70.5G 29.0G 1.31K 0 69.3M 0 space 70.5G 29.0G 1.53K 0 79.5M 0 space 70.5G 29.0G 1.42K 0 73.7M 0 space 70.5G 29.0G 1.45K 0 71.3M 0 # uptime 2:34pm up 30 min(s), 2 users, load average: 2.70, 2.55, 1.79 45MB ( 1 minute average) on each iSCSI ethernet port # kstat zfs me: arcstats class: misc c 4294967296 c_max 4294967296 c_min 536870912 crtime 5674386.62393914 deleted 1484966 demand_data_hits 8323333 demand_data_misses 1391606 demand_metadata_hits 1320089 demand_metadata_misses 83372 evict_skip 15986 hash_chain_max 10 hash_chains 47700 hash_collisions 1104590 hash_elements 166476 hash_elements_max 188996 hdr_size 29907360 hits 10033815 l2_abort_lowmem 0 l2_cksum_bad 0 l2_evict_lock_retry 0 l2_evict_reading 0 l2_feeds 0 l2_free_on_write 0 l2_hdr_size 0 l2_hits 0 l2_io_error 0 l2_misses 0 l2_rw_clash 0 l2_size 0 l2_writes_done 0 l2_writes_error 0 l2_writes_hdr_miss 0 l2_writes_sent 0 memory_throttle_count 0 mfu_ghost_hits 56647 mfu_hits 1963736 misses 1735570 mru_ghost_hits 27411 mru_hits 7715952 mutex_miss 82794 p 1918981120 prefetch_data_hits 3017 prefetch_data_misses 225803 prefetch_metadata_hits 387376 prefetch_metadata_misses 34789 recycle_miss 171217 size 3914208576 snaptime 5676565.69946945 module: zfs instance: 0 name: vdev_cache_stats class: misc crtime 5674386.6242014 delegations 15022 hits 38616 misses 64786 snaptime 5676565.7082284 -- Ed
On Aug 11, 2009, at 1:21 PM, Ed Spencer wrote:> Concurrency/Parallelism testing. > I have 6 different filesystems populated with email data on our mail > development server. > I rebooted the server before beginning the tests. > The server is a T2000 (sun4v) machine so its ideally suited for this > type of testing. > The test was to tar (to /dev/null) each of the filesystems. Launch 1, > gather stats launch another , gather stats, etc. > The underlying storage system is a Network Appliance. Our only one. In > production. Serving NFS, CIFS and iscsi. Other work the appliance is > doing may effect these tests, and vice versa :) . No one seemed to > notice I was running these tests. > > After 6 concurrent tar''s running we are probabaly seeing benefits of > the > ARC. > At certian points I included load averages and traffic stats for > each of > the iscsi ethernet interfaces that are configured with MPXIO. > > After the first 6 jobs, I launched duplicates of the 6. Then another > 6, > etc. >iostat and zpool iostat measure I/O to the disks. fsstat measures I/O to the file system (hence the name ;-). A large discrepancy between the two is another indicator of filesystem caching. While tar is slightly interesting, I would expect that your normal backup workload to show a lot of lookups and attr gets. If these are cached, life will be better. -- richard
On Tue, 2009-08-11 at 14:56, Scott Lawson wrote:> > Also, is atime on? > Turning atime off may make a big difference for you. It certainly does > for Sun Messaging server. > Maybe worth doing and reposting result?Yes. All these results were attained with atime=off. We made that change on all the filesystems this spring. I''d like tho thank everyone who took part in this thread. Its helped us quite a bit. I''ll be re-reading this thread a few times to glean additional recommendations I missed the first time. Not to mention doing some additional testing. We shall also look at tuning the DNLC on our production server. I also did some stress testing using tar and large files and saw a sustained read rate of between 100MB and 120MB per second running 5 concurrent tar''s. -- Ed
On Tue, Aug 11, 2009 at 9:39 AM, Ed Spencer<Ed_Spencer at umanitoba.ca> wrote:> We backup 2 filesystems on tuesday, 2 filesystems on thursday, and 2 on > saturday. We backup to disk and then clone to tape. Our backup people > can only handle doing 2 filesystems per night. > > Creating more filesystems to increase the parallelism of our backup is > one solution but its a major redesign of the of the mail system.What is magical about a 1:1 mapping of backup job to file system? According to the Networker manual[1], a save set in Networker can be configured to back up certain directories. According to some random documentation about Cyrus[2], mail boxes fall under a pretty predictable hierarchy. 1. http://oregonstate.edu/net/services/backups/clients/7_4/admin7_4.pdf 2. http://nakedape.cc/info/Cyrus-IMAP-HOWTO/components.html Assuming that the way that your mailboxes get hashed fall into a structure like $fs/b/bigbird and $fs/g/grover (and not just $fs/bigbird and $fs/grover), you should be able to set a save set per top level directory or per group of a few directories. That is, create a save set for $fs/a, $fs/b, etc. or $fs/a - $fs/d, $fs/e - $fs/h, etc. If you are able to create many smaller save sets and turn the parallelism up you should be able to drive more throughput. I wouldn''t get too worried about ensuring that they all start at the same time[3], but it would probably make sense to prioritize the larger ones so that they start early and the smaller ones can fill in the parallelism gaps as the longer-running ones finish. 3. That is, there is sometimes benefit in having many more jobs to run than you have concurrent streams. This avoids having one save set that finishes long after all the others because of poorly balanced save sets. -- Mike Gerdts http://mgerdts.blogspot.com/
I don''t know of any reason why we can''t turn 1 backup job per filesystem into say, up to say , 26 based on the cyrus file and directory structure. The cyrus file and directory structure is designed with users located under the directories A,B,C,D,etc to deal with the millions of little files issue at the filesystem layer. Our backups will have to be changed to use this design feature. There will be a little work on the front end to create the jobs but once done the full backups should finish in a couple of hours. As an aside, we are currently upgrading our backup server to a sun4v machine. This architecture is well suited to run more jobs in parallel. Thanx for all your help and advice. Ed On Tue, 2009-08-11 at 22:47, Mike Gerdts wrote:> On Tue, Aug 11, 2009 at 9:39 AM, Ed Spencer<Ed_Spencer at umanitoba.ca> wrote: > > We backup 2 filesystems on tuesday, 2 filesystems on thursday, and 2 on > > saturday. We backup to disk and then clone to tape. Our backup people > > can only handle doing 2 filesystems per night. > > > > Creating more filesystems to increase the parallelism of our backup is > > one solution but its a major redesign of the of the mail system. > > What is magical about a 1:1 mapping of backup job to file system? > According to the Networker manual[1], a save set in Networker can be > configured to back up certain directories. According to some random > documentation about Cyrus[2], mail boxes fall under a pretty > predictable hierarchy. > > 1. http://oregonstate.edu/net/services/backups/clients/7_4/admin7_4.pdf > 2. http://nakedape.cc/info/Cyrus-IMAP-HOWTO/components.html > > Assuming that the way that your mailboxes get hashed fall into a > structure like $fs/b/bigbird and $fs/g/grover (and not just > $fs/bigbird and $fs/grover), you should be able to set a save set per > top level directory or per group of a few directories. That is, > create a save set for $fs/a, $fs/b, etc. or $fs/a - $fs/d, $fs/e - > $fs/h, etc. If you are able to create many smaller save sets and turn > the parallelism up you should be able to drive more throughput. > > I wouldn''t get too worried about ensuring that they all start at the > same time[3], but it would probably make sense to prioritize the > larger ones so that they start early and the smaller ones can fill in > the parallelism gaps as the longer-running ones finish. > > 3. That is, there is sometimes benefit in having many more jobs to run > than you have concurrent streams. This avoids having one save set > that finishes long after all the others because of poorly balanced > save sets. > > -- > Mike Gerdts > http://mgerdts.blogspot.com/-- Ed
On Tue, Aug 11, 2009 at 11:04 PM, Richard Elling<richard.elling at gmail.com> wrote:> On Aug 11, 2009, at 7:39 AM, Ed Spencer wrote: > >> I suspect that if we ''rsync'' one of these filesystems to a second >> server/pool ?that we would also see a performance increase equal to what >> we see on the development server. (I don''t know how zfs send a receive >> work so I don''t know if it would address this "Filesystem Entropy" or >> specifically reorganize the files and directories). However, when we >> created a testfs filesystem in the zfs pool on the production server, >> and copied data to it, we saw the same performance as the other >> filesystems, in the same pool. > > Directory walkers, like NetBackup or rsync, will not scale well as > the number of files increases. ?It doesn''t matter what file system you > use, the scalability will look more-or-less similar. For millions of files, > ZFS send/receive works much better. ?More details are in my paper.It would be nice if ZFS had something similar to VxFS File Change Log. This feature is very useful for incremental backups and other directory walkers, providing they support FCL. Damjan
Ed Spencer wrote:> I don''t know of any reason why we can''t turn 1 backup job per filesystem > into say, up to say , 26 based on the cyrus file and directory > structure. >No reason whatsoever. Sometimes the more the better as per the rest of this thread. The key here is to test and tweak till you get the optimal arrangement of backup window time and performance. Performance tuning is a little bit of a Journey, that sooner or later has a final destination. ;)> The cyrus file and directory structure is designed with users located > under the directories A,B,C,D,etc to deal with the millions of little > files issue at the filesystem layer. >The sun messaging server actually hashes the user names into a structure which looks quite similar to a squid cache store. This has a top level of 128 directories, which each in turn contain 128 directories, which then contain a folder for each user that has been mapped into that structure by the hash algorithm on the user name. I use a wildcard mapping to split this into 16 streams to cover the 0-9, a-f of the hexadecimal directory structure names. eg. /mailstore1/users/0*> Our backups will have to be changed to use this design feature. > There will be a little work on the front end to create the jobs but > once done the full backups should finish in a couple of hours. >The nice thing about this work is it really is only a one off configuration in the backup software and then it is done. Certainly works a lot better than something like ALL_LOCAL_DRIVES in Netbackup which effectively forks one backup thread per file system.> As an aside, we are currently upgrading our backup server to a sun4v > machine. > This architecture is well suited to run more jobs in parallel. >I use a T5220 with staging to a J4500 with 48 x 1 TB disks in a zpool with 6 file systems. This then gets streamed to 6 LTO4 tape drives in a SL500 .Needless to say this supports a high degree of parallelism and generally finds the source server to be the bottleneck. I also take advantage of the 10 GigE capability built straight into the Ultrasparc T2. Only major bottleneck in this system is the SAS interconnect to the J4500.> > Thanx for all your help and advice. > > Ed > > On Tue, 2009-08-11 at 22:47, Mike Gerdts wrote: > >> On Tue, Aug 11, 2009 at 9:39 AM, Ed Spencer<Ed_Spencer at umanitoba.ca> wrote: >> >>> We backup 2 filesystems on tuesday, 2 filesystems on thursday, and 2 on >>> saturday. We backup to disk and then clone to tape. Our backup people >>> can only handle doing 2 filesystems per night. >>> >>> Creating more filesystems to increase the parallelism of our backup is >>> one solution but its a major redesign of the of the mail system. >>> >> What is magical about a 1:1 mapping of backup job to file system? >> According to the Networker manual[1], a save set in Networker can be >> configured to back up certain directories. According to some random >> documentation about Cyrus[2], mail boxes fall under a pretty >> predictable hierarchy. >> >> 1. http://oregonstate.edu/net/services/backups/clients/7_4/admin7_4.pdf >> 2. http://nakedape.cc/info/Cyrus-IMAP-HOWTO/components.html >> >> Assuming that the way that your mailboxes get hashed fall into a >> structure like $fs/b/bigbird and $fs/g/grover (and not just >> $fs/bigbird and $fs/grover), you should be able to set a save set per >> top level directory or per group of a few directories. That is, >> create a save set for $fs/a, $fs/b, etc. or $fs/a - $fs/d, $fs/e - >> $fs/h, etc. If you are able to create many smaller save sets and turn >> the parallelism up you should be able to drive more throughput. >> >> I wouldn''t get too worried about ensuring that they all start at the >> same time[3], but it would probably make sense to prioritize the >> larger ones so that they start early and the smaller ones can fill in >> the parallelism gaps as the longer-running ones finish. >> >> 3. That is, there is sometimes benefit in having many more jobs to run >> than you have concurrent streams. This avoids having one save set >> that finishes long after all the others because of poorly balanced >> save sets. >>Couldn''t agree more Mike.>> -- >> Mike Gerdts >> http://mgerdts.blogspot.com/ >>-- _______________________________________________________________________ Scott Lawson Systems Architect Manukau Institute of Technology Information Communication Technology Services Private Bag 94006 Manukau City Auckland New Zealand Phone : +64 09 968 7611 Fax : +64 09 968 7641 Mobile : +64 27 568 7611 mailto:scott at manukau.ac.nz http://www.manukau.ac.nz ________________________________________________________________________ perl -e ''print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'' ________________________________________________________________________
There are Works to make NDMP more efficient in highly fregmanted file Systems with a lot of small files. I am not a development engineer so I don''t know much and I do not think that there is any committed work. However ZFS engineers on the forum may comment more Mertol Mertol Ozyoney Storage Practice - Sales Manager Sun Microsystems, TR Istanbul TR Phone +902123352200 Mobile +905339310752 Fax +902123352222 Email mertol.ozyoney at sun.com -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Ed Spencer Sent: Sunday, August 09, 2009 12:14 AM To: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] zfs fragmentation On Sat, 2009-08-08 at 15:20, Bob Friesenhahn wrote:> A SSD slog backed by a SAS 15K JBOD array should perform much better > than a big iSCSI LUN.Now...yes. We implemented this pool years ago. I believe, then, the server would crash if you had a zfs drive fail. We decided to let the netapp handle the disk redundency. Its worked out well. I''ve looked at those really nice Sun products adoringly. And a 7000 series appliance would also be a nice addition to our central NFS service. Not to mention more cost effective than expanding our Network Appliance (We have researchers who are quite hungry for storage and NFS is always our first choice). We now have quite an investment in the current implementation. Its difficult to move away from. The netapp is quite a reliable product. We are quite happy with zfs and our implementation. We just need to address our backup performance and improve it just a little bit! We were almost lynched this spring because we encountered some pretty severe zfs bugs. We are still running the IDR named "A wad of ZFS bug fixes for Solaris 10 Update 6". It took over a month to resolve the issues. I work at a University and Final Exams and year end occur at the same time. I don''t recommend having email problems during this time! People are intolerant to email problems. I live in hope that a Netapp OS update, or a solaris patch, or a zfs patch, or a iscsi patch , or something will come along that improves our performance just a bit so our backup people get off my back! -- Ed _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss