Adam, With the blog entry[1] you''ve made about gzip for ZFS, it raises a couple of questions... 1) It would appear that a ZFS filesystem can support files of varying compression algorithm. If a file is compressed using method A but method B is now active, if I truncate the file and rewrite it, is A or B used? 2) The question of whether or not to use bzip2 was raised in the comment section of your blog. How easy would it be to implement a plugable (or more generic) interface between ZFS and the compression algorithms it uses such that I can modload a bzip2 compression LKM and tell ZFS to use that? I suspect that doing this will take extra work from the Solaris side of things too... 3) Given (1), are there any thoughts about being able to specify different compression algorithms for different directories (or files) on a ZFS filesystem? And thanks for the great work! Cheers, Darren [1] - blogs.sun.com/ahl/entry/gzip_for_zfs_update
Hello Darren, Thursday, March 29, 2007, 12:01:21 AM, you wrote: DRSC> Adam, DRSC> With the blog entry[1] you''ve made about gzip for ZFS, it raises DRSC> a couple of questions... DRSC> 1) It would appear that a ZFS filesystem can support files of DRSC> varying compression algorithm. If a file is compressed using DRSC> method A but method B is now active, if I truncate the file DRSC> and rewrite it, is A or B used? All new blocks will be written using B. It also means that some block belonging to the same file can be compressed with method A and some with method B (and other if compression gained less that 12% won''t be compressed at all - unless it was changed in a code). DRSC> 2) The question of whether or not to use bzip2 was raised in DRSC> the comment section of your blog. How easy would it be to DRSC> implement a plugable (or more generic) interface between DRSC> ZFS and the compression algorithms it uses such that I DRSC> can modload a bzip2 compression LKM and tell ZFS to DRSC> use that? I suspect that doing this will take extra work DRSC> from the Solaris side of things too... LKM - Linux Kernel Module? :)))))) Anyway - first problem is to find in-kernel compress/decompress algorithms or port user-land to kernel. Gzip was easier as it already was there. So if you have in-kernel bzip2 implementation, and better yet working on Solaris, then adding bzip2 to ZFS would be quite easy. Last time I looked in ZFS compression code it wasn''t dynamically expandable - available compression algorithms have to be compiled in. Now while dynamically plugable implementation sounds appealing I doubt people will actually create any such modules in reality. Not to mentions problems like - you export a pool, import on another host without your module and basically you can''t access your data. DRSC> 3) Given (1), are there any thoughts about being able to specify DRSC> different compression algorithms for different directories DRSC> (or files) on a ZFS filesystem? There was small discussion here some time ago about such possibilities but I doubt anything was actually be done about it. Despite that 12% barrier which possibly saves CPU on decompression with poorly compressed data (as such data won''t actually be compressed) I''m afraid there''s nothing more. It was suggested here that perhaps ZFS could turn compression off for specific file types determined either on file name extension or its magic cookie - that would probably save some CPU on some workloads. -- Best regards, Robert mailto:rmilkowski at task.gda.pl milek.blogspot.com
Robert Milkowski wrote:>Hello Darren, > >Thursday, March 29, 2007, 12:01:21 AM, you wrote: > >DRSC> Adam, >... >DRSC> 2) The question of whether or not to use bzip2 was raised in >DRSC> the comment section of your blog. How easy would it be to >DRSC> implement a plugable (or more generic) interface between >DRSC> ZFS and the compression algorithms it uses such that I >DRSC> can modload a bzip2 compression LKM and tell ZFS to >DRSC> use that? I suspect that doing this will take extra work >DRSC> from the Solaris side of things too... > >LKM - Linux Kernel Module? :)))))) > >Anyway - first problem is to find in-kernel compress/decompress >algorithms or port user-land to kernel. Gzip was easier as it already >was there. So if you have in-kernel bzip2 implementation, and better >yet working on Solaris, then adding bzip2 to ZFS would be quite easy. > >Last time I looked in ZFS compression code it wasn''t dynamically >expandable - available compression algorithms have to be compiled in. > >I suppose what would have been nice to see, architecturally, was a way to transform data at some part in the pipeline and to be able to specify various types of transforms, be they compression, encryption or something else. But maybe I''m just dreaming without understanding the complexities of what needs to happen on the "inside", such that where these two operations might take place is actually incompatible and thus there is little point of such a generalisation.>Now while dynamically plugable implementation sounds appealing I doubt >people will actually create any such modules in reality. Not to >mentions problems like - you export a pool, import on another host >without your module and basically you can''t access your data. > >Maybe, but it is also a very good method for enabling people to develop and test new compression algorithms for use with filesystems. It also opens up a new avenue for people that want to build their own appliances using ZFS to create an extra thing that differentiates them from others. So, for example, if the interface was plugable and Sun only wanted to ship gzip, but I wanted to create a "better" ZFS based appliance than one based on just OpenSolaris, I might build a bzip2 module for the kernel and have ZFS use that by default. Darren
Hello Darren, Thursday, March 29, 2007, 12:55:03 AM, you wrote: DRSC> So, for example, if the interface was plugable and Sun only DRSC> wanted to ship gzip, but I wanted to create a "better" ZFS DRSC> based appliance than one based on just OpenSolaris, I might DRSC> build a bzip2 module for the kernel and have ZFS use that DRSC> by default. Or better yet to implement CAS-like (de-DUP) solution (but this would probably better work on file basis rather on block basis). ok, I get the idea. -- Best regards, Robert mailto:rmilkowski at task.gda.pl milek.blogspot.com
> I suppose what would have been nice to see, architecturally, > was a way to transform data at some part in the pipeline and > to be able to specify various types of transforms, be they > compression, encryption or something else. But maybe I''m > just dreaming without understanding the complexities of > what needs to happen on the "inside", such that where these > two operations might take place is actually incompatible and > thus there is little point of such a generalisation.You really don''t want crypto algorithms to be that pluggable. The reason being with crypto you need to make a very careful choice of algorithm,keylength and mode (CBC vs EBC vs CTR vs CCM vs GCM etc) and that isn''t something you want an end admin doing. You don''t want them switching from AES-256-CCM to Blowfish-448-CBC because they see 448 is bigger than 128 so therefore more secure. For compression it wouldn''t be such a big deal except for an implementation artifact of how the ZIO pipeline works, it is partly controlled by the compress stage at the moment. The other problem is that you basically need a global unique registry anyway so that compress algorithm 1 is always lzjb, 2 is gzip, 3 is .... etc etc. Similarly for crypto and any other transform. BTW I actually floated the idea of a generic ZTL - ZIO Transform Layer about this time last year (partly in jest because ZTL is the last three on my car registration :-)).> So, for example, if the interface was plugable and Sun only > wanted to ship gzip, but I wanted to create a "better" ZFS > based appliance than one based on just OpenSolaris, I might > build a bzip2 module for the kernel and have ZFS use that > by default.I have to say it because normally I''m all for pluggable interfaces and I don''t think the answer should be "its open source just add it" in this case I think that is for now the safer way. -- Darren J Moffat
From: "Darren J Moffat" <Darren.Moffat at Sun.COM> ...> The other problem is that you basically need a global unique registry > anyway so that compress algorithm 1 is always lzjb, 2 is gzip, 3 is .... > etc etc. Similarly for crypto and any other transform.I''ve two thoughts on that: 1) if there is to be a registry, it should be hosted by OpenSolaris and be open to all and 2) there should be provision for a "private number space" so that people can implement their own whatever so long as they understand that the filesystem will not work if plugged into something else. Case in point for (2), if I wanted to make a bzip2 version of ZFS at home then I should be able to and in doing so chose a number for it that I know will be safe for my playing at home. I shouldn''t have to come to zfs-discuss at opensolaris.org to "pick a number." Darren
>From: "Darren J Moffat" <Darren.Moffat at Sun.COM> >... >> The other problem is that you basically need a global unique registry >> anyway so that compress algorithm 1 is always lzjb, 2 is gzip, 3 is .... >> etc etc. Similarly for crypto and any other transform. > >I''ve two thoughts on that: >1) if there is to be a registry, it should be hosted by OpenSolaris > and be open to all and > >2) there should be provision for a "private number space" so that > people can implement their own whatever so long as they understand > that the filesystem will not work if plugged into something else. > >Case in point for (2), if I wanted to make a bzip2 version of ZFS at >home then I should be able to and in doing so chose a number for it >that I know will be safe for my playing at home. I shouldn''t have >to come to zfs-discuss at opensolaris.org to "pick a number."I''m not sure we really need a registry or a number space. Algorithms should have names, not numbers. The zpool should contain a table: - 1 lzjb - 2 gzip - 3 ... but it could just as well be: - 1 gzip - 2 ... - 3 lzjb the zpool would simply not load if it cannot find the algorithm(s) used to store data in the zpool (or return I/O errors on the files/metadata it can''t decompress) Global registries seem like a bad idea; names can be made arbitrarily long to make uniqueness. There''s no reason why the algorithm can''t be renamed after creating the pool might a clash occur; renumbering would be much harder. Casper
On Wed, Apr 04, 2007 at 07:57:21PM +1000, Darren Reed wrote:> From: "Darren J Moffat" <Darren.Moffat at Sun.COM> > ... > >The other problem is that you basically need a global unique registry > >anyway so that compress algorithm 1 is always lzjb, 2 is gzip, 3 is .... > >etc etc. Similarly for crypto and any other transform. > > I''ve two thoughts on that: > 1) if there is to be a registry, it should be hosted by OpenSolaris > and be open to all andI think there already is such a registry: cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/sys/zio.h#89 Adam -- Adam Leventhal, Solaris Kernel Development blogs.sun.com/ahl
> Assuming that you may pick a specific compression algorithm, > most algorithms can have different levels/percentages of > deflations/inflations which affects the time to compress > and/or inflate wrt the CPU capacity.Yes? I''m not sure what your point is. Are you suggesting that, rather than hard-coding (for instance) the nine "gzip1/gzip2/.../gzip9" alternatives, it would be useful to have a "gzip" setting with a compression level? That might make some sense, but in practice, there''s a limited number of compression algorithms and limited utility for setting the degree of compression, so the current approach doesn''t seem to sacrifice much. (If you get into more complex compression algorithms, there are more knobs to tweak, too; and it doesn''t seem particularly useful to expose all of those.)> Secondly, if I can add an additional item, would anyone > want to be able to encrypt the data vs compressYes, and I think Darren Moffat is working on it. Encryption & compression are orthogonal, though. (The only constraint is that it''s far preferable to compress first, then encrypt, since compression relies on regularity in the data stream which encryption removes.)> Third, if data were to be compressed within a file > object, should a reader be made aware that the data > being read is compressed or should he just read > garbage?I don''t understand your question here. Compression is transparent, so a reader will get back exactly what was written. Both the compression and decompression happen automatically. (There''s a separate issue that backup applications would like to be able to read the compressed data directly; I haven''t paid attention to see if there''s an ioctl to enable this yet.)> Fourth, if you take 8k and expect to alloc 8k of disk > block storage for it and compress it to 7k, are you > really saving 1k? Or are you just creating an additional > 1K of internal fragmentation?You''re really saving 1K, because the disk space is not allocated until after the compression step. Remember, ZFS uses variably-sized blocks. In your example, you''ll allocate a 1K block which happens to hold 8K worth of the user''s data.> Fifth and hopefully last, should the znode have a > new length field that keeps the non-compressed length > for Posix compatibility.With this & your third question, I think you''ve got a fundamental misunderstanding of what the compression in ZFS does. It is transparent to the application. The application reads & writes uncompressed data, it sees uncompressed files, it doesn''t even have any way to know that the file has been compressed (except for looking at stat data & counting the blocks used).> Really last..., why not just compress the data > a stream > before writing it out to disk? Then you can at least > t do > a file on it and identify the type of compression...This is preferable when the application supports it, because it allows you to compress the whole file at once and get better compression ratios, choose an appropriate compression algorithm, not try to compress incompressible data, etc. However, it''s less general, since it requires that the application do the compression. If you have existing applications which only deal with uncompressed data, then having the file system do the compression is useful. This isn''t exactly new. Stak did this for Windows (at the disk level, not the file system level) in the 1980s. File system level compression came in around the same time (DiskDoubler and StuffIt SpaceSaver on the Mac, for instance). Windows NTFS has built-in compression, but it compresses the whole file, rather than individual blocks. (Better compression, but the performance isn''t as good if you''re only reading a small portion of the file.) Anton This message posted from opensolaris.org
My two cents, Assuming that you may pick a specific compression algorithm, most algorithms can have different levels/percentages of deflations/inflations which is effects the time to compress and/or inflate wrt the CPU capacity. Secondly, if I can add an additional item, would anyone want to be able to encrypt the data vs compress or to be able to combine encryption with compression? Third, if data were to be compressed within a file object, should a reader be made aware that the data being read is compressed or should he just read garbage? Would/should a field in the znode be read transparently that de-compresses already compressed data? Fourth, if you take 8k and expect to alloc 8k of disk block storage for it and compress it to 7k, are you really saving 1k? Or are you just creating an additional 1K of internal fragmentation? It is possible that moving '' 7K of data accross your "SCSI" type interface may give you a faster read/write performance. But that is after the additional latency of the compress on the async write and adds a real latency on the current block read. So, what are you really gaining????? Fifth and hopefully last, should the znode have a new length field that keeps the non-compressed length for Posix compatibility. I am assuming large file support where a process that is not large file aware should not be able to even open the file. With the additional field (unccompressed size) the file may lie on the boundry for the large file open reqs. Really last..., why not just compress the data stream before writing it out to disk? Then you can at least do a file on it and identify the type of compression... Mitchell Erblich ----------------- Darren Reed wrote:> > From: "Darren J Moffat" <Darren.Moffat at Sun.COM> > ... > > The other problem is that you basically need a global unique registry > > anyway so that compress algorithm 1 is always lzjb, 2 is gzip, 3 is .... > > etc etc. Similarly for crypto and any other transform. > > I''ve two thoughts on that: > 1) if there is to be a registry, it should be hosted by OpenSolaris > and be open to all and > > 2) there should be provision for a "private number space" so that > people can implement their own whatever so long as they understand > that the filesystem will not work if plugged into something else. > > Case in point for (2), if I wanted to make a bzip2 version of ZFS at > home then I should be able to and in doing so chose a number for it > that I know will be safe for my playing at home. I shouldn''t have > to come to zfs-discuss at opensolaris.org to "pick a number." > > Darren > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > mail.opensolaris.org/mailman/listinfo/zfs-discuss
Erblichs wrote:> My two cents, > > ... > Secondly, if I can add an additional item, would anyone > want to be able to encrypt the data vs compress or to > be able to combine encryption with compression? >Yes, I might want to encrypt all of my laptop''s hard drive contents and I might also want to have compression used prior to encryption to maximise the utility I get from the relatively limited space. Darren
Management here is worried about performance under ZFS because they had a bad experience with Instant Image a number of years ago. When iiamd was used, server performance was reduced to a crawl. Hence they want proof in the form of benchmarking that zfs snapshots will not adversely affect system performance. They suggested creating, snapshotting, copying and generally messing about with some 1 gb files. The system is an E450 running snv_52 with a 36 gb boot drive, 142 Gb data drive and two 9 gb SAN partitions, one on slow disk, one on fast. The 36 gb is formatted ufs, everything else zfs. I time mkfile''ing a 1 gb file on ufs and copying it, then did the same thing on each zfs partition. Then I took snapshots, copied files, more snapshots, keeping timings all the way. I could find no appreciable performance hit. Is this a sufficient, valid test? This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail.
zfs-discuss-bounces at opensolaris.org wrote on 04/12/2007 04:47:06 PM:> Management here is worried about performance under ZFS because they had > a bad experience with Instant Image a number of years ago. When iiamd > was used, server performance was reduced to a crawl. Hence they want > proof in the form of benchmarking that zfs snapshots will not adversely > affect system performance. They suggested creating, snapshotting, > copying and generally messing about with some 1 gb files. The system is > an E450 running snv_52 with a 36 gb boot drive, 142 Gb data drive and > two 9 gb SAN partitions, one on slow disk, one on fast. The 36 gb is > formatted ufs, everything else zfs. > > I time mkfile''ing a 1 gb file on ufs and copying it, then did the same > thing on each zfs partition. Then I took snapshots, copied files, more > snapshots, keeping timings all the way. I could find no appreciable > performance hit. > > Is this a sufficient, valid test? >I believe mkfile is creating the file padded with zeros; and that ZFS has short-curcuts to avoid storing actual data for such empty files. That would lead me to believe that this is an invalid test. -Wade> This email and any files transmitted with it are confidential and > intended solely for the use of the individual or entity to whom they > are addressed. If you have received this email in error please > notify the system manager. This message contains confidential > information and is intended only for the individual named. If you > are not the named addressee you should not disseminate, distribute > or copy this e-mail. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > mail.opensolaris.org/mailman/listinfo/zfs-discuss
On April 12, 2007 3:47:06 PM -0600 Bruce Shaw <Bruce.Shaw at gov.ab.ca> wrote:> Management here is worried about performance under ZFS because they had > a bad experience with Instant Image a number of years ago. When iiamd > was used, server performance was reduced to a crawl. Hence they want > proof in the form of benchmarking that zfs snapshots will not adversely > affect system performance. They suggested creating, snapshotting, > copying and generally messing about with some 1 gb files. The system is > an E450 running snv_52 with a 36 gb boot drive, 142 Gb data drive and > two 9 gb SAN partitions, one on slow disk, one on fast. The 36 gb is > formatted ufs, everything else zfs. > > I time mkfile''ing a 1 gb file on ufs and copying it, then did the same > thing on each zfs partition. Then I took snapshots, copied files, more > snapshots, keeping timings all the way. I could find no appreciable > performance hit. > > Is this a sufficient, valid test? > > This email and any files transmitted with it are confidential and > intended solely for the use of the individual or entity to whom they are > addressed. If you have received this email in error please notify the > system manager. This message contains confidential information and is > intended only for the individual named. If you are not the named > addressee you should not disseminate, distribute or copy this e-mail.I''m sorry, but I can''t answer your questions since I refuse to accept these terms. As I don''t have any idea who the "system manager" is, I guess I have no one to go to about this problem! You do realize your confidential email has been gateway''d to the web and indexed by google and other search engines? -frank
> >> This email and any files transmitted with it are confidential and >> intended solely for the use of the individual or entity to whom theyare>> addressed. If you have received this email in error please notify the >> system manager. This message contains confidential information and is >> intended only for the individual named. If you are not the named >> addressee you should not disseminate, distribute or copy this e-mail.>I''m sorry, but I can''t answer your questions since I refuse to accept >these terms. As I don''t have any idea who the "system manager" is, I >guess I have no one to go to about this problem!>You do realize your confidential email has been gateway''d to the web >and indexed by google and other search engines?Yeah, I know. It''s a generic disclaimer put on all outgoing mail. For your viewing pleasure, here it is again. This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system manager. This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail.
On 4/12/07, Wade.Stuart at fallon.com <Wade.Stuart at fallon.com> wrote:> zfs-discuss-bounces at opensolaris.org wrote on 04/12/2007 04:47:06 PM: > > I time mkfile''ing a 1 gb file on ufs and copying it, then did the same > > thing on each zfs partition. Then I took snapshots, copied files, more > > snapshots, keeping timings all the way. I could find no appreciable > > performance hit. > > > > Is this a sufficient, valid test? > > > > I believe mkfile is creating the file padded with zeros; and that ZFS has > short-curcuts to avoid storing actual data for such empty files. That > would lead me to believe that this is an invalid test. > > -WadeYou can get around this concern with dd if=/dev/urandom of=/tmp/1meg bs=512 count=2048 i=0 ; while [ $i -lt 1024 ] ; do cat /tmp/1meg ; i=`expr $i + 1` ; done > /zfs/1gig A smallish block size was chosen in the dd command because /dev/*random doesn''t allow huge blocks to be read. I forget what the cut-off is, but 512 bytes at a time should be fine. Mike -- Mike Gerdts mgerdts.blogspot.com
Hello Wade, Thursday, April 12, 2007, 11:55:49 PM, you wrote: WSfc> zfs-discuss-bounces at opensolaris.org wrote on 04/12/2007 04:47:06 PM:>> Management here is worried about performance under ZFS because they had >> a bad experience with Instant Image a number of years ago. When iiamd >> was used, server performance was reduced to a crawl. Hence they want >> proof in the form of benchmarking that zfs snapshots will not adversely >> affect system performance. They suggested creating, snapshotting, >> copying and generally messing about with some 1 gb files. The system is >> an E450 running snv_52 with a 36 gb boot drive, 142 Gb data drive and >> two 9 gb SAN partitions, one on slow disk, one on fast. The 36 gb is >> formatted ufs, everything else zfs. >> >> I time mkfile''ing a 1 gb file on ufs and copying it, then did the same >> thing on each zfs partition. Then I took snapshots, copied files, more >> snapshots, keeping timings all the way. I could find no appreciable >> performance hit. >> >> Is this a sufficient, valid test? >>WSfc> I believe mkfile is creating the file padded with zeros; and that ZFS has WSfc> short-curcuts to avoid storing actual data for such empty files. That WSfc> would lead me to believe that this is an invalid test. Only if you turn a compression on in ZFS. Other than that 0s are stored as any other data. -- Best regards, Robert mailto:rmilkowski at task.gda.pl milek.blogspot.com
On 4/13/07, Robert Milkowski <rmilkowski at task.gda.pl> wrote:> > Only if you turn a compression on in ZFS. > Other than that 0s are stored as any other data.There is some difference, but its marginal as the files get larger. The disks in mtank are SATA2 ES 500Gb Seagates in a Intel V5000 system. The system is a default b61 install and totally untuned. root at sstore:~# zpool status pool: mtank state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM mtank ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 root at sstore:~# time mkfile 1g /mtank/file_1g 2048Mb/s real 0m0.518s user 0m0.004s sys 0m0.513s root at sstore:~# dd if=/dev/urandom of=/tmp/1meg bs=512 count=2048; i=0 ; time while [ $i -lt 1024 ] ; do cat /tmp/1meg ; i=`expr $i + 1` ; done > /mtank/1g_ran 2048+0 records in 2048+0 records out real 0m6.876s user 0m1.205s sys 0m5.792s root at sstore:~# time mkfile 2g /mtank/file_2g - 182Mb/s real 0m11.223s user 0m0.008s sys 0m1.178s root at sstore:~# time mkfile 5g /mtank/file_5g - 147Mb/s real 0m34.721s user 0m0.019s sys 0m2.841s root at sstore:~# dd if=/dev/urandom of=/tmp/1meg bs=512 count=2048; i=0 ; time while [ $i -lt 5120 ] ; do cat /tmp/1meg ; i=`expr $i + 1` ; done > /mtank/5g_ran 2048+0 records in 2048+0 records out real 0m38.928s user 0m6.442s sys 0m32.911s root at sstore:~# time mkfile 10g /mtank/file_10g real 1m15.185s user 0m0.037s sys 0m5.885s root at sstore:~# time mkfile 10g /mtank/file_10g.2 - 134MB/s real 1m16.490s user 0m0.038s sys 0m5.723s root at sstore:~# time mkfile 50g /mtank/file_50g - (132Mb/s) real 6m27.673s user 0m0.178s sys 0m27.549s Even with 155Gb the snaps are pretty quick: root at sstore:~# ls -l /mtank/ total 162403552 -rw-r--r-- 1 root 1073741824 Apr 13 15:50 1g_ran -rw-r--r-- 1 root 5368709120 Apr 13 15:52 5g_ran -rw------- 1 root 107374182400 Apr 9 15:18 file_100g -rw------T 1 root 10737418240 Apr 9 15:20 file_10g -rw------T 1 root 10737418240 Apr 9 15:22 file_10g.2 -rw------T 1 root 1073741824 Apr 9 15:19 file_1g -rw------T 1 root 2147483648 Apr 9 15:30 file_2g -rw------T 1 root 53687091200 Apr 9 15:29 file_50g -rw------T 1 root 5368709120 Apr 9 15:31 file_5g root at sstore:~# zfs list | grep mtank mtank 155G 759G 155G /mtank root at sstore:~# time zfs snapshot mtank at test real 0m0.204s user 0m0.004s sys 0m0.006s root at sstore:/mtank# time zfs clone mtank at test mtank/test real 0m0.299s user 0m0.004s sys 0m0.008s Which is 375GB/s. Much better than: root at sstore:/mtank# time cp file_10g file_10g.b real 2m15.705s user 0m0.008s sys 0m50.084s Also double because I imagine it has to read and write thru the same disk io channel. Between pools is ok as well: root at sstore:/mtank# zpool status ztank pool: ztank state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM ztank ONLINE 0 0 0 raidz2 ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 root at sstore:/ztank/isos# time cp sol-nv-b61-x86-dvd.iso /mtank/ - 76MB/s real 0m49.955s user 0m0.003s sys 0m19.676s root at sstore:/ztank/isos# du -sh sol-nv-b61-x86-dvd.iso 3.7G sol-nv-b61-x86-dvd.iso -------------- next part -------------- An HTML attachment was scrubbed... URL: <mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070413/9de3c939/attachment.html>
> I time mkfile''ing a 1 gb file on ufs and copying it [...] then did the same > thing on each zfs partition. Then I took snapshots, copied files, more > snapshots, keeping timings all the way. [ ... ] > > Is this a sufficient, valid test?If your applications do that -- manipulate large files, primarily copying them -- then it may be. If your applications have other access patterns, probably not. If you''re concerned about whether you should put ZFS into production, then you should put it onto your test system and run your real applications on it for a while to qualify it (just as you should for any other file system or hardware). Anton This message posted from opensolaris.org
Interesting results. Note that you will lose perhaps 40% of the media speed as you go from the outside cylinders to the inside cylinders. The max sustaing media speed for that disk is about 72 MBytes/s, so you''re doing pretty good, all things considered. -- richard Nicholas Lee wrote:> On 4/13/07, *Robert Milkowski* <rmilkowski at task.gda.pl > <mailto:rmilkowski at task.gda.pl>> wrote: > > Only if you turn a compression on in ZFS. > Other than that 0s are stored as any other data. > > > > There is some difference, but its marginal as the files get larger. The > disks in mtank are SATA2 ES 500Gb Seagates in a Intel V5000 system. The > system is a default b61 install and totally untuned. > > root at sstore:~# zpool status > pool: mtank > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > mtank ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c0t2d0 ONLINE 0 0 0 > c1t2d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c0t5d0 ONLINE 0 0 0 > c1t5d0 ONLINE 0 0 0 > > root at sstore:~# time mkfile 1g /mtank/file_1g 2048Mb/s > > real 0m0.518s > user 0m0.004s > sys 0m0.513s > > root at sstore:~# dd if=/dev/urandom of=/tmp/1meg bs=512 count=2048; i=0 > ; time while [ $i -lt 1024 ] ; do cat /tmp/1meg ; i=`expr $i + 1` ; done > > /mtank/1g_ran > 2048+0 records in > 2048+0 records out > > real 0m6.876s > user 0m1.205s > sys 0m5.792s > > root at sstore:~# time mkfile 2g /mtank/file_2g - 182Mb/s > > real 0m11.223s > user 0m0.008s > sys 0m1.178s > > root at sstore:~# time mkfile 5g /mtank/file_5g - 147Mb/s > > real 0m34.721s > user 0m0.019s > sys 0m2.841s > > > root at sstore:~# dd if=/dev/urandom of=/tmp/1meg bs=512 count=2048; i=0 ; > time while [ $i -lt 5120 ] ; do cat /tmp/1meg ; i=`expr $i + 1` ; done > > /mtank/5g_ran > 2048+0 records in > 2048+0 records out > > real 0m38.928s > user 0m6.442s > sys 0m32.911s > > > root at sstore:~# time mkfile 10g /mtank/file_10g > > real 1m15.185s > user 0m0.037s > sys 0m5.885s > root at sstore:~# time mkfile 10g /mtank/file_10g.2 - 134MB/s > > real 1m16.490s > user 0m0.038s > sys 0m5.723s > root at sstore:~# time mkfile 50g /mtank/file_50g - (132Mb/s) > > real 6m27.673s > user 0m0.178s > sys 0m27.549s > > > Even with 155Gb the snaps are pretty quick: > > root at sstore:~# ls -l /mtank/ > total 162403552 > -rw-r--r-- 1 root 1073741824 Apr 13 15:50 1g_ran > -rw-r--r-- 1 root 5368709120 Apr 13 15:52 5g_ran > -rw------- 1 root 107374182400 Apr 9 15:18 file_100g > -rw------T 1 root 10737418240 Apr 9 15:20 file_10g > -rw------T 1 root 10737418240 Apr 9 15:22 file_10g.2 > -rw------T 1 root 1073741824 Apr 9 15:19 file_1g > -rw------T 1 root 2147483648 Apr 9 15:30 file_2g > -rw------T 1 root 53687091200 Apr 9 15:29 file_50g > -rw------T 1 root 5368709120 Apr 9 15:31 file_5g > root at sstore:~# zfs list | grep mtank > mtank 155G 759G 155G /mtank > root at sstore:~# time zfs snapshot mtank at test > > real 0m0.204s > user 0m0.004s > sys 0m0.006s > > root at sstore:/mtank# time zfs clone mtank at test mtank/test > > real 0m0.299s > user 0m0.004s > sys 0m0.008s > > > Which is 375GB/s. Much better than: > > root at sstore:/mtank# time cp file_10g file_10g.b > > real 2m15.705s > user 0m0.008s > sys 0m50.084s > > Also double because I imagine it has to read and write thru the same > disk io channel. > > Between pools is ok as well: > > root at sstore:/mtank# zpool status ztank > pool: ztank > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > ztank ONLINE 0 0 0 > raidz2 ONLINE 0 0 0 > c0t3d0 ONLINE 0 0 0 > c0t4d0 ONLINE 0 0 0 > c1t3d0 ONLINE 0 0 0 > c1t4d0 ONLINE 0 0 0 > c0t1d0 ONLINE 0 0 0 > > > root at sstore :/ztank/isos# time cp sol-nv-b61-x86-dvd.iso /mtank/ - 76MB/s > real 0m49.955s > user 0m0.003s > sys 0m19.676s > root at sstore:/ztank/isos# du -sh sol-nv-b61-x86-dvd.iso > 3.7G sol-nv-b61-x86-dvd.iso > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > mail.opensolaris.org/mailman/listinfo/zfs-discuss
Anton B. Rang wrote:>> I time mkfile''ing a 1 gb file on ufs and copying it [...] then did >> the same thing on each zfs partition. Then I took snapshots, >> copied files, more snapshots, keeping timings all the way. [ ... ] >> >> Is this a sufficient, valid test? > > If your applications do that -- manipulate large files, primarily > copying them -- then it may be. > > If your applications have other access patterns, probably not. If > you''re concerned about whether you should put ZFS into production, > then you should put it onto your test system and run your real > applications on it for a while to qualify it (just as you should for > any other file system or hardware).I couldn''t agree more. That said, I would be extremely surprised if the presence of snapshots or clones had any impact whatsoever on the performance of accessing a given filesystem. I''ve never seen anything like that. --matt