Anantha N. Srirama
2007-Jan-08 20:04 UTC
[zfs-discuss] Puzzling ZFS behavior with COMPRESS option
Our setup: - E2900 (24 x 96); Solaris 10 Update 2 (aka 06/06) - 2 2Gbps FC HBA - EMC DMX storage - 50 x 64GB LUNs configured in 1 ZFS pool - Many filesystems created with COMPRESS enabled; specifically I''ve one that is 768GB I''m observing the following puzzling behavior: - We are currently creating a large (>1.4TB) and sparse dataset; most of the dataset contains repeating blanks (default/standard SAS dataset behavior.) - ls -l reports the file size as 1.4+TB and du -sk reports the actual on disk usage at around 65GB. - My I/O on the system is pegged at 150+MB/S as reported by zpool iostat and I''ve confirmed the same with iostat. This is very confusing - ZFS is doing very good compression as reported by the ratio of on disk versus as reported size of the file (1.4TB vs 65GB) - [b]Why on God''s green earth am I observing such high I/O when indeed ZFS is compressing?[/b] I can''t believe that the program is actually generating I/O at the rate of (150MB/S * compressratio). Any thoughts? This message posted from opensolaris.org
Anantha N. Srirama
2007-Jan-08 20:21 UTC
[zfs-discuss] Re: Puzzling ZFS behavior with COMPRESS option
Quick update, since my original post I''ve confirmed via DTrace (rwtop script in toolkit) that the application is not generating 150MB/S * compressratio of I/O. What then is causing this much I/O in our system? This message posted from opensolaris.org
Neil Perrin
2007-Jan-08 20:52 UTC
[zfs-discuss] Puzzling ZFS behavior with COMPRESS option
Anantha N. Srirama wrote On 01/08/07 13:04,:> Our setup: > > - E2900 (24 x 96); Solaris 10 Update 2 (aka 06/06) > - 2 2Gbps FC HBA > - EMC DMX storage > - 50 x 64GB LUNs configured in 1 ZFS pool > - Many filesystems created with COMPRESS enabled; specifically I''ve one that is 768GB > > I''m observing the following puzzling behavior: > > - We are currently creating a large (>1.4TB) and sparse dataset; most of the dataset contains repeating blanks (default/standard SAS dataset behavior.) > - ls -l reports the file size as 1.4+TB and du -sk reports the actual on disk usage at around 65GB. > - My I/O on the system is pegged at 150+MB/S as reported by zpool iostat and I''ve confirmed the same with iostat. > > This is very confusing > > - ZFS is doing very good compression as reported by the ratio of on disk versus as reported size of the file (1.4TB vs 65GB) > - [b]Why on God''s green earth am I observing such high I/O when indeed ZFS is compressing?[/b] I can''t believe that the program is actually generating I/O at the rate of (150MB/S * compressratio). > > Any thoughts? >One possibility is that the data is written synchronously (uses O_DSYNC, fsync, etc), and so the ZFS Intent Log (ZIL) will write that uncompressed data to stable storage in case of a crash/power fail before the txg is committed. Neil.
Bart Smaalders
2007-Jan-09 00:11 UTC
[zfs-discuss] Re: Puzzling ZFS behavior with COMPRESS option
Anantha N. Srirama wrote:> Quick update, since my original post I''ve confirmed via DTrace (rwtop script in toolkit) that the application is not generating 150MB/S * compressratio of I/O. What then is causing this much I/O in our system? > > > This message posted from opensolaris.orgAre you doing random IO? Appending or overwriting? - Bart -- Bart Smaalders Solaris Kernel Performance barts at cyber.eng.sun.com http://blogs.sun.com/barts
Anantha N. Srirama
2007-Jan-09 13:46 UTC
[zfs-discuss] Re: Puzzling ZFS behavior with COMPRESS option
I''ll see if I can confirm what you are suggesting. Thanks. This message posted from opensolaris.org
Anantha N. Srirama
2007-Jan-10 00:05 UTC
[zfs-discuss] Re: Puzzling ZFS behavior with COMPRESS option
I''ve some important information that should shed some light on this behavior: This evening I created a new filesystem across the very same 50 disks including the COMPRESS attribute. My goal was to isolate some workload to the new filesystem and started moving a 100GB directory tree over to the new FS. While I was copying I was averaging around 25MB read and 25MB write as expected. [b]Now I opened ''vi'' and wanted to write out a new file in the new filesystem and what I saw was shocking: my reads remained the same but my writes shot upto the 150+MB/S range. This abnormal I/O pattern continued until the ''vi'' returned from the write request.[/b] Here are the ''zpool iostat mtdc 30'' output: capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- mtdc 806G 2.48T 38 173 1.93M 7.52M mtdc 806G 2.48T 188 228 15.0M 8.78M mtdc 807G 2.48T 266 624 14.0M 16.5M mtdc 807G 2.48T 286 670 17.1M 14.5M mtdc 807G 2.48T 293 1.21K 18.2M 98.4M <<-- vi activity, note mismatch in r/w rates mtdc 808G 2.48T 457 560 35.5M 24.2M mtdc 809G 2.48T 405 504 31.7M 26.3M mtdc 809G 2.48T 328 1.37K 25.2M 152M <<-- vi activity, note r/w mismatch in r/w rates mtdc 810G 2.48T 428 671 33.0M 48.0M mtdc 811G 2.48T 463 500 35.9M 26.4M mtdc 811G 2.48T 207 1.39K 16.5M 154M<<-- vi activity, note r/w mismatch in r/w rates mtdc 812G 2.48T 310 878 23.9M 77.7M mtdc 813G 2.48T 362 494 26.1M 25.3M mtdc 813G 2.48T 381 1.05K 26.8M 103M mtdc 814G 2.48T 347 1.33K 25.0M 135M mtdc 815G 2.48T 288 1.38K 21.7M 150M mtdc 815G 2.48T 425 513 32.7M 25.8M mtdc 816G 2.47T 413 515 30.2M 25.1M mtdc 817G 2.47T 341 512 21.9M 25.1M mtdc 818G 2.47T 293 529 18.5M 25.5M mtdc 818G 2.47T 344 508 23.4M 24.7M mtdc 819G 2.47T 442 512 33.4M 24.1M mtdc 820G 2.47T 385 483 28.3M 24.4M mtdc 820G 2.47T 372 483 24.7M 24.7M mtdc 821G 2.47T 347 535 23.0M 24.2M mtdc 821G 2.47T 290 497 17.9M 24.9M mtdc 823G 2.47T 349 517 20.0M 24.1M mtdc 823G 2.47T 399 512 21.2M 24.5M mtdc 824G 2.47T 383 612 19.3M 17.7M mtdc 824G 2.47T 390 614 14.2M 17.5M This message posted from opensolaris.org
Neil Perrin
2007-Jan-10 00:30 UTC
[zfs-discuss] Re: Puzzling ZFS behavior with COMPRESS option
Ah, vi does an fsync. So I suspect that this is bug: 6413510 zfs: writing to ZFS filesystem slows down fsync() on other files in the same FS Here''s a snippet from the Evaluation: ----------- ZFS keeps in list in memory of all transactions and will push *all* of them out on a fsync. This includes those not necessarily related to the znode being "fsunk". We consciously designed it this way to avoid possible problems with dependencies between znodes. This behaviour could also explain the extra fsync load on jurassic (see 6404018) as ZFS can do much more IO for fsyncs. However, I still don''t think it''s the whole problem. So it looks like we ought to just flush those changes to the specified znode, and work out the dependencies. ------------ This has been fixed since August and will be available in s10u4. Sorry Anantha N. Srirama wrote On 01/09/07 17:05,:> I''ve some important information that should shed some light on this behavior: > > This evening I created a new filesystem across the very same 50 disks including the COMPRESS attribute. My goal was to isolate some workload to the new filesystem and started moving a 100GB directory tree over to the new FS. While I was copying I was averaging around 25MB read and 25MB write as expected. [b]Now I opened ''vi'' and wanted to write out a new file in the new filesystem and what I saw was shocking: my reads remained the same but my writes shot upto the 150+MB/S range. This abnormal I/O pattern continued until the ''vi'' returned from the write request.[/b] Here are the ''zpool iostat mtdc 30'' output: > > capacity operations bandwidth > pool used avail read write read write > ---------- ----- ----- ----- ----- ----- ----- > mtdc 806G 2.48T 38 173 1.93M 7.52M > mtdc 806G 2.48T 188 228 15.0M 8.78M > mtdc 807G 2.48T 266 624 14.0M 16.5M > mtdc 807G 2.48T 286 670 17.1M 14.5M > mtdc 807G 2.48T 293 1.21K 18.2M 98.4M <<-- vi activity, note mismatch in r/w rates > mtdc 808G 2.48T 457 560 35.5M 24.2M > mtdc 809G 2.48T 405 504 31.7M 26.3M > mtdc 809G 2.48T 328 1.37K 25.2M 152M <<-- vi activity, note r/w mismatch in r/w rates > mtdc 810G 2.48T 428 671 33.0M 48.0M > mtdc 811G 2.48T 463 500 35.9M 26.4M > mtdc 811G 2.48T 207 1.39K 16.5M 154M<<-- vi activity, note r/w mismatch in r/w rates > mtdc 812G 2.48T 310 878 23.9M 77.7M > mtdc 813G 2.48T 362 494 26.1M 25.3M > mtdc 813G 2.48T 381 1.05K 26.8M 103M > mtdc 814G 2.48T 347 1.33K 25.0M 135M > mtdc 815G 2.48T 288 1.38K 21.7M 150M > mtdc 815G 2.48T 425 513 32.7M 25.8M > mtdc 816G 2.47T 413 515 30.2M 25.1M > mtdc 817G 2.47T 341 512 21.9M 25.1M > mtdc 818G 2.47T 293 529 18.5M 25.5M > mtdc 818G 2.47T 344 508 23.4M 24.7M > mtdc 819G 2.47T 442 512 33.4M 24.1M > mtdc 820G 2.47T 385 483 28.3M 24.4M > mtdc 820G 2.47T 372 483 24.7M 24.7M > mtdc 821G 2.47T 347 535 23.0M 24.2M > mtdc 821G 2.47T 290 497 17.9M 24.9M > mtdc 823G 2.47T 349 517 20.0M 24.1M > mtdc 823G 2.47T 399 512 21.2M 24.5M > mtdc 824G 2.47T 383 612 19.3M 17.7M > mtdc 824G 2.47T 390 614 14.2M 17.5M > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss