Thanks to the ZFS folks for completing this file system. It''s amazing. My question is performance on what old-style file systems would view as a large sequential reads (backups, database full table scans, etc...). Now that the data is not sequential, what kind of hit should we expect? Specifically, should we avoid ZFS for our large data warehouses or other DSS systems? Thanks!! -nathan This message posted from opensolaris.org
On Wed, Nov 16, 2005 at 04:31:21PM -0800, nathan wrote:> Thanks to the ZFS folks for completing this file system. It''s amazing.Thanks. We''re glad to get the monkey off our back, so to speak. :)> My question is performance on what old-style file systems would view > as a large sequential reads (backups, database full table scans, > etc...). Now that the data is not sequential, what kind of hit should > we expect? Specifically, should we avoid ZFS for our large data > warehouses or other DSS systems?We view any performance deficiencies as bugs, and will treat them as such. To answer your question, the real advantage of ZFS is that a random write workload becomes sequential writes on disk (really fast). As you point out, a sequntial read then becomes random reads. The thing to note, however, is that when you''re doing a sequntial access to a file, we can easily predict the next N blocks that are accessed and send all the I/O requests down to our scheduler to pull them off disk in an efficient and non-random fashion. So it''s not as bad as you think. And there is no way to predict the order or random writes and somehow optimize them. You could hold on to them for a while and hope to batch the I/O that way, but then you''re making the worst tradeoff ever: giving up correctness for improved performance. To help ZFS even further, large database installations have been transitioning the workload they present to disk over time. As memory has gotten larger and cheaper, the I/O the disk subsystems sees has changed from being read mostly (about 15 years ago), to read/write mix (about 5-10 years ago), to being mostly writes these days. So the thing you are now most interested in optimizing are random writes, which is what we do really well. --Bill
> > My question is performance on what old-style file > systems would view > > as a large sequential reads (backups, database full > table scans, > > etc...). Now that the data is not sequential, what > kind of hit should > > we expect? Specifically, should we avoid ZFS for > our large data > > warehouses or other DSS systems? > > We view any performance deficiencies as bugs, and > will treat them as > such. > > To answer your question, the real advantage of ZFS is > that a random > write workload becomes sequential writes on disk > (really fast). As you > point out, a sequntial read then becomes random > reads. The thing to > note, however, is that when you''re doing a sequntial > access to a file, > we can easily predict the next N blocks that are > accessed and send all > the I/O requests down to our scheduler to pull them > off disk in an > efficient and non-random fashion. So it''s not as bad > as you think.I should''ve paid more attention to the information you all have already put out on the scheduler!> And there is no way to predict the order or random > writes and somehow > optimize them. You could hold on to them for a while > and hope to batch > the I/O that way, but then you''re making the worst > tradeoff ever: > giving up correctness for improved performance. >Brillian point. May I quote you when I propose to immediately change our datacenter to opensolaris b27 w/ ZFS? ;-)> To help ZFS even further, large database > installations have been > transitioning the workload they present to disk over > time. As memory > has gotten larger and cheaper, the I/O the disk > subsystems sees has > changed from being read mostly (about 15 years ago), > to read/write mix > (about 5-10 years ago), to being mostly writes these > days. So the thing > you are now most interested in optimizing are random > writes, which is > what we do really well.Unfortunately the data warehouses I''ve worked with are mostly read which it sounds like you may have already figured out. Thanks again for your response. -nathan In case it matters, our read workload is:>From the source databases to the data warehouse''s staging database I see up to 80MB/s large, sequential read from the sources (that changes the OLTP source''s mix from 80% write to 80% read temporarily). After the transfor and load into the DW, batch reports on the warehouse generate over 200MB/s of large sequential (1MB operations) read over a few hours. The warehouse may be writing for the initial load, but after that it''s all read.This message posted from opensolaris.org
Does ZFS reorganize (ie. defrag) the files over time? If it doesn''t, it might not perform well in "write-little read-much" scenarios (where read performance is much more important than write performance). Thank you. This message posted from opensolaris.org
On Thu, Nov 17, 2005 at 05:21:36AM -0800, Jim Lin wrote:> Does ZFS reorganize (ie. defrag) the files over time?Not yet.> If it doesn''t, it might not perform well in "write-little read-much" > scenarios (where read performance is much more important than write > performance).As always, the correct answer is "it depends". Let''s take a look at several cases: - Random reads: No matter if the data was written randomly or sequentially, random reads are random for any filesystem, regardless of their layout policy. Not much you can do to optimize these, except have the best I/O scheduler possible. - Sequential writes, sequential reads: With ZFS, sequential writes lead to sequential layout on disk. So sequential reads will perform quite well in this case. - Random writes, sequential reads: This is the most interesting case. With random writes, ZFS turns them into sequential writes, which go *really* fast. With sequential reads, you know which order the reads are going to be coming in, so you can kick off a bunch of prefetch reads. Again, with a good I/O scheduler (which ZFS just happens to have), you can turn this into good read performance, if not entirely as good as totally sequential. Believe me, we''ve thought about this a lot. There is a lot we can do to improve performance, and we''re just getting started. --Bill
Thank you very much. That was very interesting. This message posted from opensolaris.org
Jason Ozolins
2005-Nov-23 06:52 UTC
[zfs-discuss] Re: Re: Old-style sequential read performance
> - Sequential writes, sequential reads: With ZFS, > ZFS, sequential writes > lead to sequential layout on disk. > So sequential reads will > perform quite well in this case.I was just testing sequential read performance and saw some really strange behaviour. Read performance from a pool of disk slices was much [b]lower[/b] than from a raidz pool with the same number of slices, on the same disks! Here''s the setup: Two pools, /tank (no redundancy) and /pool (raid-z), created from parallel slices across 4 physical disks: -bash-3.00# zpool status pool: pool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM pool ONLINE 0 0 0 raidz ONLINE 0 0 0 c1d0s7 ONLINE 0 0 0 c2d0s7 ONLINE 0 0 0 c3d0s7 ONLINE 0 0 0 c4d0s7 ONLINE 0 0 0 pool: tank state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 c1d0s3 ONLINE 0 0 0 c2d0s3 ONLINE 0 0 0 c3d0s3 ONLINE 0 0 0 c4d0s3 ONLINE 0 0 0 /pool has a bit of data in it (<10GB), /tank was newly created. -bash-3.00# time -p dd if=/dev/zero of=/tank/tf bs=1024k count=16384 16384+0 records in 16384+0 records out real 122.82 user 0.03 sys 12.62 -bash-3.00# time -p dd of=/dev/null if=/tank/tf bs=1024k count=16384 16384+0 records in 16384+0 records out real 314.02 user 0.07 sys 9.22 I repeated this, and the second run took 305 seconds. Oh well, I think to myself, 52MB/sec reads is kind of okay... what''s it like for the raidz pool? -bash-3.00# time -p dd if=/dev/zero of=/pool/tf bs=1024k count=16384 16384+0 records in 16384+0 records out real 218.88 user 0.08 sys 32.73 -bash-3.00# time -p dd of=/dev/null if=/pool/tf bs=1024k count=16384 16384+0 records in 16384+0 records out real 229.50 user 0.08 sys 21.13 So writing is slower, which I can expect, but read speed is a third [b]faster[/b] than for the plain striped disk /tank. This seemed odd enough to be worth mentioning. -Jason =:^) This message posted from opensolaris.org
Al Hopper
2005-Nov-23 13:25 UTC
[zfs-discuss] Re: Re: Old-style sequential read performance
On Tue, 22 Nov 2005, Jason Ozolins wrote:> > - Sequential writes, sequential reads: With ZFS, > > ZFS, sequential writes > > lead to sequential layout on disk. > > So sequential reads will > > perform quite well in this case..... reformatted ....> I was just testing sequential read performance and saw some really > strange behaviour. Read performance from a pool of disk slices was much > [b]lower[/b] than from a raidz pool with the same number of slices, on > the same disks! > > Here''s the setup: Two pools, /tank (no redundancy) and /pool (raid-z), > created from parallel slices across 4 physical disks: > > -bash-3.00# zpool status > pool: pool > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > pool ONLINE 0 0 0 > raidz ONLINE 0 0 0 > c1d0s7 ONLINE 0 0 0 > c2d0s7 ONLINE 0 0 0 > c3d0s7 ONLINE 0 0 0 > c4d0s7 ONLINE 0 0 0 > > pool: tank > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > tank ONLINE 0 0 0 > c1d0s3 ONLINE 0 0 0 > c2d0s3 ONLINE 0 0 0 > c3d0s3 ONLINE 0 0 0 > c4d0s3 ONLINE 0 0 0 > > /pool has a bit of data in it (<10GB), /tank was newly created. > > -bash-3.00# time -p dd if=/dev/zero of=/tank/tf bs=1024k count=16384 > 16384+0 records in > 16384+0 records out > real 122.82 > user 0.03 > sys 12.62 > -bash-3.00# time -p dd of=/dev/null if=/tank/tf bs=1024k count=16384 > 16384+0 records in > 16384+0 records out > real 314.02 > user 0.07 > sys 9.22 > > I repeated this, and the second run took 305 seconds. Oh well, I think > to myself, 52MB/sec reads is kind of okay... what''s it like for the raidz > pool? > > -bash-3.00# time -p dd if=/dev/zero of=/pool/tf bs=1024k count=16384 > 16384+0 records in > 16384+0 records out > real 218.88 > user 0.08 > sys 32.73 > -bash-3.00# time -p dd of=/dev/null if=/pool/tf bs=1024k count=16384 > 16384+0 records in > 16384+0 records out > real 229.50 > user 0.08 > sys 21.13 > > So writing is slower, which I can expect, but read speed is a third > [b]faster[/b] than for the plain striped disk /tank. This seemed odd > enough to be worth mentioning.Most modern disks have a variable number of sectors per track. If you''re on the cylinders close to the outer edge of the disk (platters), then you have a physically longer track than you do if you''re on an inner track. The drive maintains a constant # of magnetic transitions per linear distance (bit density), so there are more sectors written on the outer tracks than on the inner tracks. Usually the disk is divided into different zones, and the number of sectors per track is the same in each zone. In the old days it used to be 4 or 5 zones, but now there are many more zones (I don''t have a good number - but think in terms of 25 or 30). So the issue with the above test, is that you''ve running it on different cylinder boundaries which have different bit density and, therefore, will give different results. To make the test meaningful, you''d have to create pool A using some set of slices, test, destroy pool A. Then create pool B using the same set of slices and test again. Please let us know the results. PS: look at: http://www.tomshardware.com/storage/20051117/new_toshiba_sata_drives_good_for_the_mainstream-06.html and the read or write performance charts. Taking one drive (randomly), for example the Hitachi TravelStar 7K100, write performance is 53Mb/Sec on the outer cylinders and 26Mb/Sec on the inner cylinders. Its not unusual to see a (close to) 2:1 ratio in performance between the outermost/innermost cylinders. Regards, Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
Jason Ozolins
2005-Nov-23 23:41 UTC
[zfs-discuss] Re: Re: Old-style sequential read performance
Al Hopper wrote:> On Tue, 22 Nov 2005, Jason Ozolins wrote:> .... reformatted ....oops, sorry about that - I was using the ZFS forum web page to compose the posting, and it lunched the indentation on the "zfs pool" output. Thanks for fixing it up!> Most modern disks have a variable number of sectors per track. [...]> So the issue with the above test, is that you''ve running it on different > cylinder boundaries which have different bit density and, therefore, will > give different results. To make the test meaningful, you''d have to create > pool A using some set of slices, test, destroy pool A. Then create pool B > using the same set of slices and test again.Fair enough. I realised last night that I should have made explicit that slice 3 on all the disks was actually a lower cylinder range than slice 7. Given that the component slices of /tank (striped) were all on lower numbered cylinders than the component slices of /pool (raid-z), I would expect that I/O on /tank should see better media transfer rates than I/O on /pool. The fact that the degradation went the other way is part of what surprised me.> Please let us know the results.I just tried a couple of experiments. First, because /tank is only 23GB, I wondered if creating a 16GB file was filling the filesystem to the point where allocation policy might cause the file to be fragmented. Unlikely, but anyway, I removed the old file, and tried again with a 4GB file. The machine only has 1GB of memory, so there''s unlikely to be much difference due to cache between the 4GB sequential read and 16GB sequential read case. # time -p dd if=/dev/zero of=/tank/tf bs=1024k count=4096 4096+0 records in 4096+0 records out real 32.34 user 0.01 sys 6.31 # time -p dd if=/tank/tf bs=1024k count=4096 of=/dev/null 4096+0 records in 4096+0 records out real 73.45 user 0.01 sys 2.28 Roughly the same read rate (56MB/sec) as for the 16GB (52MB/sec) test file. Okay, let''s re-create /tank as RAID-Z, using the same slices: # zpool destroy tank # zpool create tank raidz c1d0s3 c2d0s3 c3d0s3 c4d0s3 # time -p dd if=/dev/zero of=/tank/tf bs=1024k count=4096 4096+0 records in 4096+0 records out real 39.48 user 0.02 sys 9.52 # time -p dd if=/tank/tf bs=1024k count=4096 of=/dev/null 4096+0 records in 4096+0 records out real 49.28 user 0.02 sys 2.91 Now we get a read rate of ~ 83MB/sec, and it''s definitely not down to zoned recording. Machine details, fwiw: Socket 754 Athlon 64 3000+ (1MB cache) on an Asus K8N-E motherboard NForce 3-250 chipset 1GB of memory 4 * Seagate ST3160827AS 160GB SATA drives c1d0 and c2d0 attached to NForce controller c3d0 and c4d0 attached to Silicon Image 3114 controller Machine was idle when all these tests were carried out.> [...] Its not unusual to > see a (close to) 2:1 ratio in performance between the outermost/innermost > cylinders.For sure. I knew about zoned recording (first heard of the concept in 1984 on the Commodore PET floppy drive ;-), but there''s something else at work here. Cheers, Jason =:^) -- Jason.Ozolins at anu.edu.au ANU Supercomputer Facility APAC Data Grid Program Leonard Huxley Bldg 56, Mills Road Ph: +61 2 6125 5449 Australian National University Fax: +61 2 6125 8199 Canberra, ACT, 0200, Australia