Scott Lovenberg
2007-Jul-07 02:32 UTC
[zfs-discuss] ZFS Performance as a function of Disk Slice
First Post! Sorry, I had to get that out of the way to break the ice... I was wondering if it makes sense to zone ZFS pools by disk slice, and if it makes a difference with RAIDZ. As I''m sure we''re all aware, the end of a drive is half as fast as the beginning ([i]where the zoning stipulates that the physical outside is the beginning and going towards the spindle increases hex value[/i]). I usually short stroke my drives so that the variable files on the operating system drive are at the beginning, page in center (so if I''m already in thrashing I''m at most 1/2 a platters width from page), and static files are towards the end. So, applying this methodology to ZFS, I partition a drive into 4 equal-sized quarters, and do this to 4 drives (each on a separate SATA channel), and then create 4 pools which hold each ''ring'' of the drives. Will I then have 4 RAIDZ pools, which I can mount according to speed needs? For instance, I always put (in Linux... I''m new to Solaris) ''/export/archive'' all the way on the slow tracks since I don''t read or write to it often and it is almost never accessed at the same time as anything else that would force long strokes. Ideally, I''d like to do a straight ZFS on the archive track. I move data to archive in chunks, 4 gigs at a time - when I roll it in I burn 2 DVDs, 1 gets cataloged locally and the other offsite, so if I lose the data, I don''t care - but, ZFS gives me the ability to snapshot to archive (I assume it works across pools?). Then stripe 1 ring (I guess this is ZFS native?), /usr/local (or its Solaris equivalent) for performance. Then mirror the root slice. Finally, /export would be RAIDZ or RAIDZ2 on the fastest track, holding my source code, large files, and things I want to stream over the LAN. Does this make sense with ZFS? Is the spindle count more of a factor than stroke latency? Does ZFS balance these things out on its own via random scattering? Reading back over this post, I''ve found it sounds like the ramblings of a madman. I guess I know what I want to say, but I''m not sure the right questions to ask. I think I''m saying: Will my proposed setup afford me the flexibility to zone for performance since I have a more intimate knowledge of the data going onto the drive, or will brute force by spindle count (I''m planning 4-6 drives - single drive to a bus) and random placement be sufficient if I just add the whole drive to a single pool? I thank you all for your time and patience as I stumble through this, and I welcome any point of view or insights (especially those from experience!) that might help me decide how to configure my storage server. This message posted from opensolaris.org
Darren Dunham
2007-Jul-07 05:22 UTC
[zfs-discuss] ZFS Performance as a function of Disk Slice
> [...] ZFS gives me the ability to snapshot to archive (I assume it > works across pools?).No. Snapshots are only within a pool. Pools are independent storage arenas. -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >
Richard Elling
2007-Jul-07 14:37 UTC
[zfs-discuss] ZFS Performance as a function of Disk Slice
Scott Lovenberg wrote:> First Post! > Sorry, I had to get that out of the way to break the ice...Welcome!> I was wondering if it makes sense to zone ZFS pools by disk slice, and if it makes a difference with RAIDZ. As I''m sure we''re all aware, the end of a drive is half as fast as the beginning ([i]where the zoning stipulates that the physical outside is the beginning and going towards the spindle increases hex value[/i]).IMHO, it makes sense to short-stroke if you are looking for the best performance. But raidz (or RAID-5) will not give you the best performance. You''d be better off mirroring for performance.> I usually short stroke my drives so that the variable files on the operating system drive are at the beginning, page in center (so if I''m already in thrashing I''m at most 1/2 a platters width from page), and static files are towards the end. So, applying this methodology to ZFS, I partition a drive into 4 equal-sized quarters, and do this to 4 drives (each on a separate SATA channel), and then create 4 pools which hold each ''ring'' of the drives. Will I then have 4 RAIDZ pools, which I can mount according to speed needs? For instance, I always put (in Linux... I''m new to Solaris) ''/export/archive'' all the way on the slow tracks since I don''t read or write to it often and it is almost never accessed at the same time as anything else that would force long strokes. > > Ideally, I''d like to do a straight ZFS on the archive track. I move data to archive in chunks, 4 gigs at a time - when I roll it in I burn 2 DVDs, 1 gets cataloged locally and the other offsite, so if I lose the data, I don''t care - but, ZFS gives me the ability to snapshot to archive (I assume it works across pools?). Then stripe 1 ring (I guess this is ZFS native?), /usr/local (or its Solaris equivalent) for performance. Then mirror the root slice. Finally, /export would be RAIDZ or RAIDZ2 on the fastest track, holding my source code, large files, and things I want to stream over the LAN. > > Does this make sense with ZFS? Is the spindle count more of a factor than stroke latency? Does ZFS balance these things out on its own via random scattering?Spindle count almost always wins for performance. Note: bandwidth usually isn''t the source of perceived performance problems, latency is. We believe that this has implications for ZFS over time due to COW, but nobody has characterized this yet.> Reading back over this post, I''ve found it sounds like the ramblings of a madman. I guess I know what I want to say, but I''m not sure the right questions to ask. I think I''m saying: Will my proposed setup afford me the flexibility to zone for performance since I have a more intimate knowledge of the data going onto the drive, or will brute force by spindle count (I''m planning 4-6 drives - single drive to a bus) and random placement be sufficient if I just add the whole drive to a single pool?Yes :-) YMMV.> I thank you all for your time and patience as I stumble through this, and I welcome any point of view or insights (especially those from experience!) that might help me decide how to configure my storage server.KISS. There are trade-offs for space, performance, and RAS. We have models to describe these, so you might check out my blogs on the subject. http://blogs.sun.com/relling -- richard
Scott Lovenberg
2007-Jul-08 20:09 UTC
[zfs-discuss] ZFS Performance as a function of Disk Slice
Thank you for your quick responses! I was unable to get back to this thread on account of being stuck on a motorcycle yesterday (still can''t feel my legs!). I think the KISS principle applies to 95% of computing (keeping in mind that 90% of everything is crap ;)). I''ve read Relling''s blogs with great interest (hey, the whole industry isn''t insane!). I''m very glad that there are people out there who know so much more than I do and are willing to share that knowledge, I think that''s the beauty of open source philosophies. I agree that RAID-Z won''t provide the best performance, but I''m willing to trade performance for redundancy via parity bits. When I go through the mental scenario of realizing that I''ve just lost all my source code to a failed drive, the sickening feeling that settles in out weighs the performance penalty! However, I''ve one more question - do you guys think NCQ with short stroked zones help or hurt performance? I have this feeling (my gut, that is), that at a low queue depth it''s a Great Win, whereas at a deeper queue it would degrade performance more so than without it. Any thoughts? This message posted from opensolaris.org
eric kustarz
2007-Jul-09 18:07 UTC
[zfs-discuss] ZFS Performance as a function of Disk Slice
> > However, I''ve one more question - do you guys think NCQ with short > stroked zones help or hurt performance? I have this feeling (my > gut, that is), that at a low queue depth it''s a Great Win, whereas > at a deeper queue it would degrade performance more so than without > it. Any thoughts?Depends on the workload. In general NCQ helps random read and hurts sequential reads: http://blogs.sun.com/erickustarz/entry/ncq_performance_analysis eric
You sir, are a gentleman and a scholar! Seriously, this is exactly the information I was looking for, thank you very much! Would you happen to know if this has improved since build 63 or if chipset has any effect one way or the other? This message posted from opensolaris.org
On Jul 9, 2007, at 11:21 AM, Scott Lovenberg wrote:> You sir, are a gentleman and a scholar! Seriously, this is exactly > the information I was looking for, thank you very much! > > Would you happen to know if this has improved since build 63 or if > chipset has any effect one way or the other?Naw. Without having information on how exactly the controller/disk firmware really works, we''re merely speculating that the firmware is where the problem is. Getting that information from the disk vendors is <ahem> tricky. More investigation is needed. eric
Thank you very much, this answers all my questions! Much appreciated! This message posted from opensolaris.org
eric kustarz wrote:> On Jul 9, 2007, at 11:21 AM, Scott Lovenberg wrote: > > >> You sir, are a gentleman and a scholar! Seriously, this is exactly >> the information I was looking for, thank you very much! >> >> Would you happen to know if this has improved since build 63 or if >> chipset has any effect one way or the other? >> > > Naw. Without having information on how exactly the controller/disk > firmware really works, we''re merely speculating that the firmware is > where the problem is. Getting that information from the disk vendors > is <ahem> tricky. >Unfortunately, testing has not support the theory that the problem is with the controller hardware, driver, disk or disk firmware. So far every valid measurement with using FPDMA READ/WRITE (NCQ) v. READ/WRITE DMA EXT. has shown anywhere from less that 1% improvement using NCQ to up to 22% improvement. The biggest improvements are seen when the disk caches are disabled, but I have measured up to 19% improvement w.r.t. time spent waiting for I/Os to complete with the caches enabled.> More investigation is needed. >Absolutely more investigation is needed.> eric > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
> eric kustarz wrote: > > On Jul 9, 2007, at 11:21 AM, Scott Lovenberg wrote: > > > > > >> You sir, are a gentleman and a scholar! > Seriously, this is exactly > > the information I was looking for, thank you very > much! > >> > >> Would you happen to know if this has improved > since build 63 or if > >> chipset has any effect one way or the other? > >> > > > > Naw. Without having information on how exactly the > controller/disk > > firmware really works, we''re merely speculating > that the firmware is > > where the problem is. Getting that information > from the disk vendors > > is <ahem> tricky. > > > > Unfortunately, testing has not support the theory > that the problem is with > the controller hardware, driver, disk or disk > firmware. So far every > valid measurement with using FPDMA READ/WRITE (NCQ) > v. READ/WRITE > DMA EXT. has shown anywhere from less that 1% > improvement using NCQ > to up to 22% improvement. The biggest improvements > are seen > when the disk caches are disabled, but I have > measured up to 19% improvement > w.r.t. time spent waiting for I/Os to complete with > the caches enabled. > > More investigation is needed. > > > > Absolutely more investigation is needed. > > ericJust a thought or two off the top of my head; is the caching daemon (bdflush or something to that effect) on when you are performing these tests. I think it flushes every 20 or 30 seconds by default, IIRC? I''m not sure, but this sounds like a buffering thing where it''s waiting for a full buffer to flush the changes. Are these disks in ATA/DMA/UDMA/PIO, SATA, or SCSI interfaces? Are these disks Western Digitals, I''ve heard their caching algorithms aren''t optimized at all (strictly heresy). It could be a delay on the channel if it''s PATA and the other drive on the channel is being accessed... Perhaps this is a cache coherency problem (is the arch. x86, IA1/2, SPARC, PPC... single or SMP... memory timings?)? This message posted from opensolaris.org
Scott Lovenberg wrote:>> eric kustarz wrote: >> >>> On Jul 9, 2007, at 11:21 AM, Scott Lovenberg wrote: >>> >>> >>> >>>> You sir, are a gentleman and a scholar! >>>> >> Seriously, this is exactly >> >>> the information I was looking for, thank you very >>> >> much! >> >>>> Would you happen to know if this has improved >>>> >> since build 63 or if >> >>>> chipset has any effect one way or the other? >>>> >>>> >>> Naw. Without having information on how exactly the >>> >> controller/disk >> >>> firmware really works, we''re merely speculating >>> >> that the firmware is >> >>> where the problem is. Getting that information >>> >> from the disk vendors >> >>> is <ahem> tricky. >>> >>> >> Unfortunately, testing has not support the theory >> that the problem is with >> the controller hardware, driver, disk or disk >> firmware. So far every >> valid measurement with using FPDMA READ/WRITE (NCQ) >> v. READ/WRITE >> DMA EXT. has shown anywhere from less that 1% >> improvement using NCQ >> to up to 22% improvement. The biggest improvements >> are seen >> when the disk caches are disabled, but I have >> measured up to 19% improvement >> w.r.t. time spent waiting for I/Os to complete with >> the caches enabled. >> >>> More investigation is needed. >>> >>> >> Absolutely more investigation is needed. >> >>> eric >>> > > Just a thought or two off the top of my head; is the caching daemon (bdflush or something to that effect) on when you are performing these tests. I think it flushes every 20 or 30 seconds by default, IIRC? > > I''m not sure, but this sounds like a buffering thing where it''s waiting for a full buffer to flush the changes. Are these disks in ATA/DMA/UDMA/PIO, SATA, or SCSI interfaces? Are these disks Western Digitals, I''ve heard their caching algorithms aren''t optimized at all (strictly heresy). >Given we are talking about NCQ, which is a SATA only feature, we are talking about SATA controllers and disks. Also, these I/Os are being schedule to be done immediately and the discussion was talking about sequential reading from one or more ZFS files. No writes.> It could be a delay on the channel if it''s PATA and the other drive on the channel is being accessed... > > Perhaps this is a cache coherency problem (is the arch. x86, IA1/2, SPARC, PPC... single or SMP... memory timings?)? >The vast number of tests were done using Opteron based machines (mostly Sun Fire x4500),. but not entirely.> > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Erm, yeah, sorry about that (previous stupid questions). I wrote it before having my first cup of coffee... Thanks for the details, though. If you guys have any updates, please, drop a link to new info in this thread (I''ll do the same if I find out anything more), as I have it on my watch list. Thank you all again for your time! This message posted from opensolaris.org
On 18-Jul-07, at 8:38 PM, Scott Lovenberg wrote:> Erm, yeah, sorry about that (previous stupid questions). I wrote > it before having my first cup of coffee... Thanks for the details, > though. If you guys have any updates, please, drop a link to new > info in this threadI hate to be a listcop - usually they''re gunning for me - but changing subject to something generic (like "Yeah...") and at the same time removing all context, has made it very difficult (for me at least) to follow your recent threads. In particular it wouldn''t be a good idea to follow up under this subject, because there is in fact no thread to follow, except by detective work.> (I''ll do the same if I find out anything more), as I have it on my > watch list. Thank you all again for your time! > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss