Using raidz in zfs or raidz2 do all the disks have to be the same size. This message posted from opensolaris.org
Hello Kory, Monday, March 19, 2007, 4:47:27 PM, you wrote: KW> Using raidz in zfs or raidz2 do all the disks have to be the same size. No, they don''t have to be the same size. However all disks will be reduced to common size and once you replace (online) all disks to bigger one the pool size will automatically expand to new common size. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Hi Kory, No, they don''t have to the same size. But, the pool size will be constrained by the smallest disk and might not be the best use of your disk space. See the output below. I''d be better off mirroring the two 136-GB disks and using the 4 GB-disk for something else. :-) Cindy c0t0d0 = 4 GB c1t17d0 = 136 GB c1t18d0 = 136 GB # zpool create rpool raidz2 c0t0d0 c1t17d0 c1t18d0 # zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT rpool 11.9G 243K 11.9G 0% ONLINE - Kory Wheatley wrote:> Using raidz in zfs or raidz2 do all the disks have to be the same size. > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
The reason for this question is we currently have our disk setup in a hardware raid5 on a EMC device and these disks are configured as a zfs file system. Would it benefit us to have the disk be setup as a raidz along with the hardware raid 5 that is already setup too? Or with this double raid slow our performance with both a software and hardware raid setup? Or would raidz setup be better than the hardware raid5 setup? Also if we do set the disks as a raidz would it benefit use more if we specified each disks in the raidz or create them as Luns then specify the setup in raidz. This message posted from opensolaris.org
Hello Kory, Tuesday, March 20, 2007, 4:38:03 PM, you wrote: KW> The reason for this question is we currently have our disk setup KW> in a hardware raid5 on a EMC device and these disks are configured KW> as a zfs file system. Would it benefit us to have the disk be KW> setup as a raidz along with the hardware raid 5 that is already KW> setup too? Or with this double raid slow our performance with KW> both a software and hardware raid setup? Or would raidz setup be KW> better than the hardware raid5 setup? KW> Also if we do set the disks as a raidz would it benefit use more KW> if we specified each disks in the raidz or create them as Luns then specify the setup in raidz. RAIDZ vs. HW RAID5 - generally you can expect much worse performance for parallel random and small reads and better performance for other cases. RAIDZ on top of HW RAID5 - well, it really depends if performance hit is acceptable for you along with storage capacity. Then there''s somewhat lacking hot spare support in ZFS right now. If a raidz performance is acceptable to you I would go with each disk presented as a LUN and then put raidz or raidz2 on top of it + hot spares. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
(I''m probably not the best person to answer this, but that has never stopped me before, and I need to give Richard Elling a little more time to get the Goats, Cows and Horses fed, sip his morning coffee, and offer a proper response...)> Would it benefit us to have the disk be setup as a raidz along with the hardware raid 5 that is already setup too?Way back when, we called such configurations "plaiding", which described a host-based RAID configuration that criss-crossed hardware RAID LUNs. In doing such things, we had potentially better data availability with a configuration that could survive more failure modes. Alternatively, we used the hardware RAID for the availability configuration (hardware RAID 5), and used host-based RAID to stripe across hardware RAID5 LUNs for performance. Seemed to work pretty well. In theory, a raidz pool spread across some number of underlying hardware raid 5 LUNs would offer protection against more failure mode, such as the loss of an entire raid5 LUN. So from a failure protection/data availability point of view, it offers some benefit. Now, as to whether or not you experience a real, measurable benefit over time is hard to say. Each additional level of protection/redundancy has a diminishing return, often times at a dramatic incremental cost (e.g. getting from "four nines" to "five nines").> Or with this double raid slow our performance with both a software and hardware raid setup?You will certainly pay a performance - using raidz across the raid5 luns will reduce deliverable IOPS from the raid 5 luns. Whether or not the performance trade-off is worth the RAS gain varies based on your RAS and data availability requirements.> Or would raidz setup be better than the hardware raid5 setup? >Assuming a robust raid5 implementation with battery-backed nvram (protect against the "write hole" and partial stripe writes), I think a raidz zpool covers more of the datapath then a hardware raid 5 LUN, but I''ll wait for Richard to elaborate here (or tell me I''m wrong).> Also if we do set the disks as a raidz would it benefit use more if we specified each disks in the raidz or create them as Luns then specify the setup in raidz. >Isn''t'' this the same question as the first question? I''m not sure what you''re asking here... The questions you''re asking are good ones, and date back to the decades old struggle around configuration tradeoffs for performance / availability / cost. My knee-jerk reaction is that one level of RAID, like either hardware raid5 ZFS raidz is sufficient for availability, and keeps things relatively simple (and simple also improves RAS). The advantage host-based RAID has always had of hardware RAID is the ability to create software LUNs (like a raidz1 or raidz2 zpool) across physical disk controllers, which may also cross SAN switches, etc. So, twas me, I''d go with non-hardware RAID5 devices from the storage frame, and create raidz1 or raidz2 zpools across controllers. But, that''s me... :^) /jim> > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Jim Mauro wrote:> (I''m probably not the best person to answer this, but that has never stopped me > before, and I need to give Richard Elling a little more time to get the Goats, Cows > and Horses fed, sip his morning coffee, and offer a proper response...)chores are done, wading through the morning e-mail...>> Would it benefit us to have the disk be setup as a raidz along with >> the hardware raid 5 that is already setup too? > Way back when, we called such configurations "plaiding", which described a host-based RAID configuration > that criss-crossed hardware RAID LUNs. In doing such things, we had potentially better data availability > with a configuration that could survive more failure modes. Alternatively, we used the hardware RAID > for the availability configuration (hardware RAID 5), and used host-based RAID to stripe across hardware > RAID5 LUNs for performance. Seemed to work pretty well.Yep, there are various ways to do this and, in general, the more copies of the data you have, the better reliability you have. Space is also fairly easy to calculate. Performance can be tricky, and you may need to benchmark with your workload to see which is better, due to the difficulty in modeling such systems.> In theory, a raidz pool spread across some number of underlying hardware raid 5 LUNs would > offer protection against more failure mode, such as the loss of an entire raid5 LUN. So from > a failure protection/data availability point of view, it offers some benefit. Now, as to whether or not > you experience a real, measurable benefit over time is hard to say. Each additional level of protection/redundancy > has a diminishing return, often times at a dramatic incremental cost (e.g. getting from "four nines" to "five nines").If money was no issue, I''m sure we could come up with an awesome solution :-)>> Or with this double raid slow our performance with both a software and >> hardware raid setup? > You will certainly pay a performance - using raidz across the raid5 luns will reduce deliverable IOPS > from the raid 5 luns. Whether or not the performance trade-off is worth the RAS gain varies based on > your RAS and data availability requirements.Fast, inexpensive, reliable: pick two.>> Or would raidz setup be better than the hardware raid5 setup? >> > Assuming a robust raid5 implementation with battery-backed nvram (protect against the "write hole" and > partial stripe writes), I think a raidz zpool covers more of the datapath then a hardware raid 5 LUN, but > I''ll wait for Richard to elaborate here (or tell me I''m wrong).In general, you want the data protection in the application, or as close to the application as you can get. Since programmers tend to be lazy (Gosling said it, not me! :-) most rely on the file system and underlying constructs to ensure data protection. So, having ZFS manage the data protection will always be better than having some box at the other end of a wire managing the protection.>> Also if we do set the disks as a raidz would it benefit use more if >> we specified each disks in the raidz or create them as Luns then >> specify the setup in raidz. >> > Isn''t'' this the same question as the first question? I''m not sure what > you''re asking here... > > The questions you''re asking are good ones, and date back to the decades old struggle > around configuration tradeoffs for performance / availability / cost. > > My knee-jerk reaction is that one level of RAID, like either hardware raid5 ZFS raidz is sufficient > for availability, and keeps things relatively simple (and simple also improves RAS). The advantage > host-based RAID has always had of hardware RAID is the ability to create software LUNs > (like a raidz1 or raidz2 zpool) across physical disk controllers, which may also cross SAN > switches, etc. So, twas me, I''d go with non-hardware RAID5 devices from the storage frame, > and create raidz1 or raidz2 zpools across controllers.This is reasonable.> But, that''s me... > :^) > > /jimThe important thing is to protect your data. You have lots of options here, so we''d need to know more precisely what the other requirements are before we could give better advice. -- richard
Hi Kory - Your problem came our way through other Sun folks a few days ago, and I wish I had that magic setting to help, but the reality is that I''m not aware of anything that will improve the time required to mount 12k file systems. I would add (not that this helps) that I''m not convinced this problem is unique to ZFS, but I do not have experience or empirical data on mount time for 12k UFS, QFS, ext4, etc, file systems. There is an RFE filed on this: http://bugs.opensolaris.org/view_bug.do?bug_id=6478980 As I said, I wish I had a better answer. Thanks, /jim Kory Wheatley wrote:> Currently we are trying to setup zfs as file systems for all our user > accounts under /homea /homec /homef /homei /homem /homep /homes and > /homet. Right now on our Sun Fire v890 with 4 dual core processors and > 16gb of memory we have 12,000 zfs file systems setup. Which Sun has > promised will work, but we didn''t know that it would take over an hour > to do a reboot on this machine to mount and umount all these file > systems. What were trying to accomplish is the best performance along > with best data protection. Sun speaks that ZFS supports millions of > fil e systems, but what they left out is how long it takes to do a > reboot when you have thousand''s of file systems. > Currently we have three LUNS on our EMC disk array that we''ve created > one zfs storage pool, and we''ve created these 12,000 zfs file system > to this zfs pool. > > We really don''t want to have to go ufs to create our user student > accounts. We like the flexibility of ZFS, but with the slow boot > process it will kill us when we have to implement patches that require > a reboot. These ZFS file systems will contain all the student data, > so reliability and performance is a key to us. Do you know away or > a different setup for ZFS to allow our system to boot up faster? > I know each mount takes up memory so that''s part of the slowness when > mounting and umounting. We know when the system is up that the kernel > is using 3gb of memory out of the 16gb, and there''s nothing else on > this box right, but ZFS. There''s no data in those thousand''s of file > systems yet. > > Richard Elling wrote: >> Jim Mauro wrote: >>> (I''m probably not the best person to answer this, but that has never >>> stopped me >>> before, and I need to give Richard Elling a little more time to get >>> the Goats, Cows >>> and Horses fed, sip his morning coffee, and offer a proper response...) >> >> chores are done, wading through the morning e-mail... >> >>>> Would it benefit us to have the disk be setup as a raidz along with >>>> the hardware raid 5 that is already setup too? >>> Way back when, we called such configurations "plaiding", which >>> described a host-based RAID configuration >>> that criss-crossed hardware RAID LUNs. In doing such things, we had >>> potentially better data availability >>> with a configuration that could survive more failure modes. >>> Alternatively, we used the hardware RAID >>> for the availability configuration (hardware RAID 5), and used >>> host-based RAID to stripe across hardware >>> RAID5 LUNs for performance. Seemed to work pretty well. >> >> Yep, there are various ways to do this and, in general, the more copies >> of the data you have, the better reliability you have. Space is also >> fairly easy to calculate. Performance can be tricky, and you may >> need to >> benchmark with your workload to see which is better, due to the >> difficulty >> in modeling such systems. >> >>> In theory, a raidz pool spread across some number of underlying >>> hardware raid 5 LUNs would >>> offer protection against more failure mode, such as the loss of an >>> entire raid5 LUN. So from >>> a failure protection/data availability point of view, it offers some >>> benefit. Now, as to whether or not >>> you experience a real, measurable benefit over time is hard to say. >>> Each additional level of protection/redundancy >>> has a diminishing return, often times at a dramatic incremental cost >>> (e.g. getting from "four nines" to "five nines"). >> >> If money was no issue, I''m sure we could come up with an awesome >> solution :-) >> >>>> Or with this double raid slow our performance with both a software >>>> and hardware raid setup? >>> You will certainly pay a performance - using raidz across the raid5 >>> luns will reduce deliverable IOPS >>> from the raid 5 luns. Whether or not the performance trade-off is >>> worth the RAS gain varies based on >>> your RAS and data availability requirements. >> >> Fast, inexpensive, reliable: pick two. >> >>>> Or would raidz setup be better than the hardware raid5 setup? >>>> >>> Assuming a robust raid5 implementation with battery-backed nvram >>> (protect against the "write hole" and >>> partial stripe writes), I think a raidz zpool covers more of the >>> datapath then a hardware raid 5 LUN, but >>> I''ll wait for Richard to elaborate here (or tell me I''m wrong). >> >> In general, you want the data protection in the application, or as >> close to >> the application as you can get. Since programmers tend to be lazy >> (Gosling >> said it, not me! :-) most rely on the file system and underlying >> constructs >> to ensure data protection. So, having ZFS manage the data protection >> will >> always be better than having some box at the other end of a wire >> managing >> the protection. >> >>>> Also if we do set the disks as a raidz would it benefit use more >>>> if we specified each disks in the raidz or create them as Luns then >>>> specify the setup in raidz. >>>> >>> Isn''t'' this the same question as the first question? I''m not sure >>> what you''re asking here... >>> >>> The questions you''re asking are good ones, and date back to the >>> decades old struggle >>> around configuration tradeoffs for performance / availability / cost. >>> >>> My knee-jerk reaction is that one level of RAID, like either >>> hardware raid5 ZFS raidz is sufficient >>> for availability, and keeps things relatively simple (and simple >>> also improves RAS). The advantage >>> host-based RAID has always had of hardware RAID is the ability to >>> create software LUNs >>> (like a raidz1 or raidz2 zpool) across physical disk controllers, >>> which may also cross SAN >>> switches, etc. So, twas me, I''d go with non-hardware RAID5 devices >>> from the storage frame, >>> and create raidz1 or raidz2 zpools across controllers. >> >> This is reasonable. >> >>> But, that''s me... >>> :^) >>> >>> /jim >> >> The important thing is to protect your data. You have lots of >> options here, >> so we''d need to know more precisely what the other requirements are >> before >> we could give better advice. >> -- richard
I think this is a systems engineering problem, not just a ZFS problem. Few have bothered to look at mount performance in the past because most systems have only a few mounted file systems[1]. Since ZFS does file system quotas instead of user quotas, now we have the situation where there could be thousands of mounts. Now we do need to look at mount performance more closely. We''re doing some of that work now, and looking at other possible solutions (CR6478980). [1] we''ve done some characterization of this while benchmarking Sun Cluster failovers. The time required for a UFS mount can be quite substantial, even when fsck is not required, and is also somewhat variable (from few seconds to tens of seconds). We''ve made some minor changes to help improve cluster failover wrt mounts, so perhaps we can look at our characterization data again and see if there is some low-hanging fruit which would also apply more generally. -- richard Kory Wheatley wrote:> Currently we are trying to setup zfs as file systems for all our user > accounts under /homea /homec /homef /homei /homem /homep /homes and > /homet. Right now on our Sun Fire v890 with 4 dual core processors and > 16gb of memory we have 12,000 zfs file systems setup. Which Sun has > promised will work, but we didn''t know that it would take over an hour > to do a reboot on this machine to mount and umount all these file > systems. What were trying to accomplish is the best performance along > with best data protection. Sun speaks that ZFS supports millions of fil > e systems, but what they left out is how long it takes to do a reboot > when you have thousand''s of file systems. > Currently we have three LUNS on our EMC disk array that we''ve created > one zfs storage pool, and we''ve created these 12,000 zfs file system to > this zfs pool. > > We really don''t want to have to go ufs to create our user student > accounts. We like the flexibility of ZFS, but with the slow boot > process it will kill us when we have to implement patches that require a > reboot. These ZFS file systems will contain all the student data, so > reliability and performance is a key to us. Do you know away or a > different setup for ZFS to allow our system to boot up faster? > I know each mount takes up memory so that''s part of the slowness when > mounting and umounting. We know when the system is up that the kernel > is using 3gb of memory out of the 16gb, and there''s nothing else on this > box right, but ZFS. There''s no data in those thousand''s of file systems > yet. > > Richard Elling wrote: >> Jim Mauro wrote: >>> (I''m probably not the best person to answer this, but that has never >>> stopped me >>> before, and I need to give Richard Elling a little more time to get >>> the Goats, Cows >>> and Horses fed, sip his morning coffee, and offer a proper response...) >> >> chores are done, wading through the morning e-mail... >> >>>> Would it benefit us to have the disk be setup as a raidz along with >>>> the hardware raid 5 that is already setup too? >>> Way back when, we called such configurations "plaiding", which >>> described a host-based RAID configuration >>> that criss-crossed hardware RAID LUNs. In doing such things, we had >>> potentially better data availability >>> with a configuration that could survive more failure modes. >>> Alternatively, we used the hardware RAID >>> for the availability configuration (hardware RAID 5), and used >>> host-based RAID to stripe across hardware >>> RAID5 LUNs for performance. Seemed to work pretty well. >> >> Yep, there are various ways to do this and, in general, the more copies >> of the data you have, the better reliability you have. Space is also >> fairly easy to calculate. Performance can be tricky, and you may need to >> benchmark with your workload to see which is better, due to the >> difficulty >> in modeling such systems. >> >>> In theory, a raidz pool spread across some number of underlying >>> hardware raid 5 LUNs would >>> offer protection against more failure mode, such as the loss of an >>> entire raid5 LUN. So from >>> a failure protection/data availability point of view, it offers some >>> benefit. Now, as to whether or not >>> you experience a real, measurable benefit over time is hard to say. >>> Each additional level of protection/redundancy >>> has a diminishing return, often times at a dramatic incremental cost >>> (e.g. getting from "four nines" to "five nines"). >> >> If money was no issue, I''m sure we could come up with an awesome >> solution :-) >> >>>> Or with this double raid slow our performance with both a software >>>> and hardware raid setup? >>> You will certainly pay a performance - using raidz across the raid5 >>> luns will reduce deliverable IOPS >>> from the raid 5 luns. Whether or not the performance trade-off is >>> worth the RAS gain varies based on >>> your RAS and data availability requirements. >> >> Fast, inexpensive, reliable: pick two. >> >>>> Or would raidz setup be better than the hardware raid5 setup? >>>> >>> Assuming a robust raid5 implementation with battery-backed nvram >>> (protect against the "write hole" and >>> partial stripe writes), I think a raidz zpool covers more of the >>> datapath then a hardware raid 5 LUN, but >>> I''ll wait for Richard to elaborate here (or tell me I''m wrong). >> >> In general, you want the data protection in the application, or as >> close to >> the application as you can get. Since programmers tend to be lazy >> (Gosling >> said it, not me! :-) most rely on the file system and underlying >> constructs >> to ensure data protection. So, having ZFS manage the data protection >> will >> always be better than having some box at the other end of a wire managing >> the protection. >> >>>> Also if we do set the disks as a raidz would it benefit use more if >>>> we specified each disks in the raidz or create them as Luns then >>>> specify the setup in raidz. >>>> >>> Isn''t'' this the same question as the first question? I''m not sure >>> what you''re asking here... >>> >>> The questions you''re asking are good ones, and date back to the >>> decades old struggle >>> around configuration tradeoffs for performance / availability / cost. >>> >>> My knee-jerk reaction is that one level of RAID, like either hardware >>> raid5 ZFS raidz is sufficient >>> for availability, and keeps things relatively simple (and simple also >>> improves RAS). The advantage >>> host-based RAID has always had of hardware RAID is the ability to >>> create software LUNs >>> (like a raidz1 or raidz2 zpool) across physical disk controllers, >>> which may also cross SAN >>> switches, etc. So, twas me, I''d go with non-hardware RAID5 devices >>> from the storage frame, >>> and create raidz1 or raidz2 zpools across controllers. >> >> This is reasonable. >> >>> But, that''s me... >>> :^) >>> >>> /jim >> >> The important thing is to protect your data. You have lots of options >> here, >> so we''d need to know more precisely what the other requirements are >> before >> we could give better advice. >> -- richard
We''ve got some work going on in the NFS group to alleviate this problem. Doug McCallum has introduced the sharemgr (see http://blogs.sun.com/dougm) and I''m about to putback the In-Kernel Sharetab bits (look in http://blogs.sun.com/tdh - especially http://blogs.sun.com/tdh/entry/in_kernel_sharetab_have_a). Doug has been doing some performance optimization to the sharemgr to allow faster boot up in loading, specifically for ZFS - see for example http://bugs.opensolaris.org/view_bug.do?bug_id=6491973. It is funny, he just told me a couple of hours ago that he was doing 15k entries. I know he has significantly reduced the times for 3k and 5k filesystems. We are still working on the 15k entries. We are wanting to combine his changes with my changes to see if we can get the 15k time down. With my changes, we remove going to disk for the sharetab and locking the file. As you can see, this is a very hot spot for us right now. We really want these times down. Also, for the interested, I gave a presentation at Connectathon last year which highlights some of the issues here: Scaling NFS Services (http://www.connectathon.org/talks06/haynes.pdf). I also presented an overview of Doug''s and my projects at the latest Connectathon: The Management of Shares (http://www.connectathon.org/talks07/ScaleShares.pdf). This message posted from opensolaris.org
Richard Elling wrote:> I think this is a systems engineering problem, not just a ZFS problem. > Few have bothered to look at mount performance in the past because > most systems have only a few mounted file systems[1]. Since ZFS does > file system quotas instead of user quotas, now we have the situation > where there could be thousands of mounts. Now we do need to look at > mount performance more closely. We''re doing some of that work now, and > looking at other possible solutions (CR6478980). > > [1] we''ve done some characterization of this while benchmarking Sun > Cluster failovers. The time required for a UFS mount can be quite > substantial, even when fsck is not required, and is also somewhat > variable (from few seconds to tens of seconds). We''ve made some minor > changes to help improve cluster failover wrt mounts, so perhaps we > can look at our characterization data again and see if there is some > low-hanging fruit which would also apply more generally.The problem is that in order to restrict disk usage, ZFS *requires* that you create this many filesystems. I think most in this situation would prefer not to have to do that. The two solutions I see would be to add user quotas to ZFS or to be able to set a quota on a directory without it becoming it''s own filesystem. We''ve ruled out using ZFS for our systems at this time due to these limitations and the fact that thousands of mounts on a host entail a very long reboot (and the fact that snapshots count toward filesystem quota). Any chance that user quotas will be added in the future? It would go a long way to alleviating this problem. Ideally, snapshots would not count against user quotas if possible. Jim
>The problem is that in order to restrict disk usage, ZFS *requires* >that you create this many filesystems. I think most in this situation >would prefer not to have to do that. The two solutions I see would >be to add user quotas to ZFS or to be able to set a quota on a >directory without it becoming it''s own filesystem.What this really means is that ZFS filesystems need to be as expensive as UFS quotas and not (considerably) more. As it stands now, ZFS filesystems have two serious limitations: - they cost 100Ks per fs (when mounted) - they are, by default, all mounted all the time>Any chance that user quotas will be added in the future? It would go >a long way to alleviating this problem. Ideally, snapshots would not >count against user quotas if possible.If snapshots don''t count against quota, then you give users an easy way to extend their quotas by (number of snapshots) fold. Casper
zfs-discuss-bounces at opensolaris.org wrote on 03/21/2007 11:00:43 AM:> > >The problem is that in order to restrict disk usage, ZFS *requires* > >that you create this many filesystems. I think most in this situation > >would prefer not to have to do that. The two solutions I see would > >be to add user quotas to ZFS or to be able to set a quota on a > >directory without it becoming it''s own filesystem. > > What this really means is that ZFS filesystems need to be as > expensive as UFS quotas and not (considerably) more.Well... and apply to user/groups restriction workflows not just child node total sizes. Many times you can create a bunch o zfs mounts to limit usage or reservations, but at some level this is not as flexible as user/group level restrictions in many workflows (or even possible given some structures and workflows). For instance how would you limit John and Sarah in accounting from gobbling up all the space in the accounting group''s folder while allowing the managers Mike and Amy to do so without changing their file structure to match the restrictions instead of their workflow. User quotas have their place and are very well suited for certain tasks, zfs quota''s and reservations are more focused on protecting pooled storage from overuse/constraint. Building out a ton of fs mounts to try to work around missing user quotas does not scale well in any terms -- system cost, admin cost or artificial filesystem layout/restructuring. The cost (cpu/memory) of zfs mountpoints is one concern not by any means the only or core concern.> > As it stands now, ZFS filesystems have two serious limitations: > - they cost 100Ks per fs (when mounted) > - they are, by default, all mounted all the time > > >Any chance that user quotas will be added in the future? It would go > >a long way to alleviating this problem. Ideally, snapshots would not > >count against user quotas if possible. > > If snapshots don''t count against quota, then you give users an > easy way to extend their quotas by (number of snapshots) fold.I see the point and agree to some extent, and will add that when quotas are putback and do take into account snap bit hogging, administrators will _need_ way better snap space usage reporting tools. I think many workflows may not want to force users to "own" the snap hog space -- in such a case as a home directory where administrators set the snap rotation and the users have no control as to when or how the snaps will be released. This space accounting (live + snap vs live) should be a setting. -Wade
The fix for CR 6491973 won''t have much effect on boot time since it is more specific to the act of setting of the sharenfs property, but as Tom said, we are looking at anything that can reduce the times for sharing out large numbers of shares.The time to share is separate from the mount times since we don''t start sharing until all the mounts are done. With such a large configuraiton, the console prompt will appear before the shares get started. The time prior to that is all in the mounts. This message posted from opensolaris.org
Kory, I''m sorry that you had to go through this. We''re all working very hard to make ZFS better for everyone. We''ve noted this problem on the ZFS Best Practices wiki to try and help avoid future problems until we can get the quotas issue resolved. -- richard Kory Wheatley wrote:> Richard, > > I appreciate your information and insight. At this time since ZFS is > not capable of handling thousands of file systems and has several > limitations. We are forced to focus our migration to using UFS, "after > wasting time", where Sun told us, "before we thought of migrating our > user accounts to ZFS", that everything would be fine. But they failed > to mention about the terrible slowness of the boot process. We told > them we would be adding thousands of file systems under ZFS, and they > said there would be no problems. Very unprofessional from my standpoint > since we invested so much time in ZFS. It''s forced us to hold back on > our migration and caused us to spend another $12k of maintenance on our > current system, because we can''t do our migration before the time our > maintenance contact runs out. We have to restructure our migration > plans for using UFS > > ZFS needs to be stated in the correct manner in Sun''s documentation and > presentation''s that I''ve analyzed. Sure it supports thousand''s and > million''s of file system''s, but there''s ramifications . Resulting in a > very slow boot process (if that would have been stated that would have > been enough). This has caused us a considerable amount of time we''ve > exhausted in ZFS, and now we have to turn our attention to UFS for our > migration. > From what I understand this problem was identified last year. I''m > wondering how much time has been invested on it, since ZFS is such a key > element for everyone to migrate or install Solaris 10. You definitely > would not want to use ZFS with thousands of file systems, it will not > work for us at all at this time. > > Richard Elling wrote: >> Jim Mauro wrote: >>> (I''m probably not the best person to answer this, but that has never >>> stopped me >>> before, and I need to give Richard Elling a little more time to get >>> the Goats, Cows >>> and Horses fed, sip his morning coffee, and offer a proper response...) >> >> chores are done, wading through the morning e-mail... >> >>>> Would it benefit us to have the disk be setup as a raidz along with >>>> the hardware raid 5 that is already setup too? >>> Way back when, we called such configurations "plaiding", which >>> described a host-based RAID configuration >>> that criss-crossed hardware RAID LUNs. In doing such things, we had >>> potentially better data availability >>> with a configuration that could survive more failure modes. >>> Alternatively, we used the hardware RAID >>> for the availability configuration (hardware RAID 5), and used >>> host-based RAID to stripe across hardware >>> RAID5 LUNs for performance. Seemed to work pretty well. >> >> Yep, there are various ways to do this and, in general, the more copies >> of the data you have, the better reliability you have. Space is also >> fairly easy to calculate. Performance can be tricky, and you may need to >> benchmark with your workload to see which is better, due to the >> difficulty >> in modeling such systems. >> >>> In theory, a raidz pool spread across some number of underlying >>> hardware raid 5 LUNs would >>> offer protection against more failure mode, such as the loss of an >>> entire raid5 LUN. So from >>> a failure protection/data availability point of view, it offers some >>> benefit. Now, as to whether or not >>> you experience a real, measurable benefit over time is hard to say. >>> Each additional level of protection/redundancy >>> has a diminishing return, often times at a dramatic incremental cost >>> (e.g. getting from "four nines" to "five nines"). >> >> If money was no issue, I''m sure we could come up with an awesome >> solution :-) >> >>>> Or with this double raid slow our performance with both a software >>>> and hardware raid setup? >>> You will certainly pay a performance - using raidz across the raid5 >>> luns will reduce deliverable IOPS >>> from the raid 5 luns. Whether or not the performance trade-off is >>> worth the RAS gain varies based on >>> your RAS and data availability requirements. >> >> Fast, inexpensive, reliable: pick two. >> >>>> Or would raidz setup be better than the hardware raid5 setup? >>>> >>> Assuming a robust raid5 implementation with battery-backed nvram >>> (protect against the "write hole" and >>> partial stripe writes), I think a raidz zpool covers more of the >>> datapath then a hardware raid 5 LUN, but >>> I''ll wait for Richard to elaborate here (or tell me I''m wrong). >> >> In general, you want the data protection in the application, or as >> close to >> the application as you can get. Since programmers tend to be lazy >> (Gosling >> said it, not me! :-) most rely on the file system and underlying >> constructs >> to ensure data protection. So, having ZFS manage the data protection >> will >> always be better than having some box at the other end of a wire managing >> the protection. >> >>>> Also if we do set the disks as a raidz would it benefit use more if >>>> we specified each disks in the raidz or create them as Luns then >>>> specify the setup in raidz. >>>> >>> Isn''t'' this the same question as the first question? I''m not sure >>> what you''re asking here... >>> >>> The questions you''re asking are good ones, and date back to the >>> decades old struggle >>> around configuration tradeoffs for performance / availability / cost. >>> >>> My knee-jerk reaction is that one level of RAID, like either hardware >>> raid5 ZFS raidz is sufficient >>> for availability, and keeps things relatively simple (and simple also >>> improves RAS). The advantage >>> host-based RAID has always had of hardware RAID is the ability to >>> create software LUNs >>> (like a raidz1 or raidz2 zpool) across physical disk controllers, >>> which may also cross SAN >>> switches, etc. So, twas me, I''d go with non-hardware RAID5 devices >>> from the storage frame, >>> and create raidz1 or raidz2 zpools across controllers. >> >> This is reasonable. >> >>> But, that''s me... >>> :^) >>> >>> /jim >> >> The important thing is to protect your data. You have lots of options >> here, >> so we''d need to know more precisely what the other requirements are >> before >> we could give better advice. >> -- richard
> > Doug has been doing some performance optimization to > the sharemgr to allow faster boot up in loading >Doug has blogged about his performance numbers here: http://blogs.sun.com/dougm/entry/recent_performance_improvement_in_zfs This message posted from opensolaris.org
"The important thing is to protect your data. You have lots of options here, so we''d need to know more precisely what the other requirements are before we could give better advice. -- richard" Please let me come in with a parallel need, the answer to which should contribute to this thread. -Physical details: 3-drive (plus DVD) box with Micro-ATX board, 1 on-board controller and the option for one raid card. Actual board, CPU and Memory yet-to-be-spec''d, but we''ll throw in whatever the "hardware-compatible" Micro-ATX board can handle. -Software details: OpenSolaris 2008-05, ZFS+PostgreSQL+Python. -Mission: ZFS box is to watch a Windoze box (or a MAC box) on which new files are being created and old ones changed, plus many deletions (animation system). -Objectives: (a) make periodic snapshots of animator''s box (actual copies of files) onto ZFS box, and (b) Write metadata into the PostgreSQL database to record event changes happening to key files. -Design concept: Integrate ZFS+SQL+Python into a rules-based backup device that notifies a third party elsewhere in the world about project progress (or lack thereof), and forwards key files and the SQL metadata (via internet) to some host ZFS box elsewhere. -Observations: (a) The local and the host ZFS boxes are not expected to contain the same images; indeed, many local ZFS boxes will be distributed, and one host ZFS box will be the ultimate repository of "completed" works. (b) High Performance is not an overriding consideration because this box "serves" only two users (the watched box on the local network and the host down the internet pipe). Question that relates to the on-going thread: What configuration of ZFS and the hardware would serve "reliable and cheap"? David Singer This message posted from opensolaris.org
Hello... If i have understood well, you will have a host with EMC RAID5 discs. Is that right? You pay a lot of money to have EMC discs, and i think is not a good idea have another layer of *any* RAID on top of it. If you have EMC RAID5 (eg. symmetrix), you don''t need to have a software RAID... ZFS was designed to have a RAID solution to cheap discs! I think is not your case, and anything that is "too much" is not good. Generates complexity and loop... :) I think ZFS can "trust" on the EMC thing... Leal. This message posted from opensolaris.org
Leal, The entire configuration through our corporation is being defined. One of our team members is heavy into EMC - 200Tb is his "normal" operating range. However, for this need we are focused just on local "smart appliances" the purpose of which is to do more than just automatically mirror the entirety of another local computer. What is desired is "reliable and cheap", plus remotely controlled, virus-free, and easily updated by the local bone-head. We expect to have many of these appliances, each in a separate spot in the world, each serving one local computer (operated by one local bone-head), and each reporting to one common central repository via internet. We don''t expect the appliance to have (relatively) much CPU stress, but the files are rather large (video, animation, and all the underlying constructs, tracks, and undo''s thereof). We''ve come to the conclusion that hardware raid of any sort is not required. Remember, the source data on Local Bone-Head''s computer (not being disparaging, just being practical that an un-supervised person thousands of miles away has to be considered less-than-optimal in computuer habits) is being copied to a ZFS machine (backup location number one) and then forwarded to a central repository (backup location number two) which will itself have a mirror in some distant location (backup location number three). We''ll try compression level #9. We''ll set "scrub" to 30 days automatic. We''ll have unique virus protection: the bootable drive will be read-only. Here''s the configuration we''ll build for our first appliance: Case: Antek NSK 1380 - Mini ATX format with a 5.25 + (3) 3.5 drive "bays" and 350w powersupply and 120mm fan, plus interior side fan (uses one PSI slot). MOBO: still to be determined, but we are currently evaluating ASUS P5E-VM, LGA775 Intel G35 northbridge and ICH9R southbridge. Comes with 4 memory slots of dual-channel DDR2. (2)PCIx-1, (1)PCIx-16, (1)PCI, (6)SATA 3Gbs, (1)IDE PATA, (1)FDD, (6)USB, (1)Firewire, 5.1surround, HDMI, XVGA, PS/2 (mouse/keybd), (1)gigaport LAN, (1)coaxial S/PDIF, RAID controller, and a flaky bios that needs immediate flash update before doing anything. But wait! Most of those features will go unused (read on). Setup: (1) single-core CPU, minimizing heat being the most important factor. (8) gb memory (4x2gb) w/heat spreader (3) 500gb 7,200rpm "whatever''s in stock cheapest" drives C: D: E: 100% use as ZFS tank, raidz-1. No floppy, no CD, no DVD, no boot from C: or D: or E: Shut off sound, Firewire, S/PDIF (and anything else we can figure out how) BOOT & run from 4gb USB FLASH (thumb) DRIVE "F:"- READ ONLY! (0)monitors, (0)keyboards, (0)mice - operate remotely via Internet Software (all loaded onto the Flash Drive): OpenSolaris ZFS PostgreSQL Python A browser A VPN Our Patch/Upgrade pipeline: FedEx a replacement read-only USB FLASH DRIVE We think we can get a brand-new 1Tb custom appliance for about $900 US. We understand there will be a learning curve to this, but are willing to cut ourselves on the bleeding edge. We think each part (with the possible exception of the mobo) has been successfuly employed - that we are just the first to assemble it all in this particular fashion. (Many thanks, Richard - chime in if I left anything out) David This message posted from opensolaris.org
Very cool! Just one comment. You said:> We''ll try compression level #9.gzip-9 is *really* CPU-intensive, often for little gain over gzip-1. As in, it can take 100 times longer and yield just a few percent gain. The CPU cost will limit write bandwidth to a few MB/sec per core. I''d suggest that you begin by doing a simple experiment -- create a filesystem at each compression level, copy representative identical data to each one, and compare space usage. My guess is that you''ll find the knee in the cost/benefit curve well below gzip-9. Also, if you''re storing jpegs or video files, those are already compressed, in which case the benefit will zero even at gzip-9. That said, the other consideration is how you''re using the storage. If the write rate is modest and disk space is at a premium, the CPU cost may simply not matter. And note that only writes are affected: when reading data back, gzip is equally fast regardless of level. Jeff