Hi everyone, Now that zfsboot is becoming available, I''m wondering how to put it to use. Imagine a system with 4 identical disks. Of course I''d like to use raidz, but zfsboot doesn''t do raidz. What if I were to partition the drives, such that I have 4 small partitions that make up a zfsboot partition (4 way mirror), and the remainder of each drive becomes part of a raidz? Do I still have the advantages of having the whole disk ''owned'' by zfs, even though it''s split into two parts? Swap would probably have to go on a zvol - would that be best placed on the n-way mirror, or on the raidz? Regards, Paul Boven.
Hi,> Now that zfsboot is becoming available, I''m wondering how to put it to > use. Imagine a system with 4 identical disks. Of course I''d like to useyou lucky one :).> raidz, but zfsboot doesn''t do raidz. What if I were to partition the > drives, such that I have 4 small partitions that make up a zfsboot > partition (4 way mirror), and the remainder of each drive becomes part > of a raidz?Sounds good. Performance will suffer a bit, as ZFS thinks it has two pools with 4 spindels each, but it should still perform better than the same on a UFS basis. You may also want to have two 2-way mirrors and keep the second for other purposes such as a scratch space for zfs migration or as spare disks for other stuff.> Do I still have the advantages of having the whole disk > ''owned'' by zfs, even though it''s split into two parts?I''m pretty sure that this is not the case: - ZFS has no guarantee that someone will do something else with that other partition, so it can''t assume the right to turn on disk cache for the whole disk. - Yes, it could be smart and realize that it does have the whole disk, only split up across two pools, but then I assume that this is not your typical enterprise class configuration and so it probably didn''t get implemented that way. I''d say that not being able to benefit from the disk drive''s cache is not as bad in the face of ZFS'' other advantages, so you can probably live with that.> Swap would probably have to go on a zvol - would that be best placed on > the n-way mirror, or on the raidz?I''d place it onto the mirror for performance reasons. Also, it feels cleaner to have all your OS stuff on one pool and all your user/app/data stuff on another. This is also recommended by the ZFS Best Practices Wiki on www.solarisinternals.com. Now back to the 4 disk RAID-Z: Does it have to be RAID-Z? Maybe you might want to reconsider using 2 2-way mirrors: - RAID-Z is slow when writing, you basically get only one disk''s bandwidth. (Yes, with variable block sizes this might be slightly better...) - RAID-Z is _very_ slow when one disk is broken. - Using mirrors is more convenient for growing the pool: You run out of space, you add two disks, and get better performance too. No need to buy 4 extra disks for another RAID-Z set. - When using disks, you need to consider availability, performance and space. Of all the three, space is the cheapest. Therefore it''s best to sacrifice space and you''ll get better availability and better performance. Hope this helps, Constantin -- Constantin Gonzalez Sun Microsystems GmbH, Germany Platform Technology Group, Global Systems Engineering http://www.sun.de/ Tel.: +49 89/4 60 08-25 91 http://blogs.sun.com/constantin/ Sitz d. Ges.: Sun Microsystems GmbH, Sonnenallee 1, 85551 Kirchheim-Heimstetten Amtsgericht Muenchen: HRB 161028 Geschaeftsfuehrer: Marcel Schneider, Wolfgang Engels, Dr. Roland Boemer Vorsitzender des Aufsichtsrates: Martin Haering
Constantin Gonzalez wrote:>> Do I still have the advantages of having the whole disk >> ''owned'' by zfs, even though it''s split into two parts? > > I''m pretty sure that this is not the case: > > - ZFS has no guarantee that someone will do something else with that other > partition, so it can''t assume the right to turn on disk cache for the whole > disk.Can write-cache not be turned on manually as the user is sure that it is only ZFS that is using the entire disk? -Manoj
Hi, Manoj Joseph wrote:> Can write-cache not be turned on manually as the user is sure that it is > only ZFS that is using the entire disk?yes it can be turned on. But I don''t know if ZFS would then know about it. I''d still feel more comfortably with it being turned off unless ZFS itself does it. But maybe someone from the ZFS team can clarify this. Cheers, Constantin -- Constantin Gonzalez Sun Microsystems GmbH, Germany Platform Technology Group, Global Systems Engineering http://www.sun.de/ Tel.: +49 89/4 60 08-25 91 http://blogs.sun.com/constantin/ Sitz d. Ges.: Sun Microsystems GmbH, Sonnenallee 1, 85551 Kirchheim-Heimstetten Amtsgericht Muenchen: HRB 161028 Geschaeftsfuehrer: Marcel Schneider, Wolfgang Engels, Dr. Roland Boemer Vorsitzender des Aufsichtsrates: Martin Haering
Hello Constantin,
Wednesday, April 4, 2007, 3:34:13 PM, you wrote:
CG> - RAID-Z is slow when writing, you basically get only one disk''s
bandwidth.
CG>   (Yes, with variable block sizes this might be slightly better...)
No, it''s not.
It''s actually very fast for writing, in many cases it would be faster
than raid-10 (both made of 4 disks).
Now doing random reads is slow...
-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com
On Wed, Apr 04, 2007 at 03:34:13PM +0200, Constantin Gonzalez wrote:> - RAID-Z is _very_ slow when one disk is broken.Do you have data on this? The reconstruction should be relatively cheap especially when compared with the initial disk access. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl
On Wed, Apr 04, 2007 at 10:08:07AM -0700, Adam Leventhal wrote:> On Wed, Apr 04, 2007 at 03:34:13PM +0200, Constantin Gonzalez wrote: > > - RAID-Z is _very_ slow when one disk is broken. > > Do you have data on this? The reconstruction should be relatively cheap > especially when compared with the initial disk access. >Also, what is your definition of "broken"? Does this mean the device appears as FAULTED in the pool status, or that the drive is present and not responding? If it''s the latter, this will be fixed by my upcoming FMA work. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
On Wed, Apr 04, 2007 at 10:08:07AM -0700, Adam Leventhal wrote:> On Wed, Apr 04, 2007 at 03:34:13PM +0200, Constantin Gonzalez wrote: > > - RAID-Z is _very_ slow when one disk is broken. > > Do you have data on this? The reconstruction should be relatively cheap > especially when compared with the initial disk access.RAID-Z has to be slower when there is lots of bitrot, but it shouldn''t be slower when a disk has read errors or is gone. Or are we talking about write performance (does RAID-Z wait too long for a disk that won''t respond?)?
>Can write-cache not be turned on manually as the user is sure that it is >only ZFS that is using the entire disk? > > > >yes it can be turned on. But I don''t know if ZFS would then know about it. > >I''d still feel more comfortably with it being turned off unless ZFS itself >does it. > >But maybe someone from the ZFS team can clarify this. >I think that it''s true that ZFS would not know about the write cache and thus you wouldn''t get the benefit of it. At some point, we''d like to implement code that recognizes the zfs "owns" the entire disk even though the disk has multiple slices, and turn on write caching anyway. I haven''t done much looking into this though. Some further comment on the proposed configuration (root mirrored across all four disks, the rest of the each disk going into a RAIDZ pool): 1. I suggest you make your root pool big enough to hold several boot environments so that you can try out clone-and-upgrade tricks like this: http://blogs.sun.com/timf/entry/an_easy_way_to_manage 2. So if root is mirrored across all four disks, that means that swapping will take place to all four disks. I''m wondering if that''s a problem, or if not a problem, maybe not optimal. Lori -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070404/624177da/attachment.html>
Hello Adam, Wednesday, April 4, 2007, 7:08:07 PM, you wrote: AL> On Wed, Apr 04, 2007 at 03:34:13PM +0200, Constantin Gonzalez wrote:>> - RAID-Z is _very_ slow when one disk is broken.AL> Do you have data on this? The reconstruction should be relatively cheap AL> especially when compared with the initial disk access. If I stop all activity to x4500 with a pool made of several raidz2 and then I issue spare attach I get really poor performance (1-2MB/s) on a pool with lot of relatively small files. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
On Wed, Apr 04, 2007 at 11:04:06PM +0200, Robert Milkowski wrote:> If I stop all activity to x4500 with a pool made of several raidz2 and > then I issue spare attach I get really poor performance (1-2MB/s) on a > pool with lot of relatively small files.Does that mean the spare is resilvering when you collect the performance data? I think a fair test would be to compare the performance of a fully functional RAID-Z stripe against a one with a missing (absent) device. Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl
Lori Alt wrote:> >> Can write-cache not be turned on manually as the user is sure that it is >> only ZFS that is using the entire disk? >> >> >> yes it can be turned on. But I don''t know if ZFS would then know about it. >> >> I''d still feel more comfortably with it being turned off unless ZFS itself >> does it. >> >> But maybe someone from the ZFS team can clarify this. > I think that it''s true that ZFS would not know about the > write cache and thus you wouldn''t get the benefit of it.Actually, all that matters is that the write cache is on -- doesn''t matter whether ZFS turned it on or you did it manually. (However, make sure that the write cache doesn''t turn itself back off when you reboot / lose power...) --matt
Hello Adam, Wednesday, April 4, 2007, 11:41:58 PM, you wrote: AL> On Wed, Apr 04, 2007 at 11:04:06PM +0200, Robert Milkowski wrote:>> If I stop all activity to x4500 with a pool made of several raidz2 and >> then I issue spare attach I get really poor performance (1-2MB/s) on a >> pool with lot of relatively small files.AL> Does that mean the spare is resilvering when you collect the performance AL> data? I think a fair test would be to compare the performance of a fully AL> functional RAID-Z stripe against a one with a missing (absent) device. Sorry, I wasn''t clear. I''m not talking about performance while spare is resilvering. I''m talking about resilver performance itself while all other IOs are absent. Resilver itself is slow (lot of files) with raidz2 here. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Hello Matthew, Thursday, April 5, 2007, 1:08:25 AM, you wrote: MA> Lori Alt wrote:>> >>> Can write-cache not be turned on manually as the user is sure that it is >>> only ZFS that is using the entire disk? >>> >>> >>> yes it can be turned on. But I don''t know if ZFS would then know about it. >>> >>> I''d still feel more comfortably with it being turned off unless ZFS itself >>> does it. >>> >>> But maybe someone from the ZFS team can clarify this. >> I think that it''s true that ZFS would not know about the >> write cache and thus you wouldn''t get the benefit of it.MA> Actually, all that matters is that the write cache is on -- doesn''t MA> matter whether ZFS turned it on or you did it manually. (However, make MA> sure that the write cache doesn''t turn itself back off when you reboot / MA> lose power...) SCSI write cache flush commands will be issued regardless if zfs has a whole disk or only a slice then, right? -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Hi,>>> - RAID-Z is _very_ slow when one disk is broken. >> Do you have data on this? The reconstruction should be relatively cheap >> especially when compared with the initial disk access. > > Also, what is your definition of "broken"? Does this mean the device > appears as FAULTED in the pool status, or that the drive is present and > not responding? If it''s the latter, this will be fixed by my upcoming > FMA work.sorry, the _very_ may be exaggarated and depending much on the load of the system and the config. I''m referring to a couple of posts and anecdotal experience from colleagues. This means that indeed "slow" or "very slow" may be a mixture of reconstruction overhead and device timeout issue. So, it''s nice to see that the upcoming FMA code will fix some of the slowness issues. Did anybody measure how much CPU overhead RAID-Z and RAID-Z2 parity computation induces, both for writes and for reads (assuming a data disk is broken)? This data would be useful when arguing for a "software RAID" scheme in front of hardware-RAID addicted customers. Best regards, Constantin -- Constantin Gonzalez Sun Microsystems GmbH, Germany Platform Technology Group, Global Systems Engineering http://www.sun.de/ Tel.: +49 89/4 60 08-25 91 http://blogs.sun.com/constantin/ Sitz d. Ges.: Sun Microsystems GmbH, Sonnenallee 1, 85551 Kirchheim-Heimstetten Amtsgericht Muenchen: HRB 161028 Geschaeftsfuehrer: Marcel Schneider, Wolfgang Engels, Dr. Roland Boemer Vorsitzender des Aufsichtsrates: Martin Haering
Le 5 avr. 07 ? 08:28, Robert Milkowski a ?crit :> Hello Matthew, > > Thursday, April 5, 2007, 1:08:25 AM, you wrote: > > MA> Lori Alt wrote: >>> >>>> Can write-cache not be turned on manually as the user is sure >>>> that it is >>>> only ZFS that is using the entire disk? >>>> >>>> >>>> yes it can be turned on. But I don''t know if ZFS would then know >>>> about it. >>>> >>>> I''d still feel more comfortably with it being turned off unless >>>> ZFS itself >>>> does it. >>>> >>>> But maybe someone from the ZFS team can clarify this. >>> I think that it''s true that ZFS would not know about the >>> write cache and thus you wouldn''t get the benefit of it. > > MA> Actually, all that matters is that the write cache is on -- > doesn''t > MA> matter whether ZFS turned it on or you did it manually. > (However, make > MA> sure that the write cache doesn''t turn itself back off when you > reboot / > MA> lose power...) > > SCSI write cache flush commands will be issued regardless if zfs has a > whole disk or only a slice then, right? >That''s correct. The code path that issue flushes to the write cache, do not check whether or no the caches are enabled.> -- > Best regards, > Robert mailto:rmilkowski at task.gda.pl > http://milek.blogspot.com > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Now, given proper I/O concurrency (like recently improved NCQ in our drivers) or SCSI CTQ, I don''t not expect the write caches to provide much performance gains, if any, over the situation with write caches off. Write caches can be extremelly effective when dealing with drives that do not handle concurrent requests properly. I''d be interested to see data that shows otherwise. -r Le 4 avr. 07 ? 15:20, Constantin Gonzalez a ?crit :> Hi, > > Manoj Joseph wrote: > >> Can write-cache not be turned on manually as the user is sure that >> it is >> only ZFS that is using the entire disk? > > yes it can be turned on. But I don''t know if ZFS would then know > about it. > > I''d still feel more comfortably with it being turned off unless ZFS > itself > does it. > > But maybe someone from the ZFS team can clarify this. > > Cheers, > Constantin > > -- > Constantin Gonzalez Sun Microsystems > GmbH, Germany > Platform Technology Group, Global Systems Engineering http:// > www.sun.de/ > Tel.: +49 89/4 60 08-25 91 http://blogs.sun.com/ > constantin/ > > Sitz d. Ges.: Sun Microsystems GmbH, Sonnenallee 1, 85551 Kirchheim- > Heimstetten > Amtsgericht Muenchen: HRB 161028 > Geschaeftsfuehrer: Marcel Schneider, Wolfgang Engels, Dr. Roland > Boemer > Vorsitzender des Aufsichtsrates: Martin Haering > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Le 4 avr. 07 ? 10:01, Paul Boven a ?crit :> Hi everyone, >> Swap would probably have to go on a zvol - would that be best > placed on > the n-way mirror, or on the raidz? >From the book of Richard Elling, Shouldn''t matter. The ''existence'' of a swap device is sometimes required. If the device ever becomes ''in use'', you''ll need to find an alternative. -r> Regards, Paul Boven. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss