Hi, I have seen a similar question on this list in the archive but haven''t seen the answer. Can I avoid striping across top level vdevs ? If I use a zpool which is one LUN from the SAN, and when it becomes full I add a new LUN to it. But I cannot guarantee that the LUN will not come from the same spindles on the SAN. Can I force zpool to not to stripe the data ? Thank You in advance, Zsolt Habony -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101018/8cb97ffb/attachment.html>
On 18/10/2010 07:44, Habony, Zsolt wrote:> I have seen a similar question on this list in the archive but haven?t > seen the answer. > > Can I avoid striping across top level vdevs ? > > If I use a zpool which is one LUN from the SAN, and when it becomes full > I add a new LUN to it. > > But I cannot guarantee that the LUN will not come from the same spindles > on the SAN.That sounds like a problem with your SAN config if that matters to you.> Can I force zpool to not to stripe the data ?You can''t, but why do you care ? -- Darren J Moffat
In many large datacenters, a different storage team handles LUN requests and assignment. We ask a LUN in a specific size, and we get one. It might result that the first vdev (LUN) is on a beginning of a RAID set on the storage, and the second vdev is on the end of the same RAID set on the same physical disks. (If not in the creation time, then later, during the increase of a filled zpool, by adding a LUN) I worry about head thrashing. Though memory cache of large storage should make the problem easier, I would be more happy if I can be sure that zpool will not be handled as a stripe. Is there a way to avoid it, or can we be sure that the problem does not exist at all ? -----Original Message----- From: Darren J Moffat [mailto:darrenm at opensolaris.org] Sent: 2010. okt?ber 18. 10:19 To: Habony, Zsolt Cc: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] How to avoid striping ? On 18/10/2010 07:44, Habony, Zsolt wrote:> I have seen a similar question on this list in the archive but haven''t > seen the answer. > > Can I avoid striping across top level vdevs ? > > If I use a zpool which is one LUN from the SAN, and when it becomes full > I add a new LUN to it. > > But I cannot guarantee that the LUN will not come from the same spindles > on the SAN.That sounds like a problem with your SAN config if that matters to you.> Can I force zpool to not to stripe the data ?You can''t, but why do you care ? -- Darren J Moffat
Le 18 oct. 2010 ? 08:44, "Habony, Zsolt" <zsolt.habony at hp.com> a ?crit :> Hi, > > I have seen a similar question on this list in the archive but haven?t seen the answer. > > Can I avoid striping across top level vdevs ? > > > > If I use a zpool which is one LUN from the SAN, and when it becomes full I add a new LUN to it. > > But I cannot guarantee that the LUN will not come from the same spindles on the SAN. > > > > Can I force zpool to not to stripe the data ? >No. The basic principle of the zpool is dynamic striping across vdevs in order to ensure that all available spindles are contributing to the workload. If you want/need more granular control over what data goes to which disk, then you''ll need to create multiple pools. Just create a new pool from the new SAN volume and you will segregate the IO. But then you risk having hot and cold spots in your storage as the IO won''t be striped. If the approach is to fill a vdev completely before adding a new one this possibility exists anyway until the block rewrite arrives to redistribute existing data across available vdevs. Cheers, Erik
>No. The basic principle of the zpool is dynamic striping across vdevs in order to ensure that all available spindles >are contributing to the workload. If you want/need more granular control over what data goes to which disk, then >you''ll need to create multiple pools.>Just create a new pool from the new SAN volume and you will segregate the IO.That''s my understanding and that''s my problem. You have an application filesystem from one LUN. (vxfs is expensive, ufs/svm is not really able to handle online filesystem increase. Thus we plan to use zfs for application filesystems.) When it fills up you increase it by adding a new LUN. You have to make sure that the added LUN is from different physical disks. Is might be not obvious with todays large storages with thousands of LUNs. If I can force concatenation, then I do not have to investigate, where are the existing parts of the filesystems.
On 18/10/2010 09:28, Habony, Zsolt wrote:> I worry about head thrashing. Though memory cache of large storage should make the problemIs that really something you should be worried about with all the other software and hardware between ZFS and the actual drives ? If that is a problem then it isn''t ZFS causing it, it will just be using the LUNs that was given to it by the SAN. An access pattern of an application on a completely different filesystem could still mean that you are using both LUNs in that way.> Is there a way to avoid it, or can we be sure that the problem does not exist at all ?Grow the existing LUN rather than adding another one. The only way to have ZFS not stripe is to not give it devices to stripe over. So stick with simple mirrors eg this style of configuration: pool: builds state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM builds ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c7t3d0 ONLINE 0 0 0 c8t4d0 ONLINE 0 0 0 Where in your configuration c7t3d0/c8t4d0 are your LUNs from the SAN. Rather than this style: pool: builds state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM builds ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c7t3d0 ONLINE 0 0 0 c8t4d0 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 c7t3d0 ONLINE 0 0 0 c8t4d0 ONLINE 0 0 0 -- Darren J Moffat
On 18/10/2010 10:01, Habony, Zsolt wrote:> If I can force concatenation, then I do not have to investigate, where are the existing parts of the filesystems.You can''t, the code for concatenation rather than stripping does not exist and there are no plans to add it. Instead of assuming you have a problem I''d highly recommend you go with the recommendation in my other email or don''t worry about it. Don''t assume that you will have a problem with ZFS because of your experience with other systems. Striping isn''t bad it is usually good. Or fix the root cause of the problem - which in this example case isn''t ZFS - on the SAN where the LUNs are getting allocated. -- Darren J Moffat
On Mon, Oct 18, 2010 at 1:28 AM, Habony, Zsolt <zsolt.habony at hp.com> wrote:> Is there a way to avoid it, or can we be sure that the problem does not exist at all ?ZFS will coalesce asynchronous writes, which should help for most of the head trash on write. Using a log device will convert sync writes to async. For reads, make sure you have enough memory and a cache device. -B -- Brandon High : bhigh at freaks.com
Hi, Habony, Zsolt writes:> You have an application filesystem from one LUN. (vxfs is expensive, ufs/svm is not really able to handle online filesystem increase. Thus we plan to use zfs for application filesystems.)What do you mean by "not really"? Use metattach to grow a metadevice or soft partition. Use growfs to grow UFS on the grown device. Rainer -- Rainer J. H. Brandt Brandt & Brandt Computer GmbH Am Wiesenpfad 6, 53340 Meckenheim Gesch?ftsf?hrer: Rainer J. H. Brandt und Volker A. Brandt Handelsregister: Amtsgericht Bonn, HRB 10513 RFC 5322: "Each line [...] SHOULD be no more than 78 characters"
On 10/18/10 2:13 AM, Rainer J.H. Brandt wrote:> > Habony, Zsolt writes: >> You have an application filesystem from one LUN. (vxfs is >> expensive, ufs/svm is not really able to handle online filesystem >> increase. Thus we plan to use zfs for application filesystems.) > > What do you mean by "not really"? Use metattach to grow a metadevice > or soft partition. Use growfs to grow UFS on the grown device.He is probably referring to the fact that growfs locks the filesystem. -- Carson Gaspar
>> You have an application filesystem from one LUN. (vxfs is expensive, ufs/svm is not really able to handle online filesystem increase. Thus we plan to use zfs for application filesystems.)>What do you mean by "not really"?...>Use growfs to grow UFS on the grown device.I know its off-toopic but the statement: " growfs will ``write-lock'''' (see lockfs(1M)) a mounted file system when expanding. " made me always uncomfortable with this online expansion. I cannot guarantee how a specific application will behave during the expansion.
>> Is there a way to avoid it, or can we be sure that the problem does not exist at all ? >Grow the existing LUN rather than adding another one.>The only way to have ZFS not stripe is to not give it devices to stripe >over. So stick with simple mirrors ...(I do not mirror, as the storage gives redundancy behind LUNs.) Online LUN expansion seems promising, and answering my question. Thank You for that. Zsolt
>>> You have an application filesystem from one LUN. (vxfs is expensive, ufs/svm is not really ableto handle online filesystem increase. Thus we plan to use zfs for application filesystems.)> >>What do you mean by "not really"? >... >>Use growfs to grow UFS on the grown device. > >I know its off-toopic but the statement: " growfs will ``write-lock'''' > (see lockfs(1M)) a mounted filesystem when expanding. " made me >always uncomfortable with this online expansion. I cannot guarantee how a >specific application will behave during the expansion.-w Write-lock (wlock) the specified file-system. wlock suspends writes that would modify the file system. Access times are not kept while a file system is write- locked. All the applications trying to write will suspend. What would be the risk of that? Casper
On 18 Oct 2010, at 12:40, Habony, Zsolt wrote:>>> Is there a way to avoid it, or can we be sure that the problem does not exist at all ? >> Grow the existing LUN rather than adding another one. > >> The only way to have ZFS not stripe is to not give it devices to stripe >> over. So stick with simple mirrors ... > > (I do not mirror, as the storage gives redundancy behind LUNs.)Then you lose ZFS self healing ability. Sami
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Habony, Zsolt > > ?????????????? If I use a zpool which is one LUN from the SAN, and when > it becomes full I add a new LUN to it. > But I cannot guarantee that the LUN will not come from the same > spindles on the SAN. > > ?????????????? Can I force zpool to not to stripe the data ?If at all possible, you should request that your LUN team give you whole disks, JBOD, instead of a LUN slice of some other raid set. The performance & reliability benefits of ZFS raid over HW raid have been discussed here many times. Please ask if you don''t already know. If they can''t do that for you ... Then your question is an important one ... and I have no idea the answer.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 10/18/2010 4:28 AM, Habony, Zsolt wrote:> > I worry about head thrashing.Why? If your SAN group gives you a LUN that is at the opposite end of the array, I would think that was because they had already assigned the space in the middle to other customers (other groups like yours, or other hosts of yours.) If so, don''t you think that all those other hosts and customers will be reading and writing from that array all the time anyway? I mean if the heads are going to ''thrash'', then they''ll be doing so even before you request your second LUN right? Adding your second LUN to the mix isn''t going to seriously change the workload on the disks in the array.> Though memory cache of large storage should make the problem > easier, I would be more happy if I can be sure that zpool will not > be handled as a stripe. > > Is there a way to avoid it, or can we be sure that the problem does > not exist at all ? >As I think the logic above suggests, If the problem exists, it exists even when you only have 1 LUN. -Kyle -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (MingW32) iQEcBAEBAgAGBQJMvFeKAAoJEEADRM+bKN5wuc4IALPTIrGcAq6TWa95yrA/DCWp vu2K7+pwSvz/IRIP+C6Y+qvWm/Km+UdtRu6PKb8G/DF8xp5vEnkqXdRSNDC6FlpR EwSNavS7ij87bN6fuBiw6E02GZtADi2RptPKgyGz1FT3wPDHS8SQKtA59DwrWJNS ckHUi+9BwngL4p7E0C+8pcahyF7QmtTm3DpL3y4AZ+7O+c/wPcIwLZ3dI6yQU8vd KuRe6h/xCHffKH9gHoXJf0pG4e5iA8XP+lt7DlJGPxRYzZil0Rr5JA67uGqEf/VY FbhAtXqWrHkNSd2sk1bIJVj7OFCS6j/NXMkV/Dt6OUH2Gkucl1nBs4yIAQ9Hu3s=I+w1 -----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 10/18/2010 5:40 AM, Habony, Zsolt wrote:> (I do not mirror, as the storage gives redundancy behind LUNs.) >By not enabling redundancy (Mirror or RAIDZ[123]) at the ZFS level, you are opening yourself to corruption problems that the underlying SAN storage can''t protect you from. the SAN array won''t even notice the problem. ZFS will notice the problem, and (if you don''t give it redundancy to work with) it won''t be able to repair it for you. You''d be better off getting unprotected LUNS from the Array, and letting ZFS handle the redundancy. -Kyle> Online LUN expansion seems promising, and answering my question. > Thank You for that. > > Zsolt > > > _______________________________________________ zfs-discuss mailing > list zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (MingW32) iQEcBAEBAgAGBQJMvFhtAAoJEEADRM+bKN5wmgwIAK2HCAtaHkAp2RxqfkcFGD3A 0YyzP148fzTcEpFwhpNm59nht9fsfAibjCZZ/HmApe2jYWJ2K9l4W0MBXedXnz3e gEaIxqymSHLjkF2SF0OD2XfnNiDMor5CrzPirZMcAL7TeyIqyACeuQTVVqZPw2rZ TF1fGG2M9Y0l1Gq5+PfNcGESiz4tb7Er6UtDnLFe7rx4DObNJnO07jr1BMBxHsp8 tL1+YxhAUpWvaKOqHJvruZRtxagdE1KUQAtipPQjZvFudqIVAT8PRL0Acwz0D6aq Lv1nmYzGg3M1usjrbfSEDV2eM3WR3gc7px93xyxZ1kMQPOgRO7X0YRxwfUMEsUc=+YXG -----END PGP SIGNATURE-----
Thank You all for the comments. You should imagine a datacenter with - standards not completely depending on me. - SAN for many OSs, one of them is Solaris, (and not the major amount) - usually level 2 engineers doing filesystem increases. - hundreds of physical boxes, dozens of virtuals on one physical - ability to move VMs (zones) across physical boxes. (by assigning LUNs to other boxes) That probably explains, that I cannot use host based raid management, it is done by storage as standard. I cannot assign whole disks to boxes, as I get LUNs standardized for all other OSs, and in a size optimized for virtual small virtual machines. zfs is just used for easy expansion, and snapshotting.>If your SAN group gives you a LUN that is at the opposite end of the array, I would think that was because they had >already assigned the space in the middle to other customers (other groups like yours, or other hosts of yours.)>Adding your second LUN to the mix isn''t going to seriously change the workload on the disks in the array.Though I agree, that I cannot guarantee what other hosts are doing on my LUNs, I still think that I would avoid striping over partitions on the same disk. The possible bad thing is better than an absolutely sure bad thing. On 10/18/2010 5:40 AM, Habony, Zsolt wrote:>> (I do not mirror, as the storage gives redundancy behind LUNs.) >> >By not enabling redundancy (Mirror or RAIDZ[123]) at the ZFS level, >you are opening yourself to corruption problems that the underlying >SAN storage can''t protect you from. the SAN array won''t even notice >the problem.So, I cannot redefine our standards here. Maybe zfs does some things better than the storage, but having standards for all the other OSs also gives advantages, and yes I know we sacrifice some useful zfs features. I hope that explains, and thank you again for all your valuable comments. Zsolt
On Mon, Oct 18, 2010 at 3:28 AM, Habony, Zsolt <zsolt.habony at hp.com> wrote:> In many large datacenters, a different storage team handles LUN requests > and assignment. > We ask a LUN in a specific size, and we get one. > > It might result that the first vdev (LUN) is on a beginning of a RAID set > on the storage, > and the second vdev is on the end of the same RAID set on the same physical > disks. (If not in the creation time, then > later, during the increase of a filled zpool, by adding a LUN) > > I worry about head thrashing. Though memory cache of large storage should > make the problem > easier, I would be more happy if I can be sure that zpool will not be > handled as a stripe. > > Is there a way to avoid it, or can we be sure that the problem does not > exist at all ? > > -----Original Message----- > From: Darren J Moffat [mailto:darrenm at opensolaris.org] > Sent: 2010. okt?ber 18. 10:19 > To: Habony, Zsolt > Cc: zfs-discuss at opensolaris.org > Subject: Re: [zfs-discuss] How to avoid striping ? > > On 18/10/2010 07:44, Habony, Zsolt wrote: > > I have seen a similar question on this list in the archive but haven''t > > seen the answer. > > > > Can I avoid striping across top level vdevs ? > > > > If I use a zpool which is one LUN from the SAN, and when it becomes full > > I add a new LUN to it. > > > > But I cannot guarantee that the LUN will not come from the same spindles > > on the SAN. > > That sounds like a problem with your SAN config if that matters to you. > > > Can I force zpool to not to stripe the data ? > > You can''t, but why do you care ? > > -- > Darren J Moffat > >It shouldn''t matter if LUN''s are on the same backend disk. Unless the manufacturer of the array is brain dead, their wide striping algorithm should handle it without breaking a sweat. If the pool of disk can''t service the number of IOPS, the "storage team" should be moving LUN''s around, that''s what they get paid to do. Your *issue* shouldn''t be an issue at all unless the backend disk is junk. I''ve never seen an issue with Hitachi''s HDP or NetApp''s aggregates. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101018/23fa1d38/attachment-0001.html>
On 2010-Oct-18 17:45:34 +0800, "Casper.Dik at Sun.COM" <Casper.Dik at Sun.COM> wrote:> Write-lock (wlock) the specified file-system. wlock > suspends writes that would modify the file system. > Access times are not kept while a file system is write- > locked. > > >All the applications trying to write will suspend. What would be the >risk of that?At least some versions of Oracle rdbms have timeouts around I/O and will abort if I/O operations don''t complete within a short period. -- Peter Jeremy -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101019/ad1077eb/attachment.bin>
On 18 Oct 2010, at 17:44, Habony, Zsolt wrote:> Thank You all for the comments. > > You should imagine a datacenter with > - standards not completely depending on me. > - SAN for many OSs, one of them is Solaris, (and not the major amount)So you get luns from the storage team and there is nothing you can do about it. Just use the luns you get as well as you can then. Which is host based mirrored zpool.> - usually level 2 engineers doing filesystem increases. > - hundreds of physical boxes, dozens of virtuals on one physical > - ability to move VMs (zones) across physical boxes. (by assigning LUNs to other boxes)You can do that even if the raid management is done host based with zfs.> > That probably explains, that I cannot use host based raid management, it is done by storage as standard.No it does not. I would still let zfs do the raid management on host side even if you can''t stop the storage team from raiding it again on the storage box.> I cannot assign whole disks to boxes, as I get LUNs standardized for all other OSs, and in a size optimized for > virtual small virtual machines.You still should mirror across two storage boxes. Sami