Hiya, I am just in the planning stages for my ZFS Home Media Server build at the moment (to replace WHS v1). I plan to use 2x motherboard ports and 2x Supermicro AOC-SASLP-MV8 8 port SATA cards to give 17* drive connections; 2 disks (120GB SATA 2.5") will be used for the ZFS install using the motherboard ports and the remaing 15 disks (1TB SATA) will be used for data using the 2x 8 port cards. * = the total number of ports is 18 but I only have enough space in the chassis for 17 drives (2x 2.5" in 1x 3.5" bay and 15x 3.5" by using 5-in-3 hotswop caddies in 9x 5.1/4" bays). All disks are 5400RPM to keep power requirements down. The ZFS install will be mirrored, but I am not sure how to configure the 15 data disks from a performance (inc. resilvering) vs protection vs usable space perspective; 3x 5 disk raid-z. 3 disk failures in the right scenario, 12TB storage 2x 7 disk raid-z + hot spare. 2 disk failures in the right scenario, 12TB storage 1x 15 disk raid-z2. 2 disk failures, 13TB storage 2x 7 disk raid-z2 + hot spare. 4 disk failures in the right scenario, 10TB storage Without having a mash of different raid-z* levels I can''t think of any other options. I am leaning towards the first option as it gives seperation between all the disks; I would have seperate Movie folders on each of them while having critical data (pictures, home videos, documents etc) stored on each set of raid-z. Suggestions welcomed. Thanks -- This message posted from opensolaris.org
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Lanky Doodle > > The ZFS install will be mirrored, but I am not sure how to configure the15> data disks from a performance (inc. resilvering) vs protection vs usablespace> perspective; > > 3x 5 disk raid-z. 3 disk failures in the right scenario, 12TB storage > 2x 7 disk raid-z + hot spare. 2 disk failures in the right scenario, 12TBstorage> 1x 15 disk raid-z2. 2 disk failures, 13TB storage > 2x 7 disk raid-z2 + hot spare. 4 disk failures in the right scenario, 10TBstorage The above all provide highest usable space (lowest hardware cost.) But if you want performance, go for mirrors. (highest hardware cost.)
Thanks Edward. I''m in two minds with mirrors. I know they provide the best performance and protection, and if this was a business critical machine I wouldn''t hesitate. But as it for a home media server, which is mainly WORM access and will be storing (legal!) DVD/Bluray rips i''m not so sure I can sacrify the space. 7x 2 way mirrors would give me 7TB usable with 1 hot spare, using 1TB disks, which is a big drop from 12TB! I could always jump to 2TB disks giving me 14TB usable but I already have 6x 1TB disks in my WHS build which i''d like to re-use. Hmmmmmmm....! -- This message posted from opensolaris.org
I am asssuming you will put all of the vdevs into a single pool, which is a good idea unless you have a specific reason for keeping them separate, e.g. you want to be able to destroy / rebuild a particular vdev while leaving the others intact. Fewer disks per vdev implies more vdevs, providing better random performance, lower scrub and resilver times and the ability to expand a vdev by replacing only the few disks in it. The downside of more vdevs is that you dedicate your parity to each vdev, e.g. a RAIDZ2 would need two parity disks per vdev.> I''m in two minds with mirrors. I know they provide > the best performance and protection, and if this was > a business critical machine I wouldn''t hesitate. > > But as it for a home media server, which is mainly > WORM access and will be storing (legal!) DVD/Bluray > rips i''m not so sure I can sacrify the space.For a home media server, all accesses are essentially sequential, so random performance should not be a deciding factor.> 7x 2 way mirrors would give me 7TB usable with 1 hot > spare, using 1TB disks, which is a big drop from > 12TB! I could always jump to 2TB disks giving me 14TB > usable but I already have 6x 1TB disks in my WHS > build which i''d like to re-use.I would be tempted to start with a 4+2 (six disk RAIDZ2) vdev using your current disks and plan from there. There is no reason you should feel compelled to buy more 1TB disks just because you already have some.> Am I right in saying that single disks cannot be > added to a raid-z* vdev so a minimum of 3 would be > required each time. However a mirror is just 2 disks > so if adding disks over a period of time mirrors > would be cheaper each time.That is not correct. You cannot ever add disks to a vdev. Well, you can add additional disks to a mirror vdev, but otherwise, once you set the geometry, a vdev is stuck for life. However, you can add any vdev you want to an existing pool. You can take a pool with a single vdev set up as a 6x RAIDZ2 and add a single disk to that pool. The previous example is a horrible idea because it makes the entire pool dependent upon a single disk. The example also illustrates that you can add any type of vdev to a pool. Most agree it is best to make the pool from vdevs of identical geometry, but that is not enforced by zfs. -- This message posted from opensolaris.org
Thanks martysch. That is what I meant about adding disks to vdevs - not adding disks to vdevs but adding vdevs to pools. If the geometry of the vdevs should ideally be the same, it would make sense to buy one more disk now and have a 7 disk raid-z2 to start with, then buy disks as and when and create a further 7 disk raid-z2 leaving the 15th disk as a hot spare. Would ''only'' give 10TB usable though. The only thing though I seem to remember reading that adding vdevs to pools way after the creation of the pool and data had been written to it, that things aren''t spread evenly - is that right? So it might actually make sense to buy all the disks now and start fresh with the final build. Starting with only 6 disks would leave growth for another 6 disk raid-z2 (to keep matching geometry) leaving 3 disks spare which is not ideal. -- This message posted from opensolaris.org
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Lanky Doodle > > But as it for a home media server, which is mainly WORM access and will be > storing (legal!) DVD/Bluray rips i''m not so sure I can sacrify the space.For your purposes, raidzN will work very well. And since you''re going to sequentially write your data once initially and leave it in place, even the resilver should perform pretty well.
Thanks Edward. In that case what ''option'' would you choose - smaller raid-z vdevs or larger raid-z2 vdevs. I do like the idea of having a hot spare so 2x 7 disk raid-z2 may be the better option rather than 3x 5 disk raid-z with no hot spare. 2TB loss in the former could be acceptable I suppose for the sake of better protection. When 4-5TB drives come to market 2-3TB drives will drop in price so I could always upgrade them - can you do this with raid-z vdevs, in terms of autoexand? There might be the odd deletion here and there if a movie is truly turd, but as you say 99% of the time it will be written and left. -- This message posted from opensolaris.org
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Lanky Doodle > > In that case what ''option'' would you choose - smaller raid-z vdevs orlarger> raid-z2 vdevs.The more redundant disks you have, the more protection you get, and the smaller available disk space. So that''s entirely up to you.> When 4-5TB > drives come to market 2-3TB drives will drop in price so I could always > upgrade them - can you do this with raid-z vdevs, in terms of autoexand?Yup. But you won''t see any increase until you replace all the drives in the vdev.> There might be the odd deletion here and there if a movie is truly turd,but> as you say 99% of the time it will be written and left.That won''t matter. The thing that matters is ... File fragmentation. For example, if you run bit torrent directly onto the file server, then you''re going to get terrible performance for everything because bittorrent grabs tiny little fragments all over the place in essentially random order. But if you rip directly from disc to a file, then it''ll be fine because it''s all serialized. Or use bittorrent onto your laptop and then copy the file all at once to the server. The thing that''s bad for performance, especially on raidz, is when you''re performing lots of small random operations. And when you come back to a large file making small random modifications after it has already been written and snapshotted...
That''s how I understood autoexpand, about not doing so until all disks have been done. I do indeed rip from disc rather than grab torrents - to VIDEO_TS folders and not ISO - on my laptop then copy the whole folder up to WHS in one go. So while they''re not one large single file, they are lots of small .vob files, but being written in one hit. This is a bit OT, but can you have one vdev that is a duplicate of another vdev? By that I mean say you had 2x 7 disk raid-z2 vdevs, instead of them both being used in one large pool could you have one that is a backup of the other, allowing you to destroy one of them and re-build without data loss? -- This message posted from opensolaris.org
> This is a bit OT, but can you have one vdev that is a duplicate > of another vdev? By that I mean say you had 2x 7 disk raid-z2 > vdevs, instead of them both being used in one large pool could > you have one that is a backup of the other, allowing you to > destroy one of them and re-build without data loss?At least two ways I can think of: maybe you can make a mirror of raidz top-level vdevs, or simply use regular zfs send/recv syncs. This may possibly solve some of fragmentation troubles by regrouping blocks during send/recv - but I asked about this recently on the list and did not get a definite answer. //Jim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110615/8258e426/attachment.html>
On Wed, Jun 15, 2011 at 8:20 AM, Lanky Doodle <lanky_doodle at hotmail.com> wrote:> That''s how I understood autoexpand, about not doing so until all disks have been done. > > I do indeed rip from disc rather than grab torrents - to VIDEO_TS folders and not ISO - on my laptop then copy the whole folder up to WHS in one go. So while they''re not one large single file, they are lots of small .vob files, but being written in one hit.I decided on 3x 6-drive RAID-Z2s for my home media server, made up of 2TB drives (mix of Barracuda LP 5900rpm and 5K3000), it''s been quite solid so far. Performance is entirely limited by GigE. --khd
It sounds like you are getting a good plan together.> The only thing though I seem to remember reading that adding vdevs to > pools way after the creation of the pool and data had been written to it, > that things aren''t spread evenly - is that right? So it might actually make > sense to buy all the disks now and start fresh with the final build.In this scenario, balancing would not impact your performance. You would start with the performance of a single vdev. Adding the second vdev later will only increase performance, even if horribly imbalanced. Over time it will start to balance itself. If you want it balanced, you can force zfs to start balancing by copying files then deleting the originals.> Starting with only 6 disks would leave growth for another 6 disk > raid-z2 (to keep matching geometry) leaving 3 disks spare which is > not ideal.Maintaining identical geometry only matters if all of the disks are identical. If you later add 2TB disks, then pick whatever geometry works for you. The most important thing is to maintain consistent vdev types, e.g. all RAIDZ2.> I do like the idea of having a hot spareI''m not sure I agree. In my anecdotal experience, sometimes my array would offline (for whatever reason) and zfs would try to replace as many disks as it could with the hot spares. If there weren''t enough hot spares for the whole array, then the pool was left irreversibly damaged, having several disks in the middle of being replaced. This has only happened once or twice and in the panic I might have handled it incorrectly, but it has spooked me from having hot spares.> This is a bit OT, but can you have one vdev that is a duplicate of > another vdev? By that I mean say you had 2x 7 disk raid-z2 vdevs, > instead of them both being used in one large pool could you have one > that is a backup of the other, allowing you to destroy one of them > and re-build without data loss?Absolutely. I do this very thing with large, slow disks holding a backup for the main disks. My home server has an SMF service which regularly synchronizes the time-slider snapshots from each main pool to the backup pool. This has saved me when a whole pool disappeared (see above) and has allowed me to make changes to the layout of the main pools. -- This message posted from opensolaris.org
Hi Lanky, If you created a mirrored pool instead of a RAIDZ pool, you could use the zpool split feature to split your mirrored pool into two identical pools. For example, If you had 3-way mirrored pool, your primary pool will remain redundant with 2-way mirrors after the split. Then, you would have a non-redundant pool as a backup. You could also attach more disks to the backup pool to make it redundant. At the end of the week or so, destroy the non-redundant pool and re-attach the disks to your primary pool and repeat. This is what I would do with daily snapshots and a monthly backup. Make sure you develop a backup strategy for any pool you build. Thanks, Cindy On 06/15/11 06:20, Lanky Doodle wrote:> That''s how I understood autoexpand, about not doing so until all disks have been done. > > I do indeed rip from disc rather than grab torrents - to VIDEO_TS folders and not ISO - on my laptop then copy the whole folder up to WHS in one go. So while they''re not one large single file, they are lots of small .vob files, but being written in one hit. > > This is a bit OT, but can you have one vdev that is a duplicate of another vdev? By that I mean say you had 2x 7 disk raid-z2 vdevs, instead of them both being used in one large pool could you have one that is a backup of the other, allowing you to destroy one of them and re-build without data loss?
> 3x 5 disk raid-z. 3 disk failures in the right scenario, 12TB storage > 2x 7 disk raid-z + hot spare. 2 disk failures in the right scenario, > 12TB storage > 1x 15 disk raid-z2. 2 disk failures, 13TB storage > 2x 7 disk raid-z2 + hot spare. 4 disk failures in the right scenario, > 10TB storageIf paranoid, use two RAIDz2 VDEVs and a spare. If not, use a single RAIDz2 or RAIDz3 VDEV with 14-15 drives and 1-2 spares. If you choose two VDEVs, replacing the drives in one of them with bigger ones as the pool grows will be more flexible, but may lead to badly balanced pools (although I just saw that fixed in Illumos/openindiana - dunno about s11ex, fbsd or other platforms). Personally, I''m a bit paranoid, and prefer to use smaller VDEVs. With 7 drives per VDEV in RAIDz2, and a spare, you may still have sufficient space for some time. If this isn''t backed up somewhere else, I''d be a wee bit paranoid indeed :) Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
Thanks guys. I have decided to bite the bullet and change to 2TB disks now rather than go through all the effort using 1TB disks and then maybe changing in 6-12 months time or whatever. The price difference between 1TB and 2TB disks is marginal and I can always re-sell my 6x 1TB disks. I think I have also narrowed down the raid config to these 4; 2x 7 disk raid-z2 with 1 hot spare - 20TB usable 3x 5 disk raid-z2 with 0 hot spare - 18TB usable 2x 6 disk raid-z2 with 2 hot spares - 16TB usable with option 1 probably being preferred at the moment. I am aware that bad batches of disks do exist so I tend to either a) buy them in sets from different suppliers or b) use different manufacturers. How sensitive to different disks is ZFS, in terms of disk features (NCQ, RPM speed, firmware/software versions, cache etc). Thanks -- This message posted from opensolaris.org
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Lanky Doodle > > can you have one vdev that is a duplicate of another > vdev? By that I mean say you had 2x 7 disk raid-z2 vdevs, instead of them > both being used in one large pool could you have one that is a backup ofthe> other, allowing you to destroy one of them and re-build without data loss?Well, you can''t make a vdev from other vdev''s, so you can''t make a mirror of raidz, if that''s what you were hoping. As Cindy mentioned, you can split mirrors... Or you could use zfs send | zfs receive, to sync one pool to another pool. This would not care if the architecture of the two pools are the same (the 2nd pool could have different or nonexistent redundancy.) But this will be based on snapshots.
> I have decided to bite the bullet and change to 2TB disks now rather > than go through all the effort using 1TB disks and then maybe changing > in 6-12 months time or whatever. The price difference between 1TB and > 2TB disks is marginal and I can always re-sell my 6x 1TB disks. > > I think I have also narrowed down the raid config to these 4; > > 2x 7 disk raid-z2 with 1 hot spare - 20TB usable > 3x 5 disk raid-z2 with 0 hot spare - 18TB usable > 2x 6 disk raid-z2 with 2 hot spares - 16TB usable > > with option 1 probably being preferred at the moment.I would choose option 1. I have similar configurations in production. A hot spare can be very good when a drive dies while you''re not watching.> I am aware that bad batches of disks do exist so I tend to either a) > buy them in sets from different suppliers or b) use different > manufacturers. How sensitive to different disks is ZFS, in terms of > disk features (NCQ, RPM speed, firmware/software versions, cache etc).For a home server, it shouldn''t make much difference - the network is likely to be the bottleneck anyway. If you choose drives with different spin rate in a pool/vdev, the lower ones will probably pull down performance, so if you''re considering "green" drives, you should use that for all the drives. Mixing Seagate, Samsung and Western drives should work well for this. Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
On Jun 16, 2011, at 2:07 AM, Lanky Doodle wrote:> Thanks guys. > > I have decided to bite the bullet and change to 2TB disks now rather than go through all the effort using 1TB disks and then maybe changing in 6-12 months time or whatever. The price difference between 1TB and 2TB disks is marginal and I can always re-sell my 6x 1TB disks. > > I think I have also narrowed down the raid config to these 4; > > 2x 7 disk raid-z2 with 1 hot spare - 20TB usable > 3x 5 disk raid-z2 with 0 hot spare - 18TB usable > 2x 6 disk raid-z2 with 2 hot spares - 16TB usable > > with option 1 probably being preferred at the moment.Sounds good to me.> > I am aware that bad batches of disks do exist so I tend to either a) buy them in sets from different suppliers or b) use different manufacturers. How sensitive to different disks is ZFS, in terms of disk features (NCQ, RPM speed, firmware/software versions, cache etc).Actually, ZFS has no idea it is talking to a disk. ZFS uses block devices. So there is nothing in ZFS that knows about NCQ, speed, or any of those sorts of attributes. For the current disk drive market, you really don''t have much choice... most vendors offer very similar disks. -- richard
On Thu, Jun 16, 2011 at 07:06:48PM +0200, Roy Sigurd Karlsbakk wrote:> > I have decided to bite the bullet and change to 2TB disks now rather > > than go through all the effort using 1TB disks and then maybe changing > > in 6-12 months time or whatever. The price difference between 1TB and > > 2TB disks is marginal and I can always re-sell my 6x 1TB disks. > > > > I think I have also narrowed down the raid config to these 4; > > > > 2x 7 disk raid-z2 with 1 hot spare - 20TB usable > > 3x 5 disk raid-z2 with 0 hot spare - 18TB usable > > 2x 6 disk raid-z2 with 2 hot spares - 16TB usable > > > > with option 1 probably being preferred at the moment. > > I would choose option 1. I have similar configurations in > production. A hot spare can be very good when a drive dies while > you''re not watching.I would probably also go for option 1, with some additional considerations: 1 - are the 2 vdevs in the same pool, or two separate pools? If the majority of your bulk data can be balanced manually or by application software across 2 filesystems/pools, this offers you the opportunity to replicate smaller more critical data between pools (and controllers). This offers better protection against whole-pool problems (bugs, fat fingers). With careful arrangement, you could even have one pool spun down most of the time. You mentioned something early on that implied this kind of thinking, but it seems to have gone by the wayside since. If you can, I would recommend 2 pools if you go for 2 vdevs. Conversely, in one pool, you might as well go for 15xZ3 since even this will likely cover performance needs (and see #4). 2 - disk purchase schedule With 2 vdevs, regardless of 1 or 2 pools, you could defer purchase of half the 2Tb drives. With 2 pools, you can use the 6x1Tb and change that later to 7x with the next purchase, with some juggling of data. You might be best to buy 1 more 1Tb to get the shape right at the start for in-place upgrades, and in a single pool this is essentially mandatory. By the time you need more space to buy the second tranche of drives, 3+Tb drives may be the better option. 3 - spare temperature for levels raidz2 and better, you might be happier with a warm spare and manual replacement, compared to overly-aggressive automated replacement if there is a cascade of errors. See recent threads. You may also consider a cold spare, leaving a drive bay free for disks-as-backup-tapes swapping. If you replace the 1Tb''s now, repurpose them for this rather than reselling. Whatever happens, if you have a mix of drive sizes, your spare should be of the larger size. Sorry for stating the obvious! :-) 4 - the 16th port Can you find somewhere inside the case for an SSD as L2ARC on your last port? Could be very worthwhile for some of your other data and metadata (less so the movies). -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110617/da9b6cb0/attachment-0001.bin>
> 1 - are the 2 vdevs in the same pool, or two separate > pools? >I was planning on having the 2 z2 vdevs in one pool. Although having 2 pools and having them sync''d sounds really good, I fear it may be overkill for the intended purpose.> > > 3 - spare temperature > > for levels raidz2 and better, you might be happier > with a warm spare > and manual replacement, compared to overly-aggressive > automated > replacement if there is a cascade of errors. See > recent threads. > > You may also consider a cold spare, leaving a drive > bay free for > disks-as-backup-tapes swapping. If you replace the > 1Tb''s now, > repurpose them for this rather than reselling. >I have considered this. The fact I am using cheap disks inevitably means they will fail sooner and more often than enterprise equivalents so the hot spare may be need to be over-used. Could I have different sized vdevs and still have them both in one pool - i.e. an 8 disk z2 vdev and a 7 disk z2 vdev.> > 4 - the 16th port > > Can you find somewhere inside the case for an SSD as > L2ARC on your > last port? Could be very worthwhile for some of your > other data and > metadata (less so the movies).Yes! I have 10 5.1/4" drive bays in my case. 9 of them are occupied by the 5-in-3 hot swop caddies leaving 1 bay left. I was planning on using one of these http://www.scan.co.uk/products/icy-dock-mb994sp-4s-4in1-sas-sata-hot-swap-backplane-525-raid-cage in the drive bay and having 2x 2.5" SATA drives mirrored for the root pool, leaving 2 drive bays spare. For the mirrored root pool I was going to use 2 of the 6 motherboard SATA II ports so they are entirely seperate to the ''data'' controllers. So I could either use the 16th port on the Supermicro controllers for an SSD or one of the remaining motherboard ports. What size would you recommend for the L2ARC disk. I ask as I have a 72GB SAS 10k disk spare so could use this for now (being faster than SATA), but it would have to be on the Supermicro card as this also supports SAS drives. SSD''s are a bit out of range price wise at the moment so i''d wait to use one. Also ZFS doesn''t support TRIM yet does it? Thank you for you excellent post! :) -- This message posted from opensolaris.org
Thanks Richard. How does ZFS enumerate the disks? In terms of listing them does it do them logically, i.e; controller #1 (motherboard) | |--- disk1 |--- disk2 controller #3 |--- disk3 |--- disk4 |--- disk5 |--- disk6 |--- disk7 |--- disk8 |--- disk9 |--- disk10 controller #4 |--- disk11 |--- disk12 |--- disk13 |--- disk14 |--- disk15 |--- disk16 |--- disk17 |--- disk18 or is it completely random leaving me with some trial and error to work out what disk is on what port? -- This message posted from opensolaris.org
>I was planning on using one of > these > http://www.scan.co.uk/products/icy-dock-mb994sp-4s-4in > 1-sas-sata-hot-swap-backplane-525-raid-cageImagine if 2.5" 2TB disks were price neutral compared to 3.5" equivalents. I could have 40 of the buggers in my system giving 80TB raw storage!!!!! I''d happily use mirrors all the way in that scenario.... -- This message posted from opensolaris.org
> 4 - the 16th port > > Can you find somewhere inside the case for an SSD as > L2ARC on your > last port?Although saying that, if we are saying hot spares may be bad in my scenario, I could ditch it and use an 3.5" SSD in the 15th drive''s place? -- This message posted from opensolaris.org
On 6/17/2011 12:55 AM, Lanky Doodle wrote:> Thanks Richard. > > How does ZFS enumerate the disks? In terms of listing them does it do them logically, i.e; > > controller #1 (motherboard) > | > |--- disk1 > |--- disk2 > controller #3 > |--- disk3 > |--- disk4 > |--- disk5 > |--- disk6 > |--- disk7 > |--- disk8 > |--- disk9 > |--- disk10 > controller #4 > |--- disk11 > |--- disk12 > |--- disk13 > |--- disk14 > |--- disk15 > |--- disk16 > |--- disk17 > |--- disk18 > > or is it completely random leaving me with some trial and error to work out what disk is on what port?This is not a ZFS issue, this is the Solaris device driver issue. Solaris uses a location-based disk naming scheme, NOT the BSD/Linux-style of simply incrementing the disk numbers. I.e. drives are usually named something like c<controller>t<target>d<disk> In most cases, the on-board controllers receive a lower controller number than any add-in adapters, and add-in adapters are enumerated in PCI ID order. However, there is no good explanation of exactly *what* number a given controller may be assigned. After receiving a controller number, disks are enumerated in ascending order by ATA ID, SCSI ID, SAS WWN, or FC WWN. The naming rules can get a bit complex. -- Erik Trimble Java Platform Group Infrastructure Mailstop: usca22-317 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (UTC-0800)
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Lanky Doodle > > or is it completely random leaving me with some trial and error to workout> what disk is on what port?It''s highly desirable to have drives with lights on them. So you can manually make the light blink (or stay on) just by reading the drive with dd. Even if you dig down and quantify precisely how the drives are numbered in which order ... You would have to find labels printed on the system board or other sata controllers, and trace the spaghetti of the sata cables, and if you make any mistake along the way, you destroy your pool. (Being dramatic, but not necessarily unrealistic.) Lights. Good.
> Lights. Good.Agreed. In a fit of desperation and stupidity I once enumerated disks by pulling them one by one from the array to see which zfs device faulted. On a busy array it is hard even to use the leds as indicators. It makes me wonder how large shops with thousands of spindles handle this. -- This message posted from opensolaris.org
> Lights. Good.Agreed. In a fit of desperation and stupidity I once enumerated disks by pulling them one by one from the array to see which zfs device faulted. On a busy array it is hard even to use the leds as indicators. It makes me wonder how large shops with thousands of spindles handle this. -- This message posted from opensolaris.org
On 6/17/2011 6:52 AM, Marty Scholes wrote:>> Lights. Good. > Agreed. In a fit of desperation and stupidity I once enumerated disks by pulling them one by one from the array to see which zfs device faulted. > > On a busy array it is hard even to use the leds as indicators. > > It makes me wonder how large shops with thousands of spindles handle this.We pay for the brand-name disk enclosures or servers where the fault-management stuff is supported by Solaris. Including the blinky lights. <grin> -- Erik Trimble Java Platform Group Infrastructure Mailstop: usca22-317 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (UTC-0800)
Funny you say that. My Sun v40z connected a pair of Sun A5200 arrays running OSol 128a can''t see the enclosures. The luxadm command comes up blank. Except for that annoyance (and similar other issues) the Sun gear works well with a Sun operating system. Sent from Yahoo! Mail on Android -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110617/19288cdb/attachment.html>
2011-06-18 0:24, marty scholes wrote:> >> It makes me wonder how large shops with thousands of spindles > handle this. > > > We pay for the brand-name disk enclosures or servers where the > fault-management stuff is supported by Solaris. > > Including the blinky lights. > > <grin> > > Funny you say that. > > My Sun v40z connected a pair of Sun A5200 arrays running OSol 128a > can''t see the enclosures. The luxadm command comes up blank. > > Except for that annoyance (and similar other issues) the Sun gear > works well with a Sun operating system. >For the sake of weekend sarcasm: Why would you wonder? That''s a wrong brand name, it is too old. Does it say "Oracle" anywhere on the label? Really, "v40z", pff! When was it made? Like, in two-thousand-zeros, back when dinosaurs roamed the earth and Sun was high above horizon? Is it still supported at all, let alone Solaris (not OSol, may I add)? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110618/8a36e682/attachment.html>
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Marty Scholes > > On a busy array it is hard even to use the leds as indicators.Offline the disk. Light stays off. Use dd to read the disk. Light stays on. That should make it easy enough. Also, depending on your HBA, lots of times you can blink an Amber LED instead of the standard green one.
On Jun 17, 2011, at 12:55 AM, Lanky Doodle wrote:> Thanks Richard. > > How does ZFS enumerate the disks? In terms of listing them does it do them logically, i.e; > > controller #1 (motherboard) > | > |--- disk1 > |--- disk2 > controller #3 > |--- disk3 > |--- disk4 > |--- disk5 > |--- disk6 > |--- disk7 > |--- disk8 > |--- disk9 > |--- disk10 > controller #4 > |--- disk11 > |--- disk12 > |--- disk13 > |--- disk14 > |--- disk15 > |--- disk16 > |--- disk17 > |--- disk18 > > or is it completely random leaving me with some trial and error to work out what disk is on what port?For all intents and purposes, it is random. Slot locations are the responsibility of the enclosure, not the disk. Until we get a better framework integrated into illumos, you can get the bay location from a SES-compliant enclosure from the fmtopo output, lsiutil, or the sg_utils. For NexentaStor users I provide some automation for this in a KB article on the customer portal. Also for NexentaStor users, DataON offers a GUI plugin called DSM that shows the enclosure, blinky lights, and all of the status information available -- power supplies, fans, etc -- good stuff! For the curious, fmtopo shows the bay for each disk and the serial number of the disk therein. You can then cross-reference the c*t*d* number for the OS instance to the serial number. Note that for dual-port disks, you can get different c*t*d* numbers for each node connected to the disk (rare, but possible). Caveat: please verify prior to rolling into production that the bay number matches the enclosure silkscreen. The numbers are programmable and different vendors deliver the same enclosure with different silkscreened numbers. As always, the disk serial number is supposed to be unique, so you can test this very easily. For the later Nexenta, OpenSolaris or Solaris 11 Express releases the mpt_sas driver will try to light the OK2RM (ok to remove) LED for a disk when you use cfgadm to disconnect the paths. Apparently this also works for SATA disks in an enclosure that manages SATA disks. The process is documented very nicely by Cindy in the ZFS Admin Guide. However, there are a number of enclosures that do not have an OK2RM LED. YMMV. -- richard
Thanks for all the replies. I have a pretty good idea how the disk enclosure assigns slot locations so should be OK. One last thing - I see thet Supermicro has just released a newer version of the card I mentioned in the first post that supports SATA 6Gbps. From what I can see it uses the Marvell 9480 controller, which I don''t think is supported in Solaris Express 11 yet. Does this mean it strictly won''t work (ie no available drivers) or that it just wouldn''t be supported if there''s problems? -- This message posted from opensolaris.org
Sorry to pester, but is anyone able to say if the Marvell 9480 chip is now supported in Solaris? The article I read saying it wasn''t supported was dated May 2010 so over a year ago. -- This message posted from opensolaris.org
OK, I have finally settled on hardware; 2x LSI SAS3081E-R controllers 2x Seagate Momentus 5400.6 rpool disks 15x Hitachi 5K3000 ''data'' disks I am still undecided as to how to group the disks. I have read elsewhere that raid-z1 is best suited with either 3 or 5 disks and raid-z2 is better suited with 6 or 10 disks - is there any truth in this, although I think this was in reference to 4K sector disks; 3x 5 drive z1 = 24t usable 2x 6 drive z2 = 16t usable keeping to those recommendations or 2x 7 disk z2 = 20t usable with 1 cold/warm/hot spare as per my original idea. -- This message posted from opensolaris.org
The LSI2008 chipset is supported and works very well. I would actually use 2 vdevs; 8 disks in each. And I would configure each vdev as raidz2. Maybe use one hot spare. And I also have personal, subjective reasons: I like to use the number of 8 in computers. 7 is an ugly number. Everything is based on powers of 2 in computers. A pocket calculator which only accepts the digits 1-8, but not accept the digit "9", is really ugly (having 7 discs, but not 8, is ugly). Some time ago, there was a problem unless you used even number of discs, that problem is corrected now. I would definitively use raidz2, because resilver time will be very long with 4-5TB disks, potentially several days. During that time, another disk problem such as reas error might occur, which means you loose all your data. -- This message posted from opensolaris.org
Thanks. I ruled out the SAS2008 controller as my motherboard is only PCIe 1.0 so would not have been able to make the most of the difference in increased bandwidth. I can''t see myself upgrading every few months (my current WHZ build has lasted over 4 years without a single change) so by the time I do come to upgrade PCIe will probably be obselete!! -- This message posted from opensolaris.org
On Tue, Jul 5, 2011 at 6:54 AM, Lanky Doodle <lanky_doodle at hotmail.com> wrote:> OK, I have finally settled on hardware; > > 2x LSI SAS3081E-R controllers > 2x Seagate Momentus 5400.6 rpool disks > 15x Hitachi 5K3000 ''data'' disks > > I am still undecided as to how to group the disks. I have read elsewhere that raid-z1 is best suited with either 3 or 5 disks and raid-z2 is better suited with 6 or 10 disks - is there any truth in this, although I think this was in reference to 4K sector disks; > > 3x 5 drive z1 = 24t usable > 2x 6 drive z2 = 16t usableTake a look at https://spreadsheets.google.com/spreadsheet/ccc?key=0AtReWsGW-SB1dFB1cmw0QWNNd0RkR1ZnN0JEb2RsLXc&hl=en_US I did a bunch of testing with 40 drives. I varied the configuration between extremes of 10 vdevs of 4 disks each to one vdev of all 40 drives. All vdevs were raidz2, so my net capacity changed, but I was looking for _relative_ performance differences. I did not test sequential reads, as that was not one of our expected I/O patterns. I believe the OS was Solaris 10U8. I know it was at least zpool version 15 and may have been 22. I used the same 40 drives in all the test cases as I had seen differences between drives, and choose 40 that all had roughly matching svc_t values (from iostat). Eventually we had Sun/Oracle come in and replace any drive who''s svc_t was substantially higher than the others (these drives also usually had lots of added bad blocks mapped).> keeping to those recommendations or > > 2x 7 disk z2 = 20t usable with 1 cold/warm/hot spareThe testing was utilizing a portion of our drives, we have 120 x 750 SATA drives in J4400s dual pathed. We ended up with 22 vdevs each a raidz2 of 5 drives, with one drive in each of the J4400, so we can lose two complete J4400 chassis and not lose any data. -- {--------1---------2---------3---------4---------5---------6---------7---------} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players
On Tue, Jul 5, 2011 at 7:47 AM, Lanky Doodle <lanky_doodle at hotmail.com> wrote:> Thanks. > > I ruled out the SAS2008 controller as my motherboard is only PCIe 1.0 so would not have been able to make the most of the difference in increased bandwidth.Only PCIe 1.0? What chipset is that based on? Might be worthwhile to upgrade as I believe Solaris power-management has a fairly recent cutoff in terms of processor support when it comes to power-management. (AMD Family 16 or better, Intel Nehalem or newer is what I''ve been told). PCIe 2.0 has been around for quite awhile, PCIe 3.0 will be making an appearance on Ivy Bridge CPUs (and has already been announced by FPGA vendors), but I''m fairly confident that graphics cards will be the first target market to utilize that. Another thing to consider is that you could buy the SAS2008-based cards and move them from motherboard to motherboard for the foreseeable future (copper PCI Express isn''t going anywhere for a long time). Don''t kneecap yourself because of your current mobo. --khd
On Tue, 5 Jul 2011, Lanky Doodle wrote:> > I am still undecided as to how to group the disks. I have read > elsewhere that raid-z1 is best suited with either 3 or 5 disks and > raid-z2 is better suited with 6 or 10 disks - is there any truth in > this, although I think this was in reference to 4K sector disks;The decision to use raid-z1 should be based on the type and size of the drives. If you are using small enterprise-class SAS drives then raid-z1 is ok but if you are using large near-line SAS/SATA or large desktop SATA drives then you should use raid-z2 instead. The reason for this is that you don''t want to experience the case where the remaining drives in a raid-z1 experience a failure while you are resilvering to replace a failed drive. If you have a very good backup system and can afford to restore the whole zfs pool from scratch, then that might be an argument to use raid-z1. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On Tue, Jul 5, 2011 at 12:54 PM, Lanky Doodle <lanky_doodle at hotmail.com> wrote:> OK, I have finally settled on hardware; > 2x LSI SAS3081E-R controllersBeware that this controller does not support drives larger than 2TB. -- Trond Michelsen
Thanks Trond. I am aware of this, but to be honest I will not be upgrading very often (my current WHS setup has lasted 5 years without a single change!) and certainly not to each iteration of TB size increase, so by the time I do upgrade, say in the next 5 years PCIe will have probably been replaced, or got to revision 10.0 or something stupid! And anyway, my current motherboard (expensive server board) is only PCIe 1.0 so I wouldn''t get the benefit of having a PCIe 2.0 card. -- This message posted from opensolaris.org
> The testing was utilizing a portion of our drives, we > have 120 x 750 > SATA drives in J4400s dual pathed. We ended up with > 22 vdevs each a > raidz2 of 5 drives, with one drive in each of the > J4400, so we can > lose two complete J4400 chassis and not lose any > data.Thanks pk. You know I never thought about doing 5 drive z2''s. That would be an a acceptable compromise for me between 2x 7 drive z2''s as; 1) resilver times should be faster 2) 5 drive groupings, matching my 5 drive caddies 3) only losing 2TB usable against 2x 7 drive z2''s 4) IOPS should be faster 5) if and when I scale up, I can add another 5 drives, in another 5 drive caddy Super! -- This message posted from opensolaris.org