Matt
2010-Mar-18 00:03 UTC
[zfs-discuss] Is this a sensible spec for an iSCSI storgage box?
Dear list, I am in the process of speccing an OpenSolaris box for iSCSI Storage of XenServer domUs. I''m trying to get the best performance from a combination of decent SATA II disks and some SSDs and I would really appreciate some feedback on my plans. I don''t have much idea what the workload will be like because we simply haven''t got any existing implementation to guide us. All I can say is that the vast majority of domUs will be small linux web servers, so I guess it will be largely random IO... I was planning on using either a current development build of OpenSolaris or perhaps the next release version if it comes out in time -I understand 2009.06 has some issues which negatively affect iSCSI and/or ZFS performance? Here is what I have in mind for the hardware: 1 x Supermicro 4U Rackmount Chassis 24 x 3.5in SAS Hot-Swap 1 x Supermicro X8ST3-F Server Board LGA1366 DDR3 SAS/SATA2 RAID IPMI GbE PCIe ATX MBD-X8ST3-F-O 2 x Intel dual port gigabit NICs (model to be decided) 1 x Supermicro AOC-USAS-L4i UIO RAID Adapter SAS 8-Port 16MB PCIe x8 1 x Intel Xeon E5520 6 x 2GB Registered ECC RAM = 12GB total 2 x 160GB Intel X25M MLC SSDs for ARC 2 x 32GB Intel X25-E SLC SSDs for ZIL 18 x WD 250GB RE3 7200RPM 16MB for storage (arranged as 4 x 6-disk raidz2) 2 x 250GB SATA II for rpool mirror This will sit on a dedicated gigabit ethernet storage network and the above gives me 4 x gigabit NICs-worth of throughput (ignoring the two NICs on the motherboard, which I will need for management and maybe a crossover for copying data to another box). We are hoping the hardware will scale to over 50 domUs across three dom0 boxes but wouldn''t be surprised if the network is saturated well before then, at which point we may have to look to 10gig ethernet. The decision to use iSCSI over NFS was made primarily because we thought the dom0s would cache some and thus reduce the amount of data travelling over the wire. Does this configuration look OK? Stupid question: should I have a battery-backed SAS adapter? -it will allegedly be protected by UPS, but... Later on, I would like to add a second lower spec box to continuously (or near-continously) mirror the data (using a gig crossover cable, maybe). I have seen lots of ways of mirroring data to other boxes which has left me with more questions than answers. Is there a simple, robust way of doing this without setting up a complex HA service and at the same time minimising load on the master? Thanks in advance and sorry for the barrage of questions, Matt. -- This message posted from opensolaris.org
Ian Collins
2010-Mar-18 02:07 UTC
[zfs-discuss] Is this a sensible spec for an iSCSI storgage box?
On 03/18/10 01:03 PM, Matt wrote: Shipping the iSCSI and SAS questions...> Later on, I would like to add a second lower spec box to continuously (or near-continously) mirror the data (using a gig crossover cable, maybe). I have seen lots of ways of mirroring data to other boxes which has left me with more questions than answers. Is there a simple, robust way of doing this without setting up a complex HA service and at the same time minimising load on the master? > >The answer really depends on how current you wish to keep your backup and how much data you have to replicate. I have a couple of x4540s which use ZFS send/receive to replicate each other hourly. Ech box has about 4TB of data, with maybe 10G of changes per hour. I have run the replication every 15 minutes, but hourly is good enough for us. -- Ian.
David Dyer-Bennet
2010-Mar-18 03:52 UTC
[zfs-discuss] Is this a sensible spec for an iSCSI storgage box?
On 3/17/2010 21:07, Ian Collins wrote:> > I have a couple of x4540s which use ZFS send/receive to replicate each > other hourly. Ech box has about 4TB of data, with maybe 10G of > changes per hour. I have run the replication every 15 minutes, but > hourly is good enough for us. >What software version are you running? And can you show me the zfs send / zfs receive commands for the incremental case that are working for you? My incremental replication streams never complete, they hang part-way through (and then require system reboot to free up the IO system). I''m running 2009.06, which is 111b. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
Matt
2010-Mar-18 10:00 UTC
[zfs-discuss] Is this a sensible spec for an iSCSI storgage box?
Ultimately this could have 3TB of data on it and it is difficult to estimate the volume of changed data. It would be nice to have a changes mirrored immediately but asynchronously so as not to impede the master. The second box is likely to have a lower spec with fewer spindles for cost reasons. Immediate failover taking second place to data preservation in the event of a failure of the master. I had looked at this: http://hub.opensolaris.org/bin/view/Project+avs/WebHome But it did seem overkill to me and doesn''t that mean that a resilver on the master will be replicated on the slave even if not required? A zfs send/receive every 15 minutes might well have to do. Matt. -- This message posted from opensolaris.org
Scott Meilicke
2010-Mar-18 15:50 UTC
[zfs-discuss] Is this a sensible spec for an iSCSI storage box?
It is hard, as you note, to recommend a box without knowing the load. How many linux boxes are you talking about? I think having a lot of space for your L2ARC is a great idea. Will you mirror your SLOG, or load balance them? I ask because perhaps one will be enough, IO wise. My box has one SLOG (X25-E) and can support about 2600 IOPS using an iometer profile that closely approximates my work load. My ~100 VMs on 8 ESX boxes average around 1000 IOPS, but can peak 2-3x that during backups. Don''t discount NFS. I absolutely love NFS for management and thin provisioning reasons. Much easier (to me) than managing iSCSI, and performance is similar. I highly recommend load testing both iSCSI and NFS before you go live. Crash consistent backups of your VMs are possible using NFS, and recovering a VM from a snapshot is a little easier using NFS, I find. Why not larger capacity disks? Hopefully your switches support NIC aggregation? The only issue I have had on 2009.06 using iSCSI (I had a windows VM directly attaching to an iSCSI 4T volume) was solved and back ported to 2009.06 (bug 6794994). -Scott -- This message posted from opensolaris.org
Matt
2010-Mar-18 16:56 UTC
[zfs-discuss] Is this a sensible spec for an iSCSI storage box?
> It is hard, as you note, to recommend a box without > knowing the load. How many linux boxes are you > talking about?This box will act as a backing store for a cluster of 3 or 4 XenServers with upwards of 50 VMs running at any one time.> Will you mirror your SLOG, or load balance them? I > ask because perhaps one will be enough, IO wise. My > box has one SLOG (X25-E) and can support about 2600 > IOPS using an iometer profile that closely > approximates my work load. My ~100 VMs on 8 ESX boxes > average around 1000 IOPS, but can peak 2-3x that > during backups.I was planning to mirror them - mainly in the hope that I could hot swap a new one in the event that an existing one started to degrade. I suppose I could start with one of each and convert to a mirror later although the prospect of losing either disk fills me with dread.> Don''t discount NFS. I absolutely love NFS for > management and thin provisioning reasons. Much easier > (to me) than managing iSCSI, and performance is > similar. I highly recommend load testing both iSCSI > and NFS before you go live. Crash consistent backups > of your VMs are possible using NFS, and recovering a > VM from a snapshot is a little easier using NFS, I > find.That''s interesting feedback. Given how easy it is to create NFS and iSCSI shares in osol, I''ll definitely try both and see how they compare.> Why not larger capacity disks?We will run out of iops before we run out of space. Its is more likely that we will gradually replace some of the SATA drives with 6gbps SAS drives to help with that and we''ve been mulling over using an LSI SAS 9211-8i controller to provide that upgrade path: http://www.lsi.com/storage_home/products_home/host_bus_adapters/sas_hbas/internal/sas9211-8i/index.html> Hopefully your switches support NIC aggregation?Yes, we''re hoping that a bond of 4 x NICs will cope. Any opinions on the use of battery backed SAS adapters? - it also occurred to me after writing this that perhaps we could use one and configure it to report writes as being flushed to disk before they actually were. That might give a slight edge in performance in some cases but I would prefer to have the data security instead, tbh. Matt. -- This message posted from opensolaris.org
Scott Meilicke
2010-Mar-18 17:57 UTC
[zfs-discuss] Is this a sensible spec for an iSCSI storage box?
>I was planning to mirror them - mainly in the hope that I could hot swap a new one in the event that an existing one started to degrade. I suppose I could start with one of each and convert to a mirror later although the prospect of losing either disk fills me with dread.You do not need to mirror the L2ARC devices, as the system will just hit disk as necessary. Mirroring sounds like a good idea on the SLOG, but this has been much discussed on the forums.>> Why not larger capacity disks?>We will run out of iops before we run out of space.Interesting. I find IOPS is more proportional to the number of VMs vs disk space. User: I need a VM that will consume up to 80G in two years, so give me an 80G disk. Me: OK, but recall we can expand disks and filesystems on the fly, without downtime. User: Well, that is cool, but 80G to start with please. Me: <sigh> I also believe the SLOG and L2ARC will make using high RPM disks not as necessary. But, from what I have read, higher RPM disks will greatly help with scrubs and reslivers. Maybe two pools - one with fast mirrored SAS, another with big SATA. Or all SATA, but one pool with mirrors, another with raidz2. Many options. But measure to see what works for you. iometer is great for that, I find.>Any opinions on the use of battery backed SAS adapters?Surely these will help with performance in write back mode, but I have not done any hard measurements. Anecdotally my PERC5i in a Dell 2950 seemed to greatly help with IOPS on a five disk raidz. There are pros and cons. Search the forums, but off the top of my head 1) SLOGs are much larger than controller caches: 2) only synced write activity is cached in a ZIL, whereas a controller cache will cache everything, needed or not, thus running out of space sooner; 3) SLOGS and L2ARC devices are specialized caches for read and write loads, vs. the all in one cache of a controller. 4) A controller *may* be faster, since it uses ram for the cache. One of the benefits of a SLOG on the SAS/SATA bus is for a cluster. If one node goes down, the other can bring up the pool, check the ZIL for any necessary transactions, and apply them. To do this with battery backed cache, you would need fancy interconnects between the nodes, cache mirroring, etc. All of those things that SAN array products do. Sounds like you have a fun project. -- This message posted from opensolaris.org
Matt
2010-Mar-19 11:28 UTC
[zfs-discuss] Is this a sensible spec for an iSCSI storage box?
> You do not need to mirror the L2ARC devices, as the > system will just hit disk as necessary. Mirroring > sounds like a good idea on the SLOG, but this has > been much discussed on the forums.Ah, ok.> Interesting. I find IOPS is more proportional to the > number of VMs vs disk space. > > User: I need a VM that will consume up to 80G in two > years, so give me an 80G disk. > Me: OK, but recall we can expand disks and > filesystems on the fly, without downtime. > User: Well, that is cool, but 80G to start with > please. > Me: <sigh>One of the reasons I am investigating solaris for this is sparse volumes and dedupe could really help here. Currently we use direct attached storage on the dom0s and allocate an LVM to the domU on creation. Just like your example above, we have lots of those "80G to start with please" volumes with 10''s of GB unused. I also think this data set would dedupe quite well since there are a great many identical OS files across the domUs. Is that assumption correct?> I also believe the SLOG and L2ARC will make using > high RPM disks not as necessary. But, from what I > have read, higher RPM disks will greatly help with > scrubs and reslivers. Maybe two pools - one with fast > mirrored SAS, another with big SATA. Or all SATA, but > one pool with mirrors, another with raidz2. Many > options. But measure to see what works for you. > iometer is great for that, I find.Yes. As part of testing this I had planned to look at the performance of the config and try some other options too, such as using a volume of 2 x mirrors. Its a classic case of balancing performance, cost and redundancy/time to resilver.> One of the benefits of a SLOG on the SAS/SATA bus is > for a cluster. If one node goes down, the other can > bring up the pool, check the ZIL for any necessary > transactions, and apply them. To do this with battery > backed cache, you would need fancy interconnects > between the nodes, cache mirroring, etc. All of those > things that SAN array products do.I''ve not seen an example of that before. Do you mean having two ''head units'' connected to an external JBOD enclosure or a proper HA cluster type configuration where the entire thing, disks and all, are duplicated? Matt. -- This message posted from opensolaris.org
Scott Meilicke
2010-Mar-19 15:32 UTC
[zfs-discuss] Is this a sensible spec for an iSCSI storage box?
> One of the reasons I am investigating solaris for > this is sparse volumes and dedupe could really help > here. Currently we use direct attached storage on > the dom0s and allocate an LVM to the domU on > creation. Just like your example above, we have lots > of those "80G to start with please" volumes with 10''s > of GB unused. I also think this data set would > dedupe quite well since there are a great many > identical OS files across the domUs. Is that > assumption correct?This is one reason I like NFS - thin by default, and no wasted space within a zvol. zvols can be thin as well, but opensolaris will not know the inside format of the zvol, and you may still have a lot of wasted space after a while as files inside of the zvol come and go. In theory dedupe should work well, but I would be careful about a possible speed hit.> I''ve not seen an example of that before. Do you mean > having two ''head units'' connected to an external JBOD > enclosure or a proper HA cluster type configuration > where the entire thing, disks and all, are > duplicated?I have not done any type of cluster work myself, but from what I have read on Sun''s site, yes, you could connect the same jbod to two head units, active/passive, in an HA cluster, but no duplicate disks/jbod. When the active goes down, passive detects this and takes over the pool by doing an import. During the import, any outstanding transactions on the zil are replayed, whether they are on a slog or not. I believe this is how Sun does it on their open storage boxes (7000 series). Note - two jbods could be used, one for each head unit, making an active/active setup. Each jbod is active on one node, passive on the other. -- This message posted from opensolaris.org