Since combining ZFS storage backend, via nfs or iscsi, with ESXi heads, I''m in love. But for one thing. The interconnect between the head & storage. 1G Ether is so cheap, but not as fast as desired. 10G ether is fast enough, but it''s overkill and why is it so bloody expensive? Why is there nothing in between? Is there something in between? Is there a better option? I mean . sata is cheap, and it''s 3g or 6g, but it''s not suitable for this purpose. But the point remains, there isn''t a fundamental limitation that *requires* 10G to be expensive, or *requires* a leap directly from 1G to 10G. I would very much like to find a solution which is a good fit. to attach ZFS storage to vmware. What are people using, as interconnect, to use ZFS storage on ESX(i)? Any suggestions? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101112/7424de47/attachment-0001.html>
On Fri, Nov 12, 2010 at 10:03:08AM -0500, Edward Ned Harvey wrote:> Since combining ZFS storage backend, via nfs or iscsi, with ESXi heads, I''m > in love. But for one thing. The interconnect between the head & storage. > > > > 1G Ether is so cheap, but not as fast as desired. 10G ether is fast enough,So bundle four of those. Or use IB, assuming ESX can handle IB.> but it''s overkill and why is it so bloody expensive? Why is there nothing > in between? Is there something in between? Is there a better option? I > mean . sata is cheap, and it''s 3g or 6g, but it''s not suitable for this > purpose. But the point remains, there isn''t a fundamental limitation that > *requires* 10G to be expensive, or *requires* a leap directly from 1G to > 10G. I would very much like to find a solution which is a good fit. to > attach ZFS storage to vmware. > > > > What are people using, as interconnect, to use ZFS storage on ESX(i)?Why do you think 10 GBit Ethernet is expensive? An Intel NIC is 200 EUR, and a crossover cable is enough. No need for a 10 GBit switch. -- Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/12/2010 10:03 AM, Edward Ned Harvey wrote:> > Since combining ZFS storage backend, via nfs or iscsi, with ESXi > heads, I?m in love. But for one thing. The interconnect between > the head & storage. > > > > 1G Ether is so cheap, but not as fast as desired. 10G ether is > fast enough, but it?s overkill and why is it so bloody expensive? > Why is there nothing in between? Is there something in between? >I suppose you could try multiple 1G interfaces bonded together - Does the ESXi hypervisor support LACP aggregations? I''m not sure it will help though, given the algorithms that LACP can use to distribute the traffic. -Kyle> Is there a better option? I mean ? sata is cheap, and it?s 3g or > 6g, but it?s not suitable for this purpose. But the point > remains, there isn?t a fundamental limitation that **requires** 10G > to be expensive, or **requires** a leap directly from 1G to 10G. I > would very much like to find a solution which is a good fit? to > attach ZFS storage to vmware. > > > > What are people using, as interconnect, to use ZFS storage on > ESX(i)? > > > > Any suggestions? > > > > > > _______________________________________________ zfs-discuss mailing > list zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (MingW32) iQEcBAEBAgAGBQJM3VycAAoJEEADRM+bKN5wh9cIAJNFlr99ue2Bd2l/GBFOHY4y IJ7Z0N6oWtKsHmNoCfepbLa9NU1VdHfaICFXq7TXBJnzjMECUu6gfsW/dK+3tgBv 1jcpx5+pxk4yAYA0znBUn+ro57bZH6PDV/tZzy4ZU0M/uLQtHGpD2wZF+qj3b9MC ieG6ywkt9YiOzOvOk7X7oTwi+iQQeKRXKVi+02vxeuN8PWRkD2NtHGbfLlp3f3en LNZx0hD0gOXBMSW3xRKTAJv0ioNRptRI0ZVc1a5+0daksioOlhdeMl+2tV2zCb8h qmnrj+H1RlWORPAWPo9QsQPLBBGixkcy7Yavj+XZz9nHanHAbtUt5z5j/hKsvAM=dDzv -----END PGP SIGNATURE----- -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101112/2335c82f/attachment.html>
Channeling Ethernet will not make it any faster. Each individual connection will be limited to 1gbit. iSCSI with mpxio may work, nfs will not. On Nov 12, 2010 9:26 AM, "Eugen Leitl" <eugen at leitl.org> wrote:> On Fri, Nov 12, 2010 at 10:03:08AM -0500, Edward Ned Harvey wrote: >> Since combining ZFS storage backend, via nfs or iscsi, with ESXi heads,I''m>> in love. But for one thing. The interconnect between the head & storage. >> >> >> >> 1G Ether is so cheap, but not as fast as desired. 10G ether is fastenough,> > So bundle four of those. Or use IB, assuming ESX can handle IB. > >> but it''s overkill and why is it so bloody expensive? Why is there nothing >> in between? Is there something in between? Is there a better option? I >> mean . sata is cheap, and it''s 3g or 6g, but it''s not suitable for this >> purpose. But the point remains, there isn''t a fundamental limitation that >> *requires* 10G to be expensive, or *requires* a leap directly from 1G to >> 10G. I would very much like to find a solution which is a good fit. to >> attach ZFS storage to vmware. >> >> >> >> What are people using, as interconnect, to use ZFS storage on ESX(i)? > > Why do you think 10 GBit Ethernet is expensive? An Intel NIC is 200 EUR, > and a crossover cable is enough. No need for a 10 GBit switch. > > -- > Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org > ______________________________________________________________ > ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org > 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101112/e63dfffb/attachment.html>
On Fri, Nov 12, 2010 at 09:34:48AM -0600, Tim Cook wrote:> Channeling Ethernet will not make it any faster. Each individual connection > will be limited to 1gbit. iSCSI with mpxio may work, nfs will not.Would NFSv4 as cluster system over multiple boxes work? (This question is not limited to ESX). I have a problem that people want to have scalable in ~30 TByte increments solution, and I''d rather avoid adding SAS expander boxes but add identical boxes in a cluster, and not just as invididual NFS mounts. -- Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE
Edward, I recently installed a 7410 cluster, which had added Fiber Channel HBAs. I know the site also has Blade 6000s running VMware, but no idea if they were planning to run fiber to those blades (or even had the option to do so). But perhaps FC would be an option for you? Mark On Nov 12, 2010, at 9:03 AM, Edward Ned Harvey wrote:> Since combining ZFS storage backend, via nfs or iscsi, with ESXi heads, I?m in love. But for one thing. The interconnect between the head & storage. > > 1G Ether is so cheap, but not as fast as desired. 10G ether is fast enough, but it?s overkill and why is it so bloody expensive? Why is there nothing in between? Is there something in between? Is there a better option? I mean ? sata is cheap, and it?s 3g or 6g, but it?s not suitable for this purpose. But the point remains, there isn?t a fundamental limitation that *requires* 10G to be expensive, or *requires* a leap directly from 1G to 10G. I would very much like to find a solution which is a good fit? to attach ZFS storage to vmware. > > What are people using, as interconnect, to use ZFS storage on ESX(i)? > > Any suggestions? > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101112/53cbf0e4/attachment-0001.html>
ESX does not support LACP, only static trunking with a host-configured path selection algorithm. Look at Infiniband. Even QDR (32 Gbit) is cheaper per port than most 10GbE solutions I''ve seen, and SDR/DDR certainly is. If you want to connect ESX to storage directly via IB you will find some limitations, but between head and backend storage you shouldn''t have as many. -Will ________________________________ From: zfs-discuss-bounces at opensolaris.org [zfs-discuss-bounces at opensolaris.org] on behalf of Kyle McDonald [kmcdonald at egenera.com] Sent: Friday, November 12, 2010 10:26 AM To: Edward Ned Harvey Cc: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] Faster than 1G Ether... ESX to ZFS -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/12/2010 10:03 AM, Edward Ned Harvey wrote:> > Since combining ZFS storage backend, via nfs or iscsi, with ESXi > heads, I?m in love. But for one thing. The interconnect between > the head & storage. > > > > 1G Ether is so cheap, but not as fast as desired. 10G ether is > fast enough, but it?s overkill and why is it so bloody expensive? > Why is there nothing in between? Is there something in between? >I suppose you could try multiple 1G interfaces bonded together - Does the ESXi hypervisor support LACP aggregations? I''m not sure it will help though, given the algorithms that LACP can use to distribute the traffic. -Kyle> Is there a better option? I mean ? sata is cheap, and it?s 3g or > 6g, but it?s not suitable for this purpose. But the point > remains, there isn?t a fundamental limitation that **requires** 10G > to be expensive, or **requires** a leap directly from 1G to 10G. I > would very much like to find a solution which is a good fit? to > attach ZFS storage to vmware. > > > > What are people using, as interconnect, to use ZFS storage on > ESX(i)? > > > > Any suggestions? > > > > > > _______________________________________________ zfs-discuss mailing > list zfs-discuss at opensolaris.org<mailto:zfs-discuss at opensolaris.org> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.14 (MingW32) iQEcBAEBAgAGBQJM3VycAAoJEEADRM+bKN5wh9cIAJNFlr99ue2Bd2l/GBFOHY4y IJ7Z0N6oWtKsHmNoCfepbLa9NU1VdHfaICFXq7TXBJnzjMECUu6gfsW/dK+3tgBv 1jcpx5+pxk4yAYA0znBUn+ro57bZH6PDV/tZzy4ZU0M/uLQtHGpD2wZF+qj3b9MC ieG6ywkt9YiOzOvOk7X7oTwi+iQQeKRXKVi+02vxeuN8PWRkD2NtHGbfLlp3f3en LNZx0hD0gOXBMSW3xRKTAJv0ioNRptRI0ZVc1a5+0daksioOlhdeMl+2tV2zCb8h qmnrj+H1RlWORPAWPo9QsQPLBBGixkcy7Yavj+XZz9nHanHAbtUt5z5j/hKsvAM=dDzv -----END PGP SIGNATURE----- -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101112/d49ee8a5/attachment.html>
Check infiniband, the guys at anandtech/zfsbuild.com used that as well. -- This message posted from opensolaris.org
On 11/13/10 04:03 AM, Edward Ned Harvey wrote:> > Since combining ZFS storage backend, via nfs or iscsi, with ESXi > heads, I?m in love. But for one thing. The interconnect between the > head & storage. > > 1G Ether is so cheap, but not as fast as desired. 10G ether is fast > enough, but it?s overkill and why is it so bloody expensive? >10G switches are expensive because the fabric and physical layer chips are expensive. You have to use dedicated switch fabric chips for 10GE (and there''s only a couple of vendors for those) and even with those, the number of ports is limited. These cost factors have limited 10GE ports to the more up market layer 3 and above switches. Low cost layer 2 mainly software options aren''t there (yet!). -- Ian.
Hi, we have the same issue, ESX(i) and Solaris on the Storage. Link Aggregation does not work with ESX(i) (i tried a lot with that for NFS), when you want to use more than one 1G connection you must configure one network or vlan and min. one share for each connection. But this is also limited to 1G for each VM (or you use more than one virtual HD, split to several shares .... ). You can use 10G without switches. As someone else wrotes, this is not an expensive solution, but it must fit to your ESX-Storage infrastructure. IB sounds very interesting for that... regards Axel Am 12.11.2010 16:03, schrieb Edward Ned Harvey:> > Since combining ZFS storage backend, via nfs or iscsi, with ESXi > heads, I''m in love. But for one thing. The interconnect between the > head & storage. > > 1G Ether is so cheap, but not as fast as desired. 10G ether is fast > enough, but it''s overkill and why is it so bloody expensive? Why is > there nothing in between? Is there something in between? Is there a > better option? I mean ... sata is cheap, and it''s 3g or 6g, but it''s > not suitable for this purpose. But the point remains, there isn''t a > fundamental limitation that **requires** 10G to be expensive, or > **requires** a leap directly from 1G to 10G. I would very much like > to find a solution which is a good fit... to attach ZFS storage to vmware. > > What are people using, as interconnect, to use ZFS storage on ESX(i)? > > Any suggestions? > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101115/65425e20/attachment.html>
>>>>> "tc" == Tim Cook <tim at cook.ms> writes:tc> Channeling Ethernet will not make it any faster. Each tc> individual connection will be limited to 1gbit. iSCSI with tc> mpxio may work, nfs will not. well...probably you will run into this problem, but it''s not necessarily totally unsolved. I am just regurgitating this list again, but: need to include L4 port number in the hash: http://www.cisco.com/en/US/products/ps9336/products_tech_note09186a0080a963a9.shtml#eclb port-channel load-balance mixed -- for L2 etherchannels mls ip cef load-sharing full -- for L3 routing (OSPF ECMP) nexus makes all this more complicated. there are a few ways that seem they''d be able to accomplish ECMP: FTag flow markers in ``FabricPath'''' L2 forwarding LISP MPLS the basic scheme is that the L4 hash is performed only by the edge router and used to calculate a label. The routing protocol will either do per-hop ECMP (FabricPath / IS-IS) or possibly some kind of per-entire-path ECMP for LISP and MPLS. unfortunately I don''t understand these tools well enoguh to lead you further, but if you''re not using infiniband and want to do >10way ECMP this is probably where you need to look. http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6817942 feature added in snv_117, NFS client connections can be spread over multiple TCP connections When rpcmod:clnt_max_conns is set to a value > 1 however Even though the server is free to return data on different connections, [it does not seem to choose to actually do so] -- 6696163 fixed snv_117 nfs:nfs3_max_threads=32 in /etc/system, which changes the default 8 async threads per mount to 32. This is especially helpful for NFS over 10Gb and sun4v this stuff gets your NFS traffic onto multiple TCP circuits, which is the same thing iSCSI multipath would accomplish. From there, you still need to do the cisco/??? stuff above to get TCP circuits spread across physical paths. http://virtualgeek.typepad.com/virtual_geek/2009/06/a-multivendor-post-to-help-our-mutual-nfs-customers-using-vmware.html -- suspect. it advises ``just buy 10gig'''' but many other places say 10G NIC''s don''t perform well in real multi-core machines unless you have at least as many TCP streams as cores, which is honestly kind of obvious. lego-netadmin bias. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101116/c540d2bc/attachment.bin>
On Wed, Nov 17, 2010 at 7:56 AM, Miles Nordin <carton at ivy.net> wrote:> >>>>> "tc" == Tim Cook <tim at cook.ms> writes: > > tc> Channeling Ethernet will not make it any faster. Each > tc> individual connection will be limited to 1gbit. iSCSI with > tc> mpxio may work, nfs will not. > > well...probably you will run into this problem, but it''s not > necessarily totally unsolved. > > I am just regurgitating this list again, but: > > need to include L4 port number in the hash: > > http://www.cisco.com/en/US/products/ps9336/products_tech_note09186a0080a963a9.shtml#eclb > port-channel load-balance mixed -- for L2 etherchannels > mls ip cef load-sharing full -- for L3 routing (OSPF ECMP) > > nexus makes all this more complicated. there are a few ways that > seem they''d be able to accomplish ECMP: > FTag flow markers in ``FabricPath'''' L2 forwarding > LISP > MPLS > the basic scheme is that the L4 hash is performed only by the edge > router and used to calculate a label. The routing protocol will > either do per-hop ECMP (FabricPath / IS-IS) or possibly some kind of > per-entire-path ECMP for LISP and MPLS. unfortunately I don''t > understand these tools well enoguh to lead you further, but if > you''re not using infiniband and want to do >10way ECMP this is > probably where you need to look. > > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6817942 > feature added in snv_117, NFS client connections can be spread over > multiple TCP connections > When rpcmod:clnt_max_conns is set to a value > 1 > however Even though the server is free to return data on different > connections, [it does not seem to choose to actually do so] -- > 6696163 fixed snv_117 > > nfs:nfs3_max_threads=32 > in /etc/system, which changes the default 8 async threads per mount to > 32. This is especially helpful for NFS over 10Gb and sun4v > > this stuff gets your NFS traffic onto multiple TCP circuits, which > is the same thing iSCSI multipath would accomplish. From there, you > still need to do the cisco/??? stuff above to get TCP circuits > spread across physical paths. > > > http://virtualgeek.typepad.com/virtual_geek/2009/06/a-multivendor-post-to-help-our-mutual-nfs-customers-using-vmware.html > -- suspect. it advises ``just buy 10gig'''' but many other places > say 10G NIC''s don''t perform well in real multi-core machines > unless you have at least as many TCP streams as cores, which is > honestly kind of obvious. lego-netadmin bias. >AFAIK, esx/i doesn''t support L4 hash, so that''s a non-starter. --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101117/4f577d16/attachment-0001.html>
On Nov 16, 2010, at 4:04 PM, Tim Cook <tim at cook.ms> wrote:> > > On Wed, Nov 17, 2010 at 7:56 AM, Miles Nordin <carton at ivy.net> wrote: > >>>>> "tc" == Tim Cook <tim at cook.ms> writes: > > tc> Channeling Ethernet will not make it any faster. Each > tc> individual connection will be limited to 1gbit. iSCSI with > tc> mpxio may work, nfs will not. > > well...probably you will run into this problem, but it''s not > necessarily totally unsolved. > > I am just regurgitating this list again, but: > > need to include L4 port number in the hash: > http://www.cisco.com/en/US/products/ps9336/products_tech_note09186a0080a963a9.shtml#eclb > port-channel load-balance mixed -- for L2 etherchannels > mls ip cef load-sharing full -- for L3 routing (OSPF ECMP) > > nexus makes all this more complicated. there are a few ways that > seem they''d be able to accomplish ECMP: > FTag flow markers in ``FabricPath'''' L2 forwarding > LISP > MPLS > the basic scheme is that the L4 hash is performed only by the edge > router and used to calculate a label. The routing protocol will > either do per-hop ECMP (FabricPath / IS-IS) or possibly some kind of > per-entire-path ECMP for LISP and MPLS. unfortunately I don''t > understand these tools well enoguh to lead you further, but if > you''re not using infiniband and want to do >10way ECMP this is > probably where you need to look. > > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6817942 > feature added in snv_117, NFS client connections can be spread over multiple TCP connections > When rpcmod:clnt_max_conns is set to a value > 1 > however Even though the server is free to return data on different > connections, [it does not seem to choose to actually do so] -- > 6696163 fixed snv_117 > > nfs:nfs3_max_threads=32 > in /etc/system, which changes the default 8 async threads per mount to > 32. This is especially helpful for NFS over 10Gb and sun4v > > this stuff gets your NFS traffic onto multiple TCP circuits, which > is the same thing iSCSI multipath would accomplish. From there, you > still need to do the cisco/??? stuff above to get TCP circuits > spread across physical paths. > > http://virtualgeek.typepad.com/virtual_geek/2009/06/a-multivendor-post-to-help-our-mutual-nfs-customers-using-vmware.html > -- suspect. it advises ``just buy 10gig'''' but many other places > say 10G NIC''s don''t perform well in real multi-core machines > unless you have at least as many TCP streams as cores, which is > honestly kind of obvious. lego-netadmin bias. > > > > AFAIK, esx/i doesn''t support L4 hash, so that''s a non-starter.For iSCSI one just needs to have a second (third or fourth...) iSCSI session on a different IP to the target and run mpio/mpxio/mpath whatever your OS calls multi-pathing. -Ross -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101116/39f14ce0/attachment-0001.html>
On Nov 16, 2010, at 6:37 PM, Ross Walker wrote:> On Nov 16, 2010, at 4:04 PM, Tim Cook <tim at cook.ms> wrote: >> AFAIK, esx/i doesn''t support L4 hash, so that''s a non-starter. > > For iSCSI one just needs to have a second (third or fourth...) iSCSI session on a different IP to the target and run mpio/mpxio/mpath whatever your OS calls multi-pathing.MC/S (Multiple Connections per Sessions) support was added to the iSCSI Target in COMSTAR, now available in Oracle Solaris 11 Express. - Jim> -Ross > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101116/3df5ee4c/attachment.html>
On Nov 16, 2010, at 7:49 PM, Jim Dunham <james.dunham at oracle.com> wrote:> On Nov 16, 2010, at 6:37 PM, Ross Walker wrote: >> On Nov 16, 2010, at 4:04 PM, Tim Cook <tim at cook.ms> wrote: >>> AFAIK, esx/i doesn''t support L4 hash, so that''s a non-starter. >> >> For iSCSI one just needs to have a second (third or fourth...) iSCSI session on a different IP to the target and run mpio/mpxio/mpath whatever your OS calls multi-pathing. > > MC/S (Multiple Connections per Sessions) support was added to the iSCSI Target in COMSTAR, now available in Oracle Solaris 11 Express.Good to know. The only initiator I know of that supports that is Windows, but with MC/S one at least doesn''t need MPIO as the initiator handles the multiplexing over the multiple connections itself. Doing multiple sessions and MPIO is supported almost universally though. -Ross -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101116/e2c1480f/attachment.html>
I''ve done mpxio over multiple ip links in linux using multipathd. Works just fine. It''s not part of the initiator but accomplishes the same thing. It was a linux IET target. Need to try it here with a COMSTAR target. -----Original Message----- From: Ross Walker <rswwalker at gmail.com> Sender: zfs-discuss-bounces at opensolaris.org Date: Tue, 16 Nov 2010 22:05:05 To: Jim Dunham<james.dunham at oracle.com> Cc: zfs-discuss at opensolaris.org<zfs-discuss at opensolaris.org> Subject: Re: [zfs-discuss] Faster than 1G Ether... ESX to ZFS _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Hi all, Let me tell you all that the MC/S *does* make a difference...I had a windows fileserver using an ISCSI connection to a host running snv_134 with an average speed of 20-35 mb/s...After the upgrade to snv_151a (Solaris 11 express) this same fileserver got a performance boost and now has an average speed of 55-60mb/s. Not double performance, but WAY better , specially if we consider that this performance boost was purely software based :) Nice...nice job COMSTAR guys! Bruno On Tue, 16 Nov 2010 19:49:59 -0500, Jim Dunham wrote: On Nov 16, 2010, at 6:37 PM, Ross Walker wrote: On Nov 16, 2010, at 4:04 PM, Tim Cook wrote: AFAIK, esx/i doesn''t support L4 hash, so that''s a non-starter. For iSCSI one just needs to have a second (third or fourth...) iSCSI session on a different IP to the target and run mpio/mpxio/mpath whatever your OS calls multi-pathing. MC/S (Multiple Connections per Sessions) support was added to the iSCSI Target in COMSTAR, now available in Oracle Solaris 11 Express. - Jim -Ross _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org [2] http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- This message has been scanned for viruses and dangerous content by MAILSCANNER [3], and is believed to be clean. -- Bruno Sousa Links: ------ [1] mailto:tim at cook.ms [2] mailto:zfs-discuss at opensolaris.org [3] http://www.mailscanner.info/ -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101117/610ab512/attachment.html>
On Wed, Nov 17, 2010 at 10:14:10AM +0000, Bruno Sousa wrote:> Hi all, > > Let me tell you all that the MC/S *does* make a difference...I had a > windows fileserver using an ISCSI connection to a host running snv_134 > with an average speed of 20-35 mb/s...After the upgrade to snv_151a > (Solaris 11 express) this same fileserver got a performance boost and now > has an average speed of 55-60mb/s. > > Not double performance, but WAY better , specially if we consider that > this performance boost was purely software based :) >Did you verify you''re using more connections after the update? Or was is just *other* COMSTAR (and/or kernel) updates making the difference.. -- Pasi> > > Nice...nice job COMSTAR guys! > > > > Bruno > > > > On Tue, 16 Nov 2010 19:49:59 -0500, Jim Dunham <james.dunham at oracle.com> > wrote: > > On Nov 16, 2010, at 6:37 PM, Ross Walker wrote: > > On Nov 16, 2010, at 4:04 PM, Tim Cook <[1]tim at cook.ms> wrote: > > AFAIK, esx/i doesn''t support L4 hash, so that''s a non-starter. > > For iSCSI one just needs to have a second (third or fourth...) iSCSI > session on a different IP to the target and run mpio/mpxio/mpath > whatever your OS calls multi-pathing. > > MC/S (Multiple Connections per Sessions) support was added to the iSCSI > Target in COMSTAR, now available in Oracle Solaris 11 Express. > - Jim > > -Ross > _______________________________________________ > zfs-discuss mailing list > [2]zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > -- > This message has been scanned for viruses and > dangerous content by [3]MailScanner, and is > believed to be clean. > > > > -- > Bruno Sousa > > -- > This message has been scanned for viruses and > dangerous content by [4]MailScanner, and is > believed to be clean. > > References > > Visible links > 1. mailto:tim at cook.ms > 2. mailto:zfs-discuss at opensolaris.org > 3. http://www.mailscanner.info/ > 4. http://www.mailscanner.info/> _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Wed, Nov 17, 2010 at 3:00 PM, Pasi K?rkk?inen <pasik at iki.fi> wrote:> On Wed, Nov 17, 2010 at 10:14:10AM +0000, Bruno Sousa wrote: >> ? ?Hi all, >> >> ? ?Let me tell you all that the MC/S *does* make a difference...I had a >> ? ?windows fileserver using an ISCSI connection to a host running snv_134 >> ? ?with an average speed of 20-35 mb/s...After the upgrade to snv_151a >> ? ?(Solaris 11 express) this same fileserver got a performance boost and now >> ? ?has an average speed of 55-60mb/s. >> >> ? ?Not double performance, but WAY better , specially if we consider that >> ? ?this performance boost was purely software based :) >> > > Did you verify you''re using more connections after the update? > Or was is just *other* COMSTAR (and/or kernel) updates making the difference..This is true. If someone wasn''t utilizing 1Gbps before MC/S then going to MC/S won''t give you more, as you weren''t using what you had (in fact added latency in MC/S may give you less!). I am going to say that the speed improvement from 134->151a was due to OS and comstar improvements and not the MC/S. -Ross
I confirm that form the fileserver point of view and storage, i had more network connections used. Bruno On Wed, 17 Nov 2010 22:00:21 +0200, Pasi K?rkk?inen <pasik at iki.fi> wrote:> On Wed, Nov 17, 2010 at 10:14:10AM +0000, Bruno Sousa wrote: >> Hi all, >> >> Let me tell you all that the MC/S *does* make a difference...I had a >> windows fileserver using an ISCSI connection to a host runningsnv_134>> with an average speed of 20-35 mb/s...After the upgrade to snv_151a >> (Solaris 11 express) this same fileserver got a performance boostand>> now >> has an average speed of 55-60mb/s. >> >> Not double performance, but WAY better , specially if we considerthat>> this performance boost was purely software based :) >> > > Did you verify you''re using more connections after the update? > Or was is just *other* COMSTAR (and/or kernel) updates making the > difference.. > > -- Pasi > > >> >> >> Nice...nice job COMSTAR guys! >> >> >> >> Bruno >> >> >> >> On Tue, 16 Nov 2010 19:49:59 -0500, Jim Dunham >> <james.dunham at oracle.com> >> wrote: >> >> On Nov 16, 2010, at 6:37 PM, Ross Walker wrote: >> >> On Nov 16, 2010, at 4:04 PM, Tim Cook <[1]tim at cook.ms> wrote: >> >> AFAIK, esx/i doesn''t support L4 hash, so that''s a non-starter. >> >> For iSCSI one just needs to have a second (third or fourth...) >> iSCSI >> session on a different IP to the target and run mpio/mpxio/mpath >> whatever your OS calls multi-pathing. >> >> MC/S (Multiple Connections per Sessions) support was added to the >> iSCSI >> Target in COMSTAR, now available in Oracle Solaris 11 Express. >> - Jim >> >> -Ross >> _______________________________________________ >> zfs-discuss mailing list >> [2]zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >> -- >> This message has been scanned for viruses and >> dangerous content by [3]MailScanner, and is >> believed to be clean. >> >> >> >> -- >> Bruno Sousa >> >> -- >> This message has been scanned for viruses and >> dangerous content by [4]MailScanner, and is >> believed to be clean. >> >> References >> >> Visible links >> 1. mailto:tim at cook.ms >> 2. mailto:zfs-discuss at opensolaris.org >> 3. http://www.mailscanner.info/ >> 4. http://www.mailscanner.info/ > >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Bruno Sousa -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
On Wed, 17 Nov 2010 16:31:32 -0500, Ross Walker <rswwalker at gmail.com> wrote:> On Wed, Nov 17, 2010 at 3:00 PM, Pasi K?rkk?inen <pasik at iki.fi> wrote: >> On Wed, Nov 17, 2010 at 10:14:10AM +0000, Bruno Sousa wrote: >>> ? ?Hi all, >>> >>> ? ?Let me tell you all that the MC/S *does* make a difference...I hada>>> ? ?windows fileserver using an ISCSI connection to a host running >>> snv_134 >>> ? ?with an average speed of 20-35 mb/s...After the upgrade to snv_151a >>> ? ?(Solaris 11 express) this same fileserver got a performance boost >>> and now >>> ? ?has an average speed of 55-60mb/s. >>> >>> ? ?Not double performance, but WAY better , specially if we consider >>> that >>> ? ?this performance boost was purely software based :) >>> >> >> Did you verify you''re using more connections after the update? >> Or was is just *other* COMSTAR (and/or kernel) updates making the >> difference.. > > This is true. If someone wasn''t utilizing 1Gbps before MC/S then going > to MC/S won''t give you more, as you weren''t using what you had (in > fact added latency in MC/S may give you less!). > > I am going to say that the speed improvement from 134->151a was due to > OS and comstar improvements and not the MC/S. > > -RossWell, with the snv_134 the storage and fileserver used a single gigabit connection for their ISCSI traffic. After the upgrade to snv_151a, the fileserver and storage are capable of using 2 gigabit connections, in a round-robin fashion i think. So if this only MC/S or not, i don''t have technical expertise for confirm it or not, but it does makes a big difference in my environment. Bruno -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
Up to last year we have had 4 exsxi4 server, each with its own NFS-storage server (NexentaStor/ Core+napp-it), directly connected via 10Gbe CX4. The second CX4 Storage-Port was connected to our San (Hp 2910 10Gbe Switch) for backups. The second port of each ESXI Server was connected (tagged Vlan) to our Lan. The 4 Serverpairs are building two redundant Groups each with backups of the other. While performance was ok, we had 8 physical servers with a lot of cabling and hardware that could fail. With our two new systems (since february), we are integrating the storageserver within our vmware machine by virtualizing Nexenta ZFS Server. (Nexenta VM is stored on local ESXI raid-1 datastore). SAS Controller and all ZFS Disks/ Pools are passed-through to Nexenta to have full ZFS-Disk control like on real hardware. (We use vti-d capabable Mainboards with Intel 5520 Chipset and >=36 GB RAM, 12 GB for Nexenta and SSD Pools) All networking is now managed via VLANS and ESXI-virtual switch. The Server is connected via 10 GBE to our 10GBe HP-Switch. Traffic betwween ESXi and Nexenta NFS Server is managd directly by Esxi. We are very satisfied about this solution and will move the rest of our systems in the next weeks. We hope to have only one OS someday for the storage and the virtualize-part. gea <br> napp-it.org -- This message posted from opensolaris.org
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > > SAS Controller > and all ZFS Disks/ Pools are passed-through to Nexenta to have fullZFS-Disk> control like on real hardware.This is precisely the thing I''m interested in. How do you do that? On my ESXi (test) server, I have a solaris ZFS VM. When I configure it... and add disk ... my options are (a) create a new virtual disk (b) use an existing virtual disk, or (c) (grayed out) raw device mapping. There is a comment "Give your virtual machine direct access to a SAN." So I guess it only is available if you have some iscsi target available... But you seem to be saying ... don''t add the disks individually to the ZFS VM. You seem to be saying... Ensure the bulk storage is on a separate sas/scsi/sata controller from the ESXi OS... And then add the sas/scsi/sata PCI device to the guest, which will implicitly get all of the disks. Right? Or maybe ... the disks have to be scsi (sas)? And then you can add the scsi device directly pass-thru? What''s the trick that I''m missing?
> -----Original Message----- > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Edward Ned Harvey > Sent: 19 November 2010 09:54 > To: ''G?nther''; zfs-discuss at opensolaris.org > Subject: Re: [zfs-discuss] Faster than 1G Ether... ESX to ZFS > > > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > > > > SAS Controller > > and all ZFS Disks/ Pools are passed-through to Nexenta to have full > ZFS-Disk > > control like on real hardware.This sounds interesting as I have been thinking something similar but never implemented it because all the eggs would be in the same basket. If you don''t mind me asking for more information: Since you use Mapped Raw LUNs don''t you lose HA/fault tolerance on the storage servers as they cannot be moved to another host? Do you mirror LUNs from storage servers on different physical servers for the guests to achieve fault tolerance? Or would you consider this kind of setup "good enough" for production without making it too complex like above question?> > This is precisely the thing I''m interested in. How do you do that? On > my > ESXi (test) server, I have a solaris ZFS VM. When I configure it... > and add > disk ... my options are (a) create a new virtual disk (b) use an > existing > virtual disk, or (c) (grayed out) raw device mapping. There is a > comment > "Give your virtual machine direct access to a SAN." So I guess it only > is > available if you have some iscsi target available... > > But you seem to be saying ... don''t add the disks individually to the > ZFS > VM. You seem to be saying... Ensure the bulk storage is on a separate > sas/scsi/sata controller from the ESXi OS... And then add the > sas/scsi/sata > PCI device to the guest, which will implicitly get all of the disks. > Right? > > Or maybe ... the disks have to be scsi (sas)? And then you can add the > scsi > device directly pass-thru?How to accomplish ESXi 4 raw device mapping with SATA at least: http://www.vm-help.com/forum/viewtopic.php?f=14&t=1025 It does work as long as your hardware supports "VT", which would be any modern computer. - Ville> > What''s the trick that I''m missing? > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> -----Original Message----- > From: zfs-discuss-bounces at opensolaris.org > [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of > Edward Ned Harvey > Sent: Thursday, November 18, 2010 9:54 PM > To: ''G?nther''; zfs-discuss at opensolaris.org > Subject: Re: [zfs-discuss] Faster than 1G Ether... ESX to ZFS > > > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > > > > SAS Controller > > and all ZFS Disks/ Pools are passed-through to Nexenta to have full > ZFS-Disk > > control like on real hardware. > > This is precisely the thing I''m interested in. How do you do > that? On my > ESXi (test) server, I have a solaris ZFS VM. When I > configure it... and add > disk ... my options are (a) create a new virtual disk (b) use > an existing > virtual disk, or (c) (grayed out) raw device mapping. There > is a comment > "Give your virtual machine direct access to a SAN." So I > guess it only is > available if you have some iscsi target available... > > But you seem to be saying ... don''t add the disks > individually to the ZFS > VM. You seem to be saying... Ensure the bulk storage is on > a separate > sas/scsi/sata controller from the ESXi OS... And then add > the sas/scsi/sata > PCI device to the guest, which will implicitly get all of the > disks. Right? > > Or maybe ... the disks have to be scsi (sas)? And then you > can add the scsi > device directly pass-thru? > > What''s the trick that I''m missing?There is no trick. If you expose the HBA directly to the VM then you get all the disks. In order to do this, you need to configure passthrough for the device at the host level (host -> configuration -> hardware -> advanced settings). This requires that the VT stuff be enabled in the BIOS on your host - if you don''t have this ability, then you''re out of luck. Once the device is configured for passthrough on the host, you also have to pass it through to the VM. This is done by ''adding'' the PCI device to the VM configuration. At that point, you just have it in the guest as with the other virtual devices. You might like to read this article, which describes something similar: http://blog.laspina.ca/ubiquitous/encapsulating-vt-d-accelerated-zfs-storage-within-esxi -Will
I haven''t seen too much talk about the actual file read and write speeds. I recently converted from using OpenFiler, which seems defunct based on their lack of releases, to using NexentaStor. The NexentaStor server is connected to my ESXi hosts using 1 gigabit switches and network cards: The speed is very good as can be seen by IOZONE tests: KB reclen write rewrite read reread 512000 32 71789 76155 94382 101022 512000 1024 75104 69860 64282 58181 1024000 1024 66226 60451 65974 61884 These speeds were achieved by: 1) Turning OFF ZIL Cache (write cache) 2) Using SSD drives for L2ARC (read cache) 3) Use NFSv3 as NFSv4 isn''t supported by ESXi version 4.0. The speeds seem very good and my VMs run smoothly. I do use a UPS to help mitigate data corruption in case of power loss since the ZIL is OFF. Here is an exceptionally detailed blog on how to achieve maximum speeds using ZFS: http://www.anandtech.com/print/3963 Gil Vidals / VMRacks.com On Fri, Nov 12, 2010 at 7:03 AM, Edward Ned Harvey <shill at nedharvey.com>wrote:> Since combining ZFS storage backend, via nfs or iscsi, with ESXi heads, I?m > in love. But for one thing. The interconnect between the head & storage. > > > > 1G Ether is so cheap, but not as fast as desired. 10G ether is fast > enough, but it?s overkill and why is it so bloody expensive? Why is there > nothing in between? Is there something in between? Is there a better > option? I mean ? sata is cheap, and it?s 3g or 6g, but it?s not suitable > for this purpose. But the point remains, there isn?t a fundamental > limitation that **requires** 10G to be expensive, or **requires** a leap > directly from 1G to 10G. I would very much like to find a solution which is > a good fit? to attach ZFS storage to vmware. > > > > What are people using, as interconnect, to use ZFS storage on ESX(i)? > > > > Any suggestions? > > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20101118/59ca57be/attachment-0001.html>
On 19 nov. 2010, at 03:53, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> >> SAS Controller >> and all ZFS Disks/ Pools are passed-through to Nexenta to have full > ZFS-Disk >> control like on real hardware. > > This is precisely the thing I''m interested in. How do you do that? On my > ESXi (test) server, I have a solaris ZFS VM. When I configure it... and add > disk ... my options are (a) create a new virtual disk (b) use an existing > virtual disk, or (c) (grayed out) raw device mapping. There is a comment > "Give your virtual machine direct access to a SAN." So I guess it only is > available if you have some iscsi target available... > > But you seem to be saying ... don''t add the disks individually to the ZFS > VM. You seem to be saying... Ensure the bulk storage is on a separate > sas/scsi/sata controller from the ESXi OS... And then add the sas/scsi/sata > PCI device to the guest, which will implicitly get all of the disks. Right? > > Or maybe ... the disks have to be scsi (sas)? And then you can add the scsi > device directly pass-thru?As mentioned by Will, you''ll need to use the VMDirectPath which allows you to map a hardware device (the disk controller) directly to the VM without passing through the VMware managed storage stack. Note that you are presenting the hardware directly so it needs to be a compatible controller. You''ll need two controllers in the server since ESXi needs at least one disk that it controls to be formatted a VMFS to hold some of its files as well as the .vmx configuration files for the VM that will host the storage (and the swap file so it''s got to be at least as large as the memory you plan to assign to the VM). Caveats - while you can install ESXi onto a USB drive, you can''t manually format a USB drive as VMFS so for best performance you''ll want at least one SATA or SAS controller that you can leave controlled by ESXi and the second controller where the bulk of the storage is attached for the ZFS VM. As far as the eggs in one basket issue goes, you can either use a clustering solution like the Nexenta HA between two servers and then you have a highly available storage solution based on two servers that can also run your VMs or for a more manual failover, just use zfs send|recv to replicate the data. You can also accomplish something similar if you have only the one controller by manually created local Raw Device Maps of the local disks and presenting them individually to the ZFS VM but you don''t have direct access to the controller so I don''t think stuff like blinking a drive will work in this configuration since you''re not talking directly to the hardware. There''s no UI for creating RDMs for local drives, but there''s a good procedure over at <http://www.vm-help.com/esx40i/SATA_RDMs.php> which explains the technique.>From a performance standpoint it works really well - I have NFS hosted VMs in this configuration getting 396Mo/s throughput on simple dd tests backed by 10 zfs mirrored disks, all protected with hourly send|recv to a second box.Cheers, Erik
hmmm <br> http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide <br> <br> Disabling the ZIL (Don''t) <br> Caution: Disabling the ZIL on an NFS server can lead to client side corruption. The ZFS pool integrity itself is not compromised by this tuning. <br><br> so especially with nfs i won`t disable it.<br><br> its better to add ssd read/write caches or use ssd-only pools. we use spindels for backups or test-server. out main vms are all on ssd-pools (striped raid1 build of 120 GB sandforce based mlc drives, about 190 euro each) <br><br> we do not use slc, i suppose mlc are good enough for the next three (the warranty-time). we will change them after this. about integrated storage in vmware: i have some infos on my homepage about our solution <br> http://www.napp-it.org/napp-it/all-in-one/index_en.html <br> gea -- This message posted from opensolaris.org
> From: Saxon, Will [mailto:Will.Saxon at sage.com] > > In order to do this, you need to configure passthrough for the device atthe> host level (host -> configuration -> hardware -> advanced settings). ThisAwesome. :-) The only problem is that once a device is configured to pass-thru to the guest VM, then that device isn''t available for the host anymore. So you have to have your boot disks on a separate controller from the primary storage disks that are pass-thru to the guest ZFS server. For a typical ... let''s say dell server ... that could be a problem. The boot disks would need to hold ESXi plus a ZFS server and then you can pass-thru the primary hotswappable storage HBA to the ZFS guest. Then the ZFS guest can export its storage back to the ESXi host via NFS or iSCSI... So all the remaining VM''s can be backed by ZFS. Of course you have to configure ESXi to boot the ZFS guest before any of the other guests. The problem is just the boot device. One option is to boot from a USB dongle, but that''s unattractive for a lot of reasons. Another option would be a PCIe storage device, which isn''t too bad an idea. Anyone using PXE to boot ESXi? Got any other suggestions? In a typical dell server, there is no place to put a disk, which isn''t attached via the primary hotswappable storage HBA. I suppose you could use a 1U rackmount server with only 2 internal disks, and add a 2nd HBA with external storage tray, to use as pass-thru to the ZFS guest.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of VO > > How to accomplish ESXi 4 raw device mapping with SATA at least: > http://www.vm-help.com/forum/viewtopic.php?f=14&t=1025It says: You can pass-thru individual disks, if you have SCSI, but you can''t pass-thru individual SATA disks. I don''t have any way to verify this, but it seems unlikely... since SAS and SATA are interchangeable. (Sort of.) I know I have a dell server, with a few SAS disks plugged in, and a few SATA disks plugged in. Maybe the backplane is doing some kind of magic? But they''re all presented to the OS by the HBA, and the OS has no way of knowing if the disks are actually SAS or SATA... As far as I know. It also says: You can pass-thru PCI SATA controller, but the entire controller must be given to the guest. This I have confirmed. I have an ESXi server with eSATA controller and external disk attached. One reboot was required in order to configure the pass-thru.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of VO > > This sounds interesting as I have been thinking something similar butnever> implemented it because all the eggs would be in the same basket. If you > don''t mind me asking for more information: > Since you use Mapped Raw LUNs don''t you lose HA/fault tolerance on the > storage servers as they cannot be moved to another host?There is at least one situation I can imagine, where you wouldn''t care. At present, I have a bunch of Linux servers, with local attached disk. I often wish I could run ZFS on Linux. You could install ESXi, Linux, and a ZFS server all into the same machine. You could export the ZFS filesystem to the Linux system via NFS. Since the network interfaces are all virtual, you should be able to achieve near-disk speed from the Linux client, and you should have no problem doing snapshots & zfs send & all the other features of ZFS. I''d love to do a proof of concept... Or hear that somebody has. ;-)
> From: Gil Vidals [mailto:gvidals at gmail.com] > > connected to my ESXi hosts using 1 gigabit switches and network cards: The > speed is very good as can be seen by IOZONE tests: > > KB ?reclen ??write rewrite ???read ???reread > 512000 ?????32 ???71789 ??76155 ???94382 ??101022 > 512000 ????1024 ??75104 ??69860 ???64282 ???58181 > 1024000 ???1024 ??66226 ??60451 ???65974 ???61884 > > These speeds were achieved by: > > 1) Turning OFF ZIL Cache (write cache) > 2) Using SSD drives for L2ARC (read cache) > 3) Use NFSv3 as NFSv4 isn''t supported by ESXi version 4.0.I have the following results using local disk. ZIL enabled, no SSD, HBA writeback enabled. KB reclen write rewrite read reread 524288 64 189783 200303 2827021 2847086 524288 1024 201472 201837 3094348 3100793 1048576 1024 201883 201154 3076932 3087206 So ... I think your results were good relative to a 1Gb interface, but I think you''re severely limited by the 1Gb as compared to local disk.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of G?nther > > <br> <br> Disabling the ZIL (Don''t) <br>This is relative. There are indeed situations where it''s acceptable to disable ZIL. To make your choice, you need to understand a few things... #1 In the event of an ungraceful reboot, with your ZIL disabled, after reboot, your filesystem will be in a valid state, which is not the latest point of time before the crash. Your filesystem will be valid, but you will lose up to 30 seconds of the latest writes leading up to the crash. #2 Even if you have ZIL enabled, all of the above statements still apply to async writes. The ZIL only provides nonvolatile storage for sync writes. Given these facts, it quickly becomes much less scary to disable the ZIL, depending on what you use your server for.
On 19 nov. 2010, at 15:04, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of G?nther >> >> <br> <br> Disabling the ZIL (Don''t) <br> > > This is relative. There are indeed situations where it''s acceptable to > disable ZIL. To make your choice, you need to understand a few things... > > #1 In the event of an ungraceful reboot, with your ZIL disabled, after > reboot, your filesystem will be in a valid state, which is not the latest > point of time before the crash. Your filesystem will be valid, but you will > lose up to 30 seconds of the latest writes leading up to the crash. > #2 Even if you have ZIL enabled, all of the above statements still apply to > async writes. The ZIL only provides nonvolatile storage for sync writes. > > Given these facts, it quickly becomes much less scary to disable the ZIL, > depending on what you use your server for.Not to mention that in this particular scenario (local storage, local VM, loopback to ESXi) where the NFS server is only publishing to the local host, if the local host crashes, there are no other NFS clients involved that have local caches that will be out of sync with the storage. Cheers, Erik
i have the same problem with my 2HE supermicro server (24x2,5", connected via 6x mini SAS 8087) and no additional mounting possibilities for 2,5" or 3,5" drives. <br><br> on those machines i use one sas port (4 drives) of an old adaptec 3805 (i have used them in my pre zfs-times) to build a raid-1 + hotfix for esxi to boot from. the other 20 slots are connected to 3 lsi sas controller for pass-through - so i have 4 sas controller in these machines. <br><br> maybee the new ssd-drives mounted on a pci-e (ex ocz revo drive) may be an alternative. have anyone used them already with esxi? <br><br> gea -- This message posted from opensolaris.org
On Fri, 19 Nov 2010 07:16:20 PST, G??nther wrote:> i have the same problem with my 2HE supermicro server (24x2,5", > connected via 6x mini SAS 8087) and no additional mounting > possibilities for 2,5" or 3,5" drives. > <br><br> > on those machines i use one sas port (4 drives) of an old adaptec > 3805 (i have used them in my pre zfs-times) to build a raid-1 + > hotfix > for esxi to boot from. the other 20 slots are connected to 3 lsi sas > controller for pass-through - so i have 4 sas controller in these > machines. > <br><br> > maybee the new ssd-drives mounted on a pci-e (ex ocz revo drive) may > be an alternative. have anyone used them already with esxi? > <br><br> > geaHey - just as a side note.. Depending on what motherboard you use, you may be able to use this: MCP-220-82603-0N - Dual 2.5 fixed HDD tray kit for SC826 (for E-ATX X8 DP MB) I haven''t used one yet myself but am currently planning a SMC build and contacted their support as I really did not want to have my system drives hanging off the controller. As far as I can tell from a picture they sent, it mounts on top of the motherboard itself somewhere where there is normally open space, and it can hold two 2.5" drives. So maybe give in touch with their support and see if you can use something similar. Cheers, Mark
> -----Original Message----- > From: Edward Ned Harvey [mailto:shill at nedharvey.com] > Sent: Friday, November 19, 2010 8:03 AM > To: Saxon, Will; ''G?nther''; zfs-discuss at opensolaris.org > Subject: RE: [zfs-discuss] Faster than 1G Ether... ESX to ZFS > > > From: Saxon, Will [mailto:Will.Saxon at sage.com] > > > > In order to do this, you need to configure passthrough for > the device at > the > > host level (host -> configuration -> hardware -> advanced > settings). This > > Awesome. :-) > The only problem is that once a device is configured to > pass-thru to the > guest VM, then that device isn''t available for the host > anymore. So you > have to have your boot disks on a separate controller from the primary > storage disks that are pass-thru to the guest ZFS server. > > For a typical ... let''s say dell server ... that could be a > problem. The > boot disks would need to hold ESXi plus a ZFS server and then you can > pass-thru the primary hotswappable storage HBA to the ZFS > guest. Then the > ZFS guest can export its storage back to the ESXi host via > NFS or iSCSI... > So all the remaining VM''s can be backed by ZFS. Of course you have to > configure ESXi to boot the ZFS guest before any of the other guests. > > The problem is just the boot device. One option is to boot from a USB > dongle, but that''s unattractive for a lot of reasons. > Another option would > be a PCIe storage device, which isn''t too bad an idea. > Anyone using PXE to > boot ESXi? > > Got any other suggestions? In a typical dell server, there > is no place to > put a disk, which isn''t attached via the primary hotswappable > storage HBA. > I suppose you could use a 1U rackmount server with only 2 > internal disks, > and add a 2nd HBA with external storage tray, to use as > pass-thru to the ZFS > guest.Well, with 4.1 ESXi does support boot from SAN. I guess that still presents a chicken-and-egg problem in this scenario, but maybe you have another san somewhere you can boot from. Also, most of the big name vendors have a USB or SD option for booting ESXi. I believe this is the ''ESXi Embedded'' flavor vs. the typical ''ESXi Installable'' that we''re used to. I don''t think it''s a bad idea at all. I''ve got a not-quite-production system I''m booting off USB right now, and while it takes a really long time to boot it does work. I think I like the SD card option better though. What I am wondering is whether this is really worth it. Are you planning to share the storage out to other VM hosts, or are all the VMs running on the host using the ''local'' storage? I know we like ZFS vs. traditional RAID and volume management, and I get that being able to boot any ZFS-capable OS is good for disaster recovery, but what I don''t get is how this ends up working better than a larger dedicated ZFS system and a storage network. Is it cheaper over several hosts? Are you getting better performance through e.g. the vmxnet3 adapter and NFS than you would just using the disks directly? -Will
> Also, most of the big name vendors have a USB or SD > option for booting ESXi. I believe this is the ''ESXi > Embedded'' flavor vs. the typical ''ESXi Installable'' > that we''re used to. I don''t think it''s a bad idea at > all. I''ve got a not-quite-production system I''m > booting off USB right now, and while it takes a > really long time to boot it does work. I think I like > the SD card option better though.i need 4gb extra space for the Nexenta zfs storage server. and it should not be as slow as a usb stick or management via web-gui is painfull slow.> > What I am wondering is whether this is really worth > it. Are you planning to share the storage out to > other VM hosts, or are all the VMs running on the > host using the ''local'' storage? I know we like ZFS > vs. traditional RAID and volume management, and I get > that being able to boot any ZFS-capable OS is good > for disaster recovery, but what I don''t get is how > this ends up working better than a larger dedicated > ZFS system and a storage network. Is it cheaper over > several hosts? Are you getting better performance > through e.g. the vmxnet3 adapter and NFS than you > would just using the disks directly? >mainly the storage is used via NFS for local vm''s. but we share the nfs datastores also via cifs to have a simple move/ clone/ copy or backup. we also replicate datastores at least once per day to a second machine via incremental zfs send. we have or plan the same system on all of our esxi machines. each esxi has its own local san-like storage server. (i do not like a to have one big san-storage to be a single point of failure + high speed san cabling. so we have 4 esxi server, each with its own virtualized zfs-storage server + three common used backup systems - connected via 10Gbe VLAN). we formerly had separate storage and esxi server but with pass-through we could integrate the two and reduce our hardware that could fail and cabling at a rate of 50%. gea -- This message posted from opensolaris.org
> From: Saxon, Will [mailto:Will.Saxon at sage.com] > > What I am wondering is whether this is really worth it. Are you planningto> share the storage out to other VM hosts, or are all the VMs running on the > host using the ''local'' storage? I know we like ZFS vs. traditional RAIDand> volume management, and I get that being able to boot any ZFS-capable OS is > good for disaster recovery, but what I don''t get is how this ends upworking> better than a larger dedicated ZFS system and a storage network. Is it > cheaper over several hosts? Are you getting better performance through > e.g. the vmxnet3 adapter and NFS than you would just using the disks > directly?I also don''t know enough details of how this works out. In particular: If your goal is high performance storage, snapshots, backups, and data integrity for Linux or some other OS (AKA, ZFS on Linux or Windows) then you should be able to win with this method of Linux & ZFS server both in VM''s of a single physical server, utilizing a vmnet switch and either NFS or iSCSI or CIFS. But until some benchmarks are done, to show that vmware isn''t adding undue overhead, I must consider it still "unproven." As compared to one big ZFS server being used as the backend SAN for a bunch of vmware hosts... If your goal is high performance for distributed computing, then you always need to use local disk attached independently to each of the compute nodes. There''s simply no way you can scale any central server large enough to handle a bunch of hosts without any performance loss. Assuming the ZFS server is able to max out its local disks... If there exists a bus which is fast enough for a remote server to max out those disks... Then the remote server should have the storage attached locally, because the physical disks are the performance bottleneck. If your goal is just "use ZFS datastore for all your vmware hosts," ... AKA you''re mostly interested in checksumming and snapshots, you''re not terribly concerned with performance as long as it''s "fast enough," then most likely you''ll be fine with using 1Gb ether because it''s so cheap. Maybe you upgrade to a faster or different type of bus (fc or ib).
Suppose if you wanted to boot from an iscsi target, just to get vmware & a ZFS server up. And then you could pass-thru the entire local storage bus(es) to the ZFS server, and you could create other VM''s whose storage is backed by the ZFS server on local disk. One way you could do this is to buy FC or IB adapter, and I presume these have some BIOS support to configure bootable iscsi targets. But is there any such thing for 1Gb ether? For this purpose, I think it would be fine for vmware & ZFS server OSes to be physically remote via iscsi 1Gb. OTOH ... maybe PXE is the way to go. I''ve never done anything with PXE, but I certainly know it''s pretty darn universal. I guess I''ll have to look into it.
For anyone who cares: I created an ESXi machine. Installed two guest (centos) machines and vmware-tools. Connected them to each other via only a virtual switch. Used rsh to transfer large quantities of data between the two guests, unencrypted, uncompressed. Have found that ESXi virtual switch performance peaks around 2.5Gbit. Also, if you have a NFS datastore, which is not available at the time of ESX bootup, then the NFS datastore doesn''t come online, and there seems to be no way of telling ESXi to make it come online later. So you can''t auto-boot any guest, which is itself stored inside another guest. So basically, if you want a layer of ZFS in between your ESX server and your physical storage, then you have to have at least two separate servers. And if you want anything resembling actual disk speed, you need infiniband, fibre channel, or 10G ethernet. (Or some really slow disks.) ;-)
On Dec 8, 2010, at 11:41 PM, Edward Ned Harvey <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:> For anyone who cares: > > I created an ESXi machine. Installed two guest (centos) machines and > vmware-tools. Connected them to each other via only a virtual switch. Used > rsh to transfer large quantities of data between the two guests, > unencrypted, uncompressed. Have found that ESXi virtual switch performance > peaks around 2.5Gbit. > > Also, if you have a NFS datastore, which is not available at the time of ESX > bootup, then the NFS datastore doesn''t come online, and there seems to be no > way of telling ESXi to make it come online later. So you can''t auto-boot > any guest, which is itself stored inside another guest. > > So basically, if you want a layer of ZFS in between your ESX server and your > physical storage, then you have to have at least two separate servers. And > if you want anything resembling actual disk speed, you need infiniband, > fibre channel, or 10G ethernet. (Or some really slow disks.) ;-)Besides the chicken and egg scenario that Ed mentions there is also the CPU usage that running the storage virtualized. You might find that as you get more machines on the storage the performance will decrease a lot faster then it otherwise would if it were standalone as it competes with the very machines it is suppose to be serving. -Ross
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Edward Ned Harvey > > Also, if you have a NFS datastore, which is not available at the time ofESX> bootup, then the NFS datastore doesn''t come online, and there seems to be > no > way of telling ESXi to make it come online later. So you can''t auto-boot > any guest, which is itself stored inside another guest.Someone just told me about esxcfg-nas -r So yes, it is possible to make ESX remount the NFS datastore in order to boot the other VM''s. The end result should be something which is faster than 1G ether, but not as fast as IB, FC, or 10G.
On 9 d?c. 2010, at 13:41, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- >> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey >> >> Also, if you have a NFS datastore, which is not available at the time of > ESX >> bootup, then the NFS datastore doesn''t come online, and there seems to be >> no >> way of telling ESXi to make it come online later. So you can''t auto-boot >> any guest, which is itself stored inside another guest. > > Someone just told me about > esxcfg-nas -r > So yes, it is possible to make ESX remount the NFS datastore in order to > boot the other VM''s. The end result should be something which is faster > than 1G ether, but not as fast as IB, FC, or 10G.I''ve got a similar setup running here - with the Nexenta VM set to auto-start, you have to wait a bit for the VM to startup until the NFS datastores become available, but the actual mount operation from the ESXi side is automatic. I suppose that if you played with the startup delays between virtual machines you could get everything to start unattended once you know how long it takes for the NFS stores to become available. Combined with send/recv to another box it''s an affordable disaster recovery solution. And to squeeze every bit of performance out of the configuration, you can use VMDirectPath to present the HBA to your storage VM (just remember to add another card to boot ESXi or store a VMFS volume for vmx and swap files. Erik