thr3ads.net - zfs discuss - [zfs-discuss] Homegrown Hybrid Storage [Jun 2010]

If this information is useful, please help other people find it:
Share via:

Ken

2010-Jun-07 01:22 UTC

[zfs-discuss] Homegrown Hybrid Storage

Hi,

I''m looking to build a virtualized web hosting server environment
accessing
files on a hybrid storage SAN.  I was looking at using the Sun X-Fire x4540
with the following configuration:

   - 6 RAID-Z vdevs with one hot spare each (all 500GB 7200RPM SATA drives)
   - 2 Intel X-25 32GB SSD''s as a mirrored ZIL
   - 4 Intel X-25 64GB SSD''s as the L2ARC.
   - De-duplification
   - LZJB compression

The clients will be Apache web hosts serving hundreds of domains.

I have the following questions:

   - Should I use NFS with all five VM''s accessing the exports, or one
LUN
   for each VM, accessed over iSCSI?
   - Are the FSYNC speed issues with NFS resolved?
   - Should I go with fiber channel, or will the 4 built-in 1Gbe NIC''s
give
   me enough speed?
   - How many SSD''s should I use for the ZIL and L2ARC?
   - What pool structure should I use?

I know these questions are slightly vague, but any input would be greatly
appreciated.

Thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100606/588e814f/attachment.html>

Erik Trimble

2010-Jun-07 02:40 UTC

head link

[zfs-discuss] Homegrown Hybrid Storage

On 6/6/2010 6:22 PM, Ken wrote:> Hi,
>
> I''m looking to build a virtualized web hosting server environment 
> accessing files on a hybrid storage SAN.  I was looking at using the 
> Sun X-Fire x4540 with the following configuration:
>
>     * 6 RAID-Z vdevs with one hot spare each (all 500GB 7200RPM SATA
>       drives)
>     * 2 Intel X-25 32GB SSD''s as a mirrored ZIL
>     * 4 Intel X-25 64GB SSD''s as the L2ARC.
>     * De-duplification
>     * LZJB compression
>
> The clients will be Apache web hosts serving hundreds of domains.
>
> I have the following questions:
>
>     * Should I use NFS with all five VM''s accessing the exports,
or
>       one LUN for each VM, accessed over iSCSI?
>     * Are the FSYNC speed issues with NFS resolved?
>     * Should I go with fiber channel, or will the 4 built-in 1Gbe
>       NIC''s give me enough speed?
>     * How many SSD''s should I use for the ZIL and L2ARC?
>     * What pool structure should I use?
>
> I know these questions are slightly vague, but any input would be 
> greatly appreciated.
>
> Thanks!
>
Which Virtual Machine technology are you going to use?

VirtualBox
VMWare
Xen
Solaris Zones
Somethinge else...

It will make a difference as to my recommendation (or, do you want me to 
recommend a VM type, too?)

<grin>



-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100606/cc778701/attachment.html>

Ken

2010-Jun-07 04:16 UTC

head link

[zfs-discuss] Homegrown Hybrid Storage

I''m looking at VMWare, ESXi 4, but I''ll take any advice
offered.

On Sun, Jun 6, 2010 at 19:40, Erik Trimble <erik.trimble at oracle.com>
wrote:
>  On 6/6/2010 6:22 PM, Ken wrote:
>
> Hi,
>
>  I''m looking to build a virtualized web hosting server environment
> accessing files on a hybrid storage SAN.  I was looking at using the Sun
> X-Fire x4540 with the following configuration:
>
>    - 6 RAID-Z vdevs with one hot spare each (all 500GB 7200RPM SATA
>    drives)
>    - 2 Intel X-25 32GB SSD''s as a mirrored ZIL
>    - 4 Intel X-25 64GB SSD''s as the L2ARC.
>    - De-duplification
>    - LZJB compression
>
> The clients will be Apache web hosts serving hundreds of domains.
>
>  I have the following questions:
>
>    - Should I use NFS with all five VM''s accessing the exports, or
one LUN
>    for each VM, accessed over iSCSI?
>    - Are the FSYNC speed issues with NFS resolved?
>    - Should I go with fiber channel, or will the 4 built-in 1Gbe
NIC''s
>    give me enough speed?
>    - How many SSD''s should I use for the ZIL and L2ARC?
>    - What pool structure should I use?
>
>  I know these questions are slightly vague, but any input would be greatly
> appreciated.
>
>  Thanks!
>
>
> Which Virtual Machine technology are you going to use?
>
> VirtualBox
> VMWare
> Xen
> Solaris Zones
> Somethinge else...
>
> It will make a difference as to my recommendation (or, do you want me to
> recommend a VM type, too?)
>
> <grin>
>
>
>
> --
> Erik Trimble
> Java System Support
> Mailstop:  usca22-123
> Phone:  x17195
> Santa Clara, CA
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100606/7aa99ccc/attachment.html>

Erik Trimble

2010-Jun-07 06:10 UTC

head link

[zfs-discuss] Homegrown Hybrid Storage

Comments in-line.


On 6/6/2010 9:16 PM, Ken wrote:> I''m looking at VMWare, ESXi 4, but I''ll take any advice
offered.
>
> On Sun, Jun 6, 2010 at 19:40, Erik Trimble <erik.trimble at oracle.com 
> <mailto:erik.trimble at oracle.com>> wrote:
>
>     On 6/6/2010 6:22 PM, Ken wrote:
>>     Hi,
>>
>>     I''m looking to build a virtualized web hosting server
environment
>>     accessing files on a hybrid storage SAN.  I was looking at using
>>     the Sun X-Fire x4540 with the following configuration:
>>
>>         * 6 RAID-Z vdevs with one hot spare each (all 500GB 7200RPM
>>           SATA drives)
>>         * 2 Intel X-25 32GB SSD''s as a mirrored ZIL
>>         * 4 Intel X-25 64GB SSD''s as the L2ARC.
>>         * De-duplification
>>         * LZJB compression
>>
>>     The clients will be Apache web hosts serving hundreds of domains.
>>
>>     I have the following questions:
>>
>>         * Should I use NFS with all five VM''s accessing the
exports,
>>           or one LUN for each VM, accessed over iSCSI?
>>Generally speaking, it depends on your comfort level with running iSCSI  
Volumes to put the VMs in, or serving everything out via NFS (hosting 
the VM disk file in an NFS filesystem).

If you go the iSCSI route, I would definitely go the "one iSCSI volume 
per VM" route - note that you can create multiple zvols per zpool on the 
X4540, so it''s not limiting in any way to volume-ize a VM. 
It''s a lot
simpler, easier, and allows for nicer management (snapshots/cloning/etc. 
on the X4540 side) if you go with a VM per iSCSI volume.

With NFS-hosted VM disks, do the same thing:  create a single filesystem 
on the X4540 for each VM.

Performance-wise, I''d have to test, but I /think/ the iSCSI route will 
be faster. Even with the ZIL SSDs.

In all cases, regardless of how you host the VM images themselves, I''d 
serve out the website files via NFS.  I''m not sure how ESXi works, but 
under something like Solaris/Vbox, I could set up the base Solaris 
system to run CacheFS for an NFS share, and then give local access to 
all the VBox instances that single NFS mountpoint.  That would allow for 
heavy client-side cacheing of important data for your web servers.  If 
you''re careful, you can separate read-only data from write-only data, 
which would allow you even better performance tweaks.  I tend to like to 
have the host OS handle as much network traffic and cacheing of data as 
possible instead of each VM doing it; it tends to be more efficient that 
way.

>>         * Are the FSYNC speed issues with NFS resolved?
>>The ZIL SSDs will compensate for synchronous write issues in NFS.  Not 
completely eliminate them, but you shouldn''t notice issues with sync 
writing until you''re up at pretty heavy loads.
>>         * Should I go with fiber channel, or will the 4 built-in 1Gbe
>>           NIC''s give me enough speed?
>>Depending on how much RAM and how much local data caching you do (and 
the specifics of the web site accesses), 4 GBE should be fine.  However, 
if you want more, I''d get another quad GBE card, and then run at least
2
guest instances per client hardware. Try very hard to have the 
equivalent of a full GBE available per VM.  Personally, I''d go for 
client hardware that has 4 GBE interfaces: (1) each for two VMs, 1 for 
external internet access, and 1 for management.  I''d then run the X4540
with 8 GBE bonded (trunked/teamed/whatever) together.   This might be 
overkill, so see what your setup requires in terms of available bandwidth.
>>         * How many SSD''s should I use for the ZIL and L2ARC?
>>Being a website mux, your data pattern is likely to be 99% read with 
small random writes being the remaining 1%.  You need just enough 
high-performance SSD for the ZIL.  Honestly, the 32GB X25-E is larger 
than you''ll likely ever need. I can''t recommend anything else
for the
money, but the sad truth is that ZFS really only need a 1-2GB of NVRAM 
for the ZIL (for most use cases).  So get the smallest device you can 
find that still satisfies the high performance requirement.  Caveaut: 
look at the archives for all the talk about protecting your ZIL device 
from power outages (and the lack of a capacitor in most modern SSDs).

For L2ARC, go big. Website files tend to be /very/ small, so you''re in 
the worst use case for Dedup. With something like a X4540 and it''s huge
data capacity, get as much L2ARC SSD space as you can afford. Remember:  
250bytes per Dedup block. If you have 1k blocks for all those little 
files, well, your L2ARC needs to be 25% of your data size. *Ouch*   Now, 
you don''t have to buy the super-expensive stuff for L2ARC: the good old
Intel X-25M works just fine.  Don''t mirror them.

Given the explosive potential size of your DDT, I''d think long and hard
about which data you really want to Dedup. Disk is cheap, but SSD 
isn''t.  Good news is that you can selectively decide which data sets to
Dedup. Ain''t ZFS great?

>>         * What pool structure should I use?
>>If it were me (and, given what little I know of your data), I''d go like
this:

(1) pool for VMs:
         8 disks, MIRRORED
         1 SSD for L2ARC
         one Zvol per VM instance, served via iSCSI, each with:
                 DD turned ON,  Compression turned OFF

(1) pool for clients to write data to (log files, incoming data, etc.)
         6 or 8 disks, MIRRORED
         2 SSDs for ZIL, mirrored
         Ideally, As many filesystems as you have webSITES, not just 
client VMs.  As this might be unwieldy for 100s of websites, you should 
segregate them into obvious groupings, taking care with write/read 
permissions.
                 NFS served
                 DD OFF, Compression ON  (or OFF, if you seem to be 
having CPU overload on the X4540)

(1) pool for client read-only data
         All the rest of the disks, split into 7 or 8-disk RAIDZ2 vdevs
         All the remaining SSDs for L2ARC
         As many filesystems as you have webSITES, not just client VMs.  
(however, see above)
                 NFS served
                 DD on for selected websites (filesystems), Compression 
ON for everything

(2) Global hot spares.

>>     I know these questions are slightly vague, but any input would be
>>     greatly appreciated.
>>
>>     Thanks!
>>
>
>     Which Virtual Machine technology are you going to use?
>
>     VirtualBox
>     VMWare
>     Xen
>     Solaris Zones
>     Somethinge else...
>
>     It will make a difference as to my recommendation (or, do you want
>     me to recommend a VM type, too?)
>
>     <grin>
>
>
>
>     -- 
>     Erik Trimble
>     Java System Support
>     Mailstop:  usca22-123
>     Phone:  x17195
>     Santa Clara, CA
>
>

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100606/85d4f744/attachment.html>

Jens Elkner

2010-Jun-07 06:33 UTC

head link

[zfs-discuss] Homegrown Hybrid Storage

On Sun, Jun 06, 2010 at 09:16:56PM -0700, Ken wrote:>    I''m looking at VMWare, ESXi 4, but I''ll take any
advice offered.
...>    I''m looking to build a virtualized web hosting server
environment accessing
>    files on a hybrid storage SAN.  I was looking at using the Sun X-Fire
x4540
>    with the following configuration:
IMHO Solaris Zones with LOFS mounted ZFSs gives you the highest
flexibility in all directions, probably the best performance and 
least resource consumption, fine grained resource management (CPU,
memory, storage space) and less maintainance stress etc...

Have fun,
jel.
-- 
Otto-von-Guericke University     http://www.cs.uni-magdeburg.de/
Department of Computer Science   Geb. 29 R 027, Universitaetsplatz 2
39106 Magdeburg, Germany         Tel: +49 391 67 12768

Roy Sigurd Karlsbakk

2010-Jun-07 07:02 UTC

head link

[zfs-discuss] Homegrown Hybrid Storage

Which Virtual Machine technology are you going to use? 

VirtualBox 
VMWare 
Xen 
Solaris Zones 
Somethinge else... 

It will make a difference as to my recommendation (or, do you want me to
recommend a VM type, too?)
This is somehow off-topic @zfs-discuss, but still. After trying to fight a bug -
http://www.virtualbox.org/ticket/6505 - for months and getting close-to-zero
feedback from the virtualbox developers, I have abandoned using vbox on
OpenSolaris. It may work fine a few days, perhaps even weeks, and then boom. I
don''t have equipment to setup a test system, and my server is located
some 50km from home, so I need something that works, not part of the time, but
all the time. Due to this, I''d recommend against VirtualBox on
OpenSolaris.

Vennlige hilsener / Best regards 

roy 
-- 
Roy Sigurd Karlsbakk 
(+47) 97542685 
roy at karlsbakk.net 
http://blogg.karlsbakk.net/ 
-- 
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100607/a1f1b916/attachment.html>

Miles Nordin

2010-Jun-07 18:06 UTC

head link

[zfs-discuss] Homegrown Hybrid Storage

>>>>> "et" == Erik Trimble <erik.trimble at
oracle.com> writes:
et> With NFS-hosted VM disks, do the same thing: create a single
et> filesystem on the X4540 for each VM.

previous posters pointed out there are unreasonable hard limits in
vmware to the number of NFS mounts or iSCSI connections or something,
so you will probably run into that snag when attempting to use the
much faster snapshotting/cloning in ZFS.

>>> * Are the FSYNC speed issues with NFS resolved?
>>>
et> The ZIL SSDs will compensate for synchronous write issues in
et> NFS.

okay, but sometimes for VM''s I think this often doesn''t matter
because
NFSv3 and v4 only add fsync()''s on file closings, and a virtual disk
is one giant file that the client never closes. There may still be
synchronous writes coming through if they don''t get blocked in LVM2
inside the guest or blocked in the VM software, but whatever comes
through ought to be exactly the same number of them for NFS or iSCSI,
unless the vm software has different bugs in the nfs vs iscsi
back-ends.

the other difference is in the latest comstar which runs in
sync-everything mode by default, AIUI. Or it does use that mode only
when zvol-backed? Or something. I''ve the impression it went through
many rounds of quiet changes, both in comstar and in zvol''s, on its
way to its present form. I''ve heard said here you can change the mode
both from the comstar host and on the remote initiator, but I don''t
know how to do it or how sticky the change is, but if you didn''t
change and stuck with the default sync-everything I think NFS would be
a lot faster. This is if we are comparing one giant .vmdk or similar
on NFS, against one zvol. If we are comparing an exploded filesystem
on NFS mounted through the virtual network adapter, then of course
you''re right again Erik.

The tradeoff integrity tests are, (1) reboot the solaris storage host
without rebooting the vmware hosts & guests and see what happens, (2)
cord-yank the vmware host. Both of these are probably more dangerous
than (3) command the vm software to virtual-cord-yank the guest.

>>> * Should I go with fiber channel, or will the 4 built-in 1Gbe
>>> NIC''s give me enough speed?

FC has different QoS properties than Ethernet because of the buffer
credit mechanism---it can exert back-pressure all the way through the
fabric. same with IB, which is HOL-blocking. This is a big deal with
storage, with its large blocks of bursty writes that aren''t really the
case for which TCP shines. I would try both and compare, if you can
afford it!

je> IMHO Solaris Zones with LOFS mounted ZFSs gives you the
je> highest flexibility in all directions, probably the best
je> performance and least resource consumption, fine grained
je> resource management (CPU, memory, storage space) and less
je> maintainance stress etc...

yeah zones are really awesome, especially combined with clones and
snapshots. For once the clunky post-Unix XML crappo solaris
interfaces are actually something I appreciate a little, because lots
of their value comes from being able to do consistent repeatable
operations on them.

The problem is that the zones run Solaris instead of Linux. BrandZ
never got far enough to, for example, run Apache under a
2.6-kernel-based distribution, so I don''t find it useful for any real
work. I do keep a CentOS 3.8 (I think?) brandz zone around, but not
for anything production---just so I can try it if I think the
new/weird version of a tool might be broken.

as for native zones, the ipkg repository, and even the jucr
repository, has two years old versions of everything---django/python,
gcc, movabletype. Many things are missing outright, like nginx. I''m
very disappointed that Solaris did not adopt an upstream package
system like Dragonfly did. Gentoo or pkgsrc would have been very
smart, IMHO. Even opencsw is based on Nick Moffitt''s GAR system,
which was an old mostly-abandoned tool for building bleeding edge
Gnome on Linux. The ancient perpetually-abandoned set of packages on
jucr and the crufty poorly-factored RPM-like spec files leave me with
little interest in contributing to jucr myself, while if Solaris had
poured the effort instead into one of these already-portable package
systems like they poured it into Mercurial after adopting that, then
I''d instead look into (a) contributing packages that I need most, and
(b) using whatever system Solaris picked on my non-Solaris systems.
This crap/marginalized build system means I need to look at a way to
host Linux under Solaris, using Solaris basically just for ZFS and
nothing else. The alternative is to spend heaps of time re-inventing
the wheel only to end up with an environment less rich than
competitors and charge twice as much for it like joyent.

But, yeah, while working on Solaris I would never install anything in
the global zone after discovering how easy it is to work with ipkg
zones. They are really brilliant, and unlike everyone else''s attempt
at these superchroot''s like freebsd jails/johncompanies.com I feel
like zones are basically finished.

however... because of:

http://mail.opensolaris.org/pipermail/zfs-discuss/2009-October/032878.html

I wonder if it might be better to mount ZFS datasets directly in the
zones, not lofs mount them. It''s easy to do this. Short version is:

1. create dataset outside the zone with mountpoint=none
2. add dataset to the zone with zonecfg
3. set the dataset''s mountpoint from a shell inside the zone

Long version below.

postgres cheatsheet:
-----8<-----
http://blogs.sun.com/jkshah/entry/opensolaris_2008_11_and_postgresql

need to make a dataset outside the zbe for postgres data so it''ll
escape beadm snapshotting/cloning
once that''s working within zones for image-update. setting mountpoints
for zoned datasets is
weird, though:
http://mail.opensolaris.org/pipermail/zones-discuss/2009-January/004661.html

outside the zone:
zfs list -r tub/export/zone
NAME USED AVAIL REFER
MOUNTPOINT
tub/export/zone 27.1G 335G 40.3K
/export/zone
tub/export/zone/awabagal 917M 335G 37.4K
/export/zone/awabagal
tub/export/zone/awabagal/ROOT 917M 335G 31.4K legacy
tub/export/zone/awabagal/ROOT/zbe 917M 335G 2.72G legacy
zfs create -o mountpoint=none tub/export/zone/awabagal/postgres-data
zonecfg -z awabagal
zonecfg:awabagal> add dataset
zonecfg:awabagal:dataset> set name=tub/export/zone/awabagal/postgres-data
zonecfg:awabagal:dataset> end
zonecfg:awabagal> commit
zonecfg:awabagal> exit
inside the zone:
zfs list
NAME USED AVAIL REFER MOUNTPOINT
tub 1.33T 335G 498K /tub
tub/export 295G 335G 63.2M /export
tub/export/zone 27.1G 335G 40.3K /export/zone
tub/export/zone/awabagal 919M 335G 37.4K
/export/zone/awabagal
tub/export/zone/awabagal/ROOT 919M 335G 31.4K legacy
tub/export/zone/awabagal/ROOT/zbe 919M 335G 2.73G legacy
tub/export/zone/awabagal/postgres-data 31.4K 335G 31.4K none
zfs set mountpoint=/var/postgres tub/export/zone/awabagal/postgres-data

the /var/postgres directory is magical and hardcoded into the package.

the rest, you do inside the zone:
pkg install SUNWpostgr-83-server SUNWpostgr-83-client SUNWpostgr-jdbc
SUNWpostgr-83-contrib SUNWpostgr-83-docs \
SUNWpostgr-83-devel SUNWpostgr-83-tcl SUNWpostgr-83-pl SUNWpgadmin3
svccfg import /var/svc/manifest/application/database/postgresql_83.xml
svcadm enable postgresql_83:default_64bit

add /usr/postgres/8.3/bin to {,SU}PATH in /etc/default/{login,su}
-----8<-----
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100607/c66b1bb1/attachment.bin>

Bob Friesenhahn

2010-Jun-07 20:08 UTC

head link

[zfs-discuss] Homegrown Hybrid Storage

On Mon, 7 Jun 2010, Miles Nordin wrote:>
> FC has different QoS properties than Ethernet because of the buffer
> credit mechanism---it can exert back-pressure all the way through the
> fabric.  same with IB, which is HOL-blocking.  This is a big deal with
> storage, with its large blocks of bursty writes that aren''t really
the
> case for which TCP shines.  I would try both and compare, if you can
> afford it!
FCoE is beginning to change this, with ethernet adaptors and switches 
which support the new features.  Without the new FCoE standards, 
Ethernet can exert back pressure but only on a local-link level, and 
with long delays.  You can be sure that companies like cisco will be 
(or are) selling FCoE hardware to compete with FC SANs.  The intention 
is that ethernet will put fibre channel out of business.  We shall see 
if history repeats itself.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2010-Jun-07 20:32 UTC

head link

[zfs-discuss] Homegrown Hybrid Storage

On Jun 7, 2010, at 11:06 AM, Miles Nordin wrote:> 
> the other difference is in the latest comstar which runs in
> sync-everything mode by default, AIUI.  Or it does use that mode only
> when zvol-backed?  Or something.  
It depends on your definition of "latest."  The latest OpenSolaris
release
is 2009.06 which treats all Zvol-backed COMSTAR iSCSI writes as
sync. This was changed in the developer releases in summer 2009, b114.
For a release such as NexentaStor 3.0.2, which is based on b140 (+/-),
the initiator''s write cache enable/disable request is respected, by
default.
>>>> * Should I go with fiber channel, or will the 4 built-in 1Gbe
>>>> NIC''s give me enough speed?
> 
> FC has different QoS properties than Ethernet because of the buffer
> credit mechanism---it can exert back-pressure all the way through the
> fabric.  same with IB, which is HOL-blocking.  This is a big deal with
> storage, with its large blocks of bursty writes that aren''t really
the
> case for which TCP shines. 
Please don''t confuse Ethernet with IP. Ethernet has no routing and
no back-off other than that required for the link. Since GbE and higher
speeds are all implemented as switched fabrics, the ability of the switch
to manage contention is paramount.  You can observe this on a Solaris
system by looking at the NIC flow control kstats.

For a LAN environment, there is little practical difference between 
Ethernet and FC wrt port contention -- high quality switches will prove
better than bargain-basement switches, with direct attach (no switches)
being the optimum cost+performance solution.  WANs are a different 
beast,  and is where we find tuning the FC buffer credits to be worth the 
effort.  For WANs no tuning is required for IP on modern OSes (Ethernet
doesn''t do WAN).
>  I would try both and compare, if you can
> afford it!
+1
 -- richard

-- 
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/

Garrett D''Amore

2010-Jun-07 20:49 UTC

head link

[zfs-discuss] Homegrown Hybrid Storage

On Mon, 2010-06-07 at 13:32 -0700, Richard Elling wrote:> On Jun 7, 2010, at 11:06 AM, Miles Nordin wrote:
> > 
> > the other difference is in the latest comstar which runs in
> > sync-everything mode by default, AIUI.  Or it does use that mode only
> > when zvol-backed?  Or something.  
> 
> It depends on your definition of "latest."  The latest
OpenSolaris release
> is 2009.06 which treats all Zvol-backed COMSTAR iSCSI writes as
> sync. This was changed in the developer releases in summer 2009, b114.
> For a release such as NexentaStor 3.0.2, which is based on b140 (+/-),
> the initiator''s write cache enable/disable request is respected,
by default.
> 
Minor Correction: NexentaStor 3.0.2 is based on 134, plus a "backport"
of a number of selected patches from OpenSolaris -- especially ZFS
patches.

	-- Garrett

Ross Walker

2010-Jun-07 22:57 UTC

head link

[zfs-discuss] Homegrown Hybrid Storage

On Jun 7, 2010, at 2:10 AM, Erik Trimble <erik.trimble at oracle.com>  
wrote:
> Comments in-line.
>
>
> On 6/6/2010 9:16 PM, Ken wrote:
>>
>> I''m looking at VMWare, ESXi 4, but I''ll take any
advice offered.
>>
>> On Sun, Jun 6, 2010 at 19:40, Erik Trimble  
>> <erik.trimble at oracle.com> wrote:
>> On 6/6/2010 6:22 PM, Ken wrote:
>>>
>>> Hi,
>>>
>>> I''m looking to build a virtualized web hosting server
environment
>>> accessing files on a hybrid storage SAN.  I was looking at using  
>>> the Sun X-Fire x4540 with the following configuration:
>>> 6 RAID-Z vdevs with one hot spare each (all 500GB 7200RPM SATA  
>>> drives)
>>> 2 Intel X-25 32GB SSD''s as a mirrored ZIL
>>> 4 Intel X-25 64GB SSD''s as the L2ARC.
>>> De-duplification
>>> LZJB compression
>>> The clients will be Apache web hosts serving hundreds of domains.
>>>
>>> I have the following questions:
>>> Should I use NFS with all five VM''s accessing the exports,
or one
>>> LUN for each VM, accessed over iSCSI?
>>
> Generally speaking, it depends on your comfort level with running  
> iSCSI  Volumes to put the VMs in, or serving everything out via NFS  
> (hosting the VM disk file in an NFS filesystem).
>
> If you go the iSCSI route, I would definitely go the "one iSCSI  
> volume per VM" route - note that you can create multiple zvols per  
> zpool on the X4540, so it''s not limiting in any way to volume-ize
a
> VM.  It''s a lot simpler, easier, and allows for nicer management  
> (snapshots/cloning/etc. on the X4540 side) if you go with a VM per  
> iSCSI volume.
>
> With NFS-hosted VM disks, do the same thing:  create a single  
> filesystem on the X4540 for each VM.
Vmware has a 32 mount limit which may limit the OP somewhat here.

> Performance-wise, I''d have to test, but I /think/ the iSCSI route
> will be faster. Even with the ZIL SSDs.
Actually properly tuned they are about the same, but VMware NFS  
datastores are FSYNC on all operations which isn''t the best for data  
vmdk files, best to serve the data directly to the VM using either  
iSCSI or NFS.
>
>>
>>> Are the FSYNC speed issues with NFS resolved?
>>
> The ZIL SSDs will compensate for synchronous write issues in NFS.   
> Not completely eliminate them, but you shouldn''t notice issues
with
> sync writing until you''re up at pretty heavy loads.
You will need this with VMware as every NFS operation (not just file  
open/close) coming out of VMware will be marked FSYNC (for VM data  
integrity in the face of server failure).
>
>>
>>>
>>>
>>
> If it were me (and, given what little I know of your data), I''d go
> like this:
>
> (1) pool for VMs:
>         8 disks, MIRRORED
>         1 SSD for L2ARC
>         one Zvol per VM instance, served via iSCSI, each with:
>                 DD turned ON,  Compression turned OFF
>
> (1) pool for clients to write data to (log files, incoming data, etc.)
>         6 or 8 disks, MIRRORED
>         2 SSDs for ZIL, mirrored
>         Ideally, As many filesystems as you have webSITES, not just  
> client VMs.  As this might be unwieldy for 100s of websites, you  
> should segregate them into obvious groupings, taking care with write/ 
> read permissions.
>                 NFS served
>                 DD OFF, Compression ON  (or OFF, if you seem to be  
> having CPU overload on the X4540)
>
> (1) pool for client read-only data
>         All the rest of the disks, split into 7 or 8-disk RAIDZ2 vdevs
>         All the remaining SSDs for L2ARC
>         As many filesystems as you have webSITES, not just client  
> VMs.  (however, see above)
>                 NFS served
>                 DD on for selected websites (filesystems),  
> Compression ON for everything
>
> (2) Global hot spares.
Make your life easy and use NFS for VMs and data. If you need high  
performance data such as databases, use iSCSI zvols directly into the  
VM, otherwise NFS/CIFS into the VM should be good enough.

-Ross

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100607/1e5529f4/attachment.html>

Ken

2010-Jun-08 00:30 UTC

head link

[zfs-discuss] Homegrown Hybrid Storage

Everyone, thank you for the comments, you''ve given me lots of great
info to
research further.

On Mon, Jun 7, 2010 at 15:57, Ross Walker <rswwalker at gmail.com> wrote:
> On Jun 7, 2010, at 2:10 AM, Erik Trimble <erik.trimble at oracle.com>
wrote:
>
> Comments in-line.
>
>
> On 6/6/2010 9:16 PM, Ken wrote:
>
> I''m looking at VMWare, ESXi 4, but I''ll take any advice
offered.
>
> On Sun, Jun 6, 2010 at 19:40, Erik Trimble < <erik.trimble at
oracle.com>
> erik.trimble at oracle.com> wrote:
>
>>  On 6/6/2010 6:22 PM, Ken wrote:
>>
>> Hi,
>>
>>  I''m looking to build a virtualized web hosting server
environment
>> accessing files on a hybrid storage SAN.  I was looking at using the
Sun
>> X-Fire x4540 with the following configuration:
>>
>>    - 6 RAID-Z vdevs with one hot spare each (all 500GB 7200RPM SATA
>>    drives)
>>    - 2 Intel X-25 32GB SSD''s as a mirrored ZIL
>>    - 4 Intel X-25 64GB SSD''s as the L2ARC.
>>    - De-duplification
>>    - LZJB compression
>>
>> The clients will be Apache web hosts serving hundreds of domains.
>>
>>  I have the following questions:
>>
>>    - Should I use NFS with all five VM''s accessing the
exports, or one
>>    LUN for each VM, accessed over iSCSI?
>>
>>     Generally speaking, it depends on your comfort level with running
> iSCSI  Volumes to put the VMs in, or serving everything out via NFS
(hosting
> the VM disk file in an NFS filesystem).
>
> If you go the iSCSI route, I would definitely go the "one iSCSI volume
per
> VM" route - note that you can create multiple zvols per zpool on the
X4540,
> so it''s not limiting in any way to volume-ize a VM.  It''s
a lot simpler,
> easier, and allows for nicer management (snapshots/cloning/etc. on the
X4540
> side) if you go with a VM per iSCSI volume.
>
> With NFS-hosted VM disks, do the same thing:  create a single filesystem on
> the X4540 for each VM.
>
>
> Vmware has a 32 mount limit which may limit the OP somewhat here.
>
>
> Performance-wise, I''d have to test, but I /think/ the iSCSI route
will be
> faster. Even with the ZIL SSDs.
>
>
> Actually properly tuned they are about the same, but VMware NFS datastores
> are FSYNC on all operations which isn''t the best for data vmdk
files, best
> to serve the data directly to the VM using either iSCSI or NFS.
>
>
>>    - Are the FSYNC speed issues with NFS resolved?
>>
>>     The ZIL SSDs will compensate for synchronous write issues in NFS.
> Not completely eliminate them, but you shouldn''t notice issues
with sync
> writing until you''re up at pretty heavy loads.
>
>
> You will need this with VMware as every NFS operation (not just file
> open/close) coming out of VMware will be marked FSYNC (for VM data
integrity
> in the face of server failure).
>
>
>>
>>     If it were me (and, given what little I know of your data),
I''d go
> like this:
>
> (1) pool for VMs:
>         8 disks, MIRRORED
>         1 SSD for L2ARC
>         one Zvol per VM instance, served via iSCSI, each with:
>                 DD turned ON,  Compression turned OFF
>
> (1) pool for clients to write data to (log files, incoming data, etc.)
>         6 or 8 disks, MIRRORED
>         2 SSDs for ZIL, mirrored
>         Ideally, As many filesystems as you have webSITES, not just client
> VMs.  As this might be unwieldy for 100s of websites, you should segregate
> them into obvious groupings, taking care with write/read permissions.
>                 NFS served
>                 DD OFF, Compression ON  (or OFF, if you seem to be having
> CPU overload on the X4540)
>
> (1) pool for client read-only data
>         All the rest of the disks, split into 7 or 8-disk RAIDZ2 vdevs
>         All the remaining SSDs for L2ARC
>         As many filesystems as you have webSITES, not just client VMs.
> (however, see above)
>                 NFS served
>                 DD on for selected websites (filesystems), Compression ON
> for everything
>
> (2) Global hot spares.
>
>
> Make your life easy and use NFS for VMs and data. If you need high
> performance data such as databases, use iSCSI zvols directly into the VM,
> otherwise NFS/CIFS into the VM should be good enough.
>
> -Ross
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100607/8be69557/attachment.html>

David Magda

2010-Jun-08 03:05 UTC

head link

[zfs-discuss] Homegrown Hybrid Storage

On Jun 7, 2010, at 16:32, Richard Elling wrote:
> Please don''t confuse Ethernet with IP. Ethernet has no routing and
> no back-off other than that required for the link.
Not entirely accurate going forward. IEEE 802.1Qau defines an end-to- 
end congestion notification management system:

	http://blogs.netapp.com/ethernet/8021qau/

IEEE 802.1aq provides for a link state protocol for finding the  
topology of Ethernet network:

	http://en.wikipedia.org/wiki/Shortest_Path_Bridging

See also the IETF''s Transparent Interconnection of Lots of Links  
(TRILL):

	http://tools.ietf.org/html/rfc5556
	http://tools.ietf.org/wg/trill/

All of this is being done under the rubric of "data center  
bridging" (DCB):

	http://en.wikipedia.org/wiki/Data_center_bridging

Brocade and IBM (?) call this Converged Enhanced Ethernet (CEE).

Things aren''t what they used to was.

Miles Nordin

2010-Jun-08 19:46 UTC

head link

[zfs-discuss] Homegrown Hybrid Storage

>>>>> "re" == Richard Elling <richard.elling at
gmail.com> writes:
re> Please don''t confuse Ethernet with IP.

okay, but I''m not. seriously, if you''ll look into it.

Did you misread where I said FC can exert back-pressure? I was
contrasting with Ethernet.

Ethernet output queues are either FIFO or RED, and are large compared
to FC and IB. FC is buffer-credit, which HOL-blocks to prevent the
small buffers from overflowing, and IB is...blocking (almost no buffer
at all---about 2KB per port and bandwidth*delay product of about 1KB
for the whole mesh, compared to ARISTA which has about 48MB per port,
so except to pedantic IB is bufferless, ie it does not even buffer one
full frame). Unlike Ethernet, both are lossless fabrics (sounds good)
and have an HOL-blocking character (sounds bad). They''re
fundamentally different at L2, so this is not about IP. If you run IP
over IB, it is still blocking and lossless. It does not magically
start buffering when you use IP because the fabric is simply unable to
buffer---there is no RAM in the mesh anywhere. Both L2 and L3
switches have output queues, and both L3 and L2 output queues can be
FIFO or RED because the output buffer exists in the same piece of
silicon of an L3 switch no matter whether it''s set to forward in L2 or
L3 mode, so L2 and L3 switches are like each other and unlike FC & IB.
This is not about IP. It''s about Ethernet.

a relevant congestion difference between L3 and L2 switches (confusing
ethernet with IP) might be ECN, because only an L3 switch can do ECN.
But I don''t think anyone actually uses ECN. It''s disabled by
default
in Solaris and, I think, all other Unixes. AFAICT my Extreme
switches, a very old L3 flow-forwarding platform, are not able to flip
the bit. I think 6500 can, but I''m not certain.

re> no back-off other than that required for the link. Since
re> GbE and higher speeds are all implemented as switched fabrics,
re> the ability of the switch to manage contention is paramount.
re> You can observe this on a Solaris system by looking at the NIC
re> flow control kstats.

You''re really confused, though I''m sure you''re going
to deny it.
Ethernet flow control mostly isn''t used at all, and it is never used
to manage output queue congestion except in hardware that everyone
agrees is defective. I almost feel like I''ve written all this stuff
already, even the part about ECN.

Ethernet flow control is never correctly used to signal output queue
congestion. The ethernet signal for congestion is a dropped packet.
flow control / PAUSE frames are *not* part of some magic mesh-wide
mechanism by which switches ``manage'''' congestion. PAUSE are
used,
when they''re used at all, for oversubscribed backplanes: for
congestion on *input*, which in Ethernet is something you want to
avoid. You want to switch ethernet frames to the output port where it
may or may not encounter congestion so that you don''t hold up input
frames headed toward other output ports. If you did hold them up,
you''d have something like HOL blocking. IB takes a different
approach: you simply accept the HOL blocking, but tend to design a
mesh with little or no oversubscription unlike ethernet LAN''s which
are heavily oversubscribed on their trunk ports. so...the HOL
blocking happens, but not as much as it would with a typical Ethernet
topology, and it happens in a way that in practice probably increases
the performance of storage networks.

This is interesting for storage because when you try to shove a
128kByte write into an Ethernet fabric, part of it may get dropped in
an output queue somewhere along the way. In IB, never will part of
the write get dropped, but sometimes you can''t shove it into the
network---it just won''t go, at L2. With Ethernet you rely on TCP to
emulate this can''t-shove-in condition, and it does not work perfectly
in that it can introduce huge jitter and link underuse
(``incast'''' problem:

http://www.pdl.cmu.edu/PDL-FTP/Storage/FASTIncast.pdf

), and secondly leave many kilobytes in transit within the mesh or TCP
buffers, like tens of megabytes and milliseconds per hop, requiring
large TCP buffers on both ends to match the bandwidth*jitter and
frustrating storage QoS by queueing commands on the link instead of in
the storage device, but in exchange you get from Ethernet no HOL
blocking and the possibility of end-to-end network QoS. It is a fair
tradeoff but arguably the wrong one for storage based on experience
with iSCSI sucking so far.

But the point is, looking at those ``flow control'''' kstats
will only
warn you if your switches are shit, and shit in one particular way
that even cheap switches rarely are. The metric that''s relevant is
how many packets are being dropped, and in what pattern (a big bucket
of them at once like FIFO, or a scattering like RED), and how TCP is
adapting to these drops. For this you might look at TCP stats in
solaris, or at output queue drop and output queue size stats on
managed switches, or simply at the overall bandwidth, the
``goodput''''
in the incast paper. The flow control kstats will never be activated
by normal congestion, unless you have some $20 gamer switch that is
misdesigned:

http://www.networkworld.com/netresources/0913flow2.html
http://www.smallnetbuilder.com/content/view/30212/54/
http://virtualthreads.blogspot.com/2006/02/beware-ethernet-flow-control.html

I said PAUSE frames are mostly never used, but Cisco''s Nexus FCoE
supposedly does send pause frames within a CoS when it has a link
partner who wants to play its Cisco-FCoE game, so the PAUSE apply to
that CoS and not to the whole link, and these have a completely
different purpose unrelated to the original pause frames. I''m
speculating from limited information because I''m not interested in
Nexus and have not read much about it much less have any. Cisco has a
lot of slick talk about them that makes it sound like you''re getting
the best of every buzzword, but AIUI the point is to create a lossless
low-jitter HOL-blocking VLAN for storage only, so that storage traffic
can be transmitted without eating huge amounts of switch output buffer
and without provoking TCP{,-like protocols} with congestion-signal
packet drops, while at the same time running other non-storage vlan''s
in lossful, non-HOL-blocking mode where nothing blocks on input and
the fabric signals congestion by dropping packets from output queues
and color-marking diffserv-style QoS is possible, like most TCP app
developers are accustomed to. I know some FCoE stuff got checked into
Solaris, but I don''t think FCoE support necessarily implies Nexus
CoS-PAUSE support so I don''t know if Solaris even supports this type
of weird pause frame. I do think it would need to support these
frames for FCoE to work well because otherwise you just push the
incast problem out to the edge, to the first switch facing the packet
source. Anyway FCoE''s not on the table for any of this discussion so
far. I only mention it so you won''t try to make my whole post sound
wrong by mentioning some pedantic nit-picky detail.

re> The latest OpenSolaris release is 2009.06 which treats all
re> Zvol-backed COMSTAR iSCSI writes as sync. This was changed in
re> the developer releases in summer 2009, b114. For a release
re> such as NexentaStor 3.0.2, which is based on b140 (+/-), the
re> initiator''s write cache enable/disable request is respected,
re> by default.

that helps a little, but it''s far from a full enough picture to be
useful to anyone IMHO. In fact it''s pretty close to ``it varies and
is confusing'''' which I already knew:

* how do I control the write cache from the initiator? though I
think I already know the answer: ``it depends on which
initiator,''''
and ``oh, you''re using that one? well i don''t know how to
do it
with THAT initiator'''' == YOU DON''T

* when the setting has been controlled, how long does it persist?
Where can it be inspected?

* ``by default'''' == there is a way to make it not respect the
initiator''s setting, and through a target shell command cause it to
use one setting or the other, persistently?

* is the behavior different for file-backed LUN''s than
zvol''s?

I guess there is less point to figuring this out until the behavior is
settled.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100608/8f1982da/attachment.bin>

Bob Friesenhahn

2010-Jun-09 01:33 UTC

head link

[zfs-discuss] Homegrown Hybrid Storage

On Tue, 8 Jun 2010, Miles Nordin wrote:
>>>>>> "re" == Richard Elling <richard.elling at
gmail.com> writes:
>
>    re> Please don''t confuse Ethernet with IP.
>
> okay, but I''m not.  seriously, if you''ll look into it.
>
> Did you misread where I said FC can exert back-pressure?  I was
> contrasting with Ethernet.
>
> You''re really confused, though I''m sure you''re
going to deny it.
I don''t think so.  I think that it is time to reset and reboot 
yourself on the technology curve.  FC semantics have been ported onto 
ethernet.  This is not your grandmother''s ethernet but it is capable 
of supporting both FCoE and normal IP traffic.  The FCoE gets 
per-stream QOS similar to what you are used to from Fibre Channel. 
Quite naturally, you get to pay a lot more for the new equipment and 
you have the opportunity to discard the equipment you bought already.

Richard is not out in the weeds although there are probably plenty of 
weeds growing at the ranch.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Erik Trimble

2010-Jun-09 03:28 UTC

head link

[zfs-discuss] Homegrown Hybrid Storage

On 6/8/2010 6:33 PM, Bob Friesenhahn wrote:> On Tue, 8 Jun 2010, Miles Nordin wrote:
>
>>>>>>> "re" == Richard Elling <richard.elling
at gmail.com> writes:
>>
>>    re> Please don''t confuse Ethernet with IP.
>>
>> okay, but I''m not.  seriously, if you''ll look into
it.
>>
>> Did you misread where I said FC can exert back-pressure?  I was
>> contrasting with Ethernet.
>>
>> You''re really confused, though I''m sure
you''re going to deny it.
>
> I don''t think so.  I think that it is time to reset and reboot 
> yourself on the technology curve.  FC semantics have been ported onto 
> ethernet.  This is not your grandmother''s ethernet but it is
capable
> of supporting both FCoE and normal IP traffic.  The FCoE gets 
> per-stream QOS similar to what you are used to from Fibre Channel. 
> Quite naturally, you get to pay a lot more for the new equipment and 
> you have the opportunity to discard the equipment you bought already.
>
> Richard is not out in the weeds although there are probably plenty of 
> weeds growing at the ranch.
>
> Bob
Well, you saying we might want to put certain folks out to pasture?

<wink>

That said, I had a good look at FCoE about a year ago, and, unlike ATAoE 
which effectively ran over standard managed or smart switched, FCoE 
required specialized switch hardware that was non-trivially expensive.  
That said, it did seem to be a mature protocol implementation, so it was 
a viable option once the hardware price came down (and we had wider, 
better software implementations).

Also, FCoE really doesn''t seem to play well with regular IP on the same
link, so you really should dedicate a link (not necessarily a switch) to 
FCoE, and pipe your IP traffic via another link. It is NOT iSCSI.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

Pasi Kärkkäinen

2010-Jun-11 08:02 UTC

head link

[zfs-discuss] Homegrown Hybrid Storage

On Tue, Jun 08, 2010 at 08:33:40PM -0500, Bob Friesenhahn
wrote:> On Tue, 8 Jun 2010, Miles Nordin wrote:
>
>>>>>>> "re" == Richard Elling <richard.elling
at gmail.com> writes:
>>
>>    re> Please don''t confuse Ethernet with IP.
>>
>> okay, but I''m not.  seriously, if you''ll look into
it.
>>
>> Did you misread where I said FC can exert back-pressure?  I was
>> contrasting with Ethernet.
>>
>> You''re really confused, though I''m sure
you''re going to deny it.
>
> I don''t think so.  I think that it is time to reset and reboot
yourself
> on the technology curve.  FC semantics have been ported onto ethernet.  
> This is not your grandmother''s ethernet but it is capable of
supporting
> both FCoE and normal IP traffic.  The FCoE gets per-stream QOS similar to 
> what you are used to from Fibre Channel. Quite naturally, you get to pay 
> a lot more for the new equipment and you have the opportunity to discard 
> the equipment you bought already.
>
Yeah, today enterprise iSCSI vendors like Equallogic (bought by Dell)
_recommend_ using flow control. Their iSCSI storage arrays are designed
to work properly with flow control and perform well.

Of course you need a proper ("certified") switches aswell.

Equallogic says the delays from flow control pause frames are shorter
than tcp retransmits, so that''s why they''re using and
recommending it.

-- Pasi

Miles Nordin

2010-Jun-11 19:30 UTC

head link

[zfs-discuss] Homegrown Hybrid Storage

>>>>> "pk" == Pasi K?rkk?inen <pasik at iki.fi>
writes:
    >>> You''re really confused, though I''m sure
you''re going to deny
    >>> it.

    >>  I don''t think so.  I think that it is time to reset and
reboot
    >> yourself on the technology curve.  FC semantics have been
    >> ported onto ethernet.  This is not your grandmother''s
ethernet
    >> but it is capable of supporting both FCoE and normal IP
    >> traffic.  The FCoE gets per-stream QOS similar to what you are
    >> used to from Fibre Channel.

FCoE != iSCSI.

FCoE was not being discussed in the part you''re trying to contradict.
If you read my entire post, I talk about FCoE at the end and say more
or less ``I am talking about FCoE here only so you don''t try to throw
out my entire post by latching onto some corner case not applying to
the OP by dragging FCoE into the mix'''' which is exactly what
you did.
I''m guessing you fired off a reply without reading the whole thing?

    pk> Yeah, today enterprise iSCSI vendors like Equallogic (bought
    pk> by Dell) _recommend_ using flow control. Their iSCSI storage
    pk> arrays are designed to work properly with flow control and
    pk> perform well.

    pk> Of course you need a proper ("certified") switches aswell.

    pk> Equallogic says the delays from flow control pause frames are
    pk> shorter than tcp retransmits, so that''s why they''re
using and
    pk> recommending it.

please have a look at the three links I posted about flow control not
being used the way you think it is by any serious switch vendor, and
the explanation of why this limitation is fundamental, not something
that can be overcome by ``technology curve.''''  It will not
hurt
anything to allow autonegotiation of flow control on non-broken
switches so I''m not surprised they recommend it with
``certified''''
known-non-broken switches, but it also will not help unless your
switches have input/backplane congestion which they usually don''t, or
your end host is able to generate PAUSE frames for PCIe congestion
which is maybe more plausible.  In particular it won''t help with the
typical case of the ``incast'''' problem in the experiment in
the FAST
incast paper URL I gave, because they narrowed down what was happening
in their experiment to OUTPUT queue congestion, which (***MODULO
FCoE*** mr ``reboot yourself on the technology curve'''') never
invokes
ethernet flow control.

HTH.

ok let me try again:

yes, I agree it would not be stupid to run iSCSI+TCP over a CoS with
blocking storage-friendly buffer semantics if your FCoE/CEE switches
can manage that, but I would like to hear of someone actually DOING it
before we drag it into the discussion.  I don''t think that''s
happening
in the wild so far, and it''s definitely not the application for which
these products have been flogged.

I know people run iSCSI over IB (possibly with RDMA for moving the
bulk data rather than TCP), and I know people run SCSI over FC, and of
course SCSI (not iSCSI) over FCoE.  Remember the original assertion
was: please try FC as well as iSCSI if you can afford it.

Are you guys really saying you believe people are running ***iSCSI***
over the separate HOL-blocking hop-by-hop pause frame CoS''s of FCoE
meshes?  or are you just spewing a bunch of noxious white paper
vapours at me?  because AIUI people using the
lossless/small-output-buffer channel of FCoE are running the FC
protocol over that ``virtual channel'''' of the mesh, not iSCSI,
are
they not?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100611/57a23945/attachment.bin>

Pasi Kärkkäinen

2010-Jun-11 19:35 UTC

head link

[zfs-discuss] Homegrown Hybrid Storage

On Fri, Jun 11, 2010 at 03:30:26PM -0400, Miles Nordin
wrote:> >>>>> "pk" == Pasi K?rkk?inen <pasik at
iki.fi> writes:
> 
>     >>> You''re really confused, though I''m sure
you''re going to deny
>     >>> it.
> 
>     >>  I don''t think so.  I think that it is time to reset
and reboot
>     >> yourself on the technology curve.  FC semantics have been
>     >> ported onto ethernet.  This is not your grandmother''s
ethernet
>     >> but it is capable of supporting both FCoE and normal IP
>     >> traffic.  The FCoE gets per-stream QOS similar to what you are
>     >> used to from Fibre Channel.
> 
> FCoE != iSCSI.
> 
> FCoE was not being discussed in the part you''re trying to
contradict.
> If you read my entire post, I talk about FCoE at the end and say more
> or less ``I am talking about FCoE here only so you don''t try to
throw
> out my entire post by latching onto some corner case not applying to
> the OP by dragging FCoE into the mix'''' which is exactly
what you did.
> I''m guessing you fired off a reply without reading the whole
thing?
> 
>     pk> Yeah, today enterprise iSCSI vendors like Equallogic (bought
>     pk> by Dell) _recommend_ using flow control. Their iSCSI storage
>     pk> arrays are designed to work properly with flow control and
>     pk> perform well.
> 
>     pk> Of course you need a proper ("certified") switches
aswell.
> 
>     pk> Equallogic says the delays from flow control pause frames are
>     pk> shorter than tcp retransmits, so that''s why
they''re using and
>     pk> recommending it.
> 
> please have a look at the three links I posted about flow control not
> being used the way you think it is by any serious switch vendor, and
> the explanation of why this limitation is fundamental, not something
> that can be overcome by ``technology curve.''''  It will
not hurt
> anything to allow autonegotiation of flow control on non-broken
> switches so I''m not surprised they recommend it with
``certified''''
> known-non-broken switches, but it also will not help unless your
> switches have input/backplane congestion which they usually don''t,
or
> your end host is able to generate PAUSE frames for PCIe congestion
> which is maybe more plausible.  In particular it won''t help with
the
> typical case of the ``incast'''' problem in the experiment
in the FAST
> incast paper URL I gave, because they narrowed down what was happening
> in their experiment to OUTPUT queue congestion, which (***MODULO
> FCoE*** mr ``reboot yourself on the technology curve'''')
never invokes
> ethernet flow control.
> 
> HTH.
> 
> ok let me try again:
> 
> yes, I agree it would not be stupid to run iSCSI+TCP over a CoS with
> blocking storage-friendly buffer semantics if your FCoE/CEE switches
> can manage that, but I would like to hear of someone actually DOING it
> before we drag it into the discussion.  I don''t think
that''s happening
> in the wild so far, and it''s definitely not the application for
which
> these products have been flogged.
> 
> I know people run iSCSI over IB (possibly with RDMA for moving the
> bulk data rather than TCP), and I know people run SCSI over FC, and of
> course SCSI (not iSCSI) over FCoE.  Remember the original assertion
> was: please try FC as well as iSCSI if you can afford it.
> 
> Are you guys really saying you believe people are running ***iSCSI***
> over the separate HOL-blocking hop-by-hop pause frame CoS''s of
FCoE
> meshes?  or are you just spewing a bunch of noxious white paper
> vapours at me?  because AIUI people using the
> lossless/small-output-buffer channel of FCoE are running the FC
> protocol over that ``virtual channel'''' of the mesh, not
iSCSI, are
> they not?
I was talking about iSCSI over TCP over IP over Ethernet. No FcOE. No IB.

-- Pasi

Bob Friesenhahn

2010-Jun-12 02:26 UTC

head link

[zfs-discuss] Homegrown Hybrid Storage

On Fri, 11 Jun 2010, Miles Nordin wrote:>
> FCoE != iSCSI.
>
> FCoE was not being discussed in the part you''re trying to
contradict.
> If you read my entire post, I talk about FCoE at the end and say more
> or less ``I am talking about FCoE here only so you don''t try to
throw
> out my entire post by latching onto some corner case not applying to
> the OP by dragging FCoE into the mix'''' which is exactly
what you did.
> I''m guessing you fired off a reply without reading the whole
thing?
I am deeply concerned that you are relying on your extensive 
experience with legacy ethernet technologies and have not done any 
research on modern technologies.

Entering "FCoE" into Google resulted in many useful hits which 
describe technologies which are ethernet but more advanced than the 
"ethernet" you generalized in your lengthy text.

For example

http://www.fcoe.com/
http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9670/white_paper_c11-462176.html
http://www.brocade.com/products-solutions/solutions/connectivity/FCoE/index.page
http://www.emulex.com/products/converged-network-adapters.html

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2010-Jun-13 15:07 UTC

head link

[zfs-discuss] Homegrown Hybrid Storage

On Jun 8, 2010, at 12:46 PM, Miles Nordin wrote:
>>>>>> "re" == Richard Elling <richard.elling at
gmail.com> writes:
> 
>    re> Please don''t confuse Ethernet with IP.
> 
> okay, but I''m not.  seriously, if you''ll look into it.
[fine whine elided]
I think we can agree that the perfect network has yet to be invented :-)
Meanwhile, 6Gbps SAS switches are starting to hit the market... what fun :-)
>    re> The latest OpenSolaris release is 2009.06 which treats all
>    re> Zvol-backed COMSTAR iSCSI writes as sync. This was changed in
>    re> the developer releases in summer 2009, b114.  For a release
>    re> such as NexentaStor 3.0.2, which is based on b140 (+/-), the
>    re> initiator''s write cache enable/disable request is
respected,
>    re> by default.
> 
> that helps a little, but it''s far from a full enough picture to be
> useful to anyone IMHO.  In fact it''s pretty close to ``it varies
and
> is confusing'''' which I already knew:
> 
> * how do I control the write cache from the initiator?  though I
>   think I already know the answer: ``it depends on which
initiator,''''
>   and ``oh, you''re using that one?  well i don''t know how
to do it
>   with THAT initiator'''' == YOU DON''T
For ZFS over a Solaris initiator, it is done with setting DKIOCSETWCE
via an ioctl.  Look on or near
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_disk.c#276

I presume that this can also be set with format -e, as is done for other 
devices.  Has anyone else tried?
> 
> * when the setting has been controlled, how long does it persist?
>   Where can it be inspected?
RTFM stmfadm(1m) and look for "wcd"
<small_rant>
drives me nuts that some people prefer negatives (disables) over
positives (enables)
</small_rant>
> 
> * ``by default'''' == there is a way to make it not respect
the
>   initiator''s setting, and through a target shell command cause it
to
>   use one setting or the other, persistently?
See above.
> * is the behavior different for file-backed LUN''s than
zvol''s?
Yes, it can be.  It can also be modified by the sync property.
See CR 6794730, need zvol support for DKIOCSETWCE and friends
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6794730
> I guess there is less point to figuring this out until the behavior is
> settled.
I think it is settled, but perhaps not well documented :-(
 -- richard

-- 
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/

Seemingly Similar Threads

Search for more maybe matching threads

zfs discuss - Jun 2010 - Homegrown Hybrid Storage

[zfs-discuss] Homegrown Hybrid Storage

[zfs-discuss] Homegrown Hybrid Storage

[zfs-discuss] Homegrown Hybrid Storage

[zfs-discuss] Homegrown Hybrid Storage

[zfs-discuss] Homegrown Hybrid Storage

[zfs-discuss] Homegrown Hybrid Storage

[zfs-discuss] Homegrown Hybrid Storage

[zfs-discuss] Homegrown Hybrid Storage

[zfs-discuss] Homegrown Hybrid Storage

[zfs-discuss] Homegrown Hybrid Storage

[zfs-discuss] Homegrown Hybrid Storage

[zfs-discuss] Homegrown Hybrid Storage

[zfs-discuss] Homegrown Hybrid Storage

[zfs-discuss] Homegrown Hybrid Storage

[zfs-discuss] Homegrown Hybrid Storage

[zfs-discuss] Homegrown Hybrid Storage

[zfs-discuss] Homegrown Hybrid Storage

[zfs-discuss] Homegrown Hybrid Storage

[zfs-discuss] Homegrown Hybrid Storage

[zfs-discuss] Homegrown Hybrid Storage

[zfs-discuss] Homegrown Hybrid Storage

Seemingly Similar Threads