thr3ads.net - zfs discuss - [zfs-discuss] Abysmal ISCSI / ZFS Performance [Feb 2010]

If this information is useful, please help other people find it:
Share via:

Brian E. Imhoff

2010-Feb-10 22:06 UTC

[zfs-discuss] Abysmal ISCSI / ZFS Performance

I am in the proof-of-concept phase of building a large ZFS/Solaris based SAN
box, and am experiencing absolutely poor / unusable performance.

Where to begin...

The Hardware setup:
Supermicro 4U 24 Drive Bay Chassis
Supermicro X8DT3 Server Motherboard
2x Xeon E5520 Nehalem 2.26 Quad Core CPUs
4GB Memory
Intel EXPI9404PT 4 port 1000GB Server Network Card (used for ISCSI traffic only)
Adaptec 52445 28 Port SATA/SAS Raid Controller connected to 
24x Western Digital WD1002FBYS 1TB Enterprise drives.

I have configured the 24 drives as single simple volumes in the Adeptec RAID
BIOS , and are presenting them to the OS as such.

I then, Create a zpool, using raidz2, using all 24 drives, 1 as a hotspare:
zpool create tank raidz2 c1t0d0 c1t1d0 [....] c1t22d0 spare c1t23d00

Then create a volume store:
zfs create -o canmount=off tank/volumes

Then create a 10 TB volume to be presented to our file server:
zfs create -V 10TB -o shareiscsi=on tank/volumes/fsrv1data
>From here, I discover the iscsi target on our Windows server 2008 R2 File
server, and see the disk is attached in Disk Management.  I initialize the 10TB
disk fine, and begin to quick format it.  Here is where I begin to see the poor
performance issue.   The Quick Format took about 45 minutes. And once the disk
is fully mounted, I get maybe 2-5 MB/s average to this disk.
I have no clue what I could be doing wrong.  To my knowledge, I followed the
documentation for setting this up correctly, though I have not looked at any
tuning guides beyond the first line saying you shouldn''t need to do any
of this as the people who picked these defaults know more about it then you.

Jumbo Frames are enabled on both sides of the iscsi path, as well as on the
switch, and rx/tx buffers increased to 2048 on both sides as well.  I know this
is not a hardware / iscsi network issue.  As another test, I installed Openfiler
in a similar configuration (using hardware raid) on this box, and was getting
350-450 MB/S from our fileserver,

An "iostat -xndz 1" readout of the "%b% coloum during a file copy
to the LUN shows maybe 10-15 seconds of %b at 0 for all disks, then 1-2 seconds
of 100, and repeats.

Is there anything I need to do to get this usable?  Or any additional
information I can provide to help solve this problem?  As nice as Openfiler is,
it doesn''t have ZFS, which is necessary to achieve our final goal.
-- 
This message posted from opensolaris.org

Will Murnane

2010-Feb-10 22:28 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

On Wed, Feb 10, 2010 at 17:06, Brian E. Imhoff <beimhoff at hotmail.com>
wrote:> I am in the proof-of-concept phase of building a large ZFS/Solaris based
SAN box, and am experiencing absolutely poor / unusable performance.
>
> I then, Create a zpool, using raidz2, using all 24 drives, 1 as a hotspare:
> zpool create tank raidz2 c1t0d0 c1t1d0 [....] c1t22d0 spare c1t23d00Create several smaller raidz2 vdevs, and consider adding a log device
and/or cache devices.  A single raidz2 vdev has about as many IOs per
second as a single disk, which could really hurt iSCSI performance.
 zpool create tank raidz2 c1t0d0 c1t1d0 ... \
  raidz2 c1t5d0 c1t6d0 ... \
  etc
You might try, say, four 5-wide stripes with a spare, a mirrored log
device, and a cache device.  More memory wouldn''t hurt anything,
either.

Will

Frank Cusack

2010-Feb-10 22:29 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

On 2/10/10 2:06 PM -0800 Brian E. Imhoff wrote:> I then, Create a zpool, using raidz2, using all 24 drives, 1 as a
> hotspare: zpool create tank raidz2 c1t0d0 c1t1d0 [....] c1t22d0 spare
> c1t23d00
Well there''s one problem anyway.  That''s going to be horribly
slow no
matter what.

David Dyer-Bennet

2010-Feb-10 22:34 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

On Wed, February 10, 2010 16:28, Will Murnane wrote:> On Wed, Feb 10, 2010 at 17:06, Brian E. Imhoff <beimhoff at
hotmail.com>
> wrote:
>> I am in the proof-of-concept phase of building a large ZFS/Solaris
based
>> SAN box, and am experiencing absolutely poor / unusable performance.
>>
>> I then, Create a zpool, using raidz2, using all 24 drives, 1 as a
>> hotspare:
>> zpool create tank raidz2 c1t0d0 c1t1d0 [....] c1t22d0 spare c1t23d00
> Create several smaller raidz2 vdevs, and consider adding a log device
> and/or cache devices.  A single raidz2 vdev has about as many IOs per
> second as a single disk, which could really hurt iSCSI performance.
>  zpool create tank raidz2 c1t0d0 c1t1d0 ... \
>   raidz2 c1t5d0 c1t6d0 ... \
>   etc
> You might try, say, four 5-wide stripes with a spare, a mirrored log
> device, and a cache device.  More memory wouldn''t hurt anything,
> either.
That''s useful general advice for increasing I/O I think, but he clearly
has something other than a "general" problem.  Did you read the
numbers he
gave on his iSCSI performance?  That can''t be explained just by
overly-large RAIDZ groups I don''t think.

-- 
David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

Tim Cook

2010-Feb-10 22:35 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

On Wed, Feb 10, 2010 at 4:06 PM, Brian E. Imhoff <beimhoff at
hotmail.com>wrote:
> I am in the proof-of-concept phase of building a large ZFS/Solaris based
> SAN box, and am experiencing absolutely poor / unusable performance.
>
> Where to begin...
>
> The Hardware setup:
> Supermicro 4U 24 Drive Bay Chassis
> Supermicro X8DT3 Server Motherboard
> 2x Xeon E5520 Nehalem 2.26 Quad Core CPUs
> 4GB Memory
> Intel EXPI9404PT 4 port 1000GB Server Network Card (used for ISCSI traffic
> only)
> Adaptec 52445 28 Port SATA/SAS Raid Controller connected to
> 24x Western Digital WD1002FBYS 1TB Enterprise drives.
>
> I have configured the 24 drives as single simple volumes in the Adeptec
> RAID BIOS , and are presenting them to the OS as such.
>
> I then, Create a zpool, using raidz2, using all 24 drives, 1 as a hotspare:
> zpool create tank raidz2 c1t0d0 c1t1d0 [....] c1t22d0 spare c1t23d00
>
> Then create a volume store:
> zfs create -o canmount=off tank/volumes
>
> Then create a 10 TB volume to be presented to our file server:
> zfs create -V 10TB -o shareiscsi=on tank/volumes/fsrv1data
>
> From here, I discover the iscsi target on our Windows server 2008 R2 File
> server, and see the disk is attached in Disk Management.  I initialize the
> 10TB disk fine, and begin to quick format it.  Here is where I begin to see
> the poor performance issue.   The Quick Format took about 45 minutes. And
> once the disk is fully mounted, I get maybe 2-5 MB/s average to this disk.
>
> I have no clue what I could be doing wrong.  To my knowledge, I followed
> the documentation for setting this up correctly, though I have not looked
at
> any tuning guides beyond the first line saying you shouldn''t need
to do any
> of this as the people who picked these defaults know more about it then
you.
>
> Jumbo Frames are enabled on both sides of the iscsi path, as well as on the
> switch, and rx/tx buffers increased to 2048 on both sides as well.  I know
> this is not a hardware / iscsi network issue.  As another test, I installed
> Openfiler in a similar configuration (using hardware raid) on this box, and
> was getting 350-450 MB/S from our fileserver,
>
> An "iostat -xndz 1" readout of the "%b% coloum during a file
copy to the
> LUN shows maybe 10-15 seconds of %b at 0 for all disks, then 1-2 seconds of
> 100, and repeats.
>
> Is there anything I need to do to get this usable?  Or any additional
> information I can provide to help solve this problem?  As nice as Openfiler
> is, it doesn''t have ZFS, which is necessary to achieve our final
goal.
>
>
>You''re extremely light on ram for a system with 24TB of storage and two
E5520''s.  I don''t think it''s the entire source of
your issue, but I''d
strongly suggest considering doubling what you have as a starting point.

What version of opensolaris are you using?  Have you considered using
COMSTAR as your iSCSI target?

--Tim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100210/1b29552c/attachment.html>

Bob Friesenhahn

2010-Feb-10 22:53 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

On Wed, 10 Feb 2010, Frank Cusack wrote:
> On 2/10/10 2:06 PM -0800 Brian E. Imhoff wrote:
>> I then, Create a zpool, using raidz2, using all 24 drives, 1 as a
>> hotspare: zpool create tank raidz2 c1t0d0 c1t1d0 [....] c1t22d0 spare
>> c1t23d00
>
> Well there''s one problem anyway.  That''s going to be
horribly slow no
> matter what.
The other three commonly mentioned issues are:

  - Disable the naggle algorithm on the windows clients.

  - Set the volume block size so that it matches the client filesystem
    block size (default is 128K!).

  - Check for an abnormally slow disk drive using ''iostat
-xe''.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Marc Nicholas

2010-Feb-10 22:58 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

Definitely use Comstar as Tim says.

At home I''m using 4*WD Caviar Blacks on an AMD Phenom x4 @ 1.Ghz and
only 2GB of RAM. I''m running svn132. No HBA - onboard SB700 SATA
ports.$

I can, with IOmeter, saturate GigE from my WinXP laptop via iSCSI.

Can you toss the RAID controller aside an use motherboard SATA ports
with just a few drives? That could help highlight if its the RAID
controler or not, and even one drive has better throughput than you''re
seeing.

Cache, ZIL, and vdev tweaks are great - but you''re not seeing any of
those bottlnecks, I can assure you.

-marc

On 2/10/10, Tim Cook <tim at cook.ms> wrote:> On Wed, Feb 10, 2010 at 4:06 PM, Brian E. Imhoff
> <beimhoff at hotmail.com>wrote:
>
>> I am in the proof-of-concept phase of building a large ZFS/Solaris
based
>> SAN box, and am experiencing absolutely poor / unusable performance.
>>
>> Where to begin...
>>
>> The Hardware setup:
>> Supermicro 4U 24 Drive Bay Chassis
>> Supermicro X8DT3 Server Motherboard
>> 2x Xeon E5520 Nehalem 2.26 Quad Core CPUs
>> 4GB Memory
>> Intel EXPI9404PT 4 port 1000GB Server Network Card (used for ISCSI
traffic
>> only)
>> Adaptec 52445 28 Port SATA/SAS Raid Controller connected to
>> 24x Western Digital WD1002FBYS 1TB Enterprise drives.
>>
>> I have configured the 24 drives as single simple volumes in the Adeptec
>> RAID BIOS , and are presenting them to the OS as such.
>>
>> I then, Create a zpool, using raidz2, using all 24 drives, 1 as a
>> hotspare:
>> zpool create tank raidz2 c1t0d0 c1t1d0 [....] c1t22d0 spare c1t23d00
>>
>> Then create a volume store:
>> zfs create -o canmount=off tank/volumes
>>
>> Then create a 10 TB volume to be presented to our file server:
>> zfs create -V 10TB -o shareiscsi=on tank/volumes/fsrv1data
>>
>> From here, I discover the iscsi target on our Windows server 2008 R2
File
>> server, and see the disk is attached in Disk Management.  I initialize
the
>> 10TB disk fine, and begin to quick format it.  Here is where I begin to
>> see
>> the poor performance issue.   The Quick Format took about 45 minutes.
And
>> once the disk is fully mounted, I get maybe 2-5 MB/s average to this
disk.
>>
>> I have no clue what I could be doing wrong.  To my knowledge, I
followed
>> the documentation for setting this up correctly, though I have not
looked
>> at
>> any tuning guides beyond the first line saying you shouldn''t
need to do
>> any
>> of this as the people who picked these defaults know more about it then
>> you.
>>
>> Jumbo Frames are enabled on both sides of the iscsi path, as well as on
>> the
>> switch, and rx/tx buffers increased to 2048 on both sides as well.  I
know
>> this is not a hardware / iscsi network issue.  As another test, I
>> installed
>> Openfiler in a similar configuration (using hardware raid) on this box,
>> and
>> was getting 350-450 MB/S from our fileserver,
>>
>> An "iostat -xndz 1" readout of the "%b% coloum during a
file copy to the
>> LUN shows maybe 10-15 seconds of %b at 0 for all disks, then 1-2
seconds
>> of
>> 100, and repeats.
>>
>> Is there anything I need to do to get this usable?  Or any additional
>> information I can provide to help solve this problem?  As nice as
>> Openfiler
>> is, it doesn''t have ZFS, which is necessary to achieve our
final goal.
>>
>>
>>
> You''re extremely light on ram for a system with 24TB of storage
and two
> E5520''s.  I don''t think it''s the entire source
of your issue, but I''d
> strongly suggest considering doubling what you have as a starting point.
>
> What version of opensolaris are you using?  Have you considered using
> COMSTAR as your iSCSI target?
>
> --Tim
>
-- 
Sent from my mobile device

Kjetil Torgrim Homme

2010-Feb-10 23:05 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

Bob Friesenhahn <bfriesen at simple.dallas.tx.us>
writes:> On Wed, 10 Feb 2010, Frank Cusack wrote:
>
> The other three commonly mentioned issues are:
>
>  - Disable the naggle algorithm on the windows clients.
for iSCSI?  shouldn''t be necessary.
>  - Set the volume block size so that it matches the client filesystem
>    block size (default is 128K!).
default for a zvol is 8 KiB.
>  - Check for an abnormally slow disk drive using ''iostat
-xe''.
his problem is "lazy" ZFS, notice how it gathers up data for 15
seconds
before flushing the data to disk.  tweaking the flush interval down
might help.
>> An "iostat -xndz 1" readout of the "%b% coloum during a
file copy to
>> the LUN shows maybe 10-15 seconds of %b at 0 for all disks, then 1-2
>> seconds of 100, and repeats.
what are the other values?  ie., number of ops and actual amount of data
read/written.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

Marc Nicholas

2010-Feb-10 23:12 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

How does lowering the flush interval help? If he can''t ingress data
fast enough, faster flushing is a Bad Thibg(tm).

-marc

On 2/10/10, Kjetil Torgrim Homme <kjetilho at linpro.no>
wrote:> Bob Friesenhahn <bfriesen at simple.dallas.tx.us> writes:
>> On Wed, 10 Feb 2010, Frank Cusack wrote:
>>
>> The other three commonly mentioned issues are:
>>
>>  - Disable the naggle algorithm on the windows clients.
>
> for iSCSI?  shouldn''t be necessary.
>
>>  - Set the volume block size so that it matches the client filesystem
>>    block size (default is 128K!).
>
> default for a zvol is 8 KiB.
>
>>  - Check for an abnormally slow disk drive using ''iostat
-xe''.
>
> his problem is "lazy" ZFS, notice how it gathers up data for 15
seconds
> before flushing the data to disk.  tweaking the flush interval down
> might help.
>
>>> An "iostat -xndz 1" readout of the "%b% coloum
during a file copy to
>>> the LUN shows maybe 10-15 seconds of %b at 0 for all disks, then
1-2
>>> seconds of 100, and repeats.
>
> what are the other values?  ie., number of ops and actual amount of data
> read/written.
>
> --
> Kjetil T. Homme
> Redpill Linpro AS - Changing the game
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
-- 
Sent from my mobile device

Brent Jones

2010-Feb-11 00:05 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

On Wed, Feb 10, 2010 at 3:12 PM, Marc Nicholas <geekything at gmail.com>
wrote:> How does lowering the flush interval help? If he can''t ingress
data
> fast enough, faster flushing is a Bad Thibg(tm).
>
> -marc
>
> On 2/10/10, Kjetil Torgrim Homme <kjetilho at linpro.no> wrote:
>> Bob Friesenhahn <bfriesen at simple.dallas.tx.us> writes:
>>> On Wed, 10 Feb 2010, Frank Cusack wrote:
>>>
>>> The other three commonly mentioned issues are:
>>>
>>> ?- Disable the naggle algorithm on the windows clients.
>>
>> for iSCSI? ?shouldn''t be necessary.
>>
>>> ?- Set the volume block size so that it matches the client
filesystem
>>> ? ?block size (default is 128K!).
>>
>> default for a zvol is 8 KiB.
>>
>>> ?- Check for an abnormally slow disk drive using ''iostat
-xe''.
>>
>> his problem is "lazy" ZFS, notice how it gathers up data for
15 seconds
>> before flushing the data to disk. ?tweaking the flush interval down
>> might help.
>>
>>>> An "iostat -xndz 1" readout of the "%b% coloum
during a file copy to
>>>> the LUN shows maybe 10-15 seconds of %b at 0 for all disks,
then 1-2
>>>> seconds of 100, and repeats.
>>
>> what are the other values? ?ie., number of ops and actual amount of
data
>> read/written.
>>
>> --
>> Kjetil T. Homme
>> Redpill Linpro AS - Changing the game
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>
> --
> Sent from my mobile device
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
ZIL performance issues? Is writecache enabled on the LUNs?

-- 
Brent Jones
brent at servuhome.net

Marc Nicholas

2010-Feb-11 00:12 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

This is a Windows box, not a DB that flushes every write.

The drives are capable of over 2000 IOPS (albeit with high latency as
its NCQ that gets you there) which would mean, even with sync flushes,
8-9MB/sec.

-marc

On 2/10/10, Brent Jones <brent at servuhome.net>
wrote:> On Wed, Feb 10, 2010 at 3:12 PM, Marc Nicholas <geekything at
gmail.com> wrote:
>> How does lowering the flush interval help? If he can''t ingress
data
>> fast enough, faster flushing is a Bad Thibg(tm).
>>
>> -marc
>>
>> On 2/10/10, Kjetil Torgrim Homme <kjetilho at linpro.no> wrote:
>>> Bob Friesenhahn <bfriesen at simple.dallas.tx.us> writes:
>>>> On Wed, 10 Feb 2010, Frank Cusack wrote:
>>>>
>>>> The other three commonly mentioned issues are:
>>>>
>>>> ?- Disable the naggle algorithm on the windows clients.
>>>
>>> for iSCSI? ?shouldn''t be necessary.
>>>
>>>> ?- Set the volume block size so that it matches the client
filesystem
>>>> ? ?block size (default is 128K!).
>>>
>>> default for a zvol is 8 KiB.
>>>
>>>> ?- Check for an abnormally slow disk drive using
''iostat -xe''.
>>>
>>> his problem is "lazy" ZFS, notice how it gathers up data
for 15 seconds
>>> before flushing the data to disk. ?tweaking the flush interval down
>>> might help.
>>>
>>>>> An "iostat -xndz 1" readout of the "%b%
coloum during a file copy to
>>>>> the LUN shows maybe 10-15 seconds of %b at 0 for all disks,
then 1-2
>>>>> seconds of 100, and repeats.
>>>
>>> what are the other values? ?ie., number of ops and actual amount of
data
>>> read/written.
>>>
>>> --
>>> Kjetil T. Homme
>>> Redpill Linpro AS - Changing the game
>>>
>>> _______________________________________________
>>> zfs-discuss mailing list
>>> zfs-discuss at opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>
>>
>> --
>> Sent from my mobile device
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>
> ZIL performance issues? Is writecache enabled on the LUNs?
>
> --
> Brent Jones
> brent at servuhome.net
>
-- 
Sent from my mobile device

Brent Jones

2010-Feb-11 00:46 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

On Wed, Feb 10, 2010 at 4:05 PM, Brent Jones <brent at servuhome.net>
wrote:> On Wed, Feb 10, 2010 at 3:12 PM, Marc Nicholas <geekything at
gmail.com> wrote:
>> How does lowering the flush interval help? If he can''t ingress
data
>> fast enough, faster flushing is a Bad Thibg(tm).
>>
>> -marc
>>
>> On 2/10/10, Kjetil Torgrim Homme <kjetilho at linpro.no> wrote:
>>> Bob Friesenhahn <bfriesen at simple.dallas.tx.us> writes:
>>>> On Wed, 10 Feb 2010, Frank Cusack wrote:
>>>>
>>>> The other three commonly mentioned issues are:
>>>>
>>>> ?- Disable the naggle algorithm on the windows clients.
>>>
>>> for iSCSI? ?shouldn''t be necessary.
>>>
>>>> ?- Set the volume block size so that it matches the client
filesystem
>>>> ? ?block size (default is 128K!).
>>>
>>> default for a zvol is 8 KiB.
>>>
>>>> ?- Check for an abnormally slow disk drive using
''iostat -xe''.
>>>
>>> his problem is "lazy" ZFS, notice how it gathers up data
for 15 seconds
>>> before flushing the data to disk. ?tweaking the flush interval down
>>> might help.
>>>
>>>>> An "iostat -xndz 1" readout of the "%b%
coloum during a file copy to
>>>>> the LUN shows maybe 10-15 seconds of %b at 0 for all disks,
then 1-2
>>>>> seconds of 100, and repeats.
>>>
>>> what are the other values? ?ie., number of ops and actual amount of
data
>>> read/written.
>>>
>>> --
>>> Kjetil T. Homme
>>> Redpill Linpro AS - Changing the game
>>>
>>> _______________________________________________
>>> zfs-discuss mailing list
>>> zfs-discuss at opensolaris.org
>>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>>
>>
>> --
>> Sent from my mobile device
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>
> ZIL performance issues? Is writecache enabled on the LUNs?
>
> --
> Brent Jones
> brent at servuhome.net
>
Also, are you using rdsk based iSCSI LUNs, or file-based LUNs?

-- 
Brent Jones
brent at servuhome.net

Kjetil Torgrim Homme

2010-Feb-11 07:15 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[please don''t top-post, please remove CC''s, please trim
quotes.  it''s
 really tedious to clean up your post to make it readable.]

Marc Nicholas <geekything at gmail.com> writes:> Brent Jones <brent at servuhome.net> wrote:
>> Marc Nicholas <geekything at gmail.com> wrote:
>>> Kjetil Torgrim Homme <kjetilho at linpro.no> wrote:
>>>> his problem is "lazy" ZFS, notice how it gathers up
data for 15
>>>> seconds before flushing the data to disk. ?tweaking the flush
>>>> interval down might help.
>>>
>>> How does lowering the flush interval help? If he can''t
ingress data
>>> fast enough, faster flushing is a Bad Thibg(tm).
if network traffic is blocked during the flush, you can experience
back-off on both the TCP and iSCSI level.
>>>> what are the other values? ?ie., number of ops and actual
amount of
>>>> data read/written.
this remained unanswered.
>> ZIL performance issues? Is writecache enabled on the LUNs?
> This is a Windows box, not a DB that flushes every write.
have you checked if the iSCSI traffic is synchronous or not?  I don''t
use Windows, but other reports on the list have indicated that at least
the NTFS format operation *is* synchronous.  use zilstats to see.
> The drives are capable of over 2000 IOPS (albeit with high latency as
> its NCQ that gets you there) which would mean, even with sync flushes,
> 8-9MB/sec.
2000 IOPS is the aggregate, but the disks are set up as *one* RAID-Z2!
NCQ doesn''t help much, since the write operations issued by ZFS are
already ordered correctly.

the OP may also want to try tweaking metaslab_df_free_pct, this helped
linear write performance on our Linux clients a lot:
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6869229

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

Peter Tribble

2010-Feb-15 21:53 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

On Wed, Feb 10, 2010 at 10:06 PM, Brian E. Imhoff <beimhoff at
hotmail.com> wrote:> I am in the proof-of-concept phase of building a large ZFS/Solaris based
SAN box, and am experiencing absolutely poor / unusable performance.
...>
> From here, I discover the iscsi target on our Windows server 2008 R2 File
server, and see the disk is attached in Disk Management. ?I initialize the 10TB
disk fine, and begin to quick format it. ?Here is where I begin to see the poor
performance issue. ? The Quick Format took about 45 minutes. And once the disk
is fully mounted, I get maybe 2-5 MB/s average to this disk.
Did you actually make any progress on this?

I''ve seen exactly the same thing. Basically, terrible transfer rates
with Windows
and the server sitting there completely idle. We had support cases open with
both Sun and Microsoft, which got nowhere.

This seems to me to be more a case of working out where the impedance
mismatch is rather than a straightforward performance issue. In my case
I could saturate the network from a Solaris client, but only maybe 2% from
a Windows box. Yes, tweaking nagle got us to almost 3%. Still nowhere
near enough to make replacing our FC SAN with X4540s an attractive
proposition.

(And I see that most of the other replies simply asserted that your zfs
configuration was bad, without either having experienced this scenario
or worked out that the actual delivered performance was an order of
magnitude or two short of what even an admittedly sub-optimal configuration
ought to have delivered.)

--
-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/

Bob Beverage

2010-Feb-15 22:33 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

> On Wed, Feb 10, 2010 at 10:06 PM, Brian E. Imhoff
> <beimhoff at hotmail.com> wrote:
> I''ve seen exactly the same thing. Basically, terrible
> transfer rates
> with Windows
> and the server sitting there completely idle.
I am also seeing this behaviour.  It started somewhere around snv111 but I am
not sure exactly when.  I used to get 30-40MB/s transfers over cifs but at some
point that dropped to roughly 7.5MB/s.
-- 
This message posted from opensolaris.org

Ragnar Sundblad

2010-Feb-16 07:34 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

On 15 feb 2010, at 23.33, Bob Beverage wrote:
>> On Wed, Feb 10, 2010 at 10:06 PM, Brian E. Imhoff
>> <beimhoff at hotmail.com> wrote:
>> I''ve seen exactly the same thing. Basically, terrible
>> transfer rates
>> with Windows
>> and the server sitting there completely idle.
> 
> I am also seeing this behaviour.  It started somewhere around snv111 but I
am not sure exactly when.  I used to get 30-40MB/s transfers over cifs but at
some point that dropped to roughly 7.5MB/s.
Wasn''t zvol changed a while ago from asynchronous to
synchronous? Could that be it?

I don''t understand that change at all - of course a zvol with or
without iscsi to access it should behave exactly as a (not broken)
disk, strictly obeying the protocol for write cache. cache flush etc.
Having it entirely synchronous is in many cases almost as useless
as having it asynchronous.

Just as much as zfs itself should demands this from it''s disks, as it
does, I believe it should provide this itself when used as storage
for others. To me it seems that the zvol+iscsi functionality seems not
ready for production and needs more work. If anyone has any better
explanation, please share it with me!

I guess a good slog could help a bit, especially if you have a bursty
write load.

/ragge

Richard Elling

2010-Feb-16 15:41 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

On Feb 15, 2010, at 11:34 PM, Ragnar Sundblad wrote:> 
> On 15 feb 2010, at 23.33, Bob Beverage wrote:
> 
>>> On Wed, Feb 10, 2010 at 10:06 PM, Brian E. Imhoff
>>> <beimhoff at hotmail.com> wrote:
>>> I''ve seen exactly the same thing. Basically, terrible
>>> transfer rates
>>> with Windows
>>> and the server sitting there completely idle.
>> 
>> I am also seeing this behaviour.  It started somewhere around snv111
but I am not sure exactly when.  I used to get 30-40MB/s transfers over cifs but
at some point that dropped to roughly 7.5MB/s.
> 
> Wasn''t zvol changed a while ago from asynchronous to
> synchronous? Could that be it?
Yes.
> I don''t understand that change at all - of course a zvol with or
> without iscsi to access it should behave exactly as a (not broken)
> disk, strictly obeying the protocol for write cache. cache flush etc.
> Having it entirely synchronous is in many cases almost as useless
> as having it asynchronous.
There are two changes at work here, and OpenSolaris 2009.06 is
in the middle of them -- and therefore is at the least optimal spot.
You have the choice of moving to a later build, after b113, which
has the proper fix.
> Just as much as zfs itself should demands this from it''s disks, as
it
> does, I believe it should provide this itself when used as storage
> for others. To me it seems that the zvol+iscsi functionality seems not
> ready for production and needs more work. If anyone has any better
> explanation, please share it with me!
The fix is in Solaris 10 10/09 and the OpenStorage software.  For some
reason, this fix is not available in the OpenSolaris supported bug fixes.
Perhaps someone from Oracle can shed light on that (non)decision?
So until next month, you will need to use an OpenSolaris dev release
after b113.
> I guess a good slog could help a bit, especially if you have a bursty
> write load.
Yes.
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
http://nexenta-atlanta.eventbrite.com (March 15-17, 2010)

Brian E. Imhoff

2010-Feb-16 17:44 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

Some more back story.  I initially started with Solaris 10 u8, and was getting
40ish MB/s reads, and 65-70MB/s writes, which was still a far cry from the
performance I was getting with OpenFiler.  I decided to try Opensolaris 2009.06,
thinking that since it was more "state of the art & up to date"
then main Solaris. Perhaps there would be some performance tweaks or bug fixes
which might bring performance closer to what I saw with OpenFiler.   But, then
on an untouched clean install of OpenSolaris 2009.06, ran into
something...else...apparently causing this far far far worse performance.

But, at the end of the day, this is quite a bomb:  "A single raidz2 vdev
has about as many IOs per second as a single disk, which could really hurt iSCSI
performance."

If I have to break 24 disks up in to multiple vdevs to get the expected
performance might be a deal breaker.  To keep raidz2 redundancy, I would have to
lose..almost half of the available storage to get reasonable IO speeds.

Now knowing about vdev IO limitations, I believe the speeds I saw with Solaris
10u8 are inline with those limitations, and instead of fighting with whatever
issue I have with this clean install of OpenSolaris, I reverted back to 10u8.  I
guess I''ll just have to see if the speeds that Solaris ISCSI w/ZFS is
capable of, is workable for what I want to do, and what the size
sacrifice/performace acceptability point is at.

Thanks for all the responses and help.  First time posting here, and this looks
like an excellent community.
-- 
This message posted from opensolaris.org

Richard Elling

2010-Feb-17 01:35 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

On Feb 16, 2010, at 9:44 AM, Brian E. Imhoff wrote:
> Some more back story.  I initially started with Solaris 10 u8, and was
getting 40ish MB/s reads, and 65-70MB/s writes, which was still a far cry from
the performance I was getting with OpenFiler.  I decided to try Opensolaris
2009.06, thinking that since it was more "state of the art & up to
date" then main Solaris. Perhaps there would be some performance tweaks or
bug fixes which might bring performance closer to what I saw with OpenFiler.  
But, then on an untouched clean install of OpenSolaris 2009.06, ran into
something...else...apparently causing this far far far worse performance.
You thought a release dated 2009.06 was further along than than a release dated
2009.10? :-)   CR 6794730 was fixed in April, 2009, after the freeze for the
2009.06
release, but before the freeze for 2009.10.
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6794730

The schedule is published here, so you can see that there is a freeze now for
the 2010.03 OpenSolaris release.
http://hub.opensolaris.org/bin/view/Community+Group+on/schedule

As they say in comedy, timing is everything :-(
> But, at the end of the day, this is quite a bomb:  "A single raidz2
vdev has about as many IOs per second as a single disk, which could really hurt
iSCSI performance."
The context for this statement is for small, random reads.  40 MB/sec of 8KB 
reads is 5,000 IOPS, or about 50 HDDs worth of small random reads @ 100
IOPS/disk,
or one decent SSD.
> If I have to break 24 disks up in to multiple vdevs to get the expected
performance might be a deal breaker.  To keep raidz2 redundancy, I would have to
lose..almost half of the available storage to get reasonable IO speeds.
Are your requirements for bandwidth or IOPS?
> Now knowing about vdev IO limitations, I believe the speeds I saw with
Solaris 10u8 are inline with those limitations, and instead of fighting with
whatever issue I have with this clean install of OpenSolaris, I reverted back to
10u8.  I guess I''ll just have to see if the speeds that Solaris ISCSI
w/ZFS is capable of, is workable for what I want to do, and what the size
sacrifice/performace acceptability point is at.
In Solaris 10 you are stuck with the legacy iSCSI target code. In OpenSolaris,
you
have the option of using COMSTAR which performs and scales better, as Roch
describes here:
http://blogs.sun.com/roch/entry/iscsi_unleashed
> Thanks for all the responses and help.  First time posting here, and this
looks like an excellent community.
We try hard, and welcome the challenges :-)
 -- richard


ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
http://nexenta-atlanta.eventbrite.com (March 15-17, 2010)

Eric D. Mudama

2010-Feb-17 05:05 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

On Tue, Feb 16 at  9:44, Brian E. Imhoff wrote:> But, at the end of the day, this is quite a bomb: "A single raidz2
> vdev has about as many IOs per second as a single disk, which could
> really hurt iSCSI performance."
>
> If I have to break 24 disks up in to multiple vdevs to get the
> expected performance might be a deal breaker.  To keep raidz2
> redundancy, I would have to lose..almost half of the available
> storage to get reasonable IO speeds.
ZFS is quite flexible.  You can put multiple vdevs in a pool, and dial
your performance/redundancy just about wherever you want them.

24 disks could be:

12x mirrored vdevs (best random IO, 50% capacity, any 1 failure absorbed, up to
12 w/ limits)
6x 4-disk raidz vdevs (75% capacity, any 1 failure absorbed, up to 6 with
limits)
4x 6-disk raidz vdevs (~83% capacity, any 1 failure absorbed, up to 4 with
limits)
4x 6-disk raidz2 vdevs (~66% capacity, any 2 failures absorbed, up to 8 with
limits)
1x 24-disk raidz2 vdev (~92% capacity, any 2 failures absorbed, worst random IO
perf)
etc.

I think the 4x 6-disk raidz2 vdev setup is quite commonly used with 24
disks available, but each application is different.  We use mirrors
vdevs at work, with a separate box as a "live" backup using raidz of
larger SATA drives.

--eric

-- 
Eric D. Mudama
edmudama at mail.bounceswoosh.org

Matt

2010-Feb-18 06:42 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

Just wanted to add that I''m in the exact same boat - I''m
connecting from a Windows system and getting just horrid iSCSI transfer speeds.

I''ve tried updating to COMSTAR (although I''m not certain that
I''m actually using it) to no avail, and I tried updating to the latest
DEV version of OpenSolaris.  All that resulted from updating to the latest DEV
version was a completely broken system that the I couldn''t access the
command line on.  Fortunately i was able to roll back to the previous version
and keep tinkering.

Anyone have any ideas as to what could really be causing this slowdown?

I''ve got 5-500GB Seagate Barracuda ES.2 drives that I''m using
for my zpools, and I''ve done the following.

1 - zpool create data mirror c0t0d0 c0t1d0
2 - zfs create -s -V 600g data/iscsitarget
3 - sbdadm create-lu /dev/zvol/rdsk/data/iscsitarget
4 - stfadm add-view xxxxxxxxxxxxxxxxxxxxxx

So I''ve got a 500GB RAID1 zpool, and I''ve created a 600GB
sparse volume on top of it, shared it via iSCSI, and connected to it. 
Everything works stellar up until I copy files to it, then I get just
sluggishness.

I start to copy a file from my windows 7 system to the iSCSI target, then pull
up IOSTAT using this command : zpool iostat -v data 10

It shows me this : 

               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
data         895M   463G      0    666      0  7.93M
  mirror     895M   463G      0    666      0  7.93M
    c0t0d0      -      -      0    269      0  7.91M
    c0t1d0      -      -      0    272      0  7.93M
----------  -----  -----  -----  -----  -----  -----

So I figure, since ZFS is pretty sweet, how about I add some additional drives. 
That should bump up my performance.

I execute this : 

zpool add data mirror c1t0d0 c1t1d0

It adds it to my zpool, and I run IOSTAT again, while the copy is still running.

               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
data        1.17G   927G      0    738  1.58K  8.87M
  mirror    1.17G   463G      0    390  1.58K  4.61M
    c0t0d0      -      -      0    172  1.58K  4.61M
    c0t1d0      -      -      0    175      0  4.61M
  mirror    42.5K   464G      0    348      0  4.27M
    c1t0d0      -      -      0    156      0  4.27M
    c1t1d0      -      -      0    159      0  4.27M
----------  -----  -----  -----  -----  -----  -----


I get a whopping extra 1MB/sec by adding two drives.  It fluctuates a lot,
sometimes dropping down to 4MB/sec, sometimes rocketing all the way up to
20MB/sec, but nothing consistent.

Basically, my transfer rates are the same no matter how many drives I add to the
zpool.

Is there anything I am missing on this?

BTW - "test" server specs

AMD dual core 6000+
2GB RAM
Onboard Sata controller
Onboard Ethernet (gigabit)

I''ve got a very similar rig to the OP showing up next week (plus an
infiniband card) I''d love to get this performing up to GB Ethernet
speeds, otherwise I may have to abandon the iSCSI project if I can''t
get it to perform.
-- 
This message posted from opensolaris.org

Brent Jones

2010-Feb-18 06:50 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

On Wed, Feb 17, 2010 at 10:42 PM, Matt <registration at flash.shanje.com>
wrote:

>
> I''ve got a very similar rig to the OP showing up next week (plus
an infiniband card) I''d love to get this performing up to GB Ethernet
speeds, otherwise I may have to abandon the iSCSI project if I can''t
get it to perform.

Do you have an SSD log device? If not, try disabling the ZIL
temporarily to see if that helps. Your workload will likely benefit
from a log device.

-- 
Brent Jones
brent at servuhome.net

Matt

2010-Feb-18 07:03 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

No SSD Log device yet.  I also tried disabling the ZIL, with no effect on
performance.

Also - what''s the best way to test local performance?  I''m
_somewhat_ dumb as far as opensolaris goes, so if you could provide me with an
exact command line for testing my current setup (exactly as it appears above)
I''d love to report the local I/O readings.
-- 
This message posted from opensolaris.org

Matt

2010-Feb-18 07:21 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

Just out of curiosity - what Supermicro chassis did you get?  I''ve got
the following items shipping to me right now, with SSD drives and 2TB main
drives coming as soon as the system boots and performs normally (using 8 extra
500GB Barracuda ES.2 drives as test drives).


http://www.acmemicro.com/estore/merchant.ihtml?pid=5440&lastcatid=53&step=4
http://www.newegg.com/Product/Product.aspx?Item=N82E16820139043
http://www.acmemicro.com/estore/merchant.ihtml?pid=4518&step=4
http://www.acmemicro.com/estore/merchant.ihtml?pid=6708&step=4
http://www.newegg.com/Product/Product.aspx?Item=N82E16819117187
http://www.newegg.com/Product/Product.aspx?Item=N82E16835203002
-- 
This message posted from opensolaris.org

Brent Jones

2010-Feb-18 08:16 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

On Wed, Feb 17, 2010 at 11:03 PM, Matt <registration at flash.shanje.com>
wrote:> No SSD Log device yet. ?I also tried disabling the ZIL, with no effect on
performance.
>
> Also - what''s the best way to test local performance?
?I''m _somewhat_ dumb as far as opensolaris goes, so if you could
provide me with an exact command line for testing my current setup (exactly as
it appears above) I''d love to report the local I/O readings.
> --
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
No one has said if they''re using dks, rdsk, or file-backed COMSTAR LUNs
yet.
I''m using file-backed COMSTAR LUNs, with ZIL currently disabled.
I can get between 100-200MB/sec, depending on random/sequential and block sizes.

Using dsk/rdsk, I was not able to see that level of performance at all.

-- 
Brent Jones
brent at servuhome.net

Markus Kovero

2010-Feb-18 08:37 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

> No one has said if they''re using dks, rdsk, or file-backed COMSTAR
LUNs yet.
> I''m using file-backed COMSTAR LUNs, with ZIL currently disabled.
> I can get between 100-200MB/sec, depending on random/sequential and block
sizes.
> 
> Using dsk/rdsk, I was not able to see that level of performance at all.
> 
> -- 
> Brent Jones
> brent at servuhome.net
Hi, I find comstar performance very low if using zvols under dsk, somehow using
them under rdsk and letting comstar to handle cache makes performance really
good (disks/nics become limiting factor).

Yours
Markus Kovero

Nigel Smith

2010-Feb-18 10:07 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

Hi Matt
Are the seeing low speeds on writes only or on both read AND write?

Are you seeing low speed just with iSCSI or also with NFS or CIFS?
> I''ve tried updating to COMSTAR 
> (although I''m not certain that I''m actually using it)
To check, do this:

  # svcs -a | grep iscsi

If ''svc:/system/iscsitgt:default'' is online,
you are using the old & mature ''user mode'' iscsi target.

If ''svc:/network/iscsi/target:default'' is online,
then you are using the new ''kernel mode'' comstar iscsi target.

For another good way to monitor disk i/o, try:

  # iostat -xndz 1

  http://docs.sun.com/app/docs/doc/819-2240/iostat-1m?a=view

Don''t just assume that your Ethernet & IP & TCP layer
are performing to the optimum - check it.

I often use ''iperf'' or ''netperf'' to do this:

  http://blogs.sun.com/observatory/entry/netperf

(Iperf is available by installing the SUNWiperf package.
A package for netperf is in the contrib repository.)

The last time I checked, the default values used
in the OpenSolaris TCP stack are not optimum
for Gigabit speed, and need to be adjusted.
Here is some advice, I found with Google, but
there are others:

 
http://serverfault.com/questions/13190/what-are-good-speeds-for-iscsi-and-nfs-over-1gb-ethernet

BTW, what sort of network card are you using,
as this can make a difference.

Regards
Nigel Smith
-- 
This message posted from opensolaris.org

Günther

2010-Feb-18 11:09 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance - napp-it + benchmarks

hello<br>
there is a new beta v. 0.220 of napp-it, the free webgui for nexenta(core) 3
<br>
new:<br>
-bonnie benchmarks included  <a
href="http://www.napp-it.org/bench.png"
target="_blank">see screenshot</a><br>
-bug fixes<br>
<br>
if you look at the benchmark screenshot:<br>
-pool daten: zfs3 of 7 x wd 2TB raid edition (WD2002FYPS), dedup and compress
enabled<br>
-pool z3ssdcache: zfs3 of 4 sas Seagate 15k/s (ST3146855SS)  edition, 
  dedup and compress enabled + ssd read cache (supertalent ultradrive 64GB)
<br>
i was surprised about the seqential write/ rewrite result.
the wd 2 TB drives performs very well only in sequential write of characters but
are horrible bad in blockwise write/ rewrite
the 15k sas drives with ssd read cache performs 20 x better (10MB/s -> 200
MB/s)  !!!!
<br><br>
downlaod:<br>
http://www.napp-it.org<br>
<br>
howto setup<br>
http://www.napp-it.org/napp-it.pdf<br>
<br>

gea
-- 
This message posted from opensolaris.org

Tomas Ögren

2010-Feb-18 11:16 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance - napp-it + benchmarks

On 18 February, 2010 - G?nther sent me these 1,1K bytes:
> hello<br>
> there is a new beta v. 0.220 of napp-it, the free webgui for nexenta(core)
3
> <br>
> new:<br>
> -bonnie benchmarks included  <a
href="http://www.napp-it.org/bench.png"
target="_blank">see screenshot</a><br>
> -bug fixes<br>
> <br>
> if you look at the benchmark screenshot:<br>
> -pool daten: zfs3 of 7 x wd 2TB raid edition (WD2002FYPS), dedup and
compress enabled<br>
> -pool z3ssdcache: zfs3 of 4 sas Seagate 15k/s (ST3146855SS)  edition, 
>   dedup and compress enabled + ssd read cache (supertalent ultradrive 64GB)
> <br>
> i was surprised about the seqential write/ rewrite result.
> the wd 2 TB drives performs very well only in sequential write of
characters but are horrible bad in blockwise write/ rewrite
> the 15k sas drives with ssd read cache performs 20 x better (10MB/s ->
200 MB/s)  !!!!
Most probably due to lack of ram to hold the dedup tables, which your
second version "fixes" with an l2arc.

Try the same test without dedup or same l2arc in both, instead of
comparing apples to canoes.
> <br><br>
> downlaod:<br>
> http://www.napp-it.org<br>
> <br>
> howto setup<br>
> http://www.napp-it.org/napp-it.pdf<br>
> <br>
> 
> gea
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

/Tomas
-- 
Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Ume?
`- Sysadmin at {cs,acc}.umu.se

Günther

2010-Feb-18 12:22 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance - napp-it + benchmarks

hello

my intention was to show , how you can tune up a pool of  drives
(how much can you reach when using sas compared to 2 TB high capacity drives)


and now the other results with same config and sas drives:
<pre>

wd 2TB x 7, z3, dedup and compress on, no ssd
daten  	12.6T   	start  	2010.02.17  	8G  	202 MB/s  	83  	10 MB/s  	4  	4.436
MB/s  	5  	135 MB/s  	87  	761 MB/s

sas 15k, 146GB x 4, z3+dedup and compress off, no ssd:
z3nocache  	544G  	start  	2010.02.18  	8G  	71 MB/s  	31  	84 MB/s  	15  	47
MB/s  	13  	87 MB/s  	55  	113 MB/s

sas 15k, 146GB x 4, z3+dedup and compress on, no ssd:
z3nocache  	544G  	start  	2010.02.18  	8G  	218 MB/s  	99  	410 MB/s  	92  	171
MB/s  	50  	148 MB/s  	92  	578 MB/s

sas 15k, 146GB x 4, z3+dedup and compress on + ssd read cache:
z3cache  	544G  	start  	2010.02.17  	8G  	172 MB/s  	77      205 MB/s  	40  	95
MB/s  	27  	141 MB/s  	90  	546 MB/s


##################### result ##################################
all pools are zfs z3
sas are Seagate 15K/m drives, 146 GB

 
                        seq-write-ch  seq-write-block   rewrite        read-char
read-block

wd 2gb x7                202 MB/s      10 MB/s           4,4 MB/s       135 MB/s
761 MB/s

sas 15k x 4 no dedup:     71 MB/s      84 MB/s           47 MB/s         87 MB/s
113 MB/s
sas 15k x 4 +dedup+comp: 218 MB/s     410 MB/s          171 MB/s        148 MB/s
578 MB/s
sas 15k x 4 +ded+ssd:    172 MB/s     205 MB/s           95 MB/s        141 MB/s
546 MB/s


conclusion:
if you need performance: 
use fast sas drives
activate dedup and compress (if you have enough cpu power)
ssd read cache is not important in bonnie test

high capacity drives are very well in reading and seq. writing

</pre>
-- 
This message posted from opensolaris.org

Eugen Leitl

2010-Feb-18 12:22 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

On Wed, Feb 17, 2010 at 11:21:07PM -0800, Matt wrote:> Just out of curiosity - what Supermicro chassis did you get?  I''ve
got the following items shipping to me right now, with SSD drives and 2TB main
drives coming as soon as the system boots and performs normally (using 8 extra
500GB Barracuda ES.2 drives as test drives).
That looks like a sane combination. Please report how this particular
setup performs, I''m quite curious.

One question though:
 > 
>
http://www.acmemicro.com/estore/merchant.ihtml?pid=5440&lastcatid=53&step=4
> http://www.newegg.com/Product/Product.aspx?Item=N82E16820139043
> http://www.acmemicro.com/estore/merchant.ihtml?pid=4518&step=4
Just this one SAS adaptor? Are you connecting to the drive
backplane with one cable for the 4 internal SAS connectors?
Are you using SAS or SATA drives? Will you be filling up 24
slots with 2 TByte drives, and are you sure you won''t be 
oversubscribed with just 4x SAS? And SSD, which drives are you 
using and in which mounts (internal or external caddies)?
> http://www.acmemicro.com/estore/merchant.ihtml?pid=6708&step=4
> http://www.newegg.com/Product/Product.aspx?Item=N82E16819117187
> http://www.newegg.com/Product/Product.aspx?Item=N82E16835203002
-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a>
http://leitl.org
______________________________________________________________
ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE

Phil Harman

2010-Feb-18 12:55 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

This discussion is very timely, but I don''t think we''re done
yet. I''ve
been working on using NexentaStor with Sun''s DVI stack. The demo
I''ve
been playing with glues SunRays to VirtualBox instances using ZFS zvols 
over iSCSI for the boot image, with all the associated ZFS 
snapshot/clone goodness we all love so well.

The supported config for the ZFS storage server is Solaris 10u7 or 10u8. 
When I eventually got VDI going with NexentaStor (my value add), I found 
that some operations which only took 10 minutes with Solaris 10u8 were 
taking over an hour with NexentaStor. Using pfiles I found that 
iscsitgtd has the zvol open O_SYNC.

My hope is that COMSTAR is a lot more intelligent, and that it does 
indeed support DKIOCFLUSHWRITECACHE. However, if your iSCSI client 
expects all writes to be flushed synchronously, all the debate we''ve 
seen on this list about the new wcd=false option for rdsk zvols is moot 
(as using the option, when it is available, could result in data loss).

When you do iSCSI to other big brand storage appliances, you generally 
have the benefit of NVRAM cacheing. As we all know, the same can be 
achieved with ZFS and an SSD "Logzilla". I didn''t have one at
hand, and
I didn''t think of disabling the ZIL (although some have reported that 
this only seems to help ZFS hosted files, not zvols). Instead, since I 
didn''t mind losing my data, for the same of experiment, I added a TMPFS
"Logzilla" ...

# mkfile 4g /tmp/zilla
# zpool add vdipool log /tmp/zilla

WARNING: DON''T TRY THIS ON ZPOOLS YOU CARE ABOUT!  However, for this 
purposes of my experiment, it worked a treat, proving to me that an SSD 
"Logzilla" was the way ahead.

I think a lot of the angst in this thread is because "it used to work"
(i.e. we used to get great iSCSI performance from zvols). But then Sun 
fixed a glaring bug (i.e. that zvols were unsafe for synchronous writes) 
and our world fell apart.

Whilst the latest bug fixes put the world to rights again with respect 
to correctness, it may be that some of our performance workaround are 
still unsafe (i.e. if my iSCSI client assumes all writes are 
synchronised to nonvolatile storage, I''d better be pretty sure of the 
failure modes before I work around that).

Right now, it seems like an SSD "Logzilla" is needed if you want 
correctness and performance.

Phil Harman
Harman Holistix - focusing on the detail and the big picture
Our holistic services include: performance health checks, system tuning, 
DTrace training, coding advice, developer assassinations

http://blogs.sun.com/pgdh (mothballed)
http://harmanholistix.com/mt (current)
http://linkedin.com/in/philharman

Matt

2010-Feb-18 15:49 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

Responses inline : 
> Hi Matt
> Are the seeing low speeds on writes only or on both
> read AND write?
> Lows speeds both reading and writing.
> Are you seeing low speed just with iSCSI or also with
> NFS or CIFS?
Haven''t gotten NFS or CIFS to work properly.  Maybe I''m just
too dumb to figure it out, but I''m ending up with permissions errors
that don''t let me do much.  All testing so far has been with
iSCSI.> 
 > To check, do this:
> 
>   # svcs -a | grep iscsi
> If ''svc:/system/iscsitgt:default'' is online,
> you are using the old & mature ''user mode'' iscsi
> target.
> 
> If ''svc:/network/iscsi/target:default'' is online,
> then you are using the new ''kernel mode'' comstar
> iscsi target.
It shows that I''m using the COMSTAR target.
> 
> For another good way to monitor disk i/o, try:
> 
>   # iostat -xndz 1
> 
Here''s IOStat while doing writes : 

r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    1.0  256.9    3.0 2242.9  0.3  0.1    1.3    0.5  11  12 c0t0d0
    0.0  253.9    0.0 2242.9  0.3  0.1    1.0    0.4  10  11 c0t1d0
    1.0  253.9    2.5 2234.4  0.2  0.1    0.9    0.4   9  11 c1t0d0
    1.0  258.9    2.5 2228.9  0.3  0.1    1.3    0.5  12  13 c1t1d0

This shows about a 10-12% utilization of my gigabit network, as reported by Task
Manager in Windows 7.


Here''s IOStat when doing reads : 

                  extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  554.1    0.0 11256.8    0.0  3.8  0.7    6.8    1.3  68  70 c0t0d0
  749.1    0.0 11003.7    0.0  2.8  0.5    3.8    0.7  51  54 c0t1d0
  742.1    0.0 11333.4    0.0  2.9  0.5    3.9    0.7  51  49 c1t0d0
  736.1    0.0 11045.9    0.0  2.8  0.5    3.8    0.7  53  53 c1t1d0


Which gives me about 30% utilization.

Another copy to the SAN yielded this result : 

                 extended device statistics              
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   15.1  314.2  883.9 4106.2  0.9  0.3    2.9    0.9  28  30 c0t0d0
   15.1  321.2  854.3 4106.2  0.9  0.3    2.7    0.8  26  26 c0t1d0
   28.1  315.2  916.5 4101.2  0.8  0.2    2.2    0.7  22  25 c1t0d0
   14.1  316.2  895.4 4097.2  0.9  0.3    2.7    0.8  26  27 c1t1d0


Which looks like writes held up at nearly 30% (doing multiple streams of data). 
Still not gigabit, but getting better.  It also seems to be very hit-or-miss. 
It''ll sustain 10-12% gigabit for a few minutes, have a little dip, jump
up to 15% for a while, then back to 10%, then up to 20%, then up to 30%, then
back down.  I can''t really make heads or tails of it.
> 
> Don''t just assume that your Ethernet & IP & TCP
> layer
> are performing to the optimum - check it.
> 
> I often use ''iperf'' or ''netperf'' to do
this:
> 
>   http://blogs.sun.com/observatory/entry/netperf
> (Iperf is available by installing the SUNWiperf
> package.
> A package for netperf is in the contrib repository.)
> 
I''ll look in to this, I don''t have either installed right now.
> The last time I checked, the default values used
> in the OpenSolaris TCP stack are not optimum
> for Gigabit speed, and need to be adjusted.
> Here is some advice, I found with Google, but
> there are others:
> 
> 
> ttp://serverfault.com/questions/13190/what-are-good-sp
> eeds-for-iscsi-and-nfs-over-1gb-ethernet
> 
> BTW, what sort of network card are you using,
> as this can make a difference.
> 
Current NIC is an integrated NIC on an Abit Fatality motherboard.  Just your
generic fare gigabit network card.  I can''t imagine that it would be
holding me back that much though.
-- 
This message posted from opensolaris.org

Matt

2010-Feb-18 16:00 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

> One question though:
  > Just this one SAS adaptor? Are you connecting to the
> drive
> backplane with one cable for the 4 internal SAS
> connectors?
> Are you using SAS or SATA drives? Will you be filling
> up 24
> slots with 2 TByte drives, and are you sure you won''t
> be 
> oversubscribed with just 4x SAS? And SSD, which
> drives are you 
> using and in which mounts (internal or external
> caddies)?
> I''m just going to use the single 4x SAS.  1200MB/sec should be a great
plenty for 24 drives total.  I''m going to be mounting 2x SSD for ZIL
and 2x SSD for ARC, then 20-2TB drives.  I''m guessing that with a
random I/O workload, I''ll never hit the 1200MB/sec peak that the 4x SAS
can sustain.

Also - for the ZIL I will be using 2x 32GB Intel X25-E SLC drives, and for the
ARC I''ll be using 2x 160GB Intel X25M MLC drives.  I''m hoping
that the cache will allow me to saturate gigabit and eventually infiniband.
-- 
This message posted from opensolaris.org

Marc Nicholas

2010-Feb-18 16:01 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

On Thu, Feb 18, 2010 at 10:49 AM, Matt <registration at
flash.shanje.com>wrote:

> Here''s IOStat while doing writes :
>
> r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
>    1.0  256.9    3.0 2242.9  0.3  0.1    1.3    0.5  11  12 c0t0d0
>    0.0  253.9    0.0 2242.9  0.3  0.1    1.0    0.4  10  11 c0t1d0
>    1.0  253.9    2.5 2234.4  0.2  0.1    0.9    0.4   9  11 c1t0d0
>    1.0  258.9    2.5 2228.9  0.3  0.1    1.3    0.5  12  13 c1t1d0
>
> This shows about a 10-12% utilization of my gigabit network, as reported by
> Task Manager in Windows 7.
>
Unless you are using SSDs (which I believe you''re not), you''re
IOPS-bound on
the drives IMHO. Writes are a better test of this than reads for cache
reasons.

-marc
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100218/2cdbfe2f/attachment.html>

Matt

2010-Feb-18 16:05 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

Also - still looking for the best way to test local performance - I''d
love to make sure that the volume is actually able to perform at a level locally
to saturate gigabit.  If it can''t do it internally, why should I expect
it to work over GbE?
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2010-Feb-18 16:05 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance - napp-it + benchmarks

On Thu, 18 Feb 2010, G?nther wrote:> i was surprised about the seqential write/ rewrite result.
> the wd 2 TB drives performs very well only in sequential write of
characters but are horrible bad in blockwise write/ rewrite
> the 15k sas drives with ssd read cache performs 20 x better (10MB/s ->
200 MB/s)  !!!!
Usually very poor re-write performance is in indication of 
insufficient RAM for caching combined with imperfect alignment 
between the written block size and the underlying zfs block size. 
There is no doubt that an enterprise SAS drive will smoke a 
high-capacity SATA "green" drive when it comes to update performance.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Marc Nicholas

2010-Feb-18 16:30 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

Run Bonnie++. You can install it with the Sun package manger and it''ll
appear under /usr/benchmarks/bonnie++

Look for the command line I posted a couple of days back for a decent set of
flags to truly rate performance (using sync writes).

-marc

On Thu, Feb 18, 2010 at 11:05 AM, Matt <registration at
flash.shanje.com>wrote:
> Also - still looking for the best way to test local performance -
I''d love
> to make sure that the volume is actually able to perform at a level locally
> to saturate gigabit.  If it can''t do it internally, why should I
expect it
> to work over GbE?
> --
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100218/1cf429ad/attachment.html>

Nigel Smith

2010-Feb-18 17:04 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

Hi Matt
> Haven''t gotten NFS or CIFS to work properly.
> Maybe I''m just too dumb to figure it out,
> but I''m ending up with permissions errors that don''t let
me do much.
> All testing so far has been with iSCSI.
So until you can test NFS or CIFS, we don''t know if it''s a 
general performance problem, or just an iSCSI problem.

To get CIFS working, try this:

 
http://blogs.sun.com/observatory/entry/accessing_opensolaris_shares_from_windows
> Here''s IOStat while doing writes : 
> Here''s IOStat when doing reads : 
Your getting >1000 Kr/s & kw/s, so add the iostat ''M''
option
to display throughput in MegaBytes per second.
> It''ll sustain 10-12% gigabit for a few minutes, have a little dip,
I''d still be interested to see the size of the TCP buffers.
What does this report:

# ndd /dev/tcp  tcp_xmit_hiwat
# ndd /dev/tcp  tcp_recv_hiwat
# ndd /dev/tcp  tcp_conn_req_max_q
# ndd /dev/tcp  tcp_conn_req_max_q0
> Current NIC is an integrated NIC on an Abit Fatality motherboard.
> Just your generic fare gigabit network card.
> I can''t imagine that it would be holding me back that much though.
Well there are sometimes bugs in the device drivers:

  http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6913756
  http://sigtar.com/2009/02/12/opensolaris-rtl81118168b-issues/

That''s why I say don''t just assume the network is performing
to the optimum.

To do a local test, direct to the hard drives, you could try
''dd'',
with various transfer sizes. Some advice from BenR, here:

  http://www.cuddletech.com/blog/pivot/entry.php?id=820

Regards
Nigel Smith
-- 
This message posted from opensolaris.org

Nigel Smith

2010-Feb-18 17:19 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

Another things you could check, which has been reported to
cause a problem, is if network or disk drivers share an interrupt
with a slow device, like say a usb device. So try:

# echo ::interrupts -d | mdb -k

... and look for multiple driver names on an INT#.
Regards
Nigel Smith
-- 
This message posted from opensolaris.org

Ragnar Sundblad

2010-Feb-19 21:57 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

On 18 feb 2010, at 13.55, Phil Harman wrote:

...> Whilst the latest bug fixes put the world to rights again with respect to
correctness, it may be that some of our performance workaround are still unsafe
(i.e. if my iSCSI client assumes all writes are synchronised to nonvolatile
storage, I''d better be pretty sure of the failure modes before I work
around that).
But are there any clients that assume that an iSCSI volume is synchronous?

Isn''t an iSCSI target supposed to behave like any other SCSI disk
(pSCSI, SAS, FC, USB MSC, SSA, ATAPI, FW SBP...)?
With that I mean: A disk which understands SCSI commands with an
optional write cache that could be turned off, with cache sync
command, and all those things.
Put in another way, isn''t is the OS/file systems responsibility to
use the SCSI disk responsibly regardless of the underlying
protocol?

/ragge

Ross Walker

2010-Feb-19 22:20 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

On Feb 19, 2010, at 4:57 PM, Ragnar Sundblad <ragge at csc.kth.se> wrote:
>
> On 18 feb 2010, at 13.55, Phil Harman wrote:
>
> ...
>> Whilst the latest bug fixes put the world to rights again with  
>> respect to correctness, it may be that some of our performance  
>> workaround are still unsafe (i.e. if my iSCSI client assumes all  
>> writes are synchronised to nonvolatile storage, I''d better be
>> pretty sure of the failure modes before I work around that).
>
> But are there any clients that assume that an iSCSI volume is  
> synchronous?
>
> Isn''t an iSCSI target supposed to behave like any other SCSI disk
> (pSCSI, SAS, FC, USB MSC, SSA, ATAPI, FW SBP...)?
> With that I mean: A disk which understands SCSI commands with an
> optional write cache that could be turned off, with cache sync
> command, and all those things.
> Put in another way, isn''t is the OS/file systems responsibility to
> use the SCSI disk responsibly regardless of the underlying
> protocol?
That was my argument a while back.

If you use /dev/dsk then all writes should be asynchronous and WCE  
should be on and the initiator should issue a ''sync'' to make
sure it''s
in NV storage, if you use /dev/rdsk all writes should be synchronous  
and WCE should be off. RCD should be off in all cases and the ARC  
should cache all it can.

Making COMSTAR always start with /dev/rdsk and flip to /dev/dsk if the  
initiator flags write cache is the wrong way to go about it. It''s more
complicated then it needs to be and it leaves setting the storage  
policy up to the system admin rather then the storage admin.

It would be better to put effort into supporting FUA and DPO options  
in the target then dynamically changing a volume''s cache policy from  
the initiator side.

-Ross

Phil Harman

2010-Feb-19 22:22 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

On 19/02/2010 21:57, Ragnar Sundblad wrote:> On 18 feb 2010, at 13.55, Phil Harman wrote:
>    
>> Whilst the latest bug fixes put the world to rights again with respect
to correctness, it may be that some of our performance workaround are still
unsafe (i.e. if my iSCSI client assumes all writes are synchronised to
nonvolatile storage, I''d better be pretty sure of the failure modes
before I work around that).
>>      
> But are there any clients that assume that an iSCSI volume is synchronous?
>
> Isn''t an iSCSI target supposed to behave like any other SCSI disk
> (pSCSI, SAS, FC, USB MSC, SSA, ATAPI, FW SBP...)?
> With that I mean: A disk which understands SCSI commands with an
> optional write cache that could be turned off, with cache sync
> command, and all those things.
> Put in another way, isn''t is the OS/file systems responsibility to
> use the SCSI disk responsibly regardless of the underlying
> protocol?
>
> /ragge
>    
Yes, that would be nice wouldn''t it? But the world is seldom that 
simple, is it? For example, Sun''s first implementation of zvol was 
unsafe by default, with no cache flush option either.

A few years back we used to note that one of the reasons Solaris was 
slower than Linux at fileystems microbenchmarks was because Linux ran 
with the write caches on (whereas we would never be that foolhardy).

And then this seems to claim that NTFS may not be that smart either ...

   http://blogs.sun.com/roch/entry/iscsi_unleashed

(see the WCE Settings paragraph)

I''m only going on what I''ve read.

Cheers,
Phil

Ragnar Sundblad

2010-Feb-20 01:41 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

On 19 feb 2010, at 23.20, Ross Walker wrote:
> On Feb 19, 2010, at 4:57 PM, Ragnar Sundblad <ragge at csc.kth.se>
wrote:
> 
>> 
>> On 18 feb 2010, at 13.55, Phil Harman wrote:
>> 
>> ...
>>> Whilst the latest bug fixes put the world to rights again with
respect to correctness, it may be that some of our performance workaround are
still unsafe (i.e. if my iSCSI client assumes all writes are synchronised to
nonvolatile storage, I''d better be pretty sure of the failure modes
before I work around that).
>> 
>> But are there any clients that assume that an iSCSI volume is
synchronous?
>> 
>> Isn''t an iSCSI target supposed to behave like any other SCSI
disk
>> (pSCSI, SAS, FC, USB MSC, SSA, ATAPI, FW SBP...)?
>> With that I mean: A disk which understands SCSI commands with an
>> optional write cache that could be turned off, with cache sync
>> command, and all those things.
>> Put in another way, isn''t is the OS/file systems
responsibility to
>> use the SCSI disk responsibly regardless of the underlying
>> protocol?
> 
> That was my argument a while back.
> 
> If you use /dev/dsk then all writes should be asynchronous and WCE should
be on and the initiator should issue a ''sync'' to make sure
it''s in NV storage, if you use /dev/rdsk all writes should be
synchronous and WCE should be off. RCD should be off in all cases and the ARC
should cache all it can.
> 
> Making COMSTAR always start with /dev/rdsk and flip to /dev/dsk if the
initiator flags write cache is the wrong way to go about it. It''s more
complicated then it needs to be and it leaves setting the storage policy up to
the system admin rather then the storage admin.
> 
> It would be better to put effort into supporting FUA and DPO options in the
target then dynamically changing a volume''s cache policy from the
initiator side.
But wouldn''t the most disk like behavior then be to implement all the
FUA, DPO, cache mode page, flush cache, etc, etc, have COMSTAR implement
a cache just like disks do, maybe have a user knob to set the cache size
(typically 32 MB or so on modern disks, could probably be used here too
as a default), and still use /dev/rdsk devices?

That could seem, in my naive limited little mind and humble opinion, as
a pretty good approximation of how real disks work, and no OS should have
to be more surprised than usual of how a SCSI disk works.

Maybe COMSTAR already does this, or parts of it?

Or am I wrong?

/ragge

Ragnar Sundblad

2010-Feb-20 02:05 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

On 19 feb 2010, at 23.22, Phil Harman wrote:
> On 19/02/2010 21:57, Ragnar Sundblad wrote:
>> On 18 feb 2010, at 13.55, Phil Harman wrote:
>>   
>>> Whilst the latest bug fixes put the world to rights again with
respect to correctness, it may be that some of our performance workaround are
still unsafe (i.e. if my iSCSI client assumes all writes are synchronised to
nonvolatile storage, I''d better be pretty sure of the failure modes
before I work around that).
>>>     
>> But are there any clients that assume that an iSCSI volume is
synchronous?
>> 
>> Isn''t an iSCSI target supposed to behave like any other SCSI
disk
>> (pSCSI, SAS, FC, USB MSC, SSA, ATAPI, FW SBP...)?
>> With that I mean: A disk which understands SCSI commands with an
>> optional write cache that could be turned off, with cache sync
>> command, and all those things.
>> Put in another way, isn''t is the OS/file systems
responsibility to
>> use the SCSI disk responsibly regardless of the underlying
>> protocol?
>> 
>> /ragge
>>   
> 
> Yes, that would be nice wouldn''t it? But the world is seldom that
simple, is it? For example, Sun''s first implementation of zvol was
unsafe by default, with no cache flush option either.
> 
> A few years back we used to note that one of the reasons Solaris was slower
than Linux at fileystems microbenchmarks was because Linux ran with the write
caches on (whereas we would never be that foolhardy).
(Exactly, and there are more "better fast than safe" evilness in that
OS too, especially in the file system area. That is why I never use it for
anything that should store anything.)
> And then this seems to claim that NTFS may not be that smart either ...
> 
>  http://blogs.sun.com/roch/entry/iscsi_unleashed
> 
> (see the WCE Settings paragraph)
> 
> I''m only going on what I''ve read.
But - all normal disks come with write caching enabled, so in both the Linux
case and the NTFS case, this is how they always operate, with all disks, so why
should an iSCSI lun behave any different?

If they can''t handle the write cache (handle syncing, barriers,
ordering an all that), they should turn the cache off, just as Solaris does in
almost all cases except when you use an entire disk for zfs (I believe because
solaris UFS was never really adapted to write caches). And they should do that
for all SCSI disks.

(I seem to recall at in the bad old days you had to disable the write cache
yourself if you should use a disk on SunOS, but that was probably because it
wasn''t standardized, and you did it with a jumper on the controller
board.)

So - I just do not understand why an iSCSI lun should not try to emulate how all
other SCSI disks work as much as possible? This must be the most compatible mode
of operation, or am I wrong?

/ragge

Richard Elling

2010-Feb-20 19:21 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

On Feb 18, 2010, at 4:55 AM, Phil Harman wrote:
> This discussion is very timely, but I don''t think we''re
done yet. I''ve been working on using NexentaStor with Sun''s
DVI stack. The demo I''ve been playing with glues SunRays to VirtualBox
instances using ZFS zvols over iSCSI for the boot image, with all the associated
ZFS snapshot/clone goodness we all love so well.
> 
> The supported config for the ZFS storage server is Solaris 10u7 or 10u8.
When I eventually got VDI going with NexentaStor (my value add), I found that
some operations which only took 10 minutes with Solaris 10u8 were taking over an
hour with NexentaStor. Using pfiles I found that iscsitgtd has the zvol open
O_SYNC.
You need the COMSTAR plugin for NexentaStor (no need to beat the dead horse :-)
 -- richard

ZFS storage and performance consulting at http://www.RichardElling.com
ZFS training on deduplication, NexentaStor, and NAS performance
http://nexenta-atlanta.eventbrite.com (March 15-17, 2010)

Miles Nordin

2010-Feb-22 20:28 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

>>>>> "rs" == Ragnar Sundblad <ragge at
csc.kth.se> writes:
rs> But are there any clients that assume that an iSCSI volume is
rs> synchronous?

there will probably be clients that might seem to implicitly make this
assuption by mishandling the case where an iSCSI target goes away and
then comes back (but comes back less whatever writes were in its write
cache). Handling that case for NFS was complicated, and I bet such
complexity is just missing without any equivalent from the iSCSI spec,
but I could be wrong. I''d love to be educated.

Even if there is some magical thing in iSCSI to handle it, the magic
will be rarely used and often wrong until peopel learn how to test it,
which they haven''t yet they way they have with NFS.

yeah, of course, making all writes synchronous isn''t an ok way to fix
this case because it''ll make iscsi way slower than non-iscsi
alternatives.

rs> Isn''t an iSCSI target supposed to behave like any other SCSI
rs> disk (pSCSI, SAS, FC, USB MSC, SSA, ATAPI, FW SBP...)? With
rs> that I mean: A disk which understands SCSI commands with an
rs> optional write cache that could be turned off, with cache sync
rs> command, and all those things.

yeah, reboot a SAS disk without rebooting the host it''s attached to,
and you may see some dropped writes showing up as mysterious checksum
errors there as well. I bet disabling said SAS disk''s write cache
will lessen/eliminate that problem.

I think it''s become a stupid mess because everyone assumed long past
the point where it became unreasonable that disks with mounted
filesystems would not ever lose power unless the kernel with the
mounted filesystem also lost power.

rs> But - all normal disks come with write caching enabled, [...]
rs> so why should an iSCSI lun behave any different?

because normal disks usually don''t dump the contents of their write
caches on the floor unless the kernel running the filesystem code also
loses power at the same instant. This coincident kernel panic acts as
a signal to the filesystem to expect some lost writes of the disks.
It also lets the kernel take advantage of NFS server reboot recovery
(asking NFS clients to replay some of their writes), and it''s an
excuse to force-close any file a userland process might''ve had open on
the filesystem, thus forcing those userland processes to go through
their crash-recovery steps by replaying database logs and such.

Over iSCSI it''s relatively common for a target to lose power and then
come back without its write cache. but when iSCSI does it, now you
are expected to soldier on without killing all userland processes.
NFS probably could invoke its crash recovery state machine without an
actual server reboot if it wanted to, but I bet it doesn''t currently
know how, and that''s probably not the right fix because you''ve
still
got the userland processes problem.

I agree with you iSCSI write cache needs to stay on, but there is
probably broken shit all over the place from this. pre-ZFS iSCSI
targets tend to have battery-backed NVRAM so they can be
all-synchronous without demolishing performance and thus fix, or maybe
just ease a little bit, this problem.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100222/3835e09a/attachment.bin>

Ragnar Sundblad

2010-Feb-22 21:58 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

On 22 feb 2010, at 21.28, Miles Nordin wrote:
>>>>>> "rs" == Ragnar Sundblad <ragge at
csc.kth.se> writes:
> 
>    rs> But are there any clients that assume that an iSCSI volume is
>    rs> synchronous?
> 
> there will probably be clients that might seem to implicitly make this
> assuption by mishandling the case where an iSCSI target goes away and
> then comes back (but comes back less whatever writes were in its write
> cache).  Handling that case for NFS was complicated, and I bet such
> complexity is just missing without any equivalent from the iSCSI spec,
> but I could be wrong.  I''d love to be educated.
Yes, this area may very well be a mine field of bugs. But this is
not a new phenomenon, it is the same with SAS, FC, USB, hot plug
disks, and even eSATA (and I guess with CD/DVD drives also with
SCSI with ATAPI (or rather SATAPI (does it have a name?))).

I believe the correct way of handling this in all those cases would
be having the old device instance fail, the file system being told
about it, having all current operations fail and all open files
be failed. When the disk comes back, it should get a new device
instance, and it should have to be remounted. All files will have
to be reopened. I hope no driver will just attach it again and happily
just continue without telling anyone/anything. But then again,
crazier things have been coded...
> Even if there is some magical thing in iSCSI to handle it, the magic
> will be rarely used and often wrong until peopel learn how to test it,
> which they haven''t yet they way they have with NFS.
I am not sure there is anything really magic or unusual about this
really, but I certainly agree that it is a typical thing that might
not have been tested thoroughly enough.

/ragge

Kjetil Torgrim Homme

2010-Feb-23 00:20 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

Miles Nordin <carton at Ivy.NET> writes:
> There will probably be clients that might seem to implicitly make this
> assuption by mishandling the case where an iSCSI target goes away and
> then comes back (but comes back less whatever writes were in its write
> cache).  Handling that case for NFS was complicated, and I bet such
> complexity is just missing without any equivalent from the iSCSI spec,
> but I could be wrong.  I''d love to be educated.
>
> Even if there is some magical thing in iSCSI to handle it, the magic
> will be rarely used and often wrong until peopel learn how to test it,
> which they haven''t yet they way they have with NFS.
I decided I needed to read up on this and found RFC 3783 which is very
readable, highly recommended:

  http://tools.ietf.org/html/rfc3783

basically iSCSI just defines a reliable channel for SCSI.  the SCSI
layer handles the replaying of operations after a reboot or connection
failure.  as far as I understand it, anyway.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

Miles Nordin

2010-Feb-23 07:58 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

>>>>> "kth" == Kjetil Torgrim Homme <kjetilho at
linpro.no> writes:
kth> basically iSCSI just defines a reliable channel for SCSI.

pft.

AIUI a lot of the complexity in real stacks is ancient protocol
arcania for supporting multiple initiators and TCQ regardless of
whther the physical target supports these things, multiple paths
between a single target,initiator pair, and their weird SCTP-like
notion that several physical SCSI targets ought to be combined into
multiple LUN''s of a single virtual iSCSI target. I think the mapping
from iSCSI to SCSI is not usually very direct. I have not dug into it
though.

kth> the SCSI layer handles the replaying of operations after a
kth> reboot or connection failure.

how?

I do not think it is handled by SCSI layers, not for SAS nor iSCSI.

Also, remember a write command that goes into the write cache is a
SCSI command that''s succeeded, even though it''s not actually
on disk
for sure unless you can complete a sync cache command successfully and
do so with no errors nor ``protocol events'''' in the gap
between the
successful write and the successful sync. A facility to replay failed
commands won''t help because when a drive with write cache on reboots,
successful writes are rolled back.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100223/2925a3de/attachment.bin>

Kjetil Torgrim Homme

2010-Feb-23 10:19 UTC

head link

[zfs-discuss] Abysmal ISCSI / ZFS Performance

Miles Nordin <carton at Ivy.NET> writes:
>>>>>> "kth" == Kjetil Torgrim Homme <kjetilho at
linpro.no> writes:
>
>    kth> the SCSI layer handles the replaying of operations after a
>    kth> reboot or connection failure.
>
> how?
>
> I do not think it is handled by SCSI layers, not for SAS nor iSCSI.
sorry, I was inaccurate.  error reporting is done by the SCSI layer, and
the filesystem handles it by retrying whatever outstanding operations it
has.
> Also, remember a write command that goes into the write cache is a
> SCSI command that''s succeeded, even though it''s not
actually on disk
> for sure unless you can complete a sync cache command successfully and
> do so with no errors nor ``protocol events'''' in the gap
between the
> successful write and the successful sync.  A facility to replay failed
> commands won''t help because when a drive with write cache on
reboots,
> successful writes are rolled back.
this is true, sorry about my lack of precision.  the SCSI layer can''t
do
this on its own.

-- 
Kjetil T. Homme
Redpill Linpro AS - Changing the game

zfs discuss - Feb 2010 - Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance - napp-it + benchmarks

[zfs-discuss] Abysmal ISCSI / ZFS Performance - napp-it + benchmarks

[zfs-discuss] Abysmal ISCSI / ZFS Performance - napp-it + benchmarks

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance - napp-it + benchmarks

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance

[zfs-discuss] Abysmal ISCSI / ZFS Performance