I am in the proof-of-concept phase of building a large ZFS/Solaris based SAN box, and am experiencing absolutely poor / unusable performance. Where to begin... The Hardware setup: Supermicro 4U 24 Drive Bay Chassis Supermicro X8DT3 Server Motherboard 2x Xeon E5520 Nehalem 2.26 Quad Core CPUs 4GB Memory Intel EXPI9404PT 4 port 1000GB Server Network Card (used for ISCSI traffic only) Adaptec 52445 28 Port SATA/SAS Raid Controller connected to 24x Western Digital WD1002FBYS 1TB Enterprise drives. I have configured the 24 drives as single simple volumes in the Adeptec RAID BIOS , and are presenting them to the OS as such. I then, Create a zpool, using raidz2, using all 24 drives, 1 as a hotspare: zpool create tank raidz2 c1t0d0 c1t1d0 [....] c1t22d0 spare c1t23d00 Then create a volume store: zfs create -o canmount=off tank/volumes Then create a 10 TB volume to be presented to our file server: zfs create -V 10TB -o shareiscsi=on tank/volumes/fsrv1data>From here, I discover the iscsi target on our Windows server 2008 R2 File server, and see the disk is attached in Disk Management. I initialize the 10TB disk fine, and begin to quick format it. Here is where I begin to see the poor performance issue. The Quick Format took about 45 minutes. And once the disk is fully mounted, I get maybe 2-5 MB/s average to this disk.I have no clue what I could be doing wrong. To my knowledge, I followed the documentation for setting this up correctly, though I have not looked at any tuning guides beyond the first line saying you shouldn''t need to do any of this as the people who picked these defaults know more about it then you. Jumbo Frames are enabled on both sides of the iscsi path, as well as on the switch, and rx/tx buffers increased to 2048 on both sides as well. I know this is not a hardware / iscsi network issue. As another test, I installed Openfiler in a similar configuration (using hardware raid) on this box, and was getting 350-450 MB/S from our fileserver, An "iostat -xndz 1" readout of the "%b% coloum during a file copy to the LUN shows maybe 10-15 seconds of %b at 0 for all disks, then 1-2 seconds of 100, and repeats. Is there anything I need to do to get this usable? Or any additional information I can provide to help solve this problem? As nice as Openfiler is, it doesn''t have ZFS, which is necessary to achieve our final goal. -- This message posted from opensolaris.org
On Wed, Feb 10, 2010 at 17:06, Brian E. Imhoff <beimhoff at hotmail.com> wrote:> I am in the proof-of-concept phase of building a large ZFS/Solaris based SAN box, and am experiencing absolutely poor / unusable performance. > > I then, Create a zpool, using raidz2, using all 24 drives, 1 as a hotspare: > zpool create tank raidz2 c1t0d0 c1t1d0 [....] c1t22d0 spare c1t23d00Create several smaller raidz2 vdevs, and consider adding a log device and/or cache devices. A single raidz2 vdev has about as many IOs per second as a single disk, which could really hurt iSCSI performance. zpool create tank raidz2 c1t0d0 c1t1d0 ... \ raidz2 c1t5d0 c1t6d0 ... \ etc You might try, say, four 5-wide stripes with a spare, a mirrored log device, and a cache device. More memory wouldn''t hurt anything, either. Will
On 2/10/10 2:06 PM -0800 Brian E. Imhoff wrote:> I then, Create a zpool, using raidz2, using all 24 drives, 1 as a > hotspare: zpool create tank raidz2 c1t0d0 c1t1d0 [....] c1t22d0 spare > c1t23d00Well there''s one problem anyway. That''s going to be horribly slow no matter what.
On Wed, February 10, 2010 16:28, Will Murnane wrote:> On Wed, Feb 10, 2010 at 17:06, Brian E. Imhoff <beimhoff at hotmail.com> > wrote: >> I am in the proof-of-concept phase of building a large ZFS/Solaris based >> SAN box, and am experiencing absolutely poor / unusable performance. >> >> I then, Create a zpool, using raidz2, using all 24 drives, 1 as a >> hotspare: >> zpool create tank raidz2 c1t0d0 c1t1d0 [....] c1t22d0 spare c1t23d00 > Create several smaller raidz2 vdevs, and consider adding a log device > and/or cache devices. A single raidz2 vdev has about as many IOs per > second as a single disk, which could really hurt iSCSI performance. > zpool create tank raidz2 c1t0d0 c1t1d0 ... \ > raidz2 c1t5d0 c1t6d0 ... \ > etc > You might try, say, four 5-wide stripes with a spare, a mirrored log > device, and a cache device. More memory wouldn''t hurt anything, > either.That''s useful general advice for increasing I/O I think, but he clearly has something other than a "general" problem. Did you read the numbers he gave on his iSCSI performance? That can''t be explained just by overly-large RAIDZ groups I don''t think. -- David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/ Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/ Photos: http://dd-b.net/photography/gallery/ Dragaera: http://dragaera.info
On Wed, Feb 10, 2010 at 4:06 PM, Brian E. Imhoff <beimhoff at hotmail.com>wrote:> I am in the proof-of-concept phase of building a large ZFS/Solaris based > SAN box, and am experiencing absolutely poor / unusable performance. > > Where to begin... > > The Hardware setup: > Supermicro 4U 24 Drive Bay Chassis > Supermicro X8DT3 Server Motherboard > 2x Xeon E5520 Nehalem 2.26 Quad Core CPUs > 4GB Memory > Intel EXPI9404PT 4 port 1000GB Server Network Card (used for ISCSI traffic > only) > Adaptec 52445 28 Port SATA/SAS Raid Controller connected to > 24x Western Digital WD1002FBYS 1TB Enterprise drives. > > I have configured the 24 drives as single simple volumes in the Adeptec > RAID BIOS , and are presenting them to the OS as such. > > I then, Create a zpool, using raidz2, using all 24 drives, 1 as a hotspare: > zpool create tank raidz2 c1t0d0 c1t1d0 [....] c1t22d0 spare c1t23d00 > > Then create a volume store: > zfs create -o canmount=off tank/volumes > > Then create a 10 TB volume to be presented to our file server: > zfs create -V 10TB -o shareiscsi=on tank/volumes/fsrv1data > > From here, I discover the iscsi target on our Windows server 2008 R2 File > server, and see the disk is attached in Disk Management. I initialize the > 10TB disk fine, and begin to quick format it. Here is where I begin to see > the poor performance issue. The Quick Format took about 45 minutes. And > once the disk is fully mounted, I get maybe 2-5 MB/s average to this disk. > > I have no clue what I could be doing wrong. To my knowledge, I followed > the documentation for setting this up correctly, though I have not looked at > any tuning guides beyond the first line saying you shouldn''t need to do any > of this as the people who picked these defaults know more about it then you. > > Jumbo Frames are enabled on both sides of the iscsi path, as well as on the > switch, and rx/tx buffers increased to 2048 on both sides as well. I know > this is not a hardware / iscsi network issue. As another test, I installed > Openfiler in a similar configuration (using hardware raid) on this box, and > was getting 350-450 MB/S from our fileserver, > > An "iostat -xndz 1" readout of the "%b% coloum during a file copy to the > LUN shows maybe 10-15 seconds of %b at 0 for all disks, then 1-2 seconds of > 100, and repeats. > > Is there anything I need to do to get this usable? Or any additional > information I can provide to help solve this problem? As nice as Openfiler > is, it doesn''t have ZFS, which is necessary to achieve our final goal. > > >You''re extremely light on ram for a system with 24TB of storage and two E5520''s. I don''t think it''s the entire source of your issue, but I''d strongly suggest considering doubling what you have as a starting point. What version of opensolaris are you using? Have you considered using COMSTAR as your iSCSI target? --Tim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100210/1b29552c/attachment.html>
On Wed, 10 Feb 2010, Frank Cusack wrote:> On 2/10/10 2:06 PM -0800 Brian E. Imhoff wrote: >> I then, Create a zpool, using raidz2, using all 24 drives, 1 as a >> hotspare: zpool create tank raidz2 c1t0d0 c1t1d0 [....] c1t22d0 spare >> c1t23d00 > > Well there''s one problem anyway. That''s going to be horribly slow no > matter what.The other three commonly mentioned issues are: - Disable the naggle algorithm on the windows clients. - Set the volume block size so that it matches the client filesystem block size (default is 128K!). - Check for an abnormally slow disk drive using ''iostat -xe''. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Definitely use Comstar as Tim says. At home I''m using 4*WD Caviar Blacks on an AMD Phenom x4 @ 1.Ghz and only 2GB of RAM. I''m running svn132. No HBA - onboard SB700 SATA ports.$ I can, with IOmeter, saturate GigE from my WinXP laptop via iSCSI. Can you toss the RAID controller aside an use motherboard SATA ports with just a few drives? That could help highlight if its the RAID controler or not, and even one drive has better throughput than you''re seeing. Cache, ZIL, and vdev tweaks are great - but you''re not seeing any of those bottlnecks, I can assure you. -marc On 2/10/10, Tim Cook <tim at cook.ms> wrote:> On Wed, Feb 10, 2010 at 4:06 PM, Brian E. Imhoff > <beimhoff at hotmail.com>wrote: > >> I am in the proof-of-concept phase of building a large ZFS/Solaris based >> SAN box, and am experiencing absolutely poor / unusable performance. >> >> Where to begin... >> >> The Hardware setup: >> Supermicro 4U 24 Drive Bay Chassis >> Supermicro X8DT3 Server Motherboard >> 2x Xeon E5520 Nehalem 2.26 Quad Core CPUs >> 4GB Memory >> Intel EXPI9404PT 4 port 1000GB Server Network Card (used for ISCSI traffic >> only) >> Adaptec 52445 28 Port SATA/SAS Raid Controller connected to >> 24x Western Digital WD1002FBYS 1TB Enterprise drives. >> >> I have configured the 24 drives as single simple volumes in the Adeptec >> RAID BIOS , and are presenting them to the OS as such. >> >> I then, Create a zpool, using raidz2, using all 24 drives, 1 as a >> hotspare: >> zpool create tank raidz2 c1t0d0 c1t1d0 [....] c1t22d0 spare c1t23d00 >> >> Then create a volume store: >> zfs create -o canmount=off tank/volumes >> >> Then create a 10 TB volume to be presented to our file server: >> zfs create -V 10TB -o shareiscsi=on tank/volumes/fsrv1data >> >> From here, I discover the iscsi target on our Windows server 2008 R2 File >> server, and see the disk is attached in Disk Management. I initialize the >> 10TB disk fine, and begin to quick format it. Here is where I begin to >> see >> the poor performance issue. The Quick Format took about 45 minutes. And >> once the disk is fully mounted, I get maybe 2-5 MB/s average to this disk. >> >> I have no clue what I could be doing wrong. To my knowledge, I followed >> the documentation for setting this up correctly, though I have not looked >> at >> any tuning guides beyond the first line saying you shouldn''t need to do >> any >> of this as the people who picked these defaults know more about it then >> you. >> >> Jumbo Frames are enabled on both sides of the iscsi path, as well as on >> the >> switch, and rx/tx buffers increased to 2048 on both sides as well. I know >> this is not a hardware / iscsi network issue. As another test, I >> installed >> Openfiler in a similar configuration (using hardware raid) on this box, >> and >> was getting 350-450 MB/S from our fileserver, >> >> An "iostat -xndz 1" readout of the "%b% coloum during a file copy to the >> LUN shows maybe 10-15 seconds of %b at 0 for all disks, then 1-2 seconds >> of >> 100, and repeats. >> >> Is there anything I need to do to get this usable? Or any additional >> information I can provide to help solve this problem? As nice as >> Openfiler >> is, it doesn''t have ZFS, which is necessary to achieve our final goal. >> >> >> > You''re extremely light on ram for a system with 24TB of storage and two > E5520''s. I don''t think it''s the entire source of your issue, but I''d > strongly suggest considering doubling what you have as a starting point. > > What version of opensolaris are you using? Have you considered using > COMSTAR as your iSCSI target? > > --Tim >-- Sent from my mobile device
Bob Friesenhahn <bfriesen at simple.dallas.tx.us> writes:> On Wed, 10 Feb 2010, Frank Cusack wrote: > > The other three commonly mentioned issues are: > > - Disable the naggle algorithm on the windows clients.for iSCSI? shouldn''t be necessary.> - Set the volume block size so that it matches the client filesystem > block size (default is 128K!).default for a zvol is 8 KiB.> - Check for an abnormally slow disk drive using ''iostat -xe''.his problem is "lazy" ZFS, notice how it gathers up data for 15 seconds before flushing the data to disk. tweaking the flush interval down might help.>> An "iostat -xndz 1" readout of the "%b% coloum during a file copy to >> the LUN shows maybe 10-15 seconds of %b at 0 for all disks, then 1-2 >> seconds of 100, and repeats.what are the other values? ie., number of ops and actual amount of data read/written. -- Kjetil T. Homme Redpill Linpro AS - Changing the game
How does lowering the flush interval help? If he can''t ingress data fast enough, faster flushing is a Bad Thibg(tm). -marc On 2/10/10, Kjetil Torgrim Homme <kjetilho at linpro.no> wrote:> Bob Friesenhahn <bfriesen at simple.dallas.tx.us> writes: >> On Wed, 10 Feb 2010, Frank Cusack wrote: >> >> The other three commonly mentioned issues are: >> >> - Disable the naggle algorithm on the windows clients. > > for iSCSI? shouldn''t be necessary. > >> - Set the volume block size so that it matches the client filesystem >> block size (default is 128K!). > > default for a zvol is 8 KiB. > >> - Check for an abnormally slow disk drive using ''iostat -xe''. > > his problem is "lazy" ZFS, notice how it gathers up data for 15 seconds > before flushing the data to disk. tweaking the flush interval down > might help. > >>> An "iostat -xndz 1" readout of the "%b% coloum during a file copy to >>> the LUN shows maybe 10-15 seconds of %b at 0 for all disks, then 1-2 >>> seconds of 100, and repeats. > > what are the other values? ie., number of ops and actual amount of data > read/written. > > -- > Kjetil T. Homme > Redpill Linpro AS - Changing the game > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-- Sent from my mobile device
On Wed, Feb 10, 2010 at 3:12 PM, Marc Nicholas <geekything at gmail.com> wrote:> How does lowering the flush interval help? If he can''t ingress data > fast enough, faster flushing is a Bad Thibg(tm). > > -marc > > On 2/10/10, Kjetil Torgrim Homme <kjetilho at linpro.no> wrote: >> Bob Friesenhahn <bfriesen at simple.dallas.tx.us> writes: >>> On Wed, 10 Feb 2010, Frank Cusack wrote: >>> >>> The other three commonly mentioned issues are: >>> >>> ?- Disable the naggle algorithm on the windows clients. >> >> for iSCSI? ?shouldn''t be necessary. >> >>> ?- Set the volume block size so that it matches the client filesystem >>> ? ?block size (default is 128K!). >> >> default for a zvol is 8 KiB. >> >>> ?- Check for an abnormally slow disk drive using ''iostat -xe''. >> >> his problem is "lazy" ZFS, notice how it gathers up data for 15 seconds >> before flushing the data to disk. ?tweaking the flush interval down >> might help. >> >>>> An "iostat -xndz 1" readout of the "%b% coloum during a file copy to >>>> the LUN shows maybe 10-15 seconds of %b at 0 for all disks, then 1-2 >>>> seconds of 100, and repeats. >> >> what are the other values? ?ie., number of ops and actual amount of data >> read/written. >> >> -- >> Kjetil T. Homme >> Redpill Linpro AS - Changing the game >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > > -- > Sent from my mobile device > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >ZIL performance issues? Is writecache enabled on the LUNs? -- Brent Jones brent at servuhome.net
This is a Windows box, not a DB that flushes every write. The drives are capable of over 2000 IOPS (albeit with high latency as its NCQ that gets you there) which would mean, even with sync flushes, 8-9MB/sec. -marc On 2/10/10, Brent Jones <brent at servuhome.net> wrote:> On Wed, Feb 10, 2010 at 3:12 PM, Marc Nicholas <geekything at gmail.com> wrote: >> How does lowering the flush interval help? If he can''t ingress data >> fast enough, faster flushing is a Bad Thibg(tm). >> >> -marc >> >> On 2/10/10, Kjetil Torgrim Homme <kjetilho at linpro.no> wrote: >>> Bob Friesenhahn <bfriesen at simple.dallas.tx.us> writes: >>>> On Wed, 10 Feb 2010, Frank Cusack wrote: >>>> >>>> The other three commonly mentioned issues are: >>>> >>>> ?- Disable the naggle algorithm on the windows clients. >>> >>> for iSCSI? ?shouldn''t be necessary. >>> >>>> ?- Set the volume block size so that it matches the client filesystem >>>> ? ?block size (default is 128K!). >>> >>> default for a zvol is 8 KiB. >>> >>>> ?- Check for an abnormally slow disk drive using ''iostat -xe''. >>> >>> his problem is "lazy" ZFS, notice how it gathers up data for 15 seconds >>> before flushing the data to disk. ?tweaking the flush interval down >>> might help. >>> >>>>> An "iostat -xndz 1" readout of the "%b% coloum during a file copy to >>>>> the LUN shows maybe 10-15 seconds of %b at 0 for all disks, then 1-2 >>>>> seconds of 100, and repeats. >>> >>> what are the other values? ?ie., number of ops and actual amount of data >>> read/written. >>> >>> -- >>> Kjetil T. Homme >>> Redpill Linpro AS - Changing the game >>> >>> _______________________________________________ >>> zfs-discuss mailing list >>> zfs-discuss at opensolaris.org >>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >>> >> >> -- >> Sent from my mobile device >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > > ZIL performance issues? Is writecache enabled on the LUNs? > > -- > Brent Jones > brent at servuhome.net >-- Sent from my mobile device
On Wed, Feb 10, 2010 at 4:05 PM, Brent Jones <brent at servuhome.net> wrote:> On Wed, Feb 10, 2010 at 3:12 PM, Marc Nicholas <geekything at gmail.com> wrote: >> How does lowering the flush interval help? If he can''t ingress data >> fast enough, faster flushing is a Bad Thibg(tm). >> >> -marc >> >> On 2/10/10, Kjetil Torgrim Homme <kjetilho at linpro.no> wrote: >>> Bob Friesenhahn <bfriesen at simple.dallas.tx.us> writes: >>>> On Wed, 10 Feb 2010, Frank Cusack wrote: >>>> >>>> The other three commonly mentioned issues are: >>>> >>>> ?- Disable the naggle algorithm on the windows clients. >>> >>> for iSCSI? ?shouldn''t be necessary. >>> >>>> ?- Set the volume block size so that it matches the client filesystem >>>> ? ?block size (default is 128K!). >>> >>> default for a zvol is 8 KiB. >>> >>>> ?- Check for an abnormally slow disk drive using ''iostat -xe''. >>> >>> his problem is "lazy" ZFS, notice how it gathers up data for 15 seconds >>> before flushing the data to disk. ?tweaking the flush interval down >>> might help. >>> >>>>> An "iostat -xndz 1" readout of the "%b% coloum during a file copy to >>>>> the LUN shows maybe 10-15 seconds of %b at 0 for all disks, then 1-2 >>>>> seconds of 100, and repeats. >>> >>> what are the other values? ?ie., number of ops and actual amount of data >>> read/written. >>> >>> -- >>> Kjetil T. Homme >>> Redpill Linpro AS - Changing the game >>> >>> _______________________________________________ >>> zfs-discuss mailing list >>> zfs-discuss at opensolaris.org >>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >>> >> >> -- >> Sent from my mobile device >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > > ZIL performance issues? Is writecache enabled on the LUNs? > > -- > Brent Jones > brent at servuhome.net >Also, are you using rdsk based iSCSI LUNs, or file-based LUNs? -- Brent Jones brent at servuhome.net
[please don''t top-post, please remove CC''s, please trim quotes. it''s really tedious to clean up your post to make it readable.] Marc Nicholas <geekything at gmail.com> writes:> Brent Jones <brent at servuhome.net> wrote: >> Marc Nicholas <geekything at gmail.com> wrote: >>> Kjetil Torgrim Homme <kjetilho at linpro.no> wrote: >>>> his problem is "lazy" ZFS, notice how it gathers up data for 15 >>>> seconds before flushing the data to disk. ?tweaking the flush >>>> interval down might help. >>> >>> How does lowering the flush interval help? If he can''t ingress data >>> fast enough, faster flushing is a Bad Thibg(tm).if network traffic is blocked during the flush, you can experience back-off on both the TCP and iSCSI level.>>>> what are the other values? ?ie., number of ops and actual amount of >>>> data read/written.this remained unanswered.>> ZIL performance issues? Is writecache enabled on the LUNs? > This is a Windows box, not a DB that flushes every write.have you checked if the iSCSI traffic is synchronous or not? I don''t use Windows, but other reports on the list have indicated that at least the NTFS format operation *is* synchronous. use zilstats to see.> The drives are capable of over 2000 IOPS (albeit with high latency as > its NCQ that gets you there) which would mean, even with sync flushes, > 8-9MB/sec.2000 IOPS is the aggregate, but the disks are set up as *one* RAID-Z2! NCQ doesn''t help much, since the write operations issued by ZFS are already ordered correctly. the OP may also want to try tweaking metaslab_df_free_pct, this helped linear write performance on our Linux clients a lot: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6869229 -- Kjetil T. Homme Redpill Linpro AS - Changing the game
On Wed, Feb 10, 2010 at 10:06 PM, Brian E. Imhoff <beimhoff at hotmail.com> wrote:> I am in the proof-of-concept phase of building a large ZFS/Solaris based SAN box, and am experiencing absolutely poor / unusable performance....> > From here, I discover the iscsi target on our Windows server 2008 R2 File server, and see the disk is attached in Disk Management. ?I initialize the 10TB disk fine, and begin to quick format it. ?Here is where I begin to see the poor performance issue. ? The Quick Format took about 45 minutes. And once the disk is fully mounted, I get maybe 2-5 MB/s average to this disk.Did you actually make any progress on this? I''ve seen exactly the same thing. Basically, terrible transfer rates with Windows and the server sitting there completely idle. We had support cases open with both Sun and Microsoft, which got nowhere. This seems to me to be more a case of working out where the impedance mismatch is rather than a straightforward performance issue. In my case I could saturate the network from a Solaris client, but only maybe 2% from a Windows box. Yes, tweaking nagle got us to almost 3%. Still nowhere near enough to make replacing our FC SAN with X4540s an attractive proposition. (And I see that most of the other replies simply asserted that your zfs configuration was bad, without either having experienced this scenario or worked out that the actual delivered performance was an order of magnitude or two short of what even an admittedly sub-optimal configuration ought to have delivered.) -- -Peter Tribble http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
> On Wed, Feb 10, 2010 at 10:06 PM, Brian E. Imhoff > <beimhoff at hotmail.com> wrote: > I''ve seen exactly the same thing. Basically, terrible > transfer rates > with Windows > and the server sitting there completely idle.I am also seeing this behaviour. It started somewhere around snv111 but I am not sure exactly when. I used to get 30-40MB/s transfers over cifs but at some point that dropped to roughly 7.5MB/s. -- This message posted from opensolaris.org
On 15 feb 2010, at 23.33, Bob Beverage wrote:>> On Wed, Feb 10, 2010 at 10:06 PM, Brian E. Imhoff >> <beimhoff at hotmail.com> wrote: >> I''ve seen exactly the same thing. Basically, terrible >> transfer rates >> with Windows >> and the server sitting there completely idle. > > I am also seeing this behaviour. It started somewhere around snv111 but I am not sure exactly when. I used to get 30-40MB/s transfers over cifs but at some point that dropped to roughly 7.5MB/s.Wasn''t zvol changed a while ago from asynchronous to synchronous? Could that be it? I don''t understand that change at all - of course a zvol with or without iscsi to access it should behave exactly as a (not broken) disk, strictly obeying the protocol for write cache. cache flush etc. Having it entirely synchronous is in many cases almost as useless as having it asynchronous. Just as much as zfs itself should demands this from it''s disks, as it does, I believe it should provide this itself when used as storage for others. To me it seems that the zvol+iscsi functionality seems not ready for production and needs more work. If anyone has any better explanation, please share it with me! I guess a good slog could help a bit, especially if you have a bursty write load. /ragge
On Feb 15, 2010, at 11:34 PM, Ragnar Sundblad wrote:> > On 15 feb 2010, at 23.33, Bob Beverage wrote: > >>> On Wed, Feb 10, 2010 at 10:06 PM, Brian E. Imhoff >>> <beimhoff at hotmail.com> wrote: >>> I''ve seen exactly the same thing. Basically, terrible >>> transfer rates >>> with Windows >>> and the server sitting there completely idle. >> >> I am also seeing this behaviour. It started somewhere around snv111 but I am not sure exactly when. I used to get 30-40MB/s transfers over cifs but at some point that dropped to roughly 7.5MB/s. > > Wasn''t zvol changed a while ago from asynchronous to > synchronous? Could that be it?Yes.> I don''t understand that change at all - of course a zvol with or > without iscsi to access it should behave exactly as a (not broken) > disk, strictly obeying the protocol for write cache. cache flush etc. > Having it entirely synchronous is in many cases almost as useless > as having it asynchronous.There are two changes at work here, and OpenSolaris 2009.06 is in the middle of them -- and therefore is at the least optimal spot. You have the choice of moving to a later build, after b113, which has the proper fix.> Just as much as zfs itself should demands this from it''s disks, as it > does, I believe it should provide this itself when used as storage > for others. To me it seems that the zvol+iscsi functionality seems not > ready for production and needs more work. If anyone has any better > explanation, please share it with me!The fix is in Solaris 10 10/09 and the OpenStorage software. For some reason, this fix is not available in the OpenSolaris supported bug fixes. Perhaps someone from Oracle can shed light on that (non)decision? So until next month, you will need to use an OpenSolaris dev release after b113.> I guess a good slog could help a bit, especially if you have a bursty > write load.Yes. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance http://nexenta-atlanta.eventbrite.com (March 15-17, 2010)
Some more back story. I initially started with Solaris 10 u8, and was getting 40ish MB/s reads, and 65-70MB/s writes, which was still a far cry from the performance I was getting with OpenFiler. I decided to try Opensolaris 2009.06, thinking that since it was more "state of the art & up to date" then main Solaris. Perhaps there would be some performance tweaks or bug fixes which might bring performance closer to what I saw with OpenFiler. But, then on an untouched clean install of OpenSolaris 2009.06, ran into something...else...apparently causing this far far far worse performance. But, at the end of the day, this is quite a bomb: "A single raidz2 vdev has about as many IOs per second as a single disk, which could really hurt iSCSI performance." If I have to break 24 disks up in to multiple vdevs to get the expected performance might be a deal breaker. To keep raidz2 redundancy, I would have to lose..almost half of the available storage to get reasonable IO speeds. Now knowing about vdev IO limitations, I believe the speeds I saw with Solaris 10u8 are inline with those limitations, and instead of fighting with whatever issue I have with this clean install of OpenSolaris, I reverted back to 10u8. I guess I''ll just have to see if the speeds that Solaris ISCSI w/ZFS is capable of, is workable for what I want to do, and what the size sacrifice/performace acceptability point is at. Thanks for all the responses and help. First time posting here, and this looks like an excellent community. -- This message posted from opensolaris.org
On Feb 16, 2010, at 9:44 AM, Brian E. Imhoff wrote:> Some more back story. I initially started with Solaris 10 u8, and was getting 40ish MB/s reads, and 65-70MB/s writes, which was still a far cry from the performance I was getting with OpenFiler. I decided to try Opensolaris 2009.06, thinking that since it was more "state of the art & up to date" then main Solaris. Perhaps there would be some performance tweaks or bug fixes which might bring performance closer to what I saw with OpenFiler. But, then on an untouched clean install of OpenSolaris 2009.06, ran into something...else...apparently causing this far far far worse performance.You thought a release dated 2009.06 was further along than than a release dated 2009.10? :-) CR 6794730 was fixed in April, 2009, after the freeze for the 2009.06 release, but before the freeze for 2009.10. http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6794730 The schedule is published here, so you can see that there is a freeze now for the 2010.03 OpenSolaris release. http://hub.opensolaris.org/bin/view/Community+Group+on/schedule As they say in comedy, timing is everything :-(> But, at the end of the day, this is quite a bomb: "A single raidz2 vdev has about as many IOs per second as a single disk, which could really hurt iSCSI performance."The context for this statement is for small, random reads. 40 MB/sec of 8KB reads is 5,000 IOPS, or about 50 HDDs worth of small random reads @ 100 IOPS/disk, or one decent SSD.> If I have to break 24 disks up in to multiple vdevs to get the expected performance might be a deal breaker. To keep raidz2 redundancy, I would have to lose..almost half of the available storage to get reasonable IO speeds.Are your requirements for bandwidth or IOPS?> Now knowing about vdev IO limitations, I believe the speeds I saw with Solaris 10u8 are inline with those limitations, and instead of fighting with whatever issue I have with this clean install of OpenSolaris, I reverted back to 10u8. I guess I''ll just have to see if the speeds that Solaris ISCSI w/ZFS is capable of, is workable for what I want to do, and what the size sacrifice/performace acceptability point is at.In Solaris 10 you are stuck with the legacy iSCSI target code. In OpenSolaris, you have the option of using COMSTAR which performs and scales better, as Roch describes here: http://blogs.sun.com/roch/entry/iscsi_unleashed> Thanks for all the responses and help. First time posting here, and this looks like an excellent community.We try hard, and welcome the challenges :-) -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance http://nexenta-atlanta.eventbrite.com (March 15-17, 2010)
On Tue, Feb 16 at 9:44, Brian E. Imhoff wrote:> But, at the end of the day, this is quite a bomb: "A single raidz2 > vdev has about as many IOs per second as a single disk, which could > really hurt iSCSI performance." > > If I have to break 24 disks up in to multiple vdevs to get the > expected performance might be a deal breaker. To keep raidz2 > redundancy, I would have to lose..almost half of the available > storage to get reasonable IO speeds.ZFS is quite flexible. You can put multiple vdevs in a pool, and dial your performance/redundancy just about wherever you want them. 24 disks could be: 12x mirrored vdevs (best random IO, 50% capacity, any 1 failure absorbed, up to 12 w/ limits) 6x 4-disk raidz vdevs (75% capacity, any 1 failure absorbed, up to 6 with limits) 4x 6-disk raidz vdevs (~83% capacity, any 1 failure absorbed, up to 4 with limits) 4x 6-disk raidz2 vdevs (~66% capacity, any 2 failures absorbed, up to 8 with limits) 1x 24-disk raidz2 vdev (~92% capacity, any 2 failures absorbed, worst random IO perf) etc. I think the 4x 6-disk raidz2 vdev setup is quite commonly used with 24 disks available, but each application is different. We use mirrors vdevs at work, with a separate box as a "live" backup using raidz of larger SATA drives. --eric -- Eric D. Mudama edmudama at mail.bounceswoosh.org
Just wanted to add that I''m in the exact same boat - I''m connecting from a Windows system and getting just horrid iSCSI transfer speeds. I''ve tried updating to COMSTAR (although I''m not certain that I''m actually using it) to no avail, and I tried updating to the latest DEV version of OpenSolaris. All that resulted from updating to the latest DEV version was a completely broken system that the I couldn''t access the command line on. Fortunately i was able to roll back to the previous version and keep tinkering. Anyone have any ideas as to what could really be causing this slowdown? I''ve got 5-500GB Seagate Barracuda ES.2 drives that I''m using for my zpools, and I''ve done the following. 1 - zpool create data mirror c0t0d0 c0t1d0 2 - zfs create -s -V 600g data/iscsitarget 3 - sbdadm create-lu /dev/zvol/rdsk/data/iscsitarget 4 - stfadm add-view xxxxxxxxxxxxxxxxxxxxxx So I''ve got a 500GB RAID1 zpool, and I''ve created a 600GB sparse volume on top of it, shared it via iSCSI, and connected to it. Everything works stellar up until I copy files to it, then I get just sluggishness. I start to copy a file from my windows 7 system to the iSCSI target, then pull up IOSTAT using this command : zpool iostat -v data 10 It shows me this : capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- data 895M 463G 0 666 0 7.93M mirror 895M 463G 0 666 0 7.93M c0t0d0 - - 0 269 0 7.91M c0t1d0 - - 0 272 0 7.93M ---------- ----- ----- ----- ----- ----- ----- So I figure, since ZFS is pretty sweet, how about I add some additional drives. That should bump up my performance. I execute this : zpool add data mirror c1t0d0 c1t1d0 It adds it to my zpool, and I run IOSTAT again, while the copy is still running. capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- data 1.17G 927G 0 738 1.58K 8.87M mirror 1.17G 463G 0 390 1.58K 4.61M c0t0d0 - - 0 172 1.58K 4.61M c0t1d0 - - 0 175 0 4.61M mirror 42.5K 464G 0 348 0 4.27M c1t0d0 - - 0 156 0 4.27M c1t1d0 - - 0 159 0 4.27M ---------- ----- ----- ----- ----- ----- ----- I get a whopping extra 1MB/sec by adding two drives. It fluctuates a lot, sometimes dropping down to 4MB/sec, sometimes rocketing all the way up to 20MB/sec, but nothing consistent. Basically, my transfer rates are the same no matter how many drives I add to the zpool. Is there anything I am missing on this? BTW - "test" server specs AMD dual core 6000+ 2GB RAM Onboard Sata controller Onboard Ethernet (gigabit) I''ve got a very similar rig to the OP showing up next week (plus an infiniband card) I''d love to get this performing up to GB Ethernet speeds, otherwise I may have to abandon the iSCSI project if I can''t get it to perform. -- This message posted from opensolaris.org
On Wed, Feb 17, 2010 at 10:42 PM, Matt <registration at flash.shanje.com> wrote:> > I''ve got a very similar rig to the OP showing up next week (plus an infiniband card) I''d love to get this performing up to GB Ethernet speeds, otherwise I may have to abandon the iSCSI project if I can''t get it to perform.Do you have an SSD log device? If not, try disabling the ZIL temporarily to see if that helps. Your workload will likely benefit from a log device. -- Brent Jones brent at servuhome.net
No SSD Log device yet. I also tried disabling the ZIL, with no effect on performance. Also - what''s the best way to test local performance? I''m _somewhat_ dumb as far as opensolaris goes, so if you could provide me with an exact command line for testing my current setup (exactly as it appears above) I''d love to report the local I/O readings. -- This message posted from opensolaris.org
Just out of curiosity - what Supermicro chassis did you get? I''ve got the following items shipping to me right now, with SSD drives and 2TB main drives coming as soon as the system boots and performs normally (using 8 extra 500GB Barracuda ES.2 drives as test drives). http://www.acmemicro.com/estore/merchant.ihtml?pid=5440&lastcatid=53&step=4 http://www.newegg.com/Product/Product.aspx?Item=N82E16820139043 http://www.acmemicro.com/estore/merchant.ihtml?pid=4518&step=4 http://www.acmemicro.com/estore/merchant.ihtml?pid=6708&step=4 http://www.newegg.com/Product/Product.aspx?Item=N82E16819117187 http://www.newegg.com/Product/Product.aspx?Item=N82E16835203002 -- This message posted from opensolaris.org
On Wed, Feb 17, 2010 at 11:03 PM, Matt <registration at flash.shanje.com> wrote:> No SSD Log device yet. ?I also tried disabling the ZIL, with no effect on performance. > > Also - what''s the best way to test local performance? ?I''m _somewhat_ dumb as far as opensolaris goes, so if you could provide me with an exact command line for testing my current setup (exactly as it appears above) I''d love to report the local I/O readings. > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >No one has said if they''re using dks, rdsk, or file-backed COMSTAR LUNs yet. I''m using file-backed COMSTAR LUNs, with ZIL currently disabled. I can get between 100-200MB/sec, depending on random/sequential and block sizes. Using dsk/rdsk, I was not able to see that level of performance at all. -- Brent Jones brent at servuhome.net
> No one has said if they''re using dks, rdsk, or file-backed COMSTAR LUNs yet. > I''m using file-backed COMSTAR LUNs, with ZIL currently disabled. > I can get between 100-200MB/sec, depending on random/sequential and block sizes. > > Using dsk/rdsk, I was not able to see that level of performance at all. > > -- > Brent Jones > brent at servuhome.netHi, I find comstar performance very low if using zvols under dsk, somehow using them under rdsk and letting comstar to handle cache makes performance really good (disks/nics become limiting factor). Yours Markus Kovero
Hi Matt Are the seeing low speeds on writes only or on both read AND write? Are you seeing low speed just with iSCSI or also with NFS or CIFS?> I''ve tried updating to COMSTAR > (although I''m not certain that I''m actually using it)To check, do this: # svcs -a | grep iscsi If ''svc:/system/iscsitgt:default'' is online, you are using the old & mature ''user mode'' iscsi target. If ''svc:/network/iscsi/target:default'' is online, then you are using the new ''kernel mode'' comstar iscsi target. For another good way to monitor disk i/o, try: # iostat -xndz 1 http://docs.sun.com/app/docs/doc/819-2240/iostat-1m?a=view Don''t just assume that your Ethernet & IP & TCP layer are performing to the optimum - check it. I often use ''iperf'' or ''netperf'' to do this: http://blogs.sun.com/observatory/entry/netperf (Iperf is available by installing the SUNWiperf package. A package for netperf is in the contrib repository.) The last time I checked, the default values used in the OpenSolaris TCP stack are not optimum for Gigabit speed, and need to be adjusted. Here is some advice, I found with Google, but there are others: http://serverfault.com/questions/13190/what-are-good-speeds-for-iscsi-and-nfs-over-1gb-ethernet BTW, what sort of network card are you using, as this can make a difference. Regards Nigel Smith -- This message posted from opensolaris.org
Günther
2010-Feb-18 11:09 UTC
[zfs-discuss] Abysmal ISCSI / ZFS Performance - napp-it + benchmarks
hello<br> there is a new beta v. 0.220 of napp-it, the free webgui for nexenta(core) 3 <br> new:<br> -bonnie benchmarks included <a href="http://www.napp-it.org/bench.png" target="_blank">see screenshot</a><br> -bug fixes<br> <br> if you look at the benchmark screenshot:<br> -pool daten: zfs3 of 7 x wd 2TB raid edition (WD2002FYPS), dedup and compress enabled<br> -pool z3ssdcache: zfs3 of 4 sas Seagate 15k/s (ST3146855SS) edition, dedup and compress enabled + ssd read cache (supertalent ultradrive 64GB) <br> i was surprised about the seqential write/ rewrite result. the wd 2 TB drives performs very well only in sequential write of characters but are horrible bad in blockwise write/ rewrite the 15k sas drives with ssd read cache performs 20 x better (10MB/s -> 200 MB/s) !!!! <br><br> downlaod:<br> http://www.napp-it.org<br> <br> howto setup<br> http://www.napp-it.org/napp-it.pdf<br> <br> gea -- This message posted from opensolaris.org
Tomas Ögren
2010-Feb-18 11:16 UTC
[zfs-discuss] Abysmal ISCSI / ZFS Performance - napp-it + benchmarks
On 18 February, 2010 - G?nther sent me these 1,1K bytes:> hello<br> > there is a new beta v. 0.220 of napp-it, the free webgui for nexenta(core) 3 > <br> > new:<br> > -bonnie benchmarks included <a href="http://www.napp-it.org/bench.png" target="_blank">see screenshot</a><br> > -bug fixes<br> > <br> > if you look at the benchmark screenshot:<br> > -pool daten: zfs3 of 7 x wd 2TB raid edition (WD2002FYPS), dedup and compress enabled<br> > -pool z3ssdcache: zfs3 of 4 sas Seagate 15k/s (ST3146855SS) edition, > dedup and compress enabled + ssd read cache (supertalent ultradrive 64GB) > <br> > i was surprised about the seqential write/ rewrite result. > the wd 2 TB drives performs very well only in sequential write of characters but are horrible bad in blockwise write/ rewrite > the 15k sas drives with ssd read cache performs 20 x better (10MB/s -> 200 MB/s) !!!!Most probably due to lack of ram to hold the dedup tables, which your second version "fixes" with an l2arc. Try the same test without dedup or same l2arc in both, instead of comparing apples to canoes.> <br><br> > downlaod:<br> > http://www.napp-it.org<br> > <br> > howto setup<br> > http://www.napp-it.org/napp-it.pdf<br> > <br> > > gea > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss/Tomas -- Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
Günther
2010-Feb-18 12:22 UTC
[zfs-discuss] Abysmal ISCSI / ZFS Performance - napp-it + benchmarks
hello my intention was to show , how you can tune up a pool of drives (how much can you reach when using sas compared to 2 TB high capacity drives) and now the other results with same config and sas drives: <pre> wd 2TB x 7, z3, dedup and compress on, no ssd daten 12.6T start 2010.02.17 8G 202 MB/s 83 10 MB/s 4 4.436 MB/s 5 135 MB/s 87 761 MB/s sas 15k, 146GB x 4, z3+dedup and compress off, no ssd: z3nocache 544G start 2010.02.18 8G 71 MB/s 31 84 MB/s 15 47 MB/s 13 87 MB/s 55 113 MB/s sas 15k, 146GB x 4, z3+dedup and compress on, no ssd: z3nocache 544G start 2010.02.18 8G 218 MB/s 99 410 MB/s 92 171 MB/s 50 148 MB/s 92 578 MB/s sas 15k, 146GB x 4, z3+dedup and compress on + ssd read cache: z3cache 544G start 2010.02.17 8G 172 MB/s 77 205 MB/s 40 95 MB/s 27 141 MB/s 90 546 MB/s ##################### result ################################## all pools are zfs z3 sas are Seagate 15K/m drives, 146 GB seq-write-ch seq-write-block rewrite read-char read-block wd 2gb x7 202 MB/s 10 MB/s 4,4 MB/s 135 MB/s 761 MB/s sas 15k x 4 no dedup: 71 MB/s 84 MB/s 47 MB/s 87 MB/s 113 MB/s sas 15k x 4 +dedup+comp: 218 MB/s 410 MB/s 171 MB/s 148 MB/s 578 MB/s sas 15k x 4 +ded+ssd: 172 MB/s 205 MB/s 95 MB/s 141 MB/s 546 MB/s conclusion: if you need performance: use fast sas drives activate dedup and compress (if you have enough cpu power) ssd read cache is not important in bonnie test high capacity drives are very well in reading and seq. writing </pre> -- This message posted from opensolaris.org
On Wed, Feb 17, 2010 at 11:21:07PM -0800, Matt wrote:> Just out of curiosity - what Supermicro chassis did you get? I''ve got the following items shipping to me right now, with SSD drives and 2TB main drives coming as soon as the system boots and performs normally (using 8 extra 500GB Barracuda ES.2 drives as test drives).That looks like a sane combination. Please report how this particular setup performs, I''m quite curious. One question though:> > http://www.acmemicro.com/estore/merchant.ihtml?pid=5440&lastcatid=53&step=4 > http://www.newegg.com/Product/Product.aspx?Item=N82E16820139043 > http://www.acmemicro.com/estore/merchant.ihtml?pid=4518&step=4Just this one SAS adaptor? Are you connecting to the drive backplane with one cable for the 4 internal SAS connectors? Are you using SAS or SATA drives? Will you be filling up 24 slots with 2 TByte drives, and are you sure you won''t be oversubscribed with just 4x SAS? And SSD, which drives are you using and in which mounts (internal or external caddies)?> http://www.acmemicro.com/estore/merchant.ihtml?pid=6708&step=4 > http://www.newegg.com/Product/Product.aspx?Item=N82E16819117187 > http://www.newegg.com/Product/Product.aspx?Item=N82E16835203002-- Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE
This discussion is very timely, but I don''t think we''re done yet. I''ve been working on using NexentaStor with Sun''s DVI stack. The demo I''ve been playing with glues SunRays to VirtualBox instances using ZFS zvols over iSCSI for the boot image, with all the associated ZFS snapshot/clone goodness we all love so well. The supported config for the ZFS storage server is Solaris 10u7 or 10u8. When I eventually got VDI going with NexentaStor (my value add), I found that some operations which only took 10 minutes with Solaris 10u8 were taking over an hour with NexentaStor. Using pfiles I found that iscsitgtd has the zvol open O_SYNC. My hope is that COMSTAR is a lot more intelligent, and that it does indeed support DKIOCFLUSHWRITECACHE. However, if your iSCSI client expects all writes to be flushed synchronously, all the debate we''ve seen on this list about the new wcd=false option for rdsk zvols is moot (as using the option, when it is available, could result in data loss). When you do iSCSI to other big brand storage appliances, you generally have the benefit of NVRAM cacheing. As we all know, the same can be achieved with ZFS and an SSD "Logzilla". I didn''t have one at hand, and I didn''t think of disabling the ZIL (although some have reported that this only seems to help ZFS hosted files, not zvols). Instead, since I didn''t mind losing my data, for the same of experiment, I added a TMPFS "Logzilla" ... # mkfile 4g /tmp/zilla # zpool add vdipool log /tmp/zilla WARNING: DON''T TRY THIS ON ZPOOLS YOU CARE ABOUT! However, for this purposes of my experiment, it worked a treat, proving to me that an SSD "Logzilla" was the way ahead. I think a lot of the angst in this thread is because "it used to work" (i.e. we used to get great iSCSI performance from zvols). But then Sun fixed a glaring bug (i.e. that zvols were unsafe for synchronous writes) and our world fell apart. Whilst the latest bug fixes put the world to rights again with respect to correctness, it may be that some of our performance workaround are still unsafe (i.e. if my iSCSI client assumes all writes are synchronised to nonvolatile storage, I''d better be pretty sure of the failure modes before I work around that). Right now, it seems like an SSD "Logzilla" is needed if you want correctness and performance. Phil Harman Harman Holistix - focusing on the detail and the big picture Our holistic services include: performance health checks, system tuning, DTrace training, coding advice, developer assassinations http://blogs.sun.com/pgdh (mothballed) http://harmanholistix.com/mt (current) http://linkedin.com/in/philharman
Responses inline :> Hi Matt > Are the seeing low speeds on writes only or on both > read AND write? >Lows speeds both reading and writing.> Are you seeing low speed just with iSCSI or also with > NFS or CIFS?Haven''t gotten NFS or CIFS to work properly. Maybe I''m just too dumb to figure it out, but I''m ending up with permissions errors that don''t let me do much. All testing so far has been with iSCSI.>> To check, do this: > > # svcs -a | grep iscsi > If ''svc:/system/iscsitgt:default'' is online, > you are using the old & mature ''user mode'' iscsi > target. > > If ''svc:/network/iscsi/target:default'' is online, > then you are using the new ''kernel mode'' comstar > iscsi target.It shows that I''m using the COMSTAR target.> > For another good way to monitor disk i/o, try: > > # iostat -xndz 1 >Here''s IOStat while doing writes : r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 1.0 256.9 3.0 2242.9 0.3 0.1 1.3 0.5 11 12 c0t0d0 0.0 253.9 0.0 2242.9 0.3 0.1 1.0 0.4 10 11 c0t1d0 1.0 253.9 2.5 2234.4 0.2 0.1 0.9 0.4 9 11 c1t0d0 1.0 258.9 2.5 2228.9 0.3 0.1 1.3 0.5 12 13 c1t1d0 This shows about a 10-12% utilization of my gigabit network, as reported by Task Manager in Windows 7. Here''s IOStat when doing reads : extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 554.1 0.0 11256.8 0.0 3.8 0.7 6.8 1.3 68 70 c0t0d0 749.1 0.0 11003.7 0.0 2.8 0.5 3.8 0.7 51 54 c0t1d0 742.1 0.0 11333.4 0.0 2.9 0.5 3.9 0.7 51 49 c1t0d0 736.1 0.0 11045.9 0.0 2.8 0.5 3.8 0.7 53 53 c1t1d0 Which gives me about 30% utilization. Another copy to the SAN yielded this result : extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 15.1 314.2 883.9 4106.2 0.9 0.3 2.9 0.9 28 30 c0t0d0 15.1 321.2 854.3 4106.2 0.9 0.3 2.7 0.8 26 26 c0t1d0 28.1 315.2 916.5 4101.2 0.8 0.2 2.2 0.7 22 25 c1t0d0 14.1 316.2 895.4 4097.2 0.9 0.3 2.7 0.8 26 27 c1t1d0 Which looks like writes held up at nearly 30% (doing multiple streams of data). Still not gigabit, but getting better. It also seems to be very hit-or-miss. It''ll sustain 10-12% gigabit for a few minutes, have a little dip, jump up to 15% for a while, then back to 10%, then up to 20%, then up to 30%, then back down. I can''t really make heads or tails of it.> > Don''t just assume that your Ethernet & IP & TCP > layer > are performing to the optimum - check it. > > I often use ''iperf'' or ''netperf'' to do this: > > http://blogs.sun.com/observatory/entry/netperf > (Iperf is available by installing the SUNWiperf > package. > A package for netperf is in the contrib repository.) >I''ll look in to this, I don''t have either installed right now.> The last time I checked, the default values used > in the OpenSolaris TCP stack are not optimum > for Gigabit speed, and need to be adjusted. > Here is some advice, I found with Google, but > there are others: > > > ttp://serverfault.com/questions/13190/what-are-good-sp > eeds-for-iscsi-and-nfs-over-1gb-ethernet > > BTW, what sort of network card are you using, > as this can make a difference. >Current NIC is an integrated NIC on an Abit Fatality motherboard. Just your generic fare gigabit network card. I can''t imagine that it would be holding me back that much though. -- This message posted from opensolaris.org
> One question though:> Just this one SAS adaptor? Are you connecting to the > drive > backplane with one cable for the 4 internal SAS > connectors? > Are you using SAS or SATA drives? Will you be filling > up 24 > slots with 2 TByte drives, and are you sure you won''t > be > oversubscribed with just 4x SAS? And SSD, which > drives are you > using and in which mounts (internal or external > caddies)? >I''m just going to use the single 4x SAS. 1200MB/sec should be a great plenty for 24 drives total. I''m going to be mounting 2x SSD for ZIL and 2x SSD for ARC, then 20-2TB drives. I''m guessing that with a random I/O workload, I''ll never hit the 1200MB/sec peak that the 4x SAS can sustain. Also - for the ZIL I will be using 2x 32GB Intel X25-E SLC drives, and for the ARC I''ll be using 2x 160GB Intel X25M MLC drives. I''m hoping that the cache will allow me to saturate gigabit and eventually infiniband. -- This message posted from opensolaris.org
On Thu, Feb 18, 2010 at 10:49 AM, Matt <registration at flash.shanje.com>wrote:> Here''s IOStat while doing writes : > > r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device > 1.0 256.9 3.0 2242.9 0.3 0.1 1.3 0.5 11 12 c0t0d0 > 0.0 253.9 0.0 2242.9 0.3 0.1 1.0 0.4 10 11 c0t1d0 > 1.0 253.9 2.5 2234.4 0.2 0.1 0.9 0.4 9 11 c1t0d0 > 1.0 258.9 2.5 2228.9 0.3 0.1 1.3 0.5 12 13 c1t1d0 > > This shows about a 10-12% utilization of my gigabit network, as reported by > Task Manager in Windows 7. >Unless you are using SSDs (which I believe you''re not), you''re IOPS-bound on the drives IMHO. Writes are a better test of this than reads for cache reasons. -marc -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100218/2cdbfe2f/attachment.html>
Also - still looking for the best way to test local performance - I''d love to make sure that the volume is actually able to perform at a level locally to saturate gigabit. If it can''t do it internally, why should I expect it to work over GbE? -- This message posted from opensolaris.org
Bob Friesenhahn
2010-Feb-18 16:05 UTC
[zfs-discuss] Abysmal ISCSI / ZFS Performance - napp-it + benchmarks
On Thu, 18 Feb 2010, G?nther wrote:> i was surprised about the seqential write/ rewrite result. > the wd 2 TB drives performs very well only in sequential write of characters but are horrible bad in blockwise write/ rewrite > the 15k sas drives with ssd read cache performs 20 x better (10MB/s -> 200 MB/s) !!!!Usually very poor re-write performance is in indication of insufficient RAM for caching combined with imperfect alignment between the written block size and the underlying zfs block size. There is no doubt that an enterprise SAS drive will smoke a high-capacity SATA "green" drive when it comes to update performance. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Run Bonnie++. You can install it with the Sun package manger and it''ll appear under /usr/benchmarks/bonnie++ Look for the command line I posted a couple of days back for a decent set of flags to truly rate performance (using sync writes). -marc On Thu, Feb 18, 2010 at 11:05 AM, Matt <registration at flash.shanje.com>wrote:> Also - still looking for the best way to test local performance - I''d love > to make sure that the volume is actually able to perform at a level locally > to saturate gigabit. If it can''t do it internally, why should I expect it > to work over GbE? > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100218/1cf429ad/attachment.html>
Hi Matt> Haven''t gotten NFS or CIFS to work properly. > Maybe I''m just too dumb to figure it out, > but I''m ending up with permissions errors that don''t let me do much. > All testing so far has been with iSCSI.So until you can test NFS or CIFS, we don''t know if it''s a general performance problem, or just an iSCSI problem. To get CIFS working, try this: http://blogs.sun.com/observatory/entry/accessing_opensolaris_shares_from_windows> Here''s IOStat while doing writes : > Here''s IOStat when doing reads :Your getting >1000 Kr/s & kw/s, so add the iostat ''M'' option to display throughput in MegaBytes per second.> It''ll sustain 10-12% gigabit for a few minutes, have a little dip,I''d still be interested to see the size of the TCP buffers. What does this report: # ndd /dev/tcp tcp_xmit_hiwat # ndd /dev/tcp tcp_recv_hiwat # ndd /dev/tcp tcp_conn_req_max_q # ndd /dev/tcp tcp_conn_req_max_q0> Current NIC is an integrated NIC on an Abit Fatality motherboard. > Just your generic fare gigabit network card. > I can''t imagine that it would be holding me back that much though.Well there are sometimes bugs in the device drivers: http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6913756 http://sigtar.com/2009/02/12/opensolaris-rtl81118168b-issues/ That''s why I say don''t just assume the network is performing to the optimum. To do a local test, direct to the hard drives, you could try ''dd'', with various transfer sizes. Some advice from BenR, here: http://www.cuddletech.com/blog/pivot/entry.php?id=820 Regards Nigel Smith -- This message posted from opensolaris.org
Another things you could check, which has been reported to cause a problem, is if network or disk drivers share an interrupt with a slow device, like say a usb device. So try: # echo ::interrupts -d | mdb -k ... and look for multiple driver names on an INT#. Regards Nigel Smith -- This message posted from opensolaris.org
On 18 feb 2010, at 13.55, Phil Harman wrote: ...> Whilst the latest bug fixes put the world to rights again with respect to correctness, it may be that some of our performance workaround are still unsafe (i.e. if my iSCSI client assumes all writes are synchronised to nonvolatile storage, I''d better be pretty sure of the failure modes before I work around that).But are there any clients that assume that an iSCSI volume is synchronous? Isn''t an iSCSI target supposed to behave like any other SCSI disk (pSCSI, SAS, FC, USB MSC, SSA, ATAPI, FW SBP...)? With that I mean: A disk which understands SCSI commands with an optional write cache that could be turned off, with cache sync command, and all those things. Put in another way, isn''t is the OS/file systems responsibility to use the SCSI disk responsibly regardless of the underlying protocol? /ragge
On Feb 19, 2010, at 4:57 PM, Ragnar Sundblad <ragge at csc.kth.se> wrote:> > On 18 feb 2010, at 13.55, Phil Harman wrote: > > ... >> Whilst the latest bug fixes put the world to rights again with >> respect to correctness, it may be that some of our performance >> workaround are still unsafe (i.e. if my iSCSI client assumes all >> writes are synchronised to nonvolatile storage, I''d better be >> pretty sure of the failure modes before I work around that). > > But are there any clients that assume that an iSCSI volume is > synchronous? > > Isn''t an iSCSI target supposed to behave like any other SCSI disk > (pSCSI, SAS, FC, USB MSC, SSA, ATAPI, FW SBP...)? > With that I mean: A disk which understands SCSI commands with an > optional write cache that could be turned off, with cache sync > command, and all those things. > Put in another way, isn''t is the OS/file systems responsibility to > use the SCSI disk responsibly regardless of the underlying > protocol?That was my argument a while back. If you use /dev/dsk then all writes should be asynchronous and WCE should be on and the initiator should issue a ''sync'' to make sure it''s in NV storage, if you use /dev/rdsk all writes should be synchronous and WCE should be off. RCD should be off in all cases and the ARC should cache all it can. Making COMSTAR always start with /dev/rdsk and flip to /dev/dsk if the initiator flags write cache is the wrong way to go about it. It''s more complicated then it needs to be and it leaves setting the storage policy up to the system admin rather then the storage admin. It would be better to put effort into supporting FUA and DPO options in the target then dynamically changing a volume''s cache policy from the initiator side. -Ross
On 19/02/2010 21:57, Ragnar Sundblad wrote:> On 18 feb 2010, at 13.55, Phil Harman wrote: > >> Whilst the latest bug fixes put the world to rights again with respect to correctness, it may be that some of our performance workaround are still unsafe (i.e. if my iSCSI client assumes all writes are synchronised to nonvolatile storage, I''d better be pretty sure of the failure modes before I work around that). >> > But are there any clients that assume that an iSCSI volume is synchronous? > > Isn''t an iSCSI target supposed to behave like any other SCSI disk > (pSCSI, SAS, FC, USB MSC, SSA, ATAPI, FW SBP...)? > With that I mean: A disk which understands SCSI commands with an > optional write cache that could be turned off, with cache sync > command, and all those things. > Put in another way, isn''t is the OS/file systems responsibility to > use the SCSI disk responsibly regardless of the underlying > protocol? > > /ragge >Yes, that would be nice wouldn''t it? But the world is seldom that simple, is it? For example, Sun''s first implementation of zvol was unsafe by default, with no cache flush option either. A few years back we used to note that one of the reasons Solaris was slower than Linux at fileystems microbenchmarks was because Linux ran with the write caches on (whereas we would never be that foolhardy). And then this seems to claim that NTFS may not be that smart either ... http://blogs.sun.com/roch/entry/iscsi_unleashed (see the WCE Settings paragraph) I''m only going on what I''ve read. Cheers, Phil
On 19 feb 2010, at 23.20, Ross Walker wrote:> On Feb 19, 2010, at 4:57 PM, Ragnar Sundblad <ragge at csc.kth.se> wrote: > >> >> On 18 feb 2010, at 13.55, Phil Harman wrote: >> >> ... >>> Whilst the latest bug fixes put the world to rights again with respect to correctness, it may be that some of our performance workaround are still unsafe (i.e. if my iSCSI client assumes all writes are synchronised to nonvolatile storage, I''d better be pretty sure of the failure modes before I work around that). >> >> But are there any clients that assume that an iSCSI volume is synchronous? >> >> Isn''t an iSCSI target supposed to behave like any other SCSI disk >> (pSCSI, SAS, FC, USB MSC, SSA, ATAPI, FW SBP...)? >> With that I mean: A disk which understands SCSI commands with an >> optional write cache that could be turned off, with cache sync >> command, and all those things. >> Put in another way, isn''t is the OS/file systems responsibility to >> use the SCSI disk responsibly regardless of the underlying >> protocol? > > That was my argument a while back. > > If you use /dev/dsk then all writes should be asynchronous and WCE should be on and the initiator should issue a ''sync'' to make sure it''s in NV storage, if you use /dev/rdsk all writes should be synchronous and WCE should be off. RCD should be off in all cases and the ARC should cache all it can. > > Making COMSTAR always start with /dev/rdsk and flip to /dev/dsk if the initiator flags write cache is the wrong way to go about it. It''s more complicated then it needs to be and it leaves setting the storage policy up to the system admin rather then the storage admin. > > It would be better to put effort into supporting FUA and DPO options in the target then dynamically changing a volume''s cache policy from the initiator side.But wouldn''t the most disk like behavior then be to implement all the FUA, DPO, cache mode page, flush cache, etc, etc, have COMSTAR implement a cache just like disks do, maybe have a user knob to set the cache size (typically 32 MB or so on modern disks, could probably be used here too as a default), and still use /dev/rdsk devices? That could seem, in my naive limited little mind and humble opinion, as a pretty good approximation of how real disks work, and no OS should have to be more surprised than usual of how a SCSI disk works. Maybe COMSTAR already does this, or parts of it? Or am I wrong? /ragge
On 19 feb 2010, at 23.22, Phil Harman wrote:> On 19/02/2010 21:57, Ragnar Sundblad wrote: >> On 18 feb 2010, at 13.55, Phil Harman wrote: >> >>> Whilst the latest bug fixes put the world to rights again with respect to correctness, it may be that some of our performance workaround are still unsafe (i.e. if my iSCSI client assumes all writes are synchronised to nonvolatile storage, I''d better be pretty sure of the failure modes before I work around that). >>> >> But are there any clients that assume that an iSCSI volume is synchronous? >> >> Isn''t an iSCSI target supposed to behave like any other SCSI disk >> (pSCSI, SAS, FC, USB MSC, SSA, ATAPI, FW SBP...)? >> With that I mean: A disk which understands SCSI commands with an >> optional write cache that could be turned off, with cache sync >> command, and all those things. >> Put in another way, isn''t is the OS/file systems responsibility to >> use the SCSI disk responsibly regardless of the underlying >> protocol? >> >> /ragge >> > > Yes, that would be nice wouldn''t it? But the world is seldom that simple, is it? For example, Sun''s first implementation of zvol was unsafe by default, with no cache flush option either. > > A few years back we used to note that one of the reasons Solaris was slower than Linux at fileystems microbenchmarks was because Linux ran with the write caches on (whereas we would never be that foolhardy).(Exactly, and there are more "better fast than safe" evilness in that OS too, especially in the file system area. That is why I never use it for anything that should store anything.)> And then this seems to claim that NTFS may not be that smart either ... > > http://blogs.sun.com/roch/entry/iscsi_unleashed > > (see the WCE Settings paragraph) > > I''m only going on what I''ve read.But - all normal disks come with write caching enabled, so in both the Linux case and the NTFS case, this is how they always operate, with all disks, so why should an iSCSI lun behave any different? If they can''t handle the write cache (handle syncing, barriers, ordering an all that), they should turn the cache off, just as Solaris does in almost all cases except when you use an entire disk for zfs (I believe because solaris UFS was never really adapted to write caches). And they should do that for all SCSI disks. (I seem to recall at in the bad old days you had to disable the write cache yourself if you should use a disk on SunOS, but that was probably because it wasn''t standardized, and you did it with a jumper on the controller board.) So - I just do not understand why an iSCSI lun should not try to emulate how all other SCSI disks work as much as possible? This must be the most compatible mode of operation, or am I wrong? /ragge
On Feb 18, 2010, at 4:55 AM, Phil Harman wrote:> This discussion is very timely, but I don''t think we''re done yet. I''ve been working on using NexentaStor with Sun''s DVI stack. The demo I''ve been playing with glues SunRays to VirtualBox instances using ZFS zvols over iSCSI for the boot image, with all the associated ZFS snapshot/clone goodness we all love so well. > > The supported config for the ZFS storage server is Solaris 10u7 or 10u8. When I eventually got VDI going with NexentaStor (my value add), I found that some operations which only took 10 minutes with Solaris 10u8 were taking over an hour with NexentaStor. Using pfiles I found that iscsitgtd has the zvol open O_SYNC.You need the COMSTAR plugin for NexentaStor (no need to beat the dead horse :-) -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance http://nexenta-atlanta.eventbrite.com (March 15-17, 2010)
>>>>> "rs" == Ragnar Sundblad <ragge at csc.kth.se> writes:rs> But are there any clients that assume that an iSCSI volume is rs> synchronous? there will probably be clients that might seem to implicitly make this assuption by mishandling the case where an iSCSI target goes away and then comes back (but comes back less whatever writes were in its write cache). Handling that case for NFS was complicated, and I bet such complexity is just missing without any equivalent from the iSCSI spec, but I could be wrong. I''d love to be educated. Even if there is some magical thing in iSCSI to handle it, the magic will be rarely used and often wrong until peopel learn how to test it, which they haven''t yet they way they have with NFS. yeah, of course, making all writes synchronous isn''t an ok way to fix this case because it''ll make iscsi way slower than non-iscsi alternatives. rs> Isn''t an iSCSI target supposed to behave like any other SCSI rs> disk (pSCSI, SAS, FC, USB MSC, SSA, ATAPI, FW SBP...)? With rs> that I mean: A disk which understands SCSI commands with an rs> optional write cache that could be turned off, with cache sync rs> command, and all those things. yeah, reboot a SAS disk without rebooting the host it''s attached to, and you may see some dropped writes showing up as mysterious checksum errors there as well. I bet disabling said SAS disk''s write cache will lessen/eliminate that problem. I think it''s become a stupid mess because everyone assumed long past the point where it became unreasonable that disks with mounted filesystems would not ever lose power unless the kernel with the mounted filesystem also lost power. rs> But - all normal disks come with write caching enabled, [...] rs> so why should an iSCSI lun behave any different? because normal disks usually don''t dump the contents of their write caches on the floor unless the kernel running the filesystem code also loses power at the same instant. This coincident kernel panic acts as a signal to the filesystem to expect some lost writes of the disks. It also lets the kernel take advantage of NFS server reboot recovery (asking NFS clients to replay some of their writes), and it''s an excuse to force-close any file a userland process might''ve had open on the filesystem, thus forcing those userland processes to go through their crash-recovery steps by replaying database logs and such. Over iSCSI it''s relatively common for a target to lose power and then come back without its write cache. but when iSCSI does it, now you are expected to soldier on without killing all userland processes. NFS probably could invoke its crash recovery state machine without an actual server reboot if it wanted to, but I bet it doesn''t currently know how, and that''s probably not the right fix because you''ve still got the userland processes problem. I agree with you iSCSI write cache needs to stay on, but there is probably broken shit all over the place from this. pre-ZFS iSCSI targets tend to have battery-backed NVRAM so they can be all-synchronous without demolishing performance and thus fix, or maybe just ease a little bit, this problem. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100222/3835e09a/attachment.bin>
On 22 feb 2010, at 21.28, Miles Nordin wrote:>>>>>> "rs" == Ragnar Sundblad <ragge at csc.kth.se> writes: > > rs> But are there any clients that assume that an iSCSI volume is > rs> synchronous? > > there will probably be clients that might seem to implicitly make this > assuption by mishandling the case where an iSCSI target goes away and > then comes back (but comes back less whatever writes were in its write > cache). Handling that case for NFS was complicated, and I bet such > complexity is just missing without any equivalent from the iSCSI spec, > but I could be wrong. I''d love to be educated.Yes, this area may very well be a mine field of bugs. But this is not a new phenomenon, it is the same with SAS, FC, USB, hot plug disks, and even eSATA (and I guess with CD/DVD drives also with SCSI with ATAPI (or rather SATAPI (does it have a name?))). I believe the correct way of handling this in all those cases would be having the old device instance fail, the file system being told about it, having all current operations fail and all open files be failed. When the disk comes back, it should get a new device instance, and it should have to be remounted. All files will have to be reopened. I hope no driver will just attach it again and happily just continue without telling anyone/anything. But then again, crazier things have been coded...> Even if there is some magical thing in iSCSI to handle it, the magic > will be rarely used and often wrong until peopel learn how to test it, > which they haven''t yet they way they have with NFS.I am not sure there is anything really magic or unusual about this really, but I certainly agree that it is a typical thing that might not have been tested thoroughly enough. /ragge
Miles Nordin <carton at Ivy.NET> writes:> There will probably be clients that might seem to implicitly make this > assuption by mishandling the case where an iSCSI target goes away and > then comes back (but comes back less whatever writes were in its write > cache). Handling that case for NFS was complicated, and I bet such > complexity is just missing without any equivalent from the iSCSI spec, > but I could be wrong. I''d love to be educated. > > Even if there is some magical thing in iSCSI to handle it, the magic > will be rarely used and often wrong until peopel learn how to test it, > which they haven''t yet they way they have with NFS.I decided I needed to read up on this and found RFC 3783 which is very readable, highly recommended: http://tools.ietf.org/html/rfc3783 basically iSCSI just defines a reliable channel for SCSI. the SCSI layer handles the replaying of operations after a reboot or connection failure. as far as I understand it, anyway. -- Kjetil T. Homme Redpill Linpro AS - Changing the game
>>>>> "kth" == Kjetil Torgrim Homme <kjetilho at linpro.no> writes:kth> basically iSCSI just defines a reliable channel for SCSI. pft. AIUI a lot of the complexity in real stacks is ancient protocol arcania for supporting multiple initiators and TCQ regardless of whther the physical target supports these things, multiple paths between a single target,initiator pair, and their weird SCTP-like notion that several physical SCSI targets ought to be combined into multiple LUN''s of a single virtual iSCSI target. I think the mapping from iSCSI to SCSI is not usually very direct. I have not dug into it though. kth> the SCSI layer handles the replaying of operations after a kth> reboot or connection failure. how? I do not think it is handled by SCSI layers, not for SAS nor iSCSI. Also, remember a write command that goes into the write cache is a SCSI command that''s succeeded, even though it''s not actually on disk for sure unless you can complete a sync cache command successfully and do so with no errors nor ``protocol events'''' in the gap between the successful write and the successful sync. A facility to replay failed commands won''t help because when a drive with write cache on reboots, successful writes are rolled back. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100223/2925a3de/attachment.bin>
Miles Nordin <carton at Ivy.NET> writes:>>>>>> "kth" == Kjetil Torgrim Homme <kjetilho at linpro.no> writes: > > kth> the SCSI layer handles the replaying of operations after a > kth> reboot or connection failure. > > how? > > I do not think it is handled by SCSI layers, not for SAS nor iSCSI.sorry, I was inaccurate. error reporting is done by the SCSI layer, and the filesystem handles it by retrying whatever outstanding operations it has.> Also, remember a write command that goes into the write cache is a > SCSI command that''s succeeded, even though it''s not actually on disk > for sure unless you can complete a sync cache command successfully and > do so with no errors nor ``protocol events'''' in the gap between the > successful write and the successful sync. A facility to replay failed > commands won''t help because when a drive with write cache on reboots, > successful writes are rolled back.this is true, sorry about my lack of precision. the SCSI layer can''t do this on its own. -- Kjetil T. Homme Redpill Linpro AS - Changing the game