thr3ads.net - zfs discuss - [zfs-discuss] ZFS over iSCSI question [Mar 2007]

If this information is useful, please help other people find it:
Share via:

Thomas Nau

2007-Mar-23 13:30 UTC

[zfs-discuss] ZFS over iSCSI question

Dear all.
I''ve setup the following scenario:

Galaxy 4200 running OpenSolaris build 59 as iSCSI target; remaining 
diskspace of the two internal drives with a total of 90GB is used as zpool 
for the two 32GB volumes "exported" via iSCSI

The initiator is an up to date Solaris 10 11/06 x86 box using the above 
mentioned volumes as disks for a local zpool.

I''ve now started rsync to copy about 1GB of data in several thousand 
files. During the operation I took the network interface on the iSCSI 
target down which resulted in no more disk IO on that server. On the other 
hand, the client happily dumps data into the ZFS cache actually completely 
finishing all of the copy operation.

Now the big question: we plan to use that kind of setup for email or other 
important services so what happens if the client crashes while the network 
is down? Does it mean that all the data in the cache is gone forever?

If so, is this a transport independent problem which can also happen if 
ZFS used Fibre Channel attached drives instead of iSCSI devices?

Thanks for your help
Thomas

-----------------------------------------------------------------
GPG fingerprint: B1 EE D2 39 2C 82 26 DA  A5 4D E0 50 35 75 9E ED

Roch - PAE

2007-Mar-23 15:45 UTC

head link

[zfs-discuss] ZFS over iSCSI question

Thomas Nau writes:
 > Dear all.
 > I''ve setup the following scenario:
 > 
 > Galaxy 4200 running OpenSolaris build 59 as iSCSI target; remaining 
 > diskspace of the two internal drives with a total of 90GB is used as zpool
 > for the two 32GB volumes "exported" via iSCSI
 > 
 > The initiator is an up to date Solaris 10 11/06 x86 box using the above 
 > mentioned volumes as disks for a local zpool.
 > 
 > I''ve now started rsync to copy about 1GB of data in several
thousand
 > files. During the operation I took the network interface on the iSCSI 
 > target down which resulted in no more disk IO on that server. On the other
 > hand, the client happily dumps data into the ZFS cache actually completely
 > finishing all of the copy operation.
 > 
 > Now the big question: we plan to use that kind of setup for email or other
 > important services so what happens if the client crashes while the network
 > is down? Does it mean that all the data in the cache is gone forever?
 > 
 > If so, is this a transport independent problem which can also happen if 
 > ZFS used Fibre Channel attached drives instead of iSCSI devices?
 > 

I assume the rsync is not issuing fsyncs (and it''s files are 
not opened O_DSYNC). If so,  rsync just works against the
filesystem cache and does not commit the data to disk.

You might want to run sync(1M) after a successful rsync.

A larger  rsync would presumably have blocked. It''s just
that the amount of data you needs to rsync fitted in a couple of
transaction groups.

-r



 > Thanks for your help
 > Thomas
 > 
 > -----------------------------------------------------------------
 > GPG fingerprint: B1 EE D2 39 2C 82 26 DA  A5 4D E0 50 35 75 9E ED
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Thomas Nau

2007-Mar-23 17:51 UTC

head link

[zfs-discuss] ZFS over iSCSI question

On Fri, 23 Mar 2007, Roch - PAE wrote:
> I assume the rsync is not issuing fsyncs (and it''s files are
> not opened O_DSYNC). If so,  rsync just works against the
> filesystem cache and does not commit the data to disk.
>
> You might want to run sync(1M) after a successful rsync.
>
> A larger  rsync would presumably have blocked. It''s just
> that the amount of data you needs to rsync fitted in a couple of
> transaction groups.
Thanks for the hints but this would make our worst nightmares become true. 
At least they could because it means that we would have to check every 
application handling critical data and I think it''s not the apps 
responsibility. Up to a certain amount like a database transaction but not 
any further. There''s always a time window where data might be cached in
memory but I would argue that caching several GB of data, in our case 
written data, with thousands of files in unbuffered memory circumvents all 
the build in reliability of ZFS.

I''m in a way still hoping that it''s a iSCSI related Problem as
detecting
dead hosts in a network can be a non trivial problem and it takes quite 
some time for TCP to timeout and inform the upper layers. Just a 
guess/hope here that FC-AL, ... do better in this case

Thomas

-----------------------------------------------------------------
GPG fingerprint: B1 EE D2 39 2C 82 26 DA  A5 4D E0 50 35 75 9E ED

Frank Cusack

2007-Mar-23 18:28 UTC

head link

[zfs-discuss] ZFS over iSCSI question

On March 23, 2007 6:51:10 PM +0100 Thomas Nau <thomas.nau at uni-ulm.de>
wrote:> Thanks for the hints but this would make our worst nightmares become
> true. At least they could because it means that we would have to check
> every application handling critical data and I think it''s not the
apps
> responsibility.
I''d tend to disagree with that.  POSIX/SUS does not guarantee data
makes
it to disk until you do an fsync() (or open the file with the right flags,
or other techniques).  If an application REQUIRES that data get to disk,
it really MUST DTRT.
> Up to a certain amount like a database transaction but
> not any further. There''s always a time window where data might be
cached
> in memory but I would argue that caching several GB of data, in our case
> written data, with thousands of files in unbuffered memory circumvents
> all the build in reliability of ZFS.
>
> I''m in a way still hoping that it''s a iSCSI related
Problem as detecting
> dead hosts in a network can be a non trivial problem and it takes quite
> some time for TCP to timeout and inform the upper layers. Just a
> guess/hope here that FC-AL, ... do better in this case
iscsi doesn''t use TCP, does it?  Anyway, the problem is really
transport
independent.

-frank

Casper.Dik at Sun.COM

2007-Mar-23 19:04 UTC

head link

[zfs-discuss] ZFS over iSCSI question

>I''d tend to disagree with that.  POSIX/SUS does not guarantee data
makes
>it to disk until you do an fsync() (or open the file with the right flags,
>or other techniques).  If an application REQUIRES that data get to disk,
>it really MUST DTRT.
Indeed; want your data safe?  Use:

	fflush(fp);
	fsync(fileno(fp));
	fclose(fp);

and check errors.


(It''s remarkable how often people get the above sequence wrong and only
do something like fsync(fileno(fp)); fclose(fp);


Casper

Richard Elling

2007-Mar-23 20:25 UTC

head link

[zfs-discuss] ZFS over iSCSI question

Thomas Nau wrote:> Dear all.
> I''ve setup the following scenario:
> 
> Galaxy 4200 running OpenSolaris build 59 as iSCSI target; remaining 
> diskspace of the two internal drives with a total of 90GB is used as 
> zpool for the two 32GB volumes "exported" via iSCSI
> 
> The initiator is an up to date Solaris 10 11/06 x86 box using the above 
> mentioned volumes as disks for a local zpool.
Like this?
disk--zpool--zvol--iscsitarget--network--iscsiclient--zpool--filesystem--app

 > I''m in a way still hoping that it''s a iSCSI related
Problem as detecting
 > dead hosts in a network can be a non trivial problem and it takes quite
 > some time for TCP to timeout and inform the upper layers. Just a
 > guess/hope here that FC-AL, ... do better in this case

Actually, this is why NFS was invented.  Prior to NFS we had something like:
disk--raw--ndserver--network--ndclient--filesystem--app

The problem is that the failure modes are very different for networks and
presumably reliable local disk connections.  Hence NFS has a lot of error
handling code and provides well understood error handling semantics.  Maybe
what you really want is NFS?
  -- richard

Thomas Nau

2007-Mar-23 23:00 UTC

head link

[zfs-discuss] ZFS over iSCSI question

Dear Fran & Casper
>> I''d tend to disagree with that.  POSIX/SUS does not guarantee
data makes
>> it to disk until you do an fsync() (or open the file with the right
flags,
>> or other techniques).  If an application REQUIRES that data get to
disk,
>> it really MUST DTRT.
>
> Indeed; want your data safe?  Use:
>
> 	fflush(fp);
> 	fsync(fileno(fp));
> 	fclose(fp);
>
> and check errors.
>
>
> (It''s remarkable how often people get the above sequence wrong and
only
> do something like fsync(fileno(fp)); fclose(fp);

Thanks for clarifying! Seems I really need to check the apps with truss or 
dtrace to see if they use that sequence. Allow me one more question: why 
is fflush() required prior to fsync()?

Putting all pieces together this means that if the app doesn''t do it it
suffered from the problem with UFS anyway just with typically smaller 
caches, right?

Thanks again
Thomas

-----------------------------------------------------------------
GPG fingerprint: B1 EE D2 39 2C 82 26 DA  A5 4D E0 50 35 75 9E ED

Thomas Nau

2007-Mar-23 23:06 UTC

head link

[zfs-discuss] ZFS over iSCSI question

Richard,
> Like this?
>
disk--zpool--zvol--iscsitarget--network--iscsiclient--zpool--filesystem--app
exactly
>> I''m in a way still hoping that it''s a iSCSI related
Problem as detecting
>> dead hosts in a network can be a non trivial problem and it takes quite
>> some time for TCP to timeout and inform the upper layers. Just a
>> guess/hope here that FC-AL, ... do better in this case
>
> Actually, this is why NFS was invented.  Prior to NFS we had something
like:
> disk--raw--ndserver--network--ndclient--filesystem--app
The problem is that our NFS, Mail, DB and other servers use mirrrored 
disks located in different building on campus. Currently we use FCAL 
devices and recently switched from UFS to ZFS. The drawback with FCAL is 
that you always need to have a second infrastructure (not the real 
problem) but with different components. Having all ethernet would be much 
easier.
> The problem is that the failure modes are very different for networks and
> presumably reliable local disk connections.  Hence NFS has a lot of error
> handling code and provides well understood error handling semantics.  Maybe
> what you really want is NFS?
We thought about using NFS as backend for as much as possible applications 
but we need to have redundancy for the fileserver itself too

Thomas

-----------------------------------------------------------------
GPG fingerprint: B1 EE D2 39 2C 82 26 DA  A5 4D E0 50 35 75 9E ED

Casper.Dik at Sun.COM

2007-Mar-23 23:15 UTC

head link

[zfs-discuss] ZFS over iSCSI question

>Thanks for clarifying! Seems I really need to check the apps with truss or 
>dtrace to see if they use that sequence. Allow me one more question: why 
>is fflush() required prior to fsync()?
When you use stdio, you need to make sure the data is in the
system buffers prior to call fsync.

fclose() will otherwise write the rest of the data which is not
sync''ed.


(In S10 I fixed this for /etc/*_* driver files , they are generally
under 8 K and therefor never written to disk before fsync''ed
if not preceeded by fflush().

Casper

Adam Leventhal

2007-Mar-24 06:06 UTC

head link

[zfs-discuss] ZFS over iSCSI question

On Fri, Mar 23, 2007 at 11:28:19AM -0700, Frank Cusack
wrote:> >I''m in a way still hoping that it''s a iSCSI related
Problem as detecting
> >dead hosts in a network can be a non trivial problem and it takes quite
> >some time for TCP to timeout and inform the upper layers. Just a
> >guess/hope here that FC-AL, ... do better in this case
> 
> iscsi doesn''t use TCP, does it?  Anyway, the problem is really
transport
> independent.
It does use TCP. Were you thinking UDP?

Adam

-- 
Adam Leventhal, Solaris Kernel Development       http://blogs.sun.com/ahl

Joerg Schilling

2007-Mar-24 10:42 UTC

head link

[zfs-discuss] ZFS over iSCSI question

Thomas Nau <thomas.nau at uni-ulm.de> wrote:
> > 	fflush(fp);
> > 	fsync(fileno(fp));
> > 	fclose(fp);
> >
> > and check errors.
> >
> >
> > (It''s remarkable how often people get the above sequence
wrong and only
> > do something like fsync(fileno(fp)); fclose(fp);
>
>
> Thanks for clarifying! Seems I really need to check the apps with truss or 
> dtrace to see if they use that sequence. Allow me one more question: why 
> is fflush() required prior to fsync()?
You cannot simply verify this with truss unless you trace libc::fflush() too.

You need to call fflush() before, in order to move the user space cache to the
kernel.


J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       schilling at fokus.fraunhofer.de     (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily

Frank Cusack

2007-Mar-24 18:20 UTC

head link

[zfs-discuss] ZFS over iSCSI question

On March 23, 2007 11:06:33 PM -0700 Adam Leventhal <ahl at eng.sun.com>
wrote:> On Fri, Mar 23, 2007 at 11:28:19AM -0700, Frank Cusack wrote:
>> > I''m in a way still hoping that it''s a iSCSI
related Problem as
>> > detecting dead hosts in a network can be a non trivial problem and
it
>> > takes quite some time for TCP to timeout and inform the upper
layers.
>> > Just a guess/hope here that FC-AL, ... do better in this case
>>
>> iscsi doesn''t use TCP, does it?  Anyway, the problem is really
transport
>> independent.
>
> It does use TCP. Were you thinking UDP?
or its own IP protocol.  I wouldn''t have thought iSCSI would want to be
subject to the vagaries of TCP.

-frank

Brian Hechinger

2007-Mar-24 21:45 UTC

head link

[zfs-discuss] ZFS over iSCSI question

On Sat, Mar 24, 2007 at 11:20:38AM -0700, Frank Cusack
wrote:> >>iscsi doesn''t use TCP, does it?  Anyway, the problem is
really transport
> >>independent.
> >
> >It does use TCP. Were you thinking UDP?
> 
> or its own IP protocol.  I wouldn''t have thought iSCSI would want
to be
> subject to the vagaries of TCP.
No, you''ll find that iSCSI does indeed us TCP, for better or for worse.
;)

-brian
-- 
"The reason I don''t use Gnome: every single other window manager I
know of is
very powerfully extensible, where you can switch actions to different mouse
buttons. Guess which one is not, because it might confuse the poor users?
Here''s a hint: it''s not the small and fast one."       
--Linus

Robert Milkowski

2007-Mar-25 09:47 UTC

head link

[zfs-discuss] ZFS over iSCSI question

Hello Thomas,

Saturday, March 24, 2007, 1:06:47 AM, you wrote:

>> The problem is that the failure modes are very different for networks
and
>> presumably reliable local disk connections.  Hence NFS has a lot of
error
>> handling code and provides well understood error handling semantics. 
Maybe
>> what you really want is NFS?
TN> We thought about using NFS as backend for as much as possible
applications
TN> but we need to have redundancy for the fileserver itself too

Then use Sun Cluster + NFS, both are for free.

Now it won''t solve your ''sync'' support but maybe you
can try: SC + NFS
mounted on clients with directio, UFS on server mounted with directio.

It probably will be really slow, but everythink should be consistent
all the time I guess.

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Thomas Nau

2007-Mar-25 10:14 UTC

head link

[zfs-discuss] ZFS over iSCSI question

Hi Robert,

On Sun, 25 Mar 2007, Robert Milkowski wrote:>>> The problem is that the failure modes are very different for
networks and
>>> presumably reliable local disk connections.  Hence NFS has a lot of
error
>>> handling code and provides well understood error handling
semantics.  Maybe
>>> what you really want is NFS?
>
> TN> We thought about using NFS as backend for as much as possible
applications
> TN> but we need to have redundancy for the fileserver itself too
>
> Then use Sun Cluster + NFS, both are for free.
We use a cluster ;) but in the backend it doesn''t solve the sync
problem
as you mention
> Now it won''t solve your ''sync'' support but maybe
you can try: SC + NFS
> mounted on clients with directio, UFS on server mounted with directio.
UFS is no option due to it''s limitations in size and data safety. We 
recently had a severe problem when the UFS log got corrupted due to a 
hardware failure (port died on a FCAL switch). The fsck ran for 10+ hours 
on the mail server. Even worse, it reported some corrected problems but 
running a few more hours in production the system paniced again with 
"freeing free block/inode?" At this point we decided that 500GB of
mail
is not what we wanna put on UFS or any similar FS anymore
> It probably will be really slow, but everythink should be consistent
> all the time I guess.
You might be right about. I did a quick check with dtrace on the mail 
server and it seems IMAP, sendmail and the others nicely sync data as they 
should

Thomas

-----------------------------------------------------------------
GPG fingerprint: B1 EE D2 39 2C 82 26 DA  A5 4D E0 50 35 75 9E ED

David Magda

2007-Mar-25 17:04 UTC

head link

[zfs-discuss] ZFS over iSCSI question

On Mar 25, 2007, at 06:14, Thomas Nau wrote:
> We use a cluster ;) but in the backend it doesn''t solve the sync  
> problem as you mention
The StorageTek Availability Suite was recently open-sourced:

http://www.opensolaris.org/os/project/avs/

zfs discuss - Mar 2007 - ZFS over iSCSI question

[zfs-discuss] ZFS over iSCSI question

[zfs-discuss] ZFS over iSCSI question

[zfs-discuss] ZFS over iSCSI question

[zfs-discuss] ZFS over iSCSI question

[zfs-discuss] ZFS over iSCSI question

[zfs-discuss] ZFS over iSCSI question

[zfs-discuss] ZFS over iSCSI question

[zfs-discuss] ZFS over iSCSI question

[zfs-discuss] ZFS over iSCSI question

[zfs-discuss] ZFS over iSCSI question

[zfs-discuss] ZFS over iSCSI question

[zfs-discuss] ZFS over iSCSI question

[zfs-discuss] ZFS over iSCSI question

[zfs-discuss] ZFS over iSCSI question

[zfs-discuss] ZFS over iSCSI question

[zfs-discuss] ZFS over iSCSI question