Dear all. I''ve setup the following scenario: Galaxy 4200 running OpenSolaris build 59 as iSCSI target; remaining diskspace of the two internal drives with a total of 90GB is used as zpool for the two 32GB volumes "exported" via iSCSI The initiator is an up to date Solaris 10 11/06 x86 box using the above mentioned volumes as disks for a local zpool. I''ve now started rsync to copy about 1GB of data in several thousand files. During the operation I took the network interface on the iSCSI target down which resulted in no more disk IO on that server. On the other hand, the client happily dumps data into the ZFS cache actually completely finishing all of the copy operation. Now the big question: we plan to use that kind of setup for email or other important services so what happens if the client crashes while the network is down? Does it mean that all the data in the cache is gone forever? If so, is this a transport independent problem which can also happen if ZFS used Fibre Channel attached drives instead of iSCSI devices? Thanks for your help Thomas ----------------------------------------------------------------- GPG fingerprint: B1 EE D2 39 2C 82 26 DA A5 4D E0 50 35 75 9E ED
Thomas Nau writes: > Dear all. > I''ve setup the following scenario: > > Galaxy 4200 running OpenSolaris build 59 as iSCSI target; remaining > diskspace of the two internal drives with a total of 90GB is used as zpool > for the two 32GB volumes "exported" via iSCSI > > The initiator is an up to date Solaris 10 11/06 x86 box using the above > mentioned volumes as disks for a local zpool. > > I''ve now started rsync to copy about 1GB of data in several thousand > files. During the operation I took the network interface on the iSCSI > target down which resulted in no more disk IO on that server. On the other > hand, the client happily dumps data into the ZFS cache actually completely > finishing all of the copy operation. > > Now the big question: we plan to use that kind of setup for email or other > important services so what happens if the client crashes while the network > is down? Does it mean that all the data in the cache is gone forever? > > If so, is this a transport independent problem which can also happen if > ZFS used Fibre Channel attached drives instead of iSCSI devices? > I assume the rsync is not issuing fsyncs (and it''s files are not opened O_DSYNC). If so, rsync just works against the filesystem cache and does not commit the data to disk. You might want to run sync(1M) after a successful rsync. A larger rsync would presumably have blocked. It''s just that the amount of data you needs to rsync fitted in a couple of transaction groups. -r > Thanks for your help > Thomas > > ----------------------------------------------------------------- > GPG fingerprint: B1 EE D2 39 2C 82 26 DA A5 4D E0 50 35 75 9E ED > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Fri, 23 Mar 2007, Roch - PAE wrote:> I assume the rsync is not issuing fsyncs (and it''s files are > not opened O_DSYNC). If so, rsync just works against the > filesystem cache and does not commit the data to disk. > > You might want to run sync(1M) after a successful rsync. > > A larger rsync would presumably have blocked. It''s just > that the amount of data you needs to rsync fitted in a couple of > transaction groups.Thanks for the hints but this would make our worst nightmares become true. At least they could because it means that we would have to check every application handling critical data and I think it''s not the apps responsibility. Up to a certain amount like a database transaction but not any further. There''s always a time window where data might be cached in memory but I would argue that caching several GB of data, in our case written data, with thousands of files in unbuffered memory circumvents all the build in reliability of ZFS. I''m in a way still hoping that it''s a iSCSI related Problem as detecting dead hosts in a network can be a non trivial problem and it takes quite some time for TCP to timeout and inform the upper layers. Just a guess/hope here that FC-AL, ... do better in this case Thomas ----------------------------------------------------------------- GPG fingerprint: B1 EE D2 39 2C 82 26 DA A5 4D E0 50 35 75 9E ED
On March 23, 2007 6:51:10 PM +0100 Thomas Nau <thomas.nau at uni-ulm.de> wrote:> Thanks for the hints but this would make our worst nightmares become > true. At least they could because it means that we would have to check > every application handling critical data and I think it''s not the apps > responsibility.I''d tend to disagree with that. POSIX/SUS does not guarantee data makes it to disk until you do an fsync() (or open the file with the right flags, or other techniques). If an application REQUIRES that data get to disk, it really MUST DTRT.> Up to a certain amount like a database transaction but > not any further. There''s always a time window where data might be cached > in memory but I would argue that caching several GB of data, in our case > written data, with thousands of files in unbuffered memory circumvents > all the build in reliability of ZFS. > > I''m in a way still hoping that it''s a iSCSI related Problem as detecting > dead hosts in a network can be a non trivial problem and it takes quite > some time for TCP to timeout and inform the upper layers. Just a > guess/hope here that FC-AL, ... do better in this caseiscsi doesn''t use TCP, does it? Anyway, the problem is really transport independent. -frank
>I''d tend to disagree with that. POSIX/SUS does not guarantee data makes >it to disk until you do an fsync() (or open the file with the right flags, >or other techniques). If an application REQUIRES that data get to disk, >it really MUST DTRT.Indeed; want your data safe? Use: fflush(fp); fsync(fileno(fp)); fclose(fp); and check errors. (It''s remarkable how often people get the above sequence wrong and only do something like fsync(fileno(fp)); fclose(fp); Casper
Thomas Nau wrote:> Dear all. > I''ve setup the following scenario: > > Galaxy 4200 running OpenSolaris build 59 as iSCSI target; remaining > diskspace of the two internal drives with a total of 90GB is used as > zpool for the two 32GB volumes "exported" via iSCSI > > The initiator is an up to date Solaris 10 11/06 x86 box using the above > mentioned volumes as disks for a local zpool.Like this? disk--zpool--zvol--iscsitarget--network--iscsiclient--zpool--filesystem--app > I''m in a way still hoping that it''s a iSCSI related Problem as detecting > dead hosts in a network can be a non trivial problem and it takes quite > some time for TCP to timeout and inform the upper layers. Just a > guess/hope here that FC-AL, ... do better in this case Actually, this is why NFS was invented. Prior to NFS we had something like: disk--raw--ndserver--network--ndclient--filesystem--app The problem is that the failure modes are very different for networks and presumably reliable local disk connections. Hence NFS has a lot of error handling code and provides well understood error handling semantics. Maybe what you really want is NFS? -- richard
Dear Fran & Casper>> I''d tend to disagree with that. POSIX/SUS does not guarantee data makes >> it to disk until you do an fsync() (or open the file with the right flags, >> or other techniques). If an application REQUIRES that data get to disk, >> it really MUST DTRT. > > Indeed; want your data safe? Use: > > fflush(fp); > fsync(fileno(fp)); > fclose(fp); > > and check errors. > > > (It''s remarkable how often people get the above sequence wrong and only > do something like fsync(fileno(fp)); fclose(fp);Thanks for clarifying! Seems I really need to check the apps with truss or dtrace to see if they use that sequence. Allow me one more question: why is fflush() required prior to fsync()? Putting all pieces together this means that if the app doesn''t do it it suffered from the problem with UFS anyway just with typically smaller caches, right? Thanks again Thomas ----------------------------------------------------------------- GPG fingerprint: B1 EE D2 39 2C 82 26 DA A5 4D E0 50 35 75 9E ED
Richard,> Like this? > disk--zpool--zvol--iscsitarget--network--iscsiclient--zpool--filesystem--appexactly>> I''m in a way still hoping that it''s a iSCSI related Problem as detecting >> dead hosts in a network can be a non trivial problem and it takes quite >> some time for TCP to timeout and inform the upper layers. Just a >> guess/hope here that FC-AL, ... do better in this case > > Actually, this is why NFS was invented. Prior to NFS we had something like: > disk--raw--ndserver--network--ndclient--filesystem--appThe problem is that our NFS, Mail, DB and other servers use mirrrored disks located in different building on campus. Currently we use FCAL devices and recently switched from UFS to ZFS. The drawback with FCAL is that you always need to have a second infrastructure (not the real problem) but with different components. Having all ethernet would be much easier.> The problem is that the failure modes are very different for networks and > presumably reliable local disk connections. Hence NFS has a lot of error > handling code and provides well understood error handling semantics. Maybe > what you really want is NFS?We thought about using NFS as backend for as much as possible applications but we need to have redundancy for the fileserver itself too Thomas ----------------------------------------------------------------- GPG fingerprint: B1 EE D2 39 2C 82 26 DA A5 4D E0 50 35 75 9E ED
>Thanks for clarifying! Seems I really need to check the apps with truss or >dtrace to see if they use that sequence. Allow me one more question: why >is fflush() required prior to fsync()?When you use stdio, you need to make sure the data is in the system buffers prior to call fsync. fclose() will otherwise write the rest of the data which is not sync''ed. (In S10 I fixed this for /etc/*_* driver files , they are generally under 8 K and therefor never written to disk before fsync''ed if not preceeded by fflush(). Casper
On Fri, Mar 23, 2007 at 11:28:19AM -0700, Frank Cusack wrote:> >I''m in a way still hoping that it''s a iSCSI related Problem as detecting > >dead hosts in a network can be a non trivial problem and it takes quite > >some time for TCP to timeout and inform the upper layers. Just a > >guess/hope here that FC-AL, ... do better in this case > > iscsi doesn''t use TCP, does it? Anyway, the problem is really transport > independent.It does use TCP. Were you thinking UDP? Adam -- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl
Thomas Nau <thomas.nau at uni-ulm.de> wrote:> > fflush(fp); > > fsync(fileno(fp)); > > fclose(fp); > > > > and check errors. > > > > > > (It''s remarkable how often people get the above sequence wrong and only > > do something like fsync(fileno(fp)); fclose(fp); > > > Thanks for clarifying! Seems I really need to check the apps with truss or > dtrace to see if they use that sequence. Allow me one more question: why > is fflush() required prior to fsync()?You cannot simply verify this with truss unless you trace libc::fflush() too. You need to call fflush() before, in order to move the user space cache to the kernel. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
On March 23, 2007 11:06:33 PM -0700 Adam Leventhal <ahl at eng.sun.com> wrote:> On Fri, Mar 23, 2007 at 11:28:19AM -0700, Frank Cusack wrote: >> > I''m in a way still hoping that it''s a iSCSI related Problem as >> > detecting dead hosts in a network can be a non trivial problem and it >> > takes quite some time for TCP to timeout and inform the upper layers. >> > Just a guess/hope here that FC-AL, ... do better in this case >> >> iscsi doesn''t use TCP, does it? Anyway, the problem is really transport >> independent. > > It does use TCP. Were you thinking UDP?or its own IP protocol. I wouldn''t have thought iSCSI would want to be subject to the vagaries of TCP. -frank
On Sat, Mar 24, 2007 at 11:20:38AM -0700, Frank Cusack wrote:> >>iscsi doesn''t use TCP, does it? Anyway, the problem is really transport > >>independent. > > > >It does use TCP. Were you thinking UDP? > > or its own IP protocol. I wouldn''t have thought iSCSI would want to be > subject to the vagaries of TCP.No, you''ll find that iSCSI does indeed us TCP, for better or for worse. ;) -brian -- "The reason I don''t use Gnome: every single other window manager I know of is very powerfully extensible, where you can switch actions to different mouse buttons. Guess which one is not, because it might confuse the poor users? Here''s a hint: it''s not the small and fast one." --Linus
Hello Thomas, Saturday, March 24, 2007, 1:06:47 AM, you wrote:>> The problem is that the failure modes are very different for networks and >> presumably reliable local disk connections. Hence NFS has a lot of error >> handling code and provides well understood error handling semantics. Maybe >> what you really want is NFS?TN> We thought about using NFS as backend for as much as possible applications TN> but we need to have redundancy for the fileserver itself too Then use Sun Cluster + NFS, both are for free. Now it won''t solve your ''sync'' support but maybe you can try: SC + NFS mounted on clients with directio, UFS on server mounted with directio. It probably will be really slow, but everythink should be consistent all the time I guess. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Hi Robert, On Sun, 25 Mar 2007, Robert Milkowski wrote:>>> The problem is that the failure modes are very different for networks and >>> presumably reliable local disk connections. Hence NFS has a lot of error >>> handling code and provides well understood error handling semantics. Maybe >>> what you really want is NFS? > > TN> We thought about using NFS as backend for as much as possible applications > TN> but we need to have redundancy for the fileserver itself too > > Then use Sun Cluster + NFS, both are for free.We use a cluster ;) but in the backend it doesn''t solve the sync problem as you mention> Now it won''t solve your ''sync'' support but maybe you can try: SC + NFS > mounted on clients with directio, UFS on server mounted with directio.UFS is no option due to it''s limitations in size and data safety. We recently had a severe problem when the UFS log got corrupted due to a hardware failure (port died on a FCAL switch). The fsck ran for 10+ hours on the mail server. Even worse, it reported some corrected problems but running a few more hours in production the system paniced again with "freeing free block/inode?" At this point we decided that 500GB of mail is not what we wanna put on UFS or any similar FS anymore> It probably will be really slow, but everythink should be consistent > all the time I guess.You might be right about. I did a quick check with dtrace on the mail server and it seems IMAP, sendmail and the others nicely sync data as they should Thomas ----------------------------------------------------------------- GPG fingerprint: B1 EE D2 39 2C 82 26 DA A5 4D E0 50 35 75 9E ED
On Mar 25, 2007, at 06:14, Thomas Nau wrote:> We use a cluster ;) but in the backend it doesn''t solve the sync > problem as you mentionThe StorageTek Availability Suite was recently open-sourced: http://www.opensolaris.org/os/project/avs/