Hi, we are testing ZFS atm as a possible replacement for Veritas VM. While testing, we encountered a serious problem, which corrupted the whole filesystem. First we created a standard Raid10 with 4 disks. [b]NODE2:../# zpool create -f swimmingpool mirror c0t3d0 c0t11d0 mirror c0t4d0 c0t12d0 NODE2:../# zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT swimmingpool 33.5G 81K 33.5G 0% ONLINE - NODE2:../# zpool status pool: swimmingpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM swimmingpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c0t11d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c0t12d0 ONLINE 0 0 0 errors: No known data errors [/b] After that we made a new ZFS and copied a testing file on it. [b]NODE2:../# zfs create swimmingpool/babe NODE2:../# zfs list NAME USED AVAIL REFER MOUNTPOINT swimmingpool 108K 33.0G 25.5K /swimmingpool swimmingpool/babe 24.5K 33.0G 24.5K /swimmingpool/babe NODE2:../# cp /etc/hosts /swimmingpool/babe/ [/b] Now we test the behaviour if importing the ZFS on another system while it is still imported on the first one. The expected behaviour would be that ZFS couldn''t be imported due to possible corruption, but instead it is imported just fine! We now were able to write simultanously from both systems on the same ZFS: [b]NODE1:../# zpool import -f swimmingpool NODE1:../# man man > /swimmingpool/babe/man NODE2:../# cat /dev/urandom > /swimmingpool/babe/testfile & NODE1:../# cat /dev/urandom > /swimmingpool/babe/testfile2 & NODE1:../# ls -l /swimmingpool/babe/ -r--r--r-- 1 root root 2194 Sep 8 14:31 hosts -rw-r--r-- 1 root root 17531 Sep 8 14:52 man -rw-r--r-- 1 root root 3830447920 Sep 8 16:20 testfile2 NODE2:../# ls -l /swimmingpool/babe/ -r--r--r-- 1 root root 2194 Sep 8 14:31 hosts -rw-r--r-- 1 root root 3534355760 Sep 8 16:19 testfile [/b] This can''t be supposed to be the normal behaviour. Did we encounter a bug or is this still under development? This message posted from opensolaris.org
Michael Schuster
2006-Sep-13 08:02 UTC
[zfs-discuss] ZFS imported simultanously on 2 systems...
I think this is user error: the man page explicitly says: -f Forces import, even if the pool appears to be potentially active. and that''s exactly what you did. If the behaviour had been the same without the -f option, I guess this would be a bug. HTH Mathias F wrote:> Hi, > > we are testing ZFS atm as a possible replacement for Veritas VM. > While testing, we encountered a serious problem, which corrupted the whole filesystem. > > First we created a standard Raid10 with 4 disks. > [b]NODE2:../# zpool create -f swimmingpool mirror c0t3d0 c0t11d0 mirror c0t4d0 c0t12d0 > > NODE2:../# zpool list > NAME SIZE USED AVAIL CAP HEALTH ALTROOT > swimmingpool 33.5G 81K 33.5G 0% ONLINE - > > NODE2:../# zpool status > pool: swimmingpool > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > swimmingpool ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c0t3d0 ONLINE 0 0 0 > c0t11d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c0t4d0 ONLINE 0 0 0 > c0t12d0 ONLINE 0 0 0 > errors: No known data errors > [/b] > > After that we made a new ZFS and copied a testing file on it. > > [b]NODE2:../# zfs create swimmingpool/babe > > NODE2:../# zfs list > NAME USED AVAIL REFER MOUNTPOINT > swimmingpool 108K 33.0G 25.5K /swimmingpool > swimmingpool/babe 24.5K 33.0G 24.5K /swimmingpool/babe > > NODE2:../# cp /etc/hosts /swimmingpool/babe/ > [/b] > > Now we test the behaviour if importing the ZFS on another system while it is still imported on the first one. > The expected behaviour would be that ZFS couldn''t be imported due to possible corruption, but instead it is imported just fine! > We now were able to write simultanously from both systems on the same ZFS: > > [b]NODE1:../# zpool import -f swimmingpool > NODE1:../# man man > /swimmingpool/babe/man > NODE2:../# cat /dev/urandom > /swimmingpool/babe/testfile & > NODE1:../# cat /dev/urandom > /swimmingpool/babe/testfile2 & > > NODE1:../# ls -l /swimmingpool/babe/ > -r--r--r-- 1 root root 2194 Sep 8 14:31 hosts > -rw-r--r-- 1 root root 17531 Sep 8 14:52 man > -rw-r--r-- 1 root root 3830447920 Sep 8 16:20 testfile2 > > NODE2:../# ls -l /swimmingpool/babe/ > -r--r--r-- 1 root root 2194 Sep 8 14:31 hosts > -rw-r--r-- 1 root root 3534355760 Sep 8 16:19 testfile > [/b] > > This can''t be supposed to be the normal behaviour. > Did we encounter a bug or is this still under development? > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-- Michael Schuster +49 89 46008-2974 / x62974 visit the online support center: http://www.sun.com/osc/ Recursion, n.: see ''Recursion''
Mathias F
2006-Sep-13 10:23 UTC
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
Well, we are using the -f parameter to test failover functionality. If one system with mounted ZFS is down, we have to use the force to mount it on the failover system. But when the failed system comes online again, it remounts the ZFS without errors, so it is mounted simultanously on both nodes.... That''s the real problem we have :[ mfg Mathias> I think this is user error: the man page explicitly > says: > > -f Forces import, even if the pool > appears to be > potentially active. > y what you did. If the behaviour had been the same > without > the -f option, I guess this would be a bug. > > HTHThis message posted from opensolaris.org
Michael Schuster
2006-Sep-13 10:28 UTC
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
Mathias F wrote:> Well, we are using the -f parameter to test failover functionality. > If one system with mounted ZFS is down, we have to use the force to mount it on the failover system. > But when the failed system comes online again, it remounts the ZFS without errors, so it is mounted simultanously on both nodes....ZFS currently doesn''t support this, I''m sorry to say. *You* have to make sure that a zpool is not imported on more than one node at a time. regards -- Michael Schuster +49 89 46008-2974 / x62974 visit the online support center: http://www.sun.com/osc/ Recursion, n.: see ''Recursion''
Thomas Wagner
2006-Sep-13 10:37 UTC
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
On Wed, Sep 13, 2006 at 12:28:23PM +0200, Michael Schuster wrote:> Mathias F wrote: > >Well, we are using the -f parameter to test failover functionality. > >If one system with mounted ZFS is down, we have to use the force to mount > >it on the failover system. > >But when the failed system comes online again, it remounts the ZFS without > >errors, so it is mounted simultanously on both nodes....This is used on a regularly basis within cluster frameworks...> ZFS currently doesn''t support this, I''m sorry to say. *You* have to make > sure that a zpool is not imported on more than one node at a time.Why not using a real cluster-software as "*You*", taking care of using resources like a filesystem (ufs, zfs, others...) in a consistent way? I think ZFS does enough to make shure not accidentially using filesystems/pools from more then one hosts at a time. If you want more, please consider using a cluster-framework with heartbeats and all that great stuff ... Regards, Thomas
Gregory Shaw
2006-Sep-13 10:49 UTC
[zfs-discuss] ZFS imported simultanously on 2 systems...
A question: You''re forcing the import of the pool on the other host. That disregards any checks, similar to a forced import of a veritas disk group. Does the same thing happen if you try to import the pool without the force option? On Sep 13, 2006, at 1:44 AM, Mathias F wrote:> Hi, > > we are testing ZFS atm as a possible replacement for Veritas VM. > While testing, we encountered a serious problem, which corrupted > the whole filesystem. > > First we created a standard Raid10 with 4 disks. > [b]NODE2:../# zpool create -f swimmingpool mirror c0t3d0 c0t11d0 > mirror c0t4d0 c0t12d0 > > NODE2:../# zpool list > NAME SIZE USED AVAIL CAP HEALTH > ALTROOT > swimmingpool 33.5G 81K 33.5G 0% ONLINE - > > NODE2:../# zpool status > pool: swimmingpool > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > swimmingpool ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c0t3d0 ONLINE 0 0 0 > c0t11d0 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > c0t4d0 ONLINE 0 0 0 > c0t12d0 ONLINE 0 0 0 > errors: No known data errors > [/b] > > After that we made a new ZFS and copied a testing file on it. > > [b]NODE2:../# zfs create swimmingpool/babe > > NODE2:../# zfs list > NAME USED AVAIL REFER MOUNTPOINT > swimmingpool 108K 33.0G 25.5K /swimmingpool > swimmingpool/babe 24.5K 33.0G 24.5K /swimmingpool/babe > > NODE2:../# cp /etc/hosts /swimmingpool/babe/ > [/b] > > Now we test the behaviour if importing the ZFS on another system > while it is still imported on the first one. > The expected behaviour would be that ZFS couldn''t be imported due > to possible corruption, but instead it is imported just fine! > We now were able to write simultanously from both systems on the > same ZFS: > > [b]NODE1:../# zpool import -f swimmingpool > NODE1:../# man man > /swimmingpool/babe/man > NODE2:../# cat /dev/urandom > /swimmingpool/babe/testfile & > NODE1:../# cat /dev/urandom > /swimmingpool/babe/testfile2 & > > NODE1:../# ls -l /swimmingpool/babe/ > -r--r--r-- 1 root root 2194 Sep 8 14:31 hosts > -rw-r--r-- 1 root root 17531 Sep 8 14:52 man > -rw-r--r-- 1 root root 3830447920 Sep 8 16:20 testfile2 > > NODE2:../# ls -l /swimmingpool/babe/ > -r--r--r-- 1 root root 2194 Sep 8 14:31 hosts > -rw-r--r-- 1 root root 3534355760 Sep 8 16:19 testfile > [/b] > > This can''t be supposed to be the normal behaviour. > Did we encounter a bug or is this still under development? > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss----- Gregory Shaw Programmer, SysAdmin fmSoft, Inc. Network Planner shaw at fmsoft.com And homebrewer... Prayer belongs in schools like facts belong in organized religion. Superintendant Chalmers - "The Simpsons" -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060913/51d4c6b7/attachment.html>
Mathias F
2006-Sep-13 11:22 UTC
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
Without -f option, the ZFS can''t be imported while "reserved" for the other host, even if that host is down. As I said, we are testing ZFS as a [b]replacement for VxVM[/b], which we are using atm. So as a result our tests have failed and we have to keep on using Veritas. Thanks for all your answers. This message posted from opensolaris.org
Michael Schuster
2006-Sep-13 12:14 UTC
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
Mathias F wrote:> Without -f option, the ZFS can''t be imported while "reserved" for the other host, even if that host is down. > > As I said, we are testing ZFS as a [b]replacement for VxVM[/b], which we are using atm. So as a result our tests have failed and we have to keep on using Veritas. > > Thanks for all your answers.I think I get the whole picture, let me summarise: - you create a pool P and an FS on host A - Host A crashes - you import P on host B; this only works with -f, as "zpool import" otherwise refuses to do so. - now P is imported on B - host A comes back up and re-accesses P, thereby leading to (potential) corruption. - your hope was that when host A comes back, there exists a mechanism for telling it "you need to re-import". - Vxvm, as you currently use it, has this functionality Is that correct? regards -- Michael Schuster +49 89 46008-2974 / x62974 visit the online support center: http://www.sun.com/osc/ Recursion, n.: see ''Recursion''
James C. McPherson
2006-Sep-13 12:15 UTC
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
Mathias F wrote:> Without -f option, the ZFS can''t be imported while "reserved" for the other > host, even if that host is down.This is the correct behaviour. What do you want to cause? data corruption?> As I said, we are testing ZFS as a [b]replacement for VxVM[/b], which we > are using atm. So as a result our tests have failed and we have to keep on > using Veritas.As I understand things, SunCluster 3.2 is expected to have support for HA-ZFS and until that version is released you will not be running in a supported configuration.... and so any errors you encounter are *your fault alone*. Didn''t we have the PMC (poor man''s cluster) talk last week as well? James C. McPherson
Zoram Thanga
2006-Sep-13 12:27 UTC
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
Hi Mathias, Mathias F wrote:> Without -f option, the ZFS can''t be imported while "reserved" for the other host, even if that host is down. > > As I said, we are testing ZFS as a [b]replacement for VxVM[/b], which we are using atm. So as a result our tests have failed and we have to keep on using Veritas.Sun Cluster 3.2, which is in beta at the moment, will allow you to do this automatically. I don''t think what you are trying to do here will be supportable unless it''s managed by SC3.2. Let me know if you''d like to try out the SC3.2 beta. Thanks, Zoram -- Zoram Thanga::Sun Cluster Development::http://blogs.sun.com/zoram
Mathias F
2006-Sep-13 12:29 UTC
[zfs-discuss] Re: Re: ZFS imported simultanously on 2 systems...
> I think I get the whole picture, let me summarise: > > - you create a pool P and an FS on host A > - Host A crashes > - you import P on host B; this only works with -f, as > "zpool import" otherwise > refuses to do so. > - now P is imported on B > - host A comes back up and re-accesses P, thereby > leading to (potential) > corruption. > - your hope was that when host A comes back, there > exists a mechanism for > telling it "you need to re-import". > - Vxvm, as you currently use it, has this > functionality > > Is that correct?Yes it is, you got it ;) VxVM just notices that it''s previously imported DiskGroup(s) (for ZFS this is the Pool) were failed over and doesn''t try to re-acquire them. It waits for an admin action. The topic of "clustering" ZFS is not the problem atm, we just test the failover behaviour manually. This message posted from opensolaris.org
Michael Schuster
2006-Sep-13 12:42 UTC
[zfs-discuss] Re: Re: ZFS imported simultanously on 2 systems...
Mathias F wrote:>> I think I get the whole picture, let me summarise: >> >> - you create a pool P and an FS on host A >> - Host A crashes >> - you import P on host B; this only works with -f, as >> "zpool import" otherwise >> refuses to do so. >> - now P is imported on B >> - host A comes back up and re-accesses P, thereby >> leading to (potential) >> corruption. >> - your hope was that when host A comes back, there >> exists a mechanism for >> telling it "you need to re-import". >> - Vxvm, as you currently use it, has this >> functionality >> >> Is that correct? > > Yes it is, you got it ;) > VxVM just notices that it''s previously imported DiskGroup(s) (for ZFS this is the Pool) were failed over and doesn''t try to re-acquire them. It waits for an admin action. > > The topic of "clustering" ZFS is not the problem atm, we just test the failover behaviour manually.well, I think nevertheless you''ll have to wait for SunCluster 3.2 for this to work. As others have said, ZFS as is currently is not made to work as you expect it to. regards -- Michael Schuster +49 89 46008-2974 / x62974 visit the online support center: http://www.sun.com/osc/ Recursion, n.: see ''Recursion''
Dale Ghent
2006-Sep-13 12:44 UTC
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
James C. McPherson wrote:> As I understand things, SunCluster 3.2 is expected to have support for > HA-ZFS > and until that version is released you will not be running in a supported > configuration.... and so any errors you encounter are *your fault alone*.Still, after reading Mathias''s description, it seems that the former node is doing an implicit forced import when it boots back up. This seems wrong to me. zpools should be imported only of the zpool itself says it''s not already taken, which of course would be overidden by a manual -f import. <zpool> "sorry, i already have a boyfriend, host b" <host a> "darn, ok, maybe next time" rather than the current scenario: <zpool> "host a, I''m over you now. host b is now the man in my life!" <host a> "I don''t care! you''re coming with me anyways. you''ll always be mine!" * host a stuffs zpool into the car and drives off ...and we know those situations never turn out particularly well. /dale
James C. McPherson
2006-Sep-13 12:45 UTC
[zfs-discuss] Re: Re: ZFS imported simultanously on 2 systems...
Mathias F wrote: ...> Yes it is, you got it ;) VxVM just notices that it''s previously imported > DiskGroup(s) (for ZFS this is the Pool) were failed over and doesn''t try to > re-acquire them. It waits for an admin action. > The topic of "clustering" ZFS is not the problem atm, we just test the > failover behaviour manually.Actually, this is the entirety of the problem: you are expecting a product which is *not* currently multi-host-aware to behave in the same safe manner as one which is. *AND* you''re doing so knowing that you are outside of the protection of a clustering framework. WHY? What valid tests do you think you are going to be able to run? Wait for the SunCluster 3.2 release (or the beta). Don''t faff around with a data-killing test suite in an unsupported configuration. James C. McPherson
Mathias F
2006-Sep-13 13:09 UTC
[zfs-discuss] Re: Re: Re: ZFS imported simultanously on 2 systems...
[...]> a product which is *not* currently multi-host-aware to > behave in the > same safe manner as one which is.That`s the point we figured out while testing it ;) I just wanted to have our thoughts reviewed by other ZFS users. Our next steps IF the failover would have succeeded would be to create a little ZFS-agent for a VCS testing cluster. We haven''t used Sun Cluster and won''t use it in future. regards Mathias This message posted from opensolaris.org
Frank Cusack
2006-Sep-13 16:14 UTC
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
On September 13, 2006 6:09:50 AM -0700 Mathias F <mathias.froehlich at finanzit.com> wrote:> [...] >> a product which is *not* currently multi-host-aware to >> behave in the >> same safe manner as one which is. > > That`s the point we figured out while testing it ;) > I just wanted to have our thoughts reviewed by other ZFS users. > > Our next steps IF the failover would have succeeded would be to create a > little ZFS-agent for a VCS testing cluster. We haven''t used Sun Cluster > and won''t use it in future./etc/zfs/zpool.cache is used at boot time to find what pools to import. Remove it when the system boots and after it goes down and comes back up it won''t import any pools. Not quite the same as not importing if they are imported elsewhere, but perhaps close enough for you. On September 13, 2006 10:15:28 PM +1000 "James C. McPherson" <James.C.McPherson at gmail.com> wrote:> As I understand things, SunCluster 3.2 is expected to have support for > HA-ZFS > and until that version is released you will not be running in a supported > configuration.... and so any errors you encounter are *your fault alone*. > > > Didn''t we have the PMC (poor man''s cluster) talk last week as well?I understand the objection to mickey mouse configurations, but I don''t understand the objection to (what I consider) simply improving safety. Why again shouldn''t zfs have a hostid written into the pool, to prevent import if the hostid doesn''t match? And why should failover be limited to SC? Why shouldn''t VCS be able to play? Why should SC have secrets on how to do failover? After all, this is OPENsolaris. And anyway many homegrown solutions (the kind I''m familiar with anyway) are of high quality compared to commercial ones. -frank
Eric Schrock
2006-Sep-13 16:32 UTC
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
On Wed, Sep 13, 2006 at 09:14:36AM -0700, Frank Cusack wrote:> > Why again shouldn''t zfs have a hostid written into the pool, to prevent > import if the hostid doesn''t match?See: 6282725 hostname/hostid should be stored in the label Keep in mind that this is not a complete clustering solution - only a mechanism to prevent administrator misconfiguration. In particular, it''s possible for one host to be doing a failover, and the other host open the pool before the hostid has been written to the disk.> And why should failover be limited to SC? Why shouldn''t VCS be able to > play? Why should SC have secrets on how to do failover? After all, > this is OPENsolaris. And anyway many homegrown solutions (the kind > I''m familiar with anyway) are of high quality compared to commercial > ones.I''m not sure I understand this. There is no built-in clustering support for UFS - simultaneously mounting the same UFS filesystem on different hosts will corrupt your data as well. You need some sort of higher level logic to correctly implement clustering. This is not a "SC secret" - it''s how you manage non-clustered filesystems in a failover situation. Storing the hostid as a last-ditch check for administrative error is a reasonable RFE - just one that we haven''t yet gotten around to. Claiming that it will solve the clustering problem oversimplifies the problem and will lead to people who think they have a ''safe'' homegrown failover when in reality the right sequence of actions will irrevocably corrupt their data. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Frank Cusack
2006-Sep-13 17:20 UTC
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
On September 13, 2006 9:32:50 AM -0700 Eric Schrock <eric.schrock at sun.com> wrote:> On Wed, Sep 13, 2006 at 09:14:36AM -0700, Frank Cusack wrote: >> >> Why again shouldn''t zfs have a hostid written into the pool, to prevent >> import if the hostid doesn''t match? > > See: > > 6282725 hostname/hostid should be stored in the label > > Keep in mind that this is not a complete clustering solution - only a > mechanism to prevent administrator misconfiguration. In particular, it''s > possible for one host to be doing a failover, and the other host open > the pool before the hostid has been written to the disk. > >> And why should failover be limited to SC? Why shouldn''t VCS be able to >> play? Why should SC have secrets on how to do failover? After all, >> this is OPENsolaris. And anyway many homegrown solutions (the kind >> I''m familiar with anyway) are of high quality compared to commercial >> ones. > > I''m not sure I understand this. There is no built-in clustering support > for UFS - simultaneously mounting the same UFS filesystem on different > hosts will corrupt your data as well. You need some sort of higher > level logic to correctly implement clustering. This is not a "SC > secret" - it''s how you manage non-clustered filesystems in a failover > situation.But UFS filesystems don''t automatically get mounted (well, we know how to not automatically mount them in /etc/vfstab). The SC "secret" is in how importing of pools is prevented at boot time. Of course you need more than that, but my complaint was against the idea that you cannot build a reliable solution yourself, instead of just sharing info about zpool.cache albeit with a warning.> Storing the hostid as a last-ditch check for administrative error is a > reasonable RFE - just one that we haven''t yet gotten around to. > Claiming that it will solve the clustering problem oversimplifies the > problem and will lead to people who think they have a ''safe'' homegrown > failover when in reality the right sequence of actions will irrevocably > corrupt their data.Thanks for that clarification, very important info. -frank
Dale Ghent
2006-Sep-13 17:28 UTC
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
On Sep 13, 2006, at 12:32 PM, Eric Schrock wrote:> Storing the hostid as a last-ditch check for administrative error is a > reasonable RFE - just one that we haven''t yet gotten around to. > Claiming that it will solve the clustering problem oversimplifies the > problem and will lead to people who think they have a ''safe'' homegrown > failover when in reality the right sequence of actions will > irrevocably > corrupt their data.HostID is handy, but it''ll only tell you who MIGHT or MIGHT NOT have control of the pool. Such an RFE would even more worthwhile if it included something such as a time stamp. This time stamp (or similar time-oriented signature) would be updated regularly (bases on some internal ZFS event). If this stamp goes for an arbitrary length of time without being updated, another host in the cluster could force import it on the assumption that the original host is no longer able to communicate to the zpool. This is a simple idea description, but perhaps worthwhile if you''re already going to change the label structure for adding the hostid. /dale
Darren J Moffat
2006-Sep-13 17:37 UTC
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
Dale Ghent wrote:> On Sep 13, 2006, at 12:32 PM, Eric Schrock wrote: > >> Storing the hostid as a last-ditch check for administrative error is a >> reasonable RFE - just one that we haven''t yet gotten around to. >> Claiming that it will solve the clustering problem oversimplifies the >> problem and will lead to people who think they have a ''safe'' homegrown >> failover when in reality the right sequence of actions will irrevocably >> corrupt their data. > > HostID is handy, but it''ll only tell you who MIGHT or MIGHT NOT have > control of the pool. > > Such an RFE would even more worthwhile if it included something such as > a time stamp. This time stamp (or similar time-oriented signature) would > be updated regularly (bases on some internal ZFS event). If this stamp > goes for an arbitrary length of time without being updated, another host > in the cluster could force import it on the assumption that the original > host is no longer able to communicate to the zpool.That might be acceptable in some environments but that is going to cause disks to spin up. That will be very unacceptable in a laptop and maybe even in some energy conscious data centres. What you are proposing sounds a lot like a cluster hear beat which IMO really should not be implemented by writing to disks. -- Darren J Moffat
Frank Cusack
2006-Sep-13 17:37 UTC
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
On September 13, 2006 1:28:47 PM -0400 Dale Ghent <daleg at elemental.org> wrote:> On Sep 13, 2006, at 12:32 PM, Eric Schrock wrote: > >> Storing the hostid as a last-ditch check for administrative error is a >> reasonable RFE - just one that we haven''t yet gotten around to. >> Claiming that it will solve the clustering problem oversimplifies the >> problem and will lead to people who think they have a ''safe'' homegrown >> failover when in reality the right sequence of actions will >> irrevocably >> corrupt their data. > > HostID is handy, but it''ll only tell you who MIGHT or MIGHT NOT have > control of the pool. > > Such an RFE would even more worthwhile if it included something such as > a time stamp. This time stamp (or similar time-oriented signature) would > be updated regularly (bases on some internal ZFS event). If this stamp > goes for an arbitrary length of time without being updated, another host > in the cluster could force import it on the assumption that the original > host is no longer able to communicate to the zpool. > > This is a simple idea description, but perhaps worthwhile if you''re > already going to change the label structure for adding the hostid.Sounds cool! Better than depending on an out-of-band heartbeat. -frank
Darren J Moffat
2006-Sep-13 17:44 UTC
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
Frank Cusack wrote:> Sounds cool! Better than depending on an out-of-band heartbeat.I disagree it sounds really really bad. If you want a high availability cluster you really need a faster interconnect than spinning rust which is probably the slowest interface we have now! -- Darren J Moffat
Dale Ghent
2006-Sep-13 18:08 UTC
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
On Sep 13, 2006, at 1:37 PM, Darren J Moffat wrote:> That might be acceptable in some environments but that is going to > cause disks to spin up. That will be very unacceptable in a > laptop and maybe even in some energy conscious data centres.Introduce an option to ''zpool create''? Come to think of it, describing attributes for a pool seems to be lacking (unlike zfs volumes)> What you are proposing sounds a lot like a cluster hear beat which > IMO really should not be implemented by writing to disks.That would be an extreme example of the use for this. While it / could/ be used as a heart beat mechanism, it would be useful administratively. # zpool status foopool Pool "foopool" is currently imported by host.blah.com Import time: 4 April 2007 16:20:00 Last activity: 23 June 2007 18:42:53 ... ... /dale
Ceri Davies
2006-Sep-13 19:03 UTC
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
On Wed, Sep 13, 2006 at 06:37:25PM +0100, Darren J Moffat wrote:> Dale Ghent wrote: > >On Sep 13, 2006, at 12:32 PM, Eric Schrock wrote: > > > >>Storing the hostid as a last-ditch check for administrative error is a > >>reasonable RFE - just one that we haven''t yet gotten around to. > >>Claiming that it will solve the clustering problem oversimplifies the > >>problem and will lead to people who think they have a ''safe'' homegrown > >>failover when in reality the right sequence of actions will irrevocably > >>corrupt their data. > > > >HostID is handy, but it''ll only tell you who MIGHT or MIGHT NOT have > >control of the pool. > > > >Such an RFE would even more worthwhile if it included something such as > >a time stamp. This time stamp (or similar time-oriented signature) would > >be updated regularly (bases on some internal ZFS event). If this stamp > >goes for an arbitrary length of time without being updated, another host > >in the cluster could force import it on the assumption that the original > >host is no longer able to communicate to the zpool. > > That might be acceptable in some environments but that is going to cause > disks to spin up. That will be very unacceptable in a laptop and > maybe even in some energy conscious data centres. > > What you are proposing sounds a lot like a cluster hear beat which IMO > really should not be implemented by writing to disks.Wouldn''t it be possible to implement this via SCSI reservations (where available) a la quorum devices? Ceri -- That must be wonderful! I don''t understand it at all. -- Moliere -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 187 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060913/3c6097ed/attachment.bin>
Darren Dunham
2006-Sep-13 19:31 UTC
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
> Still, after reading Mathias''s description, it seems that the former > node is doing an implicit forced import when it boots back up. This > seems wrong to me. > > zpools should be imported only of the zpool itself says it''s not already > taken, which of course would be overidden by a manual -f import.The problem (Danger: I''m completely going on assumptions here) is that this locking is almost always some sort of "dirty" or "locked" flag. So if your server crashes, you *want* the host to re-import the dirty pool at boot. Otherwise every unclean shutdown is a manual intervention. So a simple ''dirty/clean'' flag isn''t enough information and I''m assuming that''s all the -f is checking for today. I believe that VxVM gets around this by forcing the import at boot only if the hostid of the machine matches the hostid on the group.> <zpool> "sorry, i already have a boyfriend, host b" > <host a> "darn, ok, maybe next time"And it''s the lack of "host b" information in the pool that prevents this at the moment.. <host a> "Hey, let''s go!" <zpool> "Huh, I thought I was already seeing someone?" <host a> "Yeah, me! Now let''s go!" <zpool> "Wow, really? Okay." -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >
James C. McPherson
2006-Sep-13 23:07 UTC
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
Frank Cusack wrote: ...[snip James McPherson''s objections to PMC]> I understand the objection to mickey mouse configurations, but I don''t > understand the objection to (what I consider) simply improving safety....> And why should failover be limited to SC? Why shouldn''t VCS be able to > play? Why should SC have secrets on how to do failover? After all, > this is OPENsolaris. And anyway many homegrown solutions (the kind > I''m familiar with anyway) are of high quality compared to commercial > ones.Frank, this isn''t a SunCluster vs VCS argument. It''s an argument about * doing cluster-y stuff with the protection that a cluster framework provides versus * doing cluster-y stuff without the protection that a cluster framework provides If you want to use VCS be my guest, and let us know how it goes. If you want to use a homegrown solution, then please let us know what you did to get it working, how well it copes and how you are addressing any data corruption that might occur. I tend to refer to SunCluster more than VCS simply because I''ve got more in depth experience with Sun''s offering. James C. McPherson
Frank Cusack
2006-Sep-13 23:33 UTC
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
On September 13, 2006 6:44:44 PM +0100 Darren J Moffat <Darren.Moffat at Sun.COM> wrote:> Frank Cusack wrote: >> Sounds cool! Better than depending on an out-of-band heartbeat. > > I disagree it sounds really really bad. If you want a high availability cluster you really need > a faster interconnect than spinning rust which is probably the slowest interface we have now!But fast enough ... it''s the same speed as disk access. Why would you test something ELSE other than the actual resource you''re trying to manage, where possible. e.g. if you have a web server that does db queries, to monitor it you have to perform an http query that hits the db. You''d typically have a dedicated link for heartbeat, what if that cable gets yanked or that NIC port dies. The backup system could avoid mounting the pool if zfs had its own heartbeat. What if the cluster software has a bug and tells the other system to take over? zfs could protect itself. And anyway, unless you''re exchanging state, heartbeat data is tiny (service1: ok; service2: ok; service3: ok) and even a serial link is good enough. -frank
Frank Cusack
2006-Sep-13 23:45 UTC
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
On September 13, 2006 4:33:31 PM -0700 Frank Cusack <fcusack at fcusack.com> wrote:> You''d typically have a dedicated link for heartbeat, what if that cable > gets yanked or that NIC port dies. The backup system could avoid mounting > the pool if zfs had its own heartbeat. What if the cluster software > has a bug and tells the other system to take over? zfs could protect > itself.hmm actually probably not considering heartbeat intervals and failover time vs. probable zpool update frequency. -frank
Frank Cusack
2006-Sep-14 00:11 UTC
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
On September 13, 2006 7:07:40 PM -0700 Richard Elling <Richard.Elling at Sun.COM> wrote:> Dale Ghent wrote: >> James C. McPherson wrote: >> >>> As I understand things, SunCluster 3.2 is expected to have support >>> for HA-ZFS >>> and until that version is released you will not be running in a >>> supported >>> configuration.... and so any errors you encounter are *your fault >>> alone*. >> >> Still, after reading Mathias''s description, it seems that the former >> node is doing an implicit forced import when it boots back up. This >> seems wrong to me. > Repeat the experiment with UFS, or most other file systems, on a raw > device and you would get the same behaviour as ZFS: corruption.Again, the difference is that with UFS your filesystems won''t auto mount at boot. If you repeated with UFS, you wouldn''t try to mount until you decided you should own the disk. -frank
Richard Elling
2006-Sep-14 02:07 UTC
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
Dale Ghent wrote:> James C. McPherson wrote: > >> As I understand things, SunCluster 3.2 is expected to have support >> for HA-ZFS >> and until that version is released you will not be running in a >> supported >> configuration.... and so any errors you encounter are *your fault >> alone*. > > Still, after reading Mathias''s description, it seems that the former > node is doing an implicit forced import when it boots back up. This > seems wrong to me.Repeat the experiment with UFS, or most other file systems, on a raw device and you would get the same behaviour as ZFS: corruption. The question on the table is "why doesn''t ZFS behave like a cluster-aware volume manager" not "why does ZFS behave like UFS when 2 nodes mount the same file system simultaneously?" -- richard -- richard
Darren J Moffat
2006-Sep-14 09:47 UTC
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
Frank Cusack wrote:> On September 13, 2006 7:07:40 PM -0700 Richard Elling > <Richard.Elling at Sun.COM> wrote: >> Dale Ghent wrote: >>> James C. McPherson wrote: >>> >>>> As I understand things, SunCluster 3.2 is expected to have support >>>> for HA-ZFS >>>> and until that version is released you will not be running in a >>>> supported >>>> configuration.... and so any errors you encounter are *your fault >>>> alone*. >>> >>> Still, after reading Mathias''s description, it seems that the former >>> node is doing an implicit forced import when it boots back up. This >>> seems wrong to me. >> Repeat the experiment with UFS, or most other file systems, on a raw >> device and you would get the same behaviour as ZFS: corruption. > > Again, the difference is that with UFS your filesystems won''t auto > mount at boot. If you repeated with UFS, you wouldn''t try to mount > until you decided you should own the disk.Normally on Solaris UFS filesystems are mounted via /etc/vfstab so yes the will probably automatically mount at boot time. If you are either removing them from vfstab, not having them there, or setting the ''mount at boot'' flag in /etc/vfstab to off; then what ever it is that is doing that *is* your cluster framework. You need to rewrite/extend that to deal with the fact that ZFS doesn''t use vfstab and instead express it in terms of ZFS import/export. -- Darren J Moffat
Anton B. Rang
2006-Sep-14 14:46 UTC
[zfs-discuss] Re: Re: ZFS imported simultanously on 2 systems...
> You need to rewrite/extend that to deal with the fact that ZFS doesn''t use vfstab > and instead express it in terms of ZFS import/export.The problem (as I see it) is that ZFS import is (by default) implicit at startup, while UFS mount is (by default) only performed when explicitly requested. ZFS export isn''t very useful in a cluster failover case because if a system crashes, you don''t get a chance to run it, and if you want to run it after a crash (during startup), you can''t use it until after the pool has been imported (and hence corrupted). All of that said ... It looks like ''zpool create -R'' should solve the problem for anyone who is trying to build their own clustering facility, since it prevents the automatic import. Maybe we just need to document that more clearly? Calling it "alternate root" doesn''t really draw attention to it. This message posted from opensolaris.org
Darren J Moffat
2006-Sep-14 14:57 UTC
[zfs-discuss] Re: Re: ZFS imported simultanously on 2 systems...
Anton B. Rang wrote:>> You need to rewrite/extend that to deal with the fact that ZFS doesn''t use vfstab >> and instead express it in terms of ZFS import/export. > > The problem (as I see it) is that ZFS import is (by default) implicit at startup, while UFS mount is (by default) only performed when explicitly requested.Compare the fact that ZFS opens (not imports) the previously known pools at boot time not to mounting of a UFS file system in vfstab but to the way that LVM or VxVM makes available the volume manager device nodes. It is actually pretty easy to have ZFS not mount the file systems - and whats more it has a supported and Committed interface: zfs set mountpoint=none mypool But that isn''t the issue :-) -- Darren J Moffat
Darren Dunham
2006-Sep-14 15:34 UTC
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
> > Again, the difference is that with UFS your filesystems won''t auto > > mount at boot. If you repeated with UFS, you wouldn''t try to mount > > until you decided you should own the disk. > > Normally on Solaris UFS filesystems are mounted via /etc/vfstab so yes > the will probably automatically mount at boot time.I think the idea is that in a cluster environment, you would not. The cluster would leave /etc/vfstab empty and mount only explicitly when it thought it safe. The explicit mounts make no changes to the boot-time behavior of the system. With ZFS, explicit mounts do change the boot-time behavior (via the zfs cache) in a way that appears to be difficult to override easily.> If you are either removing them from vfstab, not having them there, or > setting the ''mount at boot'' flag in /etc/vfstab to off; then what ever > it is that is doing that *is* your cluster framework. You need to > rewrite/extend that to deal with the fact that ZFS doesn''t use vfstab > and instead express it in terms of ZFS import/export.Exactly. What method could such a framework use to ask ZFS to import a pool *now*, but not also automatically at next boot? (How does the upcoming SC do it?) Does it just nuke the cache early in the boot and then force import any pools that aren''t under the cluster solution, or have I just missed a simple option? -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >
Darren J Moffat
2006-Sep-14 15:42 UTC
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
Darren Dunham wrote:> Exactly. What method could such a framework use to ask ZFS to import a > pool *now*, but not also automatically at next boot? (How does the > upcoming SC do it?)I don''t know how Sun Cluster does it and I don''t know where the source is. As others have pointed out you could use the fully supported alternate root support for this. The "zpool create -R" and "zpool import -R" commands allow users to create and import a pool with a different root path. By default, whenever a pool is created or imported on a system, it is permanently added so that it is available whenever the system boots. For removable media, or when in recovery situations, this may not always be desirable. An alternate root pool does not persist on the system. Instead, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ it exists only until exported or the system is rebooted, at which point it will have to be imported again. Sounds exactly like what is needed. As I said I don''t know if this is what Sun Cluster does but that is a possible way to build an HA-ZFS solution, just remember not to have the scripts blindly do zpool import -Rf :-) -- Darren J Moffat
Darren Dunham
2006-Sep-14 16:14 UTC
[zfs-discuss] Re: ZFS imported simultanously on 2 systems...
> As others have pointed out you could use the fully supported alternate > root support for this. > > The "zpool create -R" and "zpool import -R" commands allowYes. I tried that. It should work well. In addition, I''m happy to note that ''-R /'' appears to be valid, allowing all the filenames to remain unchanged, but still giving the noautoremount behavior.> Sounds exactly like what is needed. As I said I don''t know if > this is what Sun Cluster does but that is a possible way to build > an HA-ZFS solution, just remember not to have the scripts blindly > do zpool import -Rf :-)Sure. Anything that does had better be making its own determination of which host "owns" the pool independently. -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >
Eric Schrock
2006-Sep-14 17:48 UTC
[zfs-discuss] Re: Re: ZFS imported simultanously on 2 systems...
On Thu, Sep 14, 2006 at 07:46:33AM -0700, Anton B. Rang wrote:> > It looks like ''zpool create -R'' should solve the problem for anyone > who is trying to build their own clustering facility, since it > prevents the automatic import. Maybe we just need to document that > more clearly? Calling it "alternate root" doesn''t really draw > attention to it. >See: 6337656 zpool should have a -t option to allow temporary import To clarify this behavior. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock
Robert Milkowski
2006-Sep-19 11:55 UTC
[zfs-discuss] Re: Re: ZFS imported simultanously on 2 systems...
Hello Eric, Thursday, September 14, 2006, 7:48:48 PM, you wrote: ES> On Thu, Sep 14, 2006 at 07:46:33AM -0700, Anton B. Rang wrote:>> >> It looks like ''zpool create -R'' should solve the problem for anyone >> who is trying to build their own clustering facility, since it >> prevents the automatic import. Maybe we just need to document that >> more clearly? Calling it "alternate root" doesn''t really draw >> attention to it. >>ES> See: ES> 6337656 zpool should have a -t option to allow temporary import ES> To clarify this behavior. What''s the difference between -t and -R / ? -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Eric Schrock
2006-Sep-19 16:00 UTC
[zfs-discuss] Re: Re: ZFS imported simultanously on 2 systems...
On Tue, Sep 19, 2006 at 01:55:33PM +0200, Robert Milkowski wrote:> > What''s the difference between -t and -R / > ? >Basically nothing implementation-wise, except that you don''t have the weird notion ''alternate root=/'', which doesn''t really make any sense. It''s really just a clarification of behavior, without having to rely on the (documented) side-effect of importing with an alternate root. Since the behavior is identical to ''-R /'', you can see why it hasn''t been implemented yet... - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock