hi everyone, I am planning on creating a local SAN via NFS(v4) and several redundant nodes. I have been using DRBD on linux before and now am asking whether some of you have experience on on-demand network filesystem mirrors. I have yet little Solaris sysadmin know how, but i am interesting whether there is an on-demand support for sending snapshots. I.e. not via a cron job, but via a kind of filesystem change notification system. Is this mere a hack or can it be used to create some sort of failover. E.g. DRBD has the master/slave option, which can be configured easily. Something like this would be nice out of the box. So in case of failure another node is the master and if the former master is back again, it is simply the slave, so that both have the current data available again. Any pointers to solutions in that area are greatly appreaciated. -- Jakob
On September 18, 2006 5:45:08 PM +0200 Jakob Praher <jp at hapra.at> wrote:> hi everyone, > > I am planning on creating a local SAN via NFS(v4) and several redundant nodes.huh. How do you create a SAN with NFS?> I have been using DRBD on linux before and now am asking whether some of you have experience on > on-demand network filesystem mirrors. > > I have yet little Solaris sysadmin know how, but i am interesting whether there is an on-demand > support for sending snapshots. I.e. not via a cron job, but via a kind of filesystem change > notification system.AFAIK, Solaris does not export file change notification to userland in any way that would be useful for on-demand filesystem replication. From looking at drbd for 5 minutes, it looks like the kind of notification that windows/linux/macos provides isn''t what drbd uses; it does BLOCK LEVEL replication, and part of the software is a kernel module to export that data to userspace. It sounds like that distinction doesn''t matter for what you are trying to achieve, and I believe that this block-by-block duplication isn''t a great idea for zfs anyway. It might be neat if zfs could inform userland of each new txg.> Is this mere a hack or can it be used to create some sort of failover. > > E.g. DRBD has the master/slave option, which can be configured easily. Something like this would > be nice out of the box. So in case of failure another node is the master and if the former master > is back again, it is simply the slave, so that both have the current data available again. > > Any pointers to solutions in that area are greatly appreaciated.See if <http://blogs.sun.com/timf/entry/zfs_automatic_snapshots_now_with> comes close. I have 2 setups, one using SC 3.2 with a SAN (both systems can access the same filesystem, yes it''s not as redundant as a remote node and remote filesystem, but it''s for HA not DR. I could add another JBOD to the SAN and configure zfs to mirror between the two enclosures to get rid of the SPoF of the JBOD backplane/midplane, but it''s not worth it. The other setup is using my own cron script (zfs send | zfs recv) to send snapshots to a "remote" (just another server in the same rack) host. This is for a service that also has very high availability requirements but where I can''t afford shared storage. I do a homegrown heartbeat and failover thing. I''m looking at replacing the cron script with the SMF service linked above, but I''m in no rush since the cron job works quite well. If zfs is otherwise a good solution for you, you might want to consider if you really need true on-demand replication. Maybe 5-minute or even 1-minute recency is good enough. I would imagine that you don''t actually get too much better than 30s with drbd anyway, since outside of fsync() data doesn''t actually make it to disk (and then replicated by drbd) more frequently than that for some generic application. -frank
Frank Cusack wrote:> On September 18, 2006 5:45:08 PM +0200 Jakob Praher <jp at hapra.at> wrote: >> hi everyone, >> >> I am planning on creating a local SAN via NFS(v4) and several >> redundant nodes. > > huh. How do you create a SAN with NFS?Not to get into a semantic holy way on acronyms but, in the past, I have seen NFS grouped under the SAN umbrella. Most people hear SAN and think FC block but recall that the "S" stands for Storage. Not very common but it happens. -- Torrey McMahon Sun Microsystems Inc.
Frank Cusack wrote:> On September 18, 2006 5:45:08 PM +0200 Jakob Praher <jp at hapra.at> wrote:> huh. How do you create a SAN with NFS?Sorry. Okay it would be Network Attached Sotrage not the other way round . I guess you are right. BUT if we are at discussing NFS for distributed stroage: What are your guys performance data for NFSv4 as a storage node. How well does the current Solaris NFSv4 stack interoperate with the Linux stack? Would you go for that? What about iSCSI on top of ZFS? is that an option. I did a research on iSCSI vs NFSv4 once and I found out that the overhead for transproting the fs metadata (in the NFSv4 case) is not the real problem for many szenarios. Especially the COMPOUND messages should help here.> >> I have been using DRBD on linux before and now am asking whether some >> of you have experience on >> on-demand network filesystem mirrors. >> > > AFAIK, Solaris does not export file change notification to userland in > any way that would be useful for on-demand filesystem replication. From > looking at drbd for 5 minutes, it looks like the kind of notification > that windows/linux/macos provides isn''t what drbd uses; it does BLOCK > LEVEL replication, and part of the software is a kernel module to export > that data to userspace. It sounds like that distinction doesn''t matter > for what you are trying to achieve, and I believe that this block-by-block > duplication isn''t a great idea for zfs anyway. It might be neat if zfs > could inform userland of each new txg. >yes. exactly. It is a block device driver and that replicates. So it sits right underneath Linux''s VFS. Okay that is something i wanted to know. Are there any good heartbeat control apps for Solaris out there? I mean if i want to have failover (even if it is a little bit cheap) it should detect failures and react accordingly. Switching from Sender to Receiver should not be difficult given that all you need is to make ZFS snapshots. (and that is really cheap in ZFS).>> Is this mere a hack or can it be used to create some sort of failover. >> >> E.g. DRBD has the master/slave option, which can be configured easily. >> Something like this would >> be nice out of the box. So in case of failure another node is the >> master and if the former master >> is back again, it is simply the slave, so that both have the current >> data available again. >> >> Any pointers to solutions in that area are greatly appreaciated. > > See if <http://blogs.sun.com/timf/entry/zfs_automatic_snapshots_now_with> > comes close. > > I have 2 setups, one using SC 3.2 with a SAN (both systems can access > the same filesystem, yes it''s not as redundant as a remote node and > remote filesystem, but it''s for HA not DR. I could add another JBOD > to the SAN and configure zfs to mirror between the two enclosures to > get rid of the SPoF of the JBOD backplane/midplane, but it''s not > worth it. >JBOD, SPoF - what are these things?> The other setup is using my own cron script (zfs send | zfs recv) to > send snapshots to a "remote" (just another server in the same rack) > host. This is for a service that also has very high availability > requirements but where I can''t afford shared storage. I do a homegrown > heartbeat and failover thing. I''m looking at replacing the cron script > with the SMF service linked above, but I''m in no rush since the cron job > works quite well. > > If zfs is otherwise a good solution for you, you might want to consider > if you really need true on-demand replication. Maybe 5-minute or even > 1-minute recency is good enough. I would imagine that you don''t actually > get too much better than 30s with drbd anyway, since outside of fsync() > data doesn''t actually make it to disk (and then replicated by drbd) > more frequently than that for some generic application.Okay. I think zfs is nice. I am using xfs+lvm2 on my linux boxes so far. This works nice too. SMF is the init.d replacement of solaris, right? What would that look like. What would SMF do, but restart your app if it fails? Would you like to have a background task running instead of kicking it on with cron? Thanks -- Jakob
Jakob Praher wrote:> Frank Cusack wrote: >> On September 18, 2006 5:45:08 PM +0200 Jakob Praher <jp at hapra.at> wrote: > >> huh. How do you create a SAN with NFS? > Sorry. Okay it would be Network Attached Sotrage not the other way round > . I guess you are right. > > BUT if we are at discussing NFS for distributed stroage: What are your > guys performance data for NFSv4 as a storage node. How well does the > current Solaris NFSv4 stack interoperate with the Linux stack? > Would you go for that?Depends on what your application is for NFS performance. The Solaris NFS stack can easily saturate a 1Gbe link doing straight I/O. Doing heavy metadata operations obviously won''t be the case. If you follow the nfsv4 IETF working group, you''ll see that the NFSv4 people have been meeting about every 4 months for over 6 years to do interopability testing. So anyone who has a serious NFSv4 stack (Sun, Netapp, Linux, IBM, Hummingbird, etc) interoperates great with others with a serious stack. You''re best bet as always is try your specific application and see if it performs to what you want.> > What about iSCSI on top of ZFS? is that an option. I did a research on > iSCSI vs NFSv4 once and I found out that the overhead for transproting > the fs metadata (in the NFSv4 case) is not the real problem for many > szenarios. Especially the COMPOUND messages should help here.Compound messages may help in the future but i don''t think anyone has fully taken advantage of them yet - most VFS''s are the same, and its a little tricky given the historical part of the kernel. We''ve integrated some things in the Solaris kernel to take advantage and have thrown around other ideas that haven''t made it quite yet. eric> >> >>> I have been using DRBD on linux before and now am asking whether some >>> of you have experience on >>> on-demand network filesystem mirrors. >>> >> >> AFAIK, Solaris does not export file change notification to userland in >> any way that would be useful for on-demand filesystem replication. From >> looking at drbd for 5 minutes, it looks like the kind of notification >> that windows/linux/macos provides isn''t what drbd uses; it does BLOCK >> LEVEL replication, and part of the software is a kernel module to export >> that data to userspace. It sounds like that distinction doesn''t matter >> for what you are trying to achieve, and I believe that this >> block-by-block >> duplication isn''t a great idea for zfs anyway. It might be neat if zfs >> could inform userland of each new txg. >> > yes. exactly. It is a block device driver and that replicates. So it > sits right underneath Linux''s VFS. > Okay that is something i wanted to know. Are there any good heartbeat > control apps for Solaris out there? I mean if i want to have failover > (even if it is a little bit cheap) it should detect failures and react > accordingly. Switching from Sender to Receiver should not be difficult > given that all you need is to make ZFS snapshots. (and that is really > cheap in ZFS). > > >>> Is this mere a hack or can it be used to create some sort of failover. >>> >>> E.g. DRBD has the master/slave option, which can be configured >>> easily. Something like this would >>> be nice out of the box. So in case of failure another node is the >>> master and if the former master >>> is back again, it is simply the slave, so that both have the current >>> data available again. >>> >>> Any pointers to solutions in that area are greatly appreaciated. >> >> See if <http://blogs.sun.com/timf/entry/zfs_automatic_snapshots_now_with> >> comes close. >> >> I have 2 setups, one using SC 3.2 with a SAN (both systems can access >> the same filesystem, yes it''s not as redundant as a remote node and >> remote filesystem, but it''s for HA not DR. I could add another JBOD >> to the SAN and configure zfs to mirror between the two enclosures to >> get rid of the SPoF of the JBOD backplane/midplane, but it''s not >> worth it. >> > JBOD, SPoF - what are these things? > >> The other setup is using my own cron script (zfs send | zfs recv) to >> send snapshots to a "remote" (just another server in the same rack) >> host. This is for a service that also has very high availability >> requirements but where I can''t afford shared storage. I do a homegrown >> heartbeat and failover thing. I''m looking at replacing the cron script >> with the SMF service linked above, but I''m in no rush since the cron job >> works quite well. >> >> If zfs is otherwise a good solution for you, you might want to consider >> if you really need true on-demand replication. Maybe 5-minute or even >> 1-minute recency is good enough. I would imagine that you don''t actually >> get too much better than 30s with drbd anyway, since outside of fsync() >> data doesn''t actually make it to disk (and then replicated by drbd) >> more frequently than that for some generic application. > > Okay. I think zfs is nice. I am using xfs+lvm2 on my linux boxes so far. > This works nice too. > > SMF is the init.d replacement of solaris, right? What would that look > like. What would SMF do, but restart your app if it fails? Would you > like to have a background task running instead of kicking it on with cron? > > Thanks > -- Jakob > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On September 21, 2006 10:48:34 AM +0200 Jakob Praher <jp at hapra.at> wrote:> Frank Cusack wrote: >> On September 18, 2006 5:45:08 PM +0200 Jakob Praher <jp at hapra.at> wrote: > > BUT if we are at discussing NFS for distributed stroage: What are your > guys performance data for NFSv4 as a storage node. How well does the > current Solaris NFSv4 stack interoperate with the Linux stack? > Would you go for that?My last knowledge of Linux NFSv4 vs. Solaris NFSv4 is that they don''t interoperate. This was about a year ago. I''ve always had to force Linux to v3.> Are there any good heartbeat > control apps for Solaris out there?Sun Cluster and Veritas VCS come to mind. I use ucarp for a homegrown solution.> JBOD, SPoF - what are these things?Wow. Just a Bunch of Disks. Single Point of Failure.> SMF is the init.d replacement of solaris, right? What would that look > like. What would SMF do, but restart your app if it fails? Would you like > to have a background task running instead of kicking it on with cron?<http://opensolaris.org/os/community/smf/> -frank
Frank Cusack wrote:> On September 21, 2006 10:48:34 AM +0200 Jakob Praher <jp at hapra.at> wrote: > >> Frank Cusack wrote: >> >>> On September 18, 2006 5:45:08 PM +0200 Jakob Praher <jp at hapra.at> >>> wrote: >> >> >> BUT if we are at discussing NFS for distributed stroage: What are your >> guys performance data for NFSv4 as a storage node. How well does the >> current Solaris NFSv4 stack interoperate with the Linux stack? >> Would you go for that? > > > My last knowledge of Linux NFSv4 vs. Solaris NFSv4 is that they don''t > interoperate. This was about a year ago. I''ve always had to force > Linux to v3.They interoprate just fine. The only weird thing is how the linux people implemented their pseudo-filesystem - make sure to add "fsid=0" to your exports via ''exportfs''. So to transition from v3 to v4, they require you administratively to make a change or otherwise your Opensolaris clients won''t be able to mount the linux server. And yes they are planning on fixing it. http://wiki.linux-nfs.org/index.php/Nfsv4_configuration http://blogs.sun.com/macrbg/date/20051020 If you see something that doesn''t work, let us or the linux people know. eric