===== PROBLEM To create a disk storage system that will act as an archive point for user data (Non-recoverable data), and also act as a back end storage unit for virtual machines at a block level. ===== BUDGET Currently I have about 25-30k to start the project, more could be allocated in the next fiscal year for perhaps a backup solution. ===== TIMEFRAME I have 8 days to cut a P.O. before our fiscal year ends. ===== STORAGE REQUIREMENTS 5-10tb of redundant fairly high speed storage ===== QUESTION #1 What is the best way to mirror two zfs pools in order to achieve a sort of HA storage system? I don''t want to have to physically swap my disks into another system if any of the hardware on the ZFS server dies. If I have the following configuration what is the best way to mirror these in near real time? BOX 1 (JBOD->ZFS) BOX 2 (JBOD-ZFS) I''ve seen the zfs send and recieve commands but I''m not sure how well that would work with a close to real time mirror. ===== QUESTION #2 Can ZFS be exported via iscsi and then imported as a disk to a linux system and then be formated with another file system. I wish to use ZFS as a block level file systems for my virtual machines. Specifically using xen. If this is possible, how stable is this? How is error checking handled if the zfs is exported via iscsi and then the block device formated to ext3? Will zfs still be able to check for errors? If this is possible and this all works, then are there ways to expand a zfs iscsi exported volume and then expand the ext3 file system on the remote host? ===== QUESTION #3 How does zfs handle a bad drive? What process must I go through in order to take out a bad drive and replace it with a good one? ===== QUESTION #4 What is a good way to back up this HA storage unit? Snapshots will provide an easy way to do it live, but should it be dumped into a tape library, or an third offsite zfs pool using zfs send/recieve or ? ===== QUESTION #5 Does the following setup work? BOX 1 (JBOD) -> iscsi export -> BOX 2 ZFS. In other words, can I setup a bunch of thin storage boxes with low cpu and ram instead of using sas or fc to supply the jbod to the zfs server? === I appreciate any advice or answers you might have.
On May 31, 2007, at 12:15 AM, Nathan Huisman wrote:> ===== PROBLEM > > To create a disk storage system that will act as an archive point for > user data (Non-recoverable data), and also act as a back end storage > unit for virtual machines at a block level.<snip> Here are some tips from me. I notice you mention iSCSI a lot so I''ll stick to that... Q1: The best way to mirror in real time is to do it from the consumers of the storage, ie, your iSCSI clients. Implement two storage servers (say, two x4100s with attached disk) and put their disk into zpools. The two servers do not have to know about each other. Configure ZFS file systems identically on both and export them to the client that''ll use it. Use the software mirroring feature on the client to mirror these iSCSI shares (eg: dynamic disks on Windows, LVM on Linux, SVM on Solaris). What this gives you are two storage servers (ZFS-backed, serving out iSCSI shares) and the client(s) take a share from each and mirror them... if one of the ZFS servers were to go kaput, the other is still there actively taking in and serving data. From the client''s perspective, it''ll just look like one side of the mirror went down and after you get the downed ZFS server back up, you would initiate normal mirror reattachment procedure on the client(s). This will also allow you to patch your ZFS servers without downtime incurred on your clients. The disk storage on your two ZFS+iSCSI servers could be anything. Given your budget and space needs, I would suggest looking at the Apple Xserve RAID with 750GB drives. You''re a .edu, so the price of these things will likely please you (I just snapped up two of them at my .edu for a really insane price). Q2: The client will just see the iSCSI share as a raw block device. Put your ext3/xfs/jfs on it as you please... to ZFS on the it is just data. That''s the only way you can use iSCSI, really.... it''s block level, remember. On ZFS, the iSCSI backing store is one large sparse file. Q3: See the zpool man page, specifically the ''zpool replace ...'' command. Q4: Since (or if) you''re doing iSCSI, ZFS snapshots will be of no value to you since ZFS can''t see into those iSCSI backing store files. I''ll assume that you have a backup system in place for your existing infrastructure (Networker, NetBackup or what have you) so back up the stuff from the *clients* and not the ZFS servers. Just space the backup schedule out if you have multiple clients so that the ZFS+iSCSI servers aren''t overloaded with all its clients reading data suddenly with backup time rolls around. Q5: Sure, nothing would stop you from doing that sort of config, but it''s something that would make Rube Goldberg smile. Keep out any unneeded complexity and condense the solution. Excuse my ASCII art skills, but consider this: [JBOD/ARRAY]---(fc)--->[ZFS/iSCSI server 1]---(iscsi share)----- [Client ] [mirroring the] [JBOD/ARRAY]---(fc)--->[ZFS/iSCSI server 2]---(iscsi share)----- [ two shares ] Kill one of the JBODs or arrays, OR the ZFS+iSCSI servers, and your clients are still in good shape as long as their software mirroring facility behaves. /dale
Questions I don''t know answers to are omitted. "I am but a nestling." On 5/31/07, Nathan Huisman <nhuisman at ifa.hawaii.edu> wrote:> ===== STORAGE REQUIREMENTS > > 5-10tb of redundant fairly high speed storageWhat does "high speed" mean? How many users are there for this system? Are they accessing it via Ethernet? FC? Something else? Why the emphasis on iscsi?> ===== QUESTION #2 > > Can ZFS be exported via iscsi and then imported as a disk to a linux > system and then be formated with another file system[?]Yes. It''s in OpenSolaris but not (as I understand it) in Solaris direct from Sun. If running OpenSolaris isn''t an issue (but it probably is) it works out of the box.> ===== QUESTION #3 > > How does zfs handle a bad drive? What process must I go through in > order to take out a bad drive and replace it with a good one?ZFS only notices drives are dead when they''re really dead - they can''t be opened. If a drive is causing intermittent problems (returning bad data and so forth) it won''t get noticed, but ZFS will recover the blocks from mirrors or parity. "zpool replace" should take care of the replacement procedure, or you could keep hot spares online. I can''t comment on hotswapping drives while the machine is on; does this work in general, or require special hardware?> ===== QUESTION #4 > > What is a good way to back up this HA storage unit? Snapshots will > provide an easy way to do it live, but should it be dumped into a tape > library, or an third offsite zfs pool using zfs send/recieve or ?ZFS will be no help if all you''ve got is iscsi targets. You need something that knows what those targets hold; whatever client-OS-based stuff you use other places will do. Otherwise you end up storing/backing up a lot more than you need to - filesystem metadata, et cetera.> ===== QUESTION #5 > > Does the following setup work? > > BOX 1 (JBOD) -> iscsi export -> BOX 2 ZFS. > > In other words, can I setup a bunch of thin storage boxes with low cpu > and ram instead of using sas or fc to supply the jbod to the zfs server?As Dale mentions, this seems overly complicated. Consuming iscsi and producing "different" iscsi doesn''t sound like a good idea to me. Will
Nathan, Some answers inline... Nathan Huisman wrote:> ===== PROBLEM > > To create a disk storage system that will act as an archive point for > user data (Non-recoverable data), and also act as a back end storage > unit for virtual machines at a block level. > > ===== BUDGET > > Currently I have about 25-30k to start the project, more could be > allocated in the next fiscal year for perhaps a backup solution. > > ===== TIMEFRAME > > I have 8 days to cut a P.O. before our fiscal year ends. > > ===== STORAGE REQUIREMENTS > > 5-10tb of redundant fairly high speed storage > > > ===== QUESTION #1 > > What is the best way to mirror two zfs pools in order to achieve a sort > of HA storage system? I don''t want to have to physically swap my disks > into another system if any of the hardware on the ZFS server dies. If I > have the following configuration what is the best way to mirror these in > near real time? > > BOX 1 (JBOD->ZFS) BOX 2 (JBOD-ZFS) > > I''ve seen the zfs send and recieve commands but I''m not sure how well > that would work with a close to real time mirror.If you want close to realtime mirroring (across pools in this case) AVS would be a better option in my opinion. Refer to : http://www.opensolaris.org/os/project/avs/Demos/AVS-ZFS-Demo-V1/> > > ===== QUESTION #2 > > Can ZFS be exported via iscsi and then imported as a disk to a linux > system and then be formated with another file system. I wish to use ZFS > as a block level file systems for my virtual machines. Specifically > using xen. If this is possible, how stable is this? How is error > checking handled if the zfs is exported via iscsi and then the block > device formated to ext3? Will zfs still be able to check for errors? > If this is possible and this all works, then are there ways to expand a > zfs iscsi exported volume and then expand the ext3 file system on the > remote host?Yes, you can create volumes (ZVOL) in a Zpool and export them over iscsi. The ZVOL would guarantee the data consistency at the block level. Expanding the ZVOL should be possible. However, I am not sure if/how iSCSI behaves here. You might need to try it out.> > ===== QUESTION #3 > > How does zfs handle a bad drive? What process must I go through in > order to take out a bad drive and replace it with a good one?# zpool replace <poolname> <bad-drive> <new-good-drive> The other option would be configure hot-spares and they will kickin automatically when a bad-drive is detected.> > ===== QUESTION #4 > > What is a good way to back up this HA storage unit? Snapshots will > provide an easy way to do it live, but should it be dumped into a tape > library, or an third offsite zfs pool using zfs send/recieve or ? > > ===== QUESTION #5 > > Does the following setup work? > > BOX 1 (JBOD) -> iscsi export -> BOX 2 ZFS. > > In other words, can I setup a bunch of thin storage boxes with low cpu > and ram instead of using sas or fc to supply the jbod to the zfs server?Should be feasible. Just that you would then need a robust LAN and that would be flooded. Thanks and regards, Sanjeev. -- Solaris Revenue Products Engineering, India Engineering Center, Sun Microsystems India Pvt Ltd. Tel: x27521 +91 80 669 27521
Nathan, Keep in mind iSCSI target is only in OpenSolaris at this time. On 05/30/2007 10:15 PM, Nathan Huisman wrote: <snip>> > ===== QUESTION #1 > > What is the best way to mirror two zfs pools in order to achieve a sort > of HA storage system? I don''t want to have to physically swap my disks > into another system if any of the hardware on the ZFS server dies. If I > have the following configuration what is the best way to mirror these in > near real time? > > BOX 1 (JBOD->ZFS) BOX 2 (JBOD-ZFS) > > I''ve seen the zfs send and recieve commands but I''m not sure how well > that would work with a close to real time mirror.If you want this to be redundant (and very scalable) you will want at least 2xBOX 1''s and 2x BOX2''s. IPMP with redundant GbE switches + NICs as well. Do not use zfs send/recv. Use Sun Cluster 3.2 for HA-ZFS. http://docs.sun.com/app/docs/doc/820-0335/6nc35dge2?a=view There is potential for data loss if the active ZFS node crashes before outstanding transaction groups commit for non-synchronous writes, but the ZVOL (and underlying ext3fs) should not become corrupt (hasn''t happened to me yet). Can someone from the ZFS team comment on this?> > > ===== QUESTION #2 > > Can ZFS be exported via iscsi and then imported as a disk to a linux > system and then be formated with another file system. I wish to use ZFS > as a block level file systems for my virtual machines. Specifically > using xen. If this is possible, how stable is this?This is possible and is stable in my experience. Scales well if you design your infrastructure correctly.> How is error > checking handled if the zfs is exported via iscsi and then the block > device formated to ext3? Will zfs still be able to check for errors?Yes, ZFS will detect/correct block level errors in ZVOLs as long as you have a redundant zpool configuration (see note below about LVM)> If this is possible and this all works, then are there ways to expand a > zfs iscsi exported volume and then expand the ext3 file system on the > remote host? >Haven''t tested it myself (yet), but should be possible. You might have to export and re-import the iSCSI target on the Xen dom0 and then resize the ext3 partition (e.g. using ''parted''). If that doesn''t work there are other ways to accomplish this.> ===== QUESTION #3 > > How does zfs handle a bad drive? What process must I go through in > order to take out a bad drive and replace it with a good one?If you have a redundant zpool configuration you will replace the failed disk and then issue a ''zpool replace''.> > ===== QUESTION #4 > > What is a good way to back up this HA storage unit? Snapshots will > provide an easy way to do it live, but should it be dumped into a tape > library, or an third offsite zfs pool using zfs send/recieve or ?Send snapshots to another server that has a RAIDZ (or RAIDZ2) zpool (want space vs performace/redundancy for backup. Opposite of the *MIRRORS* you will want to use for the HA-ZFS cluster <-> Storage nodes). From this node you can dump to tape, etc.> > ===== QUESTION #5 > > Does the following setup work? > > BOX 1 (JBOD) -> iscsi export -> BOX 2 ZFS. > > In other words, can I setup a bunch of thin storage boxes with low cpu > and ram instead of using sas or fc to supply the jbod to the zfs server?Yes. And ZFS+iSCSI makes this relatively cheap. I very strongly recommend against using LVM to handle the mirroring. *You will lose the ability to correct data corruption* at the ZFS level. It also does not scale well, increases complexity, increases cost, and reduces throughput over iSCSI to your ZFS nodes. Leave volume management and redundancy to ZFS. Set up your Xen dom0 boxes to have a redundant path to your ZVOLs over iSCSI. Send your data _one time_ to your ZFS nodes. Let ZFS handle the mirroring and then send that to your iSCSI LUNs on the storage nodes. Make sure you set up half of each mirror in the zpool with a disk from a separate storage node. Be wary of layering ZFS/ZVOLs like this. There are multiple ways to set up your storage nodes (plain iscsitadm or using ZVOls), and if you use ZVOLs you may want to disable checksum and leave that to your ZFS nodes. Other: -Others have reported that Sil3124 based SATA expansion cards work well with Solaris. -Test your failover times between ZFS nodes (BOX 2s). Having lots of iscsi shares/filesystems can cause this to be slow. Hopefully this will be improved with parallel zpool device mounting in the future. -ZVOLs are not sparse by default. I prefer this, but if you really want to use sparse ZVOLs there is a switch for it in ''zfs create'' -This will work, but TEST, TEST, TEST for your particular scenario. -Yes, this can be built for less than $30k US for your storage size requirement. -I get ~150MB/s throughput on this setup with 2 storage nodes of 6 disks each. Appears as ~3TB mirror on ZFS nodes. -Use Build 64 or later, as there is a ZVOL bug in b63 if I''m not mistaken. Probably a good idea to read through the open ZFS bugs, too. -Rule of thumb is 1Ghz is needed to saturate 1GbE link. -Run 64bit Solaris & give ZFS nodes as much RAM as possible -Read the documentation -... -Profit ;) David Anderson
Since you are doing iSCSI and may not be running ZFS on the initiator (client) then I highly recommend that you run with IPsec using at least AH (or ESP with Authentication) to protect the transport. Don''t assume that your network is reliable. ZFS won''t help you here if it isn''t running on the iSCSI initiator, and even if it is it would need two targets to be able to repair. -- Darren J Moffat
On Thu, 31 May 2007, David Anderson wrote: .... snip .....> Other: > -Others have reported that Sil3124 based SATA expansion cards work well with > Solaris.[Sorry - don''t mean to hijack this interesting thread] I believe that there is a serious bug with the si3124 driver that has not been addressed. Ben Rockwood and I have seen it firsthand, and a quick look at the Hg logs shows that si3124.c has not been changed in 6 months. Basic description of the bug: under heavy load (lots of I/O ops/Sec) all data from the drive(s) will completely stop for an extended period of time - 60 to 90+ Seconds. There was a recent discussion of the same issue on the Solaris on x86 list (solarisx86 at yahoogroups.com) - several experienced x86ers have seen this bug and found the current driver unusable. Interestingly, one individual said (paraphrased) ... "don''t see any issues" and then later ... "now I see it and it was there the entire time". Recommendation: If you plan to use the 3124 driver, test it yourself under heavy load. A simple test with one disk drive will suffice. In my case, it was plainly obvious with one (ex Sun M20) drive and a UFS filesystem - all I was doing was tarring up /export/home to another drive. Periodically the tar process would simply stop (iostat went flatline) - it looked like the system was going to crash - then (after 60+ Secs) the tar process continued as if nothing had happened. This was repeated 4 or 5 times before the ''tar cvf'' (of around 40Mb of data) completed successfully. Regards, Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
On Thu, 31 May 2007, Darren J Moffat wrote:> Since you are doing iSCSI and may not be running ZFS on the initiator > (client) then I highly recommend that you run with IPsec using at least AH > (or ESP with Authentication) to protect the transport. Don''t assume that > your network is reliable. ZFS won''t help you here if it isn''t running on the[Hi Darren] Thats a curious recommendation! You don''t think that TCP/IP is reliable enough to provide iSCSI data integrity? What errors and error rates have you seen?> iSCSI initiator, and even if it is it would need two targets to be able to > repair.Regards, Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
Al Hopper wrote:> On Thu, 31 May 2007, Darren J Moffat wrote: > >> Since you are doing iSCSI and may not be running ZFS on the initiator >> (client) then I highly recommend that you run with IPsec using at >> least AH (or ESP with Authentication) to protect the transport. Don''t >> assume that your network is reliable. ZFS won''t help you here if it >> isn''t running on the > > [Hi Darren] > > Thats a curious recommendation! You don''t think that TCP/IP is reliable > enough to provide iSCSI data integrity?No I don''t. Also I don''t personally thing that the access control model of iSCSI is sufficient and trust IPsec more in that respect. Personally I would actually like to see at IPsec AH be the default for all traffic that isn''t otherwise doing a cryptographically strong integrity check of its own.> What errors and error rates have you seen?I have seen switches flip bits in NFS traffic such that the TCP checksum still match yet the data was corrupted. One of the ways we saw this was when files were being checked out of SCCS, the SCCS checksum failed. Other ways we saw it was the compiler failing to compile untouched code. Just like we with ZFS we don''t trust the HBA and the disks to give us correct data. With iSCSI the network is your HBA and cableing and in part your disk controller as well. Defence in depth is a common mantra in the security geek world, I take that forward to protecting the data in transit too even when it isn''t purely for security reasons. -- Darren J Moffat
On 5/31/07, Darren J Moffat <Darren.Moffat at sun.com> wrote:> Since you are doing iSCSI and may not be running ZFS on the initiator > (client) then I highly recommend that you run with IPsec using at least > AH (or ESP with Authentication) to protect the transport. Don''t assume > that your network is reliable. ZFS won''t help you here if it isn''t > running on the iSCSI initiator, and even if it is it would need two > targets to be able to repair.If you don''t intend to encrypt the iSCSI headers / payloads, why not just use the header and data digests that are part of the iSCSI protocol? Thanks, - Ryan -- UNIX Administrator http://prefetch.net
Al Hopper ??:> On Thu, 31 May 2007, David Anderson wrote: > > .... snip ..... > >> Other: >> -Others have reported that Sil3124 based SATA expansion cards work >> well with Solaris. > > > [Sorry - don''t mean to hijack this interesting thread] > > I believe that there is a serious bug with the si3124 driver that has > not been addressed. Ben Rockwood and I have seen it firsthand, and a > quick look at the Hg logs shows that si3124.c has not been changed in > 6 months. > > Basic description of the bug: under heavy load (lots of I/O ops/Sec) > all data from the drive(s) will completely stop for an extended period > of time - 60 to 90+ Seconds. > > There was a recent discussion of the same issue on the Solaris on x86 > list (solarisx86 at yahoogroups.com) - several experienced x86ers have > seen this bug and found the current driver unusable. Interestingly, > one individual said (paraphrased) ... "don''t see any issues" and then > later ... "now I see it and it was there the entire time". > > Recommendation: If you plan to use the 3124 driver, test it yourself > under heavy load. A simple test with one disk drive will suffice. > > In my case, it was plainly obvious with one (ex Sun M20) drive and a > UFS filesystem - all I was doing was tarring up /export/home to > another drive. Periodically the tar process would simply stop (iostat > went flatline) - it looked like the system was going to crash - then > (after 60+ Secs) the tar process continued as if nothing had happened. > This was repeated 4 or 5 times before the ''tar cvf'' (of around 40Mb of > data) completed successfully. > > Regards, > > Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com > Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT > OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 > http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussDoes the si3124 bug Hopper mentioned has something to do with below ERROR? I met them in workspace warlock build step, but I did nothing to si3124 codes... warlock -c ../../common/io/warlock/si3124.wlcmd si3124.ll \ ../sd/sd.ll ../sd/sd_xbuf.ll \ -l ../scsi/scsi_capabilities.ll -l ../scsi/scsi_control.ll -l ../scsi/scsi_watch.ll -l ../scsi/scsi_data.ll -l ../scsi/scsi_resource.ll -l ../scsi/scsi_subr.ll -l ../scsi/scsi_hba.ll -l ../scsi/scsi_transport.ll -l ../scsi/scsi_confsubr.ll -l ../scsi/scsi_reset_notify.ll \ -l ../cmlb/cmlb.ll \ -l ../sata/sata.ll \ -l ../warlock/ddi_dki_impl.ll The following variables don''t seem to be protected consistently: dev_info::devi_state *** Error code 10 make: Fatal error: Command failed for target `si3124.ok'' Current working directory /net/greatwall/workspaces/wifi_rtw/usr/src/uts/intel/si3124 *** Error code 1 The following command caused the error: cd ../si3124; make clean; make warlock make: Fatal error: Command failed for target `warlock.sata'' Current working directory /net/greatwall/workspaces/wifi_rtw/usr/src/uts/intel/warlock - Michael
On Thu, 2007-05-31 at 13:27 +0100, Darren J Moffat wrote:> > What errors and error rates have you seen? > > I have seen switches flip bits in NFS traffic such that the TCP checksum > still match yet the data was corrupted. One of the ways we saw this was > when files were being checked out of SCCS, the SCCS checksum failed. > Other ways we saw it was the compiler failing to compile untouched code.To be specific, we found that an ethernet switch in one of our development labs had a tendency to toggle a particular bit in packets going through it. The problem was originally suspected to be a data corruption problem within solaris itself and got a lot of attention as a result. In the cases I examined (corrupted source file after SCCS checkout) there were complementary changes (0->1 and 1->0) in the same bit in bytes which were 256, 512, or 1024 bytes apart in the source file. Because of the mathematics of the 16-bit ones-complement checksum used by TCP, the packet checksummed to the same value after the switch made these two offsetting changes. (I believe that the switch was either inserting or removing a vlan tag so the ethernet CRC had to be recomputed by the switch). Once we realized that this was going on we went back, looked at the output of netstat -s, and noticed that the systems in this lab had been dropping an abnormally high number of packets due to bad TCP checksums; only a few of the broken packets were making it through, but there were enough of them to disrupt things in the lab. The problem went away when the suspect switch was taken out of service. - Bill
Al, Has there been any resolution to this problem? I get it repeatedly on my 5-500GB Raidz configuration. I sometimes get port drop/reconnect errors when this occurs. Gary This message posted from opensolaris.org