It occured to me that there are scenarios where it would be useful to be able to "zfs send -i A B" where B is a snapshot older than A. I am trying to design an encrypted disk-based off-site backup solution on top of ZFS, where budget is the primary constraint, and I wish zfs send/recv would allow me to do that. Here is why. I have a server with 12 hot-swap disk bays. An "onsite" pool has been created on 6 disks, where snapshots of the data to be backed up are periodically taken. Two other "offsite" pools have been created on two other sets of 6 disks, let''s give them the names offsite-blue and offsite-red (for use on blue/red, or even/odd, weeks). At least one of the offsite pools is always at the off-site location, while the other one is either in transit or in the server. Every week a script is basically compressing and encrypting the last few snapshots (T-2, T-1, T-0) from onsite to offsite-XXX. Here is an example: $ rm /offsite-blue/* $ zfs send onsite at T-2 | gzip | gpg -c >/offsite-blue/T-2.full.gz.gpg $ zfs send -i T-2 onsite at T-1 | gzip | gpg -c >/offsite-blue/T-1.incr.gz.gpg $ zfs send -i T-1 onsite at T-0 | gzip | gpg -c >/offsite-blue/T-0.incr.gz.gpg Then offsite-blue is zfs export''ed, sent to the the off-site location, offsite-red is retrieved from the off-site location, sent back on-site, ready to be used for the next week. My proof-of-concept tests show it works OK, but 2 details are annoying: o In order to restore the latest snapshot T-0, all the zfs streams, T-2, T-1 and T-0, have to be decrypted, then zfs receive''d. It is slow and inconvenient. o My example only backs up the last 3 snapshots, but ideally I would like to fit as many as possible in the offsite pool. However, because of the unpredictable compression efficiency, I can''t tell which snapshot I should start from when creating the first full stream. These 2 problems would be non-existent if one could "zfs send -i A B" with B older than A: $ zfs send onsite at T-0 | gzip | gpg -c >/offsite-blue/T-0.full.gz.gpg $ zfs send -i T-0 onsite at T-1 | gzip | gpg -c >/offsite-blue/T-1.incr.gz.gpg $ zfs send -i T-1 onsite at T-2 | gzip | gpg -c >/offsite-blue/T-2.incr.gz.gpg $ ... # continue forever, kill zfs(1m) when offsite-blue is 90% full I have looked at the code and the restriction "B must be earlier than A" is enforced in dmu_send.c:dmu_sendbackup() [1]. It looks like the code could be reworked to remove it. Of course, when zfs-crypto ships, it will simplify a lot of things. I could just always send incremental streams and receive them directly on the encrypted pool, and directly manage the snapshots rotation by zfs destroy''ing the old ones, etc. [1] http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/dmu_send.c#232 -marc
Marc Bevand wrote:> o In order to restore the latest snapshot T-0, all the zfs streams, > T-2, T-1 and T-0, have to be decrypted, then zfs receive''d. It is > slow and inconvenient.True, but presumably restoring the snapshots is a rare event.> o My example only backs up the last 3 snapshots, but ideally I would > like to fit as many as possible in the offsite pool. However, because > of the unpredictable compression efficiency, I can''t tell which > snapshot I should start from when creating the first full stream.I thought that your onsite and offsite pools were the same size? If so then you should be able to fit the entire contents of the onsite pool in one of the offsite ones (''zfs send'' will inflate the data a small bit, but gzip should more than make up for it). That said, if you couldn''t fit the entire onsite pool in your offiste pool, then you could make use of some additional space accounting data to tell how much space the ''zfs send'' streams will take. (Although compression efficiency is still variable, you could probably make a good enough guess.) That''s on our long-term to-do list. Also, if you can afford to waste some space, you could do something like: zfs send onsite at T-100 | ... zfs send -i T-100 onsite at t-0 | ... zfs send -i T-100 onsite at t-99 | ... zfs send -i T-99 onsite at t-98 | ... zfs send -i T-98 onsite at t-97 | ... ... until you run out of space. Of course, you need at least enough space for the oldest and newest snapshots. Glad to hear that ZFS (and send|recv) is useful to you aside from these issues, and thanks for letting us know about the difficulties too! --matt
Matthew Ahrens <Matthew.Ahrens <at> sun.com> writes:> > True, but presumably restoring the snapshots is a rare event.You are right, this would only happen in case of disaster and total loss of the backup server.> I thought that your onsite and offsite pools were the same size? If so then > you should be able to fit the entire contents of the onsite pool in one of > the offsite ones.Well, I simplified the example. In reality, the offsite pool is slightly smaller due to different number of disks and sizes.> Also, if you can afford to waste some space, you could do something like: > > zfs send onsite <at> T-100 | ... > zfs send -i T-100 onsite <at> t-0 | ... > zfs send -i T-100 onsite <at> t-99 | ... > zfs send -i T-99 onsite <at> t-98 | ... > [...]Yes, I thought about it. I might do this if the delta between T-100 and T-0 is reasonable. Oh, and while I am thinking about it, beside "zfs send | gzip | gpg", and zfs-crypto, a 3rd option would be to use zfs on top of a loficc device (lofi compression & cryptography). I went to the project page, only to realize that they haven''t shipped anything yet. Do you know how hard would it be to implement "zfs send -i A B" with B older than A ? Or why hasn''t this been done in the first place ? I am just being curious here, I can''t wait for this feature anyway (even though it would make my life soo much simpler). -marc
Roshan Perera wrote:> Hi all, > > Is there a place where I can find ZFS best practices guide to use against > DMX and a roadmap of ZFS ? > Also, the customer now looking at big ZFS installations in production. > Would you guys happen to know or where I can find details of the numbers > of current installations ? We are looking at akmost 10Terrabytes of data > to be stored on DMX using ZFS (customer is not comfortable with the RaidZ > solution in addition to their best practice of raiding at DMX levell) Any > feedback, experiences and more importantly gotchas will be muchly > appreciated.http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide and I know Ben Rockwood (now of Joyent) has blogged about how much storage they''re using, all managed with ZFS... I just can''t find the blog entry. Hope this helps, James C. McPherson -- Solaris kernel software engineer Sun Microsystems
Roshan Perera wrote:> >> But Roshan, if your pool is not replicated from ZFS'' point of view, >> then all the multipathing and raid controller backup in the world will >> not make a difference. > > James, I Agree from ZFS point of view. However, from the EMC or the > customer point of view they want to do the replication at the EMC level > and not from ZFS. By replicating at the ZFS level they will loose some > storage and its doubling the replication. Its just customer use to > working with Veritas and UFS and they don''t want to change their habbits. > I just have to convince the customer to use ZFS replication.Hi Roshan, that''s a great shame because if they actually want to make use of the features of ZFS such as replication, then they need to be serious about configuring their storage to play in the ZFS world.... and that means replication that ZFS knows about. James C. McPherson -- Solaris kernel software engineer Sun Microsystems
Hi All, We have come across a problem at a client where ZFS brought the system down with a write error on a EMC device due to mirroring done at the EMC level and not ZFS, Client is total EMC committed and not too happy to use the ZFS for oring/RAID-Z. I have seen the notes below about the ZFS and SAN attached devices and understand the ZFS behaviour. Can someone help me with the following Questions: Is this the way ZFS will work in the future ? is there going to be any compromise to allow SAN Raid and ZFS to do the rest. If so when and if possible details of it ? Many Thanks Rgds Roshan ZFS work with SAN-attached devices?> > Yes, ZFS works with either direct-attached devices or SAN-attached > devices. However, if your storage pool contains no mirror or RAID-Z > top-level devices, ZFS can only report checksum errors but cannot > correct them. If your storage pool consists of mirror or RAID-Z > devices built using storage from SAN-attached devices, ZFS can report > and correct checksum errors. > > This says that if we are not using ZFS raid or mirror then the > expected event would be for ZFS to report but not fix the error. In > our case the system kernel panicked, which is something different. Is > the FAQ wrong or is there a bug in ZFS?
Roshan, Could you provide more detail please. The host and zfs should be unaware of any EMC array side replication so this sounds more like an EMC misconfiguration than a ZFS problem. Did you look in the messages file to see if anything happened to the devices that were in your zpools? If so then that wouldn''t be a zfs error. If your EMC devices fall offline because of something happening on the array or fabric then zfs is not to blame. The same thing would have happened with any other filesystem built on those devices. What kind of pools were in use, raidz, mirror or simple stripe? Regards, Vic On 6/19/07, Roshan Perera <Roshan.Perera at sun.com> wrote:> Hi All, > > We have come across a problem at a client where ZFS brought the system down with a write error on a EMC device due to mirroring done at the EMC level and not ZFS, Client is total EMC committed and not too happy to use the ZFS for oring/RAID-Z. I have seen the notes below about the ZFS and SAN attached devices and understand the ZFS behaviour. > > Can someone help me with the following Questions: > > Is this the way ZFS will work in the future ? > is there going to be any compromise to allow SAN Raid and ZFS to do the rest. > If so when and if possible details of it ? > > > Many Thanks > > Rgds > > Roshan > > ZFS work with SAN-attached devices? > > > > Yes, ZFS works with either direct-attached devices or SAN-attached > > devices. However, if your storage pool contains no mirror or RAID-Z > > top-level devices, ZFS can only report checksum errors but cannot > > correct them. If your storage pool consists of mirror or RAID-Z > > devices built using storage from SAN-attached devices, ZFS can report > > and correct checksum errors. > > > > This says that if we are not using ZFS raid or mirror then the > > expected event would be for ZFS to report but not fix the error. In > > our case the system kernel panicked, which is something different. Is > > the FAQ wrong or is there a bug in ZFS? > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
We have the same problem and I have just moved back to UFS because of this issue. According to the engineer at Sun that i spoke with, he implied that there is an RFE out internally that is to address this problem. The issue is this: When configuring a zpool with 1 vdev in it and zfs times out a write operation to the pool/filesystem for whatever reason, possibly just a hold back or retyrable error, the zfs module will cause a system panic because it thinks there are no other mirror''s in the pool to write to and forces a kernel panic. The way around this is to configure the zpools with mirror''s which negates the use of a hardware raid array, and sends twice the amount of data down to the RAID cache that is actually required (because of the mirroring at the ZFS layer). In our case it was a little old Sun StorEdge 3511 FC SATA Array, but the principle applies to any RAID array that is not configured as a JBOD. Victor Engle wrote:> Roshan, > > Could you provide more detail please. The host and zfs should be > unaware of any EMC array side replication so this sounds more like an > EMC misconfiguration than a ZFS problem. Did you look in the messages > file to see if anything happened to the devices that were in your > zpools? If so then that wouldn''t be a zfs error. If your EMC devices > fall offline because of something happening on the array or fabric > then zfs is not to blame. The same thing would have happened with any > other filesystem built on those devices. > > What kind of pools were in use, raidz, mirror or simple stripe? > > Regards, > Vic > > > > > On 6/19/07, Roshan Perera <Roshan.Perera at sun.com> wrote: >> Hi All, >> >> We have come across a problem at a client where ZFS brought the system >> down with a write error on a EMC device due to mirroring done at the >> EMC level and not ZFS, Client is total EMC committed and not too happy >> to use the ZFS for oring/RAID-Z. I have seen the notes below about the >> ZFS and SAN attached devices and understand the ZFS behaviour. >> >> Can someone help me with the following Questions: >> >> Is this the way ZFS will work in the future ? >> is there going to be any compromise to allow SAN Raid and ZFS to do >> the rest. >> If so when and if possible details of it ? >> >> >> Many Thanks >> >> Rgds >> >> Roshan >> >> ZFS work with SAN-attached devices? >> > >> > Yes, ZFS works with either direct-attached devices or SAN-attached >> > devices. However, if your storage pool contains no mirror or RAID-Z >> > top-level devices, ZFS can only report checksum errors but cannot >> > correct them. If your storage pool consists of mirror or RAID-Z >> > devices built using storage from SAN-attached devices, ZFS can report >> > and correct checksum errors. >> > >> > This says that if we are not using ZFS raid or mirror then the >> > expected event would be for ZFS to report but not fix the error. In >> > our case the system kernel panicked, which is something different. Is >> > the FAQ wrong or is there a bug in ZFS? >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >>
Victror, Thanks for your comments but I believe it contradict what ZFS information given below and now Bruce''s mail. After some digging around I found that the messages file has thrown out some powerpath errors to one of the devices that may have caused the proble. attached below the errors. But the question still remains is ZFS only happy with JBOD disks and not SAN storage with hardware raid. Thanks Roshan Jun 4 16:30:09 su621dwdb ltid[23093]: [ID 815759 daemon.error] Cannot start rdevmi pr ocess for remote shared drive operations on host su621dh01, cannot connect to vmd Jun 4 16:30:12 su621dwdb emcp: [ID 801593 kern.notice] Info: Assigned volume Symm 000 290100491 vol 0ffe to Jun 4 16:30:12 su621dwdb last message repeated 1 time Jun 4 16:30:12 su621dwdb emcp: [ID 801593 kern.notice] Info: Assigned volume Symm 000 290100491 vol 0fee to Jun 4 16:30:12 su621dwdb unix: [ID 836849 kern.notice] Jun 4 16:30:12 su621dwdb ^Mpanic[cpu550]/thread=2a101dd9cc0: Jun 4 16:30:12 su621dwdb unix: [ID 809409 kern.notice] ZFS: I/O failure (write on <un known> off 0: zio 600574e7500 [L0 unallocated] 4000L/400P DVA[0]=<5:55c00:400> DVA[1]<6:2b800:400> fletcher4 lzjb BE contiguous birth=107027 fill=0 cksum=673200f97f:34804a 0e20dc:102879bdcf1d13:3ce1b8dac7357de): error 5 Jun 4 16:30:12 su621dwdb unix: [ID 100000 kern.notice] Jun 4 16:30:12 su621dwdb genunix: [ID 723222 kern.notice] 000002a101dd9740 zfs:zio_do ne+284 (600574e7500, 0, a8, 708fdca0, 0, 6000f26cdc0) Jun 4 16:30:12 su621dwdb genunix: [ID 179002 kern.notice] %l0-3: 0000060015beaf00 0 0000000708fdc00 0000000000000005 0000000000000005> We have the same problem and I have just moved back to UFS because of > this issue. According to the engineer at Sun that i spoke with, he > implied that there is an RFE out internally that is to address > this problem. > > The issue is this: > > When configuring a zpool with 1 vdev in it and zfs times out a write > operation to the pool/filesystem for whatever reason, possibly > just a > hold back or retyrable error, the zfs module will cause a system panic > because it thinks there are no other mirror''s in the pool to write to > and forces a kernel panic. > > The way around this is to configure the zpools with mirror''s which > negates the use of a hardware raid array, and sends twice the > amount of > data down to the RAID cache that is actually required (because of the > mirroring at the ZFS layer). In our case it was a little old Sun > StorEdge 3511 FC SATA Array, but the principle applies to any RAID > arraythat is not configured as a JBOD. > > > > Victor Engle wrote: > > Roshan, > > > > Could you provide more detail please. The host and zfs should be > > unaware of any EMC array side replication so this sounds more > like an > > EMC misconfiguration than a ZFS problem. Did you look in the > messages> file to see if anything happened to the devices that > were in your > > zpools? If so then that wouldn''t be a zfs error. If your EMC devices > > fall offline because of something happening on the array or fabric > > then zfs is not to blame. The same thing would have happened > with any > > other filesystem built on those devices. > > > > What kind of pools were in use, raidz, mirror or simple stripe? > > > > Regards, > > Vic > > > > > > > > > > On 6/19/07, Roshan Perera <Roshan.Perera at sun.com> wrote: > >> Hi All, > >> > >> We have come across a problem at a client where ZFS brought the > system>> down with a write error on a EMC device due to mirroring > done at the > >> EMC level and not ZFS, Client is total EMC committed and not > too happy > >> to use the ZFS for oring/RAID-Z. I have seen the notes below > about the > >> ZFS and SAN attached devices and understand the ZFS behaviour. > >> > >> Can someone help me with the following Questions: > >> > >> Is this the way ZFS will work in the future ? > >> is there going to be any compromise to allow SAN Raid and ZFS > to do > >> the rest. > >> If so when and if possible details of it ? > >> > >> > >> Many Thanks > >> > >> Rgds > >> > >> Roshan > >> > >> ZFS work with SAN-attached devices? > >> > > >> > Yes, ZFS works with either direct-attached devices or SAN- > attached>> > devices. However, if your storage pool contains no > mirror or RAID-Z > >> > top-level devices, ZFS can only report checksum errors but cannot > >> > correct them. If your storage pool consists of mirror or RAID-Z > >> > devices built using storage from SAN-attached devices, ZFS > can report > >> > and correct checksum errors. > >> > > >> > This says that if we are not using ZFS raid or mirror then the > >> > expected event would be for ZFS to report but not fix the > error. In > >> > our case the system kernel panicked, which is something > different. Is > >> > the FAQ wrong or is there a bug in ZFS? > >> > >> _______________________________________________ > >> zfs-discuss mailing list > >> zfs-discuss at opensolaris.org > >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > >> > >
Roshan, As far as I know, there is no problem at all with using SAN storage with ZFS and it does look like you were having an underlying problem with either powerpath or the array. The best practices guide on opensolaris does recommend replicated pools even if your backend storage is redundant. There are at least 2 good reasons for that. ZFS needs a replica for the self healing feature to work. Also there is no fsck like tool for ZFS so it is a good idea to make sure self healing can work. I think first I would track down the cause of the messages just prior to the zfs write error because even with replicated pools if several devices error at once then the pool could be lost. Regards, Vic On 6/19/07, Roshan Perera <Roshan.Perera at sun.com> wrote:> Victror, > Thanks for your comments but I believe it contradict what ZFS information given below and now Bruce''s mail. > After some digging around I found that the messages file has thrown out some powerpath errors to one of the devices that may have caused the proble. attached below the errors. But the question still remains is ZFS only happy with JBOD disks and not SAN storage with hardware raid. Thanks > Roshan > > > Jun 4 16:30:09 su621dwdb ltid[23093]: [ID 815759 daemon.error] Cannot start rdevmi pr > ocess for remote shared drive operations on host su621dh01, cannot connect to vmd > Jun 4 16:30:12 su621dwdb emcp: [ID 801593 kern.notice] Info: Assigned volume Symm 000 > 290100491 vol 0ffe to > Jun 4 16:30:12 su621dwdb last message repeated 1 time > Jun 4 16:30:12 su621dwdb emcp: [ID 801593 kern.notice] Info: Assigned volume Symm 000 > 290100491 vol 0fee to > Jun 4 16:30:12 su621dwdb unix: [ID 836849 kern.notice] > Jun 4 16:30:12 su621dwdb ^Mpanic[cpu550]/thread=2a101dd9cc0: > Jun 4 16:30:12 su621dwdb unix: [ID 809409 kern.notice] ZFS: I/O failure (write on <un > known> off 0: zio 600574e7500 [L0 unallocated] 4000L/400P DVA[0]=<5:55c00:400> DVA[1]> <6:2b800:400> fletcher4 lzjb BE contiguous birth=107027 fill=0 cksum=673200f97f:34804a > 0e20dc:102879bdcf1d13:3ce1b8dac7357de): error 5 > Jun 4 16:30:12 su621dwdb unix: [ID 100000 kern.notice] > Jun 4 16:30:12 su621dwdb genunix: [ID 723222 kern.notice] 000002a101dd9740 zfs:zio_do > ne+284 (600574e7500, 0, a8, 708fdca0, 0, 6000f26cdc0) > Jun 4 16:30:12 su621dwdb genunix: [ID 179002 kern.notice] %l0-3: 0000060015beaf00 0 > 0000000708fdc00 0000000000000005 0000000000000005 > > > > > > > > > > > > > > > > We have the same problem and I have just moved back to UFS because of > > this issue. According to the engineer at Sun that i spoke with, he > > implied that there is an RFE out internally that is to address > > this problem. > > > > The issue is this: > > > > When configuring a zpool with 1 vdev in it and zfs times out a write > > operation to the pool/filesystem for whatever reason, possibly > > just a > > hold back or retyrable error, the zfs module will cause a system panic > > because it thinks there are no other mirror''s in the pool to write to > > and forces a kernel panic. > > > > The way around this is to configure the zpools with mirror''s which > > negates the use of a hardware raid array, and sends twice the > > amount of > > data down to the RAID cache that is actually required (because of the > > mirroring at the ZFS layer). In our case it was a little old Sun > > StorEdge 3511 FC SATA Array, but the principle applies to any RAID > > arraythat is not configured as a JBOD. > > > > > > > > Victor Engle wrote: > > > Roshan, > > > > > > Could you provide more detail please. The host and zfs should be > > > unaware of any EMC array side replication so this sounds more > > like an > > > EMC misconfiguration than a ZFS problem. Did you look in the > > messages> file to see if anything happened to the devices that > > were in your > > > zpools? If so then that wouldn''t be a zfs error. If your EMC devices > > > fall offline because of something happening on the array or fabric > > > then zfs is not to blame. The same thing would have happened > > with any > > > other filesystem built on those devices. > > > > > > What kind of pools were in use, raidz, mirror or simple stripe? > > > > > > Regards, > > > Vic > > > > > > > > > > > > > > > On 6/19/07, Roshan Perera <Roshan.Perera at sun.com> wrote: > > >> Hi All, > > >> > > >> We have come across a problem at a client where ZFS brought the > > system>> down with a write error on a EMC device due to mirroring > > done at the > > >> EMC level and not ZFS, Client is total EMC committed and not > > too happy > > >> to use the ZFS for oring/RAID-Z. I have seen the notes below > > about the > > >> ZFS and SAN attached devices and understand the ZFS behaviour. > > >> > > >> Can someone help me with the following Questions: > > >> > > >> Is this the way ZFS will work in the future ? > > >> is there going to be any compromise to allow SAN Raid and ZFS > > to do > > >> the rest. > > >> If so when and if possible details of it ? > > >> > > >> > > >> Many Thanks > > >> > > >> Rgds > > >> > > >> Roshan > > >> > > >> ZFS work with SAN-attached devices? > > >> > > > >> > Yes, ZFS works with either direct-attached devices or SAN- > > attached>> > devices. However, if your storage pool contains no > > mirror or RAID-Z > > >> > top-level devices, ZFS can only report checksum errors but cannot > > >> > correct them. If your storage pool consists of mirror or RAID-Z > > >> > devices built using storage from SAN-attached devices, ZFS > > can report > > >> > and correct checksum errors. > > >> > > > >> > This says that if we are not using ZFS raid or mirror then the > > >> > expected event would be for ZFS to report but not fix the > > error. In > > >> > our case the system kernel panicked, which is something > > different. Is > > >> > the FAQ wrong or is there a bug in ZFS? > > >> > > >> _______________________________________________ > > >> zfs-discuss mailing list > > >> zfs-discuss at opensolaris.org > > >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > >> > > > > >
Victor Engle wrote:> Roshan, > > As far as I know, there is no problem at all with using SAN storage > with ZFS and it does look like you were having an underlying problem > with either powerpath or the array.Correct. A write failed.> The best practices guide on opensolaris does recommend replicated > pools even if your backend storage is redundant. There are at least 2 > good reasons for that. ZFS needs a replica for the self healing > feature to work. Also there is no fsck like tool for ZFS so it is a > good idea to make sure self healing can work.Yes, currently ZFS on Solaris will panic if a non-redundant write fails. This is known and being worked on, but there really isn''t a good solution if a write fails, unless you have some ZFS-level redundancy. NB. fsck is not needed for ZFS because the on-disk format is always consistent. This is orthogonal to hardware faults.> I think first I would track down the cause of the messages just prior > to the zfs write error because even with replicated pools if several > devices error at once then the pool could be lost.Yes, multiple failures can cause data loss. No magic here. -- richard
Roshan.Perera at Sun.COM said:> attached below the errors. But the question still remains is ZFS only happy > with JBOD disks and not SAN storage with hardware raid. ThanksZFS works fine on our SAN here. You do get a kernel panic (Solaris-10U3) if a LUN disappears for some reason (without ZFS-level redundancy), but I understand that bug is fixed in a Nevada build; I''m hoping to see the fix in Solaris-10U4. Regards, Marion
> > > The best practices guide on opensolaris does recommend replicated > > pools even if your backend storage is redundant. There are at least 2 > > good reasons for that. ZFS needs a replica for the self healing > > feature to work. Also there is no fsck like tool for ZFS so it is a > > good idea to make sure self healing can work.> > NB. fsck is not needed for ZFS because the on-disk format is always > consistent. This is orthogonal to hardware faults. >I understand that the on disk state is always consistent but the self healing feature can correct blocks that have bad checksums if zfs is able to retrieve the block from a good replica. So even though the filesystem is consistent, the data can be corrupt in non-redundant pools. I am unsure of what happens with a non-redundant pool when a block has a bad checksum and perhaps you could clear that up. Does this cause a problem for the pool or is it limited to the file or files affected by the bad block and otherwise the pool is online and healthy. Thanks, Vic
Victor Engle wrote:>> >> > The best practices guide on opensolaris does recommend replicated >> > pools even if your backend storage is redundant. There are at least 2 >> > good reasons for that. ZFS needs a replica for the self healing >> > feature to work. Also there is no fsck like tool for ZFS so it is a >> > good idea to make sure self healing can work. > >> >> NB. fsck is not needed for ZFS because the on-disk format is always >> consistent. This is orthogonal to hardware faults. >> > I understand that the on disk state is always consistent but the self > healing feature can correct blocks that have bad checksums if zfs is > able to retrieve the block from a good replica.Yes. That is how it works. By default, metatadata is replicated. For real data, you can use copies, mirroring, or raidz[12]> So even though the > filesystem is consistent, the data can be corrupt in non-redundant > pools.No. If the data is corrupt and cannot be reconstructed, it is lost. Recall that UFS''s fsck only corrects file system metadata, not real data. Most file systems which have any kind of preformance work this way. ZFS is safer, because of COW, ZFS won''t overwrite existing data leading to corruption -- but other file systems can (eg. UFS).> I am unsure of what happens with a non-redundant pool when a > block has a bad checksum and perhaps you could clear that up. Does > this cause a problem for the pool or is it limited to the file or > files affected by the bad block and otherwise the pool is online and > healthy.It depends on where the bad block is. If it isn''t being used, no foul[1]. If it is metadata, then we recover because of redundant metadata. If it is in a file with no redundancy (copies=1, by default) then an error will be logged to FMA and the file name is visible to zpool status. You can decide if that file is important to you. This is an area where there is continuing development, far beyond what ZFS alone can do. The ultimate goal is that we get to the point where most faults can be tolerated. No rest for the weary :-) [1] this is different than "software RAID" systems which don''t know if a block is being used or not. In ZFS, we only care about faults in blocks which are being used, for the most part. -- richard
Thanks for all your replies. Lot of info to take it back. In this case it seems like emcp carried out a repair to a path to LUN Followed by a panic. Jun 4 16:30:12 su621dwdb emcp: [ID 801593 kern.notice] Info: Assigned volume Symm 000290100491 vol 0ffe to I don''t think panic should be the answer in this type of scenario, as there is redundant path to the LUN and Hardware Raid is in place inside SAN. From what I gather there is work being carried out to find a better solution. What is the proposed solution or when it will be availble is the question ? Thanks again. Roshan ----- Original Message ----- From: Richard Elling <Richard.Elling at Sun.COM> Date: Tuesday, June 19, 2007 6:28 pm Subject: Re: [zfs-discuss] Re: ZFS - SAN and Raid To: Victor Engle <victor.engle at gmail.com> Cc: Bruce McAlister <bruce.mcalister at blueface.ie>, zfs-discuss at opensolaris.org, Roshan Perera <Roshan.Perera at Sun.COM>> Victor Engle wrote: > > Roshan, > > > > As far as I know, there is no problem at all with using SAN storage > > with ZFS and it does look like you were having an underlying problem > > with either powerpath or the array. > > Correct. A write failed. > > > The best practices guide on opensolaris does recommend replicated > > pools even if your backend storage is redundant. There are at > least 2 > > good reasons for that. ZFS needs a replica for the self healing > > feature to work. Also there is no fsck like tool for ZFS so it > is a > > good idea to make sure self healing can work. > > Yes, currently ZFS on Solaris will panic if a non-redundant write > fails.This is known and being worked on, but there really isn''t a > good solution > if a write fails, unless you have some ZFS-level redundancy. > > NB. fsck is not needed for ZFS because the on-disk format is always > consistent. This is orthogonal to hardware faults. > > > I think first I would track down the cause of the messages just > prior> to the zfs write error because even with replicated pools > if several > > devices error at once then the pool could be lost. > > Yes, multiple failures can cause data loss. No magic here. > -- richard > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Roshan Perera wrote:> Thanks for all your replies. Lot of info to take it back. In this case it > seems like emcp carried out a repair to a path to LUN Followed by a > panic. > > Jun 4 16:30:12 su621dwdb emcp: [ID 801593 kern.notice] Info: Assigned > volume Symm 000290100491 vol 0ffe to > > I don''t think panic should be the answer in this type of scenario, as > there is redundant path to the LUN and Hardware Raid is in place inside > SAN. From what I gather there is work being carried out to find a better > solution. What is the proposed solution or when it will be availble is > the question ?But Roshan, if your pool is not replicated from ZFS'' point of view, then all the multipathing and raid controller backup in the world will not make a difference. James C. McPherson -- Solaris kernel software engineer Sun Microsystems
On Wed, Jun 20, 2007 at 11:16:39AM +1000, James C. McPherson wrote:> Roshan Perera wrote: > > > >I don''t think panic should be the answer in this type of scenario, as > >there is redundant path to the LUN and Hardware Raid is in place inside > >SAN. From what I gather there is work being carried out to find a better > >solution. What is the proposed solution or when it will be availble is > >the question ? > > But Roshan, if your pool is not replicated from ZFS'' > point of view, then all the multipathing and raid > controller backup in the world will not make a difference.If the multipathing is working correctly, and one path to the data remains intact, the SCSI level should retry the write error successfully. This certainly happens with UFS on our fibre-channel SAN. There''s usually a SCSI bus reset message along with a message the failover to the other path. Of course, once the SCSI level exhausts its retries, something else has to happen, just as it would with a physical disk. This must be when ZFS causes a panic. -- -Gary Mills- -Unix Support- -U of M Academic Computing and Networking-
> But Roshan, if your pool is not replicated from ZFS'' > point of view, then all the multipathing and raid > controller backup in the world will not make a difference.James, I Agree from ZFS point of view. However, from the EMC or the customer point of view they want to do the replication at the EMC level and not from ZFS. By replicating at the ZFS level they will loose some storage and its doubling the replication. Its just customer use to working with Veritas and UFS and they don''t want to change their habbits. I just have to convince the customer to use ZFS replication. Thanks again> > > > James C. McPherson > -- > Solaris kernel software engineer > Sun Microsystems >
Hi all, Is there a place where I can find ZFS best practices guide to use against DMX and a roadmap of ZFS ? Also, the customer now looking at big ZFS installations in production. Would you guys happen to know or where I can find details of the numbers of current installations ? We are looking at akmost 10Terrabytes of data to be stored on DMX using ZFS (customer is not comfortable with the RaidZ solution in addition to their best practice of raiding at DMX levell) Any feedback, experiences and more importantly gotchas will be muchly appreciated. Thanks in advance. Roshan ----- Original Message ----- From: Roshan Perera <Roshan.Perera at Sun.COM> Date: Wednesday, June 20, 2007 10:49 am Subject: Re: [zfs-discuss] Re: ZFS - SAN and Raid To: James.McPherson at Sun.COM Cc: Bruce McAlister <bruce.mcalister at blueface.ie>, zfs-discuss at opensolaris.org, Richard Elling <Richard.Elling at Sun.COM>> > > > But Roshan, if your pool is not replicated from ZFS'' > > point of view, then all the multipathing and raid > > controller backup in the world will not make a difference. > > James, I Agree from ZFS point of view. However, from the EMC or > the customer point of view they want to do the replication at the > EMC level and not from ZFS. By replicating at the ZFS level they > will loose some storage and its doubling the replication. Its just > customer use to working with Veritas and UFS and they don''t want > to change their habbits. I just have to convince the customer to > use ZFS replication. > > Thanks again > > > > > > > > > > James C. McPherson > > -- > > Solaris kernel software engineer > > Sun Microsystems > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
James C. McPherson wrote:> Roshan Perera wrote: >> >>> But Roshan, if your pool is not replicated from ZFS'' point of view, >>> then all the multipathing and raid controller backup in the world will >>> not make a difference. >> >> James, I Agree from ZFS point of view. However, from the EMC or the >> customer point of view they want to do the replication at the EMC level >> and not from ZFS. By replicating at the ZFS level they will loose some >> storage and its doubling the replication. Its just customer use to >> working with Veritas and UFS and they don''t want to change their >> habbits. >> I just have to convince the customer to use ZFS replication. > > Hi Roshan, > that''s a great shame because if they actually want > to make use of the features of ZFS such as replication, > then they need to be serious about configuring their > storage to play in the ZFS world.... and that means > replication that ZFS knows about. >Also, how does replication at the ZFS level use more storage - I''m assuming raw block - then at the array level?
> On 6/20/07, Torrey McMahon <tmcmahon2 at yahoo.com> wrote: > Also, how does replication at the ZFS level use more storage - I''m > assuming raw block - then at the array level? > _______________________________________________Just to add to the previous comments. In the case where you have a SAN array providing storage to a host for use with ZFS the SAN storage really needs to be redundant in the array AND the zpools need to be redundant pools. The reason the SAN storage should be redundant is that SAN arrays are designed to serve logical units. The logical units are usually allocated from a raid set, storage pool or aggregate of some kind. The array side pool/aggregate may include 10 300GB disks and may have 100+ luns allocated from it for example. If redundancy is not used in the array side pool/aggregate and then 1 disk failure will kill 100+ luns at once. On 6/20/07, Torrey McMahon <tmcmahon2 at yahoo.com> wrote:> James C. McPherson wrote: > > Roshan Perera wrote: > >> > >>> But Roshan, if your pool is not replicated from ZFS'' point of view, > >>> then all the multipathing and raid controller backup in the world will > >>> not make a difference. > >> > >> James, I Agree from ZFS point of view. However, from the EMC or the > >> customer point of view they want to do the replication at the EMC level > >> and not from ZFS. By replicating at the ZFS level they will loose some > >> storage and its doubling the replication. Its just customer use to > >> working with Veritas and UFS and they don''t want to change their > >> habbits. > >> I just have to convince the customer to use ZFS replication. > > > > Hi Roshan, > > that''s a great shame because if they actually want > > to make use of the features of ZFS such as replication, > > then they need to be serious about configuring their > > storage to play in the ZFS world.... and that means > > replication that ZFS knows about. > > > > Also, how does replication at the ZFS level use more storage - I''m > assuming raw block - then at the array level? > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
On Wed, Jun 20, 2007 at 12:23:18PM -0400, Torrey McMahon wrote:> James C. McPherson wrote: > >Roshan Perera wrote: > >> > >>>But Roshan, if your pool is not replicated from ZFS'' point of view, > >>>then all the multipathing and raid controller backup in the world will > >>>not make a difference. > >> > >>James, I Agree from ZFS point of view. However, from the EMC or the > >>customer point of view they want to do the replication at the EMC level > >>and not from ZFS. By replicating at the ZFS level they will loose some > >>storage and its doubling the replication. Its just customer use to > >>working with Veritas and UFS and they don''t want to change their > >>habbits. > >>I just have to convince the customer to use ZFS replication. > > > >that''s a great shame because if they actually want > >to make use of the features of ZFS such as replication, > >then they need to be serious about configuring their > >storage to play in the ZFS world.... and that means > >replication that ZFS knows about. > > Also, how does replication at the ZFS level use more storage - I''m > assuming raw block - then at the array level?SAN storage generally doesn''t work that way. They use some magical redundancy scheme, which may be RAID-5 or WAFL, from which the Storage Administrator carves out virtual disks. These are best viewed as an array of blocks. All disk administration, such as replacing failed disks, takes place on the storage device without affecting the virtual disks. There''s no need for disk administration or additional redundancy on the client side. If more space is needed on the client, the Storage Administrator simply expands the virtual disk by extending its blocks. ZFS needs to play nicely in this environment because that''s what''s available in large organizations that have centralized their storage. Asking for raw disks doesn''t work. -- -Gary Mills- -Unix Support- -U of M Academic Computing and Networking-
Victor Engle wrote:>> On 6/20/07, Torrey McMahon <tmcmahon2 at yahoo.com> wrote: >> Also, how does replication at the ZFS level use more storage - I''m >> assuming raw block - then at the array level? >> _______________________________________________ > > > Just to add to the previous comments. In the case where you have a SAN > array providing storage to a host for use with ZFS the SAN storage > really needs to be redundant in the array AND the zpools need to be > redundant pools. > > The reason the SAN storage should be redundant is that SAN arrays are > designed to serve logical units. The logical units are usually > allocated from a raid set, storage pool or aggregate of some kind. The > array side pool/aggregate may include 10 300GB disks and may have 100+ > luns allocated from it for example. If redundancy is not used in the > array side pool/aggregate and then 1 disk failure will kill 100+ luns > at once.That makes a lot of sense in configurations where an array is exporting LUNs built on raid volumes to a set of heterogeneous hosts. If you''re direct connected to a single box running ZFS or a set of boxes running ZFS you probably want to export something as close to the raw disks as possible while maintaining ZFS level redundancy. (Like two R5 LUNs in a ZFS mirror.) Creating a raid set, carving out lots of LUNs and then handing them all over to ZFS isn''t going to buy you a lot and could cause performance issues. (LUN skew for example.)
Gary Mills wrote:> On Wed, Jun 20, 2007 at 12:23:18PM -0400, Torrey McMahon wrote: > >> James C. McPherson wrote: >> >>> Roshan Perera wrote: >>> >>>>> But Roshan, if your pool is not replicated from ZFS'' point of view, >>>>> then all the multipathing and raid controller backup in the world will >>>>> not make a difference. >>>>> >>>> James, I Agree from ZFS point of view. However, from the EMC or the >>>> customer point of view they want to do the replication at the EMC level >>>> and not from ZFS. By replicating at the ZFS level they will loose some >>>> storage and its doubling the replication. Its just customer use to >>>> working with Veritas and UFS and they don''t want to change their >>>> habbits. >>>> I just have to convince the customer to use ZFS replication. >>>> >>> that''s a great shame because if they actually want >>> to make use of the features of ZFS such as replication, >>> then they need to be serious about configuring their >>> storage to play in the ZFS world.... and that means >>> replication that ZFS knows about. >>> >> Also, how does replication at the ZFS level use more storage - I''m >> assuming raw block - then at the array level? >> > > SAN storage generally doesn''t work that way. They use some magical > redundancy scheme, which may be RAID-5 or WAFL, from which the Storage > Administrator carves out virtual disks. These are best viewed as an > array of blocks. All disk administration, such as replacing failed > disks, takes place on the storage device without affecting the virtual > disks. There''s no need for disk administration or additional > redundancy on the client side. If more space is needed on the client, > the Storage Administrator simply expands the virtual disk by extending > its blocks. ZFS needs to play nicely in this environment because > that''s what''s available in large organizations that have centralized > their storage. Asking for raw disks doesn''t work. >Are we talking about replication - I have a copy of my data on an other system - or redundancy - I have a system where I can tolerate a local failure? ...and I understand the ZFS has to play nice with HW raid argument. :)
Hi all, I am after some help/feedback to the subject issue explained below. We are in the process of migrating a big DB2 database from a 6900 24 x 200MHz CPU''s with Veritas FS 8TB of storage Solaris 8 to 25K 12 CPU dual core x 1800Mhz with ZFS 8TB storage SAN storage (compressed & RaidZ) Solaris 10. Unfortunately, we are having massive perfomance problems with the new solution. It all points towards IO and ZFS. Couple of questions relating to ZFS. 1. What is the impace on using ZFS compression ? Percentage of system resources required, how much of a overhead is this as suppose to non-compression. In our case DB2 do similar amount of read''s and writes. 2. Unfortunately we are using twice RAID (San level Raid and RaidZ) to overcome the panic problem my previous blog (for which I had good response). 3. Any way of monitoring ZFS performance other than iostat ? 4. Any help on ZFS tuning in this kind of environment like caching etc ? Would appreciate for any feedback/help wher to go next. If this cannot be resolved we may have to go back to VXFS which would be a shame. Thanks in advance.
On 6/26/07, Roshan Perera <Roshan.Perera at sun.com> wrote:> 25K 12 CPU dual core x 1800Mhz with ZFS 8TB storage SAN storage (compressed & RaidZ) Solaris 10.RaidZ is a poor choice for database apps in my opinion; due to the way it handles checksums on raidz stripes, it must read every disk in order to satisfy small reads that traditional raid-5 would only have to read a single disk for. Raid-Z doesn''t have the terrible write performance of raid 5, because you can stick small writes together and then do full-stripe writes, but by the same token you must do full-stripe reads, all the time. That''s how I understand it, anyways. Thus, raidz is a poor choice for a database application which tends to do a lot of small reads. Using mirrors (at the zfs level, not the SAN level) would probably help with this. Mirrors each get their own copy of the data, each with its own checksum, so you can read a small block by touching only one disk. What is your vdev setup like right now? ''zpool list'', in other words. How wide are your stripes? Is the SAN doing raid-1ish things with the disks, or something else?> 2. Unfortunately we are using twice RAID (San level Raid and RaidZ) to overcome the panic problem my previous blog (for which I had good response).Can you convince the customer to give ZFS a chance to do things its way? Let the SAN export raw disks, and make two- or three-way mirrored vdevs out of them.> 3. Any way of monitoring ZFS performance other than iostat ?In a word, yes. What are you interested in? DTrace or ''zpool iostat'' (which reports activity of individual disks within the pool) may prove interesting. Will
Hi Will, Thanks for your reply. Customer has EMC San solution and will not change their current layout. Therefore, asking the customer to give RAW disks to ZFS is no no. Hence, the RaidZ configuration as suppose to Raid - 5. I have given some stats below. I know its a bit difficult to troubleshoot with the type of data you have. But whatever input would be muchly appreciated. zpool list NAME SIZE USED AVAIL CAP HEALTH ALTROOT datapool1 2.12T 707G 1.43T 32% ONLINE - datapool2 2.12T 706G 1.44T 32% ONLINE - datapool3 2.12T 702G 1.44T 32% ONLINE - datapool4 2.12T 701G 1.44T 32% ONLINE - dumppool 272G 171G 101G 62% ONLINE - localpool 68G 12.5G 55.5G 18% ONLINE - logpool 272G 157G 115G 57% ONLINE - zfs get all datapool1 NAME PROPERTY VALUE SOURCE datapool1 type filesystem - datapool1 creation Fri Jun 8 18:46 2007 - datapool1 used 615G - datapool1 available 1.22T - datapool1 referenced 42.6K - datapool1 compressratio 2.08x - datapool1 mounted no - datapool1 quota none default datapool1 reservation none default datapool1 recordsize 128K default datapool1 mountpoint none local datapool1 sharenfs off default datapool1 checksum on default datapool1 compression on local datapool1 atime on default datapool1 devices on default datapool1 exec on default datapool1 setuid on default datapool1 readonly off default datapool1 zoned off default datapool1 snapdir hidden default datapool1 aclmode groupmask default datapool1 aclinherit secure default [su621dwdb/root] zpool status -v pool: datapool1 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM datapool1 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 emcpower8h ONLINE 0 0 0 emcpower9h ONLINE 0 0 0 emcpower10h ONLINE 0 0 0 emcpower11h ONLINE 0 0 0 emcpower12h ONLINE 0 0 0 emcpower13h ONLINE 0 0 0 emcpower14h ONLINE 0 0 0 emcpower15h ONLINE 0 0 0 errors: No known data errors pool: datapool2 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM datapool2 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 emcpower16h ONLINE 0 0 0 emcpower17h ONLINE 0 0 0 emcpower18h ONLINE 0 0 0 emcpower19h ONLINE 0 0 0 emcpower20h ONLINE 0 0 0 emcpower21h ONLINE 0 0 0 emcpower22h ONLINE 0 0 0 emcpower23h ONLINE 0 0 0 errors: No known data errors pool: datapool3 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM datapool3 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 emcpower24h ONLINE 0 0 0 emcpower25h ONLINE 0 0 0 emcpower26h ONLINE 0 0 0 emcpower27h ONLINE 0 0 0 emcpower28h ONLINE 0 0 0 emcpower29h ONLINE 0 0 0 emcpower30h ONLINE 0 0 0 emcpower31h ONLINE 0 0 0 errors: No known data errors pool: datapool4 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM datapool4 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 emcpower32h ONLINE 0 0 0 emcpower33h ONLINE 0 0 0 emcpower34h ONLINE 0 0 0 emcpower35h ONLINE 0 0 0 emcpower36h ONLINE 0 0 0 emcpower37h ONLINE 0 0 0 emcpower38h ONLINE 0 0 0 emcpower39h ONLINE 0 0 0 errors: No known data errors pool: dumppool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM dumppool ONLINE 0 0 0 c5t10d0 ONLINE 0 0 0 c5t11d0 ONLINE 0 0 0 c6t10d0 ONLINE 0 0 0 c6t11d0 ONLINE 0 0 0 errors: No known data errors pool: localpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM localpool ONLINE 0 0 0 mirror ONLINE 0 0 0 c2t9d0 ONLINE 0 0 0 c3t9d0 ONLINE 0 0 0 errors: No known data errors pool: logpool state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM logpool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 emcpower0h ONLINE 0 0 0 emcpower1h ONLINE 0 0 0 emcpower2h ONLINE 0 0 0 emcpower3h ONLINE 0 0 0 emcpower4h ONLINE 0 0 0 emcpower5h ONLINE 0 0 0 emcpower6h ONLINE 0 0 0 emcpower7h ONLINE 0 0 0 errors: No known data errors [su621dwdb/root] ----- Original Message ----- From: Will Murnane <will.murnane at gmail.com> Date: Tuesday, June 26, 2007 2:00 pm Subject: Re: [zfs-discuss] ZFS - DB2 Performance To: Roshan Perera <Roshan.Perera at Sun.COM> Cc: zfs-discuss at opensolaris.org> On 6/26/07, Roshan Perera <Roshan.Perera at sun.com> wrote: > > 25K 12 CPU dual core x 1800Mhz with ZFS 8TB storage SAN > storage (compressed & RaidZ) Solaris 10. > RaidZ is a poor choice for database apps in my opinion; due to the way > it handles checksums on raidz stripes, it must read every disk in > order to satisfy small reads that traditional raid-5 would only have > to read a single disk for. Raid-Z doesn''t have the terrible write > performance of raid 5, because you can stick small writes together and > then do full-stripe writes, but by the same token you must do > full-stripe reads, all the time. That''s how I understand it, anyways. > Thus, raidz is a poor choice for a database application which tends > to do a lot of small reads. > > Using mirrors (at the zfs level, not the SAN level) would probably > help with this. Mirrors each get their own copy of the data, each > with its own checksum, so you can read a small block by touching only > one disk. > > What is your vdev setup like right now? ''zpool list'', in other words. > How wide are your stripes? Is the SAN doing raid-1ish things with > the disks, or something else?> > 2. Unfortunately we are using twice RAID (San level Raid and > RaidZ) to overcome the panic problem my previous blog (for which I > had good response). > Can you convince the customer to give ZFS a chance to do things its > way? Let the SAN export raw disks, and make two- or three-way > mirrored vdevs out of them. > > > 3. Any way of monitoring ZFS performance other than iostat ? > In a word, yes. What are you interested in? DTrace or ''zpool iostat'' > (which reports activity of individual disks within the pool) may prove > interesting.Thanks...> > Will >
On Jun 26, 2007, at 4:26 AM, Roshan Perera wrote:> Hi all, > > I am after some help/feedback to the subject issue explained below. > > We are in the process of migrating a big DB2 database from a > > 6900 24 x 200MHz CPU''s with Veritas FS 8TB of storage Solaris 8 to > 25K 12 CPU dual core x 1800Mhz with ZFS 8TB storage SAN storage > (compressed & RaidZ) Solaris 10. > > Unfortunately, we are having massive perfomance problems with the > new solution. It all points towards IO and ZFS. > > Couple of questions relating to ZFS. > 1. What is the impace on using ZFS compression ? Percentage of > system resources required, how much of a overhead is this as > suppose to non-compression. In our case DB2 do similar amount of > read''s and writes. > 2. Unfortunately we are using twice RAID (San level Raid and RaidZ) > to overcome the panic problem my previous blog (for which I had > good response). > 3. Any way of monitoring ZFS performance other than iostat ? > 4. Any help on ZFS tuning in this kind of environment like caching > etc ?Have you looked at: http://blogs.sun.com/realneel/entry/zfs_and_databases http://blogs.sun.com/realneel/entry/zfs_and_databases_time_for ? eric> > Would appreciate for any feedback/help wher to go next. > If this cannot be resolved we may have to go back to VXFS which > would be a shame. > > > Thanks in advance. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Possibly the storage is flushing the write caches when it should not. Until we get a fix, cache flushing could be disabled in the storage (ask the vendor for the magic incantation). If that''s not forthcoming and if all pools are attached to NVRAM protected devices; then these /etc/system evil tunable might help : In older solaris releases we have set zfs:zil_noflush = 1 On newer releases set zfs:zfs_nocacheflush = 1 If you implement this, Do place a comment that this is a temporary workaround waiting for bug 6462690 to be fixed. About Compression, I don''t have the numbers but a reasonable guess would be that it can consumes roughly 1-Ghz of CPU to compress 100MB/sec. This will of course depend on the type of data being compressed. -r Roshan Perera writes: > Hi all, > > I am after some help/feedback to the subject issue explained below. > > We are in the process of migrating a big DB2 database from a > > 6900 24 x 200MHz CPU''s with Veritas FS 8TB of storage Solaris 8 to > 25K 12 CPU dual core x 1800Mhz with ZFS 8TB storage SAN storage (compressed & RaidZ) Solaris 10. > > Unfortunately, we are having massive perfomance problems with the new solution. It all points towards IO and ZFS. > > Couple of questions relating to ZFS. > 1. What is the impace on using ZFS compression ? Percentage of system > resources required, how much of a overhead is this as suppose to > non-compression. In our case DB2 do similar amount of read''s and > writes. > 2. Unfortunately we are using twice RAID (San level Raid and RaidZ) to > overcome the panic problem my previous blog (for which I had good > response). > 3. Any way of monitoring ZFS performance other than iostat ? > 4. Any help on ZFS tuning in this kind of environment like caching etc ? > > Would appreciate for any feedback/help wher to go next. > If this cannot be resolved we may have to go back to VXFS which would be a shame. > > > Thanks in advance. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
At what Solaris10 level (patch/update) was the "single-threaded compression" situation resolved? Could you be hitting that one? -- MikeE -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Roch - PAE Sent: Tuesday, June 26, 2007 12:26 PM To: Roshan Perera Cc: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] ZFS - DB2 Performance Possibly the storage is flushing the write caches when it should not. Until we get a fix, cache flushing could be disabled in the storage (ask the vendor for the magic incantation). If that''s not forthcoming and if all pools are attached to NVRAM protected devices; then these /etc/system evil tunable might help : In older solaris releases we have set zfs:zil_noflush = 1 On newer releases set zfs:zfs_nocacheflush = 1 If you implement this, Do place a comment that this is a temporary workaround waiting for bug 6462690 to be fixed. About Compression, I don''t have the numbers but a reasonable guess would be that it can consumes roughly 1-Ghz of CPU to compress 100MB/sec. This will of course depend on the type of data being compressed. -r Roshan Perera writes: > Hi all, > > I am after some help/feedback to the subject issue explained below. > > We are in the process of migrating a big DB2 database from a > > 6900 24 x 200MHz CPU''s with Veritas FS 8TB of storage Solaris 8 to > 25K 12 CPU dual core x 1800Mhz with ZFS 8TB storage SAN storage (compressed & RaidZ) Solaris 10. > > Unfortunately, we are having massive perfomance problems with the new solution. It all points towards IO and ZFS. > > Couple of questions relating to ZFS. > 1. What is the impace on using ZFS compression ? Percentage of system > resources required, how much of a overhead is this as suppose to > non-compression. In our case DB2 do similar amount of read''s and > writes. > 2. Unfortunately we are using twice RAID (San level Raid and RaidZ) to > overcome the panic problem my previous blog (for which I had good > response). > 3. Any way of monitoring ZFS performance other than iostat ? > 4. Any help on ZFS tuning in this kind of environment like caching etc ? > > Would appreciate for any feedback/help wher to go next. > If this cannot be resolved we may have to go back to VXFS which would be a shame. > > > Thanks in advance. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> > Roshan Perera writes: > > Hi all, > > > > I am after some help/feedback to the subject issue explained below. > > > > We are in the process of migrating a big DB2 database from a > > > > 6900 24 x 200MHz CPU''s with Veritas FS 8TB of storage Solaris 8 to > > 25K 12 CPU dual core x 1800Mhz with ZFS 8TB storage SAN storage > (compressed & RaidZ) Solaris 10. > >200Mhz !? You mean 1200Mhz ;) The slowest CPU''s in a 6900 was 900Mhz III Cu. You mention Veritas FS ... as in Veritas filesystem, vxfs ? I suppose you also include vmsa or the whole Storage Foundation? (could still be vxva on Solaris 8 ! Oh, those were the days...) First impressions on the system is ... well, it''s fair to say that you have some extra CPU power (and then some). The old III 1.2Ghz was nice but by no means screamers. ( years ago)> > Unfortunately, we are having massive perfomance problems with the new > solution. It all points towards IO and ZFS. > >Yep... CPU it isn''t. Keep in mind that you have now completely moved the goal posts when it comes to performance or comparing performance with the previous installation. Not only do you have a large increase in CPU performance, Solaris 10 will blitz 8 on a bad day by miles. With all of the CPU/OS bottlenecks removed I sure hope you have decent I/O at the back...> > Couple of questions relating to ZFS. > > 1. What is the impace on using ZFS compression ? Percentage of system > > resources required, how much of a overhead is this as suppose to > > non-compression. In our case DB2 do similar amount of read''s and > > writes.I''m unsure as to why a person that buys a 24 core 25K would activate compression on a OLTP database? Surely when you fork out that kind of cash you want to get every bang for your buck (and then some!). I don''t think compression was created with the view on high performance OLTP db''s. I would hope that the 25K (which in this case is light years faster than the 6900) wasn''t spec''ed with the idea of running compression with the extra CPU cycles... oooh... *crash* *burn*.> > 2. Unfortunately we are using twice RAID (San level Raid and RaidZ) > to > > overcome the panic problem my previous blog (for which I had good > > response).I''ve yet to deploy a DB on ZFS in production, so I cannot comment on the real world performance.. what I can comment on is some basic things. RAID on top of RAID seems silly. Especially RAID-Z. It''s just not as fast as a mirror or stripe when it comes to a decent db workout. Are you sure that you want to go with ZFS ... any real reason to go that way now? I would wait for U4 ... and give the machine/storage a good workout with SVM and UFS/DirectIO. Yep... it''s a bastard to manage but very little can touch it when it comes to pure performance. With so many $$$ standing on the datacentre floor, I''d forget about technology for now and let common sense and good business practice prevail.> > 3. Any way of monitoring ZFS performance other than iostat ?Dtrace guru''s can comment... however iostat should suffice.> > 4. Any help on ZFS tuning in this kind of environment like caching > etc ? > >As was posted, read the blog on ZFS and db''s.> > Would appreciate for any feedback/help wher to go next. > > If this cannot be resolved we may have to go back to VXFS which would > be a shame.By the way ... if the client has already purchased vmsa/vxfs (oh my word, how much was that!) then I''m unsure as to what ZFS will bring to the party... apart from saving the yearly $$$ for updates and patches/support. Is that the idea? It''s not like SF is bad... Nope, 8TB on a decent configured storage unit is not that big _not_ to give it a go with SVM, especially if you want to save money on Storage Foundation. I''m sure I''m preaching to the converted here but DB performance and problems will usually reside inside the storage architecture... I''ve seldom found a system wanting in the CPU department if the architect wasn''t a moron. With the upgrade that I see here... all the pressure will move to the back (bar a bad configuration) If you want to speed up a regular OLTP DB... fiddle with the I/O :) 2c
> Victor Engle wrote: > > Roshan, > > > > As far as I know, there is no problem at all with > using SAN storage > > with ZFS and it does look like you were having an > underlying problem > > with either powerpath or the array. > > Correct. A write failed. > > > The best practices guide on opensolaris does > recommend replicated > > pools even if your backend storage is redundant. > There are at least 2 > > good reasons for that. ZFS needs a replica for the > self healing > > feature to work. Also there is no fsck like tool > for ZFS so it is a > > good idea to make sure self healing can work. > > Yes, currently ZFS on Solaris will panic if a > non-redundant write fails. > This is known and being worked on, but there really > isn''t a good solution > if a write fails, unless you have some ZFS-level > redundancy.Why not? If O_DSYNC applies, a write() can still fail with EIO, right? And if O_DSYNC does not apply, an app could not assume that the written data was on stable storage anyway. Or the write() can just block until the problem is corrected (if correctable) or the system is rebooted. In any case, IMO there ought to be some sort of consistent behavior possible short of a panic. I''ve seen UFS based systems stay up even with their disks incommunicado for awhile, although they were hardly useful like that except insofar as activity strictly involving reading already cached pages was involved. This message posted from opensolaris.org