thr3ads.net - zfs discuss - [zfs-discuss] "zfs send -i A B" with B older than A [Jun 2007]

If this information is useful, please help other people find it:
Share via:

Marc Bevand

2007-Jun-18 08:32 UTC

[zfs-discuss] "zfs send -i A B" with B older than A

It occured to me that there are scenarios where it would be useful to be
able to "zfs send -i A B" where B is a snapshot older than A. I am
trying to design an encrypted disk-based off-site backup solution on top
of ZFS, where budget is the primary constraint, and I wish zfs send/recv
would allow me to do that. Here is why.

I have a server with 12 hot-swap disk bays. An "onsite" pool has been
created on 6 disks, where snapshots of the data to be backed up are
periodically taken. Two other "offsite" pools have been created on two
other sets of 6 disks, let''s give them the names offsite-blue and
offsite-red (for use on blue/red, or even/odd, weeks). At least one of
the offsite pools is always at the off-site location, while the other
one is either in transit or in the server. Every week a script is
basically compressing and encrypting the last few snapshots (T-2, T-1,
T-0) from onsite to offsite-XXX. Here is an example:

  $ rm /offsite-blue/*
  $ zfs send        onsite at T-2 | gzip | gpg -c
>/offsite-blue/T-2.full.gz.gpg
  $ zfs send -i T-2 onsite at T-1 | gzip | gpg -c
>/offsite-blue/T-1.incr.gz.gpg
  $ zfs send -i T-1 onsite at T-0 | gzip | gpg -c
>/offsite-blue/T-0.incr.gz.gpg

Then offsite-blue is zfs export''ed, sent to the the off-site location,
offsite-red is retrieved from the off-site location, sent back on-site,
ready to be used for the next week. My proof-of-concept tests show it
works OK, but 2 details are annoying:

  o In order to restore the latest snapshot T-0, all the zfs streams,
    T-2, T-1 and T-0, have to be decrypted, then zfs receive''d. It is
    slow and inconvenient.
  o My example only backs up the last 3 snapshots, but ideally I would
    like to fit as many as possible in the offsite pool. However, because
    of the unpredictable compression efficiency, I can''t tell which
    snapshot I should start from when creating the first full stream.

These 2 problems would be non-existent if one could "zfs send -i A B"
with B older than A:

  $ zfs send        onsite at T-0 | gzip | gpg -c
>/offsite-blue/T-0.full.gz.gpg
  $ zfs send -i T-0 onsite at T-1 | gzip | gpg -c
>/offsite-blue/T-1.incr.gz.gpg
  $ zfs send -i T-1 onsite at T-2 | gzip | gpg -c
>/offsite-blue/T-2.incr.gz.gpg
  $ ... # continue forever, kill zfs(1m) when offsite-blue is 90% full

I have looked at the code and the restriction "B must be earlier than
A"
is enforced in dmu_send.c:dmu_sendbackup() [1]. It looks like the code 
could be reworked to remove it.

Of course, when zfs-crypto ships, it will simplify a lot of things.
I could just always send incremental streams and receive them directly
on the encrypted pool, and directly manage the snapshots rotation by
zfs destroy''ing the old ones, etc.

[1] 
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/dmu_send.c#232

-marc

Matthew Ahrens

2007-Jun-19 04:37 UTC

head link

[zfs-discuss] "zfs send -i A B" with B older than A

Marc Bevand wrote:>   o In order to restore the latest snapshot T-0, all the zfs streams,
>     T-2, T-1 and T-0, have to be decrypted, then zfs receive''d. It
is
>     slow and inconvenient.
True, but presumably restoring the snapshots is a rare event.
>   o My example only backs up the last 3 snapshots, but ideally I would
>     like to fit as many as possible in the offsite pool. However, because
>     of the unpredictable compression efficiency, I can''t tell
which
>     snapshot I should start from when creating the first full stream.
I thought that your onsite and offsite pools were the same size?  If so then 
you should be able to fit the entire contents of the onsite pool in one of 
the offsite ones (''zfs send'' will inflate the data a small
bit, but gzip
should more than make up for it).

That said, if you couldn''t fit the entire onsite pool in your offiste
pool,
then you could make use of some additional space accounting data to tell how 
much space the ''zfs send'' streams will take.  (Although
compression
efficiency is still variable, you could probably make a good enough guess.) 
That''s on our long-term to-do list.

Also, if you can afford to waste some space, you could do something like:

zfs send onsite at T-100 | ...
zfs send -i T-100 onsite at t-0 | ...
zfs send -i T-100 onsite at t-99 | ...
zfs send -i T-99 onsite at t-98 | ...
zfs send -i T-98 onsite at t-97 | ...
...

until you run out of space.  Of course, you need at least enough space for 
the oldest and newest snapshots.

Glad to hear that ZFS (and send|recv) is useful to you aside from these 
issues, and thanks for letting us know about the difficulties too!

--matt

Marc Bevand

2007-Jun-19 06:40 UTC

head link

[zfs-discuss] Re: "zfs send -i A B" with B older than A

Matthew Ahrens <Matthew.Ahrens <at> sun.com>
writes:>
> True, but presumably restoring the snapshots is a rare event.
You are right, this would only happen in case of disaster and total
loss of the backup server.
> I thought that your onsite and offsite pools were the same size?  If so
then
> you should be able to fit the entire contents of the onsite pool in one of 
> the offsite ones.
Well, I simplified the example. In reality, the offsite pool is slightly
smaller due to different number of disks and sizes. 
> Also, if you can afford to waste some space, you could do something like:
> 
> zfs send onsite <at> T-100 | ...
> zfs send -i T-100 onsite <at> t-0 | ...
> zfs send -i T-100 onsite <at> t-99 | ...
> zfs send -i T-99 onsite <at> t-98 | ...
> [...]
Yes, I thought about it. I might do this if the delta between T-100 and
T-0 is reasonable.

Oh, and while I am thinking about it, beside "zfs send | gzip | gpg",
and
zfs-crypto, a 3rd option would be to use zfs on top of a loficc device
(lofi compression & cryptography). I went to the project page, only to
realize that they haven''t shipped anything yet.

Do you know how hard would it be to implement "zfs send -i A B" with B
older than A ? Or why hasn''t this been done in the first place ? I am 
just being curious here, I can''t wait for this feature anyway (even
though it would make my life soo much simpler).

-marc

James C. McPherson

2007-Jun-19 12:51 UTC

head link

[zfs-discuss] Re: ZFS - SAN and Raid

Roshan Perera wrote:> Hi all,
> 
> Is there a place where I can find ZFS best practices guide to use against
> DMX and a roadmap of ZFS ?
> Also, the customer now looking at big ZFS installations in production.
> Would you guys happen to know or where I can find details of the numbers
> of current installations ? We are looking at akmost 10Terrabytes of data
> to be stored on DMX using ZFS (customer is not comfortable with the RaidZ
> solution in addition to their best practice of raiding at DMX levell) Any
> feedback, experiences and more importantly gotchas will be muchly
> appreciated.
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide
and

I know Ben Rockwood (now of Joyent) has blogged about how much
storage they''re using, all managed with ZFS... I just can''t
find the blog entry.

Hope this helps,
James C. McPherson
--
Solaris kernel software engineer
Sun Microsystems

James C. McPherson

2007-Jun-19 13:01 UTC

head link

[zfs-discuss] Re: ZFS - SAN and Raid

Roshan Perera wrote:> 
>> But Roshan, if your pool is not replicated from ZFS'' point of
view,
>> then all the multipathing and raid controller backup in the world will
>> not make a difference.
> 
> James, I Agree from ZFS point of view. However, from the EMC or the
> customer point of view they want to do the replication at the EMC level
> and not from ZFS. By replicating at the ZFS level they will loose some
> storage and its doubling the replication. Its just customer use to
> working with Veritas and UFS and they don''t want to change their
habbits.
> I just have to convince the customer to use ZFS replication.
Hi Roshan,
that''s a great shame because if they actually want
to make use of the features of ZFS such as replication,
then they need to be serious about configuring their
storage to play in the ZFS world.... and that means
replication that ZFS knows about.



James C. McPherson
--
Solaris kernel software engineer
Sun Microsystems

Roshan Perera

2007-Jun-19 14:17 UTC

head link

[zfs-discuss] ZFS - SAN and Raid

Hi All,

We have come across a problem at a client where ZFS brought the system down with
a write error on a EMC device due to mirroring done at the EMC level and not
ZFS, Client is total EMC committed and not too happy to use the ZFS for
oring/RAID-Z. I have seen the notes below about the ZFS and SAN attached devices
and understand the ZFS behaviour.

Can someone help me with the following Questions:

Is this the way ZFS will work in the future ?  
is there going to be any compromise to allow SAN Raid and ZFS to do the rest. 
If so when and if possible details of it ?


Many Thanks 

Rgds

Roshan

ZFS work with SAN-attached devices?> 
>     Yes, ZFS works with either direct-attached devices or SAN-attached 
> devices. However, if your storage pool contains no mirror or RAID-Z 
> top-level devices, ZFS can only report checksum errors but cannot 
> correct them. If your storage pool consists of mirror or RAID-Z 
> devices built using storage from SAN-attached devices, ZFS can report 
> and correct checksum errors.
> 
> This says that if we are not using ZFS raid or mirror then the 
> expected event would be for ZFS to report but not fix the error. In 
> our case the system kernel panicked, which is something different. Is 
> the FAQ wrong or is there a bug in ZFS?

Victor Engle

2007-Jun-19 14:27 UTC

head link

[zfs-discuss] ZFS - SAN and Raid

Roshan,

Could you provide more detail please. The host and zfs should be
unaware of any EMC array side replication so this sounds more like an
EMC misconfiguration than a ZFS problem. Did you look in the messages
file to see if anything happened to the devices that were in your
zpools? If so then that wouldn''t be a zfs error. If your EMC devices
fall offline because of something happening on the array or fabric
then zfs is not to blame. The same thing would have happened with any
other filesystem built on those devices.

What kind of pools were in use, raidz, mirror or simple stripe?

Regards,
Vic




On 6/19/07, Roshan Perera <Roshan.Perera at sun.com>
wrote:> Hi All,
>
> We have come across a problem at a client where ZFS brought the system down
with a write error on a EMC device due to mirroring done at the EMC level and
not ZFS, Client is total EMC committed and not too happy to use the ZFS for
oring/RAID-Z. I have seen the notes below about the ZFS and SAN attached devices
and understand the ZFS behaviour.
>
> Can someone help me with the following Questions:
>
> Is this the way ZFS will work in the future ?
> is there going to be any compromise to allow SAN Raid and ZFS to do the
rest.
> If so when and if possible details of it ?
>
>
> Many Thanks
>
> Rgds
>
> Roshan
>
> ZFS work with SAN-attached devices?
> >
> >     Yes, ZFS works with either direct-attached devices or SAN-attached
> > devices. However, if your storage pool contains no mirror or RAID-Z
> > top-level devices, ZFS can only report checksum errors but cannot
> > correct them. If your storage pool consists of mirror or RAID-Z
> > devices built using storage from SAN-attached devices, ZFS can report
> > and correct checksum errors.
> >
> > This says that if we are not using ZFS raid or mirror then the
> > expected event would be for ZFS to report but not fix the error. In
> > our case the system kernel panicked, which is something different. Is
> > the FAQ wrong or is there a bug in ZFS?
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Bruce McAlister

2007-Jun-19 14:51 UTC

head link

[zfs-discuss] Re: ZFS - SAN and Raid

We have the same problem and I have just moved back to UFS because of
this issue. According to the engineer at Sun that i spoke with, he
implied that there is an RFE out internally that is to address this problem.

The issue is this:

When configuring a zpool with 1 vdev in it and zfs times out a write
operation to the pool/filesystem for whatever reason, possibly just a
hold back or retyrable error, the zfs module will cause a system panic
because it thinks there are no other mirror''s in the pool to write to
and forces a kernel panic.

The way around this is to configure the zpools with mirror''s which
negates the use of a hardware raid array, and sends twice the amount of
data down to the RAID cache that is actually required (because of the
mirroring at the ZFS layer). In our case it was a little old Sun
StorEdge 3511 FC SATA Array, but the principle applies to any RAID array
that is not configured as a JBOD.



Victor Engle wrote:> Roshan,
> 
> Could you provide more detail please. The host and zfs should be
> unaware of any EMC array side replication so this sounds more like an
> EMC misconfiguration than a ZFS problem. Did you look in the messages
> file to see if anything happened to the devices that were in your
> zpools? If so then that wouldn''t be a zfs error. If your EMC
devices
> fall offline because of something happening on the array or fabric
> then zfs is not to blame. The same thing would have happened with any
> other filesystem built on those devices.
> 
> What kind of pools were in use, raidz, mirror or simple stripe?
> 
> Regards,
> Vic
> 
> 
> 
> 
> On 6/19/07, Roshan Perera <Roshan.Perera at sun.com> wrote:
>> Hi All,
>>
>> We have come across a problem at a client where ZFS brought the system
>> down with a write error on a EMC device due to mirroring done at the
>> EMC level and not ZFS, Client is total EMC committed and not too happy
>> to use the ZFS for oring/RAID-Z. I have seen the notes below about the
>> ZFS and SAN attached devices and understand the ZFS behaviour.
>>
>> Can someone help me with the following Questions:
>>
>> Is this the way ZFS will work in the future ?
>> is there going to be any compromise to allow SAN Raid and ZFS to do
>> the rest.
>> If so when and if possible details of it ?
>>
>>
>> Many Thanks
>>
>> Rgds
>>
>> Roshan
>>
>> ZFS work with SAN-attached devices?
>> >
>> >     Yes, ZFS works with either direct-attached devices or
SAN-attached
>> > devices. However, if your storage pool contains no mirror or
RAID-Z
>> > top-level devices, ZFS can only report checksum errors but cannot
>> > correct them. If your storage pool consists of mirror or RAID-Z
>> > devices built using storage from SAN-attached devices, ZFS can
report
>> > and correct checksum errors.
>> >
>> > This says that if we are not using ZFS raid or mirror then the
>> > expected event would be for ZFS to report but not fix the error.
In
>> > our case the system kernel panicked, which is something different.
Is
>> > the FAQ wrong or is there a bug in ZFS?
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>

Roshan Perera

2007-Jun-19 16:09 UTC

head link

[zfs-discuss] Re: ZFS - SAN and Raid

Victror, 
Thanks for your comments but I believe it contradict what ZFS information given
below and now Bruce''s mail.
After some digging around I found that the messages file has thrown out some
powerpath errors to one of the devices that may have caused the proble. attached
below the errors. But the question still remains is ZFS only happy with JBOD
disks and not SAN storage with hardware raid. Thanks
Roshan


Jun  4 16:30:09 su621dwdb ltid[23093]: [ID 815759 daemon.error] Cannot start
rdevmi pr
ocess for remote shared drive operations on host su621dh01, cannot connect to
vmd
Jun  4 16:30:12 su621dwdb emcp: [ID 801593 kern.notice] Info: Assigned volume
Symm 000
290100491 vol 0ffe to
Jun  4 16:30:12 su621dwdb last message repeated 1 time
Jun  4 16:30:12 su621dwdb emcp: [ID 801593 kern.notice] Info: Assigned volume
Symm 000
290100491 vol 0fee to
Jun  4 16:30:12 su621dwdb unix: [ID 836849 kern.notice]
Jun  4 16:30:12 su621dwdb ^Mpanic[cpu550]/thread=2a101dd9cc0:
Jun  4 16:30:12 su621dwdb unix: [ID 809409 kern.notice] ZFS: I/O failure (write
on <un
known> off 0: zio 600574e7500 [L0 unallocated] 4000L/400P
DVA[0]=<5:55c00:400> DVA[1]<6:2b800:400> fletcher4 lzjb BE
contiguous birth=107027 fill=0 cksum=673200f97f:34804a
0e20dc:102879bdcf1d13:3ce1b8dac7357de): error 5
Jun  4 16:30:12 su621dwdb unix: [ID 100000 kern.notice]
Jun  4 16:30:12 su621dwdb genunix: [ID 723222 kern.notice] 000002a101dd9740
zfs:zio_do
ne+284 (600574e7500, 0, a8, 708fdca0, 0, 6000f26cdc0)
Jun  4 16:30:12 su621dwdb genunix: [ID 179002 kern.notice]   %l0-3:
0000060015beaf00 0
0000000708fdc00 0000000000000005 0000000000000005













> We have the same problem and I have just moved back to UFS because of
> this issue. According to the engineer at Sun that i spoke with, he
> implied that there is an RFE out internally that is to address 
> this problem.
> 
> The issue is this:
> 
> When configuring a zpool with 1 vdev in it and zfs times out a write
> operation to the pool/filesystem for whatever reason, possibly 
> just a
> hold back or retyrable error, the zfs module will cause a system panic
> because it thinks there are no other mirror''s in the pool to write
to
> and forces a kernel panic.
> 
> The way around this is to configure the zpools with mirror''s which
> negates the use of a hardware raid array, and sends twice the 
> amount of
> data down to the RAID cache that is actually required (because of the
> mirroring at the ZFS layer). In our case it was a little old Sun
> StorEdge 3511 FC SATA Array, but the principle applies to any RAID 
> arraythat is not configured as a JBOD.
> 
> 
> 
> Victor Engle wrote:
> > Roshan,
> > 
> > Could you provide more detail please. The host and zfs should be
> > unaware of any EMC array side replication so this sounds more 
> like an
> > EMC misconfiguration than a ZFS problem. Did you look in the 
> messages> file to see if anything happened to the devices that 
> were in your
> > zpools? If so then that wouldn''t be a zfs error. If your EMC
devices
> > fall offline because of something happening on the array or fabric
> > then zfs is not to blame. The same thing would have happened 
> with any
> > other filesystem built on those devices.
> > 
> > What kind of pools were in use, raidz, mirror or simple stripe?
> > 
> > Regards,
> > Vic
> > 
> > 
> > 
> > 
> > On 6/19/07, Roshan Perera <Roshan.Perera at sun.com> wrote:
> >> Hi All,
> >>
> >> We have come across a problem at a client where ZFS brought the 
> system>> down with a write error on a EMC device due to mirroring 
> done at the
> >> EMC level and not ZFS, Client is total EMC committed and not 
> too happy
> >> to use the ZFS for oring/RAID-Z. I have seen the notes below 
> about the
> >> ZFS and SAN attached devices and understand the ZFS behaviour.
> >>
> >> Can someone help me with the following Questions:
> >>
> >> Is this the way ZFS will work in the future ?
> >> is there going to be any compromise to allow SAN Raid and ZFS 
> to do
> >> the rest.
> >> If so when and if possible details of it ?
> >>
> >>
> >> Many Thanks
> >>
> >> Rgds
> >>
> >> Roshan
> >>
> >> ZFS work with SAN-attached devices?
> >> >
> >> >     Yes, ZFS works with either direct-attached devices or
SAN-
> attached>> > devices. However, if your storage pool contains no 
> mirror or RAID-Z
> >> > top-level devices, ZFS can only report checksum errors but
cannot
> >> > correct them. If your storage pool consists of mirror or
RAID-Z
> >> > devices built using storage from SAN-attached devices, ZFS 
> can report
> >> > and correct checksum errors.
> >> >
> >> > This says that if we are not using ZFS raid or mirror then
the
> >> > expected event would be for ZFS to report but not fix the 
> error. In
> >> > our case the system kernel panicked, which is something 
> different. Is
> >> > the FAQ wrong or is there a bug in ZFS?
> >>
> >> _______________________________________________
> >> zfs-discuss mailing list
> >> zfs-discuss at opensolaris.org
> >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> >>
> 
>

Victor Engle

2007-Jun-19 16:45 UTC

head link

[zfs-discuss] Re: ZFS - SAN and Raid

Roshan,

As far as I know, there is no problem at all with using SAN storage
with ZFS and it does look like you were having an underlying problem
with either powerpath or the array.

The best practices guide on opensolaris does recommend replicated
pools even if your backend storage is redundant. There are at least 2
good reasons for that. ZFS needs a replica for the self healing
feature to work. Also there is no fsck like tool for ZFS so it is a
good idea to make sure self healing can work.

I think first I would track down the cause of the messages just prior
to the zfs write error because even with replicated pools if several
devices error at once then the pool could be lost.

Regards,
Vic


On 6/19/07, Roshan Perera <Roshan.Perera at sun.com>
wrote:> Victror,
> Thanks for your comments but I believe it contradict what ZFS information
given below and now Bruce''s mail.
> After some digging around I found that the messages file has thrown out
some powerpath errors to one of the devices that may have caused the proble.
attached below the errors. But the question still remains is ZFS only happy with
JBOD disks and not SAN storage with hardware raid. Thanks
> Roshan
>
>
> Jun  4 16:30:09 su621dwdb ltid[23093]: [ID 815759 daemon.error] Cannot
start rdevmi pr
> ocess for remote shared drive operations on host su621dh01, cannot connect
to vmd
> Jun  4 16:30:12 su621dwdb emcp: [ID 801593 kern.notice] Info: Assigned
volume Symm 000
> 290100491 vol 0ffe to
> Jun  4 16:30:12 su621dwdb last message repeated 1 time
> Jun  4 16:30:12 su621dwdb emcp: [ID 801593 kern.notice] Info: Assigned
volume Symm 000
> 290100491 vol 0fee to
> Jun  4 16:30:12 su621dwdb unix: [ID 836849 kern.notice]
> Jun  4 16:30:12 su621dwdb ^Mpanic[cpu550]/thread=2a101dd9cc0:
> Jun  4 16:30:12 su621dwdb unix: [ID 809409 kern.notice] ZFS: I/O failure
(write on <un
> known> off 0: zio 600574e7500 [L0 unallocated] 4000L/400P
DVA[0]=<5:55c00:400> DVA[1]> <6:2b800:400> fletcher4 lzjb BE
contiguous birth=107027 fill=0 cksum=673200f97f:34804a
> 0e20dc:102879bdcf1d13:3ce1b8dac7357de): error 5
> Jun  4 16:30:12 su621dwdb unix: [ID 100000 kern.notice]
> Jun  4 16:30:12 su621dwdb genunix: [ID 723222 kern.notice] 000002a101dd9740
zfs:zio_do
> ne+284 (600574e7500, 0, a8, 708fdca0, 0, 6000f26cdc0)
> Jun  4 16:30:12 su621dwdb genunix: [ID 179002 kern.notice]   %l0-3:
0000060015beaf00 0
> 0000000708fdc00 0000000000000005 0000000000000005
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> > We have the same problem and I have just moved back to UFS because of
> > this issue. According to the engineer at Sun that i spoke with, he
> > implied that there is an RFE out internally that is to address
> > this problem.
> >
> > The issue is this:
> >
> > When configuring a zpool with 1 vdev in it and zfs times out a write
> > operation to the pool/filesystem for whatever reason, possibly
> > just a
> > hold back or retyrable error, the zfs module will cause a system panic
> > because it thinks there are no other mirror''s in the pool to
write to
> > and forces a kernel panic.
> >
> > The way around this is to configure the zpools with mirror''s
which
> > negates the use of a hardware raid array, and sends twice the
> > amount of
> > data down to the RAID cache that is actually required (because of the
> > mirroring at the ZFS layer). In our case it was a little old Sun
> > StorEdge 3511 FC SATA Array, but the principle applies to any RAID
> > arraythat is not configured as a JBOD.
> >
> >
> >
> > Victor Engle wrote:
> > > Roshan,
> > >
> > > Could you provide more detail please. The host and zfs should be
> > > unaware of any EMC array side replication so this sounds more
> > like an
> > > EMC misconfiguration than a ZFS problem. Did you look in the
> > messages> file to see if anything happened to the devices that
> > were in your
> > > zpools? If so then that wouldn''t be a zfs error. If your
EMC devices
> > > fall offline because of something happening on the array or
fabric
> > > then zfs is not to blame. The same thing would have happened
> > with any
> > > other filesystem built on those devices.
> > >
> > > What kind of pools were in use, raidz, mirror or simple stripe?
> > >
> > > Regards,
> > > Vic
> > >
> > >
> > >
> > >
> > > On 6/19/07, Roshan Perera <Roshan.Perera at sun.com> wrote:
> > >> Hi All,
> > >>
> > >> We have come across a problem at a client where ZFS brought
the
> > system>> down with a write error on a EMC device due to
mirroring
> > done at the
> > >> EMC level and not ZFS, Client is total EMC committed and not
> > too happy
> > >> to use the ZFS for oring/RAID-Z. I have seen the notes below
> > about the
> > >> ZFS and SAN attached devices and understand the ZFS
behaviour.
> > >>
> > >> Can someone help me with the following Questions:
> > >>
> > >> Is this the way ZFS will work in the future ?
> > >> is there going to be any compromise to allow SAN Raid and ZFS
> > to do
> > >> the rest.
> > >> If so when and if possible details of it ?
> > >>
> > >>
> > >> Many Thanks
> > >>
> > >> Rgds
> > >>
> > >> Roshan
> > >>
> > >> ZFS work with SAN-attached devices?
> > >> >
> > >> >     Yes, ZFS works with either direct-attached devices
or SAN-
> > attached>> > devices. However, if your storage pool contains
no
> > mirror or RAID-Z
> > >> > top-level devices, ZFS can only report checksum errors
but cannot
> > >> > correct them. If your storage pool consists of mirror or
RAID-Z
> > >> > devices built using storage from SAN-attached devices,
ZFS
> > can report
> > >> > and correct checksum errors.
> > >> >
> > >> > This says that if we are not using ZFS raid or mirror
then the
> > >> > expected event would be for ZFS to report but not fix
the
> > error. In
> > >> > our case the system kernel panicked, which is something
> > different. Is
> > >> > the FAQ wrong or is there a bug in ZFS?
> > >>
> > >> _______________________________________________
> > >> zfs-discuss mailing list
> > >> zfs-discuss at opensolaris.org
> > >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> > >>
> >
> >
>

Richard Elling

2007-Jun-19 17:26 UTC

head link

[zfs-discuss] Re: ZFS - SAN and Raid

Victor Engle wrote:> Roshan,
> 
> As far as I know, there is no problem at all with using SAN storage
> with ZFS and it does look like you were having an underlying problem
> with either powerpath or the array.
Correct.  A write failed.
> The best practices guide on opensolaris does recommend replicated
> pools even if your backend storage is redundant. There are at least 2
> good reasons for that. ZFS needs a replica for the self healing
> feature to work. Also there is no fsck like tool for ZFS so it is a
> good idea to make sure self healing can work.
Yes, currently ZFS on Solaris will panic if a non-redundant write fails.
This is known and being worked on, but there really isn''t a good
solution
if a write fails, unless you have some ZFS-level redundancy.

NB. fsck is not needed for ZFS because the on-disk format is always
consistent.  This is orthogonal to hardware faults.
> I think first I would track down the cause of the messages just prior
> to the zfs write error because even with replicated pools if several
> devices error at once then the pool could be lost.
Yes, multiple failures can cause data loss.  No magic here.
  -- richard

Marion Hakanson

2007-Jun-19 17:51 UTC

head link

[zfs-discuss] Re: ZFS - SAN and Raid

Roshan.Perera at Sun.COM said:> attached below the errors. But the question still remains is ZFS only happy
> with JBOD disks and not SAN storage with hardware raid. Thanks 
ZFS works fine on our SAN here.  You do get a kernel panic (Solaris-10U3)
if a LUN disappears for some reason (without ZFS-level redundancy), but
I understand that bug is fixed in a Nevada build;  I''m hoping to see
the
fix in Solaris-10U4.

Regards,

Marion

Victor Engle

2007-Jun-19 17:52 UTC

head link

[zfs-discuss] Re: ZFS - SAN and Raid

>
> > The best practices guide on opensolaris does recommend replicated
> > pools even if your backend storage is redundant. There are at least 2
> > good reasons for that. ZFS needs a replica for the self healing
> > feature to work. Also there is no fsck like tool for ZFS so it is a
> > good idea to make sure self healing can work.
>
> NB. fsck is not needed for ZFS because the on-disk format is always
> consistent.  This is orthogonal to hardware faults.
>
I understand that the on disk state is always consistent but the self
healing feature can correct blocks that have bad checksums if zfs is
able to retrieve the block from a good replica. So even though the
filesystem is consistent, the data can be corrupt in non-redundant
pools. I am unsure of what happens with a non-redundant pool when a
block has a bad checksum and perhaps you could clear that up. Does
this cause a problem for the pool or is it limited to the file or
files affected by the bad block and otherwise the pool is online and
healthy.

Thanks,
Vic

Richard Elling

2007-Jun-19 20:43 UTC

head link

[zfs-discuss] Re: ZFS - SAN and Raid

Victor Engle wrote:>>
>> > The best practices guide on opensolaris does recommend replicated
>> > pools even if your backend storage is redundant. There are at
least 2
>> > good reasons for that. ZFS needs a replica for the self healing
>> > feature to work. Also there is no fsck like tool for ZFS so it is
a
>> > good idea to make sure self healing can work.
> 
>>
>> NB. fsck is not needed for ZFS because the on-disk format is always
>> consistent.  This is orthogonal to hardware faults.
>>
> I understand that the on disk state is always consistent but the self
> healing feature can correct blocks that have bad checksums if zfs is
> able to retrieve the block from a good replica.
Yes.  That is how it works.  By default, metatadata is replicated.
For real data, you can use copies, mirroring, or raidz[12]
>                                                  So even though the
> filesystem is consistent, the data can be corrupt in non-redundant
> pools.
No.  If the data is corrupt and cannot be reconstructed, it is lost.
Recall that UFS''s fsck only corrects file system metadata, not real
data.  Most file systems which have any kind of preformance work this
way.  ZFS is safer, because of COW, ZFS won''t overwrite existing data
leading to corruption -- but other file systems can (eg. UFS).
>          I am unsure of what happens with a non-redundant pool when a
> block has a bad checksum and perhaps you could clear that up. Does
> this cause a problem for the pool or is it limited to the file or
> files affected by the bad block and otherwise the pool is online and
> healthy.
It depends on where the bad block is.  If it isn''t being used, no
foul[1].
If it is metadata, then we recover because of redundant metadata.  If it
is in a file with no redundancy (copies=1, by default) then an error will
be logged to FMA and the file name is visible to zpool status.  You can
decide if that file is important to you.

This is an area where there is continuing development, far beyond what
ZFS alone can do.  The ultimate goal is that we get to the point where
most faults can be tolerated.  No rest for the weary :-)

[1] this is different than "software RAID" systems which
don''t know if a
block is being used or not.  In ZFS, we only care about faults in blocks
which are being used, for the most part.
  -- richard

Roshan Perera

2007-Jun-19 22:05 UTC

head link

[zfs-discuss] Re: ZFS - SAN and Raid

Thanks for all your replies. Lot of info to take it back. In this case it seems
like emcp carried out a repair to a path to LUN Followed by a panic.

Jun  4 16:30:12 su621dwdb emcp: [ID 801593 kern.notice] Info: Assigned volume
Symm 000290100491 vol 0ffe to

I don''t think panic should be the answer in this type of scenario, as
there is redundant path to the LUN and Hardware Raid is in place inside SAN.
From what I gather there is work being carried out to find a better solution.
What is the proposed solution or when it will be availble is the question ?

Thanks again.

Roshan


----- Original Message -----
From: Richard Elling <Richard.Elling at Sun.COM>
Date: Tuesday, June 19, 2007 6:28 pm
Subject: Re: [zfs-discuss] Re: ZFS - SAN and Raid
To: Victor Engle <victor.engle at gmail.com>
Cc: Bruce McAlister <bruce.mcalister at blueface.ie>, zfs-discuss at
opensolaris.org, Roshan Perera <Roshan.Perera at Sun.COM>
> Victor Engle wrote:
> > Roshan,
> > 
> > As far as I know, there is no problem at all with using SAN storage
> > with ZFS and it does look like you were having an underlying problem
> > with either powerpath or the array.
> 
> Correct.  A write failed.
> 
> > The best practices guide on opensolaris does recommend replicated
> > pools even if your backend storage is redundant. There are at 
> least 2
> > good reasons for that. ZFS needs a replica for the self healing
> > feature to work. Also there is no fsck like tool for ZFS so it 
> is a
> > good idea to make sure self healing can work.
> 
> Yes, currently ZFS on Solaris will panic if a non-redundant write 
> fails.This is known and being worked on, but there really isn''t a 
> good solution
> if a write fails, unless you have some ZFS-level redundancy.
> 
> NB. fsck is not needed for ZFS because the on-disk format is always
> consistent.  This is orthogonal to hardware faults.
> 
> > I think first I would track down the cause of the messages just 
> prior> to the zfs write error because even with replicated pools 
> if several
> > devices error at once then the pool could be lost.
> 
> Yes, multiple failures can cause data loss.  No magic here.
>  -- richard
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

James C. McPherson

2007-Jun-20 01:16 UTC

head link

[zfs-discuss] Re: ZFS - SAN and Raid

Roshan Perera wrote:> Thanks for all your replies. Lot of info to take it back. In this case it
> seems like emcp carried out a repair to a path to LUN Followed by a
> panic.
> 
> Jun  4 16:30:12 su621dwdb emcp: [ID 801593 kern.notice] Info: Assigned
> volume Symm 000290100491 vol 0ffe to
> 
> I don''t think panic should be the answer in this type of scenario,
as
> there is redundant path to the LUN and Hardware Raid is in place inside
> SAN. From what I gather there is work being carried out to find a better
> solution. What is the proposed solution or when it will be availble is
> the question ?

But Roshan, if your pool is not replicated from ZFS''
point of view, then all the multipathing and raid
controller backup in the world will not make a difference.



James C. McPherson
--
Solaris kernel software engineer
Sun Microsystems

Gary Mills

2007-Jun-20 02:50 UTC

head link

[zfs-discuss] Re: ZFS - SAN and Raid

On Wed, Jun 20, 2007 at 11:16:39AM +1000, James C. McPherson
wrote:> Roshan Perera wrote:
> >
> >I don''t think panic should be the answer in this type of
scenario, as
> >there is redundant path to the LUN and Hardware Raid is in place inside
> >SAN. From what I gather there is work being carried out to find a
better
> >solution. What is the proposed solution or when it will be availble is
> >the question ?
> 
> But Roshan, if your pool is not replicated from ZFS''
> point of view, then all the multipathing and raid
> controller backup in the world will not make a difference.
If the multipathing is working correctly, and one path to the data
remains intact, the SCSI level should retry the write error
successfully.  This certainly happens with UFS on our fibre-channel
SAN.  There''s usually a SCSI bus reset message along with a message
the failover to the other path.  Of course, once the SCSI level
exhausts its retries, something else has to happen, just as it would
with a physical disk.  This must be when ZFS causes a panic.

-- 
-Gary Mills-    -Unix Support-    -U of M Academic Computing and Networking-

Roshan Perera

2007-Jun-20 09:47 UTC

head link

[zfs-discuss] Re: ZFS - SAN and Raid

> But Roshan, if your pool is not replicated from ZFS''
> point of view, then all the multipathing and raid
> controller backup in the world will not make a difference.
James, I Agree from ZFS point of view. However, from the EMC or the customer
point of view they want to do the replication at the EMC level and not from ZFS.
By replicating at the ZFS level they will loose some storage and its doubling
the replication. Its just customer use to working with Veritas and UFS and they
don''t want to change their habbits. I just have to convince the
customer to use ZFS replication.

Thanks again

> 
> 
> 
> James C. McPherson
> --
> Solaris kernel software engineer
> Sun Microsystems
>

Roshan Perera

2007-Jun-20 12:08 UTC

head link

[zfs-discuss] Re: ZFS - SAN and Raid

Hi all,

Is there a place where I can find ZFS best practices guide to use against DMX
and a roadmap of ZFS ?

Also, the customer now looking at big ZFS installations in production. Would you
guys happen to know or where I can find details of the numbers of current
installations ? We are looking at akmost 10Terrabytes of data to be stored on
DMX using ZFS (customer is not comfortable with the RaidZ solution in addition
to their best practice of raiding at DMX levell) Any feedback, experiences and
more importantly gotchas will be muchly appreciated.

Thanks in advance.

Roshan



----- Original Message -----
From: Roshan Perera <Roshan.Perera at Sun.COM>
Date: Wednesday, June 20, 2007 10:49 am
Subject: Re: [zfs-discuss] Re: ZFS - SAN and Raid
To: James.McPherson at Sun.COM
Cc: Bruce McAlister <bruce.mcalister at blueface.ie>, zfs-discuss at
opensolaris.org, Richard Elling <Richard.Elling at Sun.COM>
> 
> 
> > But Roshan, if your pool is not replicated from ZFS''
> > point of view, then all the multipathing and raid
> > controller backup in the world will not make a difference.
> 
> James, I Agree from ZFS point of view. However, from the EMC or 
> the customer point of view they want to do the replication at the 
> EMC level and not from ZFS. By replicating at the ZFS level they 
> will loose some storage and its doubling the replication. Its just 
> customer use to working with Veritas and UFS and they don''t want 
> to change their habbits. I just have to convince the customer to 
> use ZFS replication.
> 
> Thanks again
> 
> 
> > 
> > 
> > 
> > James C. McPherson
> > --
> > Solaris kernel software engineer
> > Sun Microsystems
> > 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Torrey McMahon

2007-Jun-20 16:23 UTC

head link

[zfs-discuss] Re: ZFS - SAN and Raid

James C. McPherson wrote:> Roshan Perera wrote:
>>
>>> But Roshan, if your pool is not replicated from ZFS'' point
of view,
>>> then all the multipathing and raid controller backup in the world
will
>>> not make a difference.
>>
>> James, I Agree from ZFS point of view. However, from the EMC or the
>> customer point of view they want to do the replication at the EMC level
>> and not from ZFS. By replicating at the ZFS level they will loose some
>> storage and its doubling the replication. Its just customer use to
>> working with Veritas and UFS and they don''t want to change
their
>> habbits.
>> I just have to convince the customer to use ZFS replication.
>
> Hi Roshan,
> that''s a great shame because if they actually want
> to make use of the features of ZFS such as replication,
> then they need to be serious about configuring their
> storage to play in the ZFS world.... and that means
> replication that ZFS knows about.
>
Also, how does replication at the ZFS level use more storage - I''m 
assuming raw block - then at the array level?

Victor Engle

2007-Jun-20 16:55 UTC

head link

[zfs-discuss] Re: ZFS - SAN and Raid

> On 6/20/07, Torrey McMahon <tmcmahon2 at yahoo.com> wrote:
> Also, how does replication at the ZFS level use more storage - I''m
> assuming raw block - then at the array level?
> _______________________________________________

Just to add to the previous comments. In the case where you have a SAN
array providing storage to a host for use with ZFS the SAN storage
really needs to be redundant in the array AND the zpools need to be
redundant pools.

The reason the SAN storage should be redundant is that SAN arrays are
designed to serve logical units. The logical units are usually
allocated from a raid set, storage pool or aggregate of some kind. The
array side pool/aggregate may include 10 300GB disks and may have 100+
luns allocated from it for example. If redundancy is not used in the
array side pool/aggregate and then 1 disk failure will kill 100+ luns
at once.

On 6/20/07, Torrey McMahon <tmcmahon2 at yahoo.com>
wrote:> James C. McPherson wrote:
> > Roshan Perera wrote:
> >>
> >>> But Roshan, if your pool is not replicated from ZFS''
point of view,
> >>> then all the multipathing and raid controller backup in the
world will
> >>> not make a difference.
> >>
> >> James, I Agree from ZFS point of view. However, from the EMC or
the
> >> customer point of view they want to do the replication at the EMC
level
> >> and not from ZFS. By replicating at the ZFS level they will loose
some
> >> storage and its doubling the replication. Its just customer use to
> >> working with Veritas and UFS and they don''t want to
change their
> >> habbits.
> >> I just have to convince the customer to use ZFS replication.
> >
> > Hi Roshan,
> > that''s a great shame because if they actually want
> > to make use of the features of ZFS such as replication,
> > then they need to be serious about configuring their
> > storage to play in the ZFS world.... and that means
> > replication that ZFS knows about.
> >
>
> Also, how does replication at the ZFS level use more storage - I''m
> assuming raw block - then at the array level?
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Gary Mills

2007-Jun-20 17:02 UTC

head link

[zfs-discuss] Re: ZFS - SAN and Raid

On Wed, Jun 20, 2007 at 12:23:18PM -0400, Torrey McMahon
wrote:> James C. McPherson wrote:
> >Roshan Perera wrote:
> >>
> >>>But Roshan, if your pool is not replicated from ZFS''
point of view,
> >>>then all the multipathing and raid controller backup in the
world will
> >>>not make a difference.
> >>
> >>James, I Agree from ZFS point of view. However, from the EMC or the
> >>customer point of view they want to do the replication at the EMC
level
> >>and not from ZFS. By replicating at the ZFS level they will loose
some
> >>storage and its doubling the replication. Its just customer use to
> >>working with Veritas and UFS and they don''t want to change
their
> >>habbits.
> >>I just have to convince the customer to use ZFS replication.
> >
> >that''s a great shame because if they actually want
> >to make use of the features of ZFS such as replication,
> >then they need to be serious about configuring their
> >storage to play in the ZFS world.... and that means
> >replication that ZFS knows about.
> 
> Also, how does replication at the ZFS level use more storage - I''m
> assuming raw block - then at the array level?
SAN storage generally doesn''t work that way.  They use some magical
redundancy scheme, which may be RAID-5 or WAFL, from which the Storage
Administrator carves out virtual disks.  These are best viewed as an
array of blocks.  All disk administration, such as replacing failed
disks, takes place on the storage device without affecting the virtual
disks.  There''s no need for disk administration or additional
redundancy on the client side.  If more space is needed on the client,
the Storage Administrator simply expands the virtual disk by extending
its blocks.  ZFS needs to play nicely in this environment because
that''s what''s available in large organizations that have
centralized
their storage.  Asking for raw disks doesn''t work.

-- 
-Gary Mills-    -Unix Support-    -U of M Academic Computing and Networking-

Torrey McMahon

2007-Jun-25 03:29 UTC

head link

[zfs-discuss] Re: ZFS - SAN and Raid

Victor Engle wrote:>> On 6/20/07, Torrey McMahon <tmcmahon2 at yahoo.com> wrote:
>> Also, how does replication at the ZFS level use more storage -
I''m
>> assuming raw block - then at the array level?
>> _______________________________________________
>
>
> Just to add to the previous comments. In the case where you have a SAN
> array providing storage to a host for use with ZFS the SAN storage
> really needs to be redundant in the array AND the zpools need to be
> redundant pools.
>
> The reason the SAN storage should be redundant is that SAN arrays are
> designed to serve logical units. The logical units are usually
> allocated from a raid set, storage pool or aggregate of some kind. The
> array side pool/aggregate may include 10 300GB disks and may have 100+
> luns allocated from it for example. If redundancy is not used in the
> array side pool/aggregate and then 1 disk failure will kill 100+ luns
> at once. 
That makes a lot of sense in configurations where an array is exporting 
LUNs built on raid volumes to a set of heterogeneous hosts. If you''re 
direct connected to a single box running ZFS or a set of boxes running 
ZFS you probably want to export something as close to the raw disks as 
possible while maintaining ZFS level redundancy. (Like two R5 LUNs in a 
ZFS mirror.) Creating a raid set, carving out lots of LUNs and then 
handing them all over to ZFS isn''t going to buy you a lot and could 
cause performance issues. (LUN skew for example.)

Torrey McMahon

2007-Jun-25 03:32 UTC

head link

[zfs-discuss] Re: ZFS - SAN and Raid

Gary Mills wrote:> On Wed, Jun 20, 2007 at 12:23:18PM -0400, Torrey McMahon wrote:
>   
>> James C. McPherson wrote:
>>     
>>> Roshan Perera wrote:
>>>       
>>>>> But Roshan, if your pool is not replicated from
ZFS'' point of view,
>>>>> then all the multipathing and raid controller backup in the
world will
>>>>> not make a difference.
>>>>>           
>>>> James, I Agree from ZFS point of view. However, from the EMC or
the
>>>> customer point of view they want to do the replication at the
EMC level
>>>> and not from ZFS. By replicating at the ZFS level they will
loose some
>>>> storage and its doubling the replication. Its just customer use
to
>>>> working with Veritas and UFS and they don''t want to
change their
>>>> habbits.
>>>> I just have to convince the customer to use ZFS replication.
>>>>         
>>> that''s a great shame because if they actually want
>>> to make use of the features of ZFS such as replication,
>>> then they need to be serious about configuring their
>>> storage to play in the ZFS world.... and that means
>>> replication that ZFS knows about.
>>>       
>> Also, how does replication at the ZFS level use more storage -
I''m
>> assuming raw block - then at the array level?
>>     
>
> SAN storage generally doesn''t work that way.  They use some
magical
> redundancy scheme, which may be RAID-5 or WAFL, from which the Storage
> Administrator carves out virtual disks.  These are best viewed as an
> array of blocks.  All disk administration, such as replacing failed
> disks, takes place on the storage device without affecting the virtual
> disks.  There''s no need for disk administration or additional
> redundancy on the client side.  If more space is needed on the client,
> the Storage Administrator simply expands the virtual disk by extending
> its blocks.  ZFS needs to play nicely in this environment because
> that''s what''s available in large organizations that have
centralized
> their storage.  Asking for raw disks doesn''t work.
>   
Are we talking about replication - I have a copy of my data on an other 
system - or redundancy - I have a system where I can tolerate a local 
failure?

...and I understand the ZFS has to play nice with HW raid argument. :)

Roshan Perera

2007-Jun-26 11:26 UTC

head link

[zfs-discuss] ZFS - DB2 Performance

Hi all,

I am after some help/feedback to the subject issue explained below.

We are in the process of migrating a big DB2 database from a 

6900 24 x 200MHz CPU''s with Veritas FS 8TB of storage Solaris 8 to 
25K   12 CPU dual core x 1800Mhz with ZFS 8TB storage SAN storage (compressed
& RaidZ) Solaris 10.

Unfortunately, we are having massive perfomance problems with the new solution.
It all points towards IO and ZFS.

Couple of questions relating to ZFS.
1. What is the impace on using ZFS compression ? Percentage of system resources
required, how much of a overhead is this as suppose to non-compression. In our
case DB2 do similar amount of read''s and writes.
2. Unfortunately we are using twice RAID (San level Raid and RaidZ) to overcome
the panic problem my previous blog (for which I had good response).
3. Any way of monitoring ZFS performance other than iostat ?
4. Any help on ZFS tuning in this kind of environment like caching etc ?

Would appreciate for any feedback/help wher to go next. 
If this cannot be resolved we may have to go back to VXFS which would be a
shame.


Thanks in advance.

Will Murnane

2007-Jun-26 13:00 UTC

head link

[zfs-discuss] ZFS - DB2 Performance

On 6/26/07, Roshan Perera <Roshan.Perera at sun.com>
wrote:> 25K 12 CPU dual core x 1800Mhz with ZFS 8TB storage SAN storage
(compressed & RaidZ) Solaris 10.RaidZ is a poor choice for database apps in my opinion; due to the way
it handles checksums on raidz stripes, it must read every disk in
order to satisfy small reads that traditional raid-5 would only have
to read a single disk for. Raid-Z doesn''t have the terrible write
performance of raid 5, because you can stick small writes together and
then do full-stripe writes, but by the same token you must do
full-stripe reads, all the time. That''s how I understand it, anyways.
Thus, raidz is a poor choice for a database application which tends
to do a lot of small reads.

Using mirrors (at the zfs level, not the SAN level) would probably
help with this. Mirrors each get their own copy of the data, each
with its own checksum, so you can read a small block by touching only
one disk.

What is your vdev setup like right now? ''zpool list'', in
other words.
How wide are your stripes? Is the SAN doing raid-1ish things with
the disks, or something else?
> 2. Unfortunately we are using twice RAID (San level Raid and RaidZ) to
overcome the panic problem my previous blog (for which I had good response).Can you convince the customer to give ZFS a chance to do things its
way? Let the SAN export raw disks, and make two- or three-way
mirrored vdevs out of them.
> 3. Any way of monitoring ZFS performance other than iostat ?In a word, yes. What are you interested in? DTrace or ''zpool
iostat''
(which reports activity of individual disks within the pool) may prove
interesting.

Will

Roshan Perera

2007-Jun-26 13:50 UTC

head link

[zfs-discuss] ZFS - DB2 Performance

Hi Will,
Thanks for your reply. 
Customer has EMC San solution and will not change their current layout.
Therefore, asking the customer to give RAW disks to ZFS is no no. Hence, the
RaidZ configuration as suppose to Raid - 5.
I have given some stats below. I know its a bit difficult to troubleshoot with
the type of data you have. But whatever input would be muchly appreciated.


zpool list

NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
datapool1              2.12T    707G   1.43T    32%  ONLINE     -
datapool2              2.12T    706G   1.44T    32%  ONLINE     -
datapool3              2.12T    702G   1.44T    32%  ONLINE     -
datapool4              2.12T    701G   1.44T    32%  ONLINE     -
dumppool                272G    171G    101G    62%  ONLINE     -
localpool                68G   12.5G   55.5G    18%  ONLINE     -
logpool                 272G    157G    115G    57%  ONLINE     -



zfs get all datapool1

NAME             PROPERTY       VALUE                      SOURCE
datapool1        type           filesystem                 -                
datapool1        creation       Fri Jun  8 18:46 2007      -                
datapool1        used           615G                       -                
datapool1        available      1.22T                      -                
datapool1        referenced     42.6K                      -                
datapool1        compressratio  2.08x                      -                
datapool1        mounted        no                         -                
datapool1        quota          none                       default          
datapool1        reservation    none                       default          
datapool1        recordsize     128K                       default          
datapool1        mountpoint     none                       local            
datapool1        sharenfs       off                        default          
datapool1        checksum       on                         default          
datapool1        compression    on                         local            
datapool1        atime          on                         default          
datapool1        devices        on                         default          
datapool1        exec           on                         default          
datapool1        setuid         on                         default          
datapool1        readonly       off                        default          
datapool1        zoned          off                        default          
datapool1        snapdir        hidden                     default          
datapool1        aclmode        groupmask                  default          
datapool1        aclinherit     secure                     default   


[su621dwdb/root] zpool status -v
  pool: datapool1
 state: ONLINE
 scrub: none requested
config:
 
        NAME             STATE     READ WRITE CKSUM
        datapool1        ONLINE       0     0     0
          raidz1         ONLINE       0     0     0
            emcpower8h   ONLINE       0     0     0
            emcpower9h   ONLINE       0     0     0
            emcpower10h  ONLINE       0     0     0
            emcpower11h  ONLINE       0     0     0
            emcpower12h  ONLINE       0     0     0
            emcpower13h  ONLINE       0     0     0
            emcpower14h  ONLINE       0     0     0
            emcpower15h  ONLINE       0     0     0
 
errors: No known data errors
 
  pool: datapool2
 state: ONLINE
 scrub: none requested
config:
 
        NAME             STATE     READ WRITE CKSUM
        datapool2        ONLINE       0     0     0
          raidz1         ONLINE       0     0     0
            emcpower16h  ONLINE       0     0     0
            emcpower17h  ONLINE       0     0     0
            emcpower18h  ONLINE       0     0     0
            emcpower19h  ONLINE       0     0     0
            emcpower20h  ONLINE       0     0     0
            emcpower21h  ONLINE       0     0     0
            emcpower22h  ONLINE       0     0     0
            emcpower23h  ONLINE       0     0     0
 
errors: No known data errors
 
  pool: datapool3
 state: ONLINE
 scrub: none requested
config:
 
        NAME             STATE     READ WRITE CKSUM
        datapool3        ONLINE       0     0     0
          raidz1         ONLINE       0     0     0
            emcpower24h  ONLINE       0     0     0
            emcpower25h  ONLINE       0     0     0
            emcpower26h  ONLINE       0     0     0
            emcpower27h  ONLINE       0     0     0
            emcpower28h  ONLINE       0     0     0
            emcpower29h  ONLINE       0     0     0
            emcpower30h  ONLINE       0     0     0
            emcpower31h  ONLINE       0     0     0
 
errors: No known data errors
 
  pool: datapool4
 state: ONLINE
 scrub: none requested
config:
 
        NAME             STATE     READ WRITE CKSUM
        datapool4        ONLINE       0     0     0
          raidz1         ONLINE       0     0     0
            emcpower32h  ONLINE       0     0     0
            emcpower33h  ONLINE       0     0     0
            emcpower34h  ONLINE       0     0     0
            emcpower35h  ONLINE       0     0     0
            emcpower36h  ONLINE       0     0     0
            emcpower37h  ONLINE       0     0     0
            emcpower38h  ONLINE       0     0     0
            emcpower39h  ONLINE       0     0     0
 
errors: No known data errors
 
  pool: dumppool
 state: ONLINE
 scrub: none requested
config:
 
        NAME        STATE     READ WRITE CKSUM
        dumppool    ONLINE       0     0     0
          c5t10d0   ONLINE       0     0     0
          c5t11d0   ONLINE       0     0     0
          c6t10d0   ONLINE       0     0     0
          c6t11d0   ONLINE       0     0     0
 
errors: No known data errors
 
  pool: localpool
 state: ONLINE
 scrub: none requested
config:
 
        NAME        STATE     READ WRITE CKSUM
        localpool   ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c2t9d0  ONLINE       0     0     0
            c3t9d0  ONLINE       0     0     0
 
errors: No known data errors
 
  pool: logpool
 state: ONLINE
 scrub: none requested
config:
 
        NAME            STATE     READ WRITE CKSUM
        logpool         ONLINE       0     0     0
          raidz1        ONLINE       0     0     0
            emcpower0h  ONLINE       0     0     0
            emcpower1h  ONLINE       0     0     0
            emcpower2h  ONLINE       0     0     0
            emcpower3h  ONLINE       0     0     0
            emcpower4h  ONLINE       0     0     0
            emcpower5h  ONLINE       0     0     0
            emcpower6h  ONLINE       0     0     0
            emcpower7h  ONLINE       0     0     0
 
errors: No known data errors
[su621dwdb/root] 



----- Original Message -----
From: Will Murnane <will.murnane at gmail.com>
Date: Tuesday, June 26, 2007 2:00 pm
Subject: Re: [zfs-discuss] ZFS - DB2 Performance
To: Roshan Perera <Roshan.Perera at Sun.COM>
Cc: zfs-discuss at opensolaris.org
> On 6/26/07, Roshan Perera <Roshan.Perera at sun.com> wrote:
> > 25K   12 CPU dual core x 1800Mhz with ZFS 8TB storage SAN 
> storage (compressed & RaidZ) Solaris 10.
> RaidZ is a poor choice for database apps in my opinion; due to the way
> it handles checksums on raidz stripes, it must read every disk in
> order to satisfy small reads that traditional raid-5 would only have
> to read a single disk for.  Raid-Z doesn''t have the terrible write
> performance of raid 5, because you can stick small writes together and
> then do full-stripe writes, but by the same token you must do
> full-stripe reads, all the time.  That''s how I understand it,
anyways.
> Thus, raidz is a poor choice for a database application which tends
> to do a lot of small reads.
> 
> Using mirrors (at the zfs level, not the SAN level) would probably
> help with this.  Mirrors each get their own copy of the data, each
> with its own checksum, so you can read a small block by touching only
> one disk.
> 
> What is your vdev setup like right now?  ''zpool list'', in
other words.
> How wide are your stripes?  Is the SAN doing raid-1ish things with
> the disks, or something else?

 > > 2. Unfortunately we are using twice RAID (San level Raid and 
> RaidZ) to overcome the panic problem my previous blog (for which I 
> had good response).
> Can you convince the customer to give ZFS a chance to do things its
> way?  Let the SAN export raw disks, and make two- or three-way
> mirrored vdevs out of them.
> 
> > 3. Any way of monitoring ZFS performance other than iostat ?
> In a word, yes.  What are you interested in?  DTrace or ''zpool
iostat''
> (which reports activity of individual disks within the pool) may prove
> interesting.
Thanks... 

> 
> Will
>

eric kustarz

2007-Jun-26 14:59 UTC

head link

[zfs-discuss] ZFS - DB2 Performance

On Jun 26, 2007, at 4:26 AM, Roshan Perera wrote:
> Hi all,
>
> I am after some help/feedback to the subject issue explained below.
>
> We are in the process of migrating a big DB2 database from a
>
> 6900 24 x 200MHz CPU''s with Veritas FS 8TB of storage Solaris 8 to
> 25K   12 CPU dual core x 1800Mhz with ZFS 8TB storage SAN storage  
> (compressed & RaidZ) Solaris 10.
>
> Unfortunately, we are having massive perfomance problems with the  
> new solution. It all points towards IO and ZFS.
>
> Couple of questions relating to ZFS.
> 1. What is the impace on using ZFS compression ? Percentage of  
> system resources required, how much of a overhead is this as  
> suppose to non-compression. In our case DB2 do similar amount of  
> read''s and writes.
> 2. Unfortunately we are using twice RAID (San level Raid and RaidZ)  
> to overcome the panic problem my previous blog (for which I had  
> good response).
> 3. Any way of monitoring ZFS performance other than iostat ?
> 4. Any help on ZFS tuning in this kind of environment like caching  
> etc ?
Have you looked at:
http://blogs.sun.com/realneel/entry/zfs_and_databases
http://blogs.sun.com/realneel/entry/zfs_and_databases_time_for
?

eric
>
> Would appreciate for any feedback/help wher to go next.
> If this cannot be resolved we may have to go back to VXFS which  
> would be a shame.
>
>
> Thanks in advance.
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Roch - PAE

2007-Jun-26 16:25 UTC

head link

[zfs-discuss] ZFS - DB2 Performance

Possibly the  storage is flushing  the write caches  when it
should not.  Until we  get  a fix,  cache flushing  could be
disabled  in  the storage  (ask   the vendor for   the magic
incantation). If that''s not forthcoming and if all pools are 
attached to NVRAM protected devices; then these /etc/system
evil tunable might help :

In older solaris releases we have

	set zfs:zil_noflush = 1

On newer releases

	set zfs:zfs_nocacheflush = 1

If  you implement this,  Do place a   comment that this is a
temporary workaround waiting for bug 6462690 to be fixed.

About Compression, I don''t have the numbers but a reasonable
guess would be that it can consumes  roughly 1-Ghz of CPU to
compress 100MB/sec. This will of course depend on the type
of data being compressed.

-r

Roshan Perera writes:
 > Hi all,
 > 
 > I am after some help/feedback to the subject issue explained below.
 > 
 > We are in the process of migrating a big DB2 database from a 
 > 
 > 6900 24 x 200MHz CPU''s with Veritas FS 8TB of storage Solaris 8
to
 > 25K   12 CPU dual core x 1800Mhz with ZFS 8TB storage SAN storage
(compressed & RaidZ) Solaris 10.
 > 
 > Unfortunately, we are having massive perfomance problems with the new
solution. It all points towards IO and ZFS.
 > 
 > Couple of questions relating to ZFS.
 > 1. What is the impace on using ZFS compression ? Percentage of system
 > resources required, how much of a overhead is this as suppose to
 > non-compression. In our case DB2 do similar amount of read''s and
 > writes. 
 > 2. Unfortunately we are using twice RAID (San level Raid and RaidZ) to
 > overcome the panic problem my previous blog (for which I had good
 > response). 
 > 3. Any way of monitoring ZFS performance other than iostat ?
 > 4. Any help on ZFS tuning in this kind of environment like caching etc ?
 > 
 > Would appreciate for any feedback/help wher to go next. 
 > If this cannot be resolved we may have to go back to VXFS which would be a
shame.
 > 
 > 
 > Thanks in advance.
 > 
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Ellis, Mike

2007-Jun-26 17:59 UTC

head link

[zfs-discuss] ZFS - DB2 Performance

At what Solaris10 level (patch/update) was the "single-threaded
compression" situation resolved? 
Could you be hitting that one?

 -- MikeE 

-----Original Message-----
From: zfs-discuss-bounces at opensolaris.org
[mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Roch - PAE
Sent: Tuesday, June 26, 2007 12:26 PM
To: Roshan Perera
Cc: zfs-discuss at opensolaris.org
Subject: Re: [zfs-discuss] ZFS - DB2 Performance


Possibly the  storage is flushing  the write caches  when it
should not.  Until we  get  a fix,  cache flushing  could be
disabled  in  the storage  (ask   the vendor for   the magic
incantation). If that''s not forthcoming and if all pools are 
attached to NVRAM protected devices; then these /etc/system
evil tunable might help :

In older solaris releases we have

	set zfs:zil_noflush = 1

On newer releases

	set zfs:zfs_nocacheflush = 1


If  you implement this,  Do place a   comment that this is a
temporary workaround waiting for bug 6462690 to be fixed.

About Compression, I don''t have the numbers but a reasonable
guess would be that it can consumes  roughly 1-Ghz of CPU to
compress 100MB/sec. This will of course depend on the type
of data being compressed.

-r

Roshan Perera writes:
 > Hi all,
 > 
 > I am after some help/feedback to the subject issue explained below.
 > 
 > We are in the process of migrating a big DB2 database from a 
 > 
 > 6900 24 x 200MHz CPU''s with Veritas FS 8TB of storage Solaris 8
to
 > 25K   12 CPU dual core x 1800Mhz with ZFS 8TB storage SAN storage
(compressed & RaidZ) Solaris 10.
 > 
 > Unfortunately, we are having massive perfomance problems with the new
solution. It all points towards IO and ZFS. 
 > 
 > Couple of questions relating to ZFS.
 > 1. What is the impace on using ZFS compression ? Percentage of system
 > resources required, how much of a overhead is this as suppose to
 > non-compression. In our case DB2 do similar amount of read''s and
 > writes. 
 > 2. Unfortunately we are using twice RAID (San level Raid and RaidZ)
to
 > overcome the panic problem my previous blog (for which I had good
 > response). 
 > 3. Any way of monitoring ZFS performance other than iostat ?
 > 4. Any help on ZFS tuning in this kind of environment like caching
etc ?
 > 
 > Would appreciate for any feedback/help wher to go next. 
 > If this cannot be resolved we may have to go back to VXFS which would
be a shame.
 > 
 > 
 > Thanks in advance.
 > 
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Louwtjie Burger

2007-Jun-26 19:49 UTC

head link

[zfs-discuss] ZFS - DB2 Performance

>
> Roshan Perera writes:
>  > Hi all,
>  >
>  > I am after some help/feedback to the subject issue explained below.
>  >
>  > We are in the process of migrating a big DB2 database from a
>  >
>  > 6900 24 x 200MHz CPU''s with Veritas FS 8TB of storage
Solaris 8 to
>  > 25K   12 CPU dual core x 1800Mhz with ZFS 8TB storage SAN storage
> (compressed & RaidZ) Solaris 10.
>  >
200Mhz !? You mean 1200Mhz ;) The slowest CPU''s in a 6900 was 900Mhz
III Cu.

You mention Veritas FS ... as in Veritas filesystem, vxfs ? I suppose
you also include vmsa or the whole Storage Foundation? (could still be
vxva on Solaris 8 ! Oh, those were the days...)

First impressions on the system is ... well, it''s fair to say that you
have some extra CPU power (and then some). The old III 1.2Ghz was nice
but by no means screamers. ( years ago)
>  > Unfortunately, we are having massive perfomance problems with the new
> solution. It all points towards IO and ZFS.
>  >
Yep... CPU it isn''t. Keep in mind that you have now completely moved
the goal posts when it comes to performance or comparing performance
with the previous installation. Not only do you have a large increase
in CPU performance, Solaris 10 will blitz 8 on a bad day by miles.
With all of the CPU/OS bottlenecks removed I sure hope you have decent
I/O at the back...
>  > Couple of questions relating to ZFS.
>  > 1. What is the impace on using ZFS compression ? Percentage of system
>  > resources required, how much of a overhead is this as suppose to
>  > non-compression. In our case DB2 do similar amount of read''s
and
>  > writes.
I''m unsure as to why a person that buys a 24 core 25K would activate
compression on a OLTP database? Surely when you fork out that kind of
cash you want to get every bang for your buck (and then some!). I
don''t think compression was created with the view on high performance
OLTP db''s.

I would hope that the 25K (which in this case is light years faster
than the 6900) wasn''t spec''ed with the idea of running
compression
with the extra CPU cycles... oooh... *crash* *burn*.
>  > 2. Unfortunately we are using twice RAID (San level Raid and RaidZ)
> to
>  > overcome the panic problem my previous blog (for which I had good
>  > response).
I''ve yet to deploy a DB on ZFS in production, so I cannot comment on
the real world performance.. what I can comment on is some basic
things.

RAID on top of RAID seems silly. Especially RAID-Z. It''s just not as
fast as a mirror or stripe when it comes to a decent db workout.

Are you sure that you want to go with ZFS ... any real reason to go
that way now? I would wait for U4 ... and give the machine/storage a
good workout with SVM and UFS/DirectIO.

Yep... it''s a bastard to manage but very little can touch it when it
comes to pure performance. With so many $$$ standing on the datacentre
floor, I''d forget about technology for now and let common sense and
good business practice prevail.

>  > 3. Any way of monitoring ZFS performance other than iostat ?
Dtrace guru''s can comment... however iostat should suffice.
>  > 4. Any help on ZFS tuning in this kind of environment like caching
> etc ?
>  >
As was posted, read the blog on ZFS and db''s.
>  > Would appreciate for any feedback/help wher to go next.
>  > If this cannot be resolved we may have to go back to VXFS which would
> be a shame.
By the way ... if the client has already purchased vmsa/vxfs (oh my
word, how much was that!) then I''m unsure as to what ZFS will bring to
the party... apart from saving the yearly $$$ for updates and
patches/support. Is that the idea? It''s not like SF is bad...

Nope, 8TB on a decent configured storage unit is not that big _not_ to
give it a go with SVM, especially if you want to save money on Storage
Foundation.

I''m sure I''m preaching to the converted here but DB
performance and
problems will usually reside inside the storage architecture... I''ve
seldom found a system wanting in the CPU department if the architect
wasn''t a moron. With the upgrade that I see here... all the pressure
will move to the back (bar a bad configuration)

If you want to speed up a regular OLTP DB... fiddle with the I/O :)

2c

Richard L. Hamilton

2007-Jun-27 08:48 UTC

head link

[zfs-discuss] Re: Re: ZFS - SAN and Raid

> Victor Engle wrote:
> > Roshan,
> > 
> > As far as I know, there is no problem at all with
> using SAN storage
> > with ZFS and it does look like you were having an
> underlying problem
> > with either powerpath or the array.
> 
> Correct.  A write failed.
> 
> > The best practices guide on opensolaris does
> recommend replicated
> > pools even if your backend storage is redundant.
> There are at least 2
> > good reasons for that. ZFS needs a replica for the
> self healing
> > feature to work. Also there is no fsck like tool
> for ZFS so it is a
> > good idea to make sure self healing can work.
> 
> Yes, currently ZFS on Solaris will panic if a
> non-redundant write fails.
> This is known and being worked on, but there really
> isn''t a good solution
> if a write fails, unless you have some ZFS-level
> redundancy.
Why not?  If O_DSYNC applies, a write() can still fail with EIO, right?
And if O_DSYNC does not apply, an app could not assume that the
written data was on stable storage anyway.

Or the write() can just block until the problem is corrected (if correctable)
or the system is rebooted.

In any case, IMO there ought to be some sort of consistent behavior
possible short of a panic.  I''ve seen UFS based systems stay up even
with their disks incommunicado for awhile, although they were hardly
useful like that except insofar as activity strictly involving reading
already cached pages was involved.
 
 
This message posted from opensolaris.org

zfs discuss - Jun 2007 - "zfs send -i A B" with B older than A

[zfs-discuss] "zfs send -i A B" with B older than A

[zfs-discuss] "zfs send -i A B" with B older than A

[zfs-discuss] Re: "zfs send -i A B" with B older than A

[zfs-discuss] Re: ZFS - SAN and Raid

[zfs-discuss] Re: ZFS - SAN and Raid

[zfs-discuss] ZFS - SAN and Raid

[zfs-discuss] ZFS - SAN and Raid

[zfs-discuss] Re: ZFS - SAN and Raid

[zfs-discuss] Re: ZFS - SAN and Raid

[zfs-discuss] Re: ZFS - SAN and Raid

[zfs-discuss] Re: ZFS - SAN and Raid

[zfs-discuss] Re: ZFS - SAN and Raid

[zfs-discuss] Re: ZFS - SAN and Raid

[zfs-discuss] Re: ZFS - SAN and Raid

[zfs-discuss] Re: ZFS - SAN and Raid

[zfs-discuss] Re: ZFS - SAN and Raid

[zfs-discuss] Re: ZFS - SAN and Raid

[zfs-discuss] Re: ZFS - SAN and Raid

[zfs-discuss] Re: ZFS - SAN and Raid

[zfs-discuss] Re: ZFS - SAN and Raid

[zfs-discuss] Re: ZFS - SAN and Raid

[zfs-discuss] Re: ZFS - SAN and Raid

[zfs-discuss] Re: ZFS - SAN and Raid

[zfs-discuss] Re: ZFS - SAN and Raid

[zfs-discuss] ZFS - DB2 Performance

[zfs-discuss] ZFS - DB2 Performance

[zfs-discuss] ZFS - DB2 Performance

[zfs-discuss] ZFS - DB2 Performance

[zfs-discuss] ZFS - DB2 Performance

[zfs-discuss] ZFS - DB2 Performance

[zfs-discuss] ZFS - DB2 Performance

[zfs-discuss] Re: Re: ZFS - SAN and Raid