thr3ads.net - zfs discuss - [zfs-discuss] ZFS dedup issue [Nov 2009]

If this information is useful, please help other people find it:
Share via:

Piotr Jasiukajtis

2009-Nov-03 09:39 UTC

[zfs-discuss] ZFS dedup issue

Hi,

Lets take a look:

# zpool list
NAME    SIZE   USED  AVAIL    CAP  DEDUP  HEALTH  ALTROOT
rpool    68G  13.9G  54.1G    20%  42.27x  ONLINE  -

# zfs get all rpool/export/data
NAME               PROPERTY                        VALUE                        
SOURCE
rpool/export/data  type                            filesystem                   
-
rpool/export/data  creation                        Mon Nov  2 16:11 2009        
-
rpool/export/data  used                            46.7G                        
-
rpool/export/data  available                       38.7M                        
-
rpool/export/data  referenced                      46.7G                        
-
rpool/export/data  compressratio                   1.00x                        
-
rpool/export/data  mounted                         yes                          
-
rpool/export/data  quota                           none                         
default
rpool/export/data  reservation                     none                         
default
rpool/export/data  recordsize                      128K                         
default
rpool/export/data  mountpoint                      /export/data                 
inherited from rpool/export
rpool/export/data  sharenfs                        off                          
default
rpool/export/data  checksum                        on                           
default
rpool/export/data  compression                     off                          
default
rpool/export/data  atime                           on                           
default
rpool/export/data  devices                         on                           
default
rpool/export/data  exec                            on                           
default
rpool/export/data  setuid                          on                           
default
rpool/export/data  readonly                        off                          
default
rpool/export/data  zoned                           off                          
default
rpool/export/data  snapdir                         hidden                       
default
rpool/export/data  aclmode                         groupmask                    
default
rpool/export/data  aclinherit                      restricted                   
default
rpool/export/data  canmount                        on                           
default
rpool/export/data  shareiscsi                      off                          
default
rpool/export/data  xattr                           on                           
default
rpool/export/data  copies                          1                            
default
rpool/export/data  version                         4                            
-
rpool/export/data  utf8only                        off                          
-
rpool/export/data  normalization                   none                         
-
rpool/export/data  casesensitivity                 sensitive                    
-
rpool/export/data  vscan                           off                          
default
rpool/export/data  nbmand                          off                          
default
rpool/export/data  sharesmb                        off                          
default
rpool/export/data  refquota                        none                         
default
rpool/export/data  refreservation                  none                         
default
rpool/export/data  primarycache                    all                          
default
rpool/export/data  secondarycache                  all                          
default
rpool/export/data  usedbysnapshots                 0                            
-
rpool/export/data  usedbydataset                   46.7G                        
-
rpool/export/data  usedbychildren                  0                            
-
rpool/export/data  usedbyrefreservation            0                            
-
rpool/export/data  logbias                         latency                      
default
rpool/export/data  dedup                           on                           
local
rpool/export/data  org.opensolaris.caiman:install  ready                        
inherited from rpool



# df -h
Filesystem            Size  Used Avail Use% Mounted on
rpool/ROOT/os_b123_dev
                      2.4G  2.4G   40M  99% /
swap                  9.1G  336K  9.1G   1% /etc/svc/volatile
/usr/lib/libc/libc_hwcap1.so.1
                      2.4G  2.4G   40M  99% /lib/libc.so.1
swap                  9.1G     0  9.1G   0% /tmp
swap                  9.1G   40K  9.1G   1% /var/run
rpool/export           40M   25K   40M   1% /export
rpool/export/home      40M   30K   40M   1% /export/home
rpool/export/home/admin
                      460M  421M   40M  92% /export/home/admin
rpool                  40M   83K   40M   1% /rpool
rpool/export/data      47G   47G   40M 100% /export/data
rpool/export/data2     40M   21K   40M   1% /export/data2


# zfs get all rpool
NAME   PROPERTY                        VALUE                           SOURCE
rpool  type                            filesystem                      -
rpool  creation                        Wed Sep 23 12:45 2009           -
rpool  used                            66.9G                           -
rpool  available                       38.7M                           -
rpool  referenced                      83K                             -
rpool  compressratio                   1.00x                           -
rpool  mounted                         yes                             -
rpool  quota                           none                            default
rpool  reservation                     none                            default
rpool  recordsize                      128K                            default
rpool  mountpoint                      /rpool                          default
rpool  sharenfs                        off                             default
rpool  checksum                        on                              default
rpool  compression                     off                             default
rpool  atime                           on                              default
rpool  devices                         on                              default
rpool  exec                            on                              default
rpool  setuid                          on                              default
rpool  readonly                        off                             default
rpool  zoned                           off                             default
rpool  snapdir                         hidden                          default
rpool  aclmode                         groupmask                       default
rpool  aclinherit                      restricted                      default
rpool  canmount                        on                              default
rpool  shareiscsi                      off                             default
rpool  xattr                           on                              default
rpool  copies                          1                               default
rpool  version                         4                               -
rpool  utf8only                        off                             -
rpool  normalization                   none                            -
rpool  casesensitivity                 sensitive                       -
rpool  vscan                           off                             default
rpool  nbmand                          off                             default
rpool  sharesmb                        off                             default
rpool  refquota                        none                            default
rpool  refreservation                  none                            default
rpool  primarycache                    all                             default
rpool  secondarycache                  all                             default
rpool  usedbysnapshots                 59.5K                           -
rpool  usedbydataset                   83K                             -
rpool  usedbychildren                  66.9G                           -
rpool  usedbyrefreservation            0                               -
rpool  logbias                         latency                         default
rpool  dedup                           off                             default
rpool  org.opensolaris.caiman:install  ready                           local


# zfs create rpool/new_data

# zfs get all rpool/new_data
NAME            PROPERTY                        VALUE                          
SOURCE
rpool/new_data  type                            filesystem                     
-
rpool/new_data  creation                        Tue Nov  3 10:32 2009          
-
rpool/new_data  used                            21K                            
-
rpool/new_data  available                       39.0M                          
-
rpool/new_data  referenced                      21K                            
-
rpool/new_data  compressratio                   1.00x                          
-
rpool/new_data  mounted                         yes                            
-
rpool/new_data  quota                           none                           
default
rpool/new_data  reservation                     none                           
default
rpool/new_data  recordsize                      128K                           
default
rpool/new_data  mountpoint                      /rpool/new_data                
default
rpool/new_data  sharenfs                        off                            
default
rpool/new_data  checksum                        on                             
default
rpool/new_data  compression                     off                            
default
rpool/new_data  atime                           on                             
default
rpool/new_data  devices                         on                             
default
rpool/new_data  exec                            on                             
default
rpool/new_data  setuid                          on                             
default
rpool/new_data  readonly                        off                            
default
rpool/new_data  zoned                           off                            
default
rpool/new_data  snapdir                         hidden                         
default
rpool/new_data  aclmode                         groupmask                      
default
rpool/new_data  aclinherit                      restricted                     
default
rpool/new_data  canmount                        on                             
default
rpool/new_data  shareiscsi                      off                            
default
rpool/new_data  xattr                           on                             
default
rpool/new_data  copies                          1                              
default
rpool/new_data  version                         4                              
-
rpool/new_data  utf8only                        off                            
-
rpool/new_data  normalization                   none                           
-
rpool/new_data  casesensitivity                 sensitive                      
-
rpool/new_data  vscan                           off                            
default
rpool/new_data  nbmand                          off                            
default
rpool/new_data  sharesmb                        off                            
default
rpool/new_data  refquota                        none                           
default
rpool/new_data  refreservation                  none                           
default
rpool/new_data  primarycache                    all                            
default
rpool/new_data  secondarycache                  all                            
default
rpool/new_data  usedbysnapshots                 0                              
-
rpool/new_data  usedbydataset                   21K                            
-
rpool/new_data  usedbychildren                  0                              
-
rpool/new_data  usedbyrefreservation            0                              
-
rpool/new_data  logbias                         latency                        
default
rpool/new_data  dedup                           off                            
default
rpool/new_data  org.opensolaris.caiman:install  ready                          
inherited from rpool


# zdb -S rpool/export/data
Dataset rpool/export/data [ZPL], ID 167, cr_txg 115855, 46.7G, 1793628 objects

So.. it seems that data is deduplicated, zpool has 54.1G of free space, but I
can use only 40M.

It''s x86, ONNV revision 10924, debug build, bfu''ed from b125.

Please add me to the CC, because I am not on this alias.

Thanks.
-- 
This message posted from opensolaris.org

Jürgen Keil

2009-Nov-03 13:38 UTC

head link

[zfs-discuss] ZFS dedup issue

> So.. it seems that data is deduplicated, zpool has
> 54.1G of free space, but I can use only 40M.
> 
> It''s x86, ONNV revision 10924, debug build, bfu''ed from
b125.
I think I''m observing the same (with changeset 10936) ...

I created a 2GB file, and a "tank" zpool on top of that file,
with compression and dedup enabled:

    mkfile 2g /var/tmp/tank.img
    zpool create tank /var/tmp/tank.img
    zfs set dedup=on tank
    zfs set compression=on tank


Now I tried to create four zfs filesystems, 
and filled them by pulling and updating
the same set of onnv sources from mercurial.

One copy needs ~ 800MB of disk space 
uncompressed, or ~ 520MB compressed. 
During the 4th "hg update":
> hg updateabort: No space left on device:
/tank/snv_128_yy/usr/src/lib/libast/sparcv9/src/lib/libast/FEATURE/common

> zpool list tankNAME   SIZE   USED  AVAIL    CAP  DEDUP  HEALTH  ALTROOT
tank  1,98G   720M  1,28G    35%  3.70x  ONLINE  -

> zfs list -r tankNAME              USED  AVAIL  REFER  MOUNTPOINT
tank             1,95G      0    26K  /tank
tank/snv_128      529M      0   529M  /tank/snv_128
tank/snv_128_jk   530M      0   530M  /tank/snv_128_jk
tank/snv_128_xx   530M      0   530M  /tank/snv_128_xx
tank/snv_128_yy   368M      0   368M  /tank/snv_128_yy
-- 
This message posted from opensolaris.org

Jürgen Keil

2009-Nov-03 14:01 UTC

head link

[zfs-discuss] ZFS dedup issue

> I think I''m observing the same (with changeset 10936) ...
    # mkfile 2g /var/tmp/tank.img
    # zpool create tank /var/tmp/tank.img
    # zfs set dedup=on tank
    # zfs create tank/foobar


    > dd if=/dev/urandom of=/tank/foobar/file1 bs=1024k count=512
    512+0 records in
    512+0 records out
    > cp /tank/foobar/file1 /tank/foobar/file2
    > cp /tank/foobar/file1 /tank/foobar/file3
    > cp /tank/foobar/file1 /tank/foobar/file4
    /tank/foobar/file4: No space left on device

    >  zfs list -r tank
    NAME          USED  AVAIL  REFER  MOUNTPOINT
    tank         1.95G      0    22K  /tank
    tank/foobar  1.95G      0  1.95G  /tank/foobar

    > zpool list tank
    NAME   SIZE   USED  AVAIL    CAP  DEDUP  HEALTH  ALTROOT
    tank  1.98G   515M  1.48G    25%  3.90x  ONLINE  -
-- 
This message posted from opensolaris.org

Eric Schrock

2009-Nov-03 15:49 UTC

head link

[zfs-discuss] ZFS dedup issue

On Nov 3, 2009, at 6:01 AM, J?rgen Keil wrote:
>> I think I''m observing the same (with changeset 10936) ...
>
>    # mkfile 2g /var/tmp/tank.img
>    # zpool create tank /var/tmp/tank.img
>    # zfs set dedup=on tank
>    # zfs create tank/foobar
This has to do with the fact that dedup space accounting is charged to  
all filesystems, regardless of whether blocks are deduped.  To do  
otherwise is impossible, as there is no true "owner" of a block, and  
the fact that it may or may not be deduped is often beyond the control  
of a single filesystem.

This has some interesting pathologies as the pool gets full.  Namely,  
that ZFS will artificially enforce a limit on the logical size of the  
pool based on non-deduped data.  This is obviously something that  
should be addressed.

- Eric
>
>
>> dd if=/dev/urandom of=/tank/foobar/file1 bs=1024k count=512
>    512+0 records in
>    512+0 records out
>> cp /tank/foobar/file1 /tank/foobar/file2
>> cp /tank/foobar/file1 /tank/foobar/file3
>> cp /tank/foobar/file1 /tank/foobar/file4
>    /tank/foobar/file4: No space left on device
>
>> zfs list -r tank
>    NAME          USED  AVAIL  REFER  MOUNTPOINT
>    tank         1.95G      0    22K  /tank
>    tank/foobar  1.95G      0  1.95G  /tank/foobar
>
>> zpool list tank
>    NAME   SIZE   USED  AVAIL    CAP  DEDUP  HEALTH  ALTROOT
>    tank  1.98G   515M  1.48G    25%  3.90x  ONLINE  -
> -- 
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
--
Eric Schrock, Fishworks                        http://blogs.sun.com/eschrock

Nils Goroll

2009-Nov-03 16:09 UTC

head link

[zfs-discuss] ZFS dedup accounting

Hi Eric and all,

Eric Schrock wrote:> 
> On Nov 3, 2009, at 6:01 AM, J?rgen Keil wrote:
> 
>>> I think I''m observing the same (with changeset 10936) ...
>>
>>    # mkfile 2g /var/tmp/tank.img
>>    # zpool create tank /var/tmp/tank.img
>>    # zfs set dedup=on tank
>>    # zfs create tank/foobar
> 
> This has to do with the fact that dedup space accounting is charged to 
> all filesystems, regardless of whether blocks are deduped.  To do 
> otherwise is impossible, as there is no true "owner" of a block
It would be great if someone could explain why it is hard (impossible? not a
good idea?) to account all data sets for at least one reference to each
dedup''ed
block and add this space to the total free space?
> This has some interesting pathologies as the pool gets full.  Namely, 
> that ZFS will artificially enforce a limit on the logical size of the 
> pool based on non-deduped data.  This is obviously something that should 
> be addressed.
Would the idea I mentioned not address this issue as well?

Thanks, Nils

Anurag Agarwal

2009-Nov-03 16:18 UTC

head link

[zfs-discuss] ZFS dedup accounting

Hi,

It looks interesting problem.

Would it help if as ZFS detects dedup blocks, it can start increasing
effective size of pool.
It will create an anomaly with respect to total disk space, but it will
still be accurate from each file system usage point of view.

Basically, dedup is at block level, so space freed can effectively be
accounted as extra free blocks added to pool. Just a thought.

Regards,
Anurag.


On Tue, Nov 3, 2009 at 9:39 PM, Nils Goroll <slink at schokola.de> wrote:
> Hi Eric and all,
>
> Eric Schrock wrote:
>
>>
>> On Nov 3, 2009, at 6:01 AM, J?rgen Keil wrote:
>>
>>  I think I''m observing the same (with changeset 10936) ...
>>>>
>>>
>>>   # mkfile 2g /var/tmp/tank.img
>>>   # zpool create tank /var/tmp/tank.img
>>>   # zfs set dedup=on tank
>>>   # zfs create tank/foobar
>>>
>>
>> This has to do with the fact that dedup space accounting is charged to
all
>> filesystems, regardless of whether blocks are deduped.  To do otherwise
is
>> impossible, as there is no true "owner" of a block
>>
>
> It would be great if someone could explain why it is hard (impossible? not
> a
> good idea?) to account all data sets for at least one reference to each
> dedup''ed
> block and add this space to the total free space?
>
>  This has some interesting pathologies as the pool gets full.  Namely, that
>> ZFS will artificially enforce a limit on the logical size of the pool
based
>> on non-deduped data.  This is obviously something that should be
addressed.
>>
>
> Would the idea I mentioned not address this issue as well?
>
> Thanks, Nils
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>


-- 
Anurag Agarwal
CEO, Founder
KQ Infotech, Pune
www.kqinfotech.com
9881254401
Coordinator Akshar Bharati
www.aksharbharati.org
Spreading joy through reading
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091103/480aa666/attachment.html>

Bartlomiej Pelc

2009-Nov-03 16:32 UTC

head link

[zfs-discuss] ZFS dedup accounting

Well, then you could have more "logical space" than "physical
space", and that would be extremely cool, but what happens if for some
reason you wanted to turn off dedup on one of the filesystems? It might exhaust
all the pool''s space to do this. I think good idea would be another
pool''s/filesystem''s property, that when turned on, would allow
allocating more "logical data" than pool''s capacity, but then
you would accept risks that involve it. Then administrator could decide which is
better for his system.
-- 
This message posted from opensolaris.org

Jürgen Keil

2009-Nov-03 17:44 UTC

head link

[zfs-discuss] ZFS dedup accounting

> Well, then you could have more "logical space" than
> "physical space", and that would be extremely cool,
I think we already have that, with zfs clones.

I often clone a zfs onnv workspace, and everything
is "deduped" between zfs parent snapshot and clone
filesystem.  The clone (initially) needs no extra zpool
space.

And with zfs clone I can actually use all
the remaining free space from the zpool.

With zfs deduped blocks, I can''t ...
> but what happens if for some reason you wanted to
> turn off dedup on one of the filesystems? It might
> exhaust all the pool''s space to do this.
As far as I understand it, nothing happens for existing
deduped blocks when you turn off dedup for a zfs
filesystem.  The new dedup=off setting is affecting
new written blocks only.
-- 
This message posted from opensolaris.org

David Dyer-Bennet

2009-Nov-03 18:00 UTC

head link

[zfs-discuss] ZFS dedup accounting

On Tue, November 3, 2009 10:32, Bartlomiej Pelc wrote:> Well, then you could have more "logical space" than
"physical space", and
> that would be extremely cool, but what happens if for some reason you
> wanted to turn off dedup on one of the filesystems? It might exhaust all
> the pool''s space to do this. I think good idea would be another
> pool''s/filesystem''s property, that when turned on, would
allow allocating
> more "logical data" than pool''s capacity, but then you
would accept risks
> that involve it. Then administrator could decide which is better for his
> system.
Compression has the same issues; how is that handled?  (Well, except that
compression is limited to the filesystem, it doesn''t have
cross-filesystem
interactions.)  They ought to behave the same with regard to reservations
and quotas unless there is a very good reason for a difference.

Generally speaking, I don''t find "but what if you turned off
dedupe?" to
be a very important question.  Or rather, I consider it such an important
question that I''d have to consider it very carefully in light of the
particular characteristics of a particular pool; no GENERAL answer is
going to be generally right.

Reserving physical space for blocks not currently stored seems like the
wrong choice; it violates my expectations, and goes against the purpose of
dedupe, which as I understand it is to save space so you can use it for
other things.  It''s obvious to me that changing the dedupe setting (or
the
compression setting) would have consequences on space use, and it seems
natural that I as the sysadmin am on the hook for those consequences. 
(I''d expect to find in the documentation explanations of what things I
need to consider and how to find the detailed data to make a rational
decision in any particular case.)

-- 
David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

Cyril Plisko

2009-Nov-03 20:24 UTC

head link

[zfs-discuss] ZFS dedup issue

>>> I think I''m observing the same (with changeset 10936) ...
>>
>> ? # mkfile 2g /var/tmp/tank.img
>> ? # zpool create tank /var/tmp/tank.img
>> ? # zfs set dedup=on tank
>> ? # zfs create tank/foobar
>
> This has to do with the fact that dedup space accounting is charged to all
> filesystems, regardless of whether blocks are deduped. ?To do otherwise is
> impossible, as there is no true "owner" of a block, and the fact
that it may
> or may not be deduped is often beyond the control of a single filesystem.
>
> This has some interesting pathologies as the pool gets full. ?Namely, that
> ZFS will artificially enforce a limit on the logical size of the pool based
> on non-deduped data. ?This is obviously something that should be addressed.
>
Eric,

Many people (me included) perceive deduplication as a mean to save
disk space and allow more data to be squeezed into a storage. What you
are saying is that effectively ZFS dedup does a wonderful job in
detecting duplicate blocks and goes into all the trouble of removing
an extra copies and keep accounting of everything. However, when it
comes to letting me use the freed space I will be plainly denied to do
so. If that so, what would be the reason to use ZFS deduplication at
all ?


-- 
Regards,
        Cyril

mano vasilakis

2009-Nov-03 20:37 UTC

head link

[zfs-discuss] ZFS dedup issue

I''m fairly new to all this and I think that is the intended behavior.
Also from my limited understanding I believe dedup behavior it would
significantly cut down on access times.
For the most part though this is such new code that I would wait abit to see
where they take it.


On Tue, Nov 3, 2009 at 3:24 PM, Cyril Plisko <cyril.plisko at
mountall.com>wrote:
> >>> I think I''m observing the same (with changeset 10936)
...
> >>
> >>   # mkfile 2g /var/tmp/tank.img
> >>   # zpool create tank /var/tmp/tank.img
> >>   # zfs set dedup=on tank
> >>   # zfs create tank/foobar
> >
> > This has to do with the fact that dedup space accounting is charged to
> all
> > filesystems, regardless of whether blocks are deduped.  To do
otherwise
> is
> > impossible, as there is no true "owner" of a block, and the
fact that it
> may
> > or may not be deduped is often beyond the control of a single
filesystem.
> >
> > This has some interesting pathologies as the pool gets full.  Namely,
> that
> > ZFS will artificially enforce a limit on the logical size of the pool
> based
> > on non-deduped data.  This is obviously something that should be
> addressed.
> >
>
> Eric,
>
> Many people (me included) perceive deduplication as a mean to save
> disk space and allow more data to be squeezed into a storage. What you
> are saying is that effectively ZFS dedup does a wonderful job in
> detecting duplicate blocks and goes into all the trouble of removing
> an extra copies and keep accounting of everything. However, when it
> comes to letting me use the freed space I will be plainly denied to do
> so. If that so, what would be the reason to use ZFS deduplication at
> all ?
>
>
> --
> Regards,
>         Cyril
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091103/563a4c6a/attachment.html>

Nils Goroll

2009-Nov-03 20:54 UTC

head link

[zfs-discuss] ZFS dedup accounting & reservations

> Well, then you could have more "logical space" than
"physical space"
Reconsidering my own question again, it seems to me that the question of space 
management is probably more fundamental than I had initially thought, and I 
assume members of the core team will have thought through much of it.

I will try to share my thoughts and I would very much appreciate any corrections
or additional explanations.

For dedup, my understanding at this point is that, first of all, every reference
to dedup''ed data must be accounted to the respective dataset.

Obviously, a decision has been made to account that space as "used",
rather than
"referenced". I am trying to understand, why.

At first sight, referring to the definition of "used" space as being
unique to
the respective dataset, it would seem natural to account all de-duped space as 
"referenced". But this could lead to much space never being accounted
as "used"
anywhere (but for the pool). This would differs from the observed behavior from 
non-deduped datasets, where, to my understanding, all "referred" space
is "used"
by some other dataset. Despite being a little counter-intuitive, first I found 
this simple solution quite attractive, because it wouldn''t alter the
semantics
of used vs. referenced space (under the assumption that my understanding is 
correct).

My understanding from Eric''s explanation is that it has been decided to
go an
alternative route and account all de-duped space as "used" to all
datasets
referencing it because, in contrast to snapshots/clones, it is impossible (?) to
differentiate between used and referred space for de-dup. Also, at first sight, 
this seems to be a way to keep the current semantics for (ref)reservations.

But while without de-dup, all the usedsnap and usedds values should roughly sum 
up to the pool used space, they can''t with this concept - which is why
I thought
a solution could be to compensate for multiply accounted "used" space
by
artificially increasing the pool size.

Instead, from the examples given here, what seems to have been implemented with 
de-dup is to simply maintain space statistics for the pool on the basis of 
actually used space.

While one find it counter-intuitive that the used sizes of all 
datasets/snapshots will exceed the pool used size with de-dedup, if my 
understanding is correct, this design seems to be consistent.

I am very interested in the reasons why this particular approach has been chosen
and why others have been dropped.


Now to the more general question: If all datasets of a pool contained the same 
data and got de-duped, the sums of their "used" space still seems to
be limited
by the "locical" pool size, as we''ve seen in examples given
by J?rgen and others
and, to get a benefit of de-dup, this implementation obviously needs to be
changed.

But: Isn''t there an implicit expectation for a space guarantee
associated with a
dataset? In other words, if a dataset has 1GB of data, isn''t it natural
to
expect to be able to overwrite that space with other data? One might want to 
define space guarantees (like with (ref)reservation), but I don''t see
how those
should work with the currently implemented concept.

Do we need something like a de-dup-reservation, which is substracted from the 
pool free space?


Thank you for reading,

Nils

Cyril Plisko

2009-Nov-03 21:06 UTC

head link

[zfs-discuss] ZFS dedup accounting & reservations

On Tue, Nov 3, 2009 at 10:54 PM, Nils Goroll <slink at schokola.de>
wrote:> Now to the more general question: If all datasets of a pool contained the
> same data and got de-duped, the sums of their "used" space still
seems to be
> limited by the "locical" pool size, as we''ve seen in
examples given by
> J?rgen and others and, to get a benefit of de-dup, this implementation
> obviously needs to be changed.
Agreed.
>
> But: Isn''t there an implicit expectation for a space guarantee
associated
> with a dataset? In other words, if a dataset has 1GB of data,
isn''t it
> natural to expect to be able to overwrite that space with other data? One
I''d say that expectation is not [always] valid. Assume you have a
dataset of 1GB of data and the pool free space is 200 MB. You are
cloning that dataset and trying to overwrite the data on the cloned
dataset. You will hit "no more space left on device" pretty soon.
Wonders of virtualization :)


-- 
Regards,
        Cyril

Karl Katzke

2009-Nov-03 21:10 UTC

head link

[zfs-discuss] MPxIO and removing physical devices

I am a bit of a Solaris newbie. I have a brand spankin'' new Solaris
10u8 machine (x4250) that is running an attached J4400 and some internal drives.
We''re using multipathed SAS I/O (enabled via stmsboot), so the device
mount points have been moved off from their "normal" c0t5d0 to long
strings -- in the case of c0t5d0, it''s now
/dev/rdsk/c6t5000CCA00A274EDCd0. (I can see the cross-referenced devices with
stmsboot -L.)

Normally, when replacing a disk on a Solaris system, I would run cfgadm -c
unconfigure c0::dsk/c0t5d0. However, cfgadm -l does not list c6, nor does it
list any disks. In fact, running cfgadm against the places where I think things
are supposed to live gets me the following:

bash# cfgadm -l /dev/rdsk/c0t5d0
Ap_Id Type Receptacle Occupant Condition
/dev/rdsk/c0t5d0: No matching library found

bash# cfgadm -l /dev/rdsk/c6t5000CCA00A274EDCd0
cfgadm: Attachment point not found

bash# cfgadm -l /dev/dsk/c6t5000CCA00A274EDCd0
Ap_Id                          Type         Receptacle   Occupant     Condition
/dev/dsk/c6t5000CCA00A274EDCd0: No matching library found

bash# cfgadm -l c6t5000CCA00A274EDCd0
Ap_Id Type Receptacle Occupant Condition
c6t5000CCA00A274EDCd0: No matching library found

I ran devfsadm -C -v and it removed all of the old attachment points for the
/dev/dsk/c0t5d0 devices and created some for the c6 devices. Running cfgadm -al
shows a c0, c4, and c5 -- these correspond to the actual controllers, but no
devices are attached to the controllers.

I found an old email on this list about MPxIO that said the solution was
basically to yank the physical device after making sure that no I/O was
happening to it. While this worked and allowed us to return the device to
service as a spare in the zpool it inhabits, more concerning was what happened
when we ran mpathadm list lu after yanking the device and returning it to
service:

------ 

bash# mpathadm list lu
/dev/rdsk/c6t5000CCA00A2A9398d0s2
Total Path Count: 1
Operational Path Count: 1
/dev/rdsk/c6t5000CCA00A29EE2Cd0s2
Total Path Count: 1
Operational Path Count: 1
/dev/rdsk/c6t5000CCA00A2BDBFCd0s2
Total Path Count: 1
Operational Path Count: 1
/dev/rdsk/c6t5000CCA00A2A8E68d0s2
Total Path Count: 1
Operational Path Count: 1
/dev/rdsk/c6t5000CCA00A0537ECd0s2
Total Path Count: 1
Operational Path Count: 1
mpathadm: Error: Unable to get configuration information.
mpathadm: Unable to complete operation

(Side note: Some of the disks are single path via an internal controller, and
some of them are multi path in the J4400  via two external controllers.)

A reboot fixed the ''issue'' with mpathadm and it now outputs
complete data.

-------- 

So -- how do I administer and remove physical devices that are in
multipath-managed controllers on Solaris 10u8 without breaking multipath and
causing configuration changes that interfere with the services and devices
attached via mpathadm and the other voodoo and black magic inside? I
can''t seem to find this documented anywhere, even if the instructions
to enable multipathing with stmsboot -e were quite complete and worked well!

Thanks,
Karl Katzke



-- 

Karl Katzke
Systems Analyst II
TAMU - RGS

Nils Goroll

2009-Nov-03 21:14 UTC

head link

[zfs-discuss] ZFS dedup accounting & reservations

Hi Cyril,
>> But: Isn''t there an implicit expectation for a space guarantee
associated
>> with a dataset? In other words, if a dataset has 1GB of data,
isn''t it
>> natural to expect to be able to overwrite that space with other data?
One
> 
> I''d say that expectation is not [always] valid. Assume you have a
> dataset of 1GB of data and the pool free space is 200 MB. You are
> cloning that dataset and trying to overwrite the data on the cloned
> dataset. You will hit "no more space left on device" pretty soon.
> Wonders of virtualization :)
The point I wanted to make is that by defining a (ref)reservation for that 
clone, ZFS won''t even create it if space does not suffice:

root at haggis:~# zpool list
NAME    SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
rpool   416G   187G   229G    44%  ONLINE  -

root at haggis:~# zfs clone -o refreservation=230g 
rpool/export/home/slink/tmp at zfs-auto-snap:frequent-2009-11-03-22:04:46
rpool/test
cannot create ''rpool/test'': out of space

I don''t see how a similar guarantee could be given with de-dup.

Nils

David Dyer-Bennet

2009-Nov-03 21:25 UTC

head link

[zfs-discuss] ZFS dedup accounting & reservations

On Tue, November 3, 2009 15:06, Cyril Plisko wrote:> On Tue, Nov 3, 2009 at 10:54 PM, Nils Goroll <slink at schokola.de>
wrote:
>> But: Isn''t there an implicit expectation for a space guarantee
>> associated
>> with a dataset? In other words, if a dataset has 1GB of data,
isn''t it
>> natural to expect to be able to overwrite that space with other data?
>> One
>
> I''d say that expectation is not [always] valid. Assume you have a
> dataset of 1GB of data and the pool free space is 200 MB. You are
> cloning that dataset and trying to overwrite the data on the cloned
> dataset. You will hit "no more space left on device" pretty soon.
> Wonders of virtualization :)
Yes, and the same is true potentially with compression as well; if the old
data blocks are actually deleted and freed up (meaning no snapshots or
other things keeping them around), the new data still may not fit in those
blocks due to differing compression based on what the data actually is.

So that''s a bit of assumption we''re just going to have to get
over making
in general.  No point  in trying to preserve a naive mental model that
simply can''t stand up to reality.

-- 
David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

Jürgen Keil

2009-Nov-03 21:28 UTC

head link

[zfs-discuss] ZFS dedup accounting & reservations

> But: Isn''t there an implicit expectation for a space guarantee
associated with a
> dataset? In other words, if a dataset has 1GB of data, isn''t it
natural to
> expect to be able to overwrite that space with other
> data?
Is there such a space guarantee for compressed or cloned zfs?
-- 
This message posted from opensolaris.org

Nils Goroll

2009-Nov-03 22:36 UTC

head link

[zfs-discuss] ZFS dedup accounting & reservations

> No point  in trying to preserve a naive mental model that
> simply can''t stand up to reality.
I kind of dislike the idea to talk about naiveness here.

Being able to give guarantees (in this case: reserve space) can be vital for 
running critical business applications. Think about the analogy in memory 
management (proper swap space reservation vs. the oom-killer).

But I realize that talking about an "implicit expectation" to give
some
motivation for reservations probably lead to some misunderstanding.

Sorry, Nils

David Dyer-Bennet

2009-Nov-03 22:50 UTC

head link

[zfs-discuss] ZFS dedup accounting & reservations

On Tue, November 3, 2009 16:36, Nils Goroll wrote:>  > No point  in trying to preserve a naive mental model that
>> simply can''t stand up to reality.
>
> I kind of dislike the idea to talk about naiveness here.
Maybe it was a poor choice of words; I mean something more along the lines
of "simplistic".  The point is, "space" is no longer as
simple a concept
as it was 40 years ago.  Even without deduplication, there is the
possibility of clones and compression causing things not to behave the
same way a simple filesystem on a hard drive did long ago.
> Being able to give guarantees (in this case: reserve space) can be vital
> for
> running critical business applications. Think about the analogy in memory
> management (proper swap space reservation vs. the oom-killer).
In my experience, systems that run on the edge of their resources and
depend on guarantees to make them work have endless problems, whereas if
they are not running on the edge of their resources, they work fine
regardless of guarantees.

For a very few kinds of embedded systems I can see the need to work to the
edges  (aircraft flight systems, for example), but that''s not something
you do in a general-purpose computer with a general-purpose OS.
> But I realize that talking about an "implicit expectation" to
give some
> motivation for reservations probably lead to some misunderstanding.
>
> Sorry, Nils
There''s plenty of real stuff worth discussing around this issue, and I
apologize for choosing a belittling term to express disagreement.  I hope
it doesn''t derail the discussion.

-- 
David Dyer-Bennet, dd-b at dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

Nils Goroll

2009-Nov-03 23:21 UTC

head link

[zfs-discuss] ZFS dedup accounting & reservations

Hi David,
>>> simply can''t stand up to reality.
>> I kind of dislike the idea to talk about naiveness here.
> 
> Maybe it was a poor choice of words; I mean something more along the lines
> of "simplistic".  The point is, "space" is no longer as
simple a concept
> as it was 40 years ago.  Even without deduplication, there is the
> possibility of clones and compression causing things not to behave the
> same way a simple filesystem on a hard drive did long ago.
Thanks for emphasizing this again - I do absolutely agree that with
today''s
technologies proper monitoring and proactive management is much more important 
than ever before.

But, again, risks can be reduced.
>> Being able to give guarantees (in this case: reserve space) can be
vital
>> for
>> running critical business applications. Think about the analogy in
memory
>> management (proper swap space reservation vs. the oom-killer).
> 
> In my experience, systems that run on the edge of their resources and
> depend on guarantees to make them work have endless problems, whereas if
> they are not running on the edge of their resources, they work fine
> regardless of guarantees.
Agree. But what if things go wrong and a process eats up all your storage in 
error? If it''s got its own dataset and you''ve used a
reservation for your
critical application on another dataset, you have a higher chance of surviving.
> There''s plenty of real stuff worth discussing around this issue,
and I
> apologize for choosing a belittling term to express disagreement.  I hope
> it doesn''t derail the discussion.
It certainly won''t on my side. Thank you for the clarification.

Thanks, Nils

Robert Milkowski

2009-Nov-03 23:26 UTC

head link

[zfs-discuss] ZFS dedup issue

Cyril Plisko wrote:>>>> I think I''m observing the same (with changeset 10936)
...
>>>>         
>>>   # mkfile 2g /var/tmp/tank.img
>>>   # zpool create tank /var/tmp/tank.img
>>>   # zfs set dedup=on tank
>>>   # zfs create tank/foobar
>>>       
>> This has to do with the fact that dedup space accounting is charged to
all
>> filesystems, regardless of whether blocks are deduped.  To do otherwise
is
>> impossible, as there is no true "owner" of a block, and the
fact that it may
>> or may not be deduped is often beyond the control of a single
filesystem.
>>
>> This has some interesting pathologies as the pool gets full.  Namely,
that
>> ZFS will artificially enforce a limit on the logical size of the pool
based
>> on non-deduped data.  This is obviously something that should be
addressed.
>>
>>     
>
> Eric,
>
> Many people (me included) perceive deduplication as a mean to save
> disk space and allow more data to be squeezed into a storage. What you
> are saying is that effectively ZFS dedup does a wonderful job in
> detecting duplicate blocks and goes into all the trouble of removing
> an extra copies and keep accounting of everything. However, when it
> comes to letting me use the freed space I will be plainly denied to do
> so. If that so, what would be the reason to use ZFS deduplication at
> all ?
>
>   
c''mon it is obviously a bug and not a design feature.
(it is I hope/think that is the case)

Eric Schrock

2009-Nov-03 23:38 UTC

head link

[zfs-discuss] ZFS dedup issue

On Nov 3, 2009, at 12:24 PM, Cyril Plisko wrote:
>>>> I think I''m observing the same (with changeset 10936)
...
>>>
>>>   # mkfile 2g /var/tmp/tank.img
>>>   # zpool create tank /var/tmp/tank.img
>>>   # zfs set dedup=on tank
>>>   # zfs create tank/foobar
>>
>> This has to do with the fact that dedup space accounting is charged  
>> to all
>> filesystems, regardless of whether blocks are deduped.  To do  
>> otherwise is
>> impossible, as there is no true "owner" of a block, and the
fact
>> that it may
>> or may not be deduped is often beyond the control of a single  
>> filesystem.
>>
>> This has some interesting pathologies as the pool gets full.   
>> Namely, that
>> ZFS will artificially enforce a limit on the logical size of the  
>> pool based
>> on non-deduped data.  This is obviously something that should be  
>> addressed.
>>
>
> Eric,
>
> Many people (me included) perceive deduplication as a mean to save
> disk space and allow more data to be squeezed into a storage. What you
> are saying is that effectively ZFS dedup does a wonderful job in
> detecting duplicate blocks and goes into all the trouble of removing
> an extra copies and keep accounting of everything. However, when it
> comes to letting me use the freed space I will be plainly denied to do
> so. If that so, what would be the reason to use ZFS deduplication at
> all ?
Please read my response before you respond.  What do you think "this  
is obviously something that should be addressed" means?  There is  
already a CR filed and the ZFS team is working on it.

- Eric
>
>
> -- 
> Regards,
>        Cyril
--
Eric Schrock, Fishworks                        http://blogs.sun.com/eschrock

"C. Bergström"

2009-Nov-03 23:45 UTC

head link

[zfs-discuss] Where is green-bytes dedup code?

Green-bytes is publicly selling their hardware and dedup solution 
today.  From the feedback of others with testing from someone on our 
team we''ve found the quality of the initial putback to be buggy and not
even close to production ready.  (That''s fine since nobody has stated
it
was production ready)

It brings up the question though of where is the green-bytes code?  They 
are obligated under the CDDL to release their changes *unless* they 
privately bought a license from Sun.  It seems the conflicts from the 
lawsuit may or may not be resolved, but still..

Where''s the code?

Tim Cook

2009-Nov-03 23:57 UTC

head link

[zfs-discuss] Where is green-bytes dedup code?

On Tuesday, November 3, 2009, "C. Bergstr?m" <codestr0m at
osunix.org> wrote:> Green-bytes is publicly selling their hardware and dedup solution today.
?From the feedback of others with testing from someone on our team
we''ve found the quality of the initial putback to be buggy and not even
close to production ready. ?(That''s fine since nobody has stated it was
production ready)
>
> It brings up the question though of where is the green-bytes code? ?They
are obligated under the CDDL to release their changes *unless* they privately
bought a license from Sun. ?It seems the conflicts from the lawsuit may or may
not be resolved, but still..
>
> Where''s the code?


I highly doubt you''re going to get any commentary from sun engineers
on pending litigation.

--Tim

Jorgen Lundman

2009-Nov-04 00:20 UTC

head link

[zfs-discuss] ZFS dedup vs compression vs ZFS user/group quotas

We recently found that the ZFS user/group quota accounting for disk-usage worked
"opposite" to what we were expecting. Ie, any space saved from
compression was a
benefit to the customer, not to us.

(We expected the Google style: Give a customer 2GB quota, and if compression 
saves space, that is profit to us)

Is the space saved with dedup charged in the same manner? I would expect so, I 
figured some of you would just know.  I will check when b128 is out.

I don''t suppose I can change the model? :)

Lund

-- 
Jorgen Lundman       | <lundman at lundman.net>
Unix Administrator   | +81 (0)3 -5456-2687 ext 1017 (work)
Shibuya-ku, Tokyo    | +81 (0)90-5578-8500          (cell)
Japan                | +81 (0)3 -3375-1767          (home)

George Wilson

2009-Nov-04 06:53 UTC

head link

[zfs-discuss] ZFS dedup issue

Eric Schrock wrote:> 
> On Nov 3, 2009, at 12:24 PM, Cyril Plisko wrote:
> 
>>>>> I think I''m observing the same (with changeset
10936) ...
>>>>
>>>>   # mkfile 2g /var/tmp/tank.img
>>>>   # zpool create tank /var/tmp/tank.img
>>>>   # zfs set dedup=on tank
>>>>   # zfs create tank/foobar
>>>
>>> This has to do with the fact that dedup space accounting is charged
>>> to all
>>> filesystems, regardless of whether blocks are deduped.  To do 
>>> otherwise is
>>> impossible, as there is no true "owner" of a block, and
the fact that
>>> it may
>>> or may not be deduped is often beyond the control of a single 
>>> filesystem.
>>>
>>> This has some interesting pathologies as the pool gets full. 
Namely,
>>> that
>>> ZFS will artificially enforce a limit on the logical size of the
pool
>>> based
>>> on non-deduped data.  This is obviously something that should be 
>>> addressed.
>>>
>>
>> Eric,
>>
>> Many people (me included) perceive deduplication as a mean to save
>> disk space and allow more data to be squeezed into a storage. What you
>> are saying is that effectively ZFS dedup does a wonderful job in
>> detecting duplicate blocks and goes into all the trouble of removing
>> an extra copies and keep accounting of everything. However, when it
>> comes to letting me use the freed space I will be plainly denied to do
>> so. If that so, what would be the reason to use ZFS deduplication at
>> all ?
> 
> Please read my response before you respond.  What do you think "this
is
> obviously something that should be addressed" means?  There is already
a
> CR filed and the ZFS team is working on it.
We have a fix for this and it should be available in a couple of days.

- George
> 
> - Eric
> 
>>
>>
>> -- 
>> Regards,
>>        Cyril
> 
> -- 
> Eric Schrock, Fishworks                        
> http://blogs.sun.com/eschrock
> 
> 
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Cindy Swearingen

2009-Nov-04 16:36 UTC

head link

[zfs-discuss] MPxIO and removing physical devices

Hi Karl,

Welcome to Solaris/ZFS land ...

ZFS administration is pretty easy but our device administration
is more difficult.

I''ll probably bungle this response because I don''t have
similar
hardware and I hope some expert will correct me.

I think you will have to experiment with various forms of cfgadm.
Also look at the cfgadm_fp man page.

See the examples below on a V210 with a 3510 array.

Cindy

# cfgadm -al | grep 226000c0ffa001ab
c1::226000c0ffa001ab     disk     connected    configured   unknown

# cfgadm -al -o show_SCSI_LUN
Ap_Id                     Type       Receptacle   Occupant     Condition
c1                        fc-fabric  connected    configured   unknown
c1::210000e08b1ad8c8      unknown    connected    unconfigured unknown
c1::210100e08b3fbb64      unknown    connected    unconfigured unknown
c1::226000c0ffa001ab,0    disk       connected    configured   unknown
c1::226000c0ffa001ab,1    disk       connected    configured   unknown
c1::226000c0ffa001ab,2    disk       connected    configured   unknown

# cfgadm -o show_FCP_dev -al
Ap_Id                    Type        Receptacle   Occupant     Condition
c1                      fc-fabric    connected    configured   unknown
c1::210000e08b1ad8c8    unknown      connected    unconfigured unknown
c1::210100e08b3fbb64    unknown      connected    unconfigured unknown
c1::226000c0ffa001ab,0  disk         connected    configured   unknown
c1::226000c0ffa001ab,1  disk         connected    configured   unknown
c1::226000c0ffa001ab,2  disk         connected    configured   unknown





On 11/03/09 14:10, Karl Katzke wrote:> I am a bit of a Solaris newbie. I have a brand spankin'' new
Solaris 10u8 machine (x4250) that is running an attached J4400 and some internal
drives. We''re using multipathed SAS I/O (enabled via stmsboot), so the
device mount points have been moved off from their "normal" c0t5d0 to
long strings -- in the case of c0t5d0, it''s now
/dev/rdsk/c6t5000CCA00A274EDCd0. (I can see the cross-referenced devices with
stmsboot -L.)
> 
> Normally, when replacing a disk on a Solaris system, I would run cfgadm -c
unconfigure c0::dsk/c0t5d0. However, cfgadm -l does not list c6, nor does it
list any disks. In fact, running cfgadm against the places where I think things
are supposed to live gets me the following:
> 
> bash# cfgadm -l /dev/rdsk/c0t5d0
> Ap_Id Type Receptacle Occupant Condition
> /dev/rdsk/c0t5d0: No matching library found
> 
> bash# cfgadm -l /dev/rdsk/c6t5000CCA00A274EDCd0
> cfgadm: Attachment point not found
> 
> bash# cfgadm -l /dev/dsk/c6t5000CCA00A274EDCd0
> Ap_Id                          Type         Receptacle   Occupant    
Condition
> /dev/dsk/c6t5000CCA00A274EDCd0: No matching library found
> 
> bash# cfgadm -l c6t5000CCA00A274EDCd0
> Ap_Id Type Receptacle Occupant Condition
> c6t5000CCA00A274EDCd0: No matching library found
> 
> I ran devfsadm -C -v and it removed all of the old attachment points for
the /dev/dsk/c0t5d0 devices and created some for the c6 devices. Running cfgadm
-al shows a c0, c4, and c5 -- these correspond to the actual controllers, but no
devices are attached to the controllers.
> 
> I found an old email on this list about MPxIO that said the solution was
basically to yank the physical device after making sure that no I/O was
happening to it. While this worked and allowed us to return the device to
service as a spare in the zpool it inhabits, more concerning was what happened
when we ran mpathadm list lu after yanking the device and returning it to
service:
> 
> ------ 
> 
> bash# mpathadm list lu
> /dev/rdsk/c6t5000CCA00A2A9398d0s2
> Total Path Count: 1
> Operational Path Count: 1
> /dev/rdsk/c6t5000CCA00A29EE2Cd0s2
> Total Path Count: 1
> Operational Path Count: 1
> /dev/rdsk/c6t5000CCA00A2BDBFCd0s2
> Total Path Count: 1
> Operational Path Count: 1
> /dev/rdsk/c6t5000CCA00A2A8E68d0s2
> Total Path Count: 1
> Operational Path Count: 1
> /dev/rdsk/c6t5000CCA00A0537ECd0s2
> Total Path Count: 1
> Operational Path Count: 1
> mpathadm: Error: Unable to get configuration information.
> mpathadm: Unable to complete operation
> 
> (Side note: Some of the disks are single path via an internal controller,
and some of them are multi path in the J4400  via two external controllers.)
> 
> A reboot fixed the ''issue'' with mpathadm and it now
outputs complete data.
> 
> -------- 
> 
> So -- how do I administer and remove physical devices that are in
multipath-managed controllers on Solaris 10u8 without breaking multipath and
causing configuration changes that interfere with the services and devices
attached via mpathadm and the other voodoo and black magic inside? I
can''t seem to find this documented anywhere, even if the instructions
to enable multipathing with stmsboot -e were quite complete and worked well!
> 
> Thanks,
> Karl Katzke
> 
> 
>

Piotr Jasiukajtis

2009-Nov-05 09:50 UTC

head link

[zfs-discuss] ZFS dedup issue

George,

Your putback 10956 fixes that issue.

Thanks.

On Wed, Nov 4, 2009 at 7:53 AM, George Wilson <George.Wilson at sun.com>
wrote:> Eric Schrock wrote:
>>
>> On Nov 3, 2009, at 12:24 PM, Cyril Plisko wrote:
>>
>>>>>> I think I''m observing the same (with changeset
10936) ...
>>>>>
>>>>> ?# mkfile 2g /var/tmp/tank.img
>>>>> ?# zpool create tank /var/tmp/tank.img
>>>>> ?# zfs set dedup=on tank
>>>>> ?# zfs create tank/foobar
>>>>
>>>> This has to do with the fact that dedup space accounting is
charged to
>>>> all
>>>> filesystems, regardless of whether blocks are deduped. ?To do
otherwise
>>>> is
>>>> impossible, as there is no true "owner" of a block,
and the fact that it
>>>> may
>>>> or may not be deduped is often beyond the control of a single
>>>> filesystem.
>>>>
>>>> This has some interesting pathologies as the pool gets full.
?Namely,
>>>> that
>>>> ZFS will artificially enforce a limit on the logical size of
the pool
>>>> based
>>>> on non-deduped data. ?This is obviously something that should
be
>>>> addressed.
>>>>
>>>
>>> Eric,
>>>
>>> Many people (me included) perceive deduplication as a mean to save
>>> disk space and allow more data to be squeezed into a storage. What
you
>>> are saying is that effectively ZFS dedup does a wonderful job in
>>> detecting duplicate blocks and goes into all the trouble of
removing
>>> an extra copies and keep accounting of everything. However, when it
>>> comes to letting me use the freed space I will be plainly denied to
do
>>> so. If that so, what would be the reason to use ZFS deduplication
at
>>> all ?
>>
>> Please read my response before you respond. ?What do you think
"this is
>> obviously something that should be addressed" means? ?There is
already a CR
>> filed and the ZFS team is working on it.
>
> We have a fix for this and it should be available in a couple of days.
>
> - George
>
>>
>> - Eric
>>
>>>
>>>
>>> --
>>> Regards,
>>> ? ? ? Cyril
>>
>> --
>> Eric Schrock, Fishworks
>> ?http://blogs.sun.com/eschrock
>>
>>
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>


-- 
Piotr Jasiukajtis | estibi | SCA OS0072
http://estseg.blogspot.com

Nigel Smith

2009-Nov-05 10:01 UTC

head link

[zfs-discuss] ZFS dedup issue

Fix now available:

  http://mail.opensolaris.org/pipermail/onnv-notify/2009-November/010716.html
  http://hg.genunix.org/onnv-gate.hg/rev/0c81acaaf614

  6897693 deduplication can only go so far
  http://bugs.opensolaris.org/view_bug.do?bug_id=6897693
-- 
This message posted from opensolaris.org

Piotr Jasiukajtis

2009-Nov-05 13:24 UTC

head link

[zfs-discuss] ZFS dedup issue

Hi,

Is there a bug in this forum or something?

I created a thread called "ZFS dedup issue" and now I can found here
messages like:
[zfs-discuss] ZFS dedup accounting  
[zfs-discuss] ZFS dedup accounting & reservations  
[zfs-discuss] Where is green-bytes dedup code?  
[zfs-discuss] MPxIO and removing physical devices  
[zfs-discuss] ZFS dedup vs compression vs ZFS user/group quotas  

I see the same on mail.opensolaris.org.
-- 
This message posted from opensolaris.org

Nigel Smith

2009-Nov-05 14:10 UTC

head link

[zfs-discuss] ZFS dedup issue

Some people, like yourself, as posting to the forum 
via the web based ''Jive'' interface to the forum.

Other people are posting using email:
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Jive tries to reconcile this two methods, but it
often ends up a bit of a mess, especially if the
email people edit the subject line...

Regards
Nigel Smith
-- 
This message posted from opensolaris.org

Nigel Smith

2009-Nov-05 14:14 UTC

head link

[zfs-discuss] ZFS dedup issue

Oh, and some of the people posting via email
have not registered a userID profile on the
OpenSolaris web site, which also does not help!
-- 
This message posted from opensolaris.org

David Magda

2009-Nov-05 15:19 UTC

head link

[zfs-discuss] ZFS dedup issue

On Thu, November 5, 2009 09:10, Nigel Smith wrote:> Some people, like yourself, as posting to the forum
> via the web based ''Jive'' interface to the forum.
>
> Other people are posting using email:
>   http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
> Jive tries to reconcile this two methods, but it
> often ends up a bit of a mess, especially if the
> email people edit the subject line...
AFAICT, Jive supports an NNTP back-end. Threading in NNTP (and e-mail
using JWZ''s algorithm) would be a lot more resilient than fickle
Subject
headers. :)

Karl Katzke

2009-Nov-05 16:45 UTC

head link

[zfs-discuss] MPxIO and removing physical devices

Cindy - 

Hi! Thanks for responding, I was starting to despair a bit .. 

In cfgadm -al, I don''t have *any* disks showing. I see my three
SCSI-BUS controllers, but there is nothing listed underneath them as it is in
yours. I realize that I''m using mpt SAS and not FC, and the SAS support
is new, but ... it seems that I''m not getting the output I should be.
There doesn''t seem to be a way to operate

In `format`, the disks show up as /scsi_vhci/disk@(GUID)

In ''stmsboot -L'', the disks show up as /dev/rdsk/c4t(GUID)

In ''cfgadm -al'', the hardware controllers show up (c0, c1, c2)
but the c4 ''soft'' controller does not.

In ''mpathadm list lu'' , the disks show as /dev/rdsk/c4t(GUID).

I have run devfsadm -Cv and it removed all of the old c0, c1, c2 points. 

I mean, it works -- I can access data on my filesystems -- but removing or
replacing a disk seriously screws things up because the GUID of the disk is the
attachment point, and I can''t seem to remove a device or add a new one!
There''s got to be a way to do that properly. I can''t put it in
production this way knowing that there''s going to be problems removing
failed disks!

-K 



Karl Katzke
Systems Analyst II
TAMU - RGS


>>> On 11/4/2009 at 10:36 AM, in message <4AF1AD76.4050504 at
Sun.COM>, Cindy
Swearingen <Cindy.Swearingen at Sun.COM> wrote: > Hi Karl, 
>  
> Welcome to Solaris/ZFS land ... 
>  
> ZFS administration is pretty easy but our device administration 
> is more difficult. 
>  
> I''ll probably bungle this response because I don''t have
similar
> hardware and I hope some expert will correct me. 
>  
> I think you will have to experiment with various forms of cfgadm. 
> Also look at the cfgadm_fp man page. 
>  
> See the examples below on a V210 with a 3510 array. 
>  
> Cindy 
>  
> # cfgadm -al | grep 226000c0ffa001ab 
> c1::226000c0ffa001ab     disk     connected    configured   unknown 
>  
> # cfgadm -al -o show_SCSI_LUN 
> Ap_Id                     Type       Receptacle   Occupant     Condition 
> c1                        fc-fabric  connected    configured   unknown 
> c1::210000e08b1ad8c8      unknown    connected    unconfigured unknown 
> c1::210100e08b3fbb64      unknown    connected    unconfigured unknown 
> c1::226000c0ffa001ab,0    disk       connected    configured   unknown 
> c1::226000c0ffa001ab,1    disk       connected    configured   unknown 
> c1::226000c0ffa001ab,2    disk       connected    configured   unknown 
>  
> # cfgadm -o show_FCP_dev -al 
> Ap_Id                    Type        Receptacle   Occupant     Condition 
> c1                      fc-fabric    connected    configured   unknown 
> c1::210000e08b1ad8c8    unknown      connected    unconfigured unknown 
> c1::210100e08b3fbb64    unknown      connected    unconfigured unknown 
> c1::226000c0ffa001ab,0  disk         connected    configured   unknown 
> c1::226000c0ffa001ab,1  disk         connected    configured   unknown 
> c1::226000c0ffa001ab,2  disk         connected    configured   unknown 
>  
>  
>  
>  
>  
> On 11/03/09 14:10, Karl Katzke wrote: 
> > I am a bit of a Solaris newbie. I have a brand spankin'' new
Solaris 10u8
> machine (x4250) that is running an attached J4400 and some internal drives.
> We''re using multipathed SAS I/O (enabled via stmsboot), so the
device mount
> points have been moved off from their "normal" c0t5d0 to long
strings -- in the
> case of c0t5d0, it''s now /dev/rdsk/c6t5000CCA00A274EDCd0. (I can
see the
> cross-referenced devices with stmsboot -L.) 
> >  
> > Normally, when replacing a disk on a Solaris system, I would run
cfgadm -c
> unconfigure c0::dsk/c0t5d0. However, cfgadm -l does not list c6, nor does
it
> list any disks. In fact, running cfgadm against the places where I think  
> things are supposed to live gets me the following: 
> >  
> > bash# cfgadm -l /dev/rdsk/c0t5d0 
> > Ap_Id Type Receptacle Occupant Condition 
> > /dev/rdsk/c0t5d0: No matching library found 
> >  
> > bash# cfgadm -l /dev/rdsk/c6t5000CCA00A274EDCd0 
> > cfgadm: Attachment point not found 
> >  
> > bash# cfgadm -l /dev/dsk/c6t5000CCA00A274EDCd0 
> > Ap_Id                          Type         Receptacle   Occupant
> Condition 
> > /dev/dsk/c6t5000CCA00A274EDCd0: No matching library found 
> >  
> > bash# cfgadm -l c6t5000CCA00A274EDCd0 
> > Ap_Id Type Receptacle Occupant Condition 
> > c6t5000CCA00A274EDCd0: No matching library found 
> >  
> > I ran devfsadm -C -v and it removed all of the old attachment points
for the
> /dev/dsk/c0t5d0 devices and created some for the c6 devices. Running cfgadm
> -al shows a c0, c4, and c5 -- these correspond to the actual controllers,
but
> no devices are attached to the controllers.  
> >  
> > I found an old email on this list about MPxIO that said the solution
was
> basically to yank the physical device after making sure that no I/O was  
> happening to it. While this worked and allowed us to return the device to  
> service as a spare in the zpool it inhabits, more concerning was what  
> happened when we ran mpathadm list lu after yanking the device and
returning
> it to service:  
> >  
> > ------  
> >  
> > bash# mpathadm list lu 
> > /dev/rdsk/c6t5000CCA00A2A9398d0s2 
> > Total Path Count: 1 
> > Operational Path Count: 1 
> > /dev/rdsk/c6t5000CCA00A29EE2Cd0s2 
> > Total Path Count: 1 
> > Operational Path Count: 1 
> > /dev/rdsk/c6t5000CCA00A2BDBFCd0s2 
> > Total Path Count: 1 
> > Operational Path Count: 1 
> > /dev/rdsk/c6t5000CCA00A2A8E68d0s2 
> > Total Path Count: 1 
> > Operational Path Count: 1 
> > /dev/rdsk/c6t5000CCA00A0537ECd0s2 
> > Total Path Count: 1 
> > Operational Path Count: 1 
> > mpathadm: Error: Unable to get configuration information. 
> > mpathadm: Unable to complete operation 
> >  
> > (Side note: Some of the disks are single path via an internal
controller,
> and some of them are multi path in the J4400  via two external
controllers.)
> >  
> > A reboot fixed the ''issue'' with mpathadm and it now
outputs complete data.
> >  
> > --------  
> >  
> > So -- how do I administer and remove physical devices that are in  
> multipath-managed controllers on Solaris 10u8 without breaking multipath
and
> causing configuration changes that interfere with the services and devices
> attached via mpathadm and the other voodoo and black magic inside? I
can''t
> seem to find this documented anywhere, even if the instructions to enable  
> multipathing with stmsboot -e were quite complete and worked well!  
> >  
> > Thanks, 
> > Karl Katzke 
> >  
> >  
> >  
>

Cindy Swearingen

2009-Nov-05 17:04 UTC

head link

[zfs-discuss] MPxIO and removing physical devices

Hi Karl,

I was hoping my lame response would bring out the device experts
on this list.

I will poke around some more to get a better response.

Back soon--

Cindy

On 11/05/09 09:45, Karl Katzke wrote:> Cindy - 
> 
> Hi! Thanks for responding, I was starting to despair a bit .. 
> 
> In cfgadm -al, I don''t have *any* disks showing. I see my three
SCSI-BUS controllers, but there is nothing listed underneath them as it is in
yours. I realize that I''m using mpt SAS and not FC, and the SAS support
is new, but ... it seems that I''m not getting the output I should be.
There doesn''t seem to be a way to operate
> 
> In `format`, the disks show up as /scsi_vhci/disk@(GUID)
> 
> In ''stmsboot -L'', the disks show up as
/dev/rdsk/c4t(GUID)
> 
> In ''cfgadm -al'', the hardware controllers show up (c0,
c1, c2) but the c4 ''soft'' controller does not.
> 
> In ''mpathadm list lu'' , the disks show as
/dev/rdsk/c4t(GUID).
> 
> I have run devfsadm -Cv and it removed all of the old c0, c1, c2 points. 
> 
> I mean, it works -- I can access data on my filesystems -- but removing or
replacing a disk seriously screws things up because the GUID of the disk is the
attachment point, and I can''t seem to remove a device or add a new one!
There''s got to be a way to do that properly. I can''t put it in
production this way knowing that there''s going to be problems removing
failed disks!
> 
> -K 
> 
> 
> 
> Karl Katzke
> Systems Analyst II
> TAMU - RGS
> 
> 
> 
>>>> On 11/4/2009 at 10:36 AM, in message <4AF1AD76.4050504 at
Sun.COM>, Cindy
> Swearingen <Cindy.Swearingen at Sun.COM> wrote: 
>> Hi Karl, 
>>  
>> Welcome to Solaris/ZFS land ... 
>>  
>> ZFS administration is pretty easy but our device administration 
>> is more difficult. 
>>  
>> I''ll probably bungle this response because I don''t
have similar
>> hardware and I hope some expert will correct me. 
>>  
>> I think you will have to experiment with various forms of cfgadm. 
>> Also look at the cfgadm_fp man page. 
>>  
>> See the examples below on a V210 with a 3510 array. 
>>  
>> Cindy 
>>  
>> # cfgadm -al | grep 226000c0ffa001ab 
>> c1::226000c0ffa001ab     disk     connected    configured   unknown 
>>  
>> # cfgadm -al -o show_SCSI_LUN 
>> Ap_Id                     Type       Receptacle   Occupant    
Condition
>> c1                        fc-fabric  connected    configured   unknown 
>> c1::210000e08b1ad8c8      unknown    connected    unconfigured unknown 
>> c1::210100e08b3fbb64      unknown    connected    unconfigured unknown 
>> c1::226000c0ffa001ab,0    disk       connected    configured   unknown 
>> c1::226000c0ffa001ab,1    disk       connected    configured   unknown 
>> c1::226000c0ffa001ab,2    disk       connected    configured   unknown 
>>  
>> # cfgadm -o show_FCP_dev -al 
>> Ap_Id                    Type        Receptacle   Occupant    
Condition
>> c1                      fc-fabric    connected    configured   unknown 
>> c1::210000e08b1ad8c8    unknown      connected    unconfigured unknown 
>> c1::210100e08b3fbb64    unknown      connected    unconfigured unknown 
>> c1::226000c0ffa001ab,0  disk         connected    configured   unknown 
>> c1::226000c0ffa001ab,1  disk         connected    configured   unknown 
>> c1::226000c0ffa001ab,2  disk         connected    configured   unknown 
>>  
>>  
>>  
>>  
>>  
>> On 11/03/09 14:10, Karl Katzke wrote: 
>>> I am a bit of a Solaris newbie. I have a brand spankin''
new Solaris 10u8
>> machine (x4250) that is running an attached J4400 and some internal
drives.
>> We''re using multipathed SAS I/O (enabled via stmsboot), so the
device mount
>> points have been moved off from their "normal" c0t5d0 to long
strings -- in the
>> case of c0t5d0, it''s now /dev/rdsk/c6t5000CCA00A274EDCd0. (I
can see the
>> cross-referenced devices with stmsboot -L.) 
>>>  
>>> Normally, when replacing a disk on a Solaris system, I would run
cfgadm -c
>> unconfigure c0::dsk/c0t5d0. However, cfgadm -l does not list c6, nor
does it
>> list any disks. In fact, running cfgadm against the places where I
think
>> things are supposed to live gets me the following: 
>>>  
>>> bash# cfgadm -l /dev/rdsk/c0t5d0 
>>> Ap_Id Type Receptacle Occupant Condition 
>>> /dev/rdsk/c0t5d0: No matching library found 
>>>  
>>> bash# cfgadm -l /dev/rdsk/c6t5000CCA00A274EDCd0 
>>> cfgadm: Attachment point not found 
>>>  
>>> bash# cfgadm -l /dev/dsk/c6t5000CCA00A274EDCd0 
>>> Ap_Id                          Type         Receptacle   Occupant
>> Condition 
>>> /dev/dsk/c6t5000CCA00A274EDCd0: No matching library found 
>>>  
>>> bash# cfgadm -l c6t5000CCA00A274EDCd0 
>>> Ap_Id Type Receptacle Occupant Condition 
>>> c6t5000CCA00A274EDCd0: No matching library found 
>>>  
>>> I ran devfsadm -C -v and it removed all of the old attachment
points for the
>> /dev/dsk/c0t5d0 devices and created some for the c6 devices. Running
cfgadm
>> -al shows a c0, c4, and c5 -- these correspond to the actual
controllers, but
>> no devices are attached to the controllers.  
>>>  
>>> I found an old email on this list about MPxIO that said the
solution was
>> basically to yank the physical device after making sure that no I/O was
>> happening to it. While this worked and allowed us to return the device
to
>> service as a spare in the zpool it inhabits, more concerning was what  
>> happened when we ran mpathadm list lu after yanking the device and
returning
>> it to service:  
>>>  
>>> ------  
>>>  
>>> bash# mpathadm list lu 
>>> /dev/rdsk/c6t5000CCA00A2A9398d0s2 
>>> Total Path Count: 1 
>>> Operational Path Count: 1 
>>> /dev/rdsk/c6t5000CCA00A29EE2Cd0s2 
>>> Total Path Count: 1 
>>> Operational Path Count: 1 
>>> /dev/rdsk/c6t5000CCA00A2BDBFCd0s2 
>>> Total Path Count: 1 
>>> Operational Path Count: 1 
>>> /dev/rdsk/c6t5000CCA00A2A8E68d0s2 
>>> Total Path Count: 1 
>>> Operational Path Count: 1 
>>> /dev/rdsk/c6t5000CCA00A0537ECd0s2 
>>> Total Path Count: 1 
>>> Operational Path Count: 1 
>>> mpathadm: Error: Unable to get configuration information. 
>>> mpathadm: Unable to complete operation 
>>>  
>>> (Side note: Some of the disks are single path via an internal
controller,
>> and some of them are multi path in the J4400  via two external
controllers.)
>>>  
>>> A reboot fixed the ''issue'' with mpathadm and it
now outputs complete data.
>>>  
>>> --------  
>>>  
>>> So -- how do I administer and remove physical devices that are in  
>> multipath-managed controllers on Solaris 10u8 without breaking
multipath and
>> causing configuration changes that interfere with the services and
devices
>> attached via mpathadm and the other voodoo and black magic inside? I
can''t
>> seem to find this documented anywhere, even if the instructions to
enable
>> multipathing with stmsboot -e were quite complete and worked well!  
>>>  
>>> Thanks, 
>>> Karl Katzke 
>>>  
>>>  
>>>  
>>  
>

Cindy Swearingen

2009-Nov-05 18:09 UTC

head link

[zfs-discuss] MPxIO and removing physical devices

Hi Karl,

It turns out that cfgadm doesn''t understand when MPxIO is enabled for
SCSI disks and this is a known problem. The bug trail starts here:

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6776330
cfgadm scsi plugin needs to provide support for mpxio enabled
controllers.

The good news is that this is fixed in the Solaris Nevada release,
build 126.

Unfortunately, you are using the latest Solaris 10 release, which does
not include the resolution for this issue.

I know that you are testing drive failure scenarios, but no easy
solution exists to help you simulate a drive failure with cfgadm as long 
as MPxIO is enabled the Solaris release that you are running.

We have some internal tools to do this but they are not shipped with
Solaris releases. Someone else might have a better idea.

Disk simulation failures aside, setting up a redundant ZFS storage pool
with hot spares is the best way to reduce down-time due to hardware
failures.

Cindy

On 11/05/09 10:04, Cindy Swearingen wrote:> Hi Karl,
> 
> I was hoping my lame response would bring out the device experts
> on this list.
> 
> I will poke around some more to get a better response.
> 
> Back soon--
> 
> Cindy
> 
> On 11/05/09 09:45, Karl Katzke wrote:
>> Cindy -
>> Hi! Thanks for responding, I was starting to despair a bit ..
>> In cfgadm -al, I don''t have *any* disks showing. I see my
three
>> SCSI-BUS controllers, but there is nothing listed underneath them as 
>> it is in yours. I realize that I''m using mpt SAS and not FC,
and the
>> SAS support is new, but ... it seems that I''m not getting the
output I
>> should be. There doesn''t seem to be a way to operate
>> In `format`, the disks show up as /scsi_vhci/disk@(GUID)
>>
>> In ''stmsboot -L'', the disks show up as
/dev/rdsk/c4t(GUID)
>>
>> In ''cfgadm -al'', the hardware controllers show up
(c0, c1, c2) but the
>> c4 ''soft'' controller does not.
>> In ''mpathadm list lu'' , the disks show as
/dev/rdsk/c4t(GUID).
>> I have run devfsadm -Cv and it removed all of the old c0, c1, c2
points.
>> I mean, it works -- I can access data on my filesystems -- but 
>> removing or replacing a disk seriously screws things up because the 
>> GUID of the disk is the attachment point, and I can''t seem to
remove a
>> device or add a new one! There''s got to be a way to do that
properly.
>> I can''t put it in production this way knowing that
there''s going to be
>> problems removing failed disks!
>>
>> -K
>>
>>
>> Karl Katzke
>> Systems Analyst II
>> TAMU - RGS
>>
>>
>>
>>>>> On 11/4/2009 at 10:36 AM, in message <4AF1AD76.4050504
at Sun.COM>, Cindy
>> Swearingen <Cindy.Swearingen at Sun.COM> wrote:
>>> Hi Karl,  
>>> Welcome to Solaris/ZFS land ...  
>>> ZFS administration is pretty easy but our device administration is 
>>> more difficult.  
>>> I''ll probably bungle this response because I
don''t have similar
>>> hardware and I hope some expert will correct me.  
>>> I think you will have to experiment with various forms of cfgadm. 
>>> Also look at the cfgadm_fp man page.  
>>> See the examples below on a V210 with a 3510 array.  
>>> Cindy  
>>> # cfgadm -al | grep 226000c0ffa001ab c1::226000c0ffa001ab     
>>> disk     connected    configured   unknown  
>>> # cfgadm -al -o show_SCSI_LUN Ap_Id                     Type       
>>> Receptacle   Occupant     Condition c1                        
>>> fc-fabric  connected    configured   unknown 
>>> c1::210000e08b1ad8c8      unknown    connected    unconfigured 
>>> unknown c1::210100e08b3fbb64      unknown    connected    
>>> unconfigured unknown c1::226000c0ffa001ab,0    disk       
>>> connected    configured   unknown c1::226000c0ffa001ab,1    
>>> disk       connected    configured   unknown 
>>> c1::226000c0ffa001ab,2    disk       connected    configured  
unknown
>>> # cfgadm -o show_FCP_dev -al Ap_Id                    Type        
>>> Receptacle   Occupant     Condition c1                      
>>> fc-fabric    connected    configured   unknown 
>>> c1::210000e08b1ad8c8    unknown      connected    unconfigured 
>>> unknown c1::210100e08b3fbb64    unknown      connected    
>>> unconfigured unknown c1::226000c0ffa001ab,0  disk         
>>> connected    configured   unknown c1::226000c0ffa001ab,1  
>>> disk         connected    configured   unknown 
>>> c1::226000c0ffa001ab,2  disk         connected    configured  
unknown
>>>  
>>>  
>>>  
>>>  
>>> On 11/03/09 14:10, Karl Katzke wrote:
>>>> I am a bit of a Solaris newbie. I have a brand
spankin'' new Solaris
>>>> 10u8  
>>> machine (x4250) that is running an attached J4400 and some internal
>>> drives.  We''re using multipathed SAS I/O (enabled via
stmsboot), so
>>> the device mount  points have been moved off from their
"normal"
>>> c0t5d0 to long strings -- in the  case of c0t5d0, it''s now
>>> /dev/rdsk/c6t5000CCA00A274EDCd0. (I can see the  cross-referenced 
>>> devices with stmsboot -L.)
>>>>  
>>>> Normally, when replacing a disk on a Solaris system, I would
run
>>>> cfgadm -c  
>>> unconfigure c0::dsk/c0t5d0. However, cfgadm -l does not list c6,
nor
>>> does it  list any disks. In fact, running cfgadm against the places
>>> where I think  things are supposed to live gets me the following:
>>>>  
>>>> bash# cfgadm -l /dev/rdsk/c0t5d0 Ap_Id Type Receptacle Occupant
>>>> Condition /dev/rdsk/c0t5d0: No matching library found  
>>>> bash# cfgadm -l /dev/rdsk/c6t5000CCA00A274EDCd0 cfgadm:
Attachment
>>>> point not found  
>>>> bash# cfgadm -l /dev/dsk/c6t5000CCA00A274EDCd0 
>>>> Ap_Id                          Type         Receptacle  
Occupant
>>> Condition
>>>> /dev/dsk/c6t5000CCA00A274EDCd0: No matching library found  
>>>> bash# cfgadm -l c6t5000CCA00A274EDCd0 Ap_Id Type Receptacle
Occupant
>>>> Condition c6t5000CCA00A274EDCd0: No matching library found  
>>>> I ran devfsadm -C -v and it removed all of the old attachment
points
>>>> for the  
>>> /dev/dsk/c0t5d0 devices and created some for the c6 devices.
Running
>>> cfgadm  -al shows a c0, c4, and c5 -- these correspond to the
actual
>>> controllers, but  no devices are attached to the controllers. 
>>>>  
>>>> I found an old email on this list about MPxIO that said the
solution
>>>> was  
>>> basically to yank the physical device after making sure that no I/O
>>> was  happening to it. While this worked and allowed us to return
the
>>> device to  service as a spare in the zpool it inhabits, more 
>>> concerning was what  happened when we ran mpathadm list lu after 
>>> yanking the device and returning  it to service: 
>>>>  
>>>> ------   
>>>> bash# mpathadm list lu /dev/rdsk/c6t5000CCA00A2A9398d0s2 Total
Path
>>>> Count: 1 Operational Path Count: 1
/dev/rdsk/c6t5000CCA00A29EE2Cd0s2
>>>> Total Path Count: 1 Operational Path Count: 1 
>>>> /dev/rdsk/c6t5000CCA00A2BDBFCd0s2 Total Path Count: 1
Operational
>>>> Path Count: 1 /dev/rdsk/c6t5000CCA00A2A8E68d0s2 Total Path
Count: 1
>>>> Operational Path Count: 1 /dev/rdsk/c6t5000CCA00A0537ECd0s2
Total
>>>> Path Count: 1 Operational Path Count: 1 mpathadm: Error: Unable
to
>>>> get configuration information. mpathadm: Unable to complete
operation
>>>> (Side note: Some of the disks are single path via an internal 
>>>> controller,  
>>> and some of them are multi path in the J4400  via two external 
>>> controllers.) 
>>>>  
>>>> A reboot fixed the ''issue'' with mpathadm and
it now outputs complete
>>>> data.   
>>>> --------   
>>>> So -- how do I administer and remove physical devices that are
in
>>> multipath-managed controllers on Solaris 10u8 without breaking 
>>> multipath and  causing configuration changes that interfere with
the
>>> services and devices  attached via mpathadm and the other voodoo
and
>>> black magic inside? I can''t  seem to find this documented
anywhere,
>>> even if the instructions to enable  multipathing with stmsboot -e 
>>> were quite complete and worked well! 
>>>>  
>>>> Thanks, Karl Katzke  
>>>>  
>>>>  
>>>  
>>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

George Wilson

2009-Nov-05 18:22 UTC

head link

[zfs-discuss] ZFS dedup issue

Nigel Smith wrote:> Fix now available:
> 
>  
http://mail.opensolaris.org/pipermail/onnv-notify/2009-November/010716.html
>   http://hg.genunix.org/onnv-gate.hg/rev/0c81acaaf614
> 
>   6897693 deduplication can only go so far
>   http://bugs.opensolaris.org/view_bug.do?bug_id=6897693
I have one more fix coming which addresses another accounting issue. 
Should be in later today.

Thanks to all who have tested this out!

- George

Maidak Alexander J

2009-Nov-09 17:48 UTC

head link

[zfs-discuss] MPxIO and removing physical devices

I''m not sure if this is exactly what you''re looking for but
check out the work around in this bug:

http://bugs.opensolaris.org/view_bug.do;jsessionid=9011b9dacffa0ffffffffb615db182bbcd7b?bug_id=6559281

Basically Look through "cfgadm -al" and run the following command on
the "unusable" attachment points, Example:

cfgadm -o unusable_FCP_dev -c unconfigure c2::5005076801400525 

You might also try the "Storage-Discuss" list.

-Alex

-----Original Message-----
From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at
opensolaris.org] On Behalf Of Karl Katzke
Sent: Tuesday, November 03, 2009 3:11 PM
To: zfs-discuss at opensolaris.org
Subject: [zfs-discuss] MPxIO and removing physical devices

I am a bit of a Solaris newbie. I have a brand spankin'' new Solaris
10u8 machine (x4250) that is running an attached J4400 and some internal drives.
We''re using multipathed SAS I/O (enabled via stmsboot), so the device
mount points have been moved off from their "normal" c0t5d0 to long
strings -- in the case of c0t5d0, it''s now
/dev/rdsk/c6t5000CCA00A274EDCd0. (I can see the cross-referenced devices with
stmsboot -L.)

Normally, when replacing a disk on a Solaris system, I would run cfgadm -c
unconfigure c0::dsk/c0t5d0. However, cfgadm -l does not list c6, nor does it
list any disks. In fact, running cfgadm against the places where I think things
are supposed to live gets me the following:

bash# cfgadm -l /dev/rdsk/c0t5d0
Ap_Id Type Receptacle Occupant Condition
/dev/rdsk/c0t5d0: No matching library found

bash# cfgadm -l /dev/rdsk/c6t5000CCA00A274EDCd0
cfgadm: Attachment point not found

bash# cfgadm -l /dev/dsk/c6t5000CCA00A274EDCd0
Ap_Id                          Type         Receptacle   Occupant     Condition
/dev/dsk/c6t5000CCA00A274EDCd0: No matching library found

bash# cfgadm -l c6t5000CCA00A274EDCd0
Ap_Id Type Receptacle Occupant Condition
c6t5000CCA00A274EDCd0: No matching library found

I ran devfsadm -C -v and it removed all of the old attachment points for the
/dev/dsk/c0t5d0 devices and created some for the c6 devices. Running cfgadm -al
shows a c0, c4, and c5 -- these correspond to the actual controllers, but no
devices are attached to the controllers.

I found an old email on this list about MPxIO that said the solution was
basically to yank the physical device after making sure that no I/O was
happening to it. While this worked and allowed us to return the device to
service as a spare in the zpool it inhabits, more concerning was what happened
when we ran mpathadm list lu after yanking the device and returning it to
service:

------ 

bash# mpathadm list lu
/dev/rdsk/c6t5000CCA00A2A9398d0s2
Total Path Count: 1
Operational Path Count: 1
/dev/rdsk/c6t5000CCA00A29EE2Cd0s2
Total Path Count: 1
Operational Path Count: 1
/dev/rdsk/c6t5000CCA00A2BDBFCd0s2
Total Path Count: 1
Operational Path Count: 1
/dev/rdsk/c6t5000CCA00A2A8E68d0s2
Total Path Count: 1
Operational Path Count: 1
/dev/rdsk/c6t5000CCA00A0537ECd0s2
Total Path Count: 1
Operational Path Count: 1
mpathadm: Error: Unable to get configuration information.
mpathadm: Unable to complete operation

(Side note: Some of the disks are single path via an internal controller, and
some of them are multi path in the J4400  via two external controllers.)

A reboot fixed the ''issue'' with mpathadm and it now outputs
complete data.

-------- 

So -- how do I administer and remove physical devices that are in
multipath-managed controllers on Solaris 10u8 without breaking multipath and
causing configuration changes that interfere with the services and devices
attached via mpathadm and the other voodoo and black magic inside? I
can''t seem to find this documented anywhere, even if the instructions
to enable multipathing with stmsboot -e were quite complete and worked well!

Thanks,
Karl Katzke



-- 

Karl Katzke
Systems Analyst II
TAMU - RGS


_______________________________________________
zfs-discuss mailing list
zfs-discuss at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Jim Klimov

2009-Dec-02 09:37 UTC

head link

[zfs-discuss] ZFS dedup issue

Hello all

Sorry for bumping an old thread, but now that snv_128 is due to appear as a
public DVD download, I wonder: has this fix for zfs-accounting and other issues
with zfs dedup been integrated into build 128?

We have a fileserver which is likely to have much redundant data and
we''d like to clean up its space with zfs-deduping (even if that takes
copying files over to a temp dir and back - so their common blocks are noticed
by the code). Will build 128 be ready for the task - and increase our
server''s available space after deduping - or should we better wait for
another one?

In general, were there any stability issues with snv_128 during internal/BFU
testing?

TIA,
//Jim
-- 
This message posted from opensolaris.org

Cindy Swearingen

2009-Dec-02 16:06 UTC

head link

[zfs-discuss] ZFS dedup issue

Hi Jim,

Nevada build 128 had some problems so will not be released.

The dedup space fixes should be available in build 129.

Thanks,

Cindy

On 12/02/09 02:37, Jim Klimov wrote:> Hello all
> 
> Sorry for bumping an old thread, but now that snv_128 is due to appear as a
public DVD download, I wonder: has this fix for zfs-accounting and other issues
with zfs dedup been integrated into build 128?
> 
> We have a fileserver which is likely to have much redundant data and
we''d like to clean up its space with zfs-deduping (even if that takes
copying files over to a temp dir and back - so their common blocks are noticed
by the code). Will build 128 be ready for the task - and increase our
server''s available space after deduping - or should we better wait for
another one?
> 
> In general, were there any stability issues with snv_128 during
internal/BFU testing?
> 
> TIA,
> //Jim

Colin Raven

2009-Dec-02 16:20 UTC

head link

[zfs-discuss] ZFS dedup issue

Hey Cindy!

Any idea of when we might see 129? (an approximation only). I ask the
question because I''m pulling budget funds to build a filer, but it may
not
be in service until mid-January. Would it be reasonable to say that we might
see 129 by then, or are we looking at summer...or even beyond?

I don''t see that there''s a "wrong answer: here
necessarily, :) :) :) I''ll go
with what''s out, but dedup is a big one and a feature that made me
commit to
this project.

-Colin

On Wed, Dec 2, 2009 at 17:06, Cindy Swearingen <Cindy.Swearingen at
sun.com>wrote:
> Hi Jim,
>
> Nevada build 128 had some problems so will not be released.
>
> The dedup space fixes should be available in build 129.
>
> Thanks,
>
> Cindy
>
>
> On 12/02/09 02:37, Jim Klimov wrote:
>
>> Hello all
>>
>> Sorry for bumping an old thread, but now that snv_128 is due to appear
as
>> a public DVD download, I wonder: has this fix for zfs-accounting and
other
>> issues with zfs dedup been integrated into build 128?
>>
>> We have a fileserver which is likely to have much redundant data and
we''d
>> like to clean up its space with zfs-deduping (even if that takes
copying
>> files over to a temp dir and back - so their common blocks are noticed
by
>> the code). Will build 128 be ready for the task - and increase our
server''s
>> available space after deduping - or should we better wait for another
one?
>>
>> In general, were there any stability issues with snv_128 during
>> internal/BFU testing?
>>
>> TIA,
>> //Jim
>>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091202/1970f2de/attachment.html>

"C. Bergström"

2009-Dec-02 16:30 UTC

head link

[zfs-discuss] ZFS dedup issue

Colin Raven wrote:> Hey Cindy!
>
> Any idea of when we might see 129? (an approximation only). I ask the 
> question because I''m pulling budget funds to build a filer, but it
may
> not be in service until mid-January. Would it be reasonable to say 
> that we might see 129 by then, or are we looking at summer...or even 
> beyond?
>
> I don''t see that there''s a "wrong answer: here
necessarily, :) :) :)
> I''ll go with what''s out, but dedup is a big one and a
feature that
> made me commit to this project.The unstable and experimental Sun builds typically lag about 2 weeks 
behind the cut of the hg tag.  (Holidays and respins can derail that of 
course.)  The stable releases I have no clue about.  Depending on the 
level of adventure osunix in our next release may be interesting to you.

Feel free to email me off list or say hi on irc #osunix irc.freenode.net


Thanks

./C

zfs discuss - Nov 2009 - ZFS dedup issue

[zfs-discuss] ZFS dedup issue

[zfs-discuss] ZFS dedup issue

[zfs-discuss] ZFS dedup issue

[zfs-discuss] ZFS dedup issue

[zfs-discuss] ZFS dedup accounting

[zfs-discuss] ZFS dedup accounting

[zfs-discuss] ZFS dedup accounting

[zfs-discuss] ZFS dedup accounting

[zfs-discuss] ZFS dedup accounting

[zfs-discuss] ZFS dedup issue

[zfs-discuss] ZFS dedup issue

[zfs-discuss] ZFS dedup accounting & reservations

[zfs-discuss] ZFS dedup accounting & reservations

[zfs-discuss] MPxIO and removing physical devices

[zfs-discuss] ZFS dedup accounting & reservations

[zfs-discuss] ZFS dedup accounting & reservations

[zfs-discuss] ZFS dedup accounting & reservations

[zfs-discuss] ZFS dedup accounting & reservations

[zfs-discuss] ZFS dedup accounting & reservations

[zfs-discuss] ZFS dedup accounting & reservations

[zfs-discuss] ZFS dedup issue

[zfs-discuss] ZFS dedup issue

[zfs-discuss] Where is green-bytes dedup code?

[zfs-discuss] Where is green-bytes dedup code?

[zfs-discuss] ZFS dedup vs compression vs ZFS user/group quotas

[zfs-discuss] ZFS dedup issue

[zfs-discuss] MPxIO and removing physical devices

[zfs-discuss] ZFS dedup issue

[zfs-discuss] ZFS dedup issue

[zfs-discuss] ZFS dedup issue

[zfs-discuss] ZFS dedup issue

[zfs-discuss] ZFS dedup issue

[zfs-discuss] ZFS dedup issue

[zfs-discuss] MPxIO and removing physical devices

[zfs-discuss] MPxIO and removing physical devices

[zfs-discuss] MPxIO and removing physical devices

[zfs-discuss] ZFS dedup issue

[zfs-discuss] MPxIO and removing physical devices

[zfs-discuss] ZFS dedup issue

[zfs-discuss] ZFS dedup issue

[zfs-discuss] ZFS dedup issue

[zfs-discuss] ZFS dedup issue