thr3ads.net - zfs discuss - [zfs-discuss] asize is 300MB smaller than lsize

If this information is useful, please help other people find it:
Share via:

Robert Milkowski

2007-Mar-22 15:46 UTC

[zfs-discuss] asize is 300MB smaller than lsize - why?

Hi.

 System is snv_56 x86 32bit
 
bash-3.00# zpool status solaris
  pool: solaris
 state: ONLINE
 scrub: scrub stopped with 0 errors on Thu Mar 22 16:25:23 2007
config:

        NAME        STATE     READ WRITE CKSUM
        solaris     ONLINE       0     0     0
          c0t1d0    ONLINE       0     0     0

errors: No known data errors
bash-3.00# 


bash-3.00# zfs list
NAME                              USED  AVAIL  REFER  MOUNTPOINT
solaris                          11.7G  5.02G  3.27G  /solaris
solaris/d100                     1.64G  5.02G  1.64G  /solaris/d100
solaris/d100 at replicate_previous      0      -  1.64G  -
solaris/d100 at replicate_latest        0      -  1.64G  -
solaris/d100-copy                12.0M  5.02G  12.0M  /solaris/d100-copy
solaris/d100-copy1               1.31G  5.02G  1.31G  /solaris/d100-copy1
solaris/d101                      348M  5.02G  15.3M  /solaris/d101
solaris/d101 at replicate_previous   333M      -   348M  -
solaris/d101 at replicate_latest        0      -  15.3M  -
solaris/d101-copy                15.3M  5.02G  15.3M  /solaris/d101-copy
solaris/testws                   5.13G  5.02G  5.13G  /export/testws/
bash-3.00# 


File systems solaris/d100 and solaris/d100-copy1 contain the same data.

bash-3.00# ls -l /solaris/d100 | wc -l
     163
bash-3.00# ls -l /solaris/d100-copy1 | wc -l
     163
bash-3.00# 

bash-3.00# gtar cvf /solaris/2.tar /solaris/d100-copy1
bash-3.00# gtar cvf /solaris/1.tar /solaris/d100
bash-3.00# ls -l /solaris/1.tar
-rw-r--r--   1 root     other    1755699200 Mar 22 16:15 /solaris/1.tar
bash-3.00# ls -l /solaris/2.tar
-rw-r--r--   1 root     other    1755699200 Mar 22 16:19 /solaris/2.tar
bash-3.00# 


bash-3.00# zdb -v solaris/d100 >/tmp/1
bash-3.00# zdb -v solaris/d100-copy1 >/tmp/2
bash-3.00# diff -u /tmp/1 /tmp/2 
--- /tmp/1      Thu Mar 22 16:41:52 2007
+++ /tmp/2      Thu Mar 22 16:41:57 2007
@@ -1,7 +1,7 @@
-Dataset solaris/d100 [ZPL], ID 189, cr_txg 779704, 1.64G, 807 objects
+Dataset solaris/d100-copy1 [ZPL], ID 128, cr_txg 831226, 1.31G, 807 objects
 
     Object  lvl   iblk   dblk  lsize  asize  type
-         0    7    16K    16K   416K   242K  DMU dnode
+         0    7    16K    16K   416K   239K  DMU dnode
          1    1    16K    512    512     1K  ZFS master node
          2    1    16K    512    512     1K  ZFS delete queue
          3    1    16K  10.5K  10.5K     4K  ZFS directory
@@ -807,5 +807,5 @@
        806    1    16K  66.5K  66.5K  66.5K  ZFS plain file
        807    1    16K  67.5K  67.5K  67.5K  ZFS plain file
        808    1    16K  24.5K  24.5K  24.5K  ZFS plain file
-       809    3    16K   128K  1.58G  1.58G  ZFS plain file
+       809    3    16K   128K  1.58G  1.24G  ZFS plain file
 
bash-3.00# 

bash-3.00# find /solaris/d100-copy1/ -inum 809 -ls
  809 1304748 -rw-r--r--   1 root     other    1692205056 Mar 22 16:05
/solaris/d100-copy1/m1
bash-3.00# find /solaris/d100/ -inum 809 -ls
  809 1652825 -rw-r--r--   1 root     other    1692205056 Mar 22 16:05
/solaris/d100/m1
bash-3.00# diff -b /solaris/d100/m1 /solaris/d100-copy1/m1
bash-3.00# 

While lsize is the same for both files asize is smaller fr the second one.
Why is it? When is is possible? Both file systems have compression turned off
and default recordsize. Diff claims both files to be the same.

Any idea?
 
 
This message posted from opensolaris.org

Matthew Ahrens

2007-Mar-22 19:07 UTC

head link

[zfs-discuss] asize is 300MB smaller than lsize - why?

Robert Milkowski wrote:> While lsize is the same for both files asize is smaller fr the second
> one. Why is it? When is is possible? Both file systems have 
> compression turned off and default recordsize. Diff claims both files
> to be the same.
Metadata (eg, "DMU dnode", and indirect blocks for "ZFS plain
file",
which you can see broken out by using more -b''s) is always compressed.
Because the metadata is necessarily different (there are different block
pointers, also the object numbers could be allocated differently, though
not in your situation), it can compress different amounts.

So, this is always possible, and in fact likely.

--matt

Robert Milkowski

2007-Mar-22 22:49 UTC

head link

[zfs-discuss] asize is 300MB smaller than lsize - why?

Hello Matthew,

Thursday, March 22, 2007, 8:07:14 PM, you wrote:

MA> Robert Milkowski wrote:>> While lsize is the same for both files asize is smaller fr the second
>> one. Why is it? When is is possible? Both file systems have 
>> compression turned off and default recordsize. Diff claims both files
>> to be the same.
MA> Metadata (eg, "DMU dnode", and indirect blocks for "ZFS
plain file",
MA> which you can see broken out by using more -b''s) is always
compressed.
MA> Because the metadata is necessarily different (there are different block
MA> pointers, also the object numbers could be allocated differently, though
MA> not in your situation), it can compress different amounts.

MA> So, this is always possible, and in fact likely.

Well, I don''t know.
DMU in both cases is so small that it doesn''t really matter.
Both are the sime files (diff confirms that) about 1.6GB in size and
the actual on disk size is more than 20% different. That''s really a big
difference just for one large file.

zdb -b (or -bbb) doesn''t work here (b56):

bash-3.00# zdb -b solaris/d100 809
Dataset solaris/d100 [ZPL], ID 189, cr_txg 779704, 1.64G, 807 objects
bash-3.00# zdb -bbb solaris/d100 809
Dataset solaris/d100 [ZPL], ID 189, cr_txg 779704, 1.64G, 807 objects
bash-3.00# zdb -bbbvvv solaris/d100 809
Dataset solaris/d100 [ZPL], ID 189, cr_txg 779704, 1.64G, 807 objects


bash-3.00# zdb -vvvv solaris/d100 809 >/tmp/a
bash-3.00# zdb -vvvv solaris/d100-copy1 809 >/tmp/b
bash-3.00# cat /tmp/a | wc -l
   13070
bash-3.00# cat /tmp/b | wc -l
   10295

bash-3.00# tail -10 /tmp/a
        64d00000   L0 0:213420000:20000 20000L/20000P F=1 B=831385
        64d20000   L0 0:213440000:20000 20000L/20000P F=1 B=831385
        64d40000   L0 0:213460000:20000 20000L/20000P F=1 B=831385
        64d60000   L0 0:213480000:20000 20000L/20000P F=1 B=831385
        64d80000   L0 0:2134a0000:20000 20000L/20000P F=1 B=831385
        64da0000   L0 0:2134c0000:20000 20000L/20000P F=1 B=831385
        64dc0000   L0 0:ea1c0000:20000 20000L/20000P F=1 B=831388

                segment [0000000000000000, 0000000065000000) size 1.58G

bash-3.00# tail -10 /tmp/b
        64d00000   L0 0:116a60000:20000 20000L/20000P F=1 B=831417
        64d20000   L0 0:116a80000:20000 20000L/20000P F=1 B=831417
        64d40000   L0 0:116aa0000:20000 20000L/20000P F=1 B=831417
        64d60000   L0 0:116ac0000:20000 20000L/20000P F=1 B=831417
        64d80000   L0 0:116ae0000:20000 20000L/20000P F=1 B=831417
        64da0000   L0 0:116b00000:20000 20000L/20000P F=1 B=831417
        64dc0000   L0 0:116b20000:20000 20000L/20000P F=1 B=831417

                segment [0000000014c40000, 0000000026000000) size  276M

bash-3.00#

What''s the last line about?
Also only /tmp/a has a Deadlist entries:
    Deadlist: 33 entries, 235K (114K/114K comp)

        Item   0: 0:191e0ea00:e00 4000L/e00P F=0 B=831102
        Item   1: 0:ea1a2000:800 4000L/800P F=0 B=831388
        Item   2: 0:191d58000:1000 4000L/1000P F=0 B=831102
        Item   3: 0:2507b2200:1200 4000L/1200P F=0 B=831294
        Item   4: 0:191e06200:1200 4000L/1200P F=0 B=831102
        Item   5: 0:191e07400:1200 4000L/1200P F=0 B=831102
        Item   6: 0:250186000:1000 4000L/1000P F=0 B=831294
        Item   7: 0:191e0b800:e00 4000L/e00P F=0 B=831102
        Item   8: 0:191e0d800:1200 4000L/1200P F=0 B=831102
        Item   9: 0:191e03e00:1200 4000L/1200P F=0 B=831102
        Item  10: 0:250188000:1000 4000L/1000P F=0 B=831294
        Item  11: 0:191e09800:1200 4000L/1200P F=0 B=831102
        Item  12: 0:191e10a00:1200 4000L/1200P F=0 B=831102
        Item  13: 0:191e02c00:1200 4000L/1200P F=0 B=831102
        Item  14: 0:191e05000:1200 4000L/1200P F=0 B=831102
        Item  15: 0:191e08600:1200 4000L/1200P F=0 B=831102
        Item  16: 0:2507b3400:e00 4000L/e00P F=0 B=831294
        Item  17: 0:191d57000:1000 4000L/1000P F=0 B=831102
        Item  18: 0:191d56000:1000 4000L/1000P F=0 B=831102
        Item  19: 0:250189000:1000 4000L/1000P F=0 B=831294
        Item  20: 0:191d59000:1000 4000L/1000P F=0 B=831102
        Item  21: 0:191e0f800:1200 4000L/1200P F=0 B=831102
        Item  22: 0:191e12e00:1200 4000L/1200P F=0 B=831102
        Item  23: 0:191e11c00:1200 4000L/1200P F=0 B=831102
        Item  24: 0:191e0aa00:e00 4000L/e00P F=0 B=831102
        Item  25: 0:25339a400:e00 4000L/e00P F=0 B=831342
        Item  26: 0:ea1a2800:800 4000L/800P F=0 B=831388
        Item  27: 0:ea1a1c00:400 4000L/400P F=0 B=831388
        Item  28: 0:ea1a3000:400 4000L/400P F=0 B=831388
        Item  29: 0:ea1a3400:400 4000L/400P F=0 B=831388
        Item  30: 0:ea1a3800:400 4000L/400P F=0 B=831388
        Item  31: 0:ea1a3c00:400 4000L/400P F=0 B=831388
        Item  32: 0:ea1a4000:200 400L/200P F=0 B=831388

What are those?


And even if that is to be expected (such a big difference in actual
space utilization) something is far from perfect here. Both file
systems are in the same pool and over 20% difference in size on just
one large file is huge - perhaps some algorithms are suboptimal.




-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Matthew Ahrens

2007-Mar-22 23:01 UTC

head link

[zfs-discuss] asize is 300MB smaller than lsize - why?

Robert Milkowski wrote:> What''s the last line about?
Ah -- I think that may help explain things.  It may be that your file 
has some runs of zeros in it, which are represented as holes in 
d100-copy1/m1, but as blocks of zeros in the d100/m1.  It begs the 
question, what is this file and how did you create the copy?
> Also only /tmp/a has a Deadlist entries:
That''s because you have snapshots of d100 but not of d100-copy1, and 
apparently the contents of the d100 fs have changed since the most 
recent snapshot.

--matt

Robert Milkowski

2007-Mar-22 23:46 UTC

head link

[zfs-discuss] asize is 300MB smaller than lsize - why?

Hello Matthew,

Friday, March 23, 2007, 12:01:12 AM, you wrote:

MA> Robert Milkowski wrote:>> What''s the last line about?
MA> Ah -- I think that may help explain things.  It may be that your file 
MA> has some runs of zeros in it, which are represented as holes in 
MA> d100-copy1/m1, but as blocks of zeros in the d100/m1.  It begs the 
MA> question, what is this file and how did you create the copy?

This file is full of 0s - it was created by
  dd if=/dev/zero of=/solaris/d100/m1 bs=32k&

Then file system solaris/d100 was replicated in a similar way to zfs
send|zfs recv into solaris/d100-copy1.

Now I wonder how holes were created and why not as entire file...



>> Also only /tmp/a has a Deadlist entries:
MA> That''s because you have snapshots of d100 but not of d100-copy1,
and
MA> apparently the contents of the d100 fs have changed since the most 
MA> recent snapshot.

thanks for info

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Matthew Ahrens

2007-Mar-23 01:49 UTC

head link

[zfs-discuss] asize is 300MB smaller than lsize - why?

Robert Milkowski wrote:> MA> Ah -- I think that may help explain things.  It may be that your
file
> MA> has some runs of zeros in it, which are represented as holes in 
> MA> d100-copy1/m1, but as blocks of zeros in the d100/m1.  It begs the 
> MA> question, what is this file and how did you create the copy?
> 
> This file is full of 0s - it was created by
>   dd if=/dev/zero of=/solaris/d100/m1 bs=32k&
> 
> Then file system solaris/d100 was replicated in a similar way to zfs
> send|zfs recv into solaris/d100-copy1.
> 
> Now I wonder how holes were created and why not as entire file...
Hmm, that''s definitely curious.  What do you mean by "a similar
way to
zfs send | zfs recv"?  Can you send me the full output of your
''zdb
-vvvv solaris/d100{-copy1} 809''?

--matt

Robert Milkowski

2007-Mar-23 07:47 UTC

head link

[zfs-discuss] asize is 300MB smaller than lsize - why?

Hello Matthew,

Friday, March 23, 2007, 2:49:03 AM, you wrote:

MA> Robert Milkowski wrote:>> MA> Ah -- I think that may help explain things.  It may be that your
file
>> MA> has some runs of zeros in it, which are represented as holes in 
>> MA> d100-copy1/m1, but as blocks of zeros in the d100/m1.  It begs
the
>> MA> question, what is this file and how did you create the copy?
>> 
>> This file is full of 0s - it was created by
>>   dd if=/dev/zero of=/solaris/d100/m1 bs=32k&
>> 
>> Then file system solaris/d100 was replicated in a similar way to zfs
>> send|zfs recv into solaris/d100-copy1.
>> 
>> Now I wonder how holes were created and why not as entire file...
MA> Hmm, that''s definitely curious.  What do you mean by "a
similar way to
MA> zfs send | zfs recv"?  Can you send me the full output of your
''zdb
MA> -vvvv solaris/d100{-copy1} 809''?

See http://milek.blogspot.com/2007/03/zfs-online-replication.html

Basically we''ve implemented a mechanizm to replicate zfs file system
implementing new ioctl based on zfs send|recv. The difference is that
we sleep() for specified time (default 5s) and then ask for new
transcation and if there''s one we send it out.

More details really soon I hope.


ps. zdb output sent privately

-- 
Best regards,
 Robert                            mailto:rmilkowski at task.gda.pl
                                       http://milek.blogspot.com

Matthew Ahrens

2007-Mar-23 16:01 UTC

head link

[zfs-discuss] asize is 300MB smaller than lsize - why?

Robert Milkowski wrote:> Basically we''ve implemented a mechanizm to replicate zfs file
system
> implementing new ioctl based on zfs send|recv. The difference is that
> we sleep() for specified time (default 5s) and then ask for new
> transcation and if there''s one we send it out.
> 
> More details really soon I hope.
> 
> 
> ps. zdb output sent privately
The smaller file has its first 320MB as a hole, while the larger file is 
  entirely filled in.  You can see this from the zdb output (the first 
number on each line is the offset):

Indirect blocks:
                0 L2   0:115be2400:1200 4000L/1200P F=10192 B=831417
         14000000  L1  0:c0028c00:400 4000L/400P F=30 B=831370
         14c40000   L0 0:b8180000:20000 20000L/20000P F=1 B=831367
         14c60000   L0 0:b81a0000:20000 20000L/20000P F=1 B=831367
...

vs.

Indirect blocks:
                0 L2   0:ea1a0800:1400 4000L/1400P F=12911 B=831388
                0  L1  0:2553bb400:400 4000L/400P F=128 B=831346
                0   L0 0:255400000:20000 20000L/20000P F=1 B=831346
            20000   L0 0:255420000:20000 20000L/20000P F=1 B=831346
            40000   L0 0:255440000:20000 20000L/20000P F=1 B=831346
...

How it got that way, I couldn''t really say without looking at your
code.
  If you are able to reproduce this using OpenSolaris bits, let me know.

--matt

Łukasz

2007-Mar-23 16:50 UTC

head link

[zfs-discuss] Re: asize is 300MB smaller than lsize - why?

> How it got that way, I couldn''t really say without looking at your
code.
It works like this:

In new ioctl operation
    zfs_ioc_replicate_send(zfs_cmd_t *zc) 
           
we open filesystem ( not snapshot )

   dmu_objset_open(zc->zc_name, DMU_OST_ANY,
        DS_MODE_STANDARD | DS_MODE_READONLY, &filesystem);

call dmu replicate send function
   
   dmu_replicate_send(filesystem,  &txg, ...);
 (  txg  - is tranzaction group number  )

we set max_txg
  ba.max_txg =
(spa_get_dsl(filesystem->os->os_spa))->dp_tx.tx_synced_txg;

and call traverse_dsl_dataset

   traverse_dsl_dataset(filesystem->os->os_dsl_dataset, *txg,
	    ADVANCE_PRE | ADVANCE_HOLES | ADVANCE_DATA | ADVANCE_NOLOCK,
	    replicate_cb, &ba);

after traversing next txg is returned

   if (ba.got_data != 0)
	  *txg = ba.max_txg + 1;
 
in replicate_cb we do the same what backup_cb does, but at the beginning we 
are checking txg:

	/* remember last txg */
	if (bc->bc_blkptr.blk_birth) {
                
                if (bc->bc_blkptr.blk_birth > ba->max_txg) return;

                ba->got_data = 1;
	}

After 5 seconds delay we call ioctl with txg returned from last operation.
 
 
This message posted from opensolaris.org

Matthew Ahrens

2007-Mar-24 00:09 UTC

head link

[zfs-discuss] Re: asize is 300MB smaller than lsize - why?

?ukasz wrote:>> How it got that way, I couldn''t really say without looking at
your code.
> 
> It works like this:
...> we set max_txg
>   ba.max_txg =
(spa_get_dsl(filesystem->os->os_spa))->dp_tx.tx_synced_txg;
So, how do you send the initial stream?  Presumably you need to do it 
with ba.max_txg = 0?  If, say the first 320MB were written before your 
first ba.max_txg, then you wouldn''t be sending that data, thus 
explaining the behavior you''re seeing.

It seems to me that your algorithm is fundamentally flawed -- if the 
filesystem is changing, it will not result in a consistent (from the 
ZPL''s point if view) filesystem.  For example:

There are two directories, A and B.  You last sent txg 10.

In txg 13, a file is renamed from directory A to directory B.

It is now txg 15, and you begin traversing to do a send, from txg 10 -> 15.

While that''s in progress, a new file is created in directory A, and 
synced out in txg 16.

When you visit directory A, you see that its birth time is 16 > 15, so 
you don''t send it.  When you visit directory B, you see that its birth 
time is 13 <= 15 so you send it.

Now the other side has two links to the file, when it should have one.

Given that you don''t actually have the data from txg 15 (because you 
didn''t take a snapshot), I don''t see how you could make this
work.

(Also FYI, traversing changing filesystems in this way will almost 
certainly break once we rewrite as part of the pool space reduction work.)

--matt

Matthew Ahrens

2007-Mar-24 18:13 UTC

head link

[zfs-discuss] Re: asize is 300MB smaller than lsize - why?

Kangurek wrote:> Thanks for info.
> My idea was to traverse changing filesystem, now I see that it will not 
> work.
> I will try to traverse snapshots. Zreplicate will:
> 1. do snapshot @replicate_leatest and
> 2. send data to snapshot @replicate_leatest
> 3. wait X sec   ( X = 20 )
> 4. remove @replicate_previous,  rename @replicate_latest to 
> @replicate_previous
> 5. repeat from 1.
> 
> I''m sure it will work, but taking snapshots will be slow on loaded
> filesystem.
> Do you have any idea how to speed up operations on snapshots.
> 1. remove @replicate_previous
> 2. rename @replicate_leatest to @replicate_previous
> 3. create @replicate_leatest
You can avoid the rename by doing:

zfs create @A
again:
zfs destroy @B
zfs create @B
zfs send @A @B
zfs destroy @A
zfs create @A
zfs send @B @A
goto again

I''m not sure exactly what will be slow about taking snapshots, but one 
aspect might be that we have to suspend the intent log (see call to 
zil_suspend() in dmu_objset_snapshot_one()).  I''ve been meaning to 
change that for a while now -- just let the snapshot have the 
(non-empty) zil header in it, but don''t use it (eg. if we rollback or 
clone, explicitly zero out the zil header).  So you might want to look 
into that.

--matt

Neil Perrin

2007-Mar-24 18:30 UTC

head link

[zfs-discuss] Re: asize is 300MB smaller than lsize - why?

Matthew Ahrens wrote On 03/24/07 12:13,:> Kangurek wrote:
> 
>> Thanks for info.
>> My idea was to traverse changing filesystem, now I see that it will 
>> not work.
>> I will try to traverse snapshots. Zreplicate will:
>> 1. do snapshot @replicate_leatest and
>> 2. send data to snapshot @replicate_leatest
>> 3. wait X sec   ( X = 20 )
>> 4. remove @replicate_previous,  rename @replicate_latest to 
>> @replicate_previous
>> 5. repeat from 1.
>>
>> I''m sure it will work, but taking snapshots will be slow on
loaded
>> filesystem.
>> Do you have any idea how to speed up operations on snapshots.
>> 1. remove @replicate_previous
>> 2. rename @replicate_leatest to @replicate_previous
>> 3. create @replicate_leatest
> 
> 
> You can avoid the rename by doing:
> 
> zfs create @A
> again:
> zfs destroy @B
> zfs create @B
> zfs send @A @B
> zfs destroy @A
> zfs create @A
> zfs send @B @A
> goto again
> 
> I''m not sure exactly what will be slow about taking snapshots, but
one
> aspect might be that we have to suspend the intent log (see call to 
> zil_suspend() in dmu_objset_snapshot_one()).  I''ve been meaning to
> change that for a while now -- just let the snapshot have the 
> (non-empty) zil header in it, but don''t use it (eg. if we rollback
or
> clone, explicitly zero out the zil header).  So you might want to look 
> into that.
I''ve always thought the slowness was due to the txg_wait_synced().
I just counted 5 for one snapshot:

[0]> $c
zfs`txg_wait_synced+0xc(30005c51dc0, 0, 7aa610d3, 70170800, ...)
zfs`zil_commit_writer+0x34c(30010c55200, 151, 151, 1, 3fe, 7aa84600)
zfs`zil_commit+0x68(30010c55200, 151, 0, 30010c5527c, 151, 0)
zfs`zil_suspend+0xc0(30010c55200, 2a1010db240, 0, 0, 30014b32e00, 0)
zfs`dmu_objset_snapshot_one+0x74(0, 2a1010db420, 7aa60700, 0, 0, 0)
zfs`dmu_objset_snapshot+0xe8(300265bd000, 300265bd400, 0, 0, ...)
zfs`zfsdev_ioctl+0x12c(701cf9f0, 701cf660, ffbfe850, 390, 701cf400, ...)

[0]> $c
zfs`txg_wait_synced+0xc(30005c51dc0, 3, 151, c00431549f, 3fe, 7aa84600)
zfs`zil_destroy+0xc(30010c55200, 0, 0, 30010c5527c, 30014b32e00, 0)
zfs`zil_suspend+0x108(30010c55200, 2a1010db240, 30010c5527c, 0, 30014b32e00, 0)
zfs`dmu_objset_snapshot_one+0x74(0, 2a1010db420, 7aa60700, 0, 0, 0)
zfs`dmu_objset_snapshot+0xe8(300265bd000, 300265bd400, 0, 0, ...)
zfs`zfsdev_ioctl+0x12c(701cf9f0, 701cf660, ffbfe850, 390, 701cf400,...)

[0]> $c
zfs`txg_wait_synced+0xc(30005c51dc0, 36f8, 300000593b0, 1f8, 1f8, 180c000)
zfs`zil_destroy+0x1b0(30010c55200, 0, 701d5760, 30010c5527c, ...)
zfs`zil_suspend+0x108(30010c55200, 2a1010db240, 30010c5527c, 0, 30014b32e00, 0)
zfs`dmu_objset_snapshot_one+0x74(0, 2a1010db420, 7aa60700, 0, 0, 0)
zfs`dmu_objset_snapshot+0xe8(300265bd000, 300265bd400, 0, 0, ...)
zfs`zfsdev_ioctl+0x12c(701cf9f0, 701cf660, ffbfe850, 390, 701cf400, ...)

[0]> $c
zfs`txg_wait_synced+0xc(30005c51dc0, 36f9, 300000593b0, 1f8, 1f8, 180c000)
zfs`dsl_sync_task_group_wait+0x11c(300109a7ac8, 30005c51dc0, 7aa60700, ...)
zfs`dmu_objset_snapshot+0x100(300265bd000, 300265bd400, 0, 0, ...)
zfs`zfsdev_ioctl+0x12c(701cf9f0, 701cf660, ffbfe850, 390, 701cf400, ...)

[0]> $c
zfs`txg_wait_synced+0xc(30005c51dc0, 36fa, 300000593b0, 1f8, 1f8, 180c000)
zfs`dsl_sync_task_group_wait+0x11c(300109a7ac8, 30005c51dc0, ...)
zfs`dsl_sync_task_do+0x28(30005c51dc0, 0, 7aa2d898, 300028f7680,...)
zfs`spa_history_log+0x30(300028f7680, 3000dee1490, 0, 7aa2d800, 1, 18)
zfs`zfs_ioc_pool_log_history+0xd8(7aa64c00, 0, 17, 18, 3000dee1490, 7aa64c00)
zfs`zfsdev_ioctl+0x12c(701cf768, 701cf660, ffbfe850, 108, 701cf400,...)
> 
> --matt
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Matthew Ahrens

2007-Mar-24 18:36 UTC

head link

[zfs-discuss] Re: asize is 300MB smaller than lsize - why?

Neil Perrin wrote:>> I''m not sure exactly what will be slow about taking snapshots,
but one
>> aspect might be that we have to suspend the intent log (see call to 
>> zil_suspend() in dmu_objset_snapshot_one()).  I''ve been
meaning to
>> change that for a while now -- just let the snapshot have the 
>> (non-empty) zil header in it, but don''t use it (eg. if we
rollback or
>> clone, explicitly zero out the zil header).  So you might want to look 
>> into that.
> 
> I''ve always thought the slowness was due to the txg_wait_synced().
> I just counted 5 for one snapshot:
Yeah, well 3 of the 5 are for zil_suspend(), so I think you''ve proved
my
point :-)

I believe that the one from spa_history_log() will go away with MarkS''s
delegated admin work, leaving just the one "actually do it" 
txg_wait_synced().

Bottom line, it shouldn be possible to make zfs snapshot take 5x less 
time, without an extraordinary effort.

--matt

Neil Perrin

2007-Mar-24 18:44 UTC

head link

[zfs-discuss] Re: asize is 300MB smaller than lsize - why?

Matthew Ahrens wrote On 03/24/07 12:36,:> Neil Perrin wrote:
> 
>>> I''m not sure exactly what will be slow about taking
snapshots, but
>>> one aspect might be that we have to suspend the intent log (see
call
>>> to zil_suspend() in dmu_objset_snapshot_one()).  I''ve been
meaning to
>>> change that for a while now -- just let the snapshot have the 
>>> (non-empty) zil header in it, but don''t use it (eg. if we
rollback or
>>> clone, explicitly zero out the zil header).  So you might want to 
>>> look into that.
>>
>>
>> I''ve always thought the slowness was due to the
txg_wait_synced().
>> I just counted 5 for one snapshot:
> 
> 
> Yeah, well 3 of the 5 are for zil_suspend(), so I think you''ve
proved my
> point :-)
> 
> I believe that the one from spa_history_log() will go away with
MarkS''s
> delegated admin work, leaving just the one "actually do it" 
> txg_wait_synced().
> 
> Bottom line, it shouldn be possible to make zfs snapshot take 5x less 
> time, without an extraordinary effort.
I''m not sure. Doing one will take the same time as more than one
(assuming same txg)
but at least one is needed to ensure all transactions prior to the snapshot are 
committed.

Neil.

Łukasz

2007-Mar-27 11:30 UTC

head link

[zfs-discuss] Re: Re: asize is 300MB smaller than lsize - why?

I have other question about replication in this thread:

http://www.opensolaris.org/jive/thread.jspa?threadID=27082&tstart=0
 
 
This message posted from opensolaris.org

zfs discuss - Mar 2007 - asize is 300MB smaller than lsize - why?

[zfs-discuss] asize is 300MB smaller than lsize - why?

[zfs-discuss] asize is 300MB smaller than lsize - why?

[zfs-discuss] asize is 300MB smaller than lsize - why?

[zfs-discuss] asize is 300MB smaller than lsize - why?

[zfs-discuss] asize is 300MB smaller than lsize - why?

[zfs-discuss] asize is 300MB smaller than lsize - why?

[zfs-discuss] asize is 300MB smaller than lsize - why?

[zfs-discuss] asize is 300MB smaller than lsize - why?

[zfs-discuss] Re: asize is 300MB smaller than lsize - why?

[zfs-discuss] Re: asize is 300MB smaller than lsize - why?

[zfs-discuss] Re: asize is 300MB smaller than lsize - why?

[zfs-discuss] Re: asize is 300MB smaller than lsize - why?

[zfs-discuss] Re: asize is 300MB smaller than lsize - why?

[zfs-discuss] Re: asize is 300MB smaller than lsize - why?

[zfs-discuss] Re: Re: asize is 300MB smaller than lsize - why?