Robert Milkowski
2007-Mar-22 15:46 UTC
[zfs-discuss] asize is 300MB smaller than lsize - why?
Hi. System is snv_56 x86 32bit bash-3.00# zpool status solaris pool: solaris state: ONLINE scrub: scrub stopped with 0 errors on Thu Mar 22 16:25:23 2007 config: NAME STATE READ WRITE CKSUM solaris ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 errors: No known data errors bash-3.00# bash-3.00# zfs list NAME USED AVAIL REFER MOUNTPOINT solaris 11.7G 5.02G 3.27G /solaris solaris/d100 1.64G 5.02G 1.64G /solaris/d100 solaris/d100 at replicate_previous 0 - 1.64G - solaris/d100 at replicate_latest 0 - 1.64G - solaris/d100-copy 12.0M 5.02G 12.0M /solaris/d100-copy solaris/d100-copy1 1.31G 5.02G 1.31G /solaris/d100-copy1 solaris/d101 348M 5.02G 15.3M /solaris/d101 solaris/d101 at replicate_previous 333M - 348M - solaris/d101 at replicate_latest 0 - 15.3M - solaris/d101-copy 15.3M 5.02G 15.3M /solaris/d101-copy solaris/testws 5.13G 5.02G 5.13G /export/testws/ bash-3.00# File systems solaris/d100 and solaris/d100-copy1 contain the same data. bash-3.00# ls -l /solaris/d100 | wc -l 163 bash-3.00# ls -l /solaris/d100-copy1 | wc -l 163 bash-3.00# bash-3.00# gtar cvf /solaris/2.tar /solaris/d100-copy1 bash-3.00# gtar cvf /solaris/1.tar /solaris/d100 bash-3.00# ls -l /solaris/1.tar -rw-r--r-- 1 root other 1755699200 Mar 22 16:15 /solaris/1.tar bash-3.00# ls -l /solaris/2.tar -rw-r--r-- 1 root other 1755699200 Mar 22 16:19 /solaris/2.tar bash-3.00# bash-3.00# zdb -v solaris/d100 >/tmp/1 bash-3.00# zdb -v solaris/d100-copy1 >/tmp/2 bash-3.00# diff -u /tmp/1 /tmp/2 --- /tmp/1 Thu Mar 22 16:41:52 2007 +++ /tmp/2 Thu Mar 22 16:41:57 2007 @@ -1,7 +1,7 @@ -Dataset solaris/d100 [ZPL], ID 189, cr_txg 779704, 1.64G, 807 objects +Dataset solaris/d100-copy1 [ZPL], ID 128, cr_txg 831226, 1.31G, 807 objects Object lvl iblk dblk lsize asize type - 0 7 16K 16K 416K 242K DMU dnode + 0 7 16K 16K 416K 239K DMU dnode 1 1 16K 512 512 1K ZFS master node 2 1 16K 512 512 1K ZFS delete queue 3 1 16K 10.5K 10.5K 4K ZFS directory @@ -807,5 +807,5 @@ 806 1 16K 66.5K 66.5K 66.5K ZFS plain file 807 1 16K 67.5K 67.5K 67.5K ZFS plain file 808 1 16K 24.5K 24.5K 24.5K ZFS plain file - 809 3 16K 128K 1.58G 1.58G ZFS plain file + 809 3 16K 128K 1.58G 1.24G ZFS plain file bash-3.00# bash-3.00# find /solaris/d100-copy1/ -inum 809 -ls 809 1304748 -rw-r--r-- 1 root other 1692205056 Mar 22 16:05 /solaris/d100-copy1/m1 bash-3.00# find /solaris/d100/ -inum 809 -ls 809 1652825 -rw-r--r-- 1 root other 1692205056 Mar 22 16:05 /solaris/d100/m1 bash-3.00# diff -b /solaris/d100/m1 /solaris/d100-copy1/m1 bash-3.00# While lsize is the same for both files asize is smaller fr the second one. Why is it? When is is possible? Both file systems have compression turned off and default recordsize. Diff claims both files to be the same. Any idea? This message posted from opensolaris.org
Matthew Ahrens
2007-Mar-22 19:07 UTC
[zfs-discuss] asize is 300MB smaller than lsize - why?
Robert Milkowski wrote:> While lsize is the same for both files asize is smaller fr the second > one. Why is it? When is is possible? Both file systems have > compression turned off and default recordsize. Diff claims both files > to be the same.Metadata (eg, "DMU dnode", and indirect blocks for "ZFS plain file", which you can see broken out by using more -b''s) is always compressed. Because the metadata is necessarily different (there are different block pointers, also the object numbers could be allocated differently, though not in your situation), it can compress different amounts. So, this is always possible, and in fact likely. --matt
Robert Milkowski
2007-Mar-22 22:49 UTC
[zfs-discuss] asize is 300MB smaller than lsize - why?
Hello Matthew, Thursday, March 22, 2007, 8:07:14 PM, you wrote: MA> Robert Milkowski wrote:>> While lsize is the same for both files asize is smaller fr the second >> one. Why is it? When is is possible? Both file systems have >> compression turned off and default recordsize. Diff claims both files >> to be the same.MA> Metadata (eg, "DMU dnode", and indirect blocks for "ZFS plain file", MA> which you can see broken out by using more -b''s) is always compressed. MA> Because the metadata is necessarily different (there are different block MA> pointers, also the object numbers could be allocated differently, though MA> not in your situation), it can compress different amounts. MA> So, this is always possible, and in fact likely. Well, I don''t know. DMU in both cases is so small that it doesn''t really matter. Both are the sime files (diff confirms that) about 1.6GB in size and the actual on disk size is more than 20% different. That''s really a big difference just for one large file. zdb -b (or -bbb) doesn''t work here (b56): bash-3.00# zdb -b solaris/d100 809 Dataset solaris/d100 [ZPL], ID 189, cr_txg 779704, 1.64G, 807 objects bash-3.00# zdb -bbb solaris/d100 809 Dataset solaris/d100 [ZPL], ID 189, cr_txg 779704, 1.64G, 807 objects bash-3.00# zdb -bbbvvv solaris/d100 809 Dataset solaris/d100 [ZPL], ID 189, cr_txg 779704, 1.64G, 807 objects bash-3.00# zdb -vvvv solaris/d100 809 >/tmp/a bash-3.00# zdb -vvvv solaris/d100-copy1 809 >/tmp/b bash-3.00# cat /tmp/a | wc -l 13070 bash-3.00# cat /tmp/b | wc -l 10295 bash-3.00# tail -10 /tmp/a 64d00000 L0 0:213420000:20000 20000L/20000P F=1 B=831385 64d20000 L0 0:213440000:20000 20000L/20000P F=1 B=831385 64d40000 L0 0:213460000:20000 20000L/20000P F=1 B=831385 64d60000 L0 0:213480000:20000 20000L/20000P F=1 B=831385 64d80000 L0 0:2134a0000:20000 20000L/20000P F=1 B=831385 64da0000 L0 0:2134c0000:20000 20000L/20000P F=1 B=831385 64dc0000 L0 0:ea1c0000:20000 20000L/20000P F=1 B=831388 segment [0000000000000000, 0000000065000000) size 1.58G bash-3.00# tail -10 /tmp/b 64d00000 L0 0:116a60000:20000 20000L/20000P F=1 B=831417 64d20000 L0 0:116a80000:20000 20000L/20000P F=1 B=831417 64d40000 L0 0:116aa0000:20000 20000L/20000P F=1 B=831417 64d60000 L0 0:116ac0000:20000 20000L/20000P F=1 B=831417 64d80000 L0 0:116ae0000:20000 20000L/20000P F=1 B=831417 64da0000 L0 0:116b00000:20000 20000L/20000P F=1 B=831417 64dc0000 L0 0:116b20000:20000 20000L/20000P F=1 B=831417 segment [0000000014c40000, 0000000026000000) size 276M bash-3.00# What''s the last line about? Also only /tmp/a has a Deadlist entries: Deadlist: 33 entries, 235K (114K/114K comp) Item 0: 0:191e0ea00:e00 4000L/e00P F=0 B=831102 Item 1: 0:ea1a2000:800 4000L/800P F=0 B=831388 Item 2: 0:191d58000:1000 4000L/1000P F=0 B=831102 Item 3: 0:2507b2200:1200 4000L/1200P F=0 B=831294 Item 4: 0:191e06200:1200 4000L/1200P F=0 B=831102 Item 5: 0:191e07400:1200 4000L/1200P F=0 B=831102 Item 6: 0:250186000:1000 4000L/1000P F=0 B=831294 Item 7: 0:191e0b800:e00 4000L/e00P F=0 B=831102 Item 8: 0:191e0d800:1200 4000L/1200P F=0 B=831102 Item 9: 0:191e03e00:1200 4000L/1200P F=0 B=831102 Item 10: 0:250188000:1000 4000L/1000P F=0 B=831294 Item 11: 0:191e09800:1200 4000L/1200P F=0 B=831102 Item 12: 0:191e10a00:1200 4000L/1200P F=0 B=831102 Item 13: 0:191e02c00:1200 4000L/1200P F=0 B=831102 Item 14: 0:191e05000:1200 4000L/1200P F=0 B=831102 Item 15: 0:191e08600:1200 4000L/1200P F=0 B=831102 Item 16: 0:2507b3400:e00 4000L/e00P F=0 B=831294 Item 17: 0:191d57000:1000 4000L/1000P F=0 B=831102 Item 18: 0:191d56000:1000 4000L/1000P F=0 B=831102 Item 19: 0:250189000:1000 4000L/1000P F=0 B=831294 Item 20: 0:191d59000:1000 4000L/1000P F=0 B=831102 Item 21: 0:191e0f800:1200 4000L/1200P F=0 B=831102 Item 22: 0:191e12e00:1200 4000L/1200P F=0 B=831102 Item 23: 0:191e11c00:1200 4000L/1200P F=0 B=831102 Item 24: 0:191e0aa00:e00 4000L/e00P F=0 B=831102 Item 25: 0:25339a400:e00 4000L/e00P F=0 B=831342 Item 26: 0:ea1a2800:800 4000L/800P F=0 B=831388 Item 27: 0:ea1a1c00:400 4000L/400P F=0 B=831388 Item 28: 0:ea1a3000:400 4000L/400P F=0 B=831388 Item 29: 0:ea1a3400:400 4000L/400P F=0 B=831388 Item 30: 0:ea1a3800:400 4000L/400P F=0 B=831388 Item 31: 0:ea1a3c00:400 4000L/400P F=0 B=831388 Item 32: 0:ea1a4000:200 400L/200P F=0 B=831388 What are those? And even if that is to be expected (such a big difference in actual space utilization) something is far from perfect here. Both file systems are in the same pool and over 20% difference in size on just one large file is huge - perhaps some algorithms are suboptimal. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Matthew Ahrens
2007-Mar-22 23:01 UTC
[zfs-discuss] asize is 300MB smaller than lsize - why?
Robert Milkowski wrote:> What''s the last line about?Ah -- I think that may help explain things. It may be that your file has some runs of zeros in it, which are represented as holes in d100-copy1/m1, but as blocks of zeros in the d100/m1. It begs the question, what is this file and how did you create the copy?> Also only /tmp/a has a Deadlist entries:That''s because you have snapshots of d100 but not of d100-copy1, and apparently the contents of the d100 fs have changed since the most recent snapshot. --matt
Robert Milkowski
2007-Mar-22 23:46 UTC
[zfs-discuss] asize is 300MB smaller than lsize - why?
Hello Matthew, Friday, March 23, 2007, 12:01:12 AM, you wrote: MA> Robert Milkowski wrote:>> What''s the last line about?MA> Ah -- I think that may help explain things. It may be that your file MA> has some runs of zeros in it, which are represented as holes in MA> d100-copy1/m1, but as blocks of zeros in the d100/m1. It begs the MA> question, what is this file and how did you create the copy? This file is full of 0s - it was created by dd if=/dev/zero of=/solaris/d100/m1 bs=32k& Then file system solaris/d100 was replicated in a similar way to zfs send|zfs recv into solaris/d100-copy1. Now I wonder how holes were created and why not as entire file...>> Also only /tmp/a has a Deadlist entries:MA> That''s because you have snapshots of d100 but not of d100-copy1, and MA> apparently the contents of the d100 fs have changed since the most MA> recent snapshot. thanks for info -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Matthew Ahrens
2007-Mar-23 01:49 UTC
[zfs-discuss] asize is 300MB smaller than lsize - why?
Robert Milkowski wrote:> MA> Ah -- I think that may help explain things. It may be that your file > MA> has some runs of zeros in it, which are represented as holes in > MA> d100-copy1/m1, but as blocks of zeros in the d100/m1. It begs the > MA> question, what is this file and how did you create the copy? > > This file is full of 0s - it was created by > dd if=/dev/zero of=/solaris/d100/m1 bs=32k& > > Then file system solaris/d100 was replicated in a similar way to zfs > send|zfs recv into solaris/d100-copy1. > > Now I wonder how holes were created and why not as entire file...Hmm, that''s definitely curious. What do you mean by "a similar way to zfs send | zfs recv"? Can you send me the full output of your ''zdb -vvvv solaris/d100{-copy1} 809''? --matt
Robert Milkowski
2007-Mar-23 07:47 UTC
[zfs-discuss] asize is 300MB smaller than lsize - why?
Hello Matthew, Friday, March 23, 2007, 2:49:03 AM, you wrote: MA> Robert Milkowski wrote:>> MA> Ah -- I think that may help explain things. It may be that your file >> MA> has some runs of zeros in it, which are represented as holes in >> MA> d100-copy1/m1, but as blocks of zeros in the d100/m1. It begs the >> MA> question, what is this file and how did you create the copy? >> >> This file is full of 0s - it was created by >> dd if=/dev/zero of=/solaris/d100/m1 bs=32k& >> >> Then file system solaris/d100 was replicated in a similar way to zfs >> send|zfs recv into solaris/d100-copy1. >> >> Now I wonder how holes were created and why not as entire file...MA> Hmm, that''s definitely curious. What do you mean by "a similar way to MA> zfs send | zfs recv"? Can you send me the full output of your ''zdb MA> -vvvv solaris/d100{-copy1} 809''? See http://milek.blogspot.com/2007/03/zfs-online-replication.html Basically we''ve implemented a mechanizm to replicate zfs file system implementing new ioctl based on zfs send|recv. The difference is that we sleep() for specified time (default 5s) and then ask for new transcation and if there''s one we send it out. More details really soon I hope. ps. zdb output sent privately -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Matthew Ahrens
2007-Mar-23 16:01 UTC
[zfs-discuss] asize is 300MB smaller than lsize - why?
Robert Milkowski wrote:> Basically we''ve implemented a mechanizm to replicate zfs file system > implementing new ioctl based on zfs send|recv. The difference is that > we sleep() for specified time (default 5s) and then ask for new > transcation and if there''s one we send it out. > > More details really soon I hope. > > > ps. zdb output sent privatelyThe smaller file has its first 320MB as a hole, while the larger file is entirely filled in. You can see this from the zdb output (the first number on each line is the offset): Indirect blocks: 0 L2 0:115be2400:1200 4000L/1200P F=10192 B=831417 14000000 L1 0:c0028c00:400 4000L/400P F=30 B=831370 14c40000 L0 0:b8180000:20000 20000L/20000P F=1 B=831367 14c60000 L0 0:b81a0000:20000 20000L/20000P F=1 B=831367 ... vs. Indirect blocks: 0 L2 0:ea1a0800:1400 4000L/1400P F=12911 B=831388 0 L1 0:2553bb400:400 4000L/400P F=128 B=831346 0 L0 0:255400000:20000 20000L/20000P F=1 B=831346 20000 L0 0:255420000:20000 20000L/20000P F=1 B=831346 40000 L0 0:255440000:20000 20000L/20000P F=1 B=831346 ... How it got that way, I couldn''t really say without looking at your code. If you are able to reproduce this using OpenSolaris bits, let me know. --matt
> How it got that way, I couldn''t really say without looking at your code.It works like this: In new ioctl operation zfs_ioc_replicate_send(zfs_cmd_t *zc) we open filesystem ( not snapshot ) dmu_objset_open(zc->zc_name, DMU_OST_ANY, DS_MODE_STANDARD | DS_MODE_READONLY, &filesystem); call dmu replicate send function dmu_replicate_send(filesystem, &txg, ...); ( txg - is tranzaction group number ) we set max_txg ba.max_txg = (spa_get_dsl(filesystem->os->os_spa))->dp_tx.tx_synced_txg; and call traverse_dsl_dataset traverse_dsl_dataset(filesystem->os->os_dsl_dataset, *txg, ADVANCE_PRE | ADVANCE_HOLES | ADVANCE_DATA | ADVANCE_NOLOCK, replicate_cb, &ba); after traversing next txg is returned if (ba.got_data != 0) *txg = ba.max_txg + 1; in replicate_cb we do the same what backup_cb does, but at the beginning we are checking txg: /* remember last txg */ if (bc->bc_blkptr.blk_birth) { if (bc->bc_blkptr.blk_birth > ba->max_txg) return; ba->got_data = 1; } After 5 seconds delay we call ioctl with txg returned from last operation. This message posted from opensolaris.org
Matthew Ahrens
2007-Mar-24 00:09 UTC
[zfs-discuss] Re: asize is 300MB smaller than lsize - why?
?ukasz wrote:>> How it got that way, I couldn''t really say without looking at your code. > > It works like this:...> we set max_txg > ba.max_txg = (spa_get_dsl(filesystem->os->os_spa))->dp_tx.tx_synced_txg;So, how do you send the initial stream? Presumably you need to do it with ba.max_txg = 0? If, say the first 320MB were written before your first ba.max_txg, then you wouldn''t be sending that data, thus explaining the behavior you''re seeing. It seems to me that your algorithm is fundamentally flawed -- if the filesystem is changing, it will not result in a consistent (from the ZPL''s point if view) filesystem. For example: There are two directories, A and B. You last sent txg 10. In txg 13, a file is renamed from directory A to directory B. It is now txg 15, and you begin traversing to do a send, from txg 10 -> 15. While that''s in progress, a new file is created in directory A, and synced out in txg 16. When you visit directory A, you see that its birth time is 16 > 15, so you don''t send it. When you visit directory B, you see that its birth time is 13 <= 15 so you send it. Now the other side has two links to the file, when it should have one. Given that you don''t actually have the data from txg 15 (because you didn''t take a snapshot), I don''t see how you could make this work. (Also FYI, traversing changing filesystems in this way will almost certainly break once we rewrite as part of the pool space reduction work.) --matt
Matthew Ahrens
2007-Mar-24 18:13 UTC
[zfs-discuss] Re: asize is 300MB smaller than lsize - why?
Kangurek wrote:> Thanks for info. > My idea was to traverse changing filesystem, now I see that it will not > work. > I will try to traverse snapshots. Zreplicate will: > 1. do snapshot @replicate_leatest and > 2. send data to snapshot @replicate_leatest > 3. wait X sec ( X = 20 ) > 4. remove @replicate_previous, rename @replicate_latest to > @replicate_previous > 5. repeat from 1. > > I''m sure it will work, but taking snapshots will be slow on loaded > filesystem. > Do you have any idea how to speed up operations on snapshots. > 1. remove @replicate_previous > 2. rename @replicate_leatest to @replicate_previous > 3. create @replicate_leatestYou can avoid the rename by doing: zfs create @A again: zfs destroy @B zfs create @B zfs send @A @B zfs destroy @A zfs create @A zfs send @B @A goto again I''m not sure exactly what will be slow about taking snapshots, but one aspect might be that we have to suspend the intent log (see call to zil_suspend() in dmu_objset_snapshot_one()). I''ve been meaning to change that for a while now -- just let the snapshot have the (non-empty) zil header in it, but don''t use it (eg. if we rollback or clone, explicitly zero out the zil header). So you might want to look into that. --matt
Neil Perrin
2007-Mar-24 18:30 UTC
[zfs-discuss] Re: asize is 300MB smaller than lsize - why?
Matthew Ahrens wrote On 03/24/07 12:13,:> Kangurek wrote: > >> Thanks for info. >> My idea was to traverse changing filesystem, now I see that it will >> not work. >> I will try to traverse snapshots. Zreplicate will: >> 1. do snapshot @replicate_leatest and >> 2. send data to snapshot @replicate_leatest >> 3. wait X sec ( X = 20 ) >> 4. remove @replicate_previous, rename @replicate_latest to >> @replicate_previous >> 5. repeat from 1. >> >> I''m sure it will work, but taking snapshots will be slow on loaded >> filesystem. >> Do you have any idea how to speed up operations on snapshots. >> 1. remove @replicate_previous >> 2. rename @replicate_leatest to @replicate_previous >> 3. create @replicate_leatest > > > You can avoid the rename by doing: > > zfs create @A > again: > zfs destroy @B > zfs create @B > zfs send @A @B > zfs destroy @A > zfs create @A > zfs send @B @A > goto again > > I''m not sure exactly what will be slow about taking snapshots, but one > aspect might be that we have to suspend the intent log (see call to > zil_suspend() in dmu_objset_snapshot_one()). I''ve been meaning to > change that for a while now -- just let the snapshot have the > (non-empty) zil header in it, but don''t use it (eg. if we rollback or > clone, explicitly zero out the zil header). So you might want to look > into that.I''ve always thought the slowness was due to the txg_wait_synced(). I just counted 5 for one snapshot: [0]> $c zfs`txg_wait_synced+0xc(30005c51dc0, 0, 7aa610d3, 70170800, ...) zfs`zil_commit_writer+0x34c(30010c55200, 151, 151, 1, 3fe, 7aa84600) zfs`zil_commit+0x68(30010c55200, 151, 0, 30010c5527c, 151, 0) zfs`zil_suspend+0xc0(30010c55200, 2a1010db240, 0, 0, 30014b32e00, 0) zfs`dmu_objset_snapshot_one+0x74(0, 2a1010db420, 7aa60700, 0, 0, 0) zfs`dmu_objset_snapshot+0xe8(300265bd000, 300265bd400, 0, 0, ...) zfs`zfsdev_ioctl+0x12c(701cf9f0, 701cf660, ffbfe850, 390, 701cf400, ...) [0]> $c zfs`txg_wait_synced+0xc(30005c51dc0, 3, 151, c00431549f, 3fe, 7aa84600) zfs`zil_destroy+0xc(30010c55200, 0, 0, 30010c5527c, 30014b32e00, 0) zfs`zil_suspend+0x108(30010c55200, 2a1010db240, 30010c5527c, 0, 30014b32e00, 0) zfs`dmu_objset_snapshot_one+0x74(0, 2a1010db420, 7aa60700, 0, 0, 0) zfs`dmu_objset_snapshot+0xe8(300265bd000, 300265bd400, 0, 0, ...) zfs`zfsdev_ioctl+0x12c(701cf9f0, 701cf660, ffbfe850, 390, 701cf400,...) [0]> $c zfs`txg_wait_synced+0xc(30005c51dc0, 36f8, 300000593b0, 1f8, 1f8, 180c000) zfs`zil_destroy+0x1b0(30010c55200, 0, 701d5760, 30010c5527c, ...) zfs`zil_suspend+0x108(30010c55200, 2a1010db240, 30010c5527c, 0, 30014b32e00, 0) zfs`dmu_objset_snapshot_one+0x74(0, 2a1010db420, 7aa60700, 0, 0, 0) zfs`dmu_objset_snapshot+0xe8(300265bd000, 300265bd400, 0, 0, ...) zfs`zfsdev_ioctl+0x12c(701cf9f0, 701cf660, ffbfe850, 390, 701cf400, ...) [0]> $c zfs`txg_wait_synced+0xc(30005c51dc0, 36f9, 300000593b0, 1f8, 1f8, 180c000) zfs`dsl_sync_task_group_wait+0x11c(300109a7ac8, 30005c51dc0, 7aa60700, ...) zfs`dmu_objset_snapshot+0x100(300265bd000, 300265bd400, 0, 0, ...) zfs`zfsdev_ioctl+0x12c(701cf9f0, 701cf660, ffbfe850, 390, 701cf400, ...) [0]> $c zfs`txg_wait_synced+0xc(30005c51dc0, 36fa, 300000593b0, 1f8, 1f8, 180c000) zfs`dsl_sync_task_group_wait+0x11c(300109a7ac8, 30005c51dc0, ...) zfs`dsl_sync_task_do+0x28(30005c51dc0, 0, 7aa2d898, 300028f7680,...) zfs`spa_history_log+0x30(300028f7680, 3000dee1490, 0, 7aa2d800, 1, 18) zfs`zfs_ioc_pool_log_history+0xd8(7aa64c00, 0, 17, 18, 3000dee1490, 7aa64c00) zfs`zfsdev_ioctl+0x12c(701cf768, 701cf660, ffbfe850, 108, 701cf400,...)> > --matt > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Matthew Ahrens
2007-Mar-24 18:36 UTC
[zfs-discuss] Re: asize is 300MB smaller than lsize - why?
Neil Perrin wrote:>> I''m not sure exactly what will be slow about taking snapshots, but one >> aspect might be that we have to suspend the intent log (see call to >> zil_suspend() in dmu_objset_snapshot_one()). I''ve been meaning to >> change that for a while now -- just let the snapshot have the >> (non-empty) zil header in it, but don''t use it (eg. if we rollback or >> clone, explicitly zero out the zil header). So you might want to look >> into that. > > I''ve always thought the slowness was due to the txg_wait_synced(). > I just counted 5 for one snapshot:Yeah, well 3 of the 5 are for zil_suspend(), so I think you''ve proved my point :-) I believe that the one from spa_history_log() will go away with MarkS''s delegated admin work, leaving just the one "actually do it" txg_wait_synced(). Bottom line, it shouldn be possible to make zfs snapshot take 5x less time, without an extraordinary effort. --matt
Neil Perrin
2007-Mar-24 18:44 UTC
[zfs-discuss] Re: asize is 300MB smaller than lsize - why?
Matthew Ahrens wrote On 03/24/07 12:36,:> Neil Perrin wrote: > >>> I''m not sure exactly what will be slow about taking snapshots, but >>> one aspect might be that we have to suspend the intent log (see call >>> to zil_suspend() in dmu_objset_snapshot_one()). I''ve been meaning to >>> change that for a while now -- just let the snapshot have the >>> (non-empty) zil header in it, but don''t use it (eg. if we rollback or >>> clone, explicitly zero out the zil header). So you might want to >>> look into that. >> >> >> I''ve always thought the slowness was due to the txg_wait_synced(). >> I just counted 5 for one snapshot: > > > Yeah, well 3 of the 5 are for zil_suspend(), so I think you''ve proved my > point :-) > > I believe that the one from spa_history_log() will go away with MarkS''s > delegated admin work, leaving just the one "actually do it" > txg_wait_synced(). > > Bottom line, it shouldn be possible to make zfs snapshot take 5x less > time, without an extraordinary effort.I''m not sure. Doing one will take the same time as more than one (assuming same txg) but at least one is needed to ensure all transactions prior to the snapshot are committed. Neil.