On 2012-12-20 18:25, sol wrote:> Hi
>
> I know some of this has been discussed in the past but I can''t
quite
> find the exact information I''m seeking
> (and I''d check the ZFS wikis but the websites are down at the
moment).
>
> Firstly, which is correct, free space shown by "zfs list" or by
"zpool
> iostat" ? (...)
> (That''s a big difference, and the percentage doesn''t
agree)
I believe, zpool iostat (and zpool list) report raw storage accounting,
basically - the number of HDD sectors available and consumed, including
redundancy and metadata (so available space also includes the unused-yet
redundancy overhead), and the reserved space (like 1/64 of the pool size
for system use - including attempts to counter the said performance
degradation on full pools).
zfs list displays user-data info - what is available after redundancy
and system reservations, and in general subject to "(ref)reservation"
and "(ref)quota" on datasets in the pool. When cloning and dedup come
into play as well as compression, this accounting becomes tricky.
Overall, there is one number you can trust: the used space in a dataset
says how much userdata (including directory structures, but also after
compression) is referenced in this filesystem, if you limit or bill by
consumption - the end-user value of your service. This does not mean
that only this filesystem references these blocks, though. And the
other numbers are more vague (i.e. with good dedup+compress ratios you
can sum up the used spaces to much more than the raw pool sizes).
> Secondly, there''s 8 vdevs each of 11 disks.
> 6 vdevs show used 8.19 TB, free 1.81 TB, free = 18.1%
> 2 vdevs show used 6.39 TB, free 3.61 TB, free = 36.1%
How did you look that up? ;)
>
> I''ve heard that
> a) performance degrades when free space is below a certain amount
Basically, the "mechanics" of the degradation is that ZFS writes new
data into available space "bubbles" within a range called
"metaslab".
It tries to make sequential writes to do stuff faster. If your pool
has seen lots of writes and deletions, its free spaces may have become
fragmented, so search for the "bubbles" takes longer, and they are too
small to fit the whole incoming transaction - leading to more HDD seeks
and thus more latency on write. In extreme, ZFS can''t even find holes
big enough for a block, so it splits the block data into several pieces
and writes "gang blocks", using many tiny IOs with many mechanical HDD
seeks.
Numbers - how full is a pool to display these problems - are highly
individual. Some pools saw it after filling to 60%, typical is 80-90%,
and for write-only pools you might never see this problem because you
don''t delete stuff (well, except maybe for metadata during updates,
all of which usually consumes 1-3% of total allocation).
> b) data is written to different vdevs depending on free space
There are several rules which influence the preference of a Top-level
VDEV and of a metaslab region inside it, which probably include free
space, known presence of large "bubbles" to write into, and location
on the disk (slower-faster LBA tracks).
>
> So a) how do I determine the exact value when performance degrades and
> how significant is it?
> b) has that threshold been reached (or exceeded?) in the first six vdevs?
> and if so are the two emptier vdevs being used exclusively to prevent
> performance degrading
> so it will only degrade when all vdevs reach the magic 18.1% free (or
> whatever it is)?
Hopefully, this was answered above :)
> Presumably there''s no way to identify which files are on which
vdevs in
> order to delete them and recover the performance?
It is possible, but not simple, and is not guaranteed to get the
result you want (though there is little harm in trying).
You can use "zdb" to extract information about an inode on a dataset
as a listing of block pointer entries which form a tree for this file.
For example:
# ls -lani /lib/libnsl.so.1
9239 -rwxr-xr-x 1 0 2 649720 Jun 8 2012 /lib/libnsl.so.1
# df -k /lib/libnsl.so.1
Filesystem kbytes used avail capacity Mounted on
rpool/ROOT/oi_151a4 61415424 452128 24120824 2% /
Here the first number from "ls -i" gives us the inode of the file,
and the "df" confirms the dataset name. So we can zdb walk:
# zdb -ddddd -bbbbbb rpool/ROOT/oi_151a4 9239
Dataset rpool/ROOT/oi_151a4 [ZPL], ID 5299, cr_txg 1349648, 442M,
8213 objects, rootbp DVA[0]=<0:a6921d600:200>
DVA[1]=<0:2ffc7b400:200>
[L0 DMU objset] fletcher4 lzjb LE contiguous unique double
size=800L/200P birth=4682209L/4682209P fill=8213
cksum=16f122cb05:77d20eea7b8:155c69ed5a6ce:2b90104e19641f
Object lvl iblk dblk dsize lsize %full type
9239 2 16K 128K 642K 640K 100.00 ZFS plain file
168 bonus System attributes
dnode flags: USED_BYTES USERUSED_ACCOUNTED
dnode maxblkid: 4
path /lib/libnsl.so.1
uid 0
gid 2
atime Fri Jun 8 00:22:17 2012
mtime Fri Jun 8 00:22:17 2012
ctime Fri Jun 8 00:22:17 2012
crtime Fri Jun 8 00:22:17 2012
gen 1349746
mode 100755
size 649720
parent 25
links 1
pflags 40800000104
Indirect blocks:
0 L1 DVA[0]=<0:940298000:400>
DVA[1]=<0:263234a00:400>
[L1 ZFS plain file] fletcher4 lzjb LE contiguous unique double
size=4000L/400P birth=1349746L/1349746P fill=5
cksum=682d4fda0b:3cc1aa306094:13ebb22837cf14:4c5c67e522dbca8
0 L0 DVA[0]=<0:95f337000:20000> [L0 ZFS plain file]
fletcher4 uncompressed LE contiguous unique single size=20000L/20000P
birth=1349746L/1349746P fill=1
cksum=23fce6aa160b:5ab11e5fcbc6c2e:5b38f230e01d508d:12cf92941e4b2487
20000 L0 DVA[0]=<0:95f357000:20000> [L0 ZFS plain file]
fletcher4 uncompressed LE contiguous unique single size=20000L/20000P
birth=1349746L/1349746P fill=1
cksum=3f0ac207affd:f8ed413113d6bdd:24e36c7682cfc297:2549c866ab61e464
40000 L0 DVA[0]=<0:95f377000:20000> [L0 ZFS plain file]
fletcher4 uncompressed LE contiguous unique single size=20000L/20000P
birth=1349746L/1349746P fill=1
cksum=3d40bf3329f0:f459bc876303dd7:2230ee348b7b08c5:3a65d1ebbf52c9dc
60000 L0 DVA[0]=<0:95f397000:20000> [L0 ZFS plain file]
fletcher4 uncompressed LE contiguous unique single size=20000L/20000P
birth=1349746L/1349746P fill=1
cksum=19e01b53eb67:956b52d1df6ecd4:38ff9bd1302bf879:e4661798dd1ae8a0
80000 L0 DVA[0]=<0:95f3b7000:20000> [L0 ZFS plain file]
fletcher4 uncompressed LE contiguous unique single size=20000L/20000P
birth=1349746L/1349746P fill=1
cksum=361e6fd03d40:d0903e491fa09e9:7a2e453ed28baa92:28562c53af3c0495
segment [0000000000000000, 00000000000a0000) size 640K
After several higher layers of the pointers (just L1 in example above),
you have "L0" entries which point to actual data blocks with their DVA
fields.
The example file above fits in five 128K blocks at level L0.
The first component of the DVA address is the top-level vdev ID,
followed by offset and allocation size (including raidzN redundancy).
Depending on your pool''s history, larger files may have been striped
over several TLVDEVs however, and relocating them (copying over and
deleting the original) might help or not help free up a particular
TLVDEV (upon rewrite they will be striped again, albeit maybe ZFS
will make different decisions upon a new write - and prefer the more
free devices).
Also, if the file''s blocks are referenced via snapshots, clones,
dedup or hardlinks, they won''t actually be released when you delete
a particular copy of the file.
HTH,
//Jim Klimov