thr3ads.net - zfs discuss - [zfs-discuss] ZFS file disk usage [Sep 2009]

If this information is useful, please help other people find it:
Share via:

Andrew Deason

2009-Sep-17 14:57 UTC

[zfs-discuss] ZFS file disk usage

As I''m sure you''re all aware, filesize in ZFS can differ
greatly from
actual disk usage, depending on access patterns. e.g. truncating a 1M
file down to 1 byte still uses up about 130k on disk when
recordsize=128k. I''m aware that this is a result of ZFS''s
rather
different internals, and that it works well for normal usage, but this
can make things difficult for applications that wish to restrain their
own disk usage.

The particular application I''m working on that has such a problem is
the
OpenAFS <http://www.openafs.org/> client, when it uses ZFS as the disk
cache partition. The disk cache is constrained to a user-configurable
size, and the amount of cache used is tracked by counters internal to
the OpenAFS client. Normally cache usage is tracked by just taking the
file length of a particular file in the cache, and rounding it up to the
next frsize boundary of the cache filesystem. This is obviously wrong
when ZFS is used, and so our cache usage tracking can get very
incorrect.  So, I have two questions which would help us fix this:

  1. Is there any interface to ZFS (or a configuration knob or
  something) that we can use from a kernel module to explicitly return a
  file to the more predictable size? In the above example, truncating a
  1M file (call it ''A'') to 1b mkes it take up 130k, but if we
create a
  new file (call it ''B'') with that 1b in it, it only takes up
about 1k.
  Is there any operation we can perform on file ''A'' to make it
take up
  less space without having to create a new file ''B''?

  The cache files are often truncated and overwritten with new data,
  which is why this can become a problem. If there was some way to
  explicitly signal to ZFS that we want a particular file to be put in a
  smaller block or something, that would be helpful. (I am mostly
  ignorant on ZFS internals; if there''s somewhere that would have told
  me this information, let me know)

  2. Lacking 1., can anyone give an equation relating file length, max
  size on disk, and recordsize? (and any additional parameters needed).
  If we just have a way of knowing in advance how much disk space we''re
  going to take up by writing a certain amount of data, we should be
  okay.

Or, if anyone has any other ideas on how to overcome this, it would be
welcomed.

-- 
Andrew Deason
adeason at sinenomine.net

Robert Milkowski

2009-Sep-17 21:55 UTC

head link

[zfs-discuss] ZFS file disk usage

Andrew Deason wrote:> As I''m sure you''re all aware, filesize in ZFS can differ
greatly from
> actual disk usage, depending on access patterns. e.g. truncating a 1M
> file down to 1 byte still uses up about 130k on disk when
> recordsize=128k. I''m aware that this is a result of ZFS''s
rather
> different internals, and that it works well for normal usage, but this
> can make things difficult for applications that wish to restrain their
> own disk usage.
>
> The particular application I''m working on that has such a problem
is the
> OpenAFS <http://www.openafs.org/> client, when it uses ZFS as the
disk
> cache partition. The disk cache is constrained to a user-configurable
> size, and the amount of cache used is tracked by counters internal to
> the OpenAFS client. Normally cache usage is tracked by just taking the
> file length of a particular file in the cache, and rounding it up to the
> next frsize boundary of the cache filesystem. This is obviously wrong
> when ZFS is used, and so our cache usage tracking can get very
> incorrect.  So, I have two questions which would help us fix this:
>
>   1. Is there any interface to ZFS (or a configuration knob or
>   something) that we can use from a kernel module to explicitly return a
>   file to the more predictable size? In the above example, truncating a
>   1M file (call it ''A'') to 1b mkes it take up 130k, but
if we create a
>   new file (call it ''B'') with that 1b in it, it only
takes up about 1k.
>   Is there any operation we can perform on file ''A'' to
make it take up
>   less space without having to create a new file ''B''?
>
>   The cache files are often truncated and overwritten with new data,
>   which is why this can become a problem. If there was some way to
>   explicitly signal to ZFS that we want a particular file to be put in a
>   smaller block or something, that would be helpful. (I am mostly
>   ignorant on ZFS internals; if there''s somewhere that would have
told
>   me this information, let me know)
>
>   2. Lacking 1., can anyone give an equation relating file length, max
>   size on disk, and recordsize? (and any additional parameters needed).
>   If we just have a way of knowing in advance how much disk space
we''re
>   going to take up by writing a certain amount of data, we should be
>   okay.
>
> Or, if anyone has any other ideas on how to overcome this, it would be
> welcomed.
>
>   
When creating a new file zfs will set its block size to be no larger 
than current value of recordsize. If there is at least recordsize of 
data to be written then the blocksize will equal to recordsize. From now 
on the file blocksize is "frozen"  - that''s why when you
truncate it it
keeps its original blocksize size. It also means that if file was 
smaller than recordsize (so its blocksize was smaller too) when you 
truncate it to 1B it will keep its  smaller blocksize. IMHO you won''t
be
able to lower a file blocksize other than by creating a new file. For 
example:
> milek at r600:~/progs$ mkfile 10m file1
> milek at r600:~/progs$ ./stat file1
> size: 10485760    blksize: 131072
> milek at r600:~/progs$ truncate -s 1 file1
> milek at r600:~/progs$ ./stat file1
> size: 1    blksize: 131072
> milek at r600:~/progs$
> milek at r600:~/progs$ rm file1
> milek at r600:~/progs$
> milek at r600:~/progs$ mkfile 10000 file1
> milek at r600:~/progs$ ./stat file1
> size: 10000    blksize: 10240
> milek at r600:~/progs$ truncate -s 1 file1
> milek at r600:~/progs$ ./stat file1
> size: 1    blksize: 10240
> milek at r600:~/progs$
If you are not worried with this extra overhead and you are mostly 
concerned with proper accounting of used disk space than instead of 
relaying on a file size alone you should take intro account its 
blocksize and round file size up-to blocksize (actual file size on disk 
(not counting metadata) is N*blocksize).
However IIRC there is an open bug/rfe asking for a special treatment of 
a file''s tail block so it can be smaller than the file blocksize. Once 
it''s integrated your math could be wrong again.

Please also note that relaying on a logical file size could be even more 
misleading if compression is enabled in zfs (or dedup in the future). 
Relaying on blocksize will give you more accurate estimates.

You can get a file blocksize by using stat() and getting value of 
buf.st_blksize
or you can get a good estimate of used disk space by doing buf.st_blocks*512
> milek at r600:~/progs$ cat stat.c
>
> #include <stdio.h>
> #include <errno.h>
> #include <fcntl.h>
> #include <sys/types.h>
> #include <sys/stat.h>
>
> int main(int argc, char **argv)
> {
>   struct stat buf;
>
>   if(!stat(argv[1], &buf))
>   {
>     printf("size: %d\tblksize: %d\n", buf.st_size,
buf.st_blksize);
>   }
>   else
>   {
>     printf("ERROR: stat(), errno: %d\n", errno);
>     exit(1);
>   }
>  
>
> }
>
> milek at r600:~/progs$

-- 
Robert Milkowski
http://milek.blogspot.com

Andrew Deason

2009-Sep-17 22:20 UTC

head link

[zfs-discuss] ZFS file disk usage

On Thu, 17 Sep 2009 22:55:38 +0100
Robert Milkowski <milek at task.gda.pl> wrote:
> IMHO you won''t be able to lower a file blocksize other than by
> creating a new file. For example:
Okay, thank you.
> If you are not worried with this extra overhead and you are mostly 
> concerned with proper accounting of used disk space than instead of 
> relaying on a file size alone you should take intro account its 
> blocksize and round file size up-to blocksize (actual file size on
> disk (not counting metadata) is N*blocksize).
Metadata can be nontrivial for small blocksizes, though, can''t it? I
tried similar tests with varying recordsizes and with recordsize=1k, a
file with 1M bytes written to it took up significantly more than 1024 1k
blocks.

Is there a reliable way to account for this? Through experimenting with
various recordsizes and file sizes I can see enough of a pattern to try
and come up with an equation for the total disk usage, but that doesn''t
mean such a relation would be correct... if someone could give me
something a bit more authoritative, it would be nice.
> However IIRC there is an open bug/rfe asking for a special treatment
> of a file''s tail block so it can be smaller than the file
blocksize.
> Once it''s integrated your math could be wrong again.
>
> Please also note that relaying on a logical file size could be even
> more misleading if compression is enabled in zfs (or dedup in the
> future). Relaying on blocksize will give you more accurate estimates.
I was a bit unclear. We''re not so concerned about the math being wrong
in general; we just need to make sure we are not significantly
underestimating the usage. If we overestimate within reason, that''s
fine, but getting the tightest bound is obviously more desirable. So
I''m
not worried about compression, dedup, or the tail block being treated in
such a way.
> You can get a file blocksize by using stat() and getting value of 
> buf.st_blksize
> or you can get a good estimate of used disk space by doing
> buf.st_blocks*512
Hmm, I thought I had tried this, but st_blocks didn''t seem to be
updated
accurately until after some time after a write.

I''d also like to avoid having to stat the file each time after a write
or truncate in order to get the file size. The current way the code is
structured intends for the space calculations to be made /before/ the
write is done. It may be possible to change that, but I''d rather not,
if
possible (and I''d have to make sure there''s not a significant
speed hit
in doing so).

-- 
Andrew Deason
adeason at sinenomine.net

Robert Milkowski

2009-Sep-17 22:40 UTC

head link

[zfs-discuss] ZFS file disk usage

if you would create a dedicated dataset for your cache and set quota on 
it then instead of tracking a disk space usage for each file you could 
easily check how much disk space is being used in the dataset.
Would it suffice for you?

Setting recordsize to 1k if you have lots of files (I assume) larger 
than that doesn''t really make sense.
The problem with metadata is that by default it is also compressed so 
there is no easy way to tell how much disk space it occupies for a 
specified file using standard API.

-- 
Robert Milkowski
http://milek.blogspot.com

Andrew Deason

2009-Sep-18 14:36 UTC

head link

[zfs-discuss] ZFS file disk usage

On Thu, 17 Sep 2009 18:40:49 -0400
Robert Milkowski <milek at task.gda.pl> wrote:
> if you would create a dedicated dataset for your cache and set quota
> on it then instead of tracking a disk space usage for each file you
> could easily check how much disk space is being used in the dataset.
> Would it suffice for you?
No. We need to be able to tell how close to full we are, for determining
when to start/stop removing things from the cache before we can add new
items to the cache again.

I''d also _like_ not to require a dedicated dataset for it, but
it''s not
like it''s difficult for users to create one.
> Setting recordsize to 1k if you have lots of files (I assume) larger 
> than that doesn''t really make sense.
> The problem with metadata is that by default it is also compressed so 
> there is no easy way to tell how much disk space it occupies for a 
> specified file using standard API.
We do not know in advance what file sizes we''ll be seeing in general.
We
could of course tell people to tune the cache dataset according to their
usage pattern, but I don''t think users are generally going to know what
their cache usage pattern looks like.

I can say that at least right now, usually each file will be at most 1M
long (1M is the max unless the user specifically changes it). But
between the range 1k-1M, I don''t know what the distribution looks like.

I can''t get an /estimate/ on the data+metadata disk usage? What about
in
the hypothetical case of the metadata compression ratio being
effectively the same as without compression, what would it be then?

-- 
Andrew Deason
adeason at sinenomine.net

Richard Elling

2009-Sep-18 16:48 UTC

head link

[zfs-discuss] ZFS file disk usage

On Sep 18, 2009, at 7:36 AM, Andrew Deason wrote:
> On Thu, 17 Sep 2009 18:40:49 -0400
> Robert Milkowski <milek at task.gda.pl> wrote:
>
>> if you would create a dedicated dataset for your cache and set quota
>> on it then instead of tracking a disk space usage for each file you
>> could easily check how much disk space is being used in the dataset.
>> Would it suffice for you?
>
> No. We need to be able to tell how close to full we are, for  
> determining
> when to start/stop removing things from the cache before we can add  
> new
> items to the cache again.
The transactional nature of ZFS may work against you here.
Until the data is committed to disk, it is unclear how much space
it will consume. Compression clouds the crystal ball further.
>
> I''d also _like_ not to require a dedicated dataset for it, but
it''s
> not
> like it''s difficult for users to create one.
Use delegation.  Users can create their own datasets, set parameters,
etc. For this case, you could consider changing recordsize, if you  
really
are so worried about 1k. IMHO, it is easier and less expensive in
process and pain to just buy more disk when needed.
  -- richard

Andrew Deason

2009-Sep-18 16:59 UTC

head link

[zfs-discuss] ZFS file disk usage

On Fri, 18 Sep 2009 12:48:34 -0400
Richard Elling <richard.elling at gmail.com> wrote:
> The transactional nature of ZFS may work against you here.
> Until the data is committed to disk, it is unclear how much space
> it will consume. Compression clouds the crystal ball further.
...but not impossible. I''m just looking for a reasonable upper bound.
For example, if I always rounded up to the next 128k mark, and added an
additional 128k, that would always give me an upper bound (for files <1M), as
far as I can tell. But that is not a very tight bound; can you
suggest anything better?
> > I''d also _like_ not to require a dedicated dataset for it,
but
> > it''s not
> > like it''s difficult for users to create one.
> 
> Use delegation.  Users can create their own datasets, set parameters,
> etc. For this case, you could consider changing recordsize, if you
> really are so worried about 1k. IMHO, it is easier and less expensive
> in process and pain to just buy more disk when needed.
Users of OpenAFS, not "unprivileged users". All users I am talking
about
are the administrators for their machines. I would just like to reduce
the number of filesystem-specific steps needed to be taken to set up the
cache. You don''t need to do anything special for a tmpfs cache, for
instance, or ext2/3 caches on linux.

-- 
Andrew Deason
adeason at sinenomine.net

Robert Milkowski

2009-Sep-18 20:38 UTC

head link

[zfs-discuss] ZFS file disk usage

Andrew Deason wrote:> On Thu, 17 Sep 2009 18:40:49 -0400
> Robert Milkowski <milek at task.gda.pl> wrote:
>
>   
>> if you would create a dedicated dataset for your cache and set quota
>> on it then instead of tracking a disk space usage for each file you
>> could easily check how much disk space is being used in the dataset.
>> Would it suffice for you?
>>     
>
> No. We need to be able to tell how close to full we are, for determining
> when to start/stop removing things from the cache before we can add new
> items to the cache again.
>   
but having a dedicated dataset will let you answer such a question 
immediatelly as then you get from zfs information from for the dataset 
on how much space is used (everything: data + metadata) and how much is 
left.
> I''d also _like_ not to require a dedicated dataset for it, but
it''s not
> like it''s difficult for users to create one.
>
>   no, it is not.
>> Setting recordsize to 1k if you have lots of files (I assume) larger 
>> than that doesn''t really make sense.
>> The problem with metadata is that by default it is also compressed so 
>> there is no easy way to tell how much disk space it occupies for a 
>> specified file using standard API.
>>     
>
> We do not know in advance what file sizes we''ll be seeing in
general. We
> could of course tell people to tune the cache dataset according to their
> usage pattern, but I don''t think users are generally going to know
what
> their cache usage pattern looks like.
>
> I can say that at least right now, usually each file will be at most 1M
> long (1M is the max unless the user specifically changes it). But
> between the range 1k-1M, I don''t know what the distribution looks
like.
>
>   What I meant was that I believe that default recordsize of 128k should 
be fine for you (files smaller than 128k will use smaller recordsize, 
larger ones will use a recordsize of 128k). The only problem will be 
with files truncated to 0 and growing again as they will be stuck with 
an old recordsize. But in most cases it won''t probably be a practical 
problem anyway.

Andrew Deason

2009-Sep-18 21:26 UTC

head link

[zfs-discuss] ZFS file disk usage

On Fri, 18 Sep 2009 16:38:28 -0400
Robert Milkowski <milek at task.gda.pl> wrote:
> > No. We need to be able to tell how close to full we are, for
> > determining when to start/stop removing things from the cache
> > before we can add new items to the cache again.
> >   
> 
> but having a dedicated dataset will let you answer such a question 
> immediatelly as then you get from zfs information from for the
> dataset on how much space is used (everything: data + metadata) and
> how much is left.
Immediately? There isn''t a delay between the write and the next commit
when the space is recorded? (Do you mean a statvfs equivalent, or some
zfs-specific call?)

And the current code is structured such that we record usage changes
before a write; it would be a huge pain to rely on the write to
calculate the usage (for that and other reasons).
> >> Setting recordsize to 1k if you have lots of files (I assume)
> >> larger than that doesn''t really make sense.
> >> The problem with metadata is that by default it is also compressed
> >> so there is no easy way to tell how much disk space it occupies
> >> for a specified file using standard API.
> >>     
> >
> > We do not know in advance what file sizes we''ll be seeing in
> > general. We could of course tell people to tune the cache dataset
> > according to their usage pattern, but I don''t think users are
> > generally going to know what their cache usage pattern looks like.
> >
> > I can say that at least right now, usually each file will be at
> > most 1M long (1M is the max unless the user specifically changes
> > it). But between the range 1k-1M, I don''t know what the
> > distribution looks like.
> >
> >   
> What I meant was that I believe that default recordsize of 128k
> should be fine for you (files smaller than 128k will use smaller
> recordsize, larger ones will use a recordsize of 128k). The only
> problem will be with files truncated to 0 and growing again as they
> will be stuck with an old recordsize. But in most cases it won''t
> probably be a practical problem anyway.
Well, it may or may not be ''fine''; we may have a lot of little
files in
the cache, and rounding up to 128k for each one reduces our disk
efficiency somewhat. Files are truncated to 0 and grow again quite often
in busy clients. But that''s an efficiency issue, we''d still be
able to
stay within the configured limit that way.

But anyway, 128k may be fine for me, but what about if someone sets
their recordsize to something different? That''s why I was wondering
about the overhead if someone sets the recordsize to 1k; is there no way
to account for it even if I know the recordsize is 1k?

-- 
Andrew Deason
adeason at sinenomine.net

Robert Milkowski

2009-Sep-18 21:54 UTC

head link

[zfs-discuss] ZFS file disk usage

Andrew Deason wrote:> On Fri, 18 Sep 2009 16:38:28 -0400
> Robert Milkowski <milek at task.gda.pl> wrote:
>
>   
>>> No. We need to be able to tell how close to full we are, for
>>> determining when to start/stop removing things from the cache
>>> before we can add new items to the cache again.
>>>   
>>>       
>> but having a dedicated dataset will let you answer such a question 
>> immediatelly as then you get from zfs information from for the
>> dataset on how much space is used (everything: data + metadata) and
>> how much is left.
>>     
>
> Immediately? There isn''t a delay between the write and the next
commit
> when the space is recorded? (Do you mean a statvfs equivalent, or some
> zfs-specific call?)
>
> And the current code is structured such that we record usage changes
> before a write; it would be a huge pain to rely on the write to
> calculate the usage (for that and other reasons).
>   
There will be a delay of up-to 30s currently.

But how much data do you expect to be pushed within 30s?
Lets say it would be even 10g to lots of small file and you would 
calculate the total size by only summing up a logical size of data. 
Would you really expect that an error would be greater than 5% which 
would be 500mb. Does it matter in practice?


>>>> Setting recordsize to 1k if you have lots of files (I assume)
>>>> larger than that doesn''t really make sense.
>>>> The problem with metadata is that by default it is also
compressed
>>>> so there is no easy way to tell how much disk space it occupies
>>>> for a specified file using standard API.
>>>>     
>>>>         
>>> We do not know in advance what file sizes we''ll be seeing
in
>>> general. We could of course tell people to tune the cache dataset
>>> according to their usage pattern, but I don''t think users
are
>>> generally going to know what their cache usage pattern looks like.
>>>
>>> I can say that at least right now, usually each file will be at
>>> most 1M long (1M is the max unless the user specifically changes
>>> it). But between the range 1k-1M, I don''t know what the
>>> distribution looks like.
>>>
>>>   
>>>       
>> What I meant was that I believe that default recordsize of 128k
>> should be fine for you (files smaller than 128k will use smaller
>> recordsize, larger ones will use a recordsize of 128k). The only
>> problem will be with files truncated to 0 and growing again as they
>> will be stuck with an old recordsize. But in most cases it
won''t
>> probably be a practical problem anyway.
>>     
>
> Well, it may or may not be ''fine''; we may have a lot of
little files in
> the cache, and rounding up to 128k for each one reduces our disk
> efficiency somewhat. Files are truncated to 0 and grow again quite often
> in busy clients. But that''s an efficiency issue, we''d
still be able to
> stay within the configured limit that way.
>
> But anyway, 128k may be fine for me, but what about if someone sets
> their recordsize to something different? That''s why I was
wondering
> about the overhead if someone sets the recordsize to 1k; is there no way
> to account for it even if I know the recordsize is 1k?
>
>   
what is user enables compression like lzjb or even gzip?
How would you like to take it into account before doing writes?

What if user creates a snapshot? How would you take it into account?

I''m under suspicion that you are looking too closely  for no real
benefit.
Especially if you don''t want to dedicate a dataset to cache you would 
expect other  applications in a system  to write to the same file system 
but different locations which you have no control or ability to predict 
how much data will be written at all. Be it Linux, Solaris, BSD, ... the 
issue will be there.

IMHO a dedicated dataset and statvfs() on it should be good enough, 
eventually with an estimate before writing your data (as a total logical 
file size from application point of view) - however due to compression 
or dedup enabled by user that estimate could be totally wrong so 
probably doesn''t actually make sense.


-- 
Robert Milkowski
http://milek.blogspot.com

Andrew Deason

2009-Sep-20 21:17 UTC

head link

[zfs-discuss] ZFS file disk usage

On Fri, 18 Sep 2009 17:54:41 -0400
Robert Milkowski <milek at task.gda.pl> wrote:
> There will be a delay of up-to 30s currently.
> 
> But how much data do you expect to be pushed within 30s?
> Lets say it would be even 10g to lots of small file and you would 
> calculate the total size by only summing up a logical size of data. 
> Would you really expect that an error would be greater than 5% which 
> would be 500mb. Does it matter in practice?
Well, that wasn''t the problem I was thinking of. I meant, if we have to
wait 30 seconds after the write to measure the disk usage... what do I
do, just sleep 30s after the write before polling for disk usage?

We could just ask for disk usage when we write, knowing that it doesn''t
take into account the write we are performing... but we''re changing
what
we''re measuring, then. If we are removing things from the cache in
order
to free up space, how do we know when to stop?

To illustrate: normally when the cache is 98% full, we remove items
until we are 95% full before we allow a write to happen again. If we
relied on statvfs information for our disk usage information, we would
start removing items at 98%, and have no idea when we hit 95% unless we
wait 30 seconds.

If you are simply saying that the difference in logical size and used
disk blocks on ZFS are similar enough not to make a difference... well,
that''s what I''ve been asking. I have asked what the maximum
difference
is between "logical size rounded up to recordsize" and "size
taken up on
disk", and haven''t received an answer yet. If the answer is
"small
enough that you don''t care", then fantastic.
> what is user enables compression like lzjb or even gzip?
> How would you like to take it into account before doing writes?
> 
> What if user creates a snapshot? How would you take it into account?
Then it will be wrong; we do not take them into account. I do not care
about those cases. It is already impossible to enforce that the cache
tracking data is 100% correct all of the time.

Imagine we somehow had a way to account for all of those cases you
listed, and would make me happy. Say the directory the user uses for the
cache data is /usr/vice/cache (one standard path to put it). The OpenAFS
client will put cache data in e.g. /usr/vice/cache/D0/V1 and a bunch of
other files.  If the user puts their own file in
/usr/vice/cache/reallybigfile, our cache tracking information will
always be off, in all current implementations.  We have no control over
it, and we do not try to solve that problem.

I am treating the cases of "what if the user creates a snapshot" and
the
like as a similar situation. If someone does that and runs out of space,
it is pretty easy to troubleshoot their system and say "you have a
snapshot of the cache dataset; do not do that". Right now, if someone
runs an OpenAFS client cache on zfs and runs out of space, the only
thing I can tell them is "don''t use zfs", which I
don''t want to do.

If it works for _a_ configuration -- the default one -- that is all I am
asking for.
> I''m under suspicion that you are looking too closely  for no real
> benefit. Especially if you don''t want to dedicate a dataset to
cache
> you would expect other  applications in a system  to write to the
> same file system but different locations which you have no control or
> ability to predict how much data will be written at all. Be it Linux,
> Solaris, BSD, ... the issue will be there.
It is certainly possible for other applications to fill up the disk. We
just need to ensure that we don''t fill up the disk to block other
applications. You may think this is fruitless, and just from that
description alone, it may be. But you must understand that without an
accurate bound on the cache, well... we can eat up the disk a lot faster
than other applications without the user realizing it.

-- 
Andrew Deason
adeason at sinenomine.net

Richard Elling

2009-Sep-21 00:31 UTC

head link

[zfs-discuss] ZFS file disk usage

If you are just building a cache, why not just make a file system and
put a reservation on it? Turn off auto snapshots and set other features
as per best practices for your workload? In other words, treat it like  
we
treat dump space.

I think that we are getting caught up in trying to answer the question
you ask rather than solving the problem you have... perhaps because
we don''t understand the problem.
  -- richard

On Sep 20, 2009, at 2:17 PM, Andrew Deason wrote:
> On Fri, 18 Sep 2009 17:54:41 -0400
> Robert Milkowski <milek at task.gda.pl> wrote:
>
>> There will be a delay of up-to 30s currently.
>>
>> But how much data do you expect to be pushed within 30s?
>> Lets say it would be even 10g to lots of small file and you would
>> calculate the total size by only summing up a logical size of data.
>> Would you really expect that an error would be greater than 5% which
>> would be 500mb. Does it matter in practice?
>
> Well, that wasn''t the problem I was thinking of. I meant, if we
have
> to
> wait 30 seconds after the write to measure the disk usage... what do I
> do, just sleep 30s after the write before polling for disk usage?
>
> We could just ask for disk usage when we write, knowing that it  
> doesn''t
> take into account the write we are performing... but we''re
changing
> what
> we''re measuring, then. If we are removing things from the cache in
> order
> to free up space, how do we know when to stop?
>
> To illustrate: normally when the cache is 98% full, we remove items
> until we are 95% full before we allow a write to happen again. If we
> relied on statvfs information for our disk usage information, we would
> start removing items at 98%, and have no idea when we hit 95% unless  
> we
> wait 30 seconds.
>
> If you are simply saying that the difference in logical size and used
> disk blocks on ZFS are similar enough not to make a difference...  
> well,
> that''s what I''ve been asking. I have asked what the
maximum difference
> is between "logical size rounded up to recordsize" and "size
taken
> up on
> disk", and haven''t received an answer yet. If the answer is
"small
> enough that you don''t care", then fantastic.
>
>> what is user enables compression like lzjb or even gzip?
>> How would you like to take it into account before doing writes?
>>
>> What if user creates a snapshot? How would you take it into account?
>
> Then it will be wrong; we do not take them into account. I do not care
> about those cases. It is already impossible to enforce that the cache
> tracking data is 100% correct all of the time.
>
> Imagine we somehow had a way to account for all of those cases you
> listed, and would make me happy. Say the directory the user uses for  
> the
> cache data is /usr/vice/cache (one standard path to put it). The  
> OpenAFS
> client will put cache data in e.g. /usr/vice/cache/D0/V1 and a bunch  
> of
> other files.  If the user puts their own file in
> /usr/vice/cache/reallybigfile, our cache tracking information will
> always be off, in all current implementations.  We have no control  
> over
> it, and we do not try to solve that problem.
>
> I am treating the cases of "what if the user creates a snapshot"
and
> the
> like as a similar situation. If someone does that and runs out of  
> space,
> it is pretty easy to troubleshoot their system and say "you have a
> snapshot of the cache dataset; do not do that". Right now, if someone
> runs an OpenAFS client cache on zfs and runs out of space, the only
> thing I can tell them is "don''t use zfs", which I
don''t want to do.
>
> If it works for _a_ configuration -- the default one -- that is all  
> I am
> asking for.
>
>> I''m under suspicion that you are looking too closely  for no
real
>> benefit. Especially if you don''t want to dedicate a dataset to
cache
>> you would expect other  applications in a system  to write to the
>> same file system but different locations which you have no control or
>> ability to predict how much data will be written at all. Be it Linux,
>> Solaris, BSD, ... the issue will be there.
>
> It is certainly possible for other applications to fill up the disk.  
> We
> just need to ensure that we don''t fill up the disk to block other
> applications. You may think this is fruitless, and just from that
> description alone, it may be. But you must understand that without an
> accurate bound on the cache, well... we can eat up the disk a lot  
> faster
> than other applications without the user realizing it.
>
> -- 
> Andrew Deason
> adeason at sinenomine.net
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Andrew Deason

2009-Sep-21 14:11 UTC

head link

[zfs-discuss] ZFS file disk usage

On Sun, 20 Sep 2009 20:31:57 -0400
Richard Elling <richard.elling at gmail.com> wrote:
> If you are just building a cache, why not just make a file system and
> put a reservation on it? Turn off auto snapshots and set other
> features as per best practices for your workload? In other words,
> treat it like we
> treat dump space.
> 
> I think that we are getting caught up in trying to answer the question
> you ask rather than solving the problem you have... perhaps because
> we don''t understand the problem.
Yes, possibly... some of these suggestions dont quite make a lot of
sense to me. We can''t just make a filesystem and put a reservation on
it; we are just an application the administrator puts on a machine for
it to access AFS. So I''m not sure when you are imagining we do that;
when the client starts up? Or part of the installation procedure?
Requiring a separate filesystem seems unnecessarily restrictive.

And I still don''t see how that helps. Making an fs with a reservation
would definitely limit us to the specified space, but we still can''t
get
an accurate picture of the current disk usage. I already mentioned why
using statvfs is not usable with that commit delay.

But solving the general problem for me isn''t necessary. If I could just
get a ballpark estimate of the max overhead for a file, I would be fine.
I haven''t payed attention to it before, so I don''t even have
an
intuitive feel for what it is.

-- 
Andrew Deason
adeason at sinenomine.net

Richard Elling

2009-Sep-21 21:13 UTC

head link

[zfs-discuss] ZFS file disk usage

On Sep 21, 2009, at 7:11 AM, Andrew Deason wrote:
> On Sun, 20 Sep 2009 20:31:57 -0400
> Richard Elling <richard.elling at gmail.com> wrote:
>
>> If you are just building a cache, why not just make a file system and
>> put a reservation on it? Turn off auto snapshots and set other
>> features as per best practices for your workload? In other words,
>> treat it like we
>> treat dump space.
>>
>> I think that we are getting caught up in trying to answer the  
>> question
>> you ask rather than solving the problem you have... perhaps because
>> we don''t understand the problem.
>
> Yes, possibly... some of these suggestions dont quite make a lot of
> sense to me. We can''t just make a filesystem and put a reservation
on
> it; we are just an application the administrator puts on a machine for
> it to access AFS. So I''m not sure when you are imagining we do
that;
> when the client starts up? Or part of the installation procedure?
> Requiring a separate filesystem seems unnecessarily restrictive.
>
> And I still don''t see how that helps. Making an fs with a
reservation
> would definitely limit us to the specified space, but we still
can''t
> get
> an accurate picture of the current disk usage. I already mentioned why
> using statvfs is not usable with that commit delay.
OK, so the problem you are trying to solve is "how much stuff can I
place in the remaining free space?"  I don''t think this is
knowable
for a dynamic file system like ZFS where metadata is dynamically
allocated.
>
> But solving the general problem for me isn''t necessary. If I could
> just
> get a ballpark estimate of the max overhead for a file, I would be  
> fine.
> I haven''t payed attention to it before, so I don''t even
have an
> intuitive feel for what it is.
You don''t know the max overhead for the file before it is allocated.
You could guess at a max of 3x size + at least three blocks.  Since
you can''t control this, it seems like the worst case is when copies=3.
  -- richard

Andrew Deason

2009-Sep-21 21:43 UTC

head link

[zfs-discuss] ZFS file disk usage

On Mon, 21 Sep 2009 17:13:26 -0400
Richard Elling <richard.elling at gmail.com> wrote:
> OK, so the problem you are trying to solve is "how much stuff can I
> place in the remaining free space?"  I don''t think this is
knowable
> for a dynamic file system like ZFS where metadata is dynamically
> allocated.
Yes. And I acknowledge that we can''t know that precisely; I''m
trying for
an estimate on the bound.
> You don''t know the max overhead for the file before it is
allocated.
> You could guess at a max of 3x size + at least three blocks.  Since
> you can''t control this, it seems like the worst case is when
copies=3.
Is that max with copies=3? Assume copies=1; what is it then?

-- 
Andrew Deason
adeason at sinenomine.net

Richard Elling

2009-Sep-21 22:20 UTC

head link

[zfs-discuss] ZFS file disk usage

On Sep 21, 2009, at 2:43 PM, Andrew Deason wrote:
> On Mon, 21 Sep 2009 17:13:26 -0400
> Richard Elling <richard.elling at gmail.com> wrote:
>
>> OK, so the problem you are trying to solve is "how much stuff can
I
>> place in the remaining free space?"  I don''t think this
is knowable
>> for a dynamic file system like ZFS where metadata is dynamically
>> allocated.
>
> Yes. And I acknowledge that we can''t know that precisely;
I''m trying
> for
> an estimate on the bound.
>
>> You don''t know the max overhead for the file before it is
allocated.
>> You could guess at a max of 3x size + at least three blocks.  Since
>> you can''t control this, it seems like the worst case is when  
>> copies=3.
>
> Is that max with copies=3? Assume copies=1; what is it then?
1x size + 1 block.
  -- richard

Andrew Deason

2009-Sep-22 15:07 UTC

head link

[zfs-discuss] ZFS file disk usage

On Mon, 21 Sep 2009 18:20:53 -0400
Richard Elling <richard.elling at gmail.com> wrote:
> On Sep 21, 2009, at 2:43 PM, Andrew Deason wrote:
> 
> > On Mon, 21 Sep 2009 17:13:26 -0400
> > Richard Elling <richard.elling at gmail.com> wrote:
> >
> >> You don''t know the max overhead for the file before it is
> >> allocated. You could guess at a max of 3x size + at least three
> >> blocks.  Since you can''t control this, it seems like the
worst
> >> case is when copies=3.
> >
> > Is that max with copies=3? Assume copies=1; what is it then?
> 
> 1x size + 1 block.
That seems to differ quite a bit from what I''ve seen; perhaps I am
misunderstanding... is the "+ 1 block" of a different size than the
recordsize? With recordsize=1k:

$ ls -ls foo
2261 -rw-r--r--   1 root     root     1048576 Sep 22 10:59 foo

1024k vs 1130k

-- 
Andrew Deason
adeason at sinenomine.net

Richard Elling

2009-Sep-22 17:26 UTC

head link

[zfs-discuss] ZFS file disk usage

On Sep 22, 2009, at 8:07 AM, Andrew Deason wrote:
> On Mon, 21 Sep 2009 18:20:53 -0400
> Richard Elling <richard.elling at gmail.com> wrote:
>
>> On Sep 21, 2009, at 2:43 PM, Andrew Deason wrote:
>>
>>> On Mon, 21 Sep 2009 17:13:26 -0400
>>> Richard Elling <richard.elling at gmail.com> wrote:
>>>
>>>> You don''t know the max overhead for the file before it
is
>>>> allocated. You could guess at a max of 3x size + at least three
>>>> blocks.  Since you can''t control this, it seems like
the worst
>>>> case is when copies=3.
>>>
>>> Is that max with copies=3? Assume copies=1; what is it then?
>>
>> 1x size + 1 block.
>
> That seems to differ quite a bit from what I''ve seen; perhaps I am
> misunderstanding... is the "+ 1 block" of a different size than
the
> recordsize? With recordsize=1k:
>
> $ ls -ls foo
> 2261 -rw-r--r--   1 root     root     1048576 Sep 22 10:59 foo
Well, there it is.  I suggest suitable guard bands.
  -- richard
>
> 1024k vs 1130k
>
> -- 
> Andrew Deason
> adeason at sinenomine.net
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Andrew Deason

2009-Sep-22 21:25 UTC

head link

[zfs-discuss] ZFS file disk usage

On Tue, 22 Sep 2009 13:26:59 -0400
Richard Elling <richard.elling at gmail.com> wrote:
> > That seems to differ quite a bit from what I''ve seen; perhaps
I am
> > misunderstanding... is the "+ 1 block" of a different size
than the
> > recordsize? With recordsize=1k:
> >
> > $ ls -ls foo
> > 2261 -rw-r--r--   1 root     root     1048576 Sep 22 10:59 foo
> 
> Well, there it is.  I suggest suitable guard bands.
So, you would say it''s reasonable to assume the overhead will always be
less than about 100k or 10%?

And to be sure... if we''re to be rounding up to the next recordsize
boundary, are we guaranteed to be able to get the from the blocksize
reported by statvfs?

-- 
Andrew Deason
adeason at sinenomine.net

zfs discuss - Sep 2009 - ZFS file disk usage

[zfs-discuss] ZFS file disk usage

[zfs-discuss] ZFS file disk usage

[zfs-discuss] ZFS file disk usage

[zfs-discuss] ZFS file disk usage

[zfs-discuss] ZFS file disk usage

[zfs-discuss] ZFS file disk usage

[zfs-discuss] ZFS file disk usage

[zfs-discuss] ZFS file disk usage

[zfs-discuss] ZFS file disk usage

[zfs-discuss] ZFS file disk usage

[zfs-discuss] ZFS file disk usage

[zfs-discuss] ZFS file disk usage

[zfs-discuss] ZFS file disk usage

[zfs-discuss] ZFS file disk usage

[zfs-discuss] ZFS file disk usage

[zfs-discuss] ZFS file disk usage

[zfs-discuss] ZFS file disk usage

[zfs-discuss] ZFS file disk usage

[zfs-discuss] ZFS file disk usage