As I''m sure you''re all aware, filesize in ZFS can differ greatly from actual disk usage, depending on access patterns. e.g. truncating a 1M file down to 1 byte still uses up about 130k on disk when recordsize=128k. I''m aware that this is a result of ZFS''s rather different internals, and that it works well for normal usage, but this can make things difficult for applications that wish to restrain their own disk usage. The particular application I''m working on that has such a problem is the OpenAFS <http://www.openafs.org/> client, when it uses ZFS as the disk cache partition. The disk cache is constrained to a user-configurable size, and the amount of cache used is tracked by counters internal to the OpenAFS client. Normally cache usage is tracked by just taking the file length of a particular file in the cache, and rounding it up to the next frsize boundary of the cache filesystem. This is obviously wrong when ZFS is used, and so our cache usage tracking can get very incorrect. So, I have two questions which would help us fix this: 1. Is there any interface to ZFS (or a configuration knob or something) that we can use from a kernel module to explicitly return a file to the more predictable size? In the above example, truncating a 1M file (call it ''A'') to 1b mkes it take up 130k, but if we create a new file (call it ''B'') with that 1b in it, it only takes up about 1k. Is there any operation we can perform on file ''A'' to make it take up less space without having to create a new file ''B''? The cache files are often truncated and overwritten with new data, which is why this can become a problem. If there was some way to explicitly signal to ZFS that we want a particular file to be put in a smaller block or something, that would be helpful. (I am mostly ignorant on ZFS internals; if there''s somewhere that would have told me this information, let me know) 2. Lacking 1., can anyone give an equation relating file length, max size on disk, and recordsize? (and any additional parameters needed). If we just have a way of knowing in advance how much disk space we''re going to take up by writing a certain amount of data, we should be okay. Or, if anyone has any other ideas on how to overcome this, it would be welcomed. -- Andrew Deason adeason at sinenomine.net
Andrew Deason wrote:> As I''m sure you''re all aware, filesize in ZFS can differ greatly from > actual disk usage, depending on access patterns. e.g. truncating a 1M > file down to 1 byte still uses up about 130k on disk when > recordsize=128k. I''m aware that this is a result of ZFS''s rather > different internals, and that it works well for normal usage, but this > can make things difficult for applications that wish to restrain their > own disk usage. > > The particular application I''m working on that has such a problem is the > OpenAFS <http://www.openafs.org/> client, when it uses ZFS as the disk > cache partition. The disk cache is constrained to a user-configurable > size, and the amount of cache used is tracked by counters internal to > the OpenAFS client. Normally cache usage is tracked by just taking the > file length of a particular file in the cache, and rounding it up to the > next frsize boundary of the cache filesystem. This is obviously wrong > when ZFS is used, and so our cache usage tracking can get very > incorrect. So, I have two questions which would help us fix this: > > 1. Is there any interface to ZFS (or a configuration knob or > something) that we can use from a kernel module to explicitly return a > file to the more predictable size? In the above example, truncating a > 1M file (call it ''A'') to 1b mkes it take up 130k, but if we create a > new file (call it ''B'') with that 1b in it, it only takes up about 1k. > Is there any operation we can perform on file ''A'' to make it take up > less space without having to create a new file ''B''? > > The cache files are often truncated and overwritten with new data, > which is why this can become a problem. If there was some way to > explicitly signal to ZFS that we want a particular file to be put in a > smaller block or something, that would be helpful. (I am mostly > ignorant on ZFS internals; if there''s somewhere that would have told > me this information, let me know) > > 2. Lacking 1., can anyone give an equation relating file length, max > size on disk, and recordsize? (and any additional parameters needed). > If we just have a way of knowing in advance how much disk space we''re > going to take up by writing a certain amount of data, we should be > okay. > > Or, if anyone has any other ideas on how to overcome this, it would be > welcomed. > >When creating a new file zfs will set its block size to be no larger than current value of recordsize. If there is at least recordsize of data to be written then the blocksize will equal to recordsize. From now on the file blocksize is "frozen" - that''s why when you truncate it it keeps its original blocksize size. It also means that if file was smaller than recordsize (so its blocksize was smaller too) when you truncate it to 1B it will keep its smaller blocksize. IMHO you won''t be able to lower a file blocksize other than by creating a new file. For example:> milek at r600:~/progs$ mkfile 10m file1 > milek at r600:~/progs$ ./stat file1 > size: 10485760 blksize: 131072 > milek at r600:~/progs$ truncate -s 1 file1 > milek at r600:~/progs$ ./stat file1 > size: 1 blksize: 131072 > milek at r600:~/progs$ > milek at r600:~/progs$ rm file1 > milek at r600:~/progs$ > milek at r600:~/progs$ mkfile 10000 file1 > milek at r600:~/progs$ ./stat file1 > size: 10000 blksize: 10240 > milek at r600:~/progs$ truncate -s 1 file1 > milek at r600:~/progs$ ./stat file1 > size: 1 blksize: 10240 > milek at r600:~/progs$If you are not worried with this extra overhead and you are mostly concerned with proper accounting of used disk space than instead of relaying on a file size alone you should take intro account its blocksize and round file size up-to blocksize (actual file size on disk (not counting metadata) is N*blocksize). However IIRC there is an open bug/rfe asking for a special treatment of a file''s tail block so it can be smaller than the file blocksize. Once it''s integrated your math could be wrong again. Please also note that relaying on a logical file size could be even more misleading if compression is enabled in zfs (or dedup in the future). Relaying on blocksize will give you more accurate estimates. You can get a file blocksize by using stat() and getting value of buf.st_blksize or you can get a good estimate of used disk space by doing buf.st_blocks*512> milek at r600:~/progs$ cat stat.c > > #include <stdio.h> > #include <errno.h> > #include <fcntl.h> > #include <sys/types.h> > #include <sys/stat.h> > > int main(int argc, char **argv) > { > struct stat buf; > > if(!stat(argv[1], &buf)) > { > printf("size: %d\tblksize: %d\n", buf.st_size, buf.st_blksize); > } > else > { > printf("ERROR: stat(), errno: %d\n", errno); > exit(1); > } > > > } > > milek at r600:~/progs$-- Robert Milkowski http://milek.blogspot.com
On Thu, 17 Sep 2009 22:55:38 +0100 Robert Milkowski <milek at task.gda.pl> wrote:> IMHO you won''t be able to lower a file blocksize other than by > creating a new file. For example:Okay, thank you.> If you are not worried with this extra overhead and you are mostly > concerned with proper accounting of used disk space than instead of > relaying on a file size alone you should take intro account its > blocksize and round file size up-to blocksize (actual file size on > disk (not counting metadata) is N*blocksize).Metadata can be nontrivial for small blocksizes, though, can''t it? I tried similar tests with varying recordsizes and with recordsize=1k, a file with 1M bytes written to it took up significantly more than 1024 1k blocks. Is there a reliable way to account for this? Through experimenting with various recordsizes and file sizes I can see enough of a pattern to try and come up with an equation for the total disk usage, but that doesn''t mean such a relation would be correct... if someone could give me something a bit more authoritative, it would be nice.> However IIRC there is an open bug/rfe asking for a special treatment > of a file''s tail block so it can be smaller than the file blocksize. > Once it''s integrated your math could be wrong again. > > Please also note that relaying on a logical file size could be even > more misleading if compression is enabled in zfs (or dedup in the > future). Relaying on blocksize will give you more accurate estimates.I was a bit unclear. We''re not so concerned about the math being wrong in general; we just need to make sure we are not significantly underestimating the usage. If we overestimate within reason, that''s fine, but getting the tightest bound is obviously more desirable. So I''m not worried about compression, dedup, or the tail block being treated in such a way.> You can get a file blocksize by using stat() and getting value of > buf.st_blksize > or you can get a good estimate of used disk space by doing > buf.st_blocks*512Hmm, I thought I had tried this, but st_blocks didn''t seem to be updated accurately until after some time after a write. I''d also like to avoid having to stat the file each time after a write or truncate in order to get the file size. The current way the code is structured intends for the space calculations to be made /before/ the write is done. It may be possible to change that, but I''d rather not, if possible (and I''d have to make sure there''s not a significant speed hit in doing so). -- Andrew Deason adeason at sinenomine.net
if you would create a dedicated dataset for your cache and set quota on it then instead of tracking a disk space usage for each file you could easily check how much disk space is being used in the dataset. Would it suffice for you? Setting recordsize to 1k if you have lots of files (I assume) larger than that doesn''t really make sense. The problem with metadata is that by default it is also compressed so there is no easy way to tell how much disk space it occupies for a specified file using standard API. -- Robert Milkowski http://milek.blogspot.com
On Thu, 17 Sep 2009 18:40:49 -0400 Robert Milkowski <milek at task.gda.pl> wrote:> if you would create a dedicated dataset for your cache and set quota > on it then instead of tracking a disk space usage for each file you > could easily check how much disk space is being used in the dataset. > Would it suffice for you?No. We need to be able to tell how close to full we are, for determining when to start/stop removing things from the cache before we can add new items to the cache again. I''d also _like_ not to require a dedicated dataset for it, but it''s not like it''s difficult for users to create one.> Setting recordsize to 1k if you have lots of files (I assume) larger > than that doesn''t really make sense. > The problem with metadata is that by default it is also compressed so > there is no easy way to tell how much disk space it occupies for a > specified file using standard API.We do not know in advance what file sizes we''ll be seeing in general. We could of course tell people to tune the cache dataset according to their usage pattern, but I don''t think users are generally going to know what their cache usage pattern looks like. I can say that at least right now, usually each file will be at most 1M long (1M is the max unless the user specifically changes it). But between the range 1k-1M, I don''t know what the distribution looks like. I can''t get an /estimate/ on the data+metadata disk usage? What about in the hypothetical case of the metadata compression ratio being effectively the same as without compression, what would it be then? -- Andrew Deason adeason at sinenomine.net
On Sep 18, 2009, at 7:36 AM, Andrew Deason wrote:> On Thu, 17 Sep 2009 18:40:49 -0400 > Robert Milkowski <milek at task.gda.pl> wrote: > >> if you would create a dedicated dataset for your cache and set quota >> on it then instead of tracking a disk space usage for each file you >> could easily check how much disk space is being used in the dataset. >> Would it suffice for you? > > No. We need to be able to tell how close to full we are, for > determining > when to start/stop removing things from the cache before we can add > new > items to the cache again.The transactional nature of ZFS may work against you here. Until the data is committed to disk, it is unclear how much space it will consume. Compression clouds the crystal ball further.> > I''d also _like_ not to require a dedicated dataset for it, but it''s > not > like it''s difficult for users to create one.Use delegation. Users can create their own datasets, set parameters, etc. For this case, you could consider changing recordsize, if you really are so worried about 1k. IMHO, it is easier and less expensive in process and pain to just buy more disk when needed. -- richard
On Fri, 18 Sep 2009 12:48:34 -0400 Richard Elling <richard.elling at gmail.com> wrote:> The transactional nature of ZFS may work against you here. > Until the data is committed to disk, it is unclear how much space > it will consume. Compression clouds the crystal ball further....but not impossible. I''m just looking for a reasonable upper bound. For example, if I always rounded up to the next 128k mark, and added an additional 128k, that would always give me an upper bound (for files <1M), as far as I can tell. But that is not a very tight bound; can you suggest anything better?> > I''d also _like_ not to require a dedicated dataset for it, but > > it''s not > > like it''s difficult for users to create one. > > Use delegation. Users can create their own datasets, set parameters, > etc. For this case, you could consider changing recordsize, if you > really are so worried about 1k. IMHO, it is easier and less expensive > in process and pain to just buy more disk when needed.Users of OpenAFS, not "unprivileged users". All users I am talking about are the administrators for their machines. I would just like to reduce the number of filesystem-specific steps needed to be taken to set up the cache. You don''t need to do anything special for a tmpfs cache, for instance, or ext2/3 caches on linux. -- Andrew Deason adeason at sinenomine.net
Andrew Deason wrote:> On Thu, 17 Sep 2009 18:40:49 -0400 > Robert Milkowski <milek at task.gda.pl> wrote: > > >> if you would create a dedicated dataset for your cache and set quota >> on it then instead of tracking a disk space usage for each file you >> could easily check how much disk space is being used in the dataset. >> Would it suffice for you? >> > > No. We need to be able to tell how close to full we are, for determining > when to start/stop removing things from the cache before we can add new > items to the cache again. >but having a dedicated dataset will let you answer such a question immediatelly as then you get from zfs information from for the dataset on how much space is used (everything: data + metadata) and how much is left.> I''d also _like_ not to require a dedicated dataset for it, but it''s not > like it''s difficult for users to create one. > >no, it is not.>> Setting recordsize to 1k if you have lots of files (I assume) larger >> than that doesn''t really make sense. >> The problem with metadata is that by default it is also compressed so >> there is no easy way to tell how much disk space it occupies for a >> specified file using standard API. >> > > We do not know in advance what file sizes we''ll be seeing in general. We > could of course tell people to tune the cache dataset according to their > usage pattern, but I don''t think users are generally going to know what > their cache usage pattern looks like. > > I can say that at least right now, usually each file will be at most 1M > long (1M is the max unless the user specifically changes it). But > between the range 1k-1M, I don''t know what the distribution looks like. > >What I meant was that I believe that default recordsize of 128k should be fine for you (files smaller than 128k will use smaller recordsize, larger ones will use a recordsize of 128k). The only problem will be with files truncated to 0 and growing again as they will be stuck with an old recordsize. But in most cases it won''t probably be a practical problem anyway.
On Fri, 18 Sep 2009 16:38:28 -0400 Robert Milkowski <milek at task.gda.pl> wrote:> > No. We need to be able to tell how close to full we are, for > > determining when to start/stop removing things from the cache > > before we can add new items to the cache again. > > > > but having a dedicated dataset will let you answer such a question > immediatelly as then you get from zfs information from for the > dataset on how much space is used (everything: data + metadata) and > how much is left.Immediately? There isn''t a delay between the write and the next commit when the space is recorded? (Do you mean a statvfs equivalent, or some zfs-specific call?) And the current code is structured such that we record usage changes before a write; it would be a huge pain to rely on the write to calculate the usage (for that and other reasons).> >> Setting recordsize to 1k if you have lots of files (I assume) > >> larger than that doesn''t really make sense. > >> The problem with metadata is that by default it is also compressed > >> so there is no easy way to tell how much disk space it occupies > >> for a specified file using standard API. > >> > > > > We do not know in advance what file sizes we''ll be seeing in > > general. We could of course tell people to tune the cache dataset > > according to their usage pattern, but I don''t think users are > > generally going to know what their cache usage pattern looks like. > > > > I can say that at least right now, usually each file will be at > > most 1M long (1M is the max unless the user specifically changes > > it). But between the range 1k-1M, I don''t know what the > > distribution looks like. > > > > > What I meant was that I believe that default recordsize of 128k > should be fine for you (files smaller than 128k will use smaller > recordsize, larger ones will use a recordsize of 128k). The only > problem will be with files truncated to 0 and growing again as they > will be stuck with an old recordsize. But in most cases it won''t > probably be a practical problem anyway.Well, it may or may not be ''fine''; we may have a lot of little files in the cache, and rounding up to 128k for each one reduces our disk efficiency somewhat. Files are truncated to 0 and grow again quite often in busy clients. But that''s an efficiency issue, we''d still be able to stay within the configured limit that way. But anyway, 128k may be fine for me, but what about if someone sets their recordsize to something different? That''s why I was wondering about the overhead if someone sets the recordsize to 1k; is there no way to account for it even if I know the recordsize is 1k? -- Andrew Deason adeason at sinenomine.net
Andrew Deason wrote:> On Fri, 18 Sep 2009 16:38:28 -0400 > Robert Milkowski <milek at task.gda.pl> wrote: > > >>> No. We need to be able to tell how close to full we are, for >>> determining when to start/stop removing things from the cache >>> before we can add new items to the cache again. >>> >>> >> but having a dedicated dataset will let you answer such a question >> immediatelly as then you get from zfs information from for the >> dataset on how much space is used (everything: data + metadata) and >> how much is left. >> > > Immediately? There isn''t a delay between the write and the next commit > when the space is recorded? (Do you mean a statvfs equivalent, or some > zfs-specific call?) > > And the current code is structured such that we record usage changes > before a write; it would be a huge pain to rely on the write to > calculate the usage (for that and other reasons). >There will be a delay of up-to 30s currently. But how much data do you expect to be pushed within 30s? Lets say it would be even 10g to lots of small file and you would calculate the total size by only summing up a logical size of data. Would you really expect that an error would be greater than 5% which would be 500mb. Does it matter in practice?>>>> Setting recordsize to 1k if you have lots of files (I assume) >>>> larger than that doesn''t really make sense. >>>> The problem with metadata is that by default it is also compressed >>>> so there is no easy way to tell how much disk space it occupies >>>> for a specified file using standard API. >>>> >>>> >>> We do not know in advance what file sizes we''ll be seeing in >>> general. We could of course tell people to tune the cache dataset >>> according to their usage pattern, but I don''t think users are >>> generally going to know what their cache usage pattern looks like. >>> >>> I can say that at least right now, usually each file will be at >>> most 1M long (1M is the max unless the user specifically changes >>> it). But between the range 1k-1M, I don''t know what the >>> distribution looks like. >>> >>> >>> >> What I meant was that I believe that default recordsize of 128k >> should be fine for you (files smaller than 128k will use smaller >> recordsize, larger ones will use a recordsize of 128k). The only >> problem will be with files truncated to 0 and growing again as they >> will be stuck with an old recordsize. But in most cases it won''t >> probably be a practical problem anyway. >> > > Well, it may or may not be ''fine''; we may have a lot of little files in > the cache, and rounding up to 128k for each one reduces our disk > efficiency somewhat. Files are truncated to 0 and grow again quite often > in busy clients. But that''s an efficiency issue, we''d still be able to > stay within the configured limit that way. > > But anyway, 128k may be fine for me, but what about if someone sets > their recordsize to something different? That''s why I was wondering > about the overhead if someone sets the recordsize to 1k; is there no way > to account for it even if I know the recordsize is 1k? > >what is user enables compression like lzjb or even gzip? How would you like to take it into account before doing writes? What if user creates a snapshot? How would you take it into account? I''m under suspicion that you are looking too closely for no real benefit. Especially if you don''t want to dedicate a dataset to cache you would expect other applications in a system to write to the same file system but different locations which you have no control or ability to predict how much data will be written at all. Be it Linux, Solaris, BSD, ... the issue will be there. IMHO a dedicated dataset and statvfs() on it should be good enough, eventually with an estimate before writing your data (as a total logical file size from application point of view) - however due to compression or dedup enabled by user that estimate could be totally wrong so probably doesn''t actually make sense. -- Robert Milkowski http://milek.blogspot.com
On Fri, 18 Sep 2009 17:54:41 -0400 Robert Milkowski <milek at task.gda.pl> wrote:> There will be a delay of up-to 30s currently. > > But how much data do you expect to be pushed within 30s? > Lets say it would be even 10g to lots of small file and you would > calculate the total size by only summing up a logical size of data. > Would you really expect that an error would be greater than 5% which > would be 500mb. Does it matter in practice?Well, that wasn''t the problem I was thinking of. I meant, if we have to wait 30 seconds after the write to measure the disk usage... what do I do, just sleep 30s after the write before polling for disk usage? We could just ask for disk usage when we write, knowing that it doesn''t take into account the write we are performing... but we''re changing what we''re measuring, then. If we are removing things from the cache in order to free up space, how do we know when to stop? To illustrate: normally when the cache is 98% full, we remove items until we are 95% full before we allow a write to happen again. If we relied on statvfs information for our disk usage information, we would start removing items at 98%, and have no idea when we hit 95% unless we wait 30 seconds. If you are simply saying that the difference in logical size and used disk blocks on ZFS are similar enough not to make a difference... well, that''s what I''ve been asking. I have asked what the maximum difference is between "logical size rounded up to recordsize" and "size taken up on disk", and haven''t received an answer yet. If the answer is "small enough that you don''t care", then fantastic.> what is user enables compression like lzjb or even gzip? > How would you like to take it into account before doing writes? > > What if user creates a snapshot? How would you take it into account?Then it will be wrong; we do not take them into account. I do not care about those cases. It is already impossible to enforce that the cache tracking data is 100% correct all of the time. Imagine we somehow had a way to account for all of those cases you listed, and would make me happy. Say the directory the user uses for the cache data is /usr/vice/cache (one standard path to put it). The OpenAFS client will put cache data in e.g. /usr/vice/cache/D0/V1 and a bunch of other files. If the user puts their own file in /usr/vice/cache/reallybigfile, our cache tracking information will always be off, in all current implementations. We have no control over it, and we do not try to solve that problem. I am treating the cases of "what if the user creates a snapshot" and the like as a similar situation. If someone does that and runs out of space, it is pretty easy to troubleshoot their system and say "you have a snapshot of the cache dataset; do not do that". Right now, if someone runs an OpenAFS client cache on zfs and runs out of space, the only thing I can tell them is "don''t use zfs", which I don''t want to do. If it works for _a_ configuration -- the default one -- that is all I am asking for.> I''m under suspicion that you are looking too closely for no real > benefit. Especially if you don''t want to dedicate a dataset to cache > you would expect other applications in a system to write to the > same file system but different locations which you have no control or > ability to predict how much data will be written at all. Be it Linux, > Solaris, BSD, ... the issue will be there.It is certainly possible for other applications to fill up the disk. We just need to ensure that we don''t fill up the disk to block other applications. You may think this is fruitless, and just from that description alone, it may be. But you must understand that without an accurate bound on the cache, well... we can eat up the disk a lot faster than other applications without the user realizing it. -- Andrew Deason adeason at sinenomine.net
If you are just building a cache, why not just make a file system and put a reservation on it? Turn off auto snapshots and set other features as per best practices for your workload? In other words, treat it like we treat dump space. I think that we are getting caught up in trying to answer the question you ask rather than solving the problem you have... perhaps because we don''t understand the problem. -- richard On Sep 20, 2009, at 2:17 PM, Andrew Deason wrote:> On Fri, 18 Sep 2009 17:54:41 -0400 > Robert Milkowski <milek at task.gda.pl> wrote: > >> There will be a delay of up-to 30s currently. >> >> But how much data do you expect to be pushed within 30s? >> Lets say it would be even 10g to lots of small file and you would >> calculate the total size by only summing up a logical size of data. >> Would you really expect that an error would be greater than 5% which >> would be 500mb. Does it matter in practice? > > Well, that wasn''t the problem I was thinking of. I meant, if we have > to > wait 30 seconds after the write to measure the disk usage... what do I > do, just sleep 30s after the write before polling for disk usage? > > We could just ask for disk usage when we write, knowing that it > doesn''t > take into account the write we are performing... but we''re changing > what > we''re measuring, then. If we are removing things from the cache in > order > to free up space, how do we know when to stop? > > To illustrate: normally when the cache is 98% full, we remove items > until we are 95% full before we allow a write to happen again. If we > relied on statvfs information for our disk usage information, we would > start removing items at 98%, and have no idea when we hit 95% unless > we > wait 30 seconds. > > If you are simply saying that the difference in logical size and used > disk blocks on ZFS are similar enough not to make a difference... > well, > that''s what I''ve been asking. I have asked what the maximum difference > is between "logical size rounded up to recordsize" and "size taken > up on > disk", and haven''t received an answer yet. If the answer is "small > enough that you don''t care", then fantastic. > >> what is user enables compression like lzjb or even gzip? >> How would you like to take it into account before doing writes? >> >> What if user creates a snapshot? How would you take it into account? > > Then it will be wrong; we do not take them into account. I do not care > about those cases. It is already impossible to enforce that the cache > tracking data is 100% correct all of the time. > > Imagine we somehow had a way to account for all of those cases you > listed, and would make me happy. Say the directory the user uses for > the > cache data is /usr/vice/cache (one standard path to put it). The > OpenAFS > client will put cache data in e.g. /usr/vice/cache/D0/V1 and a bunch > of > other files. If the user puts their own file in > /usr/vice/cache/reallybigfile, our cache tracking information will > always be off, in all current implementations. We have no control > over > it, and we do not try to solve that problem. > > I am treating the cases of "what if the user creates a snapshot" and > the > like as a similar situation. If someone does that and runs out of > space, > it is pretty easy to troubleshoot their system and say "you have a > snapshot of the cache dataset; do not do that". Right now, if someone > runs an OpenAFS client cache on zfs and runs out of space, the only > thing I can tell them is "don''t use zfs", which I don''t want to do. > > If it works for _a_ configuration -- the default one -- that is all > I am > asking for. > >> I''m under suspicion that you are looking too closely for no real >> benefit. Especially if you don''t want to dedicate a dataset to cache >> you would expect other applications in a system to write to the >> same file system but different locations which you have no control or >> ability to predict how much data will be written at all. Be it Linux, >> Solaris, BSD, ... the issue will be there. > > It is certainly possible for other applications to fill up the disk. > We > just need to ensure that we don''t fill up the disk to block other > applications. You may think this is fruitless, and just from that > description alone, it may be. But you must understand that without an > accurate bound on the cache, well... we can eat up the disk a lot > faster > than other applications without the user realizing it. > > -- > Andrew Deason > adeason at sinenomine.net > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Sun, 20 Sep 2009 20:31:57 -0400 Richard Elling <richard.elling at gmail.com> wrote:> If you are just building a cache, why not just make a file system and > put a reservation on it? Turn off auto snapshots and set other > features as per best practices for your workload? In other words, > treat it like we > treat dump space. > > I think that we are getting caught up in trying to answer the question > you ask rather than solving the problem you have... perhaps because > we don''t understand the problem.Yes, possibly... some of these suggestions dont quite make a lot of sense to me. We can''t just make a filesystem and put a reservation on it; we are just an application the administrator puts on a machine for it to access AFS. So I''m not sure when you are imagining we do that; when the client starts up? Or part of the installation procedure? Requiring a separate filesystem seems unnecessarily restrictive. And I still don''t see how that helps. Making an fs with a reservation would definitely limit us to the specified space, but we still can''t get an accurate picture of the current disk usage. I already mentioned why using statvfs is not usable with that commit delay. But solving the general problem for me isn''t necessary. If I could just get a ballpark estimate of the max overhead for a file, I would be fine. I haven''t payed attention to it before, so I don''t even have an intuitive feel for what it is. -- Andrew Deason adeason at sinenomine.net
On Sep 21, 2009, at 7:11 AM, Andrew Deason wrote:> On Sun, 20 Sep 2009 20:31:57 -0400 > Richard Elling <richard.elling at gmail.com> wrote: > >> If you are just building a cache, why not just make a file system and >> put a reservation on it? Turn off auto snapshots and set other >> features as per best practices for your workload? In other words, >> treat it like we >> treat dump space. >> >> I think that we are getting caught up in trying to answer the >> question >> you ask rather than solving the problem you have... perhaps because >> we don''t understand the problem. > > Yes, possibly... some of these suggestions dont quite make a lot of > sense to me. We can''t just make a filesystem and put a reservation on > it; we are just an application the administrator puts on a machine for > it to access AFS. So I''m not sure when you are imagining we do that; > when the client starts up? Or part of the installation procedure? > Requiring a separate filesystem seems unnecessarily restrictive. > > And I still don''t see how that helps. Making an fs with a reservation > would definitely limit us to the specified space, but we still can''t > get > an accurate picture of the current disk usage. I already mentioned why > using statvfs is not usable with that commit delay.OK, so the problem you are trying to solve is "how much stuff can I place in the remaining free space?" I don''t think this is knowable for a dynamic file system like ZFS where metadata is dynamically allocated.> > But solving the general problem for me isn''t necessary. If I could > just > get a ballpark estimate of the max overhead for a file, I would be > fine. > I haven''t payed attention to it before, so I don''t even have an > intuitive feel for what it is.You don''t know the max overhead for the file before it is allocated. You could guess at a max of 3x size + at least three blocks. Since you can''t control this, it seems like the worst case is when copies=3. -- richard
On Mon, 21 Sep 2009 17:13:26 -0400 Richard Elling <richard.elling at gmail.com> wrote:> OK, so the problem you are trying to solve is "how much stuff can I > place in the remaining free space?" I don''t think this is knowable > for a dynamic file system like ZFS where metadata is dynamically > allocated.Yes. And I acknowledge that we can''t know that precisely; I''m trying for an estimate on the bound.> You don''t know the max overhead for the file before it is allocated. > You could guess at a max of 3x size + at least three blocks. Since > you can''t control this, it seems like the worst case is when copies=3.Is that max with copies=3? Assume copies=1; what is it then? -- Andrew Deason adeason at sinenomine.net
On Sep 21, 2009, at 2:43 PM, Andrew Deason wrote:> On Mon, 21 Sep 2009 17:13:26 -0400 > Richard Elling <richard.elling at gmail.com> wrote: > >> OK, so the problem you are trying to solve is "how much stuff can I >> place in the remaining free space?" I don''t think this is knowable >> for a dynamic file system like ZFS where metadata is dynamically >> allocated. > > Yes. And I acknowledge that we can''t know that precisely; I''m trying > for > an estimate on the bound. > >> You don''t know the max overhead for the file before it is allocated. >> You could guess at a max of 3x size + at least three blocks. Since >> you can''t control this, it seems like the worst case is when >> copies=3. > > Is that max with copies=3? Assume copies=1; what is it then?1x size + 1 block. -- richard
On Mon, 21 Sep 2009 18:20:53 -0400 Richard Elling <richard.elling at gmail.com> wrote:> On Sep 21, 2009, at 2:43 PM, Andrew Deason wrote: > > > On Mon, 21 Sep 2009 17:13:26 -0400 > > Richard Elling <richard.elling at gmail.com> wrote: > > > >> You don''t know the max overhead for the file before it is > >> allocated. You could guess at a max of 3x size + at least three > >> blocks. Since you can''t control this, it seems like the worst > >> case is when copies=3. > > > > Is that max with copies=3? Assume copies=1; what is it then? > > 1x size + 1 block.That seems to differ quite a bit from what I''ve seen; perhaps I am misunderstanding... is the "+ 1 block" of a different size than the recordsize? With recordsize=1k: $ ls -ls foo 2261 -rw-r--r-- 1 root root 1048576 Sep 22 10:59 foo 1024k vs 1130k -- Andrew Deason adeason at sinenomine.net
On Sep 22, 2009, at 8:07 AM, Andrew Deason wrote:> On Mon, 21 Sep 2009 18:20:53 -0400 > Richard Elling <richard.elling at gmail.com> wrote: > >> On Sep 21, 2009, at 2:43 PM, Andrew Deason wrote: >> >>> On Mon, 21 Sep 2009 17:13:26 -0400 >>> Richard Elling <richard.elling at gmail.com> wrote: >>> >>>> You don''t know the max overhead for the file before it is >>>> allocated. You could guess at a max of 3x size + at least three >>>> blocks. Since you can''t control this, it seems like the worst >>>> case is when copies=3. >>> >>> Is that max with copies=3? Assume copies=1; what is it then? >> >> 1x size + 1 block. > > That seems to differ quite a bit from what I''ve seen; perhaps I am > misunderstanding... is the "+ 1 block" of a different size than the > recordsize? With recordsize=1k: > > $ ls -ls foo > 2261 -rw-r--r-- 1 root root 1048576 Sep 22 10:59 fooWell, there it is. I suggest suitable guard bands. -- richard> > 1024k vs 1130k > > -- > Andrew Deason > adeason at sinenomine.net > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Tue, 22 Sep 2009 13:26:59 -0400 Richard Elling <richard.elling at gmail.com> wrote:> > That seems to differ quite a bit from what I''ve seen; perhaps I am > > misunderstanding... is the "+ 1 block" of a different size than the > > recordsize? With recordsize=1k: > > > > $ ls -ls foo > > 2261 -rw-r--r-- 1 root root 1048576 Sep 22 10:59 foo > > Well, there it is. I suggest suitable guard bands.So, you would say it''s reasonable to assume the overhead will always be less than about 100k or 10%? And to be sure... if we''re to be rounding up to the next recordsize boundary, are we guaranteed to be able to get the from the blocksize reported by statvfs? -- Andrew Deason adeason at sinenomine.net