An issue was found with the netBeans installer where the installation was failing on a large ZFS filesystem. This resulted in CR 6560644 (zfs statvfs f_frsize needs work). The issue is that large filesystems can cause EOVERFLOW on statvfs() calls. This behavior is documented in the statvfs(2) man page, but I think we can do better. The problem was initially reported against ZFS, and my first fix was to zfs_statvfs(). But the VFS_STATVFS routines only see a statvfs64, so they have no way to tell if they''re about to cause EOVERFLOW. This means that they would always have to limit themselves to 32-bit values, which seems like the wrong direction for a fix. The change detects when cstatvfs32 is about to return EOVERFLOW, and attempts to rescale the statvfs values so that we return valid (albeit less precise) information. Please send me any comments or suggestions, especially with respect to conformance. -Chris Here''s the suggested patch: --- old/usr/src/uts/common/syscall/statvfs.c Mon Aug 27 11:17:30 2007 +++ new/usr/src/uts/common/syscall/statvfs.c Mon Aug 27 11:17:30 2007 @@ -19,7 +19,7 @@ * CDDL HEADER END */ /* - * Copyright 2006 Sun Microsystems, Inc. All rights reserved. + * Copyright 2007 Sun Microsystems, Inc. All rights reserved. * Use is subject to license terms. */ @@ -31,7 +31,7 @@ * under license from the Regents of the University of California. */ -#pragma ident "@(#)statvfs.c 1.19 06/05/17 SMI" +#pragma ident "@(#)statvfs.c 1.20 07/08/27 SMI" /* * Get file system statistics (statvfs and fstatvfs). @@ -76,6 +76,27 @@ * Common routines for statvfs and fstatvfs. */ +/* + * Underlying VFS_STATVFS routines can''t tell if they are going to cause + * EOVERFLOW with f_frsize, because all they see is a struct statvfs64. + * If EOVERFLOW was about to occur, try to adjust f_frsize upward so that + * we return valid (but less precise) block information. + */ +static void +rescale_frsize(struct statvfs64 *sbp) +{ + /* + * We''re limited to 32 bits for the number of frsize blocks. + * Find the smallest power of 2 that won''t cause EOVERFLOW. + */ + while (sbp->f_blocks > UINT32_MAX && sbp->f_frsize < (1U << 31)) { + sbp->f_frsize <<= 1; + sbp->f_blocks >>= 1; + sbp->f_bfree >>= 1; + sbp->f_bavail >>= 1; + } +} + static int cstatvfs32(struct vfs *vfsp, struct statvfs32 *ubp) { @@ -114,10 +135,28 @@ if (ds64.f_bfree == (fsblkcnt64_t)-1) ds64.f_bfree = UINT32_MAX; + /* + * If we''re about to cause EOVERFLOW with any of the inode + * counts, cap the value(s) at UINT32_MAX. + */ + if (ds64.f_files > UINT32_MAX) + ds64.f_files = UINT32_MAX; + if (ds64.f_ffree > UINT32_MAX) + ds64.f_ffree = UINT32_MAX; + if (ds64.f_favail > UINT32_MAX) + ds64.f_favail = UINT32_MAX; + + /* + * If necessary, try to avoid EOVERFLOW by increasing f_frsize. + */ if (ds64.f_blocks > UINT32_MAX || ds64.f_bfree > UINT32_MAX || - ds64.f_bavail > UINT32_MAX || ds64.f_files > UINT32_MAX || - ds64.f_ffree > UINT32_MAX || ds64.f_favail > UINT32_MAX) + ds64.f_bavail > UINT32_MAX) + rescale_frsize(&ds64); + + if (ds64.f_blocks > UINT32_MAX || ds64.f_bfree > UINT32_MAX || + ds64.f_bavail > UINT32_MAX) return (EOVERFLOW); + #ifdef _LP64 /* * On the 64-bit kernel, even these fields grow to 64-bit
Can you explain in a bit more detail why we''re doing this? I probably don''t understand the issue in sufficient detail. It seems like the large file compilation environment, lfcompile(5), exists to solve this exact problem. Isn''t it the application''s responsibility to properly handle EOVERFLOW or choose an interface that can handle file offsets that are greater than 2Gbytes? Is there something obvious here that I''m missing? -j On Mon, Aug 27, 2007 at 03:38:26PM -0500, Chris Kirby wrote:> An issue was found with the netBeans installer where > the installation was failing on a large ZFS filesystem. > > This resulted in CR 6560644 (zfs statvfs f_frsize needs work). > The issue is that large filesystems can cause EOVERFLOW on > statvfs() calls. This behavior is documented in the statvfs(2) > man page, but I think we can do better. > > The problem was initially reported against ZFS, and my first > fix was to zfs_statvfs(). But the VFS_STATVFS routines only > see a statvfs64, so they have no way to tell if they''re > about to cause EOVERFLOW. This means that they would always > have to limit themselves to 32-bit values, which seems > like the wrong direction for a fix. > > The change detects when cstatvfs32 is about to return EOVERFLOW, > and attempts to rescale the statvfs values so that we return valid > (albeit less precise) information. > > Please send me any comments or suggestions, especially > with respect to conformance. > > -Chris > > > Here''s the suggested patch: > > --- old/usr/src/uts/common/syscall/statvfs.c Mon Aug 27 11:17:30 2007 > +++ new/usr/src/uts/common/syscall/statvfs.c Mon Aug 27 11:17:30 2007 > @@ -19,7 +19,7 @@ > * CDDL HEADER END > */ > /* > - * Copyright 2006 Sun Microsystems, Inc. All rights reserved. > + * Copyright 2007 Sun Microsystems, Inc. All rights reserved. > * Use is subject to license terms. > */ > > @@ -31,7 +31,7 @@ > * under license from the Regents of the University of California. > */ > > -#pragma ident "@(#)statvfs.c 1.19 06/05/17 SMI" > +#pragma ident "@(#)statvfs.c 1.20 07/08/27 SMI" > > /* > * Get file system statistics (statvfs and fstatvfs). > @@ -76,6 +76,27 @@ > * Common routines for statvfs and fstatvfs. > */ > > +/* > + * Underlying VFS_STATVFS routines can''t tell if they are going to cause > + * EOVERFLOW with f_frsize, because all they see is a struct statvfs64. > + * If EOVERFLOW was about to occur, try to adjust f_frsize upward so that > + * we return valid (but less precise) block information. > + */ > +static void > +rescale_frsize(struct statvfs64 *sbp) > +{ > + /* > + * We''re limited to 32 bits for the number of frsize blocks. > + * Find the smallest power of 2 that won''t cause EOVERFLOW. > + */ > + while (sbp->f_blocks > UINT32_MAX && sbp->f_frsize < (1U << 31)) { > + sbp->f_frsize <<= 1; > + sbp->f_blocks >>= 1; > + sbp->f_bfree >>= 1; > + sbp->f_bavail >>= 1; > + } > +} > + > static int > cstatvfs32(struct vfs *vfsp, struct statvfs32 *ubp) > { > @@ -114,10 +135,28 @@ > if (ds64.f_bfree == (fsblkcnt64_t)-1) > ds64.f_bfree = UINT32_MAX; > > + /* > + * If we''re about to cause EOVERFLOW with any of the inode > + * counts, cap the value(s) at UINT32_MAX. > + */ > + if (ds64.f_files > UINT32_MAX) > + ds64.f_files = UINT32_MAX; > + if (ds64.f_ffree > UINT32_MAX) > + ds64.f_ffree = UINT32_MAX; > + if (ds64.f_favail > UINT32_MAX) > + ds64.f_favail = UINT32_MAX; > + > + /* > + * If necessary, try to avoid EOVERFLOW by increasing f_frsize. > + */ > if (ds64.f_blocks > UINT32_MAX || ds64.f_bfree > UINT32_MAX || > - ds64.f_bavail > UINT32_MAX || ds64.f_files > UINT32_MAX || > - ds64.f_ffree > UINT32_MAX || ds64.f_favail > UINT32_MAX) > + ds64.f_bavail > UINT32_MAX) > + rescale_frsize(&ds64); > + > + if (ds64.f_blocks > UINT32_MAX || ds64.f_bfree > UINT32_MAX || > + ds64.f_bavail > UINT32_MAX) > return (EOVERFLOW); > + > #ifdef _LP64 > /* > * On the 64-bit kernel, even these fields grow to 64-bit > > _______________________________________________ > zfs-code mailing list > zfs-code at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-code
johansen-osdev at sun.com wrote:> Can you explain in a bit more detail why we''re doing this? I probably > don''t understand the issue in sufficient detail. It seems like the > large file compilation environment, lfcompile(5), exists to solve this > exact problem. Isn''t it the application''s responsibility to properly > handle EOVERFLOW or choose an interface that can handle file offsets > that are greater than 2Gbytes? Is there something obvious here that I''m > missing? >It''s not a large file issue, it''s a large *filesystem* issue that revolves around f_frsize unit reporting via the cstatvfs32 interface. f_blocks, f_bfree, and f_bavail are all reported in units of f_frsize. For ZFS, we report f_frsize as 512 regardless of the size of the fs. This means we can only express vfs size up to UINT32_MAX * 512 bytes. That''s not a terribly large fs by today''s standards. Anything larger will result in EOVERFLOW from statvfs. You''re entirely correct that it''s the application''s responsibility to deal with EOVERFLOW, perhaps by using statvfs64. But if we can return valid information instead of an error, that seems like a good thing. -Chris
> This could be a bad thing if the application thinks that all of the > information in the statvfs structure is valid, when only some of it is > valid. Why not set the invalid fields in the statvfs structure to > something distinctive like -EOVERFLOW instead of the proposed change of > setting such fields to the "smallest power of 2 that won''t cause > EOVERFLOW?" It''s better to return something that''s obviously garbage > than it is to return something that looks like it''s valid when it''s not?I don''t think that Chris is proposing that we return invalid data to the application. Rather, I think he''s talking about increasing the size that we report blocks to be in the statvfs structure. statvfs(2) explains this in a little more detail. Basically, f_frsize is a scaling factor for f_blocks, f_bfree, and f_bavail. IIUC, Chris is proposing adjusting the scaling factor to avoid integer overflow. Thus, if f_frsize is 512b and f_blocks is UINT32_MAX, perhaps we could set f_frsize to 1024b and have f_blocks be 2^31 instead. This means that there''s a potential to overreport the amount of space used. However, we already have and use a scaling factor, so we''ve been doing this since day one. This seems to be a philosophical question, not one of correctness. Chris, correct me if I got any of that wrong. -j
Chris, There are several issues here. Please find comments in-line below... Cheers, Don>Date: Mon, 27 Aug 2007 16:04:42 -0500 >From: Chris Kirby <chris.kirby at sun.com> >Subject: Re: [zfs-code] statvfs change >To: johansen-osdev at sun.com >Cc: ufs-discuss at opensolaris.org, don.cragun at sun.com, zfs-code at opensolaris.org > >johansen-osdev at sun.com wrote: >> Can you explain in a bit more detail why we''re doing this? I probably >> don''t understand the issue in sufficient detail. It seems like the >> large file compilation environment, lfcompile(5), exists to solve this >> exact problem. Isn''t it the application''s responsibility to properly >> handle EOVERFLOW or choose an interface that can handle file offsets >> that are greater than 2Gbytes? Is there something obvious here that I''m >> missing? >> > >It''s not a large file issue, it''s a large *filesystem* issue >that revolves around f_frsize unit reporting via the cstatvfs32 >interface. f_blocks, f_bfree, and f_bavail are all reported in >units of f_frsize.ZFS has large file and large filesystem issues. But, those of us who participated in the Large File Summit (vendors and consumers who jointly produced the Large File Summit Specification and did the work to get the non-transitional interfaces integrated into X/Open''s (now The Open Group''s) X/Open Portability Guide, Issue 4, Version 2 remember the discussions that led to the creation of the EOVERFLOW errno value. The base point is that any time you lie to applications, some application software is going to make wrong decisions based on the lie.> >For ZFS, we report f_frsize as 512 regardless of the size of >the fs. ...Why? Why shouldn''t you always set f_frsize to the actual size of an allocation unit on the filesystem? Is it still true that we don''t support disks formatted with 1024 byte sectors?> ... This means we can only express vfs size up to >UINT32_MAX * 512 bytes. That''s not a terribly large fs >by today''s standards. Anything larger will result in EOVERFLOW >from statvfs. > >You''re entirely correct that it''s the application''s responsibility >to deal with EOVERFLOW, perhaps by using statvfs64. But if we can >return valid information instead of an error, that seems like a >good thing.When you cap f_files, f_ffree, and f_favail at UINT32_MAX when the correct values for these fields are larger; you are not returning valid information. You may be returning "valid" values for f_frsize, f_blocks, f_bfree, and f_bavail; but you aren''t checking to see if that is true or not. (If shifting f_blocks, f_bfree, or f_bavail right throws away a bit that was not a zero bit; the scaled values being returned are not valid.) Since the statvfs(2) and statvfs.h(3HEAD) man pages don''t state any relationship between f_bsize and f_frsize, applications may well have made their own assumptions. Is there documentation somewhere that specifies how many bytes should be written at a time (on boundaries that is a multiple of that value) to get the most efficiency out of the underlying hardware? I would hope that f_bsize would be that value. If it is, it seems that f_bsize should be an integral multiple of f_frsize.> >-Chris
Don, thanks for your comments, please see below: Don Cragun wrote:>>Date: Mon, 27 Aug 2007 16:04:42 -0500 >>From: Chris Kirby <chris.kirby at sun.com>>> The base point is that any time you lie to applications, some > application software is going to make wrong decisions based on the lie.Yes, and we certainly don''t want to lie. But returning an error when we can return valid (albeit less precise) info will also cause applications to make wrong decisions. In the case of the netBeans installer, it died because it thought there wasn''t enough free space when in fact, there were several TB of space available. I suspect that most apps that use f_bfree/f_bavail just want to know if they have enough space to write their data.> >>For ZFS, we report f_frsize as 512 regardless of the size of >>the fs. ... > > > Why? Why shouldn''t you always set f_frsize to the actual size of an > allocation unit on the filesystem? Is it still true that we don''t > support disks formatted with 1024 byte sectors?For ZFS, we don''t have a fixed allocation block size so in general there won''t be one true f_frsize across and entire VFS. So we return SPA_MINBLOCKSIZE (512) for f_frsize.> When you cap f_files, f_ffree, and f_favail at UINT32_MAX when the > correct values for these fields are larger; you are not returning valid > information.I think it''s valid in the sense that you will be able to create at least UINT32_MAX files. Of course once you''ve done so, we might still report that you can create UINT32_MAX additional files. :-) For any application making a decision on an available file count such that UINT32_MAX is not enough, but UINT32_MAX+1 would be OK, is using the correct largefile syscalls like statvfs64.> > You may be returning "valid" values for f_frsize, f_blocks, f_bfree, > and f_bavail; but you aren''t checking to see if that is true or not. > (If shifting f_blocks, f_bfree, or f_bavail right throws away a bit > that was not a zero bit; the scaled values being returned are not > valid.)You''re right that we''re discarding some bytes through the scaling process. However, any non-zero bits that are discarded are effectively partial f_frsize blocks. For any filesystem large enough to get into this situation, we''re talking about a relatively *very* small amount of rounding down. (e.g. for a 1PB fs, f_frsize is only 256K) Remember that the fs code can be doing delayed writes, delayed allocation, background delete processing, etc. So the statvfs values are just rumors anyway. Most filesystems don''t even bother to grab a lock when reporting statvfs info.> > Since the statvfs(2) and statvfs.h(3HEAD) man pages don''t state any > relationship between f_bsize and f_frsize, applications may well have > made their own assumptions. Is there documentation somewhere that > specifies how many bytes should be written at a time (on boundaries > that is a multiple of that value) to get the most efficiency out of > the underlying hardware? I would hope that f_bsize would be that > value. If it is, it seems that f_bsize should be an integral multiple > of f_frsize.Aside from the comment in statvfs(2) about f_bsize being the "preferred file system block size", I can''t find any documentation that talks about that. For filesystems that support direct I/O, f_bsize has traditionally provided the most efficient I/O size multiplier. But the setting of f_bsize is up to the underlying fs. And at least for QFS, UFS, and ZFS, its value is not scaled based on f_frsize. That''s also why I don''t rescale f_bsize. -Chris
Chris, More comments below... Cheers, Don>Date: Tue, 28 Aug 2007 17:23:58 -0500 >From: Chris Kirby <chris.kirby at sun.com> >Subject: Re: [zfs-code] statvfs change > >Don, thanks for your comments, please see below: > > >Don Cragun wrote: >>>Date: Mon, 27 Aug 2007 16:04:42 -0500 >>>From: Chris Kirby <chris.kirby at sun.com> > > >> The base point is that any time you lie to applications, some >> application software is going to make wrong decisions based on the lie. > > >Yes, and we certainly don''t want to lie. But returning an >error when we can return valid (albeit less precise) info >will also cause applications to make wrong decisions.Yes. But, if you give them an error, the application writer will soon be notified that their application isn''t working on ZFS due to an EOVERFLOW error. If you lie <--- give them less precise info, other applications will make bad assumptions and fail later in strange and wondrous ways (with no indication that an overflowed value was the culprit).> >In the case of the netBeans installer, it died because >it thought there wasn''t enough free space when in fact, >there were several TB of space available. > >I suspect that most apps that use f_bfree/f_bavail just >want to know if they have enough space to write their >data.Unfortunately, ZFS has no way of knowing when the application calling statvfs() is not one of those apps.> > >> >>>For ZFS, we report f_frsize as 512 regardless of the size of >>>the fs. ... >> >> >> Why? Why shouldn''t you always set f_frsize to the actual size of an >> allocation unit on the filesystem? Is it still true that we don''t >> support disks formatted with 1024 byte sectors? > > >For ZFS, we don''t have a fixed allocation block size so in general >there won''t be one true f_frsize across and entire VFS. So we return >SPA_MINBLOCKSIZE (512) for f_frsize. > > >> When you cap f_files, f_ffree, and f_favail at UINT32_MAX when the >> correct values for these fields are larger; you are not returning valid >> information. > > >I think it''s valid in the sense that you will be able to create at >least UINT32_MAX files. Of course once you''ve done so, >we might still report that you can create UINT32_MAX >additional files. :-)You may also find an app (A) that checks number of free files, removes a few and then crashes because it was supposed to be running on a quiet machine and has now detected that some other app (B) is creating files as fast as A can remove them. (B doesn''t really exist, but since the number of free files isn''t rising, A has to assume that B is active.)> >For any application making a decision on an available file count such >that UINT32_MAX is not enough, but UINT32_MAX+1 would be OK, is >using the correct largefile syscalls like statvfs64.And the only way for that app to detect the this is what is going on, statvfs() has to fail with an EOVERFLOW error.> >> >> You may be returning "valid" values for f_frsize, f_blocks, f_bfree, >> and f_bavail; but you aren''t checking to see if that is true or not. >> (If shifting f_blocks, f_bfree, or f_bavail right throws away a bit >> that was not a zero bit; the scaled values being returned are not >> valid.) > > >You''re right that we''re discarding some bytes through the scaling >process. However, any non-zero bits that are discarded are effectively >partial f_frsize blocks. For any filesystem large enough to get into >this situation, we''re talking about a relatively *very* small >amount of rounding down. (e.g. for a 1PB fs, f_frsize is >only 256K) > >Remember that the fs code can be doing delayed writes, delayed >allocation, background delete processing, etc. So the statvfs >values are just rumors anyway. Most filesystems don''t even bother >to grab a lock when reporting statvfs info. > >> >> Since the statvfs(2) and statvfs.h(3HEAD) man pages don''t state any >> relationship between f_bsize and f_frsize, applications may well have >> made their own assumptions. Is there documentation somewhere that >> specifies how many bytes should be written at a time (on boundaries >> that is a multiple of that value) to get the most efficiency out of >> the underlying hardware? I would hope that f_bsize would be that >> value. If it is, it seems that f_bsize should be an integral multiple >> of f_frsize. > >Aside from the comment in statvfs(2) about f_bsize being the >"preferred file system block size", I can''t find any documentation >that talks about that. > >For filesystems that support direct I/O, f_bsize has traditionally >provided the most efficient I/O size multiplier. > >But the setting of f_bsize is up to the underlying fs. And at least >for QFS, UFS, and ZFS, its value is not scaled based on f_frsize. >That''s also why I don''t rescale f_bsize.Correct. I''m not suggesting that statvfs() should scale f_bsize; I''m saying that if you scale f_frsize, some application may be think its world has turned upside down because the relationship it thought existed between f_frsize and f_bsize is no longer true. I believe statvfs() should be returning an error condition with errno set to EOVERFLOW and that applications that run into the EOVERFLOW should be fixed to handle the brave new world of large filesystems. By the logic you''re using, we would not have needed to change the df utility to be large filesystem aware; we should have just let it truncate the number of blocks it said were available for all filesystems to 32-bit values. For a sysadmin that wants to know if the ZFS filesystem that was just created came out at the correct size, this clearly is not sufficient; but for "most" users who just want to know if there is room to create a file, it will meet their needs perfectly. For any particular call to statvfs(), the system won''t know whether the discarded low order bits of f_blocks, f_bfree, and f_bavail and the high order bits of f_files, f_ffree, and f_favail are important or not. The only safe thing to do is report the overflows and fix the applications that get the resulting EOVERFLOW errors.> > >-Chris
Chris Kirby <chris.kirby at sun.com> writes:> Don, thanks for your comments, please see below: > > Don Cragun wrote: >>>Date: Mon, 27 Aug 2007 16:04:42 -0500 From: Chris Kirby >>><chris.kirby at sun.com> > > >> The base point is that any time you lie to applications, some >> application software is going to make wrong decisions based on the >> lie. > > > Yes, and we certainly don''t want to lie. But returning an error when > we can return valid (albeit less precise) info will also cause > applications to make wrong decisions. > > In the case of the netBeans installer, it died because it thought > there wasn''t enough free space when in fact, there were several TB of > space available. > > I suspect that most apps that use f_bfree/f_bavail just want to know > if they have enough space to write their data.I don''t claim to be an expert in this area (or any other) but all this seems very clearly to be an application bug. If the application asks "how much free space is there on this filesystem" and the system says "More space that your data structure can handle" then surely the correct response is: "Since the data I''m installing *will* fit into a 32 bit size and the free space *won''t*, there is no problem, just install." Or, "The data I''m installing *won''t* fit into a 32 bit size so using a 32 bit struct to get the available space was stupid in the first place">>>For ZFS, we report f_frsize as 512 regardless of the size of the fs. >>>... >> >> >> Why? Why shouldn''t you always set f_frsize to the actual size of an >> allocation unit on the filesystem? Is it still true that we don''t >> support disks formatted with 1024 byte sectors? > > > For ZFS, we don''t have a fixed allocation block size so in general > there won''t be one true f_frsize across and entire VFS. So we return > SPA_MINBLOCKSIZE (512) for f_frsize. > > >> When you cap f_files, f_ffree, and f_favail at UINT32_MAX when the >> correct values for these fields are larger; you are not returning >> valid information. > > > I think it''s valid in the sense that you will be able to create at > least UINT32_MAX files. Of course once you''ve done so, we might still > report that you can create UINT32_MAX additional files. :-) > > For any application making a decision on an available file count such > that UINT32_MAX is not enough, but UINT32_MAX+1 would be OK, is using > the correct largefile syscalls like statvfs64. > >> >> You may be returning "valid" values for f_frsize, f_blocks, f_bfree, >> and f_bavail; but you aren''t checking to see if that is true or >> not. (If shifting f_blocks, f_bfree, or f_bavail right throws away a >> bit that was not a zero bit; the scaled values being returned are not >> valid.) > > > You''re right that we''re discarding some bytes through the scaling > process. However, any non-zero bits that are discarded are > effectively partial f_frsize blocks. For any filesystem large enough > to get into this situation, we''re talking about a relatively *very* > small amount of rounding down. (e.g. for a 1PB fs, f_frsize is only > 256K) > > Remember that the fs code can be doing delayed writes, delayed > allocation, background delete processing, etc. So the statvfs values > are just rumors anyway. Most filesystems don''t even bother to grab a > lock when reporting statvfs info.Not to mention that (even if it''s 100% accurate) the amount of free space *now* is not valid at any time other than *now*. Assuming that if there''s 300GB available now it means I can write 300GB is not valid in the face of other writers.>> Since the statvfs(2) and statvfs.h(3HEAD) man pages don''t state any >> relationship between f_bsize and f_frsize, applications may well have >> made their own assumptions. Is there documentation somewhere that >> specifies how many bytes should be written at a time (on boundaries >> that is a multiple of that value) to get the most efficiency out of >> the underlying hardware? I would hope that f_bsize would be that >> value. If it is, it seems that f_bsize should be an integral >> multiple of f_frsize. > > Aside from the comment in statvfs(2) about f_bsize being the > "preferred file system block size", I can''t find any documentation > that talks about that. > > For filesystems that support direct I/O, f_bsize has traditionally > provided the most efficient I/O size multiplier. > > But the setting of f_bsize is up to the underlying fs. And at least > for QFS, UFS, and ZFS, its value is not scaled based on > f_frsize. That''s also why I don''t rescale f_bsize. > > > -Chris _______________________________________________ ufs-discuss > mailing list ufs-discuss at opensolaris.org
Don Cragun wrote:>>From: Chris Kirby <chris.kirby at sun.com>>>>When you cap f_files, f_ffree, and f_favail at UINT32_MAX when the >>>correct values for these fields are larger; you are not returning valid >>>information. >> >> >>I think it''s valid in the sense that you will be able to create at >>least UINT32_MAX files. Of course once you''ve done so, >>we might still report that you can create UINT32_MAX >>additional files. :-) > > > You may also find an app (A) that checks number of free files, removes > a few and then crashes because it was supposed to be running on a quiet > machine and has now detected that some other app (B) is creating files > as fast as A can remove them. (B doesn''t really exist, but since the > number of free files isn''t rising, A has to assume that B is active.)Remember that f_ffiles is in no way under an application''s control. That value usually represents an inode count, which is not necessarily updated synchronously at unlink time. The fs itself might be doing background activity (garbage collection) that could cause this count to fluctuate on an otherwise idle system. Plus, the fs is free to report whatever number it wants here. It can even be the same number all the time. QFS, which does dynamic inode allocation, always reports f_ffiles as -1. None of {ZFS,QFS,UFS} even bother to grab a lock in their statvfs handlers, so as Boyd Adamson correctly pointed out, all of the statvfs values are rumors at best anyway.>> >>But the setting of f_bsize is up to the underlying fs. And at least >>for QFS, UFS, and ZFS, its value is not scaled based on f_frsize. >>That''s also why I don''t rescale f_bsize. > > > Correct. I''m not suggesting that statvfs() should scale f_bsize; I''m > saying that if you scale f_frsize, some application may be think its > world has turned upside down because the relationship it thought > existed between f_frsize and f_bsize is no longer true.Given that we don''t document any relationship between those two fields, I''m not sure why that would be a valid assumption. AFAICS, these values aren''t even required to be constant for consecutive calls to the same vfs.> > I believe statvfs() should be returning an error condition with errno > set to EOVERFLOW and that applications that run into the EOVERFLOW > should be fixed to handle the brave new world of large filesystems. > > By the logic you''re using, we would not have needed to change the df > utility to be large filesystem aware; we should have just let it > truncate the number of blocks it said were available for all > filesystems to 32-bit values. For a sysadmin that wants to know if the > ZFS filesystem that was just created came out at the correct size, this > clearly is not sufficient; but for "most" users who just want to know > if there is room to create a file, it will meet their needs perfectly.I''m not suggesting that we just truncate the number of blocks. We''re also adjusting the unit size (f_frsize) so that (f_blocks * f_frsize) is the same before and after the rescaling. Yes, there might be a rounding error of up to (f_frsize - 1) *bytes*. So on a 1PB fs, we might be off by (256K - 1) bytes. That''s 0.000000026 percent. It doesn''t seem like it would generate many support calls.> > For any particular call to statvfs(), the system won''t know whether the > discarded low order bits of f_blocks, f_bfree, and f_bavail and the > high order bits of f_files, f_ffree, and f_favail are important or > not. The only safe thing to do is report the overflows and fix the > applications that get the resulting EOVERFLOW errors.Even if an application decides that those bits were important (and we''re talking about the very last tiny bit of space on the fs), there are currently no guarantees that those few bytes would be available to a user anyway. Growing a file by one byte could still result in ENOSPC if the fs needed more than the remaining bytes for its own data structures. I agree that broken apps should be fixed. But that''s not always possible. And I guess I still don''t see where we would be breaking any well-behaved applications. -Chris
Chris, If ZFS is ever going to pass the test suites so we can claim that ZFS meets POSIX and UNIX filesystem requirements, you''ll find that the conformance test suites will be trying things similar to what I''ve been describing. The test suites certainly aren''t perfect, but over time they get more and more touchy about lies. If the test suites get an EOVERFLOW when requesting information, they''ll report an unresolved status along with what they were trying to test, and specify that human intervention (in this case an explanation) will be required to get certification. If the test suite catches you in a lie, you''ll have to file for a test suite waiver and explain why the test suite is wrong. If statvfs() says there are X files, blocks, ... available when there are 2X or X+1, there is a very low chance of getting the waiver. At that point you will have the option of not certifying your product or fixing your code so it passes the test. You asked for my standards opinion. My opinion is that the change you''re suggesting does not comply with standards requirements. - Don>Date: Wed, 29 Aug 2007 10:55:34 -0500 >From: Chris Kirby <chris.kirby at sun.com> >Subject: Re: [zfs-code] statvfs change >To: Don Cragun <don.cragun at sun.com> >Cc: johansen-osdev at sun.com, ufs-discuss at opensolaris.org,zfs-code at opensolaris.org> >Don Cragun wrote: >>>From: Chris Kirby <chris.kirby at sun.com> > >>>>When you cap f_files, f_ffree, and f_favail at UINT32_MAX when the >>>>correct values for these fields are larger; you are not returning valid >>>>information. >>> >>> >>>I think it''s valid in the sense that you will be able to create at >>>least UINT32_MAX files. Of course once you''ve done so, >>>we might still report that you can create UINT32_MAX >>>additional files. :-) >> >> >> You may also find an app (A) that checks number of free files, removes >> a few and then crashes because it was supposed to be running on a quiet >> machine and has now detected that some other app (B) is creating files >> as fast as A can remove them. (B doesn''t really exist, but since the >> number of free files isn''t rising, A has to assume that B is active.) > >Remember that f_ffiles is in no way under an application''s >control. That value usually represents an inode count, which >is not necessarily updated synchronously at unlink time. The >fs itself might be doing background activity (garbage collection) >that could cause this count to fluctuate on an otherwise idle system. > >Plus, the fs is free to report whatever number it wants here. It can >even be the same number all the time. QFS, which does dynamic >inode allocation, always reports f_ffiles as -1. > >None of {ZFS,QFS,UFS} even bother to grab a lock in their statvfs >handlers, so as Boyd Adamson correctly pointed out, all of the statvfs >values are rumors at best anyway. > > >>> >>>But the setting of f_bsize is up to the underlying fs. And at least >>>for QFS, UFS, and ZFS, its value is not scaled based on f_frsize. >>>That''s also why I don''t rescale f_bsize. >> >> >> Correct. I''m not suggesting that statvfs() should scale f_bsize; I''m >> saying that if you scale f_frsize, some application may be think its >> world has turned upside down because the relationship it thought >> existed between f_frsize and f_bsize is no longer true. > > >Given that we don''t document any relationship between those two >fields, I''m not sure why that would be a valid assumption. > >AFAICS, these values aren''t even required to be constant for >consecutive calls to the same vfs. > >> >> I believe statvfs() should be returning an error condition with errno >> set to EOVERFLOW and that applications that run into the EOVERFLOW >> should be fixed to handle the brave new world of large filesystems. >> >> By the logic you''re using, we would not have needed to change the df >> utility to be large filesystem aware; we should have just let it >> truncate the number of blocks it said were available for all >> filesystems to 32-bit values. For a sysadmin that wants to know if the >> ZFS filesystem that was just created came out at the correct size, this >> clearly is not sufficient; but for "most" users who just want to know >> if there is room to create a file, it will meet their needs perfectly. > >I''m not suggesting that we just truncate the number of blocks. >We''re also adjusting the unit size (f_frsize) so that >(f_blocks * f_frsize) is the same before and after the rescaling. > >Yes, there might be a rounding error of up to (f_frsize - 1) *bytes*. >So on a 1PB fs, we might be off by (256K - 1) bytes. That''s >0.000000026 percent. It doesn''t seem like it would generate >many support calls. > >> >> For any particular call to statvfs(), the system won''t know whether the >> discarded low order bits of f_blocks, f_bfree, and f_bavail and the >> high order bits of f_files, f_ffree, and f_favail are important or >> not. The only safe thing to do is report the overflows and fix the >> applications that get the resulting EOVERFLOW errors. > >Even if an application decides that those bits were important (and >we''re talking about the very last tiny bit of space on the fs), >there are currently no guarantees that those few bytes would be >available to a user anyway. Growing a file by one byte could still >result in ENOSPC if the fs needed more than the remaining bytes for >its own data structures. > >I agree that broken apps should be fixed. But that''s not always >possible. And I guess I still don''t see where we would be breaking >any well-behaved applications. > >-Chris >
On Wed, 29 Aug 2007, Chris Kirby wrote: [ ... ]> None of {ZFS,QFS,UFS} even bother to grab a lock in their statvfs > handlers, so as Boyd Adamson correctly pointed out, all of the statvfs > values are rumors at best anyway.That is not correct for UFS. This piece: /* * Adjust the numbers based on things waiting to be deleted. * modifies f_bfree and f_ffree. Afterwards, everything we * come up with will be self-consistent. By definition, this * is a point-in-time snapshot, so the fact that the delete * thread''s probably already invalidated the results is not a * problem. Note that if the delete thread is ever extended to * non-logging ufs, this adjustment must always be made. */ if (TRANS_ISTRANS(ufsvfsp)) ufs_delete_adjust_stats(ufsvfsp, sp); in ufs_statvfs() is grabbing locks. And it was introduced because some "nitpickers" claimed guaranteed point-in-time consistency of statvfs() is mandated by the standards, and arguing that "the values are out of date as soon as the syscall returns no matter what" wouldn''t matter in that context. For the gory details, see: 5012326 delay between unlink/close and space becoming available may be arbitrarily long and then: 6251659 statvfs taking 30 second to complete Still wondering sometimes whether it''d been the right thing to claim "statvfs must be 100% accurate". [ ... ]> I''m not suggesting that we just truncate the number of blocks. > We''re also adjusting the unit size (f_frsize) so that > (f_blocks * f_frsize) is the same before and after the rescaling. > > Yes, there might be a rounding error of up to (f_frsize - 1) *bytes*. > So on a 1PB fs, we might be off by (256K - 1) bytes. That''s > 0.000000026 percent. It doesn''t seem like it would generate > many support calls.Even if. As was mentioned about ZFS behaviour a few times, ZFS will "compactify" small files anyway (meaning that, even if the FS were full to that degree, extending _small_ files may be possible), and since it optionally also does disk compression the amount of free space can only be estimated. The point, in particular for ZFS, can be made that the accurracy of all statvfs() return values is, by design, not to the ''least significant bit''. Hence, as long as things like "space free" is the lower bound of what really is free, it''s performing its task. Don''s statement about the standard requiring "X free - not X+1" is, hmm, to say it politically correct, "difficult to meet with a filesystem that does compress/tail-compact". The "X not X-1" is, on the other hand, to be taken very seriously. I cannot claim to have free space if I haven''t, and that''s what the above UFS bugfixes have been about. I agree with Don that we can''t lie about having free space if we don''t. I also agree that statvfs() and statvfs64() shouldn''t return different values. But I don''t see why we couldn''t round up the used space / round (down) the free space to some value/blocksize chosen by the filesystem itself. We just have to make this consistent. If: statvfs() statvfs64() 64bit statvfs() all return the same values, and: - freeing a unit fs_frsize adds ''1'' to fs_blkfree - allocating a unit fs_frsize subtracts ''1'' from fs_blkfree - an alloc attempt when fs_blkfree is zero must fail are met, we''re standards compliant, aren''t we ? The standard, for statvfs(), doesn''t make any statement about what happens on attempts to allocate/free amounts _less than_ fs_frsize bytes, except that "if I release N bytes I must be able to allocate N bytes again - to the same file". As I read the proposal, the blocked behaviour is exactly what''s being requested, isn''t it ? We have to remember that statvfs returns _BLOCKED_ units. Space (non)availability is judged in these units. And actions like: - release fs_frsize/N from N different files ==> fs_blkfree++ - fs_blkfree == 1, alloc fs_frsize/N to N different files succeeds are _NOT_ covered by the standard as "must (not) fail". If I''m wrong about this, then please provide pointers. I''m very interested. Thanks, FrankH.> >> >> For any particular call to statvfs(), the system won''t know whether the >> discarded low order bits of f_blocks, f_bfree, and f_bavail and the >> high order bits of f_files, f_ffree, and f_favail are important or >> not. The only safe thing to do is report the overflows and fix the >> applications that get the resulting EOVERFLOW errors. > > Even if an application decides that those bits were important (and > we''re talking about the very last tiny bit of space on the fs), > there are currently no guarantees that those few bytes would be > available to a user anyway. Growing a file by one byte could still > result in ENOSPC if the fs needed more than the remaining bytes for > its own data structures. > > I agree that broken apps should be fixed. But that''s not always > possible. And I guess I still don''t see where we would be breaking > any well-behaved applications.A 32bit application that uses statvfs() in a non-largefile compile environment - aka not getting statvfs64() - isn''t "well behaved". It''s way over a decade that the LF64 interfaces were introduced. That''s a while before anyone used the term ''Java'' or ''Netbeans Installer''.> > -Chris > > _______________________________________________ > ufs-discuss mailing list > ufs-discuss at opensolaris.org >
Frank Hofmann wrote:> On Wed, 29 Aug 2007, Chris Kirby wrote: > > [ ... ] > >> None of {ZFS,QFS,UFS} even bother to grab a lock in their statvfs >> handlers, so as Boyd Adamson correctly pointed out, all of the statvfs >> values are rumors at best anyway. > > > That is not correct for UFS. This piece: > > /* > * Adjust the numbers based on things waiting to be deleted. > * modifies f_bfree and f_ffree. Afterwards, everything we > * come up with will be self-consistent. By definition, this > * is a point-in-time snapshot, so the fact that the delete > * thread''s probably already invalidated the results is not a > * problem. Note that if the delete thread is ever extended to > * non-logging ufs, this adjustment must always be made. > */ > if (TRANS_ISTRANS(ufsvfsp)) > ufs_delete_adjust_stats(ufsvfsp, sp); > > in ufs_statvfs() is grabbing locks. And it was introduced because some > "nitpickers" claimed guaranteed point-in-time consistency of statvfs() > is mandated by the standards, and arguing that "the values are out of > date as soon as the syscall returns no matter what" wouldn''t matter in > that context.Ouch, that''s expensive. I still can''t find a reference showing that the standards actually require that sort of behavior. In fact, there doesn''t seem to be a lot of information about what *is* required of statvfs by the standards. One interesting quote is from Apple''s statvfs(3) man page: STANDARDS The statvfs() and fstatvfs() functions conform to IEEE Std 1003.1-2001 (``POSIX.1''''). As standardized, portable applications cannot depend on these functions returning any valid information at all. This implementation attempts to provide as much useful information as is provided by the underlying file system, subject to the limitations of the specified data types. Another is from the Open Group page on statvfs: http://www.opengroup.org/onlinepubs/000095399/functions/statvfs.html "It is unspecified whether all members of the statvfs structure have meaningful values on all file systems." I suspect that the conformance tests for statvfs are a bit overzealous in their expectations. -Chris
Chris Kirby <chris.kirby at sun.com> writes:> None of {ZFS,QFS,UFS} even bother to grab a lock in their statvfs > handlers, so as Boyd Adamson correctly pointed out, all of the statvfs > values are rumors at best anyway.Actually, that wasn''t me. I pointed out that the free space at this moment is no guide as to the free space when we actually come to write the files. Boyd
Boyd Adamson wrote:> Chris Kirby <chris.kirby at sun.com> writes: > >>None of {ZFS,QFS,UFS} even bother to grab a lock in their statvfs >>handlers, so as Boyd Adamson correctly pointed out, all of the statvfs >>values are rumors at best anyway. > > > Actually, that wasn''t me. I pointed out that the free space at this > moment is no guide as to the free space when we actually come to write > the files.Yep, that was what I meant by " statvfs values are rumors at best anyway". -Chris
Richard Lowe wrote:> Boyd Adamson <boyd-adamson at usa.net> writes: > > >>Chris Kirby <chris.kirby at sun.com> writes: >> >>>None of {ZFS,QFS,UFS} even bother to grab a lock in their statvfs >>>handlers, so as Boyd Adamson correctly pointed out, all of the statvfs >>>values are rumors at best anyway. >> >>Actually, that wasn''t me. I pointed out that the free space at this >>moment is no guide as to the free space when we actually come to write >>the files. >> > > > And that surely the right thing to do here is to have the application > either deal with EOVERFLOW sanely, or use the largefile environment.No doubt, that''s definitely the best solution in general. The request for this RFE was to see if we could do something reasonable for apps that can''t be fixed for whatever reason (no maintainer, no source code, etc.). That''s what I''m trying to explore. :-) -Chris
On 8/30/07, Chris Kirby <chris.kirby at sun.com> wrote:> Richard Lowe wrote: > > Boyd Adamson <boyd-adamson at usa.net> writes: > > > > > >>Chris Kirby <chris.kirby at sun.com> writes: > >> > >>>None of {ZFS,QFS,UFS} even bother to grab a lock in their statvfs > >>>handlers, so as Boyd Adamson correctly pointed out, all of the statvfs > >>>values are rumors at best anyway. > >> > >>Actually, that wasn''t me. I pointed out that the free space at this > >>moment is no guide as to the free space when we actually come to write > >>the files. > >> > > > > > > And that surely the right thing to do here is to have the application > > either deal with EOVERFLOW sanely, or use the largefile environment. > > No doubt, that''s definitely the best solution in general. > > The request for this RFE was to see if we could do something reasonable > for apps that can''t be fixed for whatever reason (no maintainer, no > source code, etc.). That''s what I''m trying to explore. :-)If only there were a way to detect which apps were in that state :)
Don Cragun wrote:> Chris, > If ZFS is ever going to pass the test suites so we can claim > that ZFS meets POSIX and UNIX filesystem requirements, you''ll find that > the conformance test suites will be trying things similar to what I''ve > been describing. The test suites certainly aren''t perfect, but over > time they get more and more touchy about lies. If the test suites get > an EOVERFLOW when requesting information, they''ll report an unresolved > status along with what they were trying to test, and specify that human > intervention (in this case an explanation) will be required to get > certification. If the test suite catches you in a lie, you''ll have to > file for a test suite waiver and explain why the test suite is wrong. > If statvfs() says there are X files, blocks, ... available when there > are 2X or X+1, there is a very low chance of getting the waiver. At > that point you will have the option of not certifying your product or > fixing your code so it passes the test.Don, Can you provide a concrete example of this? Eg: which standard requires what behavior, and how would the changes that Chris is proposing violate them? It''s difficult for me to see how any reasonable test program could succeed under ZFS today but fail with Chris''s changes. With ZFS, the number of blocks or files "available", while true, does not exactly and simply correspond to whether any future syscall will succeed or fail -- even in an otherwise idle system, even waiting "sufficiently long" for all changes to percolate through the system. This is due to metadata, compression, gang blocks, raid-z, snapshots, etc. This is somewhat true even in UFS -- indirect blocks have to come from somewhere. So for example, if there is X bytes of free space, you can not necessarily write X bytes to an existing file. If you can, there will not necessarily be 0 free bytes. Same goes for number of files available. To take the one example you mentioned, if X files are removed, on ZFS today the number of free files will not necessarily go up by X (nor necessarily by X or more, nor necessarily by X or less; the number of free files may even decrease!). This is because metadata is shared between files (there are many dnodes per block), and because of snapshots. Therefore applications can not make any assumption about how any action will affect the number of available files. --matt
On Sep 05, 2007 17:39 -0700, Matthew Ahrens wrote:> Can you provide a concrete example of this? Eg: which standard requires what > behavior, and how would the changes that Chris is proposing violate them? > It''s difficult for me to see how any reasonable test program could succeed > under ZFS today but fail with Chris''s changes.Just as an FYI - we''ve been scaling the statfs blocksize with Lustre for many years without problems, because 32-bit systems would overflow df at 16TB because the userspace (df in particular) and many apps were not compiled to use a 64-bit statfs interface at the time but customers were using filesystems hundreds of TB in size. Even today we scale up the blocksize on 32-bit systems as needed to always fit into a 32-bit total block count, because Linux has the same issue that the fs doesn''t know if the caller is using the 32-bit or 64-bit statfs syscall. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Maybe Matching Threads
- [PATCH 0/2] builder: Choose better weights in the planner.
- [PATCH v2 0/2] builder: Choose better weights in the planner.
- [PATCH v3 0/2] builder: Choose better weights in the planner.
- virtdf outputs on host differs from df in guest
- Re: virtdf outputs on host differs from df in guest