thr3ads.net - zfs code - [zfs-code] statvfs change [Aug 2007]

If this information is useful, please help other people find it:
Share via:

Chris Kirby

2007-Aug-27 20:38 UTC

[zfs-code] statvfs change

An issue was found with the netBeans installer where
the installation was failing on a large ZFS filesystem.

This resulted in CR 6560644 (zfs statvfs f_frsize needs work).
The issue is that large filesystems can cause EOVERFLOW on
statvfs() calls. This behavior is documented in the statvfs(2)
man page, but I think we can do better.

The problem was initially reported against ZFS, and my first
fix was to zfs_statvfs().  But the VFS_STATVFS routines only
see a statvfs64, so they have no way to tell if they''re
about to cause EOVERFLOW.  This means that they would always
have to limit themselves to 32-bit values, which seems
like the wrong direction for a fix.

The change detects when cstatvfs32 is about to return EOVERFLOW,
and attempts to rescale the statvfs values so that we return valid
(albeit less precise) information.

Please send me any comments or suggestions, especially
with respect to conformance.

-Chris


Here''s the suggested patch:

--- old/usr/src/uts/common/syscall/statvfs.c	Mon Aug 27 11:17:30 2007
+++ new/usr/src/uts/common/syscall/statvfs.c	Mon Aug 27 11:17:30 2007
@@ -19,7 +19,7 @@
   * CDDL HEADER END
   */
  /*
- * Copyright 2006 Sun Microsystems, Inc.  All rights reserved.
+ * Copyright 2007 Sun Microsystems, Inc.  All rights reserved.
   * Use is subject to license terms.
   */

@@ -31,7 +31,7 @@
   * under license from the Regents of the University of California.
   */

-#pragma ident	"@(#)statvfs.c	1.19	06/05/17 SMI"
+#pragma ident	"@(#)statvfs.c	1.20	07/08/27 SMI"

  /*
   * Get file system statistics (statvfs and fstatvfs).
@@ -76,6 +76,27 @@
   * Common routines for statvfs and fstatvfs.
   */

+/*
+ * Underlying VFS_STATVFS routines can''t tell if they are going to
cause
+ * EOVERFLOW with f_frsize, because all they see is a struct statvfs64.
+ * If EOVERFLOW was about to occur, try to adjust f_frsize upward so that
+ * we return valid (but less precise) block information.
+ */
+static void
+rescale_frsize(struct statvfs64 *sbp)
+{
+	/*
+	 * We''re limited to 32 bits for the number of frsize blocks.
+	 * Find the smallest power of 2 that won''t cause EOVERFLOW.
+	 */
+	while (sbp->f_blocks > UINT32_MAX && sbp->f_frsize < (1U
<< 31)) {
+		sbp->f_frsize <<= 1;
+		sbp->f_blocks >>= 1;
+		sbp->f_bfree >>= 1;
+		sbp->f_bavail >>= 1;
+	}
+}
+
  static int
  cstatvfs32(struct vfs *vfsp, struct statvfs32 *ubp)
  {
@@ -114,10 +135,28 @@
  	if (ds64.f_bfree == (fsblkcnt64_t)-1)
  		ds64.f_bfree = UINT32_MAX;

+	/*
+	 * If we''re about to cause EOVERFLOW with any of the inode
+	 * counts, cap the value(s) at UINT32_MAX.
+	 */
+	if (ds64.f_files > UINT32_MAX)
+		ds64.f_files = UINT32_MAX;
+	if (ds64.f_ffree > UINT32_MAX)
+		ds64.f_ffree = UINT32_MAX;
+	if (ds64.f_favail > UINT32_MAX)
+		ds64.f_favail = UINT32_MAX;
+
+	/*
+	 * If necessary, try to avoid EOVERFLOW by increasing f_frsize.
+	 */
  	if (ds64.f_blocks > UINT32_MAX || ds64.f_bfree > UINT32_MAX ||
-	    ds64.f_bavail > UINT32_MAX || ds64.f_files > UINT32_MAX ||
-	    ds64.f_ffree > UINT32_MAX || ds64.f_favail > UINT32_MAX)
+	    ds64.f_bavail > UINT32_MAX)
+		rescale_frsize(&ds64);
+
+	if (ds64.f_blocks > UINT32_MAX || ds64.f_bfree > UINT32_MAX ||
+	    ds64.f_bavail > UINT32_MAX)
  		return (EOVERFLOW);
+
  #ifdef _LP64
  	/*
  	 * On the 64-bit kernel, even these fields grow to 64-bit

johansen-osdev at sun.com

2007-Aug-27 20:47 UTC

head link

[zfs-code] statvfs change

Can you explain in a bit more detail why we''re doing this?  I probably
don''t understand the issue in sufficient detail.  It seems like the
large file compilation environment, lfcompile(5), exists to solve this
exact problem.  Isn''t it the application''s responsibility to
properly
handle EOVERFLOW or choose an interface that can handle file offsets
that are greater than 2Gbytes?  Is there something obvious here that
I''m
missing?

-j

On Mon, Aug 27, 2007 at 03:38:26PM -0500, Chris Kirby
wrote:> An issue was found with the netBeans installer where
> the installation was failing on a large ZFS filesystem.
> 
> This resulted in CR 6560644 (zfs statvfs f_frsize needs work).
> The issue is that large filesystems can cause EOVERFLOW on
> statvfs() calls. This behavior is documented in the statvfs(2)
> man page, but I think we can do better.
> 
> The problem was initially reported against ZFS, and my first
> fix was to zfs_statvfs().  But the VFS_STATVFS routines only
> see a statvfs64, so they have no way to tell if they''re
> about to cause EOVERFLOW.  This means that they would always
> have to limit themselves to 32-bit values, which seems
> like the wrong direction for a fix.
> 
> The change detects when cstatvfs32 is about to return EOVERFLOW,
> and attempts to rescale the statvfs values so that we return valid
> (albeit less precise) information.
> 
> Please send me any comments or suggestions, especially
> with respect to conformance.
> 
> -Chris
> 
> 
> Here''s the suggested patch:
> 
> --- old/usr/src/uts/common/syscall/statvfs.c	Mon Aug 27 11:17:30 2007
> +++ new/usr/src/uts/common/syscall/statvfs.c	Mon Aug 27 11:17:30 2007
> @@ -19,7 +19,7 @@
>    * CDDL HEADER END
>    */
>   /*
> - * Copyright 2006 Sun Microsystems, Inc.  All rights reserved.
> + * Copyright 2007 Sun Microsystems, Inc.  All rights reserved.
>    * Use is subject to license terms.
>    */
> 
> @@ -31,7 +31,7 @@
>    * under license from the Regents of the University of California.
>    */
> 
> -#pragma ident	"@(#)statvfs.c	1.19	06/05/17 SMI"
> +#pragma ident	"@(#)statvfs.c	1.20	07/08/27 SMI"
> 
>   /*
>    * Get file system statistics (statvfs and fstatvfs).
> @@ -76,6 +76,27 @@
>    * Common routines for statvfs and fstatvfs.
>    */
> 
> +/*
> + * Underlying VFS_STATVFS routines can''t tell if they are going
to cause
> + * EOVERFLOW with f_frsize, because all they see is a struct statvfs64.
> + * If EOVERFLOW was about to occur, try to adjust f_frsize upward so that
> + * we return valid (but less precise) block information.
> + */
> +static void
> +rescale_frsize(struct statvfs64 *sbp)
> +{
> +	/*
> +	 * We''re limited to 32 bits for the number of frsize blocks.
> +	 * Find the smallest power of 2 that won''t cause EOVERFLOW.
> +	 */
> +	while (sbp->f_blocks > UINT32_MAX && sbp->f_frsize <
(1U << 31)) {
> +		sbp->f_frsize <<= 1;
> +		sbp->f_blocks >>= 1;
> +		sbp->f_bfree >>= 1;
> +		sbp->f_bavail >>= 1;
> +	}
> +}
> +
>   static int
>   cstatvfs32(struct vfs *vfsp, struct statvfs32 *ubp)
>   {
> @@ -114,10 +135,28 @@
>   	if (ds64.f_bfree == (fsblkcnt64_t)-1)
>   		ds64.f_bfree = UINT32_MAX;
> 
> +	/*
> +	 * If we''re about to cause EOVERFLOW with any of the inode
> +	 * counts, cap the value(s) at UINT32_MAX.
> +	 */
> +	if (ds64.f_files > UINT32_MAX)
> +		ds64.f_files = UINT32_MAX;
> +	if (ds64.f_ffree > UINT32_MAX)
> +		ds64.f_ffree = UINT32_MAX;
> +	if (ds64.f_favail > UINT32_MAX)
> +		ds64.f_favail = UINT32_MAX;
> +
> +	/*
> +	 * If necessary, try to avoid EOVERFLOW by increasing f_frsize.
> +	 */
>   	if (ds64.f_blocks > UINT32_MAX || ds64.f_bfree > UINT32_MAX ||
> -	    ds64.f_bavail > UINT32_MAX || ds64.f_files > UINT32_MAX ||
> -	    ds64.f_ffree > UINT32_MAX || ds64.f_favail > UINT32_MAX)
> +	    ds64.f_bavail > UINT32_MAX)
> +		rescale_frsize(&ds64);
> +
> +	if (ds64.f_blocks > UINT32_MAX || ds64.f_bfree > UINT32_MAX ||
> +	    ds64.f_bavail > UINT32_MAX)
>   		return (EOVERFLOW);
> +
>   #ifdef _LP64
>   	/*
>   	 * On the 64-bit kernel, even these fields grow to 64-bit
> 
> _______________________________________________
> zfs-code mailing list
> zfs-code at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-code

Chris Kirby

2007-Aug-27 21:04 UTC

head link

[zfs-code] statvfs change

johansen-osdev at sun.com wrote:> Can you explain in a bit more detail why we''re doing this?  I
probably
> don''t understand the issue in sufficient detail.  It seems like
the
> large file compilation environment, lfcompile(5), exists to solve this
> exact problem.  Isn''t it the application''s responsibility
to properly
> handle EOVERFLOW or choose an interface that can handle file offsets
> that are greater than 2Gbytes?  Is there something obvious here that
I''m
> missing?
> 
It''s not a large file issue, it''s a large *filesystem* issue
that revolves around f_frsize unit reporting via the cstatvfs32
interface.  f_blocks, f_bfree, and f_bavail are all reported in
units of f_frsize.

For ZFS, we report f_frsize as 512 regardless of the size of
the fs.  This means we can only express vfs size up to
UINT32_MAX * 512 bytes.  That''s not a terribly large fs
by today''s standards. Anything larger will result in EOVERFLOW
from statvfs.

You''re entirely correct that it''s the application''s
responsibility
to deal with EOVERFLOW, perhaps by using statvfs64.  But if we can
return valid information instead of an error, that seems like a
good thing.

-Chris

johansen-osdev at sun.com

2007-Aug-28 00:51 UTC

head link

[zfs-code] [ufs-discuss] statvfs change

> This could be a bad thing if the application thinks that all of the
> information in the statvfs structure is valid, when only some of it is
> valid.  Why not set the invalid fields in the statvfs structure to
> something distinctive like -EOVERFLOW instead of the proposed change of
> setting such fields to the "smallest power of 2 that won''t
cause
> EOVERFLOW?"  It''s better to return something that''s
obviously garbage
> than it is to return something that looks like it''s valid when
it''s not?
I don''t think that Chris is proposing that we return invalid data to
the
application.  Rather, I think he''s talking about increasing the size
that we report blocks to be in the statvfs structure.  statvfs(2)
explains this in a little more detail.

Basically, f_frsize is a scaling factor for f_blocks, f_bfree, and
f_bavail.  IIUC, Chris is proposing adjusting the scaling factor to
avoid integer overflow.

Thus, if f_frsize is 512b and f_blocks is UINT32_MAX, perhaps we
could set f_frsize to 1024b and have f_blocks be 2^31 instead.

This means that there''s a potential to overreport the amount of space
used.  However, we already have and use a scaling factor, so we''ve been
doing this since day one.

This seems to be a philosophical question, not one of correctness.

Chris, correct me if I got any of that wrong.

-j

Don Cragun

2007-Aug-28 21:07 UTC

head link

[zfs-code] statvfs change

Chris,
	There are several issues here.  Please find comments in-line
below...

	Cheers,
	Don
>Date: Mon, 27 Aug 2007 16:04:42 -0500
>From: Chris Kirby <chris.kirby at sun.com>
>Subject: Re: [zfs-code] statvfs change
>To: johansen-osdev at sun.com
>Cc: ufs-discuss at opensolaris.org, don.cragun at sun.com, zfs-code at
opensolaris.org
>
>johansen-osdev at sun.com wrote:
>> Can you explain in a bit more detail why we''re doing this?  I
probably
>> don''t understand the issue in sufficient detail.  It seems
like the
>> large file compilation environment, lfcompile(5), exists to solve this
>> exact problem.  Isn''t it the application''s
responsibility to properly
>> handle EOVERFLOW or choose an interface that can handle file offsets
>> that are greater than 2Gbytes?  Is there something obvious here that
I''m
>> missing?
>> 
>
>It''s not a large file issue, it''s a large *filesystem*
issue
>that revolves around f_frsize unit reporting via the cstatvfs32
>interface.  f_blocks, f_bfree, and f_bavail are all reported in
>units of f_frsize.
ZFS has large file and large filesystem issues.  But, those of us who
participated in the Large File Summit (vendors and consumers who
jointly produced the Large File Summit Specification and did the work
to get the non-transitional interfaces integrated into X/Open''s (now
The Open Group''s) X/Open Portability Guide, Issue 4, Version 2 remember
the discussions that led to the creation of the EOVERFLOW errno value.
The base point is that any time you lie to applications, some
application software is going to make wrong decisions based on the lie.
>
>For ZFS, we report f_frsize as 512 regardless of the size of
>the fs.  ...
Why?  Why shouldn''t you always set f_frsize to the actual size of an
allocation unit on the filesystem?  Is it still true that we don''t
support disks formatted with 1024 byte sectors?
>   ...   This means we can only express vfs size up to
>UINT32_MAX * 512 bytes.  That''s not a terribly large fs
>by today''s standards. Anything larger will result in EOVERFLOW
>from statvfs.
>
>You''re entirely correct that it''s the
application''s responsibility
>to deal with EOVERFLOW, perhaps by using statvfs64.  But if we can
>return valid information instead of an error, that seems like a
>good thing.
When you cap f_files, f_ffree, and f_favail at UINT32_MAX when the
correct values for these fields are larger; you are not returning valid
information.

You may be returning "valid" values for f_frsize, f_blocks, f_bfree,
and f_bavail; but you aren''t checking to see if that is true or not.
(If shifting f_blocks, f_bfree, or f_bavail right throws away a bit
that was not a zero bit; the scaled values being returned are not
valid.)

Since the statvfs(2) and statvfs.h(3HEAD) man pages don''t state any
relationship between f_bsize and f_frsize, applications may well have
made their own assumptions.  Is there documentation somewhere that
specifies how many bytes should be written at a time (on boundaries
that is a multiple of that value) to get the most efficiency out of
the underlying hardware?  I would hope that f_bsize would be that
value.  If it is, it seems that f_bsize should be an integral multiple
of f_frsize.
>
>-Chris

Chris Kirby

2007-Aug-28 22:23 UTC

head link

[zfs-code] statvfs change

Don, thanks for your comments, please see below:

Don Cragun wrote:>>Date: Mon, 27 Aug 2007 16:04:42 -0500
>>From: Chris Kirby <chris.kirby at sun.com>
 >> The base point is that any time you lie to applications, some
> application software is going to make wrong decisions based on the lie.

Yes, and we certainly don''t want to lie. But returning an
error when we can return valid (albeit less precise) info
will also cause applications to make wrong decisions.

In the case of the netBeans installer, it died because
it thought there wasn''t enough free space when in fact,
there were several TB of space available.

I suspect that most apps that use f_bfree/f_bavail just
want to know if they have enough space to write their
data.

> 
>>For ZFS, we report f_frsize as 512 regardless of the size of
>>the fs.  ...
> 
> 
> Why?  Why shouldn''t you always set f_frsize to the actual size of
an
> allocation unit on the filesystem?  Is it still true that we don''t
> support disks formatted with 1024 byte sectors?

For ZFS, we don''t have a fixed allocation block size so in general
there won''t be one true f_frsize across and entire VFS.  So we return
SPA_MINBLOCKSIZE (512) for f_frsize.

> When you cap f_files, f_ffree, and f_favail at UINT32_MAX when the
> correct values for these fields are larger; you are not returning valid
> information.

I think it''s valid in the sense that you will be able to create at
least UINT32_MAX files.  Of course once you''ve done so,
we might still report that you can create UINT32_MAX
additional files.  :-)

For any application making a decision on an available file count such
that UINT32_MAX is not enough, but UINT32_MAX+1 would be OK, is
using the correct largefile syscalls like statvfs64.
> 
> You may be returning "valid" values for f_frsize, f_blocks,
f_bfree,
> and f_bavail; but you aren''t checking to see if that is true or
not.
> (If shifting f_blocks, f_bfree, or f_bavail right throws away a bit
> that was not a zero bit; the scaled values being returned are not
> valid.)

You''re right that we''re discarding some bytes through the
scaling
process.  However, any non-zero bits that are discarded are effectively
partial f_frsize blocks.  For any filesystem large enough to get into
this situation, we''re talking about a relatively *very* small
amount of rounding down.  (e.g. for a 1PB fs, f_frsize is
only 256K)

Remember that the fs code can be doing delayed writes, delayed
allocation, background delete processing, etc.  So the statvfs
values are just rumors anyway.  Most filesystems don''t even bother
to grab a lock when reporting statvfs info.
> 
> Since the statvfs(2) and statvfs.h(3HEAD) man pages don''t state
any
> relationship between f_bsize and f_frsize, applications may well have
> made their own assumptions.  Is there documentation somewhere that
> specifies how many bytes should be written at a time (on boundaries
> that is a multiple of that value) to get the most efficiency out of
> the underlying hardware?  I would hope that f_bsize would be that
> value.  If it is, it seems that f_bsize should be an integral multiple
> of f_frsize.
Aside from the comment in statvfs(2) about f_bsize being the
"preferred file system block size", I can''t find any
documentation
that talks about that.

For filesystems that support direct I/O, f_bsize has traditionally
provided the most efficient I/O size multiplier.

But the setting of f_bsize is up to the underlying fs.  And at least
for QFS, UFS, and ZFS, its value is not scaled based on f_frsize.
That''s also why I don''t rescale f_bsize.

-Chris

Don Cragun

2007-Aug-28 23:06 UTC

head link

[zfs-code] statvfs change

Chris,
	More comments below...

	Cheers,
	Don
>Date: Tue, 28 Aug 2007 17:23:58 -0500
>From: Chris Kirby <chris.kirby at sun.com>
>Subject: Re: [zfs-code] statvfs change
>
>Don, thanks for your comments, please see below:
>
>
>Don Cragun wrote:
>>>Date: Mon, 27 Aug 2007 16:04:42 -0500
>>>From: Chris Kirby <chris.kirby at sun.com>
> >
>> The base point is that any time you lie to applications, some
>> application software is going to make wrong decisions based on the lie.
>
>
>Yes, and we certainly don''t want to lie. But returning an
>error when we can return valid (albeit less precise) info
>will also cause applications to make wrong decisions.
Yes.  But, if you give them an error, the application writer will soon
be notified that their application isn''t working on ZFS due to an
EOVERFLOW error.  If you lie <--- give them less precise info, other
applications will make bad assumptions and fail later in strange and
wondrous ways (with no indication that an overflowed value was the
culprit).
>
>In the case of the netBeans installer, it died because
>it thought there wasn''t enough free space when in fact,
>there were several TB of space available.
>
>I suspect that most apps that use f_bfree/f_bavail just
>want to know if they have enough space to write their
>data.
Unfortunately, ZFS has no way of knowing when the application calling
statvfs() is not one of those apps.
>
>
>> 
>>>For ZFS, we report f_frsize as 512 regardless of the size of
>>>the fs.  ...
>> 
>> 
>> Why?  Why shouldn''t you always set f_frsize to the actual size
of an
>> allocation unit on the filesystem?  Is it still true that we
don''t
>> support disks formatted with 1024 byte sectors?
>
>
>For ZFS, we don''t have a fixed allocation block size so in general
>there won''t be one true f_frsize across and entire VFS.  So we
return
>SPA_MINBLOCKSIZE (512) for f_frsize.
>
>
>> When you cap f_files, f_ffree, and f_favail at UINT32_MAX when the
>> correct values for these fields are larger; you are not returning valid
>> information.
>
>
>I think it''s valid in the sense that you will be able to create at
>least UINT32_MAX files.  Of course once you''ve done so,
>we might still report that you can create UINT32_MAX
>additional files.  :-)
You may also find an app (A) that checks number of free files, removes
a few and then crashes because it was supposed to be running on a quiet
machine and has now detected that some other app (B) is creating files
as fast as A can remove them.  (B doesn''t really exist, but since the
number of free files isn''t rising, A has to assume that B is active.)
>
>For any application making a decision on an available file count such
>that UINT32_MAX is not enough, but UINT32_MAX+1 would be OK, is
>using the correct largefile syscalls like statvfs64.
And the only way for that app to detect the this is what is going on,
statvfs() has to fail with an EOVERFLOW error.
>
>> 
>> You may be returning "valid" values for f_frsize, f_blocks,
f_bfree,
>> and f_bavail; but you aren''t checking to see if that is true
or not.
>> (If shifting f_blocks, f_bfree, or f_bavail right throws away a bit
>> that was not a zero bit; the scaled values being returned are not
>> valid.)
>
>
>You''re right that we''re discarding some bytes through the
scaling
>process.  However, any non-zero bits that are discarded are effectively
>partial f_frsize blocks.  For any filesystem large enough to get into
>this situation, we''re talking about a relatively *very* small
>amount of rounding down.  (e.g. for a 1PB fs, f_frsize is
>only 256K)
>
>Remember that the fs code can be doing delayed writes, delayed
>allocation, background delete processing, etc.  So the statvfs
>values are just rumors anyway.  Most filesystems don''t even bother
>to grab a lock when reporting statvfs info.
>
>> 
>> Since the statvfs(2) and statvfs.h(3HEAD) man pages don''t
state any
>> relationship between f_bsize and f_frsize, applications may well have
>> made their own assumptions.  Is there documentation somewhere that
>> specifies how many bytes should be written at a time (on boundaries
>> that is a multiple of that value) to get the most efficiency out of
>> the underlying hardware?  I would hope that f_bsize would be that
>> value.  If it is, it seems that f_bsize should be an integral multiple
>> of f_frsize.
>
>Aside from the comment in statvfs(2) about f_bsize being the
>"preferred file system block size", I can''t find any
documentation
>that talks about that.
>
>For filesystems that support direct I/O, f_bsize has traditionally
>provided the most efficient I/O size multiplier.
>
>But the setting of f_bsize is up to the underlying fs.  And at least
>for QFS, UFS, and ZFS, its value is not scaled based on f_frsize.
>That''s also why I don''t rescale f_bsize.
Correct.  I''m not suggesting that statvfs() should scale f_bsize;
I''m
saying that if you scale f_frsize, some application may be think its
world has turned upside down because the relationship it thought
existed between f_frsize and f_bsize is no longer true.

I believe statvfs() should be returning an error condition with errno
set to EOVERFLOW and that applications that run into the EOVERFLOW
should be fixed to handle the brave new world of large filesystems.

By the logic you''re using, we would not have needed to change the df
utility to be large filesystem aware; we should have just let it
truncate the number of blocks it said were available for all
filesystems to 32-bit values.  For a sysadmin that wants to know if the
ZFS filesystem that was just created came out at the correct size, this
clearly is not sufficient; but for "most" users who just want to know
if there is room to create a file, it will meet their needs perfectly.

For any particular call to statvfs(), the system won''t know whether the
discarded low order bits of f_blocks, f_bfree, and f_bavail and the
high order bits of f_files, f_ffree, and f_favail are important or
not.  The only safe thing to do is report the overflows and fix the
applications that get the resulting EOVERFLOW errors.
>
>
>-Chris

Boyd Adamson

2007-Aug-29 02:26 UTC

head link

[zfs-code] [ufs-discuss] statvfs change

Chris Kirby <chris.kirby at sun.com> writes:> Don, thanks for your comments, please see below:
>
> Don Cragun wrote:
>>>Date: Mon, 27 Aug 2007 16:04:42 -0500 From: Chris Kirby
>>><chris.kirby at sun.com>
>  >
>> The base point is that any time you lie to applications, some
>> application software is going to make wrong decisions based on the
>> lie.
>
>
> Yes, and we certainly don''t want to lie. But returning an error
when
> we can return valid (albeit less precise) info will also cause
> applications to make wrong decisions.
>
> In the case of the netBeans installer, it died because it thought
> there wasn''t enough free space when in fact, there were several TB
of
> space available.
>
> I suspect that most apps that use f_bfree/f_bavail just want to know
> if they have enough space to write their data.
I don''t claim to be an expert in this area (or any other) but all this
seems very clearly to be an application bug. If the application asks
"how much free space is there on this filesystem" and the system says
"More space that your data structure can handle" then surely the
correct
response is:

   "Since the data I''m installing *will* fit into a 32 bit size
and the
   free space *won''t*, there is no problem, just install."

Or,

   "The data I''m installing *won''t* fit into a 32 bit
size so using a 32
   bit struct to get the available space was stupid in the first place"
>>>For ZFS, we report f_frsize as 512 regardless of the size of the fs.
>>>...
>> 
>> 
>> Why?  Why shouldn''t you always set f_frsize to the actual size
of an
>> allocation unit on the filesystem?  Is it still true that we
don''t
>> support disks formatted with 1024 byte sectors?
>
>
> For ZFS, we don''t have a fixed allocation block size so in general
> there won''t be one true f_frsize across and entire VFS.  So we
return
> SPA_MINBLOCKSIZE (512) for f_frsize.
>
>
>> When you cap f_files, f_ffree, and f_favail at UINT32_MAX when the
>> correct values for these fields are larger; you are not returning
>> valid information.
>
>
> I think it''s valid in the sense that you will be able to create at
> least UINT32_MAX files.  Of course once you''ve done so, we might
still
> report that you can create UINT32_MAX additional files.  :-)
>
> For any application making a decision on an available file count such
> that UINT32_MAX is not enough, but UINT32_MAX+1 would be OK, is using
> the correct largefile syscalls like statvfs64.
>
>> 
>> You may be returning "valid" values for f_frsize, f_blocks,
f_bfree,
>> and f_bavail; but you aren''t checking to see if that is true
or
>> not. (If shifting f_blocks, f_bfree, or f_bavail right throws away a
>> bit that was not a zero bit; the scaled values being returned are not
>> valid.)
>
>
> You''re right that we''re discarding some bytes through the
scaling
> process.  However, any non-zero bits that are discarded are
> effectively partial f_frsize blocks.  For any filesystem large enough
> to get into this situation, we''re talking about a relatively
*very*
> small amount of rounding down.  (e.g. for a 1PB fs, f_frsize is only
> 256K)
>
> Remember that the fs code can be doing delayed writes, delayed
> allocation, background delete processing, etc.  So the statvfs values
> are just rumors anyway.  Most filesystems don''t even bother to
grab a
> lock when reporting statvfs info.
Not to mention that (even if it''s 100% accurate) the amount of free
space *now* is not valid at any time other than *now*. Assuming that if
there''s 300GB available now it means I can write 300GB is not valid in
the face of other writers.
>> Since the statvfs(2) and statvfs.h(3HEAD) man pages don''t
state any
>> relationship between f_bsize and f_frsize, applications may well have
>> made their own assumptions.  Is there documentation somewhere that
>> specifies how many bytes should be written at a time (on boundaries
>> that is a multiple of that value) to get the most efficiency out of
>> the underlying hardware?  I would hope that f_bsize would be that
>> value.  If it is, it seems that f_bsize should be an integral
>> multiple of f_frsize.
>
> Aside from the comment in statvfs(2) about f_bsize being the
> "preferred file system block size", I can''t find any
documentation
> that talks about that.
>
> For filesystems that support direct I/O, f_bsize has traditionally
> provided the most efficient I/O size multiplier.
>
> But the setting of f_bsize is up to the underlying fs.  And at least
> for QFS, UFS, and ZFS, its value is not scaled based on
> f_frsize. That''s also why I don''t rescale f_bsize.
>
>
> -Chris _______________________________________________ ufs-discuss
> mailing list ufs-discuss at opensolaris.org

Chris Kirby

2007-Aug-29 15:55 UTC

head link

[zfs-code] statvfs change

Don Cragun wrote:>>From: Chris Kirby <chris.kirby at sun.com>
>>>When you cap f_files, f_ffree, and f_favail at UINT32_MAX when the
>>>correct values for these fields are larger; you are not returning
valid
>>>information.
>>
>>
>>I think it''s valid in the sense that you will be able to create
at
>>least UINT32_MAX files.  Of course once you''ve done so,
>>we might still report that you can create UINT32_MAX
>>additional files.  :-)
> 
> 
> You may also find an app (A) that checks number of free files, removes
> a few and then crashes because it was supposed to be running on a quiet
> machine and has now detected that some other app (B) is creating files
> as fast as A can remove them.  (B doesn''t really exist, but since
the
> number of free files isn''t rising, A has to assume that B is
active.)
Remember that f_ffiles is in no way under an application''s
control.  That value usually represents an inode count, which
is not necessarily updated synchronously at unlink time.  The
fs itself might be doing background activity (garbage collection)
that could cause this count to fluctuate on an otherwise idle system.

Plus, the fs is free to report whatever number it wants here.  It can
even be the same number all the time.  QFS, which does dynamic
inode allocation, always reports f_ffiles as -1.

None of {ZFS,QFS,UFS} even bother to grab a lock in their statvfs
handlers, so as Boyd Adamson correctly pointed out, all of the statvfs
values are rumors at best anyway.

>>
>>But the setting of f_bsize is up to the underlying fs.  And at least
>>for QFS, UFS, and ZFS, its value is not scaled based on f_frsize.
>>That''s also why I don''t rescale f_bsize.
> 
> 
> Correct.  I''m not suggesting that statvfs() should scale f_bsize;
I''m
> saying that if you scale f_frsize, some application may be think its
> world has turned upside down because the relationship it thought
> existed between f_frsize and f_bsize is no longer true.

Given that we don''t document any relationship between those two
fields, I''m not sure why that would be a valid assumption.

AFAICS, these values aren''t even required to be constant for
consecutive calls to the same vfs.
> 
> I believe statvfs() should be returning an error condition with errno
> set to EOVERFLOW and that applications that run into the EOVERFLOW
> should be fixed to handle the brave new world of large filesystems.
> 
> By the logic you''re using, we would not have needed to change the
df
> utility to be large filesystem aware; we should have just let it
> truncate the number of blocks it said were available for all
> filesystems to 32-bit values. For a sysadmin that wants to know if the
> ZFS filesystem that was just created came out at the correct size, this
> clearly is not sufficient; but for "most" users who just want to
know
> if there is room to create a file, it will meet their needs perfectly.
I''m not suggesting that we just truncate the number of blocks.
We''re also adjusting the unit size (f_frsize) so that
(f_blocks * f_frsize) is the same before and after the rescaling.

Yes, there might be a rounding error of up to (f_frsize - 1) *bytes*.
So on a 1PB fs, we might be off by (256K - 1) bytes.  That''s
0.000000026 percent.  It doesn''t seem like it would generate
many support calls.
> 
> For any particular call to statvfs(), the system won''t know
whether the
> discarded low order bits of f_blocks, f_bfree, and f_bavail and the
> high order bits of f_files, f_ffree, and f_favail are important or
> not.  The only safe thing to do is report the overflows and fix the
> applications that get the resulting EOVERFLOW errors.
Even if an application decides that those bits were important (and
we''re talking about the very last tiny bit of space on the fs),
there are currently no guarantees that those few bytes would be
available to a user anyway.  Growing a file by one byte could still
result in ENOSPC if the fs needed more than the remaining bytes for
its own data structures.

I agree that broken apps should be fixed.  But that''s not always
possible.  And I guess I still don''t see where we would be breaking
any well-behaved applications.

-Chris

Don Cragun

2007-Aug-29 16:51 UTC

head link

[zfs-code] statvfs change

Chris,
	If ZFS is ever going to pass the test suites so we can claim
that ZFS meets POSIX and UNIX filesystem requirements, you''ll find that
the conformance test suites will be trying things similar to what I''ve
been describing.  The test suites certainly aren''t perfect, but over
time they get more and more touchy about lies.  If the test suites get
an EOVERFLOW when requesting information, they''ll report an unresolved
status along with what they were trying to test, and specify that human
intervention (in this case an explanation) will be required to get
certification.  If the test suite catches you in a lie, you''ll have to
file for a test suite waiver and explain why the test suite is wrong.
If statvfs() says there are X files, blocks, ... available when there
are 2X or X+1, there is a very low chance of getting the waiver.  At
that point you will have the option of not certifying your product or
fixing your code so it passes the test.
	You asked for my standards opinion.  My opinion is that the
change you''re suggesting does not comply with standards requirements.

	- Don
>Date: Wed, 29 Aug 2007 10:55:34 -0500
>From: Chris Kirby <chris.kirby at sun.com>
>Subject: Re: [zfs-code] statvfs change
>To: Don Cragun <don.cragun at sun.com>
>Cc: johansen-osdev at sun.com, ufs-discuss at opensolaris.org, 
zfs-code at opensolaris.org>
>Don Cragun wrote:
>>>From: Chris Kirby <chris.kirby at sun.com>
>
>>>>When you cap f_files, f_ffree, and f_favail at UINT32_MAX when
the
>>>>correct values for these fields are larger; you are not
returning valid
>>>>information.
>>>
>>>
>>>I think it''s valid in the sense that you will be able to
create at
>>>least UINT32_MAX files.  Of course once you''ve done so,
>>>we might still report that you can create UINT32_MAX
>>>additional files.  :-)
>> 
>> 
>> You may also find an app (A) that checks number of free files, removes
>> a few and then crashes because it was supposed to be running on a quiet
>> machine and has now detected that some other app (B) is creating files
>> as fast as A can remove them.  (B doesn''t really exist, but
since the
>> number of free files isn''t rising, A has to assume that B is
active.)
>
>Remember that f_ffiles is in no way under an application''s
>control.  That value usually represents an inode count, which
>is not necessarily updated synchronously at unlink time.  The
>fs itself might be doing background activity (garbage collection)
>that could cause this count to fluctuate on an otherwise idle system.
>
>Plus, the fs is free to report whatever number it wants here.  It can
>even be the same number all the time.  QFS, which does dynamic
>inode allocation, always reports f_ffiles as -1.
>
>None of {ZFS,QFS,UFS} even bother to grab a lock in their statvfs
>handlers, so as Boyd Adamson correctly pointed out, all of the statvfs
>values are rumors at best anyway.
>
>
>>>
>>>But the setting of f_bsize is up to the underlying fs.  And at least
>>>for QFS, UFS, and ZFS, its value is not scaled based on f_frsize.
>>>That''s also why I don''t rescale f_bsize.
>> 
>> 
>> Correct.  I''m not suggesting that statvfs() should scale
f_bsize; I''m
>> saying that if you scale f_frsize, some application may be think its
>> world has turned upside down because the relationship it thought
>> existed between f_frsize and f_bsize is no longer true.
>
>
>Given that we don''t document any relationship between those two
>fields, I''m not sure why that would be a valid assumption.
>
>AFAICS, these values aren''t even required to be constant for
>consecutive calls to the same vfs.
>
>> 
>> I believe statvfs() should be returning an error condition with errno
>> set to EOVERFLOW and that applications that run into the EOVERFLOW
>> should be fixed to handle the brave new world of large filesystems.
>> 
>> By the logic you''re using, we would not have needed to change
the df
>> utility to be large filesystem aware; we should have just let it
>> truncate the number of blocks it said were available for all
>> filesystems to 32-bit values. For a sysadmin that wants to know if the
>> ZFS filesystem that was just created came out at the correct size, this
>> clearly is not sufficient; but for "most" users who just want
to know
>> if there is room to create a file, it will meet their needs perfectly.
>
>I''m not suggesting that we just truncate the number of blocks.
>We''re also adjusting the unit size (f_frsize) so that
>(f_blocks * f_frsize) is the same before and after the rescaling.
>
>Yes, there might be a rounding error of up to (f_frsize - 1) *bytes*.
>So on a 1PB fs, we might be off by (256K - 1) bytes.  That''s
>0.000000026 percent.  It doesn''t seem like it would generate
>many support calls.
>
>> 
>> For any particular call to statvfs(), the system won''t know
whether the
>> discarded low order bits of f_blocks, f_bfree, and f_bavail and the
>> high order bits of f_files, f_ffree, and f_favail are important or
>> not.  The only safe thing to do is report the overflows and fix the
>> applications that get the resulting EOVERFLOW errors.
>
>Even if an application decides that those bits were important (and
>we''re talking about the very last tiny bit of space on the fs),
>there are currently no guarantees that those few bytes would be
>available to a user anyway.  Growing a file by one byte could still
>result in ENOSPC if the fs needed more than the remaining bytes for
>its own data structures.
>
>I agree that broken apps should be fixed.  But that''s not always
>possible.  And I guess I still don''t see where we would be breaking
>any well-behaved applications.
>
>-Chris
>

Frank Hofmann

2007-Aug-29 18:41 UTC

head link

[zfs-code] [ufs-discuss] statvfs change

On Wed, 29 Aug 2007, Chris Kirby wrote:

[ ... ]> None of {ZFS,QFS,UFS} even bother to grab a lock in their statvfs
> handlers, so as Boyd Adamson correctly pointed out, all of the statvfs
> values are rumors at best anyway.
That is not correct for UFS. This piece:

         /*
          * Adjust the numbers based on things waiting to be deleted.
          * modifies f_bfree and f_ffree.  Afterwards, everything we
          * come up with will be self-consistent.  By definition, this
          * is a point-in-time snapshot, so the fact that the delete
          * thread''s probably already invalidated the results is not a
          * problem.  Note that if the delete thread is ever extended to
          * non-logging ufs, this adjustment must always be made.
          */
         if (TRANS_ISTRANS(ufsvfsp))
                 ufs_delete_adjust_stats(ufsvfsp, sp);

in ufs_statvfs() is grabbing locks. And it was introduced because some 
"nitpickers" claimed guaranteed point-in-time consistency of statvfs()
is
mandated by the standards, and arguing that "the values are out of date 
as soon as the syscall returns no matter what" wouldn''t matter in
that
context. For the gory details, see:

5012326	delay between unlink/close and space becoming available may be
 	arbitrarily long

and then:

6251659	statvfs taking 30 second to complete

Still wondering sometimes whether it''d been the right thing to claim 
"statvfs must be 100% accurate".

[ ... ]> I''m not suggesting that we just truncate the number of blocks.
> We''re also adjusting the unit size (f_frsize) so that
> (f_blocks * f_frsize) is the same before and after the rescaling.
>
> Yes, there might be a rounding error of up to (f_frsize - 1) *bytes*.
> So on a 1PB fs, we might be off by (256K - 1) bytes.  That''s
> 0.000000026 percent.  It doesn''t seem like it would generate
> many support calls.
Even if. As was mentioned about ZFS behaviour a few times, ZFS will 
"compactify" small files anyway (meaning that, even if the FS were
full to
that degree, extending _small_ files may be possible), and since it 
optionally also does disk compression the amount of free space can only be 
estimated.
The point, in particular for ZFS, can be made that the accurracy of all 
statvfs() return values is, by design, not to the ''least significant
bit''.
Hence, as long as things like "space free" is the lower bound of what 
really is free, it''s performing its task.

Don''s statement about the standard requiring "X free - not
X+1" is, hmm,
to say it politically correct, "difficult to meet with a filesystem that 
does compress/tail-compact".
The "X not X-1" is, on the other hand, to be taken very seriously. I 
cannot claim to have free space if I haven''t, and that''s what
the above
UFS bugfixes have been about.

I agree with Don that we can''t lie about having free space if we
don''t.
I also agree that statvfs() and statvfs64() shouldn''t return different 
values.

But I don''t see why we couldn''t round up the used space /
round (down) the
free space to some value/blocksize chosen by the filesystem itself. We 
just have to make this consistent. If:

 	statvfs()
 	statvfs64()
 	64bit statvfs()

all return the same values, and:

 	- freeing a unit fs_frsize adds ''1'' to fs_blkfree
 	- allocating a unit fs_frsize subtracts ''1'' from fs_blkfree
 	- an alloc attempt when fs_blkfree is zero must fail

are met, we''re standards compliant, aren''t we ?

The standard, for statvfs(), doesn''t make any statement about what
happens
on attempts to allocate/free amounts _less than_ fs_frsize bytes, except 
that "if I release N bytes I must be able to allocate N bytes again - to 
the same file".

As I read the proposal, the blocked behaviour is exactly what''s being 
requested, isn''t it ?

We have to remember that statvfs returns _BLOCKED_ units. Space 
(non)availability is judged in these units. And actions like:

 	- release fs_frsize/N from N different files ==> fs_blkfree++
 	- fs_blkfree == 1, alloc fs_frsize/N to N different files succeeds

are _NOT_ covered by the standard as "must (not) fail".

If I''m wrong about this, then please provide pointers. I''m
very
interested.

Thanks,
FrankH.
>
>>
>> For any particular call to statvfs(), the system won''t know
whether the
>> discarded low order bits of f_blocks, f_bfree, and f_bavail and the
>> high order bits of f_files, f_ffree, and f_favail are important or
>> not.  The only safe thing to do is report the overflows and fix the
>> applications that get the resulting EOVERFLOW errors.
>
> Even if an application decides that those bits were important (and
> we''re talking about the very last tiny bit of space on the fs),
> there are currently no guarantees that those few bytes would be
> available to a user anyway.  Growing a file by one byte could still
> result in ENOSPC if the fs needed more than the remaining bytes for
> its own data structures.
>
> I agree that broken apps should be fixed.  But that''s not always
> possible.  And I guess I still don''t see where we would be
breaking
> any well-behaved applications.
A 32bit application that uses statvfs() in a non-largefile compile 
environment - aka not getting statvfs64() - isn''t "well
behaved". It''s way
over a decade that the LF64 interfaces were introduced. That''s a while 
before anyone used the term ''Java'' or ''Netbeans
Installer''.

>
> -Chris
>
> _______________________________________________
> ufs-discuss mailing list
> ufs-discuss at opensolaris.org
>

Chris Kirby

2007-Aug-29 19:33 UTC

head link

[zfs-code] [ufs-discuss] statvfs change

Frank Hofmann wrote:> On Wed, 29 Aug 2007, Chris Kirby wrote:
> 
> [ ... ]
> 
>> None of {ZFS,QFS,UFS} even bother to grab a lock in their statvfs
>> handlers, so as Boyd Adamson correctly pointed out, all of the statvfs
>> values are rumors at best anyway.
> 
> 
> That is not correct for UFS. This piece:
> 
>         /*
>          * Adjust the numbers based on things waiting to be deleted.
>          * modifies f_bfree and f_ffree.  Afterwards, everything we
>          * come up with will be self-consistent.  By definition, this
>          * is a point-in-time snapshot, so the fact that the delete
>          * thread''s probably already invalidated the results is
not a
>          * problem.  Note that if the delete thread is ever extended to
>          * non-logging ufs, this adjustment must always be made.
>          */
>         if (TRANS_ISTRANS(ufsvfsp))
>                 ufs_delete_adjust_stats(ufsvfsp, sp);
> 
> in ufs_statvfs() is grabbing locks. And it was introduced because some 
> "nitpickers" claimed guaranteed point-in-time consistency of
statvfs()
> is mandated by the standards, and arguing that "the values are out of 
> date as soon as the syscall returns no matter what" wouldn''t
matter in
> that context. 
Ouch, that''s expensive.

I still can''t find a reference showing that the standards
actually require that sort of behavior.  In fact, there doesn''t
seem to be a lot of information about what *is* required of statvfs
by the standards.

One interesting quote is from Apple''s statvfs(3) man page:

STANDARDS
The statvfs() and fstatvfs() functions conform to IEEE Std 1003.1-2001
(``POSIX.1'''').  As standardized, portable applications cannot
depend on
these functions returning any valid information at all.  This
implementation attempts to provide as much useful information as is
provided by the underlying file system, subject to the limitations of
the specified data types.

Another is from the Open Group page on statvfs:
http://www.opengroup.org/onlinepubs/000095399/functions/statvfs.html

"It is unspecified whether all members of the statvfs structure have
meaningful values on all file systems."

I suspect that the conformance tests for statvfs are a bit
overzealous in their expectations.

-Chris

Boyd Adamson

2007-Aug-29 23:34 UTC

head link

[zfs-code] [ufs-discuss] statvfs change

Chris Kirby <chris.kirby at sun.com> writes:> None of {ZFS,QFS,UFS} even bother to grab a lock in their statvfs
> handlers, so as Boyd Adamson correctly pointed out, all of the statvfs
> values are rumors at best anyway.
Actually, that wasn''t me. I pointed out that the free space at this
moment is no guide as to the free space when we actually come to write
the files.

Boyd

Chris Kirby

2007-Aug-29 23:50 UTC

head link

[zfs-code] [ufs-discuss] statvfs change

Boyd Adamson wrote:> Chris Kirby <chris.kirby at sun.com> writes:
> 
>>None of {ZFS,QFS,UFS} even bother to grab a lock in their statvfs
>>handlers, so as Boyd Adamson correctly pointed out, all of the statvfs
>>values are rumors at best anyway.
> 
> 
> Actually, that wasn''t me. I pointed out that the free space at
this
> moment is no guide as to the free space when we actually come to write
> the files.
Yep, that was what I meant by " statvfs values are rumors at best
anyway".

-Chris

Chris Kirby

2007-Aug-29 23:54 UTC

head link

[zfs-code] statvfs change

Richard Lowe wrote:> Boyd Adamson <boyd-adamson at usa.net> writes:
> 
> 
>>Chris Kirby <chris.kirby at sun.com> writes:
>>
>>>None of {ZFS,QFS,UFS} even bother to grab a lock in their statvfs
>>>handlers, so as Boyd Adamson correctly pointed out, all of the
statvfs
>>>values are rumors at best anyway.
>>
>>Actually, that wasn''t me. I pointed out that the free space at
this
>>moment is no guide as to the free space when we actually come to write
>>the files.
>>
> 
> 
> And that surely the right thing to do here is to have the application
> either deal with EOVERFLOW sanely, or use the largefile environment.
No doubt, that''s definitely the best solution in general.

The request for this RFE was to see if we could do something reasonable
for apps that can''t be fixed for whatever reason (no maintainer, no
source code, etc.).  That''s what I''m trying to explore.  :-)

-Chris

Boyd Adamson

2007-Aug-30 00:17 UTC

head link

[zfs-code] statvfs change

On 8/30/07, Chris Kirby <chris.kirby at sun.com>
wrote:> Richard Lowe wrote:
> > Boyd Adamson <boyd-adamson at usa.net> writes:
> >
> >
> >>Chris Kirby <chris.kirby at sun.com> writes:
> >>
> >>>None of {ZFS,QFS,UFS} even bother to grab a lock in their
statvfs
> >>>handlers, so as Boyd Adamson correctly pointed out, all of the
statvfs
> >>>values are rumors at best anyway.
> >>
> >>Actually, that wasn''t me. I pointed out that the free
space at this
> >>moment is no guide as to the free space when we actually come to
write
> >>the files.
> >>
> >
> >
> > And that surely the right thing to do here is to have the application
> > either deal with EOVERFLOW sanely, or use the largefile environment.
>
> No doubt, that''s definitely the best solution in general.
>
> The request for this RFE was to see if we could do something reasonable
> for apps that can''t be fixed for whatever reason (no maintainer,
no
> source code, etc.).  That''s what I''m trying to explore. 
:-)
If only there were a way to detect which apps were in that state :)

Matthew Ahrens

2007-Sep-06 00:39 UTC

head link

[zfs-code] statvfs change

Don Cragun wrote:> Chris,
> 	If ZFS is ever going to pass the test suites so we can claim
> that ZFS meets POSIX and UNIX filesystem requirements, you''ll find
that
> the conformance test suites will be trying things similar to what
I''ve
> been describing.  The test suites certainly aren''t perfect, but
over
> time they get more and more touchy about lies.  If the test suites get
> an EOVERFLOW when requesting information, they''ll report an
unresolved
> status along with what they were trying to test, and specify that human
> intervention (in this case an explanation) will be required to get
> certification.  If the test suite catches you in a lie, you''ll
have to
> file for a test suite waiver and explain why the test suite is wrong.
> If statvfs() says there are X files, blocks, ... available when there
> are 2X or X+1, there is a very low chance of getting the waiver.  At
> that point you will have the option of not certifying your product or
> fixing your code so it passes the test.
Don,

Can you provide a concrete example of this?  Eg: which standard requires what 
behavior, and how would the changes that Chris is proposing violate them? 
It''s difficult for me to see how any reasonable test program could
succeed
under ZFS today but fail with Chris''s changes.

With ZFS, the number of blocks or files "available", while true, does
not
exactly and simply correspond to whether any future syscall will succeed or 
fail -- even in an otherwise idle system, even waiting "sufficiently
long"
for all changes to percolate through the system.  This is due to metadata, 
compression, gang blocks, raid-z, snapshots, etc.

This is somewhat true even in UFS -- indirect blocks have to come from
somewhere.

So for example, if there is X bytes of free space, you can not necessarily 
write X bytes to an existing file.  If you can, there will not necessarily be 
  0 free bytes.  Same goes for number of files available.

To take the one example you mentioned, if X files are removed, on ZFS today 
the number of free files will not necessarily go up by X (nor necessarily by 
X or more, nor necessarily by X or less; the number of free files may even 
decrease!).  This is because metadata is shared between files (there are many 
dnodes per block), and because of snapshots.  Therefore applications can not 
make any assumption about how any action will affect the number of available 
files.

--matt

Andreas Dilger

2007-Sep-10 06:44 UTC

head link

[zfs-code] statvfs change

On Sep 05, 2007  17:39 -0700, Matthew Ahrens wrote:> Can you provide a concrete example of this?  Eg: which standard requires
what
> behavior, and how would the changes that Chris is proposing violate them? 
> It''s difficult for me to see how any reasonable test program could
succeed
> under ZFS today but fail with Chris''s changes.
Just as an FYI - we''ve been scaling the statfs blocksize with Lustre
for many years without problems, because 32-bit systems would overflow
df at 16TB because the userspace (df in particular) and many apps were
not compiled to use a 64-bit statfs interface at the time but customers
were using filesystems hundreds of TB in size.  Even today we scale up
the blocksize on 32-bit systems as needed to always fit into a 32-bit
total block count, because Linux has the same issue that the fs doesn''t
know if the caller is using the 32-bit or 64-bit statfs syscall.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Maybe Matching Threads

Search for more apparently analagous threads

zfs code - Aug 2007 - statvfs change

[zfs-code] statvfs change

[zfs-code] statvfs change

[zfs-code] statvfs change

[zfs-code] [ufs-discuss] statvfs change

[zfs-code] statvfs change

[zfs-code] statvfs change

[zfs-code] statvfs change

[zfs-code] [ufs-discuss] statvfs change

[zfs-code] statvfs change

[zfs-code] statvfs change

[zfs-code] [ufs-discuss] statvfs change

[zfs-code] [ufs-discuss] statvfs change

[zfs-code] [ufs-discuss] statvfs change

[zfs-code] [ufs-discuss] statvfs change

[zfs-code] statvfs change

[zfs-code] statvfs change

[zfs-code] statvfs change

[zfs-code] statvfs change

Maybe Matching Threads