thr3ads.net - zfs discuss - [zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool [Dec 2007]

If this information is useful, please help other people find it:
Share via:

Darren Reed

2007-Dec-27 04:58 UTC

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

Having just done a largish mv from one ZFS filesystem to another ZFS
filesystem in the same zpool, I was somewhat surprised at how long it
took - I was expecting it to be near instant like it would be within the
same filesystem.

Are there optimisations possible here?

Surely it should be possible to just change some block pointers and do
a little bit of accounting?  Or is the amount of work required only really
beneficial for large files (hundreds of MB or GB)?

e.g. mv will currently do this:
...
lstat64("/biscuit/bfu//hda4.dump", 0x08066370)  Err#2 ENOENT
rename("hda4.dump", "/biscuit/bfu//hda4.dump")  Err#18 EXDEV
open64("hda4.dump", O_RDONLY)                   = 3
creat64("/biscuit/bfu//hda4.dump", 0644)        = 4
stat64("/biscuit/bfu//hda4.dump", 0x08066370)   = 0
chmod("/biscuit/bfu//hda4.dump", 0100644)       = 0
fstat64(3, 0x080662E0)                          = 0
mmap64(0x00000000, 8388608, PROT_READ, MAP_SHARED, 3, 0) = 0xFE600000
write(4, "01\0\0\0 {1CDF ?\0\0\0\0".., 8388608) = 8388608
mmap64(0xFE600000, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 3, 8388608) 
= 0xFE6
00000
write(4, "11 603\0\f\001\0 .\0 d 1".., 8388608) = 8388608
mmap64(0xFE600000, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 3, 
0x01000000) = 0x
FE600000
write(4, "\v998B\t96CBD2 X * ;E7ED".., 8388608) = 8388608
...

There are likely other considerations that need to be considered (such
as encryption being on/off, etc) but if all of the properties are the same
for a given pair of ZFS filesystems between which a file is being moved,
surely it should be possible to take some nice shortcuts?

RFE worthy?

Darren

Joerg Schilling

2007-Dec-27 12:00 UTC

head link

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

Darren Reed <Darren.Reed at Sun.COM> wrote:
> Having just done a largish mv from one ZFS filesystem to another ZFS
> filesystem in the same zpool, I was somewhat surprised at how long it
> took - I was expecting it to be near instant like it would be within the
> same filesystem.
I would guess that this is caused by different st_dev values in the new 
filesystem. In such a case, mv copies the files instead of renaming them.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       schilling at fokus.fraunhofer.de     (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily

Casper.Dik at Sun.COM

2007-Dec-27 15:28 UTC

head link

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

>
>I would guess that this is caused by different st_dev values in the new
>filesystem. In such a case, mv copies the files instead of renaming them.

No, it''s because they are different filesystems and the data needs to
be
copied; zfs doesn''t allow data movement between filesystems within a
pool.

The code inside "mv" would immediately support such renames as it
*first*
checks whether rename works and only then will it try "plan B":

                if (rename(source, target) >= 0)
                        return (0);
                if (errno != EXDEV) {
			/* fatal errors */
		}
		... continue with plan B: copy & remove ...


Casper

Frank.Hofmann at Sun.COM

2007-Dec-27 15:50 UTC

head link

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

On Thu, 27 Dec 2007, Casper.Dik at Sun.COM wrote:
>
>>
>> I would guess that this is caused by different st_dev values in the new
>> filesystem. In such a case, mv copies the files instead of renaming
them.
>
>
> No, it''s because they are different filesystems and the data needs
to be
> copied; zfs doesn''t allow data movement between filesystems within
a pool.
It''s not ZFS that blocks this by design - it''s the VFS
framework.
vn_rename() has this piece:

         /*
          * Make sure both the from vnode directory and the to directory
          * are in the same vfs and the to directory is writable.
          * We check fsid''s, not vfs pointers, so loopback fs works.
          */
         if (fromvp != tovp) {
                 vattr.va_mask = AT_FSID;
                 if (error = VOP_GETATTR(fromvp, &vattr, 0, CRED(), NULL))
                         goto out;
                 fsid = vattr.va_fsid;
                 vattr.va_mask = AT_FSID;
                 if (error = VOP_GETATTR(tovp, &vattr, 0, CRED(), NULL))
                         goto out;
                 if (fsid != vattr.va_fsid) {
                         error = EXDEV;
                         goto out;
                 }
         }

ZFS will never even see such a rename request.

FrankH.
>
> The code inside "mv" would immediately support such renames as it
*first*
> checks whether rename works and only then will it try "plan B":
>
>                if (rename(source, target) >= 0)
>                        return (0);
>                if (errno != EXDEV) {
> 			/* fatal errors */
> 		}
> 		... continue with plan B: copy & remove ...
>
>
> Casper
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
------------------------------------------------------------------------------
No good can come from selling your freedom, not for all the gold in the world,
for the value of this heavenly gift far exceeds that of any fortune on earth.
------------------------------------------------------------------------------

Darren Reed

2007-Dec-28 03:14 UTC

head link

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

Frank.Hofmann at Sun.COM wrote:> On Thu, 27 Dec 2007, Casper.Dik at Sun.COM wrote:
>
>>
>>>
>>> I would guess that this is caused by different st_dev values in the
new
>>> filesystem. In such a case, mv copies the files instead of renaming
>>> them.
>>
>>
>> No, it''s because they are different filesystems and the data
needs to be
>> copied; zfs doesn''t allow data movement between filesystems
within a
>> pool.
>
> It''s not ZFS that blocks this by design - it''s the VFS
framework.
> vn_rename() has this piece:
>
>         /*
>          * Make sure both the from vnode directory and the to directory
>          * are in the same vfs and the to directory is writable.
>          * We check fsid''s, not vfs pointers, so loopback fs
works.
>          */
>         if (fromvp != tovp) {
>                 vattr.va_mask = AT_FSID;
>                 if (error = VOP_GETATTR(fromvp, &vattr, 0, CRED(),
NULL))
>                         goto out;
>                 fsid = vattr.va_fsid;
>                 vattr.va_mask = AT_FSID;
>                 if (error = VOP_GETATTR(tovp, &vattr, 0, CRED(), NULL))
>                         goto out;
>                 if (fsid != vattr.va_fsid) {
>                         error = EXDEV;
>                         goto out;
>                 }
>         }
>
> ZFS will never even see such a rename request.
Is this behaviour defined by a standard (such as POSIX or the
VFS design) or are we free to innovate here and do something
that allowed such a shortcut as required?

Although I''m not sure the effort required would be worth the
added complexity (to VFS and ZFS) for such a minor "feature".

Darren

2007-Dec-28 04:52 UTC

head link

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the

> such a minor "feature"
I don''t think copying files is a minor feature.  

Doubly so since the words I''ve read from Sun suggest that ZFS
"file systems" (or "data sets" or whatever they are called
now) can be used in the way directories on a normal file system are used.
 
 
This message posted from opensolaris.org

Frank Hofmann

2007-Dec-28 10:00 UTC

head link

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

On Fri, 28 Dec 2007, Darren Reed wrote:
> Frank.Hofmann at Sun.COM wrote:
>> On Thu, 27 Dec 2007, Casper.Dik at Sun.COM wrote:
>> 
>>> 
>>>> 
>>>> I would guess that this is caused by different st_dev values in
the new
>>>> filesystem. In such a case, mv copies the files instead of
renaming them.
>>> 
>>> 
>>> No, it''s because they are different filesystems and the
data needs to be
>>> copied; zfs doesn''t allow data movement between
filesystems within a pool.
>> 
>> It''s not ZFS that blocks this by design - it''s the
VFS framework.
>> vn_rename() has this piece:
[ ... ]>> ZFS will never even see such a rename request.
>
> Is this behaviour defined by a standard (such as POSIX or the
> VFS design) or are we free to innovate here and do something
> that allowed such a shortcut as required?
>
> Although I''m not sure the effort required would be worth the
> added complexity (to VFS and ZFS) for such a minor "feature".
>
> Darren
Hi Darren,

I don''t think the standards would prevent us from adding "cross-fs
rename"
capabilities. It''s beyond the standards as of now, and I''d
expect that
were it ever added to that it''d be an optional feature as well, to be 
queried for via e.g. pathconf().

The VFS design/framework is "ours" - the OpenSolaris community is free
to
innovate there and change it as desired. It''s not on the stability
level
of the DDI. You can''t revamp it at a whim, but you can change/extend
it.

Precedence exists for things that FS X can do but FS Y cannot, and 
changing the framework to check "does this fs claim to support cross-fs 
rename ?" wouldn''t be too hard.

A filesystem could advertise that e.g. via a VFSSW capabilities flags (the 
VSW_* stuff from <sys/vfs.h>), or via VFS features (VFSFT_*, again see 
<sys/vfs.h>, this is relatively recent, got added by the CIFS projects).

I don''t know enough about ZFS internals to help you code the backend 
support, but if you wish to work on it, I''d be happy to help you with
the
framework changes. Those won''t be more than ~50 lines.

"Minor feature" ? I guess that depends how you look at it. It would be
another thing that highlights what noone else but ZFS can do for you.
Who knows what users will do with it in ten years :)

FrankH.

Frank Hofmann

2007-Dec-28 10:20 UTC

head link

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

On Fri, 28 Dec 2007, Darren Reed wrote:
[ ... ]> Is this behaviour defined by a standard (such as POSIX or the
> VFS design) or are we free to innovate here and do something
> that allowed such a shortcut as required?
Wrt. to standards, quote from:

 	http://www.opengroup.org/onlinepubs/009695399/functions/rename.html

 	ERRORS
 	The rename() function shall fail if:
[ ... ]
 	[EXDEV]
 	[CX]  The links named by new and old are on different file systems and the
 	implementation does not support links between file systems.

Hence, it''s implementation-dependent, as per IEEE1003.1.

FrankH.

Casper.Dik at Sun.COM

2007-Dec-28 11:38 UTC

head link

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

>
>
>On Fri, 28 Dec 2007, Darren Reed wrote:
>[ ... ]
>> Is this behaviour defined by a standard (such as POSIX or the
>> VFS design) or are we free to innovate here and do something
>> that allowed such a shortcut as required?
>
>Wrt. to standards, quote from:
>
> 	http://www.opengroup.org/onlinepubs/009695399/functions/rename.html
>
> 	ERRORS
> 	The rename() function shall fail if:
>[ ... ]
> 	[EXDEV]
> 	[CX]  The links named by new and old are on different file systems and the
> 	implementation does not support links between file systems.
>
>Hence, it''s implementation-dependent, as per IEEE1003.1.

So it can be transparently solved in the VFS layer (also nice for moving
files back from cloned snapshots and such)

Casper

Joerg Schilling

2007-Dec-28 13:42 UTC

head link

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

Darren Reed <Darren.Reed at Sun.COM> wrote:
> >         if (fromvp != tovp) {
> >                 vattr.va_mask = AT_FSID;
> >                 if (error = VOP_GETATTR(fromvp, &vattr, 0, CRED(),
NULL))
> >                         goto out;
> >                 fsid = vattr.va_fsid;
> >                 vattr.va_mask = AT_FSID;
> >                 if (error = VOP_GETATTR(tovp, &vattr, 0, CRED(),
NULL))
> >                         goto out;
> >                 if (fsid != vattr.va_fsid) {
> >                         error = EXDEV;
> >                         goto out;
> >                 }
> >         }
> >
> > ZFS will never even see such a rename request.
>
> Is this behaviour defined by a standard (such as POSIX or the
> VFS design) or are we free to innovate here and do something
> that allowed such a shortcut as required?
EXDEV means: "cross device link", not cross filesystem link
A ZFS pool acts as the underlying "storage device", so everything that
is within a single ZFS pool may be a candidate for a rename.

POSIX grants that st_dev and st_ino together uniquely identify a file
on a system. As long as neither st_dev nor st_ino change during the 
rename(2) call, POSIX does not prevent this rename operation.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       schilling at fokus.fraunhofer.de     (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily

Joerg Schilling

2007-Dec-28 13:48 UTC

head link

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

Frank Hofmann <Frank.Hofmann at Sun.COM> wrote:
> I don''t think the standards would prevent us from adding
"cross-fs rename"
> capabilities. It''s beyond the standards as of now, and
I''d expect that
> were it ever added to that it''d be an optional feature as well, to
be
> queried for via e.g. pathconf().
Why do you beliece there is a need for a pathconf() call?
Either rename(2) succeeds or it fails with a cross-device error.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       schilling at fokus.fraunhofer.de     (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily

Frank Hofmann

2007-Dec-28 14:02 UTC

head link

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

On Fri, 28 Dec 2007, Joerg Schilling wrote:
> Frank Hofmann <Frank.Hofmann at Sun.COM> wrote:
>
>> I don''t think the standards would prevent us from adding
"cross-fs rename"
>> capabilities. It''s beyond the standards as of now, and
I''d expect that
>> were it ever added to that it''d be an optional feature as
well, to be
>> queried for via e.g. pathconf().
>
> Why do you beliece there is a need for a pathconf() call?
> Either rename(2) succeeds or it fails with a cross-device error.
Why do you have a NAME_MAX / SYMLINK_MAX query - you can just as well let 
such requests fail with ENAMETOOLONG.

Why do you have a FILESIZEBITS query - there''s EOVERFLOW to tell you.


There''s no _need_. But the convenience exists for others as well.


FrankH.

Joerg Schilling

2007-Dec-28 14:09 UTC

head link

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

Frank Hofmann <Frank.Hofmann at Sun.COM> wrote:
> > Why do you beliece there is a need for a pathconf() call?
> > Either rename(2) succeeds or it fails with a cross-device error.
>
> Why do you have a NAME_MAX / SYMLINK_MAX query - you can just as well let 
> such requests fail with ENAMETOOLONG.
>
> Why do you have a FILESIZEBITS query - there''s EOVERFLOW to tell
you.
>
>
> There''s no _need_. But the convenience exists for others as well.
What kind of call do you propose?

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       schilling at fokus.fraunhofer.de     (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily

Frank Hofmann

2007-Dec-28 14:10 UTC

head link

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

On Fri, 28 Dec 2007, Joerg Schilling wrote:
[ ... ]> POSIX grants that st_dev and st_ino together uniquely identify a file
> on a system. As long as neither st_dev nor st_ino change during the
> rename(2) call, POSIX does not prevent this rename operation.
Clarification request: Where''s the piece in the standard that forces an
interpretation:

 	"rename() operations shall not change st_ino/st_dev"

I don''t see where such a requirement would come from.


FrankH.

Joerg Schilling

2007-Dec-28 14:21 UTC

head link

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

Frank Hofmann <Frank.Hofmann at Sun.COM> wrote:
>
>
> On Fri, 28 Dec 2007, Joerg Schilling wrote:
> [ ... ]
> > POSIX grants that st_dev and st_ino together uniquely identify a file
> > on a system. As long as neither st_dev nor st_ino change during the
> > rename(2) call, POSIX does not prevent this rename operation.
>
> Clarification request: Where''s the piece in the standard that
forces an
> interpretation:
>
>  	"rename() operations shall not change st_ino/st_dev"
>
> I don''t see where such a requirement would come from.
See: http://www.opengroup.org/onlinepubs/009695399/basedefs/sys/stat.h.html

"The st_ino and st_dev fields taken together uniquely identify the file
within
the system."

The identity of an open file cannot change during the lifetime of a process.
Note that the renamed file may be open and the process may call fstat(2)
on the open file before and after the rename(2). As rename(2) does not change
the content of the file, it may only affect the time stamps of the file.

Note that some programs call stat/fstat on files in order to compare file 
identities. What happens if program A calls stat("file1"), then
program B
calls rename("file1", "file2") and after that, program A
calls stat("file2").
A POSIX compliant system will grant that stat("file1") and
stat("file2") will
return the same st_dev/st_ino identity.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       schilling at fokus.fraunhofer.de     (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       schilling at fokus.fraunhofer.de     (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily

Carson Gaspar

2007-Dec-29 07:08 UTC

head link

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

Frank Hofmann wrote:> 
> On Fri, 28 Dec 2007, Joerg Schilling wrote:
...>> Why do you beliece there is a need for a pathconf() call?
>> Either rename(2) succeeds or it fails with a cross-device error.
> 
> Why do you have a NAME_MAX / SYMLINK_MAX query - you can just as well let 
> such requests fail with ENAMETOOLONG.
> 
> Why do you have a FILESIZEBITS query - there''s EOVERFLOW to tell
you.
> 
> 
> There''s no _need_. But the convenience exists for others as well.
Because those 2 involve variable types and/or buffer allocations, so 
knowing them in advance is a major advantage. rename will either succeed 
or fail.

-- 
Carson

Jonathan Loran

2007-Dec-29 07:33 UTC

head link

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

Joerg Schilling wrote:> See: http://www.opengroup.org/onlinepubs/009695399/basedefs/sys/stat.h.html
>
> "The st_ino and st_dev fields taken together uniquely identify the
file within
> the system."
>
> The identity of an open file cannot change during the lifetime of a
process.
> Note that the renamed file may be open and the process may call fstat(2)
> on the open file before and after the rename(2). As rename(2) does not
change
> the content of the file, it may only affect the time stamps of the file.
>
> Note that some programs call stat/fstat on files in order to compare file 
> identities. What happens if program A calls stat("file1"), then
program B
> calls rename("file1", "file2") and after that, program
A calls stat("file2").
> A POSIX compliant system will grant that stat("file1") and
stat("file2") will
> return the same st_dev/st_ino identity.
>
>   
And consider what happens if the originating zfs is exported via NFS, 
and the destination isn''t.  If an NFS client has the subject file open,
we need to ensure the correct behavior after the move. 

The Unix file system behavior (Sorry, don''t have references to Posix or
RFCs here, just 25 years of experience..) would be that if a file is 
moved between file systems, it is removed from the source, yet the file 
storage will continue to exist until the last process which has this 
file open closes.  In effect, this means the file in the old location 
(file system) should continue to exist indefinitely, if it is open by a 
long running process.  I fear if we aren''t careful, we will introduce a
boat load of bugs. 

Hey, here''s an idea:  We snapshot the file as it exists at the time of 
the mv in the old file system until all referring file handles are 
closed, then destroy the single file snap.  I know, not easy to 
implement, but that is the correct behavior, I believe.

All this said, I would love to have this "feature" introduced.  Moving
large file stores between zfs file systems would be so handy!  From my 
own sloppiness, I''ve suffered dearly from the the lack of it.

Jon

-- 

-     _____/     _____/      /           - Jonathan Loran -           -
-    /          /           /                IT Manager               -
-  _____  /   _____  /     /     Space Sciences Laboratory, UC Berkeley
-        /          /     /      (510) 643-5146 jloran at ssl.berkeley.edu
- ______/    ______/    ______/           AST:7731^29u18e3

Jonathan Edwards

2007-Dec-29 14:37 UTC

head link

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

On Dec 29, 2007, at 2:33 AM, Jonathan Loran wrote:
> Hey, here''s an idea:  We snapshot the file as it exists at the
time of
> the mv in the old file system until all referring file handles are
> closed, then destroy the single file snap.  I know, not easy to
> implement, but that is the correct behavior, I believe.
>
> All this said, I would love to have this "feature" introduced. 
Moving
> large file stores between zfs file systems would be so handy!  From my
> own sloppiness, I''ve suffered dearly from the the lack of it.
since in the current implementation a mv between filesystems would  
have to assign new st_ino values (fsids in NFS should also be  
different), all you should need to do is assign new block pointers in  
the new side of the filesystem .. that would also be handy for cp as  
well

---
.je

Joerg Schilling

2007-Dec-29 15:11 UTC

head link

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

Jonathan Edwards <Jonathan.Edwards at Sun.COM> wrote:
>
> On Dec 29, 2007, at 2:33 AM, Jonathan Loran wrote:
>
> > Hey, here''s an idea:  We snapshot the file as it exists at
the time of
> > the mv in the old file system until all referring file handles are
> > closed, then destroy the single file snap.  I know, not easy to
> > implement, but that is the correct behavior, I believe.
> >
> > All this said, I would love to have this "feature"
introduced.  Moving
> > large file stores between zfs file systems would be so handy!  From my
> > own sloppiness, I''ve suffered dearly from the the lack of it.
>
> since in the current implementation a mv between filesystems would  
> have to assign new st_ino values (fsids in NFS should also be  
> different), all you should need to do is assign new block pointers in  
> the new side of the filesystem .. that would also be handy for cp as  
> well
If the rename would keep the blocks from the old file for the new name
then the new file would inherit the identity of the old file.

If you did iplement the rename in a way that would cause new values for
st_dev/st_ino to be returned from a fstat(2) cal then this could confuse 
programs.

If you instead set st_nlink for the open file to 0, then this would be OK 
from the viewpoint of the old file but not be OK from the view to the whole 
system. How would you implement writes into the open fd from the old name?

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       schilling at fokus.fraunhofer.de     (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily

Jonathan Loran

2007-Dec-30 08:01 UTC

head link

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

Joerg Schilling wrote:> Jonathan Edwards <Jonathan.Edwards at Sun.COM> wrote:
>   
>> since in the current implementation a mv between filesystems would  
>> have to assign new st_ino values (fsids in NFS should also be  
>> different), all you should need to do is assign new block pointers in  
>> the new side of the filesystem .. that would also be handy for cp as  
>> well
>>     
>
> If the rename would keep the blocks from the old file for the new name
> then the new file would inherit the identity of the old file.
>
> If you did iplement the rename in a way that would cause new values for
> st_dev/st_ino to be returned from a fstat(2) cal then this could confuse 
> programs.
>
> If you instead set st_nlink for the open file to 0, then this would be OK 
> from the viewpoint of the old file but not be OK from the view to the whole
> system. How would you implement writes into the open fd from the old name?
>
> J?rg
>
>   More concise way of putting what I''m saying.  Traditional mv between
two
fs will create two copies of the data, if the source file is open.  At a 
minimum, this will have to be emulated or things will break.   Since zfs 
file systems are really different Unix file systems, we have to deal 
with the semantics.  It''s not just a path change as in a directory mv.

Jon

-- 


-     _____/     _____/      /           - Jonathan Loran -           -
-    /          /           /                IT Manager               -
-  _____  /   _____  /     /     Space Sciences Laboratory, UC Berkeley
-        /          /     /      (510) 643-5146 jloran at ssl.berkeley.edu
- ______/    ______/    ______/           AST:7731^29u18e3
                                 


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20071230/b840735b/attachment.html>

Darren Reed

2007-Dec-31 08:20 UTC

head link

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

Frank Hofmann wrote:>
>
> On Fri, 28 Dec 2007, Darren Reed wrote:
> [ ... ]
>> Is this behaviour defined by a standard (such as POSIX or the
>> VFS design) or are we free to innovate here and do something
>> that allowed such a shortcut as required?
>
> Wrt. to standards, quote from:
>
>     http://www.opengroup.org/onlinepubs/009695399/functions/rename.html
>
>     ERRORS
>     The rename() function shall fail if:
> [ ... ]
>     [EXDEV]
>     [CX]  The links named by new and old are on different file systems 
> and the
>     implementation does not support links between file systems.
>
> Hence, it''s implementation-dependent, as per IEEE1003.1.
This implies that we''d also have to look at allowing
link(2) to also function between filesystems where
rename(2) was going to work without doing a copy,
correct?  Which I suppose makes sense.

Darren

Frank.Hofmann at Sun.COM

2007-Dec-31 08:41 UTC

head link

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

On Mon, 31 Dec 2007, Darren Reed wrote:
> Frank Hofmann wrote:
>>
>>
>> On Fri, 28 Dec 2007, Darren Reed wrote:
>> [ ... ]
>>> Is this behaviour defined by a standard (such as POSIX or the
>>> VFS design) or are we free to innovate here and do something
>>> that allowed such a shortcut as required?
>>
>> Wrt. to standards, quote from:
>>
>>     http://www.opengroup.org/onlinepubs/009695399/functions/rename.html
>>
>>     ERRORS
>>     The rename() function shall fail if:
>> [ ... ]
>>     [EXDEV]
>>     [CX]  The links named by new and old are on different file systems
>> and the
>>     implementation does not support links between file systems.
>>
>> Hence, it''s implementation-dependent, as per IEEE1003.1.
>
> This implies that we''d also have to look at allowing
> link(2) to also function between filesystems where
> rename(2) was going to work without doing a copy,
> correct?  Which I suppose makes sense.
Copy-on-write. rename() is just defined as an "atomic" sequence of:

 	link(old, new);
 	unlink(old);

If cross-fs rename is possible, then cross-fs link is as well. It''s 
"per-file clone".

Btw, Joerg, this addresses the concern you had in any case. It''s
cross-fs,
that means st_dev/st_ino _WILL_ change. Persistence of open files is not 
related to that. If you hold a file open, the st_dev/st_ino associated 
with the open fd will stay around and continue to be accessible with 
fstat() - but not necessarily with stat(). It definitely would not be in 
case the file got removed. That cross-fs rename would, on the source fs, 
remove the file is, for all I can see, not violating anything.
The location of the file''s data is _NOT_ the only way to derive a
unique
st_dev/st_ino pair.
rename() _within_ a filesystem (as defined by the set of nodes with a 
common st_dev) should preserve st_ino if the fs supports link counts 
larger than one, agreed. But let''s not confuse this with cross-fs
rename,
where by definition (cross-fs) st_dev must change. The identity of that 
file, therefore, has changed.
We''re just in the happy situation with ZFS that the storage low-level 
implementation can know that the contents haven''t.

That''s a sad situation for backup utilities, by the way - a backup tool
would have no way of finding out that file X on fs A already existed as 
file Z on fs B. So what ? If the file got copied, byte by byte, the same 
situation exists, the contents are identical. I don''t think just
because
this makes backups slower than they could be if the backup utility were 
omniscient, that makes a reason to slow file copy/rename operations down.

Happy new year !
FrankH.

Joerg Schilling

2007-Dec-31 12:23 UTC

head link

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

Darren Reed <Darren.Reed at Sun.COM> wrote:
> > Wrt. to standards, quote from:
> >
> >    
http://www.opengroup.org/onlinepubs/009695399/functions/rename.html
> >
> >     ERRORS
> >     The rename() function shall fail if:
> > [ ... ]
> >     [EXDEV]
> >     [CX]  The links named by new and old are on different file systems
> > and the
> >     implementation does not support links between file systems.
> >
> > Hence, it''s implementation-dependent, as per IEEE1003.1.
>
> This implies that we''d also have to look at allowing
> link(2) to also function between filesystems where
> rename(2) was going to work without doing a copy,
> correct?  Which I suppose makes sense.
Thank you for mentioning this.

This brings us closer to the demand that st_dev/st_ino need to be 
kept during a rename.

Basically, rename has been introduced in order to avoid a privileged
link(2)/unlink(2) on directories in order to rename directories. For this
reason, I would expect a rename(2) to work similar to a link/unlink
chain.

Something that I should mention also is that there may be programs
that rename open files and asume traditional st_dev/st_ino semantic in case 
that rename worked.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       schilling at fokus.fraunhofer.de     (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily

Darren Reed

2008-Jan-02 03:46 UTC

head link

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

Frank.Hofmann at Sun.COM wrote:> ...
> That''s a sad situation for backup utilities, by the way - a backup
> tool would have no way of finding out that file X on fs A already 
> existed as file Z on fs B. So what ? If the file got copied, byte by 
> byte, the same situation exists, the contents are identical. I
don''t
> think just because this makes backups slower than they could be if the 
> backup utility were omniscient, that makes a reason to slow file 
> copy/rename operations down.
I don''t see this as being a problem at all.

This idea is aimed at being a filesystem performance optimisation,
not a backup optimisation.

Darren

Wee Yeh Tan

2008-Jan-02 09:10 UTC

head link

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

On Jan 2, 2008 11:46 AM, Darren Reed <Darren.Reed at sun.com>
wrote:> Frank.Hofmann at Sun.COM wrote:
> > ...
> > That''s a sad situation for backup utilities, by the way - a
backup
> > tool would have no way of finding out that file X on fs A already
> > existed as file Z on fs B. So what ? If the file got copied, byte by
> > byte, the same situation exists, the contents are identical. I
don''t
> > think just because this makes backups slower than they could be if the
> > backup utility were omniscient, that makes a reason to slow file
> > copy/rename operations down.
>
> I don''t see this as being a problem at all.
>
> This idea is aimed at being a filesystem performance optimisation,
> not a backup optimisation.
Anyway, the same "problem" already exists with cloned filesystems.


-- 
Just me,
Wire ...
Blog: <prstat.blogspot.com>

Nicolas Williams

2008-Jan-02 16:24 UTC

head link

[zfs-discuss] Inode (dnode) numbers (Re: rename(2) (mv(1)) between ZFS filesystems in the same zpool)

On Mon, Dec 31, 2007 at 07:20:30PM +1100, Darren Reed
wrote:> Frank Hofmann wrote:
> >    
http://www.opengroup.org/onlinepubs/009695399/functions/rename.html
> >
> >     ERRORS
> >     The rename() function shall fail if:
> > [ ... ]
> >     [EXDEV]
> >     [CX]  The links named by new and old are on different file systems
> > and the
> >     implementation does not support links between file systems.
> >
> > Hence, it''s implementation-dependent, as per IEEE1003.1.
> 
> This implies that we''d also have to look at allowing
> link(2) to also function between filesystems where
> rename(2) was going to work without doing a copy,
> correct?  Which I suppose makes sense.
If so then a cross-dataset rename(2) won''t necessarily work.

link(2) preserves inode numbers.  mv(1) does not [when crossing
devices].  A cross-dataset rename(2) may not be able to preserve inode
numbers either (e.g., if the one at the source is already in use on the
target).

Nico
--

Nicolas Williams

2008-Jan-02 16:32 UTC

head link

[zfs-discuss] Inode (dnode) numbers (Re: rename(2) (mv(1)) between ZFS filesystems in the same zpool)

Oof, I see this has been discussed since (and, actually, IIRC it was
discussed a long time ago too).

Anyways, IMO, this requires a new syscall or syscalls:

    xdevrename(2)
    xdevcopy(2)

and then mv(1) can do:

	if (rename(old, new) != 0) {
		if (xdevrename(old, new) != 0) {
			/* do a cp(1) instead */
			return (do_cp(old, new));
		}
		return (0);
	}
	return (0);

cp(1), and maybe ln(1), could do something similar using xdevcopy(2).

Nico
--

Wee Yeh Tan

2008-Jan-03 01:06 UTC

head link

[zfs-discuss] Inode (dnode) numbers (Re: rename(2) (mv(1)) between ZFS filesystems in the same zpool)

On Jan 3, 2008 12:32 AM, Nicolas Williams <Nicolas.Williams at sun.com>
wrote:> Oof, I see this has been discussed since (and, actually, IIRC it was
> discussed a long time ago too).
>
> Anyways, IMO, this requires a new syscall or syscalls:
>
>     xdevrename(2)
>     xdevcopy(2)
>
> and then mv(1) can do:
>
>         if (rename(old, new) != 0) {
>                 if (xdevrename(old, new) != 0) {
>                         /* do a cp(1) instead */
>                         return (do_cp(old, new));
>                 }
>                 return (0);
>         }
>         return (0);
>
> cp(1), and maybe ln(1), could do something similar using xdevcopy(2).
Could it be cleaner to do that within vn_renameat() instead?  This
will save creating a new syscall and updating quite a number of
utilities.


-- 
Just me,
Wire ...
Blog: <prstat.blogspot.com>

Darren Reed

2008-Jan-03 03:09 UTC

head link

[zfs-discuss] Inode (dnode) numbers (Re: rename(2) (mv(1)) between ZFS filesystems in the same zpool)

Nicolas Williams wrote:> On Mon, Dec 31, 2007 at 07:20:30PM +1100, Darren Reed wrote:
>   
>> Frank Hofmann wrote:
>>     
>>>    
http://www.opengroup.org/onlinepubs/009695399/functions/rename.html
>>>
>>>     ERRORS
>>>     The rename() function shall fail if:
>>> [ ... ]
>>>     [EXDEV]
>>>     [CX]  The links named by new and old are on different file
systems
>>> and the
>>>     implementation does not support links between file systems.
>>>
>>> Hence, it''s implementation-dependent, as per IEEE1003.1.
>>>       
>> This implies that we''d also have to look at allowing
>> link(2) to also function between filesystems where
>> rename(2) was going to work without doing a copy,
>> correct?  Which I suppose makes sense.
>>     
>
> If so then a cross-dataset rename(2) won''t necessarily work.
>
> link(2) preserves inode numbers.  mv(1) does not [when crossing
> devices].  A cross-dataset rename(2) may not be able to preserve inode
> numbers either (e.g., if the one at the source is already in use on the
> target).
Unless POSIX or similar says the preservation of inode numbers is required,
I can''t see why that is important.

Darren

Carsten Bormann

2008-Jan-03 08:29 UTC

head link

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

On Dec 29 2007, at 08:33, Jonathan Loran wrote:
> We snapshot the file as it exists at the time of
> the mv in the old file system until all referring file handles are
> closed, then destroy the single file snap.  I know, not easy to
> implement, but that is the correct behavior, I believe.
Exactly.

Note that apart from open descriptors, there may be other links to the  
file on the old FS; it has to be clear whether writes to the file in  
the new FS change the file in the old FS or not.  I''d rather say they  
shouldn''t.
Yes, this would be different from the normal rename(2) semantics with  
respect to multiply linked files.  And yes, the semantics of link(2)  
should also be consistent with this.

Gruesse, Carsten

Joerg Schilling

2008-Jan-03 08:47 UTC

head link

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

Carsten Bormann <cabocabo at gmail.com> wrote:
> On Dec 29 2007, at 08:33, Jonathan Loran wrote:
>
> > We snapshot the file as it exists at the time of
> > the mv in the old file system until all referring file handles are
> > closed, then destroy the single file snap.  I know, not easy to
> > implement, but that is the correct behavior, I believe.
>
> Exactly.
>
> Note that apart from open descriptors, there may be other links to the  
> file on the old FS; it has to be clear whether writes to the file in  
> the new FS change the file in the old FS or not.  I''d rather say
they
> shouldn''t.
> Yes, this would be different from the normal rename(2) semantics with  
> respect to multiply linked files.  And yes, the semantics of link(2)  
> should also be consistent with this.
This in an interesting problem. Your proposal would imply that a file
may have different identities in different filesystems:

-	different st_dev

-	different st_ino

-	different link count

This cannot be implemented with a single "inode data" anymore.

Well, it is not impossible as my WOFS (mentioned before) implements
hardlinks via "inode relative symlinks". In order to allow this. a
file
would need a storage pool global serial number that allows to match different
inode sets for the file.

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       schilling at fokus.fraunhofer.de     (work) Blog:
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily

Jonathan Loran

2008-Jan-03 22:55 UTC

head link

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

Joerg Schilling wrote:> Carsten Bormann <cabocabo at gmail.com> wrote:
>
>   
>> On Dec 29 2007, at 08:33, Jonathan Loran wrote:
>>
>>     
>>> We snapshot the file as it exists at the time of
>>> the mv in the old file system until all referring file handles are
>>> closed, then destroy the single file snap.  I know, not easy to
>>> implement, but that is the correct behavior, I believe.
>>>       
>> Exactly.
>>
>> Note that apart from open descriptors, there may be other links to the
>> file on the old FS; it has to be clear whether writes to the file in  
>> the new FS change the file in the old FS or not.  I''d rather
say they
>> shouldn''t.
>> Yes, this would be different from the normal rename(2) semantics with  
>> respect to multiply linked files.  And yes, the semantics of link(2)  
>> should also be consistent with this.
>>     
>
> This in an interesting problem. Your proposal would imply that a file
> may have different identities in different filesystems:
>
> -	different st_dev
>
> -	different st_ino
>
> -	different link count
>
> This cannot be implemented with a single "inode data" anymore.
>
> Well, it is not impossible as my WOFS (mentioned before) implements
> hardlinks via "inode relative symlinks". In order to allow this.
a file
> would need a storage pool global serial number that allows to match
different
> inode sets for the file.
>
> J?rg
>
>   
At first, as I mentioned in my earlier email, I was thinking we needed 
to emulate the cross-fs rename/link/etc behavior as it is currently 
implemented, where a file appears to actually be copied.  But now I''m 
not so sure. 

In Unixland, the ideal has always been to have the whole file system, 
kit and caboodle, singly rooted at /.  Heck, even devices are in the 
file system.  Of course, reality required that Programmatically, we 
needed to be aware of what file system your cwd is in.  At a minimum, 
it''s returned in our various stat structs (st_dev). 

I can see I''m getting long winded, but I''m thinking: what is
the value
of having different behavior with a cross zfs file move, within the same 
pool as that  between  directories.  I''m not addressing the previous 
discussion about how to treat file handles, etc, but more about sharing 
open file blocks, linked across zfs boundaries before and after such a mv. 

I think the test is this: can we find a scenario where something would 
break if we did share the file blocks across zfs boundaries after such a 
mv?  For every example I''ve been able to think of, if I ask the 
question: what if I moved the file from one directory to the other, 
instead of across zfs boundaries, would it have been different? it''s 
been no.  Comments please. 

Jon

-- 

-     _____/     _____/      /           - Jonathan Loran -           -
-    /          /           /                IT Manager               -
-  _____  /   _____  /     /     Space Sciences Laboratory, UC Berkeley
-        /          /     /      (510) 643-5146 jloran at ssl.berkeley.edu
- ______/    ______/    ______/           AST:7731^29u18e3

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080103/755d133a/attachment.html>

zfs discuss - Dec 2007 - rename(2) (mv(1)) between ZFS filesystems in the same zpool

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

[zfs-discuss] Inode (dnode) numbers (Re: rename(2) (mv(1)) between ZFS filesystems in the same zpool)

[zfs-discuss] Inode (dnode) numbers (Re: rename(2) (mv(1)) between ZFS filesystems in the same zpool)

[zfs-discuss] Inode (dnode) numbers (Re: rename(2) (mv(1)) between ZFS filesystems in the same zpool)

[zfs-discuss] Inode (dnode) numbers (Re: rename(2) (mv(1)) between ZFS filesystems in the same zpool)

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool

[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool