Darren Reed
2007-Dec-27 04:58 UTC
[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
Having just done a largish mv from one ZFS filesystem to another ZFS filesystem in the same zpool, I was somewhat surprised at how long it took - I was expecting it to be near instant like it would be within the same filesystem. Are there optimisations possible here? Surely it should be possible to just change some block pointers and do a little bit of accounting? Or is the amount of work required only really beneficial for large files (hundreds of MB or GB)? e.g. mv will currently do this: ... lstat64("/biscuit/bfu//hda4.dump", 0x08066370) Err#2 ENOENT rename("hda4.dump", "/biscuit/bfu//hda4.dump") Err#18 EXDEV open64("hda4.dump", O_RDONLY) = 3 creat64("/biscuit/bfu//hda4.dump", 0644) = 4 stat64("/biscuit/bfu//hda4.dump", 0x08066370) = 0 chmod("/biscuit/bfu//hda4.dump", 0100644) = 0 fstat64(3, 0x080662E0) = 0 mmap64(0x00000000, 8388608, PROT_READ, MAP_SHARED, 3, 0) = 0xFE600000 write(4, "01\0\0\0 {1CDF ?\0\0\0\0".., 8388608) = 8388608 mmap64(0xFE600000, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 3, 8388608) = 0xFE6 00000 write(4, "11 603\0\f\001\0 .\0 d 1".., 8388608) = 8388608 mmap64(0xFE600000, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 3, 0x01000000) = 0x FE600000 write(4, "\v998B\t96CBD2 X * ;E7ED".., 8388608) = 8388608 ... There are likely other considerations that need to be considered (such as encryption being on/off, etc) but if all of the properties are the same for a given pair of ZFS filesystems between which a file is being moved, surely it should be possible to take some nice shortcuts? RFE worthy? Darren
Joerg Schilling
2007-Dec-27 12:00 UTC
[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
Darren Reed <Darren.Reed at Sun.COM> wrote:> Having just done a largish mv from one ZFS filesystem to another ZFS > filesystem in the same zpool, I was somewhat surprised at how long it > took - I was expecting it to be near instant like it would be within the > same filesystem.I would guess that this is caused by different st_dev values in the new filesystem. In such a case, mv copies the files instead of renaming them. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Casper.Dik at Sun.COM
2007-Dec-27 15:28 UTC
[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
> >I would guess that this is caused by different st_dev values in the new >filesystem. In such a case, mv copies the files instead of renaming them.No, it''s because they are different filesystems and the data needs to be copied; zfs doesn''t allow data movement between filesystems within a pool. The code inside "mv" would immediately support such renames as it *first* checks whether rename works and only then will it try "plan B": if (rename(source, target) >= 0) return (0); if (errno != EXDEV) { /* fatal errors */ } ... continue with plan B: copy & remove ... Casper
Frank.Hofmann at Sun.COM
2007-Dec-27 15:50 UTC
[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
On Thu, 27 Dec 2007, Casper.Dik at Sun.COM wrote:> >> >> I would guess that this is caused by different st_dev values in the new >> filesystem. In such a case, mv copies the files instead of renaming them. > > > No, it''s because they are different filesystems and the data needs to be > copied; zfs doesn''t allow data movement between filesystems within a pool.It''s not ZFS that blocks this by design - it''s the VFS framework. vn_rename() has this piece: /* * Make sure both the from vnode directory and the to directory * are in the same vfs and the to directory is writable. * We check fsid''s, not vfs pointers, so loopback fs works. */ if (fromvp != tovp) { vattr.va_mask = AT_FSID; if (error = VOP_GETATTR(fromvp, &vattr, 0, CRED(), NULL)) goto out; fsid = vattr.va_fsid; vattr.va_mask = AT_FSID; if (error = VOP_GETATTR(tovp, &vattr, 0, CRED(), NULL)) goto out; if (fsid != vattr.va_fsid) { error = EXDEV; goto out; } } ZFS will never even see such a rename request. FrankH.> > The code inside "mv" would immediately support such renames as it *first* > checks whether rename works and only then will it try "plan B": > > if (rename(source, target) >= 0) > return (0); > if (errno != EXDEV) { > /* fatal errors */ > } > ... continue with plan B: copy & remove ... > > > Casper > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >------------------------------------------------------------------------------ No good can come from selling your freedom, not for all the gold in the world, for the value of this heavenly gift far exceeds that of any fortune on earth. ------------------------------------------------------------------------------
Darren Reed
2007-Dec-28 03:14 UTC
[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
Frank.Hofmann at Sun.COM wrote:> On Thu, 27 Dec 2007, Casper.Dik at Sun.COM wrote: > >> >>> >>> I would guess that this is caused by different st_dev values in the new >>> filesystem. In such a case, mv copies the files instead of renaming >>> them. >> >> >> No, it''s because they are different filesystems and the data needs to be >> copied; zfs doesn''t allow data movement between filesystems within a >> pool. > > It''s not ZFS that blocks this by design - it''s the VFS framework. > vn_rename() has this piece: > > /* > * Make sure both the from vnode directory and the to directory > * are in the same vfs and the to directory is writable. > * We check fsid''s, not vfs pointers, so loopback fs works. > */ > if (fromvp != tovp) { > vattr.va_mask = AT_FSID; > if (error = VOP_GETATTR(fromvp, &vattr, 0, CRED(), NULL)) > goto out; > fsid = vattr.va_fsid; > vattr.va_mask = AT_FSID; > if (error = VOP_GETATTR(tovp, &vattr, 0, CRED(), NULL)) > goto out; > if (fsid != vattr.va_fsid) { > error = EXDEV; > goto out; > } > } > > ZFS will never even see such a rename request.Is this behaviour defined by a standard (such as POSIX or the VFS design) or are we free to innovate here and do something that allowed such a shortcut as required? Although I''m not sure the effort required would be worth the added complexity (to VFS and ZFS) for such a minor "feature". Darren
> such a minor "feature"I don''t think copying files is a minor feature. Doubly so since the words I''ve read from Sun suggest that ZFS "file systems" (or "data sets" or whatever they are called now) can be used in the way directories on a normal file system are used. This message posted from opensolaris.org
Frank Hofmann
2007-Dec-28 10:00 UTC
[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
On Fri, 28 Dec 2007, Darren Reed wrote:> Frank.Hofmann at Sun.COM wrote: >> On Thu, 27 Dec 2007, Casper.Dik at Sun.COM wrote: >> >>> >>>> >>>> I would guess that this is caused by different st_dev values in the new >>>> filesystem. In such a case, mv copies the files instead of renaming them. >>> >>> >>> No, it''s because they are different filesystems and the data needs to be >>> copied; zfs doesn''t allow data movement between filesystems within a pool. >> >> It''s not ZFS that blocks this by design - it''s the VFS framework. >> vn_rename() has this piece:[ ... ]>> ZFS will never even see such a rename request. > > Is this behaviour defined by a standard (such as POSIX or the > VFS design) or are we free to innovate here and do something > that allowed such a shortcut as required? > > Although I''m not sure the effort required would be worth the > added complexity (to VFS and ZFS) for such a minor "feature". > > DarrenHi Darren, I don''t think the standards would prevent us from adding "cross-fs rename" capabilities. It''s beyond the standards as of now, and I''d expect that were it ever added to that it''d be an optional feature as well, to be queried for via e.g. pathconf(). The VFS design/framework is "ours" - the OpenSolaris community is free to innovate there and change it as desired. It''s not on the stability level of the DDI. You can''t revamp it at a whim, but you can change/extend it. Precedence exists for things that FS X can do but FS Y cannot, and changing the framework to check "does this fs claim to support cross-fs rename ?" wouldn''t be too hard. A filesystem could advertise that e.g. via a VFSSW capabilities flags (the VSW_* stuff from <sys/vfs.h>), or via VFS features (VFSFT_*, again see <sys/vfs.h>, this is relatively recent, got added by the CIFS projects). I don''t know enough about ZFS internals to help you code the backend support, but if you wish to work on it, I''d be happy to help you with the framework changes. Those won''t be more than ~50 lines. "Minor feature" ? I guess that depends how you look at it. It would be another thing that highlights what noone else but ZFS can do for you. Who knows what users will do with it in ten years :) FrankH.
Frank Hofmann
2007-Dec-28 10:20 UTC
[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
On Fri, 28 Dec 2007, Darren Reed wrote: [ ... ]> Is this behaviour defined by a standard (such as POSIX or the > VFS design) or are we free to innovate here and do something > that allowed such a shortcut as required?Wrt. to standards, quote from: http://www.opengroup.org/onlinepubs/009695399/functions/rename.html ERRORS The rename() function shall fail if: [ ... ] [EXDEV] [CX] The links named by new and old are on different file systems and the implementation does not support links between file systems. Hence, it''s implementation-dependent, as per IEEE1003.1. FrankH.
Casper.Dik at Sun.COM
2007-Dec-28 11:38 UTC
[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
> > >On Fri, 28 Dec 2007, Darren Reed wrote: >[ ... ] >> Is this behaviour defined by a standard (such as POSIX or the >> VFS design) or are we free to innovate here and do something >> that allowed such a shortcut as required? > >Wrt. to standards, quote from: > > http://www.opengroup.org/onlinepubs/009695399/functions/rename.html > > ERRORS > The rename() function shall fail if: >[ ... ] > [EXDEV] > [CX] The links named by new and old are on different file systems and the > implementation does not support links between file systems. > >Hence, it''s implementation-dependent, as per IEEE1003.1.So it can be transparently solved in the VFS layer (also nice for moving files back from cloned snapshots and such) Casper
Joerg Schilling
2007-Dec-28 13:42 UTC
[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
Darren Reed <Darren.Reed at Sun.COM> wrote:> > if (fromvp != tovp) { > > vattr.va_mask = AT_FSID; > > if (error = VOP_GETATTR(fromvp, &vattr, 0, CRED(), NULL)) > > goto out; > > fsid = vattr.va_fsid; > > vattr.va_mask = AT_FSID; > > if (error = VOP_GETATTR(tovp, &vattr, 0, CRED(), NULL)) > > goto out; > > if (fsid != vattr.va_fsid) { > > error = EXDEV; > > goto out; > > } > > } > > > > ZFS will never even see such a rename request. > > Is this behaviour defined by a standard (such as POSIX or the > VFS design) or are we free to innovate here and do something > that allowed such a shortcut as required?EXDEV means: "cross device link", not cross filesystem link A ZFS pool acts as the underlying "storage device", so everything that is within a single ZFS pool may be a candidate for a rename. POSIX grants that st_dev and st_ino together uniquely identify a file on a system. As long as neither st_dev nor st_ino change during the rename(2) call, POSIX does not prevent this rename operation. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Joerg Schilling
2007-Dec-28 13:48 UTC
[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
Frank Hofmann <Frank.Hofmann at Sun.COM> wrote:> I don''t think the standards would prevent us from adding "cross-fs rename" > capabilities. It''s beyond the standards as of now, and I''d expect that > were it ever added to that it''d be an optional feature as well, to be > queried for via e.g. pathconf().Why do you beliece there is a need for a pathconf() call? Either rename(2) succeeds or it fails with a cross-device error. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Frank Hofmann
2007-Dec-28 14:02 UTC
[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
On Fri, 28 Dec 2007, Joerg Schilling wrote:> Frank Hofmann <Frank.Hofmann at Sun.COM> wrote: > >> I don''t think the standards would prevent us from adding "cross-fs rename" >> capabilities. It''s beyond the standards as of now, and I''d expect that >> were it ever added to that it''d be an optional feature as well, to be >> queried for via e.g. pathconf(). > > Why do you beliece there is a need for a pathconf() call? > Either rename(2) succeeds or it fails with a cross-device error.Why do you have a NAME_MAX / SYMLINK_MAX query - you can just as well let such requests fail with ENAMETOOLONG. Why do you have a FILESIZEBITS query - there''s EOVERFLOW to tell you. There''s no _need_. But the convenience exists for others as well. FrankH.
Joerg Schilling
2007-Dec-28 14:09 UTC
[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
Frank Hofmann <Frank.Hofmann at Sun.COM> wrote:> > Why do you beliece there is a need for a pathconf() call? > > Either rename(2) succeeds or it fails with a cross-device error. > > Why do you have a NAME_MAX / SYMLINK_MAX query - you can just as well let > such requests fail with ENAMETOOLONG. > > Why do you have a FILESIZEBITS query - there''s EOVERFLOW to tell you. > > > There''s no _need_. But the convenience exists for others as well.What kind of call do you propose? J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Frank Hofmann
2007-Dec-28 14:10 UTC
[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
On Fri, 28 Dec 2007, Joerg Schilling wrote: [ ... ]> POSIX grants that st_dev and st_ino together uniquely identify a file > on a system. As long as neither st_dev nor st_ino change during the > rename(2) call, POSIX does not prevent this rename operation.Clarification request: Where''s the piece in the standard that forces an interpretation: "rename() operations shall not change st_ino/st_dev" I don''t see where such a requirement would come from. FrankH.
Joerg Schilling
2007-Dec-28 14:21 UTC
[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
Frank Hofmann <Frank.Hofmann at Sun.COM> wrote:> > > On Fri, 28 Dec 2007, Joerg Schilling wrote: > [ ... ] > > POSIX grants that st_dev and st_ino together uniquely identify a file > > on a system. As long as neither st_dev nor st_ino change during the > > rename(2) call, POSIX does not prevent this rename operation. > > Clarification request: Where''s the piece in the standard that forces an > interpretation: > > "rename() operations shall not change st_ino/st_dev" > > I don''t see where such a requirement would come from.See: http://www.opengroup.org/onlinepubs/009695399/basedefs/sys/stat.h.html "The st_ino and st_dev fields taken together uniquely identify the file within the system." The identity of an open file cannot change during the lifetime of a process. Note that the renamed file may be open and the process may call fstat(2) on the open file before and after the rename(2). As rename(2) does not change the content of the file, it may only affect the time stamps of the file. Note that some programs call stat/fstat on files in order to compare file identities. What happens if program A calls stat("file1"), then program B calls rename("file1", "file2") and after that, program A calls stat("file2"). A POSIX compliant system will grant that stat("file1") and stat("file2") will return the same st_dev/st_ino identity. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Carson Gaspar
2007-Dec-29 07:08 UTC
[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
Frank Hofmann wrote:> > On Fri, 28 Dec 2007, Joerg Schilling wrote:...>> Why do you beliece there is a need for a pathconf() call? >> Either rename(2) succeeds or it fails with a cross-device error. > > Why do you have a NAME_MAX / SYMLINK_MAX query - you can just as well let > such requests fail with ENAMETOOLONG. > > Why do you have a FILESIZEBITS query - there''s EOVERFLOW to tell you. > > > There''s no _need_. But the convenience exists for others as well.Because those 2 involve variable types and/or buffer allocations, so knowing them in advance is a major advantage. rename will either succeed or fail. -- Carson
Jonathan Loran
2007-Dec-29 07:33 UTC
[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
Joerg Schilling wrote:> See: http://www.opengroup.org/onlinepubs/009695399/basedefs/sys/stat.h.html > > "The st_ino and st_dev fields taken together uniquely identify the file within > the system." > > The identity of an open file cannot change during the lifetime of a process. > Note that the renamed file may be open and the process may call fstat(2) > on the open file before and after the rename(2). As rename(2) does not change > the content of the file, it may only affect the time stamps of the file. > > Note that some programs call stat/fstat on files in order to compare file > identities. What happens if program A calls stat("file1"), then program B > calls rename("file1", "file2") and after that, program A calls stat("file2"). > A POSIX compliant system will grant that stat("file1") and stat("file2") will > return the same st_dev/st_ino identity. > >And consider what happens if the originating zfs is exported via NFS, and the destination isn''t. If an NFS client has the subject file open, we need to ensure the correct behavior after the move. The Unix file system behavior (Sorry, don''t have references to Posix or RFCs here, just 25 years of experience..) would be that if a file is moved between file systems, it is removed from the source, yet the file storage will continue to exist until the last process which has this file open closes. In effect, this means the file in the old location (file system) should continue to exist indefinitely, if it is open by a long running process. I fear if we aren''t careful, we will introduce a boat load of bugs. Hey, here''s an idea: We snapshot the file as it exists at the time of the mv in the old file system until all referring file handles are closed, then destroy the single file snap. I know, not easy to implement, but that is the correct behavior, I believe. All this said, I would love to have this "feature" introduced. Moving large file stores between zfs file systems would be so handy! From my own sloppiness, I''ve suffered dearly from the the lack of it. Jon -- - _____/ _____/ / - Jonathan Loran - - - / / / IT Manager - - _____ / _____ / / Space Sciences Laboratory, UC Berkeley - / / / (510) 643-5146 jloran at ssl.berkeley.edu - ______/ ______/ ______/ AST:7731^29u18e3
Jonathan Edwards
2007-Dec-29 14:37 UTC
[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
On Dec 29, 2007, at 2:33 AM, Jonathan Loran wrote:> Hey, here''s an idea: We snapshot the file as it exists at the time of > the mv in the old file system until all referring file handles are > closed, then destroy the single file snap. I know, not easy to > implement, but that is the correct behavior, I believe. > > All this said, I would love to have this "feature" introduced. Moving > large file stores between zfs file systems would be so handy! From my > own sloppiness, I''ve suffered dearly from the the lack of it.since in the current implementation a mv between filesystems would have to assign new st_ino values (fsids in NFS should also be different), all you should need to do is assign new block pointers in the new side of the filesystem .. that would also be handy for cp as well --- .je
Joerg Schilling
2007-Dec-29 15:11 UTC
[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
Jonathan Edwards <Jonathan.Edwards at Sun.COM> wrote:> > On Dec 29, 2007, at 2:33 AM, Jonathan Loran wrote: > > > Hey, here''s an idea: We snapshot the file as it exists at the time of > > the mv in the old file system until all referring file handles are > > closed, then destroy the single file snap. I know, not easy to > > implement, but that is the correct behavior, I believe. > > > > All this said, I would love to have this "feature" introduced. Moving > > large file stores between zfs file systems would be so handy! From my > > own sloppiness, I''ve suffered dearly from the the lack of it. > > since in the current implementation a mv between filesystems would > have to assign new st_ino values (fsids in NFS should also be > different), all you should need to do is assign new block pointers in > the new side of the filesystem .. that would also be handy for cp as > wellIf the rename would keep the blocks from the old file for the new name then the new file would inherit the identity of the old file. If you did iplement the rename in a way that would cause new values for st_dev/st_ino to be returned from a fstat(2) cal then this could confuse programs. If you instead set st_nlink for the open file to 0, then this would be OK from the viewpoint of the old file but not be OK from the view to the whole system. How would you implement writes into the open fd from the old name? J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Jonathan Loran
2007-Dec-30 08:01 UTC
[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
Joerg Schilling wrote:> Jonathan Edwards <Jonathan.Edwards at Sun.COM> wrote: > >> since in the current implementation a mv between filesystems would >> have to assign new st_ino values (fsids in NFS should also be >> different), all you should need to do is assign new block pointers in >> the new side of the filesystem .. that would also be handy for cp as >> well >> > > If the rename would keep the blocks from the old file for the new name > then the new file would inherit the identity of the old file. > > If you did iplement the rename in a way that would cause new values for > st_dev/st_ino to be returned from a fstat(2) cal then this could confuse > programs. > > If you instead set st_nlink for the open file to 0, then this would be OK > from the viewpoint of the old file but not be OK from the view to the whole > system. How would you implement writes into the open fd from the old name? > > J?rg > >More concise way of putting what I''m saying. Traditional mv between two fs will create two copies of the data, if the source file is open. At a minimum, this will have to be emulated or things will break. Since zfs file systems are really different Unix file systems, we have to deal with the semantics. It''s not just a path change as in a directory mv. Jon -- - _____/ _____/ / - Jonathan Loran - - - / / / IT Manager - - _____ / _____ / / Space Sciences Laboratory, UC Berkeley - / / / (510) 643-5146 jloran at ssl.berkeley.edu - ______/ ______/ ______/ AST:7731^29u18e3 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20071230/b840735b/attachment.html>
Darren Reed
2007-Dec-31 08:20 UTC
[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
Frank Hofmann wrote:> > > On Fri, 28 Dec 2007, Darren Reed wrote: > [ ... ] >> Is this behaviour defined by a standard (such as POSIX or the >> VFS design) or are we free to innovate here and do something >> that allowed such a shortcut as required? > > Wrt. to standards, quote from: > > http://www.opengroup.org/onlinepubs/009695399/functions/rename.html > > ERRORS > The rename() function shall fail if: > [ ... ] > [EXDEV] > [CX] The links named by new and old are on different file systems > and the > implementation does not support links between file systems. > > Hence, it''s implementation-dependent, as per IEEE1003.1.This implies that we''d also have to look at allowing link(2) to also function between filesystems where rename(2) was going to work without doing a copy, correct? Which I suppose makes sense. Darren
Frank.Hofmann at Sun.COM
2007-Dec-31 08:41 UTC
[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
On Mon, 31 Dec 2007, Darren Reed wrote:> Frank Hofmann wrote: >> >> >> On Fri, 28 Dec 2007, Darren Reed wrote: >> [ ... ] >>> Is this behaviour defined by a standard (such as POSIX or the >>> VFS design) or are we free to innovate here and do something >>> that allowed such a shortcut as required? >> >> Wrt. to standards, quote from: >> >> http://www.opengroup.org/onlinepubs/009695399/functions/rename.html >> >> ERRORS >> The rename() function shall fail if: >> [ ... ] >> [EXDEV] >> [CX] The links named by new and old are on different file systems >> and the >> implementation does not support links between file systems. >> >> Hence, it''s implementation-dependent, as per IEEE1003.1. > > This implies that we''d also have to look at allowing > link(2) to also function between filesystems where > rename(2) was going to work without doing a copy, > correct? Which I suppose makes sense.Copy-on-write. rename() is just defined as an "atomic" sequence of: link(old, new); unlink(old); If cross-fs rename is possible, then cross-fs link is as well. It''s "per-file clone". Btw, Joerg, this addresses the concern you had in any case. It''s cross-fs, that means st_dev/st_ino _WILL_ change. Persistence of open files is not related to that. If you hold a file open, the st_dev/st_ino associated with the open fd will stay around and continue to be accessible with fstat() - but not necessarily with stat(). It definitely would not be in case the file got removed. That cross-fs rename would, on the source fs, remove the file is, for all I can see, not violating anything. The location of the file''s data is _NOT_ the only way to derive a unique st_dev/st_ino pair. rename() _within_ a filesystem (as defined by the set of nodes with a common st_dev) should preserve st_ino if the fs supports link counts larger than one, agreed. But let''s not confuse this with cross-fs rename, where by definition (cross-fs) st_dev must change. The identity of that file, therefore, has changed. We''re just in the happy situation with ZFS that the storage low-level implementation can know that the contents haven''t. That''s a sad situation for backup utilities, by the way - a backup tool would have no way of finding out that file X on fs A already existed as file Z on fs B. So what ? If the file got copied, byte by byte, the same situation exists, the contents are identical. I don''t think just because this makes backups slower than they could be if the backup utility were omniscient, that makes a reason to slow file copy/rename operations down. Happy new year ! FrankH.
Joerg Schilling
2007-Dec-31 12:23 UTC
[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
Darren Reed <Darren.Reed at Sun.COM> wrote:> > Wrt. to standards, quote from: > > > > http://www.opengroup.org/onlinepubs/009695399/functions/rename.html > > > > ERRORS > > The rename() function shall fail if: > > [ ... ] > > [EXDEV] > > [CX] The links named by new and old are on different file systems > > and the > > implementation does not support links between file systems. > > > > Hence, it''s implementation-dependent, as per IEEE1003.1. > > This implies that we''d also have to look at allowing > link(2) to also function between filesystems where > rename(2) was going to work without doing a copy, > correct? Which I suppose makes sense.Thank you for mentioning this. This brings us closer to the demand that st_dev/st_ino need to be kept during a rename. Basically, rename has been introduced in order to avoid a privileged link(2)/unlink(2) on directories in order to rename directories. For this reason, I would expect a rename(2) to work similar to a link/unlink chain. Something that I should mention also is that there may be programs that rename open files and asume traditional st_dev/st_ino semantic in case that rename worked. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Darren Reed
2008-Jan-02 03:46 UTC
[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
Frank.Hofmann at Sun.COM wrote:> ... > That''s a sad situation for backup utilities, by the way - a backup > tool would have no way of finding out that file X on fs A already > existed as file Z on fs B. So what ? If the file got copied, byte by > byte, the same situation exists, the contents are identical. I don''t > think just because this makes backups slower than they could be if the > backup utility were omniscient, that makes a reason to slow file > copy/rename operations down.I don''t see this as being a problem at all. This idea is aimed at being a filesystem performance optimisation, not a backup optimisation. Darren
Wee Yeh Tan
2008-Jan-02 09:10 UTC
[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
On Jan 2, 2008 11:46 AM, Darren Reed <Darren.Reed at sun.com> wrote:> Frank.Hofmann at Sun.COM wrote: > > ... > > That''s a sad situation for backup utilities, by the way - a backup > > tool would have no way of finding out that file X on fs A already > > existed as file Z on fs B. So what ? If the file got copied, byte by > > byte, the same situation exists, the contents are identical. I don''t > > think just because this makes backups slower than they could be if the > > backup utility were omniscient, that makes a reason to slow file > > copy/rename operations down. > > I don''t see this as being a problem at all. > > This idea is aimed at being a filesystem performance optimisation, > not a backup optimisation.Anyway, the same "problem" already exists with cloned filesystems. -- Just me, Wire ... Blog: <prstat.blogspot.com>
Nicolas Williams
2008-Jan-02 16:24 UTC
[zfs-discuss] Inode (dnode) numbers (Re: rename(2) (mv(1)) between ZFS filesystems in the same zpool)
On Mon, Dec 31, 2007 at 07:20:30PM +1100, Darren Reed wrote:> Frank Hofmann wrote: > > http://www.opengroup.org/onlinepubs/009695399/functions/rename.html > > > > ERRORS > > The rename() function shall fail if: > > [ ... ] > > [EXDEV] > > [CX] The links named by new and old are on different file systems > > and the > > implementation does not support links between file systems. > > > > Hence, it''s implementation-dependent, as per IEEE1003.1. > > This implies that we''d also have to look at allowing > link(2) to also function between filesystems where > rename(2) was going to work without doing a copy, > correct? Which I suppose makes sense.If so then a cross-dataset rename(2) won''t necessarily work. link(2) preserves inode numbers. mv(1) does not [when crossing devices]. A cross-dataset rename(2) may not be able to preserve inode numbers either (e.g., if the one at the source is already in use on the target). Nico --
Nicolas Williams
2008-Jan-02 16:32 UTC
[zfs-discuss] Inode (dnode) numbers (Re: rename(2) (mv(1)) between ZFS filesystems in the same zpool)
Oof, I see this has been discussed since (and, actually, IIRC it was discussed a long time ago too). Anyways, IMO, this requires a new syscall or syscalls: xdevrename(2) xdevcopy(2) and then mv(1) can do: if (rename(old, new) != 0) { if (xdevrename(old, new) != 0) { /* do a cp(1) instead */ return (do_cp(old, new)); } return (0); } return (0); cp(1), and maybe ln(1), could do something similar using xdevcopy(2). Nico --
Wee Yeh Tan
2008-Jan-03 01:06 UTC
[zfs-discuss] Inode (dnode) numbers (Re: rename(2) (mv(1)) between ZFS filesystems in the same zpool)
On Jan 3, 2008 12:32 AM, Nicolas Williams <Nicolas.Williams at sun.com> wrote:> Oof, I see this has been discussed since (and, actually, IIRC it was > discussed a long time ago too). > > Anyways, IMO, this requires a new syscall or syscalls: > > xdevrename(2) > xdevcopy(2) > > and then mv(1) can do: > > if (rename(old, new) != 0) { > if (xdevrename(old, new) != 0) { > /* do a cp(1) instead */ > return (do_cp(old, new)); > } > return (0); > } > return (0); > > cp(1), and maybe ln(1), could do something similar using xdevcopy(2).Could it be cleaner to do that within vn_renameat() instead? This will save creating a new syscall and updating quite a number of utilities. -- Just me, Wire ... Blog: <prstat.blogspot.com>
Darren Reed
2008-Jan-03 03:09 UTC
[zfs-discuss] Inode (dnode) numbers (Re: rename(2) (mv(1)) between ZFS filesystems in the same zpool)
Nicolas Williams wrote:> On Mon, Dec 31, 2007 at 07:20:30PM +1100, Darren Reed wrote: > >> Frank Hofmann wrote: >> >>> http://www.opengroup.org/onlinepubs/009695399/functions/rename.html >>> >>> ERRORS >>> The rename() function shall fail if: >>> [ ... ] >>> [EXDEV] >>> [CX] The links named by new and old are on different file systems >>> and the >>> implementation does not support links between file systems. >>> >>> Hence, it''s implementation-dependent, as per IEEE1003.1. >>> >> This implies that we''d also have to look at allowing >> link(2) to also function between filesystems where >> rename(2) was going to work without doing a copy, >> correct? Which I suppose makes sense. >> > > If so then a cross-dataset rename(2) won''t necessarily work. > > link(2) preserves inode numbers. mv(1) does not [when crossing > devices]. A cross-dataset rename(2) may not be able to preserve inode > numbers either (e.g., if the one at the source is already in use on the > target).Unless POSIX or similar says the preservation of inode numbers is required, I can''t see why that is important. Darren
Carsten Bormann
2008-Jan-03 08:29 UTC
[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
On Dec 29 2007, at 08:33, Jonathan Loran wrote:> We snapshot the file as it exists at the time of > the mv in the old file system until all referring file handles are > closed, then destroy the single file snap. I know, not easy to > implement, but that is the correct behavior, I believe.Exactly. Note that apart from open descriptors, there may be other links to the file on the old FS; it has to be clear whether writes to the file in the new FS change the file in the old FS or not. I''d rather say they shouldn''t. Yes, this would be different from the normal rename(2) semantics with respect to multiply linked files. And yes, the semantics of link(2) should also be consistent with this. Gruesse, Carsten
Joerg Schilling
2008-Jan-03 08:47 UTC
[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
Carsten Bormann <cabocabo at gmail.com> wrote:> On Dec 29 2007, at 08:33, Jonathan Loran wrote: > > > We snapshot the file as it exists at the time of > > the mv in the old file system until all referring file handles are > > closed, then destroy the single file snap. I know, not easy to > > implement, but that is the correct behavior, I believe. > > Exactly. > > Note that apart from open descriptors, there may be other links to the > file on the old FS; it has to be clear whether writes to the file in > the new FS change the file in the old FS or not. I''d rather say they > shouldn''t. > Yes, this would be different from the normal rename(2) semantics with > respect to multiply linked files. And yes, the semantics of link(2) > should also be consistent with this.This in an interesting problem. Your proposal would imply that a file may have different identities in different filesystems: - different st_dev - different st_ino - different link count This cannot be implemented with a single "inode data" anymore. Well, it is not impossible as my WOFS (mentioned before) implements hardlinks via "inode relative symlinks". In order to allow this. a file would need a storage pool global serial number that allows to match different inode sets for the file. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
Jonathan Loran
2008-Jan-03 22:55 UTC
[zfs-discuss] rename(2) (mv(1)) between ZFS filesystems in the same zpool
Joerg Schilling wrote:> Carsten Bormann <cabocabo at gmail.com> wrote: > > >> On Dec 29 2007, at 08:33, Jonathan Loran wrote: >> >> >>> We snapshot the file as it exists at the time of >>> the mv in the old file system until all referring file handles are >>> closed, then destroy the single file snap. I know, not easy to >>> implement, but that is the correct behavior, I believe. >>> >> Exactly. >> >> Note that apart from open descriptors, there may be other links to the >> file on the old FS; it has to be clear whether writes to the file in >> the new FS change the file in the old FS or not. I''d rather say they >> shouldn''t. >> Yes, this would be different from the normal rename(2) semantics with >> respect to multiply linked files. And yes, the semantics of link(2) >> should also be consistent with this. >> > > This in an interesting problem. Your proposal would imply that a file > may have different identities in different filesystems: > > - different st_dev > > - different st_ino > > - different link count > > This cannot be implemented with a single "inode data" anymore. > > Well, it is not impossible as my WOFS (mentioned before) implements > hardlinks via "inode relative symlinks". In order to allow this. a file > would need a storage pool global serial number that allows to match different > inode sets for the file. > > J?rg > >At first, as I mentioned in my earlier email, I was thinking we needed to emulate the cross-fs rename/link/etc behavior as it is currently implemented, where a file appears to actually be copied. But now I''m not so sure. In Unixland, the ideal has always been to have the whole file system, kit and caboodle, singly rooted at /. Heck, even devices are in the file system. Of course, reality required that Programmatically, we needed to be aware of what file system your cwd is in. At a minimum, it''s returned in our various stat structs (st_dev). I can see I''m getting long winded, but I''m thinking: what is the value of having different behavior with a cross zfs file move, within the same pool as that between directories. I''m not addressing the previous discussion about how to treat file handles, etc, but more about sharing open file blocks, linked across zfs boundaries before and after such a mv. I think the test is this: can we find a scenario where something would break if we did share the file blocks across zfs boundaries after such a mv? For every example I''ve been able to think of, if I ask the question: what if I moved the file from one directory to the other, instead of across zfs boundaries, would it have been different? it''s been no. Comments please. Jon -- - _____/ _____/ / - Jonathan Loran - - - / / / IT Manager - - _____ / _____ / / Space Sciences Laboratory, UC Berkeley - / / / (510) 643-5146 jloran at ssl.berkeley.edu - ______/ ______/ ______/ AST:7731^29u18e3 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080103/755d133a/attachment.html>