KRELLE,BRIAN (HP-Roseville,ex1)
2001-Sep-12 00:20 UTC
Distinct transactions (MV vs rename())?
I have a question regarding a thread in June called "Distinct transactions", which I have included below. It seems to me that the solution is not atomic for daemons opening the file as there is a moment where the filename is not in the directory (i.e. unlink then link). In summary, poster Charlie Woloszynski wanted to update a configuration file in a safe manner (i.e. as a transaction would either complete or not if power failed). The solution was: (1) remove "new" if in existance (2) copy "current" to "new" (3) modify "new" (4) remove "old" if in existance (5) hard link "current" to "old" (6) move "new" to "current" Step 6 was to use the MV command, which Nigel Metheringham stated: The reason for doing this is that at no time will you NOT have a valid config file in place. One mv is guaranteed atomic, whereas two are not. I was rooting around in the MV code. It seems to perform an unlink() then a link() if the filename already exists (Theodore Tso please correct me if I'm wrong). This implies to me there is a moment where the filename (i.e. "current") is not in the directory. Another thread on July 26th about MTAs and metadata synchronization, cryptically titled "ext3-2.4-0.9.4", talked about rename() being atomic in regards to other processes attempting to open the file being renamed. Since MV, or LN, do not seem to use the rename() function, how does one accomplish step 6 above atomically from processes using the FS API? It would also be nice to minimize the number of underlying filesystem transactions to be powerfail safe. An unlink() of a hard link needs to remove the entry from the directory and decrement the link count in "old" inode. The link() would add an entry to the directory and increment the link count in the "new" inode. A total of four metadata changes. I was thinking of using symlinks but I do not think one can "move" its link using the rename() function. Thanks in advance for your advice as I'm trying to solve this problem. Daemons we write are polite (i.e. only read their configuration file upon receipt of a SIGHUP) but some we use are not (e.g. open and read their files under their own volition). Brian Krelle Distinct transactions Thread from June --------------------------------------------------------------------------- To: ext3-users(a)redhat.com Subject: Distinct transactions From: Charlie Woloszynski <chw(a)clearmetrix.com> Date: Thu, 14 Jun 2001 09:21:20 -0400 Folks: I have been asked to set up a Linux box that can be power-cycled at any time and it should not fail to reboot. EXT3 seems like the basis for such a box, and we have set up a RedHat 6.2 box with the new kernel and we have ext3 working. We have set up ext3 to journal both metadata and data changes (and we are willing to accept the reduced disk throughput). We are expecting to have logging data flowing onto the hard drive and we expect to have configuration changes also made to the box.>From what I can see, I can expect EXT3 to make sure that the logging dataflowing into the box will either be appended to the file or not, but I am not sure at what level will the filesystem changes be 'chunked' into transactions. Is there some direct mapping of system calls to 'transactions' on the file system that we can assume? Of even more importance is our configuration files. If these get left in some partly modified state, all else is pointless. So, we are trying to make sure we have some understanding of the 'right' way to make sure that we keep consistent files for the configuration stuff. We are proposing: (1) copy the config file to some new file. (2) modify the new file and close it. (3) move the existing config file to a backup file (4) move the modified file to replace the original file Are we right in assuming that these actions retain their sequentiality with journaling? That is, can we be sure that if (3) occurred in the filesystem, that (2) must have been completed? If we can be *sure*, then we can fix any power cycle in the middle of the update by looking at the state of the two files. If the existing config file is in place, we remove any 'new config' files. If the existing config file has been renamed, then we move the 'new config' in place and continue. We should never see the existing config file moved and no new config file on the system. Have we made any mistaken assumptions about the level and sequence of transactions in the EXT3 stuff? Any and all feedback and commentary are greatly appreciated. Thanks, Charlie ---------------------------------------------------------------------------- ---- To: ext3-users(a)redhat.com Subject: Re: Distinct transactions From: "Stephen C. Tweedie" <sct(a)redhat.com> Date: Thu, 14 Jun 2001 15:09:55 +0100 Hi, On Thu, Jun 14, 2001 at 09:21:20AM -0400, Charlie Woloszynski wrote:> > I have been asked to set up a Linux box that can be power-cycled at any > time and it should not fail to reboot. EXT3 seems like the basis for such > a box, and we have set up a RedHat 6.2 box with the new kernel and we have > ext3 working. We have set up ext3 to journal both metadata and data > changes (and we are willing to accept the reduced disk throughput). > > From what I can see, I can expect EXT3 to make sure that the logging data > flowing into the box will either be appended to the file or not, but I am > not sure at what level will the filesystem changes be 'chunked' into > transactions. Is there some direct mapping of system calls to > 'transactions' on the file system that we can assume?First of all, if you are extending existing files, then the "data=ordered" journaling mode will give you the same atomicity guarantees as "data=journaled", without the performance penalty of writing everything twice. Secondly, nearly every transaction is atomic on disk in ext3. Truncates can span more than one transaction, but ext3 recovery after a crash will complete any partial truncate which happened to be split over transactions. Only large writes will ever be split over transactions. Now, in older versions of ext3, the definition of "large" was slightly broken, but in 0.0.7a only transactions which take up more than 64 blocks in the log should ever be broken up. That's hard to translate exactly into the size of a write(): it depends on how fragmented the written data is, and whether or not quotas are enabled. But any write below about 16k is going to be completely atomic on ext3 with data journaling, and any append of that size will be atomic even with ordered journaling.> We are proposing: > (1) copy the config file to some new file. > (2) modify the new file and close it. > (3) move the existing config file to a backup file > (4) move the modified file to replace the original file > > Are we right in assuming that these actions retain their sequentialitywith> journaling?Yes.> That is, can we be sure that if (3) occurred in the > filesystem, that (2) must have been completed?Yes. data=ordered and data=journal should both satisfy your constraints in this case. Cheers, Stephen ---------------------------------------------------------------------------- ---- To: ext3-users(a)redhat.com Subject: Re: Distinct transactions From: Andreas Dilger <adilger(a)turbolinux.com> Date: Thu, 14 Jun 2001 18:17:34 -0600 (MDT) Charlie writes:> We are proposing: > (1) copy the config file to some new file. > (2) modify the new file and close it. > (3) move the existing config file to a backup file > (4) move the modified file to replace the original file > > Are we right in assuming that these actions retain their sequentialitywith> journaling? That is, can we be sure that if (3) occurred in the > filesystem, that (2) must have been completed?If you are doing journalling, I would propose the following: 1) copy config to new file 2) modify new file 3) copy original config to backup 4) mv modified file to replace original The reason for doing this is that at no time will you NOT have a valid config file in place. One mv is guaranteed atomic, whereas two are not. Cheers, Andreas ---------------------------------------------------------------------------- ---- To: ext3-users(a)redhat.com Subject: Re: Distinct transactions From: Nigel Metheringham <Nigel.Metheringham(a)vdata.co.uk> Date: 15 Jun 2001 09:31:34 +0100 On 14 Jun 2001 18:17:34 -0600, Andreas Dilger wrote:> If you are doing journalling, I would propose the following: > 1) copy config to new file > 2) modify new file > 3) copy original config to backup > 4) mv modified file to replace original > > The reason for doing this is that at no time will you NOT have a valid > config file in place. One mv is guaranteed atomic, whereas two are not.To pick a minor nit, you can do this a little more efficiently by making step #3 a (hard) link of original config to backup. Nigel. ---------------------------------------------------------------------------- ---- To: ext3-users(a)redhat.com Subject: Re: Distinct transactions From: Charlie Woloszynski <chw(a)clearmetrix.com> Date: Fri, 15 Jun 2001 14:04:52 -0400 Folks: Thanks for the improvements. To be clear, the new process is: (1) copy current config file to "new" (2) modify new file to desired state (3) hard link current config to "old" (4) move new file to current file. Using this approach, I think Andreas is suggesting that if the system power-cycles, there is no recovery process needed. The current config file is always correct. If the system power cycles before (4) is completed, then the changes in process are lost. If (4) was executed, then all the changes to the system have been journaled and the new config file is as desired. If the system fails, it might leave a "new" file or a "old" file. I think I need to add in removing these, if they exist, as part of this process. So, I think I get: (1) remove "new" if in existance (2) copy "current" to "new" (3) modify "new" (4) remove "old" if in existance (5) hard link "current" to "old" (6) move "new" to "current" I figured I'd leave "old" around as a backup in case the configuration change process failed (not a journaling failure, but a user failure) and we want to see the previous configuration. Does this approach require full data and metadata journaling, or just "ordered" journaling? Thanks, Charlie ---------------------------------------------------------------------------- ---- To: ext3-users(a)redhat.com Subject: Re: Distinct transactions From: "Stephen C. Tweedie" <sct(a)redhat.com> Date: Fri, 15 Jun 2001 19:14:20 +0100 Hi, On Fri, Jun 15, 2001 at 02:04:52PM -0400, Charlie Woloszynski wrote:> Thanks for the improvements. To be clear, the new process is: > > (1) copy current config file to "new" > (2) modify new file to desired state > (3) hard link current config to "old" > (4) move new file to current file.> Does this approach require full data and metadata journaling, or just > "ordered" journaling?Ordered should work just fine. It will guarantee that the new data created in stage (2) is flushed to disk before the transaction completes. Cheers, Stephen
KRELLE,BRIAN (HP-Roseville,ex1)
2001-Sep-12 00:40 UTC
Distinct transactions (MV vs rename())?
I have a question regarding a thread in June called "Distinct transactions", which I have included below. It seems to me that the solution is not atomic for daemons opening the file as there is a moment where the filename is not in the directory (i.e. unlink then link). In summary, poster Charlie Woloszynski wanted to update a configuration file in a safe manner (i.e. as a transaction would either complete or not if power failed). The solution was: (1) remove "new" if in existance (2) copy "current" to "new" (3) modify "new" (4) remove "old" if in existance (5) hard link "current" to "old" (6) move "new" to "current" Step 6 was to use the MV command, which Nigel Metheringham stated: The reason for doing this is that at no time will you NOT have a valid config file in place. One mv is guaranteed atomic, whereas two are not. I was rooting around in the MV code. It seems to perform an unlink() then a link() if the filename already exists (Theodore Tso please correct me if I'm wrong). This implies to me there is a moment where the filename (i.e. "current") is not in the directory. Another thread on July 26th about MTAs and metadata synchronization, cryptically titled "ext3-2.4-0.9.4", talked about rename() being atomic in regards to other processes attempting to open the file being renamed. Since MV, or LN, do not seem to use the rename() function, how does one accomplish step 6 above atomically from processes using the FS API? It would also be nice to minimize the number of underlying filesystem transactions to be powerfail safe. An unlink() of a hard link needs to remove the entry from the directory and decrement the link count in "old" inode. The link() would add an entry to the directory and increment the link count in the "new" inode. A total of four metadata changes. I was thinking of using symlinks but I do not think one can "move" its link using the rename() function. Thanks in advance for your advice as I'm trying to solve this problem. Daemons we write are polite (i.e. only read their configuration file upon receipt of a SIGHUP) but some we use are not (e.g. open and read their files under their own volition). Brian Krelle Distinct transactions Thread from June --------------------------------------------------------------------------- To: ext3-users(a)redhat.com Subject: Distinct transactions From: Charlie Woloszynski <chw(a)clearmetrix.com> Date: Thu, 14 Jun 2001 09:21:20 -0400 Folks: I have been asked to set up a Linux box that can be power-cycled at any time and it should not fail to reboot. EXT3 seems like the basis for such a box, and we have set up a RedHat 6.2 box with the new kernel and we have ext3 working. We have set up ext3 to journal both metadata and data changes (and we are willing to accept the reduced disk throughput). We are expecting to have logging data flowing onto the hard drive and we expect to have configuration changes also made to the box.>From what I can see, I can expect EXT3 to make sure that the logging dataflowing into the box will either be appended to the file or not, but I am not sure at what level will the filesystem changes be 'chunked' into transactions. Is there some direct mapping of system calls to 'transactions' on the file system that we can assume? Of even more importance is our configuration files. If these get left in some partly modified state, all else is pointless. So, we are trying to make sure we have some understanding of the 'right' way to make sure that we keep consistent files for the configuration stuff. We are proposing: (1) copy the config file to some new file. (2) modify the new file and close it. (3) move the existing config file to a backup file (4) move the modified file to replace the original file Are we right in assuming that these actions retain their sequentiality with journaling? That is, can we be sure that if (3) occurred in the filesystem, that (2) must have been completed? If we can be *sure*, then we can fix any power cycle in the middle of the update by looking at the state of the two files. If the existing config file is in place, we remove any 'new config' files. If the existing config file has been renamed, then we move the 'new config' in place and continue. We should never see the existing config file moved and no new config file on the system. Have we made any mistaken assumptions about the level and sequence of transactions in the EXT3 stuff? Any and all feedback and commentary are greatly appreciated. Thanks, Charlie ---------------------------------------------------------------------------- ---- To: ext3-users(a)redhat.com Subject: Re: Distinct transactions From: "Stephen C. Tweedie" <sct(a)redhat.com> Date: Thu, 14 Jun 2001 15:09:55 +0100 Hi, On Thu, Jun 14, 2001 at 09:21:20AM -0400, Charlie Woloszynski wrote:> > I have been asked to set up a Linux box that can be power-cycled at any > time and it should not fail to reboot. EXT3 seems like the basis for such > a box, and we have set up a RedHat 6.2 box with the new kernel and we have > ext3 working. We have set up ext3 to journal both metadata and data > changes (and we are willing to accept the reduced disk throughput). > > From what I can see, I can expect EXT3 to make sure that the logging data > flowing into the box will either be appended to the file or not, but I am > not sure at what level will the filesystem changes be 'chunked' into > transactions. Is there some direct mapping of system calls to > 'transactions' on the file system that we can assume?First of all, if you are extending existing files, then the "data=ordered" journaling mode will give you the same atomicity guarantees as "data=journaled", without the performance penalty of writing everything twice. Secondly, nearly every transaction is atomic on disk in ext3. Truncates can span more than one transaction, but ext3 recovery after a crash will complete any partial truncate which happened to be split over transactions. Only large writes will ever be split over transactions. Now, in older versions of ext3, the definition of "large" was slightly broken, but in 0.0.7a only transactions which take up more than 64 blocks in the log should ever be broken up. That's hard to translate exactly into the size of a write(): it depends on how fragmented the written data is, and whether or not quotas are enabled. But any write below about 16k is going to be completely atomic on ext3 with data journaling, and any append of that size will be atomic even with ordered journaling.> We are proposing: > (1) copy the config file to some new file. > (2) modify the new file and close it. > (3) move the existing config file to a backup file > (4) move the modified file to replace the original file > > Are we right in assuming that these actions retain their sequentialitywith> journaling?Yes.> That is, can we be sure that if (3) occurred in the > filesystem, that (2) must have been completed?Yes. data=ordered and data=journal should both satisfy your constraints in this case. Cheers, Stephen ---------------------------------------------------------------------------- ---- To: ext3-users(a)redhat.com Subject: Re: Distinct transactions From: Andreas Dilger <adilger(a)turbolinux.com> Date: Thu, 14 Jun 2001 18:17:34 -0600 (MDT) Charlie writes:> We are proposing: > (1) copy the config file to some new file. > (2) modify the new file and close it. > (3) move the existing config file to a backup file > (4) move the modified file to replace the original file > > Are we right in assuming that these actions retain their sequentialitywith> journaling? That is, can we be sure that if (3) occurred in the > filesystem, that (2) must have been completed?If you are doing journalling, I would propose the following: 1) copy config to new file 2) modify new file 3) copy original config to backup 4) mv modified file to replace original The reason for doing this is that at no time will you NOT have a valid config file in place. One mv is guaranteed atomic, whereas two are not. Cheers, Andreas ---------------------------------------------------------------------------- ---- To: ext3-users(a)redhat.com Subject: Re: Distinct transactions From: Nigel Metheringham <Nigel.Metheringham(a)vdata.co.uk> Date: 15 Jun 2001 09:31:34 +0100 On 14 Jun 2001 18:17:34 -0600, Andreas Dilger wrote:> If you are doing journalling, I would propose the following: > 1) copy config to new file > 2) modify new file > 3) copy original config to backup > 4) mv modified file to replace original > > The reason for doing this is that at no time will you NOT have a valid > config file in place. One mv is guaranteed atomic, whereas two are not.To pick a minor nit, you can do this a little more efficiently by making step #3 a (hard) link of original config to backup. Nigel. ---------------------------------------------------------------------------- ---- To: ext3-users(a)redhat.com Subject: Re: Distinct transactions From: Charlie Woloszynski <chw(a)clearmetrix.com> Date: Fri, 15 Jun 2001 14:04:52 -0400 Folks: Thanks for the improvements. To be clear, the new process is: (1) copy current config file to "new" (2) modify new file to desired state (3) hard link current config to "old" (4) move new file to current file. Using this approach, I think Andreas is suggesting that if the system power-cycles, there is no recovery process needed. The current config file is always correct. If the system power cycles before (4) is completed, then the changes in process are lost. If (4) was executed, then all the changes to the system have been journaled and the new config file is as desired. If the system fails, it might leave a "new" file or a "old" file. I think I need to add in removing these, if they exist, as part of this process. So, I think I get: (1) remove "new" if in existance (2) copy "current" to "new" (3) modify "new" (4) remove "old" if in existance (5) hard link "current" to "old" (6) move "new" to "current" I figured I'd leave "old" around as a backup in case the configuration change process failed (not a journaling failure, but a user failure) and we want to see the previous configuration. Does this approach require full data and metadata journaling, or just "ordered" journaling? Thanks, Charlie ---------------------------------------------------------------------------- ---- To: ext3-users(a)redhat.com Subject: Re: Distinct transactions From: "Stephen C. Tweedie" <sct(a)redhat.com> Date: Fri, 15 Jun 2001 19:14:20 +0100 Hi, On Fri, Jun 15, 2001 at 02:04:52PM -0400, Charlie Woloszynski wrote:> Thanks for the improvements. To be clear, the new process is: > > (1) copy current config file to "new" > (2) modify new file to desired state > (3) hard link current config to "old" > (4) move new file to current file.> Does this approach require full data and metadata journaling, or just > "ordered" journaling?Ordered should work just fine. It will guarantee that the new data created in stage (2) is flushed to disk before the transaction completes. Cheers, Stephen
Hi, On Tue, Sep 11, 2001 at 05:20:44PM -0700, KRELLE,BRIAN (HP-Roseville,ex1) wrote:> I was rooting around in the MV code. It seems to perform an unlink() > then a link() if the filename already exists (Theodore Tso please > correct me if I'm wrong). This implies to me there is a moment where > the filename (i.e. "current") is not in the directory.$ touch a $ touch b $ strace mv a b [...] stat64("b", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 lstat64("a", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 lstat64("b", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 rename("a", "b") = 0 _exit(0) = ? $ Seems to be atomic to me. Cheers, Stephen
On Tue, Sep 11, 2001 at 05:20:44PM -0700, KRELLE,BRIAN (HP-Roseville,ex1) wrote:> Another thread on July 26th about MTAs and metadata synchronization, > cryptically titled "ext3-2.4-0.9.4", talked about rename() being atomic > in regards to other processes attempting to open the file being renamed. > > Since MV, or LN, do not seem to use the rename() function, how does one > accomplish step 6 above atomically from processes using the FS API? It > would also be nice to minimize the number of underlying filesystem > transactions to be powerfail safe.As near as I can tell mv seems to use the rename() function, as long source and target are on the same filesystem. If the source and destination directories are on different filesystems, then a copy followed by an unlink will take place. - Ted