KRELLE,BRIAN (HP-Roseville,ex1)
2001-Sep-12  00:20 UTC
Distinct transactions (MV vs rename())?
I have a question regarding a thread in June called "Distinct
transactions", which I have included below.  It seems to me that the
solution is not atomic for daemons opening the file as there is a moment
where the filename is not in the directory (i.e.  unlink then link).
In summary, poster Charlie Woloszynski wanted to update a configuration
file in a safe manner (i.e.  as a transaction would either complete or
not if power failed).
The solution was:
    (1) remove "new" if in existance
    (2) copy "current" to "new"
    (3) modify "new"
    (4) remove "old" if in existance
    (5) hard link "current" to "old"
    (6) move "new" to "current"
Step 6 was to use the MV command, which Nigel Metheringham stated:
    The reason for doing this is that at no time will
    you NOT have a valid config file in place.  One mv
    is guaranteed atomic, whereas two are not.
I was rooting around in the MV code.  It seems to perform an unlink()
then a link() if the filename already exists (Theodore Tso please
correct me if I'm wrong).  This implies to me there is a moment where
the filename (i.e. "current") is not in the directory.
Another thread on July 26th about MTAs and metadata synchronization,
cryptically titled "ext3-2.4-0.9.4", talked about rename() being
atomic
in regards to other processes attempting to open the file being renamed.
Since MV, or LN, do not seem to use the rename() function, how does one
accomplish step 6 above atomically from processes using the FS API?  It
would also be nice to minimize the number of underlying filesystem 
transactions to be powerfail safe.
An unlink() of a hard link needs to remove the entry from the directory
and decrement the link count in "old" inode.  The link() would add an
entry to the directory and increment the link count in the "new"
inode.
A total of four metadata changes.  
I was thinking of using symlinks but I do not think one can "move" its
link using the rename() function.
Thanks in advance for your advice as I'm trying to solve this problem.
Daemons we write are polite (i.e.  only read their configuration file
upon receipt of a SIGHUP) but some we use are not (e.g.  open and read
their files under their own volition).
Brian Krelle
Distinct transactions Thread from June
---------------------------------------------------------------------------
To: ext3-users(a)redhat.com 
Subject: Distinct transactions 
From: Charlie Woloszynski <chw(a)clearmetrix.com> 
Date: Thu, 14 Jun 2001 09:21:20 -0400 
Folks:
I have been asked to set up a Linux box that can be power-cycled at any
time and it should not fail to reboot.  EXT3 seems like the basis for such
a box, and we have set up a RedHat 6.2 box with the new kernel and we have
ext3 working.  We have set up ext3 to journal both metadata and data
changes (and we are willing to accept the reduced disk throughput).
We are expecting to have logging data flowing onto the hard drive and we
expect to have configuration changes also made to the box.
>From what I can see, I can expect EXT3 to make sure that the logging data
flowing into the box will either be appended to the file or not, but I am
not sure at what level will the filesystem changes be 'chunked' into
transactions.  Is there some direct mapping of system calls to
'transactions' on the file system that we can assume?
Of even more importance is our configuration files.  If these get left in
some partly modified state, all else is pointless.  So, we are trying to
make sure we have some understanding of the 'right' way to make sure
that
we keep consistent files for the configuration stuff.
We are proposing:
(1) copy the config file to some new file.
(2) modify the new file and close it.
(3) move the existing config file to a backup file
(4) move the modified file to replace the original file
Are we right in assuming that these actions retain their sequentiality with
journaling?  That is, can we be sure that if (3) occurred in the
filesystem, that (2) must have been completed?
If we can be *sure*, then we can fix any power cycle in the middle of the
update by looking at the state of the two files.  If the existing config
file is in place, we remove any 'new config' files.  If the existing
config
file has been renamed, then we move the 'new config' in place and
continue.  We should never see the existing config file moved and no new
config file on the system.
Have we made any mistaken assumptions about the level and sequence of
transactions in the EXT3 stuff?  Any and all feedback and commentary are
greatly appreciated.
Thanks,
Charlie
----------------------------------------------------------------------------
----
To: ext3-users(a)redhat.com 
Subject: Re: Distinct transactions 
From: "Stephen C. Tweedie" <sct(a)redhat.com> 
Date: Thu, 14 Jun 2001 15:09:55 +0100 
Hi,
On Thu, Jun 14, 2001 at 09:21:20AM -0400, Charlie Woloszynski
wrote:> 
> I have been asked to set up a Linux box that can be power-cycled at any
> time and it should not fail to reboot.  EXT3 seems like the basis for such
> a box, and we have set up a RedHat 6.2 box with the new kernel and we have
> ext3 working.  We have set up ext3 to journal both metadata and data
> changes (and we are willing to accept the reduced disk throughput).
> 
> From what I can see, I can expect EXT3 to make sure that the logging data
> flowing into the box will either be appended to the file or not, but I am
> not sure at what level will the filesystem changes be 'chunked'
into
> transactions.  Is there some direct mapping of system calls to
> 'transactions' on the file system that we can assume?
First of all, if you are extending existing files, then the
"data=ordered" journaling mode will give you the same atomicity
guarantees as "data=journaled", without the performance penalty of
writing everything twice.  
Secondly, nearly every transaction is atomic on disk in ext3.
Truncates can span more than one transaction, but ext3 recovery after
a crash will complete any partial truncate which happened to be split
over transactions.
Only large writes will ever be split over transactions.  Now, in older
versions of ext3, the definition of "large" was slightly broken, but
in 0.0.7a only transactions which take up more than 64 blocks in the
log should ever be broken up.  That's hard to translate exactly into
the size of a write(): it depends on how fragmented the written data
is, and whether or not quotas are enabled.  But any write below about
16k is going to be completely atomic on ext3 with data journaling, and
any append of that size will be atomic even with ordered journaling.  
> We are proposing:
> (1) copy the config file to some new file.
> (2) modify the new file and close it.
> (3) move the existing config file to a backup file
> (4) move the modified file to replace the original file
> 
> Are we right in assuming that these actions retain their sequentiality
with> journaling?
Yes.
> That is, can we be sure that if (3) occurred in the
> filesystem, that (2) must have been completed?
Yes.  data=ordered and data=journal should both satisfy your
constraints in this case.
Cheers,
 Stephen
----------------------------------------------------------------------------
----
To: ext3-users(a)redhat.com 
Subject: Re: Distinct transactions 
From: Andreas Dilger <adilger(a)turbolinux.com> 
Date: Thu, 14 Jun 2001 18:17:34 -0600 (MDT) 
Charlie writes:> We are proposing:
> (1) copy the config file to some new file.
> (2) modify the new file and close it.
> (3) move the existing config file to a backup file
> (4) move the modified file to replace the original file
> 
> Are we right in assuming that these actions retain their sequentiality
with> journaling?  That is, can we be sure that if (3) occurred in the
> filesystem, that (2) must have been completed?
If you are doing journalling, I would propose the following:
1) copy config to new file
2) modify new file
3) copy original config to backup
4) mv modified file to replace original
The reason for doing this is that at no time will you NOT have a valid
config file in place.  One mv is guaranteed atomic, whereas two are not.
Cheers, Andreas
----------------------------------------------------------------------------
----
To: ext3-users(a)redhat.com 
Subject: Re: Distinct transactions 
From: Nigel Metheringham <Nigel.Metheringham(a)vdata.co.uk> 
Date: 15 Jun 2001 09:31:34 +0100 
On 14 Jun 2001 18:17:34 -0600, Andreas Dilger wrote:> If you are doing journalling, I would propose the following:
> 1) copy config to new file
> 2) modify new file
> 3) copy original config to backup
> 4) mv modified file to replace original
> 
> The reason for doing this is that at no time will you NOT have a valid
> config file in place.  One mv is guaranteed atomic, whereas two are not.
To pick a minor nit, you can do this
a little more efficiently by making
step #3 a (hard) link of original
config to backup.
	Nigel.
----------------------------------------------------------------------------
----
To: ext3-users(a)redhat.com 
Subject: Re: Distinct transactions 
From: Charlie Woloszynski <chw(a)clearmetrix.com> 
Date: Fri, 15 Jun 2001 14:04:52 -0400 
Folks:
Thanks for the improvements.  To be clear, the new process is:
(1) copy current config file to "new"
(2) modify new file to desired state
(3) hard link current config to "old"
(4) move new file to current file.
Using this approach, I think Andreas is suggesting that if the system
power-cycles, there is no recovery process needed.  The current config file
is
always correct.  If the system power cycles before (4) is completed, then
the
changes in process are lost.  If (4) was executed, then all the changes to
the
system have been journaled and the new config file is as desired.
If the system fails, it might leave a "new" file or a "old"
file.  I think I
need to add in removing these, if they exist, as part of this process.  So,
I
think I get:
(1) remove "new" if in existance
(2) copy "current" to "new"
(3) modify "new"
(4) remove "old" if in existance
(5) hard link "current" to "old"
(6) move "new" to "current"
I figured I'd leave "old" around as a backup in case the
configuration
change
process failed (not a journaling failure, but a user failure) and we want to
see the previous configuration.
Does this approach require full data and metadata journaling, or just
"ordered" journaling?
Thanks,
Charlie
----------------------------------------------------------------------------
----
To: ext3-users(a)redhat.com 
Subject: Re: Distinct transactions 
From: "Stephen C. Tweedie" <sct(a)redhat.com> 
Date: Fri, 15 Jun 2001 19:14:20 +0100 
Hi,
On Fri, Jun 15, 2001 at 02:04:52PM -0400, Charlie Woloszynski wrote:
 > Thanks for the improvements.  To be clear, the new process is:
> 
> (1) copy current config file to "new"
> (2) modify new file to desired state
> (3) hard link current config to "old"
> (4) move new file to current file.
 > Does this approach require full data and metadata journaling, or just
> "ordered" journaling?
Ordered should work just fine.  It will guarantee that the new data
created in stage (2) is flushed to disk before the transaction
completes.
Cheers,
 Stephen
KRELLE,BRIAN (HP-Roseville,ex1)
2001-Sep-12  00:40 UTC
Distinct transactions (MV vs rename())?
I have a question regarding a thread in June called "Distinct
transactions", which I have included below.  It seems to me that the
solution is not atomic for daemons opening the file as there is a moment
where the filename is not in the directory (i.e.  unlink then link).
In summary, poster Charlie Woloszynski wanted to update a configuration
file in a safe manner (i.e.  as a transaction would either complete or
not if power failed).
The solution was:
    (1) remove "new" if in existance
    (2) copy "current" to "new"
    (3) modify "new"
    (4) remove "old" if in existance
    (5) hard link "current" to "old"
    (6) move "new" to "current"
Step 6 was to use the MV command, which Nigel Metheringham stated:
    The reason for doing this is that at no time will
    you NOT have a valid config file in place.  One mv
    is guaranteed atomic, whereas two are not.
I was rooting around in the MV code.  It seems to perform an unlink()
then a link() if the filename already exists (Theodore Tso please
correct me if I'm wrong).  This implies to me there is a moment where
the filename (i.e. "current") is not in the directory.
Another thread on July 26th about MTAs and metadata synchronization,
cryptically titled "ext3-2.4-0.9.4", talked about rename() being
atomic
in regards to other processes attempting to open the file being renamed.
Since MV, or LN, do not seem to use the rename() function, how does one
accomplish step 6 above atomically from processes using the FS API?  It
would also be nice to minimize the number of underlying filesystem 
transactions to be powerfail safe.
An unlink() of a hard link needs to remove the entry from the directory
and decrement the link count in "old" inode.  The link() would add an
entry to the directory and increment the link count in the "new"
inode.
A total of four metadata changes.  
I was thinking of using symlinks but I do not think one can "move" its
link using the rename() function.
Thanks in advance for your advice as I'm trying to solve this problem.
Daemons we write are polite (i.e.  only read their configuration file
upon receipt of a SIGHUP) but some we use are not (e.g.  open and read
their files under their own volition).
Brian Krelle
Distinct transactions Thread from June
---------------------------------------------------------------------------
To: ext3-users(a)redhat.com 
Subject: Distinct transactions 
From: Charlie Woloszynski <chw(a)clearmetrix.com> 
Date: Thu, 14 Jun 2001 09:21:20 -0400 
Folks:
I have been asked to set up a Linux box that can be power-cycled at any
time and it should not fail to reboot.  EXT3 seems like the basis for such
a box, and we have set up a RedHat 6.2 box with the new kernel and we have
ext3 working.  We have set up ext3 to journal both metadata and data
changes (and we are willing to accept the reduced disk throughput).
We are expecting to have logging data flowing onto the hard drive and we
expect to have configuration changes also made to the box.
>From what I can see, I can expect EXT3 to make sure that the logging data
flowing into the box will either be appended to the file or not, but I am
not sure at what level will the filesystem changes be 'chunked' into
transactions.  Is there some direct mapping of system calls to
'transactions' on the file system that we can assume?
Of even more importance is our configuration files.  If these get left in
some partly modified state, all else is pointless.  So, we are trying to
make sure we have some understanding of the 'right' way to make sure
that
we keep consistent files for the configuration stuff.
We are proposing:
(1) copy the config file to some new file.
(2) modify the new file and close it.
(3) move the existing config file to a backup file
(4) move the modified file to replace the original file
Are we right in assuming that these actions retain their sequentiality with
journaling?  That is, can we be sure that if (3) occurred in the
filesystem, that (2) must have been completed?
If we can be *sure*, then we can fix any power cycle in the middle of the
update by looking at the state of the two files.  If the existing config
file is in place, we remove any 'new config' files.  If the existing
config
file has been renamed, then we move the 'new config' in place and
continue.  We should never see the existing config file moved and no new
config file on the system.
Have we made any mistaken assumptions about the level and sequence of
transactions in the EXT3 stuff?  Any and all feedback and commentary are
greatly appreciated.
Thanks,
Charlie
----------------------------------------------------------------------------
----
To: ext3-users(a)redhat.com 
Subject: Re: Distinct transactions 
From: "Stephen C. Tweedie" <sct(a)redhat.com> 
Date: Thu, 14 Jun 2001 15:09:55 +0100 
Hi,
On Thu, Jun 14, 2001 at 09:21:20AM -0400, Charlie Woloszynski
wrote:> 
> I have been asked to set up a Linux box that can be power-cycled at any
> time and it should not fail to reboot.  EXT3 seems like the basis for such
> a box, and we have set up a RedHat 6.2 box with the new kernel and we have
> ext3 working.  We have set up ext3 to journal both metadata and data
> changes (and we are willing to accept the reduced disk throughput).
> 
> From what I can see, I can expect EXT3 to make sure that the logging data
> flowing into the box will either be appended to the file or not, but I am
> not sure at what level will the filesystem changes be 'chunked'
into
> transactions.  Is there some direct mapping of system calls to
> 'transactions' on the file system that we can assume?
First of all, if you are extending existing files, then the
"data=ordered" journaling mode will give you the same atomicity
guarantees as "data=journaled", without the performance penalty of
writing everything twice.  
Secondly, nearly every transaction is atomic on disk in ext3.
Truncates can span more than one transaction, but ext3 recovery after
a crash will complete any partial truncate which happened to be split
over transactions.
Only large writes will ever be split over transactions.  Now, in older
versions of ext3, the definition of "large" was slightly broken, but
in 0.0.7a only transactions which take up more than 64 blocks in the
log should ever be broken up.  That's hard to translate exactly into
the size of a write(): it depends on how fragmented the written data
is, and whether or not quotas are enabled.  But any write below about
16k is going to be completely atomic on ext3 with data journaling, and
any append of that size will be atomic even with ordered journaling.  
> We are proposing:
> (1) copy the config file to some new file.
> (2) modify the new file and close it.
> (3) move the existing config file to a backup file
> (4) move the modified file to replace the original file
> 
> Are we right in assuming that these actions retain their sequentiality
with> journaling?
Yes.
> That is, can we be sure that if (3) occurred in the
> filesystem, that (2) must have been completed?
Yes.  data=ordered and data=journal should both satisfy your
constraints in this case.
Cheers,
 Stephen
----------------------------------------------------------------------------
----
To: ext3-users(a)redhat.com 
Subject: Re: Distinct transactions 
From: Andreas Dilger <adilger(a)turbolinux.com> 
Date: Thu, 14 Jun 2001 18:17:34 -0600 (MDT) 
Charlie writes:> We are proposing:
> (1) copy the config file to some new file.
> (2) modify the new file and close it.
> (3) move the existing config file to a backup file
> (4) move the modified file to replace the original file
> 
> Are we right in assuming that these actions retain their sequentiality
with> journaling?  That is, can we be sure that if (3) occurred in the
> filesystem, that (2) must have been completed?
If you are doing journalling, I would propose the following:
1) copy config to new file
2) modify new file
3) copy original config to backup
4) mv modified file to replace original
The reason for doing this is that at no time will you NOT have a valid
config file in place.  One mv is guaranteed atomic, whereas two are not.
Cheers, Andreas
----------------------------------------------------------------------------
----
To: ext3-users(a)redhat.com 
Subject: Re: Distinct transactions 
From: Nigel Metheringham <Nigel.Metheringham(a)vdata.co.uk> 
Date: 15 Jun 2001 09:31:34 +0100 
On 14 Jun 2001 18:17:34 -0600, Andreas Dilger wrote:> If you are doing journalling, I would propose the following:
> 1) copy config to new file
> 2) modify new file
> 3) copy original config to backup
> 4) mv modified file to replace original
> 
> The reason for doing this is that at no time will you NOT have a valid
> config file in place.  One mv is guaranteed atomic, whereas two are not.
To pick a minor nit, you can do this
a little more efficiently by making
step #3 a (hard) link of original
config to backup.
	Nigel.
----------------------------------------------------------------------------
----
To: ext3-users(a)redhat.com 
Subject: Re: Distinct transactions 
From: Charlie Woloszynski <chw(a)clearmetrix.com> 
Date: Fri, 15 Jun 2001 14:04:52 -0400 
Folks:
Thanks for the improvements.  To be clear, the new process is:
(1) copy current config file to "new"
(2) modify new file to desired state
(3) hard link current config to "old"
(4) move new file to current file.
Using this approach, I think Andreas is suggesting that if the system
power-cycles, there is no recovery process needed.  The current config file
is
always correct.  If the system power cycles before (4) is completed, then
the
changes in process are lost.  If (4) was executed, then all the changes to
the
system have been journaled and the new config file is as desired.
If the system fails, it might leave a "new" file or a "old"
file.  I think I
need to add in removing these, if they exist, as part of this process.  So,
I
think I get:
(1) remove "new" if in existance
(2) copy "current" to "new"
(3) modify "new"
(4) remove "old" if in existance
(5) hard link "current" to "old"
(6) move "new" to "current"
I figured I'd leave "old" around as a backup in case the
configuration
change
process failed (not a journaling failure, but a user failure) and we want to
see the previous configuration.
Does this approach require full data and metadata journaling, or just
"ordered" journaling?
Thanks,
Charlie
----------------------------------------------------------------------------
----
To: ext3-users(a)redhat.com 
Subject: Re: Distinct transactions 
From: "Stephen C. Tweedie" <sct(a)redhat.com> 
Date: Fri, 15 Jun 2001 19:14:20 +0100 
Hi,
On Fri, Jun 15, 2001 at 02:04:52PM -0400, Charlie Woloszynski wrote:
 > Thanks for the improvements.  To be clear, the new process is:
> 
> (1) copy current config file to "new"
> (2) modify new file to desired state
> (3) hard link current config to "old"
> (4) move new file to current file.
 > Does this approach require full data and metadata journaling, or just
> "ordered" journaling?
Ordered should work just fine.  It will guarantee that the new data
created in stage (2) is flushed to disk before the transaction
completes.
Cheers,
 Stephen
Hi, On Tue, Sep 11, 2001 at 05:20:44PM -0700, KRELLE,BRIAN (HP-Roseville,ex1) wrote:> I was rooting around in the MV code. It seems to perform an unlink() > then a link() if the filename already exists (Theodore Tso please > correct me if I'm wrong). This implies to me there is a moment where > the filename (i.e. "current") is not in the directory.$ touch a $ touch b $ strace mv a b [...] stat64("b", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 lstat64("a", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 lstat64("b", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 rename("a", "b") = 0 _exit(0) = ? $ Seems to be atomic to me. Cheers, Stephen
On Tue, Sep 11, 2001 at 05:20:44PM -0700, KRELLE,BRIAN (HP-Roseville,ex1) wrote:> Another thread on July 26th about MTAs and metadata synchronization, > cryptically titled "ext3-2.4-0.9.4", talked about rename() being atomic > in regards to other processes attempting to open the file being renamed. > > Since MV, or LN, do not seem to use the rename() function, how does one > accomplish step 6 above atomically from processes using the FS API? It > would also be nice to minimize the number of underlying filesystem > transactions to be powerfail safe.As near as I can tell mv seems to use the rename() function, as long source and target are on the same filesystem. If the source and destination directories are on different filesystems, then a copy followed by an unlink will take place. - Ted