samba-bugs at samba.org
2019-Jan-02 17:28 UTC
[Bug 13735] New: Synchronize files when the sending side has newer change times while modification times and sizes are identical on both sides
https://bugzilla.samba.org/show_bug.cgi?id=13735
Bug ID: 13735
Summary: Synchronize files when the sending side has newer
change times while modification times and sizes are
identical on both sides
Product: rsync
Version: 3.1.3
Hardware: All
OS: All
Status: NEW
Severity: normal
Priority: P5
Component: core
Assignee: wayned at samba.org
Reporter: sbehuret at gmail.com
QA Contact: rsync-qa at samba.org
Files that have identical sizes and change times on the sending and receiving
sides, but different contents, will not be synchronized by default (e.g. rsync
-a --delete source:/path/ dest:/path/). Synchronizing these files requires the
use of --checksum or --ignore-times options, which are both sub-optimal in most
cases (see caveats below). I would like to propose new options to efficiently
synchronize these files.
To make this report as clear as possible, I kindly remind that modification
times (mtime) can be manually set by users (e.g. with touch) and rsync will
preserve those during synchronization (with -a, which includes -t). By
contrast, change times (ctime) are automatically updated by the OS when files
are changed. rsync relies on modification times to decide whether it should
skip or transfer files.
There are some use cases where files are modified while preserving their
original sizes and mtimes. This can happen when a fixed-size file is updated
with new content in a build chain while forcibly preserving a specific mtime.
To the best of my knowledge, rsync does not have any option to transfer these
files in an efficient manner.
I illustrate the issue below:
# Create two different files with same size
echo 'new content' > srcfile
echo 'old content' > destfile
# Set identical mtime for both files: Update srcfile's mtime to match
destfile's mtime
touch -mr destfile srcfile
# At this point, srcfile and destfile have:
# - identical size
# - identical mtime
# - different content
# - srcfile's ctime is newer than destfile's ctime
# rsync experiments (dry runs)
rsync -avn srcfile destfile # will not synchronize
rsync -avn --checksum srcfile destfile # will synchronize
rsync -avn --ignore-times srcfile destfile # will synchronize
The desired behavior is to synchronize srcfile to destfile, because srcfile is
different and has a newer ctime.
Caveats: --checksum option is I/O intensive and will check all files
--ignore-times option will force a (re)synchronization of all files
To solve this issue, we could check change times. Considering that rsync will
not be able to control change times on the receiving side, we must be careful.
If we suppose that rsync used ctimes (and not mtimes) to compare these files,
it would first find that srcfile is newer, and then on subsequent rsync passes,
it would find that destfile is newer. Therefore we can't transfer files if
[srcfile's ctime != destfile's ctime]. However we can use a combination
of
mtime and ctime to solve the above ambiguity.
Considering that both files have identical size and mtime, I propose the
following new flags:
--ignore-times-if-newer
"don't skip files that match size and (m)time if source has newer
ctime"
This would force a transfer for files that are newer on the sending side,
regardless of their sizes and mtimes.
--checksum-if-newer
"skip based on checksum if source has newer ctime"
Likewise, this would force a transfer for files that are newer on the sending
side and different in content.
Similarly, --ignore-times-if-older and --checksum-if-older may be desirable
when we trust older files on the sending side more than newer files on the
receiving side.
New rsync experiments:
# Try to synchronize srcfile to destfile
rsync -avn --ignore-times-if-newer srcfile destfile # will synchronize
rsync -avn --ignore-times-if-older srcfile destfile # will not synchronize
# Swap source and destination: Try to synchronize destfile to srcfile
rsync -avn --ignore-times-if-newer destfile srcfile # will not synchronize
rsync -avn --ignore-times-if-older destfile srcfile # will synchronize
# Real examples (wet runs) to test multiple rsync passes
# Synchronizing srcfile to destfile
rsync -av --ignore-times-if-newer srcfile destfile # 1st pass: will synchronize
rsync -av --ignore-times-if-newer srcfile destfile # 2nd pass: will skip
rsync -av --ignore-times-if-newer srcfile destfile # nth pass: will always
skip: srcfile's ctime is no longer newer, rely only on size and mtime that
are
identical
rsync -av --ignore-times-if-older srcfile destfile # 1st pass: will synchronize
rsync -av --ignore-times-if-older srcfile destfile # 2nd pass: will synchronize
rsync -av --ignore-times-if-older srcfile destfile # nth pass: will always
synchronize: srcfile's ctime will always be older at this point
# Final cleanup
rm srcfile destfile
And similarly for --checksum-if-newer/--checksum-if-older.
These additional options would efficiently synchronize newer
(--ignore-times-if-newer/--checksum-if-newer) or older
(--ignore-times-if-older/--checksum-if-older) files that do not differ in size
and mtime.
--
You are receiving this mail because:
You are the QA Contact for the bug.
samba-bugs at samba.org
2019-Jan-02 17:44 UTC
[Bug 13735] Synchronize files when the sending side has newer change times while modification times and sizes are identical on both sides
https://bugzilla.samba.org/show_bug.cgi?id=13735 --- Comment #1 from Kevin Korb <rsync at sanitarium.net> --- Yes, a user can back-date the mtime on a file. This is what rsync does when it copies a file then copies the timestamp. Your suggestion of a better --checksum option is an interesting idea but so far we don't even have a --checksum that behaves even as intelligently as the man page describes it. Try rsycing a big tree to an empty dir with --checksum. According to the man page --checksum shouldn't matter but you will see that it checksums files when it has nothing to compare them to. It should skip checksumming files that have no match. It should also skip checksumming files that don't have matching file sizes but it doesn't. This is why I say that --checksum is almost always a bad idea. -- You are receiving this mail because: You are the QA Contact for the bug.
samba-bugs at samba.org
2019-Jan-15 18:06 UTC
[Bug 13735] Synchronize files when the sending side has newer change times while modification times and sizes are identical on both sides
https://bugzilla.samba.org/show_bug.cgi?id=13735
Wayne Davison <wayne at opencoder.net> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution|--- |WONTFIX
--- Comment #2 from Wayne Davison <wayne at opencoder.net> ---
Yes, the quick-check algorithm relies on the user and the apps to not make a
data change and then back-date the mtime to lie about when the file was last
modified. To help deal with this hiding of changes, the rsync patches repo has
several checksum-caching idioms that accelerate a --checksum transfer. My
favorite is db.diff that allows you to stash off checksum info in a local
SQLite db file (or even a shared Mysql db). Such a setup doesn't avoid all
redundant checksum computations (since the ctime can change on a file that has
no data changes), but it does make checksum transfers MUCH faster.
As Kevin mentions, the --checksum algorithm could also be improved to make it
more selective, but that would require a protocol change and (in the simplest
generator modification) we'd need to make the generator sit around and wait
for
its requested checksum data for each file that needs it before finishing up the
current file and moving on to check the next one. Such a revised checksum
method would be required for any maybe-checksum processing, such as the options
you propose. Your options would also require adding in ctime info into the file
lists. One nice thing about the checksum-caching patches is that you can make
use of the optimization on just one side of the transfer and the other side
gets the normal checksum-using file data. It also allows the sharing of
updated checksum info between multiple rsync copies (whereas your suggested
options would require each source/destination ctime difference to re-compute
the checksum).
--
You are receiving this mail because:
You are the QA Contact for the bug.
samba-bugs at samba.org
2019-Jan-22 10:57 UTC
[Bug 13735] Synchronize files when the sending side has newer change times while modification times and sizes are identical on both sides
https://bugzilla.samba.org/show_bug.cgi?id=13735 --- Comment #3 from Sébastien Béhuret <sbehuret at gmail.com> --- Thank you for suggesting the patches repo. An improved checksum/maybe-checksum algorithm would be great but there appears to be a lot of work to achieve this. Checksums are very handy for special cases (e.g. to detect and fix data corruption) but are still relatively slow and prone to collisions or require specific patches as you suggested. We ideally want the possibility to enforce the synchronization of files that are more recent on the sending side when mtime and size are identical on both sides. This would improve the reliability of system backup software that are based on rsync, and could be implemented as a new option to alter the behavior of the quick-check algorithm. Overall, rsync lacks a solid way to detect and transfer back-dated files. I feel like the importance of dealing with back-dated files is underestimated: In a file system, file back-dating may occur during software updates without malicious intent and users being aware of it. An example of file back-dating is found in Firefox package in Debian-based distributions. Some JS files in /usr/share/firefox/browser/defaults/preferences/ directory are always dated 2010-01-01 00:00:00. When changes in these files are small (e.g. a version string, a fixed-size series of characters such as a timestamp, hash or key), the files end up with the same size and mtime and the changes won’t be detected by rsync quick-check algorithm. Backup software relying on rsync for incremental updates will eventually get wrong unless they use the --checksum option, but this is sub-optimal (and sometimes buggy) and most backup systems don’t even allow the user to add this option. Quick fix suggestion: This may be a bit of an oversimplification, but assuming that the current rsync quick-check algorithm looks like this: synchronize(source, dest) IF [ mtime(source) != mtime(dest) AND size(source) !size(dest) ] Then a new option (e.g. --use-ctime or --ignore-times-if-newer) could alter it in the following way: synchronize(source, dest) IF [[ ctime(source) > ctime(dest) ] OR [ mtime(source) != mtime(dest) AND size(source) != size(dest) ]] (Notice the use of ‘greater than’ rather than ‘not equal’ to compare ctimes.) This would do the trick and ensure that files that were back-dated are properly detected and synchronized during incremental updates. I think that such an option is a must-have for reliable backup software, and could even be enabled by default since atime updates do not alter ctime. -- You are receiving this mail because: You are the QA Contact for the bug.
Joe
2019-Jan-22 11:27 UTC
[Bug 13735] Synchronize files when the sending side has newer change times while modification times and sizes are identical on both sides
While a solution using rsync would be ideal, if you don't have a huge number of files or lots of huge files which meet this special case, it probably wouldn't be prohibitively difficult to write your own special case script. It could be set to ignore all files that rsync will handle normally (but it would still have to look at both sides to determine that.) On 1/22/19 5:57 AM, just subscribed for rsync-qa from bugzilla via rsync wrote:> https://bugzilla.samba.org/show_bug.cgi?id=13735 > > --- Comment #3 from Sébastien Béhuret <sbehuret at gmail.com> --- > Thank you for suggesting the patches repo. An improved checksum/maybe-checksum > algorithm would be great but there appears to be a lot of work to achieve this. > Checksums are very handy for special cases (e.g. to detect and fix data > corruption) but are still relatively slow and prone to collisions or require > specific patches as you suggested. We ideally want the possibility to enforce > the synchronization of files that are more recent on the sending side when > mtime and size are identical on both sides. This would improve the reliability > of system backup software that are based on rsync, and could be implemented as > a new option to alter the behavior of the quick-check algorithm. > > Overall, rsync lacks a solid way to detect and transfer back-dated files. I > feel like the importance of dealing with back-dated files is underestimated: > > In a file system, file back-dating may occur during software updates without > malicious intent and users being aware of it. An example of file back-dating is > found in Firefox package in Debian-based distributions. Some JS files in > /usr/share/firefox/browser/defaults/preferences/ directory are always dated > 2010-01-01 00:00:00. When changes in these files are small (e.g. a version > string, a fixed-size series of characters such as a timestamp, hash or key), > the files end up with the same size and mtime and the changes won’t be detected > by rsync quick-check algorithm. Backup software relying on rsync for > incremental updates will eventually get wrong unless they use the --checksum > option, but this is sub-optimal (and sometimes buggy) and most backup systems > don’t even allow the user to add this option. > > Quick fix suggestion: > > This may be a bit of an oversimplification, but assuming that the current rsync > quick-check algorithm looks like this: > > synchronize(source, dest) IF [ mtime(source) != mtime(dest) AND size(source) !> size(dest) ] > > Then a new option (e.g. --use-ctime or --ignore-times-if-newer) could alter it > in the following way: > > synchronize(source, dest) IF [[ ctime(source) > ctime(dest) ] OR [ > mtime(source) != mtime(dest) AND size(source) != size(dest) ]] > > (Notice the use of ‘greater than’ rather than ‘not equal’ to compare ctimes.) > > This would do the trick and ensure that files that were back-dated are properly > detected and synchronized during incremental updates. I think that such an > option is a must-have for reliable backup software, and could even be enabled > by default since atime updates do not alter ctime. >
samba-bugs at samba.org
2019-Jun-20 17:43 UTC
[Bug 13735] Synchronize files when the sending side has newer change times while modification times and sizes are identical on both sides
https://bugzilla.samba.org/show_bug.cgi?id=13735
S?bastien B?huret <sbehuret at gmail.com> changed:
What |Removed |Added
----------------------------------------------------------------------------
Resolution|WONTFIX |---
Status|RESOLVED |REOPENED
--- Comment #4 from S?bastien B?huret <sbehuret at gmail.com> ---
Quick workaround to find backdated files that may be -wrongly- ignored by
rsync:
# print if ctime > mtime
find /path/to/source/ -type f -exec bash -c 'test `stat -c "%Z"
"{}"` -gt `stat
-c "%Y" "{}"`' \; -print
However, on a regular system, most files will be backdated, e.g. if they have
been moved or permissions/ownership were altered, which makes the above command
unpractical.
A practical use case is the following: An initial transfer was done a year ago
and we want to update the destination incrementally. To find backdated files
that may be ignored by rsync, we have to look for files that were backdated
after the original transfer:
# print if ctime - mtime > 1y (31536000s)
# difference between the time of the original transfer and the time of the
incremental update
find /path/to/source/ -type f -exec bash -c 'test `expr \`stat -c
"%Z" "{}"\` -
\`stat -c "%Y" "{}"\`` -gt 31536000' \; -print
Re-opening this bug report as there is currently no satisfactory solution to
ensure rsync will transfer all modified files during incremental updates.
--
You are receiving this mail because:
You are the QA Contact for the bug.
samba-bugs at samba.org
2020-Apr-05 22:14 UTC
[Bug 13735] Synchronize files when the sending side has newer change times while modification times and sizes are identical on both sides
https://bugzilla.samba.org/show_bug.cgi?id=13735
Wayne Davison <wayne at opencoder.net> changed:
What |Removed |Added
----------------------------------------------------------------------------
Resolution|--- |WONTFIX
Status|REOPENED |RESOLVED
--
You are receiving this mail because:
You are the QA Contact for the bug.
Apparently Analagous Threads
- [Bug 13735] Synchronize files when the sending side has newer change times while modification times and sizes are identical on both sides
- mtime vs ctime
- Ext 2/3 overwriting remnant data & use of data blocks - security
- checksum-xattr.diff [CVS update: rsync/patches]
- problem cloning storage pool volume