On 6/30/07, Wayne Davison <wayned@samba.org> committed:> Added Files: > checksum-xattr.diff > Log Message: > A simple patch that lets rsync use cached checksum values stored in > each file's extended attributes. A perl script is provided to create > and update the values.Wayne, You should be aware of two drawbacks of caching checksums in xattrs: First, setting the xattr hits the file's ctime. Thus, in exchange for rsync being able to skip the file, other tools that use ctime (such as GNU tar incremental backups) unnecessarily reprocess it. Beagle also caches checksums in xattrs, and one of its users complained about the effect on the ctime: http://www.mail-archive.com/dashboard-hackers@gnome.org/msg03251.html Second, it is impossible to make xattr-based checksum caching foolproof against same-second modification. Suppose a file is written during second 5 and then rsync caches its checksum during second 8; now the file has mtime 5 and ctime 8. Sometime later, rsync notices that the file still has mtime 5 and ctime 8. Does rsync trust the cached checksum? It must; otherwise the benefit of caching checksums would be lost. However, rsync will be fooled if the file was modified and then touched back to mtime 5 during second 8, right after the checksum was cached. This concern may not be relevant when the content is slowly changing. Matt
Matt McCutchen wrote:> Second, it is impossible to make xattr-based checksum caching > foolproof against same-second modification. Suppose a file is written > during second 5 and then rsync caches its checksum during second 8; > now the file has mtime 5 and ctime 8. Sometime later, rsync notices > that the file still has mtime 5 and ctime 8. Does rsync trust the > cached checksum? It must; otherwise the benefit of caching checksums > would be lost. However, rsync will be fooled if the file was modified > and then touched back to mtime 5 during second 8, right after the > checksum was cached. This concern may not be relevant when the > content is slowly changing.There really ought to be a special kind of xattr which automatically disappears when the file is modified, for this sort of thing. Or a modification serial number, perhaps only incremented when somebody actually has read it. Alas, I think attempts to get one into Linux didn't get very far; nobody thought it was that important. -- Jamie
On Sat, Jun 30, 2007 at 04:17:29PM -0400, Matt McCutchen wrote:> First, setting the xattr hits the file's ctime.Yeah, I realize that, and that's why none of the xattr values cache the ctime. This does mean that this method isn't good for updating checksum values on existing files (since a general-purpose trusting/updating of checksums based on size and mtime would be no better than a non-checksum quick check). It is still useful for allowing a server to cache the checksum values without requiring any extra files. As long as it is used on files that aren't being actively updated, it works great. I might make this patch capable of creating the cached checksum values when rsync creates a file, but I don't plan to make rsync ever update an xattr checksum on an existing file.> Second, it is impossible to make xattr-based checksum caching > foolproof against same-second modification.Not really. The git algorithm only works if nothing modifies the files while the checksum operation is running. So, the algorithm protects against bad things for sequential operations, but not parallel operations. A paranoid checksummer could notice if the mtime of a file was "now"(*) and delay checksumming that file until later in the run. It could also compare the mtime of a file from before and after it was read to ensure that it wasn't modified during the read phase (assuming that it never starts to read a file with an mtime of "now"). *Note that "now" for a particular disk may not be the same as time() if the disk is remote, so network filesystems can be rather complicated. Also, being off by a second might still be "now" if the value of the seconds field rolled over during the check. The perl script in my patch that creates/updates these xattr checksums doesn't try to deal with any of these complications. ..wayne..
Apparently Analagous Threads
- [Bug 13735] New: Synchronize files when the sending side has newer change times while modification times and sizes are identical on both sides
- ZFS Dataset lost structure
- mtime vs ctime
- Ext 2/3 overwriting remnant data & use of data blocks - security
- Caching {filePath,mtime64,checksum} values to speed up execution-time