James A. Dinkel
2007-Feb-05 19:12 UTC
[Samba] cleaning up duplicate files on the file server
I imagine we can save some space on our file server by cleaning up all the files that are saved multiple times by different people. There is already the fdupes command in linux that will scan a directory tree and report what files have duplicates. This could be easily scripted to turn those duplicate files into symlinks to one file. The problem is see, then, is what would happen if someone tries to change a duplicate file that they think is their own copy. Of course, everyone with a symlink to that file would get the changes, which is not what I would want. What it would need is some sort of copy-on-edit mechanism, so when the file is changed, instead of changing the original file, the symlink is replaced with the edited version of the file. Does this make sense? Has anyone else thought about this, or found an elegant solution to this? James Dinkel Network Engineer Butler County of Kansas There are 10 types of people in the world: those who understand binary, and those who don't.
Dealing with data duplication is not always particularly easy. What I would suggest is the following: 1) Identify the duplicates with the oldest modification date 2) Notify your users that you are making changes and to be on the lookout for any problems 3) Change the file permissions so that they can't be accessed by anyone other than you 4) If after some predetermined length of time (measured in months preferably) nobody has complained, delete the duplicates Changing the permissions offers you an easy way to simulate deleting without actually deleting. You could issue a command to dump the ACLs for each file into a log by using a modified form of the command I've posted in the past for setting the archive bit of files that have been modified. Here is is for your convenience: /usr/bin/find /share/ -name '*' -mtime 0 -exec setfattr --name=user.DOSATTRIB --value=0x30783230 {} \; You could change the find command to use your find duplicates and change the setfattr to getfacl. With some fancy footwork, you should be able to do all of that and redirect output into a text file in the event that you have to restore permissions to their previous state. Of course, you could also use this command to set permissions on all of the files by using setfacl. Just a suggestion. Any shell gurus out there that can offer up better or more clear advice please do so. James A. Dinkel wrote:> I imagine we can save some space on our file server by cleaning up all > the files that are saved multiple times by different people. There is > already the fdupes command in linux that will scan a directory tree and > report what files have duplicates. This could be easily scripted to > turn those duplicate files into symlinks to one file. > > > > The problem is see, then, is what would happen if someone tries to > change a duplicate file that they think is their own copy. Of course, > everyone with a symlink to that file would get the changes, which is not > what I would want. What it would need is some sort of copy-on-edit > mechanism, so when the file is changed, instead of changing the original > file, the symlink is replaced with the edited version of the file. > > > > Does this make sense? Has anyone else thought about this, or found an > elegant solution to this? > > > > James Dinkel > > Network Engineer > > Butler County of Kansas > > > > There are 10 types of people in the world: those who understand binary, > and those who don't. > > > >