Herve Pages
2008-May-16 01:50 UTC
File corruptions with rsync version 2.6.9 on 64-bit openSUSE 10.3
Hi, I'm part of the team that runs the Bioconductor project http://bioconductor.org/ and we've used rsync successfully so far for a lot of different things in particular for moving the hundreds of packages that we build and check every day thru our build system pipe (which is made of several build nodes running different OSes, see our daily build report here: http://bioconductor.org/checkResults/2.2/bioc-LATEST/). At the very end of the build pipe, rsync is used again to sync our public package repository (http://bioconductor.org/packages/2.2/bioc/) with an internal repository that is behind a firewall. Until recently, the internal repository was hosted on lamb1, a 64-bit SUSE LINUX 10.1 system: biocadmin@lamb1:~> rsync --version rsync version 2.6.6 protocol version 29 Copyright (C) 1996-2005 by Andrew Tridgell and others <http://rsync.samba.org/> Capabilities: 64-bit files, socketpairs, hard links, ACLs, symlinks, batchfiles, inplace, IPv6, 64-bit system inums, 64-bit internal inums, SLP rsync comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions. See the GNU General Public Licence for details. and AFAICT we've never observed any file corruption when rsync'ing between lamb1 and bioconductor.org. rsync was run everyday on lamb1 with the following options: rsync --delete -ave ssh SRC USER@HOST:DEST Recently we've set up a new machine, wilson1, for hosting the internal package repository. wilson1 is a 64-bit openSUSE 10.3 system: biocadmin@wilson1:~> rsync --version rsync version 2.6.9 protocol version 29 Copyright (C) 1996-2006 by Andrew Tridgell, Wayne Davison, and others. <http://rsync.samba.org/> Capabilities: 64-bit files, socketpairs, hard links, symlinks, batchfiles, inplace, IPv6, ACLs, xattrs, SLP 64-bit system inums, 64-bit internal inums rsync comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions. See the GNU General Public Licence for details. Now when we use rsync on wilson1 to synchronize the internal and public package repositories, we end up having corrupted files on the public repository (their md5sums differ between local and remote file, but their sizes and timestamps are exactly the same). On wilson1, we use rsync exactly the same way as on lamb1 i.e. we do: rsync --delete -ave ssh SRC USER@HOST:DEST The destination machine (bioconductor.org) is a 64-bit SUSE LINUX Enterprise Server 9 system. It has not changed during our switch from lamb1 to wilson1 for the source machine. It seems that the frequency of the corruptions is low but since the total volume of packages that we produce is high (> 30G, a few packages are several hundred MB), we end up having a few corrupted packages on bioconductor.org (9 in total today, most of them are among the biggest packages we produce i.e. they are > 700MB). Of course, if I rerun rsync --delete -ave ssh SRC USER@HOST:DEST again, the corrupted files are not detected so nothing happens. But strangely enough, if I delete the corrupted file by hand and rerun the above command, then this time the transfer seems to be OK. But may that's just luck (given that the corruptions seem to happen randomly). I've only done this manual deletion once and for 1 file only because I want to give some time to our IT guys to look into this problem. Any idea what could be going wrong? What kind of extra information would you need? Thanks in advance for your help, H.
Herve Pages
2008-May-22 01:43 UTC
File corruptions with rsync version 2.6.9 on 64-bit openSUSE 10.3
Hi, An update on this: we might have an hardware problem. After moving our internal package repository to another machine with the same OS, same patch level, same rsync version and same hardware, we don't observe file corruptions anymore. We've tried different versions of rsync on the broken machine (2.6.9, 3.0.2 and 2.6.6) with different options (--whole-file and --ignore-times) and we always ended up with a few corrupted files on the remote machine (the destination). Then we discovered that running md5sum on the local files at different moments was producing different results even though no process/job was supposed to modify those files in the meantime (and the timestamps were confirming this). Some files would have an abnormal md5sum and look corrupted for a few minutes and then be back to their normal md5sum and look fine again. All the files are on a hardware RAID10 made of 4 disks of 230GB each and our IT guys are starting to suspect it. I'll post here again when we know more... Cheers, H. Kyle Lanclos wrote:> Are you experiencing hardware problems? While disk problems usually show > up in a log somewhere, something like memory or CPU problems ususally do > not on Linux systems. > > I've had at least two systems manifest CPU problems in the form of random > I/O corruption. > > --Kyle