On Fri, Sep 10, 2010 at 10:45:08AM +0200, freebsd wrote:> hi list,
>
> we upgraded some 20 boxes from 7.1 and 7.2 to 7.3-RELEASE-p2 (all amd64)
> and now are experiencing some weird behaviour on 6 of them with rsnapshot:
>
> after a few days/several weeks (seems to be completely random),
> rsnapshot reports that it can't start due it's lockfile and process
> still being present. on such boxes either a zombie rm or find process
> (which presumably were launched by rsnapshot) can be found.
> if the backup was done to a separate partition (physical disks or RAIDs)
> any access (ls, stat, fsck, etc) to the partition would kill the current
> SSH session, creating a new zombie of the process one just started.
> unmounting the affected partition would render the server completely
> unresponsive and required a hardware reset.
>
> when trying to restart, the machines wouldn't even shut down completely
> but hanged somewhere after syncing buffers, only a hardware reset
> worked. after the reboot, those partitions were unmounted and fscked.
> after which the backups would work again until the next error happened
> again.
>
> the hardware of affected and unaffected system are:
>
> HP ProLiant DL380 G4
> HP ProLiant DL380 G5
> HP ProLiant DL360 G5
>
> there is no visible pattern between affected and unaffected boxes. also
> those machines were upgraded the exact same way, running identical
> kernels (more or less GENERIC, with QUOTA activated).
>
> we upgraded the most critical boxes which showed that behaviour on a
> daily interval to 8.0-RELEASE and ever since this behavior has
> disappeared since nearly 3 months now.
>
> we installed a debug-kernel on an affected box, but the machine
wouldn't
> panic when the error occured. when trying to unmount the affected
> partition it just went completely unresponsive, as mentioned above.
>
> before trying to unmount procstat -ak showed some processes with
> VOP_LOCK1_APV:
>
> 55396 100135 find - mi_switch sleepq_switch sleepq_wait _sleep acquire
> _lockmgr ffs_lock VOP_LOCK1_APV _vn_lock vget cache_lookup
> vfs_cache_lookup VOP_LOOKUP_APV lookup namei kern_lstat lstat syscall
> 70923 100146 rsync - mi_switch sleepq_switch sleepq_wait _sleep acquire
> _lockmgr ffs_lock VOP_LOCK1_APV _vn_lock vget vfs_hash_get ffs_vgetf
> ufs_lookup_ vfs_cache_lookup OP_LOOKUP_APV lookup namei kern_lstat
>
> since this hardware has been working before 7.3 and -- as we assume --
> would work again with 8.*, we would be grateful for any hints what could
> be the cause of all this.
It sounds like a deadlock, but the cause cannot be identified without
further diagnostic. It might be driver (ciss I assume), but may be quota
code, or even something else.
Please follow the
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug-deadlocks.html
to obtain the required information.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
Url :
http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20100910/9c0148bc/attachment.pgp