Hi all, I''ve encountered a LustreError that might have triggered an unwanted failover of a MGS/MGD -HA-pair of servers. I''m not sure about the latter, but at least I have not found a trace of that error via Google, so it might be worth considering. And it occurred in this form only the two times the heartbeat monitoring failed shortly afterwards: kern.log.1:Jul 20 06:47:19 kernel: LustreError: 27696:0:(upcall_cache.c:326:upcall_cache_get_entry()) acquire timeout exceeded for key 0 kern.log.1:Jul 20 06:47:41 kernel: LustreError: 27713:0:(upcall_cache.c:326:upcall_cache_get_entry()) acquire timeout exceeded for key 0 There was no Lustre log activity the day before that, the last entry before being an eviction of a client at Jul 25 19:31:09 The system is running Lustre 1.6.3, kernel 2.6.22, Debian Etch. There are some more ''acquire timeout '' messages dating from Jul 24+25, however not for ''key 0'' but for key 4209, 4409, ..., whatever this may mean. No "fatal" consequences then. On Jul 27, the same happened again, kern.log:Jul 27 06:47:17 kernel: LustreError: 24327:0:(upcall_cache.c:326:upcall_cache_get_entry()) acquire timeout exceeded for key 0 kern.log:Jul 27 06:47:37 kernel: LustreError: 22381:0:(upcall_cache.c:326:upcall_cache_get_entry()) acquire timeout exceeded for key 0 This time it took heartbeat only second to loose its IP: lrmd[10373]: 2008/07/27_06:47:31 WARN: IPaddr:monitor process (PID 26903) timed out (try 1). Killing with signal SIGT ERM (15). On another system running Lustre 1.6.5, without any heartbeat errors, it was: kern.log:Jul 27 06:47:20 kernel: LustreError: 4627:0:(upcall_cache.c:325:upcall_cache_get_entry()) acquire timeout exceeded for key 0 kern.log:Jul 27 06:47:37 kernel: LustreError: 3581:0:(upcall_cache.c:325:upcall_cache_get_entry()) acquire timeout exceeded for key 0 Of course these temporal coincidences look verrrrry suspicious. So far, I have no idea what kind of weird script might be running at these times causing all the trouble, still I''m already looking forward to next Sunday ;-) But it would be nice if somebody could explain these Lustre errors, and perhaps assure me that these Lustre errors are entirely harmless or cannot possibly have any influence on the stability of the system. Thanks, Thomas
Andreas Dilger
2008-Jul-30 10:44 UTC
[Lustre-discuss] LustreError: acquire timeout exceeded
On Jul 29, 2008 18:51 +0200, Thomas Roth wrote:> kern.log.1:Jul 20 06:47:19 kernel: LustreError: > 27696:0:(upcall_cache.c:326:upcall_cache_get_entry()) acquire timeout > exceeded for key 0 > kern.log.1:Jul 20 06:47:41 kernel: LustreError: > 27713:0:(upcall_cache.c:326:upcall_cache_get_entry()) acquire timeout > exceeded for key 0This means the MDS was unable to find supplementary group information for the root user.> Of course these temporal coincidences look verrrrry suspicious. So far, > I have no idea what kind of weird script might be running at these times > causing all the trouble, still I''m already looking forward to next > Sunday ;-)It isn''t clear whether this is cause or effect though. If you use any kind of network user/group database (LDAP, NIS, etc) then the timeout likely means that the network had already failed at this time. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.