thr3ads.net - Lustre discuss - [Lustre-discuss] LustreError: acquire timeout exceeded [Jul 2008]

If this information is useful, please help other people find it:
Share via:

Thomas Roth

2008-Jul-29 16:51 UTC

[Lustre-discuss] LustreError: acquire timeout exceeded

Hi all,

I''ve encountered a LustreError that might have triggered an unwanted 
failover of a MGS/MGD -HA-pair of servers. I''m not sure about the 
latter, but at least I have not found a trace of that error via Google, 
so it might be worth considering.
And it occurred in this form only the two times the heartbeat monitoring 
failed shortly afterwards:

kern.log.1:Jul 20 06:47:19 kernel: LustreError: 
27696:0:(upcall_cache.c:326:upcall_cache_get_entry()) acquire timeout 
exceeded for key 0
kern.log.1:Jul 20 06:47:41 kernel: LustreError: 
27713:0:(upcall_cache.c:326:upcall_cache_get_entry()) acquire timeout 
exceeded for key 0

There was no Lustre log activity the day before that, the last entry 
before being an eviction of a client at Jul 25 19:31:09

The system is running Lustre 1.6.3, kernel 2.6.22, Debian Etch.

There are some more ''acquire timeout '' messages dating from
Jul 24+25,
however not for ''key 0'' but for key 4209, 4409, ..., whatever
this may
mean. No "fatal" consequences then.

On Jul 27, the same happened again,

kern.log:Jul 27 06:47:17 kernel: LustreError: 
24327:0:(upcall_cache.c:326:upcall_cache_get_entry()) acquire timeout 
exceeded for key 0
kern.log:Jul 27 06:47:37 kernel: LustreError: 
22381:0:(upcall_cache.c:326:upcall_cache_get_entry()) acquire timeout 
exceeded for key 0

This time it took heartbeat only second to loose its IP:

lrmd[10373]: 2008/07/27_06:47:31 WARN: IPaddr:monitor process (PID 
26903) timed out (try 1).  Killing with signal SIGT
ERM (15).

On another system running Lustre 1.6.5, without any heartbeat errors, it 
was:
kern.log:Jul 27 06:47:20  kernel: LustreError: 
4627:0:(upcall_cache.c:325:upcall_cache_get_entry()) acquire timeout 
exceeded for key 0
kern.log:Jul 27 06:47:37  kernel: LustreError: 
3581:0:(upcall_cache.c:325:upcall_cache_get_entry()) acquire timeout 
exceeded for key 0

Of course these temporal coincidences look verrrrry suspicious. So far, 
I have no idea what kind of weird script might be running at these times 
causing all the trouble, still I''m already looking forward to next 
Sunday ;-)

But it would be nice if somebody could explain these  Lustre errors, and 
perhaps assure me that these Lustre errors are entirely harmless or 
cannot possibly have any influence on the stability of the system.

Thanks,
Thomas

Andreas Dilger

2008-Jul-30 10:44 UTC

head link

[Lustre-discuss] LustreError: acquire timeout exceeded

On Jul 29, 2008  18:51 +0200, Thomas Roth wrote:> kern.log.1:Jul 20 06:47:19 kernel: LustreError: 
> 27696:0:(upcall_cache.c:326:upcall_cache_get_entry()) acquire timeout 
> exceeded for key 0
> kern.log.1:Jul 20 06:47:41 kernel: LustreError: 
> 27713:0:(upcall_cache.c:326:upcall_cache_get_entry()) acquire timeout 
> exceeded for key 0
This means the MDS was unable to find supplementary group information
for the root user.
> Of course these temporal coincidences look verrrrry suspicious. So far, 
> I have no idea what kind of weird script might be running at these times 
> causing all the trouble, still I''m already looking forward to next
> Sunday ;-)
It isn''t clear whether this is cause or effect though.  If you use any
kind of network user/group database (LDAP, NIS, etc) then the timeout
likely means that the network had already failed at this time.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre discuss - Jul 2008 - LustreError: acquire timeout exceeded

[Lustre-discuss] LustreError: acquire timeout exceeded

[Lustre-discuss] LustreError: acquire timeout exceeded