thr3ads.net - Lustre discuss - [Lustre-discuss] [robinhood-support] robinhood error messages [Nov 2010]

If this information is useful, please help other people find it:
Share via:

LEIBOVICI Thomas

2010-Nov-24 12:20 UTC

[Lustre-discuss] [robinhood-support] robinhood error messages

Hi Thomas,

We already stated this, basically after the filesystem was blocked for a 
while, or after an OSS had crashed.
If it is stuck for too long (default timeout is 1 hour), robinhood tries 
to cancel its operation on current directory and continues with the next 
one.
Maybe it didn''t recover successfuly from this cancellation, and you 
receive those messages since that badly happened.

To avoid this problem, you can increase the timeout to a very high 
value, to make sure it is never reached (e.g. xxx days).
In that case, robinhood will remain stuck as long as its current 
operation in Lustre is blocked,
and it will resume the current operation as soon as Lustre is back.

You can change this timeout by setting the "scan_op_timeout" parameter
in the "FS_Scan" section of config file.

Alternatively, you can also keep a reasonable timeout and make robinhood 
exit when the filesystem is not responding
by setting "exit_on_timeout = TRUE" in the same section of the config.
So you can respawn robinhood daemon when everything is fixed.

Best regards,
Thomas LEIBOVICI
CEA/DAM
> A support request from lustre-discuss.
>
> ------------------------------------------------------------------------
>
> Sujet:
> [Lustre-discuss] robinhood error messages
> Exp?diteur:
> Thomas Roth <t.roth at gsi.de>
> Date:
> Tue, 23 Nov 2010 20:20:33 +0100
> Destinataire:
> lustre-discuss at lists.lustre.org
>
> Destinataire:
> lustre-discuss at lists.lustre.org
>
>
> Hi all,
>
> we are running robinhood (v2.2.1) on our 1.8.4 cluster (basically to 
> find out where and who the big space consumers are - no purging).
>
> Robinhood sends me lots and lots of messages (~100/day) of the type
>
>  > ===== FS scan is blocked (/lustre) ====>  > Date: 2010/11/23
20:05:22
>  > Program: robinhood (pid 4826)
>  > Host: lxb310
>  > Filesystem: /lustre
>  > A thread has been inactive for 3660 sec
>  > while scanning directory /lustre/....
>
> This seems to indicate some trouble accessing certain directories on the 
> node where robinhood is running. However, this is independent of the 
> node, and at the same time we neither see any issues / slowness/ 
> connectivity problems nor get any user complaints of the like.
>
> So I wonder whether anybody else is using robinhood and has seen similar 
> messages.
>
> Regards,
> Thomas
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>   
> ------------------------------------------------------------------------
>
>
------------------------------------------------------------------------------
> Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
> Tap into the largest installed PC base & get more eyes on your game by
> optimizing for Intel(R) Graphics Technology. Get started today with the
> Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
> http://p.sf.net/sfu/intelisp-dev2dev
> ------------------------------------------------------------------------
>
> _______________________________________________
> robinhood-support mailing list
> robinhood-support at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/robinhood-support
>

Thomas Roth

2010-Nov-24 13:00 UTC

head link

[Lustre-discuss] [robinhood-support] robinhood error messages

Thank you Thomas.
If these messages mean that robinhood just continues after the timeout, 
it would be nothing to worry about, but I will try to adapt the timeout 
anyhow.
Right now, however, it seems the scan is really stuck: since days, 
rbh-report -i tells me about 612 TB in the filesystem, but lfs df says 
we have 787 TB ;-)

Btw, whenever I restart the scan, e.g. after a reconfiguration such as 
for the timeout, I get the logfile full of
 > ListMgr | DB query failed in ListMgr_Insert line 340...
and assorted messages, which seem to indicate that the new robinhood 
scan tries to put something into the DB that is already there, and 
stumbles on this. Or maybe that happens when several robins are running 
simultaneously. I''m not sure if it is a problem for the scan, it is, 
however, a problem for the free space on /var, or wherever I point the 
log to ;-)

Regards,
Thomas

On 24.11.2010 13:20, LEIBOVICI Thomas wrote:> Hi Thomas,
>
> We already stated this, basically after the filesystem was blocked for a
> while, or after an OSS had crashed.
> If it is stuck for too long (default timeout is 1 hour), robinhood tries
> to cancel its operation on current directory and continues with the next
> one.
> Maybe it didn''t recover successfuly from this cancellation, and
you
> receive those messages since that badly happened.
>
> To avoid this problem, you can increase the timeout to a very high
> value, to make sure it is never reached (e.g. xxx days).
> In that case, robinhood will remain stuck as long as its current
> operation in Lustre is blocked,
> and it will resume the current operation as soon as Lustre is back.
>
> You can change this timeout by setting the "scan_op_timeout"
parameter
> in the "FS_Scan" section of config file.
>
> Alternatively, you can also keep a reasonable timeout and make robinhood
> exit when the filesystem is not responding
> by setting "exit_on_timeout = TRUE" in the same section of the
config.
> So you can respawn robinhood daemon when everything is fixed.
>
> Best regards,
> Thomas LEIBOVICI
> CEA/DAM
>
>  > A support request from lustre-discuss.
>  >
>  >
------------------------------------------------------------------------
>  >
>  > Sujet:
>  > [Lustre-discuss] robinhood error messages
>  > Exp?diteur:
>  > Thomas Roth <t.roth at gsi.de>
>  > Date:
>  > Tue, 23 Nov 2010 20:20:33 +0100
>  > Destinataire:
>  > lustre-discuss at lists.lustre.org
>  >
>  > Destinataire:
>  > lustre-discuss at lists.lustre.org
>  >
>  >
>  > Hi all,
>  >
>  > we are running robinhood (v2.2.1) on our 1.8.4 cluster (basically to
>  > find out where and who the big space consumers are - no purging).
>  >
>  > Robinhood sends me lots and lots of messages (~100/day) of the type
>  >
>  > > ===== FS scan is blocked (/lustre) ====>  > > Date:
2010/11/23 20:05:22
>  > > Program: robinhood (pid 4826)
>  > > Host: lxb310
>  > > Filesystem: /lustre
>  > > A thread has been inactive for 3660 sec
>  > > while scanning directory /lustre/....
>  >
>  > This seems to indicate some trouble accessing certain directories on
the
>  > node where robinhood is running. However, this is independent of the
>  > node, and at the same time we neither see any issues / slowness/
>  > connectivity problems nor get any user complaints of the like.
>  >
>  > So I wonder whether anybody else is using robinhood and has seen
similar
>  > messages.
>  >
>  > Regards,
>  > Thomas
>  > _______________________________________________
>  > Lustre-discuss mailing list
>  > Lustre-discuss at lists.lustre.org
>  > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>  >
>  >
>  >
------------------------------------------------------------------------
>  >
>  >
>
------------------------------------------------------------------------------
>  > Increase Visibility of Your 3D Game App & Earn a Chance To Win
$500!
>  > Tap into the largest installed PC base & get more eyes on your
game by
>  > optimizing for Intel(R) Graphics Technology. Get started today with
the
>  > Intel(R) Software Partner Program. Five $500 cash prizes are up for
> grabs.
>  > http://p.sf.net/sfu/intelisp-dev2dev
>  >
------------------------------------------------------------------------
>  >
>  > _______________________________________________
>  > robinhood-support mailing list
>  > robinhood-support at lists.sourceforge.net
>  > https://lists.sourceforge.net/lists/listinfo/robinhood-support
>  >
>

-- 
--------------------------------------------------------------------
Thomas Roth           IT-HPC-Linux
Location: SB3 1.262   Phone: +49-6159-71 1453


http://twitter.com/gsi_it

LEIBOVICI Thomas

2010-Nov-24 14:17 UTC

head link

[Lustre-discuss] [robinhood-support] robinhood error messages

Thomas Roth wrote:> Thank you Thomas.
> If these messages mean that robinhood just continues after the 
> timeout, it would be nothing to worry about, but I will try to adapt 
> the timeout anyhow.
> Right now, however, it seems the scan is really stuck: since days, 
> rbh-report -i tells me about 612 TB in the filesystem, but lfs df says 
> we have 787 TB ;-)A couple of such messages would not be a big deal, but 100s/day during 
several days is not normal... I suspect a problem on timeout handling in 
robinhood, that leads to such a blocking. That''s why I suggest you to 
avoid timeouts by increasing its value.> Btw, whenever I restart the scan, e.g. after a reconfiguration such as 
> for the timeout, I get the logfile full ofTips: for changing such a scalar param, you are not obliged to fully 
restart the daemon. "service robinhood reload" or "kill
-HUP" on the
process is OK.> > ListMgr | DB query failed in ListMgr_Insert line 340...
> and assorted messages, which seem to indicate that the new robinhood 
> scan tries to put something into the DB that is already there, and 
> stumbles on this. Or maybe that happens when several robins are 
> running simultaneously.Are you running several instances for scanning the same
filesystem??> I''m not sure if it is a problem for the scan, it is, however, a 
> problem for the free space on /var, or wherever I point the log to ;-)
>
> Regards,
> Thomas
>
> On 24.11.2010 13:20, LEIBOVICI Thomas wrote:
>> Hi Thomas,
>>
>> We already stated this, basically after the filesystem was blocked for
a
>> while, or after an OSS had crashed.
>> If it is stuck for too long (default timeout is 1 hour), robinhood
tries
>> to cancel its operation on current directory and continues with the
next
>> one.
>> Maybe it didn''t recover successfuly from this cancellation,
and you
>> receive those messages since that badly happened.
>>
>> To avoid this problem, you can increase the timeout to a very high
>> value, to make sure it is never reached (e.g. xxx days).
>> In that case, robinhood will remain stuck as long as its current
>> operation in Lustre is blocked,
>> and it will resume the current operation as soon as Lustre is back.
>>
>> You can change this timeout by setting the "scan_op_timeout"
parameter
>> in the "FS_Scan" section of config file.
>>
>> Alternatively, you can also keep a reasonable timeout and make
robinhood
>> exit when the filesystem is not responding
>> by setting "exit_on_timeout = TRUE" in the same section of
the config.
>> So you can respawn robinhood daemon when everything is fixed.
>>
>> Best regards,
>> Thomas LEIBOVICI
>> CEA/DAM
>>
>>  > A support request from lustre-discuss.
>>  >
>>  > 
>>
------------------------------------------------------------------------
>>  >
>>  > Sujet:
>>  > [Lustre-discuss] robinhood error messages
>>  > Exp?diteur:
>>  > Thomas Roth <t.roth at gsi.de>
>>  > Date:
>>  > Tue, 23 Nov 2010 20:20:33 +0100
>>  > Destinataire:
>>  > lustre-discuss at lists.lustre.org
>>  >
>>  > Destinataire:
>>  > lustre-discuss at lists.lustre.org
>>  >
>>  >
>>  > Hi all,
>>  >
>>  > we are running robinhood (v2.2.1) on our 1.8.4 cluster (basically
to
>>  > find out where and who the big space consumers are - no purging).
>>  >
>>  > Robinhood sends me lots and lots of messages (~100/day) of the
type
>>  >
>>  > > ===== FS scan is blocked (/lustre) ====>>  > >
Date: 2010/11/23 20:05:22
>>  > > Program: robinhood (pid 4826)
>>  > > Host: lxb310
>>  > > Filesystem: /lustre
>>  > > A thread has been inactive for 3660 sec
>>  > > while scanning directory /lustre/....
>>  >
>>  > This seems to indicate some trouble accessing certain directories
>> on the
>>  > node where robinhood is running. However, this is independent of
the
>>  > node, and at the same time we neither see any issues / slowness/
>>  > connectivity problems nor get any user complaints of the like.
>>  >
>>  > So I wonder whether anybody else is using robinhood and has seen 
>> similar
>>  > messages.
>>  >
>>  > Regards,
>>  > Thomas
>>  > _______________________________________________
>>  > Lustre-discuss mailing list
>>  > Lustre-discuss at lists.lustre.org
>>  > http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>  >
>>  >
>>  > 
>>
------------------------------------------------------------------------
>>  >
>>  >
>>
------------------------------------------------------------------------------
>>
>>  > Increase Visibility of Your 3D Game App & Earn a Chance To
Win $500!
>>  > Tap into the largest installed PC base & get more eyes on
your
>> game by
>>  > optimizing for Intel(R) Graphics Technology. Get started today 
>> with the
>>  > Intel(R) Software Partner Program. Five $500 cash prizes are up
for
>> grabs.
>>  > http://p.sf.net/sfu/intelisp-dev2dev
>>  > 
>>
------------------------------------------------------------------------
>>  >
>>  > _______________________________________________
>>  > robinhood-support mailing list
>>  > robinhood-support at lists.sourceforge.net
>>  > https://lists.sourceforge.net/lists/listinfo/robinhood-support
>>  >
>>
>
>

Thomas Roth

2010-Nov-24 16:58 UTC

head link

[Lustre-discuss] [robinhood-support] robinhood error messages

On 24.11.2010 15:17, LEIBOVICI Thomas wrote:> Thomas Roth wrote:
>  > > ListMgr | DB query failed in ListMgr_Insert line 340...
>  > and assorted messages, which seem to indicate that the new robinhood
>  > scan tries to put something into the DB that is already there, and
>  > stumbles on this. Or maybe that happens when several robins are
>  > running simultaneously.
> Are you running several instances for scanning the same filesystem??
Well, yes, tried that also. Actually I was under the impression that 
this is a feature of Robinhood - of course, now that I am looking for 
this in the documentation I can''t find it.

But these errors from the DB definitely did arise first when I restarted 
robinhood anew after some changes (location of log file, debug level, 
...) in the config file. But since there was no change in the robinhood 
version, I did not empty the database. After this restart, I immediately 
got a lot of
 > 2010/11/04 11:27:45 robinhood[1489/4]: EntryProc | Error 3 performing 
database operation.
 > 2010/11/04 11:27:45 robinhood[1489/8]: ListMgr | DB query failed in 
ListMgr_Insert line 340: pk=''54051386:6D286C'', code=3:
Duplicate entry
''54051386:6D286C'' for key 1

I suppose this is something that should not happen when one is feeding a 
database?

Cheers,
Thomas

LEIBOVICI Thomas

2010-Nov-25 09:07 UTC

head link

[Lustre-discuss] [robinhood-support] robinhood error messages

Hello Thomas,

Sorry, I just saw the email you sent on robinhood-support mailing list 
and that was blocked waiting for admin validation.
About multiple robinhood instances, the documentation says that you can 
split the features on different nodes:
basically, the database server can run on a machine, FS scan on another 
machine, disk resource monitoring and purging on another machine, etc...
But you must only run a single instance of each feature at a given time.

Thomas Roth wrote:>> Is there a way to "partition" a file system for Robinhood?
Tell an
>> instance to only scan certain directories? Because I think the issue is
>> not a really broken data base, but simply a later coming Robin scanning
>> files that were already done?What is your need exactly? Do you want to speed-up the scan by running 
several robinhood instances,
or do you only want to scan certain directories?
- About speed, robinhood already performs scans in parallel with 
multiple threads, each one scanning different directories.
So if you want more parallelism, increase the number of scan threads.
- If your need is to scan only some parts of the namespace, you can 
ignore directories by specifying "ignore" rules in the configuration 
file (FS_Scan section)
E.g. ignore { path == "/lustre/xyz*" } if you know the path you want
to
ignore, or a negation:
ignore { not ( path == "/lustre/dir1" or path ==
"/lustre/dir2/subdir*"
) } if you know the paths you want to scan.
>>  > > ListMgr | DB query failed in ListMgr_Insert line 340...
>>  > and assorted messages, which seem to indicate that the new
robinhood
>>  > scan tries to put something into the DB that is already there,
and
>>  > stumbles on this. Or maybe that happens when several robins are
>>  > running simultaneously.
>> Are you running several instances for scanning the same filesystem??
>
> Well, yes, tried that also. Actually I was under the impression that 
> this is a feature of Robinhood - of course, now that I am looking for 
> this in the documentation I can''t find it.
>
> But these errors from the DB definitely did arise first when I 
> restarted robinhood anew after some changes (location of log file, 
> debug level, ...) in the config file. But since there was no change in 
> the robinhood version, I did not empty the database. After this 
> restart, I immediately got a lot of
> > 2010/11/04 11:27:45 robinhood[1489/4]: EntryProc | Error 3 
> performing database operation.
> > 2010/11/04 11:27:45 robinhood[1489/8]: ListMgr | DB query failed in 
> ListMgr_Insert line 340: pk=''54051386:6D286C'', code=3:
Duplicate entry
> ''54051386:6D286C'' for key 1
>
> I suppose this is something that should not happen when one is feeding 
> a database?Yes, these errors seams to be caused by the concurrence between several 
feeders. This is not sane, and the db content may be inconsistent now.
So I recommend you to stop all your running instances, clear the db 
content (command "rbh-config empty_db")
and then, only start a single instance for scanning.

Best regards,
Thomas.

Lustre discuss - Nov 2010 - [robinhood-support] robinhood error messages

[Lustre-discuss] [robinhood-support] robinhood error messages

[Lustre-discuss] [robinhood-support] robinhood error messages

[Lustre-discuss] [robinhood-support] robinhood error messages

[Lustre-discuss] [robinhood-support] robinhood error messages

[Lustre-discuss] [robinhood-support] robinhood error messages