LEIBOVICI Thomas
2010-Nov-24 12:20 UTC
[Lustre-discuss] [robinhood-support] robinhood error messages
Hi Thomas, We already stated this, basically after the filesystem was blocked for a while, or after an OSS had crashed. If it is stuck for too long (default timeout is 1 hour), robinhood tries to cancel its operation on current directory and continues with the next one. Maybe it didn''t recover successfuly from this cancellation, and you receive those messages since that badly happened. To avoid this problem, you can increase the timeout to a very high value, to make sure it is never reached (e.g. xxx days). In that case, robinhood will remain stuck as long as its current operation in Lustre is blocked, and it will resume the current operation as soon as Lustre is back. You can change this timeout by setting the "scan_op_timeout" parameter in the "FS_Scan" section of config file. Alternatively, you can also keep a reasonable timeout and make robinhood exit when the filesystem is not responding by setting "exit_on_timeout = TRUE" in the same section of the config. So you can respawn robinhood daemon when everything is fixed. Best regards, Thomas LEIBOVICI CEA/DAM> A support request from lustre-discuss. > > ------------------------------------------------------------------------ > > Sujet: > [Lustre-discuss] robinhood error messages > Exp?diteur: > Thomas Roth <t.roth at gsi.de> > Date: > Tue, 23 Nov 2010 20:20:33 +0100 > Destinataire: > lustre-discuss at lists.lustre.org > > Destinataire: > lustre-discuss at lists.lustre.org > > > Hi all, > > we are running robinhood (v2.2.1) on our 1.8.4 cluster (basically to > find out where and who the big space consumers are - no purging). > > Robinhood sends me lots and lots of messages (~100/day) of the type > > > ===== FS scan is blocked (/lustre) ====> > Date: 2010/11/23 20:05:22 > > Program: robinhood (pid 4826) > > Host: lxb310 > > Filesystem: /lustre > > A thread has been inactive for 3660 sec > > while scanning directory /lustre/.... > > This seems to indicate some trouble accessing certain directories on the > node where robinhood is running. However, this is independent of the > node, and at the same time we neither see any issues / slowness/ > connectivity problems nor get any user complaints of the like. > > So I wonder whether anybody else is using robinhood and has seen similar > messages. > > Regards, > Thomas > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------------ > Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! > Tap into the largest installed PC base & get more eyes on your game by > optimizing for Intel(R) Graphics Technology. Get started today with the > Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. > http://p.sf.net/sfu/intelisp-dev2dev > ------------------------------------------------------------------------ > > _______________________________________________ > robinhood-support mailing list > robinhood-support at lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/robinhood-support >
Thomas Roth
2010-Nov-24 13:00 UTC
[Lustre-discuss] [robinhood-support] robinhood error messages
Thank you Thomas. If these messages mean that robinhood just continues after the timeout, it would be nothing to worry about, but I will try to adapt the timeout anyhow. Right now, however, it seems the scan is really stuck: since days, rbh-report -i tells me about 612 TB in the filesystem, but lfs df says we have 787 TB ;-) Btw, whenever I restart the scan, e.g. after a reconfiguration such as for the timeout, I get the logfile full of > ListMgr | DB query failed in ListMgr_Insert line 340... and assorted messages, which seem to indicate that the new robinhood scan tries to put something into the DB that is already there, and stumbles on this. Or maybe that happens when several robins are running simultaneously. I''m not sure if it is a problem for the scan, it is, however, a problem for the free space on /var, or wherever I point the log to ;-) Regards, Thomas On 24.11.2010 13:20, LEIBOVICI Thomas wrote:> Hi Thomas, > > We already stated this, basically after the filesystem was blocked for a > while, or after an OSS had crashed. > If it is stuck for too long (default timeout is 1 hour), robinhood tries > to cancel its operation on current directory and continues with the next > one. > Maybe it didn''t recover successfuly from this cancellation, and you > receive those messages since that badly happened. > > To avoid this problem, you can increase the timeout to a very high > value, to make sure it is never reached (e.g. xxx days). > In that case, robinhood will remain stuck as long as its current > operation in Lustre is blocked, > and it will resume the current operation as soon as Lustre is back. > > You can change this timeout by setting the "scan_op_timeout" parameter > in the "FS_Scan" section of config file. > > Alternatively, you can also keep a reasonable timeout and make robinhood > exit when the filesystem is not responding > by setting "exit_on_timeout = TRUE" in the same section of the config. > So you can respawn robinhood daemon when everything is fixed. > > Best regards, > Thomas LEIBOVICI > CEA/DAM > > > A support request from lustre-discuss. > > > > ------------------------------------------------------------------------ > > > > Sujet: > > [Lustre-discuss] robinhood error messages > > Exp?diteur: > > Thomas Roth <t.roth at gsi.de> > > Date: > > Tue, 23 Nov 2010 20:20:33 +0100 > > Destinataire: > > lustre-discuss at lists.lustre.org > > > > Destinataire: > > lustre-discuss at lists.lustre.org > > > > > > Hi all, > > > > we are running robinhood (v2.2.1) on our 1.8.4 cluster (basically to > > find out where and who the big space consumers are - no purging). > > > > Robinhood sends me lots and lots of messages (~100/day) of the type > > > > > ===== FS scan is blocked (/lustre) ====> > > Date: 2010/11/23 20:05:22 > > > Program: robinhood (pid 4826) > > > Host: lxb310 > > > Filesystem: /lustre > > > A thread has been inactive for 3660 sec > > > while scanning directory /lustre/.... > > > > This seems to indicate some trouble accessing certain directories on the > > node where robinhood is running. However, this is independent of the > > node, and at the same time we neither see any issues / slowness/ > > connectivity problems nor get any user complaints of the like. > > > > So I wonder whether anybody else is using robinhood and has seen similar > > messages. > > > > Regards, > > Thomas > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > > > ------------------------------------------------------------------------ > > > > > ------------------------------------------------------------------------------ > > Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! > > Tap into the largest installed PC base & get more eyes on your game by > > optimizing for Intel(R) Graphics Technology. Get started today with the > > Intel(R) Software Partner Program. Five $500 cash prizes are up for > grabs. > > http://p.sf.net/sfu/intelisp-dev2dev > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > robinhood-support mailing list > > robinhood-support at lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/robinhood-support > > >-- -------------------------------------------------------------------- Thomas Roth IT-HPC-Linux Location: SB3 1.262 Phone: +49-6159-71 1453 http://twitter.com/gsi_it
LEIBOVICI Thomas
2010-Nov-24 14:17 UTC
[Lustre-discuss] [robinhood-support] robinhood error messages
Thomas Roth wrote:> Thank you Thomas. > If these messages mean that robinhood just continues after the > timeout, it would be nothing to worry about, but I will try to adapt > the timeout anyhow. > Right now, however, it seems the scan is really stuck: since days, > rbh-report -i tells me about 612 TB in the filesystem, but lfs df says > we have 787 TB ;-)A couple of such messages would not be a big deal, but 100s/day during several days is not normal... I suspect a problem on timeout handling in robinhood, that leads to such a blocking. That''s why I suggest you to avoid timeouts by increasing its value.> Btw, whenever I restart the scan, e.g. after a reconfiguration such as > for the timeout, I get the logfile full ofTips: for changing such a scalar param, you are not obliged to fully restart the daemon. "service robinhood reload" or "kill -HUP" on the process is OK.> > ListMgr | DB query failed in ListMgr_Insert line 340... > and assorted messages, which seem to indicate that the new robinhood > scan tries to put something into the DB that is already there, and > stumbles on this. Or maybe that happens when several robins are > running simultaneously.Are you running several instances for scanning the same filesystem??> I''m not sure if it is a problem for the scan, it is, however, a > problem for the free space on /var, or wherever I point the log to ;-) > > Regards, > Thomas > > On 24.11.2010 13:20, LEIBOVICI Thomas wrote: >> Hi Thomas, >> >> We already stated this, basically after the filesystem was blocked for a >> while, or after an OSS had crashed. >> If it is stuck for too long (default timeout is 1 hour), robinhood tries >> to cancel its operation on current directory and continues with the next >> one. >> Maybe it didn''t recover successfuly from this cancellation, and you >> receive those messages since that badly happened. >> >> To avoid this problem, you can increase the timeout to a very high >> value, to make sure it is never reached (e.g. xxx days). >> In that case, robinhood will remain stuck as long as its current >> operation in Lustre is blocked, >> and it will resume the current operation as soon as Lustre is back. >> >> You can change this timeout by setting the "scan_op_timeout" parameter >> in the "FS_Scan" section of config file. >> >> Alternatively, you can also keep a reasonable timeout and make robinhood >> exit when the filesystem is not responding >> by setting "exit_on_timeout = TRUE" in the same section of the config. >> So you can respawn robinhood daemon when everything is fixed. >> >> Best regards, >> Thomas LEIBOVICI >> CEA/DAM >> >> > A support request from lustre-discuss. >> > >> > >> ------------------------------------------------------------------------ >> > >> > Sujet: >> > [Lustre-discuss] robinhood error messages >> > Exp?diteur: >> > Thomas Roth <t.roth at gsi.de> >> > Date: >> > Tue, 23 Nov 2010 20:20:33 +0100 >> > Destinataire: >> > lustre-discuss at lists.lustre.org >> > >> > Destinataire: >> > lustre-discuss at lists.lustre.org >> > >> > >> > Hi all, >> > >> > we are running robinhood (v2.2.1) on our 1.8.4 cluster (basically to >> > find out where and who the big space consumers are - no purging). >> > >> > Robinhood sends me lots and lots of messages (~100/day) of the type >> > >> > > ===== FS scan is blocked (/lustre) ====>> > > Date: 2010/11/23 20:05:22 >> > > Program: robinhood (pid 4826) >> > > Host: lxb310 >> > > Filesystem: /lustre >> > > A thread has been inactive for 3660 sec >> > > while scanning directory /lustre/.... >> > >> > This seems to indicate some trouble accessing certain directories >> on the >> > node where robinhood is running. However, this is independent of the >> > node, and at the same time we neither see any issues / slowness/ >> > connectivity problems nor get any user complaints of the like. >> > >> > So I wonder whether anybody else is using robinhood and has seen >> similar >> > messages. >> > >> > Regards, >> > Thomas >> > _______________________________________________ >> > Lustre-discuss mailing list >> > Lustre-discuss at lists.lustre.org >> > http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > >> > >> > >> ------------------------------------------------------------------------ >> > >> > >> ------------------------------------------------------------------------------ >> >> > Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! >> > Tap into the largest installed PC base & get more eyes on your >> game by >> > optimizing for Intel(R) Graphics Technology. Get started today >> with the >> > Intel(R) Software Partner Program. Five $500 cash prizes are up for >> grabs. >> > http://p.sf.net/sfu/intelisp-dev2dev >> > >> ------------------------------------------------------------------------ >> > >> > _______________________________________________ >> > robinhood-support mailing list >> > robinhood-support at lists.sourceforge.net >> > https://lists.sourceforge.net/lists/listinfo/robinhood-support >> > >> > >
Thomas Roth
2010-Nov-24 16:58 UTC
[Lustre-discuss] [robinhood-support] robinhood error messages
On 24.11.2010 15:17, LEIBOVICI Thomas wrote:> Thomas Roth wrote:> > > ListMgr | DB query failed in ListMgr_Insert line 340... > > and assorted messages, which seem to indicate that the new robinhood > > scan tries to put something into the DB that is already there, and > > stumbles on this. Or maybe that happens when several robins are > > running simultaneously. > Are you running several instances for scanning the same filesystem??Well, yes, tried that also. Actually I was under the impression that this is a feature of Robinhood - of course, now that I am looking for this in the documentation I can''t find it. But these errors from the DB definitely did arise first when I restarted robinhood anew after some changes (location of log file, debug level, ...) in the config file. But since there was no change in the robinhood version, I did not empty the database. After this restart, I immediately got a lot of > 2010/11/04 11:27:45 robinhood[1489/4]: EntryProc | Error 3 performing database operation. > 2010/11/04 11:27:45 robinhood[1489/8]: ListMgr | DB query failed in ListMgr_Insert line 340: pk=''54051386:6D286C'', code=3: Duplicate entry ''54051386:6D286C'' for key 1 I suppose this is something that should not happen when one is feeding a database? Cheers, Thomas
LEIBOVICI Thomas
2010-Nov-25 09:07 UTC
[Lustre-discuss] [robinhood-support] robinhood error messages
Hello Thomas, Sorry, I just saw the email you sent on robinhood-support mailing list and that was blocked waiting for admin validation. About multiple robinhood instances, the documentation says that you can split the features on different nodes: basically, the database server can run on a machine, FS scan on another machine, disk resource monitoring and purging on another machine, etc... But you must only run a single instance of each feature at a given time. Thomas Roth wrote:>> Is there a way to "partition" a file system for Robinhood? Tell an >> instance to only scan certain directories? Because I think the issue is >> not a really broken data base, but simply a later coming Robin scanning >> files that were already done?What is your need exactly? Do you want to speed-up the scan by running several robinhood instances, or do you only want to scan certain directories? - About speed, robinhood already performs scans in parallel with multiple threads, each one scanning different directories. So if you want more parallelism, increase the number of scan threads. - If your need is to scan only some parts of the namespace, you can ignore directories by specifying "ignore" rules in the configuration file (FS_Scan section) E.g. ignore { path == "/lustre/xyz*" } if you know the path you want to ignore, or a negation: ignore { not ( path == "/lustre/dir1" or path == "/lustre/dir2/subdir*" ) } if you know the paths you want to scan.>> > > ListMgr | DB query failed in ListMgr_Insert line 340... >> > and assorted messages, which seem to indicate that the new robinhood >> > scan tries to put something into the DB that is already there, and >> > stumbles on this. Or maybe that happens when several robins are >> > running simultaneously. >> Are you running several instances for scanning the same filesystem?? > > Well, yes, tried that also. Actually I was under the impression that > this is a feature of Robinhood - of course, now that I am looking for > this in the documentation I can''t find it. > > But these errors from the DB definitely did arise first when I > restarted robinhood anew after some changes (location of log file, > debug level, ...) in the config file. But since there was no change in > the robinhood version, I did not empty the database. After this > restart, I immediately got a lot of > > 2010/11/04 11:27:45 robinhood[1489/4]: EntryProc | Error 3 > performing database operation. > > 2010/11/04 11:27:45 robinhood[1489/8]: ListMgr | DB query failed in > ListMgr_Insert line 340: pk=''54051386:6D286C'', code=3: Duplicate entry > ''54051386:6D286C'' for key 1 > > I suppose this is something that should not happen when one is feeding > a database?Yes, these errors seams to be caused by the concurrence between several feeders. This is not sane, and the db content may be inconsistent now. So I recommend you to stop all your running instances, clear the db content (command "rbh-config empty_db") and then, only start a single instance for scanning. Best regards, Thomas.