oscaruser@programmer.net
2006-May-17 18:06 UTC
[Xapian-discuss] How to update DB concurrently?
Folks, I have many (~150) perl spider processes that gathers data from targeted URLs running concurrently. They use mysql to maintain state, and locking is handled only if two spider processes happen to randomly select the same URL to process. I want to use omindex to add the data to the index being built from the data source HTML downloaded near the start of the spidering process. However, each omindex process creates a db_lock file before modifying the index data, so other processes are locked out. Also, it would be nice to be able to search the index while it is updated and grows. It seems the db_lock does not prevent searching. What is a better way to do this than waiting for omindex to exit? One thought is to modify omindex to create a server that accepts IPC named pipe or socket connections. The spiders would connect and send data over then close the connection. Has this been done anywhere so that I need not write it? Thanks, OSC -- ___________________________________________________ Play 100s of games for FREE! http://games.mail.com/
oscaruser@programmer.net
2006-May-17 19:19 UTC
[Xapian-discuss] How to update DB concurrently?
seems xapian-tcpsrv is what was needed, while increasing listen backlog to 256 in tcpserver.cc thanks> ----- Original Message ----- > From: oscaruser@programmer.net > To: xapian-discuss@lists.xapian.org > Subject: [Xapian-discuss] How to update DB concurrently? > Date: Wed, 17 May 2006 08:59:01 -0800 > > > Folks, > > I have many (~150) perl spider processes that gathers data from > targeted URLs running concurrently. They use mysql to maintain > state, and locking is handled only if two spider processes happen > to randomly select the same URL to process. I want to use omindex > to add the data to the index being built from the data source HTML > downloaded near the start of the spidering process. However, each > omindex process creates a db_lock file before modifying the index > data, so other processes are locked out. Also, it would be nice to > be able to search the index while it is updated and grows. It seems > the db_lock does not prevent searching. What is a better way to do > this than waiting for omindex to exit? One thought is to modify > omindex to create a server that accepts IPC named pipe or socket > connections. The spiders would connect and send data over then > close the connection. Has this been done anywhere so that I need > not write it? > > Thanks, > OSC > > -- > ___________________________________________________ > Play 100s of games for FREE! http://games.mail.com/ > > > _______________________________________________ > Xapian-discuss mailing list > Xapian-discuss@lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-discuss>-- ___________________________________________________ Play 100s of games for FREE! http://games.mail.com/
oscaruser@programmer.net
2006-May-18 05:53 UTC
[Xapian-discuss] How to update DB concurrently?
I set this up, but found that the 150 spiders where processing data at a much faster rate than the indexer was able to build the index. This poses a serious performance bottle neck issue since it is not scaling. How can I increase or improve the rate of the indexer to the level the spiders are processing the URLs? Thanks> ----- Original Message ----- > From: "Olly Betts" <olly@survex.com> > To: oscaruser@programmer.net > Subject: Re: [Xapian-discuss] How to update DB concurrently? > Date: Wed, 17 May 2006 20:17:16 +0100 > > > On Wed, May 17, 2006 at 10:08:06AM -0800, oscaruser@programmer.net wrote: > > seems xapian-tcpsrv is what was needed, while increasing listen > > backlog to 256 in tcpserver.cc > > The remote backend (which uses xapian-tcpsrv) only supports reading > databases currently, though it looks like someone's going to commission > me to implement a writable remote backend in the near future. > > But adding documents in batches is much more efficient - if you try > to scale your current setup, you'll probably hit a limit to how fast > you can add documents. > > I'd suggest that each spider should dump pages in a form suitable for > feeding into script index (in Perl you can just suck the whole page > into $html and then: > > $html =~ s/\n/ /g; > > then create a dump file entry like so: > > print DUMPFILE_TMP <<END; > url=$url > html=$html > > END > > You can include any other meta information you want - title, > content-type, modification time, sitename, etc in other fields. > A suitable index script would be something like: > > url : field=url hash boolean=Q unique=Q > html : unhtml index truncate=250 field=sample > > And then when you've dumped 100 or 1000 or something you can switch > to a new dump file and feed the old one into scriptindex. The way > I'd do that is have a spool directory which dump files just get > renamed into by the spiders, and an indexer process which does something > like: > > chdir "spool" or die $!; > while (1) { > my @files = glob "*.dump"; > if (@files) { > system "scriptindex", $database, $indexscript, @files or die $!; > unlink @files; > } else { > sleep 60; > } > } > > This "spool directory" style of design is both simple and suitably > robust. > > Cheers, > Olly>-- ___________________________________________________ Play 100s of games for FREE! http://games.mail.com/
oscaruser@programmer.net
2006-May-18 23:14 UTC
[Xapian-discuss] How to update DB concurrently?
Folks, I switched to flint, set XAPIAN_FLUSH_THRESHOLD, and I rolled the indexer into the spiders. Now it creates 150 separate indexes. I am using omega.cgi to perform search. How can I query all 150 dbs at the same time? Thanks> ----- Original Message ----- > From: "Olly Betts" <olly@survex.com> > To: oscaruser@programmer.net > Subject: Re: [Xapian-discuss] How to update DB concurrently? > Date: Thu, 18 May 2006 09:41:22 +0100 > > > On Wed, May 17, 2006 at 08:52:58PM -0800, oscaruser@programmer.net wrote: > > How can I increase or improve the rate of the indexer to the level the > > spiders are processing the URLs? > > Hmm, I'd imagine 150 spiders are probably netting you several hundred > documents per second, maybe thousands. > > Some ideas: > > * Read http://www.xapian.org/docs/scalability.html if you haven't > already. > > * Make sure the indexer is running continuously and don't call flush() > explicitly. > > * Batch up updates by setting XAPIAN_FLUSH_THRESHOLD in the > environment (don't forget to export it!) It defaults to 10000 - if > you've plenty of RAM, you can raise this substantially. Gmane uses > 100000 (100 thousand) currently. > > * Use the flint backend instead of quartz: > http://wiki.xapian.org/FlintBackend > Don't be put off by the warning - the current state very stable > (sufficiently good that I'm contemplating forking off a copy as > the default backend for Xapian 1.0.) > > * Make sure the machine has plenty of RAM and fast disks. > > * Run several indexers into separate databases and merge these later > with xapian-compact (for flint) or quartzcompact (for quartz). The > indexing rate drops off gradually as database size grows, so the > fastest way to build a large database is to build a number of > databases and merge - gmane builds databases containing 1 million > documents each and then merges them together. I chose this threshold > after doing a bit of profiling so it's a good starting value, but you > may be able to tune it further and it'll depend on your hardware too. > > * If you aren't trying to read from the databases while building > them, you could try enabling "dangerous mode" - for flint you > just need to uncomment the obvious #define in > backends/flint/flint_table.cc (search for DANGEROUS) and recompile. > "Dangerous" mode updates blocks in place rather than ensuring the > old version is preserved, so reading while writing won't work, and > (this is the "danger" bit) if the power fails or the system crashes > your database may not be in a consistent state. But it reduces the > amount of I/O and buys you a little speed. I use this mode to build > gmane's database. > > I'm also have plans for a number of improvements, which I'm working on > in an on-going fashion. If you're in a hurry and have a budget for > your project, then funding is always welcome and would enable me to > devote more time to this work! > > Cheers, > Olly>-- ___________________________________________________ Play 100s of games for FREE! http://games.mail.com/
oscaruser@programmer.net
2006-May-19 00:35 UTC
[Xapian-discuss] How to update DB concurrently?
I just added the following lines to omega.cc (I'll do a better job coding it in a loop in C++) for (int i = 0; i < 150; i++) db.add_database(Xapian::Database(db)); Thanks> ----- Original Message ----- > From: oscaruser@programmer.net > To: "Olly Betts" <olly@survex.com> > Subject: Re: [Xapian-discuss] How to update DB concurrently? > Date: Thu, 18 May 2006 14:13:54 -0800 > > > Folks, > > I switched to flint, set XAPIAN_FLUSH_THRESHOLD, and I > rolled the indexer into the spiders. Now it creates 150 separate > indexes. I am using omega.cgi to perform search. How can I query > all 150 dbs at the same time? > > Thanks > > > ----- Original Message ----- > > From: "Olly Betts" <olly@survex.com> > > To: oscaruser@programmer.net > > Subject: Re: [Xapian-discuss] How to update DB concurrently? > > Date: Thu, 18 May 2006 09:41:22 +0100 > > > > > > On Wed, May 17, 2006 at 08:52:58PM -0800, oscaruser@programmer.net wrote: > > > How can I increase or improve the rate of the indexer to the level the > > > spiders are processing the URLs? > > > > Hmm, I'd imagine 150 spiders are probably netting you several hundred > > documents per second, maybe thousands. > > > > Some ideas: > > > > * Read http://www.xapian.org/docs/scalability.html if you haven't > > already. > > > > * Make sure the indexer is running continuously and don't call flush() > > explicitly. > > > > * Batch up updates by setting XAPIAN_FLUSH_THRESHOLD in the > > environment (don't forget to export it!) It defaults to 10000 - if > > you've plenty of RAM, you can raise this substantially. Gmane uses > > 100000 (100 thousand) currently. > > > > * Use the flint backend instead of quartz: > > http://wiki.xapian.org/FlintBackend > > Don't be put off by the warning - the current state very stable > > (sufficiently good that I'm contemplating forking off a copy as > > the default backend for Xapian 1.0.) > > > > * Make sure the machine has plenty of RAM and fast disks. > > > > * Run several indexers into separate databases and merge these later > > with xapian-compact (for flint) or quartzcompact (for quartz). The > > indexing rate drops off gradually as database size grows, so the > > fastest way to build a large database is to build a number of > > databases and merge - gmane builds databases containing 1 million > > documents each and then merges them together. I chose this threshold > > after doing a bit of profiling so it's a good starting value, but you > > may be able to tune it further and it'll depend on your hardware too. > > > > * If you aren't trying to read from the databases while building > > them, you could try enabling "dangerous mode" - for flint you > > just need to uncomment the obvious #define in > > backends/flint/flint_table.cc (search for DANGEROUS) and recompile. > > "Dangerous" mode updates blocks in place rather than ensuring the > > old version is preserved, so reading while writing won't work, and > > (this is the "danger" bit) if the power fails or the system crashes > > your database may not be in a consistent state. But it reduces the > > amount of I/O and buys you a little speed. I use this mode to build > > gmane's database. > > > > I'm also have plans for a number of improvements, which I'm working on > > in an on-going fashion. If you're in a hurry and have a budget for > > your project, then funding is always welcome and would enable me to > > devote more time to this work! > > > > Cheers, > > Olly > > > > > > -- > ___________________________________________________ > Play 100s of games for FREE! http://games.mail.com/ > > > _______________________________________________ > Xapian-discuss mailing list > Xapian-discuss@lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-discuss>-- ___________________________________________________ Play 100s of games for FREE! http://games.mail.com/
oscaruser@programmer.net
2006-May-19 01:40 UTC
[Xapian-discuss] How to update DB concurrently?
I added the following : delta:/home/oscar/xapian/omega-0.9.6# diff omega.cc ../orig/omega-0.9.6/omega.cc 33,34d32 < #include <iomanip> < #include <sstream> 136,143d133 < < for (int index = 0; index < 150; index++) { < std::ostringstream s; < s << "/svr/hda1/omega/data/mydb" << std::setfill('0') << std::setw(4) << index < << "/default"; < //cout << s.str() << endl; < db.add_database(Xapian::Database(s.str())); < } delta:/home/oscar/xapian/omega-0.9.6# Activating the URL "http://delta/cgi-bin/omega.cgi" in the browser properly shows the building index, but searching does not return any results. It seems that the URL uses the templates to perform the query, show a result page with URL "http://delta/cgi-bin/omega.cgi?P=test&DEFAULTOP=or&DB=default&FMT=query&xP=test.&xDB=default&xFILTERS=--O", but not show any results. How can I show the results for query all the DBs? Thanks> ----- Original Message ----- > From: oscaruser@programmer.net > To: "Olly Betts" <olly@survex.com> > Subject: Re: [Xapian-discuss] How to update DB concurrently? > Date: Thu, 18 May 2006 15:34:49 -0800 > > > I just added the following lines to omega.cc (I'll do a better job > coding it in a loop in C++) > > for (int i = 0; i < 150; i++) > db.add_database(Xapian::Database(db)); > > Thanks > > > ----- Original Message ----- > > From: oscaruser@programmer.net > > To: "Olly Betts" <olly@survex.com> > > Subject: Re: [Xapian-discuss] How to update DB concurrently? > > Date: Thu, 18 May 2006 14:13:54 -0800 > > > > > > Folks, > > > > I switched to flint, set XAPIAN_FLUSH_THRESHOLD, and I > > rolled the indexer into the spiders. Now it creates 150 separate > > indexes. I am using omega.cgi to perform search. How can I query > > all 150 dbs at the same time? > > > > Thanks > > > > > ----- Original Message ----- > > > From: "Olly Betts" <olly@survex.com> > > > To: oscaruser@programmer.net > > > Subject: Re: [Xapian-discuss] How to update DB concurrently? > > > Date: Thu, 18 May 2006 09:41:22 +0100 > > > > > > > > > On Wed, May 17, 2006 at 08:52:58PM -0800, oscaruser@programmer.net wrote: > > > > How can I increase or improve the rate of the indexer to the level the > > > > spiders are processing the URLs? > > > > > > Hmm, I'd imagine 150 spiders are probably netting you several hundred > > > documents per second, maybe thousands. > > > > > > Some ideas: > > > > > > * Read http://www.xapian.org/docs/scalability.html if you haven't > > > already. > > > > > > * Make sure the indexer is running continuously and don't call flush() > > > explicitly. > > > > > > * Batch up updates by setting XAPIAN_FLUSH_THRESHOLD in the > > > environment (don't forget to export it!) It defaults to 10000 - if > > > you've plenty of RAM, you can raise this substantially. Gmane uses > > > 100000 (100 thousand) currently. > > > > > > * Use the flint backend instead of quartz: > > > http://wiki.xapian.org/FlintBackend > > > Don't be put off by the warning - the current state very stable > > > (sufficiently good that I'm contemplating forking off a copy as > > > the default backend for Xapian 1.0.) > > > > > > * Make sure the machine has plenty of RAM and fast disks. > > > > > > * Run several indexers into separate databases and merge these later > > > with xapian-compact (for flint) or quartzcompact (for quartz). The > > > indexing rate drops off gradually as database size grows, so the > > > fastest way to build a large database is to build a number of > > > databases and merge - gmane builds databases containing 1 million > > > documents each and then merges them together. I chose this threshold > > > after doing a bit of profiling so it's a good starting value, but you > > > may be able to tune it further and it'll depend on your hardware too. > > > > > > * If you aren't trying to read from the databases while building > > > them, you could try enabling "dangerous mode" - for flint you > > > just need to uncomment the obvious #define in > > > backends/flint/flint_table.cc (search for DANGEROUS) and recompile. > > > "Dangerous" mode updates blocks in place rather than ensuring the > > > old version is preserved, so reading while writing won't work, and > > > (this is the "danger" bit) if the power fails or the system crashes > > > your database may not be in a consistent state. But it reduces the > > > amount of I/O and buys you a little speed. I use this mode to build > > > gmane's database. > > > > > > I'm also have plans for a number of improvements, which I'm working on > > > in an on-going fashion. If you're in a hurry and have a budget for > > > your project, then funding is always welcome and would enable me to > > > devote more time to this work! > > > > > > Cheers, > > > Olly > > > > > > > > > > > -- > > ___________________________________________________ > > Play 100s of games for FREE! http://games.mail.com/ > > > > > > _______________________________________________ > > Xapian-discuss mailing list > > Xapian-discuss@lists.xapian.org > > http://lists.xapian.org/mailman/listinfo/xapian-discuss > > > > > > -- > ___________________________________________________ > Play 100s of games for FREE! http://games.mail.com/ > > > _______________________________________________ > Xapian-discuss mailing list > Xapian-discuss@lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-discuss>-- ___________________________________________________ Play 100s of games for FREE! http://games.mail.com/
oscaruser@programmer.net
2006-May-19 20:20 UTC
[Xapian-discuss] How to update DB concurrently?
Olly, Super, this worked perfectly. I want to change the appearance of the search results by adding my own data. To make this work I want to generate an XML stream with just URLs as the output of omega.cgi. How can I do this? Thanks> ----- Original Message ----- > From: "Olly Betts" <olly@survex.com> > To: oscaruser@programmer.net > Subject: Re: [Xapian-discuss] How to update DB concurrently? > Date: Fri, 19 May 2006 04:17:53 +0100 > > > On Thu, May 18, 2006 at 04:39:36PM -0800, oscaruser@programmer.net wrote: > > < for (int index = 0; index < 150; index++) { > > < std::ostringstream s; > > < s << "/svr/hda1/omega/data/mydb" << std::setfill('0') << > > std::setw(4) << index > > < << "/default"; > > < //cout << s.str() << endl; > > < db.add_database(Xapian::Database(s.str())); > > < } > > There's no context, so I can't see where you're patching that, but it > looks plausible. > > However, I don't think you want to search 150 databases at once like > this. The overhead from opening that many databases on every search is > likely to be noticable, and searching one big database will be more > efficient. > > Instead use xapian-compact to merge them all together like so: > > xapian-compact -F -m /svr/hda1/omega/data/mydb*/default /path/to/output/db > > And then search the merged database '/path/to/output/db'. > > You almost certainly want to use -m if you can - it merges in multiple > passes which for 150 databases should be substantially quicker. It's > probably worth using -F too unless you want to update the merged > database by opening it as a WritableDatabase. > > Cheers, > Olly>-- ___________________________________________________ Play 100s of games for FREE! http://games.mail.com/
oscaruser@programmer.net
2006-May-19 20:35 UTC
[Xapian-discuss] How to update DB concurrently?
Reading omega/docs/omegascript.txt Thanks...> ----- Original Message ----- > From: oscaruser@programmer.net > To: "Olly Betts" <olly@survex.com> > Subject: Re: [Xapian-discuss] How to update DB concurrently? > Date: Fri, 19 May 2006 11:19:26 -0800 > > > Olly, > > Super, this worked perfectly. I want to change the appearance of > the search results by adding my own data. To make this work I want > to generate an XML stream with just URLs as the output of > omega.cgi. How can I do this? > > Thanks > > > > ----- Original Message ----- > > From: "Olly Betts" <olly@survex.com> > > To: oscaruser@programmer.net > > Subject: Re: [Xapian-discuss] How to update DB concurrently? > > Date: Fri, 19 May 2006 04:17:53 +0100 > > > > > > On Thu, May 18, 2006 at 04:39:36PM -0800, oscaruser@programmer.net wrote: > > > < for (int index = 0; index < 150; index++) { > > > < std::ostringstream s; > > > < s << "/svr/hda1/omega/data/mydb" << std::setfill('0') > > << > std::setw(4) << index > > > < << "/default"; > > > < //cout << s.str() << endl; > > > < db.add_database(Xapian::Database(s.str())); > > > < } > > > > There's no context, so I can't see where you're patching that, but it > > looks plausible. > > > > However, I don't think you want to search 150 databases at once like > > this. The overhead from opening that many databases on every search is > > likely to be noticable, and searching one big database will be more > > efficient. > > > > Instead use xapian-compact to merge them all together like so: > > > > xapian-compact -F -m /svr/hda1/omega/data/mydb*/default /path/to/output/db > > > > And then search the merged database '/path/to/output/db'. > > > > You almost certainly want to use -m if you can - it merges in multiple > > passes which for 150 databases should be substantially quicker. It's > > probably worth using -F too unless you want to update the merged > > database by opening it as a WritableDatabase. > > > > Cheers, > > Olly-- ___________________________________________________ Play 100s of games for FREE! http://games.mail.com/
oscaruser@programmer.net
2006-May-19 21:37 UTC
[Xapian-discuss] How to update DB concurrently?
Folks, Sorry for the excessive emails, but just in case someone else has the same issue... after some reading and looking around, I found setting "FMT=xml" was what was needed because it uses the xml from omega/templates sub dir of the omega tar source. Thanks Example, http://delta/cgi-bin/omega.cgi?P=jacket&DEFAULTOP=and&DB=default&FMT=xml&xP=north.face.jacket.&xDB=default&xFILTERS=--A> ----- Original Message ----- > From: oscaruser@programmer.net > To: "Olly Betts" <olly@survex.com> > Subject: Re: [Xapian-discuss] How to update DB concurrently? > Date: Fri, 19 May 2006 11:35:01 -0800 > > > Reading omega/docs/omegascript.txt > Thanks... > > > ----- Original Message ----- > > From: oscaruser@programmer.net > > To: "Olly Betts" <olly@survex.com> > > Subject: Re: [Xapian-discuss] How to update DB concurrently? > > Date: Fri, 19 May 2006 11:19:26 -0800 > > > > > > I want to change the appearance of > > the search results by adding my own data. To make this work I > > want to generate an XML stream with just URLs as the output of > > omega.cgi. How can I do this?-- ___________________________________________________ Play 100s of games for FREE! http://games.mail.com/