thr3ads.net - Xapian discuss - [Xapian-discuss] How to update DB concurrently? [May 2006]

If this information is useful, please help other people find it:
Share via:

oscaruser@programmer.net

2006-May-17 18:06 UTC

[Xapian-discuss] How to update DB concurrently?

Folks,

I have many (~150) perl spider processes that gathers data from targeted URLs
running concurrently. They use mysql to maintain state, and locking is handled
only if two spider processes happen to randomly select the same URL to process.
I want to use omindex to add the data to the index being built from the data
source HTML downloaded near the start of the spidering process. However, each
omindex process creates a db_lock file before modifying the index data, so other
processes are locked out. Also, it would be nice to be able to search the index
while it is updated and grows. It seems the db_lock does not prevent searching.
What is a better way to do this than waiting for omindex to exit? One thought is
to modify omindex to create a server that accepts IPC named pipe or socket
connections. The spiders would connect and send data over then close the
connection. Has this been done anywhere so that I need not write it?

Thanks,
OSC

-- 
___________________________________________________
Play 100s of games for FREE! http://games.mail.com/

oscaruser@programmer.net

2006-May-17 19:19 UTC

head link

[Xapian-discuss] How to update DB concurrently?

seems xapian-tcpsrv is what was needed, while increasing listen backlog to 256
in tcpserver.cc

thanks
> ----- Original Message -----
> From: oscaruser@programmer.net
> To: xapian-discuss@lists.xapian.org
> Subject: [Xapian-discuss] How to update DB concurrently?
> Date: Wed, 17 May 2006 08:59:01 -0800
> 
> 
> Folks,
> 
> I have many (~150) perl spider processes that gathers data from 
> targeted URLs running concurrently. They use mysql to maintain 
> state, and locking is handled only if two spider processes happen 
> to randomly select the same URL to process. I want to use omindex 
> to add the data to the index being built from the data source HTML 
> downloaded near the start of the spidering process. However, each 
> omindex process creates a db_lock file before modifying the index 
> data, so other processes are locked out. Also, it would be nice to 
> be able to search the index while it is updated and grows. It seems 
> the db_lock does not prevent searching. What is a better way to do 
> this than waiting for omindex to exit? One thought is to modify 
> omindex to create a server that accepts IPC named pipe or socket 
> connections. The spiders would connect and send data over then 
> close the connection. Has this been done anywhere so that I need 
> not write it?
> 
> Thanks,
> OSC
> 
> --
> ___________________________________________________
> Play 100s of games for FREE! http://games.mail.com/
> 
> 
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss@lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>

-- 
___________________________________________________
Play 100s of games for FREE! http://games.mail.com/

oscaruser@programmer.net

2006-May-18 05:53 UTC

head link

[Xapian-discuss] How to update DB concurrently?

I set this up, but found that the 150 spiders where processing data at a much
faster rate than the indexer was able to build the index. This poses a serious
performance bottle neck issue since it is not scaling. How can I increase or
improve the rate of the indexer to the level the spiders are processing the
URLs?

Thanks
> ----- Original Message -----
> From: "Olly Betts" <olly@survex.com>
> To: oscaruser@programmer.net
> Subject: Re: [Xapian-discuss] How to update DB concurrently?
> Date: Wed, 17 May 2006 20:17:16 +0100
> 
> 
> On Wed, May 17, 2006 at 10:08:06AM -0800, oscaruser@programmer.net wrote:
> > seems xapian-tcpsrv is what was needed, while increasing listen
> > backlog to 256 in tcpserver.cc
> 
> The remote backend (which uses xapian-tcpsrv) only supports reading
> databases currently, though it looks like someone's going to commission
> me to implement a writable remote backend in the near future.
> 
> But adding documents in batches is much more efficient - if you try
> to scale your current setup, you'll probably hit a limit to how fast
> you can add documents.
> 
> I'd suggest that each spider should dump pages in a form suitable for
> feeding into script index (in Perl you can just suck the whole page
> into $html and then:
> 
> $html =~ s/\n/ /g;
> 
> then create a dump file entry like so:
> 
> print DUMPFILE_TMP <<END;
> url=$url
> html=$html
> 
> END
> 
> You can include any other meta information you want - title,
> content-type, modification time, sitename, etc in other fields.
> A suitable index script would be something like:
> 
> url : field=url hash boolean=Q unique=Q
> html : unhtml index truncate=250 field=sample
> 
> And then when you've dumped 100 or 1000 or something you can switch
> to a new dump file and feed the old one into scriptindex.  The way
> I'd do that is have a spool directory which dump files just get
> renamed into by the spiders, and an indexer process which does something
> like:
> 
> chdir "spool" or die $!;
> while (1) {
>      my @files = glob "*.dump";
>      if (@files) {
> 	system "scriptindex", $database, $indexscript, @files or die $!;
> 	unlink @files;
>      } else {
> 	sleep 60;
>      }
> }
> 
> This "spool directory" style of design is both simple and
suitably
> robust.
> 
> Cheers,
>      Olly
>

-- 
___________________________________________________
Play 100s of games for FREE! http://games.mail.com/

oscaruser@programmer.net

2006-May-18 23:14 UTC

head link

[Xapian-discuss] How to update DB concurrently?

Folks,

I switched to flint, set XAPIAN_FLUSH_THRESHOLD, and I 
rolled the indexer into the spiders. Now it creates 150 separate 
indexes. I am using omega.cgi to perform search. How can I query 
all 150 dbs at the same time?

Thanks
> ----- Original Message -----
> From: "Olly Betts" <olly@survex.com>
> To: oscaruser@programmer.net
> Subject: Re: [Xapian-discuss] How to update DB concurrently?
> Date: Thu, 18 May 2006 09:41:22 +0100
> 
> 
> On Wed, May 17, 2006 at 08:52:58PM -0800, oscaruser@programmer.net wrote:
> > How can I increase or improve the rate of the indexer to the level the
> > spiders are processing the URLs?
> 
> Hmm, I'd imagine 150 spiders are probably netting you several hundred
> documents per second, maybe thousands.
> 
> Some ideas:
> 
> * Read http://www.xapian.org/docs/scalability.html if you haven't
>    already.
> 
> * Make sure the indexer is running continuously and don't call flush()
>    explicitly.
> 
> * Batch up updates by setting XAPIAN_FLUSH_THRESHOLD in the
>    environment (don't forget to export it!)  It defaults to 10000 - if
>    you've plenty of RAM, you can raise this substantially.  Gmane uses
>    100000 (100 thousand) currently.
> 
> * Use the flint backend instead of quartz:
>    http://wiki.xapian.org/FlintBackend
>    Don't be put off by the warning - the current state very stable
>    (sufficiently good that I'm contemplating forking off a copy as
>    the default backend for Xapian 1.0.)
> 
> * Make sure the machine has plenty of RAM and fast disks.
> 
> * Run several indexers into separate databases and merge these later
>    with xapian-compact (for flint) or quartzcompact (for quartz).  The
>    indexing rate drops off gradually as database size grows, so the
>    fastest way to build a large database is to build a number of
>    databases and merge - gmane builds databases containing 1 million
>    documents each and then merges them together.  I chose this threshold
>    after doing a bit of profiling so it's a good starting value, but
you
>    may be able to tune it further and it'll depend on your hardware
too.
> 
> * If you aren't trying to read from the databases while building
>    them, you could try enabling "dangerous mode" - for flint you
>    just need to uncomment the obvious #define in
>    backends/flint/flint_table.cc (search for DANGEROUS) and recompile.
>    "Dangerous" mode updates blocks in place rather than ensuring
the
>    old version is preserved, so reading while writing won't work, and
>    (this is the "danger" bit) if the power fails or the system
crashes
>    your database may not be in a consistent state.  But it reduces the
>    amount of I/O and buys you a little speed.  I use this mode to build
>    gmane's database.
> 
> I'm also have plans for a number of improvements, which I'm working
on
> in an on-going fashion.  If you're in a hurry and have a budget for
> your project, then funding is always welcome and would enable me to
> devote more time to this work!
> 
> Cheers,
>      Olly
> 

-- 
___________________________________________________
Play 100s of games for FREE! http://games.mail.com/

oscaruser@programmer.net

2006-May-19 00:35 UTC

head link

[Xapian-discuss] How to update DB concurrently?

I just added the following lines to omega.cc (I'll do a better job coding it
in a loop in C++)

    for (int i = 0; i < 150; i++)
      db.add_database(Xapian::Database(db));

Thanks
> ----- Original Message -----
> From: oscaruser@programmer.net
> To: "Olly Betts" <olly@survex.com>
> Subject: Re: [Xapian-discuss] How to update DB concurrently?
> Date: Thu, 18 May 2006 14:13:54 -0800
> 
> 
> Folks,
> 
> I switched to flint, set XAPIAN_FLUSH_THRESHOLD, and I
> rolled the indexer into the spiders. Now it creates 150 separate
> indexes. I am using omega.cgi to perform search. How can I query
> all 150 dbs at the same time?
> 
> Thanks
> 
> > ----- Original Message -----
> > From: "Olly Betts" <olly@survex.com>
> > To: oscaruser@programmer.net
> > Subject: Re: [Xapian-discuss] How to update DB concurrently?
> > Date: Thu, 18 May 2006 09:41:22 +0100
> >
> >
> > On Wed, May 17, 2006 at 08:52:58PM -0800, oscaruser@programmer.net
wrote:
> > > How can I increase or improve the rate of the indexer to the
level the
> > > spiders are processing the URLs?
> >
> > Hmm, I'd imagine 150 spiders are probably netting you several
hundred
> > documents per second, maybe thousands.
> >
> > Some ideas:
> >
> > * Read http://www.xapian.org/docs/scalability.html if you haven't
> >    already.
> >
> > * Make sure the indexer is running continuously and don't call
flush()
> >    explicitly.
> >
> > * Batch up updates by setting XAPIAN_FLUSH_THRESHOLD in the
> >    environment (don't forget to export it!)  It defaults to 10000
- if
> >    you've plenty of RAM, you can raise this substantially.  Gmane
uses
> >    100000 (100 thousand) currently.
> >
> > * Use the flint backend instead of quartz:
> >    http://wiki.xapian.org/FlintBackend
> >    Don't be put off by the warning - the current state very stable
> >    (sufficiently good that I'm contemplating forking off a copy as
> >    the default backend for Xapian 1.0.)
> >
> > * Make sure the machine has plenty of RAM and fast disks.
> >
> > * Run several indexers into separate databases and merge these later
> >    with xapian-compact (for flint) or quartzcompact (for quartz).  The
> >    indexing rate drops off gradually as database size grows, so the
> >    fastest way to build a large database is to build a number of
> >    databases and merge - gmane builds databases containing 1 million
> >    documents each and then merges them together.  I chose this
threshold
> >    after doing a bit of profiling so it's a good starting value,
but you
> >    may be able to tune it further and it'll depend on your
hardware too.
> >
> > * If you aren't trying to read from the databases while building
> >    them, you could try enabling "dangerous mode" - for flint
you
> >    just need to uncomment the obvious #define in
> >    backends/flint/flint_table.cc (search for DANGEROUS) and recompile.
> >    "Dangerous" mode updates blocks in place rather than
ensuring the
> >    old version is preserved, so reading while writing won't work,
and
> >    (this is the "danger" bit) if the power fails or the
system crashes
> >    your database may not be in a consistent state.  But it reduces the
> >    amount of I/O and buys you a little speed.  I use this mode to
build
> >    gmane's database.
> >
> > I'm also have plans for a number of improvements, which I'm
working on
> > in an on-going fashion.  If you're in a hurry and have a budget
for
> > your project, then funding is always welcome and would enable me to
> > devote more time to this work!
> >
> > Cheers,
> >      Olly
> 
> >
> 
> 
> --
> ___________________________________________________
> Play 100s of games for FREE! http://games.mail.com/
> 
> 
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss@lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>

-- 
___________________________________________________
Play 100s of games for FREE! http://games.mail.com/

oscaruser@programmer.net

2006-May-19 01:40 UTC

head link

[Xapian-discuss] How to update DB concurrently?

I added the following :

delta:/home/oscar/xapian/omega-0.9.6# diff omega.cc ../orig/omega-0.9.6/omega.cc
33,34d32
< #include <iomanip>
< #include <sstream>
136,143d133
< 
<       for (int index = 0; index < 150; index++) {
<       std::ostringstream s;
<       s << "/svr/hda1/omega/data/mydb" <<
std::setfill('0') << std::setw(4) << index
<         << "/default";
<       //cout << s.str() << endl;
<       db.add_database(Xapian::Database(s.str()));
<       }
delta:/home/oscar/xapian/omega-0.9.6# 

Activating the URL "http://delta/cgi-bin/omega.cgi" in the browser
properly shows the building index, but searching does not return any results. It
seems that the URL uses the templates to perform the query, show a result page
with URL
"http://delta/cgi-bin/omega.cgi?P=test&DEFAULTOP=or&DB=default&FMT=query&xP=test.&xDB=default&xFILTERS=--O",
but not show any results. How can I show the results for query all the DBs?

Thanks

> ----- Original Message -----
> From: oscaruser@programmer.net
> To: "Olly Betts" <olly@survex.com>
> Subject: Re: [Xapian-discuss] How to update DB concurrently?
> Date: Thu, 18 May 2006 15:34:49 -0800
> 
> 
> I just added the following lines to omega.cc (I'll do a better job 
> coding it in a loop in C++)
> 
>      for (int i = 0; i < 150; i++)
>        db.add_database(Xapian::Database(db));
> 
> Thanks
> 
> > ----- Original Message -----
> > From: oscaruser@programmer.net
> > To: "Olly Betts" <olly@survex.com>
> > Subject: Re: [Xapian-discuss] How to update DB concurrently?
> > Date: Thu, 18 May 2006 14:13:54 -0800
> >
> >
> > Folks,
> >
> > I switched to flint, set XAPIAN_FLUSH_THRESHOLD, and I
> > rolled the indexer into the spiders. Now it creates 150 separate
> > indexes. I am using omega.cgi to perform search. How can I query
> > all 150 dbs at the same time?
> >
> > Thanks
> >
> > > ----- Original Message -----
> > > From: "Olly Betts" <olly@survex.com>
> > > To: oscaruser@programmer.net
> > > Subject: Re: [Xapian-discuss] How to update DB concurrently?
> > > Date: Thu, 18 May 2006 09:41:22 +0100
> > >
> > >
> > > On Wed, May 17, 2006 at 08:52:58PM -0800,
oscaruser@programmer.net wrote:
> > > > How can I increase or improve the rate of the indexer to the
level the
> > > > spiders are processing the URLs?
> > >
> > > Hmm, I'd imagine 150 spiders are probably netting you several
hundred
> > > documents per second, maybe thousands.
> > >
> > > Some ideas:
> > >
> > > * Read http://www.xapian.org/docs/scalability.html if you
haven't
> > >    already.
> > >
> > > * Make sure the indexer is running continuously and don't
call flush()
> > >    explicitly.
> > >
> > > * Batch up updates by setting XAPIAN_FLUSH_THRESHOLD in the
> > >    environment (don't forget to export it!)  It defaults to
10000 - if
> > >    you've plenty of RAM, you can raise this substantially. 
Gmane uses
> > >    100000 (100 thousand) currently.
> > >
> > > * Use the flint backend instead of quartz:
> > >    http://wiki.xapian.org/FlintBackend
> > >    Don't be put off by the warning - the current state very
stable
> > >    (sufficiently good that I'm contemplating forking off a
copy as
> > >    the default backend for Xapian 1.0.)
> > >
> > > * Make sure the machine has plenty of RAM and fast disks.
> > >
> > > * Run several indexers into separate databases and merge these
later
> > >    with xapian-compact (for flint) or quartzcompact (for quartz).
The
> > >    indexing rate drops off gradually as database size grows, so
the
> > >    fastest way to build a large database is to build a number of
> > >    databases and merge - gmane builds databases containing 1
million
> > >    documents each and then merges them together.  I chose this
threshold
> > >    after doing a bit of profiling so it's a good starting
value, but you
> > >    may be able to tune it further and it'll depend on your
hardware too.
> > >
> > > * If you aren't trying to read from the databases while
building
> > >    them, you could try enabling "dangerous mode" - for
flint you
> > >    just need to uncomment the obvious #define in
> > >    backends/flint/flint_table.cc (search for DANGEROUS) and
recompile.
> > >    "Dangerous" mode updates blocks in place rather than
ensuring the
> > >    old version is preserved, so reading while writing won't
work, and
> > >    (this is the "danger" bit) if the power fails or the
system crashes
> > >    your database may not be in a consistent state.  But it
reduces the
> > >    amount of I/O and buys you a little speed.  I use this mode to
build
> > >    gmane's database.
> > >
> > > I'm also have plans for a number of improvements, which
I'm working on
> > > in an on-going fashion.  If you're in a hurry and have a
budget for
> > > your project, then funding is always welcome and would enable me
to
> > > devote more time to this work!
> > >
> > > Cheers,
> > >      Olly
> >
> > >
> >
> >
> > --
> > ___________________________________________________
> > Play 100s of games for FREE! http://games.mail.com/
> >
> >
> > _______________________________________________
> > Xapian-discuss mailing list
> > Xapian-discuss@lists.xapian.org
> > http://lists.xapian.org/mailman/listinfo/xapian-discuss
> 
> >
> 
> 
> --
> ___________________________________________________
> Play 100s of games for FREE! http://games.mail.com/
> 
> 
> _______________________________________________
> Xapian-discuss mailing list
> Xapian-discuss@lists.xapian.org
> http://lists.xapian.org/mailman/listinfo/xapian-discuss
>

-- 
___________________________________________________
Play 100s of games for FREE! http://games.mail.com/

oscaruser@programmer.net

2006-May-19 20:20 UTC

head link

[Xapian-discuss] How to update DB concurrently?

Olly,

Super, this worked perfectly. I want to change the appearance of the search
results by adding my own data. To make this work I want to generate an XML
stream with just URLs as the output of omega.cgi. How can I do this?

Thanks

> ----- Original Message -----
> From: "Olly Betts" <olly@survex.com>
> To: oscaruser@programmer.net
> Subject: Re: [Xapian-discuss] How to update DB concurrently?
> Date: Fri, 19 May 2006 04:17:53 +0100
> 
> 
> On Thu, May 18, 2006 at 04:39:36PM -0800, oscaruser@programmer.net wrote:
> > <       for (int index = 0; index < 150; index++) {
> > <       std::ostringstream s;
> > <       s << "/svr/hda1/omega/data/mydb" <<
std::setfill('0') <<
> > std::setw(4) << index
> > <         << "/default";
> > <       //cout << s.str() << endl;
> > <       db.add_database(Xapian::Database(s.str()));
> > <       }
> 
> There's no context, so I can't see where you're patching that,
but it
> looks plausible.
> 
> However, I don't think you want to search 150 databases at once like
> this.  The overhead from opening that many databases on every search is
> likely to be noticable, and searching one big database will be more
> efficient.
> 
> Instead use xapian-compact to merge them all together like so:
> 
> xapian-compact -F -m /svr/hda1/omega/data/mydb*/default /path/to/output/db
> 
> And then search the merged database '/path/to/output/db'.
> 
> You almost certainly want to use -m if you can - it merges in multiple
> passes which for 150 databases should be substantially quicker.  It's
> probably worth using -F too unless you want to update the merged
> database by opening it as a WritableDatabase.
> 
> Cheers,
>      Olly
>

-- 
___________________________________________________
Play 100s of games for FREE! http://games.mail.com/

oscaruser@programmer.net

2006-May-19 20:35 UTC

head link

[Xapian-discuss] How to update DB concurrently?

Reading omega/docs/omegascript.txt
Thanks...
> ----- Original Message -----
> From: oscaruser@programmer.net
> To: "Olly Betts" <olly@survex.com>
> Subject: Re: [Xapian-discuss] How to update DB concurrently?
> Date: Fri, 19 May 2006 11:19:26 -0800
> 
> 
> Olly,
> 
> Super, this worked perfectly. I want to change the appearance of 
> the search results by adding my own data. To make this work I want 
> to generate an XML stream with just URLs as the output of 
> omega.cgi. How can I do this?
> 
> Thanks
> 
> 
> > ----- Original Message -----
> > From: "Olly Betts" <olly@survex.com>
> > To: oscaruser@programmer.net
> > Subject: Re: [Xapian-discuss] How to update DB concurrently?
> > Date: Fri, 19 May 2006 04:17:53 +0100
> >
> >
> > On Thu, May 18, 2006 at 04:39:36PM -0800, oscaruser@programmer.net
wrote:
> > > <       for (int index = 0; index < 150; index++) {
> > > <       std::ostringstream s;
> > > <       s << "/svr/hda1/omega/data/mydb"
<< std::setfill('0')
> > << > std::setw(4) << index
> > > <         << "/default";
> > > <       //cout << s.str() << endl;
> > > <       db.add_database(Xapian::Database(s.str()));
> > > <       }
> >
> > There's no context, so I can't see where you're patching
that, but it
> > looks plausible.
> >
> > However, I don't think you want to search 150 databases at once
like
> > this.  The overhead from opening that many databases on every search
is
> > likely to be noticable, and searching one big database will be more
> > efficient.
> >
> > Instead use xapian-compact to merge them all together like so:
> >
> > xapian-compact -F -m /svr/hda1/omega/data/mydb*/default
/path/to/output/db
> >
> > And then search the merged database '/path/to/output/db'.
> >
> > You almost certainly want to use -m if you can - it merges in multiple
> > passes which for 150 databases should be substantially quicker. 
It's
> > probably worth using -F too unless you want to update the merged
> > database by opening it as a WritableDatabase.
> >
> > Cheers,
> >      Olly

-- 
___________________________________________________
Play 100s of games for FREE! http://games.mail.com/

oscaruser@programmer.net

2006-May-19 21:37 UTC

head link

[Xapian-discuss] How to update DB concurrently?

Folks,

Sorry for the excessive emails, but just in case someone else has the same
issue... after some reading and looking around, I found setting
"FMT=xml" was what was needed because it uses the xml from
omega/templates sub dir of the omega tar source.

Thanks

Example,

http://delta/cgi-bin/omega.cgi?P=jacket&DEFAULTOP=and&DB=default&FMT=xml&xP=north.face.jacket.&xDB=default&xFILTERS=--A
> ----- Original Message -----
> From: oscaruser@programmer.net
> To: "Olly Betts" <olly@survex.com>
> Subject: Re: [Xapian-discuss] How to update DB concurrently?
> Date: Fri, 19 May 2006 11:35:01 -0800
> 
> 
> Reading omega/docs/omegascript.txt
> Thanks...
> 
> > ----- Original Message -----
> > From: oscaruser@programmer.net
> > To: "Olly Betts" <olly@survex.com>
> > Subject: Re: [Xapian-discuss] How to update DB concurrently?
> > Date: Fri, 19 May 2006 11:19:26 -0800
> >
> >
> > I want to change the appearance of 
> > the search results by adding my own data. To make this work I 
> > want to generate an XML stream with just URLs as the output of 
> > omega.cgi. How can I do this?

-- 
___________________________________________________
Play 100s of games for FREE! http://games.mail.com/

Xapian discuss - May 2006 - How to update DB concurrently?

[Xapian-discuss] How to update DB concurrently?

[Xapian-discuss] How to update DB concurrently?

[Xapian-discuss] How to update DB concurrently?

[Xapian-discuss] How to update DB concurrently?

[Xapian-discuss] How to update DB concurrently?

[Xapian-discuss] How to update DB concurrently?

[Xapian-discuss] How to update DB concurrently?

[Xapian-discuss] How to update DB concurrently?

[Xapian-discuss] How to update DB concurrently?