athlon athlonf
2008-Jan-15 21:04 UTC
[Xapian-discuss] PHP indexing, what's the PHP method for indexscript
Currently I have the following indexscript: pid : unique=Q boolean=Q field=pid postdate : field=startdate author_name: unhtml boolean=XAUTHORNAME field=author author_id: boolean=XAUTHORID field=authorid url : field=url sample : weight=1 index field=sample How can I create the same indexing using PHP? With this, I can get an searchable index, but I have no idea how to set the fields, so that I can actually GET something back (with the underneath code, I just get a bunch of pid's back). $doc = new XapianDocument(); $doc->set_data($postrow['pid']); $doc->add_value(1,date('Ymd',$postrow['postdate'])); $doc->add_value(2,$postrow['author_id']); $doc->add_term("XAUTHORID".$postrow["author_id"]); $doc->add_term("XAUTHORNAME".$postrow["forum_id"]); $indexer->set_document($doc); $indexer->index_text($postrow['post']); //post == sample // Add the document to the database. $database->add_document($doc); ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
James Aylett
2008-Jan-15 21:47 UTC
[Xapian-discuss] PHP indexing, what's the PHP method for indexscript
On Tue, Jan 15, 2008 at 01:04:10PM -0800, athlon athlonf wrote:> How can I create the same indexing using PHP? With this, I can get > an searchable index, but I have no idea how to set the fields, so > that I can actually GET something back (with the underneath code, I > just get a bunch of pid's back). > > $doc = new XapianDocument(); > $doc->set_data($postrow['pid']);You're setting the document data field to just the pid. If you want it to be compatible with omega, you'll need to put the right data in there, as documented in ``Document data construction'' in omega/docs/overview.html. Alternatively, if you're using your own searcher, you could pull data out of wherever it came from to display it and cut down on data duplication - which makes more sense depends on a huge number of factors. The former is the easier to migrate gradually from the omega tools to your own system. J -- /--------------------------------------------------------------------------\ James Aylett xapian.org james@tartarus.org uncertaintydivision.org
athlon athlonf
2008-Jan-15 22:33 UTC
[Xapian-discuss] PHP indexing, what's the PHP method for indexscript
Hi James, thanks for the answer. Indeed if I do something like this: $data = 'author='.$postrow['starter_name']."\n"; $data .= 'authorid='.$postrow['starter_id']."\n"; $data .= 'forum_id='.$postrow['forum_id']."\n"; $doc->set_data($data); It will be correctly inserted. However, what do you mean with "you could pull data out of wherever it came from to display it"? That I could parse the result by looking back at the datasource (a database)? I guess there are pro and con's about that. ----- Original Message ---- From: James Aylett <james-xapian@tartarus.org> To: xapian-discuss@lists.xapian.org Sent: Tuesday, January 15, 2008 10:47:09 PM Subject: Re: [Xapian-discuss] PHP indexing, what's the PHP method for indexscript On Tue, Jan 15, 2008 at 01:04:10PM -0800, athlon athlonf wrote:> How can I create the same indexing using PHP? With this, I can get > an searchable index, but I have no idea how to set the fields, so > that I can actually GET something back (with the underneath code, I > just get a bunch of pid's back). > > $doc = new XapianDocument(); > $doc->set_data($postrow['pid']);You're setting the document data field to just the pid. If you want it to be compatible with omega, you'll need to put the right data in there, as documented in ``Document data construction'' in omega/docs/overview.html. Alternatively, if you're using your own searcher, you could pull data out of wherever it came from to display it and cut down on data duplication - which makes more sense depends on a huge number of factors. The former is the easier to migrate gradually from the omega tools to your own system. J -- /--------------------------------------------------------------------------\ James Aylett xapian.org james@tartarus.org uncertaintydivision.org _______________________________________________ Xapian-discuss mailing list Xapian-discuss@lists.xapian.org http://lists.xapian.org/mailman/listinfo/xapian-discuss ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
athlon athlonf
2008-Jan-16 09:18 UTC
[Xapian-discuss] PHP indexing, what's the PHP method for indexscript
I've managed to correctly use PHP-bindings to index my database and I'm really amazed by the speed. Apparantly, the method of using a perlscript like dbi2omega to get the inputfile and then use scriptindex to parse and index it is much slower. Indexing with PHP took 9 hours to complete on my developmachine (amd64 3800 with 2GB of ram and 5HDD-raid5) for 3 million documents, with less load. Indexing with dbi2omega->scriptindex takes more than 24 hours and it's not even at 40% (i've made several intermediate files) at load 5. And this on a AMD dual opteron 246 with raid1 and 3GB of ram. ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
athlon athlonf
2008-Jan-16 17:58 UTC
[Xapian-discuss] PHP indexing, what's the PHP method for indexscript
>Load 5 suggests something's wrong, because dbi2omega and scriptindex >are both linear processes. Are you running several instances in >parallel in some way?it usually starts off fairly low, but then after half an hour of so, it will reach load 5 constantly. I'm only doing one scriptindex at a time, but it's fairly complicated indexingscript I guess. And of course, the inputfiles are huge, mostly beyond 2GB.>I believe that right now, none of the supplied Xapian indexing scripts >or binaries will go significantly above a load of 1, unless you have >other issues or something else happening on the machine.The machine is on itself also a webserver, but the normal average load is below 0.25. I can imagine it to get into memory problem when scriptindex tries to parse 2GB worth of data each time. This is the indexscript i was using. tid : boolean=Q field=id pid : unique=Q boolean=Q field=pid topic_title : unhtml weight=10 field=title index=Z index forum_id : indexnopos=XFORUMID field=forum_id postdate : field=startdate postdateh : field=startdateh date=yyyymmdd value=2 topicStart : field=topicStart topicStarth : field=topicStarth date=yyyymmdd value=3 author_name: unhtml boolean=XAUTHORNAME field=author author_id: boolean=XAUTHORID field=authorid state : field=state url : field=url sample : weight=1 index field=sample ____________________________________________________________________________________ Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
athlon athlonf
2008-Jan-16 19:47 UTC
[Xapian-discuss] PHP indexing, what's the PHP method for indexscript
Actually I'm already graphing my servers, but haven't really investigate on the load due to the scriptindex as I was more devving atm. I've just taken a look and it seems not to be a memory problem. During the load spikes, memory went from 20MB to 1GB, but there was still 2GB left (albleit cached). I'd suspect it to be an IO load. Hmm... the index=Z... no idea really. A copyandpaste-error i think. I remembered that when I delved my first xapian-db, i see that "content" which will be searched had the prefix Z, so... ----- Original Message ---- From: James Aylett <james-xapian@tartarus.org> To: xapian-discuss@lists.xapian.org Sent: Wednesday, January 16, 2008 7:19:25 PM Subject: Re: [Xapian-discuss] PHP indexing, what's the PHP method for indexscript On Wed, Jan 16, 2008 at 09:58:11AM -0800, athlon athlonf wrote:> >Load 5 suggests something's wrong, because dbi2omega and scriptindex > >are both linear processes. Are you running several instances in > >parallel in some way? > > it usually starts off fairly low, but then after half an hour of so, > it will reach load 5 constantly.I'd guess at a memory problem then, with processes blocking on VM I/O. But that's really a guess. Load is an indication of the length of the run queue, ie the number of processes trying to get access to the processor at any one point in time; so there's something going on beyond scriptindex itself there. If you want to figure out what's going on, I'd recommend pulling snmp data out of the system and graphing it (probably using something like cacti) -- you'll see things like memory usage over time that way, and it'll be obvious if (for instance) the load shoots up when it hits a certain amount of free memory or something.> I'm only doing one scriptindex at a time, but it's fairly > complicated indexingscript I guess. And of course, the inputfiles > are huge, mostly beyond 2GB.scriptindex only pulls a line at a time from the input files, so that won't matter per se. You may be running into issues with Xapian not flushing to disk enough - that again> topic_title : unhtml weight=10 field=title index=Z indexWhat is the index=Z intended to do? J -- /--------------------------------------------------------------------------\ James Aylett xapian.org james@tartarus.org uncertaintydivision.org _______________________________________________ Xapian-discuss mailing list Xapian-discuss@lists.xapian.org http://lists.xapian.org/mailman/listinfo/xapian-discuss ____________________________________________________________________________________ Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs
athlon athlonf
2008-Jan-17 20:26 UTC
[Xapian-discuss] PHP indexing, what's the PHP method for indexscript
The explanation sounds plausible. As for the indexer, no, it does not use replace_document (didn't knew about that function actually...) This is the relevant part of the php-script: $doc = new XapianDocument(); $doc->set_data($data); $doc->add_value(1,$postrow['forum_id']); $doc->add_value(2,date('Ymd',$postrow['postdate'])); $doc->add_value(3,$postrow['author_id']); //Adds a boolean term $doc->add_term("XFORUMID".$postrow["forum_id"]); $doc->add_term("XAUTHORID".$postrow["author_id"]); $doc->add_term("XAUTHORNAME".$postrow["forum_id"]); //Assign the document to the TermGenerator which will generate the terms used for searching $indexer->set_document($doc); $indexer->index_text($postrow['post']); $indexer->index_text($postrow['title'], 2); // Add the document to the database. $database->add_document($doc); $postrow =null; $data =null; $doc =null; So...I should probably use replace_document if I "update" existing documents? ----- Original Message ---- From: Olly Betts <olly@survex.com> To: athlon athlonf <athlonkmf@yahoo.com> Cc: xapian-discuss@lists.xapian.org Sent: Thursday, January 17, 2008 3:14:15 AM Subject: Re: [Xapian-discuss] PHP indexing, what's the PHP method for indexscript On Wed, Jan 16, 2008 at 09:58:11AM -0800, athlon athlonf wrote:> >Load 5 suggests something's wrong, because dbi2omega and scriptindex > >are both linear processes. Are you running several instances in > >parallel in some way? > > it usually starts off fairly low, but then after half an hour of so, > it will reach load 5 constantly.As James says, the scriptindex process itself shouldn't raise the load by more than 1 (since it's essentially a single process, plus one /bin/cat child process, which will always be blocked on read except very briefly when the database is opened or closed). I suspect what is happening here is that the scriptindex process is causing the machine to swap so that webserver requests take a lot longer and so start to overlap. Hence 4 of the load is actually due to the webserver (although caused by scriptindex). I can't think of another plausible explanation anyway.> tid : boolean=Q field=id > pid : unique=Q boolean=Q field=pidIt doesn't seem to make a lot of sense to have two fields mapping to "Q" like this... FWIW, I think this may explain why your PHP script is so much faster - "unique" is quite a slow operation (even if no duplicate documents exist, just checking for them significantly slows indexing). Does your PHP indexer contain code like this: $db->replace_document($qterm, $doc); If not, does it handle enforcing unique documents another way? If it doesn't, then you aren't comparing like with like. If this isn't the explanation, it would be interesting to work out why there's such a difference. Cheers, Olly ____________________________________________________________________________________ Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hs
Olly Betts
2008-Jan-24 02:03 UTC
[Xapian-discuss] PHP indexing, what's the PHP method for indexscript
On Thu, Jan 17, 2008 at 12:25:57PM -0800, athlon athlonf wrote:> So...I should probably use replace_document if I "update" existing documents?Yes, although you can get away without doing so when reindexing from scratch, assuming there are no duplicates in an indexing run. Someone recently benchmarked (the discussion was on this mailing list), and there's a significant overhead to checking the uid term. It's something I'm intending to look into, but I don't have a lot of spare time at the moment... Cheers, Olly