thr3ads.net - Xapian discuss - [Xapian-discuss] PHP indexing, what's the PHP method for indexscript [Jan 2008]

If this information is useful, please help other people find it:
Share via:

athlon athlonf

2008-Jan-15 21:04 UTC

[Xapian-discuss] PHP indexing, what's the PHP method for indexscript

Currently I have the following indexscript:

pid : unique=Q boolean=Q field=pid
postdate : field=startdate
author_name: unhtml boolean=XAUTHORNAME field=author
author_id: boolean=XAUTHORID field=authorid
url :  field=url
sample : weight=1 index field=sample



How can I create the same indexing using PHP?
With this, I can get an searchable index, but I have no idea how to set the
fields, so that I can actually GET something back (with the underneath code, I
just get a bunch of pid's back).

$doc = new XapianDocument();
$doc->set_data($postrow['pid']);    
$doc->add_value(1,date('Ymd',$postrow['postdate']));
$doc->add_value(2,$postrow['author_id']);
      
$doc->add_term("XAUTHORID".$postrow["author_id"]);
$doc->add_term("XAUTHORNAME".$postrow["forum_id"]);

$indexer->set_document($doc);
$indexer->index_text($postrow['post']);          //post == sample  
        
// Add the document to the database.
$database->add_document($doc);



     
____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now. 
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

James Aylett

2008-Jan-15 21:47 UTC

head link

[Xapian-discuss] PHP indexing, what's the PHP method for indexscript

On Tue, Jan 15, 2008 at 01:04:10PM -0800, athlon athlonf wrote:
> How can I create the same indexing using PHP?  With this, I can get
> an searchable index, but I have no idea how to set the fields, so
> that I can actually GET something back (with the underneath code, I
> just get a bunch of pid's back).
> 
> $doc = new XapianDocument();
> $doc->set_data($postrow['pid']);    
You're setting the document data field to just the pid. If you want it
to be compatible with omega, you'll need to put the right data in
there, as documented in ``Document data construction'' in
omega/docs/overview.html. Alternatively, if you're using your own
searcher, you could pull data out of wherever it came from to display
it and cut down on data duplication - which makes more sense depends
on a huge number of factors. The former is the easier to migrate
gradually from the omega tools to your own system.

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james@tartarus.org                               uncertaintydivision.org

athlon athlonf

2008-Jan-15 22:33 UTC

head link

[Xapian-discuss] PHP indexing, what's the PHP method for indexscript

Hi James,

thanks for the answer.
Indeed if I do something like this:
        $data =
'author='.$postrow['starter_name']."\n";
            $data .=
'authorid='.$postrow['starter_id']."\n";
            $data .=
'forum_id='.$postrow['forum_id']."\n";
            $doc->set_data($data);    
        
It will be correctly inserted.

However, what do you mean with "you could pull data out of wherever it came
from to display
it"?
That I could parse the result by looking back at the datasource (a database)? I
guess there are pro and con's about that.


----- Original Message ----
From: James Aylett <james-xapian@tartarus.org>
To: xapian-discuss@lists.xapian.org
Sent: Tuesday, January 15, 2008 10:47:09 PM
Subject: Re: [Xapian-discuss] PHP indexing, what's the PHP method for
indexscript


On Tue, Jan 15, 2008 at 01:04:10PM -0800, athlon athlonf wrote:
> How can I create the same indexing using PHP?  With this, I can get
> an searchable index, but I have no idea how to set the fields, so
> that I can actually GET something back (with the underneath code, I
> just get a bunch of pid's back).
> 
> $doc = new XapianDocument();
> $doc->set_data($postrow['pid']);    
You're setting the document data field to just the pid. If you want it
to be compatible with omega, you'll need to put the right data in
there, as documented in ``Document data construction'' in
omega/docs/overview.html. Alternatively, if you're using your own
searcher, you could pull data out of wherever it came from to display
it and cut down on data duplication - which makes more sense depends
on a huge number of factors. The former is the easier to migrate
gradually from the omega tools to your own system.

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                
  xapian.org
  james@tartarus.org                              
 uncertaintydivision.org

_______________________________________________
Xapian-discuss mailing list
Xapian-discuss@lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss





     
____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now. 
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

athlon athlonf

2008-Jan-16 09:18 UTC

head link

[Xapian-discuss] PHP indexing, what's the PHP method for indexscript

I've managed to correctly use PHP-bindings to index my database and I'm
really amazed by the speed.

Apparantly, the method of using a perlscript like dbi2omega to get the inputfile
and then use scriptindex to parse and index it is much slower.
Indexing with PHP took 9 hours to complete on my developmachine (amd64 3800 with
2GB of ram and 5HDD-raid5) for 3 million documents, with less load.

Indexing with dbi2omega->scriptindex takes more than 24 hours and it's
not even at 40% (i've made several intermediate files) at load 5. And this
on a AMD dual opteron 246 with raid1 and 3GB of ram.





     
____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now. 
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

athlon athlonf

2008-Jan-16 17:58 UTC

head link

[Xapian-discuss] PHP indexing, what's the PHP method for indexscript

>Load 5 suggests something's wrong, because dbi2omega and scriptindex
>are both linear processes. Are you running several instances in
>parallel in some way?
it usually starts off fairly low, but then after half an hour of so, it will
reach load 5 constantly.
I'm only doing one scriptindex at a time, but it's fairly complicated
indexingscript I guess. And of course, the inputfiles are huge, mostly beyond
2GB.

>I believe that right now, none of the supplied Xapian indexing scripts
>or binaries will go significantly above a load of 1, unless you have
>other issues or something else happening on the machine.


The machine is on itself also a webserver, but the normal average load is below
0.25. I can imagine it to get into memory problem when scriptindex tries to
parse 2GB worth of data each time.


This is the indexscript i was using.

tid : boolean=Q field=id
pid : unique=Q boolean=Q field=pid
topic_title : unhtml weight=10 field=title index=Z index
forum_id : indexnopos=XFORUMID field=forum_id
postdate : field=startdate
postdateh : field=startdateh  date=yyyymmdd value=2
topicStart : field=topicStart
topicStarth : field=topicStarth  date=yyyymmdd value=3
author_name: unhtml boolean=XAUTHORNAME field=author
author_id: boolean=XAUTHORID field=authorid
state : field=state
url :  field=url
sample : weight=1 index field=sample



     
____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now. 
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ

athlon athlonf

2008-Jan-16 19:47 UTC

head link

[Xapian-discuss] PHP indexing, what's the PHP method for indexscript

Actually I'm already graphing my servers, but haven't really investigate
on the load due to the scriptindex as I was more devving atm.
I've just taken a look and it seems not to be a memory problem. During the
load spikes, memory went from 20MB to 1GB, but there was still 2GB left (albleit
cached).

I'd suspect it to be an IO load.

Hmm... the index=Z... no idea really. A copyandpaste-error i think. I remembered
that when I delved my first xapian-db, i see that "content" which will
be searched had the prefix Z, so...

----- Original Message ----
From: James Aylett <james-xapian@tartarus.org>
To: xapian-discuss@lists.xapian.org
Sent: Wednesday, January 16, 2008 7:19:25 PM
Subject: Re: [Xapian-discuss] PHP indexing, what's the PHP method for
indexscript


On Wed, Jan 16, 2008 at 09:58:11AM -0800, athlon athlonf wrote:
> >Load 5 suggests something's wrong, because dbi2omega and
scriptindex
> >are both linear processes. Are you running several instances in
> >parallel in some way?
> 
> it usually starts off fairly low, but then after half an hour of so,
> it will reach load 5 constantly.
I'd guess at a memory problem then, with processes blocking on VM
I/O. But that's really a guess.

Load is an indication of the length of the run queue, ie the number of
processes trying to get access to the processor at any one point in
time; so there's something going on beyond scriptindex itself
there. If you want to figure out what's going on, I'd recommend
pulling snmp data out of the system and graphing it (probably using
something like cacti) -- you'll see things like memory usage over time
that way, and it'll be obvious if (for instance) the load shoots up
when it hits a certain amount of free memory or something.
> I'm only doing one scriptindex at a time, but it's fairly
> complicated indexingscript I guess. And of course, the inputfiles
> are huge, mostly beyond 2GB.
scriptindex only pulls a line at a time from the input files, so that
won't matter per se. You may be running into issues with Xapian not
flushing to disk enough - that again 
> topic_title : unhtml weight=10 field=title index=Z index
What is the index=Z intended to do?

J

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                
  xapian.org
  james@tartarus.org                              
 uncertaintydivision.org

_______________________________________________
Xapian-discuss mailing list
Xapian-discuss@lists.xapian.org
http://lists.xapian.org/mailman/listinfo/xapian-discuss





     
____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs

athlon athlonf

2008-Jan-17 20:26 UTC

head link

[Xapian-discuss] PHP indexing, what's the PHP method for indexscript

The explanation sounds plausible. 

As for the indexer, no, it does not use replace_document (didn't knew about
that function actually...)

This is the relevant part of the php-script:

    
                $doc = new XapianDocument();
                $doc->set_data($data);    
          $doc->add_value(1,$postrow['forum_id']);
         
$doc->add_value(2,date('Ymd',$postrow['postdate']));
          $doc->add_value(3,$postrow['author_id']);
          //Adds a boolean term
         
$doc->add_term("XFORUMID".$postrow["forum_id"]);
         
$doc->add_term("XAUTHORID".$postrow["author_id"]);
         
$doc->add_term("XAUTHORNAME".$postrow["forum_id"]);
          //Assign the document to the TermGenerator which will generate the
terms used for searching
          $indexer->set_document($doc);
                $indexer->index_text($postrow['post']);
                $indexer->index_text($postrow['title'], 2);
            
                // Add the document to the database.
                $database->add_document($doc);
                
                $postrow =null;
                $data =null;
                $doc =null;

So...I should probably use replace_document if I "update" existing
documents?


----- Original Message ----
From: Olly Betts <olly@survex.com>
To: athlon athlonf <athlonkmf@yahoo.com>
Cc: xapian-discuss@lists.xapian.org
Sent: Thursday, January 17, 2008 3:14:15 AM
Subject: Re: [Xapian-discuss] PHP indexing,  what's the PHP method for
indexscript


On Wed, Jan 16, 2008 at 09:58:11AM -0800, athlon athlonf
wrote:> >Load 5 suggests something's wrong, because dbi2omega and
scriptindex
> >are both linear processes. Are you running several instances in
> >parallel in some way?
> 
> it usually starts off fairly low, but then after half an hour of so,
> it will reach load 5 constantly.
As James says, the scriptindex process itself shouldn't raise the load
by more than 1 (since it's essentially a single process, plus one
/bin/cat child process, which will always be blocked on read except
 very
briefly when the database is opened or closed).

I suspect what is happening here is that the scriptindex process is
causing the machine to swap so that webserver requests take a lot
 longer
and so start to overlap.  Hence 4 of the load is actually due to the
webserver (although caused by scriptindex).  I can't think of another
plausible explanation anyway.
> tid : boolean=Q field=id
> pid : unique=Q boolean=Q field=pid
It doesn't seem to make a lot of sense to have two fields mapping to
 "Q"
like this...

FWIW, I think this may explain why your PHP script is so much faster -
"unique" is quite a slow operation (even if no duplicate documents
exist, just checking for them significantly slows indexing).  Does your
PHP indexer contain code like this:

    $db->replace_document($qterm, $doc);

If not, does it handle enforcing unique documents another way?  If it
doesn't, then you aren't comparing like with like.

If this isn't the explanation, it would be interesting to work out why
there's such a difference.

Cheers,
    Olly





     
____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs

Olly Betts

2008-Jan-24 02:03 UTC

head link

[Xapian-discuss] PHP indexing, what's the PHP method for indexscript

On Thu, Jan 17, 2008 at 12:25:57PM -0800, athlon athlonf
wrote:> So...I should probably use replace_document if I "update"
existing documents?
Yes, although you can get away without doing so when reindexing from
scratch, assuming there are no duplicates in an indexing run.

Someone recently benchmarked (the discussion was on this mailing list),
and there's a significant overhead to checking the uid term.  It's
something I'm intending to look into, but I don't have a lot of spare
time at the moment...

Cheers,
    Olly

Apparently Analagous Threads

Search for more seemingly similar threads

Xapian discuss - Jan 2008 - PHP indexing, what's the PHP method for indexscript

[Xapian-discuss] PHP indexing, what's the PHP method for indexscript

[Xapian-discuss] PHP indexing, what's the PHP method for indexscript

[Xapian-discuss] PHP indexing, what's the PHP method for indexscript

[Xapian-discuss] PHP indexing, what's the PHP method for indexscript

[Xapian-discuss] PHP indexing, what's the PHP method for indexscript

[Xapian-discuss] PHP indexing, what's the PHP method for indexscript

[Xapian-discuss] PHP indexing, what's the PHP method for indexscript

[Xapian-discuss] PHP indexing, what's the PHP method for indexscript

Apparently Analagous Threads