thr3ads.net - Xapian discuss - [Xapian-discuss] Newbie problems with searching from Python [Apr 2009]

If this information is useful, please help other people find it:
Share via:

Richard Lewis

2009-Apr-27 17:26 UTC

[Xapian-discuss] Newbie problems with searching from Python

Hi there,

I'm brand new to Xapian and trying to use it to add a full-text search
facility to my Web-published database using Python. I've developed an
API which gives me XML views of the database records I need to
index. I also have XSLT stylesheets which transform those XML views
into HTML for Web presentation.

What I'm trying to do is build a Xapian index of my HTML documents and
provide a simple key-word search interface to that index. Both the
indexing operation and the searching operation need to be called in
response to HTTP requests on an always-alive server (CherryPy, in
fact) and so a Python bindings-based solution is preferable to using
an external application (such as Omega) run in a separate process.

So far, I have the following:

import xapian

# some document 'value' (or metadata) constants
DOC_PATH = 0
DOC_RECORD_TYPE = 1
DOC_CATNO = 2
DOC_TITLE = 3
DOC_SUBTITLE = 4
DOC_YEAR = 5

def build_fulltext_index():
    # initialise the Xapian indexer
    database = xapian.WritableDatabase('indexes',
xapian.DB_CREATE_OR_OPEN)
    indexer = xapian.TermGenerator()
    stemmer = xapian.Stem('english')
    indexer.set_stemmer(stemmer)

    for work in works_table.list_records():
        work_html = html_XSLT(work.xml())

        # create a Xapian document
        doc = xapian.Document()

        # set its properties to the properties of the work
        doc.add_value(DOC_PATH, '/works/%s' %
work['catalogue_no'])
        doc.add_value(DOC_RECORD_TYPE,
work_html.xpath('//meta[@name="record-type"]')[0].attrib['content'])
        doc.add_value(DOC_CATNO, work['catalogue_no'])
        doc.add_value(DOC_TITLE, work['title'])
        doc.add_value(DOC_SUBTITLE, work['subtitle'])
        doc.add_value(DOC_YEAR, work['year'])

        # set the HTML version of the work as the Xapian
        # document's data
        doc.set_data(etree.tostring(work_html))

	# index all the text inside the HTML <dic class="work">
	# element
        indexer.set_document(doc)
       
indexer.index_text('\n'.join(work_html.getroot().xpath('//div[@class="work"]//text()')))

        # add the document to the database
        if doc.get_docid() == 0:
            print '/works/%s has docid of 0' %
work['catalogue_no']
        else:
            database.replace_document(doc.get_docid(), doc)

def search(terms):
    # load the index and initialise the query
    database = xapian.Database('indexes')
    enquire = xapian.Enquire(database)
    qp = xapian.QueryParser()
    stemmer = xapian.Stem('english')
    qp.set_stemmer(stemmer)
    qp.set_database(database)
    qp.set_stemming_strategy(xapian.QueryParser.STEM_SOME)
    query = qp.parse_query(terms)

    # execute the query
    enquire.set_query(query)
    matches = enquire.get_mset(start - 1, count)

    # iterate over the results
    for m in matches:
        # retrieve the document
        document = m.document

        # get the catalogue number and record-type of the hit
        (cat_no, record_type) = (document.get_value(Search.DOC_CATNO),
document.get_value(Search.DOC_RECORD_TYPE))


The above code seems to generate the index OK. And it also manages
storing and retrieving the metadata (like titles, catalogue numbers,
etc.) However, I don't get anything like the number of hits I'd expect
for any given search. Generally, I get two or three hits for the
search terms I try. In all cases I know that 10s of records in my
database match the terms I'm testing. (I used to use Swish-e for my
indexing and it returned the kinds of results I'm expecting to see.)

So I don't really know what to ask. Does anyone know what I'm doing
wrong? Or is Xapian behaving as it should and I'm just expecting the
wrong thing of it?

Cheers,
Richard

James Aylett

2009-Apr-27 17:55 UTC

head link

[Xapian-discuss] Newbie problems with searching from Python

On Mon, Apr 27, 2009 at 06:26:24PM +0100, Richard Lewis wrote...

Richard; I can't easily answer your general question about why you
aren't getting the number of results you expect without some idea of
what you're putting into your document. As the FAQ for when you get no
matches at all suggests, try using delve to find out what's actually
being stored in the database for each document.

<http://trac.xapian.org/wiki/FAQ/NoMatches>

(This FAQ perhaps needs rewording to make it obvious that it's useful
when you're getting fewer than expected results as well.)

However this looks weird:
>         # add the document to the database
>         if doc.get_docid() == 0:
>             print '/works/%s has docid of 0' %
work['catalogue_no']
>         else:
>             database.replace_document(doc.get_docid(), doc)
Unless I'm misreading the rest of your code, I don't see how this is
going to index anything! get_docid() will return 0 when the document
hasn't been added to a database yet; so you need a
`database.add_document(...)` on that leg of the if, don't you?

I note that you're only indexing some of the text in your HTML
output. This suggests that lots of your HTML template is building
chrome (menus, navigation and so on); it may be easier to index out of
the data store directly rather than converting to HTML and then
indexing that, which strikes me as just an extra step for things to go
wrong or get more confusing in.

J

-- 
  James Aylett

  talktorex.co.uk - xapian.org - uncertaintydivision.org

Simon Roe

2009-Apr-28 07:43 UTC

head link

[Xapian-discuss] Newbie problems with searching from Python

On Mon, Apr 27, 2009 at 6:26 PM, Richard Lewis
<richardlewis at fastmail.co.uk> wrote:> # some document 'value' (or metadata) constants
> DOC_PATH = 0
> DOC_RECORD_TYPE = 1
> DOC_CATNO = 2
> DOC_TITLE = 3
> DOC_SUBTITLE = 4
> DOC_YEAR = 5
Not related to your problem, but you may want to use term prefixes or
the document data for some of this stuff, rather than values.

http://www.flax.co.uk/blog/2009/04/02/xapian-search-architecture/

-- 
Help save the economy:
http://seriouschange.org.uk/

E: simon.roe at talusdesign.co.uk
M: 07742079314

Xapian discuss - Apr 2009 - Newbie problems with searching from Python

[Xapian-discuss] Newbie problems with searching from Python

[Xapian-discuss] Newbie problems with searching from Python

[Xapian-discuss] Newbie problems with searching from Python