Richard Lewis
2009-Apr-27 17:26 UTC
[Xapian-discuss] Newbie problems with searching from Python
Hi there, I'm brand new to Xapian and trying to use it to add a full-text search facility to my Web-published database using Python. I've developed an API which gives me XML views of the database records I need to index. I also have XSLT stylesheets which transform those XML views into HTML for Web presentation. What I'm trying to do is build a Xapian index of my HTML documents and provide a simple key-word search interface to that index. Both the indexing operation and the searching operation need to be called in response to HTTP requests on an always-alive server (CherryPy, in fact) and so a Python bindings-based solution is preferable to using an external application (such as Omega) run in a separate process. So far, I have the following: import xapian # some document 'value' (or metadata) constants DOC_PATH = 0 DOC_RECORD_TYPE = 1 DOC_CATNO = 2 DOC_TITLE = 3 DOC_SUBTITLE = 4 DOC_YEAR = 5 def build_fulltext_index(): # initialise the Xapian indexer database = xapian.WritableDatabase('indexes', xapian.DB_CREATE_OR_OPEN) indexer = xapian.TermGenerator() stemmer = xapian.Stem('english') indexer.set_stemmer(stemmer) for work in works_table.list_records(): work_html = html_XSLT(work.xml()) # create a Xapian document doc = xapian.Document() # set its properties to the properties of the work doc.add_value(DOC_PATH, '/works/%s' % work['catalogue_no']) doc.add_value(DOC_RECORD_TYPE, work_html.xpath('//meta[@name="record-type"]')[0].attrib['content']) doc.add_value(DOC_CATNO, work['catalogue_no']) doc.add_value(DOC_TITLE, work['title']) doc.add_value(DOC_SUBTITLE, work['subtitle']) doc.add_value(DOC_YEAR, work['year']) # set the HTML version of the work as the Xapian # document's data doc.set_data(etree.tostring(work_html)) # index all the text inside the HTML <dic class="work"> # element indexer.set_document(doc) indexer.index_text('\n'.join(work_html.getroot().xpath('//div[@class="work"]//text()'))) # add the document to the database if doc.get_docid() == 0: print '/works/%s has docid of 0' % work['catalogue_no'] else: database.replace_document(doc.get_docid(), doc) def search(terms): # load the index and initialise the query database = xapian.Database('indexes') enquire = xapian.Enquire(database) qp = xapian.QueryParser() stemmer = xapian.Stem('english') qp.set_stemmer(stemmer) qp.set_database(database) qp.set_stemming_strategy(xapian.QueryParser.STEM_SOME) query = qp.parse_query(terms) # execute the query enquire.set_query(query) matches = enquire.get_mset(start - 1, count) # iterate over the results for m in matches: # retrieve the document document = m.document # get the catalogue number and record-type of the hit (cat_no, record_type) = (document.get_value(Search.DOC_CATNO), document.get_value(Search.DOC_RECORD_TYPE)) The above code seems to generate the index OK. And it also manages storing and retrieving the metadata (like titles, catalogue numbers, etc.) However, I don't get anything like the number of hits I'd expect for any given search. Generally, I get two or three hits for the search terms I try. In all cases I know that 10s of records in my database match the terms I'm testing. (I used to use Swish-e for my indexing and it returned the kinds of results I'm expecting to see.) So I don't really know what to ask. Does anyone know what I'm doing wrong? Or is Xapian behaving as it should and I'm just expecting the wrong thing of it? Cheers, Richard
James Aylett
2009-Apr-27 17:55 UTC
[Xapian-discuss] Newbie problems with searching from Python
On Mon, Apr 27, 2009 at 06:26:24PM +0100, Richard Lewis wrote... Richard; I can't easily answer your general question about why you aren't getting the number of results you expect without some idea of what you're putting into your document. As the FAQ for when you get no matches at all suggests, try using delve to find out what's actually being stored in the database for each document. <http://trac.xapian.org/wiki/FAQ/NoMatches> (This FAQ perhaps needs rewording to make it obvious that it's useful when you're getting fewer than expected results as well.) However this looks weird:> # add the document to the database > if doc.get_docid() == 0: > print '/works/%s has docid of 0' % work['catalogue_no'] > else: > database.replace_document(doc.get_docid(), doc)Unless I'm misreading the rest of your code, I don't see how this is going to index anything! get_docid() will return 0 when the document hasn't been added to a database yet; so you need a `database.add_document(...)` on that leg of the if, don't you? I note that you're only indexing some of the text in your HTML output. This suggests that lots of your HTML template is building chrome (menus, navigation and so on); it may be easier to index out of the data store directly rather than converting to HTML and then indexing that, which strikes me as just an extra step for things to go wrong or get more confusing in. J -- James Aylett talktorex.co.uk - xapian.org - uncertaintydivision.org
Simon Roe
2009-Apr-28 07:43 UTC
[Xapian-discuss] Newbie problems with searching from Python
On Mon, Apr 27, 2009 at 6:26 PM, Richard Lewis <richardlewis at fastmail.co.uk> wrote:> # some document 'value' (or metadata) constants > DOC_PATH = 0 > DOC_RECORD_TYPE = 1 > DOC_CATNO = 2 > DOC_TITLE = 3 > DOC_SUBTITLE = 4 > DOC_YEAR = 5Not related to your problem, but you may want to use term prefixes or the document data for some of this stuff, rather than values. http://www.flax.co.uk/blog/2009/04/02/xapian-search-architecture/ -- Help save the economy: http://seriouschange.org.uk/ E: simon.roe at talusdesign.co.uk M: 07742079314