robersonja
2006-Jan-13 22:22 UTC
[Xapian-discuss] How to accomplish this task with the Python Bindings?
I am working on creating a OSX Spotlight like application. first task is to index fully qualified paths, I want to be able to search for filenames first as a learning exercise to learn xapian and the python bindings. I tried using Xapwrap by divmod.org, that didn't pan out, I could not get the actual data back after a search, a search would return document uid but I never code get .get_document().get_data() to return anything. So I decided to just use the "raw" python bindings provided so I tried the simpleindex and simplesearch python example programs. I think in both cases ( xapwrap and just the default xapian ) bindings I am getting indexing to happen, but I can't really tell because I can't get any search results to confirm anything. When I tried with the xapian python bindings directly, I can't get the search to work. Granted the simplesearch example program is broken, so I am kind of groping in the dark on how to get the search to return a list of documents and have get_data() actually return something. I guess what I need is some simple example code that will allow me to do the following.. given some data like /this/is/a/fully/qualified/path/to/a/filename how do I create a document and add it to an index so that I can search for it by 'filename' this is what I am doing to create documents and add them to the index #!/usr/bin/python # indexer.py import sys import xapian # setup the file to index fileToIndex = sys.argv[1] if len(sys.argv) >= 3: maxRecordsToIndex = int(sys.argv[2]) else: maxRecordsToIndex = 0 recordCount = -1 # setup the xapian database try: db = xapian.WritableDatabase('/tmp/index', xapian.DB_CREATE_OR_OPEN) # index the file for line in file(fileToIndex): doc = xapian.Document() doc.set_data(line) db.add_document(doc) # my input file is 70GB of data, this is to make testing faster recordCount = recordCount + 1 if maxRecordsToIndex > -1 and recordCount >= maxRecordsToIndex: break elif recordCount % 1000 == 0: print 'print processed %s records so far!' % recordCount print 'processed %s records' % recordCount except Exception, e: print'Exception: %s' % str(e) sys.exit(1) and this is what I an doing to try and get the data back from a search, the problem is I can't get it to find anything. Given the example data above when run: python searcher.py /tmp/index filename I get 0 records found! #!/usr/local/bin/python # searcher.py import sys import xapian if len(sys.argv) < 3: print "usage: %s <path to database> <search terms>" % sys.argv[0] sys.exit(1) try: database = xapian.Database(sys.argv[1]) enquire = xapian.Enquire(database) query = xapian.Query(sys.argv[2]) print "Performing query `%s'" % query.get_description() enquire.set_query(query) matches = enquire.get_mset(0, 10) print "%i results found" % matches.get_matches_estimated() for match in matches: print "ID %i %i%% [%s]" % (match[xapian.MSET_DID], match [xapian.MSET_PERCENT], match[xapian.MSET_DOCUMENT].get_data()) except Exception, e: print "Exception: %s" % str(e) sys.exit(1)
Olly Betts
2006-Jan-14 01:24 UTC
[Xapian-discuss] How to accomplish this task with the Python Bindings?
On Fri, Jan 13, 2006 at 05:06:22PM -0500, robersonja wrote:> I think in both cases ( xapwrap and just the default xapian ) > bindings I am getting indexing to happen, but I can't really tell > because I can't get any search results to confirm anything.For this sort of debugging, the "delve" utility is very handy. It's in the examples subdirectory of xapian-core, and should be installed by make install. You can use it to check that the index contains what you expect and narrow down a problem to being on the indexing or searching side. For example, you can look at the terms indexing document 7: delve /path/to/database -r 7 Or the posting list for term "wibble" (note that delve wants the term exactly as in the database, which may be stemmed or have a prefix, etc): delve /path/to/database -t wibble For other options, read "delve --help".> When I tried with the xapian python bindings directly, I can't get > the search to work. Granted the simplesearch example program is > broken, so I am kind of groping in the dark on how to get the search > to return a list of documents and have get_data() actually return > something.Sorry about the broken simplesearch.py, but it's only broken in that it uses a query constructor which SWIG was failing to wrap as intended. The rest of the code is correct, only the part which builds the query is wrong (well, arguably the example is right, and the bindings are wrong). This is fixed in SVN trunk, so you might want to try a snapshot: http://www.oligarchy.co.uk/xapian/trunk/ They're in good shape right now as I'm busy tying up loose ends for the next release.> # index the file > for line in file(fileToIndex): > doc = xapian.Document() > doc.set_data(line)You need to add some index entries here which you want searches for this document to match. So split line on "/" and add a posting for each entry: pos = 0 for term in line.split("/"): doc.add_posting(term, pos++)> db.add_document(doc)You may want to stem "term" before adding it (if you do, you also need to correspondingly stem terms before searching for them).> and this is what I an doing to try and get the data back from a > search, the problem is I can't get it to find anything.The search script looks plausible. I think if you actually add some postings it'll all start to work. Cheers, Olly