robersonja
2006-Jan-13 22:22 UTC
[Xapian-discuss] How to accomplish this task with the Python Bindings?
I am working on creating a OSX Spotlight like application.
first task is to index fully qualified paths, I want to be able to
search for filenames first as a learning exercise to learn xapian and
the python bindings.
I tried using Xapwrap by divmod.org, that didn't pan out, I could not
get the actual data back after a search, a search would return
document uid but I never code get .get_document().get_data() to
return anything.
So I decided to just use the "raw" python bindings provided
so I tried the simpleindex and simplesearch python example programs.
I think in both cases ( xapwrap and just the default xapian )
bindings I am getting indexing to happen, but I can't really tell
because I can't get any search results to confirm anything.
When I tried with the xapian python bindings directly, I can't get
the search to work. Granted the simplesearch example program is
broken, so I am kind of groping in the dark on how to get the search
to return a list of documents and have get_data() actually return
something.
I guess what I need is some simple example code that will allow me to
do the following..
given some data like
/this/is/a/fully/qualified/path/to/a/filename
how do I create a document and add it to an index so that I can
search for it by 'filename'
this is what I am doing to create documents and add them to the index
#!/usr/bin/python
# indexer.py
import sys
import xapian
# setup the file to index
fileToIndex = sys.argv[1]
if len(sys.argv) >= 3:
maxRecordsToIndex = int(sys.argv[2])
else:
maxRecordsToIndex = 0
recordCount = -1
# setup the xapian database
try:
db = xapian.WritableDatabase('/tmp/index',
xapian.DB_CREATE_OR_OPEN)
# index the file
for line in file(fileToIndex):
doc = xapian.Document()
doc.set_data(line)
db.add_document(doc)
# my input file is 70GB of data, this is to make testing faster
recordCount = recordCount + 1
if maxRecordsToIndex > -1 and recordCount >= maxRecordsToIndex:
break
elif recordCount % 1000 == 0:
print 'print processed %s records so far!' % recordCount
print 'processed %s records' % recordCount
except Exception, e:
print'Exception: %s' % str(e)
sys.exit(1)
and this is what I an doing to try and get the data back from a
search, the problem is I can't get it to find anything.
Given the example data above when run: python searcher.py /tmp/index
filename
I get 0 records found!
#!/usr/local/bin/python
# searcher.py
import sys
import xapian
if len(sys.argv) < 3:
print "usage: %s <path to database> <search terms>" %
sys.argv[0]
sys.exit(1)
try:
database = xapian.Database(sys.argv[1])
enquire = xapian.Enquire(database)
query = xapian.Query(sys.argv[2])
print "Performing query `%s'" % query.get_description()
enquire.set_query(query)
matches = enquire.get_mset(0, 10)
print "%i results found" % matches.get_matches_estimated()
for match in matches:
print "ID %i %i%% [%s]" % (match[xapian.MSET_DID], match
[xapian.MSET_PERCENT], match[xapian.MSET_DOCUMENT].get_data())
except Exception, e:
print "Exception: %s" % str(e)
sys.exit(1)
Olly Betts
2006-Jan-14 01:24 UTC
[Xapian-discuss] How to accomplish this task with the Python Bindings?
On Fri, Jan 13, 2006 at 05:06:22PM -0500, robersonja wrote:> I think in both cases ( xapwrap and just the default xapian ) > bindings I am getting indexing to happen, but I can't really tell > because I can't get any search results to confirm anything.For this sort of debugging, the "delve" utility is very handy. It's in the examples subdirectory of xapian-core, and should be installed by make install. You can use it to check that the index contains what you expect and narrow down a problem to being on the indexing or searching side. For example, you can look at the terms indexing document 7: delve /path/to/database -r 7 Or the posting list for term "wibble" (note that delve wants the term exactly as in the database, which may be stemmed or have a prefix, etc): delve /path/to/database -t wibble For other options, read "delve --help".> When I tried with the xapian python bindings directly, I can't get > the search to work. Granted the simplesearch example program is > broken, so I am kind of groping in the dark on how to get the search > to return a list of documents and have get_data() actually return > something.Sorry about the broken simplesearch.py, but it's only broken in that it uses a query constructor which SWIG was failing to wrap as intended. The rest of the code is correct, only the part which builds the query is wrong (well, arguably the example is right, and the bindings are wrong). This is fixed in SVN trunk, so you might want to try a snapshot: http://www.oligarchy.co.uk/xapian/trunk/ They're in good shape right now as I'm busy tying up loose ends for the next release.> # index the file > for line in file(fileToIndex): > doc = xapian.Document() > doc.set_data(line)You need to add some index entries here which you want searches for this document to match. So split line on "/" and add a posting for each entry: pos = 0 for term in line.split("/"): doc.add_posting(term, pos++)> db.add_document(doc)You may want to stem "term" before adding it (if you do, you also need to correspondingly stem terms before searching for them).> and this is what I an doing to try and get the data back from a > search, the problem is I can't get it to find anything.The search script looks plausible. I think if you actually add some postings it'll all start to work. Cheers, Olly