I'm creating a unique ID for every document, I have about 3500 documents so far and seem to have ran into a problem while testing. Here's what I did to "discover" my issue. The first term I add to a document is in the form of (python): uid = sha.new( str(random.random()) + str(time.time()) ).hexdigest() "Q" + uid which is basically a random float + unix timestamp as float. I used .add_term() for this. I can ensure that every key is unique and actually being added to the document. So I came up with some code like this to list all terms -- listterms.py -- iter = xapdb.allterms_begin() end = xapdb.allterms_end() while not iter == end: print iter.get_term() iter.next() -- listterms.py -- making a little bash loop like this I then requested each document off my server: ./misc/listterms.py | grep ^Q | cut -c2- | while read id; do curl -s "http://localhost:8080/bin/read?id=$id" | grep -n ^ERROR; done -- the /bin/read (hacked down for brevity)-- sys.stderr = sys.stdout FieldStore = cgi.FieldStorage() print "Content-Type: text/html" print xapdb = xapian.Database("..") enquire = xapian.Enquire(xapdb) stemmer = xapian.Stem("english") qp = xapian.QueryParser() # i do have other prefixes but only Q is important to my example qp.set_prefix("id", "Q") id = FieldStore.getvalue("id", "") q = "id:" + id query = qp.parse_query(q) enquire.set_query(query) matches = enquire.get_mset(0, 1) if matches.get_matches_upper_bound() == 0: print "ERROR: Oops, unable to find a message %s" % (id) sys.exit(0) match = iter(matches).next() print "ID %i %i%% [%s]" % \ (match[xapian.MSET_DID], match[xapian.MSET_PERCENT], match[xapian.MSET_DOCUMENT].get_data()) -- the /bin/read -- I tried calling flush thinking it could be a python side effect. tried doing every 50 documents, after every record, and not at all. Ended up with pretty much the same results. There is about 200 unique ID's that are not found. How can this be? TIA Sig
On Fri, Mar 11, 2005 at 10:39:53AM -0500, Sig Lange wrote:> -- the /bin/read (hacked down for brevity)-- > sys.stderr = sys.stdout > FieldStore = cgi.FieldStorage() > print "Content-Type: text/html" > print > xapdb = xapian.Database("..") > > enquire = xapian.Enquire(xapdb) > stemmer = xapian.Stem("english") > > qp = xapian.QueryParser() > # i do have other prefixes but only Q is important to my example > qp.set_prefix("id", "Q") > > id = FieldStore.getvalue("id", "") > q = "id:" + id > > query = qp.parse_query(q) > > enquire.set_query(query) > matches = enquire.get_mset(0, 1) > > if matches.get_matches_upper_bound() == 0: > print "ERROR: Oops, unable to find a message %s" % (id) > sys.exit(0) > > match = iter(matches).next() > > print "ID %i %i%% [%s]" % \ > (match[xapian.MSET_DID], match[xapian.MSET_PERCENT], > match[xapian.MSET_DOCUMENT].get_data()) > > -- the /bin/read --How about, instead of your QueryParser bit ...: ---------------------------------------------------------------------- id = FieldStore.getvalue("id", "") query = xapian.Query(xapian.Query.OP_OR, ["Q%s" % str(id)]) enquire.set_query(query) ---------------------------------------------------------------------- since you really don't need to bother about the query parser for this kind of work. I'm pretty sure that if a document never got into the database, its Q-term won't have done either. So if you're getting the id terms to look for out of the Xapian database, you should be able to find the documents as well. You could try getting the posting list from the term, and print the docid for each Q-term you find, then interrogate the database directly for those documents, if you're worried there's an inconsistency between the different tables. J -- /--------------------------------------------------------------------------\ James Aylett xapian.org james@tartarus.org uncertaintydivision.org