Alex Aminoff
2016-May-09 17:11 UTC
Given a document, how do you get its ID? (perl bindings)
I am writing an indexer that will crawl our web site. Following the recommendation here: https://trac.xapian.org/wiki/FAQ/UniqueIds I'm using the URL as the unique ID for each document. I see how to get a document from the xapian database if I know its URL, but what I need is also to be able to find out the URL from the document. Does this mean I need to store the URL in a value in addition to as a term? In fact I notice that there is no get_id method on a document object, so even if you use numeric IDs assigned by Xapian you can not get them from a document. - Alex
Richard Boulton
2016-May-09 20:26 UTC
Given a document, how do you get its ID? (perl bindings)
Document does have a method for getting the numeric document ID: Document::get_docid(). See https://xapian.org/docs/apidoc/html/classXapian_1_1Document.html#a03ff36283ac7d14a1a3b1c9fb26eff61. However, if you're using a URL as the unique ID, getting Xapian's internal numeric docid isn't of much use. Instead, to find out the document ID using the method described in the UniqueIds document in the FAQ, you can look for a term beginning with a "Q" in the document. You'd could do it with a function something like (in Python, and untested - I'm not up to date with the perl bindings) def get_id_string_from_doc(doc): termlist = doc.termlist() termlist.skip_to("Q") # Advances the iterator to point to the first term starting with a "Q" (more precisely, sorting after "Q") try: item = termlist.next() except StopIteration: raise KeyError("No ID in the document") term = item.get_term() # Should probably check that the term starts with a "Q", and raise an error that the document doesn't have an identifier if it doesn't. return term[1:] # Remove the leading "Q" from the term On Mon, May 9, 2016 at 6:12 PM Alex Aminoff <aminoff at nber.org> wrote:> I am writing an indexer that will crawl our web site. Following the > recommendation here: > > https://trac.xapian.org/wiki/FAQ/UniqueIds > > I'm using the URL as the unique ID for each document. I see how to get a > document from the xapian database if I know its URL, but what I need is > also to be able to find out the URL from the document. Does this mean I > need to store the URL in a value in addition to as a term? In fact I > notice that there is no get_id method on a document object, so even if > you use numeric IDs assigned by Xapian you can not get them from a > document. > > - Alex > > >