thr3ads.net - Xapian discuss - Given a document, how do you get its ID? (perl bindings) [May 2016]

If this information is useful, please help other people find it:
Share via:

Alex Aminoff

2016-May-09 17:11 UTC

Given a document, how do you get its ID? (perl bindings)

I am writing an indexer that will crawl our web site. Following the 
recommendation here:

https://trac.xapian.org/wiki/FAQ/UniqueIds

I'm using the URL as the unique ID for each document. I see how to get a 
document from the xapian database if I know its URL, but what I need is 
also to be able to find out the URL from the document. Does this mean I 
need to store the URL in a value in addition to as a term? In fact I 
notice that there is no get_id method on a document object, so even if 
you use numeric IDs assigned by Xapian you can not get them from a document.

  - Alex

Richard Boulton

2016-May-09 20:26 UTC

head link

Given a document, how do you get its ID? (perl bindings)

Document does have a method for getting the numeric document ID:
Document::get_docid().  See
https://xapian.org/docs/apidoc/html/classXapian_1_1Document.html#a03ff36283ac7d14a1a3b1c9fb26eff61.
However, if you're using a URL as the unique ID, getting Xapian's
internal
numeric docid isn't of much use.

Instead, to find out the document ID using the method described in the
UniqueIds document in the FAQ, you can look for a term beginning with a
"Q"
in the document. You'd could do it with a function something like (in
Python, and untested - I'm not up to date with the perl bindings)

def get_id_string_from_doc(doc):
    termlist = doc.termlist()
    termlist.skip_to("Q") # Advances the iterator to point to the
first
term starting with a "Q" (more precisely, sorting after "Q")
    try:
        item = termlist.next()
    except StopIteration:
        raise KeyError("No ID in the document")
    term = item.get_term()
    # Should probably check that the term starts with a "Q", and raise
an
error that the document doesn't have an identifier if it doesn't.
    return term[1:]  # Remove the leading "Q" from the term

On Mon, May 9, 2016 at 6:12 PM Alex Aminoff <aminoff at nber.org> wrote:
> I am writing an indexer that will crawl our web site. Following the
> recommendation here:
>
> https://trac.xapian.org/wiki/FAQ/UniqueIds
>
> I'm using the URL as the unique ID for each document. I see how to get
a
> document from the xapian database if I know its URL, but what I need is
> also to be able to find out the URL from the document. Does this mean I
> need to store the URL in a value in addition to as a term? In fact I
> notice that there is no get_id method on a document object, so even if
> you use numeric IDs assigned by Xapian you can not get them from a
> document.
>
>   - Alex
>
>
>

Seemingly Similar Threads

Search for more reasonably related threads

Xapian discuss - May 2016 - Given a document, how do you get its ID? (perl bindings)

Given a document, how do you get its ID? (perl bindings)

Given a document, how do you get its ID? (perl bindings)

Seemingly Similar Threads