thr3ads.net - Xapian discuss - deleting terms with a given prefix [Aug 2025]

If this information is useful, please help other people find it:
Share via:

David Bremner

2025-Aug-28 11:09 UTC

deleting terms with a given prefix

I'm looking at some benchmarks where the bottleneck is in the following
function, and wondering if there is a better way to delete all the terms
with a given prefix from a document?  According to perf, 41% of the time
is in this function, almost all (40%) in
Xapian::Document::Internal::remove_term.

At a higher level, this particular benchmark is replacing all of the
tags for a bunch of messages, and the current strategy is to delete all
existing ones and then add the new ones.

void
_notmuch_message_remove_terms (notmuch_message_t *message, const char *prefix)
{
    Xapian::TermIterator i;
    size_t prefix_len = 0;

    prefix_len = strlen (prefix);

    while (1) {
	i = message->doc.termlist_begin ();
	i.skip_to (prefix);

	/* Terminate loop when no terms remain with desired prefix. */
	if (i == message->doc.termlist_end () ||
	    strncmp ((*i).c_str (), prefix, prefix_len))
	    break;

	try {
	    message->doc.remove_term ((*i));
	    message->modified = true;
	} catch (const Xapian::InvalidArgumentError) {
	    /* Ignore failure to remove non-existent term. */
	}
    }

    _notmuch_message_invalidate_metadata (message, "property");
}

Olly Betts

2025-Aug-28 20:34 UTC

head link

deleting terms with a given prefix

On Thu, Aug 28, 2025 at 08:09:07AM -0300, David Bremner
wrote:> void
> _notmuch_message_remove_terms (notmuch_message_t *message, const char
*prefix)
> {
>     Xapian::TermIterator i;
>     size_t prefix_len = 0;
> 
>     prefix_len = strlen (prefix);
> 
>     while (1) {
> 	i = message->doc.termlist_begin ();
> 	i.skip_to (prefix);
> 
> 	/* Terminate loop when no terms remain with desired prefix. */
> 	if (i == message->doc.termlist_end () ||
> 	    strncmp ((*i).c_str (), prefix, prefix_len))
> 	    break;
> 
> 	try {
> 	    message->doc.remove_term ((*i));
> 	    message->modified = true;
> 	} catch (const Xapian::InvalidArgumentError) {
> 	    /* Ignore failure to remove non-existent term. */
> 	}
>     }
You're creating a new iterator and skipping it for every term removed
whereas you should be able to just increment the iterator to get the
next term you need to consider.

The only wrinkle (which I'm guessing may be why the code is as above) is
that you would need to increment *before* removing the term so as not
to invalidate the iterator, so something like:

    size_t prefix_len = strlen (prefix);

    Xapian::TermIterator i = message->doc.termlist_begin ();
    i.skip_to (prefix);

    while (i != message->doc.termlist_end ()) {
        const string& term = *i;
        /* Terminate loop when no terms remain with desired prefix. */
        if (strncmp (term.c_str (), prefix, prefix_len))
            break;
        ++i;
        try {
            message->doc.remove_term (term);
            message->modified = true;
        } catch (const Xapian::InvalidArgumentError) {
            /* Ignore failure to remove non-existent term. */
        }
     }

(It shouldn't be possible for the term not to exist here BTW - if
that was added due to actually encountering that situation that's
definitely a bug that would be useful to report...)

That should help but there's unfortunately still a significant
inefficiency here though, as remove_term() takes a string which it then
has to look up in the document terms to refind the position which we
actually already have in the iterator, and that's O(log(n)) (whereas
using the position from the iterator would be O(0) !)

A variant of remove_term() which took an iterator could avoid this
(and automatically increment the iterator for you).

We could also/instead provide a method to directly remove all terms
with a specified prefix (or maybe between two specified terms) from the
document, which is probably better for this specific case as I think it
can be done even more efficiently.

I'm trying to focus on getting a new release series started, so please
can you open a ticket for this so it doesn't get forgotten?

Cheers,
    Olly

Xapian discuss - Aug 2025 - deleting terms with a given prefix

deleting terms with a given prefix

deleting terms with a given prefix