Dear Xapian, I am writing a little tool for indexing/searching email messages in maildirs. For indexing the message bodies, Xapian looks like an interesting option, but I have some newbie questions. What I would *like* to do is being able to add the email bodies to the Xapian database, and then be able to search for some words. I am looking at the Quickstart (http://xapian.org/docs/quickstart.html). and it seems I have to create a Xapian::Document instance, then (1) add document data with set_data and (2) add some search terms with add?posting. I could use the message path as the document data, but what about the search terms? Should I split my body text in words, and add every single one of them as a search term? That does not sound very attractive... I seems that 'recoll' (which uses Xapian) is doing that though. Or is there some easier way to simply provide blobs of text, and being able to search for them later? I have the feeling I am misunderstanding something.... Hope someone can give me some hints. Thanks in advance! Dirk. -- ----------------------------------------------- Dirk-Jan C. Binnema <djcb at djcbsoftware.nl> blog: http://www.djcbsoftware.nl/ChangeLog (NL) http://djcbflux.blogspot.com (EN) -----------------------------------------------
djcb wrote:> Or is there some easier way to simply provide blobs of text, and being > able to search for them later?You want XapianTermGenerator, which takes a blob of text and adds all the words in it to Xapian. e.g. (snippet of the written-in-PHP http://sandwich.ukcod.org.uk/~matthew/subtitles/?source=1#indexer ): $indexer = new XapianTermGenerator(); $indexer->set_flags(128); $indexer->set_database($db); # For spelling [... then for each document ... ] $doc = new XapianDocument(); $indexer->set_document($doc); $doc->set_data( [...] ); $doc->add_term( [...] ); $doc->add_value( [...] ); $indexer->index_text($text); $db->add_document($doc); ATB, Matthew
djcb wrote:> I am writing a little tool for indexing/searching email messages in > maildirs.> For indexing the message bodies, Xapian looks like an interesting > option, but I have some newbie questions. What I would *like* to do is > being able to add the email bodies to the Xapian database, and then be > able to search for some words.I'd also recommend looking at the GMANE indexer (it is targeted at mboxes though). The Debian list archives uses a derivative of that (together with the omega search engine). The Debian code should also be available somewhere Kind regards T. 1. http://people.debian.org/~tviehmann/list-search/ but if that's at all interesting, I'll put up the current stuff, too. -- Thomas Viehmann, http://thomas.viehmann.net/
On Sat, 30 Aug 2008, djcb wrote:> Dear Xapian, > > I am writing a little tool for indexing/searching email messages in > maildirs.<snip (16 lines)>> Or is there some easier way to simply provide blobs of text, and being > able to search for them later? I have the feeling I am misunderstanding > something....Thanks all for the quick replies! Matthew Somerville <matthew at mysociety.org> wrote:> You want XapianTermGenerator, which takes a blob of text and adds all > the words in it to Xapian. e.g. (snippet of the written-in-PHP > http://sandwich.ukcod.org.uk/~matthew/subtitles/?source=1#indexer ):Ah, that did the trick, great! I now integrated Xapian with my code, and it seems to work nicely. I'll take a look at some of the other indexers that were mentioned. I noticed that the stemming is language-specific (understandably); is there some recommended way to guess the language of a blob of text? For me, speed is more important than 100% accuracy (which would be hard anyway, and consider multi-language text etc...) BTW, my little maildir indexer/searcher 'mu': http://www.djcbsoftware.nl/code/mu/ Version 0.1 does not have Xapian-search yet, but 0.2 will :-) Best wishes, Dirk.
Rusty Conover
2008-Sep-01 09:57 UTC
[Xapian-discuss] using xapian for indexing mails [SOLVED]
> > I noticed that the stemming is language-specific (understandably); is > there some recommended way to guess the language of a blob of text? > For > me, speed is more important than 100% accuracy (which would be hard > anyway, and consider multi-language text etc...)n-gram analysis works pretty well.. In a nutshell it works like this: Step 1. Training: With sample texts in various languages by produce n- grams, keep the most popular N n-grams for each language where N is sufficiently large. Step 2. Analysis: Compare the number of matching of n-grams from the unknown language text to the n-gram samples from each language. The language with the most matches is probably the language of that text. See: http://www.rubyinside.com/whatlanguage-ruby-language-detection-library-1085.html http://code.activestate.com/recipes/326576/ Regards, Rusty -- Rusty Conover InfoGears Inc. / www.GearBuyer.com / www.FootwearBuyer.com http://www.infogears.com
Olly Betts
2008-Sep-02 03:37 UTC
[Xapian-discuss] using xapian for indexing mails [SOLVED]
On Mon, Sep 01, 2008 at 03:57:29AM -0600, Rusty Conover wrote:> > I noticed that the stemming is language-specific (understandably); is > > there some recommended way to guess the language of a blob of text? > > n-gram analysis works pretty well..[...]> See: > http://www.rubyinside.com/whatlanguage-ruby-language-detection-library-1085.html > http://code.activestate.com/recipes/326576/Also: http://odur.let.rug.nl/~vannoord/TextCat/ Cheers, Olly
Peter Karman
2008-Sep-02 21:29 UTC
[Xapian-discuss] using xapian for indexing mails [SOLVED]
djcb wrote on 8/31/08 4:36 AM:> On Sat, 30 Aug 2008, djcb wrote: > >> Dear Xapian, >> >> I am writing a little tool for indexing/searching email messages in >> maildirs. >fwiw, you may want to look at http://search.cpan.org/~karman/SWISH-Prog-0.20/lib/SWISH/Prog/Aggregator/Mail.pm which will eventually have a Xapian backend as well. -- Peter Karman . http://peknet.com/ . peter at peknet.com