Hello, I have a corpus that I have indexed with xapian/xappy and I would now like to generate a corpus-specific list of stopwords. (This is a technical corpus, so a typical stopword list wouldn't be helpful.) My first thought was to ask the xapian database for a list of terms followed by their frequency. My intuition is that I could probably bring together a list of stopwords by examining the head and tail of the list. This would allow me to exclude both terms that are too common as well as unique but non-informative terms. Is there a good way to get this information? Thanks, Van
Richard Boulton
2010-Oct-08 15:21 UTC
[Xapian-discuss] Get a list of all terms in an indexed corpus
On 8 October 2010 15:38, VanL <van.lindberg at gmail.com> wrote:> Hello, > > I have a corpus that I have indexed with xapian/xappy and I would now > like to generate a corpus-specific list of stopwords. (This is a > technical corpus, so a typical stopword list wouldn't be helpful.)Xapian doesn't store a lits of terms sorted by frequency, so you'll need to do that sorting yourself outside xapian. Using xappy, you can call SearchConnection.iter_terms_for_field(fieldname) to get an iterator over the terms generated from a given field. However, this doesn't return the frequencies of the terms, and returns them in lexicographic order. Using xapian, you can call xapian.Database.allterms() to get an iterator over all the terms. This iterator returns xapian.TermListItem objects, which have a .termfreq property containing the number of documents the term occurs in (and a .term property containing the term string itself). You'll still need to sort the frequencies, but this should give you what you need. Hope this helps, -- Richard