thr3ads.net - similar to: "Get a list of all terms in an indexed corpus"

Displaying 20 results from an estimated 400 matches similar to: "Get a list of all terms in an indexed corpus"

Where do I stick the PARTIAL flag in xappy?

2011 Jun 10

Where do I stick the PARTIAL flag in xappy?

I want to be able to do searches with FLAG_PARTIAL and some without. Most searches without but with the PARTIAL for an autocomplete widget. I'm using xappy and I can't find where to send the flag when I build up the query. The docs talk of setting up the database with or without FLAG_PARTIAL but that's probably not what I want. Peter

Something to think about

2007 Oct 10

Something to think about

I'm planning to add multiple-database support for searches to my "Xappy" python wrapper (more on this wrapper later, but for now, see http://code.google.com/p/xappy for details). This is reasonably straightforward, because Xapian supports this nicely: except that "Xappy" generates a "fieldname->prefix" mapping automatically. The prefix which corresponds

DatabaseCorruptError

2011 Jan 17

DatabaseCorruptError

Hi there, My web app uses Xapian via the PHP bindings. I'm getting this error thrown occasionally when atempting to instantiate a XapianDatabase object for searching. DatabaseCorruptError: Expected block 107 to be level 1, not 0 Here's the line that invokes it: $database = new XapianDatabase(PROJROOT.'/data/xapian/posts'); And my version is xapian-core 1.2.3 with matchspy.

Xapian configuration

2018 Jul 21

Xapian configuration

Hello, I want a tutorial on how to configure xapian on a moinmoin wiki. I install my moinmoin wiki on a virtual env but I m having lots of problem to configure xapian Thanks

KMeans Clusterer - Going forward

2017 Jun 14

KMeans Clusterer - Going forward

Hello, I have finished moving the API to PIMPL classes and will fix issues within the current code over the next week, based on reviews from mentors. The next step going forward is to start with forming document vectors that are reduced and more useful. This majorly helps in saving run time (since time for distance calculation depends on number of terms). Getting the useful terms within a

how can i use stopwords?

2008 Mar 12

how can i use stopwords?

Hi, I do not understand the stopword function... I've set the termgenerator like this: $self->{'Stemmer'} = new Search::Xapian::Stem(german2); $self->{'Stopper'} = new Search::Xapian::SimpleStopper(); $self->{'TermGenerator'} = new Search::Xapian::TermGenerator; $self->{'TermGenerator'}->set_stemmer( $self->{'Stemmer'} );

[python indexer] add meta informations

2009 Nov 11

[python indexer] add meta informations

Hello, I'm trying to index some blog stuff through python bindings. I'd like to know how to add some informations (url, title, date, and so on) so that I can reach them through a xapian.Enquire object.. I believe it's something to be set in xapian.TermGenerator(), but... I can't manage to find which function. I'm waiting for something like : xtermgen.add_meta('url',

range query for terms

2015 Mar 14

range query for terms

first, thank you,xapian! then I'd like to ask if it is possible to do a range query on terms(like the range query on values), or if it is just a wildcard(right truncation) match. the case is searching ip address bettween ?10.10.0.0? and ?10.10.255.255? the user want : 1. query "10.10.10.10" < ip < "10.10.10.12" gives "10.10.10.11" 2. query

Lucene 3.6.2 backend for xapian (#25)

2013 Oct 30

Lucene 3.6.2 backend for xapian (#25)

[Replying to xapian-devel, as I think a wider audience would be useful] On Mon, Oct 21, 2013 at 11:24:51PM +0800, jiangwen jiang wrote: > yes, it's less efficient. Lucene database has multiple segments, each > segment can treat as a independent database. The same term may exists in >= > 1 segments. Sorry for taking a while to respond - I've been both busy and mulling this

Help with cleaning a corpus

2011 Apr 18

Help with cleaning a corpus

Hi! I created a corpus and I started to clean through this piece of code: txt <-tm_map(txt,removeWords, stopwords("spanish")) txt <-tm_map(txt,stripWhitespace) txt <-tm_map(txt,tolower) txt <-tm_map(txt,removeNumbers) txt <-tm_map(txt,removePunctuation) But something happpended: some of the documents in the corpus became empty, this is a problem when i try to make a

Stopword addition and stemming

2010 Nov 15

Stopword addition and stemming

Hi, Two questions which I'm unsure about: Stemming: I've turned on stemming, etc, but how can I confirm that it's being used in searches? What should I look/search for? Stopwords: I'm trying out xapian on a regional dataset (searching data from a *.co.us TLD, eg) . I've noticed that searching for [bob co.us] results in *very* slow search times (tens of seconds), since it

Fetching document content by Q term in Python

2007 Feb 09

Fetching document content by Q term in Python

Hello, I'd like to be able to retrieve the indexes stored copy of the document text and tried the following: terms = self.db.allterms() terms.skip_to('Q' + uri.encode('utf-8')) term = terms.next() doc = self.db.get_document(term[1]) print doc.get_data() I just wildly guessed that [1] was the docid, but of course it isn't. So the question is, how do I

what is the fastest way to fetch results which are sorted by timestamp ?

2011 Aug 09

what is the fastest way to fetch results which are sorted by timestamp ?

what is the fastest way to fetch results which are sorted by timestamp ? i want to use xapian as my search engine , use add_boolean_term(something) and add_value(0,sortable_serialise(get_timestamp())) to a doc. search through enquire.set_weighting_scheme(xapian.BoolWeight()) and enquire.set_sort_by_value(0,True) to ensure that the results are sorted by the timestamp. This method is ok , but

How can this code be improved?

2009 Nov 12

How can this code be improved?

I am running the following code on a MacBook Pro 17" Unibody early 2009 with 8GB RAM, OS X 10.5.8, R 2.10.0 Patch from Nov. 2, 2009, in 64-bit mode. freq.stopwords <- numeric(0) freq.nonstopwords <- numeric(0) token.tables <- list(0) i.ss <- c(0) cat("Beginning at ", date(), ".\n") for (i.d in 1:length(tokens)) { tt <- list(0) for (i.s in

Per-namespace proxying?

2009 Jan 02

Per-namespace proxying?

Hello, Searching in the archives, I saw the following posting from Timo Sirainen: [Dovecot] Roadmap to future (06 Dec 2007): [...] > Proxying > -------- > > - These could be implemented to v1.2. - Log in normally (no proxying) > if destination IP is the server itself. > > - Support for per-namespace proxying: > > namespace public { > prefix = Public/ >

Rsync when using --whole-file

2012 Dec 20

Rsync when using --whole-file

I have a question about what happens at the code level when I use --whole-file. I know that it turns off the rolling checksum. I also understand that it only checks the file's mtime and size to identify whether there should be some transfer. Two questions: 1) Could anyone give me a pointer to the correct file so that I can read what happens when --whole-file is used? 2) When using

compile xapian-extras

2010 Oct 21

compile xapian-extras

Hi, The xapian-extras supports image similarity.( http://xapian.wordpress.com/2009/03/11/xappy-now-supports-image-similarity-searching/ ) I complie xapian-extras and xapian-extras-bindings with python. import xapian import xapian.imgseek doc = xapian.Document() imgsig = xapian.imgseek.ImgSig.register_Image(JPEG_PATH) imgterms = xapian.imgseek.ImgTerms('A', 300)

Phrase search problem

2011 Jul 20

Phrase search problem

Hi, I'm experiencing problems when doing phrase searches with adjacent repeated terms. Example: if I search for 'curtain curtain' and there are documents that matches the query, they aren't returned. But, if I search for 'curtain nice curtain' and there are documents that matches this query, it works ok. attached there is a python program that shows the problem. I tried

Help-Multi class classification for large datasets

2017 Jul 18

Help-Multi class classification for large datasets

Hai all, We are working on Multi-class Classification. Currently up to 1.1 million records Ranger package in R is able to handle. Training time on 128 GB RAM is 12 days, which is not a practically feasible method to proceed further. In future we will have dataset of dimension 10 million records, we are in search for a package or framework which can handle 10 million records with at least 12000

bug when assigning new analyzer?

2007 May 09

bug when assigning new analyzer?

require ''rubygems'' require ''ferret'' include Ferret PATH = ''/tmp/ferret_stopwords_test'' index = Index::IndexWriter.new(:path => PATH, :create => true) index.analyzer = Analysis::StandardAnalyzer.new([]) index << {:title => ''a few good men'', :language => ''en''} index.analyzer =

similar to: Get a list of all terms in an indexed corpus