thr3ads.net - similar to: "Get term from document by position"

Displaying 20 results from an estimated 3000 matches similar to: "Get term from document by position"

2015 Jul 23

Get term from document by position

Hello. Is there any FAST way to get a term from the xapian document by it's position, something like std::string term = Xapian::Document::GetTermByPosition(int position) ? Below i have described a task that i am trying to solve, in case if somebody is interested. ============================================================================ When displaying search results, i would like to

Getting non-stemmed terms from IndexReader

2007 Mar 04

Getting non-stemmed terms from IndexReader

I need to get a set of terms being indexed using Ferret. I used IndexReader.terms and it returns a list of TermEnum nicely. The only problem is that my analyzer includes a stemming filter. So now, the terms I''m getting back are all stemmed. Is there anyway to get the original unstemmed terms back from the index somehow? Thanks. -- Posted via http://www.ruby-forum.com/.

Get term from document by position

2015 Jul 26

Get term from document by position

> Can you file a bug with some example outputs that are unrelated to the search string? Here is the example (see attachment). This example does the following: 1)First, it indexes text from the "text.txt" file (see attachment) (actually, this is the text of the following book: "Abbas, Lichtman. Basic immunology"). 2)Next, it searches for the "extracellular

How to make database build threaded?

2018 Sep 14

How to make database build threaded?

On 14/09/2018 at 09:30, Jean-Francois Dockes wrote: > Hi, > > You may be interested by how Recoll does it: > > https://www.lesbonscomptes.com/recoll/idxthreads/threadingRecoll.html > > A few things in the document are slightly obsolete (esp. the last > paragraph: recollindex now does use vfork()), but it's overall quite close > to how the current indexer works.

Get term from document by position

2015 Jul 26

Get term from document by position

mple (see attachment). > > Attachments get stripped out by the mailing list, so I?ve made a private gist of the two files here: <https://gist.github.com/jaylett/ce8455b37e2b84422346>. > > Actually, when I run it I get 0 matches, which would explain why you?re just getting the start of the document. However if I adjust things (match the stemming strategy for TermGenerator to

Desktopsearch "Recoll" for CentOS 5.5 64bit

2011 May 26

Desktopsearch "Recoll" for CentOS 5.5 64bit

Hi Folks, is there a rpm-package for desktopsearch "recoll" for CentOS 5.5 64bit If yes - where is it? I've tried fedora-packages from http://www.lesbonscomptes.com/recoll/download.html#rpms but got much dependencies-errors Thx Timothy

KMeans Clusterer - Going forward

2017 Jun 14

KMeans Clusterer - Going forward

Hello, I have finished moving the API to PIMPL classes and will fix issues within the current code over the next week, based on reviews from mentors. The next step going forward is to start with forming document vectors that are reduced and more useful. This majorly helps in saving run time (since time for distance calculation depends on number of terms). Getting the useful terms within a

How to make database build threaded?

2018 Sep 14

How to make database build threaded?

Franco Martelli writes: > Hi everybody, > I'm the author of a small C++11 program called XDGSearch. The source > code is hosted on Github, for a quick overview you can visit this link > https://github.com/frank67/XDGSearch/blob/master/README.md > I'm writing to the mailing list because I'd like to make the database > build process splitted in more thread. Is it

Proposed changes to omindex

2006 Aug 11

Proposed changes to omindex

Proposed changes to omindex Currently Available Items ========================= 1) Have the Q prefix contain the 16 byte MD5 of the full file name used for document lookup during indexing. 2) Add the document?s last modified time to the value table (ID 0). This would allow incremental indexing based on the timestamp and also sorting by date in omega (SORT=0) a. Currently I store the timestamp

manual flushing thresholds for deletes?

2023 Mar 26

manual flushing thresholds for deletes?

On Fri, Mar 24, 2023 at 10:37:41AM +0000, Eric Wong wrote: > Realizing I had documents of hugely varying sizes (0.5KB..20MB) > and little RAM, I instead tracked the number of raw bytes in the > text being indexed and flushed whenever I'd seen a configurable > byte count. Not the most scientific way, but it seems to work > well enough on low-end systems. > > Now, I'm

Proper noun stemming

2008 Mar 27

Proper noun stemming

Hi All I was wondering if anyone had a solution for the following problem. I user QueryParser to stem my documents before adding them to a database. During the stemming process I would like to find a way of keeping proper nouns that span two or more words together as a phrase. For example "New York" or "Gordon Brown" or "Prime Minister" get spilt up. I see

query time stemming and term weights

2005 Nov 16

query time stemming and term weights

I am developping a personal/desktop search tool for which I am experimenting with doing no stemming during the indexing, but instead having a stem database (or several for different languages), used for expanding the query terms at search time. (ie: user query: flooring -> stem: floor -> final query for: [floored flooring floorings floors]) I have thought of a possible problem with

ideas on picking stopwords

2009 Mar 26

ideas on picking stopwords

I'm looking at adding some stopwords to my indexing procedure, and was wondering if anyone had any good rules of thumb on how to pick which words to blacklist. It all seems a little... well... vague. Although I guess it kind of depends on the sort of documents you're wanting to index. My current idea is to write a little script to output the terms with the highest frequency in my

Does OP_NEAR works with stemming?

2011 May 27

Does OP_NEAR works with stemming?

Hi All, I used the OP_NEAR operator for queryparser, and when I searched for "apple store" from my own collection, the query is parsed as "Zappl:(pos=1) NEAR 11 Zstore:(pos=2)" but retrieved nothing. However, if I type in "Apple Store", the query is parsed as Xapian::Query((apple:(pos=1) NEAR 11 store:(pos=2))) and some results are showed. I'm not sure whether

NEAR non-leaf subqueries

2017 Jan 12

NEAR non-leaf subqueries

Olly Betts writes: > On Wed, Jan 04, 2017 at 07:29:58AM +0100, Jean-Francois Dockes wrote: > > Olly Betts writes: > > > The ticket has a patch which attempts to handle the OR case (which seems > > > to be the part you actually care about) but this suffers from issues with > > > object lifetimes which get a bit involved in the details. Since there >

TermGenerator and SimpleStopper

2007 Jun 28

TermGenerator and SimpleStopper

Hi, I'm using SimpleStopper with TermGenerator in a Python indexing script, in an attempt to keep my index size down (currently 30K per doc, and I have 200 million docs to index, which I think implies 6TB.) However, unprefixed (positional?) terms are not affected by the stopper, though Z-prefixed terms are. I assume this is intentional for phrase queries, but I need to reduce my

Full text search improvements

2013 Nov 30

Full text search improvements

FTS indexing is something I hear quite often nowadays. I?ve added some hacks to make it work better for some installations, but it?s about time to think about the whole design and how it could be improved for everyone in future. Here are some of my initial thoughts. Currently Dovecot supports 3 full text search engines: Solr, CLucene and Dovecot Squat. CLucene plugin has various features built

NEAR non-leaf subqueries

2017 Jan 20

NEAR non-leaf subqueries

Olly Betts writes: > On Thu, Jan 12, 2017 at 07:53:21PM +0100, Jean-Francois Dockes wrote: > > > Recoll also supports multi-word synonyms which could potentially > > generate PHRASE subqueries inside NEAR queries, but this > > understandably already did not work with 1.2, so the multi-word > > expansions are only used when proximity is not involved (by the way,

manual flushing thresholds for deletes?

2023 Mar 24

manual flushing thresholds for deletes?

Years ago, I ran into OOM problems with the default flush threshold of 10000 documents while indexing (add/replace). Realizing I had documents of hugely varying sizes (0.5KB..20MB) and little RAM, I instead tracked the number of raw bytes in the text being indexed and flushed whenever I'd seen a configurable byte count. Not the most scientific way, but it seems to work well enough on low-end

Strange index consistency issue

2016 Jan 08

Strange index consistency issue

Hi, A Recoll user is reporting an index corruption problem. In general, index corruption happens from time to time with Recoll, because of crashes, reboots, misc Recoll bugs, etc. The strange thing here is that xapian-check does not seem to detect anything. In a nutshell, some document numbers seem to point to a data blackhole: the docids are returned when searching for the file/doc unique

similar to: Get term from document by position