Displaying 20 results from an estimated 3000 matches similar to: "Get term from document by position"
2015 Jul 23
1
Get term from document by position
Hello. Is there any FAST way to get a term from the xapian document by it's position, something like
std::string term = Xapian::Document::GetTermByPosition(int position) ?
Below i have described a task that i am trying to solve, in case if somebody is interested.
============================================================================
When displaying search results, i would like to
2007 Mar 04
5
Getting non-stemmed terms from IndexReader
I need to get a set of terms being indexed using Ferret. I used
IndexReader.terms and it returns a list of TermEnum nicely. The only
problem is that my analyzer includes a stemming filter.
So now, the terms I''m getting back are all stemmed. Is there anyway to
get the original unstemmed terms back from the index somehow? Thanks.
--
Posted via http://www.ruby-forum.com/.
2015 Jul 26
1
Get term from document by position
> Can you file a bug with some example outputs that are unrelated to the search string?
Here is the example (see attachment).
This example does the following:
1)First, it indexes text from the "text.txt" file (see attachment) (actually, this is the text of the following book: "Abbas, Lichtman. Basic immunology").
2)Next, it searches for the "extracellular
2018 Sep 14
3
How to make database build threaded?
On 14/09/2018 at 09:30, Jean-Francois Dockes wrote:
> Hi,
>
> You may be interested by how Recoll does it:
>
> https://www.lesbonscomptes.com/recoll/idxthreads/threadingRecoll.html
>
> A few things in the document are slightly obsolete (esp. the last
> paragraph: recollindex now does use vfork()), but it's overall quite close
> to how the current indexer works.
2015 Jul 26
1
Get term from document by position
mple (see attachment).
>
> Attachments get stripped out by the mailing list, so I?ve made a private gist of the two files here: <https://gist.github.com/jaylett/ce8455b37e2b84422346>.
>
> Actually, when I run it I get 0 matches, which would explain why you?re just getting the start of the document. However if I adjust things (match the stemming strategy for TermGenerator to
2011 May 26
0
Desktopsearch "Recoll" for CentOS 5.5 64bit
Hi Folks,
is there a rpm-package for desktopsearch "recoll" for CentOS 5.5 64bit
If yes - where is it?
I've tried fedora-packages from
http://www.lesbonscomptes.com/recoll/download.html#rpms
but got much dependencies-errors
Thx
Timothy
2017 Jun 14
2
KMeans Clusterer - Going forward
Hello,
I have finished moving the API to PIMPL classes and will fix issues within
the current code over the next week, based on reviews from mentors.
The next step going forward is to start with forming document vectors that
are reduced and more useful. This majorly helps in saving run time (since
time for distance calculation depends on number of terms). Getting the
useful terms within a
2018 Sep 14
0
How to make database build threaded?
Franco Martelli writes:
> Hi everybody,
> I'm the author of a small C++11 program called XDGSearch. The source
> code is hosted on Github, for a quick overview you can visit this link
> https://github.com/frank67/XDGSearch/blob/master/README.md
> I'm writing to the mailing list because I'd like to make the database
> build process splitted in more thread. Is it
2006 Aug 11
3
Proposed changes to omindex
Proposed changes to omindex
Currently Available Items
=========================
1) Have the Q prefix contain the 16 byte MD5 of the full file name used for document lookup during
indexing.
2) Add the document?s last modified time to the value table (ID 0). This would allow incremental
indexing based on the timestamp and also sorting by date in omega (SORT=0)
a. Currently I store the timestamp
2023 Mar 26
1
manual flushing thresholds for deletes?
On Fri, Mar 24, 2023 at 10:37:41AM +0000, Eric Wong wrote:
> Realizing I had documents of hugely varying sizes (0.5KB..20MB)
> and little RAM, I instead tracked the number of raw bytes in the
> text being indexed and flushed whenever I'd seen a configurable
> byte count. Not the most scientific way, but it seems to work
> well enough on low-end systems.
>
> Now, I'm
2008 Mar 27
2
Proper noun stemming
Hi All
I was wondering if anyone had a solution for the following problem.
I user QueryParser to stem my documents before adding them to a
database. During the stemming process I would like to find a way of
keeping proper nouns that span two or more words together as a phrase.
For example "New York" or "Gordon Brown" or "Prime Minister" get spilt
up. I see
2005 Nov 16
1
query time stemming and term weights
I am developping a personal/desktop search tool for which I am
experimenting with doing no stemming during the indexing, but instead
having a stem database (or several for different languages), used for
expanding the query terms at search time.
(ie: user query: flooring -> stem: floor
-> final query for: [floored flooring floorings floors])
I have thought of a possible problem with
2009 Mar 26
1
ideas on picking stopwords
I'm looking at adding some stopwords to my indexing procedure, and was
wondering if anyone had any good rules of thumb on how to pick which
words to blacklist. It all seems a little... well... vague. Although I
guess it kind of depends on the sort of documents you're wanting to index.
My current idea is to write a little script to output the terms with the
highest frequency in my
2011 May 27
1
Does OP_NEAR works with stemming?
Hi All,
I used the OP_NEAR operator for queryparser, and when I searched for "apple store" from my own collection, the query is parsed as "Zappl:(pos=1) NEAR 11 Zstore:(pos=2)" but retrieved nothing. However, if I type in "Apple Store", the query is parsed as Xapian::Query((apple:(pos=1) NEAR 11 store:(pos=2))) and some results are showed. I'm not sure whether
2017 Jan 12
2
NEAR non-leaf subqueries
Olly Betts writes:
> On Wed, Jan 04, 2017 at 07:29:58AM +0100, Jean-Francois Dockes wrote:
> > Olly Betts writes:
> > > The ticket has a patch which attempts to handle the OR case (which seems
> > > to be the part you actually care about) but this suffers from issues with
> > > object lifetimes which get a bit involved in the details. Since there
>
2007 Jun 28
1
TermGenerator and SimpleStopper
Hi,
I'm using SimpleStopper with TermGenerator in a Python indexing
script, in an attempt to keep my index size down (currently 30K per
doc, and I have 200 million docs to index, which I think implies
6TB.) However, unprefixed (positional?) terms are not affected by
the stopper, though Z-prefixed terms are.
I assume this is intentional for phrase queries, but I need to reduce
my
2013 Nov 30
4
Full text search improvements
FTS indexing is something I hear quite often nowadays. I?ve added some hacks to make it work better for some installations, but it?s about time to think about the whole design and how it could be improved for everyone in future. Here are some of my initial thoughts.
Currently Dovecot supports 3 full text search engines: Solr, CLucene and Dovecot Squat. CLucene plugin has various features built
2017 Jan 20
2
NEAR non-leaf subqueries
Olly Betts writes:
> On Thu, Jan 12, 2017 at 07:53:21PM +0100, Jean-Francois Dockes wrote:
>
> > Recoll also supports multi-word synonyms which could potentially
> > generate PHRASE subqueries inside NEAR queries, but this
> > understandably already did not work with 1.2, so the multi-word
> > expansions are only used when proximity is not involved (by the way,
2016 Jan 08
2
Strange index consistency issue
Hi,
A Recoll user is reporting an index corruption problem. In general, index
corruption happens from time to time with Recoll, because of crashes,
reboots, misc Recoll bugs, etc.
The strange thing here is that xapian-check does not seem to detect anything.
In a nutshell, some document numbers seem to point to a data blackhole: the
docids are returned when searching for the file/doc unique
2023 Mar 24
1
manual flushing thresholds for deletes?
Years ago, I ran into OOM problems with the default flush
threshold of 10000 documents while indexing (add/replace).
Realizing I had documents of hugely varying sizes (0.5KB..20MB)
and little RAM, I instead tracked the number of raw bytes in the
text being indexed and flushed whenever I'd seen a configurable
byte count. Not the most scientific way, but it seems to work
well enough on low-end