thr3ads.net - similar to: "Proposed changes to omindex"

Displaying 20 results from an estimated 3000 matches similar to: "Proposed changes to omindex"

2017 Jun 14

KMeans Clusterer - Going forward

Hello, I have finished moving the API to PIMPL classes and will fix issues within the current code over the next week, based on reviews from mentors. The next step going forward is to start with forming document vectors that are reduced and more useful. This majorly helps in saving run time (since time for distance calculation depends on number of terms). Getting the useful terms within a

Getting non-stemmed terms from IndexReader

2007 Mar 04

Getting non-stemmed terms from IndexReader

I need to get a set of terms being indexed using Ferret. I used IndexReader.terms and it returns a list of TermEnum nicely. The only problem is that my analyzer includes a stemming filter. So now, the terms I''m getting back are all stemmed. Is there anyway to get the original unstemmed terms back from the index somehow? Thanks. -- Posted via http://www.ruby-forum.com/.

omindex => Unknown extension

2009 Apr 06

omindex => Unknown extension

Hi all, I'm having a recurrent problem with Omega's indexing. When I run omindex, it sometimes misses to recognize the extension of some files (.doc, .pdf) and skips them. In the same run, omindex is otherwise perfectly able to index other files with same extensions. The reason is not clear but it should occur before it selects a content converter since for example, if I manually run

Proper noun stemming

2008 Mar 27

Proper noun stemming

Hi All I was wondering if anyone had a solution for the following problem. I user QueryParser to stem my documents before adding them to a database. During the stemming process I would like to find a way of keeping proper nouns that span two or more words together as a phrase. For example "New York" or "Gordon Brown" or "Prime Minister" get spilt up. I see

Get term from document by position

2015 Jul 26

Get term from document by position

> Snippet highlighting is something that was worked on for a GSoC project a > few years ago, and is mentioned in our FAQ: <http://trac.xapian.org/wiki/FAQ/Snippets>. > It?s not available in the 1.2 series, but as I understand it should work out of the > box in 1.3.3. I tried it, this approach returns snippet that have nothing to do with the search string. Moreover, it takes too

Reading a password-protected PDF

2013 Feb 27

Reading a password-protected PDF

Hello respected developers, I was wondering if it is possible for xapian to read a password-protected PDF. Searches in the archives and google had yield 0 results. I also tried looking at the source code but I could not find the specific one related to this issue. The characteristic of the set of PDF is as: 1. a set of password protected PDF documents 2. all PDF is set with the same password. 3.

ideas on picking stopwords

2009 Mar 26

ideas on picking stopwords

I'm looking at adding some stopwords to my indexing procedure, and was wondering if anyone had any good rules of thumb on how to pick which words to blacklist. It all seems a little... well... vague. Although I guess it kind of depends on the sort of documents you're wanting to index. My current idea is to write a little script to output the terms with the highest frequency in my

[GSoC] Questions about project Text-Extraction Libraries

2019 Mar 21

[GSoC] Questions about project Text-Extraction Libraries

Hello! I have a few question related to the project Text-Extraction Libraries. Firstly, I think that trying to isolate library bugs in subprocesses could get to work, but I am not sure about how to handle deadlocks or infinite loops. I feel that using a timer is the only way to deal with it but I would like to know what you think about it. Secondly, I have been reading the source code of

Does OP_NEAR works with stemming?

2011 May 27

Does OP_NEAR works with stemming?

Hi All, I used the OP_NEAR operator for queryparser, and when I searched for "apple store" from my own collection, the query is parsed as "Zappl:(pos=1) NEAR 11 Zstore:(pos=2)" but retrieved nothing. However, if I type in "Apple Store", the query is parsed as Xapian::Query((apple:(pos=1) NEAR 11 store:(pos=2))) and some results are showed. I'm not sure whether

Dealing with image PDF's

2008 Jul 30

Dealing with image PDF's

Guys, I was just playing around and added a bit of code to omindex.cc so I could ocr tiff and tif with gocr which seems to work. Here's what it looks like: // Tiff: } else if (startswith(mimetype, "image/tif")) { // Inspired by http://mjr.towers.org.uk/comp/sxw2text string safefile = shell_protect(file); string cmd = "tifftopnm " + safefile + "

Dealing with image PDF's

2008 Jul 30

Dealing with image PDF's

Need Beginner Guide for Matcher Optimisations Project

2013 Mar 04

Need Beginner Guide for Matcher Optimisations Project

Hi, While searching for a project which matches my interest andskill level, I found this project named Matcher Optimization. This project is really challenging and excting from my view point and I would like to be a part of this project. Optimization techniques metioned in the reference links provided will take some time for me to have a good understanding about them. But I am trying to get my

omindex patch

2006 Aug 20

omindex patch

Attached is my rather largish omindex.cc patch with ChangeLog. It needs autoreconf to update configure and the Makefiles. Note that unrar is not patent infected, only rar, the compressor. I've put some AC_PATH_PROG checks into configure for all helpers. The patch is not yet complete. 2006-08-18 15:13:32 Reini Urban <reinhard.urban at avl.com> omega-0.9.6b: * omindex.cc: last_mod as

omindex one file at a time?

2012 Dec 13

omindex one file at a time?

Hi, all -- I want to do Plain Old Omindex'ing *but* the mapping between my documents' filenames and the URLs where I hope search users to find them is, uh..., strange. The simplest thing (to me) would be to run omindex for each document, e.g. omindex --no-delete -U /cool-url-1 /funky/doc/file-blah.pdf omindex --no-delete -U /cool-url-7 /doc/funky/ohmy/blah-file.txt ... and so on...

patch proposal: omindex library or daemon

2011 Oct 18

patch proposal: omindex library or daemon

Olly (looking at commit logs, I think this is your dept :-) For apps which re/index files frequently and need format conversion, I'd like to propose a patch for one of... Omindex library (thread safe): Omindex::init(options) // struct Omindex::options { ... } initialize mime_map, store default options session = new Omindex::Session(db_pathname) user threads use different sessions

omindex hangs while scanning

2009 Jun 20

omindex hangs while scanning

Hello, I was looking for a search engine for a small internal documentation site and found xapian and omega. Downloaded and compiled it using msys and ming on a german windows xp system. Finally installed apache on the same box. Following the omega example I copied the book to .../apache/htdocs and startet the omindex which hang up on the first document found. Even on very short doc with

omindex options

2009 May 19

omindex options

Hi. I am writing a python equivalent of omindex (we are using scriptindex currently - but I wanted to use omindex instead, and extend it to work with our internal file format.. BUT did not want to compile code if possible... so anyway). I have tried to keep the code as close to possible to the omindex native code, but am facing a bit of confusion: what exactly is the reason for omindex to take

Ticket #282: omindex-assorted-enhancements.patch woes

2009 Feb 02

Ticket #282: omindex-assorted-enhancements.patch woes

I would really like to try out the features in the patch above. But I can't ever seem to get the resulting omindex.cc to "make". I tried updating to rev 10801 from the SVN then run /bootstrap but then I seem to get errors compiling everything when I try and do "make" (I'm using ubuntu 8.10). So I thought I'd try an apply the patch to the latest stable version

How to omindex some sub-directories?

2013 May 15

How to omindex some sub-directories?

Given a directory tree like ... /foo | +-- A | +-- B | +-- C ... what is the best way to index A and C into a single Xapian database? AFAIK the alternatives are: omindex --db /my_db --no-delete /foo /foo/A omindex --db /my_db --no-delete /foo /foo/B or omindex --db /my_A_db /foo /foo/A omindex --db /my_B_db /foo /foo/B xapian-compact /my_A_db /my_B_db /my_db The first alternative does not

omega: omindex behaviour with duplicate files

2007 Jul 12

omega: omindex behaviour with duplicate files

Hi all I need a little clarification with regard to Omega's behaviour with 'duplicate' files when running 'omindex'. How is a duplicate recognised? Is it simply by file path? How is an unmodified file detected, if at all? I would like to set up subversion post-commit hook to update my index. If possible I would like to just update the index with the newly commited files.

similar to: Proposed changes to omindex