thr3ads.net - Xapian discuss - How to use Xapian Omega directly (i.e., without using `recoll` and `xapiandb`) ... Full Set Of Questions Below: [Apr 2024]

If this information is useful, please help other people find it:
Share via:

Susmita/Rajib

2024-Apr-22 04:17 UTC

How to use Xapian Omega directly (i.e., without using `recoll` and `xapiandb`) ... Full Set Of Questions Below:

Dear senior ML members and developers of Xapian Omega,

Mr. Olly has helped me cross the bump of the initial learning curve.
(ref: https://lists.xapian.org/pipermail/xapian-discuss/2024-April/010034.html)

How can I use Xapian Omega directly (i.e., without using `recoll` and
`xapiandb`) to index a directory of text files with all strings
greater than 3 characters, to create an index text file typically
occurs in the End of a Book, with location in specific files, without
using Recoll database? I want to create an extensive list first with
xapian omega, then have the list post-processed for all strings
greater than 3 characters, along with the indexing data, as to where
those texts appear in all specific text documents, like position,
document name, etc. How could I include phrases? Could `omgrep` or
`python` script be used for specific phrases?

Would the following steps help? Am I planning wrongly?:

1.  Index the Text Files within a directory: Use the omindex command
to index the text files. The example command to index all text files
in a directory:
$omindex -d /path/to/index_directory /path/to/text/files/directory

2.  Once the text files are indexed, use the omindex command with the
-i option to generate an extensive list of all words. example command:
$omindex -d /path/to/index_directory -i > extensive_list.txt
to generate a plain text file named extensive_list.txt containing all
the words extracted from the indexed text files.

3. Post-Processing for Strings Greater Than 3 Characters: After
generating the extensive list, to post-process it to filter out
strings greater than 3 characters. I can use various tools and
scripting languages like grep, awk, or Python to accomplish this task.
For example, using grep:
$grep -E '\b\w{4,}\b' extensive_list.txt > filtered_list.txt
This command should theoretically filter out words with more than 3
characters from the extensive_list.txt and save the result in
filtered_list.txt.

But how do I create an index for a pre-determined set of phrases?
Would I require a specific script using omgrep, like using?:
$omgrep "my phrase" /path/to/index/directory

Please suggest extensive code-lines, considering me a novice.

Best wishes,
Rajib
Etc.

Susmita/Rajib

2024-Apr-22 07:17 UTC

head link

How to use Xapian Omega directly (i.e., without using `recoll` and `xapiandb`) ... Full Set Of Questions Below:

The post here:
https://lists.xapian.org/pipermail/xapian-discuss/2024-April/010036.html

Olly Betts

2024-Apr-26 04:23 UTC

head link

How to use Xapian Omega directly (i.e., without using `recoll` and `xapiandb`) ... Full Set Of Questions Below:

On Mon, Apr 22, 2024 at 09:47:54AM +0530, Susmita/Rajib
wrote:> How can I use Xapian Omega directly (i.e., without using `recoll` and
> `xapiandb`) to index a directory of text files with all strings
> greater than 3 characters, to create an index text file typically
> occurs in the End of a Book, with location in specific files, without
> using Recoll database? I want to create an extensive list first with
> xapian omega, then have the list post-processed for all strings
> greater than 3 characters, along with the indexing data, as to where
> those texts appear in all specific text documents, like position,
> document name, etc. How could I include phrases? Could `omgrep` or
> `python` script be used for specific phrases?
Um, there isn't an "omgrep" tool...

You could write a python script to extract the information needed to
do this from a Xapian database though.
> 2.  Once the text files are indexed, use the omindex command with the
> -i option to generate an extensive list of all words. example command:
> $omindex -d /path/to/index_directory -i > extensive_list.txt
> to generate a plain text file named extensive_list.txt containing all
> the words extracted from the indexed text files.
No, "-i" means "ignore meta robots tags and similar
exclusions" -
omindex will still index, and this option only affects decisions about
which files to index.

If you want a command line tool to extract this sort of information from
the database, look at "xapian-delve" (part of xapian-core; at least
for
Debian and Ubuntu it's in the xapian-tools binary package).  You could
also do it from Python or any other supported language.
> But how do I create an index for a pre-determined set of phrases?
> Would I require a specific script using omgrep, like using?:
> $omgrep "my phrase" /path/to/index/directory
You can just run a query to find the documents matching a phrase, or
any other question.

Maybe the "quest" command line tool is useful for that if you want an
existing command line tool to post-process output from?

E.g. this gives the top ten document ids matching the phrase:

quest -d data/default '"stemming algorithm"' |sed
's/^\([0-9]\+\): \[.*\]$/\1/p;d'

Probably better to write some python using Xapian's python bindings
rather than trying to parse output that was never intended to be
machine-readable.

The "getting started" guide shows how to write a simple search script:

https://getting-started-with-xapian.readthedocs.io/en/latest/

Cheers,
    Olly

Reasonably Related Threads

Search for more reasonably related threads

Xapian discuss - Apr 2024 - How to use Xapian Omega directly (i.e., without using `recoll` and `xapiandb`) ... Full Set Of Questions Below:

How to use Xapian Omega directly (i.e., without using `recoll` and `xapiandb`) ... Full Set Of Questions Below:

How to use Xapian Omega directly (i.e., without using `recoll` and `xapiandb`) ... Full Set Of Questions Below:

How to use Xapian Omega directly (i.e., without using `recoll` and `xapiandb`) ... Full Set Of Questions Below:

Reasonably Related Threads