Susmita/Rajib
2024-Apr-22 04:17 UTC
How to use Xapian Omega directly (i.e., without using `recoll` and `xapiandb`) ... Full Set Of Questions Below:
Dear senior ML members and developers of Xapian Omega, Mr. Olly has helped me cross the bump of the initial learning curve. (ref: https://lists.xapian.org/pipermail/xapian-discuss/2024-April/010034.html) How can I use Xapian Omega directly (i.e., without using `recoll` and `xapiandb`) to index a directory of text files with all strings greater than 3 characters, to create an index text file typically occurs in the End of a Book, with location in specific files, without using Recoll database? I want to create an extensive list first with xapian omega, then have the list post-processed for all strings greater than 3 characters, along with the indexing data, as to where those texts appear in all specific text documents, like position, document name, etc. How could I include phrases? Could `omgrep` or `python` script be used for specific phrases? Would the following steps help? Am I planning wrongly?: 1. Index the Text Files within a directory: Use the omindex command to index the text files. The example command to index all text files in a directory: $omindex -d /path/to/index_directory /path/to/text/files/directory 2. Once the text files are indexed, use the omindex command with the -i option to generate an extensive list of all words. example command: $omindex -d /path/to/index_directory -i > extensive_list.txt to generate a plain text file named extensive_list.txt containing all the words extracted from the indexed text files. 3. Post-Processing for Strings Greater Than 3 Characters: After generating the extensive list, to post-process it to filter out strings greater than 3 characters. I can use various tools and scripting languages like grep, awk, or Python to accomplish this task. For example, using grep: $grep -E '\b\w{4,}\b' extensive_list.txt > filtered_list.txt This command should theoretically filter out words with more than 3 characters from the extensive_list.txt and save the result in filtered_list.txt. But how do I create an index for a pre-determined set of phrases? Would I require a specific script using omgrep, like using?: $omgrep "my phrase" /path/to/index/directory Please suggest extensive code-lines, considering me a novice. Best wishes, Rajib Etc.
Susmita/Rajib
2024-Apr-22 07:17 UTC
How to use Xapian Omega directly (i.e., without using `recoll` and `xapiandb`) ... Full Set Of Questions Below:
The post here: https://lists.xapian.org/pipermail/xapian-discuss/2024-April/010036.html
Olly Betts
2024-Apr-26 04:23 UTC
How to use Xapian Omega directly (i.e., without using `recoll` and `xapiandb`) ... Full Set Of Questions Below:
On Mon, Apr 22, 2024 at 09:47:54AM +0530, Susmita/Rajib wrote:> How can I use Xapian Omega directly (i.e., without using `recoll` and > `xapiandb`) to index a directory of text files with all strings > greater than 3 characters, to create an index text file typically > occurs in the End of a Book, with location in specific files, without > using Recoll database? I want to create an extensive list first with > xapian omega, then have the list post-processed for all strings > greater than 3 characters, along with the indexing data, as to where > those texts appear in all specific text documents, like position, > document name, etc. How could I include phrases? Could `omgrep` or > `python` script be used for specific phrases?Um, there isn't an "omgrep" tool... You could write a python script to extract the information needed to do this from a Xapian database though.> 2. Once the text files are indexed, use the omindex command with the > -i option to generate an extensive list of all words. example command: > $omindex -d /path/to/index_directory -i > extensive_list.txt > to generate a plain text file named extensive_list.txt containing all > the words extracted from the indexed text files.No, "-i" means "ignore meta robots tags and similar exclusions" - omindex will still index, and this option only affects decisions about which files to index. If you want a command line tool to extract this sort of information from the database, look at "xapian-delve" (part of xapian-core; at least for Debian and Ubuntu it's in the xapian-tools binary package). You could also do it from Python or any other supported language.> But how do I create an index for a pre-determined set of phrases? > Would I require a specific script using omgrep, like using?: > $omgrep "my phrase" /path/to/index/directoryYou can just run a query to find the documents matching a phrase, or any other question. Maybe the "quest" command line tool is useful for that if you want an existing command line tool to post-process output from? E.g. this gives the top ten document ids matching the phrase: quest -d data/default '"stemming algorithm"' |sed 's/^\([0-9]\+\): \[.*\]$/\1/p;d' Probably better to write some python using Xapian's python bindings rather than trying to parse output that was never intended to be machine-readable. The "getting started" guide shows how to write a simple search script: https://getting-started-with-xapian.readthedocs.io/en/latest/ Cheers, Olly
Reasonably Related Threads
- patch proposal: omindex library or daemon
- Fwd: Is there a front-end for using xapian-omega rather than the terminal? Could a Xapian database be accessed from web-browsers?
- Fwd: Is there a front-end for using xapian-omega rather than the terminal? Could a Xapian database be accessed from web-browsers?
- omega issues/notes
- Re: [Xapian-commits] 6355: trunk/xapian-applications/omega/ trunk/xapian-applications/omega/docs/