V S P
2012-May-21 02:59 UTC
[Xapian-discuss] [q] phrase replacement in thousands of text files
Hello, first post. I searched through docs and examples but did not see this particular problem answered. I have thousands of text files total size about 17GB. Within a file I need for find a phrase (typically up to 3 words together separated by spaces, commas, and non period punctuation mark). I have a dictionary of about 3 million phrases and their replacement. So I need to replace all of the matching phrases from the dictionary with their replacements The most brute force approach I though was a) build an index on all of the 17GB of documents b) for every one of the 3 million search phrases do search c) expect to return from ( b ) xapian match where I would get the start and end byte location in a file for every search remember that location, and the found phrase in a 'future replacement list' d) when done , use the 'future replacement list' -- to perform the replacement operation Obviously couple of problems 3 million times search 17GB worth of text Second -- I do not understand how (if at all possible) to get the start/end offset of the found phrase within the source file Third how do I insure that the phrase words are together (and the one with period between them is not concidered a find). thank you in advance for any suggestions, vsp -- http://www.fastmail.fm - Access all of your messages and folders wherever you are
Olly Betts
2012-May-21 05:46 UTC
[Xapian-discuss] [q] phrase replacement in thousands of text files
On Sun, May 20, 2012 at 10:59:11PM -0400, V S P wrote:> Obviously couple of problems 3 million times search 17GB worth of textI'm not sure I see why this a problem unless the run time of this is highly sensitive.> Second -- I do not understand how (if at all possible) to get the > start/end offset of the found phrase within the source fileXapian doesn't store the byte offsets (only word offsets), so this isn't possible. It can narrow down the number of files you need to go and look at for each replacement though, which could make quite a difference if many of the replacements are rarely done.> Third how do I insure that the phrase words are together (and the one > with period between them is not concidered a find).When indexing, pass each chunk of text between periods to TermGenerator::index_text(), calling increase_termpos() after each index_text() call. Then phrases can't span a period. Cheers, Olly