I''m planning on indexing XML/HTML files. I only want to index the text contained in the files and not any of the elements or tags. I just finished reading Chapter 6 of "Ferret" (Balmain/O''Reilley) that presented a solution for this issue. The essence of the solution was to parse the XML/HTML and extract the text content using a parser such as Hpricot. My concern is that this approach will not support highlighting of the results [correct me if I''m wrong here] since the corresponding indexed field will only contain text without the elements and tags that are necessary to indicate the position of the text. Question: wouldn''t a better approach be to implement a tokenizer that ignores XML/HTML tags and preserves the positions of the appropriately indexed items? If this is indeed an ideal approach does such a solution exist or, alternatively, how can I contribute when I implement it? Regards, John aka sd.codewarrior -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20080412/4884f333/attachment.html