Jeremy C. Reed
2011-Feb-19 19:21 UTC
[Xapian-discuss] index everything? (no extensions/no mime-types)
I have around 550,000 files (4.7GB) that I need to index. It is a huge mix of file types. I don't need access to this via web. I just use for research locally. For now I do a grep and wait several minutes. omindex complains of Unknown extension: .... - skipping As I have many thousands of files that don't have extensions. (No Period.) Any way to use omindex to index regardless of the extensions? Maybe just use are plain text or run strings on them? Thanks, Jeremy C. Reed
Olly Betts
2011-Feb-20 14:34 UTC
[Xapian-discuss] index everything? (no extensions/no mime-types)
On Sat, Feb 19, 2011 at 01:21:49PM -0600, Jeremy C. Reed wrote:> I have around 550,000 files (4.7GB) that I need to index. It is a huge > mix of file types. I don't need access to this via web. I just use for > research locally. For now I do a grep and wait several minutes. > > omindex complains of > > Unknown extension: .... - skipping > > As I have many thousands of files that don't have extensions. (No > Period.) > > Any way to use omindex to index regardless of the extensions? Maybe just > use are plain text or run strings on them?You can set a mapping for no extension - e.g. to treat as plain text: -M:text/plain I think that should work with any Omega version you're likely to be using. New in 1.2.4, Omega can use libmagic to detect the content-type of files which there's no extensions mapping for. This is enabled if the libmagic development files are found, so install those if building from source, or if using a package, politely ask your packager to ensure libmagic is installed when building (the Debian and Ubuntu packages have this enabled). 1.2.4 also adds a way to specify filters on the command line, so you can set a mapping for no extension with: -M:application/octet-stream and then tell omindex to run "strings -n8" on such files using: --filter=application/octet-stream:'strings -n8' There isn't a way to set a content-type regardless of extension currently. Not sure that I can see a good use case for that. You also can't clear all the mappings except by passing -Mtext/html: etc for every content-type which has a mapping by default, which is very cumbersome. Removing all the mappings means libmagic would be used on all files, so it might be useful to have a simple way to achieve this. Cheers, Olly