Chris TenHarmsel
2008-Feb-21 17:35 UTC
[Ferret-talk] Experience using ferret to index log files
Hi everyone, I''ve been exploring using ferret for indexing large amounts of production log files. Right now we have a homemade system for searching through the logs that involves specifying a date/time range and then grepping through the relevant files. This can take a long time. My initial tests (on 2gb of log files) have been promising, I''ve taken two separate approaches: The first is loading each line in each log file as a "document". The plus side to this is that doing a search will get you individual log lines as the results, which is what I want. The downside is that indexing takes a long long time and the index size is very large even when not storing the contents of the lines. This approach is not viable for indexing all of our logs. The second approach is indexing the log files as documents. This is relatively fast, 211sec for 2gb of logs, and the index size is a nice 12% of the sample size. The downside is that after figuring out which files match your search terms, you have to crawl through each "hit" document to find the relevant lines. For the sake of full disclosure, at any given time we keep roughly 30 days of logs which comes to about 800ish Gb of log files. Each file is roughly 15Mb in size before it gets rotated. Has anyone else tackled a problem like this and can offer any ideas on how to go about searching those logs? The best idea I can come up with (that I haven''t implemented yet to get real numbers) is to index a certain number of log files by line, like the last 2 days, and then do another set by file (like the last week). This would have fast results for the more recent logs and you would just have to be patient for the slightly older logs. Any ideas/help? Thanks, Chris -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20080221/56ae5a79/attachment.html
Hi Chris, I''ve been toying with the idea of a Ferret log indexer for my Linux systems so this is rather interesting. Regarding performance of the one ferret document per line, you should look into the various tunables. An obvious one is ensuring auto_flush is disabled, but the next likely is :max_buffered_docs. This, by default, it set to flush to the index every 10,000 documents, but your log file lines will be hitting that regularly. Also consider :max_buffer_memory. As log files will often have lots of unique but "useless" terms (such as the timestamps) I''d recommend pre-parsing your log lines. If it''s syslog files you''re indexing, parse the timestamp and convert it to 200802221816 format and add that as a separate untokenized field to the index. Cut it down to the maximum accuracy you''ll need as this will reduce the number of unique terms in the index (maybe you''ll only ever need to find logs down to the day, not the hour and minute) Also, disable term vectors, as this will save disk space. I''ve also found using a field as the the id is slooow, so avoid that (that''s usually only something done with the primary key from databases though, so I doubt you''re doing it) Regarding performance and index size for the one ferret document per log file: By default, ferret only indexes the first 10,000 terms of each document so it might only be faster because it''s indexing less! Ditto for the index file size :S See the :max_field_length option. Write your own custom stop words list to skip indexing hugely common words - this will reduce the size of your index. Consider writing your own Analyzer to do tokenization to reduce the number of unique terms, for example the following line from a log file on my system: Feb 21 05:13:10 lion named[15722]: unexpected RCODE (SERVFAIL) resolving ''ns1.rfrjqrkfccysqlycevtyz.info/AAAA/IN'': 194.168.8.100#53 I''m not sure exactly how the default analyzer would tokenize this, but an ideal list of tokens would probably be: lion named unexpected RCODE SERVFAIL resolving ns1.rfrjqrkfccysqlycevtyz.info AAAA IN 194.168.8.100 53 If you still want to stick to document per log file, you can use the term_vectors to find the offset of the match in the log file - then you just open the log file and jump to that position (store the log filename). It does use a bit more disk space per term indexed, but useful! Also, omitting norms will save 1 byte per field per document too, a huge saving I''m sure you''ll agree ;) :index => :yes_omit_norms Um, I think I''m done. The Ferret shortcut book by the Ferret author covers all this stuff - it''s cheap and good: http://www.oreilly.com/catalog/9780596527853/index.html John. -- http://johnleach.co.uk http://www.brightbox.co.uk - UK/EU Ruby on Rails hosting On Thu, 2008-02-21 at 17:35 +0000, Chris TenHarmsel wrote:> Hi everyone, > I''ve been exploring using ferret for indexing large amounts of > production log files. Right now we have a homemade system for > searching through the logs that involves specifying a date/time range > and then grepping through the relevant files. This can take a long > time. > > My initial tests (on 2gb of log files) have been promising, I''ve taken > two separate approaches: > The first is loading each line in each log file as a "document". The > plus side to this is that doing a search will get you individual log > lines as the results, which is what I want. The downside is that > indexing takes a long long time and the index size is very large even > when not storing the contents of the lines. This approach is not > viable for indexing all of our logs. > > The second approach is indexing the log files as documents. This is > relatively fast, 211sec for 2gb of logs, and the index size is a nice > 12% of the sample size. The downside is that after figuring out which > files match your search terms, you have to crawl through each "hit" > document to find the relevant lines. > > For the sake of full disclosure, at any given time we keep roughly 30 > days of logs which comes to about 800ish Gb of log files. Each file > is roughly 15Mb in size before it gets rotated. > > Has anyone else tackled a problem like this and can offer any ideas on > how to go about searching those logs? The best idea I can come up > with (that I haven''t implemented yet to get real numbers) is to index > a certain number of log files by line, like the last 2 days, and then > do another set by file (like the last week). This would have fast > results for the more recent logs and you would just have to be patient > for the slightly older logs. > > Any ideas/help? >
Chris TenHarmsel
2008-Feb-22 20:09 UTC
[Ferret-talk] Experience using ferret to index log files
Note: Sorry if this was double posted, I sent it from the wrong email address before. Hi John, Thanks for the tips. Currently I''m using these tunables for my indexer: :max_buffer_memory => 204857600, :max_buffered_docs => 1000000, :merge_factor => 100000, For some reason, if I set max_buffered_docs to 1000001 or higher, Ferret segfaults, so I''m stuck at that. I wasn''t aware that Ferret by default only indexes the first 10000 terms, so I will definitely have to change that for log file-level indexing Could you maybe elaborate a little more on a couple things: First, I''m not really that knowledgeable on the tokenizing that is happening. I looked through the docs and I think I understand the basics, but I''m not even sure how I would go about doing my own tokenizing to create more meaningful tokens. Is a token basically a thing that can be searched for? So if I had a token of "sometoken" and searched for "some" would it find it? From what I can tell, I would have to subclass the TokenStream class and implement "text=()" to split the input into my "tokens" and then have the "next" method just return them in order, correct? Secondly, I''m not sure what you mean by looking at the term_vector to find the position. If I do a search and get "Hits" ( http://ferret.davebalmain.com/api/classes/Ferret/Search/Hit.html) back, I thought all I got was the doc id and the score. Can you explain a little more on this? THanks, On Fri, Feb 22, 2008 at 6:52 PM, John Leach <john at johnleach.co.uk> wrote:> Hi Chris, > > I''ve been toying with the idea of a Ferret log indexer for my Linux > systems so this is rather interesting. > > Regarding performance of the one ferret document per line, you should > look into the various tunables. An obvious one is ensuring auto_flush is > disabled, but the next likely is :max_buffered_docs. This, by default, > it set to flush to the index every 10,000 documents, but your log file > lines will be hitting that regularly. Also consider :max_buffer_memory. > > As log files will often have lots of unique but "useless" terms (such as > the timestamps) I''d recommend pre-parsing your log lines. If it''s > syslog files you''re indexing, parse the timestamp and convert it to > 200802221816 format and add that as a separate untokenized field to the > index. Cut it down to the maximum accuracy you''ll need as this will > reduce the number of unique terms in the index (maybe you''ll only ever > need to find logs down to the day, not the hour and minute) > > Also, disable term vectors, as this will save disk space. > > I''ve also found using a field as the the id is slooow, so avoid that > (that''s usually only something done with the primary key from databases > though, so I doubt you''re doing it) > > Regarding performance and index size for the one ferret document per log > file: By default, ferret only indexes the first 10,000 terms of each > document so it might only be faster because it''s indexing less! Ditto > for the index file size :S See the :max_field_length option. > > Write your own custom stop words list to skip indexing hugely common > words - this will reduce the size of your index. > > Consider writing your own Analyzer to do tokenization to reduce the > number of unique terms, for example the following line from a log file > on my system: > > Feb 21 05:13:10 lion named[15722]: unexpected RCODE (SERVFAIL) resolving '' > ns1.rfrjqrkfccysqlycevtyz.info/AAAA/IN'': 194.168.8.100#53 > > I''m not sure exactly how the default analyzer would tokenize this, but > an ideal list of tokens would probably be: > > lion named unexpected RCODE SERVFAIL resolving > ns1.rfrjqrkfccysqlycevtyz.info AAAA IN 194.168.8.100 53 > > If you still want to stick to document per log file, you can use the > term_vectors to find the offset of the match in the log file - then you > just open the log file and jump to that position (store the log > filename). It does use a bit more disk space per term indexed, but > useful! > > Also, omitting norms will save 1 byte per field per document too, a huge > saving I''m sure you''ll agree ;) :index => :yes_omit_norms > > Um, I think I''m done. The Ferret shortcut book by the Ferret author > covers all this stuff - it''s cheap and good: > > http://www.oreilly.com/catalog/9780596527853/index.html > > John. > -- > http://johnleach.co.uk > http://www.brightbox.co.uk - UK/EU Ruby on Rails hosting > > On Thu, 2008-02-21 at 17:35 +0000, Chris TenHarmsel wrote: > > Hi everyone, > > I''ve been exploring using ferret for indexing large amounts of > > production log files. Right now we have a homemade system for > > searching through the logs that involves specifying a date/time range > > and then grepping through the relevant files. This can take a long > > time. > > > > My initial tests (on 2gb of log files) have been promising, I''ve taken > > two separate approaches: > > The first is loading each line in each log file as a "document". The > > plus side to this is that doing a search will get you individual log > > lines as the results, which is what I want. The downside is that > > indexing takes a long long time and the index size is very large even > > when not storing the contents of the lines. This approach is not > > viable for indexing all of our logs. > > > > The second approach is indexing the log files as documents. This is > > relatively fast, 211sec for 2gb of logs, and the index size is a nice > > 12% of the sample size. The downside is that after figuring out which > > files match your search terms, you have to crawl through each "hit" > > document to find the relevant lines. > > > > For the sake of full disclosure, at any given time we keep roughly 30 > > days of logs which comes to about 800ish Gb of log files. Each file > > is roughly 15Mb in size before it gets rotated. > > > > Has anyone else tackled a problem like this and can offer any ideas on > > how to go about searching those logs? The best idea I can come up > > with (that I haven''t implemented yet to get real numbers) is to index > > a certain number of log files by line, like the last 2 days, and then > > do another set by file (like the last week). This would have fast > > results for the more recent logs and you would just have to be patient > > for the slightly older logs. > > > > Any ideas/help? > > > > > > _______________________________________________ > Ferret-talk mailing list > Ferret-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/ferret-talk >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/ferret-talk/attachments/20080222/2a714d3e/attachment-0001.html
Hi Chris, On Fri, 2008-02-22 at 20:09 +0000, Chris TenHarmsel wrote:> First, I''m not really that knowledgeable on the tokenizing that is > happening. I looked through the docs and I think I understand the > basics, but I''m not even sure how I would go about doing my own > tokenizing to create more meaningful tokens. Is a token basically a > thing that can be searched for?Tokenizing is splitting the input text into words that can be searched for. Sometimes you can just split the text up by whitespace, but I''m thinking that log files might need some specific attention.> So if I had a token of "sometoken" and searched for "some" would it > find it?No. Though if you did a search for "some*", Ferret would search the available tokens (one of which would be sometoken), then do a search on the matching tokens. You might write a clever tokenizer to recognise that "sometoken" was actually two words without a space and return them as the separate tokens "some" and "token".> From what I can tell, I would have to subclass the TokenStream class > and implement "text=()" to split the input into my "tokens" and then > have the "next" method just return them in order, correct?Not sure off the top of my head, but that''s about right, but then you need to make an Analyzer class that uses your new tokenizer. I have an example but I''ve not got time to extract it right now, sorry!> Secondly, I''m not sure what you mean by looking at the term_vector to > find the position. If I do a search and get > "Hits" (http://ferret.davebalmain.com/api/classes/Ferret/Search/Hit.html) back, I thought all I got was the doc id and the score. Can you explain a little more on this?The term vectors stores the offset in the document to the match, byte position and length - it''s used often for highlighting search matches. I''ve not actually used them myself - a quick look at the api makes it sound like they''re used internally by the highlight method. You can get to them using some methods on the index_reader, which return TermVector objects. index_reader.term_vector(doc_id, field) http://ferret.davebalmain.com/api/classes/Ferret/Index/TermVector.html John. -- http://www.brightbox.co.uk - UK/EU Ruby on Rails Hosting http://johnleach.co.uk