This snippet of code pulls an array of hostnames from some log files. It has to parse around 3GB of log files, so I'm keen on making it as efficient as possible. Can you think of any way to optimize this to run faster? HOSTS=() for host in $(grep -h -o "[-\.0-9a-z][-\.0-9a-z]*.com" ${TMPDIR}/* | sort | uniq); do HOSTS+=("$host") done
Sean Carolan wrote:> This snippet of code pulls an array of hostnames from some log files. > It has to parse around 3GB of log files, so I'm keen on making it as > efficient as possible. Can you think of any way to optimize this to > run faster? > > HOSTS=() > for host in $(grep -h -o "[-\.0-9a-z][-\.0-9a-z]*.com" ${TMPDIR}/* | > sort | uniq); do > HOSTS+=("$host") > doneFor one, do the sort in one step: sort -u. For another, are the hostnames always the same field? For example, if they're all /var/log/messages, I'd do awk '{print $4;}' | sort -u mark
On 06/28/2012 11:30 AM, Sean Carolan wrote:> Can you think of any way to optimize this to run faster? > > HOSTS=() > for host in $(grep -h -o "[-\.0-9a-z][-\.0-9a-z]*.com" ${TMPDIR}/* | > sort | uniq); do > HOSTS+=("$host") > doneYou have two major performance problems in this script. First, UTF-8 processing is slow. Second, wildcards are EXTREMELY SLOW! You'll get a small performance improvement by using a C locale, *if* you know that all of your text will be ascii (hostnames will be). You can set LANG either for the whole script or just for grep/sort: --- $ export LANG=C --- $ env LANG=C grep ... | env LANG=C sort --- I don't think you'll get much from running uniq in a C locale. You'll get a HUGE performance boost from prefixing your search with some known prefix to your regex. As it is written, your regex will iterate over every character in each line. If that character is a member of the first set, grep will then iterate over all of the following characters until it finds one that isn't a match, then check for ".com". That second loop increases the processing load tremendously. If you know the prefix, use it, and cut it out in a subsequent stage. $ grep 'host: [-\.0-9a-z][-\.0-9a-z]*.com' ${TMPDIR}/* $ egrep '(host:|hostname:|from:) [-\.0-9a-z][-\.0-9a-z]*.com' \ ${TMPDIR}/*
On Thu, Jun 28, 2012 at 01:30:33PM -0500, Sean Carolan wrote:> This snippet of code pulls an array of hostnames from some log files. > It has to parse around 3GB of log files, so I'm keen on making it as > efficient as possible. Can you think of any way to optimize this to > run faster?If the key phrase is *as efficient as possible*, then I would say you want a compiled pattern search. Lex is the tool for this, and for this job is not hard. Lex will generate a specific scanner(*) in C or C++ (depending on what flavor of lex you use). It will probably be table-based. Grep and awk, in contrast, generate scanners on the fly, and specifying complicated regular expressions is somewhat clumsier in grep and awk. (*) strictly speaking, you are *scanning* not *parsing*. Parsing involves a grammar, and there's no grammar here. If it develops that these domain names are context sensitive, then you will need a grammar. The suggestions of others -- setting LANG, cutting a specific field, and so on, are all very valuable, and may be *practically* more valuable than writing a scanner with lex, or could be used in conjunction with a "proper" scanner. Note that lex will allow you to use a much better definition for "domain name" -- such as more than one suffix, names of arbitrary complexity, names that may violate RFC, numeric type names, case sensitivity, names that match certain special templates, like "*.cn" or "goog*.*" and so on. If you are unfamiliar with lex, note that it is the front end for many a compiler. BTW, you could easily incorporate a sorting function in lex that would eliminate the need for an external sort. This might be done in awk, too, but in lex it would be more natural. You simply would not enter duplicates in the tree. When the run is over, traverse the tree and out come the unique hostnames. I'm assuming you'll have many collisions. (You could even keep a count of collisions, if you're interested in which hosts are "popular".) Consider btree(3) for this or hash(3). Dave -- Programming is tedious, but it is still fun after all these years.
Woodchuck wrote:> On Thu, Jun 28, 2012 at 01:30:33PM -0500, Sean Carolan wrote: >> This snippet of code pulls an array of hostnames from some log files. >> It has to parse around 3GB of log files, so I'm keen on making it as >> efficient as possible. Can you think of any way to optimize this to >> run faster? > > If the key phrase is *as efficient as possible*, then I would say > you want a compiled pattern search. Lex is the tool for this, andThat, to me, would be a Big Deal. <snip>> BTW, you could easily incorporate a sorting function in lex that > would eliminate the need for an external sort. This might be done in awk, > too, but in lex it would be more natural. You simply would not<snip> Hello, mark, wake up. Of course, there's an even easier way, just using awk: awk '{if (/[-\.0-9a-z][-\.0-9a-z]*.com/) { hostarray[$9] = 1;}} END { for (i in hostarray ) { print i;}}' This dumps it into an associative array - that's one whose indices are a string - so it will by default be in order. mark