thr3ads.net - CentOS - [CentOS] Optimizing grep, sort, uniq for speed [Jun 2012]

If this information is useful, please help other people find it:
Share via:

Sean Carolan

2012-Jun-28 18:30 UTC

[CentOS] Optimizing grep, sort, uniq for speed

This snippet of code pulls an array of hostnames from some log files.
It has to parse around 3GB of log files, so I'm keen on making it as
efficient as possible.  Can you think of any way to optimize this to
run faster?

HOSTS=()
for host in $(grep -h -o "[-\.0-9a-z][-\.0-9a-z]*.com" ${TMPDIR}/* |
sort | uniq); do
    HOSTS+=("$host")
done

m.roth at 5-cent.us

2012-Jun-28 18:57 UTC

head link

[CentOS] Optimizing grep, sort, uniq for speed

Sean Carolan wrote:> This snippet of code pulls an array of hostnames from some log files.
> It has to parse around 3GB of log files, so I'm keen on making it as
> efficient as possible.  Can you think of any way to optimize this to
> run faster?
>
> HOSTS=()
> for host in $(grep -h -o "[-\.0-9a-z][-\.0-9a-z]*.com"
${TMPDIR}/* |
> sort | uniq); do
>     HOSTS+=("$host")
> done
For one, do the sort in one step: sort -u. For another, are the hostnames
always the same field? For example, if they're all /var/log/messages,
I'd
do awk '{print $4;}' | sort -u

       mark

Gordon Messmer

2012-Jun-28 19:15 UTC

head link

[CentOS] Optimizing grep, sort, uniq for speed

On 06/28/2012 11:30 AM, Sean Carolan wrote:> Can you think of any way to optimize this to run faster?
>
> HOSTS=()
> for host in $(grep -h -o "[-\.0-9a-z][-\.0-9a-z]*.com"
${TMPDIR}/* |
> sort | uniq); do
>      HOSTS+=("$host")
> done
You have two major performance problems in this script.  First, UTF-8 
processing is slow.  Second, wildcards are EXTREMELY SLOW!

You'll get a small performance improvement by using a C locale, *if* you 
know that all of your text will be ascii (hostnames will be).  You can 
set LANG either for the whole script or just for grep/sort:

---
$ export LANG=C
---
$ env LANG=C grep ... | env LANG=C sort
---

I don't think you'll get much from running uniq in a C locale.

You'll get a HUGE performance boost from prefixing your search with some 
known prefix to your regex.  As it is written, your regex will iterate 
over every character in each line.  If that character is a member of the 
first set, grep will then iterate over all of the following characters 
until it finds one that isn't a match, then check for ".com". 
That
second loop increases the processing load tremendously.  If you know the 
prefix, use it, and cut it out in a subsequent stage.

$ grep 'host: [-\.0-9a-z][-\.0-9a-z]*.com' ${TMPDIR}/*
$ egrep '(host:|hostname:|from:) [-\.0-9a-z][-\.0-9a-z]*.com' \
   ${TMPDIR}/*

Woodchuck

2012-Jun-28 20:07 UTC

head link

[CentOS] Optimizing grep, sort, uniq for speed

On Thu, Jun 28, 2012 at 01:30:33PM -0500, Sean Carolan
wrote:> This snippet of code pulls an array of hostnames from some log files.
> It has to parse around 3GB of log files, so I'm keen on making it as
> efficient as possible.  Can you think of any way to optimize this to
> run faster?
If the key phrase is *as efficient as possible*, then I would say
you want a compiled pattern search.  Lex is the tool for this, and
for this job is not hard.  Lex will generate a specific scanner(*)
in C or C++ (depending on what flavor of lex you use). It will probably
be table-based.  Grep and awk, in contrast, generate scanners on the
fly, and specifying complicated regular expressions is somewhat
clumsier in grep and awk.

(*) strictly speaking, you are *scanning* not *parsing*.  Parsing
involves a grammar, and there's no grammar here.  If it develops that
these domain names are context sensitive, then you will need a grammar.

The suggestions of others -- setting LANG, cutting a specific field,
and so on, are all very valuable, and may be *practically* more valuable
than writing a scanner with lex, or could be used in conjunction
with a "proper" scanner.

Note that lex will allow you to use a much better definition for
"domain name" -- such as more than one suffix, names of arbitrary
complexity, names that may violate RFC, numeric type names, case
sensitivity, names that match certain special templates, like
"*.cn" or "goog*.*" and so on.

If you are unfamiliar with lex, note that it is the front end for
many a compiler.  

BTW, you could easily incorporate a sorting function in lex that
would eliminate the need for an external sort.  This might be done in awk,
too, but in lex it would be more natural.  You simply would not
enter duplicates in the tree.  When the run is over, traverse the
tree and out come the unique hostnames.  I'm assuming you'll have
many collisions.  (You could even keep a count of collisions, if you're
interested in which hosts are "popular".)  Consider btree(3) for this
or hash(3).

Dave
-- 
   Programming is tedious, but it is still fun after all these years.

m.roth at 5-cent.us

2012-Jun-28 20:39 UTC

head link

[CentOS] Optimizing grep, sort, uniq for speed

Woodchuck wrote:> On Thu, Jun 28, 2012 at 01:30:33PM -0500, Sean Carolan wrote:
>> This snippet of code pulls an array of hostnames from some log files.
>> It has to parse around 3GB of log files, so I'm keen on making it
as
>> efficient as possible.  Can you think of any way to optimize this to
>> run faster?
>
> If the key phrase is *as efficient as possible*, then I would say
> you want a compiled pattern search.  Lex is the tool for this, and
That, to me, would be a Big Deal.
<snip>> BTW, you could easily incorporate a sorting function in lex that
> would eliminate the need for an external sort.  This might be done in awk,
> too, but in lex it would be more natural.  You simply would not<snip>
Hello, mark, wake up.

Of course, there's an even easier way, just using awk:

awk '{if (/[-\.0-9a-z][-\.0-9a-z]*.com/) { hostarray[$9] = 1;}} END { for
(i in hostarray ) { print i;}}'

This dumps it into an associative array - that's one whose indices are a
string - so it will by default be in order.

       mark

Maybe Matching Threads

Search for more seemingly similar threads

CentOS - Jun 2012 - Optimizing grep, sort, uniq for speed

[CentOS] Optimizing grep, sort, uniq for speed

[CentOS] Optimizing grep, sort, uniq for speed

[CentOS] Optimizing grep, sort, uniq for speed

[CentOS] Optimizing grep, sort, uniq for speed

[CentOS] Optimizing grep, sort, uniq for speed

Maybe Matching Threads