thr3ads.net - CentOS - [CentOS] Scripting help please.... [Oct 2009]

If this information is useful, please help other people find it:
Share via:

Truejack

2009-Oct-28 17:09 UTC

[CentOS] Scripting help please....

Need a scripting help to sort out a list and list all the duplicate lines.

My data looks somethings like this

host6:dev406mum.dd.mum.test.com:22:11:11:no
host7:dev258mum.dd.mum.test.com:36:17:19:no
host7:dev258mum.dd.mum.test.com:36:17:19:no
host17:dev258mum.dd.mum.test.com:31:17:19:no
host12:dev258mum.dd.mum.test.com:41:17:19:no
host2:dev258mum.dd.mum.test.com:36:17:19:no
host4:dev258mum.dd.mum.test.com:41:17:19:no
host4:dev258mum.dd.mum.test.com:45:17:19:no
host4:dev258mum.dd.mum.test.com:36:17:19:no

I need to sort this list and print all the lines where column 3 has a
duplicate entry.

I need to print the whole line, if a duplicate entry exists in column 3.

I tried using a combination of "sort" and "uniq" but was not
successful.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.centos.org/pipermail/centos/attachments/20091028/099870f5/attachment-0002.html>

Neil Aggarwal

2009-Oct-28 17:24 UTC

head link

[CentOS] Scripting help please....

I wonder if you can do this in two steps:
 
1. Parse out the unique values from the thrid column into
    a file.
2. Run the processor on the script to print where the
    third column matches one of the values identified.
 
I dont know how to do this in a script.
I would write a simple Java program to do it.
 
    Neil
 


--
Neil Aggarwal, (281)846-8957, http://www.JAMMConsulting.com
<http://www.jammconsulting.com/> 
CentOS 5.4 KVM VPS $55/mo, no setup fee, no contract, dedicated 64bit CPU
1GB dedicated RAM, 40GB RAID storage, 500GB/mo premium BW, Zero downtime 

 


  _____  

From: centos-bounces at centos.org [mailto:centos-bounces at centos.org] On
Behalf
Of Truejack
Sent: Wednesday, October 28, 2009 12:10 PM
To: centos at centos.org
Subject: [CentOS] Scripting help please....


Need a scripting help to sort out a list and list all the duplicate lines.

My data looks somethings like this

host6:dev406mum.dd.mum.test.com:22:11:11:no
host7:dev258mum.dd.mum.test.com:36:17:19:no
host7:dev258mum.dd.mum.test.com:36:17:19:no
host17:dev258mum.dd.mum.test.com:31:17:19:no
host12:dev258mum.dd.mum.test.com:41:17:19:no
host2:dev258mum.dd.mum.test.com:36:17:19:no
host4:dev258mum.dd.mum.test.com:41:17:19:no
host4:dev258mum.dd.mum.test.com:45:17:19:no
host4:dev258mum.dd.mum.test.com:36:17:19:no

I need to sort this list and print all the lines where column 3 has a
duplicate entry.

I need to print the whole line, if a duplicate entry exists in column 3.

I tried using a combination of "sort" and "uniq" but was not
successful.



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.centos.org/pipermail/centos/attachments/20091028/15845656/attachment-0002.html>

Benjamin Donnachie

2009-Oct-28 17:32 UTC

head link

[CentOS] Scripting help please....

2009/10/28 Neil Aggarwal <neil at
jammconsulting.com>:> I dont know how to do this in a script.
Could be a job for awk.

Bit too busy at work to look into it further at the moment though.

Ben

John Doe

2009-Oct-28 17:41 UTC

head link

[CentOS] Scripting help please....

>
>From: Truejack <truejack at gmail.com>
>To: centos at centos.org
>Sent: Wed, October 28, 2009 6:09:41 PM
>Subject: [CentOS] Scripting help please....
>
>Need a scripting help to sort out a list and list all the duplicate lines.
>
>My data looks somethings like this
>
>host6:dev406mum.dd.mum.test.com:22:11:11:no
>host7:dev258mum.dd.mum.test.com:36:17:19:no
>host7:dev258mum.dd.mum.test.com:36:17:19:no
>>host17:dev258mum.dd.mum.test.com:31:17:19:no
>host12:dev258mum.dd.mum.test.com:41:17:19:no
>host2:dev258mum.dd.mum.test.com:36:17:19:no
>host4:dev258mum.dd.mum.test.com:41:17:19:no
>host4:dev258mum.dd.mum.test.com:45:17:19:no
>>host4:dev258mum.dd.mum.test.com:36:17:19:no
>
>I need to sort this list and print all the lines where column 3 has a
duplicate entry.
>
>I need to print the whole line, if a duplicate entry exists in column 3.
>
>I tried using a combination of "sort" and "uniq" but was
not successful.
>
>
A quick and dirty example (only prints the extra duplicate lines; not the
original duplicate):
awk -F: ' { v[$3]=v[$3]+1; if (v[$3]>1) print $0; } ' datafile

JD

Paul Bijnens

2009-Oct-28 17:41 UTC

head link

[CentOS] Scripting help please....

On 2009-10-28 18:09, Truejack wrote:> Need a scripting help to sort out a list and list all the duplicate lines.
> 
> My data looks somethings like this
> 
> host6:dev406mum.dd.mum.test.com:22:11:11:no
> host7:dev258mum.dd.mum.test.com:36:17:19:no
> host7:dev258mum.dd.mum.test.com:36:17:19:no
> host17:dev258mum.dd.mum.test.com:31:17:19:no
> host12:dev258mum.dd.mum.test.com:41:17:19:no
> host2:dev258mum.dd.mum.test.com:36:17:19:no
> host4:dev258mum.dd.mum.test.com:41:17:19:no
> host4:dev258mum.dd.mum.test.com:45:17:19:no
> host4:dev258mum.dd.mum.test.com:36:17:19:no
> 
> I need to sort this list and print all the lines where column 3 has a 
> duplicate entry.
> 
> I need to print the whole line, if a duplicate entry exists in column 3.
> 
> I tried using a combination of "sort" and "uniq" but
was not successful.

Long time ago (when I was still young and beautiful) and encountering also
the limitations of "uniq", I wrote a small program in C to do these
kinds of things.
It is designed to handle record oriented stuff in groups similar to uniq.
The primary purpose was as prepocessor to awk/perl, but simple things like this
are builtin.  You find it here:

   ftp://ftp.xplanation.com/utils/by-src.zip

Unpack;  make;  and copy the program "by" somehwere in your PATH.

Then, to solve your problem do:

    sort -t: -k 3 InputFile | by -F: -f3 -D

This sorts the input on field 3, fields separated by colon, and outputs
all lines that are duplicate according to field 3 (-D).

The program can do more as well, and a little tutorial is included in the zip.


-- 
Paul Bijnens, Xplanation Technology Services        Tel  +32 16 397.525
Interleuvenlaan 86, B-3001 Leuven, BELGIUM          Fax  +32 16 397.552
***********************************************************************
* I think I've got the hang of it now:  exit, ^D, ^C, ^\, ^Z, ^Q, ^^, *
* quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, ~., *
* stop, end, ^]c, +++ ATH, disconnect,  halt,  abort,  hangup,  KJOB, *
* ^X^X,  :D::D,  kill -9 1,  kill -1 $$,  shutdown,  init 0,  Alt-F4, *
* Alt-f-e, Ctrl-Alt-Del, Alt-SysRq-reisub, Stop-A, AltGr-NumLock, ... *
* ...  "Are you sure?"  ...   YES   ...   Phew ...   I'm out      
*
***********************************************************************

Arturas Mr.

2009-Oct-28 18:05 UTC

head link

[CentOS] Scripting help please....

I think it can be optimized, and if programing language doesn't matter:
#!/usr/bin/python

file="test.txt"
fl = open(file,'r')
toParse = fl.readlines()
fl.close()
dublicates = []
firstOne = []
for ln in toParse:
    ln=ln.strip()
    lnMap = ln.split(':')
    target = lnMap[2]
    if target in firstOne:
        if not target in dublicates:
            dublicates.append(target)
    else:
        firstOne.append(target)
for ln in toParse:
    ln = ln.strip()
    lnMap = ln.split(':')
    target = lnMap[2]
    if target in dublicates:
        print ln


On Wed, Oct 28, 2009 at 7:09 PM, Truejack <truejack at gmail.com> wrote:
> Need a scripting help to sort out a list and list all the duplicate lines.
>
> My data looks somethings like this
>
> host6:dev406mum.dd.mum.test.com:22:11:11:no
> host7:dev258mum.dd.mum.test.com:36:17:19:no
> host7:dev258mum.dd.mum.test.com:36:17:19:no
> host17:dev258mum.dd.mum.test.com:31:17:19:no
> host12:dev258mum.dd.mum.test.com:41:17:19:no
> host2:dev258mum.dd.mum.test.com:36:17:19:no
> host4:dev258mum.dd.mum.test.com:41:17:19:no
> host4:dev258mum.dd.mum.test.com:45:17:19:no
> host4:dev258mum.dd.mum.test.com:36:17:19:no
>
> I need to sort this list and print all the lines where column 3 has a
> duplicate entry.
>
> I need to print the whole line, if a duplicate entry exists in column 3.
>
> I tried using a combination of "sort" and "uniq" but
was not successful.
>
>
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.centos.org/pipermail/centos/attachments/20091028/923c7986/attachment-0002.html>

m.roth at 5-cent.us

2009-Oct-28 19:57 UTC

head link

[CentOS] Scripting help please....

> Need a scripting help to sort out a list and list all the duplicate lines.
>
> My data looks somethings like this
>
> host6:dev406mum.dd.mum.test.com:22:11:11:no
> host7:dev258mum.dd.mum.test.com:36:17:19:no
> host7:dev258mum.dd.mum.test.com:36:17:19:no
> host17:dev258mum.dd.mum.test.com:31:17:19:no
> host12:dev258mum.dd.mum.test.com:41:17:19:no
> host2:dev258mum.dd.mum.test.com:36:17:19:no
> host4:dev258mum.dd.mum.test.com:41:17:19:no
> host4:dev258mum.dd.mum.test.com:45:17:19:no
> host4:dev258mum.dd.mum.test.com:36:17:19:no
>
> I need to sort this list and print all the lines where column 3 has a
> duplicate entry.
>
> I need to print the whole line, if a duplicate entry exists in column 3.
>
> I tried using a combination of "sort" and "uniq" but
was not successful.
list.awk
BEGIN {
   FS=":";
}
{  if ( $3 == last ) {

      print $0;
   }
   last = $3;
}

sort <file> | awk -f list.awk

     mark "*how* long an awk script would you like?"

Nifty Cluster Mitch

2009-Oct-28 21:26 UTC

head link

[CentOS] Scripting help please....

On Wed, Oct 28, 2009 at 10:39:41PM +0530, Truejack
wrote:> 
>    Need a scripting help to sort out a list and list all the duplicate
lines.
> 
>    My data looks somethings like this
> 
>    host6:dev406mum.dd.mum.test.com:22:11:11:no
>    host7:dev258mum.dd.mum.test.com:36:17:19:no
A key to your answer is the --all-repeated option
for uniq on a sorted file.

I call this "find-duplicates" -- this post makes it GPL

#!  /bin/bash
#SIZER=' -size +10240k'
SIZER=' -size +0'
#SIZER=""
DIRLIST=". "
find $DIRLIST  -type f $SIZER -print0 | xargs -0 md5sum |\
sort > /tmp/looking4duplicates
tput bel; sleep 2
cat /tmp/looking4duplicates |  uniq --check-chars=32 --all-repeated=prepend |
less

Reasonably Related Threads

Search for more possibly parallel threads

CentOS - Oct 2009 - Scripting help please....

[CentOS] Scripting help please....

[CentOS] Scripting help please....

[CentOS] Scripting help please....

[CentOS] Scripting help please....

[CentOS] Scripting help please....

[CentOS] Scripting help please....

[CentOS] Scripting help please....

[CentOS] Scripting help please....

Reasonably Related Threads