Need a scripting help to sort out a list and list all the duplicate lines. My data looks somethings like this host6:dev406mum.dd.mum.test.com:22:11:11:no host7:dev258mum.dd.mum.test.com:36:17:19:no host7:dev258mum.dd.mum.test.com:36:17:19:no host17:dev258mum.dd.mum.test.com:31:17:19:no host12:dev258mum.dd.mum.test.com:41:17:19:no host2:dev258mum.dd.mum.test.com:36:17:19:no host4:dev258mum.dd.mum.test.com:41:17:19:no host4:dev258mum.dd.mum.test.com:45:17:19:no host4:dev258mum.dd.mum.test.com:36:17:19:no I need to sort this list and print all the lines where column 3 has a duplicate entry. I need to print the whole line, if a duplicate entry exists in column 3. I tried using a combination of "sort" and "uniq" but was not successful. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.centos.org/pipermail/centos/attachments/20091028/099870f5/attachment-0002.html>
I wonder if you can do this in two steps:
1. Parse out the unique values from the thrid column into
a file.
2. Run the processor on the script to print where the
third column matches one of the values identified.
I dont know how to do this in a script.
I would write a simple Java program to do it.
Neil
--
Neil Aggarwal, (281)846-8957, http://www.JAMMConsulting.com
<http://www.jammconsulting.com/>
CentOS 5.4 KVM VPS $55/mo, no setup fee, no contract, dedicated 64bit CPU
1GB dedicated RAM, 40GB RAID storage, 500GB/mo premium BW, Zero downtime
_____
From: centos-bounces at centos.org [mailto:centos-bounces at centos.org] On
Behalf
Of Truejack
Sent: Wednesday, October 28, 2009 12:10 PM
To: centos at centos.org
Subject: [CentOS] Scripting help please....
Need a scripting help to sort out a list and list all the duplicate lines.
My data looks somethings like this
host6:dev406mum.dd.mum.test.com:22:11:11:no
host7:dev258mum.dd.mum.test.com:36:17:19:no
host7:dev258mum.dd.mum.test.com:36:17:19:no
host17:dev258mum.dd.mum.test.com:31:17:19:no
host12:dev258mum.dd.mum.test.com:41:17:19:no
host2:dev258mum.dd.mum.test.com:36:17:19:no
host4:dev258mum.dd.mum.test.com:41:17:19:no
host4:dev258mum.dd.mum.test.com:45:17:19:no
host4:dev258mum.dd.mum.test.com:36:17:19:no
I need to sort this list and print all the lines where column 3 has a
duplicate entry.
I need to print the whole line, if a duplicate entry exists in column 3.
I tried using a combination of "sort" and "uniq" but was not
successful.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.centos.org/pipermail/centos/attachments/20091028/15845656/attachment-0002.html>
2009/10/28 Neil Aggarwal <neil at jammconsulting.com>:> I dont know how to do this in a script.Could be a job for awk. Bit too busy at work to look into it further at the moment though. Ben
> >From: Truejack <truejack at gmail.com> >To: centos at centos.org >Sent: Wed, October 28, 2009 6:09:41 PM >Subject: [CentOS] Scripting help please.... > >Need a scripting help to sort out a list and list all the duplicate lines. > >My data looks somethings like this > >host6:dev406mum.dd.mum.test.com:22:11:11:no >host7:dev258mum.dd.mum.test.com:36:17:19:no >host7:dev258mum.dd.mum.test.com:36:17:19:no >>host17:dev258mum.dd.mum.test.com:31:17:19:no >host12:dev258mum.dd.mum.test.com:41:17:19:no >host2:dev258mum.dd.mum.test.com:36:17:19:no >host4:dev258mum.dd.mum.test.com:41:17:19:no >host4:dev258mum.dd.mum.test.com:45:17:19:no >>host4:dev258mum.dd.mum.test.com:36:17:19:no > >I need to sort this list and print all the lines where column 3 has a duplicate entry. > >I need to print the whole line, if a duplicate entry exists in column 3. > >I tried using a combination of "sort" and "uniq" but was not successful. > >A quick and dirty example (only prints the extra duplicate lines; not the original duplicate): awk -F: ' { v[$3]=v[$3]+1; if (v[$3]>1) print $0; } ' datafile JD
On 2009-10-28 18:09, Truejack wrote:> Need a scripting help to sort out a list and list all the duplicate lines. > > My data looks somethings like this > > host6:dev406mum.dd.mum.test.com:22:11:11:no > host7:dev258mum.dd.mum.test.com:36:17:19:no > host7:dev258mum.dd.mum.test.com:36:17:19:no > host17:dev258mum.dd.mum.test.com:31:17:19:no > host12:dev258mum.dd.mum.test.com:41:17:19:no > host2:dev258mum.dd.mum.test.com:36:17:19:no > host4:dev258mum.dd.mum.test.com:41:17:19:no > host4:dev258mum.dd.mum.test.com:45:17:19:no > host4:dev258mum.dd.mum.test.com:36:17:19:no > > I need to sort this list and print all the lines where column 3 has a > duplicate entry. > > I need to print the whole line, if a duplicate entry exists in column 3. > > I tried using a combination of "sort" and "uniq" but was not successful.Long time ago (when I was still young and beautiful) and encountering also the limitations of "uniq", I wrote a small program in C to do these kinds of things. It is designed to handle record oriented stuff in groups similar to uniq. The primary purpose was as prepocessor to awk/perl, but simple things like this are builtin. You find it here: ftp://ftp.xplanation.com/utils/by-src.zip Unpack; make; and copy the program "by" somehwere in your PATH. Then, to solve your problem do: sort -t: -k 3 InputFile | by -F: -f3 -D This sorts the input on field 3, fields separated by colon, and outputs all lines that are duplicate according to field 3 (-D). The program can do more as well, and a little tutorial is included in the zip. -- Paul Bijnens, Xplanation Technology Services Tel +32 16 397.525 Interleuvenlaan 86, B-3001 Leuven, BELGIUM Fax +32 16 397.552 *********************************************************************** * I think I've got the hang of it now: exit, ^D, ^C, ^\, ^Z, ^Q, ^^, * * quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, ~., * * stop, end, ^]c, +++ ATH, disconnect, halt, abort, hangup, KJOB, * * ^X^X, :D::D, kill -9 1, kill -1 $$, shutdown, init 0, Alt-F4, * * Alt-f-e, Ctrl-Alt-Del, Alt-SysRq-reisub, Stop-A, AltGr-NumLock, ... * * ... "Are you sure?" ... YES ... Phew ... I'm out * ***********************************************************************
I think it can be optimized, and if programing language doesn't matter:
#!/usr/bin/python
file="test.txt"
fl = open(file,'r')
toParse = fl.readlines()
fl.close()
dublicates = []
firstOne = []
for ln in toParse:
ln=ln.strip()
lnMap = ln.split(':')
target = lnMap[2]
if target in firstOne:
if not target in dublicates:
dublicates.append(target)
else:
firstOne.append(target)
for ln in toParse:
ln = ln.strip()
lnMap = ln.split(':')
target = lnMap[2]
if target in dublicates:
print ln
On Wed, Oct 28, 2009 at 7:09 PM, Truejack <truejack at gmail.com> wrote:
> Need a scripting help to sort out a list and list all the duplicate lines.
>
> My data looks somethings like this
>
> host6:dev406mum.dd.mum.test.com:22:11:11:no
> host7:dev258mum.dd.mum.test.com:36:17:19:no
> host7:dev258mum.dd.mum.test.com:36:17:19:no
> host17:dev258mum.dd.mum.test.com:31:17:19:no
> host12:dev258mum.dd.mum.test.com:41:17:19:no
> host2:dev258mum.dd.mum.test.com:36:17:19:no
> host4:dev258mum.dd.mum.test.com:41:17:19:no
> host4:dev258mum.dd.mum.test.com:45:17:19:no
> host4:dev258mum.dd.mum.test.com:36:17:19:no
>
> I need to sort this list and print all the lines where column 3 has a
> duplicate entry.
>
> I need to print the whole line, if a duplicate entry exists in column 3.
>
> I tried using a combination of "sort" and "uniq" but
was not successful.
>
>
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.centos.org/pipermail/centos/attachments/20091028/923c7986/attachment-0002.html>
> Need a scripting help to sort out a list and list all the duplicate lines. > > My data looks somethings like this > > host6:dev406mum.dd.mum.test.com:22:11:11:no > host7:dev258mum.dd.mum.test.com:36:17:19:no > host7:dev258mum.dd.mum.test.com:36:17:19:no > host17:dev258mum.dd.mum.test.com:31:17:19:no > host12:dev258mum.dd.mum.test.com:41:17:19:no > host2:dev258mum.dd.mum.test.com:36:17:19:no > host4:dev258mum.dd.mum.test.com:41:17:19:no > host4:dev258mum.dd.mum.test.com:45:17:19:no > host4:dev258mum.dd.mum.test.com:36:17:19:no > > I need to sort this list and print all the lines where column 3 has a > duplicate entry. > > I need to print the whole line, if a duplicate entry exists in column 3. > > I tried using a combination of "sort" and "uniq" but was not successful.list.awk BEGIN { FS=":"; } { if ( $3 == last ) { print $0; } last = $3; } sort <file> | awk -f list.awk mark "*how* long an awk script would you like?"
On Wed, Oct 28, 2009 at 10:39:41PM +0530, Truejack wrote:> > Need a scripting help to sort out a list and list all the duplicate lines. > > My data looks somethings like this > > host6:dev406mum.dd.mum.test.com:22:11:11:no > host7:dev258mum.dd.mum.test.com:36:17:19:noA key to your answer is the --all-repeated option for uniq on a sorted file. I call this "find-duplicates" -- this post makes it GPL #! /bin/bash #SIZER=' -size +10240k' SIZER=' -size +0' #SIZER="" DIRLIST=". " find $DIRLIST -type f $SIZER -print0 | xargs -0 md5sum |\ sort > /tmp/looking4duplicates tput bel; sleep 2 cat /tmp/looking4duplicates | uniq --check-chars=32 --all-repeated=prepend | less