Hi, My grep regex foo is not very good and googling is getting me nowhere so hopefully someone is kind enough to give me some pointers. Goal: grep (non .dbg) filenames and versions from a ftp dir listing and a raw html file: $ wget --no-remove-listing -O ftp-index.txt ftp://127.0.0.1/test/ $ wget --no-remove-listing -O index.html http://127.0.0.1/test/ The relevant parts of the files above (first one is ftp listing, second part is the html file, both copied to test_regex.txt) are: 2011 Jan 28 21:25 File <a href="ftp://127.0.0.1/bar-4.5.6.i686.dbg.tgz">bar-4.5.6.i686.dbg.tgz</a> (5551274 bytes) 2011 Jan 28 21:25 File <a href="ftp://127.0.0.1/bar-4.5.6.i686.tgz">bar-4.5.6.i686.tgz</a> (5551274 bytes) 2011 Jan 28 21:25 File <a href="ftp://127.0.0.1/bar-4.5.6.x86_64.dbg.tgz">bar-4.5.6.x86_64.dbg.tgz</a> (5551274 bytes) 2011 Jan 28 21:25 File <a href="ftp://127.0.0.1/bar-4.5.6.x86_64.tgz">bar-4.5.6.x86_64.tgz</a> (5551274 bytes) <tr><td><a href="foo-bar-1.2.3+1.2.3.tar.gz">foo-bar-1.2.3+1.2.3.tar.gz</td></tr> This is what I now have (improvements most welcome): $ egrep -o ">([A-Za-z_-]+)([[:digit:]]{1,3}(\.[[:digit:]]{1,3})*).+(.|t)gz" ./test_regex.txt | grep -v ".dbg" | tr -d '>' Output: foo-bar-1.2.3+1.2.3.tar.gz baz-4.5.6.i686.tgz baz-4.5.6.x86_64.tgz So far so good but now I also want to get the version numbers which I can't figure out. Anyone have a pointer how to get the version number from these filenames (1.2.3+1.2.3 and 4.5.6)? Thanks! Patrick
On Sat, Mar 5, 2011 at 5:13 PM, Patrick Lists <centos-list at puzzled.xs4all.nl> wrote:> Hi, > > My grep regex foo is not very good and googling is getting me nowhere so > hopefully someone is kind enough to give me some pointers. > > Goal: grep (non .dbg) filenames and versions from a ftp dir listing and > a raw html file: > > $ wget --no-remove-listing -O ftp-index.txt ftp://127.0.0.1/test/ > $ wget --no-remove-listing -O index.html http://127.0.0.1/test/ > > The relevant parts of the files above (first one is ftp listing, second > part is the html file, both copied to test_regex.txt) are: > > 2011 Jan 28 21:25 ?File ?<a > href="ftp://127.0.0.1/bar-4.5.6.i686.dbg.tgz">bar-4.5.6.i686.dbg.tgz</a> > ?(5551274 bytes) > 2011 Jan 28 21:25 ?File ?<a > href="ftp://127.0.0.1/bar-4.5.6.i686.tgz">bar-4.5.6.i686.tgz</a> > (5551274 bytes) > 2011 Jan 28 21:25 ?File ?<a > href="ftp://127.0.0.1/bar-4.5.6.x86_64.dbg.tgz">bar-4.5.6.x86_64.dbg.tgz</a> > ?(5551274 bytes) > 2011 Jan 28 21:25 ?File ?<a > href="ftp://127.0.0.1/bar-4.5.6.x86_64.tgz">bar-4.5.6.x86_64.tgz</a> > (5551274 bytes) > > <tr><td><a > href="foo-bar-1.2.3+1.2.3.tar.gz">foo-bar-1.2.3+1.2.3.tar.gz</td></tr> > > This is what I now have (improvements most welcome): > > $ egrep -o > ">([A-Za-z_-]+)([[:digit:]]{1,3}(\.[[:digit:]]{1,3})*).+(.|t)gz" > ./test_regex.txt | grep -v ".dbg" | tr -d '>' > > Output: > > foo-bar-1.2.3+1.2.3.tar.gz > baz-4.5.6.i686.tgz > baz-4.5.6.x86_64.tgz > > So far so good but now I also want to get the version numbers which I > can't figure out. Anyone have a pointer how to get the version number > from these filenames (1.2.3+1.2.3 and 4.5.6)?Separate the ".i686.tgz" with something like a '-' or "_", not a dot. and be consistent about using .tar.gz instead of mixing .tar.gz and .tgz, if possible.
Hello, On my opinion, grep is not powerful enough in order to achieve what you want. It would be preferable to use at least some (old but powerful) tools such sed, awk, or even better : perl. Actually, what you need is a tool providing a capture buffer (this is perl jargon - "back references" in sed jargon) in which you can get the string you want to extract, rather than trying to build up a positive matching regex, as the string boundaries seem to be easy enough to describe with regexs. Regards --- Robert GRASSO ? System engineer CEDRAT S.A. 15 Chemin de Malacher - Inovall?e - 38246 MEYLAN cedex - FRANCE Phone: +33 (0)4 76 90 50 45 - Fax: +33 (0)4 56 38 08 30 mailto:robert.grasso at cedrat.com - http://www.cedrat.com> -----Message d'origine----- > De : centos-bounces at centos.org > [mailto:centos-bounces at centos.org] De la part de Patrick Lists > Envoy? : 5 mars 2011 23:14 > ? : CentOS mailing list > Objet : [CentOS] OT: grep regex pointer appreciated > > Hi, > > My grep regex foo is not very good and googling is getting me > nowhere so > hopefully someone is kind enough to give me some pointers. > > Goal: grep (non .dbg) filenames and versions from a ftp dir > listing and > a raw html file: > > $ wget --no-remove-listing -O ftp-index.txt ftp://127.0.0.1/test/ > $ wget --no-remove-listing -O index.html http://127.0.0.1/test/ > > The relevant parts of the files above (first one is ftp > listing, second > part is the html file, both copied to test_regex.txt) are: > > 2011 Jan 28 21:25 File <a > href="ftp://127.0.0.1/bar-4.5.6.i686.dbg.tgz">bar-4.5.6.i686.d > bg.tgz</a> > (5551274 bytes) > 2011 Jan 28 21:25 File <a > href="ftp://127.0.0.1/bar-4.5.6.i686.tgz">bar-4.5.6.i686.tgz</a> > (5551274 bytes) > 2011 Jan 28 21:25 File <a > href="ftp://127.0.0.1/bar-4.5.6.x86_64.dbg.tgz">bar-4.5.6.x86_ > 64.dbg.tgz</a> > (5551274 bytes) > 2011 Jan 28 21:25 File <a > href="ftp://127.0.0.1/bar-4.5.6.x86_64.tgz">bar-4.5.6.x86_64.tgz</a> > (5551274 bytes) > > <tr><td><a > href="foo-bar-1.2.3+1.2.3.tar.gz">foo-bar-1.2.3+1.2.3.tar.gz</td></tr> > > This is what I now have (improvements most welcome): > > $ egrep -o > ">([A-Za-z_-]+)([[:digit:]]{1,3}(\.[[:digit:]]{1,3})*).+(.|t)gz" > ./test_regex.txt | grep -v ".dbg" | tr -d '>' > > Output: > > foo-bar-1.2.3+1.2.3.tar.gz > baz-4.5.6.i686.tgz > baz-4.5.6.x86_64.tgz > > So far so good but now I also want to get the version numbers which I > can't figure out. Anyone have a pointer how to get the version number > from these filenames (1.2.3+1.2.3 and 4.5.6)? > > Thanks! > Patrick > _______________________________________________ > CentOS mailing list > CentOS at centos.org > http://lists.centos.org/mailman/listinfo/centos >