Zoltan Frombach
2004-Nov-14 15:40 UTC
Either I do something wrong or there is a regexp bug in sed !!
I'm trying to use sed under FreeBSD 5.3-RELEASE in a new 'netqmail' port I am currently working on. I want to replace a bunch of digits (in plain English: a decimal number) in a text file at the beginning of a line. Here is how the original file looks before I do anything (this file is part of the netqmail-1.05 package, but it is unimportant): --- file conf-split begins 23 This is the queue subdirectory split. --- file conf-split ends Okay, so I try to replace 23 (or whatever number is there!) at the beginning of the first line to let's say 199 in this file using sed. I would expect this to work: sed -e "s/^[0-9]+/199/" conf-split > conf-split.new But it doesn't change anything in conf-spilt.new!! My regexp ^[0-9]+ doesn't match anything! After spending like an hour investigating this, I realized that the + after my bracket expression ( I'm talking about this part here: [0-9]+ ) does not match! If I omit the use of + and use * instead, I can make my regexp to match. So this works - but IMHO it's ugly: sed -e "s/^[0-9][0-9]*/199/" conf-split > conf-split.new It gives this output, which is what I always wanted: --- file conf-split.new begins 199 This is the queue subdirectory split. --- file conf-split.new ends According to the sed man page, the regexp syntax that is used by sed is documented in the re_format man page. And according to the re_format man page: "A piece is an atom possibly followed by a single= `*', `+', `?', or bound. An atom followed by `*' matches a sequence of 0 or more matches of the atom. An atom followed by `+' matches a sequence of 1 or more matches of the atom. ..." And the definition of an "atom" is (quoted from the same man page): "An atom is a regular expression enclosed in `()' (matching a match for the regular expression), an empty set of `()' (matching the null string)=, a bracket expression (see below) ..." So either my bracket expression ( [0-9] ) in my first sed command was not recognized as an atom, or if it was recognized as an atom then the + that followed it was not interpreted properly... Can anyone please tell me why? I believe this is a bug in sed or in the regexp library which sed uses. If it is a regexp library issue, then there is a chance that it affects other programs that use it, as well! At least it can break all programs that use sed regexps, especially ports... My uname -a is: FreeBSD www.xxxxxxxx.com 5.3-RELEASE FreeBSD 5.3-RELEASE #0: Fri Nov 12 01:07:41 PST 2004 xxx@www.xxxxxxxx.com:/usr/obj/usr/src/sys/XXXXXXXX i386 Zoltan
Brandon S. Allbery KF8NH
2004-Nov-14 15:48 UTC
Either I do something wrong or there is a regexp bug in sed !!
On Sun, 2004-11-14 at 18:39, Zoltan Frombach wrote:> match anything! After spending like an hour investigating this, I realized > that the + after my bracket expression ( I'm talking about this part here:Normal.> According to the sed man page, the regexp syntax that is used by sed is > documented in the re_format man page. And according to the re_format man > page: "A piece is an atom possibly followed by a single= `*', `+', `?', orYou need to read it more carefully. There are two kinds of regular expressions, "basic" and "extended". sed, ed, and grep speak BRE syntax, whereas awk and egrep speak ERE syntax. + is special only in ERE syntax. (And then there's GNU, where the difference between BRE and ERE is that some things use a preceding backslash in BRE and don't in ERE, and vice versa, so GNU sed does what you want if you use \+ instead of +.) -- brandon s. allbery [linux,solaris,freebsd,perl] allbery@kf8nh.com system administrator [WAY too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon univ. KF8NH