It has long been known that the C specification of *scanf() leaves behavior undefined for things like int i; sscanf("9999999999999999", "%i", &i); C11 7.21.6.2 P12 "Matches an optionally signed integer, whose format is the same as expected for the subject sequence of the strtol function with the value 0 for the base argument." C11 7.21.6.2 P10 "If this object does not have an appropriate type, or if the result of the conversion cannot be represented in the object, the behavior is undefined." as there is an overflow when consuming the input which matches the strtol subject sequence but does not fit in the width of an int. On my Linux system, 'man sscanf' mentions that ERANGE might be set in such a case, but neither C nor POSIX actually requires this behavior; other likely behaviors is storing the value mod 2^32 into i, or storing INT_MAX into i, or ... This is annoying - the only safe way to parse integers from untrustworthy sources, where overflow MUST be detected, is to manually open-code strtol() calls, which can get quite lengthy in comparison to the concise representations possible with *scanf. Would glibc be willing to consider a GNU extension to add an optional flag character between '%' and the various numeric conversion specifiers (both integral based on strto*l, and floating point based on strtod), where we could force *scanf to treat numeric overflow as a matching failure, rather than undefined behavior? Or even a second flag to request that printf stop consuming characters if the next character in input would cause overflow in the current specifier, leaving that character to instead be matched to the remainder of the format string? Let's suppose for arguments that we add '^' as a request to force overflow to be a matching error. Then sscanf("9999999999999999", "%^i", &i) would be well-specified to return 0, rather than returning 1 with an unknown value assigned into i or any other behavior that other libc do with the undefined behavior when the ^ is not present. And if glibc likes the idea of such an extension, and we see an uptick in applications actually using it, I'd also be happy to champion the addition of such an extension in POSIX (but the POSIX folks will definitely want to see existing practice first - both an implementation and applications that use that implementation). The libguestfs suite of programs is willing to be an early adopter, if glibc is willing to pursue adding such a safety valve. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org
On Fri, May 22, 2020 at 03:59:14PM -0500, Eric Blake via Libc-alpha wrote:> It has long been known that the C specification of *scanf() leaves > behavior undefined for things like > int i; > sscanf("9999999999999999", "%i", &i); > > C11 7.21.6.2 P12 > "Matches an optionally signed integer, whose format is the same as > expected for the subject sequence of the strtol function with the > value 0 for the base argument." > C11 7.21.6.2 P10 > "If this object does not have an appropriate type, or if the result > of the conversion cannot be represented in the object, the behavior > is undefined." > > as there is an overflow when consuming the input which matches the > strtol subject sequence but does not fit in the width of an int. On > my Linux system, 'man sscanf' mentions that ERANGE might be set in > such a case, but neither C nor POSIX actually requires this > behavior; other likely behaviors is storing the value mod 2^32 into > i, or storing INT_MAX into i, or ... > > This is annoying - the only safe way to parse integers from > untrustworthy sources, where overflow MUST be detected, is to > manually open-code strtol() calls, which can get quite lengthy in > comparison to the concise representations possible with *scanf. > > Would glibc be willing to consider a GNU extension to add an > optional flag character between '%' and the various numeric > conversion specifiers (both integral based on strto*l, and floating > point based on strtod), where we could force *scanf to treat numeric > overflow as a matching failure, rather than undefined behavior? Or > even a second flag to request that printf stop consuming characters > if the next character in input would cause overflow in the current > specifier, leaving that character to instead be matched to the > remainder of the format string?Since conversion specifier forms outside the standard *also* have undefined behavior, I see no advantage to defining that particular undefined case vs just defining the result of the overflowing conversion, unless you're worried the standard might later define a conflicting definition. Neither way is amenable to configure detection (without breaking cross compiling) without also adopting something like my proposal on libc-coord: https://www.openwall.com/lists/libc-coord/2020/04/22/1 BTW there is a portable only-somewhat-hideous way to do this with sscanf: using assignment suppression combined with %n, then strtol, etc. with the offsets sproduced by %n.> Let's suppose for arguments that we add '^' as a request to force > overflow to be a matching error. Then sscanf("9999999999999999", > "%^i", &i) would be well-specified to return 0, rather than > returning 1 with an unknown value assigned into i or any other > behavior that other libc do with the undefined behavior when the ^ > is not present. > > And if glibc likes the idea of such an extension, and we see an > uptick in applications actually using it, I'd also be happy to > champion the addition of such an extension in POSIX (but the POSIX > folks will definitely want to see existing practice first - both an > implementation and applications that use that implementation). The > libguestfs suite of programs is willing to be an early adopter, if > glibc is willing to pursue adding such a safety valve.I think it would be more useful to look for existing practice where the UB blows up in horrible ways, and if there is none (if all implementations behave somewhat reasonably) define the intersection of their behaviors as standard and get rid of the UB here. A new feature will not reliably be usable for decades in portable software, but new documentation of existing universal practice would be immediately usable. Rich
The context to this is that nbdkit uses sscanf to parse simple file formats in various places, eg: https://github.com/libguestfs/nbdkit/blob/b23f4f53cf71326f1dba481f64f7f182c20fa3dc/plugins/data/format.c#L171-L172 https://github.com/libguestfs/nbdkit/blob/b23f4f53cf71326f1dba481f64f7f182c20fa3dc/filters/ddrescue/ddrescue.c#L98 We can only do this safely where we can prove that overflow does not matter. In other cases we've had to change sscanf uses to strto* etc which is much more difficult to use correctly. Just look at how much code is required to wrap strto* functions to use them safely: https://github.com/libguestfs/nbdkit/blob/b23f4f53cf71326f1dba481f64f7f182c20fa3dc/server/public.c#L113-L296 Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-p2v converts physical machines to virtual machines. Boot with a live CD or over the network (PXE) and turn machines into KVM guests. http://libguestfs.org/virt-v2v
On Fri, May 22, 2020 at 08:06:34PM -0700, Paul Eggert wrote:> On 5/22/20 6:16 PM, Rich Felker wrote: > > A new feature > > will not reliably be usable for decades in portable software, but new > > documentation of existing universal practice would be immediately > > usable. > > We could do both. > > Also, we could change glibc's behavior in a simpler way, by not adding a new > flag; but if an integer is out of range, then scan only the initial prefix that > fits, leaving the trailing digits for the rest of the format to scan. This also > conforms to POSIX and is more likely to cause C programs to do the right thing > (i.e., report a failure) than the current behavior does. And with luck perhaps > we could eventually get POSIX to standardize this behavior.I'm not really a fan of stopping on an initial prefix. While UB allows anything, that's contrary to the abstract behavior defined for scanf (matching fields syntactically then value conversion) and does not admit easily sharing a backend with strto*. It's also even *more likely* to break programs that don't expect the behavior than just storing a wrapped or clamped value, since all the remaining fields will misalign with the conversion specifier string. FILE-based (as opposed to string-based) scanf forms inherently do not admit any kind of "recovery" after mismatch without the caller seeking backwards (requiring a seekable stream); many of them are lossy on error. This is mainly a reaon not to use them, not a justification for a weird definition for one special case. I'm pretty sure the real answer here is just "don't use *scanf for that." Rich
On Sat, May 23, 2020 at 08:06:54AM +0100, Richard W.M. Jones via Libc-alpha wrote:> The context to this is that nbdkit uses sscanf to parse simple file > formats in various places, eg: > > https://github.com/libguestfs/nbdkit/blob/b23f4f53cf71326f1dba481f64f7f182c20fa3dc/plugins/data/format.c#L171-L172 > https://github.com/libguestfs/nbdkit/blob/b23f4f53cf71326f1dba481f64f7f182c20fa3dc/filters/ddrescue/ddrescue.c#L98 > > We can only do this safely where we can prove that overflow does not > matter.Being that it's specified as UB, it can never "not matter"; arbitrarily bad side effects are permitted. So it's only safe to use scanf where the input is *trusted not to contain overflowing values*. What would be really nice to fix here is getting the standard to specify that overflow has behavior like strto* or at least "unspecified value" rather than "undefined behavior" so that it's safe to let it overflow in cases where you don't care (e.g. you'll be consistency-checking the value afterwards anyway).> In other cases we've had to change sscanf uses to strto* etc > which is much more difficult to use correctly. Just look at how much > code is required to wrap strto* functions to use them safely: > > https://github.com/libguestfs/nbdkit/blob/b23f4f53cf71326f1dba481f64f7f182c20fa3dc/server/public.c#L113-L296Really that much code is just for the sake of verbose error messages, and they're not even accurate. "errno!=0" does not mean "could not parse"; it can also be overflow of a perfectly parseable value. And if you've already caught errno!=0 then end==str is impossible (dead code). The last case, not hitting null, is also likely spurious/wrong; you usually *want* to pick up where strto* stopped, and the next thing the parser does will catch whether the characters after the number are valid there or not. strto* do have some annoying design flaws in error reporting, but they're not really that hard to use right, and much easier than scanf which just *lacks the reporting channels* for the kind of fine-grained error reporting you're insisting on doing here. FWIW this code would also be a lot cleaner as a static inline function rather than a many-line macro. Rich
On Sat, May 23, 2020 at 09:28:26AM -0700, Paul Eggert wrote:> On 5/23/20 9:11 AM, Rich Felker wrote: > > > stopping on an initial prefix ... does not admit easily sharing a backend with strto*. > > I don't see why. If the backend has a "stop scanning on integer overflow" flag > (which it would need to have anyway, to support the proposed behavior), then > *scanf can use the flag and strto* can not use it. > > Anyway, this is not an issue for glibc, which has no such backend.It's relevant because you want to propose this for standardization.> > that's contrary to the abstract behavior defined for scanf > > (matching fields syntactically then value conversion) > > That's not really a problem. The abstract behavior already provides for matching > that is not purely syntactic. For example, string conversion specifiers can > impose length limits on the match, which means the matching does not rely purely > on the syntax of the input. It would be easy to say that integer conversion > specifiers can also impose limits related to integer overflow.Sure that's syntax. It's /[^ ]{1,n}"/. Of course for integers you can define a syntax that matches every non-overflowing value (this is always true for finite matching sets), but that's nothing like how the function is specified and I don't think anyone reasonable would classify non-overflow as a syntactic property.> > It's also even *more > > likely* to break programs that don't expect the behavior than just > > storing a wrapped or clamped value > > That's not true of the code that I looked at (see the URLs earlier in this > thread). That code was pretty carefully written and yet still vulnerable to the > integer-overflow issue.I don't follow. *Any* use of scanf on untrusted input is "vulnerable to the integer-overflow issue" in the sense that overflow is UB. This is not something subtle. If you mean actually using overflowed values in an unsafe way (assuming no ballooning effects of UB, just wrong values), I don't see how it's subtle either. Any value that could be produced via overflow could also be produced via non-overflowing input, and you have to validate data either way.> > I'm pretty sure the real answer here is just "don't use *scanf for > > that." > > Absolutely true right now. We are merely talking about (a) what sort of > implementation behavior is more useful for programs that are currently relying > on undefined behavior, and (b) what might be the cleanest addition to POSIX > later, to help improve this mess so that future programmers can use *scanf > safely in more situations.This is absolutely not "clean" and I am opposed to it. Rich
On Sat, May 23, 2020 at 12:45:01PM -0400, Rich Felker wrote:> I don't follow. *Any* use of scanf on untrusted input is "vulnerable > to the integer-overflow issue" in the sense that overflow is UB. This > is not something subtle.{,s}scanf is a useful, natural way to parse strings, and strto* is a horrible interface with many bear traps. It seems to me scanf could be changed to make it safe for overflow, simply by stopping parsing at the point where the overflow occurs and returning a short count (or the various other ideas suggested already in this thread). Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top
Seemingly Similar Threads
- Re: RFC: *scanf vs. overflow
- Re: [PATCH nbdkit v3 2/3] Add new retry filter.
- [PATCH] include/checkpatch: Prefer __scanf to __attribute__((format(scanf, ...)
- [PATCH] include/checkpatch: Prefer __scanf to __attribute__((format(scanf, ...)
- [PATCH nbdkit] ddrescue: Miscellaneous fixes.