Hi,
I decided to implement a sanitizer that tracks the active member, and
reports reads from inactive members.
e.g.:
  struct S {
    union {
      long l;
      char s[42];
      double d;
    };
  };
  void f() {
    S s;
    s.d = 42.0;
    if (s.l > 100) // fire here
      ;
  }
My current implementation is naïve (also possibly wrong), and doesn't
care about "special guarantee" that allows inspecting common initial
sequence. Briefly;
1. in EmitBinaryOperatorLValue, if we are storing value of a union
member, we call a ubsan runtime function (not a handler) that maps
pointer value (of the union) to a tuple/struct (of TypeDescr, active
field index, expression location that has changed the active member,
and array of field names)
2. When a union member is being loaded, a call is emitted to a ubsan
runtime function that both checks for, and reports access to inactive
members
For above example, sanitizer says:
  union-track-active-t1.cpp:11:9: runtime error: active field of union
'S::(anonymous union at union-track-active-t1.cpp:2:3)' is 'd';
access
to field 'l' is undefined
  union-track-active-t1.cpp:10:7: note: active index set to 'd' here
Runtime functions I've added:
  struct StringVec {
    unsigned NumElts;
    const char *Elts[];
  };
  struct UnionStaticData {
    TypeDescriptor *Type;
    StringVec* FieldNames;
  };
extern "C" SANITIZER_INTERFACE_ATTRIBUTE
void __ubsan_union_set_active_field_index(uptr Addr, const
UnionStaticData *Data,
                                            const SourceLocation *ActivationLoc,
                                            unsigned ActiveFieldIndex);
extern "C" SANITIZER_INTERFACE_ATTRIBUTE
void __ubsan_union_check_access(uptr Addr, const SourceLocation *Loc,
                                  unsigned FieldIndex);
For above code, IR looks like:
  // s.d = 42.0;
  store double 4.200000e+01, double* %d, align 8
  // injected by sanitizer
  // __ubsan_union_set_active_field_index(/*union addr*/&s,
  // /*type descr=*/'S::anonymous union',
  // /*field names*/{"l", "s", "d"},
  // /*expr loc=*/"file.cpp:10:7",
  // /*field index*/2);
  %1 = bitcast %union.anon* %0 to i8*
  call void @__ubsan_union_set_active_field_index(i8* %1, i8* bitcast
({ { i16, i16, [56 x i8] }*, { i32, [3 x i8*] }* }* @6 to i8*), i8*
bitcast ({ [26 x i8]*, i32, i32 }* @5 to i8*), i32 2)
  // if (s.l > 100L) ;
  %2 = getelementptr inbounds %struct.S* %s, i32 0, i32 0
  %l = bitcast %union.anon* %2 to i64*
  %3 = bitcast i64* %l to i8*
  call void @__ubsan_union_check_access(i8* %3, i8* bitcast ({ [26 x
i8]*, i32, i32 }* @7 to i8*), i32 0)
  %4 = load i64* %l, align 8
  %cmp = icmp sgt i64 %4, 100
I have a few questions regarding the overall design:
1. Do you think this is a useful check?
2. Where can I store type and field info about the union; some form of
a shadow memory or a simple array/map?
3. Should sanitizer abort immediately or continue upon detection?
4. When/how can I remove entries from ubsan shadow memory when union's
lifetime ends; perhaps in a module pass or at the end of each
function?
It's far from submission quality at the moment, that's why I am
holding back the patch. Before proceeding further, I'd like to get
some feedback as to whether this is a valuable feature, and whether I
am on the right track.
Ismail
On 12/15/2014 10:24 PM, Ismail Pazarbasi wrote:> s.d = 42.0; > if (s.l > 100) // fire hereNote that code like this is frequently used to convert integers to floats so you'll get tons of false positives. Emitting error for accessing differently sized elements of enum may work (but should already be handled by MSan?).> I have a few questions regarding the overall design: > 1. Do you think this is a useful check?That's actually an interesting questions. It could be useful for tagged unions although I believe programmers usually surround them with checking asserts anyway.> 2. Where can I store type and field info about the union; some form of > a shadow memory or a simple array/map?Without shadow it may be unacceptably slow in union-intensive applications. But with shadow, it'll greatly complicate UBSan.> 3. Should sanitizer abort immediately or continue upon detection?AFAIK normally UBSan checks continue after error (but there's a flag to alter this).> 4. When/how can I remove entries from ubsan shadow memory when union's > lifetime ends; perhaps in a module pass or at the end of each > function?Take a look at how ASan does this (it's not easy). -Y
On 12/16/2014 02:43 AM, Yury Gribov wrote:> On 12/15/2014 10:24 PM, Ismail Pazarbasi wrote: >> s.d = 42.0; >> if (s.l > 100) // fire here > > Note that code like this is frequently used to convert integers to > floats so you'll get tons of false positives.True positives. The fix is to use memcpy instead. Emitting error for> accessing differently sized elements of enum may work (but should > already be handled by MSan?). > >> I have a few questions regarding the overall design: >> 1. Do you think this is a useful check? > > That's actually an interesting questions. It could be useful for tagged > unions although I believe programmers usually surround them with > checking asserts anyway.Useful, yes. It will find bugs. I haven't heard anyone clamoring for it though.>> 2. Where can I store type and field info about the union; some form of >> a shadow memory or a simple array/map? > > Without shadow it may be unacceptably slow in union-intensive > applications. But with shadow, it'll greatly complicate UBSan.None of the other checks in UBSan change the ABI. You can freely link ubsan .o files with non-ubsan .o files and the program will still work and ubsan will generate no false positives. With this check, that is not so. Uninstrumented code changing the active member of a union will cause a false positive when we next read it in instrumented code.>> 3. Should sanitizer abort immediately or continue upon detection? > > AFAIK normally UBSan checks continue after error (but there's a flag to > alter this). > >> 4. When/how can I remove entries from ubsan shadow memory when union's >> lifetime ends; perhaps in a module pass or at the end of each >> function? > > Take a look at how ASan does this (it's not easy). > > -Y > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >
Richard Smith
2014-Dec-19  01:26 UTC
[LLVMdev] [cfe-dev] ubsan - active member check for unions
On Mon, Dec 15, 2014 at 11:24 AM, Ismail Pazarbasi < ismail.pazarbasi at gmail.com> wrote:> > Hi, > > I decided to implement a sanitizer that tracks the active member, and > reports reads from inactive members. > > e.g.: > struct S { > union { > long l; > char s[42]; > double d; > }; > }; > void f() { > S s; > s.d = 42.0; > if (s.l > 100) // fire here > ; > } > > My current implementation is naïve (also possibly wrong), and doesn't > care about "special guarantee" that allows inspecting common initial > sequence. Briefly; > 1. in EmitBinaryOperatorLValue, if we are storing value of a union > member, we call a ubsan runtime function (not a handler) that maps > pointer value (of the union) to a tuple/struct (of TypeDescr, active > field index, expression location that has changed the active member, > and array of field names) > 2. When a union member is being loaded, a call is emitted to a ubsan > runtime function that both checks for, and reports access to inactive > members >I do not believe this is correct, and I do not believe it can be made to be correct. Here's why: * In C, there is no restriction on accessing an inactive union member. The only relevant rule is the effective type rule, in C11 6.5/6, that says that you can't store to memory with one type and then load from that same memory with an incompatible type. But beware! C is deeply confused, and C11 footnote 95 says that reading a union member with the wrong effective type will produce "the appropriate part of the object representation of the value [...] reinterpreted as an object representation in the new type", and then refers the reader to another part of the standard which says no such thing. * In C++, the only standard way to change the active member of a union object is by applying placement new to the member (and even that doesn't work in some cases). No-one actually writes code that way, so every implementation in practice allows the active union member to change in various undocumented and non-standard ways. * It is an explicit goal of UBSan to allow some TUs to be built with it enabled and some to be built with it disabled. This check does not and cannot support that, because the active union member / effective type can change in uninstrumented code. So this does not belong in UBSan. However... Implementing a sanitizer for C's effective type rule seems like a good idea. Such a sanitizer would also be correct in C++ (whose rules are strictly less permissive), but not complete. The best way to approach this would be to provide shadow memory indicating the effective type of real memory. In order to keep things simple, you'd probably want to ignore aggregate types and simply track the primitive effective type of memory. Up to compatibility, there are not many such types (bool, char, short, int, long, long long, float, double, long double, and some language extensions), so you can probably get away with a 4 bit representation of the type. Further, if a byte has effective type of (signed/unsigned/plain) char, the adjacent byte must have that type too (due to alignment restrictions), so you can use the same type representation for each pair of bytes. There are likely to be enough spare values that you can separately encode [4 char], [2 char][short], [short][2 char], and [short][short], which would let you use a single macrotype for each 4 byte chunk of memory. That gives only a 12.5% memory overhead, which seems very respectable. For above example, sanitizer says:> union-track-active-t1.cpp:11:9: runtime error: active field of union > 'S::(anonymous union at union-track-active-t1.cpp:2:3)' is 'd'; access > to field 'l' is undefined > union-track-active-t1.cpp:10:7: note: active index set to 'd' here > > Runtime functions I've added: > struct StringVec { > unsigned NumElts; > const char *Elts[]; > }; > struct UnionStaticData { > TypeDescriptor *Type; > StringVec* FieldNames; > }; > > extern "C" SANITIZER_INTERFACE_ATTRIBUTE > void __ubsan_union_set_active_field_index(uptr Addr, const > UnionStaticData *Data, > const SourceLocation > *ActivationLoc, > unsigned ActiveFieldIndex); > > extern "C" SANITIZER_INTERFACE_ATTRIBUTE > void __ubsan_union_check_access(uptr Addr, const SourceLocation *Loc, > unsigned FieldIndex); > > For above code, IR looks like: > // s.d = 42.0; > store double 4.200000e+01, double* %d, align 8 > // injected by sanitizer > // __ubsan_union_set_active_field_index(/*union addr*/&s, > // /*type descr=*/'S::anonymous union', > // /*field names*/{"l", "s", "d"}, > // /*expr loc=*/"file.cpp:10:7", > // /*field index*/2); > %1 = bitcast %union.anon* %0 to i8* > call void @__ubsan_union_set_active_field_index(i8* %1, i8* bitcast > ({ { i16, i16, [56 x i8] }*, { i32, [3 x i8*] }* }* @6 to i8*), i8* > bitcast ({ [26 x i8]*, i32, i32 }* @5 to i8*), i32 2) > > // if (s.l > 100L) ; > %2 = getelementptr inbounds %struct.S* %s, i32 0, i32 0 > %l = bitcast %union.anon* %2 to i64* > %3 = bitcast i64* %l to i8* > call void @__ubsan_union_check_access(i8* %3, i8* bitcast ({ [26 x > i8]*, i32, i32 }* @7 to i8*), i32 0) > %4 = load i64* %l, align 8 > %cmp = icmp sgt i64 %4, 100 > > I have a few questions regarding the overall design: > 1. Do you think this is a useful check? > 2. Where can I store type and field info about the union; some form of > a shadow memory or a simple array/map? > 3. Should sanitizer abort immediately or continue upon detection? > 4. When/how can I remove entries from ubsan shadow memory when union's > lifetime ends; perhaps in a module pass or at the end of each > function? > > It's far from submission quality at the moment, that's why I am > holding back the patch. Before proceeding further, I'd like to get > some feedback as to whether this is a valuable feature, and whether I > am on the right track. > > Ismail > > _______________________________________________ > cfe-dev mailing list > cfe-dev at cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141218/3aee0288/attachment.html>
Ismail Pazarbasi
2014-Dec-23  14:51 UTC
[LLVMdev] [cfe-dev] ubsan - active member check for unions
On Fri, Dec 19, 2014 at 2:26 AM, Richard Smith <richard at metafoo.co.uk> wrote:> On Mon, Dec 15, 2014 at 11:24 AM, Ismail Pazarbasi > <ismail.pazarbasi at gmail.com> wrote: >> >> Hi, >> >> I decided to implement a sanitizer that tracks the active member, and >> reports reads from inactive members. >> >> e.g.: >> struct S { >> union { >> long l; >> char s[42]; >> double d; >> }; >> }; >> void f() { >> S s; >> s.d = 42.0; >> if (s.l > 100) // fire here >> ; >> } >> >> My current implementation is naïve (also possibly wrong), and doesn't >> care about "special guarantee" that allows inspecting common initial >> sequence. Briefly; >> 1. in EmitBinaryOperatorLValue, if we are storing value of a union >> member, we call a ubsan runtime function (not a handler) that maps >> pointer value (of the union) to a tuple/struct (of TypeDescr, active >> field index, expression location that has changed the active member, >> and array of field names) >> 2. When a union member is being loaded, a call is emitted to a ubsan >> runtime function that both checks for, and reports access to inactive >> members > > > I do not believe this is correct, and I do not believe it can be made to be > correct. Here's why: > > * In C, there is no restriction on accessing an inactive union member. The > only relevant rule is the effective type rule, in C11 6.5/6, that says that > you can't store to memory with one type and then load from that same memory > with an incompatible type. But beware! C is deeply confused, and C11 > footnote 95 says that reading a union member with the wrong effective type > will produce "the appropriate part of the object representation of the value > [...] reinterpreted as an object representation in the new type", and then > refers the reader to another part of the standard which says no such thing. > > * In C++, the only standard way to change the active member of a union > object is by applying placement new to the member (and even that doesn't > work in some cases). No-one actually writes code that way, so every > implementation in practice allows the active union member to change in > various undocumented and non-standard ways. > > * It is an explicit goal of UBSan to allow some TUs to be built with it > enabled and some to be built with it disabled. This check does not and > cannot support that, because the active union member / effective type can > change in uninstrumented code. So this does not belong in UBSan.The primary reason I thought ubsan would be the right place is because type mismatch check is implemented in ubsan. I also thought about asking whether ubsan is the right place for this check. According to my interpretation of the discussions in -core reflector, C DR236, and in UB list, reasons of undefined-ness of this case are: a) lifetime of inactive members seems like to end when a member becomes active (C++-only) b) multiple subobjects with different type point to the same storage (C, and C++) For reference (C++ N3797): 9.5/p1: In a union, at most one of the non-static data members can be active at any time, that is, the value of at most one of the non-static data members can be stored in a union at any time. ... The size of a union is sufficient to contain the largest of its non-static data members. Each non-static data member is allocated as if it were the sole member of a struct. Per your message 25690 to -core reflector, given a union: U { int i; float f; } u; both members' lifetimes will begin when storage for `u` is allocated, the union has no active member. u.i = 0; means lifetime of `u.f` will end immediately, per 3.8/p1 [...]the storage which the object occupies is reused[...] So: u.f = 0; is undefined behavior. But: new (&u.f) float(0); is well-defined.> > However... > > Implementing a sanitizer for C's effective type rule seems like a good idea. > Such a sanitizer would also be correct in C++ (whose rules are strictly less > permissive), but not complete. The best way to approach this would be to > provide shadow memory indicating the effective type of real memory. In order > to keep things simple, you'd probably want to ignore aggregate types and > simply track the primitive effective type of memory. > > Up to compatibility, there are not many such types (bool, char, short, int, > long, long long, float, double, long double, and some language extensions), > so you can probably get away with a 4 bit representation of the type. > Further, if a byte has effective type of (signed/unsigned/plain) char, the > adjacent byte must have that type too (due to alignment restrictions), so > you can use the same type representation for each pair of bytes. There are > likely to be enough spare values that you can separately encode [4 char], [2 > char][short], [short][2 char], and [short][short], which would let you use a > single macrotype for each 4 byte chunk of memory. That gives only a 12.5% > memory overhead, which seems very respectable.Thanks for the idea! Which sanitizer do you think fits best for this check? :)> >> For above example, sanitizer says: >> union-track-active-t1.cpp:11:9: runtime error: active field of union >> 'S::(anonymous union at union-track-active-t1.cpp:2:3)' is 'd'; access >> to field 'l' is undefined >> union-track-active-t1.cpp:10:7: note: active index set to 'd' here >> >> Runtime functions I've added: >> struct StringVec { >> unsigned NumElts; >> const char *Elts[]; >> }; >> struct UnionStaticData { >> TypeDescriptor *Type; >> StringVec* FieldNames; >> }; >> >> extern "C" SANITIZER_INTERFACE_ATTRIBUTE >> void __ubsan_union_set_active_field_index(uptr Addr, const >> UnionStaticData *Data, >> const SourceLocation >> *ActivationLoc, >> unsigned ActiveFieldIndex); >> >> extern "C" SANITIZER_INTERFACE_ATTRIBUTE >> void __ubsan_union_check_access(uptr Addr, const SourceLocation *Loc, >> unsigned FieldIndex); >> >> For above code, IR looks like: >> // s.d = 42.0; >> store double 4.200000e+01, double* %d, align 8 >> // injected by sanitizer >> // __ubsan_union_set_active_field_index(/*union addr*/&s, >> // /*type descr=*/'S::anonymous union', >> // /*field names*/{"l", "s", "d"}, >> // /*expr loc=*/"file.cpp:10:7", >> // /*field index*/2); >> %1 = bitcast %union.anon* %0 to i8* >> call void @__ubsan_union_set_active_field_index(i8* %1, i8* bitcast >> ({ { i16, i16, [56 x i8] }*, { i32, [3 x i8*] }* }* @6 to i8*), i8* >> bitcast ({ [26 x i8]*, i32, i32 }* @5 to i8*), i32 2) >> >> // if (s.l > 100L) ; >> %2 = getelementptr inbounds %struct.S* %s, i32 0, i32 0 >> %l = bitcast %union.anon* %2 to i64* >> %3 = bitcast i64* %l to i8* >> call void @__ubsan_union_check_access(i8* %3, i8* bitcast ({ [26 x >> i8]*, i32, i32 }* @7 to i8*), i32 0) >> %4 = load i64* %l, align 8 >> %cmp = icmp sgt i64 %4, 100 >> >> I have a few questions regarding the overall design: >> 1. Do you think this is a useful check? >> 2. Where can I store type and field info about the union; some form of >> a shadow memory or a simple array/map? >> 3. Should sanitizer abort immediately or continue upon detection? >> 4. When/how can I remove entries from ubsan shadow memory when union's >> lifetime ends; perhaps in a module pass or at the end of each >> function? >> >> It's far from submission quality at the moment, that's why I am >> holding back the patch. Before proceeding further, I'd like to get >> some feedback as to whether this is a valuable feature, and whether I >> am on the right track. >> >> Ismail >> >> _______________________________________________ >> cfe-dev mailing list >> cfe-dev at cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev