Tozer, Stephen via llvm-dev
2020-Sep-15 17:56 UTC
[llvm-dev] [Debuginfo] Changing llvm.dbg.value and DBG_VALUE to support multiple location operands
> That sounds to me to be the same concept that I am calling r-value vs. l-value. Do you agree, or is there some subtlety that I am missing?I've been assuming that the l-value vs r-value distinction is analogous to C++: an l-value can be written to by the debugger, an r-value cannot. A memory location and a register location are both l-values, while implicit locations (stack value/implicit pointer) are r-values. Directness on the other hand, I've been using to mean "the variable's value is equal to the value computed by the DIExpression". Applied to a DWARF expression this description would only refer to a stack value. LLVM's DIExpressions are different however, because we don't have DW_OP_reg or DW_OP_breg: we simply refer to the register and use context to determine which it should be. The best example of this is from this example[0], expanded on here with the location type: $noreg, (plus_uconst, 8), -> DW_OP_breg7 RSP+8) Memory l-value Indirect 0, (plus_uconst, 8), -> DW_OP_breg7 RSP+8) Memory l-value Indirect $noreg, (plus_uconst, 8, stackval), -> DW_OP_breg7 RSP+8, stackval Stack r-value Direct 0, (plus_uconst, 8, stackval), -> DW_OP_breg7 RSP+8, stackval Stack r-value Direct $noreg, (plus_uconst, 8, deref), -> DW_OP_breg7 RSP+8 Memory l-value Indirect 0, (plus_uconst, 8, deref), -> DW_OP_breg7 RSP+8, deref Memory l-value Indirect $noreg, (plus_uconst, 8, deref, stackval)-> DW_OP_breg7 RSP+8, deref, stackval Stack r-value Direct 0, (plus_uconst, 8, deref, stackval)-> DW_OP_breg7 RSP+8, deref, stackval Stack r-value Direct $noreg, (), -> DW_OP_reg7 RSP Register l-value Direct 0, (), -> DW_OP_breg7 RSP+0 Memory l-value Indirect $noreg, (deref), -> DW_OP_breg7 RSP+0 Memory l-value Indirect 0, (deref), -> DW_OP_breg7 RSP+0, DW_OP_deref Memory l-value Indirect The point of using DW_OP_LLVM_direct is that it supplants the current directness flag, without the redundancy or unintuitive behaviour that the flag does. I believe the only reason that the indirectness flag is necessary right now is to allow register locations to be emitted, by delineating the register location "$rsp, $noreg, ()" -> DW_OP_reg7 RSP" from the memory location "$rsp, 0, () -> DW_OP_breg7 RSP+0". Outside of this case the existing representation is sufficient for all other locations, and ideally the indirectness flag would have no effect (although unfortunately it does). The justification for DW_OP_LLVM_direct rests on the idea that we will generally choose to produce "DW_OP_reg7 RSP" instead of "DW_OP_breg7 RSP, DW_OP_stack_value". It doesn't prevent us from using DW_OP_stack_value instead if we have an exception, but I don't believe there are any. Using DW_OP_LLVM_direct instead of directness and stackval for the table above, we get this: (plus_uconst, 8), -> DW_OP_breg7 RSP+8 Memory l-value Indirect (plus_uconst, 8, LLVM_direct), -> DW_OP_breg7 RSP+8, stackval Stack r-value Direct (plus_uconst, 8, deref), -> DW_OP_breg7 RSP+8, deref Memory l-value Indirect (plus_uconst, 8, deref, LLVM_direct), -> DW_OP_breg7 RSP+8, deref, stackval Stack r-value Direct (), -> DW_OP_breg7 RSP+0 Memory l-value Inirect (LLVM_direct), -> DW_OP_reg7 RSP Register l-value Direct (deref), -> DW_OP_breg7 RSP+0, deref Memory l-value Indirect (deref, LLVM_direct), -> DW_OP_breg7 RSP+0, deref, stackval Stack r-value Direct Two of the examples in this table should be excluded from actual use: the two rows that end with "deref, LLVM_direct" shouldn't be produced within LLVM, because we can cancel the two operators out to give a memory location rather than producing a "deref, stackval" expression. This can be done in LLVM itself through the DIExpression interface, so that we don't hold DIExpressions in an incorrect intermediate state. I'm currently operating under the belief that if a dbg.value can be an l-value, it always should be; if not, then we can use DW_OP_stack_value instead in all cases where we require a given dbg.value to be an r-value. To give an example of why having this could be more useful than just applying stack value, consider a hypothetical "DIExpression optimizer" pass applied to the following code: // `int a` is live... int b = a + 5; int c = b - 5; If both b and c are optimized out and salvaged, then we end up with the following dbg.values: @llvm.dbg.value(i32 %a, !"b", !DIExpression(DW_OP_plus_uconst, 5, DW_OP_LLVM_direct)) @llvm.dbg.value(i32 %a, !"c", !DIExpression(DW_OP_plus_uconst, 5, DW_OP_constu, 5, DW_OP_minus, DW_OP_LLVM_direct)) ; DIExpressions optimized... @llvm.dbg.value(i32 %a, !"b", !DIExpression(DW_OP_plus_uconst, 5, DW_OP_LLVM_direct)) @llvm.dbg.value(i32 %a, !"c", !DIExpression(DW_OP_LLVM_direct)) In this admittedly strange case, we start with b and c as l-values (before they are optimized out), they then become r-values due to optimization, and finally c is a valid l-value again. If we instead applied DW_OP_stack_value when we salvage, then c would not be recovered as an l-value. If we had DW_OP_implicit_ptr instead of DW_OP_LLVM_direct, then the result would be an r-value either way; likewise if we were referencing a memory location, the result would be an l-value regardless of how we modified it. I suspect there may be disagreements over whether c should share an l-value location with a, since this means that a user could write to either c or a, and that doing so would assign to both of them. My personal belief is that even if it seems confusing, we shouldn't arbitrarily restrict write-access to variables on criteria that will not always be clear to a debug user; whether or not to apply such a restriction should be left to the debugger, rather than being baked into the information we produce for it. Personal opinions aside, the other reason I'm taking this approach right now is that it matches LLVM's existing behaviour. If we have the source code: int a = ... int b = a; We will produce the IR: call void @llvm.dbg.value(metadata i32 %a, metadata !13, metadata !DIExpression()), !dbg !15 call void @llvm.dbg.value(metadata i32 %a, metadata !14, metadata !DIExpression()), !dbg !15 This IR will in turn produce register locations for both variables at the same location. Based on that, I believe that the current expected behaviour is that if two source variables map to the same actual location then they should share a DWARF location.> Here is an example of why I think that an optimization pass must have the ability to downgrade an l-value to an r-value: > ... > The optimization eliminated the store of constant 0 to %mem and replaced all loads of %mem in the subsequent block with a constant "i32 0". This means that we need mark the first dbg.value (that would otherwise look like an l-value) as an r-value, because writing a new value to %mem there would not affect the code that is now hardcoded to use a constant 0 value for x.The behaviour seen by the user in this case is: ; BEFORE call %llvm.dbg.declare(%mem, !DILocalVariable("x")) store i32* %mem, i32 %foo ; "x" is set to %foo, and can be written to ... store i32* %mem, i32 0 ; "x" is set to 0, and can be written to ... store i32* %mem, i32 %foo ; "x" is set to %foo, and can be written to ... ; AFTER call %llvm.dbg.value(%mem, !DILocalVariable("x"), !DIExpression(DW_OP_deref)) store i32* %mem, i32 %foo ; "x" is set to %foo, and can be written to ... call %llvm.dbg.value(%mem, !DILocalVariable("x"), !DIExpression(DW_OP_constu 0, DW_OP_stack_value)) ; "x" is set to %foo, and is read-only ... call %llvm.dbg.value(%mem, !DILocalVariable("x"), !DIExpression(DW_OP_deref)) ; "x" is set to its value prior to 0, and can be written to ... I don't think there's any issue with a variable being an r-value at some points in its live range and an l-value at others; in this case I think it's correct that the first dbg.value should be an l-value. Any write to x will not affect the code after the eliminated store, but even without optimizations x would be set to 0 (overriding any debugger assignment) at that point anyway. I do agree that the code produced may be slightly confusing to a user; this code likely maps to something along the lines of: int x = foo; ... x = 0; ... x = foo; If the user breaks just after the first "x = foo" and assigns `x = 5` in the debugger, they will see the correct result for all subsequent uses of x until the next assignment to x. When they step over "x = 0", x has the value 0 (as expected) and it becomes read-only. Finally after stepping over the next "x = foo" they will see that `x == 5`, which might not make a lot of sense when the user is expecting it to be assigned a different value. Even so, this information is a correct representation of the program state. The alternative of making the first dbg.value an r-value is restrictive - if the initial assignment to x is at the top of a large function that the user is debugging, and the other two assignments occur at the very end, it would likely be frustrating to the user that they have no write-access to x throughout the function. Because of this I don't believe that it would be right to make the first dbg.value an r-value; I think again that choosing to apply these restrictions should be left to the debugger. [0] https://bugs.llvm.org/show_bug.cgi?id=41675#c8 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200915/c0ca235b/attachment-0001.html>
Adrian Prantl via llvm-dev
2020-Sep-15 20:32 UTC
[llvm-dev] [Debuginfo] Changing llvm.dbg.value and DBG_VALUE to support multiple location operands
> On Sep 15, 2020, at 10:56 AM, Tozer, Stephen <stephen.tozer at sony.com> wrote: > > > That sounds to me to be the same concept that I am calling r-value vs. l-value. Do you agree, or is there some subtlety that I am missing? > > I've been assuming that the l-value vs r-value distinction is analogous to C++: an l-value can be written to by the debugger, an r-value cannot.Up to here, I fully agree!> A memory location and a register location are both l-values, while implicit locations (stack value/implicit pointer) are r-values.Here, however I think I need to to do a better job at explaining my thoughts ;-) You are correct to point out that all l-values must be memory or register locations, since they need to be something writeable. An l-value can't be a constant since you can't write to a constant. But — not all memory locations are l-values. If we have a DWARF location list for variable "x" which points to a memory address for the first n instructions and the switches to a constant for the remainder of the scope, the memory address is not guaranteed to be an l-value, because writing the the memory address cannot affect the later part of the function where the variable is a constant.> Directness on the other hand, I've been using to mean "the variable's value is equal to the value computed by the DIExpression".as opposed to "the value computed by the expression is (the address) of the variable's storage"? That makes sense, and I think for "direct" values in your definition it is true that all direct values are r-values.> Applied to a DWARF expression this description would only refer to a stack value. LLVM's DIExpressions are different however, because we don't have DW_OP_reg or DW_OP_breg: we simply refer to the register and use context to determine which it should be. The best example of this is from this example[0], expanded on here with the location type: > > $noreg, (plus_uconst, 8), -> DW_OP_breg7 RSP+8) Memory l-value Indirect > 0, (plus_uconst, 8), -> DW_OP_breg7 RSP+8) Memory l-value Indirect > $noreg, (plus_uconst, 8, stackval), -> DW_OP_breg7 RSP+8, stackval Stack r-value Direct > 0, (plus_uconst, 8, stackval), -> DW_OP_breg7 RSP+8, stackval Stack r-value Direct > $noreg, (plus_uconst, 8, deref), -> DW_OP_breg7 RSP+8 Memory l-value Indirect > 0, (plus_uconst, 8, deref), -> DW_OP_breg7 RSP+8, deref Memory l-value Indirect > $noreg, (plus_uconst, 8, deref, stackval)-> DW_OP_breg7 RSP+8, deref, stackval Stack r-value Direct > 0, (plus_uconst, 8, deref, stackval)-> DW_OP_breg7 RSP+8, deref, stackval Stack r-value Direct > $noreg, (), -> DW_OP_reg7 RSP Register l-value Directis this strictly an r-value?> 0, (), -> DW_OP_breg7 RSP+0 Memory l-value Indirect > $noreg, (deref), -> DW_OP_breg7 RSP+0 Memory l-value Indirect > 0, (deref), -> DW_OP_breg7 RSP+0, DW_OP_deref Memory l-value Indirect > > The point of using DW_OP_LLVM_direct is that it supplants the current directness flag, without the redundancy or unintuitive behaviour that the flag does.Why do we need DW_OP_LLVM_direct when we already have DW_OP_LLVM_stack_value? Can you give an example of something that is definitely not a stack value, but direct?> I believe the only reason that the indirectness flag is necessary right now is to allow register locations to be emitted, by delineating the register location "$rsp, $noreg, ()" -> DW_OP_reg7 RSP" from the memory location "$rsp, 0, () -> DW_OP_breg7 RSP+0". Outside of this case the existing representation is sufficient for all other locations, and ideally the indirectness flag would have no effect (although unfortunately it does). The justification for DW_OP_LLVM_direct rests on the idea that we will generally choose to produce "DW_OP_reg7 RSP" instead of "DW_OP_breg7 RSP, DW_OP_stack_value". It doesn't prevent us from using DW_OP_stack_value instead if we have an exception, but I don't believe there are any. Using DW_OP_LLVM_direct instead of directness and stackval for the table above, we get this: > > (plus_uconst, 8), -> DW_OP_breg7 RSP+8 Memory l-value Indirect > (plus_uconst, 8, LLVM_direct), -> DW_OP_breg7 RSP+8, stackval Stack r-value Direct > (plus_uconst, 8, deref), -> DW_OP_breg7 RSP+8, deref Memory l-value Indirect > (plus_uconst, 8, deref, LLVM_direct), -> DW_OP_breg7 RSP+8, deref, stackval Stack r-value Direct > (), -> DW_OP_breg7 RSP+0 Memory l-value Inirect > (LLVM_direct), -> DW_OP_reg7 RSP Register l-value Direct > (deref), -> DW_OP_breg7 RSP+0, deref Memory l-value Indirect > (deref, LLVM_direct), -> DW_OP_breg7 RSP+0, deref, stackval Stack r-value Direct > > Two of the examples in this table should be excluded from actual use: the two rows that end with "deref, LLVM_direct" shouldn't be produced within LLVM, because we can cancel the two operators out to give a memory location rather than producing a "deref, stackval" expression. This can be done in LLVM itself through the DIExpression interface, so that we don't hold DIExpressions in an incorrect intermediate state. I'm currently operating under the belief that if a dbg.value can be an l-value, it always should be; if not, then we can use DW_OP_stack_value instead in all cases where we require a given dbg.value to be an r-value. > > To give an example of why having this could be more useful than just applying stack value, consider a hypothetical "DIExpression optimizer" pass applied to the following code: > > // `int a` is live... > int b = a + 5; > int c = b - 5; > > If both b and c are optimized out and salvaged, then we end up with the following dbg.values: > > @llvm.dbg.value(i32 %a, !"b", !DIExpression(DW_OP_plus_uconst, 5, DW_OP_LLVM_direct)) > @llvm.dbg.value(i32 %a, !"c", !DIExpression(DW_OP_plus_uconst, 5, DW_OP_constu, 5, DW_OP_minus, DW_OP_LLVM_direct)) > ; DIExpressions optimized... > @llvm.dbg.value(i32 %a, !"b", !DIExpression(DW_OP_plus_uconst, 5, DW_OP_LLVM_direct)) > @llvm.dbg.value(i32 %a, !"c", !DIExpression(DW_OP_LLVM_direct)) > > In this admittedly strange case, we start with b and c as l-values (before they are optimized out), they then become r-values due to optimization, and finally c is a valid l-value again.But is that safe? If we wrote to %a, yes we will modify "c", but wouldn't we also unintentionally modify "a", and "b"? It sounds more like we'd want to have an rvalue in this case.> If we instead applied DW_OP_stack_value when we salvage, then c would not be recovered as an l-value. If we had DW_OP_implicit_ptr instead of DW_OP_LLVM_direct, then the result would be an r-value either way; likewise if we were referencing a memory location, the result would be an l-value regardless of how we modified it. > > I suspect there may be disagreements over whether c should share an l-value location with a, since this means that a user could write to either c or a, and that doing so would assign to both of them. My personal belief is that even if it seems confusing, we shouldn't arbitrarily restrict write-access to variables on criteria that will not always be clear to a debug user; whether or not to apply such a restriction should be left to the debugger, rather than being baked into the information we produce for it.Ah — you've thought about this already :-) This is actually really important to me: My take is that we must always err on the safe side. If the compiler cannot prove that it is safe to write to an l-value (ie., that the semantics are the same as if the write were inserted into the source code at that source location) it must downgrade the l-value to an r-value. That's one thing I'm feeling really strongly about. The reason behind this is that if we allow debug info that is only maybe correct (or correct on some paths, or under some other circumstances that are opaque to the user) — if we allow potentially incorrect debug info to leak into the program, the user can not distinguish between information they can trust and information that is bogus. That means all information that they see in the debugger is potentially bogus. If the debugger is not trustworthy it's worthless. That is feedback I keep getting particularly from kernel developers, and I must say it resonates with me. -- adrian> Personal opinions aside, the other reason I'm taking this approach right now is that it matches LLVM's existing behaviour. If we have the source code: > > int a = ... > int b = a; > > We will produce the IR: > > call void @llvm.dbg.value(metadata i32 %a, metadata !13, metadata !DIExpression()), !dbg !15 > call void @llvm.dbg.value(metadata i32 %a, metadata !14, metadata !DIExpression()), !dbg !15 > > This IR will in turn produce register locations for both variables at the same location. Based on that, I believe that the current expected behaviour is that if two source variables map to the same actual location then they should share a DWARF location. > > > Here is an example of why I think that an optimization pass must have the ability to downgrade an l-value to an r-value: > > ... > > The optimization eliminated the store of constant 0 to %mem and replaced all loads of %mem in the subsequent block with a constant "i32 0". This means that we need mark the first dbg.value (that would otherwise look like an l-value) as an r-value, because writing a new value to %mem there would not affect the code that is now hardcoded to use a constant 0 value for x. > > The behaviour seen by the user in this case is: > > ; BEFORE > call %llvm.dbg.declare(%mem, !DILocalVariable("x")) > store i32* %mem, i32 %foo > ; "x" is set to %foo, and can be written to > ... > store i32* %mem, i32 0 > ; "x" is set to 0, and can be written to > ... > store i32* %mem, i32 %foo > ; "x" is set to %foo, and can be written to > ... > > ; AFTER > call %llvm.dbg.value(%mem, !DILocalVariable("x"), !DIExpression(DW_OP_deref)) > store i32* %mem, i32 %foo > ; "x" is set to %foo, and can be written to > ... > call %llvm.dbg.value(%mem, !DILocalVariable("x"), !DIExpression(DW_OP_constu 0, DW_OP_stack_value)) > ; "x" is set to %foo, and is read-only > ... > call %llvm.dbg.value(%mem, !DILocalVariable("x"), !DIExpression(DW_OP_deref)) > ; "x" is set to its value prior to 0, and can be written to > ... > > I don't think there's any issue with a variable being an r-value at some points in its live range and an l-value at others; in this case I think it's correct that the first dbg.value should be an l-value. Any write tox will not affect the code after the eliminated store, but even without optimizations x would be set to 0 (overriding any debugger assignment) at that point anyway. I do agree that the code produced may be slightly confusing to a user; this code likely maps to something along the lines of: > > int x = foo; > ... > x = 0; > ... > x = foo; > > If the user breaks just after the first "x = foo" and assigns `x = 5` in the debugger, they will see the correct result for all subsequent uses of x until the next assignment to x. When they step over "x = 0", x has the value 0 (as expected) and it becomes read-only. Finally after stepping over the next "x = foo" they will see that `x == 5`, which might not make a lot of sense when the user is expecting it to be assigned a different value. Even so, this information is a correct representation of the program state. > > The alternative of making the first dbg.value an r-value is restrictive - if the initial assignment to x is at the top of a large function that the user is debugging, and the other two assignments occur at the very end, it would likely be frustrating to the user that they have no write-access to x throughout the function. Because of this I don't believe that it would be right to make the first dbg.value an r-value; I think again that choosing to apply these restrictions should be left to the debugger. > > [0] https://bugs.llvm.org/show_bug.cgi?id=41675#c8 <https://bugs.llvm.org/show_bug.cgi?id=41675#c8>-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200915/7e99ec89/attachment.html>
Robinson, Paul via llvm-dev
2020-Sep-15 22:07 UTC
[llvm-dev] [Debuginfo] Changing llvm.dbg.value and DBG_VALUE to support multiple location operands
Hi Adrian & Stephen, One thought here: But — not all memory locations are l-values. If we have a DWARF location list for variable "x" which points to a memory address for the first n instructions and the switches to a constant for the remainder of the scope, the memory address is not guaranteed to be an l-value, because writing the the memory address cannot affect the later part of the function where the variable is a constant. A very interesting point. I believe this is a concept that LLVM does not yet understand? that it needs to consider the l/r-value-ness of the variable throughout the function, before deciding on the form that the location-expressions will take. I would not want to spend a lot of time parsing expressions in order to make this kind of decision. And a second thought here: I suspect there may be disagreements over whether c should share an l-value location with a, since this means that a user could write to either c or a, and that doing so would assign to both of them. My personal belief is that even if it seems confusing, we shouldn't arbitrarily restrict write-access to variables on criteria that will not always be clear to a debug user; whether or not to apply such a restriction should be left to the debugger, rather than being baked into the information we produce for it. Ah — you've thought about this already :-) This is actually really important to me: My take is that we must always err on the safe side. If the compiler cannot prove that it is safe to write to an l-value (ie., that the semantics are the same as if the write were inserted into the source code at that source location) it must downgrade the l-value to an r-value. That's one thing I'm feeling really strongly about. The reason behind this is that if we allow debug info that is only maybe correct (or correct on some paths, or under some other circumstances that are opaque to the user) — if we allow potentially incorrect debug info to leak into the program, the user can not distinguish between information they can trust and information that is bogus. That means all information that they see in the debugger is potentially bogus. If the debugger is not trustworthy it's worthless. That is feedback I keep getting particularly from kernel developers, and I must say it resonates with me. I should say that Stephen and I had a discussion a while back, where I said IMO the compiler should emit debug info that correctly reflects the code as-it-is, and not arbitrarily decide that ‘c’ and/or ‘a’ should not be writeable. I believe I also said this was arguable, but that I felt pretty strongly about it. I have to say, it did not occur to me to consider emitting info that showed *neither* ‘c’ nor ‘a’ to be writeable! This is a kind of protect-the-debugger-from-the-user perspective that I’ve not encountered before. I would actually argue that showing both variables as writeable is in fact correct (truthful), because it reflects the instructions as-produced; if you modify one, you necessarily modify the other. That’s what the code as-produced does. I accept that is not what the code as-originally-written does… and that it is unlikely that the debugger would go to the trouble to notice that two variables share a storage location and caution the user about that unexpected interaction. (It’s hard enough to persuade the compiler to notice…) Given that engineering practicality, and that producing debug info that helps the user without confusing/dismaying the user is a valuable goal, then I can agree that describing both variables as r-values is more helpful. In short, I agree with your conclusion, although not for your reasons. 😊 --paulr -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200915/683d28a4/attachment.html>