Hi all,
I have been looking at how to convert by-reference captures into by-copy
captures for captured statements and possibly C++ lambdas, and am looking for
some feedback on my approach. The motivation for trying to use copy captures is
to avoid unnecessary loads that are otherwise required inside the outlined
function. This can be important when the outlined function represents the body
of a loop, and cannot be inlined, such as in cilk_for, or using a library-based
parallel for with lambdas.
I have been prototyping an LLVM IR pass that can move loads out of a captured
statement body when possible. The approach:
Use named metadata to find the captured statement helpers (or lambda functions),
as well as the "kind" of captured region they represent.
e.g.
!capturedstmt.helper = !{!0, !1}
!0 = metadata !{<function1>, metadata !"cilk_for"}
!1 = metadata !{<function2>, metadata !"default"}
For each field in the implicit capture-struct parameter, determine whether it
can be "promoted" to a by-copy capture. This involves
1) checking the type of the field: only pointers to pointers or pointers to
primitive types with size <= the original pointer can currently be promoted -
this skips types that might require a copy constructor, and ensures that they
are cheap to pass by value.
2) looking at the uses of the field in the helper: if the field is used in any
operation other than a load, then assume it cannot be promoted
3) looking at the uses of the field in the call-site(s): if the pointer stored
in the field may also be passed into the helper in another way, then it cannot
be promoted. I have not implemented anything for this, but I imagine there are
existing passes that would be useful here, such as alias analysis. For captured
statements, there is only one call-site to worry about, but in lambdas all
call-sites would need to be considered.
If any fields can be promoted, then clone the original function with a new
capture struct parameter e.g. {i32**, i32*} -> {i32*, i32}. Then replace
loads of the original field with the value inside the outlined function. The
call-site is updated to call the new function, and add loads of any arguments
that have been promoted. These loads may be removed by later optimizations.
e.g.
%a = alloca i32
%context = alloca {i32*}
%field = getelementptr inbounds {i32*}* %context, i32 0, i32 0
store i32* %a, i32** %field
call void @__captured_stmt_helper({i32*}* %context)
define void @__captured_stmt_helper({i32*}* %context) {
%field = getelementptr inbounds {i32*}* %context, i32 0, i32 0
%load.field = load i32** %field
%a = load i32* %load.field
...
}
Becomes something like
%a = alloca i32
%context = alloca {i32}
%field = getelementptr inbounds {i32}* %context, i32 0, i32 0
%load.a = load i32* %a
store i32 %load.a, i32* %field
call void @__captured_stmt_helper_new({i32}* %context)
define void @__captured_stmt_helper_new({i32}* %context) {
%field = getelementptr inbounds {i32}* %context, i32 0, i32 0
%a = load i32* %field
...
}
I'd love to hear any feedback about this approach, since I'm not totally
convinced yet this should not be done in Clang's AST instead. Thanks,
Ben
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130527/0f8fb570/attachment.html>