Witold Waligora via llvm-dev
2017-Feb-13 13:11 UTC
[llvm-dev] (RFC) JumpMaps: switch statement optimization
Hi All, This is our first appearance on llvm-dev so a small introduction is in order. We are a team of two working for good few years now on an out-of-tree embedded back-end. Our primary goal is code size, secondary performance. We've recently implemented an optimization that we believe others could use. It's about time to contribute back. ------------------------------------------------------- JumpMaps: a generalization of JumpTables JumpTables produce fast and small code but have a limitation - they only work if case values are easy to compute from the variable being switch()ed. To overcome this limitation we introduce a {key,value} structure - a JumpMap. Simplifying somewhat, LLVM would currently generate a sequential if-elseif- for small switches and an inline binary-search for large switches, if it can't produce a JumpTable. JumpMaps would instead generate a call followed by a jump-through-register. Pseudo-code example: switch(x) { case 0x6851: {BB1} case 0x1383: {BB2} case 0x2224: {BB3} default: {defaultBB} } Without Jump Maps: if(x==0x6851) { goto BB1 } else if(x==0x1383) { goto BB2 } else if(x==0x2224){ goto BB3 } else{ goto defaultBB } With Jump Maps: jumpmap_0 = { keys = {3, 0x6851, 0x1383, 0x2224} vals = {defaultBB, BB1, BB2, BB3} } addr dst = __jumpmap_find(&jumpmap_0, x) goto dst; On our target Jump Maps produce both smaller and faster code even for quite small switch statements. We believe other architectures would benefit as well, X86 and ARM included: - jumpmap struct has a good chance of being cached more efficiently than a large chunk of inline binary-search code - for large switches branch prediction should have an easier time compared to inline binary-search - embedded architectures (ARM, MSP430) would benefit from smaller code (trading off .text for .rodata) In terms of lowering, JumpMap flow is very similar to that of JumpTables. The lowering is even slightly easier as JumpMaps do not require an extra range-check basic-block - the default case is also handled by __jumpmap_find. Targets need to provide lowering for the map structure and __jumpmap_find function, which would typically become a part of compier-rt library and could be written in hand-crafted asm for extra speed. Questions or comments? Please let me know. Best Regards, Witold Waligóra Myre Laboratories
Friedman, Eli via llvm-dev
2017-Feb-14 00:23 UTC
[llvm-dev] (RFC) JumpMaps: switch statement optimization
On 2/13/2017 5:11 AM, Witold Waligora via llvm-dev wrote:> Hi All, > > This is our first appearance on llvm-dev so a small introduction is in > order. > We are a team of two working for good few years now on an out-of-tree > embedded back-end. Our primary goal is code size, secondary performance. > We've recently implemented an optimization that we believe others could > use. It's about time to contribute back. > > ------------------------------------------------------- > JumpMaps: a generalization of JumpTables > JumpTables produce fast and small code but have a limitation - they only > work if case values are easy to compute from the variable being > switch()ed. To overcome this limitation we introduce a {key,value} > structure - a JumpMap. > > Simplifying somewhat, LLVM would currently generate a sequential > if-elseif- for small switches and an inline binary-search for large > switches, if it can't produce a JumpTable. JumpMaps would instead > generate a call followed by a jump-through-register. > Pseudo-code example: > > switch(x) { > case 0x6851: {BB1} > case 0x1383: {BB2} > case 0x2224: {BB3} > default: {defaultBB} > } > > Without Jump Maps: > if(x==0x6851) { > goto BB1 > } > else if(x==0x1383) { > goto BB2 > } > else if(x==0x2224){ > goto BB3 > } > else{ > goto defaultBB > } > > With Jump Maps: > jumpmap_0 = { > keys = {3, 0x6851, 0x1383, 0x2224} > vals = {defaultBB, BB1, BB2, BB3} > } > addr dst = __jumpmap_find(&jumpmap_0, x) > goto dst; > > On our target Jump Maps produce both smaller and faster code even for > quite small switch statements. We believe other architectures would > benefit as well, X86 and ARM included: > - jumpmap struct has a good chance of being cached more efficiently than > a large chunk of inline binary-search code > - for large switches branch prediction should have an easier time > compared to inline binary-search > - embedded architectures (ARM, MSP430) would benefit from smaller code > (trading off .text for .rodata) > > In terms of lowering, JumpMap flow is very similar to that of > JumpTables. The lowering is even slightly easier as JumpMaps do not > require an extra range-check basic-block - the default case is also > handled by __jumpmap_find. > Targets need to provide lowering for the map structure and > __jumpmap_find function, which would typically become a part of > compier-rt library and could be written in hand-crafted asm for extra speed. > > > Questions or comments? > Please let me know. >Smaller codesize for switches could be interesting for Thumb targets. Some thoughts. 1. For Thumb, the overall size is more important that constants vs. code; the tables will end up in the text segment anyway. 2. How many versions of __jumpmap_find do you need? If you're going for size, you probably want to put 1 or 2 byte relative jump offsets into the table, if possible. You'd also want to shrink the keys if possible: for a small range, we could use 2-byte, or even 1-byte keys. You could also encode deltas into the table, rather than absolute values; the deltas are likely to be smaller than the keys, and you're iterating over the table anyway. 3. Is the linear search a problem? I mean, the linear nature of it doesn't matter for small tables, but eventually you get to a point where it just doesn't make sense. Linearly iterating over 1000 entries in a table is clearly going to be slower than a binary search. Or is the idea that you'd do a binary search down to say, 20 entries, then call __jumpmap_find? 4. How often does this pattern come up, in your experience? -Eli -- Employee of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
Joerg Sonnenberger via llvm-dev
2017-Feb-14 00:40 UTC
[llvm-dev] (RFC) JumpMaps: switch statement optimization
On Mon, Feb 13, 2017 at 02:11:24PM +0100, Witold Waligora via llvm-dev wrote:> JumpMaps: a generalization of JumpTables > JumpTables produce fast and small code but have a limitation - they only > work if case values are easy to compute from the variable being > switch()ed. To overcome this limitation we introduce a {key,value} > structure - a JumpMap.We've been discussing supporting either perfect hashing or just a pre-transformation a while ago. That would have a very similar use case, the primary issue is the additional code for checking that the hashed value is actually the expected value. Joerg
Nema, Ashutosh via llvm-dev
2017-Feb-14 09:53 UTC
[llvm-dev] (RFC) JumpMaps: switch statement optimization
Hi Witold, Can you explain what you meant by adding support in compiler_rt & changes in lowering. Did you tried any benchmarks, please share if you have any numbers with performance improvements & code size reduction ? Regards, Ashutosh -----Original Message----- From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Witold Waligora via llvm-dev Sent: Monday, February 13, 2017 6:41 PM To: llvm-dev at lists.llvm.org Subject: [llvm-dev] (RFC) JumpMaps: switch statement optimization Hi All, This is our first appearance on llvm-dev so a small introduction is in order. We are a team of two working for good few years now on an out-of-tree embedded back-end. Our primary goal is code size, secondary performance. We've recently implemented an optimization that we believe others could use. It's about time to contribute back. ------------------------------------------------------- JumpMaps: a generalization of JumpTables JumpTables produce fast and small code but have a limitation - they only work if case values are easy to compute from the variable being switch()ed. To overcome this limitation we introduce a {key,value} structure - a JumpMap. Simplifying somewhat, LLVM would currently generate a sequential if-elseif- for small switches and an inline binary-search for large switches, if it can't produce a JumpTable. JumpMaps would instead generate a call followed by a jump-through-register. Pseudo-code example: switch(x) { case 0x6851: {BB1} case 0x1383: {BB2} case 0x2224: {BB3} default: {defaultBB} } Without Jump Maps: if(x==0x6851) { goto BB1 } else if(x==0x1383) { goto BB2 } else if(x==0x2224){ goto BB3 } else{ goto defaultBB } With Jump Maps: jumpmap_0 = { keys = {3, 0x6851, 0x1383, 0x2224} vals = {defaultBB, BB1, BB2, BB3} } addr dst = __jumpmap_find(&jumpmap_0, x) goto dst; On our target Jump Maps produce both smaller and faster code even for quite small switch statements. We believe other architectures would benefit as well, X86 and ARM included: - jumpmap struct has a good chance of being cached more efficiently than a large chunk of inline binary-search code - for large switches branch prediction should have an easier time compared to inline binary-search - embedded architectures (ARM, MSP430) would benefit from smaller code (trading off .text for .rodata) In terms of lowering, JumpMap flow is very similar to that of JumpTables. The lowering is even slightly easier as JumpMaps do not require an extra range-check basic-block - the default case is also handled by __jumpmap_find. Targets need to provide lowering for the map structure and __jumpmap_find function, which would typically become a part of compier-rt library and could be written in hand-crafted asm for extra speed. Questions or comments? Please let me know. Best Regards, Witold Waligóra Myre Laboratories _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Witold Waligora via llvm-dev
2017-Feb-14 13:17 UTC
[llvm-dev] (RFC) JumpMaps: switch statement optimization
Eli,> 2. How many versions of __jumpmap_find do you need?In our out-of-tree implementation we currently encode case values as-is, thus we need 4 versions of __jumpmap_find function (i8,i16,i32,i64). We have an analysis pass that helps us figure out which kinds of maps are actually profitable. For example, if we discover that we only have one i8 jumpmap but many i16 jumpmaps, then we would zero-extend the i8 map to i16 so that we don't use __jumpmap_find_i8 at all and it can be removed.> If you're going for > size, you probably want to put 1 or 2 byte relative jump offsets into > the table, if possible. You'd also want to shrink the keys if possible: > for a small range, we could use 2-byte, or even 1-byte keys. You could > also encode deltas into the table, rather than absolute values; the > deltas are likely to be smaller than the keys, and you're iterating over > the table anyway.Yes, many options are possible. In general, as long as jumpmap_find() function can undo what the compiler did to the map you'd be fine. We plan to keep this opt flexible when upstreaming.> 3. Is the linear search a problem? I mean, the linear nature of it > doesn't matter for small tables, but eventually you get to a point where > it just doesn't make sense. Linearly iterating over 1000 entries in a > table is clearly going to be slower than a binary search. Or is the > idea that you'd do a binary search down to say, 20 entries, then call > __jumpmap_find?Our __jumpmap_find does binary-search internally and it turned out to be both faster and smaller than what compiler would generate otherwise (bittest clustering). There is some concern about how well jumpmaps could integrate with existing switch clustering code for a pure speed Target.> 4. How often does this pattern come up, in your experience?This is highly dependent on how large switches will be profitable to fold as a jumpmap on a particular target. For us the break-even point is at 5 cases which is very small and fires quite often. I can't say for X86/speed, but as code size opt, this fires on any switch statement >= 5 that fails to fold as a JumpTable. Witold
Witold Waligora via llvm-dev
2017-Feb-14 13:28 UTC
[llvm-dev] (RFC) JumpMaps: switch statement optimization
JumpMap lowering is nearly identical to that of JumpTables with the exception of lack of range-check basic-block. We introduce JumpMapInfo structure which follows the same flow as JumpTableInfo and is finally emitted by AsmPrinter. There are many ways a Target may want to encode jumpmaps (deltas, compression, relative vs absolute), so we plan to keep this flexible and target-driven when upstreaming. Witold W dniu 2017-02-14 o 10:53, Nema, Ashutosh pisze:> Hi Witold, > > Can you explain what you meant by adding support in compiler_rt & changes in lowering. > > Did you tried any benchmarks, please share if you have any numbers with performance improvements & code size reduction ? > > Regards, > Ashutosh > > -----Original Message----- > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Witold Waligora via llvm-dev > Sent: Monday, February 13, 2017 6:41 PM > To: llvm-dev at lists.llvm.org > Subject: [llvm-dev] (RFC) JumpMaps: switch statement optimization > > Hi All, > > This is our first appearance on llvm-dev so a small introduction is in order. > We are a team of two working for good few years now on an out-of-tree embedded back-end. Our primary goal is code size, secondary performance. > We've recently implemented an optimization that we believe others could use. It's about time to contribute back. > > ------------------------------------------------------- > JumpMaps: a generalization of JumpTables JumpTables produce fast and small code but have a limitation - they only work if case values are easy to compute from the variable being switch()ed. To overcome this limitation we introduce a {key,value} structure - a JumpMap. > > Simplifying somewhat, LLVM would currently generate a sequential > if-elseif- for small switches and an inline binary-search for large switches, if it can't produce a JumpTable. JumpMaps would instead generate a call followed by a jump-through-register. > Pseudo-code example: > > switch(x) { > case 0x6851: {BB1} > case 0x1383: {BB2} > case 0x2224: {BB3} > default: {defaultBB} > } > > Without Jump Maps: > if(x==0x6851) { > goto BB1 > } > else if(x==0x1383) { > goto BB2 > } > else if(x==0x2224){ > goto BB3 > } > else{ > goto defaultBB > } > > With Jump Maps: > jumpmap_0 = { > keys = {3, 0x6851, 0x1383, 0x2224} > vals = {defaultBB, BB1, BB2, BB3} > } > addr dst = __jumpmap_find(&jumpmap_0, x) > goto dst; > > On our target Jump Maps produce both smaller and faster code even for quite small switch statements. We believe other architectures would benefit as well, X86 and ARM included: > - jumpmap struct has a good chance of being cached more efficiently than a large chunk of inline binary-search code > - for large switches branch prediction should have an easier time compared to inline binary-search > - embedded architectures (ARM, MSP430) would benefit from smaller code (trading off .text for .rodata) > > In terms of lowering, JumpMap flow is very similar to that of JumpTables. The lowering is even slightly easier as JumpMaps do not require an extra range-check basic-block - the default case is also handled by __jumpmap_find. > Targets need to provide lowering for the map structure and __jumpmap_find function, which would typically become a part of compier-rt library and could be written in hand-crafted asm for extra speed. > > > Questions or comments? > Please let me know. > > > Best Regards, > Witold Waligóra > Myre Laboratories > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >