thr3ads.net - llvm dev - [llvm-dev] [ARM] Should Use Load and Store with Register Offset [Jul 2020]

If this information is useful, please help other people find it:
Share via:

Daniel Way via llvm-dev

2020-Jul-21 07:12 UTC

[llvm-dev] [ARM] Should Use Load and Store with Register Offset

Hello Sjoerd,

Thank you for your response! I was not aware that -Oz is a closer
equivalent to GCC's -Os. I tried -Oz when compiling with clang and
confirmed that the Clang's generated assembly is equivalent to GCC for the
code snippet I posted above.

clang --target=armv6m-none-eabi -Oz -fomit-frame-pointer
memcpy_alt1:
        push    {r4, lr}
        movs    r3, #0
.LBB0_1:
        cmp     r2, r3
        beq     .LBB0_3
        ldrb    r4, [r1, r3]
        strb    r4, [r0, r3]
        adds    r3, r3, #1
        b       .LBB0_1
.LBB0_3:
        pop     {r4, pc}

On the other hand, -O2 in GCC still uses the register-offset load and store
instructions while Clang -O2 generates the same assembly as -Os:
immediate-offset (0 offset) load/store followed by incrementing the base
register addresses.
I have not tried to benchmark the Clang-generated code, it is possible that
execution time is bounded by the load and store instructions and memory
access latency. From an intuitive view, however, both GCC and Clang are
generating code with 1 load and 1 store, so if Clang inserts two additional
adds instructions, the binary size is larger, execution *could* be slower,
and there's no improvement in register utilization over GCC.

I wanted to try a couple other variants of memcpy-like functions. The
https://godbolt.org/z/d7P6rG link includes memcpy_alt2 which copies data
from src to dst starting at the high address and memcpy_silly which copies
src to dst<0-4>. Here is the behavior I have noticed from GCC and Clang.

*memcpy_alt2*

   - With -Os, GCC generates just 6 instructions. -O2 generates 7 but
   reduces branching to once per loop.
   - Clang with -Os or -O2 does a decent job of using a common register to
   offset the load and store bases. It adds some overhead, though, by
   pre-decrementing the base registers. 10 instructions generated.
   - Clang with -Oz is pathological, generating 13 instructions. It uses
   register-offset load/store instructions, but uses different registers for
   the offsets.

*memcpy_silly*

   - I created this case to see if clang would select load/store with a
   common offset register once enough load instructions were added.
   - Clang with -Os or -O2 does not seem to care about register-offset
   load/store and prefers to increment each base register address.
   - Clang with -Oz performs the optimization I want. It produces the same
   number of instructions as GCC, and avoids an issue where GCC has to re-read
   the same value from the stack each time through the loop.


I really think that, when limited to the Thumb1 ISA, register-offset load
and store instructions should be used at -Oz, -Os, and -O2 optimization
levels. Explicitly incrementing a register holding the base address seems
unnecessary when the value seems wasteful and I cannot see how it will
improve execution time in the examples I'm investigating. Id like to know
if I'm wrong in assuming that LDR Rd, [Rn, Rm] and LDR Rd, [Rn,
#<imm>]
have the same execution time, but based on the Cortex-M0+ TRM they should
both require 2 clock cycles.

Best regards,

Daniel Way


On Mon, Jul 20, 2020 at 6:15 PM Sjoerd Meijer <Sjoerd.Meijer at arm.com>
wrote:
> Hello Daniel,
>
> LLVM and GCC's optimisation levels are not really equivalent. In Clang,
> -Os makes a performance and code-size trade off. In GCC, -Os is minimising
> code-size, which is equivalent to -Oz with Clang. I have't looked into
> details yet, but changing -Os to -Oz in the godbolt link gives the codegen
> you're looking for?
>
> Cheers,
> Sjoerd.
> ------------------------------
> *From:* llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of
Daniel
> Way via llvm-dev <llvm-dev at lists.llvm.org>
> *Sent:* 20 July 2020 06:54
> *To:* llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
> *Subject:* [llvm-dev] [ARM] Should Use Load and Store with Register Offset
>
> Hello LLVM Community (specifically anyone working with ARM Cortex-M),
>
> While trying to compile the Newlib C library I found that Clang10 was
> generating slightly larger binaries than the libc from the prebuilt
> gcc-arm-none-eabi toolchain. I looked at a few specific functions (memcpy,
> strcpy, etc.) and noticed that LLVM does not tend to generate load/store
> instructions with a register offset (e.g. ldr Rd, [Rn, Rm] form) and
> instead prefers the immediate offset form.
>
> When copying a contiguous sequence of bytes, this results in additional
> instructions to modify the base address. https://godbolt.org/z/T1xhae
>
> void* memcpy_alt1(void* dst, const void* src, size_t len) {
>     char* save = (char*)dst;
>     for (size_t i = 0; i < len; ++i)
>         *((char*)(dst + i)) = *((char*)(src + i));
>     return save;
> }
>
> clang --target=armv6m-none-eabi -Os -fomit-frame-pointer
> memcpy_alt1:
>         push    {r4, lr}
>         cmp     r2, #0
>         beq     .LBB0_3
>         mov     r3, r0
> .LBB0_2:
>         ldrb    r4, [r1]
>         strb    r4, [r3]
>         adds    r1, r1, #1
>         adds    r3, r3, #1
>         subs    r2, r2, #1
>         bne     .LBB0_2
> .LBB0_3:
>         pop     {r4, pc}
>
> arm-none-eabi-gcc -march=armv6-m -Os
> memcpy_alt1:
>         movs    r3, #0
>         push    {r4, lr}
> .L2:
>         cmp     r3, r2
>         bne     .L3
>         pop     {r4, pc}
> .L3:
>         ldrb    r4, [r1, r3]
>         strb    r4, [r0, r3]
>         adds    r3, r3, #1
>         b       .L2
>
> Because this code appears in a loop that could be copying hundreds of
> bytes, I want to add an optimization that will prioritize load/store
> instructions with register offsets when the offset is used multiple times.
> I have not worked on LLVM before, so I'd like advice about where to
start.
>
>    - The generated code is correct, just sub-optimal so is it appropriate
>    to submit a bug report?
>    - Is anyone already tackling this change or is there someone with more
>    experience interested in collaborating?
>    - Is this optimization better performed early during instruction
>    selection or late using c++ (i.e. ARMLoadStoreOptimizer.cpp)
>    - What is the potential to cause harm to other parts of the code gen,
>    specifically for other arm targets. I'm working with armv6m, but
armv7m
>    offers base register updating in a single instruction. I don't want
to
>    break other useful optimizations.
>
> So far, I am reading through the LLVM documentation to see where a change
> could be applied. I have also:
>
>    - Compiled with -S -emit-llvm (see Godbolt link)
>    There is an identifiable pattern where a getelementptr function is
>    followed by a load or store. When multiple getelementptr functions
appear
>    with the same virtual register offset, maybe this should match a tLDRr
or
>    tSTRr.
>    - Ran LLC with  --print-machineinstrs
>    It appears that tLDRBi and tSTRBi are selected very early and never
>    replaced by the equivalent t<LDRB|STRB>r instructions.
>
> Thank you,
>
> Daniel Way
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200721/9e078f30/attachment.html>

Sjoerd Meijer via llvm-dev

2020-Jul-21 09:05 UTC

head link

[llvm-dev] [ARM] Should Use Load and Store with Register Offset

Hi Daniel,

Your observations seem valid to me. Some high-level comments from my side.

As you said, the loops are quite similar. We have also observed that in general
we generate more code around loops, in the function prologue and epilogue, where
some data and arguments get moved and reshuffled etc. While this is very obvious
in these micro-benchmarks, it hasn't bothered us enough yet for larger apps
where this is less important (or where others things are more important). The
outlier looks indeed to be Clang -Oz for memcpy_alt2, that is perhaps a
"code-size bug". As I haven't looked into it, it's too early
for me to blame this on just the addressing modes as there could be several
things going on.

Since this is a micro-benchmark, and lowering memcpy is a bit of an art ;-), for
which a specialised implementation is probably available, you might want to look
at some other codes too that are important for you.

Your remarks about execution times might be right too, and as you said, probably
best confirmed with benchmark numbers. In our group, we have not really looked
into performance for the Cortex-M0, probably because it's the only v6m core
(although the Cortex-m23 and Armv8-M Baseline is very similar) and code-size
would be more important for us, but there might be something to be gained here.

Cheers,
Sjoerd.
________________________________
From: Daniel Way <p.waydan at gmail.com>
Sent: 21 July 2020 08:12
To: Sjoerd Meijer <Sjoerd.Meijer at arm.com>
Cc: llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
Subject: Re: [llvm-dev] [ARM] Should Use Load and Store with Register Offset

Hello Sjoerd,

Thank you for your response! I was not aware that -Oz is a closer equivalent to
GCC's -Os. I tried -Oz when compiling with clang and confirmed that the
Clang's generated assembly is equivalent to GCC for the code snippet I
posted above.

clang --target=armv6m-none-eabi -Oz -fomit-frame-pointer
memcpy_alt1:
        push    {r4, lr}
        movs    r3, #0
.LBB0_1:
        cmp     r2, r3
        beq     .LBB0_3
        ldrb    r4, [r1, r3]
        strb    r4, [r0, r3]
        adds    r3, r3, #1
        b       .LBB0_1
.LBB0_3:
        pop     {r4, pc}

On the other hand, -O2 in GCC still uses the register-offset load and store
instructions while Clang -O2 generates the same assembly as -Os:
immediate-offset (0 offset) load/store followed by incrementing the base
register addresses.
I have not tried to benchmark the Clang-generated code, it is possible that
execution time is bounded by the load and store instructions and memory access
latency. From an intuitive view, however, both GCC and Clang are generating code
with 1 load and 1 store, so if Clang inserts two additional adds instructions,
the binary size is larger, execution could be slower, and there's no
improvement in register utilization over GCC.

I wanted to try a couple other variants of memcpy-like functions. The
https://godbolt.org/z/d7P6rG link includes memcpy_alt2 which copies data from
src to dst starting at the high address and memcpy_silly which copies src to
dst<0-4>. Here is the behavior I have noticed from GCC and Clang.

memcpy_alt2

  *   With -Os, GCC generates just 6 instructions. -O2 generates 7 but reduces
branching to once per loop.
  *   Clang with -Os or -O2 does a decent job of using a common register to
offset the load and store bases. It adds some overhead, though, by
pre-decrementing the base registers. 10 instructions generated.
  *   Clang with -Oz is pathological, generating 13 instructions. It uses
register-offset load/store instructions, but uses different registers for the
offsets.

memcpy_silly

  *   I created this case to see if clang would select load/store with a common
offset register once enough load instructions were added.
  *   Clang with -Os or -O2 does not seem to care about register-offset
load/store and prefers to increment each base register address.
  *   Clang with -Oz performs the optimization I want. It produces the same
number of instructions as GCC, and avoids an issue where GCC has to re-read the
same value from the stack each time through the loop.

I really think that, when limited to the Thumb1 ISA, register-offset load and
store instructions should be used at -Oz, -Os, and -O2 optimization levels.
Explicitly incrementing a register holding the base address seems unnecessary
when the value seems wasteful and I cannot see how it will improve execution
time in the examples I'm investigating. Id like to know if I'm wrong in
assuming that LDR Rd, [Rn, Rm] and LDR Rd, [Rn, #<imm>] have the same
execution time, but based on the Cortex-M0+ TRM they should both require 2 clock
cycles.

Best regards,

Daniel Way


On Mon, Jul 20, 2020 at 6:15 PM Sjoerd Meijer <Sjoerd.Meijer at
arm.com<mailto:Sjoerd.Meijer at arm.com>> wrote:
Hello Daniel,

LLVM and GCC's optimisation levels are not really equivalent. In Clang, -Os
makes a performance and code-size trade off. In GCC, -Os is minimising
code-size, which is equivalent to -Oz with Clang. I have't looked into
details yet, but changing -Os to -Oz in the godbolt link gives the codegen
you're looking for?

Cheers,
Sjoerd.
________________________________
From: llvm-dev <llvm-dev-bounces at lists.llvm.org<mailto:llvm-dev-bounces
at lists.llvm.org>> on behalf of Daniel Way via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>
Sent: 20 July 2020 06:54
To: llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
<llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>
Subject: [llvm-dev] [ARM] Should Use Load and Store with Register Offset

Hello LLVM Community (specifically anyone working with ARM Cortex-M),

While trying to compile the Newlib C library I found that Clang10 was generating
slightly larger binaries than the libc from the prebuilt gcc-arm-none-eabi
toolchain. I looked at a few specific functions (memcpy, strcpy, etc.) and
noticed that LLVM does not tend to generate load/store instructions with a
register offset (e.g. ldr Rd, [Rn, Rm] form) and instead prefers the immediate
offset form.

When copying a contiguous sequence of bytes, this results in additional
instructions to modify the base address. https://godbolt.org/z/T1xhae

void* memcpy_alt1(void* dst, const void* src, size_t len) {
    char* save = (char*)dst;
    for (size_t i = 0; i < len; ++i)
        *((char*)(dst + i)) = *((char*)(src + i));
    return save;
}

clang --target=armv6m-none-eabi -Os -fomit-frame-pointer
memcpy_alt1:
        push    {r4, lr}
        cmp     r2, #0
        beq     .LBB0_3
        mov     r3, r0
.LBB0_2:
        ldrb    r4, [r1]
        strb    r4, [r3]
        adds    r1, r1, #1
        adds    r3, r3, #1
        subs    r2, r2, #1
        bne     .LBB0_2
.LBB0_3:
        pop     {r4, pc}

arm-none-eabi-gcc -march=armv6-m -Os
memcpy_alt1:
        movs    r3, #0
        push    {r4, lr}
.L2:
        cmp     r3, r2
        bne     .L3
        pop     {r4, pc}
.L3:
        ldrb    r4, [r1, r3]
        strb    r4, [r0, r3]
        adds    r3, r3, #1
        b       .L2

Because this code appears in a loop that could be copying hundreds of bytes, I
want to add an optimization that will prioritize load/store instructions with
register offsets when the offset is used multiple times. I have not worked on
LLVM before, so I'd like advice about where to start.

  *   The generated code is correct, just sub-optimal so is it appropriate to
submit a bug report?
  *   Is anyone already tackling this change or is there someone with more
experience interested in collaborating?
  *   Is this optimization better performed early during instruction selection
or late using c++ (i.e. ARMLoadStoreOptimizer.cpp)
  *   What is the potential to cause harm to other parts of the code gen,
specifically for other arm targets. I'm working with armv6m, but armv7m
offers base register updating in a single instruction. I don't want to break
other useful optimizations.

So far, I am reading through the LLVM documentation to see where a change could
be applied. I have also:

  *   Compiled with -S -emit-llvm (see Godbolt link)
There is an identifiable pattern where a getelementptr function is followed by a
load or store. When multiple getelementptr functions appear with the same
virtual register offset, maybe this should match a tLDRr or tSTRr.
  *   Ran LLC with  --print-machineinstrs
It appears that tLDRBi and tSTRBi are selected very early and never replaced by
the equivalent t<LDRB|STRB>r instructions.

Thank you,

Daniel Way
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200721/ef5d4f27/attachment.html>

Daniel Way via llvm-dev

2020-Jul-22 03:33 UTC

head link

[llvm-dev] [ARM] Should Use Load and Store with Register Offset

Thank you, Sjoerd.

Your high-level comments are very helpful and much appreciated. I ended up
rebuilding the Newlib-nano source with -Oz instead of -Os and found an
overall improvement in code size. The final size is still larger than the
gcc-arm-none-eabi toolchain. Of course there are a few caveats to this:

   - Newlib is designed around GCC;
   - I'm not sure I perfectly reproduced the build settings for the
   pre-built toolchain (macros, etc.);
   - and this comparison considers all libc functions, many of which may
   not end up in the final image.

For now, I've submitted *BUG 46801* for the case when -Oz produces more
instructions than -Os. I don't know if it needs to be a priority, but
thought it should be recorded.

I may try benchmarking the memcpy implementations as well as a few other
libc functions, but I haven't done this before. Of course, I'll share my
results if I do end up testing.

Thank you for the help.
Daniel Way


On Tue, Jul 21, 2020 at 6:05 PM Sjoerd Meijer <Sjoerd.Meijer at arm.com>
wrote:
> Hi Daniel,
>
> Your observations seem valid to me. Some high-level comments from my side.
>
> As you said, the loops are quite similar. We have also observed that in
> general we generate more code around loops, in the function prologue and
> epilogue, where some data and arguments get moved and reshuffled etc. While
> this is very obvious in these micro-benchmarks, it hasn't bothered us
> enough yet for larger apps where this is less important (or where others
> things are more important). The outlier looks indeed to be Clang -Oz for
> memcpy_alt2, that is perhaps a "code-size bug". As I haven't
looked into
> it, it's too early for me to blame this on just the addressing modes as
> there could be several things going on.
>
> Since this is a micro-benchmark, and lowering memcpy is a bit of an art
> ;-), for which a specialised implementation is probably available, you
> might want to look at some other codes too that are important for you.
>
> Your remarks about execution times might be right too, and as you said,
> probably best confirmed with benchmark numbers. In our group, we have not
> really looked into performance for the Cortex-M0, probably because it's
the
> only v6m core (although the Cortex-m23 and Armv8-M Baseline is very
> similar) and code-size would be more important for us, but there might be
> something to be gained here.
>
> Cheers,
> Sjoerd.
> ------------------------------
> *From:* Daniel Way <p.waydan at gmail.com>
> *Sent:* 21 July 2020 08:12
> *To:* Sjoerd Meijer <Sjoerd.Meijer at arm.com>
> *Cc:* llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
> *Subject:* Re: [llvm-dev] [ARM] Should Use Load and Store with Register
> Offset
>
> Hello Sjoerd,
>
> Thank you for your response! I was not aware that -Oz is a closer
> equivalent to GCC's -Os. I tried -Oz when compiling with clang and
> confirmed that the Clang's generated assembly is equivalent to GCC for
the
> code snippet I posted above.
>
> clang --target=armv6m-none-eabi -Oz -fomit-frame-pointer
> memcpy_alt1:
>         push    {r4, lr}
>         movs    r3, #0
> .LBB0_1:
>         cmp     r2, r3
>         beq     .LBB0_3
>         ldrb    r4, [r1, r3]
>         strb    r4, [r0, r3]
>         adds    r3, r3, #1
>         b       .LBB0_1
> .LBB0_3:
>         pop     {r4, pc}
>
> On the other hand, -O2 in GCC still uses the register-offset load and
> store instructions while Clang -O2 generates the same assembly as -Os:
> immediate-offset (0 offset) load/store followed by incrementing the base
> register addresses.
> I have not tried to benchmark the Clang-generated code, it is possible
> that execution time is bounded by the load and store instructions and
> memory access latency. From an intuitive view, however, both GCC and Clang
> are generating code with 1 load and 1 store, so if Clang inserts two
> additional adds instructions, the binary size is larger, execution *could*
be
> slower, and there's no improvement in register utilization over GCC.
>
> I wanted to try a couple other variants of memcpy-like functions. The
> https://godbolt.org/z/d7P6rG link includes memcpy_alt2 which copies data
> from src to dst starting at the high address and memcpy_silly which
> copies src to dst<0-4>. Here is the behavior I have noticed from GCC
and
> Clang.
>
> *memcpy_alt2*
>
>    - With -Os, GCC generates just 6 instructions. -O2 generates 7 but
>    reduces branching to once per loop.
>    - Clang with -Os or -O2 does a decent job of using a common register
>    to offset the load and store bases. It adds some overhead, though, by
>    pre-decrementing the base registers. 10 instructions generated.
>    - Clang with -Oz is pathological, generating 13 instructions. It uses
>    register-offset load/store instructions, but uses different registers
for
>    the offsets.
>
> *memcpy_silly*
>
>    - I created this case to see if clang would select load/store with a
>    common offset register once enough load instructions were added.
>    - Clang with -Os or -O2 does not seem to care about register-offset
>    load/store and prefers to increment each base register address.
>    - Clang with -Oz performs the optimization I want. It produces the
>    same number of instructions as GCC, and avoids an issue where GCC has to
>    re-read the same value from the stack each time through the loop.
>
>
> I really think that, when limited to the Thumb1 ISA, register-offset load
> and store instructions should be used at -Oz, -Os, and -O2 optimization
> levels. Explicitly incrementing a register holding the base address seems
> unnecessary when the value seems wasteful and I cannot see how it will
> improve execution time in the examples I'm investigating. Id like to
know
> if I'm wrong in assuming that LDR Rd, [Rn, Rm] and LDR Rd, [Rn,
#<imm>]
> have the same execution time, but based on the Cortex-M0+ TRM they should
> both require 2 clock cycles.
>
> Best regards,
>
> Daniel Way
>
>
> On Mon, Jul 20, 2020 at 6:15 PM Sjoerd Meijer <Sjoerd.Meijer at
arm.com>
> wrote:
>
> Hello Daniel,
>
> LLVM and GCC's optimisation levels are not really equivalent. In Clang,
> -Os makes a performance and code-size trade off. In GCC, -Os is minimising
> code-size, which is equivalent to -Oz with Clang. I have't looked into
> details yet, but changing -Os to -Oz in the godbolt link gives the codegen
> you're looking for?
>
> Cheers,
> Sjoerd.
> ------------------------------
> *From:* llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of
Daniel
> Way via llvm-dev <llvm-dev at lists.llvm.org>
> *Sent:* 20 July 2020 06:54
> *To:* llvm-dev at lists.llvm.org <llvm-dev at lists.llvm.org>
> *Subject:* [llvm-dev] [ARM] Should Use Load and Store with Register Offset
>
> Hello LLVM Community (specifically anyone working with ARM Cortex-M),
>
> While trying to compile the Newlib C library I found that Clang10 was
> generating slightly larger binaries than the libc from the prebuilt
> gcc-arm-none-eabi toolchain. I looked at a few specific functions (memcpy,
> strcpy, etc.) and noticed that LLVM does not tend to generate load/store
> instructions with a register offset (e.g. ldr Rd, [Rn, Rm] form) and
> instead prefers the immediate offset form.
>
> When copying a contiguous sequence of bytes, this results in additional
> instructions to modify the base address. https://godbolt.org/z/T1xhae
>
> void* memcpy_alt1(void* dst, const void* src, size_t len) {
>     char* save = (char*)dst;
>     for (size_t i = 0; i < len; ++i)
>         *((char*)(dst + i)) = *((char*)(src + i));
>     return save;
> }
>
> clang --target=armv6m-none-eabi -Os -fomit-frame-pointer
> memcpy_alt1:
>         push    {r4, lr}
>         cmp     r2, #0
>         beq     .LBB0_3
>         mov     r3, r0
> .LBB0_2:
>         ldrb    r4, [r1]
>         strb    r4, [r3]
>         adds    r1, r1, #1
>         adds    r3, r3, #1
>         subs    r2, r2, #1
>         bne     .LBB0_2
> .LBB0_3:
>         pop     {r4, pc}
>
> arm-none-eabi-gcc -march=armv6-m -Os
> memcpy_alt1:
>         movs    r3, #0
>         push    {r4, lr}
> .L2:
>         cmp     r3, r2
>         bne     .L3
>         pop     {r4, pc}
> .L3:
>         ldrb    r4, [r1, r3]
>         strb    r4, [r0, r3]
>         adds    r3, r3, #1
>         b       .L2
>
> Because this code appears in a loop that could be copying hundreds of
> bytes, I want to add an optimization that will prioritize load/store
> instructions with register offsets when the offset is used multiple times.
> I have not worked on LLVM before, so I'd like advice about where to
start.
>
>    - The generated code is correct, just sub-optimal so is it appropriate
>    to submit a bug report?
>    - Is anyone already tackling this change or is there someone with more
>    experience interested in collaborating?
>    - Is this optimization better performed early during instruction
>    selection or late using c++ (i.e. ARMLoadStoreOptimizer.cpp)
>    - What is the potential to cause harm to other parts of the code gen,
>    specifically for other arm targets. I'm working with armv6m, but
armv7m
>    offers base register updating in a single instruction. I don't want
to
>    break other useful optimizations.
>
> So far, I am reading through the LLVM documentation to see where a change
> could be applied. I have also:
>
>    - Compiled with -S -emit-llvm (see Godbolt link)
>    There is an identifiable pattern where a getelementptr function is
>    followed by a load or store. When multiple getelementptr functions
appear
>    with the same virtual register offset, maybe this should match a tLDRr
or
>    tSTRr.
>    - Ran LLC with  --print-machineinstrs
>    It appears that tLDRBi and tSTRBi are selected very early and never
>    replaced by the equivalent t<LDRB|STRB>r instructions.
>
> Thank you,
>
> Daniel Way
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200722/cedd419d/attachment-0001.html>

llvm dev - Jul 2020 - [ARM] Should Use Load and Store with Register Offset

[llvm-dev] [ARM] Should Use Load and Store with Register Offset

[llvm-dev] [ARM] Should Use Load and Store with Register Offset

[llvm-dev] [ARM] Should Use Load and Store with Register Offset