Bharathi Seshadri via llvm-dev
2018-May-24 18:06 UTC
[llvm-dev] X86 Intrinsics : _mm_storel_epi64/ _mm_loadl_epi64 with -m32
Hi, I’m using _mm_storel_epi64/ _mm_loadl_epi64 in my test case as below and generating 32-bit code (using -m32 and -msse4.2). The 64-bit load and 64-bit store operations are replaced with two 32-bit mov instructions, presumably due to the use of uint64_t type. If I use __m128i instead of uint64_t everywhere, then the read and write happen as 64-bit operations using the xmm registers as expected. void indvbl_write64(volatile void *p, uint64_t v) { __m128i tmp = _mm_loadl_epi64((__m128i const *)&v); _mm_storel_epi64((__m128i *)p, tmp); } uint64_t indivbl_read64 (volatile void *p) { __m128i tmp = _mm_loadl_epi64((__m128i const *)p); return *(uint64_t *)&tmp; } Options used to compile: clang –O2 –c –msse4.2 –m32 test.c Generated code: 00000000 <indvbl_write64>: 0: 8b 44 24 08 mov 0x8(%esp),%eax 4: 8b 54 24 04 mov 0x4(%esp),%edx 8: 8b 4c 24 0c mov 0xc(%esp),%ecx c: 89 4a 04 mov %ecx,0x4(%edx) f: 89 02 mov %eax,(%edx) 11: c3 ret 12: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%eax,%eax,1) 19: 00 00 00 1c: 0f 1f 40 00 nopl 0x0(%eax) 00000020 <indvbl_read64>: 20: 8b 4c 24 04 mov 0x4(%esp),%ecx 24: 8b 01 mov (%ecx),%eax 26: 8b 51 04 mov 0x4(%ecx),%edx 29: c3 ret The front-end generates insertelement <2 x i64> and extractelement <2 x i64> for the load and stores as expected and optimizer generates load i64 and store i64, which are then lowered into 32-bit move instructions in the Instruction Selection Phase. Would it be possible and safe to generate a single 64-bit load/store in this case with –m32 ? If so, please may I have some pointers to related parts of the code I should be looking at to make this improvement. Thanks, Bharathi