Compiler choice of not using REP MOVSB instruction for a byte array move
Compiler choice of not using REP MOVSB instruction for a byte array move
I'm checking the Release build of my project done with the latest version of the VS 2017 C++ compiler. And I'm curious why did compiler choose to build the following code snippet:
//ncbSzBuffDataUsed of type INT32
UINT8* pDst = (UINT8*)(pMXB + 1);
UINT8* pSrc = (UINT8*)pDPE;
for(size_t i = 0; i < (size_t)ncbSzBuffDataUsed; i++)
{
pDst[i] = pSrc[i];
}
as such:

UINT8* pDst = (UINT8*)(pMXB + 1);
UINT8* pSrc = (UINT8*)pDPE;
for(size_t i = 0; i < (size_t)ncbSzBuffDataUsed; i++)
00007FF66441251E 4C 63 C2 movsxd r8,edx
00007FF664412521 4C 2B D1 sub r10,rcx
00007FF664412524 0F 1F 40 00 nop dword ptr [rax]
00007FF664412528 0F 1F 84 00 00 00 00 00 nop dword ptr [rax+rax]
00007FF664412530 41 0F B6 04 0A movzx eax,byte ptr [r10+rcx]
{
pDst[i] = pSrc[i];
00007FF664412535 88 01 mov byte ptr [rcx],al
00007FF664412537 48 8D 49 01 lea rcx,[rcx+1]
00007FF66441253B 49 83 E8 01 sub r8,1
00007FF66441253F 75 EF jne _logDebugPrint_in_MainXchgBuffer+0A0h (07FF664412530h)
}
versus just using a single REP MOVSB instruction? Wouldn't the latter be more efficient?
REP MOVSB
What's the surrounding code? Can the compiler prove that the src and dst don't overlap? Declaring function args or globals with
__restrict, like uint8_t *__restrict pDPE, will promise the compiler that the pointed-to memory isn't also access any other way. Failed aliasing analysis defeats auto-vectorization in general. And BTW, a vector copy loop is usually slightly better than rep movsb (Enhanced REP MOVSB for memcpy), but for large copies maybe only with runtime CPU dispatching for AVX, because rep movsb can use 256b loads/stores.– Peter Cordes
Jul 1 at 6:01
__restrict
uint8_t *__restrict pDPE
rep movsb
rep movsb
Can you post a Minimal, Complete, and Verifiable example of this code-gen on godbolt.org, with one of the MSVC versions it has? (Use
/Ox for full optimization.)– Peter Cordes
Jul 1 at 6:05
/Ox
I'm curious why you not use
memcpy(pMXB + 1, pDPE, ncbSzBuffDataUsed); ? compiler implement exactly what you write in src code. want more efficient binary ? write more efficient src– RbMm
Jul 1 at 8:01
memcpy(pMXB + 1, pDPE, ncbSzBuffDataUsed);
@PeterCordes: Your intuition was right: if I add
__restrict, then MSVC uses memcpy: godbolt.org/g/PUfErC.– geza
Jul 1 at 8:45
__restrict
memcpy
2 Answers
2
Edit: First up, there's an intrinsic for rep movsb which Peter Cordes tells us would be much faster here and I believe him (I guess I already did). If you want to force the compiler to do things this way, see: __movsb(): https://docs.microsoft.com/en-us/cpp/intrinsics/movsb.
rep movsb
__movsb()
As to why the compiler didn't do this for you, in the absence of any other ideas the answer might be register pressure. To use rep movsb The compiler would have to:
rep movsb
rsi
rdi
rcx
rep movsb
So now it has had to use up the three registers mandated by the rep movsb instruction, and it may prefer not to do that. Specifically rsi and rdi are expected to be preserved across a function call, so if the compiler can get away with using them in the body of any particular function it will, and (on initial entry to the method, at least) rcx holds the this pointer.
rep movsb
rsi
rdi
rcx
this
Also, with the code that we see the compiler has generated there, the r10 and rcxregisters might already contain the requisite source and destination addresses (we can't see that from your example), which would be handy for the compiler if so.
r10
rcx
In practise, you will probably see the compiler make different choices in different situations. The type of optimisation requested (/O1 - optimise for size, vs /O2 - optimise for speed) will likely also affect this.
/O1
/O2
More on the x64 register passing convention here, and on the x64 ABI generally here.
Edit 2 (again inspired by Peter's comments):
The compiler probably decided not to vectorise the loop because it doesn't know if the pointers are aligned or might overlap. Without seeing more of the code, we can't be sure. But that's not strictly relevant to my answer, given what the OP actually asked about.
Also
rep movsb works byte by byte (at least conceptually), and moving in a loop data word by word might be faster– Basile Starynkevitch
Jul 1 at 5:50
rep movsb
Then, it could be time to switch to a better compiler (or just fine-tune the many optimization flags your compiler could provide). Did you try GCC or Clang or icc ??
– Basile Starynkevitch
Jul 1 at 5:53
Register pressure is not really plausible. Any decent compiler knows that it's worth saving/restoring a register or two if it enables a big speedup inside a loop. Compiling loops efficiently is job #1 for compilers. (And yes I'm including MSVC here; not gimping loops to save a couple prologue/epilogue instructions is a pretty low bar for compilers.) Almost certainly something stopped MSVC from auto-vectorizing or recognizing this as a memcpy / memmove. (Inserting a call to memmove would be a good optimization here if the count is normally large).
– Peter Cordes
Jul 1 at 6:07
@c00000fd: Intel's optimization manual has a whole section on implementing
memcpy with either a vector loop vs. rep movsb, and alignment considerations. If you only care about CPUs since IvyBridge, then just use rep movsb if you insist on using rep movs at all, because the ERMSB feature makes it at least as good as rep movsq. But make sure your pointers are both aligned, or else avoid rep movs. See performance links in stackoverflow.com/tags/x86/info.– Peter Cordes
Jul 1 at 6:11
memcpy
rep movsb
rep movsb
rep movs
rep movsq
rep movs
BTW, I had been thinking that potential overlap made it a
memmove. But that's not true: with dst = src+1, it turns into memset(dst, src[0]). Anyway, see my updated answer on Does any of current C++ compilers ever emit "rep movsb/w/d"?. It might be interesting for a compiler to consider inlining rep movsb in case of possible overlap, especially if compiling with -Os (optimize for size), on a CPU with ERMSB.– Peter Cordes
Jul 2 at 1:31
memmove
dst = src+1
memset(dst, src[0])
rep movsb
-Os
This is not really an answer, and I can't jam it all into a comment. I just want to share my additional findings. (This is probably relevant to the Visual Studio compilers only.)
What also makes a difference is how you structure your loops. For instance:
Assuming the following struct definitions:
#define PCALLBACK ULONG64
#pragma pack(push)
#pragma pack(1)
typedef struct {
ULONG64 ui0;
USHORT w0;
USHORT w1;
//Followed by:
// PCALLBACK 'array' - variable size array
}DPE;
#pragma pack(pop)
(1) The regular way to structure a for loop. The following code chunk is called somewhere in the middle of a larger serialization function:
for
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
for(size_t i = 0; i < (size_t)info.wNumCallbackFuncs; i++)
{
pDstClbks[i] = info.callbackFuncs[i];
}
As was mentioned somewhere in the answer on this page, it is clear that the compiler was starved of registers to have produced the following monstrocity (see how it reused rax for the loop end limit, or movzx eax,word ptr [r13] instruction that could've been clearly left out of the loop.)
rax
movzx eax,word ptr [r13]
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
00007FF7029327CF 48 83 C1 30 add rcx,30h
for(size_t i = 0; i < (size_t)info.wNumCallbackFuncs; i++)
00007FF7029327D3 66 41 3B 5D 00 cmp bx,word ptr [r13]
00007FF7029327D8 73 1F jae 07FF7029327F9h
00007FF7029327DA 4C 8B C1 mov r8,rcx
00007FF7029327DD 4C 2B F1 sub r14,rcx
{
pDstClbks[i] = info.callbackFuncs[i];
00007FF7029327E0 4B 8B 44 06 08 mov rax,qword ptr [r14+r8+8]
00007FF7029327E5 48 FF C3 inc rbx
00007FF7029327E8 49 89 00 mov qword ptr [r8],rax
00007FF7029327EB 4D 8D 40 08 lea r8,[r8+8]
00007FF7029327EF 41 0F B7 45 00 movzx eax,word ptr [r13]
00007FF7029327F4 48 3B D8 cmp rbx,rax
00007FF7029327F7 72 E7 jb 07FF7029327E0h
}
00007FF7029327F9 45 0F B7 C7 movzx r8d,r15w
(2) So if I re-write it into a less familiar C pattern:
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
PCALLBACK* pEndDstClbks = pDstClbks + (size_t)info.wNumCallbackFuncs;
for(PCALLBACK* pScrClbks = info.callbackFuncs;
pDstClbks < pEndDstClbks;
pScrClbks++, pDstClbks++)
{
*pDstClbks = *pScrClbks;
}
this produces a more sensible machine code (on the same compiler, in the same function, in the same project):
PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
00007FF71D7E27C2 48 83 C1 30 add rcx,30h
PCALLBACK* pEndDstClbks = pDstClbks + (size_t)info.wNumCallbackFuncs;
00007FF71D7E27C6 0F B7 86 88 00 00 00 movzx eax,word ptr [rsi+88h]
00007FF71D7E27CD 48 8D 14 C1 lea rdx,[rcx+rax*8]
for(PCALLBACK* pScrClbks = info.callbackFuncs; pDstClbks < pEndDstClbks; pScrClbks++, pDstClbks++)
00007FF71D7E27D1 48 3B CA cmp rcx,rdx
00007FF71D7E27D4 76 14 jbe 07FF71D7E27EAh
00007FF71D7E27D6 48 2B F1 sub rsi,rcx
{
*pDstClbks = *pScrClbks;
00007FF71D7E27D9 48 8B 44 0E 08 mov rax,qword ptr [rsi+rcx+8]
00007FF71D7E27DE 48 89 01 mov qword ptr [rcx],rax
00007FF71D7E27E1 48 83 C1 08 add rcx,8
00007FF71D7E27E5 48 3B CA cmp rcx,rdx
00007FF71D7E27E8 77 EF jb 07FF71D7E27D9h
}
00007FF71D7E27EA 45 0F B7 C6 movzx r8d,r14w
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Hmmm. Why is curiosity so often punished with a downvote?
– Paul Sanders
Jul 1 at 5:44