My code is
...
fragment1 // compares several regions in D1$ to D1$/D3$
__atomic_fetch_add(&lock,-1,__ATOMIC_ACQ_REL); // stmt A
fragment2 // moves several regions from D1$/D3$ to D1$
__atomic_fetch_add(&lock,-1,__ATOMIC_ACQ_REL);
...
The atomic ops will presumably be done with LOCK XADD (i. e. CAS not required) and will not cause branch misprediction. And the result of the atomic is not used and does not need to be waited on.
The question: how much does stmtA slow down the core it is running in?
Motivation: if stmtA is cheap, I will use it to allow other threads to acquire the lock faster. If it is expensive, I will delete stmtA, decrement the lock by 2 in the other atomic, and let the other threads wait for fragment2 to finish.
I am guessing that any delay caused by stmtA is related to the memory ordering. Can someone spell this out? Does it wait for outstanding reads to complete? For anything else? Can accesses to local caches continue?