5

My code is

...
fragment1  // compares several regions in D1$ to D1$/D3$
__atomic_fetch_add(&lock,-1,__ATOMIC_ACQ_REL);   // stmt A
fragment2  // moves several regions from D1$/D3$ to D1$
__atomic_fetch_add(&lock,-1,__ATOMIC_ACQ_REL);
...

The atomic ops will presumably be done with LOCK XADD (i. e. CAS not required) and will not cause branch misprediction. And the result of the atomic is not used and does not need to be waited on.

The question: how much does stmtA slow down the core it is running in?

Motivation: if stmtA is cheap, I will use it to allow other threads to acquire the lock faster. If it is expensive, I will delete stmtA, decrement the lock by 2 in the other atomic, and let the other threads wait for fragment2 to finish.

I am guessing that any delay caused by stmtA is related to the memory ordering. Can someone spell this out? Does it wait for outstanding reads to complete? For anything else? Can accesses to local caches continue?

1 Answer 1

6

On x86-64, all atomic RMWs1 are full barriers, defeating all memory-level parallelism including to local data caches since all cache is coherent. (I-cache coherency doesn't have to respect acq_rel sync for code-fetch, which is why you need a full serializing instruction for cross-modifying code, so at least later instructions can get fetched and decoded.)
Instruction-level parallelism for non-memory instructions is still possible around locked instructions.

Most of the cost of an atomic RMW is usually in getting ownership of the cache line, but even on a "hot" cache line it's several uops, with a throughput of 1 per ~20 cycles depending on the uarch.

When you don't use the result, good compilers will use lock add or lock dec instead of lock xadd, but that only saves a couple uops. The instruction costs on https://uops.info/ are for use in a single-threaded loop, with hot cache for the memory location being RMWed.


Footnote 1 - except for the RAO-INT extension; "remote atomics" which execute at the L3 slice instead of having to bring the cache-line to the core. They were going to debut in Granite Ridge but got pushed back. And apparently didn't make it into Arrow Lake either. I'm not sure if they're still planned or not, or when they did / will arrive. Anyway, RAO-INT atomics are weakly-ordered (memory_order_relaxed), fire-and-forget.

Or for RMWs that are only atomic wrt. interrupts / other code on this core, where you don't need a lock prefix, but you can't get C compilers to do that for you. Except maybe by omitting lock prefixes entirely with assembler options, but then your code could only run on uniprocessor machines / VMs.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks. It makes sense that the core issuing the atomic would wait for its write queue to drain. Since the atomic is both an acquire and a release, is there a similar delay in the current owner, i. e. does a core owning a line RFO'd by another core wait for its store queue to drain before yielding ownership?
No, that's totally asynchronous. Cores don't hold onto cache lines for extra time even if there are pending stores to that line, unless it's literally in the middle of an atomic-RMW with the load side having already locked that cache line. (That's how they achieve RMW atomicity without asserting a system-wide LOCK# bus-lock signal that blocks all memory ops from all cores, like they have to do for a cache-line-split atomic RMW. stackoverflow.com/questions/39393850/…).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.