What is the performance effect (on x64) of __atomic_fetch_add that ignores its result?

Question

My code is

...
fragment1  // compares several regions in D1$ to D1$/D3$
__atomic_fetch_add(&lock,-1,__ATOMIC_ACQ_REL);   // stmt A
fragment2  // moves several regions from D1$/D3$ to D1$
__atomic_fetch_add(&lock,-1,__ATOMIC_ACQ_REL);
...

The atomic ops will presumably be done with LOCK XADD (i. e. CAS not required) and will not cause branch misprediction. And the result of the atomic is not used and does not need to be waited on.

The question: how much does stmtA slow down the core it is running in?

Motivation: if stmtA is cheap, I will use it to allow other threads to acquire the lock faster. If it is expensive, I will delete stmtA, decrement the lock by 2 in the other atomic, and let the other threads wait for fragment2 to finish.

I am guessing that any delay caused by stmtA is related to the memory ordering. Can someone spell this out? Does it wait for outstanding reads to complete? For anything else? Can accesses to local caches continue?

Peter Cordes · Accepted Answer · 2025-11-28 16:11:04Z

6

On x86-64, all atomic RMWs¹ are full barriers, defeating all memory-level parallelism including to local data caches since all cache is coherent. (I-cache coherency doesn't have to respect acq_rel sync for code-fetch, which is why you need a full serializing instruction for cross-modifying code, so at least later instructions can get fetched and decoded.)
Instruction-level parallelism for non-memory instructions is still possible around locked instructions.

Most of the cost of an atomic RMW is usually in getting ownership of the cache line, but even on a "hot" cache line it's several uops, with a throughput of 1 per ~20 cycles depending on the uarch.

When you don't use the result, good compilers will use lock add or lock dec instead of lock xadd, but that only saves a couple uops. The instruction costs on https://uops.info/ are for use in a single-threaded loop, with hot cache for the memory location being RMWed.

Footnote 1 - except for the RAO-INT extension; "remote atomics" which execute at the L3 slice instead of having to bring the cache-line to the core. They were going to debut in Granite Ridge but got pushed back. And apparently didn't make it into Arrow Lake either. I'm not sure if they're still planned or not, or when they did / will arrive. Anyway, RAO-INT atomics are weakly-ordered (memory_order_relaxed), fire-and-forget.

Or for RMWs that are only atomic wrt. interrupts / other code on this core, where you don't need a lock prefix, but you can't get C compilers to do that for you. Except maybe by omitting lock prefixes entirely with assembler options, but then your code could only run on uniprocessor machines / VMs.

edited 2 days ago

answered Nov 28 at 2:30

Peter Cordes

377k50 gold badges742 silver badges1k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Henry Rich 2 days ago

Thanks. It makes sense that the core issuing the atomic would wait for its write queue to drain. Since the atomic is both an acquire and a release, is there a similar delay in the current owner, i. e. does a core owning a line RFO'd by another core wait for its store queue to drain before yielding ownership?

Peter Cordes 2 days ago

No, that's totally asynchronous. Cores don't hold onto cache lines for extra time even if there are pending stores to that line, unless it's literally in the middle of an atomic-RMW with the load side having already locked that cache line. (That's how they achieve RMW atomicity without asserting a system-wide LOCK# bus-lock signal that blocks all memory ops from all cores, like they have to do for a cache-line-split atomic RMW. stackoverflow.com/questions/39393850/…).

Collectives™ on Stack Overflow

What is the performance effect (on x64) of __atomic_fetch_add that ignores its result?

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related