Newest 'cpu-architecture' Questions

Tooling

0 votes

1 replies

45 views

simulating aarch64 (ARM 64 bit)branch predictor unit (BPU)

I am working on a microarchitectural tooling project, and as part of a heuristic I need the ability to observe and manipulate the internal state of a branch predictor. Specifically, I am looking for ...

Gal Kaptsenel

93

asked Nov 27 at 16:02

-3 votes

0 answers

49 views

How to determine the exact architecture and architecture version? [closed]

My old desktop PC is not able to run some modern Docker images due to architecture version. That Desktop has an AMD Phenom II x6 1100T BE, and after a really long search on the internet, I found that ...

PillFall

188

asked Nov 24 at 4:21

0 votes

0 answers

38 views

What constitutes a "Mixed I/O-bound" workload in the context of CPU scheduling simulations?

am currently working on an assignment to simulate and analyze the performance of various CPU scheduling algorithms (such as FCFS, SJF, and Round Robin). I was given a list of workload scenarios to ...

TBOYT

1

asked Nov 22 at 18:06

0 votes

2 answers

108 views

Why doesn't a Wallace tree produce numbers with more than 2n final product bits?

I am trying to understand Wallace Trees. The algorithm goes multiply each bit of one number a by each bit of the other (b), which is accomplished as a simple AND gate, where the partial product of ...

yosmo78

641

asked Nov 17 at 4:52

0 votes

1 answer

90 views

How does the LDA instruction interact with RAM and IR on MU1 architecture? [closed]

I am trying to understand this LDA instruction. After the fetch instruction, IR holds LDA S. 1 - In the first step, DIN = [IR], IR is supposed to be interpreted as an address (because of the brackets),...

Mohamed Badis Kerdellou

53

asked Nov 11 at 16:04

3 votes

0 answers

157 views

The cost of non contiguous reads and writes (naive matrix transpose, power-of-2 and other sizes)

I was benchmarking a naive transposition and noticed a very large performance discrepancy in performance between: a naive operation where we read data contiguously and write with a large stride; the ...

Etienne M

715

asked Nov 3 at 14:49

Advice

1 vote

2 replies

148 views

How many general purpose registers are on an x86-64 processor, including alias registers?

I was curious and wondering how many registers are on an x64 processor. I know there are 16 general purpose registers available to the user, but there are supposedly general purpose alias registers ...

misInformationSpreader

31

asked Nov 2 at 7:11

5 votes

0 answers

230 views

Why does this data race have some consistent invariants with writers updating one of three atomic<int> variables?

I have the following program. The relevant info is: There are 3 variables atomic<int> x,y,z accessed by all threads. 3 writer threads: Each thread read all 3 values x,y,z, and update exactly 1 ...

Huy Le

1,989

asked Oct 29 at 8:47

0 votes

2 answers

111 views

"docker: no matching manifest for linux/amd64 in the manifest list entries"

I'm trying to run a software on a big-endian architecture. Following the update at the end of this answer, I tried this: $ docker run --rm --privileged multiarch/qemu-user-static --reset -p yes ...

optical

267

asked Oct 23 at 13:56

3 votes

0 answers

83 views

After enabling an interrupt via CSRRW in RISC-V, how many instructions may execute before trap entry?

In RISC-V machine mode, when you issue a csrrw that sets a bit in mie (i.e. enabling an interrupt that is already pending), must the very next instruction immediately branch to the interrupt handler? ...

Ömer GÜZEL

305

asked Oct 22 at 14:15

2 votes

0 answers

94 views

Too big a latency of ping-pong between two IPC processes on Sapphire Rapids Xeon with plain loads and stores, instruction order makes a big difference

I am running simple Ping/Pong between two processes A, B with shared memory: shm_A and shm_B are in separate cache lines. Allocated with separate calls to shm_open, so probably in different pages, ...

Samuel Hapak

7,284

asked Oct 16 at 7:52

2 votes

1 answer

83 views

Why are J1 and J2 used with XOR in ARMv6-M BL instruction immediate calculation?

I’m trying to understand how the BL instruction is decoded in the ARMv6-M architecture. The part I don’t get is in the imm32 calculation: the values of I1 and I2 are derived using J1 and J2, but they’...

zenprogrammer

751

asked Sep 30 at 21:46

1 vote

1 answer

104 views

How are MMIO requests routed in CPU microarchitecture — cache-bypass on same path or a separate bus/port?

Short background: MMIO regions are typically mapped as uncachable / device memory, so CPU must not treat device registers like normal cacheable DRAM. I’m asking about the microarchitecture routing and ...

SungwookKang

13

asked Sep 30 at 10:10

0 votes

0 answers

63 views

How to decide the data size handled by each processor/core in SIMD?

I’m learning how to use SIMD (Single Instruction, Multiple Data) for parallel data processing. Suppose I have a large dataset (e.g., an array of 1 million floats), and I want to process it efficiently ...

Catdev

1

asked Sep 29 at 16:57

2 votes

2 answers

182 views

Fractional-cycle latency of CPU instructions

I am trying to characterize the instruction latency of ARM's aese and aesmc instructions in Apple's M1, M3 and M4 CPUs. For M1, Dougall Johnson obtains [3 cycles][1] for a fused pair of aese + aesmc. ...

swineone

3,000

asked Sep 18 at 16:13

1 vote

1 answer

108 views

Is CPU multithreading effected by divergence?

Building on this question here The term thread divergence is used in CUDA; from my understanding it's a situation where different threads are assigned to do different tasks and this results in a big ...

bigcodeszzer

960

asked Sep 18 at 1:37

7 votes

1 answer

226 views

Why are all IMUL µOPs dispatched to Port 1 only (on Haswell), even when multiple IMULs are executed in parallel?

I'm experimenting with the IMUL r64, r64 instruction on an Intel Xeon E5-1620 v3 (Haswell architecture, base clock 3.5 GHz, turbo boost up to 3.6 GHz, Hyper Threading is enabled). My test loop is ...

Andrey Dmitriev

179

asked Sep 12 at 9:26

1 vote

1 answer

113 views

Which resources of a modern x86 CPU core are occupied by memory transactions in flight?

I want to clarify how modern x86 architectures handle the latency of memory transactions that go all the way to DRAM. Specifically, which resources (which queues) get occupied waiting for the memory ...

xealits

4,808

asked Sep 5 at 0:27

3 votes

2 answers

169 views

How to support Carryless Multiplication operation in .NET 8.0 on various platforms

I use Pclmulqdq.CarrylessMultiply method in .NET 8.0 / C# program. The method performs carryless multiplication using x86 processor instruction which is very fast. Method documentation: https://learn....

PanJanek

6,717

asked Sep 1 at 13:55

0 votes

0 answers

92 views

How does a failed spinlock CAS affect out-of-order speculation and RMW reordering on weak memory architectures?

I’m trying to understand how speculative execution interacts with weak memory models (ARM/Power) in the context of a spinlock implemented with a plain CAS. Example: // Spinlock acquisition attempt if (...

Delark

1,385

asked Aug 28 at 15:52

0 votes

0 answers

42 views

What protocol does the LLC directory uses to synchronize parallel RFO signals?

The MESI or MOESI protocols need the LLC directory in order to work... and the directory needs to synchronize parallel RFO + snoop-invalidation calls in order for it to work (in TSO architectures that ...

Delark

1,385

asked Aug 27 at 0:47

3 votes

0 answers

123 views

IPC collapse with larger loop bodies despite constant I-cache miss rate, what's the bottleneck?

I'm seeing dramatic instructions-per-cycle collapse (2.08 -> 1.30) when increasing loop body size in simple arithmetic code with no branches, but instruction cache miss rate stays exactly constant ...

Transcendental

969

asked Aug 26 at 1:47

2 votes

0 answers

210 views

Why does floating point division take less than 50% of the latency of integer division and also 10x more latency than usual when underflow occurs?

I am measuring the latency of instructions. For 64-bit primitives, integer division takes about 25 cycles each, usually on my 2.3GHz Digital Ocean vCPU, while floating point division takes about 10 ...

Zack Light

362

asked Aug 22 at 5:35

2 votes

0 answers

116 views

Is it possible on any real hardware, for the updated value of an atomic integer to become visible earlier via an indirect path than via a direct path?

Is it possible on any real hardware in the real world, for the updated value of an atomic integer written by one thread to become visible to another thread earlier via an indirect path, where a third ...

Qwert Yuiop

362

asked Aug 19 at 21:10

1 vote

1 answer

67 views

How to simulate a weakly order memory environment on a host with strong memory order?

I want to test if there are any bugs in my program where the memory sequence has not been properly used, but I do not have a weakly order memory environment for testing. For example, on x86, all loads ...

untitled

563

asked Aug 17 at 14:14

0 votes

1 answer

97 views

When an atomic variable becomes visible to a thread other than the writing thread, is it also immediately globally visible?

Suppose I have three threads. If x was written by thread2 and x is visible to thread1, do I have the guarantee that the latest value of x is also visible to thread3? In other words, can the new value ...

Qwert Yuiop

362

asked Aug 15 at 21:06

2 votes

2 answers

224 views

Can the hardware reorder an atomic load followed by an atomic store, if the store is conditional on the load?

Can the hardware reorder an atomic load followed by an atomic store, if the store is conditional on the load? It would be highly unintuitive if this could happen, because if thread1 speculatively due ...

Qwert Yuiop

362

asked Aug 15 at 20:54

2 votes

1 answer

96 views

Is there a seq_cst sequence between different parts of an atomic object when atomic operations with different sizes mixed?

Updated: I already know that this is a UB for ISO C, I apologize for the vague statement I made earlier. This question originates from my previous question Can atomic operations of different sizes be ...

untitled

563

asked Aug 15 at 16:22

3 votes

1 answer

135 views

What is the overhead of jumps and call-rets for CPU front-end decoder?

How jumps and call-ret pairs affect the CPU front-end decoder in the best case scenario when there are few instructions, they are well cached, and branches are well predicted? For example, I run a ...

xealits

4,808

asked Aug 12 at 19:09

2 votes

1 answer

205 views

Can atomic operations of different sizes be mixed?

For the same memory address, if I use atomic operations of different widths to operate on it (assuming the memory is aligned), for example(Assuming the hardware supports 128 bit atomic operations): #...

untitled

563

asked Aug 10 at 9:05

3 votes

4 answers

324 views

Is sizeof(pointer) the same as processor's native word size?

Say, is sizeof(void*) the same as the size processor can atomically access per instruction? For example, 32-bit processor can read aligned 4 bytes atomically, 64-bit processor can read aligned 8 bytes ...

PkDrew

2,301

asked Aug 2 at 3:03

2 votes

1 answer

82 views

Portability of newer assembly instructions

I was reading through Arm64 instructions and saw that some instructions (e.g. CTZ) are optional and become mandatory with some extensions (e.g. Armv8.9 for CTZ). I am now wondering about portability ...

alexisrdt

524

asked Jul 19 at 2:59

5 votes

1 answer

118 views

minimum required atomic instructions to support C++11 concurrency libraries

I'm implementing a multi core system consisting of several custom/specialty CPUs. Those CPUs need to be able to support the C++11 concurrency libraries (thread/mutex etc.). I'm not sure what kind of ...

dsula

267

asked Jul 17 at 18:20

4 votes

1 answer

176 views

On x86-64 can aligned writes to code be assumed to be read atomically by other cores?

I'm investigating the possibility of cross-modifying (hotpatching) code without pausing other threads. The Intel and AMD manuals specifically document that aligned writes to memory of 1, 2, 4 or 8 ...

Joseph Garvin

22.3k

asked Jul 17 at 17:33

0 votes

0 answers

71 views

Why must align memory address

Memory addresses must be aligned before they are used. I know that if they are not, performance costs more in CPU caching. I discovered that certain processors raise exceptions when unaligned memories ...

LEE LUNA

1

asked Jul 8 at 9:39

-3 votes

1 answer

110 views

Understanding when a hazard in MIPS occurs

I have a question regarding these two instructions: lw r2, 10(r1) lw r1, 10(r2) Is there a hazard here, do I need stalls in between two of them? I want to know if any kind of hazard happens here? I ...

mer mer

17

asked Jun 28 at 15:34

6 votes

0 answers

247 views

What is an explanation for the performance characteristics of CLWB when sharing data between cores (Tigerlake)?

Goal I would like to transfer a 32KiB buffer between two cores (C1 and C2) as fast as possible, performing loads and stores at both cores. Observation A simple benchmark is devised: one core performs ...

doliphin

1,044

asked May 31 at 20:44

1 vote

1 answer

119 views

Cache line sizes for AMD Zen 3 Architecture

I wanted to see if I am correctly interpreting the attached diagram. It shows the AMD Zen 3's cache lines. OC Fetch is Opcode Cache, IC Fetch is Instruction Cache. I am just unable to make sense of ...

Kush Jenamani

11

asked May 27 at 15:27

1 vote

0 answers

77 views

How to analyze the microarchitecture resource requirements based on the trace generated by program execution?

I'm doing an in-depth CPU microarchitectural resource analysis. I want to know the requirements of my program on processor microarchitectural resources and compare the requirements of different ...

Gerrie

455

asked May 19 at 12:26

3 votes

0 answers

163 views

In the Independent Read Independent Write (IRIW) scenario, is changing loads to seq_cst alone sufficient to prevent the result in C++23?

In C++23, consider the classic IRIW litmus test, with the modification that all loads are now seq_cst, while stores are still relaxed: void reader0(atomic_int *x, atomic_int *y) { int l0x = x->...

Liu Xiaoyi

31

asked May 14 at 18:51

7 votes

2 answers

375 views

Why is rep movsb slow for overlapping forward memmove, and why does libc use it?

TL;DR: memmove that uses rep movsb is magnitudes slow on overlapping array of 8193 bytes compared to memmove for 8191 elements that doesn't use rep movsb. I'm asking why. Consider the following ...

Alex Guteniev

14.3k

asked May 13 at 9:30

1 vote

0 answers

55 views

What is the basline implementation for "allocate on update" cache policy for L2 cache

Hellow. I’m analyzing a fully-featured L2 cache with the following properties: Non-blocking Write-allocate Write-back For simplification: Only full-cacheline stores are allowed (every store allocates ...

Konstantin Kazartsev

97

asked May 2 at 17:54

4 votes

1 answer

205 views

Why is my benchmark using __mm_prefetch slower?

I am trying to learn some C++ optimizations and I have tried using __mm_prefetch for summing an array. The benchmark tests for my code is: #include <benchmark/benchmark.h> #include <vector>...

Tom McLean

6,643

asked Apr 25 at 14:51

1 vote

0 answers

77 views

Why is the core-to-core-latency performance of EPYC 4 so poor in NUMA2 mode?

I test the EYPC 9564 CPU (dual socket), the core-to-core latency of the second socket is very high, even greater than the latency for inter-socket communication. As shown for AMD EPYC 7R13, 48 Cores, ...

wang fuqiang

81

asked Apr 25 at 2:34

0 votes

2 answers

112 views

Are cache coherence protocols only active when explicitly using certain types in your code?

When I read a description of cache coherence protocols it talks about how separate CPU cores keep track of memory address ranges modified, I think, through methods like bus snooping. The end result of ...

Zebrafish

16.3k

asked Apr 21 at 19:47

0 votes

1 answer

170 views

Why does my CPU's efficiency core have a lower core-to-core latency?

I used https://github.com/nviennot/core-to-core-latency to measure my CPU's (Intel(R) Core(TM) Ultra 7 268V) core-to-core latency and these are my results: ~/Developer/core-to-core-latency main ❯ ...

weineng

364

asked Apr 19 at 3:11

1 vote

0 answers

129 views

Why LWSYNC can not make the Independent Read Independent Write Example (IRIW) behave sensibly on PowerPC?

On PowerPC platform, the book A Primer on Memory Consistency and Cache Coherence stated: As depicted in Table 5.18, Power’s HWSYNCs can be used to make the Independent Read Independent Write Example (...

Anonemous

319

asked Apr 1 at 12:20

0 votes

1 answer

107 views

Execution stages in a superscalar microarchitecture

In this article https://www.lighterra.com/papers/modernmicroprocessors it is stated that (under Multiple issue - Superscalar) the fetch and decode/dispatch stages must be enhanced so they can decode ...

Rishi

41

asked Mar 27 at 9:33

2 votes

1 answer

121 views

Context switching in hardware threads

In Hyper-threading (or SMT) when two threads of a CPU core gets swapped in and out, does a context-switch occur. Would it be called a context switch?, if not what is the terminology for it.

Rishi

41

asked Mar 23 at 4:47

3 votes

1 answer

104 views

How to efficiently set a 64bit register to -1 on x64? [duplicate]

If the question were "How to efficiently set a 64bit register to zero on x64?" that would be easy. You just can't beat xor eax, eax. While it's not exactly correct to say it takes zero ...

David Wohlferd

7,610

asked Mar 14 at 22:54

Collectives™ on Stack Overflow