4,323 questions
Tooling
0
votes
1
replies
45
views
simulating aarch64 (ARM 64 bit)branch predictor unit (BPU)
I am working on a microarchitectural tooling project, and as part of a heuristic I need the ability to observe and manipulate the internal state of a branch predictor. Specifically, I am looking for ...
-3
votes
0
answers
49
views
How to determine the exact architecture and architecture version? [closed]
My old desktop PC is not able to run some modern Docker images due to architecture version. That Desktop has an AMD Phenom II x6 1100T BE, and after a really long search on the internet, I found that ...
0
votes
0
answers
38
views
What constitutes a "Mixed I/O-bound" workload in the context of CPU scheduling simulations?
am currently working on an assignment to simulate and analyze the performance of various CPU scheduling algorithms (such as FCFS, SJF, and Round Robin).
I was given a list of workload scenarios to ...
0
votes
2
answers
108
views
Why doesn't a Wallace tree produce numbers with more than 2n final product bits?
I am trying to understand Wallace Trees.
The algorithm goes
multiply each bit of one number a by each bit of the other (b), which is accomplished as a simple AND gate, where the partial product of ...
0
votes
1
answer
90
views
How does the LDA instruction interact with RAM and IR on MU1 architecture? [closed]
I am trying to understand this LDA instruction.
After the fetch instruction, IR holds LDA S.
1 - In the first step, DIN = [IR], IR is supposed to be interpreted as an address (because of the brackets),...
3
votes
0
answers
157
views
The cost of non contiguous reads and writes (naive matrix transpose, power-of-2 and other sizes)
I was benchmarking a naive transposition and noticed a very large performance discrepancy in performance between:
a naive operation where we read data contiguously and write with a large stride;
the ...
Advice
1
vote
2
replies
148
views
How many general purpose registers are on an x86-64 processor, including alias registers?
I was curious and wondering how many registers are on an x64 processor. I know there are 16 general purpose registers available to the user, but there are supposedly general purpose alias registers ...
5
votes
0
answers
230
views
Why does this data race have some consistent invariants with writers updating one of three atomic<int> variables?
I have the following program. The relevant info is:
There are 3 variables atomic<int> x,y,z accessed by all threads.
3 writer threads: Each thread read all 3 values x,y,z, and update exactly 1 ...
0
votes
2
answers
111
views
"docker: no matching manifest for linux/amd64 in the manifest list entries"
I'm trying to run a software on a big-endian architecture. Following the update at the end of this answer, I tried this:
$ docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
...
3
votes
0
answers
83
views
After enabling an interrupt via CSRRW in RISC-V, how many instructions may execute before trap entry?
In RISC-V machine mode, when you issue a csrrw that sets a bit in mie (i.e. enabling an interrupt that is already pending), must the very next instruction immediately branch to the interrupt handler? ...
2
votes
0
answers
94
views
Too big a latency of ping-pong between two IPC processes on Sapphire Rapids Xeon with plain loads and stores, instruction order makes a big difference
I am running simple Ping/Pong between two processes A, B with shared memory:
shm_A and shm_B are in separate cache lines. Allocated with separate calls to shm_open, so probably in different pages, ...
2
votes
1
answer
83
views
Why are J1 and J2 used with XOR in ARMv6-M BL instruction immediate calculation?
I’m trying to understand how the BL instruction is decoded in the ARMv6-M architecture.
The part I don’t get is in the imm32 calculation: the values of I1 and I2 are derived using J1 and J2, but they’...
1
vote
1
answer
104
views
How are MMIO requests routed in CPU microarchitecture — cache-bypass on same path or a separate bus/port?
Short background: MMIO regions are typically mapped as uncachable / device memory, so CPU must not treat device registers like normal cacheable DRAM. I’m asking about the microarchitecture routing and ...
0
votes
0
answers
63
views
How to decide the data size handled by each processor/core in SIMD?
I’m learning how to use SIMD (Single Instruction, Multiple Data) for parallel data processing.
Suppose I have a large dataset (e.g., an array of 1 million floats), and I want to process it efficiently ...
2
votes
2
answers
182
views
Fractional-cycle latency of CPU instructions
I am trying to characterize the instruction latency of ARM's aese and aesmc instructions in Apple's M1, M3 and M4 CPUs.
For M1, Dougall Johnson obtains [3 cycles][1] for a fused pair of aese + aesmc. ...
1
vote
1
answer
108
views
Is CPU multithreading effected by divergence?
Building on this question here
The term thread divergence is used in CUDA; from my understanding it's a situation where different threads are assigned to do different tasks and this results in a big ...
7
votes
1
answer
226
views
Why are all IMUL µOPs dispatched to Port 1 only (on Haswell), even when multiple IMULs are executed in parallel?
I'm experimenting with the IMUL r64, r64 instruction on an Intel Xeon E5-1620 v3 (Haswell architecture, base clock 3.5 GHz, turbo boost up to 3.6 GHz, Hyper Threading is enabled).
My test loop is ...
1
vote
1
answer
113
views
Which resources of a modern x86 CPU core are occupied by memory transactions in flight?
I want to clarify how modern x86 architectures handle the latency of memory transactions that go all the way to DRAM. Specifically, which resources (which queues) get occupied waiting for the memory ...
3
votes
2
answers
169
views
How to support Carryless Multiplication operation in .NET 8.0 on various platforms
I use Pclmulqdq.CarrylessMultiply method in .NET 8.0 / C# program. The method performs carryless multiplication using x86 processor instruction which is very fast.
Method documentation: https://learn....
0
votes
0
answers
92
views
How does a failed spinlock CAS affect out-of-order speculation and RMW reordering on weak memory architectures?
I’m trying to understand how speculative execution interacts with weak memory models (ARM/Power) in the context of a spinlock implemented with a plain CAS. Example:
// Spinlock acquisition attempt
if (...
0
votes
0
answers
42
views
What protocol does the LLC directory uses to synchronize parallel RFO signals?
The MESI or MOESI protocols need the LLC directory in order to work... and the directory needs to synchronize parallel RFO + snoop-invalidation calls in order for it to work
(in TSO architectures that ...
3
votes
0
answers
123
views
IPC collapse with larger loop bodies despite constant I-cache miss rate, what's the bottleneck?
I'm seeing dramatic instructions-per-cycle collapse (2.08 -> 1.30) when increasing loop body size in simple arithmetic code with no branches, but instruction cache miss rate stays exactly constant ...
2
votes
0
answers
210
views
Why does floating point division take less than 50% of the latency of integer division and also 10x more latency than usual when underflow occurs?
I am measuring the latency of instructions.
For 64-bit primitives, integer division takes about 25 cycles each, usually on my 2.3GHz Digital Ocean vCPU, while floating point division takes about 10 ...
2
votes
0
answers
116
views
Is it possible on any real hardware, for the updated value of an atomic integer to become visible earlier via an indirect path than via a direct path?
Is it possible on any real hardware in the real world, for the updated value of an atomic integer written by one thread to become visible to another thread earlier via an indirect path, where a third ...
1
vote
1
answer
67
views
How to simulate a weakly order memory environment on a host with strong memory order?
I want to test if there are any bugs in my program where the memory sequence has not been properly used, but I do not have a weakly order memory environment for testing.
For example, on x86, all loads ...
0
votes
1
answer
97
views
When an atomic variable becomes visible to a thread other than the writing thread, is it also immediately globally visible?
Suppose I have three threads. If x was written by thread2 and x is visible to thread1, do I have the guarantee that the latest value of x is also visible to thread3? In other words, can the new value ...
2
votes
2
answers
224
views
Can the hardware reorder an atomic load followed by an atomic store, if the store is conditional on the load?
Can the hardware reorder an atomic load followed by an atomic store, if the store is conditional on the load? It would be highly unintuitive if this could happen, because if thread1 speculatively due ...
2
votes
1
answer
96
views
Is there a seq_cst sequence between different parts of an atomic object when atomic operations with different sizes mixed?
Updated:
I already know that this is a UB for ISO C, I apologize for the vague statement I made earlier.
This question originates from my previous question
Can atomic operations of different sizes be ...
3
votes
1
answer
135
views
What is the overhead of jumps and call-rets for CPU front-end decoder?
How jumps and call-ret pairs affect the CPU front-end decoder in the best case scenario when there are few instructions, they are well cached, and branches are well predicted?
For example, I run a ...
2
votes
1
answer
205
views
Can atomic operations of different sizes be mixed?
For the same memory address, if I use atomic operations of different widths to operate on it (assuming the memory is aligned), for example(Assuming the hardware supports 128 bit atomic operations):
#...
3
votes
4
answers
324
views
Is sizeof(pointer) the same as processor's native word size?
Say, is sizeof(void*) the same as the size processor can atomically access per instruction?
For example, 32-bit processor can read aligned 4 bytes atomically, 64-bit processor can read aligned 8 bytes ...
2
votes
1
answer
82
views
Portability of newer assembly instructions
I was reading through Arm64 instructions and saw that some instructions (e.g. CTZ) are optional and become mandatory with some extensions (e.g. Armv8.9 for CTZ).
I am now wondering about portability ...
5
votes
1
answer
118
views
minimum required atomic instructions to support C++11 concurrency libraries
I'm implementing a multi core system consisting of several custom/specialty CPUs. Those CPUs need to be able to support the C++11 concurrency libraries (thread/mutex etc.).
I'm not sure what kind of ...
4
votes
1
answer
176
views
On x86-64 can aligned writes to *code* be assumed to be read atomically by other cores?
I'm investigating the possibility of cross-modifying (hotpatching) code without pausing other threads.
The Intel and AMD manuals specifically document that aligned writes to memory of 1, 2, 4 or 8 ...
0
votes
0
answers
71
views
Why must align memory address
Memory addresses must be aligned before they are used. I know that if they are not, performance costs more in CPU caching. I discovered that certain processors raise exceptions when unaligned memories ...
-3
votes
1
answer
110
views
Understanding when a hazard in MIPS occurs
I have a question regarding these two instructions:
lw r2, 10(r1)
lw r1, 10(r2)
Is there a hazard here, do I need stalls in between two of them?
I want to know if any kind of hazard happens here? I ...
6
votes
0
answers
247
views
What is an explanation for the performance characteristics of CLWB when sharing data between cores (Tigerlake)?
Goal
I would like to transfer a 32KiB buffer between two cores (C1 and C2) as fast as possible, performing loads and stores at both cores.
Observation
A simple benchmark is devised: one core performs ...
1
vote
1
answer
119
views
Cache line sizes for AMD Zen 3 Architecture
I wanted to see if I am correctly interpreting the attached diagram.
It shows the AMD Zen 3's cache lines.
OC Fetch is Opcode Cache,
IC Fetch is Instruction Cache.
I am just unable to make sense of ...
1
vote
0
answers
77
views
How to analyze the microarchitecture resource requirements based on the trace generated by program execution?
I'm doing an in-depth CPU microarchitectural resource analysis. I want to know the requirements of my program on processor microarchitectural resources and compare the requirements of different ...
3
votes
0
answers
163
views
In the Independent Read Independent Write (IRIW) scenario, is changing loads to seq_cst alone sufficient to prevent the result in C++23?
In C++23, consider the classic IRIW litmus test, with the modification that all loads are now seq_cst, while stores are still relaxed:
void reader0(atomic_int *x, atomic_int *y) {
int l0x = x->...
7
votes
2
answers
375
views
Why is rep movsb slow for overlapping forward memmove, and why does libc use it?
TL;DR: memmove that uses rep movsb is magnitudes slow on overlapping array of 8193 bytes compared to memmove for 8191 elements that doesn't use rep movsb.
I'm asking why.
Consider the following ...
1
vote
0
answers
55
views
What is the basline implementation for "allocate on update" cache policy for L2 cache
Hellow.
I’m analyzing a fully-featured L2 cache with the following properties:
Non-blocking
Write-allocate
Write-back
For simplification: Only full-cacheline stores are allowed (every store allocates ...
4
votes
1
answer
205
views
Why is my benchmark using __mm_prefetch slower?
I am trying to learn some C++ optimizations and I have tried using __mm_prefetch for summing an array. The benchmark tests for my code is:
#include <benchmark/benchmark.h>
#include <vector>...
1
vote
0
answers
77
views
Why is the core-to-core-latency performance of EPYC 4 so poor in NUMA2 mode?
I test the EYPC 9564 CPU (dual socket), the core-to-core latency of the second socket is very high, even greater than the latency for inter-socket communication. As shown for AMD EPYC 7R13, 48 Cores, ...
0
votes
2
answers
112
views
Are cache coherence protocols only active when explicitly using certain types in your code?
When I read a description of cache coherence protocols it talks about how separate CPU cores keep track of memory address ranges modified, I think, through methods like bus snooping. The end result of ...
0
votes
1
answer
170
views
Why does my CPU's efficiency core have a lower core-to-core latency?
I used https://github.com/nviennot/core-to-core-latency to measure my CPU's (Intel(R) Core(TM) Ultra 7 268V) core-to-core latency and these are my results:
~/Developer/core-to-core-latency main
❯ ...
1
vote
0
answers
129
views
Why LWSYNC can not make the Independent Read Independent Write Example (IRIW) behave sensibly on PowerPC?
On PowerPC platform, the book A Primer on Memory Consistency and Cache Coherence stated:
As depicted in Table 5.18, Power’s HWSYNCs can be used to make the
Independent Read Independent Write Example (...
0
votes
1
answer
107
views
Execution stages in a superscalar microarchitecture
In this article https://www.lighterra.com/papers/modernmicroprocessors it is stated that (under Multiple issue - Superscalar)
the fetch and decode/dispatch stages must be enhanced so they can decode ...
2
votes
1
answer
121
views
Context switching in hardware threads
In Hyper-threading (or SMT) when two threads of a CPU core gets swapped in and out, does a context-switch occur.
Would it be called a context switch?, if not what is the terminology for it.
3
votes
1
answer
104
views
How to efficiently set a 64bit register to -1 on x64? [duplicate]
If the question were "How to efficiently set a 64bit register to zero on x64?" that would be easy. You just can't beat xor eax, eax. While it's not exactly correct to say it takes zero ...