Skip to main content
Filter by
Sorted by
Tagged with
Tooling
0 votes
1 replies
45 views

I am working on a microarchitectural tooling project, and as part of a heuristic I need the ability to observe and manipulate the internal state of a branch predictor. Specifically, I am looking for ...
Gal Kaptsenel's user avatar
-3 votes
0 answers
49 views

My old desktop PC is not able to run some modern Docker images due to architecture version. That Desktop has an AMD Phenom II x6 1100T BE, and after a really long search on the internet, I found that ...
PillFall's user avatar
  • 188
0 votes
0 answers
38 views

am currently working on an assignment to simulate and analyze the performance of various CPU scheduling algorithms (such as FCFS, SJF, and Round Robin). I was given a list of workload scenarios to ...
TBOYT's user avatar
  • 1
0 votes
2 answers
108 views

I am trying to understand Wallace Trees. The algorithm goes multiply each bit of one number a by each bit of the other (b), which is accomplished as a simple AND gate, where the partial product of ...
yosmo78's user avatar
  • 641
0 votes
1 answer
90 views

I am trying to understand this LDA instruction. After the fetch instruction, IR holds LDA S. 1 - In the first step, DIN = [IR], IR is supposed to be interpreted as an address (because of the brackets),...
Mohamed Badis Kerdellou's user avatar
3 votes
0 answers
157 views

I was benchmarking a naive transposition and noticed a very large performance discrepancy in performance between: a naive operation where we read data contiguously and write with a large stride; the ...
Etienne M's user avatar
  • 715
Advice
1 vote
2 replies
148 views

I was curious and wondering how many registers are on an x64 processor. I know there are 16 general purpose registers available to the user, but there are supposedly general purpose alias registers ...
misInformationSpreader's user avatar
5 votes
0 answers
230 views

I have the following program. The relevant info is: There are 3 variables atomic<int> x,y,z accessed by all threads. 3 writer threads: Each thread read all 3 values x,y,z, and update exactly 1 ...
Huy Le's user avatar
  • 1,989
0 votes
2 answers
111 views

I'm trying to run a software on a big-endian architecture. Following the update at the end of this answer, I tried this: $ docker run --rm --privileged multiarch/qemu-user-static --reset -p yes ...
optical's user avatar
  • 267
3 votes
0 answers
83 views

In RISC-V machine mode, when you issue a csrrw that sets a bit in mie (i.e. enabling an interrupt that is already pending), must the very next instruction immediately branch to the interrupt handler? ...
Ömer GÜZEL's user avatar
2 votes
0 answers
94 views

I am running simple Ping/Pong between two processes A, B with shared memory: shm_A and shm_B are in separate cache lines. Allocated with separate calls to shm_open, so probably in different pages, ...
Samuel Hapak's user avatar
  • 7,284
2 votes
1 answer
83 views

I’m trying to understand how the BL instruction is decoded in the ARMv6-M architecture. The part I don’t get is in the imm32 calculation: the values of I1 and I2 are derived using J1 and J2, but they’...
zenprogrammer's user avatar
1 vote
1 answer
104 views

Short background: MMIO regions are typically mapped as uncachable / device memory, so CPU must not treat device registers like normal cacheable DRAM. I’m asking about the microarchitecture routing and ...
SungwookKang's user avatar
0 votes
0 answers
63 views

I’m learning how to use SIMD (Single Instruction, Multiple Data) for parallel data processing. Suppose I have a large dataset (e.g., an array of 1 million floats), and I want to process it efficiently ...
Catdev's user avatar
  • 1
2 votes
2 answers
182 views

I am trying to characterize the instruction latency of ARM's aese and aesmc instructions in Apple's M1, M3 and M4 CPUs. For M1, Dougall Johnson obtains [3 cycles][1] for a fused pair of aese + aesmc. ...
swineone's user avatar
  • 3,000
1 vote
1 answer
108 views

Building on this question here The term thread divergence is used in CUDA; from my understanding it's a situation where different threads are assigned to do different tasks and this results in a big ...
bigcodeszzer's user avatar
7 votes
1 answer
226 views

I'm experimenting with the IMUL r64, r64 instruction on an Intel Xeon E5-1620 v3 (Haswell architecture, base clock 3.5 GHz, turbo boost up to 3.6 GHz, Hyper Threading is enabled). My test loop is ...
Andrey Dmitriev's user avatar
1 vote
1 answer
113 views

I want to clarify how modern x86 architectures handle the latency of memory transactions that go all the way to DRAM. Specifically, which resources (which queues) get occupied waiting for the memory ...
xealits's user avatar
  • 4,808
3 votes
2 answers
169 views

I use Pclmulqdq.CarrylessMultiply method in .NET 8.0 / C# program. The method performs carryless multiplication using x86 processor instruction which is very fast. Method documentation: https://learn....
PanJanek's user avatar
  • 6,717
0 votes
0 answers
92 views

I’m trying to understand how speculative execution interacts with weak memory models (ARM/Power) in the context of a spinlock implemented with a plain CAS. Example: // Spinlock acquisition attempt if (...
Delark's user avatar
  • 1,385
0 votes
0 answers
42 views

The MESI or MOESI protocols need the LLC directory in order to work... and the directory needs to synchronize parallel RFO + snoop-invalidation calls in order for it to work (in TSO architectures that ...
Delark's user avatar
  • 1,385
3 votes
0 answers
123 views

I'm seeing dramatic instructions-per-cycle collapse (2.08 -> 1.30) when increasing loop body size in simple arithmetic code with no branches, but instruction cache miss rate stays exactly constant ...
Transcendental's user avatar
2 votes
0 answers
210 views

I am measuring the latency of instructions. For 64-bit primitives, integer division takes about 25 cycles each, usually on my 2.3GHz Digital Ocean vCPU, while floating point division takes about 10 ...
Zack Light's user avatar
2 votes
0 answers
116 views

Is it possible on any real hardware in the real world, for the updated value of an atomic integer written by one thread to become visible to another thread earlier via an indirect path, where a third ...
Qwert Yuiop's user avatar
1 vote
1 answer
67 views

I want to test if there are any bugs in my program where the memory sequence has not been properly used, but I do not have a weakly order memory environment for testing. For example, on x86, all loads ...
untitled's user avatar
  • 563
0 votes
1 answer
97 views

Suppose I have three threads. If x was written by thread2 and x is visible to thread1, do I have the guarantee that the latest value of x is also visible to thread3? In other words, can the new value ...
Qwert Yuiop's user avatar
2 votes
2 answers
224 views

Can the hardware reorder an atomic load followed by an atomic store, if the store is conditional on the load? It would be highly unintuitive if this could happen, because if thread1 speculatively due ...
Qwert Yuiop's user avatar
2 votes
1 answer
96 views

Updated: I already know that this is a UB for ISO C, I apologize for the vague statement I made earlier. This question originates from my previous question Can atomic operations of different sizes be ...
untitled's user avatar
  • 563
3 votes
1 answer
135 views

How jumps and call-ret pairs affect the CPU front-end decoder in the best case scenario when there are few instructions, they are well cached, and branches are well predicted? For example, I run a ...
xealits's user avatar
  • 4,808
2 votes
1 answer
205 views

For the same memory address, if I use atomic operations of different widths to operate on it (assuming the memory is aligned), for example(Assuming the hardware supports 128 bit atomic operations): #...
untitled's user avatar
  • 563
3 votes
4 answers
324 views

Say, is sizeof(void*) the same as the size processor can atomically access per instruction? For example, 32-bit processor can read aligned 4 bytes atomically, 64-bit processor can read aligned 8 bytes ...
PkDrew's user avatar
  • 2,301
2 votes
1 answer
82 views

I was reading through Arm64 instructions and saw that some instructions (e.g. CTZ) are optional and become mandatory with some extensions (e.g. Armv8.9 for CTZ). I am now wondering about portability ...
alexisrdt's user avatar
  • 524
5 votes
1 answer
118 views

I'm implementing a multi core system consisting of several custom/specialty CPUs. Those CPUs need to be able to support the C++11 concurrency libraries (thread/mutex etc.). I'm not sure what kind of ...
dsula's user avatar
  • 267
4 votes
1 answer
176 views

I'm investigating the possibility of cross-modifying (hotpatching) code without pausing other threads. The Intel and AMD manuals specifically document that aligned writes to memory of 1, 2, 4 or 8 ...
Joseph Garvin's user avatar
0 votes
0 answers
71 views

Memory addresses must be aligned before they are used. I know that if they are not, performance costs more in CPU caching. I discovered that certain processors raise exceptions when unaligned memories ...
LEE LUNA's user avatar
-3 votes
1 answer
110 views

I have a question regarding these two instructions: lw r2, 10(r1) lw r1, 10(r2) Is there a hazard here, do I need stalls in between two of them? I want to know if any kind of hazard happens here? I ...
mer mer's user avatar
  • 17
6 votes
0 answers
247 views

Goal I would like to transfer a 32KiB buffer between two cores (C1 and C2) as fast as possible, performing loads and stores at both cores. Observation A simple benchmark is devised: one core performs ...
doliphin's user avatar
  • 1,044
1 vote
1 answer
119 views

I wanted to see if I am correctly interpreting the attached diagram. It shows the AMD Zen 3's cache lines. OC Fetch is Opcode Cache, IC Fetch is Instruction Cache. I am just unable to make sense of ...
Kush Jenamani's user avatar
1 vote
0 answers
77 views

I'm doing an in-depth CPU microarchitectural resource analysis. I want to know the requirements of my program on processor microarchitectural resources and compare the requirements of different ...
Gerrie's user avatar
  • 455
3 votes
0 answers
163 views

In C++23, consider the classic IRIW litmus test, with the modification that all loads are now seq_cst, while stores are still relaxed: void reader0(atomic_int *x, atomic_int *y) { int l0x = x->...
Liu Xiaoyi's user avatar
7 votes
2 answers
375 views

TL;DR: memmove that uses rep movsb is magnitudes slow on overlapping array of 8193 bytes compared to memmove for 8191 elements that doesn't use rep movsb. I'm asking why. Consider the following ...
Alex Guteniev's user avatar
1 vote
0 answers
55 views

Hellow. I’m analyzing a fully-featured L2 cache with the following properties: Non-blocking Write-allocate Write-back For simplification: Only full-cacheline stores are allowed (every store allocates ...
Konstantin Kazartsev's user avatar
4 votes
1 answer
205 views

I am trying to learn some C++ optimizations and I have tried using __mm_prefetch for summing an array. The benchmark tests for my code is: #include <benchmark/benchmark.h> #include <vector>...
Tom McLean's user avatar
  • 6,643
1 vote
0 answers
77 views

I test the EYPC 9564 CPU (dual socket), the core-to-core latency of the second socket is very high, even greater than the latency for inter-socket communication. As shown for AMD EPYC 7R13, 48 Cores, ...
wang fuqiang's user avatar
0 votes
2 answers
112 views

When I read a description of cache coherence protocols it talks about how separate CPU cores keep track of memory address ranges modified, I think, through methods like bus snooping. The end result of ...
Zebrafish's user avatar
  • 16.3k
0 votes
1 answer
170 views

I used https://github.com/nviennot/core-to-core-latency to measure my CPU's (Intel(R) Core(TM) Ultra 7 268V) core-to-core latency and these are my results: ~/Developer/core-to-core-latency main ❯ ...
weineng's user avatar
  • 364
1 vote
0 answers
129 views

On PowerPC platform, the book A Primer on Memory Consistency and Cache Coherence stated: As depicted in Table 5.18, Power’s HWSYNCs can be used to make the Independent Read Independent Write Example (...
Anonemous's user avatar
  • 319
0 votes
1 answer
107 views

In this article https://www.lighterra.com/papers/modernmicroprocessors it is stated that (under Multiple issue - Superscalar) the fetch and decode/dispatch stages must be enhanced so they can decode ...
Rishi's user avatar
  • 41
2 votes
1 answer
121 views

In Hyper-threading (or SMT) when two threads of a CPU core gets swapped in and out, does a context-switch occur. Would it be called a context switch?, if not what is the terminology for it.
Rishi's user avatar
  • 41
3 votes
1 answer
104 views

If the question were "How to efficiently set a 64bit register to zero on x64?" that would be easy. You just can't beat xor eax, eax. While it's not exactly correct to say it takes zero ...
David Wohlferd's user avatar

1
2 3 4 5
87