541 questions
3
votes
0
answers
157
views
The cost of non contiguous reads and writes (naive matrix transpose, power-of-2 and other sizes)
I was benchmarking a naive transposition and noticed a very large performance discrepancy in performance between:
a naive operation where we read data contiguously and write with a large stride;
the ...
10
votes
1
answer
431
views
AVX-512 MD5 implementation: unexplained performance regression on Zen 4
I have written an implementation of the MD5 hash function using AVX-512. While it uses SIMD instructions, it is fundamentally a scalar algorithm. The point of using SIMD instructions is to access ...
1
vote
0
answers
34
views
Tracking Per Channel Memory Traffic in AMD Zen 2 (Rome)
I am using perf to profile workloads on my system, and I need to track the memory traffic generated by my workload on each NUMA node. Currently, I only have perf results for LLC cache misses, which ...
0
votes
1
answer
117
views
Why does PERF_COUNT_HW_REF_CPU_CYCLES have much higher variance on Zen5 cpus than PERF_COUNT_HW_CPU_CYCLES?
My understanding is that PERF_COUNT_HW_REF_CPU_CYCLES should map to some counter that counts at a constant rate, as opposed to PERF_COUNT_HW_CPU_CYCLES which is affected by frequency scaling. I'd ...
1
vote
1
answer
119
views
Cache line sizes for AMD Zen 3 Architecture
I wanted to see if I am correctly interpreting the attached diagram.
It shows the AMD Zen 3's cache lines.
OC Fetch is Opcode Cache,
IC Fetch is Instruction Cache.
I am just unable to make sense of ...
1
vote
0
answers
77
views
Why is the core-to-core-latency performance of EPYC 4 so poor in NUMA2 mode?
I test the EYPC 9564 CPU (dual socket), the core-to-core latency of the second socket is very high, even greater than the latency for inter-socket communication. As shown for AMD EPYC 7R13, 48 Cores, ...
2
votes
0
answers
69
views
Why perf complains that it cannot open this L1 cache event on Zen 2?
I am trying to read cache events on a AMD Zen2:
L1d all read accesses
L1d all write accesses
L1d read misses (not shown below)
L1d write misses (not shown below)
According to the perf_event_open(2) ...
1
vote
0
answers
118
views
Can Zen 4 run more than 1 branch per cycle
In Performance optimization, and how to do it wrong
the author claims:
the CPU can't predict more than one branch per cycle
A single if statement inside a loop is enough to stop any further ...
0
votes
0
answers
40
views
If CLGI can block virtual interrupt or not?
amd sdm implies CLGI can block vINTR
Table 15-10 Effect of the GIF on Interrupt Handling
15.21.4 Injecting Virtual (INTR) Interrupts
The processor takes a virtual INTR interrupt if:
V_IRQ and ...
1
vote
0
answers
91
views
What do the StD and IntD components mean in the Zen 5 CPU microarchitecture?
In the AMD Zen5 architecture block diagram, the FP/Vector execution unit has two components, StD and IntD with arrows connecting them to the "Load/Store Queue". What are the functions of ...
2
votes
0
answers
72
views
How to verify the granularity of memory access interleaving across different channels?
According to AMD's material, access to contiguous physical addresses will be interleaved across all memory channels (if set to NPS1). When a machine has 8 memory channels and the size of memory ...
1
vote
1
answer
175
views
Unable to cross-compile Rust project using tokio-udev
I want to cross-compile a minimal project which uses tokio-udev. The linker fails because of missing libudev:
aarch64-linux-musl/bin/ld: cannot find -ludev
I can cross-compile Rust projects which do ...
2
votes
1
answer
199
views
Tracking DRAM traffic in AMD Zen 2 (Rome)
I want to track the number of read/write accesses at each of the Unified Memory Controllers (UMCs) in my AMD EPYC processor (family: 0x17 and model: 0x31). The AMDuProfPcm tool, when used with the -m ...
0
votes
1
answer
237
views
Choose CPU processor (intel or AMD) of machine hosting action runner
Besides choosing between linux/windows/mac and 32/64 bit, is it possible to choose the processor of the machine where the action runner will be running? In my organization we have been using actions ...
0
votes
1
answer
50
views
Problems opening FOC motor control app in Vivado 2023.2
I have bought the Kria KD240 Starter Kit to get used to working with drives applications and FOC control. I am following the steps mentioned here but I can't open the Vivado project correctly. When I ...
5
votes
0
answers
125
views
Repeated x87 fnstenv yields cleared instruction pointer after arbitrary time
I have a program that calls the x87 instruction fnstenv multiple times per second and with only the occasional floating point computation being executed (in periods of multiple seconds apart), I had ...
2
votes
1
answer
108
views
Why does AMD processor use sub instruction instead of xor to verify the stack canary?
So I've been exploring the 12 chapter in the picoCTF primer and suddenly saw difference in my assembly of the program and the picoCTF's in the end of main function, where the stack canary is being ...
3
votes
1
answer
187
views
Twice as slow SIMD performance without extra copy
I've been optimizing some code, and stumbled across some peculiar case.
Here are the two assembly codes:
; FAST
lea rcx,[rsp+50h]
call qword ptr [Random_get_float3] ;this function ...
2
votes
1
answer
172
views
SymFromAddr fails on AMD Machine with the error message "Attempt to access Invalid address"
struct StackFrame
{
DWORD64 address;
std::string name;
std::string module;
std::string filename;
int line_number;
};
std::vector<StackFrame> GetStackTrace(CONTEXT context)
{
...
1
vote
0
answers
855
views
Cache inclusivity policy differences on x86 between Intel and AMD
(tldr: the question itself is at the bottom)
I've read that on AMD family 17h processors (Zen-Zen2, although it might be the case with the following generations as well, but I am not familiar with ...
0
votes
0
answers
149
views
How to debug an HIP/HIPRT application on windows?
I'm writing a path tracer using HIPRT on Windows but I couldn't find anything to debug my application yet. I'd like to be able to execute my kernels line by line, watch kernel variables, print to ...
2
votes
0
answers
134
views
Why instructions after atomic operation make execution faster (on AMD CPU)?
I wrote the following test cases to bench some operations:
#define BENCH_ROUNDS 1000000000 // 10**9
static volatile UINT64 _test_argument, _test_result;
static _Atomic(UINT64) _test_atom;
// For ...
1
vote
0
answers
285
views
Why polars on intel cpu is faster than on amd cpu?
I have two pc, one is Intel i7 13700KF with 64GB RAM and another is AMD 3970X with same RAM, both pc use ssd as storage and both pc has python 3.11 and polars 0.20.5. I run code below:
df = pl....
9
votes
0
answers
288
views
Are there processors on which VPMASKMOVD generates faults for the masked-out elements?
Are there processors on which VPMASKMOVD generates faults for the masked-out elements?
Going by the Intel Software Developer's Manual, the answer is plainly "no":
Faults occur only due to ...
0
votes
0
answers
144
views
What's the difference between those "cache_as_ram.S" in coreboot?
I want to learn how the "cache as ram" work, so i find some asm file in "/src/cpu/intel/car/" from coreboot. But there are four folders containing "cache_as_ram.S". What'...
0
votes
1
answer
187
views
Why amd_pmu_v2_handle_irq being called when not using perf?
amd_pmu_v2_handle_irq should be used to handle PMU overflow in AMD processor. When I use perf top -ag in the system, it is heavily called.
But when I use the perf stat -a command, there are fewer ...
-1
votes
1
answer
333
views
Why is the frequency of the CPU lower than the Max. Boost Clock?
I am using AMD's EPYC 7713 CPU. According to the specification, its maximum frequency is 3.675GHz. But when I run stress-ng (only running single threaded cpu loads), its frequency does not exceed 3....
2
votes
2
answers
1k
views
What x86 CPUs, if any, still have MOVDIRI or MOVDIR64b instructions?
I've recently been checking the Intel CPUs that I have access to.
None of them (they're all Xeons) have the MOVDIRI or MOVDIR64b instructions, which are store instructions that bypass the caches. Are ...
0
votes
1
answer
352
views
Illegal instruction (core dumped) in cv::findHomography
I am getting this error:
Illegal instruction (core dumped)
When calling:
cv::findHomography(query_points, reference_points, cv::RANSAC, homography_ransac_threshold_, h_mask);
This happen only an AWS ...
0
votes
0
answers
1k
views
What are the advantages of write-combine memory compared to write-back memory?
In Software Optimization Guide for the AMD Zen4 Microarchitecture, it is written that:
Write-combining is the merging of multiple memory write cycles that target locations within the address range of ...
2
votes
0
answers
453
views
What does the cache bank mean in AMD CPU?
In AMD's optimization manual, the L1 Data cache is described as follows:
The L1 DC provides multiple access ports using a banked structure. The read ports are shared by three load pipes and victim ...
2
votes
1
answer
944
views
What's the difference between dispatching and issuing in CPU pipeline
In Software Optimization Guide for the AMD Zen4 Microarchitecture, the terminology are explained as follows:
Dispatching: Dispatching refers to the act of transferring macro ops from the front end of ...
4
votes
1
answer
593
views
What does L2 poison mean in CPU?
I have encountered the same problem as this.
What does L2 poison mean?
I'm using AMD CPU.
0
votes
1
answer
326
views
How to test the latency and throughput of an intrinsic function?
In Intel's Intrinsic guide, each function has its own latency and throughput. For example, _mm256_loadu_ps:
Architecture, Latency, Throughput (CPI)
Alderlake, 7, 0.333333333
Icelake Intel Core, 7, 0.5
...
0
votes
0
answers
104
views
model.fit() stopping halfway on 1 epoch using tensorflow-directml. What to do?
Currently using tensorflow-directml as I am training a model on AMD (RX 580). The problem is, upon model.fit() it seems to be stuck at epoch 1 with no progress. Here's my code and error:
with ...
7
votes
0
answers
3k
views
Intel OneAPI MPI MKL with AMD, is there an AMD flavor?
I've always happened to use Intel cpus in intel chipset based servers, as such have used Intel's MPI and MKL for the past 20 years that's all I kinda know. With their OneAPI I only need and use MPI, ...
1
vote
0
answers
113
views
How do different monitoring tools calculate memory bandwidth?
For monitoring memory bandwidth, there is pcm-memory on the Intel platform and AMDuProf on the AMD platform.
How do they calculate memory bandwidth usage? Which PMUs were used?
Is it using 1024 or ...
2
votes
0
answers
1k
views
Difference of floating arithmetics on AMD CPUs and Intel CPUs
I am student majoring in computational science. When I deal with mixed-precision projects on AMD CPUs, I find that single precision data behaves similarly to double precision data. Sometimes, single-...
4
votes
0
answers
93
views
Use perf to see if I'm write bound?
I have a loop that's running slower than I expected. I measure how long it takes per collection it processes and notice it takes twice as long when I use 8 cores (overall 4x faster). There's no data ...
1
vote
0
answers
127
views
Ryzen AMD x86_64 increment for 64 bits on memory runs 8 times faster than 8,16 or 32 bit increment
I wanted to benchmark the atomic instructions compared to the non atomic, so I wrote the code that follows bellow. Besides benchmarking locked accesses I noticed a different aspect too that seems to ...
0
votes
1
answer
62
views
How can I use kernel functions in SVM root(execute) mode?
I ran into the following problem: When I initialized the kernel hypervisor, for me it is SVM and I exit from vmrun and get into my SvmExitHandler (this is the dispatcher that manages exit codes), then ...
1
vote
1
answer
88
views
Assembly instructions showing how zenbleed was found
While looking at this zenbleed article, it was found that a randomly generated sequence of instructions and the same sequence but with randomized alignment, serialization and speculation fences added ...
2
votes
0
answers
419
views
Obtaining SMI_COUNT on amd cpu
I am trying to get familiar with AMD's interface of SMM. Want to implement simple task:
Check SMI_COUNT
Trigger SMI
Check SMI_COUNT after trigger
The SMI-interrupt is a rare thing (I believe), so ...
1
vote
0
answers
115
views
numpy built with locally built blis does not use multithreading
I'm looking for help with an issue I'm having building Numpy against locally built blis for zen3.
I've configured blis to enable threading using openmp. (it is installed and working on my machine, ...
-1
votes
1
answer
290
views
Windows 10 nested virtualization on AMD CPU
I am working on a Software company, mainly developing on Linux. For Windows development we have couple of machines that are shared. However, a new project came up, and we need more resources on ...
1
vote
1
answer
205
views
Not getting any cache-pollution benefit from PREFETCHNTA on Zen 3
I'm trying to write a non-cache-polluting memcpy (using PREFETCHNTA for reads and streaming writes) and first doing some artificial benchmarking to determine what prefetch distances work well. I've ...
2
votes
0
answers
76
views
Can rdpmc be used to read the fixed-function counters on AMD?
On Intel the fixed-function performance counters can be read by setting bit 30 of ecx as well the index of the counter to read (0-4) in the bottom bits of that same register.
Is something similar ...
7
votes
1
answer
940
views
AMD DE_CFG[9] documentation
As a mitigation against the recent zenbleed vulnerability (https://lock.cmpxchg8b.com/zenbleed.html) it is advised to set DE_CFG[9] = 1.
I have not manage to find anything on this MSR, except for Is ...
2
votes
0
answers
222
views
What granularity does memory channel interleaving occur when enabled in BIOS?
Memory channel interleaving is a method of setting a physical address area which can be enabled in BIOS, so that all memory channels are alternately used to achieve best bandwidth and latency.
I want ...
3
votes
0
answers
75
views
Equivalent for uops_dispatched_port and uops_executed_port on AMD
I am working on the activity of ensuring optimal loading of CPU pipeline by FMA operations.
I need to make measurements on AMD Ryzen 9 (Zen 4), not the Intel platform.
Could you suggest HW events or a ...