Newest 'amd-processor' Questions

3 votes

0 answers

157 views

The cost of non contiguous reads and writes (naive matrix transpose, power-of-2 and other sizes)

I was benchmarking a naive transposition and noticed a very large performance discrepancy in performance between: a naive operation where we read data contiguously and write with a large stride; the ...

Etienne M

725

asked Nov 3 at 14:49

10 votes

1 answer

431 views

AVX-512 MD5 implementation: unexplained performance regression on Zen 4

I have written an implementation of the MD5 hash function using AVX-512. While it uses SIMD instructions, it is fundamentally a scalar algorithm. The point of using SIMD instructions is to access ...

fuz

94.7k

asked Oct 8 at 16:55

1 vote

0 answers

34 views

Tracking Per Channel Memory Traffic in AMD Zen 2 (Rome)

I am using perf to profile workloads on my system, and I need to track the memory traffic generated by my workload on each NUMA node. Currently, I only have perf results for LLC cache misses, which ...

smz

515

asked Aug 20 at 19:51

0 votes

1 answer

117 views

Why does PERF_COUNT_HW_REF_CPU_CYCLES have much higher variance on Zen5 cpus than PERF_COUNT_HW_CPU_CYCLES?

My understanding is that PERF_COUNT_HW_REF_CPU_CYCLES should map to some counter that counts at a constant rate, as opposed to PERF_COUNT_HW_CPU_CYCLES which is affected by frequency scaling. I'd ...

Joseph Garvin

22.3k

asked Jul 24 at 17:17

1 vote

1 answer

119 views

Cache line sizes for AMD Zen 3 Architecture

I wanted to see if I am correctly interpreting the attached diagram. It shows the AMD Zen 3's cache lines. OC Fetch is Opcode Cache, IC Fetch is Instruction Cache. I am just unable to make sense of ...

Kush Jenamani

11

asked May 27 at 15:27

1 vote

0 answers

77 views

Why is the core-to-core-latency performance of EPYC 4 so poor in NUMA2 mode?

I test the EYPC 9564 CPU (dual socket), the core-to-core latency of the second socket is very high, even greater than the latency for inter-socket communication. As shown for AMD EPYC 7R13, 48 Cores, ...

wang fuqiang

81

asked Apr 25 at 2:34

2 votes

0 answers

69 views

Why perf complains that it cannot open this L1 cache event on Zen 2?

I am trying to read cache events on a AMD Zen2: L1d all read accesses L1d all write accesses L1d read misses (not shown below) L1d write misses (not shown below) According to the perf_event_open(2) ...

onlycparra

815

asked Mar 20 at 5:02

1 vote

0 answers

118 views

Can Zen 4 run more than 1 branch per cycle

In Performance optimization, and how to do it wrong the author claims: the CPU can't predict more than one branch per cycle A single if statement inside a loop is enough to stop any further ...

HesLg

58

asked Mar 8 at 0:54

0 votes

0 answers

40 views

If CLGI can block virtual interrupt or not?

amd sdm implies CLGI can block vINTR Table 15-10 Effect of the GIF on Interrupt Handling 15.21.4 Injecting Virtual (INTR) Interrupts The processor takes a virtual INTR interrupt if: V_IRQ and ...

wang fuqiang

81

asked Mar 5 at 10:35

1 vote

0 answers

91 views

What do the StD and IntD components mean in the Zen 5 CPU microarchitecture?

In the AMD Zen5 architecture block diagram, the FP/Vector execution unit has two components, StD and IntD with arrows connecting them to the "Load/Store Queue". What are the functions of ...

Frontier_Setter

809

asked Mar 4 at 3:48

2 votes

0 answers

72 views

How to verify the granularity of memory access interleaving across different channels?

According to AMD's material, access to contiguous physical addresses will be interleaved across all memory channels (if set to NPS1). When a machine has 8 memory channels and the size of memory ...

Frontier_Setter

809

asked Dec 17, 2024 at 5:31

1 vote

1 answer

175 views

Unable to cross-compile Rust project using tokio-udev

I want to cross-compile a minimal project which uses tokio-udev. The linker fails because of missing libudev: aarch64-linux-musl/bin/ld: cannot find -ludev I can cross-compile Rust projects which do ...

Twonky

814

asked Nov 11, 2024 at 13:25

2 votes

1 answer

199 views

Tracking DRAM traffic in AMD Zen 2 (Rome)

I want to track the number of read/write accesses at each of the Unified Memory Controllers (UMCs) in my AMD EPYC processor (family: 0x17 and model: 0x31). The AMDuProfPcm tool, when used with the -m ...

smz

515

asked Oct 4, 2024 at 18:37

0 votes

1 answer

237 views

Choose CPU processor (intel or AMD) of machine hosting action runner

Besides choosing between linux/windows/mac and 32/64 bit, is it possible to choose the processor of the machine where the action runner will be running? In my organization we have been using actions ...

Alberto Gascón

3

asked Sep 17, 2024 at 7:23

0 votes

1 answer

50 views

Problems opening FOC motor control app in Vivado 2023.2

I have bought the Kria KD240 Starter Kit to get used to working with drives applications and FOC control. I am following the steps mentioned here but I can't open the Vivado project correctly. When I ...

alagal

1

asked Sep 16, 2024 at 11:53

5 votes

0 answers

125 views

Repeated x87 fnstenv yields cleared instruction pointer after arbitrary time

I have a program that calls the x87 instruction fnstenv multiple times per second and with only the occasional floating point computation being executed (in periods of multiple seconds apart), I had ...

Thomas Reitmayr

51

asked Aug 24, 2024 at 13:44

2 votes

1 answer

108 views

Why does AMD processor use sub instruction instead of xor to verify the stack canary?

So I've been exploring the 12 chapter in the picoCTF primer and suddenly saw difference in my assembly of the program and the picoCTF's in the end of main function, where the stack canary is being ...

digitale

23

asked Aug 7, 2024 at 18:16

3 votes

1 answer

187 views

Twice as slow SIMD performance without extra copy

I've been optimizing some code, and stumbled across some peculiar case. Here are the two assembly codes: ; FAST lea rcx,[rsp+50h] call qword ptr [Random_get_float3] ;this function ...

Alex

582

asked Jul 19, 2024 at 8:54

2 votes

1 answer

172 views

SymFromAddr fails on AMD Machine with the error message "Attempt to access Invalid address"

struct StackFrame { DWORD64 address; std::string name; std::string module; std::string filename; int line_number; }; std::vector<StackFrame> GetStackTrace(CONTEXT context) { ...

Hari E

490

asked Mar 14, 2024 at 7:08

1 vote

0 answers

855 views

Cache inclusivity policy differences on x86 between Intel and AMD

(tldr: the question itself is at the bottom) I've read that on AMD family 17h processors (Zen-Zen2, although it might be the case with the following generations as well, but I am not familiar with ...

Andriy Sultanov

88

asked Feb 8, 2024 at 19:50

0 votes

0 answers

149 views

How to debug an HIP/HIPRT application on windows?

I'm writing a path tracer using HIPRT on Windows but I couldn't find anything to debug my application yet. I'd like to be able to execute my kernels line by line, watch kernel variables, print to ...

Tom Clabault

502

asked Feb 2, 2024 at 9:54

2 votes

0 answers

134 views

Why instructions after atomic operation make execution faster (on AMD CPU)?

I wrote the following test cases to bench some operations: #define BENCH_ROUNDS 1000000000 // 10**9 static volatile UINT64 _test_argument, _test_result; static _Atomic(UINT64) _test_atom; // For ...

Wilderness Ranger

312

asked Feb 2, 2024 at 9:29

1 vote

0 answers

285 views

Why polars on intel cpu is faster than on amd cpu?

I have two pc, one is Intel i7 13700KF with 64GB RAM and another is AMD 3970X with same RAM, both pc use ssd as storage and both pc has python 3.11 and polars 0.20.5. I run code below: df = pl....

Hakase

331

asked Jan 30, 2024 at 2:56

9 votes

0 answers

288 views

Are there processors on which VPMASKMOVD generates faults for the masked-out elements?

Are there processors on which VPMASKMOVD generates faults for the masked-out elements? Going by the Intel Software Developer's Manual, the answer is plainly "no": Faults occur only due to ...

user555045

65.8k

asked Jan 28, 2024 at 15:16

0 votes

0 answers

144 views

What's the difference between those "cache_as_ram.S" in coreboot?

I want to learn how the "cache as ram" work, so i find some asm file in "/src/cpu/intel/car/" from coreboot. But there are four folders containing "cache_as_ram.S". What'...

50han Bill

1

asked Jan 13, 2024 at 9:38

0 votes

1 answer

187 views

Why amd_pmu_v2_handle_irq being called when not using perf?

amd_pmu_v2_handle_irq should be used to handle PMU overflow in AMD processor. When I use perf top -ag in the system, it is heavily called. But when I use the perf stat -a command, there are fewer ...

Frontier_Setter

809

asked Dec 22, 2023 at 11:21

-1 votes

1 answer

333 views

Why is the frequency of the CPU lower than the Max. Boost Clock？

I am using AMD's EPYC 7713 CPU. According to the specification, its maximum frequency is 3.675GHz. But when I run stress-ng (only running single threaded cpu loads), its frequency does not exceed 3....

Frontier_Setter

809

asked Dec 4, 2023 at 15:35

2 votes

2 answers

1k views

What x86 CPUs, if any, still have MOVDIRI or MOVDIR64b instructions?

I've recently been checking the Intel CPUs that I have access to. None of them (they're all Xeons) have the MOVDIRI or MOVDIR64b instructions, which are store instructions that bypass the caches. Are ...

user22797201

asked Nov 1, 2023 at 16:36

0 votes

1 answer

352 views

Illegal instruction (core dumped) in cv::findHomography

I am getting this error: Illegal instruction (core dumped) When calling: cv::findHomography(query_points, reference_points, cv::RANSAC, homography_ransac_threshold_, h_mask); This happen only an AWS ...

Humam Helfawi

20.4k

asked Oct 27, 2023 at 20:36

0 votes

0 answers

1k views

What are the advantages of write-combine memory compared to write-back memory?

In Software Optimization Guide for the AMD Zen4 Microarchitecture, it is written that: Write-combining is the merging of multiple memory write cycles that target locations within the address range of ...

Frontier_Setter

809

asked Oct 15, 2023 at 4:35

2 votes

0 answers

453 views

What does the cache bank mean in AMD CPU?

In AMD's optimization manual, the L1 Data cache is described as follows: The L1 DC provides multiple access ports using a banked structure. The read ports are shared by three load pipes and victim ...

Frontier_Setter

809

asked Oct 13, 2023 at 11:25

2 votes

1 answer

944 views

What's the difference between dispatching and issuing in CPU pipeline

In Software Optimization Guide for the AMD Zen4 Microarchitecture, the terminology are explained as follows: Dispatching: Dispatching refers to the act of transferring macro ops from the front end of ...

Frontier_Setter

809

asked Oct 13, 2023 at 9:36

4 votes

1 answer

593 views

What does L2 poison mean in CPU?

I have encountered the same problem as this. What does L2 poison mean? I'm using AMD CPU.

Frontier_Setter

809

asked Oct 7, 2023 at 3:29

0 votes

1 answer

326 views

How to test the latency and throughput of an intrinsic function？

In Intel's Intrinsic guide, each function has its own latency and throughput. For example, _mm256_loadu_ps: Architecture, Latency, Throughput (CPI) Alderlake, 7, 0.333333333 Icelake Intel Core, 7, 0.5 ...

Frontier_Setter

809

asked Sep 26, 2023 at 13:20

0 votes

0 answers

104 views

model.fit() stopping halfway on 1 epoch using tensorflow-directml. What to do?

Currently using tensorflow-directml as I am training a model on AMD (RX 580). The problem is, upon model.fit() it seems to be stuck at epoch 1 with no progress. Here's my code and error: with ...

user21525821

45

asked Sep 20, 2023 at 8:14

7 votes

0 answers

3k views

Intel OneAPI MPI MKL with AMD, is there an AMD flavor?

I've always happened to use Intel cpus in intel chipset based servers, as such have used Intel's MPI and MKL for the past 20 years that's all I kinda know. With their OneAPI I only need and use MPI, ...

ron

1,035

asked Sep 19, 2023 at 13:59

1 vote

0 answers

113 views

How do different monitoring tools calculate memory bandwidth?

For monitoring memory bandwidth, there is pcm-memory on the Intel platform and AMDuProf on the AMD platform. How do they calculate memory bandwidth usage? Which PMUs were used? Is it using 1024 or ...

Frontier_Setter

809

asked Sep 8, 2023 at 13:19

2 votes

0 answers

1k views

Difference of floating arithmetics on AMD CPUs and Intel CPUs

I am student majoring in computational science. When I deal with mixed-precision projects on AMD CPUs, I find that single precision data behaves similarly to double precision data. Sometimes, single-...

Singyuk Lau

21

asked Sep 6, 2023 at 3:53

4 votes

0 answers

93 views

Use perf to see if I'm write bound?

I have a loop that's running slower than I expected. I measure how long it takes per collection it processes and notice it takes twice as long when I use 8 cores (overall 4x faster). There's no data ...

David

41

asked Aug 28, 2023 at 19:04

1 vote

0 answers

127 views

Ryzen AMD x86_64 increment for 64 bits on memory runs 8 times faster than 8,16 or 32 bit increment

I wanted to benchmark the atomic instructions compared to the non atomic, so I wrote the code that follows bellow. Besides benchmarking locked accesses I noticed a different aspect too that seems to ...

George Kourtis

2,624

asked Aug 25, 2023 at 16:20

0 votes

1 answer

62 views

How can I use kernel functions in SVM root(execute) mode?

I ran into the following problem: When I initialized the kernel hypervisor, for me it is SVM and I exit from vmrun and get into my SvmExitHandler (this is the dispatcher that manages exit codes), then ...

Barbosso

11

asked Aug 17, 2023 at 13:10

1 vote

1 answer

88 views

Assembly instructions showing how zenbleed was found

While looking at this zenbleed article, it was found that a randomly generated sequence of instructions and the same sequence but with randomized alignment, serialization and speculation fences added ...

vengy

2,467

asked Aug 13, 2023 at 17:30

2 votes

0 answers

419 views

Obtaining SMI_COUNT on amd cpu

I am trying to get familiar with AMD's interface of SMM. Want to implement simple task: Check SMI_COUNT Trigger SMI Check SMI_COUNT after trigger The SMI-interrupt is a rare thing (I believe), so ...

Rockrid3r

321

asked Aug 11, 2023 at 19:26

1 vote

0 answers

115 views

numpy built with locally built blis does not use multithreading

I'm looking for help with an issue I'm having building Numpy against locally built blis for zen3. I've configured blis to enable threading using openmp. (it is installed and working on my machine, ...

Crispy Holiday

472

asked Aug 10, 2023 at 15:06

-1 votes

1 answer

290 views

Windows 10 nested virtualization on AMD CPU

I am working on a Software company, mainly developing on Linux. For Windows development we have couple of machines that are shared. However, a new project came up, and we need more resources on ...

wizard

155

asked Aug 10, 2023 at 13:46

1 vote

1 answer

205 views

Not getting any cache-pollution benefit from PREFETCHNTA on Zen 3

I'm trying to write a non-cache-polluting memcpy (using PREFETCHNTA for reads and streaming writes) and first doing some artificial benchmarking to determine what prefetch distances work well. I've ...

Bruce Merry

790

asked Aug 7, 2023 at 9:35

2 votes

0 answers

76 views

Can rdpmc be used to read the fixed-function counters on AMD?

On Intel the fixed-function performance counters can be read by setting bit 30 of ecx as well the index of the counter to read (0-4) in the bottom bits of that same register. Is something similar ...

BeeOnRope

66.3k

asked Aug 3, 2023 at 23:52

7 votes

1 answer

940 views

AMD DE_CFG[9] documentation

As a mitigation against the recent zenbleed vulnerability (https://lock.cmpxchg8b.com/zenbleed.html) it is advised to set DE_CFG[9] = 1. I have not manage to find anything on this MSR, except for Is ...

benjamin-lieser

1,888

asked Jul 25, 2023 at 13:10

2 votes

0 answers

222 views

What granularity does memory channel interleaving occur when enabled in BIOS?

Memory channel interleaving is a method of setting a physical address area which can be enabled in BIOS, so that all memory channels are alternately used to achieve best bandwidth and latency. I want ...

Frontier_Setter

809

asked Jul 24, 2023 at 12:36

3 votes

0 answers

75 views

Equivalent for uops_dispatched_port and uops_executed_port on AMD

I am working on the activity of ensuring optimal loading of CPU pipeline by FMA operations. I need to make measurements on AMD Ryzen 9 (Zen 4), not the Intel platform. Could you suggest HW events or a ...

Andrey Lomakin

666

asked Jul 18, 2023 at 2:45

Collectives™ on Stack Overflow