Skip to main content
Filter by
Sorted by
Tagged with
3 votes
0 answers
157 views

I was benchmarking a naive transposition and noticed a very large performance discrepancy in performance between: a naive operation where we read data contiguously and write with a large stride; the ...
Etienne M's user avatar
  • 725
10 votes
1 answer
431 views

I have written an implementation of the MD5 hash function using AVX-512. While it uses SIMD instructions, it is fundamentally a scalar algorithm. The point of using SIMD instructions is to access ...
fuz's user avatar
  • 94.7k
1 vote
0 answers
34 views

I am using perf to profile workloads on my system, and I need to track the memory traffic generated by my workload on each NUMA node. Currently, I only have perf results for LLC cache misses, which ...
smz's user avatar
  • 515
0 votes
1 answer
117 views

My understanding is that PERF_COUNT_HW_REF_CPU_CYCLES should map to some counter that counts at a constant rate, as opposed to PERF_COUNT_HW_CPU_CYCLES which is affected by frequency scaling. I'd ...
Joseph Garvin's user avatar
1 vote
1 answer
119 views

I wanted to see if I am correctly interpreting the attached diagram. It shows the AMD Zen 3's cache lines. OC Fetch is Opcode Cache, IC Fetch is Instruction Cache. I am just unable to make sense of ...
Kush Jenamani's user avatar
1 vote
0 answers
77 views

I test the EYPC 9564 CPU (dual socket), the core-to-core latency of the second socket is very high, even greater than the latency for inter-socket communication. As shown for AMD EPYC 7R13, 48 Cores, ...
wang fuqiang's user avatar
2 votes
0 answers
69 views

I am trying to read cache events on a AMD Zen2: L1d all read accesses L1d all write accesses L1d read misses (not shown below) L1d write misses (not shown below) According to the perf_event_open(2) ...
onlycparra's user avatar
1 vote
0 answers
118 views

In Performance optimization, and how to do it wrong the author claims: the CPU can't predict more than one branch per cycle A single if statement inside a loop is enough to stop any further ...
HesLg's user avatar
  • 58
0 votes
0 answers
40 views

amd sdm implies CLGI can block vINTR Table 15-10 Effect of the GIF on Interrupt Handling 15.21.4 Injecting Virtual (INTR) Interrupts The processor takes a virtual INTR interrupt if: V_IRQ and ...
wang fuqiang's user avatar
1 vote
0 answers
91 views

In the AMD Zen5 architecture block diagram, the FP/Vector execution unit has two components, StD and IntD with arrows connecting them to the "Load/Store Queue". What are the functions of ...
Frontier_Setter's user avatar
2 votes
0 answers
72 views

According to AMD's material, access to contiguous physical addresses will be interleaved across all memory channels (if set to NPS1). When a machine has 8 memory channels and the size of memory ...
Frontier_Setter's user avatar
1 vote
1 answer
175 views

I want to cross-compile a minimal project which uses tokio-udev. The linker fails because of missing libudev: aarch64-linux-musl/bin/ld: cannot find -ludev I can cross-compile Rust projects which do ...
Twonky's user avatar
  • 814
2 votes
1 answer
199 views

I want to track the number of read/write accesses at each of the Unified Memory Controllers (UMCs) in my AMD EPYC processor (family: 0x17 and model: 0x31). The AMDuProfPcm tool, when used with the -m ...
smz's user avatar
  • 515
0 votes
1 answer
237 views

Besides choosing between linux/windows/mac and 32/64 bit, is it possible to choose the processor of the machine where the action runner will be running? In my organization we have been using actions ...
Alberto Gascón's user avatar
0 votes
1 answer
50 views

I have bought the Kria KD240 Starter Kit to get used to working with drives applications and FOC control. I am following the steps mentioned here but I can't open the Vivado project correctly. When I ...
alagal's user avatar
  • 1
5 votes
0 answers
125 views

I have a program that calls the x87 instruction fnstenv multiple times per second and with only the occasional floating point computation being executed (in periods of multiple seconds apart), I had ...
Thomas Reitmayr's user avatar
2 votes
1 answer
108 views

So I've been exploring the 12 chapter in the picoCTF primer and suddenly saw difference in my assembly of the program and the picoCTF's in the end of main function, where the stack canary is being ...
digitale's user avatar
3 votes
1 answer
187 views

I've been optimizing some code, and stumbled across some peculiar case. Here are the two assembly codes: ; FAST lea rcx,[rsp+50h] call qword ptr [Random_get_float3] ;this function ...
Alex's user avatar
  • 582
2 votes
1 answer
172 views

struct StackFrame { DWORD64 address; std::string name; std::string module; std::string filename; int line_number; }; std::vector<StackFrame> GetStackTrace(CONTEXT context) { ...
Hari E's user avatar
  • 490
1 vote
0 answers
855 views

(tldr: the question itself is at the bottom) I've read that on AMD family 17h processors (Zen-Zen2, although it might be the case with the following generations as well, but I am not familiar with ...
Andriy Sultanov's user avatar
0 votes
0 answers
149 views

I'm writing a path tracer using HIPRT on Windows but I couldn't find anything to debug my application yet. I'd like to be able to execute my kernels line by line, watch kernel variables, print to ...
Tom Clabault's user avatar
2 votes
0 answers
134 views

I wrote the following test cases to bench some operations: #define BENCH_ROUNDS 1000000000 // 10**9 static volatile UINT64 _test_argument, _test_result; static _Atomic(UINT64) _test_atom; // For ...
Wilderness Ranger's user avatar
1 vote
0 answers
285 views

I have two pc, one is Intel i7 13700KF with 64GB RAM and another is AMD 3970X with same RAM, both pc use ssd as storage and both pc has python 3.11 and polars 0.20.5. I run code below: df = pl....
Hakase's user avatar
  • 331
9 votes
0 answers
288 views

Are there processors on which VPMASKMOVD generates faults for the masked-out elements? Going by the Intel Software Developer's Manual, the answer is plainly "no": Faults occur only due to ...
user555045's user avatar
  • 65.8k
0 votes
0 answers
144 views

I want to learn how the "cache as ram" work, so i find some asm file in "/src/cpu/intel/car/" from coreboot. But there are four folders containing "cache_as_ram.S". What'...
50han Bill's user avatar
0 votes
1 answer
187 views

amd_pmu_v2_handle_irq should be used to handle PMU overflow in AMD processor. When I use perf top -ag in the system, it is heavily called. But when I use the perf stat -a command, there are fewer ...
Frontier_Setter's user avatar
-1 votes
1 answer
333 views

I am using AMD's EPYC 7713 CPU. According to the specification, its maximum frequency is 3.675GHz. But when I run stress-ng (only running single threaded cpu loads), its frequency does not exceed 3....
Frontier_Setter's user avatar
2 votes
2 answers
1k views

I've recently been checking the Intel CPUs that I have access to. None of them (they're all Xeons) have the MOVDIRI or MOVDIR64b instructions, which are store instructions that bypass the caches. Are ...
user avatar
0 votes
1 answer
352 views

I am getting this error: Illegal instruction (core dumped) When calling: cv::findHomography(query_points, reference_points, cv::RANSAC, homography_ransac_threshold_, h_mask); This happen only an AWS ...
Humam Helfawi's user avatar
0 votes
0 answers
1k views

In Software Optimization Guide for the AMD Zen4 Microarchitecture, it is written that: Write-combining is the merging of multiple memory write cycles that target locations within the address range of ...
Frontier_Setter's user avatar
2 votes
0 answers
453 views

In AMD's optimization manual, the L1 Data cache is described as follows: The L1 DC provides multiple access ports using a banked structure. The read ports are shared by three load pipes and victim ...
Frontier_Setter's user avatar
2 votes
1 answer
944 views

In Software Optimization Guide for the AMD Zen4 Microarchitecture, the terminology are explained as follows: Dispatching: Dispatching refers to the act of transferring macro ops from the front end of ...
Frontier_Setter's user avatar
4 votes
1 answer
593 views

I have encountered the same problem as this. What does L2 poison mean? I'm using AMD CPU.
Frontier_Setter's user avatar
0 votes
1 answer
326 views

In Intel's Intrinsic guide, each function has its own latency and throughput. For example, _mm256_loadu_ps: Architecture, Latency, Throughput (CPI) Alderlake, 7, 0.333333333 Icelake Intel Core, 7, 0.5 ...
Frontier_Setter's user avatar
0 votes
0 answers
104 views

Currently using tensorflow-directml as I am training a model on AMD (RX 580). The problem is, upon model.fit() it seems to be stuck at epoch 1 with no progress. Here's my code and error: with ...
user21525821's user avatar
7 votes
0 answers
3k views

I've always happened to use Intel cpus in intel chipset based servers, as such have used Intel's MPI and MKL for the past 20 years that's all I kinda know. With their OneAPI I only need and use MPI, ...
ron's user avatar
  • 1,035
1 vote
0 answers
113 views

For monitoring memory bandwidth, there is pcm-memory on the Intel platform and AMDuProf on the AMD platform. How do they calculate memory bandwidth usage? Which PMUs were used? Is it using 1024 or ...
Frontier_Setter's user avatar
2 votes
0 answers
1k views

I am student majoring in computational science. When I deal with mixed-precision projects on AMD CPUs, I find that single precision data behaves similarly to double precision data. Sometimes, single-...
Singyuk Lau's user avatar
4 votes
0 answers
93 views

I have a loop that's running slower than I expected. I measure how long it takes per collection it processes and notice it takes twice as long when I use 8 cores (overall 4x faster). There's no data ...
David's user avatar
  • 41
1 vote
0 answers
127 views

I wanted to benchmark the atomic instructions compared to the non atomic, so I wrote the code that follows bellow. Besides benchmarking locked accesses I noticed a different aspect too that seems to ...
George Kourtis's user avatar
0 votes
1 answer
62 views

I ran into the following problem: When I initialized the kernel hypervisor, for me it is SVM and I exit from vmrun and get into my SvmExitHandler (this is the dispatcher that manages exit codes), then ...
Barbosso's user avatar
1 vote
1 answer
88 views

While looking at this zenbleed article, it was found that a randomly generated sequence of instructions and the same sequence but with randomized alignment, serialization and speculation fences added ...
vengy's user avatar
  • 2,467
2 votes
0 answers
419 views

I am trying to get familiar with AMD's interface of SMM. Want to implement simple task: Check SMI_COUNT Trigger SMI Check SMI_COUNT after trigger The SMI-interrupt is a rare thing (I believe), so ...
Rockrid3r's user avatar
  • 321
1 vote
0 answers
115 views

I'm looking for help with an issue I'm having building Numpy against locally built blis for zen3. I've configured blis to enable threading using openmp. (it is installed and working on my machine, ...
Crispy Holiday's user avatar
-1 votes
1 answer
290 views

I am working on a Software company, mainly developing on Linux. For Windows development we have couple of machines that are shared. However, a new project came up, and we need more resources on ...
wizard's user avatar
  • 155
1 vote
1 answer
205 views

I'm trying to write a non-cache-polluting memcpy (using PREFETCHNTA for reads and streaming writes) and first doing some artificial benchmarking to determine what prefetch distances work well. I've ...
Bruce Merry's user avatar
2 votes
0 answers
76 views

On Intel the fixed-function performance counters can be read by setting bit 30 of ecx as well the index of the counter to read (0-4) in the bottom bits of that same register. Is something similar ...
BeeOnRope's user avatar
  • 66.3k
7 votes
1 answer
940 views

As a mitigation against the recent zenbleed vulnerability (https://lock.cmpxchg8b.com/zenbleed.html) it is advised to set DE_CFG[9] = 1. I have not manage to find anything on this MSR, except for Is ...
benjamin-lieser's user avatar
2 votes
0 answers
222 views

Memory channel interleaving is a method of setting a physical address area which can be enabled in BIOS, so that all memory channels are alternately used to achieve best bandwidth and latency. I want ...
Frontier_Setter's user avatar
3 votes
0 answers
75 views

I am working on the activity of ensuring optimal loading of CPU pipeline by FMA operations. I need to make measurements on AMD Ryzen 9 (Zen 4), not the Intel platform. Could you suggest HW events or a ...
Andrey Lomakin's user avatar

1
2 3 4 5
11