1

I am building a wheel of PyTorch from source, based on their https://github.com/pytorch/pytorch/blob/v2.6.0/.ci/manywheel/build_common.sh CI build script. I tested on a "local" instance of a g5.xlarge EC2 instance, I installed it with pip and everything works well. Then I built the same wheel on a g5.12xlarge instance to speed up the process, tested it on that machine and everything works. This leads to a problem when trying to install the g5.12xlarge wheel on a g5.xlarge instance:

Python 3.11.11 (main, Nov 13 2025, 17:12:08) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Illegal instruction (core dumped)

After using gdb we see:

Program received signal SIGILL, Illegal instruction.
0x00007fffe7b74d69 in ska::detailv3::sherwood_v3_table<std::pair<c10::OperatorName, c10::OperatorHandle>, c10::OperatorName, std::hash<c10::OperatorName>, ska::detailv3::KeyOrValueHasher<c10::OperatorName, std::pair<c10::OperatorName, c10::OperatorHandle>, std::hash<c10::OperatorName> >, std::equal_to<c10::OperatorName>, ska::detailv3::KeyOrValueEquality<c10::OperatorName, std::pair<c10::OperatorName, c10::OperatorHandle>, std::equal_to<c10::OperatorName> >, std::allocator<std::pair<c10::OperatorName, c10::OperatorHandle> >, std::allocator<ska::detailv3::sherwood_v3_entry<std::pair<c10::OperatorName, c10::OperatorHandle> > > >::rehash(unsigned long) () \
from /home/prod/.local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so

So it seems the libtorch_cpu.so has different symbols. I am trying to understand how this happened, because these two instance types have the same CPUs. I would love some help in making this work, i.e. how to build the wheel on a g5.12xlarge instance so it works on a g5.xlarge instance.

Update: Both g5.xlarge and g5.12xlarge claim to use identical CPUs:

vendor_id       : AuthenticAMD
cpu family      : 23
model           : 49
model name      : AMD EPYC 7R32
stepping        : 0
microcode       : 0x830107f
cpu MHz         : 3299.275
cache size      : 512 KB

GDB shows crashing instruction:

Program received signal SIGILL, Illegal instruction.
0x00007fffe7b74d69 in ska::detailv3....

(gdb) x/i $pc
=> 0x7fffe7b74d69 <_ZN3ska8detail...EEE6rehashEm+25>:   vcvtusi2sdq 0x18(%rdi),%xmm1,%xmm0

Update #2 Here is more GDB output:

(gdb) disas/r $pc, $pc+1
Dump of assembler code from 0x7fffe7b74d69 to 0x7fffe7b74d6a:
=> 0x00007fffe7b74d69 <_ZN3ska8detailv317sherwood_v3_tableISt4pairIN3c1012OperatorNameENS3_14OperatorHandleEES4_St4hashIS4_ENS0_16KeyOrValueHasherIS4_S6_S8_EESt8equal_toIS4_ENS0_18KeyOrValueEqualityIS4_S6_SC_EESaIS6_ESaINS0_17sherwood_v3_entryIS6_EEEE6rehashEm+25>:   62 f1 f7 08 7b 47 03    vcvtusi2sdq 0x18(%rdi),%xmm1,%xmm0
End of assembler dump.
4
  • Missing symbols is unlikely to be the problem here. What is the output from cat /proc/cpuinfo on the two machines, and what is the output from (gdb) x/i $pc at the crash point? Commented Nov 20 at 5:30
  • @EmployedRussian here is the output for both things: pastebin.com/r02mgBy8 thanks for the help Commented Nov 20 at 23:38
  • I've updated your question with the new info. Could you update it further with the output from (gdb) disas/r $pc, $pc+1. Commented Nov 21 at 2:01
  • I have updated the post with this information Commented Nov 21 at 21:22

1 Answer 1

1

It looks like the crashing instruction: vcvtusi2sdq 0x18(%rdi),%xmm1,%xmm0 is an AVX512F one, which neither EC instance supports.

This is probably happening because you build on a AVX512-capable Intel machine, and your compilation flags include -march=native.

Changing flags to -march=x86-64 and rebuilding may solve this crash.


It is unclear to me why only one of the machines exercises this code (and crashes). The other machine must not exercise this code (or it would have also crashed).

Sign up to request clarification or add additional context in comments.

1 Comment

Will try and report back in a couple of hours!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.