I am building a wheel of PyTorch from source, based on their https://github.com/pytorch/pytorch/blob/v2.6.0/.ci/manywheel/build_common.sh CI build script. I tested on a "local" instance of a g5.xlarge EC2 instance, I installed it with pip and everything works well. Then I built the same wheel on a g5.12xlarge instance to speed up the process, tested it on that machine and everything works. This leads to a problem when trying to install the g5.12xlarge wheel on a g5.xlarge instance:
Python 3.11.11 (main, Nov 13 2025, 17:12:08) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Illegal instruction (core dumped)
After using gdb we see:
Program received signal SIGILL, Illegal instruction.
0x00007fffe7b74d69 in ska::detailv3::sherwood_v3_table<std::pair<c10::OperatorName, c10::OperatorHandle>, c10::OperatorName, std::hash<c10::OperatorName>, ska::detailv3::KeyOrValueHasher<c10::OperatorName, std::pair<c10::OperatorName, c10::OperatorHandle>, std::hash<c10::OperatorName> >, std::equal_to<c10::OperatorName>, ska::detailv3::KeyOrValueEquality<c10::OperatorName, std::pair<c10::OperatorName, c10::OperatorHandle>, std::equal_to<c10::OperatorName> >, std::allocator<std::pair<c10::OperatorName, c10::OperatorHandle> >, std::allocator<ska::detailv3::sherwood_v3_entry<std::pair<c10::OperatorName, c10::OperatorHandle> > > >::rehash(unsigned long) () \
from /home/prod/.local/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so
So it seems the libtorch_cpu.so has different symbols. I am trying to understand how this happened, because these two instance types have the same CPUs. I would love some help in making this work, i.e. how to build the wheel on a g5.12xlarge instance so it works on a g5.xlarge instance.
Update: Both g5.xlarge and g5.12xlarge claim to use identical CPUs:
vendor_id : AuthenticAMD
cpu family : 23
model : 49
model name : AMD EPYC 7R32
stepping : 0
microcode : 0x830107f
cpu MHz : 3299.275
cache size : 512 KB
GDB shows crashing instruction:
Program received signal SIGILL, Illegal instruction.
0x00007fffe7b74d69 in ska::detailv3....
(gdb) x/i $pc
=> 0x7fffe7b74d69 <_ZN3ska8detail...EEE6rehashEm+25>: vcvtusi2sdq 0x18(%rdi),%xmm1,%xmm0
Update #2 Here is more GDB output:
(gdb) disas/r $pc, $pc+1
Dump of assembler code from 0x7fffe7b74d69 to 0x7fffe7b74d6a:
=> 0x00007fffe7b74d69 <_ZN3ska8detailv317sherwood_v3_tableISt4pairIN3c1012OperatorNameENS3_14OperatorHandleEES4_St4hashIS4_ENS0_16KeyOrValueHasherIS4_S6_S8_EESt8equal_toIS4_ENS0_18KeyOrValueEqualityIS4_S6_SC_EESaIS6_ESaINS0_17sherwood_v3_entryIS6_EEEE6rehashEm+25>: 62 f1 f7 08 7b 47 03 vcvtusi2sdq 0x18(%rdi),%xmm1,%xmm0
End of assembler dump.
cat /proc/cpuinfoon the two machines, and what is the output from(gdb) x/i $pcat the crash point?(gdb) disas/r $pc, $pc+1.