algorithmic complexity in numpy

Question

I have learnt scientific computing back in the time when C and C++ were the thing, and I'm relatively new to python. Back then, I could estimate the algorithmic cost rather easily because, even with the STL, I knew how data was being handled (of course there was already low-level optimisation so your code might do better than expected, but still you had a guideline).

Now using numpy massively I'm at a loss, I have never found information on the complexity/expected CPU cost of numpy functions.

This puzzles me when I need to choose between some options, e.g. whether I should store some data representation or just repeat its production, depending on how many calls I'm expecting to do.

To give one concrete example: I have N vector arrays, which have no reason to be contiguous in memory, I'll need to call them M times for element-by-element multiplication where those N arrays appear as a single vector (concatenation) : depending on M and N, when should I store the concatenated array? (assuming the entire data is very small compared to RAM size) Is it so impossible to know (in a portable manner or even in a single environment) that I just shouldn't care?

How is this different from the same scenario in C++, using N vectors? — Mikael Öhman
– Mikael Öhman, Commented May 29 at 17:52
Big parts of numpy are written in C and C++: github.com/numpy/numpy/tree/main/numpy/_core/src You should assume that whatever algorithms you call or implement will use data layout and loops just like any other C code would. — Wolfgang Bangerth
– Wolfgang Bangerth, Commented May 29 at 19:49

niemc · Accepted Answer · 2025-06-01 23:59:51Z

If you allocate a numpy array and don't make a non-contiguous view of it, then it will be contiguous in memory, even if it's multidimensional, see https://numpy.org/doc/stable/dev/internals.html . If you make a non-contiguous view, you are back to strided access, which is the same thing that occurs in C when you address a linear array as foo[stride*z] with stride != 1. If you make your own non-contiguous structures in numpy, you have the same pointer chasing as in C.

So Q = A + B will be O(3N) store+loads, also in numpy. If A or B are more expensive to compute than two loads from RAM (e.g. need multiple uncached loads from RAM), then storing them might be sensible. The same complexity assumptions as in C should typically hold for operations implemented as ufuncs. Most of these are implemented in C/C++ as Wolfgang Bangerth said, so their per-element cost should be close to what you get in C/C++. Stuff like sorting has the complexity written in the docs and is also implemented in C/C++.

Tangentially related: I've been personally wondering about whether Q = A + B + C + ... will produce temporaries and so be possibly inferior to a loop over the summands. So I wrote a small test script (see the end of the post) for it. Once the array is big enough, I typically saw results like (statistics on 5 runs of 20 adds each)

             mean     stddev   min      max      
Q=A+B        0.474155 0.007966 0.468307 0.485545 
Q=A+B+C      0.671866 0.004024 0.667896 0.678112 
Q=sum(A, B)  0.343913 0.002298 0.341384 0.346913 
Q=sum(A,B,C) 0.553073 0.001132 0.551913 0.554523 
Q=sum(4)     0.765597 0.001289 0.763745 0.767078

with sum(...) being implemented as first copying over A into Q, then adding the rest of the elements as binary operations, avoiding temporaries. This would suggest to me that something besides adding is happening in Q = A+B + ... . Anyhow, as you can see there's a (roughly) linear relationship between the number of operands and how long it takes, regardless of how you sum them. As for the per-element cost, the above are samples for 20 operations at a time, so the min of sum(A,B) comes to 0.0174085s per vector add, with a STREAM ADD benchmark of the same size on the same machine getting 0.014576s, which isn't too far off.

Test code:

import timeit
import numpy as np
from scipy import stats

def add2(Q, A, B):
  Q[:] = A[:] + B[:]

def add3(Q, A, B, C):
  Q[:] = A[:] + B[:] + C[:]

def addk(Q, *args):
  Q[:] = args[0]
  for arg in args[1:]:
    Q[:] += arg

outprec=8
def pprint_header(prelen):
  cols = ["mean", "stddev", "min", "max"]
  flen=outprec+1
  fmt = ("{:<%d}" % (flen) )  * (len(cols))
  print(" "*prelen, fmt.format(*cols))
def pprint(lab, r, prelen):
  sr = stats.describe(r)
  sdev = np.sqrt(sr.variance)
  dats = [sr.mean, np.sqrt(sr.variance), sr.minmax[0], sr.minmax[1]]
  fmt = ("{:<%df} " % (outprec)) * len(dats)
  print(f"{lab:<{prelen}}", fmt.format(*dats))

N = 1<<23 # sufficient to be memory-bound and not suffer from python-level inefficiencies, hopefully
Q = np.zeros(N)
A = np.ones(N)
B = np.ones(N)
C = np.ones(N)
D = np.ones(N)

tests = [lambda : add2(Q, A, B), lambda: add3(Q, A, B, C), lambda: addk(Q, A, B), lambda: addk(Q, A, B, C), lambda: addk(Q, A, B, C, D)]
names = ["Q=A+B", "Q=A+B+C", "Q=sum(A, B)", "Q=sum(A,B,C)", "Q=sum(4)"]
prelen = max([len(x) for x in names])
for _ in range(3):
  tests[-1]() # warmup

pprint_header(prelen);
for test, name in zip(tests, names):
  r = timeit.repeat(test, number=20, repeat=5)
  pprint(name, r, prelen)
```

You are right, that was an excellent copy paste failure. Running with the edited version doesn't show much timing difference, probably since it's already past the relevant cache sizes. — niemc
– niemc, Commented Jun 1 at 23:59

Stack Exchange Network

algorithmic complexity in numpy

1 Answer 1

Your Answer

Hot Network Questions

algorithmic complexity in numpy

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions