2
$\begingroup$

I have learnt scientific computing back in the time when C and C++ were the thing, and I'm relatively new to python. Back then, I could estimate the algorithmic cost rather easily because, even with the STL, I knew how data was being handled (of course there was already low-level optimisation so your code might do better than expected, but still you had a guideline).

Now using numpy massively I'm at a loss, I have never found information on the complexity/expected CPU cost of numpy functions.

This puzzles me when I need to choose between some options, e.g. whether I should store some data representation or just repeat its production, depending on how many calls I'm expecting to do.

To give one concrete example: I have N vector arrays, which have no reason to be contiguous in memory, I'll need to call them M times for element-by-element multiplication where those N arrays appear as a single vector (concatenation) : depending on M and N, when should I store the concatenated array? (assuming the entire data is very small compared to RAM size) Is it so impossible to know (in a portable manner or even in a single environment) that I just shouldn't care?

$\endgroup$
2
  • 1
    $\begingroup$ How is this different from the same scenario in C++, using N vectors? $\endgroup$ Commented May 29 at 17:52
  • 1
    $\begingroup$ Big parts of numpy are written in C and C++: github.com/numpy/numpy/tree/main/numpy/_core/src You should assume that whatever algorithms you call or implement will use data layout and loops just like any other C code would. $\endgroup$ Commented May 29 at 19:49

1 Answer 1

4
$\begingroup$

If you allocate a numpy array and don't make a non-contiguous view of it, then it will be contiguous in memory, even if it's multidimensional, see https://numpy.org/doc/stable/dev/internals.html . If you make a non-contiguous view, you are back to strided access, which is the same thing that occurs in C when you address a linear array as foo[stride*z] with stride != 1. If you make your own non-contiguous structures in numpy, you have the same pointer chasing as in C.

So Q = A + B will be O(3N) store+loads, also in numpy. If A or B are more expensive to compute than two loads from RAM (e.g. need multiple uncached loads from RAM), then storing them might be sensible. The same complexity assumptions as in C should typically hold for operations implemented as ufuncs. Most of these are implemented in C/C++ as Wolfgang Bangerth said, so their per-element cost should be close to what you get in C/C++. Stuff like sorting has the complexity written in the docs and is also implemented in C/C++.

Tangentially related: I've been personally wondering about whether Q = A + B + C + ... will produce temporaries and so be possibly inferior to a loop over the summands. So I wrote a small test script (see the end of the post) for it. Once the array is big enough, I typically saw results like (statistics on 5 runs of 20 adds each)

             mean     stddev   min      max      
Q=A+B        0.474155 0.007966 0.468307 0.485545 
Q=A+B+C      0.671866 0.004024 0.667896 0.678112 
Q=sum(A, B)  0.343913 0.002298 0.341384 0.346913 
Q=sum(A,B,C) 0.553073 0.001132 0.551913 0.554523 
Q=sum(4)     0.765597 0.001289 0.763745 0.767078 

with sum(...) being implemented as first copying over A into Q, then adding the rest of the elements as binary operations, avoiding temporaries. This would suggest to me that something besides adding is happening in Q = A+B + ... . Anyhow, as you can see there's a (roughly) linear relationship between the number of operands and how long it takes, regardless of how you sum them. As for the per-element cost, the above are samples for 20 operations at a time, so the min of sum(A,B) comes to 0.0174085s per vector add, with a STREAM ADD benchmark of the same size on the same machine getting 0.014576s, which isn't too far off.

Test code:

import timeit
import numpy as np
from scipy import stats

def add2(Q, A, B):
  Q[:] = A[:] + B[:]

def add3(Q, A, B, C):
  Q[:] = A[:] + B[:] + C[:]

def addk(Q, *args):
  Q[:] = args[0]
  for arg in args[1:]:
    Q[:] += arg

outprec=8
def pprint_header(prelen):
  cols = ["mean", "stddev", "min", "max"]
  flen=outprec+1
  fmt = ("{:<%d}" % (flen) )  * (len(cols))
  print(" "*prelen, fmt.format(*cols))
def pprint(lab, r, prelen):
  sr = stats.describe(r)
  sdev = np.sqrt(sr.variance)
  dats = [sr.mean, np.sqrt(sr.variance), sr.minmax[0], sr.minmax[1]]
  fmt = ("{:<%df} " % (outprec)) * len(dats)
  print(f"{lab:<{prelen}}", fmt.format(*dats))

N = 1<<23 # sufficient to be memory-bound and not suffer from python-level inefficiencies, hopefully
Q = np.zeros(N)
A = np.ones(N)
B = np.ones(N)
C = np.ones(N)
D = np.ones(N)

tests = [lambda : add2(Q, A, B), lambda: add3(Q, A, B, C), lambda: addk(Q, A, B), lambda: addk(Q, A, B, C), lambda: addk(Q, A, B, C, D)]
names = ["Q=A+B", "Q=A+B+C", "Q=sum(A, B)", "Q=sum(A,B,C)", "Q=sum(4)"]
prelen = max([len(x) for x in names])
for _ in range(3):
  tests[-1]() # warmup

pprint_header(prelen);
for test, name in zip(tests, names):
  r = timeit.repeat(test, number=20, repeat=5)
  pprint(name, r, prelen)
```
$\endgroup$
1
  • 1
    $\begingroup$ You are right, that was an excellent copy paste failure. Running with the edited version doesn't show much timing difference, probably since it's already past the relevant cache sizes. $\endgroup$ Commented Jun 1 at 23:59

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.