If you allocate a numpy array and don't make a non-contiguous view of it, then it will be contiguous in memory, even if it's multidimensional, see https://numpy.org/doc/stable/dev/internals.html . If you make a non-contiguous view, you are back to strided access, which is the same thing that occurs in C when you address a linear array as foo[stride*z] with stride != 1. If you make your own non-contiguous structures in numpy, you have the same pointer chasing as in C.
So Q = A + B will be O(3N) store+loads, also in numpy. If A or B are more expensive to compute than two loads from RAM (e.g. need multiple uncached loads from RAM), then storing them might be sensible. The same complexity assumptions as in C should typically hold for operations implemented as ufuncs. Most of these are implemented in C/C++ as Wolfgang Bangerth said, so their per-element cost should be close to what you get in C/C++. Stuff like sorting has the complexity written in the docs and is also implemented in C/C++.
Tangentially related: I've been personally wondering about whether Q = A + B + C + ... will produce temporaries and so be possibly inferior to a loop over the summands. So I wrote a small test script (see the end of the post) for it. Once the array is big enough, I typically saw results like (statistics on 5 runs of 20 adds each)
mean stddev min max
Q=A+B 0.474155 0.007966 0.468307 0.485545
Q=A+B+C 0.671866 0.004024 0.667896 0.678112
Q=sum(A, B) 0.343913 0.002298 0.341384 0.346913
Q=sum(A,B,C) 0.553073 0.001132 0.551913 0.554523
Q=sum(4) 0.765597 0.001289 0.763745 0.767078
with sum(...) being implemented as first copying over A into Q, then adding the rest of the elements as binary operations, avoiding temporaries. This would suggest to me that something besides adding is happening in Q = A+B + ... . Anyhow, as you can see there's a (roughly) linear relationship between the number of operands and how long it takes, regardless of how you sum them. As for the per-element cost, the above are samples for 20 operations at a time, so the min of sum(A,B) comes to 0.0174085s per vector add, with a STREAM ADD benchmark of the same size on the same machine getting 0.014576s, which isn't too far off.
Test code:
import timeit
import numpy as np
from scipy import stats
def add2(Q, A, B):
Q[:] = A[:] + B[:]
def add3(Q, A, B, C):
Q[:] = A[:] + B[:] + C[:]
def addk(Q, *args):
Q[:] = args[0]
for arg in args[1:]:
Q[:] += arg
outprec=8
def pprint_header(prelen):
cols = ["mean", "stddev", "min", "max"]
flen=outprec+1
fmt = ("{:<%d}" % (flen) ) * (len(cols))
print(" "*prelen, fmt.format(*cols))
def pprint(lab, r, prelen):
sr = stats.describe(r)
sdev = np.sqrt(sr.variance)
dats = [sr.mean, np.sqrt(sr.variance), sr.minmax[0], sr.minmax[1]]
fmt = ("{:<%df} " % (outprec)) * len(dats)
print(f"{lab:<{prelen}}", fmt.format(*dats))
N = 1<<23 # sufficient to be memory-bound and not suffer from python-level inefficiencies, hopefully
Q = np.zeros(N)
A = np.ones(N)
B = np.ones(N)
C = np.ones(N)
D = np.ones(N)
tests = [lambda : add2(Q, A, B), lambda: add3(Q, A, B, C), lambda: addk(Q, A, B), lambda: addk(Q, A, B, C), lambda: addk(Q, A, B, C, D)]
names = ["Q=A+B", "Q=A+B+C", "Q=sum(A, B)", "Q=sum(A,B,C)", "Q=sum(4)"]
prelen = max([len(x) for x in names])
for _ in range(3):
tests[-1]() # warmup
pprint_header(prelen);
for test, name in zip(tests, names):
r = timeit.repeat(test, number=20, repeat=5)
pprint(name, r, prelen)
```
Nvectors? $\endgroup$