I have read a paper entitled "Attention is all you need" by Vaswani et al. (2017). This paper use the so-called position-wise feedforward neural network, where the input of this network is a matrix $\mathbf{X} \in \mathbb{R}^{n \times d_\mathrm{model}}$ (not a vector $\mathbf{X} \in \mathbb{R}^{d_\mathrm{model}}$). If I am not mistaken, the meaning of position-wise is that the (same) feed-forward layer applies to every vector $\mathbf{X}_{i*}$ ($i$th row of $\mathbf{X}$) for $i = 1, \dots, n$. Thus, the weights are shared.
I want to do backpropagation for a position-wise network consisting only a linear layer with no activation. Let the output dimensionality is $d_\mathrm{model}$. Applying this network yields $\mathbf{Z} \in \mathbb{R}^{n \times d_\mathrm{model}}$ where each row $\mathbf{Z}_{i*},\ i=1, \dots, n$ is given by $\mathbf{Z}_{i*} = \mathbf{X}_{i*} \mathbf{W} + \mathbf{b}^\intercal$. Here, $\mathbf{W} \in \mathbb{R}^{d_\mathrm{model} \times d_\mathrm{model}}$ and $\mathbf{b} \in \mathbb{R}^{d_\mathrm{model}}$ are weight and bias, respectively.
Let $L$ be the loss function. For $i$th row I get: $\dfrac{\partial L}{\partial \mathbf{W}_{pq}} = \dfrac{\partial L}{\partial \mathbf{Z}_{i1}} \cdot \dfrac{\partial \mathbf{Z}_{i1}}{\partial \mathbf{W}_{pq}} + \dfrac{\partial L}{\partial \mathbf{Z}_{i2}} \dfrac{\partial \mathbf{Z}_{i2}}{\partial \mathbf{W}_{pq}} + \dots + \dfrac{\partial L}{\partial \mathbf{Z}_{id_\mathrm{model}}} \dfrac{\partial \mathbf{Z}_{id_\mathrm{model}}}{\partial \mathbf{W}_{pq}} = \dfrac{\partial L}{\partial \mathbf{Z}_{ip}} \mathbf{X}_{iq}$, for $p, q = 1, \dots, d_\mathrm{model}$.
Thus, I end up with $\dfrac{\partial L}{\partial \mathbf{W}} = \left(\dfrac{\partial L}{\partial \mathbf{Z}_{i*}}\right)^\intercal \mathbf{X}_{i*}$.
My question: is $\left(\dfrac{\partial L}{\partial \mathbf{Z}_{1*}}\right)^\intercal \mathbf{X}_{1*} = \left(\dfrac{\partial L}{\partial \mathbf{Z}_{2*}}\right)^\intercal \mathbf{X}_{2*} = \dots = \left(\dfrac{\partial L}{\partial \mathbf{Z}_{d_\mathrm{model}*}}\right)^\intercal \mathbf{X}_{d_\mathrm{model}*}$ holds?