I am working on my computer organization lab and trying to figure out how to make my matrix transpose more efficient for 64 by 64 integer array. For a cache configuration with 32 direct mapped sets with 32 bytes for each line (s=5, E=1, b=5).
I'm using blocking (Which I was recommended). And I have diagonal transpose delayed until the for loop ends to not cause an unnecessary eviction. for now by looking through the memory references for this code I found that I should use block size of 4 (Instead of 8 which at first seemed like a more logical choice for this cache configuration)
I don't see any obvious mistakes in my transpose that causes excessive evictions for 64x64 array. Oh and I'm only allowed to define integer variables in my code.
int BS = 4;
int ii, jj, i, j;
for (ii = 0; ii < N; ii += BS) {
for (jj = 0; jj < M; jj += BS) {
int ii_max = (ii + BS < N ? ii + BS : N);
int jj_max = (jj + BS < M ? jj + BS : M);
if (ii != jj) {
for (i = ii; i < ii_max; i++) {
for (j = jj; j < jj_max; j++) {
B[j][i] = A[i][j];
}
}
} else {
for (i = ii; i < ii_max; i++) {
int tmp;
for (j = jj; j < jj_max; j++) {
if (i == j) {
tmp = A[i][j];
} else {
B[j][i] = A[i][j];
}
}
B[i][i] = tmp;
}
}
}
}
I also wrote down to which sets the references will map to on each access iteration. (Which one will they evict) if it will help.
First value is i, second value j, the last is set index.
0,0,0 (diagonal replacement)
0,1,0
1,0,8
0,2,0
2,0,16
0,3,0
3,0,24
0,4,0
4,0,0
0,5,0
5,0,8
0,6,0
6,0,16
0,7,0
7,0,24
0,8,1
8,0,0
0,9,1
9,0,8
0,10,1
10,0,16
0,11,1
11,0,24
0,12,1
12,0,0
0,13,1
13,0,8
0,14,1
14,0,16
0,15,1
15,0,24
N != M.