2

I am working on my computer organization lab and trying to figure out how to make my matrix transpose more efficient for 64 by 64 integer array. For a cache configuration with 32 direct mapped sets with 32 bytes for each line (s=5, E=1, b=5).

I'm using blocking (Which I was recommended). And I have diagonal transpose delayed until the for loop ends to not cause an unnecessary eviction. for now by looking through the memory references for this code I found that I should use block size of 4 (Instead of 8 which at first seemed like a more logical choice for this cache configuration)

I don't see any obvious mistakes in my transpose that causes excessive evictions for 64x64 array. Oh and I'm only allowed to define integer variables in my code.

    int BS = 4;
    int ii, jj, i, j;
    for (ii = 0; ii < N; ii += BS) {
        for (jj = 0; jj < M; jj += BS) {
    
            int ii_max = (ii + BS < N ? ii + BS : N);
            int jj_max = (jj + BS < M ? jj + BS : M);
    
            if (ii != jj) {
                for (i = ii; i < ii_max; i++) {
                    for (j = jj; j < jj_max; j++) {
                        B[j][i] = A[i][j];
                    }
                }
            } else {
                for (i = ii; i < ii_max; i++) {
                    int tmp;
                    for (j = jj; j < jj_max; j++) {
                        if (i == j) {
                            tmp = A[i][j];
                        } else {
                            B[j][i] = A[i][j];
                        }
                    }
                    B[i][i] = tmp; 
                }
            }
        }
    }

I also wrote down to which sets the references will map to on each access iteration. (Which one will they evict) if it will help. First value is i, second value j, the last is set index.

0,0,0 (diagonal replacement)

0,1,0 
1,0,8

0,2,0
2,0,16

0,3,0
3,0,24


0,4,0
4,0,0

0,5,0
5,0,8

0,6,0
6,0,16

0,7,0
7,0,24


0,8,1
8,0,0

0,9,1
9,0,8

0,10,1
10,0,16

0,11,1
11,0,24


0,12,1
12,0,0

0,13,1
13,0,8

0,14,1
14,0,16

0,15,1
15,0,24
7
  • There is a good cache oblivious answer here: stackoverflow.com/questions/5200338/… Commented Jun 19 at 12:42
  • 1
    Your post does not indicate there is any problem. Nowhere does it state that execution is taking too long or that there are excessive evictions, let alone provide information on how much too long, how many excessive evictions, or how you know that. Commented Jun 19 at 14:04
  • (One tip: Never write “block size of 4.” Always specify units. Bytes, array elements, whatever. It makes no more sense to say “block size of 4” than it does to say your computer weighs 3 or its height is ½ or its speed is 8.) Commented Jun 19 at 14:06
  • I appreciate the help guys. I will try the solution that you sent when i'll get to my computer. The problem I'm dealing with is that my code causes 1764 misses for a 64x64 integer array (hits:6401, misses:1796, evictions:1764). And i need to get it to be less then 1300 to get a full grade. And yes I should have specified that I'm working with a block size of 4 by 4 grid of integers. (This place is a little stricter then i expected) Commented Jun 19 at 14:30
  • The classic way to do it with minimal evictions from cache is to transpose all the 4x4 tiles in place on the first pass through and then do a second pass to shuffle the tiles into their correct final positions. Tiles on the diagonal for a square matrix do not move. It is very much easier to get this right on a square matrix. Expect some serious fun debugging it when N != M. Commented Jun 19 at 14:46

1 Answer 1

0

Here my remarks:

  • I don't understand why you make a special case of the center blocks (ii == jj) and the diagonal... Lack of symmetry tend to prevent compiler optimisations.
  • Instead, try and make it obvious to the compiler that the inner loops can be unrolled and reordered as appropriate.
  • Make sure you tell the compiler that the source and destination matrices do not overlap using the restrict keyword.
  • Furthermore, you might want to group the writes to the same cache line and read from the scattered lines by swapping the loop indices.
  • The tests on partial blocks are also useless if the blocking factor evenly divides N and M, separate this special case (which is probably the most common) for the compiler to optimize.
  • use size_t for the index variables, unless you must use int, which would be a bizarre requirement. size_t is an integer type.

Here is a simpler version you might want to try:

// Assuming the matrix dimensions are known at compile time
#define M 64
#define N 64

typedef int celltype_t;  // Assuming `int` matrix cells, adjust for your purpose

void transpose_matrix(celltype_t restrict B[M][N],
                      const celltype_t restrict A[N][M]) {
    size_t BS = 4;
    size_t ii, jj, i, j;
    if (N % BS | M % BS) {
        for (jj = 0; jj < M; jj += BS) {
            for (ii = 0; ii < N; ii += BS) {
                int ii_max = (ii + BS < N ? ii + BS : N);
                int jj_max = (jj + BS < M ? jj + BS : M);
                for (j = jj; j < jj_max; j++) {
                    for (i = ii; i < ii_max; i++) {
                        B[j][i] = A[i][j];
                    }
                }
            }
        }
    } else {
        for (jj = 0; jj < M; jj += BS) {
            for (ii = 0; ii < N; ii += BS) {
                for (j = 0; j < BS; j++) {
                    for (i = 0; i < BS; i++) {
                        B[jj + j][ii + i] = A[ii + i][jj + j];
                    }
                }
            }
        }
    }
}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.