Please imagine I have a dataframe like this:
df = pd.DataFrame(index=pd.Index(['1', '1', '2', '2'], name='from'), columns=['to'], data= ['2', '2', '4', '5'])
df:
Now, I would like to calculate a matrix comprising of the percentage of times each value in the index "from" transitions to each value in column 'to', which is known as a transition matrix. I can achieve this by creating an empty transition matrix first and then populating it with the percentages using a for loop:
#Create an empty matrix to populate later (using sparse dtype to save memory):
matrix = pd.DataFrame(index=df.index.unique(), columns=df.to.unique(), data=0, dtype=pd.SparseDtype(dtype=np.float16, fill_value=0))
matrix:
for i in range(len(df)):
from_, to = df.index[i], df.to.iloc[i]
matrix[to] = matrix[to].sparse.to_dense() # Convert to dense format because sparse dtype does not allow value assignment with .loc in the next line:
matrix.loc[from_, to] += 1 # Do a normal insertion with .loc[]
matrix[to] = matrix[to].astype(pd.SparseDtype(dtype=np.float16, fill_value=0)) # Back to the original sparse format
matrix = (matrix.div(matrix.sum(axis=1), axis=0)*100) # converting counts to percentages
matrix:
This works. For example, index "1" only transitioned to "2" (100% of the time) and index "2" transitioned to "4" 50% of the time and to "5" the other 50% of the time, as can be verified in df.
Issue: The actual matrix is about 500K by 500K and the for loop takes a really long time to finish. So, is there a vectorized or other efficient way of calculating matrix from df
Note: I would get MemoryError without using the whole Sparse dtype thing even with dtype=float16 in pd.DataFrame() so I prefer to keep that if possible. It would be great if the 500K by 500K matrix will not take up more than 10-12Gb of RAM. Also, if it matters, these percentages will always have a 0-100 range, obviously.


