0

Please imagine I have a dataframe like this:

df = pd.DataFrame(index=pd.Index(['1', '1', '2', '2'], name='from'), columns=['to'], data= ['2', '2', '4', '5'])

df:

enter image description here

Now, I would like to calculate a matrix comprising of the percentage of times each value in the index "from" transitions to each value in column 'to', which is known as a transition matrix. I can achieve this by creating an empty transition matrix first and then populating it with the percentages using a for loop:

#Create an empty matrix to populate later (using sparse dtype to save memory):
matrix = pd.DataFrame(index=df.index.unique(), columns=df.to.unique(), data=0, dtype=pd.SparseDtype(dtype=np.float16, fill_value=0)) 

matrix:

enter image description here

for i in range(len(df)):
    from_, to = df.index[i], df.to.iloc[i]     
    matrix[to] = matrix[to].sparse.to_dense() # Convert to dense format because sparse dtype does not allow value assignment with .loc in the next line:  
    matrix.loc[from_, to] += 1     # Do a normal insertion with .loc[]
    matrix[to] = matrix[to].astype(pd.SparseDtype(dtype=np.float16, fill_value=0)) # Back to the original sparse format

matrix = (matrix.div(matrix.sum(axis=1), axis=0)*100) # converting counts to percentages

matrix:

enter image description here

This works. For example, index "1" only transitioned to "2" (100% of the time) and index "2" transitioned to "4" 50% of the time and to "5" the other 50% of the time, as can be verified in df.

Issue: The actual matrix is about 500K by 500K and the for loop takes a really long time to finish. So, is there a vectorized or other efficient way of calculating matrix from df

Note: I would get MemoryError without using the whole Sparse dtype thing even with dtype=float16 in pd.DataFrame() so I prefer to keep that if possible. It would be great if the 500K by 500K matrix will not take up more than 10-12Gb of RAM. Also, if it matters, these percentages will always have a 0-100 range, obviously.

1 Answer 1

3

Option 1: pd.crosstab

out = (pd.crosstab(index=df.index, 
                   columns=df['to'], 
                   normalize='index'
                   )
       .mul(100)
       .rename_axis(index='from', columns=None)
       )

Output:

          2     4     5
from                   
1     100.0   0.0   0.0
2       0.0  50.0  50.0

Option 2: df.pivot_table

out2 = (df.reset_index()
        .pivot_table(index='from', columns='to', 
                     values='to', aggfunc='size', fill_value=0)
        .div(df.index.value_counts(), axis=0)
        .mul(100)
        .rename_axis(columns=None)
        )

out2.equals(out)
# True

Option 3: df.groupby

out3 = (df.groupby(df.index)
        .value_counts(normalize=True)
        .unstack(-1, fill_value=0)
        .mul(100)
        .rename_axis(columns=None)
        )

out3.equals(out)
# True

Edit:

With a very sizeable df, I would consider creating a csr_matrix to avoid the MemoryError. In this case, we only use df.groupby + value_counts and use Index.factorize to pass compatible row_ind and col_ind values.

from scipy.sparse import csr_matrix

g = df.groupby(df.index).value_counts(normalize=True).mul(100)

idx_codes, idx_unique = g.index.get_level_values(0).factorize()
col_codes, col_unique = g.index.get_level_values(1).factorize() 

m = csr_matrix((g.values, (idx_codes, col_codes)))

If you want, you can turn that back into a df with df.sparse.from_spmatrix:

df_m = pd.DataFrame.sparse.from_spmatrix(m, 
                                         index=idx_unique, 
                                         columns=col_unique)

Output:

       2     4     5
1  100.0   0.0   0.0
2    0.0  50.0  50.0

But understanding this, you can also continue with m and still use idx_unique and col_unique for slicing. E.g.:

bool_idx = col_unique.isin(['4','5'])

m[:, bool_idx].sum(axis=1)

matrix([[  0.],
        [100.]])

Compare:

df_m.loc[:, bool_idx].sum(axis=1)

1        0
2    100.0
dtype: Sparse[float64, 0]
Sign up to request clarification or add additional context in comments.

3 Comments

@ouroboros1 Is there anyway you can modify the code (maybe option 3 b/c it's the fastest I think) to reduce the data size. I'm getting MemoryError: Unable to allocate 29.4 GiB for an array with shape (60654, 65130) and data type float64. Most of the values are 0s and others range from 0 to 100.
@Saeed: updated the answer. Have a look and let me know if this can work for your use case.
Wow df_m is less than 100MB. Awesome, thank for much for the update.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.