1

So I have two columns

| col_arr | col_ind | 
|[1, 2, 3]| [0, 2]  |
|[5, 1]   | [1]     |

and I'd like my result to be an extraction of the values in col_arr by col_ind resulting in col_val below:

| col_arr | col_ind | col_val |
|[1, 2, 3]| [0, 2]  | [1, 3]  |
|[5, 1]   | [1]     | [1]     |

What would be the most elegant and efficient way to do this?


My first idea is to use a UDF, but it feels like an overkill

@udf
def sub_select(arr, inds):
    if (arr is not None) and (inds is not None):
        return [arr[ind] for ind in inds]

I was also thinking about the array_position function dynamically with expr. It is not clear to me how to do so in a way that is flexible to the length of col_ind.

F.expr("array_position(col_arr, array_position(col_ind, 0))")

Simplifications and complications:

  1. Assume the length of indices to query is bounded by a small number (ex. the max len of a col_ind list is 5).
  2. What if I have several col_arr columns (col_arr1, col_arr2, col_arr3), and only one col_ind?

1 Answer 1

2

Apply a transformation function on col_ind and for each index return the corresponding value from col_arr

df = df.withColumn('col_val', F.expr("transform(col_ind, i -> col_arr[i])"))

# df.show()
# +---------+-------+-------+
# |  col_arr|col_ind|col_val|
# +---------+-------+-------+
# |[1, 2, 3]| [0, 2]| [1, 3]|
# |   [5, 1]|    [1]|    [1]|
# +---------+-------+-------+

Bonus: What if I have several col_arr columns (col_arr1, col_arr2, col_arr3), and only one col_ind? Same approach can be used for each column seperately

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.