So I have two columns
| col_arr | col_ind |
|[1, 2, 3]| [0, 2] |
|[5, 1] | [1] |
and I'd like my result to be an extraction of the values in col_arr by col_ind resulting in col_val below:
| col_arr | col_ind | col_val |
|[1, 2, 3]| [0, 2] | [1, 3] |
|[5, 1] | [1] | [1] |
What would be the most elegant and efficient way to do this?
My first idea is to use a UDF, but it feels like an overkill
@udf
def sub_select(arr, inds):
if (arr is not None) and (inds is not None):
return [arr[ind] for ind in inds]
I was also thinking about the array_position function dynamically with expr. It is not clear to me how to do so in a way that is flexible to the length of col_ind.
F.expr("array_position(col_arr, array_position(col_ind, 0))")
Simplifications and complications:
- Assume the length of indices to query is bounded by a small number (ex. the max len of a col_ind list is 5).
- What if I have several col_arr columns (col_arr1, col_arr2, col_arr3), and only one col_ind?