I'm trying to wrap my head around a seemingly simple weighted average calculation.
I have a table joined from different sources to include customer, product and product_interaction_id :
| customer | product_id | product_interaction_id |
|---|---|---|
| cust1 | 64527 | NULL |
| cust1 | 82582 | a927c943-4061-4187-b1a0-1536ea1cc091 |
| cust1 | 82582 | 7d7ac180-599d-4b8c-aefc-7cb5f9525254 |
| cust1 | 11042 | NULL |
| cust2 | 92753 | 0b43b633-0271-4835-b5c7-6bc320ed1a0a |
| cust2 | 92753 | f5277e30-9d1a-4d86-a8b4-7e48286d3dd0 |
| cust3 | 75161 | 5bbe5d36-fcb3-4c7f-9c40-8933b9f751da |
| cust4 | 38156 | NULL |
| cust4 | 45124 | NULL |
| cust5 | 18980 | e8c4c98e-905c-445d-956d-97680bc910d5 |
| cust5 | 18980 | ad1233b2-19ed-4689-8922-2b2dbff14494 |
| cust5 | 18980 | 60905f1d-8c5a-4307-ac00-549f6274f168 |
| cust5 | 33635 | NULL |
| cust6 | 46350 | NULL |
| cust6 | 65928 | 279803ba-9e39-4c1a-b1f4-2de485c95ecb |
| cust6 | 65928 | 66e7e098-f822-4726-89a9-f5f29419bbcb |
| cust6 | 91255 | NULL |
A row in this table describes an interaction by a customer using one of their specific products. Think of an interaction as e.g. the customer has had an interaction with a support representative (concerning that particular product). If the product_interaction_id for a given product, for a given customer, is NULL, then the customer has had no interaction for that product. If the product_interaction_id is not null, then the customer has had 1 interaction. When there are multiple product_interaction_id for a given product, for a given customer, then there are multiple customer-product interactions.
Now I want to understand how one should calculate a total average product interactions. My brain tells me that I should somehow adjust for the fact that some customers have multiple products and therefore a greater chance to have a lot of product interactions, compared to customers with fewer products.
Does anyone see how I can get to a correct total average interactions across all customers?
Code:
library(dplyr)
customer_product_interactions <- tibble::tribble(
~customer, ~product_id, ~product_interaction_id,
"cust1", "11042", NA,
"cust1", "64527", NA,
"cust1", "82582", "a927c943-4061-4187-b1a0-1536ea1cc091",
"cust1", "82582", "7d7ac180-599d-4b8c-aefc-7cb5f9525254",
"cust2", "92753", "0b43b633-0271-4835-b5c7-6bc320ed1a0a",
"cust2", "92753", "f5277e30-9d1a-4d86-a8b4-7e48286d3dd0",
"cust3", "75161", "5bbe5d36-fcb3-4c7f-9c40-8933b9f751da",
"cust4", "38156", NA,
"cust4", "45124", NA,
"cust5", "18980", "e8c4c98e-905c-445d-956d-97680bc910d5",
"cust5", "18980", "ad1233b2-19ed-4689-8922-2b2dbff14494",
"cust5", "18980", "60905f1d-8c5a-4307-ac00-549f6274f168",
"cust5", "33635", NA,
"cust6", "46350", NA,
"cust6", "65928", "279803ba-9e39-4c1a-b1f4-2de485c95ecb",
"cust6", "65928", "66e7e098-f822-4726-89a9-f5f29419bbcb",
"cust6", "91255", NA
)
customer_product_interactions |>
group_by(customer) |>
summarise(
n_products = n_distinct(product_id),
n_product_interactions = n_distinct(product_interaction_id, na.rm = TRUE),
product_interactions_per_customer_products = n_product_interactions / n_products
)
#> # A tibble: 6 × 4
#> customer n_products n_product_interactions product_interactions_per_customer…¹
#> <chr> <int> <int> <dbl>
#> 1 cust1 3 2 0.667
#> 2 cust2 1 2 2
#> 3 cust3 1 1 1
#> 4 cust4 2 0 0
#> 5 cust5 2 3 1.5
#> 6 cust6 3 1 0.333
#> # :information_source: abbreviated name: ¹product_interactions_per_customer_products
customer_product_interactions |>
summarise(
n_products = n_distinct(product_id),
n_product_interactions = n_distinct(product_interaction_id, na.rm = TRUE),
product_interactions_per_customer_products = n_product_interactions / n_products
)
#> # A tibble: 1 × 3
#> n_products n_product_interactions product_interactions_per_customer_products
#> <int> <int> <dbl>
#> 1 12 9 0.75