Weighted averages (average customer-product interactions)

Question

I'm trying to wrap my head around a seemingly simple weighted average calculation.

I have a table joined from different sources to include customer, product and product_interaction_id :

customer	product_id	product_interaction_id
cust1	64527	NULL
cust1	82582	a927c943-4061-4187-b1a0-1536ea1cc091
cust1	82582	7d7ac180-599d-4b8c-aefc-7cb5f9525254
cust1	11042	NULL
cust2	92753	0b43b633-0271-4835-b5c7-6bc320ed1a0a
cust2	92753	f5277e30-9d1a-4d86-a8b4-7e48286d3dd0
cust3	75161	5bbe5d36-fcb3-4c7f-9c40-8933b9f751da
cust4	38156	NULL
cust4	45124	NULL
cust5	18980	e8c4c98e-905c-445d-956d-97680bc910d5
cust5	18980	ad1233b2-19ed-4689-8922-2b2dbff14494
cust5	18980	60905f1d-8c5a-4307-ac00-549f6274f168
cust5	33635	NULL
cust6	46350	NULL
cust6	65928	279803ba-9e39-4c1a-b1f4-2de485c95ecb
cust6	65928	66e7e098-f822-4726-89a9-f5f29419bbcb
cust6	91255	NULL

A row in this table describes an interaction by a customer using one of their specific products. Think of an interaction as e.g. the customer has had an interaction with a support representative (concerning that particular product). If the product_interaction_id for a given product, for a given customer, is NULL, then the customer has had no interaction for that product. If the product_interaction_id is not null, then the customer has had 1 interaction. When there are multiple product_interaction_id for a given product, for a given customer, then there are multiple customer-product interactions.

Now I want to understand how one should calculate a total average product interactions. My brain tells me that I should somehow adjust for the fact that some customers have multiple products and therefore a greater chance to have a lot of product interactions, compared to customers with fewer products.

Does anyone see how I can get to a correct total average interactions across all customers?

Code:

library(dplyr)
customer_product_interactions <- tibble::tribble(
  ~customer, ~product_id, ~product_interaction_id,
  "cust1", "11042", NA,
  "cust1", "64527", NA,
  "cust1", "82582", "a927c943-4061-4187-b1a0-1536ea1cc091",
  "cust1", "82582", "7d7ac180-599d-4b8c-aefc-7cb5f9525254",
  "cust2", "92753", "0b43b633-0271-4835-b5c7-6bc320ed1a0a",
  "cust2", "92753", "f5277e30-9d1a-4d86-a8b4-7e48286d3dd0",
  "cust3", "75161", "5bbe5d36-fcb3-4c7f-9c40-8933b9f751da",
  "cust4", "38156", NA,
  "cust4", "45124", NA,
  "cust5", "18980", "e8c4c98e-905c-445d-956d-97680bc910d5",
  "cust5", "18980", "ad1233b2-19ed-4689-8922-2b2dbff14494",
  "cust5", "18980", "60905f1d-8c5a-4307-ac00-549f6274f168",
  "cust5", "33635", NA,
  "cust6", "46350", NA,
  "cust6", "65928", "279803ba-9e39-4c1a-b1f4-2de485c95ecb",
  "cust6", "65928", "66e7e098-f822-4726-89a9-f5f29419bbcb",
  "cust6", "91255", NA
)

customer_product_interactions |>
  group_by(customer) |>
  summarise(
    n_products = n_distinct(product_id),
    n_product_interactions = n_distinct(product_interaction_id, na.rm = TRUE),
    product_interactions_per_customer_products = n_product_interactions / n_products
  )
#> # A tibble: 6 × 4
#>   customer n_products n_product_interactions product_interactions_per_customer…¹
#>   <chr>         <int>                  <int>                               <dbl>
#> 1 cust1             3                      2                               0.667
#> 2 cust2             1                      2                               2
#> 3 cust3             1                      1                               1
#> 4 cust4             2                      0                               0
#> 5 cust5             2                      3                               1.5
#> 6 cust6             3                      1                               0.333
#> # :information_source: abbreviated name: ¹product_interactions_per_customer_products

customer_product_interactions |>
  summarise(
    n_products = n_distinct(product_id),
    n_product_interactions = n_distinct(product_interaction_id, na.rm = TRUE),
    product_interactions_per_customer_products = n_product_interactions / n_products
  )
#> # A tibble: 1 × 3
#>   n_products n_product_interactions product_interactions_per_customer_products
#>        <int>                  <int>                                      <dbl>
#> 1         12                      9                                       0.75

jginestet · Accepted Answer · 2024-08-15 01:27:54Z

In fact, you can compute an average, in 1 of 4 ways. All 4 make sense, but they tell you something different. Your confusion is probably because you state that you want to compute an average, but you do not state average per "what".
The first option is to compute the average interactions per customer. In your example data, you have 10 interactions, over 6 customers, So the "average" is $10/6=1.667$. This tells you that, on average, a customer will interact 1.667 times.
The 2nd option is to average per product (all your product codes seem different per customer? But I will assume that 2 customers could be interested in the same product). You have the same 10 interactions, over 12 products; so the "average" is $=10/12=.833$. For any product, you can expect, on average .833 interactions.
Next, you can average over the customer-product (i.e. for each customer-product pair). Here, you have 12 customer-product pairs (same as above, because in your sample each customer has unique products. But say 1 customer shared a product with another customer; your total # of customers would not change, but your customer-product pair # would go up by 1). So here, same "average" of $10/12=.833$.
Last, you can do a "grand average", over all the records. Your example still has 10 interactions, over 17 records. So now $10/17=.59$. That tells you that every time you make an entry in the table, 59% of the time (on average) it is because of an interaction (and 41% it is because no interaction took place).
It is not that there is 1 "correct" average to compute; there are 4, all 4 are "correct", and all 4 give you a different insight in your data.

Thanks for the answer. I have a couple of clarifications: (1) product codes are actually product installation ids (for heat pump installations, for example). (2) Most causes for interaction (such as support/service) are related to the idiosyncrasies of the individual installation and customer, rather than the physical product attributes alone. Therefore, I think calculating the average interaction per customer-product would be the most appropriate choice here. — QueryingQuail
– QueryingQuail, Commented Aug 15, 2024 at 7:01
@QueryingQuail, it is entirely up to you to decide the most appropriate "average" for your case. Note that if the "product code" is really customer specific, then averaging by product code, or by customer-product pair, will give you the same results. But yes, per customer-installation does make sense. — jginestet
– jginestet, Commented Aug 15, 2024 at 17:08

Stack Exchange Network

Weighted averages (average customer-product interactions)

1 Answer 1

Your Answer

Hot Network Questions

Weighted averages (average customer-product interactions)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions