1
$\begingroup$

I'm trying to wrap my head around a seemingly simple weighted average calculation.

I have a table joined from different sources to include customer, product and product_interaction_id :

customer product_id product_interaction_id
cust1 64527 NULL
cust1 82582 a927c943-4061-4187-b1a0-1536ea1cc091
cust1 82582 7d7ac180-599d-4b8c-aefc-7cb5f9525254
cust1 11042 NULL
cust2 92753 0b43b633-0271-4835-b5c7-6bc320ed1a0a
cust2 92753 f5277e30-9d1a-4d86-a8b4-7e48286d3dd0
cust3 75161 5bbe5d36-fcb3-4c7f-9c40-8933b9f751da
cust4 38156 NULL
cust4 45124 NULL
cust5 18980 e8c4c98e-905c-445d-956d-97680bc910d5
cust5 18980 ad1233b2-19ed-4689-8922-2b2dbff14494
cust5 18980 60905f1d-8c5a-4307-ac00-549f6274f168
cust5 33635 NULL
cust6 46350 NULL
cust6 65928 279803ba-9e39-4c1a-b1f4-2de485c95ecb
cust6 65928 66e7e098-f822-4726-89a9-f5f29419bbcb
cust6 91255 NULL

A row in this table describes an interaction by a customer using one of their specific products. Think of an interaction as e.g. the customer has had an interaction with a support representative (concerning that particular product). If the product_interaction_id for a given product, for a given customer, is NULL, then the customer has had no interaction for that product. If the product_interaction_id is not null, then the customer has had 1 interaction. When there are multiple product_interaction_id for a given product, for a given customer, then there are multiple customer-product interactions.

Now I want to understand how one should calculate a total average product interactions. My brain tells me that I should somehow adjust for the fact that some customers have multiple products and therefore a greater chance to have a lot of product interactions, compared to customers with fewer products.

Does anyone see how I can get to a correct total average interactions across all customers?

Code:

library(dplyr)
customer_product_interactions <- tibble::tribble(
  ~customer, ~product_id, ~product_interaction_id,
  "cust1", "11042", NA,
  "cust1", "64527", NA,
  "cust1", "82582", "a927c943-4061-4187-b1a0-1536ea1cc091",
  "cust1", "82582", "7d7ac180-599d-4b8c-aefc-7cb5f9525254",
  "cust2", "92753", "0b43b633-0271-4835-b5c7-6bc320ed1a0a",
  "cust2", "92753", "f5277e30-9d1a-4d86-a8b4-7e48286d3dd0",
  "cust3", "75161", "5bbe5d36-fcb3-4c7f-9c40-8933b9f751da",
  "cust4", "38156", NA,
  "cust4", "45124", NA,
  "cust5", "18980", "e8c4c98e-905c-445d-956d-97680bc910d5",
  "cust5", "18980", "ad1233b2-19ed-4689-8922-2b2dbff14494",
  "cust5", "18980", "60905f1d-8c5a-4307-ac00-549f6274f168",
  "cust5", "33635", NA,
  "cust6", "46350", NA,
  "cust6", "65928", "279803ba-9e39-4c1a-b1f4-2de485c95ecb",
  "cust6", "65928", "66e7e098-f822-4726-89a9-f5f29419bbcb",
  "cust6", "91255", NA
)

customer_product_interactions |>
  group_by(customer) |>
  summarise(
    n_products = n_distinct(product_id),
    n_product_interactions = n_distinct(product_interaction_id, na.rm = TRUE),
    product_interactions_per_customer_products = n_product_interactions / n_products
  )
#> # A tibble: 6 × 4
#>   customer n_products n_product_interactions product_interactions_per_customer…¹
#>   <chr>         <int>                  <int>                               <dbl>
#> 1 cust1             3                      2                               0.667
#> 2 cust2             1                      2                               2
#> 3 cust3             1                      1                               1
#> 4 cust4             2                      0                               0
#> 5 cust5             2                      3                               1.5
#> 6 cust6             3                      1                               0.333
#> # :information_source: abbreviated name: ¹​product_interactions_per_customer_products

customer_product_interactions |>
  summarise(
    n_products = n_distinct(product_id),
    n_product_interactions = n_distinct(product_interaction_id, na.rm = TRUE),
    product_interactions_per_customer_products = n_product_interactions / n_products
  )
#> # A tibble: 1 × 3
#>   n_products n_product_interactions product_interactions_per_customer_products
#>        <int>                  <int>                                      <dbl>
#> 1         12                      9                                       0.75
$\endgroup$

1 Answer 1

2
$\begingroup$

In fact, you can compute an average, in 1 of 4 ways. All 4 make sense, but they tell you something different. Your confusion is probably because you state that you want to compute an average, but you do not state average per "what".
The first option is to compute the average interactions per customer. In your example data, you have 10 interactions, over 6 customers, So the "average" is $10/6=1.667$. This tells you that, on average, a customer will interact 1.667 times.
The 2nd option is to average per product (all your product codes seem different per customer? But I will assume that 2 customers could be interested in the same product). You have the same 10 interactions, over 12 products; so the "average" is $=10/12=.833$. For any product, you can expect, on average .833 interactions.
Next, you can average over the customer-product (i.e. for each customer-product pair). Here, you have 12 customer-product pairs (same as above, because in your sample each customer has unique products. But say 1 customer shared a product with another customer; your total # of customers would not change, but your customer-product pair # would go up by 1). So here, same "average" of $10/12=.833$.
Last, you can do a "grand average", over all the records. Your example still has 10 interactions, over 17 records. So now $10/17=.59$. That tells you that every time you make an entry in the table, 59% of the time (on average) it is because of an interaction (and 41% it is because no interaction took place).
It is not that there is 1 "correct" average to compute; there are 4, all 4 are "correct", and all 4 give you a different insight in your data.

$\endgroup$
2
  • $\begingroup$ Thanks for the answer. I have a couple of clarifications: (1) product codes are actually product installation ids (for heat pump installations, for example). (2) Most causes for interaction (such as support/service) are related to the idiosyncrasies of the individual installation and customer, rather than the physical product attributes alone. Therefore, I think calculating the average interaction per customer-product would be the most appropriate choice here. $\endgroup$ Commented Aug 15, 2024 at 7:01
  • $\begingroup$ @QueryingQuail, it is entirely up to you to decide the most appropriate "average" for your case. Note that if the "product code" is really customer specific, then averaging by product code, or by customer-product pair, will give you the same results. But yes, per customer-installation does make sense. $\endgroup$ Commented Aug 15, 2024 at 17:08

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.