0
$\begingroup$

For example:

Let's say I have dataset A:
Measured body temperature of a person during the day.
I have measurements from 3 people in the span of a year.
If I cluster it, I expect the clusters to inform me of something that will help me give advice to each person "Drink more water at this and this hour during the day".

So I use k-means to cluster the data from person #1 in dataset A. (100 000 points - no ground truth)
It gives me 5 clusters.

A new dataset( B) becomes available from a newer data-collection system, and I get measurements from 9 people this time.
I cluster data from person #1 in dataset B in the same way. (200 000 points - no ground truth)
It gives me 3 clusters.

I want to see if the performance suffered, improved or stayed the same.

Question(s):
How would I go about to:

  1. Compare their performances
    (considering the difference of nr. of clusters, difference in data amount, perhaps in data quality that is not visible to the naked eye listing through the data etc. - can it be a 1-to-1 comparison at all?)
  2. Validate it (because both could be bad/wrong), or how do I choose a sensible yardstick at least?

EDIT: Actually, the yardstick could be high temperature and low temperature, or anything in between, to give some kinda "better/worse" or direction. But still, how do I compare the two when the cluster numbers differ and the amount of data differs?

$\endgroup$
6
  • $\begingroup$ Are you clustering days, where each day is represented by an vector of temperature measurements? Why would such clusters be informative and lead to advice such as "Drink more water at this and this hour during the day"? $\endgroup$ Commented Aug 12, 2022 at 9:08
  • $\begingroup$ @micans Yes, I'm clustering days the way you described. It would be useful because I can look at the centroid of each cluster and decide which cluster represents "days with high/medium/low temperature". And then the days assigned to the cluster "high", they are labeled as "high" on the calendar, same with medium and low. And the calendar is sent to the user like a suggestion "you will probably need to bring more water to work during this and this day, but not this and this day". It's just an example of application, the main point is how do you compare asymmetric models? $\endgroup$ Commented Aug 15, 2022 at 4:04
  • $\begingroup$ I'm still trying to understand the setting. Body temperature should be fairly constant. Is it really body temperature, or external temperature? How do you apply k-means? Try different k and use some criterion to pick a k? Is there any significance to the fact that people are grouped together in a dataset? (So far I'm not convinced this a suitable problem for clustering.) $\endgroup$ Commented Aug 15, 2022 at 21:53
  • $\begingroup$ I'm not at liberty to change the approach. This is a hypothetical equivalent to something real I had. (but that I can't divulge in detail because of NDAs) It's not something that needs elucidation down to the last detail to imagine the problem though. So assume that I use the elbow method to decide K. Assume it's external temperature. Assume that the reason is severe hardware restraints and no ground truth - so you're basically muscled into unsupervised learning. (And assume that there's a way to make k-means work on the hardware) $\endgroup$ Commented Aug 16, 2022 at 11:27
  • $\begingroup$ That's a lot of assumptions and asking contributors here to contort their brains. The example does not come alive, I suggest thinking of a better example with clearer motivation. $\endgroup$ Commented Aug 17, 2022 at 12:39

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.