I have a dataset of around 100,000 companies. For each company, I have a bunch of features such as: Number of employees, Number of customers, Number of complaints, other additional company attributes and the current number of managers for each company.
My goal is to determine how many managers a company “should” have, even though there is no direct label indicating whether the current manager count is correct or adequate. Essentially, I want to: Identify companies that may be under-managed and eventually build a calculator so that if a new company gives me its data (employees, customers, etc.), I can suggest a recommended manager count. Since there’s no ground truth or existing label (e.g., the ideal manager count is X), I’m exploring unsupervised approaches such as clustering. The idea is to group companies by their operational characteristics (employees, customers, complaints, etc.) and then derive a typical or “normative” manager count (or ratio) within each group.
However, I’m unsure about the best way to include the existing manager count in this analysis. Specifically:
Should I exclude the manager count when clustering, then afterward analyze the distribution of manager counts within each cluster to see which companies deviate? or should I include the manager count in the features during clustering, so the algorithm can learn patterns of how many managers typically go with a given company profile?
I also wonder how to validate or compare different approaches, given the lack of direct labels. Are there standard methods, internal metrics, or domain-driven techniques to confirm that a derived “manager count” is sensible? Are there best practices for deciding thresholds (e.g. identifying outliers who deviate significantly from the cluster average)?
Any advice or references on how to tackle this problem. thanks