5
$\begingroup$

I have a dataset of around 100,000 companies. For each company, I have a bunch of features such as: Number of employees, Number of customers, Number of complaints, other additional company attributes and the current number of managers for each company.

My goal is to determine how many managers a company “should” have, even though there is no direct label indicating whether the current manager count is correct or adequate. Essentially, I want to: Identify companies that may be under-managed and eventually build a calculator so that if a new company gives me its data (employees, customers, etc.), I can suggest a recommended manager count. Since there’s no ground truth or existing label (e.g., the ideal manager count is X), I’m exploring unsupervised approaches such as clustering. The idea is to group companies by their operational characteristics (employees, customers, complaints, etc.) and then derive a typical or “normative” manager count (or ratio) within each group.

However, I’m unsure about the best way to include the existing manager count in this analysis. Specifically:

Should I exclude the manager count when clustering, then afterward analyze the distribution of manager counts within each cluster to see which companies deviate? or should I include the manager count in the features during clustering, so the algorithm can learn patterns of how many managers typically go with a given company profile?

I also wonder how to validate or compare different approaches, given the lack of direct labels. Are there standard methods, internal metrics, or domain-driven techniques to confirm that a derived “manager count” is sensible? Are there best practices for deciding thresholds (e.g. identifying outliers who deviate significantly from the cluster average)?

Any advice or references on how to tackle this problem. thanks

$\endgroup$
3
  • $\begingroup$ I worry about the applicability of XKCD1838 to the description of this analysis. There is an unknown number of ways for confounding to silently influence the results of such an analysis in undesirable ways. $\endgroup$ Commented Feb 14 at 14:50
  • $\begingroup$ You moved from 'ideal' to 'normal' in the question. The two are not necessarily overlapping. $\endgroup$ Commented Feb 14 at 16:28
  • $\begingroup$ Unless this is a class project, you have two major assumptions. One is that the number of managers (presumably a way to measure the hierarchical nature of the organization) influences the number of complaints. The second is that the organization strives to minimize complaints. How do you define a manager? At my first job decades ago when matrix management was the in thing my officemate once said to me "It's Tuesday. Does that mean I am your boss or you are mine?" I think you are asking the wrong question. $\endgroup$ Commented Feb 15 at 3:27

2 Answers 2

10
$\begingroup$

The only variable in your list that sounds like it could be used to determine whether one firm is performing "better" or "worse" than another one (which you would need to determine a "good" manager density) seems to be "Number of complaints".

So I would suggest you build a model, using this as a target variable.

You can then use this model and feed in all the other characteristics of a new firm. Then predict the number of complaints for various numbers of managers, and output the manager density that minimizes (or maximizes, depending on what you prefer) this number.

If you have other variables that look like they could be targets, e.g., profitability or shareholder returns, you can do the same exercise for those. Or try to build a composite target variable and work with that.

$\endgroup$
3
  • $\begingroup$ My first stab would have been to target the ratio of complaints/customer? $\endgroup$ Commented Feb 13 at 23:56
  • 3
    $\begingroup$ @DanielR.Collins: yes, that makes sense, otherwise we will just think smaller companies are better. Alternatively the ratio of compaints to employees. Or something else. It just should make sense in the context. $\endgroup$ Commented Feb 14 at 7:19
  • $\begingroup$ @StephanKolassa can you comment/answer this question when u had time? $\endgroup$ Commented Jul 16 at 11:02
5
$\begingroup$

Welcome to CV.

I don't have any good references on this, but you clearly will want to look at books on cluster analysis.

My thoughts are that you should exclude manager count from the clustering. If you include it, it's sort of like having the same variable on two sides of an equation. Not exactly the same, of course, because there is no equation.

Then you can look at ranges of managers within clusters. You could also do a regression (or ANOVA, or maybe a nonparametric test) to see if the number of managers varies across clusters. If not, you may not have the right variables.

For detecting outliers, you have all the usual issues. There have been many discussions here about outliers. You could search on the "outliers" tag.

For validation, I think one thing is to divide the data set into train and test and see if things are similar. And, if any of your variables are purely positive or negative, you could do some sort of regression (depending on the nature of that variable) with cluster and number of managers as independent variables.

$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.