1
$\begingroup$

I am currently studying whether academic freedom (independent variable) has an impact on university rankings (dependent variable). So far, my model is only composed of these two variables, as well as university fixed effects.

In order to get a better estimation, I am currently looking for control variables to include in my model (the problem being that there is almost no literature on the topic, so I have to pick them out myself instead on relying on what previous authors chose).

I was wondering: what do I have to look for in order to determine whether the control variables I picked are well suited for my model? I know that the control variables should be correlated with both my dependent and independent variable, but apart from this, I'm not sure what to look for.

Are there some specific tests I can conduct? Or should I just look at some specific statistics such as the $R^2$ or the $F$ statistic?

$\endgroup$
5
  • $\begingroup$ Why should the control variable be correlated with your primary variable of interest? $\endgroup$ Commented Apr 19, 2023 at 12:18
  • $\begingroup$ It's just what I have been told by my teacher. I guess that it helps getting the "true" coefficient of my variable of interest. Like, if I introduce R&D expenditures in my model as a control, and suddenly the coefficient of academic freedom gets smaller, that means that part of the effect on ranking that we attributed to academic freedom was actually due to R&D expenditures. We would not observe this if academic freedom and R&D were not correlated. That's how I see it at least, but I'm far from an expert in the field. $\endgroup$ Commented Apr 19, 2023 at 18:56
  • 1
    $\begingroup$ That has to do with omitted-variable bias, and that is important. However, correlated regression features inflate the standard errors, so your decrease on estimation bias might coincide with an inflation of the estimation variance. I would typically think of a control variable as a way to account for known determinants of the outcome in order to reduce variance and get small standard errors. That feature correlation can lead to variance inflation and bias reduction, along with the usual potential to decrease variance, is part of what makes this a difficult problem. $\endgroup$ Commented Apr 19, 2023 at 19:18
  • $\begingroup$ Thank you for the explanation. I do want to tacle an issue of omitted-variable bias, hence the need for correlation with both X and Y. How could I check whether this omitted variable bias is still there after I introduced my "control" variables ? $\endgroup$ Commented Apr 20, 2023 at 18:44
  • $\begingroup$ Here's a short paper that maybe helps. At least the title of your post and the paper seem to match pretty well. $\endgroup$ Commented May 24, 2024 at 11:06

1 Answer 1

1
$\begingroup$

Exploring Control Variables

I will first say that I don't believe that there is no literature on this subject. Perhaps you are looking in the wrong places. I would look into other fields or do some exploratory analyses to perhaps discover what variables are out there that influence the control variable inclusion in your model.

I will say that normally your best resource for understanding which controls to include in your model are the literature and, if you are a Ph.D student, your supervisor (as they will know a lot of the literature already). In terms of what controls are, the classic definition of a control variable is a variable that influences both the predictor and the outcome. Including a control in a model, mathematically, simply removes the variance associated with variables that are not of interest and removes potential confounding bias (Bartram, 2021; Bernerth & Aguinis, 2016).

An example of this relationship in DAG from is shown below (from Statistical Rethinking). Suppose we know that body mass and brain size both influence milk production in mammals, where $M$ is body mass, $N$ is neocortex percent (brain size), and $K$ is kilocalories in milk. If we look at the DAG on the left, the idea is that, procedurally, body mass predicts brain size, which in turn predicts production of milk. Yet body mass also predicts milk production, so it can potentially influence both causal pathways. In this case, $M$ is the confounding variable because it is supposedly causing both $N$ and $K$. However, the temporal ordering matters here. We could for example believe that the relationship is flipped (DAG on the right), where $N$ predicts $M$ and $K$ instead.

enter image description here

This is again where literature will be your best friend, as nobody can know exactly which controls to include unless one has a firm understanding of the research. But as they say, all models are wrong (Box, 1976) and a model at the end of a day is just a model (Knudsen et al., 2019). We can enter a billion different control variables into a model, but what we are often after is a parsimonious and useful model that explains phenomena. One should really consider the most important confounding factors in your experimental design and go from there.

There are a lot of excellent resources on this subject, one of which is already cited in the comments. (Bartram, 2021; Bernert & Aguinis, 2016; Cinelli et al., 2020). There is an excellent discussion on controls and causality in general from Robert Long's post here as well. Hopefully the resource list below will be helpful.

References

Control Variables

Modeling Decisions

$\endgroup$

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.