Data taken from survey where survey-takers self report a continous variable

Question

I have a problem with some health data that I'm trying to analyze. The main issue originates from a census variable is derived from self reported times. The variable is sleep duration, which is derived from the hour reported at which the survey-taker goes to sleep, and the hour at which they wake up. The documentation says that the value is then rounded to the nearest half hour. Here is a histogram of the data:

Because of the self-reported nature of the data, and the rounding, there seems to be a bias towards whole hour values over half hour values. Intuitively, I'd expect this variable to be distributed normally. I want to somehow correct this bias, or at least artificially modify the data so it is distributed rationally.

I do not mind modifying the data, as it having a sensible distribution is more important than accurately mirroring the survey data for me. I tried adding Gaussian noise with SD=0.5, and I got the following histogram:

This looks more like what I would expect the actual values to look like. However, I don't know if there is a better or standard way to correct/analyze data with this kind of bias. If there is, or if there is some flaw in my reasoning, please let me know.

This seems like a VERY bad idea to me. But what are you going to do with the sleep variable? Is it a DV in a regression? An IV? Part of a cluster analysis? Or what? — Peter Flom
– Peter Flom, Commented Nov 18, 2023 at 11:35
"having a sensible distribution is more important than accurately mirroring the survey data for me" Why? — J-J-J
– J-J-J, Commented Nov 18, 2023 at 11:45
I don't actually care about the data all that much. It's for a statistics class, and I've spoken with my professor about this issue. She doesn't mind if the data is partially artificial. The assignment is more about applying statistical methods rather than doing a super rigorous analysis. — Ender_The_Xenocide
– Ender_The_Xenocide, Commented Nov 18, 2023 at 11:48
I will also probably do some kind of two sample hypothesis test — Ender_The_Xenocide
– Ender_The_Xenocide, Commented Nov 18, 2023 at 12:02

Robert Long · Accepted Answer · 2023-11-18 12:20:16Z

1

You say:

I want to somehow correct this bias, or at least artificially modify the data so it is distributed rationally.

and also:

I'm going to do a multiple regression with sleep as the DV

In that case do NOT "modify" the data. Just use sleep as it exists in the dataset as your outcome. Then do the usual regression diagnostics. Modifying the data seems to be a very bad idea as mentioned by Peter Flom in the comment to the question. Adding noise to your outcome variable does not make any sense to me. Also, bear in mind that the histograms of data can look very strange due to the binning levels you use. For example:

set.seed(101)
hist(rnorm(100),breaks = 20)

produces this:

answered Nov 18, 2023 at 12:20

Robert Long

68.5k11 gold badges145 silver badges270 bronze badges

$\begingroup$ I also need to fit a distribution. I'm sorry I failed to mention that. $\endgroup$

Ender_The_Xenocide
– Ender_The_Xenocide

2023-11-18 12:28:26 +00:00
Commented Nov 18, 2023 at 12:28
$\begingroup$ That doesn't change anything, does it ? $\endgroup$

Robert Long
– Robert Long

2023-11-18 12:29:04 +00:00
Commented Nov 18, 2023 at 12:29
$\begingroup$ I need to do a goodness of fit test, and I suspect that will not go well. $\endgroup$

Ender_The_Xenocide
– Ender_The_Xenocide

2023-11-18 12:31:05 +00:00
Commented Nov 18, 2023 at 12:31
$\begingroup$ You mean a goodness of fit test of your model fit ? That's just part of the usual regression diagnostics. It doesn't change anything. $\endgroup$

Robert Long
– Robert Long

2023-11-18 12:34:12 +00:00
Commented Nov 18, 2023 at 12:34
1

$\begingroup$ So go ahead and fit the distribution to the data. It may or may not be a good fit. Are you marked based on how well it fits ? I would doubt that. I would think the teacher just wants to see that you know how to fit a distribution to data and interpret the finding. As for the regression model, none of that changes my advice. But it seem that you already know what you want to do, so it makes me wonder why are you asking a question here? I think you've been told by several people that what you want to do is not a good idea, but you seem to intent on doing what you want despite best advice. $\endgroup$

Robert Long
– Robert Long

2023-11-18 13:24:51 +00:00
Commented Nov 18, 2023 at 13:24

| Show 6 more comments

Stack Exchange Network

Data taken from survey where survey-takers self report a continous variable

1 Answer 1

Your Answer

Hot Network Questions

Data taken from survey where survey-takers self report a continous variable

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions