I have a problem with some health data that I'm trying to analyze. The main issue originates from a census variable is derived from self reported times. The variable is sleep duration, which is derived from the hour reported at which the survey-taker goes to sleep, and the hour at which they wake up. The documentation says that the value is then rounded to the nearest half hour. Here is a histogram of the data:
Because of the self-reported nature of the data, and the rounding, there seems to be a bias towards whole hour values over half hour values. Intuitively, I'd expect this variable to be distributed normally. I want to somehow correct this bias, or at least artificially modify the data so it is distributed rationally.
I do not mind modifying the data, as it having a sensible distribution is more important than accurately mirroring the survey data for me. I tried adding Gaussian noise with SD=0.5, and I got the following histogram:
This looks more like what I would expect the actual values to look like. However, I don't know if there is a better or standard way to correct/analyze data with this kind of bias. If there is, or if there is some flaw in my reasoning, please let me know.


