Questions tagged [dataset]
Requests for datasets are off-topic on this site. Use this tag for questions concerning creating, processing, or maintaining datasets.
1,934 questions
3
votes
1
answer
119
views
Advice on regression approach
How should I handle a mass-point in the dependent variable when running OLS regression in R?
I’m working with a a household expenditure dataset (Living Costs 2019) where the dependent variable is the ...
4
votes
5
answers
711
views
How Do Quartiles Help Us Understand a Dataset?
It’s confusing to understand how quartile values can actually be used to give insights into a dataset. Please assist with examples. I struggle to interpret the values in the context of providing ...
5
votes
3
answers
533
views
How to handle outliers when some predictors perform better with them and others without
I’m working on a project where I need to build a predictive model for wine quality based on its chemical properties. The goal is to find which features best explain or predict the quality score.
I’ve ...
1
vote
1
answer
136
views
What is the current consensus on "using test set as training set, post testing"? [duplicate]
This question is inspired by a blog post by https://www.argmin.net/p/in-defense-of-typing-monkeys and several rumors I've heard from other people who works in machine learning.
The gist of it is that ...
1
vote
0
answers
99
views
Looking for an authentic example with extremely small coefficient of variation [closed]
Out of curiosity, I am looking for an example of an authentic variable (which one would find in a data set) with an exceptionally small coefficient of variation: $\text{CV} = \frac{s}{\bar{x}}$. To ...
1
vote
2
answers
123
views
How can the standard error measure how accurately a sample represents the population, when we don’t have access to the population’s data?
If I got it correct, the standard error is a statistic that measures the variability of a sample’s data and how accurately a statistic represents the corresponding parameter.
Please suggest any ...
1
vote
0
answers
47
views
Potential CNN Overfitting Due to Limited Training Data
Neural Network Beginner here. I am currently implementing a CNN on PyTorch for recognizing Japanese handwritten letters, which has 46 classes of outputs.
I found a dataset on Kaggle https://www.kaggle....
2
votes
1
answer
102
views
Quantitatively determining unexplored parameter spaces [closed]
If we have a high-dimensional dataset (7-10 columns) of continuous variables like Time, Temperature etc. recorded from experiments (not performed by us) are there established methods to quantitatively ...
1
vote
1
answer
94
views
A correct approach to validate/correct readings from similar sensors?
I am looking to apply a calibration/correction approach on a set of sensors and I just wanted to know that the approach I am going to use is statistically correct and acceptable.
I am using a set of ...
12
votes
5
answers
2k
views
How much missing data is too much? part 2: statistical power to impute?
A question is how many missing values are too many to be handled. It has been asked in the context of applying specific software and method (MICE).
I am interested in understanding a bit better what ...
2
votes
1
answer
83
views
Customer propensity: time based split or random split
I have a task: for the store, where customers may pay for their items on registers with cashiers, were added self-service checkouts. I have 4 months of transaction data of customers who make their ...
7
votes
4
answers
880
views
Can you remove outliers if they are less than 10% of the datapoints? [duplicate]
I am currently attending my first data analysis class and we do some simple hypothesis tests like t test etc. Our teacher told us that we can remove outliers, as long as they are not more that the 10% ...
0
votes
1
answer
50
views
Theoretical question around Implicit Attitude Test data between timepoints: single vs. multiple datapoints per person?
I have a question that relates to the use of IAT scores across timepoints. As part of a large health-based intervention my colleagues and I have obtained IAT scores at different timepoints, from which ...
187
votes
15
answers
57k
views
Are large data sets inappropriate for hypothesis testing?
In a recent article of Amstat News, the authors (Mark van der Laan and Sherri Rose) stated that "We know that for large enough sample sizes, every study—including ones in which the null hypothesis of ...
10
votes
1
answer
1k
views
Would it be possible to generate data from real data in medical research? [closed]
We are trying to develop some predictive models in medical research. We have combination of clinical and RNA-seq data just for 40 patients. The problem is classification. After feature selection, we ...
11
votes
4
answers
957
views
Are data in the real world "sampled" in the statistical sense?
In machine learning, it is commonly assumed that samples are generated i.i.d. according to some probability distribution. On the importance of the i.i.d. assumption in statistical learning
The ...
0
votes
0
answers
39
views
Is there any standard or common notation for censored values, in data files?
Suppose one must share a data file – could be a simple CSV file – where each datapoint has several variates, let's say a nominal one, an ordinal one, and a continuous-real one.
Are there any standard ...
6
votes
0
answers
320
views
Reconstructing count table when only pairwise features are visible
Assume we are only able to observe two-way entry table counting the number of observations of a pair of categorical features $x_i,x_j$.
$$
\begin{array}{c|ccc}
& & x_j & \\
\hline
...
103
votes
25
answers
43k
views
Locating freely available data samples
I've been working on a new method for analyzing and parsing datasets to identify and isolate subgroups of a population without foreknowledge of any subgroup's characteristics. While the method works ...
1
vote
1
answer
292
views
SHAP values across different groups
I developed and compared four ML models via Random Forest, Support Vector Machine, Logistic Regression, and Xgboost (tidymodels R package) algorithms using data without stratification by age groups. ...
0
votes
0
answers
49
views
Question in longitudinal survey is no longer asked. MNAR?
In a longitudinal hospitalization survey dataset, where patients are asked to fill out a survey each time they are admitted into the hospital, one of the questions is no longer asked. This question ...
0
votes
0
answers
129
views
Variable selection in higly multivariate dataset
I have a metagenomics dataset with more than 2 million features (each one being the relative abundance of a gene family, this is, a cluster of gene sequences) in 30 samples. First, I CLR-transformed ...
1
vote
1
answer
95
views
Why is the Keras MNIST dataset split into training and test samples of lengths 60k and 10k respectively?
The MNIST dataset can be obtained directly using Keras by running the following lines of Python code.
...
5
votes
1
answer
181
views
Bivariate data generation
Consider the distribution with the bivariate cumulative distribution function $$F(t_1,t_2)=t_1^{1+\theta\log(t_2)}t_2,0<t_1,t_2<1; \theta\leq 0$$.
I want to generate data from this distribution (...
1
vote
1
answer
87
views
How to extract and compare the distribution of predicted values of two mixed effect models?
I have 100 samples for 6 days (every day 100 observations, 600 observation in general). I have tried to fit the mixed model to the data. Another time, I tried to fit the mixed model just for ...
2
votes
0
answers
56
views
What discussion exists regarding statisticians' relationship to statistical methodological assumptions in applications? [closed]
This is more of a philosophical question, and also a question asking for references. To those who follow statistical academic literature, there are papers that discuss philosophical issues in ...
95
votes
2
answers
193k
views
How to normalize data between -1 and 1?
I have seen the min-max normalization formula but that normalizes values between 0 and 1. How would I normalize my data between -1 and 1? I have both negative and positive values in my data matrix.
1
vote
0
answers
83
views
Aggregate Ordinal Data?
In my research, I am examining the impact of AI labels (with vs. without) on various brand perceptions and behavioral intentions. Specifically, I analyze how the stimulus (IV, 4 stimuli in 2 subgroups)...
0
votes
0
answers
49
views
Handling Missing Values in the dataset
I'm using this dataset for a regression project, and the goal is to predict the beneficiary risk score(Bene_Avg_Risk_Scre). Now, to protect beneficiary identities and safeguard this information, CMS ...
0
votes
0
answers
68
views
Fixed Effects OLS Estimation in R and what various specifications mean
I have a question about panel data related to feols in R.
Suppose that I have the linear regression model
y_{it}=a+x_{1it}+x_{2it}+error_{it}
where i=1,...,T is ...
1
vote
0
answers
79
views
Analyzing ACE twin models from a long data set rather than wide data set [closed]
I am wondering if we can analyze ACE twin models from a long data set rather than the usual wide data set (R lavaan, mplus).
For example, I have found that you can create ACE twin data in R from here.
...
0
votes
0
answers
92
views
Employment status categories that include pensioners, learners, students and non schooling
I have collected a dataset on Employment status. I created the following categories; Pensioners, Formally employed, Informally Employed, Self-employed, and Unemployed. I also have Learners or Students ...
0
votes
0
answers
53
views
Partially deleting rows or columns of data with too many missings: Best order?
There is a dataset of roughly 4,000 people interviewed and examined among them a dependend variable I am interested in and roughly 50 variables that I would like to investigate further in their ...
0
votes
0
answers
42
views
Regression with unbalanced frequency of independent variable
I am investigating the relationship between the number of branches on the upper part of a plant (independent) and the number of branches on the lower part (dependent). Thus, the data is numerical ...
0
votes
0
answers
57
views
Random Forest on small sample
I am learning about Random Forests and want to test it on a dataset that I have. There are 500 samples equally distributed among 50 classes, the samples have ~500 values. Is this suitable for random ...
0
votes
0
answers
43
views
Why does GEV fit sometimes not fit the tails well?
I am performing a generalized extreme value analysis using about 20 years of data sampled every 1 minute. I am doing this in order to predict return levels at e.g. 1-in-50 and 1-in-100 intervals. The ...
2
votes
1
answer
161
views
Best Practices for Imputing Missing Data in Trade Data (Linear Interpolation and Random Volume)
I am working on a dataset containing trade data, and my goal is to impute the missing data for a period of around 24 hours. Here's a sample of the trade data I'm working with:
timestamp
symbol
price
...
21
votes
4
answers
4k
views
Realistically, does the i.i.d. assumption hold for the vast majority of supervised learning tasks?
The i.i.d. assumption states:
We are given a data set, $\{(x_i,y_i)\}_{i = 1, \ldots, n}$, each data $(x_i,y_i)$ is generated in an independent and identically distributed fashion.
To me, ...
0
votes
1
answer
137
views
Would it be possible to use regularization methods as a feature selection method and then use machine learning models to analyses data?
My data is RNA-seq data with more than 14000 features and the problem is binary classification. Then the total sample is 50 and p>>n. When I use Elasticnet method with train and test data, the ...
96
votes
6
answers
9k
views
Essential data checking tests
In my job role I often work with other people's datasets; non-experts bring me clinical data and I help them summarise it and perform statistical tests.
The problem I am having is that the datasets I ...
1
vote
0
answers
132
views
Proof of asymmetry of relative entropy (KL-divergence) $D(p∥q) \neq D(q∥p)$ [duplicate]
Unlike a real distance measure, relative entropy is not symmetric in the
sense that $D(p(x)∥q(x)) \neq D(q(x)∥p(x))$. It turns out that many information measures can be expressed by relative entropies....
6
votes
1
answer
429
views
Name of academic field studying geometric structure of data sets [closed]
I have questions about the geometric structure of data sets, esp. as it relates to the relationships between predictors. Is there a name for this field?
2
votes
1
answer
119
views
How can I augment a 1D tablar dataset using an additional 2D dataset?
I have the following two types of datasets:
The dataset-1 is a tabular data that describe 7000 proteins. dataset-1 is only one file.
dataset-2 consists of 7000 individual data files that are 2D ...
3
votes
1
answer
495
views
Can I change values in data from yes and no to binary
I have a dataset that I want to perform a regression on. However, some of the columns are not in numerical form. For example, the extra classes column. What I ...
0
votes
1
answer
506
views
Conditions to Select Pairwise Deletion
When should I select pairwise deletion?
So I grasp the idea of pairwise deletion, but what conditions are actually needed to select this? Is it when data is MCAR? Why would researches select this ...
1
vote
1
answer
79
views
How can I treat my pilot data that has 1 repeated indicator/survey item for all 3 companies in R Studio?
I'm conducting a study on corporate social responsibility (CSR) and am encountering a challenge with my data cleaning. I've instructed respondents to answer the Likert scale questions only for ...
2
votes
2
answers
111
views
Imputation Missing data
I have a longitudinal data set with 2 dependent variables (couple) - a husband and a wife. There were 2 waves for the husbands and 3 waves for the wives. Since there is a lot of missing data, I ...
2
votes
2
answers
378
views
R/Econometrics: transform yearly dataset into monthly dataset?
As explained in the title, I would like to transform a yearly dataset into a monthly one, but including a constraint.
My current dataset gives the yearly production of a commodity, and from year to ...
4
votes
3
answers
2k
views
Which are outliers?
I am in the process of solving a Machine Learning challenge, and I want to do it the right way.
I did some exploratory data analysisand I wanted to check the distribution of the data.
As displayed in ...
0
votes
1
answer
129
views
How to calculate reliability of difference scores?
I am trying to calculate the reliability of a difference score. Specifically, the data have, for each participant, scores for 10 items in Condition X (1s and 0s), as well as 10 different items in ...