Questions tagged [dataset]
Requests for datasets are off-topic on this site. Use this tag for questions concerning creating, processing, or maintaining datasets.
1,934 questions
3
votes
1
answer
105
views
Advice on regression approach
How should I handle a mass-point in the dependent variable when running OLS regression in R?
I’m working with a a household expenditure dataset (Living Costs 2019) where the dependent variable is the ...
5
votes
3
answers
533
views
How to handle outliers when some predictors perform better with them and others without
I’m working on a project where I need to build a predictive model for wine quality based on its chemical properties. The goal is to find which features best explain or predict the quality score.
I’ve ...
4
votes
5
answers
711
views
How Do Quartiles Help Us Understand a Dataset?
It’s confusing to understand how quartile values can actually be used to give insights into a dataset. Please assist with examples. I struggle to interpret the values in the context of providing ...
1
vote
0
answers
47
views
Potential CNN Overfitting Due to Limited Training Data
Neural Network Beginner here. I am currently implementing a CNN on PyTorch for recognizing Japanese handwritten letters, which has 46 classes of outputs.
I found a dataset on Kaggle https://www.kaggle....
1
vote
0
answers
99
views
Looking for an authentic example with extremely small coefficient of variation [closed]
Out of curiosity, I am looking for an example of an authentic variable (which one would find in a data set) with an exceptionally small coefficient of variation: $\text{CV} = \frac{s}{\bar{x}}$. To ...
1
vote
1
answer
136
views
What is the current consensus on "using test set as training set, post testing"? [duplicate]
This question is inspired by a blog post by https://www.argmin.net/p/in-defense-of-typing-monkeys and several rumors I've heard from other people who works in machine learning.
The gist of it is that ...
1
vote
2
answers
123
views
How can the standard error measure how accurately a sample represents the population, when we don’t have access to the population’s data?
If I got it correct, the standard error is a statistic that measures the variability of a sample’s data and how accurately a statistic represents the corresponding parameter.
Please suggest any ...
0
votes
1
answer
50
views
Theoretical question around Implicit Attitude Test data between timepoints: single vs. multiple datapoints per person?
I have a question that relates to the use of IAT scores across timepoints. As part of a large health-based intervention my colleagues and I have obtained IAT scores at different timepoints, from which ...
2
votes
1
answer
102
views
Quantitatively determining unexplored parameter spaces [closed]
If we have a high-dimensional dataset (7-10 columns) of continuous variables like Time, Temperature etc. recorded from experiments (not performed by us) are there established methods to quantitatively ...
1
vote
1
answer
94
views
A correct approach to validate/correct readings from similar sensors?
I am looking to apply a calibration/correction approach on a set of sensors and I just wanted to know that the approach I am going to use is statistically correct and acceptable.
I am using a set of ...
2
votes
1
answer
83
views
Customer propensity: time based split or random split
I have a task: for the store, where customers may pay for their items on registers with cashiers, were added self-service checkouts. I have 4 months of transaction data of customers who make their ...
0
votes
0
answers
39
views
Is there any standard or common notation for censored values, in data files?
Suppose one must share a data file – could be a simple CSV file – where each datapoint has several variates, let's say a nominal one, an ordinal one, and a continuous-real one.
Are there any standard ...
0
votes
0
answers
49
views
Question in longitudinal survey is no longer asked. MNAR?
In a longitudinal hospitalization survey dataset, where patients are asked to fill out a survey each time they are admitted into the hospital, one of the questions is no longer asked. This question ...
1
vote
1
answer
95
views
Why is the Keras MNIST dataset split into training and test samples of lengths 60k and 10k respectively?
The MNIST dataset can be obtained directly using Keras by running the following lines of Python code.
...
0
votes
0
answers
49
views
Handling Missing Values in the dataset
I'm using this dataset for a regression project, and the goal is to predict the beneficiary risk score(Bene_Avg_Risk_Scre). Now, to protect beneficiary identities and safeguard this information, CMS ...
1
vote
1
answer
292
views
SHAP values across different groups
I developed and compared four ML models via Random Forest, Support Vector Machine, Logistic Regression, and Xgboost (tidymodels R package) algorithms using data without stratification by age groups. ...
0
votes
0
answers
42
views
Regression with unbalanced frequency of independent variable
I am investigating the relationship between the number of branches on the upper part of a plant (independent) and the number of branches on the lower part (dependent). Thus, the data is numerical ...
0
votes
0
answers
129
views
Variable selection in higly multivariate dataset
I have a metagenomics dataset with more than 2 million features (each one being the relative abundance of a gene family, this is, a cluster of gene sequences) in 30 samples. First, I CLR-transformed ...
2
votes
0
answers
56
views
What discussion exists regarding statisticians' relationship to statistical methodological assumptions in applications? [closed]
This is more of a philosophical question, and also a question asking for references. To those who follow statistical academic literature, there are papers that discuss philosophical issues in ...
6
votes
0
answers
320
views
Reconstructing count table when only pairwise features are visible
Assume we are only able to observe two-way entry table counting the number of observations of a pair of categorical features $x_i,x_j$.
$$
\begin{array}{c|ccc}
& & x_j & \\
\hline
...
1
vote
0
answers
83
views
Aggregate Ordinal Data?
In my research, I am examining the impact of AI labels (with vs. without) on various brand perceptions and behavioral intentions. Specifically, I analyze how the stimulus (IV, 4 stimuli in 2 subgroups)...
1
vote
1
answer
87
views
How to extract and compare the distribution of predicted values of two mixed effect models?
I have 100 samples for 6 days (every day 100 observations, 600 observation in general). I have tried to fit the mixed model to the data. Another time, I tried to fit the mixed model just for ...
0
votes
0
answers
43
views
Why does GEV fit sometimes not fit the tails well?
I am performing a generalized extreme value analysis using about 20 years of data sampled every 1 minute. I am doing this in order to predict return levels at e.g. 1-in-50 and 1-in-100 intervals. The ...
0
votes
0
answers
53
views
Partially deleting rows or columns of data with too many missings: Best order?
There is a dataset of roughly 4,000 people interviewed and examined among them a dependend variable I am interested in and roughly 50 variables that I would like to investigate further in their ...
0
votes
0
answers
57
views
Random Forest on small sample
I am learning about Random Forests and want to test it on a dataset that I have. There are 500 samples equally distributed among 50 classes, the samples have ~500 values. Is this suitable for random ...
0
votes
0
answers
68
views
Fixed Effects OLS Estimation in R and what various specifications mean
I have a question about panel data related to feols in R.
Suppose that I have the linear regression model
y_{it}=a+x_{1it}+x_{2it}+error_{it}
where i=1,...,T is ...
0
votes
0
answers
92
views
Employment status categories that include pensioners, learners, students and non schooling
I have collected a dataset on Employment status. I created the following categories; Pensioners, Formally employed, Informally Employed, Self-employed, and Unemployed. I also have Learners or Students ...
7
votes
4
answers
880
views
Can you remove outliers if they are less than 10% of the datapoints? [duplicate]
I am currently attending my first data analysis class and we do some simple hypothesis tests like t test etc. Our teacher told us that we can remove outliers, as long as they are not more that the 10% ...
0
votes
0
answers
25
views
Can a panel dataset consist of units sampled at random points in time?
I have a data frame with the variables $Judge\ ID$ (uniquely identifies judges), $Case\ ID$ (uniquely identifies court cases), $Decision$ (records case outcome), and $Comp\ Date$ (datetime variable ...
2
votes
1
answer
161
views
Best Practices for Imputing Missing Data in Trade Data (Linear Interpolation and Random Volume)
I am working on a dataset containing trade data, and my goal is to impute the missing data for a period of around 24 hours. Here's a sample of the trade data I'm working with:
timestamp
symbol
price
...
1
vote
0
answers
43
views
What’s the best training set modality for a model that takes two random configurations as input and predicts which one is better?
I want to create a model that predicts which Spark configuration is better. What’s the best dataset for this? Is it better to compare different configurations of various Spark jobs one-on-one using ...
1
vote
1
answer
79
views
How can I treat my pilot data that has 1 repeated indicator/survey item for all 3 companies in R Studio?
I'm conducting a study on corporate social responsibility (CSR) and am encountering a challenge with my data cleaning. I've instructed respondents to answer the Likert scale questions only for ...
5
votes
1
answer
181
views
Bivariate data generation
Consider the distribution with the bivariate cumulative distribution function $$F(t_1,t_2)=t_1^{1+\theta\log(t_2)}t_2,0<t_1,t_2<1; \theta\leq 0$$.
I want to generate data from this distribution (...
0
votes
0
answers
64
views
Probability distribution of numeric input variables for linear machine learning models
So I was reading this book : Data Preparation for Machine Learning by Jason Brownlee. And there was this block of text that was a bit confusing and I couldn't find any explanation.
"For example, ...
11
votes
4
answers
957
views
Are data in the real world "sampled" in the statistical sense?
In machine learning, it is commonly assumed that samples are generated i.i.d. according to some probability distribution. On the importance of the i.i.d. assumption in statistical learning
The ...
1
vote
0
answers
132
views
Proof of asymmetry of relative entropy (KL-divergence) $D(p∥q) \neq D(q∥p)$ [duplicate]
Unlike a real distance measure, relative entropy is not symmetric in the
sense that $D(p(x)∥q(x)) \neq D(q(x)∥p(x))$. It turns out that many information measures can be expressed by relative entropies....
1
vote
1
answer
63
views
Mean follow up time
*I am using the pmsampsize function in R to calculate the sample size for a cox proportional hazard model (Pilot study). I need to input the value of ...
0
votes
0
answers
34
views
Simple but important question: how do you write down the formula for the probability density of data in general? [duplicate]
In machine learning many data can be thought of as generated from a probability density function (also called probability distribution).
But most probability textbook only discuss probability density ...
1
vote
0
answers
55
views
Effect size of categorical variables [closed]
If I have bio test of two categories with different dimensions and larger size=190 samples , df=63 . Is cramer V suitable in this case? And how would be the interpretation considering the effect size? ...
12
votes
5
answers
2k
views
How much missing data is too much? part 2: statistical power to impute?
A question is how many missing values are too many to be handled. It has been asked in the context of applying specific software and method (MICE).
I am interested in understanding a bit better what ...
1
vote
0
answers
36
views
Is data set enough for regression analysis? [closed]
I need to conduct regression analysis for my thesis (panel data), however, my data set turned out to be really small. I have only 6 companies to study (annual data for 11 years) and the sector i am ...
0
votes
1
answer
94
views
Is there any way to can deal with missing data without imputation ? model considering NA values [closed]
My question is how to model data with NA values and without imputing. Is there any possibility? and what is the advantage and disadvantage? The problem is classification.
0
votes
0
answers
41
views
How to better analyze the correlation between binary and log-scaled data?
Imagine I have two datasets as shown in the figure:
data1 is an array of zeroes and ones, while data2 is an array of real ...
1
vote
0
answers
79
views
Analyzing ACE twin models from a long data set rather than wide data set [closed]
I am wondering if we can analyze ACE twin models from a long data set rather than the usual wide data set (R lavaan, mplus).
For example, I have found that you can create ACE twin data in R from here.
...
10
votes
1
answer
1k
views
Would it be possible to generate data from real data in medical research? [closed]
We are trying to develop some predictive models in medical research. We have combination of clinical and RNA-seq data just for 40 patients. The problem is classification. After feature selection, we ...
0
votes
1
answer
137
views
Would it be possible to use regularization methods as a feature selection method and then use machine learning models to analyses data?
My data is RNA-seq data with more than 14000 features and the problem is binary classification. Then the total sample is 50 and p>>n. When I use Elasticnet method with train and test data, the ...
2
votes
0
answers
51
views
Using whole training set for choosing model
I am working on a classification problem with what I understand as a big dataset. I have first of all splitted it in my "train" dataset and the "test" one. (Actually I am convinced ...
2
votes
1
answer
80
views
Object detection for finite dataset
Consider the following scenario
If I want to train a model to detect and count these squares:
These squares will never be different. They will always look exactly the same, and be of exactly the ...
2
votes
2
answers
111
views
Imputation Missing data
I have a longitudinal data set with 2 dependent variables (couple) - a husband and a wife. There were 2 waves for the husbands and 3 waves for the wives. Since there is a lot of missing data, I ...
2
votes
1
answer
119
views
How can I augment a 1D tablar dataset using an additional 2D dataset?
I have the following two types of datasets:
The dataset-1 is a tabular data that describe 7000 proteins. dataset-1 is only one file.
dataset-2 consists of 7000 individual data files that are 2D ...