Newest 'dataset' Questions

3 votes

1 answer

105 views

Advice on regression approach

How should I handle a mass-point in the dependent variable when running OLS regression in R? I’m working with a a household expenditure dataset (Living Costs 2019) where the dependent variable is the ...

Jim

31

asked 17 hours ago

5 votes

3 answers

533 views

How to handle outliers when some predictors perform better with them and others without

I’m working on a project where I need to build a predictive model for wine quality based on its chemical properties. The goal is to find which features best explain or predict the quality score. I’ve ...

QualityX

51

asked Oct 8 at 19:23

4 votes

5 answers

711 views

How Do Quartiles Help Us Understand a Dataset?

It’s confusing to understand how quartile values can actually be used to give insights into a dataset. Please assist with examples. I struggle to interpret the values in the context of providing ...

Buchi

41

asked Oct 2 at 17:14

1 vote

0 answers

47 views

Potential CNN Overfitting Due to Limited Training Data

Neural Network Beginner here. I am currently implementing a CNN on PyTorch for recognizing Japanese handwritten letters, which has 46 classes of outputs. I found a dataset on Kaggle https://www.kaggle....

Krish Thyagarajan

11

asked Sep 7 at 16:33

1 vote

0 answers

99 views

Looking for an authentic example with extremely small coefficient of variation [closed]

Out of curiosity, I am looking for an example of an authentic variable (which one would find in a data set) with an exceptionally small coefficient of variation: $\text{CV} = \frac{s}{\bar{x}}$. To ...

Gregg H

7,077

asked Sep 1 at 14:44

1 vote

1 answer

136 views

What is the current consensus on "using test set as training set, post testing"? [duplicate]

This question is inspired by a blog post by https://www.argmin.net/p/in-defense-of-typing-monkeys and several rumors I've heard from other people who works in machine learning. The gist of it is that ...

Your neighbor Todorovich

707

asked Aug 22 at 4:12

1 vote

2 answers

123 views

How can the standard error measure how accurately a sample represents the population, when we don’t have access to the population’s data?

If I got it correct, the standard error is a statistic that measures the variability of a sample’s data and how accurately a statistic represents the corresponding parameter. Please suggest any ...

okman

315

asked Aug 3 at 10:23

0 votes

1 answer

50 views

Theoretical question around Implicit Attitude Test data between timepoints: single vs. multiple datapoints per person?

I have a question that relates to the use of IAT scores across timepoints. As part of a large health-based intervention my colleagues and I have obtained IAT scores at different timepoints, from which ...

Jonathan Kim

11

asked Jul 28 at 21:03

2 votes

1 answer

102 views

Quantitatively determining unexplored parameter spaces [closed]

If we have a high-dimensional dataset (7-10 columns) of continuous variables like Time, Temperature etc. recorded from experiments (not performed by us) are there established methods to quantitatively ...

Sunera Wijeratne

31

asked Jul 23 at 17:14

1 vote

1 answer

94 views

A correct approach to validate/correct readings from similar sensors?

I am looking to apply a calibration/correction approach on a set of sensors and I just wanted to know that the approach I am going to use is statistically correct and acceptable. I am using a set of ...

Milad

157

asked Jul 14 at 11:41

2 votes

1 answer

83 views

Customer propensity: time based split or random split

I have a task: for the store, where customers may pay for their items on registers with cashiers, were added self-service checkouts. I have 4 months of transaction data of customers who make their ...

remon

21

asked Jul 9 at 5:00

0 votes

0 answers

39 views

Is there any standard or common notation for censored values, in data files?

Suppose one must share a data file – could be a simple CSV file – where each datapoint has several variates, let's say a nominal one, an ordinal one, and a continuous-real one. Are there any standard ...

pglpm

1,356

asked Jun 21 at 16:33

0 votes

0 answers

49 views

Question in longitudinal survey is no longer asked. MNAR?

In a longitudinal hospitalization survey dataset, where patients are asked to fill out a survey each time they are admitted into the hospital, one of the questions is no longer asked. This question ...

Kevin

353

asked May 9 at 2:13

1 vote

1 answer

95 views

Why is the Keras MNIST dataset split into training and test samples of lengths 60k and 10k respectively?

The MNIST dataset can be obtained directly using Keras by running the following lines of Python code. ...

user3728501

353

asked May 5 at 12:57

0 votes

0 answers

49 views

Handling Missing Values in the dataset

I'm using this dataset for a regression project, and the goal is to predict the beneficiary risk score(Bene_Avg_Risk_Scre). Now, to protect beneficiary identities and safeguard this information, CMS ...

Anirudh

1

asked Apr 2 at 7:34

1 vote

1 answer

292 views

SHAP values across different groups

I developed and compared four ML models via Random Forest, Support Vector Machine, Logistic Regression, and Xgboost (tidymodels R package) algorithms using data without stratification by age groups. ...

Data and data

33

asked Mar 29 at 1:30

0 votes

0 answers

42 views

Regression with unbalanced frequency of independent variable

I am investigating the relationship between the number of branches on the upper part of a plant (independent) and the number of branches on the lower part (dependent). Thus, the data is numerical ...

Scott

1

asked Mar 26 at 17:37

0 votes

0 answers

129 views

Variable selection in higly multivariate dataset

I have a metagenomics dataset with more than 2 million features (each one being the relative abundance of a gene family, this is, a cluster of gene sequences) in 30 samples. First, I CLR-transformed ...

AdrianLG

1

asked Mar 7 at 16:30

2 votes

0 answers

56 views

What discussion exists regarding statisticians' relationship to statistical methodological assumptions in applications? [closed]

This is more of a philosophical question, and also a question asking for references. To those who follow statistical academic literature, there are papers that discuss philosophical issues in ...

cgmil

1,633

asked Mar 6 at 21:34

6 votes

0 answers

320 views

Reconstructing count table when only pairwise features are visible

Assume we are only able to observe two-way entry table counting the number of observations of a pair of categorical features $x_i,x_j$. $$ \begin{array}{c|ccc} & & x_j & \\ \hline ...

Three Diag

517

asked Feb 7 at 16:29

1 vote

0 answers

83 views

Aggregate Ordinal Data?

In my research, I am examining the impact of AI labels (with vs. without) on various brand perceptions and behavioral intentions. Specifically, I analyze how the stimulus (IV, 4 stimuli in 2 subgroups)...

Marcel El Joundi

11

asked Jan 31 at 12:52

1 vote

1 answer

87 views

How to extract and compare the distribution of predicted values of two mixed effect models?

I have 100 samples for 6 days (every day 100 observations, 600 observation in general). I have tried to fit the mixed model to the data. Another time, I tried to fit the mixed model just for ...

Leila

11

asked Jan 24 at 14:45

0 votes

0 answers

43 views

Why does GEV fit sometimes not fit the tails well?

I am performing a generalized extreme value analysis using about 20 years of data sampled every 1 minute. I am doing this in order to predict return levels at e.g. 1-in-50 and 1-in-100 intervals. The ...

Darcy

947

asked Jan 21 at 19:14

0 votes

0 answers

53 views

Partially deleting rows or columns of data with too many missings: Best order?

There is a dataset of roughly 4,000 people interviewed and examined among them a dependend variable I am interested in and roughly 50 variables that I would like to investigate further in their ...

Bernhard

8,645

asked Jan 9 at 12:42

0 votes

0 answers

57 views

Random Forest on small sample

I am learning about Random Forests and want to test it on a dataset that I have. There are 500 samples equally distributed among 50 classes, the samples have ~500 values. Is this suitable for random ...

user438409385

1

asked Jan 5 at 1:33

0 votes

0 answers

68 views

Fixed Effects OLS Estimation in R and what various specifications mean

I have a question about panel data related to feols in R. Suppose that I have the linear regression model y_{it}=a+x_{1it}+x_{2it}+error_{it} where i=1,...,T is ...

user454850

1

asked Dec 26, 2024 at 16:31

0 votes

0 answers

92 views

Employment status categories that include pensioners, learners, students and non schooling

I have collected a dataset on Employment status. I created the following categories; Pensioners, Formally employed, Informally Employed, Self-employed, and Unemployed. I also have Learners or Students ...

Amelia Nicodemus

371

asked Dec 9, 2024 at 15:03

7 votes

4 answers

880 views

Can you remove outliers if they are less than 10% of the datapoints? [duplicate]

I am currently attending my first data analysis class and we do some simple hypothesis tests like t test etc. Our teacher told us that we can remove outliers, as long as they are not more that the 10% ...

Maria

71

asked Dec 7, 2024 at 10:22

0 votes

0 answers

25 views

Can a panel dataset consist of units sampled at random points in time?

I have a data frame with the variables $Judge\ ID$ (uniquely identifies judges), $Case\ ID$ (uniquely identifies court cases), $Decision$ (records case outcome), and $Comp\ Date$ (datetime variable ...

mrhumanzee

1

asked Dec 3, 2024 at 19:33

2 votes

1 answer

161 views

Best Practices for Imputing Missing Data in Trade Data (Linear Interpolation and Random Volume)

I am working on a dataset containing trade data, and my goal is to impute the missing data for a period of around 24 hours. Here's a sample of the trade data I'm working with: timestamp symbol price ...

Mocak

21

asked Oct 14, 2024 at 10:28

1 vote

0 answers

43 views

What’s the best training set modality for a model that takes two random configurations as input and predicts which one is better?

I want to create a model that predicts which Spark configuration is better. What’s the best dataset for this? Is it better to compare different configurations of various Spark jobs one-on-one using ...

Hijaw

175

asked Oct 1, 2024 at 17:00

1 vote

1 answer

79 views

How can I treat my pilot data that has 1 repeated indicator/survey item for all 3 companies in R Studio?

I'm conducting a study on corporate social responsibility (CSR) and am encountering a challenge with my data cleaning. I've instructed respondents to answer the Likert scale questions only for ...

user432017

11

asked Sep 19, 2024 at 1:28

5 votes

1 answer

181 views

Bivariate data generation

Consider the distribution with the bivariate cumulative distribution function $$F(t_1,t_2)=t_1^{1+\theta\log(t_2)}t_2,0<t_1,t_2<1; \theta\leq 0$$. I want to generate data from this distribution (...

Unknown

220

asked Sep 15, 2024 at 9:02

0 votes

0 answers

64 views

Probability distribution of numeric input variables for linear machine learning models

So I was reading this book : Data Preparation for Machine Learning by Jason Brownlee. And there was this block of text that was a bit confusing and I couldn't find any explanation. "For example, ...

tatv047

31

asked Sep 12, 2024 at 11:01

11 votes

4 answers

957 views

Are data in the real world "sampled" in the statistical sense?

In machine learning, it is commonly assumed that samples are generated i.i.d. according to some probability distribution. On the importance of the i.i.d. assumption in statistical learning The ...

Your neighbor Todorovich

707

asked Sep 12, 2024 at 9:19

1 vote

0 answers

132 views

Proof of asymmetry of relative entropy (KL-divergence) $D(p∥q) \neq D(q∥p)$ [duplicate]

Unlike a real distance measure, relative entropy is not symmetric in the sense that $D(p(x)∥q(x)) \neq D(q(x)∥p(x))$. It turns out that many information measures can be expressed by relative entropies....

허정윤

11

asked Sep 11, 2024 at 10:40

1 vote

1 answer

63 views

Mean follow up time

*I am using the pmsampsize function in R to calculate the sample size for a cox proportional hazard model (Pilot study). I need to input the value of ...

elisa

305

asked Sep 6, 2024 at 6:55

0 votes

0 answers

34 views

Simple but important question: how do you write down the formula for the probability density of data in general? [duplicate]

In machine learning many data can be thought of as generated from a probability density function (also called probability distribution). But most probability textbook only discuss probability density ...

Your neighbor Todorovich

707

asked Sep 2, 2024 at 10:37

1 vote

0 answers

55 views

Effect size of categorical variables [closed]

If I have bio test of two categories with different dimensions and larger size=190 samples , df=63 . Is cramer V suitable in this case? And how would be the interpretation considering the effect size? ...

Sara Scofild

11

asked Aug 31, 2024 at 9:30

12 votes

5 answers

2k views

How much missing data is too much? part 2: statistical power to impute?

A question is how many missing values are too many to be handled. It has been asked in the context of applying specific software and method (MICE). I am interested in understanding a bit better what ...

Johan

346

asked Aug 27, 2024 at 11:17

1 vote

0 answers

36 views

Is data set enough for regression analysis? [closed]

I need to conduct regression analysis for my thesis (panel data), however, my data set turned out to be really small. I have only 6 companies to study (annual data for 11 years) and the sector i am ...

Irene K

11

asked Aug 26, 2024 at 8:33

0 votes

1 answer

94 views

Is there any way to can deal with missing data without imputation ? model considering NA values [closed]

My question is how to model data with NA values and without imputing. Is there any possibility? and what is the advantage and disadvantage? The problem is classification.

Leila ali

189

asked Aug 5, 2024 at 10:39

0 votes

0 answers

41 views

How to better analyze the correlation between binary and log-scaled data?

Imagine I have two datasets as shown in the figure: data1 is an array of zeroes and ones, while data2 is an array of real ...

sam wolfe

180

asked Jul 31, 2024 at 14:41

1 vote

0 answers

79 views

Analyzing ACE twin models from a long data set rather than wide data set [closed]

I am wondering if we can analyze ACE twin models from a long data set rather than the usual wide data set (R lavaan, mplus). For example, I have found that you can create ACE twin data in R from here. ...

POC

688

asked Jul 26, 2024 at 17:29

10 votes

1 answer

1k views

Would it be possible to generate data from real data in medical research? [closed]

We are trying to develop some predictive models in medical research. We have combination of clinical and RNA-seq data just for 40 patients. The problem is classification. After feature selection, we ...

Leila ali

189

asked Jul 24, 2024 at 8:30

0 votes

1 answer

137 views

Would it be possible to use regularization methods as a feature selection method and then use machine learning models to analyses data?

My data is RNA-seq data with more than 14000 features and the problem is binary classification. Then the total sample is 50 and p>>n. When I use Elasticnet method with train and test data, the ...

Leila ali

189

asked Jul 20, 2024 at 10:51

2 votes

0 answers

51 views

Using whole training set for choosing model

I am working on a classification problem with what I understand as a big dataset. I have first of all splitted it in my "train" dataset and the "test" one. (Actually I am convinced ...

Videgain

121

asked Jul 11, 2024 at 18:34

2 votes

1 answer

80 views

Object detection for finite dataset

Consider the following scenario If I want to train a model to detect and count these squares: These squares will never be different. They will always look exactly the same, and be of exactly the ...

Fresh Prince Of Nigeria

23

asked Jul 3, 2024 at 3:04

2 votes

2 answers

111 views

Imputation Missing data

I have a longitudinal data set with 2 dependent variables (couple) - a husband and a wife. There were 2 waves for the husbands and 3 waves for the wives. Since there is a lot of missing data, I ...

eagersquirrel

41

asked Jul 1, 2024 at 10:36

2 votes

1 answer

119 views

How can I augment a 1D tablar dataset using an additional 2D dataset?

I have the following two types of datasets: The dataset-1 is a tabular data that describe 7000 proteins. dataset-1 is only one file. dataset-2 consists of 7000 individual data files that are 2D ...

user366312

2,077

asked Jul 1, 2024 at 6:08

Questions tagged [dataset]