Trending 'dataset' questions

3 votes

1 answer

119 views

Advice on regression approach

How should I handle a mass-point in the dependent variable when running OLS regression in R? I’m working with a a household expenditure dataset (Living Costs 2019) where the dependent variable is the ...

Jim

31

asked 20 hours ago

4 votes

5 answers

711 views

How Do Quartiles Help Us Understand a Dataset?

It’s confusing to understand how quartile values can actually be used to give insights into a dataset. Please assist with examples. I struggle to interpret the values in the context of providing ...

Buchi

41

asked Oct 2 at 17:14

5 votes

3 answers

533 views

How to handle outliers when some predictors perform better with them and others without

I’m working on a project where I need to build a predictive model for wine quality based on its chemical properties. The goal is to find which features best explain or predict the quality score. I’ve ...

QualityX

51

asked Oct 8 at 19:23

1 vote

1 answer

136 views

What is the current consensus on "using test set as training set, post testing"? [duplicate]

This question is inspired by a blog post by https://www.argmin.net/p/in-defense-of-typing-monkeys and several rumors I've heard from other people who works in machine learning. The gist of it is that ...

Your neighbor Todorovich

707

asked Aug 22 at 4:12

1 vote

0 answers

99 views

Looking for an authentic example with extremely small coefficient of variation [closed]

Out of curiosity, I am looking for an example of an authentic variable (which one would find in a data set) with an exceptionally small coefficient of variation: $\text{CV} = \frac{s}{\bar{x}}$. To ...

Gregg H

7,077

asked Sep 1 at 14:44

1 vote

2 answers

123 views

How can the standard error measure how accurately a sample represents the population, when we don’t have access to the population’s data?

If I got it correct, the standard error is a statistic that measures the variability of a sample’s data and how accurately a statistic represents the corresponding parameter. Please suggest any ...

okman

315

asked Aug 3 at 10:23

1 vote

0 answers

47 views

Potential CNN Overfitting Due to Limited Training Data

Neural Network Beginner here. I am currently implementing a CNN on PyTorch for recognizing Japanese handwritten letters, which has 46 classes of outputs. I found a dataset on Kaggle https://www.kaggle....

Krish Thyagarajan

11

asked Sep 7 at 16:33

2 votes

1 answer

102 views

Quantitatively determining unexplored parameter spaces [closed]

If we have a high-dimensional dataset (7-10 columns) of continuous variables like Time, Temperature etc. recorded from experiments (not performed by us) are there established methods to quantitatively ...

Sunera Wijeratne

31

asked Jul 23 at 17:14

1 vote

1 answer

94 views

A correct approach to validate/correct readings from similar sensors?

I am looking to apply a calibration/correction approach on a set of sensors and I just wanted to know that the approach I am going to use is statistically correct and acceptable. I am using a set of ...

Milad

157

asked Jul 14 at 11:41

12 votes

5 answers

2k views

How much missing data is too much? part 2: statistical power to impute?

A question is how many missing values are too many to be handled. It has been asked in the context of applying specific software and method (MICE). I am interested in understanding a bit better what ...

Johan

346

asked Aug 27, 2024 at 11:17

2 votes

1 answer

83 views

Customer propensity: time based split or random split

I have a task: for the store, where customers may pay for their items on registers with cashiers, were added self-service checkouts. I have 4 months of transaction data of customers who make their ...

remon

21

asked Jul 9 at 5:00

7 votes

4 answers

880 views

Can you remove outliers if they are less than 10% of the datapoints? [duplicate]

I am currently attending my first data analysis class and we do some simple hypothesis tests like t test etc. Our teacher told us that we can remove outliers, as long as they are not more that the 10% ...

Maria

71

asked Dec 7, 2024 at 10:22

0 votes

1 answer

50 views

Theoretical question around Implicit Attitude Test data between timepoints: single vs. multiple datapoints per person?

I have a question that relates to the use of IAT scores across timepoints. As part of a large health-based intervention my colleagues and I have obtained IAT scores at different timepoints, from which ...

Jonathan Kim

11

asked Jul 28 at 21:03

187 votes

15 answers

57k views

Are large data sets inappropriate for hypothesis testing?

In a recent article of Amstat News, the authors (Mark van der Laan and Sherri Rose) stated that "We know that for large enough sample sizes, every study—including ones in which the null hypothesis of ...

Carlos Accioly

5,095

asked Sep 9, 2010 at 18:21

10 votes

1 answer

1k views

Would it be possible to generate data from real data in medical research? [closed]

We are trying to develop some predictive models in medical research. We have combination of clinical and RNA-seq data just for 40 patients. The problem is classification. After feature selection, we ...

Leila ali

189

asked Jul 24, 2024 at 8:30

11 votes

4 answers

957 views

Are data in the real world "sampled" in the statistical sense?

In machine learning, it is commonly assumed that samples are generated i.i.d. according to some probability distribution. On the importance of the i.i.d. assumption in statistical learning The ...

Your neighbor Todorovich

707

asked Sep 12, 2024 at 9:19

0 votes

0 answers

39 views

Is there any standard or common notation for censored values, in data files?

Suppose one must share a data file – could be a simple CSV file – where each datapoint has several variates, let's say a nominal one, an ordinal one, and a continuous-real one. Are there any standard ...

pglpm

1,356

asked Jun 21 at 16:33

6 votes

0 answers

320 views

Reconstructing count table when only pairwise features are visible

Assume we are only able to observe two-way entry table counting the number of observations of a pair of categorical features $x_i,x_j$. $$ \begin{array}{c|ccc} & & x_j & \\ \hline ...

Three Diag

517

asked Feb 7 at 16:29

103 votes

25 answers

43k views

Locating freely available data samples

I've been working on a new method for analyzing and parsing datasets to identify and isolate subgroups of a population without foreknowledge of any subgroup's characteristics. While the method works ...

Community wiki

3 revs, 2 users 100%
EAMann

1 vote

1 answer

292 views

SHAP values across different groups

I developed and compared four ML models via Random Forest, Support Vector Machine, Logistic Regression, and Xgboost (tidymodels R package) algorithms using data without stratification by age groups. ...

Data and data

33

asked Mar 29 at 1:30

0 votes

0 answers

49 views

Question in longitudinal survey is no longer asked. MNAR?

In a longitudinal hospitalization survey dataset, where patients are asked to fill out a survey each time they are admitted into the hospital, one of the questions is no longer asked. This question ...

Kevin

353

asked May 9 at 2:13

0 votes

0 answers

129 views

Variable selection in higly multivariate dataset

I have a metagenomics dataset with more than 2 million features (each one being the relative abundance of a gene family, this is, a cluster of gene sequences) in 30 samples. First, I CLR-transformed ...

AdrianLG

1

asked Mar 7 at 16:30

1 vote

1 answer

95 views

Why is the Keras MNIST dataset split into training and test samples of lengths 60k and 10k respectively?

The MNIST dataset can be obtained directly using Keras by running the following lines of Python code. ...

user3728501

353

asked May 5 at 12:57

5 votes

1 answer

181 views

Bivariate data generation

Consider the distribution with the bivariate cumulative distribution function $$F(t_1,t_2)=t_1^{1+\theta\log(t_2)}t_2,0<t_1,t_2<1; \theta\leq 0$$. I want to generate data from this distribution (...

Unknown

220

asked Sep 15, 2024 at 9:02

1 vote

1 answer

87 views

How to extract and compare the distribution of predicted values of two mixed effect models?

I have 100 samples for 6 days (every day 100 observations, 600 observation in general). I have tried to fit the mixed model to the data. Another time, I tried to fit the mixed model just for ...

Leila

11

asked Jan 24 at 14:45

2 votes

0 answers

56 views

What discussion exists regarding statisticians' relationship to statistical methodological assumptions in applications? [closed]

This is more of a philosophical question, and also a question asking for references. To those who follow statistical academic literature, there are papers that discuss philosophical issues in ...

cgmil

1,633

asked Mar 6 at 21:34

95 votes

2 answers

193k views

How to normalize data between -1 and 1?

I have seen the min-max normalization formula but that normalizes values between 0 and 1. How would I normalize my data between -1 and 1? I have both negative and positive values in my data matrix.

covfefe

1,299

asked Oct 26, 2015 at 1:02

1 vote

0 answers

83 views

Aggregate Ordinal Data?

In my research, I am examining the impact of AI labels (with vs. without) on various brand perceptions and behavioral intentions. Specifically, I analyze how the stimulus (IV, 4 stimuli in 2 subgroups)...

Marcel El Joundi

11

asked Jan 31 at 12:52

0 votes

0 answers

49 views

Handling Missing Values in the dataset

I'm using this dataset for a regression project, and the goal is to predict the beneficiary risk score(Bene_Avg_Risk_Scre). Now, to protect beneficiary identities and safeguard this information, CMS ...

Anirudh

1

asked Apr 2 at 7:34

0 votes

0 answers

68 views

Fixed Effects OLS Estimation in R and what various specifications mean

I have a question about panel data related to feols in R. Suppose that I have the linear regression model y_{it}=a+x_{1it}+x_{2it}+error_{it} where i=1,...,T is ...

user454850

1

asked Dec 26, 2024 at 16:31

1 vote

0 answers

79 views

Analyzing ACE twin models from a long data set rather than wide data set [closed]

I am wondering if we can analyze ACE twin models from a long data set rather than the usual wide data set (R lavaan, mplus). For example, I have found that you can create ACE twin data in R from here. ...

POC

688

asked Jul 26, 2024 at 17:29

0 votes

0 answers

92 views

Employment status categories that include pensioners, learners, students and non schooling

I have collected a dataset on Employment status. I created the following categories; Pensioners, Formally employed, Informally Employed, Self-employed, and Unemployed. I also have Learners or Students ...

Amelia Nicodemus

371

asked Dec 9, 2024 at 15:03

0 votes

0 answers

53 views

Partially deleting rows or columns of data with too many missings: Best order?

There is a dataset of roughly 4,000 people interviewed and examined among them a dependend variable I am interested in and roughly 50 variables that I would like to investigate further in their ...

Bernhard

8,645

asked Jan 9 at 12:42

0 votes

0 answers

42 views

Regression with unbalanced frequency of independent variable

I am investigating the relationship between the number of branches on the upper part of a plant (independent) and the number of branches on the lower part (dependent). Thus, the data is numerical ...

Scott

1

asked Mar 26 at 17:37

0 votes

0 answers

57 views

Random Forest on small sample

I am learning about Random Forests and want to test it on a dataset that I have. There are 500 samples equally distributed among 50 classes, the samples have ~500 values. Is this suitable for random ...

user438409385

1

asked Jan 5 at 1:33

0 votes

0 answers

43 views

Why does GEV fit sometimes not fit the tails well?

I am performing a generalized extreme value analysis using about 20 years of data sampled every 1 minute. I am doing this in order to predict return levels at e.g. 1-in-50 and 1-in-100 intervals. The ...

Darcy

947

asked Jan 21 at 19:14

2 votes

1 answer

161 views

Best Practices for Imputing Missing Data in Trade Data (Linear Interpolation and Random Volume)

I am working on a dataset containing trade data, and my goal is to impute the missing data for a period of around 24 hours. Here's a sample of the trade data I'm working with: timestamp symbol price ...

Mocak

21

asked Oct 14, 2024 at 10:28

21 votes

4 answers

4k views

Realistically, does the i.i.d. assumption hold for the vast majority of supervised learning tasks?

The i.i.d. assumption states: We are given a data set, $\{(x_i,y_i)\}_{i = 1, \ldots, n}$, each data $(x_i,y_i)$ is generated in an independent and identically distributed fashion. To me, ...

Olórin

744

asked Jan 19, 2020 at 4:13

0 votes

1 answer

137 views

Would it be possible to use regularization methods as a feature selection method and then use machine learning models to analyses data?

My data is RNA-seq data with more than 14000 features and the problem is binary classification. Then the total sample is 50 and p>>n. When I use Elasticnet method with train and test data, the ...

Leila ali

189

asked Jul 20, 2024 at 10:51

96 votes

6 answers

9k views

Essential data checking tests

In my job role I often work with other people's datasets; non-experts bring me clinical data and I help them summarise it and perform statistical tests. The problem I am having is that the datasets I ...

Chris Beeley

5,921

asked Jun 7, 2011 at 8:19

1 vote

0 answers

132 views

Proof of asymmetry of relative entropy (KL-divergence) $D(p∥q) \neq D(q∥p)$ [duplicate]

Unlike a real distance measure, relative entropy is not symmetric in the sense that $D(p(x)∥q(x)) \neq D(q(x)∥p(x))$. It turns out that many information measures can be expressed by relative entropies....

허정윤

11

asked Sep 11, 2024 at 10:40

6 votes

1 answer

429 views

Name of academic field studying geometric structure of data sets [closed]

I have questions about the geometric structure of data sets, esp. as it relates to the relationships between predictors. Is there a name for this field?

Chris Science

403

asked Dec 14, 2023 at 17:44

2 votes

1 answer

119 views

How can I augment a 1D tablar dataset using an additional 2D dataset?

I have the following two types of datasets: The dataset-1 is a tabular data that describe 7000 proteins. dataset-1 is only one file. dataset-2 consists of 7000 individual data files that are 2D ...

user366312

2,077

asked Jul 1, 2024 at 6:08

3 votes

1 answer

495 views

Can I change values in data from yes and no to binary

I have a dataset that I want to perform a regression on. However, some of the columns are not in numerical form. For example, the extra classes column. What I ...

Charlotte

31

asked Feb 21, 2024 at 20:58

0 votes

1 answer

506 views

Conditions to Select Pairwise Deletion

When should I select pairwise deletion? So I grasp the idea of pairwise deletion, but what conditions are actually needed to select this? Is it when data is MCAR? Why would researches select this ...

Fats

21

asked Oct 21, 2022 at 21:56

1 vote

1 answer

79 views

How can I treat my pilot data that has 1 repeated indicator/survey item for all 3 companies in R Studio?

I'm conducting a study on corporate social responsibility (CSR) and am encountering a challenge with my data cleaning. I've instructed respondents to answer the Likert scale questions only for ...

user432017

11

asked Sep 19, 2024 at 1:28

2 votes

2 answers

111 views

Imputation Missing data

I have a longitudinal data set with 2 dependent variables (couple) - a husband and a wife. There were 2 waves for the husbands and 3 waves for the wives. Since there is a lot of missing data, I ...

eagersquirrel

41

asked Jul 1, 2024 at 10:36

2 votes

2 answers

378 views

R/Econometrics: transform yearly dataset into monthly dataset?

As explained in the title, I would like to transform a yearly dataset into a monthly one, but including a constraint. My current dataset gives the yearly production of a commodity, and from year to ...

EstebanVer

21

asked Jul 29, 2023 at 10:35

4 votes

3 answers

2k views

Which are outliers?

I am in the process of solving a Machine Learning challenge, and I want to do it the right way. I did some exploratory data analysisand I wanted to check the distribution of the data. As displayed in ...

Spicy strike

51

asked Jan 26, 2022 at 14:38

0 votes

1 answer

129 views

How to calculate reliability of difference scores?

I am trying to calculate the reliability of a difference score. Specifically, the data have, for each participant, scores for 10 items in Condition X (1s and 0s), as well as 10 different items in ...

Altair555

61

asked Mar 13, 2024 at 17:03

Questions tagged [dataset]