Newest 'dataset' Questions - Page 3

0 votes

1 answer

75 views

Is there a test I could run to test the statistical significance between survey respondents over multiple years?

Assume I have a dataset that ranges from 2014-2022 composed of survey responses from two different groups e.g. store 1 and store 2. I want to test whether there is a statistically significant ...

Daria

63

asked Oct 3, 2023 at 12:31

0 votes

1 answer

99 views

How can I perform an analysis of the NHIS imputed income variables?

I have downloaded five family income variables from https://nhis.ipums.org/nhis-action/variables/group?id=economic_income (INCPPOINT1, INCPPOINT2, INCPPOINT3, INCPPOINT4, INCPPOINT5) for they years ...

tryingtogetsmth

1

asked Oct 1, 2023 at 16:05

2 votes

1 answer

1k views

Is batching needed for the test set?

I'm just starting to learn about CNN (convolutional neural networks). Does the test data also need to be divided into batches, similar to how it's done with the training data?

xioXuei

21

asked Sep 25, 2023 at 5:47

1 vote

1 answer

266 views

What is an “independent” effect?

According to one study by Wang and colleagues (2013), smoking cigars has “a positive and independent” effect on testosterone. What is an independent effect and what is the significance of one?

TylerDurden

113

asked Sep 24, 2023 at 21:00

1 vote

1 answer

175 views

Generate marginally dependent (with predetermined covariance) but conditionally independent data from a Mixture of Gaussians

Suppose you have three variables $y\in\{0,1\}$ and $x_1\in\mathbb{R}$ and $x_2\in\mathbb{R}$. I want to produce data with the following generative process which corresponds to a Mixture of Gaussians (...

Sergio

336

asked Sep 22, 2023 at 13:44

1 vote

0 answers

105 views

With more labelled samples in a grouped dataset, to check the correlation with another, is it better to duplicate samples or take the means? [closed]

Another issue is that dataset with more labels in groups, they have less deviation intra groups vs inter groups. So I could take the means of them (the 'second' index), or the whole label means (the '...

anon

asked Sep 20, 2023 at 20:52

0 votes

1 answer

64 views

Data normalisation for a newbie

I am new to data analysis and I have been given a task to compare safety of trams compared to buses. Since there are far more buses than trams, I was introduced to the concept of normalisation by ...

Revonda Sanchez

1

asked Sep 20, 2023 at 9:45

0 votes

0 answers

46 views

Right Way to Sample a Validation Set

I am working on a project that uses training data selection techniques; it involves sampling the training set in some smart way rather than sampling randomly. The goal is to compare different data ...

Mr.Robot

257

asked Sep 19, 2023 at 23:22

2 votes

1 answer

158 views

Comparison of Data that looks at correlation and absolute values

I am looking for a way to compare two sets of data in order to find out how similar they are to each other. My application: I try to compare multiple Measurement methods that both measure the sound ...

Anton Wolf

23

asked Aug 23, 2023 at 19:42

1 vote

0 answers

36 views

Any image datasets with Inter-annotator agreement (IAA) values recorded? [closed]

Is anyone familiar with publicly accessible image datasets that include Inter-annotator agreement (IAA) scores? Ideally for object detection or classification tasks. Thank you!

SQL_Noob

11

asked Aug 11, 2023 at 0:48

1 vote

1 answer

123 views

Representing Nested Models as an SKLearn Pipeline [closed]

Goal: Represent nested models with SKLearn's Pipeline / ColumnTransformer / FeatureUnion setup. Specific issue: I cannot figure out how to use the prediction from one model as a factor of a secondary ...

chris

31

asked Aug 10, 2023 at 16:25

1 vote

0 answers

51 views

Dataset for Swimming with Heartbeat and Motion Data [closed]

I am currently working on my final project CS degree that involves analyzing swimming data, specifically focusing on both heartbeat and motion data. I am looking for a dataset that includes these two ...

Malak Qaadan

11

asked Aug 5, 2023 at 21:40

3 votes

1 answer

364 views

Comparing 2 Likert Scales for 1 Population

I have two 5-point Likert scales (Strongly Disagree to Strongly Agree) and I want to compare the results within one population sample. I expect participants to choose 'agree' or 'strongly agree' on ...

YasG

31

asked Aug 1, 2023 at 11:47

2 votes

2 answers

378 views

R/Econometrics: transform yearly dataset into monthly dataset?

As explained in the title, I would like to transform a yearly dataset into a monthly one, but including a constraint. My current dataset gives the yearly production of a commodity, and from year to ...

EstebanVer

21

asked Jul 29, 2023 at 10:35

0 votes

1 answer

86 views

Why are some datasets denoted as D = {X,y} instead of D={x,y}?

In this question: On the importance of the i.i.d. assumption in statistical learning the dataset is denoted as D={X,y}. In statistics, capital letters are usually used to refer to a random variable, ...

StackExchanger

101

asked Jul 28, 2023 at 21:44

1 vote

0 answers

385 views

How to generate random values based on mean, standard deviation, skew and kurtosis in Python?

Given these values, is it possible to generate random values that conform to this distribution (using Python, but preferably without the SciPy package)? Statistic Value Mean 1.518 Std Dev 24.827 Skew ...

m01010011

111

asked Jul 28, 2023 at 4:03

1 vote

1 answer

261 views

What is the best way to analyze differences in demographics data among groups of different sizes

I have groups of individuals in a population. These groups have different sizes--for example, group 1 has 50 individuals and group 2 has 1000 individuals. I want to show the differences in the groups' ...

anni

31

asked Jul 26, 2023 at 5:16

1 vote

0 answers

34 views

Examples of interventions changing the Population distribution of a system

When learning about the Datasets of naturally occuring systems you encounter distributions, for example power law distributions in the domain of population sizes of cities. In my understanding this ...

FreeThought

11

asked Jul 10, 2023 at 18:31

0 votes

0 answers

109 views

Adding pre and post treatment dummies in DiD regression

I want to estimate the impact of economic sanctions on GDP and other macroeconomic indicators on Italy in 1935 with DiD. I have observations of GDP of Italy (treated) and of other countries (control) ...

rickycala

1

asked Jun 21, 2023 at 16:38

4 votes

2 answers

132 views

Odds of 2 people sharing the same exact ticket set of a size up to 38

Say user 1 entered the raffle 18 times (18 tickets) and he had ticket ids [0, 2, 3, 4, 8, 9, 14, 16, 27, 28, 30, 31, 32, 33] (so they acquired the first ticket, missed the second (2nd hour ticket) ...

user390427

asked Jun 15, 2023 at 11:58

1 vote

0 answers

92 views

The covariance of a data matrix

Please let me know if the below statement is valid or not ; Suppose that $X$ is an $n\times p$ data matrix with $p$ features and $n$ data samples. Suppose further that each feature(column) is zero ...

sj.kim

11

asked Jun 14, 2023 at 10:14

3 votes

1 answer

348 views

Monte Carlo cross-validation on an imbalanced dataset

I have been doing a little research about the several cross validation methods but there is an issue that remains as a doubt for me in the Monte Carlo method. Let's suposse I have 2,000 data points. ...

The Student

89

asked Jun 12, 2023 at 21:10

1 vote

0 answers

99 views

I want to apply Jeffrey divergence / Jensenshannon divergence for two histogram. Where data of histogram is not a probability distribution [closed]

I found Jeffrey formula as here hi and ki are bin data of two histograms. https://www.cs.cmu.edu/~efros/courses/LBMV07/Papers/rubner-jcviu-00.pdf I want to find Jeffrey divergence. There is any ...

Om Prakash

11

asked Jun 8, 2023 at 8:00

1 vote

0 answers

44 views

confirm which dataset is closer to benchmark

I have 3 datasets of different sizes and trying to select one that is the closest to the benchmark, in a sense of minimized pairwise differences. I checked for normality using QQ plots and Shapiro-...

gregV

111

asked Jun 7, 2023 at 22:48

2 votes

0 answers

60 views

How to learn from a dataset of weighted polynomials

I have a dataset of weighted polynomials, i.e. each data point is a polynomial (of variable size/degree) together with a weight vector (of fixed size). Each data point has an integer label that ranges ...

tomate

21

asked May 22, 2023 at 11:49

0 votes

1 answer

82 views

Two way Anova Hypothesis Testing

I am looking to perform Two way Anova on the following dataset : ...

Sammy

1

asked May 13, 2023 at 10:56

1 vote

0 answers

593 views

Best Datasets and Packages for Comparing LASSO, Elastic Net, and Ridge [closed]

I have been recently been working with the MASS, lars, and glmnet packages to study variable ...

YessuhYessuhYessuh

227

asked May 6, 2023 at 3:09

1 vote

0 answers

125 views

How to transform percentage cover data into species abundance data in algae

I have collected percentage data for Rocky shore and summarised it per algae type like the example below: ...

Martina

11

asked May 5, 2023 at 20:51

1 vote

0 answers

38 views

Scaling datasets for multi-dataset time series

Suppose that I have training data with dimension $(N,H,F)$, where $N$ represents the number of different datasets, $H$ is the history size and $F$ is the input size. Normalizing each dataset over the ...

Hadar

125

asked May 5, 2023 at 15:36

2 votes

0 answers

67 views

How to calculate the area (or Time) spent in each "zone" on a stacked area graph?

I have a dataset of blood pressure recordings measured over 36 hours, which I plotted with BP (aka MAP) on the Y axis, and minutes on the X axis. I then plotted the blood pressure percentiles for age (...

user387100

21

asked May 3, 2023 at 3:46

1 vote

0 answers

31 views

Measurements for calculating the difference of the Outliers has on a Data Set

Given a scenario where I compare a single column(independent variable) in a data set to the other columns(dependent variables). If I find the outliers from the independent variable. What could I test ...

edge selcuk

11

asked Apr 25, 2023 at 22:10

2 votes

1 answer

88 views

What's the best way to visualize load testing results

I have a piece of code that I'm adding a feature onto, and I need to test performance before and after the feature addition. So I used a test instance with a large amount of data and triggered the ...

SachiDangalla

121

asked Apr 20, 2023 at 12:54

1 vote

0 answers

83 views

Ranked data and hypothesis testing

I have a few questions regarding ranked data and hypothesis testing. Hypothesis: Car drivers are more likely to consider the car as the most beneficial transportation method for mental health than ...

csira_allapot_5876

11

asked Apr 19, 2023 at 10:21

3 votes

0 answers

70 views

What is a way to estimate job match quality using new job wage, old job, wage, and how long a new job was had? [closed]

The data I have seeks to understand whether an individual's new job is of high quality. I have an individuals old job wage: how much they were paid before, ...

user321627

4,372

asked Apr 15, 2023 at 4:39

0 votes

0 answers

135 views

Are these data points decaying exponentially or as a power law?

0 I have a set of data points. The first coordinate is time and the second coordinate is energy. I am trying to figure out how the energy is decaying over time. Particularly, I have to find if it is ...

HadamardN2

1

asked Apr 12, 2023 at 1:19

1 vote

1 answer

535 views

How to deal with groups when splitting a data into train and test?

Say i have a dataset with groups that i want to use for a Regression problem that looks like the following where feature1 is the group column: ...

Hamza Adnan

11

asked Apr 7, 2023 at 14:09

1 vote

0 answers

182 views

Efficient estimation of conditional probability density

The formulation of the conditional density is: $$ f_{Y|X}(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)}. $$ I need to estimate this density from data and it's prohibitively time-consuming to calculate the joint ...

smthack

61

asked Apr 7, 2023 at 10:58

1 vote

0 answers

46 views

Transformation to compare many distributions with different means and variances

I have daily sales (total volume in dollars) from 200 stores of the same franchise, over two years. I would like to identify any store with anomalies or special patterns, which could be the sign of a ...

olivaw

111

asked Apr 5, 2023 at 22:16

1 vote

0 answers

29 views

Ascertaining an appropriate test for this scenario

I need to run an analysis in spss, please could you advise me on that matter? I gave my examples, please correct me.. three groups: control, body positivity, body neutrality, measure: self esteem, ...

Ana

11

asked Apr 2, 2023 at 23:59

1 vote

1 answer

77 views

Right way of comparing my model's result with state of the arts

I know that I have to use the same datasets and experimental settings. In this way, I don't have to run their codes. What if I used different datasets? Is it OK to use their code from GitHub to get ...

AMAS AL

33

asked Apr 1, 2023 at 13:29

1 vote

3 answers

1k views

Is there any downside to scaling a dataset?

I've read that some models, such as decision trees, don't require scaling to work effectively. However, the author of the linked article states there's no downside to scaling data for a decision tree ...

Connor

677

asked Mar 31, 2023 at 7:41

1 vote

1 answer

72 views

Smallest set of words contained in a number of documents

I have a collection of text documents. I would like to find the smallest set of words such that searching by those words allows discovering each document. It is quite natural to describe this data as ...

mmh

1,019

asked Mar 28, 2023 at 8:40

1 vote

1 answer

895 views

Can I scale a dataset using different methods on different columns and why?

relatively new to this and this question has been plaguing me. Say I have a dataset with feature A, feature B, and feature C. I need to scale for my model. Based on their distributions, feature A is ...

Marque

11

asked Mar 25, 2023 at 15:23

2 votes

0 answers

82 views

ANOVA Assumptions on Big data

I wanted to know if one would require to check for the violation of ANOVA Assumptions before running an ANOVA model on a Big Dataset (size of the big dataset is 57 million rows)? Thanks!

Akira Banerjee

21

asked Mar 8, 2023 at 21:18

0 votes

0 answers

65 views

Modeling a time series without trend or seasonality

I am dealing with a dataset that has different input data with a different behavior but I am getting confused as I can not find any trend or pattern in my data. I checked seasonality and found nothing....

john22

157

asked Mar 7, 2023 at 14:42

0 votes

1 answer

490 views

Order of pre-processing the dataset

suppose I have categorical dataset, I'm doing data pre-processing. what is the correct order of applying these 3 techniques Train Test split SMOTEN to over sampler the minority class Categorical ...

Mohamed Ahmed

3

asked Mar 7, 2023 at 10:09

0 votes

0 answers

42 views

How do analyse a Panel Data with HDI as the independent variable and Case fatality Rate and Number of cases as the dependent variables?

I have Covid-19 Case fatality rate and Number of cases for 37 states as dependent variables and HD Index of the 37 States as the independent variable. That is a single predictor and multiple responses....

HENRY

1

asked Feb 22, 2023 at 12:33

1 vote

1 answer

95 views

Does a newly constructed ML dataset need to have an official train-dev-test split? Should the test set be intentionally balanced?

I have constructed a novel ML (NLP) dataset for classification and labeled it with three classes. The dataset is rather small with about 700 examples, out of which the classes have about 400, 200, and ...

Arno

11

asked Feb 16, 2023 at 10:47

0 votes

0 answers

65 views

My dataset is unbalanced and has missing values, when does this become a problem?

I'm writing a thesis on data and variable specification sensitivity pertaining to economic growth. I have panel data with 39 countries ranging from 1981-2012, however, lots of countries had important ...

Maurits de Vries

1

asked Feb 11, 2023 at 18:03

1 vote

1 answer

44 views

Comparing microRNA expression at different tumour stages and identifying trends in miRNA expression

I'm currently working on a university project using publicly available data. I'm collecting genome data on cancer patients at different pathological stages (Stage I, II, III and IV). I'm aiming to ...

user378714

11

asked Jan 29, 2023 at 23:36

Questions tagged [dataset]