Questions tagged [dataset]
Requests for datasets are off-topic on this site. Use this tag for questions concerning creating, processing, or maintaining datasets.
1,934 questions
0
votes
1
answer
75
views
Is there a test I could run to test the statistical significance between survey respondents over multiple years?
Assume I have a dataset that ranges from 2014-2022 composed of survey responses from two different groups e.g. store 1 and store 2. I want to test whether there is a statistically significant ...
0
votes
1
answer
99
views
How can I perform an analysis of the NHIS imputed income variables?
I have downloaded five family income variables from https://nhis.ipums.org/nhis-action/variables/group?id=economic_income (INCPPOINT1, INCPPOINT2, INCPPOINT3, INCPPOINT4, INCPPOINT5) for they years ...
2
votes
1
answer
1k
views
Is batching needed for the test set?
I'm just starting to learn about CNN (convolutional neural networks). Does the test data also need to be divided into batches, similar to how it's done with the training data?
1
vote
1
answer
266
views
What is an “independent” effect?
According to one study by Wang and colleagues (2013), smoking cigars has “a positive and independent” effect on testosterone. What is an independent effect and what is the significance of one?
1
vote
1
answer
175
views
Generate marginally dependent (with predetermined covariance) but conditionally independent data from a Mixture of Gaussians
Suppose you have three variables $y\in\{0,1\}$ and $x_1\in\mathbb{R}$ and $x_2\in\mathbb{R}$. I want to produce data with the following generative process which corresponds to a Mixture of Gaussians (...
1
vote
0
answers
105
views
With more labelled samples in a grouped dataset, to check the correlation with another, is it better to duplicate samples or take the means? [closed]
Another issue is that dataset with more labels in groups, they have less deviation intra groups vs inter groups. So I could take the means of them (the 'second' index), or the whole label means (the '...
0
votes
1
answer
64
views
Data normalisation for a newbie
I am new to data analysis and I have been given a task to compare safety of trams compared to buses. Since there are far more buses than trams, I was introduced to the concept of normalisation by ...
0
votes
0
answers
46
views
Right Way to Sample a Validation Set
I am working on a project that uses training data selection techniques; it involves sampling the training set in some smart way rather than sampling randomly. The goal is to compare different data ...
2
votes
1
answer
158
views
Comparison of Data that looks at correlation and absolute values
I am looking for a way to compare two sets of data in order to find out how similar they are to each other.
My application: I try to compare multiple Measurement methods that both measure the sound ...
1
vote
0
answers
36
views
Any image datasets with Inter-annotator agreement (IAA) values recorded? [closed]
Is anyone familiar with publicly accessible image datasets that include Inter-annotator agreement (IAA) scores? Ideally for object detection or classification tasks.
Thank you!
1
vote
1
answer
123
views
Representing Nested Models as an SKLearn Pipeline [closed]
Goal: Represent nested models with SKLearn's Pipeline / ColumnTransformer / FeatureUnion setup.
Specific issue: I cannot figure out how to use the prediction from one model as a factor of a secondary ...
1
vote
0
answers
51
views
Dataset for Swimming with Heartbeat and Motion Data [closed]
I am currently working on my final project CS degree that involves analyzing swimming data, specifically focusing on both heartbeat and motion data. I am looking for a dataset that includes these two ...
3
votes
1
answer
364
views
Comparing 2 Likert Scales for 1 Population
I have two 5-point Likert scales (Strongly Disagree to Strongly Agree) and I want to compare the results within one population sample. I expect participants to choose 'agree' or 'strongly agree' on ...
2
votes
2
answers
378
views
R/Econometrics: transform yearly dataset into monthly dataset?
As explained in the title, I would like to transform a yearly dataset into a monthly one, but including a constraint.
My current dataset gives the yearly production of a commodity, and from year to ...
0
votes
1
answer
86
views
Why are some datasets denoted as D = {X,y} instead of D={x,y}?
In this question: On the importance of the i.i.d. assumption in statistical learning the dataset is denoted as D={X,y}. In statistics, capital letters are usually used to refer to a random variable, ...
1
vote
0
answers
385
views
How to generate random values based on mean, standard deviation, skew and kurtosis in Python?
Given these values, is it possible to generate random values that conform to this distribution (using Python, but preferably without the SciPy package)?
Statistic
Value
Mean
1.518
Std Dev
24.827
Skew
...
1
vote
1
answer
261
views
What is the best way to analyze differences in demographics data among groups of different sizes
I have groups of individuals in a population. These groups have different sizes--for example, group 1 has 50 individuals and group 2 has 1000 individuals. I want to show the differences in the groups' ...
1
vote
0
answers
34
views
Examples of interventions changing the Population distribution of a system
When learning about the Datasets of naturally occuring systems you encounter distributions, for example power law distributions in the domain of population sizes of cities.
In my understanding this ...
0
votes
0
answers
109
views
Adding pre and post treatment dummies in DiD regression
I want to estimate the impact of economic sanctions on GDP and other macroeconomic indicators on Italy in 1935 with DiD. I have observations of GDP of Italy (treated) and of other countries (control) ...
4
votes
2
answers
132
views
Odds of 2 people sharing the same exact ticket set of a size up to 38
Say user 1 entered the raffle 18 times (18 tickets) and he had ticket ids [0, 2, 3, 4, 8, 9, 14, 16, 27, 28, 30, 31, 32, 33] (so they acquired the first ticket, missed the second (2nd hour ticket) ...
1
vote
0
answers
92
views
The covariance of a data matrix
Please let me know if the below statement is valid or not ;
Suppose that $X$ is an $n\times p$ data matrix with $p$ features and $n$ data samples.
Suppose further that each feature(column) is zero ...
3
votes
1
answer
348
views
Monte Carlo cross-validation on an imbalanced dataset
I have been doing a little research about the several cross validation methods but there is an issue that remains as a doubt for me in the Monte Carlo method. Let's suposse I have 2,000 data points. ...
1
vote
0
answers
99
views
I want to apply Jeffrey divergence / Jensenshannon divergence for two histogram. Where data of histogram is not a probability distribution [closed]
I found Jeffrey formula as
here hi and ki are bin data of two histograms.
https://www.cs.cmu.edu/~efros/courses/LBMV07/Papers/rubner-jcviu-00.pdf
I want to find Jeffrey divergence.
There is any ...
1
vote
0
answers
44
views
confirm which dataset is closer to benchmark
I have 3 datasets of different sizes and trying to select one that is the closest to the benchmark, in a sense of minimized pairwise differences. I checked for normality using QQ plots and Shapiro-...
2
votes
0
answers
60
views
How to learn from a dataset of weighted polynomials
I have a dataset of weighted polynomials, i.e. each data point is a polynomial (of variable size/degree) together with a weight vector (of fixed size).
Each data point has an integer label that ranges ...
0
votes
1
answer
82
views
Two way Anova Hypothesis Testing
I am looking to perform Two way Anova on the following dataset :
...
1
vote
0
answers
593
views
Best Datasets and Packages for Comparing LASSO, Elastic Net, and Ridge [closed]
I have been recently been working with the MASS, lars, and glmnet packages to study variable ...
1
vote
0
answers
125
views
How to transform percentage cover data into species abundance data in algae
I have collected percentage data for Rocky shore and summarised it per algae type like the example below:
...
1
vote
0
answers
38
views
Scaling datasets for multi-dataset time series
Suppose that I have training data with dimension $(N,H,F)$, where $N$ represents the number of different datasets, $H$ is the history size and $F$ is the input size. Normalizing each dataset over the ...
2
votes
0
answers
67
views
How to calculate the area (or Time) spent in each "zone" on a stacked area graph?
I have a dataset of blood pressure recordings measured over 36 hours, which I plotted with BP (aka MAP) on the Y axis, and minutes on the X axis. I then plotted the blood pressure percentiles for age (...
1
vote
0
answers
31
views
Measurements for calculating the difference of the Outliers has on a Data Set
Given a scenario where I compare a single column(independent variable) in a data set to the other columns(dependent variables). If I find the outliers from the independent variable. What could I test ...
2
votes
1
answer
88
views
What's the best way to visualize load testing results
I have a piece of code that I'm adding a feature onto, and I need to test performance before and after the feature addition. So I used a test instance with a large amount of data and triggered the ...
1
vote
0
answers
83
views
Ranked data and hypothesis testing
I have a few questions regarding ranked data and hypothesis testing.
Hypothesis: Car drivers are more likely to consider the car as the most beneficial transportation method for mental health than ...
3
votes
0
answers
70
views
What is a way to estimate job match quality using new job wage, old job, wage, and how long a new job was had? [closed]
The data I have seeks to understand whether an individual's new job is of high quality. I have an individuals
old job wage: how much they were paid before,
...
0
votes
0
answers
135
views
Are these data points decaying exponentially or as a power law?
0
I have a set of data points. The first coordinate is time and the second coordinate is energy. I am trying to figure out how the energy is decaying over time. Particularly, I have to find if it is ...
1
vote
1
answer
535
views
How to deal with groups when splitting a data into train and test?
Say i have a dataset with groups that i want to use for a Regression problem that looks like the following where feature1 is the group column:
...
1
vote
0
answers
182
views
Efficient estimation of conditional probability density
The formulation of the conditional density is:
$$ f_{Y|X}(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)}. $$
I need to estimate this density from data and it's prohibitively time-consuming to calculate the joint ...
1
vote
0
answers
46
views
Transformation to compare many distributions with different means and variances
I have daily sales (total volume in dollars) from 200 stores of the same franchise, over two years. I would like to identify any store with anomalies or special patterns, which could be the sign of a ...
1
vote
0
answers
29
views
Ascertaining an appropriate test for this scenario
I need to run an analysis in spss, please could you advise me on that matter?
I gave my examples, please correct me..
three groups: control, body positivity, body neutrality,
measure: self esteem, ...
1
vote
1
answer
77
views
Right way of comparing my model's result with state of the arts
I know that I have to use the same datasets and experimental settings. In this way, I don't have to run their codes. What if I used different datasets? Is it OK to use their code from GitHub to get ...
1
vote
3
answers
1k
views
Is there any downside to scaling a dataset?
I've read that some models, such as decision trees, don't require scaling to work effectively.
However, the author of the linked article states there's no downside to scaling data for a decision tree ...
1
vote
1
answer
72
views
Smallest set of words contained in a number of documents
I have a collection of text documents. I would like to find the smallest set of words such that searching by those words allows discovering each document.
It is quite natural to describe this data as ...
1
vote
1
answer
895
views
Can I scale a dataset using different methods on different columns and why?
relatively new to this and this question has been plaguing me.
Say I have a dataset with feature A, feature B, and feature C. I need to scale for my model. Based on their distributions, feature A is ...
2
votes
0
answers
82
views
ANOVA Assumptions on Big data
I wanted to know if one would require to check for the violation of ANOVA Assumptions before running an ANOVA model on a Big Dataset (size of the big dataset is 57 million rows)?
Thanks!
0
votes
0
answers
65
views
Modeling a time series without trend or seasonality
I am dealing with a dataset that has different input data with a different behavior but I am getting confused as I can not find any trend or pattern in my data. I checked seasonality and found nothing....
0
votes
1
answer
490
views
Order of pre-processing the dataset
suppose I have categorical dataset, I'm doing data pre-processing.
what is the correct order of applying these 3 techniques
Train Test split
SMOTEN to over sampler the minority class
Categorical ...
0
votes
0
answers
42
views
How do analyse a Panel Data with HDI as the independent variable and Case fatality Rate and Number of cases as the dependent variables?
I have Covid-19 Case fatality rate and Number of cases for 37 states as dependent variables and HD Index of the 37 States as the independent variable. That is a single predictor and multiple responses....
1
vote
1
answer
95
views
Does a newly constructed ML dataset need to have an official train-dev-test split? Should the test set be intentionally balanced?
I have constructed a novel ML (NLP) dataset for classification and labeled it with three classes. The dataset is rather small with about 700 examples, out of which the classes have about 400, 200, and ...
0
votes
0
answers
65
views
My dataset is unbalanced and has missing values, when does this become a problem?
I'm writing a thesis on data and variable specification sensitivity pertaining to economic growth. I have panel data with 39 countries ranging from 1981-2012, however, lots of countries had important ...
1
vote
1
answer
44
views
Comparing microRNA expression at different tumour stages and identifying trends in miRNA expression
I'm currently working on a university project using publicly available data. I'm collecting genome data on cancer patients at different pathological stages (Stage I, II, III and IV).
I'm aiming to ...