Skip to main content

Questions tagged [dataset]

Requests for datasets are off-topic on this site. Use this tag for questions concerning creating, processing, or maintaining datasets.

Filter by
Sorted by
Tagged with
0 votes
1 answer
75 views

Assume I have a dataset that ranges from 2014-2022 composed of survey responses from two different groups e.g. store 1 and store 2. I want to test whether there is a statistically significant ...
Daria 's user avatar
  • 63
0 votes
1 answer
99 views

I have downloaded five family income variables from https://nhis.ipums.org/nhis-action/variables/group?id=economic_income (INCPPOINT1, INCPPOINT2, INCPPOINT3, INCPPOINT4, INCPPOINT5) for they years ...
tryingtogetsmth's user avatar
2 votes
1 answer
1k views

I'm just starting to learn about CNN (convolutional neural networks). Does the test data also need to be divided into batches, similar to how it's done with the training data?
xioXuei's user avatar
  • 21
1 vote
1 answer
266 views

According to one study by Wang and colleagues (2013), smoking cigars has “a positive and independent” effect on testosterone. What is an independent effect and what is the significance of one?
TylerDurden's user avatar
1 vote
1 answer
175 views

Suppose you have three variables $y\in\{0,1\}$ and $x_1\in\mathbb{R}$ and $x_2\in\mathbb{R}$. I want to produce data with the following generative process which corresponds to a Mixture of Gaussians (...
Sergio's user avatar
  • 336
1 vote
0 answers
105 views

Another issue is that dataset with more labels in groups, they have less deviation intra groups vs inter groups. So I could take the means of them (the 'second' index), or the whole label means (the '...
user avatar
0 votes
1 answer
64 views

I am new to data analysis and I have been given a task to compare safety of trams compared to buses. Since there are far more buses than trams, I was introduced to the concept of normalisation by ...
Revonda Sanchez's user avatar
0 votes
0 answers
46 views

I am working on a project that uses training data selection techniques; it involves sampling the training set in some smart way rather than sampling randomly. The goal is to compare different data ...
Mr.Robot's user avatar
  • 257
2 votes
1 answer
158 views

I am looking for a way to compare two sets of data in order to find out how similar they are to each other. My application: I try to compare multiple Measurement methods that both measure the sound ...
Anton Wolf's user avatar
1 vote
0 answers
36 views

Is anyone familiar with publicly accessible image datasets that include Inter-annotator agreement (IAA) scores? Ideally for object detection or classification tasks. Thank you!
SQL_Noob's user avatar
1 vote
1 answer
123 views

Goal: Represent nested models with SKLearn's Pipeline / ColumnTransformer / FeatureUnion setup. Specific issue: I cannot figure out how to use the prediction from one model as a factor of a secondary ...
chris's user avatar
  • 31
1 vote
0 answers
51 views

I am currently working on my final project CS degree that involves analyzing swimming data, specifically focusing on both heartbeat and motion data. I am looking for a dataset that includes these two ...
Malak Qaadan's user avatar
3 votes
1 answer
364 views

I have two 5-point Likert scales (Strongly Disagree to Strongly Agree) and I want to compare the results within one population sample. I expect participants to choose 'agree' or 'strongly agree' on ...
YasG's user avatar
  • 31
2 votes
2 answers
378 views

As explained in the title, I would like to transform a yearly dataset into a monthly one, but including a constraint. My current dataset gives the yearly production of a commodity, and from year to ...
EstebanVer's user avatar
0 votes
1 answer
86 views

In this question: On the importance of the i.i.d. assumption in statistical learning the dataset is denoted as D={X,y}. In statistics, capital letters are usually used to refer to a random variable, ...
StackExchanger's user avatar
1 vote
0 answers
385 views

Given these values, is it possible to generate random values that conform to this distribution (using Python, but preferably without the SciPy package)? Statistic Value Mean 1.518 Std Dev 24.827 Skew ...
m01010011's user avatar
  • 111
1 vote
1 answer
261 views

I have groups of individuals in a population. These groups have different sizes--for example, group 1 has 50 individuals and group 2 has 1000 individuals. I want to show the differences in the groups' ...
anni's user avatar
  • 31
1 vote
0 answers
34 views

When learning about the Datasets of naturally occuring systems you encounter distributions, for example power law distributions in the domain of population sizes of cities. In my understanding this ...
FreeThought's user avatar
0 votes
0 answers
109 views

I want to estimate the impact of economic sanctions on GDP and other macroeconomic indicators on Italy in 1935 with DiD. I have observations of GDP of Italy (treated) and of other countries (control) ...
rickycala's user avatar
4 votes
2 answers
132 views

Say user 1 entered the raffle 18 times (18 tickets) and he had ticket ids [0, 2, 3, 4, 8, 9, 14, 16, 27, 28, 30, 31, 32, 33] (so they acquired the first ticket, missed the second (2nd hour ticket) ...
user avatar
1 vote
0 answers
92 views

Please let me know if the below statement is valid or not ; Suppose that $X$ is an $n\times p$ data matrix with $p$ features and $n$ data samples. Suppose further that each feature(column) is zero ...
sj.kim's user avatar
  • 11
3 votes
1 answer
348 views

I have been doing a little research about the several cross validation methods but there is an issue that remains as a doubt for me in the Monte Carlo method. Let's suposse I have 2,000 data points. ...
The Student's user avatar
1 vote
0 answers
99 views

I found Jeffrey formula as here hi and ki are bin data of two histograms. https://www.cs.cmu.edu/~efros/courses/LBMV07/Papers/rubner-jcviu-00.pdf I want to find Jeffrey divergence. There is any ...
Om Prakash's user avatar
1 vote
0 answers
44 views

I have 3 datasets of different sizes and trying to select one that is the closest to the benchmark, in a sense of minimized pairwise differences. I checked for normality using QQ plots and Shapiro-...
gregV's user avatar
  • 111
2 votes
0 answers
60 views

I have a dataset of weighted polynomials, i.e. each data point is a polynomial (of variable size/degree) together with a weight vector (of fixed size). Each data point has an integer label that ranges ...
tomate's user avatar
  • 21
0 votes
1 answer
82 views

I am looking to perform Two way Anova on the following dataset : ...
Sammy's user avatar
  • 1
1 vote
0 answers
593 views

I have been recently been working with the MASS, lars, and glmnet packages to study variable ...
YessuhYessuhYessuh's user avatar
1 vote
0 answers
125 views

I have collected percentage data for Rocky shore and summarised it per algae type like the example below: ...
Martina's user avatar
  • 11
1 vote
0 answers
38 views

Suppose that I have training data with dimension $(N,H,F)$, where $N$ represents the number of different datasets, $H$ is the history size and $F$ is the input size. Normalizing each dataset over the ...
Hadar's user avatar
  • 125
2 votes
0 answers
67 views

I have a dataset of blood pressure recordings measured over 36 hours, which I plotted with BP (aka MAP) on the Y axis, and minutes on the X axis. I then plotted the blood pressure percentiles for age (...
user387100's user avatar
1 vote
0 answers
31 views

Given a scenario where I compare a single column(independent variable) in a data set to the other columns(dependent variables). If I find the outliers from the independent variable. What could I test ...
edge selcuk's user avatar
2 votes
1 answer
88 views

I have a piece of code that I'm adding a feature onto, and I need to test performance before and after the feature addition. So I used a test instance with a large amount of data and triggered the ...
SachiDangalla's user avatar
1 vote
0 answers
83 views

I have a few questions regarding ranked data and hypothesis testing. Hypothesis: Car drivers are more likely to consider the car as the most beneficial transportation method for mental health than ...
csira_allapot_5876's user avatar
3 votes
0 answers
70 views

The data I have seeks to understand whether an individual's new job is of high quality. I have an individuals old job wage: how much they were paid before, ...
user321627's user avatar
  • 4,372
0 votes
0 answers
135 views

0 I have a set of data points. The first coordinate is time and the second coordinate is energy. I am trying to figure out how the energy is decaying over time. Particularly, I have to find if it is ...
HadamardN2's user avatar
1 vote
1 answer
535 views

Say i have a dataset with groups that i want to use for a Regression problem that looks like the following where feature1 is the group column: ...
Hamza Adnan's user avatar
1 vote
0 answers
182 views

The formulation of the conditional density is: $$ f_{Y|X}(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)}. $$ I need to estimate this density from data and it's prohibitively time-consuming to calculate the joint ...
smthack's user avatar
  • 61
1 vote
0 answers
46 views

I have daily sales (total volume in dollars) from 200 stores of the same franchise, over two years. I would like to identify any store with anomalies or special patterns, which could be the sign of a ...
olivaw's user avatar
  • 111
1 vote
0 answers
29 views

I need to run an analysis in spss, please could you advise me on that matter? I gave my examples, please correct me.. three groups: control, body positivity, body neutrality, measure: self esteem, ...
Ana's user avatar
  • 11
1 vote
1 answer
77 views

I know that I have to use the same datasets and experimental settings. In this way, I don't have to run their codes. What if I used different datasets? Is it OK to use their code from GitHub to get ...
AMAS AL's user avatar
  • 33
1 vote
3 answers
1k views

I've read that some models, such as decision trees, don't require scaling to work effectively. However, the author of the linked article states there's no downside to scaling data for a decision tree ...
Connor's user avatar
  • 677
1 vote
1 answer
72 views

I have a collection of text documents. I would like to find the smallest set of words such that searching by those words allows discovering each document. It is quite natural to describe this data as ...
mmh's user avatar
  • 1,019
1 vote
1 answer
895 views

relatively new to this and this question has been plaguing me. Say I have a dataset with feature A, feature B, and feature C. I need to scale for my model. Based on their distributions, feature A is ...
Marque's user avatar
  • 11
2 votes
0 answers
82 views

I wanted to know if one would require to check for the violation of ANOVA Assumptions before running an ANOVA model on a Big Dataset (size of the big dataset is 57 million rows)? Thanks!
Akira Banerjee's user avatar
0 votes
0 answers
65 views

I am dealing with a dataset that has different input data with a different behavior but I am getting confused as I can not find any trend or pattern in my data. I checked seasonality and found nothing....
john22's user avatar
  • 157
0 votes
1 answer
490 views

suppose I have categorical dataset, I'm doing data pre-processing. what is the correct order of applying these 3 techniques Train Test split SMOTEN to over sampler the minority class Categorical ...
Mohamed Ahmed's user avatar
0 votes
0 answers
42 views

I have Covid-19 Case fatality rate and Number of cases for 37 states as dependent variables and HD Index of the 37 States as the independent variable. That is a single predictor and multiple responses....
HENRY's user avatar
  • 1
1 vote
1 answer
95 views

I have constructed a novel ML (NLP) dataset for classification and labeled it with three classes. The dataset is rather small with about 700 examples, out of which the classes have about 400, 200, and ...
Arno's user avatar
  • 11
0 votes
0 answers
65 views

I'm writing a thesis on data and variable specification sensitivity pertaining to economic growth. I have panel data with 39 countries ranging from 1981-2012, however, lots of countries had important ...
Maurits de Vries's user avatar
1 vote
1 answer
44 views

I'm currently working on a university project using publicly available data. I'm collecting genome data on cancer patients at different pathological stages (Stage I, II, III and IV). I'm aiming to ...
user378714's user avatar

1 2
3
4 5
39