Questions tagged [dataset]
Requests for datasets are off-topic on this site. Use this tag for questions concerning creating, processing, or maintaining datasets.
1,934 questions
2
votes
1
answer
86
views
What are some best practices for labeling data that exists in a continuum?
I am building computer vision models on data that exists in a continuum. For example, imagine I'm trying to do semantic segmentation on cars. Some of the labels are distinct, like "chipped paint&...
2
votes
1
answer
108
views
For human annotation projects, what are some commonly used metrics to assess grader reliability?
Lots of machine learning datasets are now created by having human raters annotate and provide labels to questions. Usually, a gold set is the most robust way of seeing if the raters are doing a good ...
1
vote
0
answers
39
views
Can transfer learning be applied after learning using homomorphic encryption to obfuscate dataset source?
Context
Suppose one has a public dataset plant_labels with:
Input: plant pictures
Labels: plant names
And a larger public model: ...
0
votes
0
answers
33
views
Why should exploratory analysis not be followed up with a confirmatory analysis in the same dataset? [duplicate]
On the Wikipedia page for data analysis, the following claim is made. "...one should not follow up an exploratory analysis with a confirmatory analysis in the same dataset. An exploratory ...
0
votes
1
answer
152
views
Proving a class imbalance IS a problem in Machine Learning [duplicate]
Context: Have been trying to create a prediction model for a 1% outcome variable using Random Forest Machine Learning for a large health survey (entirely multi-level categorical data, yes/no outcome, ~...
3
votes
0
answers
48
views
How to combine ML + Expert knowledge? (constrained machine learning)
I am working the sector of computer science for agriculture research. I deal here with algorithm for crop yield prediction. However, data in agriculture is very limited.
To overcome the issues of ...
0
votes
0
answers
20
views
Validation of Data on SPSS [duplicate]
I have the data of 150 participants [2 different methods that assess the same thing (Blood pressure), one of which is considered as "gold standard method"] and I want to validate them on ...
0
votes
1
answer
425
views
How to make a pairwise correlation matrix including interaction with a third variable?
My dataset follows this structure:
...
1
vote
0
answers
80
views
Correct way to split data for propensity modeling
For generic propensity (purchase, churn etc.) modeling a lot of typical references / examples available use randomized splitting for train / eval / test sets. For propensity modeling in practice ...
0
votes
0
answers
605
views
Grouped stratified train-val-test split for a multilabel dataset
I was wondering if there is a fast heuristic algorithm for performing grouped stratified dataset split on a multilabel dataset. Question originally posted on Data Science stackexcahnge here.
...
0
votes
0
answers
47
views
Example studies that use inferences from cross-validation of the entire dataset
I am performing a study where I perform inferences based on cross-validation metrics that use the entire dataset. My reasoning behind this is (a) my dataset is small and imbalanced and (b) this is ...
1
vote
0
answers
245
views
Calculating total revenue in Rstudio (price*quantity sold) [closed]
Hello everyone,
I need to work with the data presented above. This question involves a little bit of knowledge of Rstudio. I would (1) like to obtain total revenue PER product (ucp), and brand, (2) ...
2
votes
1
answer
368
views
Which test to use with three variables? [closed]
that's my chart:
let's say that D0.5 was missing OJ. what statistical test could I use in this case?
1
vote
1
answer
183
views
Missing data mechanism for a single missing value?
I am currently studying statistics and I have come across these terms about missing data mechanisms; MCAR, MAR and MNAR. I have a dataset with exactly one missing value and I can only think of that ...
2
votes
1
answer
86
views
Should I remove 0 values from my dataset if they seem to be from instrument error?
I have conducted an experiment looking at the decay rate of a DNA target in flowing water over time, with 4 replicates of each treatment. The data is collected via dPCR Quiacuity instrument that ...
0
votes
1
answer
58
views
Size of Final TestSet
is there a rule of thumb of how large the final test set has to be in Machine Learning?
Assumed I have 1.000 images how many images do I ignore and use only in the final run?
My proposal:
Select ...
1
vote
1
answer
98
views
Data with a lot of zeroes
Please forgive a simple-minded question. I'm looking at a dataset now of a few thousand values, and trying to analyze it statistically. Most of the values, about 90%, are zeroes, and then the rest are ...
2
votes
2
answers
2k
views
How many datapoints are enough for a regression model to predict with reasoanble (say 88%-92%) accuracy? [closed]
Is there any number that we can land on for our regression model to predict with high accuracy? (accuracy metrics I have in mind at RMSE or R-squared). Also high accuracy may mean something above 88% ...
2
votes
1
answer
953
views
How to find the strongest correlation with big data in R? [closed]
I am trying to find the strongest correlation between two data sets in R and one set has 9000+ columns. I used cor() and it worked well, but is there a function or way to find the strongest ...
1
vote
2
answers
911
views
Generating synthetic time series data with limited data
I would like some opinions on my current situation.
I have a set of time series data that I want to forecast. The data however is not very long (around 500 rows) so I was looking into generating many ...
2
votes
1
answer
83
views
calculate an equation based on conditional existing data
I'm trying to figure out which method to go about calculating an equation based on variables from my database. I have the variables sex, which is 0 for males and 1 for females. I also have serum ...
1
vote
0
answers
71
views
IS PCA redundant when the first PC equals the mean of the data? [closed]
I have a situation where the principle component of my data is almost equal to the mean of the data. Does this make PCA redundant? Does PCA not work in this setting?
1
vote
1
answer
72
views
Prudent to reduce data size for the sake of model performance?
I am currently working on predicting the customer revenue in next 3,6 or 9 months using the below two methods
a) Buy Till you die probabilistic models
b) Tweedie regression and other regression ...
1
vote
2
answers
1k
views
how to normalize data 'with a sample range from -1 to 1 and a mean value of 0'?
I am trying to pre-process data following a statement in a paper.
They said
for the normalization, each dataset is normalized on a per channel basis with a sample range from -1 to 1 and a mean value ...
1
vote
1
answer
404
views
Machine Learning Models for predicting probability
I have a dataset where the dependent variable is a success probability ranging from 0 to 1. I cannot use the regular linear regression to model because linear regression does not restrict the output ...
3
votes
2
answers
199
views
When is an unbalanced dataset large enough for calculating a decision threshold?
I have a (large i.e. >1M rows) very unbalanced (1% event label, binary classification) dataset with data from various institutions. At the moment, I train an XGBoost model on this data and get good ...
0
votes
0
answers
36
views
How to determine if should I adopt a time series approach given a dataset?
I'm learning about machine learning and data science and I Would like to know, given a dataset which contains "Date" as one of it's many features. How to determine if I'm facing a time ...
1
vote
1
answer
64
views
Should I use long-format or recode the condition column? [closed]
I am just starting out with R and struggling to wrap my head around it.
I have a data set from a $2 \times 2$ repeated measures experiment (IV 1 - Expectation with two levels, number or letter; IV 2 - ...
1
vote
1
answer
1k
views
What happens if I fit my model on the same training dataset multiple times?
If I fit my model on the same training set twice or thrice so does the model remember the learning from each iteration and improve ?
1
vote
1
answer
87
views
What is the correct method for training NLP models with augmented data?
I have a very small dataset (~50 rows) for a text classification problem. I found some open source data that's similar to the problem I'm trying to solve.
Should I...
Train the (BERT) model on the ...
0
votes
1
answer
72
views
Using SVM for subsets
Let's say I have a set of 20 data points. However, for certain unexplained reasons, I can only perform SVM on 4 of those data points at a time. Is there any way I can do SVM for each subset of 4 ...
0
votes
0
answers
44
views
Binary Logistic Regression - How are my IVs affecting each other?
I am a relative stats noob trying to create a binary logistic regression model in spss to explore the relationship between internet access and feeling part of your community. I am also interested in ...
0
votes
1
answer
77
views
One-Sided Hypothesis Test with Categorical Covariate in R, Iris Data Set
I'm currently working with the famous Iris data set in R. I want to test whether the difference in sepal width between setosa and the other plant species is positive, i.e whether setosa has a larger ...
0
votes
1
answer
506
views
Conditions to Select Pairwise Deletion
When should I select pairwise deletion?
So I grasp the idea of pairwise deletion, but what conditions are actually needed to select this? Is it when data is MCAR? Why would researches select this ...
2
votes
2
answers
757
views
Converting nominal variables and ordinal variables in dataframe - categorical variables
I have a data frame in which I have some categorical variables, some are ordinal and others
are nominal.
How can I deal with nominal columns/variables that have too many levels?
For example, I have a ...
1
vote
1
answer
106
views
Why do eigenvalues of $\mathbf\Phi^T\mathbf\Phi$ increase with the size of data set?
The question comes from a paragraph in page 171 of "Pattern Recognition and Machine Learning" by Christopher M. Bishop:
Here $\mathbf\Phi$ is the design matrix for a data set of $N$ samples ...
1
vote
0
answers
134
views
How to add a constant to a new variable conditional on an existing variable in r? [closed]
I have a dataset for 100 households and 10 years.
I created a new variable called x1hat conditional on the household identifier. Then, I assigned the same value as under variable x1 to all households ...
0
votes
1
answer
154
views
Combine two data sets from two different regions
This is a actually very basic question, but I can't get my head around it.
I have two datasets for Europe and U.S. that contain the same two variables. These two variables are in a linear ...
0
votes
0
answers
25
views
How to check statistical correlation an ordinal variable against a continuous one [duplicate]
I've been trying to run some stats to check for correlation between an ordinal categories (such as body conditions classes) and a continuous variable (such as body measurements in cm). I'm really ...
1
vote
0
answers
45
views
Missing trade data for difference in differences model
I have missing monthly trade data for my dependent variable in the DiD model. Are there any methods to compute the missing data, so it does not lead to a bias? Also, for many products, there are no ...
1
vote
0
answers
68
views
Provide an example of a dataset where maximum likelihood is inapplicable as third moments and fourth moments "assumptions" do not apply
An additional complication arises with estimation, since maximum likelihood estimation may not be feasible without making unrealistically strong ?????"assumptions"????? about third‐ and ...
0
votes
1
answer
90
views
Is it correct to use Wilcoxon signed rank test on mean data?
I have X individuals and 2 category of interest (category A and category B) per individual. The problem is that is have a variability in the number of measures I have between individuals and between ...
2
votes
0
answers
120
views
Neural network not working as expected - Autonomous Driving
Background/problem I am trying to solve: I have vehicle timeline data for position, velocity, accelerations for all vehicles on a section of a motorway for ~30min where adjacent data points are spaced ...
2
votes
0
answers
123
views
How to interpret the relationship between batch size and bootstrap count in a specific paper?
In the paper "Active Learning for Natural Language Parsing and Information Extraction", the author mentioned:
In tests on this data, test examples were chosen independently for 10 trials ...
0
votes
1
answer
141
views
Categorical data for binary classification [duplicate]
ML newbie here. I'm preparing my data for a binary classification to predict whether a person has an account or not. In total I have 8 variables: 2 numeric (age and household size) and 6 categorical. ...
5
votes
1
answer
965
views
Why do I get different results after shuffling data using DBSCAN
Sometimes, by simply shuffling my data, not changing the parameters, I get a different cluster result using sklearn.DBSCAN. Why this happens?
I mean, by shuffling data, the data distribution is the ...
0
votes
1
answer
53
views
Climatological datasets [closed]
What are some good repositories of climatological data? Is there a golden standard used in the community?
I need some datasets with gridded data with a reasonably good resolution, both temporal (...
1
vote
0
answers
83
views
Differentiate between two set of points
Consider two sets of points (in the pictures below), whose "center of gravity" is same. What measure can differentiate between the two sets?
e.g.
Image 1 ...
3
votes
0
answers
1k
views
Combining multiple stock time series to one data set for LSTM
I am trying to predict daily stock return volatility using an LSTM network. My data comprises price data of five different stocks, over the same time frame. My question, to which I have not found an ...
1
vote
0
answers
49
views
How to analyze datasets collected by different companies
I have data from six different experimental groups that were collected by two different companies (three groups were collected by company A e the other three by company B).
I'm interested in compare a ...