Skip to main content

Questions tagged [dataset]

Requests for datasets are off-topic on this site. Use this tag for questions concerning creating, processing, or maintaining datasets.

Filter by
Sorted by
Tagged with
2 votes
1 answer
86 views

I am building computer vision models on data that exists in a continuum. For example, imagine I'm trying to do semantic segmentation on cars. Some of the labels are distinct, like "chipped paint&...
jss367's user avatar
  • 446
2 votes
1 answer
108 views

Lots of machine learning datasets are now created by having human raters annotate and provide labels to questions. Usually, a gold set is the most robust way of seeing if the raters are doing a good ...
user321627's user avatar
  • 4,372
1 vote
0 answers
39 views

Context Suppose one has a public dataset plant_labels with: Input: plant pictures Labels: plant names And a larger public model: ...
a.t.'s user avatar
  • 111
0 votes
0 answers
33 views

On the Wikipedia page for data analysis, the following claim is made. "...one should not follow up an exploratory analysis with a confirmatory analysis in the same dataset. An exploratory ...
anna6931's user avatar
  • 151
0 votes
1 answer
152 views

Context: Have been trying to create a prediction model for a 1% outcome variable using Random Forest Machine Learning for a large health survey (entirely multi-level categorical data, yes/no outcome, ~...
MJay's user avatar
  • 1
3 votes
0 answers
48 views

I am working the sector of computer science for agriculture research. I deal here with algorithm for crop yield prediction. However, data in agriculture is very limited. To overcome the issues of ...
MvB's user avatar
  • 31
0 votes
0 answers
20 views

I have the data of 150 participants [2 different methods that assess the same thing (Blood pressure), one of which is considered as "gold standard method"] and I want to validate them on ...
user avatar
0 votes
1 answer
425 views

My dataset follows this structure: ...
user avatar
1 vote
0 answers
80 views

For generic propensity (purchase, churn etc.) modeling a lot of typical references / examples available use randomized splitting for train / eval / test sets. For propensity modeling in practice ...
permustats's user avatar
0 votes
0 answers
605 views

I was wondering if there is a fast heuristic algorithm for performing grouped stratified dataset split on a multilabel dataset. Question originally posted on Data Science stackexcahnge here. ...
jasperhyp's user avatar
0 votes
0 answers
47 views

I am performing a study where I perform inferences based on cross-validation metrics that use the entire dataset. My reasoning behind this is (a) my dataset is small and imbalanced and (b) this is ...
Adam_G's user avatar
  • 371
1 vote
0 answers
245 views

Hello everyone, I need to work with the data presented above. This question involves a little bit of knowledge of Rstudio. I would (1) like to obtain total revenue PER product (ucp), and brand, (2) ...
René González's user avatar
2 votes
1 answer
368 views

that's my chart: let's say that D0.5 was missing OJ. what statistical test could I use in this case?
NERD's user avatar
  • 23
1 vote
1 answer
183 views

I am currently studying statistics and I have come across these terms about missing data mechanisms; MCAR, MAR and MNAR. I have a dataset with exactly one missing value and I can only think of that ...
restingquartH's user avatar
2 votes
1 answer
86 views

I have conducted an experiment looking at the decay rate of a DNA target in flowing water over time, with 4 replicates of each treatment. The data is collected via dPCR Quiacuity instrument that ...
Mitchell Liddick's user avatar
0 votes
1 answer
58 views

is there a rule of thumb of how large the final test set has to be in Machine Learning? Assumed I have 1.000 images how many images do I ignore and use only in the final run? My proposal: Select ...
SchwarzbrotMitHummus's user avatar
1 vote
1 answer
98 views

Please forgive a simple-minded question. I'm looking at a dataset now of a few thousand values, and trying to analyze it statistically. Most of the values, about 90%, are zeroes, and then the rest are ...
Greg Markowsky's user avatar
2 votes
2 answers
2k views

Is there any number that we can land on for our regression model to predict with high accuracy? (accuracy metrics I have in mind at RMSE or R-squared). Also high accuracy may mean something above 88% ...
SJa's user avatar
  • 554
2 votes
1 answer
953 views

I am trying to find the strongest correlation between two data sets in R and one set has 9000+ columns. I used cor() and it worked well, but is there a function or way to find the strongest ...
guest101010's user avatar
1 vote
2 answers
911 views

I would like some opinions on my current situation. I have a set of time series data that I want to forecast. The data however is not very long (around 500 rows) so I was looking into generating many ...
codinator's user avatar
  • 123
2 votes
1 answer
83 views

I'm trying to figure out which method to go about calculating an equation based on variables from my database. I have the variables sex, which is 0 for males and 1 for females. I also have serum ...
Fred's user avatar
  • 21
1 vote
0 answers
71 views

I have a situation where the principle component of my data is almost equal to the mean of the data. Does this make PCA redundant? Does PCA not work in this setting?
Martian's user avatar
  • 11
1 vote
1 answer
72 views

I am currently working on predicting the customer revenue in next 3,6 or 9 months using the below two methods a) Buy Till you die probabilistic models b) Tweedie regression and other regression ...
The Great's user avatar
  • 3,380
1 vote
2 answers
1k views

I am trying to pre-process data following a statement in a paper. They said for the normalization, each dataset is normalized on a per channel basis with a sample range from -1 to 1 and a mean value ...
Margie Shi's user avatar
1 vote
1 answer
404 views

I have a dataset where the dependent variable is a success probability ranging from 0 to 1. I cannot use the regular linear regression to model because linear regression does not restrict the output ...
Simon's user avatar
  • 51
3 votes
2 answers
199 views

I have a (large i.e. >1M rows) very unbalanced (1% event label, binary classification) dataset with data from various institutions. At the moment, I train an XGBoost model on this data and get good ...
Spill4963's user avatar
0 votes
0 answers
36 views

I'm learning about machine learning and data science and I Would like to know, given a dataset which contains "Date" as one of it's many features. How to determine if I'm facing a time ...
Antonio Caipora's user avatar
1 vote
1 answer
64 views

I am just starting out with R and struggling to wrap my head around it. I have a data set from a $2 \times 2$ repeated measures experiment (IV 1 - Expectation with two levels, number or letter; IV 2 - ...
user avatar
1 vote
1 answer
1k views

If I fit my model on the same training set twice or thrice so does the model remember the learning from each iteration and improve ?
Parthsarthi Joshi's user avatar
1 vote
1 answer
87 views

I have a very small dataset (~50 rows) for a text classification problem. I found some open source data that's similar to the problem I'm trying to solve. Should I... Train the (BERT) model on the ...
krisjuna's user avatar
0 votes
1 answer
72 views

Let's say I have a set of 20 data points. However, for certain unexplained reasons, I can only perform SVM on 4 of those data points at a time. Is there any way I can do SVM for each subset of 4 ...
MeltedStatementRecognizing's user avatar
0 votes
0 answers
44 views

I am a relative stats noob trying to create a binary logistic regression model in spss to explore the relationship between internet access and feeling part of your community. I am also interested in ...
VerticalSlice's user avatar
0 votes
1 answer
77 views

I'm currently working with the famous Iris data set in R. I want to test whether the difference in sepal width between setosa and the other plant species is positive, i.e whether setosa has a larger ...
Pame's user avatar
  • 331
0 votes
1 answer
506 views

When should I select pairwise deletion? So I grasp the idea of pairwise deletion, but what conditions are actually needed to select this? Is it when data is MCAR? Why would researches select this ...
Fats's user avatar
  • 21
2 votes
2 answers
757 views

I have a data frame in which I have some categorical variables, some are ordinal and others are nominal. How can I deal with nominal columns/variables that have too many levels? For example, I have a ...
Daniel's user avatar
  • 21
1 vote
1 answer
106 views

The question comes from a paragraph in page 171 of "Pattern Recognition and Machine Learning" by Christopher M. Bishop: Here $\mathbf\Phi$ is the design matrix for a data set of $N$ samples ...
zzzhhh's user avatar
  • 333
1 vote
0 answers
134 views

I have a dataset for 100 households and 10 years. I created a new variable called x1hat conditional on the household identifier. Then, I assigned the same value as under variable x1 to all households ...
TFT's user avatar
  • 345
0 votes
1 answer
154 views

This is a actually very basic question, but I can't get my head around it. I have two datasets for Europe and U.S. that contain the same two variables. These two variables are in a linear ...
Soda's user avatar
  • 1
0 votes
0 answers
25 views

I've been trying to run some stats to check for correlation between an ordinal categories (such as body conditions classes) and a continuous variable (such as body measurements in cm). I'm really ...
Jaq's user avatar
  • 1
1 vote
0 answers
45 views

I have missing monthly trade data for my dependent variable in the DiD model. Are there any methods to compute the missing data, so it does not lead to a bias? Also, for many products, there are no ...
kz500's user avatar
  • 11
1 vote
0 answers
68 views

An additional complication arises with estimation, since maximum likelihood estimation may not be feasible without making unrealistically strong ?????"assumptions"????? about third‐ and ...
user avatar
0 votes
1 answer
90 views

I have X individuals and 2 category of interest (category A and category B) per individual. The problem is that is have a variability in the number of measures I have between individuals and between ...
user avatar
2 votes
0 answers
120 views

Background/problem I am trying to solve: I have vehicle timeline data for position, velocity, accelerations for all vehicles on a section of a motorway for ~30min where adjacent data points are spaced ...
hokage007's user avatar
2 votes
0 answers
123 views

In the paper "Active Learning for Natural Language Parsing and Information Extraction", the author mentioned: In tests on this data, test examples were chosen independently for 10 trials ...
LCheng's user avatar
  • 221
0 votes
1 answer
141 views

ML newbie here. I'm preparing my data for a binary classification to predict whether a person has an account or not. In total I have 8 variables: 2 numeric (age and household size) and 6 categorical. ...
adam g.'s user avatar
  • 21
5 votes
1 answer
965 views

Sometimes, by simply shuffling my data, not changing the parameters, I get a different cluster result using sklearn.DBSCAN. Why this happens? I mean, by shuffling data, the data distribution is the ...
Brown_Z's user avatar
  • 53
0 votes
1 answer
53 views

What are some good repositories of climatological data? Is there a golden standard used in the community? I need some datasets with gridded data with a reasonably good resolution, both temporal (...
AbateFaria's user avatar
1 vote
0 answers
83 views

Consider two sets of points (in the pictures below), whose "center of gravity" is same. What measure can differentiate between the two sets? e.g. Image 1 ...
s510's user avatar
  • 161
3 votes
0 answers
1k views

I am trying to predict daily stock return volatility using an LSTM network. My data comprises price data of five different stocks, over the same time frame. My question, to which I have not found an ...
Draenth's user avatar
  • 31
1 vote
0 answers
49 views

I have data from six different experimental groups that were collected by two different companies (three groups were collected by company A e the other three by company B). I'm interested in compare a ...
bettertospeakordie's user avatar

1 2 3
4
5
39