Newest 'dataset' Questions - Page 4

2 votes

1 answer

86 views

What are some best practices for labeling data that exists in a continuum?

I am building computer vision models on data that exists in a continuum. For example, imagine I'm trying to do semantic segmentation on cars. Some of the labels are distinct, like "chipped paint&...

jss367

446

asked Jan 19, 2023 at 17:52

2 votes

1 answer

108 views

For human annotation projects, what are some commonly used metrics to assess grader reliability?

Lots of machine learning datasets are now created by having human raters annotate and provide labels to questions. Usually, a gold set is the most robust way of seeing if the raters are doing a good ...

user321627

4,372

asked Jan 19, 2023 at 0:00

1 vote

0 answers

39 views

Can transfer learning be applied after learning using homomorphic encryption to obfuscate dataset source?

Context Suppose one has a public dataset plant_labels with: Input: plant pictures Labels: plant names And a larger public model: ...

a.t.

111

asked Jan 17, 2023 at 21:27

0 votes

0 answers

33 views

Why should exploratory analysis not be followed up with a confirmatory analysis in the same dataset? [duplicate]

On the Wikipedia page for data analysis, the following claim is made. "...one should not follow up an exploratory analysis with a confirmatory analysis in the same dataset. An exploratory ...

anna6931

151

asked Jan 17, 2023 at 8:05

0 votes

1 answer

152 views

Proving a class imbalance IS a problem in Machine Learning [duplicate]

Context: Have been trying to create a prediction model for a 1% outcome variable using Random Forest Machine Learning for a large health survey (entirely multi-level categorical data, yes/no outcome, ~...

MJay

1

asked Jan 14, 2023 at 2:58

3 votes

0 answers

48 views

How to combine ML + Expert knowledge? (constrained machine learning)

I am working the sector of computer science for agriculture research. I deal here with algorithm for crop yield prediction. However, data in agriculture is very limited. To overcome the issues of ...

MvB

31

asked Jan 13, 2023 at 15:27

0 votes

0 answers

20 views

Validation of Data on SPSS [duplicate]

I have the data of 150 participants [2 different methods that assess the same thing (Blood pressure), one of which is considered as "gold standard method"] and I want to validate them on ...

user376879

asked Jan 5, 2023 at 17:23

0 votes

1 answer

425 views

How to make a pairwise correlation matrix including interaction with a third variable?

My dataset follows this structure: ...

user351731

asked Jan 3, 2023 at 22:40

1 vote

0 answers

80 views

Correct way to split data for propensity modeling

For generic propensity (purchase, churn etc.) modeling a lot of typical references / examples available use randomized splitting for train / eval / test sets. For propensity modeling in practice ...

permustats

11

asked Jan 3, 2023 at 22:36

0 votes

0 answers

605 views

Grouped stratified train-val-test split for a multilabel dataset

I was wondering if there is a fast heuristic algorithm for performing grouped stratified dataset split on a multilabel dataset. Question originally posted on Data Science stackexcahnge here. ...

jasperhyp

21

asked Dec 19, 2022 at 0:22

0 votes

0 answers

47 views

Example studies that use inferences from cross-validation of the entire dataset

I am performing a study where I perform inferences based on cross-validation metrics that use the entire dataset. My reasoning behind this is (a) my dataset is small and imbalanced and (b) this is ...

Adam_G

371

asked Dec 16, 2022 at 19:47

1 vote

0 answers

245 views

Calculating total revenue in Rstudio (price*quantity sold) [closed]

Hello everyone, I need to work with the data presented above. This question involves a little bit of knowledge of Rstudio. I would (1) like to obtain total revenue PER product (ucp), and brand, (2) ...

René González

11

asked Dec 16, 2022 at 11:52

2 votes

1 answer

368 views

Which test to use with three variables? [closed]

that's my chart: let's say that D0.5 was missing OJ. what statistical test could I use in this case?

NERD

23

asked Dec 11, 2022 at 7:12

1 vote

1 answer

183 views

Missing data mechanism for a single missing value?

I am currently studying statistics and I have come across these terms about missing data mechanisms; MCAR, MAR and MNAR. I have a dataset with exactly one missing value and I can only think of that ...

restingquartH

13

asked Dec 5, 2022 at 21:34

2 votes

1 answer

86 views

Should I remove 0 values from my dataset if they seem to be from instrument error?

I have conducted an experiment looking at the decay rate of a DNA target in flowing water over time, with 4 replicates of each treatment. The data is collected via dPCR Quiacuity instrument that ...

Mitchell Liddick

31

asked Dec 3, 2022 at 20:34

0 votes

1 answer

58 views

Size of Final TestSet

is there a rule of thumb of how large the final test set has to be in Machine Learning? Assumed I have 1.000 images how many images do I ignore and use only in the final run? My proposal: Select ...

SchwarzbrotMitHummus

11

asked Nov 29, 2022 at 14:00

1 vote

1 answer

98 views

Data with a lot of zeroes

Please forgive a simple-minded question. I'm looking at a dataset now of a few thousand values, and trying to analyze it statistically. Most of the values, about 90%, are zeroes, and then the rest are ...

Greg Markowsky

137

asked Nov 25, 2022 at 18:08

2 votes

2 answers

2k views

How many datapoints are enough for a regression model to predict with reasoanble (say 88%-92%) accuracy? [closed]

Is there any number that we can land on for our regression model to predict with high accuracy? (accuracy metrics I have in mind at RMSE or R-squared). Also high accuracy may mean something above 88% ...

SJa

554

asked Nov 23, 2022 at 15:26

2 votes

1 answer

953 views

How to find the strongest correlation with big data in R? [closed]

I am trying to find the strongest correlation between two data sets in R and one set has 9000+ columns. I used cor() and it worked well, but is there a function or way to find the strongest ...

guest101010

29

asked Nov 21, 2022 at 3:07

1 vote

2 answers

911 views

Generating synthetic time series data with limited data

I would like some opinions on my current situation. I have a set of time series data that I want to forecast. The data however is not very long (around 500 rows) so I was looking into generating many ...

codinator

123

asked Nov 15, 2022 at 14:07

2 votes

1 answer

83 views

calculate an equation based on conditional existing data

I'm trying to figure out which method to go about calculating an equation based on variables from my database. I have the variables sex, which is 0 for males and 1 for females. I also have serum ...

Fred

21

asked Nov 13, 2022 at 21:32

1 vote

0 answers

71 views

IS PCA redundant when the first PC equals the mean of the data? [closed]

I have a situation where the principle component of my data is almost equal to the mean of the data. Does this make PCA redundant? Does PCA not work in this setting?

Martian

11

asked Nov 10, 2022 at 22:30

1 vote

1 answer

72 views

Prudent to reduce data size for the sake of model performance?

I am currently working on predicting the customer revenue in next 3,6 or 9 months using the below two methods a) Buy Till you die probabilistic models b) Tweedie regression and other regression ...

The Great

3,380

asked Nov 9, 2022 at 12:12

1 vote

2 answers

1k views

how to normalize data 'with a sample range from -1 to 1 and a mean value of 0'?

I am trying to pre-process data following a statement in a paper. They said for the normalization, each dataset is normalized on a per channel basis with a sample range from -1 to 1 and a mean value ...

Margie Shi

13

asked Nov 8, 2022 at 11:12

1 vote

1 answer

404 views

Machine Learning Models for predicting probability

I have a dataset where the dependent variable is a success probability ranging from 0 to 1. I cannot use the regular linear regression to model because linear regression does not restrict the output ...

Simon

51

asked Nov 7, 2022 at 16:43

3 votes

2 answers

199 views

When is an unbalanced dataset large enough for calculating a decision threshold?

I have a (large i.e. >1M rows) very unbalanced (1% event label, binary classification) dataset with data from various institutions. At the moment, I train an XGBoost model on this data and get good ...

Spill4963

31

asked Nov 3, 2022 at 12:21

0 votes

0 answers

36 views

How to determine if should I adopt a time series approach given a dataset?

I'm learning about machine learning and data science and I Would like to know, given a dataset which contains "Date" as one of it's many features. How to determine if I'm facing a time ...

Antonio Caipora

61

asked Oct 31, 2022 at 14:57

1 vote

1 answer

64 views

Should I use long-format or recode the condition column? [closed]

I am just starting out with R and struggling to wrap my head around it. I have a data set from a $2 \times 2$ repeated measures experiment (IV 1 - Expectation with two levels, number or letter; IV 2 - ...

user371552

asked Oct 29, 2022 at 18:58

1 vote

1 answer

1k views

What happens if I fit my model on the same training dataset multiple times?

If I fit my model on the same training set twice or thrice so does the model remember the learning from each iteration and improve ?

Parthsarthi Joshi

11

asked Oct 29, 2022 at 14:10

1 vote

1 answer

87 views

What is the correct method for training NLP models with augmented data?

I have a very small dataset (~50 rows) for a text classification problem. I found some open source data that's similar to the problem I'm trying to solve. Should I... Train the (BERT) model on the ...

krisjuna

31

asked Oct 29, 2022 at 0:41

0 votes

1 answer

72 views

Using SVM for subsets

Let's say I have a set of 20 data points. However, for certain unexplained reasons, I can only perform SVM on 4 of those data points at a time. Is there any way I can do SVM for each subset of 4 ...

MeltedStatementRecognizing

103

asked Oct 28, 2022 at 13:48

0 votes

0 answers

44 views

Binary Logistic Regression - How are my IVs affecting each other?

I am a relative stats noob trying to create a binary logistic regression model in spss to explore the relationship between internet access and feeling part of your community. I am also interested in ...

VerticalSlice

1

asked Oct 28, 2022 at 0:56

0 votes

1 answer

77 views

One-Sided Hypothesis Test with Categorical Covariate in R, Iris Data Set

I'm currently working with the famous Iris data set in R. I want to test whether the difference in sepal width between setosa and the other plant species is positive, i.e whether setosa has a larger ...

Pame

331

asked Oct 24, 2022 at 10:30

0 votes

1 answer

506 views

Conditions to Select Pairwise Deletion

When should I select pairwise deletion? So I grasp the idea of pairwise deletion, but what conditions are actually needed to select this? Is it when data is MCAR? Why would researches select this ...

Fats

21

asked Oct 21, 2022 at 21:56

2 votes

2 answers

757 views

Converting nominal variables and ordinal variables in dataframe - categorical variables

I have a data frame in which I have some categorical variables, some are ordinal and others are nominal. How can I deal with nominal columns/variables that have too many levels? For example, I have a ...

Daniel

21

asked Oct 21, 2022 at 10:50

1 vote

1 answer

106 views

Why do eigenvalues of $\mathbf\Phi^T\mathbf\Phi$ increase with the size of data set?

The question comes from a paragraph in page 171 of "Pattern Recognition and Machine Learning" by Christopher M. Bishop: Here $\mathbf\Phi$ is the design matrix for a data set of $N$ samples ...

zzzhhh

333

asked Oct 19, 2022 at 5:41

1 vote

0 answers

134 views

How to add a constant to a new variable conditional on an existing variable in r? [closed]

I have a dataset for 100 households and 10 years. I created a new variable called x1hat conditional on the household identifier. Then, I assigned the same value as under variable x1 to all households ...

TFT

345

asked Oct 14, 2022 at 16:22

0 votes

1 answer

154 views

Combine two data sets from two different regions

This is a actually very basic question, but I can't get my head around it. I have two datasets for Europe and U.S. that contain the same two variables. These two variables are in a linear ...

Soda

1

asked Oct 11, 2022 at 8:38

0 votes

0 answers

25 views

How to check statistical correlation an ordinal variable against a continuous one [duplicate]

I've been trying to run some stats to check for correlation between an ordinal categories (such as body conditions classes) and a continuous variable (such as body measurements in cm). I'm really ...

Jaq

1

asked Oct 11, 2022 at 4:57

1 vote

0 answers

45 views

Missing trade data for difference in differences model

I have missing monthly trade data for my dependent variable in the DiD model. Are there any methods to compute the missing data, so it does not lead to a bias? Also, for many products, there are no ...

kz500

11

asked Oct 10, 2022 at 3:52

1 vote

0 answers

68 views

Provide an example of a dataset where maximum likelihood is inapplicable as third moments and fourth moments "assumptions" do not apply

An additional complication arises with estimation, since maximum likelihood estimation may not be feasible without making unrealistically strong ?????"assumptions"????? about third‐ and ...

user318514

asked Oct 6, 2022 at 17:18

0 votes

1 answer

90 views

Is it correct to use Wilcoxon signed rank test on mean data?

I have X individuals and 2 category of interest (category A and category B) per individual. The problem is that is have a variability in the number of measures I have between individuals and between ...

anon

asked Sep 28, 2022 at 0:44

2 votes

0 answers

120 views

Neural network not working as expected - Autonomous Driving

Background/problem I am trying to solve: I have vehicle timeline data for position, velocity, accelerations for all vehicles on a section of a motorway for ~30min where adjacent data points are spaced ...

hokage007

21

asked Sep 15, 2022 at 10:51

2 votes

0 answers

123 views

How to interpret the relationship between batch size and bootstrap count in a specific paper?

In the paper "Active Learning for Natural Language Parsing and Information Extraction", the author mentioned: In tests on this data, test examples were chosen independently for 10 trials ...

LCheng

221

asked Sep 12, 2022 at 4:49

0 votes

1 answer

141 views

Categorical data for binary classification [duplicate]

ML newbie here. I'm preparing my data for a binary classification to predict whether a person has an account or not. In total I have 8 variables: 2 numeric (age and household size) and 6 categorical. ...

adam g.

21

asked Sep 10, 2022 at 0:28

5 votes

1 answer

965 views

Why do I get different results after shuffling data using DBSCAN

Sometimes, by simply shuffling my data, not changing the parameters, I get a different cluster result using sklearn.DBSCAN. Why this happens? I mean, by shuffling data, the data distribution is the ...

Brown_Z

53

asked Sep 8, 2022 at 17:54

0 votes

1 answer

53 views

Climatological datasets [closed]

What are some good repositories of climatological data? Is there a golden standard used in the community? I need some datasets with gridded data with a reasonably good resolution, both temporal (...

AbateFaria

207

asked Sep 5, 2022 at 10:09

1 vote

0 answers

83 views

Differentiate between two set of points

Consider two sets of points (in the pictures below), whose "center of gravity" is same. What measure can differentiate between the two sets? e.g. Image 1 ...

s510

161

asked Aug 30, 2022 at 13:55

3 votes

0 answers

1k views

Combining multiple stock time series to one data set for LSTM

I am trying to predict daily stock return volatility using an LSTM network. My data comprises price data of five different stocks, over the same time frame. My question, to which I have not found an ...

Draenth

31

asked Aug 26, 2022 at 7:07

1 vote

0 answers

49 views

How to analyze datasets collected by different companies

I have data from six different experimental groups that were collected by two different companies (three groups were collected by company A e the other three by company B). I'm interested in compare a ...

bettertospeakordie

21

asked Aug 24, 2022 at 22:36

Questions tagged [dataset]