Newest 'dataset' Questions - Page 5

1 vote

0 answers

130 views

Train-test split within modeling function

I have this function I wrote: ...

Igor9094

31

asked Aug 19, 2022 at 20:13

2 votes

2 answers

231 views

Algorithm for detecting collective outliers

What algorithm should I go for if I want to determine collective outliers within a dataset? By collective outliers, I mean a series of data points differ significantly from the trends in the rest of ...

Iamtrying

33

asked Aug 15, 2022 at 13:01

3 votes

4 answers

398 views

Can Positive Values of Observations imply Population Positive Mean?

Suppose that we have N-observations for a random variable $V$ from a population: $$\{v_1,\;\ldots,\;v_N\}$$ , and all of the observations is positive: $v_i>0\quad \forall i=1,\;\ldots,\;N$. Here, ...

M.C. Park

985

asked Aug 12, 2022 at 16:18

3 votes

1 answer

1k views

Are coordinates (e.g. Point(123, -123)) considered interval data?

I understand that latitude and longitude are interval data, but are coordinates (e.g. Point(123, -123)) also considered interval data? If so, how can the standard deviation, mean etc. be calculated?

Verum

31

asked Aug 10, 2022 at 12:57

1 vote

0 answers

34 views

How do I create a dataset for vacation home bookings? [closed]

My family is renting out a vacation home and I would like to create a dataset in R to calculate statistics on the bookings (for some helpful insights and also for me to practice R). I have a question ...

Sofia

11

asked Aug 8, 2022 at 9:25

2 votes

0 answers

56 views

Datasets with long tail of eigenvalues of the covariance?

In most datasets I use, the spectrum of the covariance matrix decays to 0 quite fast, meaning that they are more or less low-rank. My question is, whether there are setting or disciplines that are ...

Community wiki

Roy

0 votes

2 answers

212 views

Random data generation for hurdle model using R [closed]

I would like to generate random data for a distribution which takes the form of hurdle-like model. Let's denote a random variable X with probability mass function \begin{align} Pr(X=0)&=\alpha\\ ...

RRMT

382

asked Jul 27, 2022 at 3:08

0 votes

0 answers

91 views

Comparing clustering performance of two datasets?

For example: Let's say I have dataset A: Measured body temperature of a person during the day. I have measurements from 3 people in the span of a year. If I cluster it, I expect the clusters to ...

user326964

101

asked Jul 26, 2022 at 15:37

0 votes

1 answer

552 views

Hyperparameter tunning in SelectKBest feature selector

I am working with a pretty large dataset containing 760 rows and arround 58k-60k features and I'd like to perform a feature selection to reduce the dimensionality of those. After stardardising the ...

Julen

23

asked Jul 26, 2022 at 10:08

0 votes

0 answers

74 views

How limited is my dataset?

So I have multiple datasets which are all histograms and cannot be linked. The topic of the data is quite complex so for example imagine I was surveying different qualities between men and women. The ...

sputnik44

1

asked Jul 25, 2022 at 14:04

2 votes

1 answer

138 views

Making my linear regression model meet assumptions causes a large increase in mean squared error

I was creating a linear regression model on a particle collisions dataset. I observed that my model breaks several assumptions in linear regression, and when I tried to fix them, it increased the ...

Featherball

216

asked Jul 21, 2022 at 14:52

7 votes

4 answers

3k views

Data Imbalance: what would be an ideal number(ratio) of newly added class's data?

Assume that I have 10 classes with 100 samples for each class—same # of samples, perfect balanced dataset. I want to add 3 new classes, and which of the following is the best option for the number of ...

Kevin Choi

183

asked Jul 21, 2022 at 9:03

1 vote

0 answers

51 views

Training a dataset with just one variable [closed]

I have been suggested to train my model with a single labeled internet traffic and then test it with both traffic type I.e., Nirmal and Ddos traffic. Is it possible to train the model with just one ...

Akash Singh

11

asked Jul 20, 2022 at 1:57

1 vote

1 answer

67 views

My max target value is seen quite frequently since the data source has a threshold on what they can measure. Should I remove these data?

I'm working on a regression problem involving nutrient concentrations. The lab I'm getting my data from can measure up to 9000ppm of a particular nutrient. Beyond that, everything is reported as ...

Viv Crowe

11

asked Jul 19, 2022 at 17:04

1 vote

0 answers

29 views

Comparing forecast models for signs of conversion

I am trying to analyze two external forecast models for weather data that each generate hourly forecasts twice a day for one week ahead. Thereby I get a panel-like dataset, in which I am interested in ...

kriskj

11

asked Jul 18, 2022 at 11:01

2 votes

1 answer

219 views

Predicting Win Probabilities In-Game

I'm looking to predict the probability of a team winning a game in basketball so that I can create something close to this: If you can't see the picture, it's a graph with time remaining on the x-...

sla813

85

asked Jul 17, 2022 at 18:06

1 vote

1 answer

133 views

How to compute score on separate test set after k-fold cross validation on separate train set?

I am aware there is quite a few similar questions but none answer was dealing with following situation: I have a task with train dataset and test dataset provided. All previous approaches are measured ...

Baltazar Gąbka

11

asked Jul 8, 2022 at 14:54

2 votes

0 answers

422 views

NLP - How to deal with a dataset where some spaces between words are missing

I've been normalizing a dataset and after tokenizing my words I've noticed that some records contain combinations of words where the spaces between them are missing. ...

Tolure

121

asked Jul 6, 2022 at 12:33

0 votes

0 answers

37 views

Why $P$ and $Q$ don't exist on the same coordinate, they need to be reconciled (processed) to exist on the exact same cells in order to calculate RMSE

I have a question about the root mean square error and Wasserstein distance on the paper https://arxiv.org/abs/2111.08736?context=stat. Consider two discrete probability distributions $P=\{P_i\}_{i=1}...

oliver

1

asked Jul 6, 2022 at 1:07

1 vote

1 answer

183 views

Choosing class-balance of training dataset for unbalanced binary classification problem

There are many discussions on here about techniques for handling unbalanced datasets, eg. Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?. My question is different ...

Ryan Volpi

2,008

asked Jul 2, 2022 at 13:34

1 vote

1 answer

772 views

Choice of time origin for survival analysis when no specific event or beginning of study can be chosen

I'm currently struggling with the choice of time origin for survival analysis in my data. My data comes from an ongoing clinical database of patients who all have the same genetic disease. In it, I ...

floubert

67

asked Jul 1, 2022 at 18:27

0 votes

1 answer

70 views

How to find a mapping to a higher dimension that separates the data, given a data set

We have the following dataset: $$ \begin{bmatrix} x_1 & x_2 & y\\ +1 & 0 & +1\\ -1 & 0 & +1\\ 0 & +2 & +1\\ 0 & +1 & -1 \end{bmatrix} $$ I was asked to find ...

user361992

35

asked Jul 1, 2022 at 12:20

1 vote

0 answers

54 views

Applying statistical test on different data [closed]

We frequently apply various statistical tests to data, including the stationarity test, the t-test, tests for randomness, etc. They are typically used with direct data, or information obtained ...

bogus

65

asked Jun 29, 2022 at 7:36

1 vote

1 answer

1k views

Why don’t we split the dataset into training and testing set if the sample size is small?

I learned in school that we don't split the dataset into training and testing sets if the sample size is less than 30. I wonder why we don't?

Anna Quoc Nguyen

181

asked Jun 29, 2022 at 2:57

1 vote

0 answers

75 views

Is it possible to use my dataset for survival analysis, if so, how should I adapt it?

I'll try to summarize my situation as best as I can first. I'm doing a project in which we're trying to do a prognostic model on a specific disease to determine the risk of an outcome happening to ...

floubert

67

asked Jun 27, 2022 at 15:25

4 votes

2 answers

463 views

Summarising and Visualising three attributes in R

I am trying to summarise and visualise three attributes in R: Patient_Age, Patient_Deprivation and Hospital_Time. I am trying to summarise the time patients spent in hospital by deprivation(scale 1-5) ...

Usman YousafZai

141

asked Jun 27, 2022 at 13:53

1 vote

1 answer

106 views

dataset in log & lin-lin regression function vs. dataset not in log & log-log regression function: Why different results in R?

I've discovered by chance that R produces different results when using a dataset which has been first transformed by the natural logarithm and then loaded into R for analysis and when the dataset is ...

TFT

345

asked Jun 23, 2022 at 5:50

3 votes

1 answer

151 views

Calculating Potential Usefulness of Acquiring Additional Data

Imagine Anne has a labeled training dataset for a machine learning prediction problem. There is an opportunity to acquire more data from an agent, at a cost. However, before she decides to acquire ...

NGInd

75

asked Jun 19, 2022 at 16:08

0 votes

0 answers

140 views

How to show that adding more data improve a model quality?

I am working on a project for university and I would like to demonstrate that adding more rows to the same model (linear regressions for instance or other types of models) increase the model quality. ...

Annis99

11

asked Jun 11, 2022 at 16:46

0 votes

0 answers

26 views

Generating more data rows for a binary sparse matrix

I have a matrix that has 50 named rows and 15 named columns. Matrix contains 0 and 1 values and is 85% sparse. Similar to the 'last.fm dataset'. I would like to intelligently create more rows from my ...

UsrnmChck

1

asked Jun 10, 2022 at 12:30

1 vote

2 answers

90 views

Add training data to fix prediction problems

Let's suppose I have a classification problem with three y classes {bad, middle, good}. Data in my dataset and also data in true reality are identically distributed between the three labels (33%, 33%, ...

Luca

11

asked Jun 9, 2022 at 21:50

3 votes

1 answer

431 views

Compositional Data in R

I'm writing a work on the Aitchison geometry for compositional data and I have seen an Image I want to reproduce in R. I work with the "compositions" library and I want to understand how to ...

vitalmath

131

asked May 30, 2022 at 20:38

0 votes

0 answers

193 views

Linearly dependent columns in dataset

I need to do the 'cleaning' of the dataset, ie preprocessing. I noticed that I have two columns in the dataset that are totally linearly dependent. Is it okay to delete one, both or am I not allowed ...

Guest0225

1

asked May 29, 2022 at 0:25

1 vote

1 answer

729 views

Calculating Avg Days to Pay: Which formula to use?

I want to find the average days to pay for a client based on last year's many invoices. Nearly all of them are paid. Most formulas for this activity rely on outstanding amounts for the client. In this ...

jabs

199

asked May 27, 2022 at 15:39

0 votes

0 answers

466 views

Skewed distributions: choosing the median or average

Goal: To determine whether to use the median or average (or a weighted combination of the two) from a data set based on whether ...

p.luck

101

asked May 18, 2022 at 10:34

3 votes

1 answer

97 views

Efficient storage of functional data

I have access to a sample (size $N$) of functional data. Each observation corresponds to $C$ functions. Each function $f_{n,c}$ is represented by $T_n$ points for $1\geq n \geq N, 1\geq c \geq C$. All ...

noob

2,620

asked May 18, 2022 at 5:41

0 votes

1 answer

385 views

Separating datasets vs one dataset with extra categorical feature

I have regression/classification problem. Dataset contains data from 4 sensors on 4 positions (1,2,3,4). Processes measured on all 4 positions are equivalent and same label and features describe all 4 ...

ziga

3

asked May 11, 2022 at 9:03

0 votes

0 answers

104 views

How to choose the best recommender system? What evaluation metrics to use?

I want to build a recommender system to suggest similar songs to continue a playlist (similar to what Spotify does by recommending similar songs at the end of a playlist). I want to build two models: ...

Pybubb

1

asked May 10, 2022 at 13:44

1 vote

1 answer

329 views

How to evaluate complementary datasets for ML models?

Evaluating ML models is a fundamental task and subfield of the Machine Learning practice. On the other hand, I was not able to find any existing materials, guides, protocols, papers on how to proceed ...

Betelgeux

21

asked May 10, 2022 at 13:25

0 votes

2 answers

607 views

Comparing impact of training data size - what testing data size?

I am training a classifier using BERT and want to check how the accuracy changes with increasing training data size. Up until now, I have 1k annotated training samples and tested the accuracy for ...

Sven

3

asked May 9, 2022 at 9:21

5 votes

2 answers

236 views

Will a dataset with multiple labels perform better than with binary labels?

Suppose I have a dataset comprised of garbages. Will a model perform better if I only label the dataset with biodegradable or non-biodegradable? Or will it be better if I label them with plastics, ...

wd violet

787

asked May 2, 2022 at 1:16

2 votes

2 answers

2k views

Salary of a group of people is continuous or discrete

I have salary data of 3000 employees ranging from 3000 - 10000 dollars. Based on my understanding:(https://mathbitsnotebook.com/Algebra1/FunctionGraphs/FNGContinuousDiscrete.html) Continuous data is a ...

user3164187

123

asked Apr 20, 2022 at 3:49

0 votes

0 answers

130 views

Finding an optimal distribution to fit to a highly skewed data vector (DV with missing values) in R

I hope that this question has not already been asked. I am analyzing data in R (and am a novice). I have a highly skewed data vector in a dataframe with missing values that I hope to set as the ...

vochoa213

1

asked Apr 19, 2022 at 16:30

0 votes

0 answers

70 views

Rule based label - For attrition risk

I have 3 domains of supplier data (Jan 2017 to Jan 2022) and they are as follows a) Purchase data - Contains all the purchase (of product) data made by the suppliers with us. It contains columns such ...

The Great

3,380

asked Apr 19, 2022 at 13:37

1 vote

0 answers

31 views

For the given data find the clusters. Assume the relevant parameters needed [closed]

Below is the given data, how can I make clusters using symmetric matrix?

Nivedita

11

asked Apr 19, 2022 at 0:57

2 votes

1 answer

265 views

Paired t-test vs two-sample t-test - animal populations

From my understanding, a paired t-test is used when samples are dependent of each other. I'm having trouble deciding whether a paired t-test should be used when comparing the average population of a ...

user355881

asked Apr 18, 2022 at 14:08

0 votes

0 answers

135 views

Effect of duplicate/redundant labels on performance of model

I am training a CNN to predict age,mass and tone from images. The structure of my dateset is as follows ...

Sparsh Garg

1

asked Apr 14, 2022 at 19:11

0 votes

1 answer

53 views

What are the best-practices for validating phone records? [closed]

I have a bunch of telephone data including information on call start time, end time, and duration. I am trying to evaluate the quality of the dataset to determine if the phone call data are legitimate ...

324

504

asked Apr 13, 2022 at 14:47

0 votes

0 answers

109 views

Hyperparameter tuning on training data vs validation data

If we divide the data into training data, validation data, and testing data, I remember the lesson from Andrew Ng saying we use the validation data for hyperparameter tuning purpose. (you can see this ...

william007

1,097

asked Apr 13, 2022 at 2:56

1 vote

0 answers

31 views

Finding a dataset for a computer vision project related to medical imaging (related to cancer/tumor) [closed]

I am trying to find a dataset of medical images related to tumor/cancer, there should be images different stages of the cancer and also preferably the details about the patient, their medical history, ...

JyotishmanK7719

11

asked Apr 10, 2022 at 8:39

Questions tagged [dataset]