Questions tagged [dataset]
Requests for datasets are off-topic on this site. Use this tag for questions concerning creating, processing, or maintaining datasets.
1,934 questions
1
vote
0
answers
130
views
Train-test split within modeling function
I have this function I wrote:
...
2
votes
2
answers
231
views
Algorithm for detecting collective outliers
What algorithm should I go for if I want to determine collective outliers within a dataset?
By collective outliers, I mean a series of data points differ significantly from the trends in the rest of ...
3
votes
4
answers
398
views
Can Positive Values of Observations imply Population Positive Mean?
Suppose that we have N-observations for a random variable $V$ from a population:
$$\{v_1,\;\ldots,\;v_N\}$$
, and all of the observations is positive: $v_i>0\quad \forall i=1,\;\ldots,\;N$.
Here, ...
3
votes
1
answer
1k
views
Are coordinates (e.g. Point(123, -123)) considered interval data?
I understand that latitude and longitude are interval data, but are coordinates (e.g. Point(123, -123)) also considered interval data? If so, how can the standard deviation, mean etc. be calculated?
1
vote
0
answers
34
views
How do I create a dataset for vacation home bookings? [closed]
My family is renting out a vacation home and I would like to create a dataset in R to calculate statistics on the bookings (for some helpful insights and also for me to practice R).
I have a question ...
2
votes
0
answers
56
views
Datasets with long tail of eigenvalues of the covariance?
In most datasets I use, the spectrum of the covariance matrix decays to 0 quite fast, meaning that they are more or less low-rank.
My question is, whether there are setting or disciplines that are ...
0
votes
2
answers
212
views
Random data generation for hurdle model using R [closed]
I would like to generate random data for a distribution which takes the form of hurdle-like model. Let's denote a random variable X with probability mass function
\begin{align}
Pr(X=0)&=\alpha\\
...
0
votes
0
answers
91
views
Comparing clustering performance of two datasets?
For example:
Let's say I have dataset A:
Measured body temperature of a person during the day.
I have measurements from 3 people in the span of a year.
If I cluster it, I expect the clusters to ...
0
votes
1
answer
552
views
Hyperparameter tunning in SelectKBest feature selector
I am working with a pretty large dataset containing 760 rows and arround 58k-60k features and I'd like to perform a feature selection to reduce the dimensionality of those. After stardardising the ...
0
votes
0
answers
74
views
How limited is my dataset?
So I have multiple datasets which are all histograms and cannot be linked. The topic of the data is quite complex so for example imagine I was surveying different qualities between men and women. The ...
2
votes
1
answer
138
views
Making my linear regression model meet assumptions causes a large increase in mean squared error
I was creating a linear regression model on a particle collisions dataset. I observed that my model breaks several assumptions in linear regression, and when I tried to fix them, it increased the ...
7
votes
4
answers
3k
views
Data Imbalance: what would be an ideal number(ratio) of newly added class's data?
Assume that I have 10 classes with 100 samples for each class—same # of samples, perfect balanced dataset.
I want to add 3 new classes, and which of the following is the best option for the number of ...
1
vote
0
answers
51
views
Training a dataset with just one variable [closed]
I have been suggested to train my model with a single labeled internet traffic and then test it with both traffic type I.e., Nirmal and Ddos traffic.
Is it possible to train the model with just one ...
1
vote
1
answer
67
views
My max target value is seen quite frequently since the data source has a threshold on what they can measure. Should I remove these data?
I'm working on a regression problem involving nutrient concentrations. The lab I'm getting my data from can measure up to 9000ppm of a particular nutrient. Beyond that, everything is reported as ...
1
vote
0
answers
29
views
Comparing forecast models for signs of conversion
I am trying to analyze two external forecast models for weather data that each generate hourly forecasts twice a day for one week ahead. Thereby I get a panel-like dataset, in which I am interested in ...
2
votes
1
answer
219
views
Predicting Win Probabilities In-Game
I'm looking to predict the probability of a team winning a game in basketball so that I can create something close to this:
If you can't see the picture, it's a graph with time remaining on the x-...
1
vote
1
answer
133
views
How to compute score on separate test set after k-fold cross validation on separate train set?
I am aware there is quite a few similar questions but none answer was dealing with following situation:
I have a task with train dataset and test dataset provided. All previous approaches are measured ...
2
votes
0
answers
422
views
NLP - How to deal with a dataset where some spaces between words are missing
I've been normalizing a dataset and after tokenizing my words I've noticed that some records contain combinations of words where the spaces between them are missing.
...
0
votes
0
answers
37
views
Why $P$ and $Q$ don't exist on the same coordinate, they need to be reconciled (processed) to exist on the exact same cells in order to calculate RMSE
I have a question about the root mean square error and Wasserstein distance on the paper https://arxiv.org/abs/2111.08736?context=stat. Consider two discrete probability distributions $P=\{P_i\}_{i=1}...
1
vote
1
answer
183
views
Choosing class-balance of training dataset for unbalanced binary classification problem
There are many discussions on here about techniques for handling unbalanced datasets, eg.
Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?. My question is different ...
1
vote
1
answer
772
views
Choice of time origin for survival analysis when no specific event or beginning of study can be chosen
I'm currently struggling with the choice of time origin for survival analysis in my data.
My data comes from an ongoing clinical database of patients who all have the same genetic disease. In it, I ...
0
votes
1
answer
70
views
How to find a mapping to a higher dimension that separates the data, given a data set
We have the following dataset:
$$ \begin{bmatrix}
x_1 & x_2 & y\\
+1 & 0 & +1\\
-1 & 0 & +1\\
0 & +2 & +1\\
0 & +1 & -1
\end{bmatrix} $$
I was asked to find ...
1
vote
0
answers
54
views
Applying statistical test on different data [closed]
We frequently apply various statistical tests to data, including the stationarity test, the t-test, tests for randomness, etc. They are typically used with direct data, or information obtained ...
1
vote
1
answer
1k
views
Why don’t we split the dataset into training and testing set if the sample size is small?
I learned in school that we don't split the dataset into training and testing sets if the sample size is less than 30. I wonder why we don't?
1
vote
0
answers
75
views
Is it possible to use my dataset for survival analysis, if so, how should I adapt it?
I'll try to summarize my situation as best as I can first.
I'm doing a project in which we're trying to do a prognostic model on a specific disease to determine the risk of an outcome happening to ...
4
votes
2
answers
463
views
Summarising and Visualising three attributes in R
I am trying to summarise and visualise three attributes in R:
Patient_Age, Patient_Deprivation and Hospital_Time.
I am trying to summarise the time patients spent in hospital by deprivation(scale 1-5) ...
1
vote
1
answer
106
views
dataset in log & lin-lin regression function vs. dataset not in log & log-log regression function: Why different results in R?
I've discovered by chance that R produces different results when using a dataset which has been first transformed by the natural logarithm and then loaded into R for analysis and when the dataset is ...
3
votes
1
answer
151
views
Calculating Potential Usefulness of Acquiring Additional Data
Imagine Anne has a labeled training dataset for a machine learning prediction problem. There is an opportunity to acquire more data from an agent, at a cost. However, before she decides to acquire ...
0
votes
0
answers
140
views
How to show that adding more data improve a model quality?
I am working on a project for university and I would like to demonstrate that adding more rows to the same model (linear regressions for instance or other types of models) increase the model quality. ...
0
votes
0
answers
26
views
Generating more data rows for a binary sparse matrix
I have a matrix that has 50 named rows and 15 named columns. Matrix contains 0 and 1 values and is 85% sparse. Similar to the 'last.fm dataset'. I would like to intelligently create more rows from my ...
1
vote
2
answers
90
views
Add training data to fix prediction problems
Let's suppose I have a classification problem with three y classes {bad, middle, good}.
Data in my dataset and also data in true reality are identically distributed between the three labels (33%, 33%, ...
3
votes
1
answer
431
views
Compositional Data in R
I'm writing a work on the Aitchison geometry for compositional data and I have seen an Image I want to reproduce in R.
I work with the "compositions" library and I want to understand how to ...
0
votes
0
answers
193
views
Linearly dependent columns in dataset
I need to do the 'cleaning' of the dataset, ie preprocessing. I noticed that I have two columns in the dataset that are totally linearly dependent. Is it okay to delete one, both or am I not allowed ...
1
vote
1
answer
729
views
Calculating Avg Days to Pay: Which formula to use?
I want to find the average days to pay for a client based on last year's many invoices. Nearly all of them are paid.
Most formulas for this activity rely on outstanding amounts for the client. In this ...
0
votes
0
answers
466
views
Skewed distributions: choosing the median or average
Goal:
To determine whether to use the median or average (or a weighted combination of the two) from a data set based on whether ...
3
votes
1
answer
97
views
Efficient storage of functional data
I have access to a sample (size $N$) of functional data. Each observation corresponds to $C$ functions. Each function $f_{n,c}$ is represented by $T_n$ points for $1\geq n \geq N, 1\geq c \geq C$. All ...
0
votes
1
answer
385
views
Separating datasets vs one dataset with extra categorical feature
I have regression/classification problem. Dataset contains data from 4 sensors on 4 positions (1,2,3,4). Processes measured on all 4 positions are equivalent and same label and features describe all 4 ...
0
votes
0
answers
104
views
How to choose the best recommender system? What evaluation metrics to use?
I want to build a recommender system to suggest similar songs to continue a playlist (similar to what Spotify does by recommending similar songs at the end of a playlist).
I want to build two models: ...
1
vote
1
answer
329
views
How to evaluate complementary datasets for ML models?
Evaluating ML models is a fundamental task and subfield of the Machine Learning practice. On the other hand, I was not able to find any existing materials, guides, protocols, papers on how to proceed ...
0
votes
2
answers
607
views
Comparing impact of training data size - what testing data size?
I am training a classifier using BERT and want to check how the accuracy changes with increasing training data size. Up until now, I have 1k annotated training samples and tested the accuracy for ...
5
votes
2
answers
236
views
Will a dataset with multiple labels perform better than with binary labels?
Suppose I have a dataset comprised of garbages. Will a model perform better if I only label the dataset with biodegradable or non-biodegradable?
Or will it be better if I label them with plastics, ...
2
votes
2
answers
2k
views
Salary of a group of people is continuous or discrete
I have salary data of 3000 employees ranging from 3000 - 10000 dollars.
Based on my understanding:(https://mathbitsnotebook.com/Algebra1/FunctionGraphs/FNGContinuousDiscrete.html)
Continuous data is a ...
0
votes
0
answers
130
views
Finding an optimal distribution to fit to a highly skewed data vector (DV with missing values) in R
I hope that this question has not already been asked.
I am analyzing data in R (and am a novice).
I have a highly skewed data vector in a dataframe with missing values that I hope to set as the ...
0
votes
0
answers
70
views
Rule based label - For attrition risk
I have 3 domains of supplier data (Jan 2017 to Jan 2022) and they are as follows
a) Purchase data - Contains all the purchase (of product) data made by the suppliers with us. It contains columns such ...
1
vote
0
answers
31
views
For the given data find the clusters. Assume the relevant parameters needed [closed]
Below is the given data, how can I make clusters using symmetric matrix?
2
votes
1
answer
265
views
Paired t-test vs two-sample t-test - animal populations
From my understanding, a paired t-test is used when samples are dependent of each other.
I'm having trouble deciding whether a paired t-test should be used when comparing the average population of a ...
0
votes
0
answers
135
views
Effect of duplicate/redundant labels on performance of model
I am training a CNN to predict age,mass and tone from images.
The structure of my dateset is as follows
...
0
votes
1
answer
53
views
What are the best-practices for validating phone records? [closed]
I have a bunch of telephone data including information on call start time, end time, and duration. I am trying to evaluate the quality of the dataset to determine if the phone call data are legitimate ...
0
votes
0
answers
109
views
Hyperparameter tuning on training data vs validation data
If we divide the data into training data, validation data, and testing data, I remember the lesson from Andrew Ng saying we use the validation data for hyperparameter tuning purpose.
(you can see this ...
1
vote
0
answers
31
views
Finding a dataset for a computer vision project related to medical imaging (related to cancer/tumor) [closed]
I am trying to find a dataset of medical images related to tumor/cancer, there should be images different stages of the cancer and also preferably the details about the patient, their medical history, ...