Questions tagged [dataset]
Requests for datasets are off-topic on this site. Use this tag for questions concerning creating, processing, or maintaining datasets.
1,934 questions
2
votes
1
answer
122
views
Cross-validation: error estimation and bias
When obtaining the error estimation of a model over a dataset using k-fold cross-validation, lower values of the error estimation necessarily imply a lower bias? Are both concepts, error estimation ...
3
votes
0
answers
4k
views
Synthesize data given mean, variance, skew, and kurtosis in python [closed]
I would like to generate synthetic data by specifying their mean, variance, skew, and kurtosis. However, I only know how to generate synthetic data with mean and var.
Here is an example with mean and ...
0
votes
0
answers
135
views
What is a good type of transformation to try on this data to get normal distribution?
I've tried a log-transformation but data becomes left-skewed.
In general, when data is distributed like this, what is the next best transformation to try if you want to normalize the data?
2
votes
1
answer
287
views
The percentile method
a research used the percentile method to find the prevalence in severe group as below:
The outcome variable was categorized using percentiles to Mild, Moderate, and Severe in the Symptom Severity ...
0
votes
1
answer
174
views
How to learn a filter on a dataset?
Let $X$ be a a tabular dataset with $N$ features indexed by row number and a categorical value "Cat". Let $A$ be an aggregation function, e.g. ...
0
votes
1
answer
641
views
Grouping using percentiles
I did visual binning process in spss and made three cutpoints like in this image:
I did check off included
I want to know the percentage range or value for each group.
If I describe it in this way is ...
0
votes
0
answers
113
views
How do I decide the frequency of data capture for modeling?
I plan to capture data to predict energy consumption in a food processing plant. I want to capture production details such as how much each category of food is produced, what is the machine's output, ...
0
votes
1
answer
141
views
"Ordinal" is to "level of measurement" as "dependent variable" is to __________ [closed]
The Question
If I'm drawing up a table that describes some variables -- a data dictionary, say -- and I consider whether a variable is "ordinal", or "nominal", or "interval&...
3
votes
2
answers
208
views
A/B test with a result of another deep learning analysis with 90% accuracy
I am planning to conduct a A/B test with data obtained through a deep learning algorithm. Say, I got a binary classification dataset through machine learning with about 100k rows classified into yes, ...
0
votes
1
answer
107
views
About ideal data and its distribution
I was just thinking about what would be the properties of an ideal data set $X \in R^{n,d}$ where n is sample size, d represents features. I think (or at least I understood from reading text books) ...
1
vote
1
answer
203
views
Is it true that a larger, representative dataset is always better to use than a smaller, representative dataset?
By "representative" I mean that the data in the dataset faithfully reflects the "underlying signal" a model is trying to tap in to. Is it always true that, as long as increasing ...
-1
votes
1
answer
148
views
Is 6% of your dataset are outliers normal?
My dataset has 80,886 obs and 16 variables. I am using Mahalanobis Distance to detect outliers. And use P-value less than 0.001 as the cut-off. I am getting 5,423 obs as outlier which is 6% of total ...
1
vote
1
answer
678
views
In data setup for Cox regression, how to handle a subject's time before treatment of interest (i.e. before time-zero)?
Background
I'm designing a study that models time-to-event for two groups of study subject: people who receive a treatment and those who do not. I'm fairly new to applied survival analysis, so I've ...
1
vote
2
answers
2k
views
How to solve the problem of having sparse data that would become too small when aggregated?
I have a dataset that provides the count of cyber incidents since 2011 for different countries and different attack types, and I want to use this data in a machine learning model to predict future ...
1
vote
1
answer
66
views
Measure how dataset is harmonious or organized
Suppose we have two set of numbers : A = [1,4,9,16,25,49...100] and B = [1,4,7,7,25,49,64...100]. As you seen the first one is consistently growing, elements of it is square of numbers. But although ...
1
vote
2
answers
66
views
Data structure for p-value analysis of medical data
I have a good-sized data set containing details and outcomes of various types of injury. I've worked through it and cleaned the data as best I can - it's now consistent and complete.
At this point I'm ...
0
votes
0
answers
19
views
What test is appropriate? [duplicate]
can I do welch one way anova to compare mean between the following three groups
Group 1 has 83 observations
Group 2 has 15 observations
Group 3 has only 2 observations
And in group 3 the variance is 0 ...
2
votes
1
answer
91
views
Apply machine learning to predict discharging and risk of readmission based on medical data
I am new to machine learning. I have a problem where I have to predict patients to be/not to be discharged using hospital data and depending on that prediction (i.e. if the patient is successfully ...
0
votes
0
answers
238
views
Data comparison of two sensors, to produce a correction factor
I have sensor "A" that is outputting a set of data and I also have another sensor that is acting as the reference sensor. The aim is to get Sensor "A" as close as possible to the ...
2
votes
1
answer
48
views
What is the term for data which do not include multiple variables needed for controlling confounding in analyses?
I have a terminology question that I couldn't answer by googling.
What is the term for data which do not include multiple analytically relevant variables needed for controlling confounding in analyses?...
1
vote
0
answers
37
views
Best resources to gather social media statistics [closed]
I am looking to gather data on the amount of people following/viewing digital artists' content on social media like youtube, tikotok, instagram etc. I've looked around online but I can only find lists ...
0
votes
0
answers
213
views
Training a machine learning model on data that has several rows for each user
I have a dataset consisting of log files from a smartphone application. Currently, it creates a row each time a user clicks on something, i.e. a user clicks on the homepage, and a new row is created ...
0
votes
0
answers
150
views
How to compare/test two datasets which are not supposed to be exactly equal
Example of such dataset is open, high, low, close, volume data of any stock over the period of last 5 days. This data is available at the frequency of each minute. I want to test the quality of my ...
4
votes
0
answers
312
views
How can you evaluate the representativeness of a sample for a given distribution?
Problem:
I am looking for a metric to find the representativeness of a sample for a given distribution, being the representativeness of a random sample as the degree of capacity of the sample to ...
2
votes
0
answers
153
views
What is the proper way to externally validate clusters when I have only a sample of the dataset labeled, but want to cluster the entire dataset?
I have a dataset of text-based documents that I want to cluster. For a sample of this dataset (~10%) I have manually annotated labels (i.e., the ground truth). I would like to cluster this dataset to &...
0
votes
1
answer
210
views
Variable selection with sparse data
I have a dataset with 141 observations and 8 corresponding variables and I mean to apply a GLM to this dataset. However, a lot of observations lack either one or multiple variable values. So if I want ...
1
vote
2
answers
484
views
t-test with data in long-format including levels of non-interesting factor in R
I've computed an ANOVA with one between-subjects factor (2 groups each including 26 participants) and 2 within-subjects factors (item type with 3 levels and emotion with 2 levels) in ezANOVA. ...
1
vote
1
answer
210
views
How to scale data for model retraining on production?
Let's say I have a basic regression model being used in production and now I want to implement periodical model retraining (i.e. once a month) where I take a batch of new data from last month and fit ...
3
votes
0
answers
139
views
How to identify small, moderate or large data sample sizes
I want to run a probit or a logit model and I am curious about the choise regarding the data sample size that I have. In a previous answer in a question 'Probit vs Logit' I read that "Probit is ...
1
vote
0
answers
801
views
How to guarantee the test set is "independent"?
In Machine Learning (ML) tasks, one splits the dataset into training and test sets. We train the ML model based on the training test, and then we evaluate the performance of the model with the test ...
0
votes
1
answer
206
views
In regression when we standardise the data do we need intercept?
I would like to see when we standradize the data and then apply linear regression or Bayesian regression do we need intercept or no? or it has nothing with standardize?
0
votes
1
answer
147
views
For univariate data, why do we need the normalmixEM function in R instead of just computing the mean and variance with the basic methods?
I can understand why if from your univariate data (1 column?) you plot a histogram which seems to have 2+ peaks ie a mix of more than 1 gaussian. But what if you plot a histogram and there looks to be ...
0
votes
1
answer
145
views
Questions on Data Quality Assessment
I have been bumping my head against wall in trying to figure out a good real-world solution for this challenging problem that my friend asked me.
Could you please give some pointers?
Lets say we want ...
0
votes
1
answer
1k
views
normalizing and scaling are different?
This is the original data histogram, I have a data set and plot by DataFrame.hist():
After that I applied the zscore function to my data set and plot this histogram:
After I have applied zscore, I ...
4
votes
3
answers
2k
views
Which are outliers?
I am in the process of solving a Machine Learning challenge, and I want to do it the right way.
I did some exploratory data analysisand I wanted to check the distribution of the data.
As displayed in ...
4
votes
1
answer
599
views
Calculating a 95% confidence interval without the original dataset
I have been asked to calculate a 95% confidence interval based on the following paper: https://www.nber.org/system/files/working_papers/w26107/w26107.pdf
The question reads:
Suppose rainfall in the ...
0
votes
0
answers
73
views
In Bayesian Statistics, can the data be a random variable drawn from the posterior of a separate model? Will the uncertainty flow through?
In the traditional Bayesian hierarchical approach, you typically have a hierarchy built on your coefficients. That is to say, you might have your coefficient of interest distributed around some ...
1
vote
1
answer
53
views
How to read the following table of data?
I have downloaded the following table from https://www.ons.gov.uk for car rental market.
Does Turnover (£000s) mean the the values should be multiplied by 1000?
I.e....
2
votes
3
answers
369
views
Feature not applicable to some samples
I am working with a private medical dataset including categorical features coming from patients examinations.
However, the problem is that some patients underwent MRI, others scanner, and some ...
0
votes
1
answer
38
views
How to Analyze Data to optimize committee Allocations? [closed]
On Google Sheets, I have collected responses from my team members to assign them to be committee members in any of the 4 following committees: Internal, External, Membership, Speaker Management. ...
0
votes
1
answer
71
views
Are these two datasets interval?
I have two dataset: population density and case fatality rate.
Population density is measured as number of people living in an area divided by that area size in square miles (number of people/area ...
1
vote
1
answer
41
views
Is it a good practice to collect different numbers of data (randomly picked) per class?
I am totally new to data science and neural networks. So I want to make a simple chord recognition neural network using chroma data from audio. I collected some recordings and songs, and then divide ...
0
votes
1
answer
123
views
Statistical analysis of distributed values in Java [closed]
I am writing a program in Java that outputs a List<Double> of distances that roughly follow a bell curve distribution. From this data, I need to generate two ...
0
votes
1
answer
61
views
How should my data be formatted for 1v1 match prediction?
I want to build a model that predicts 1v1 tennis match outcomes. What is the best way to layout my data? Essentially is it better to have 1 row per match or create rows/observations from each player's ...
0
votes
0
answers
85
views
Scientific way to construct dataset for text classification
BTOG.
I need to develop a machine learning algorithm that matches random text to predefined categories. The texts that I need to predict is web page text.
I know there are many machine learning ...
0
votes
1
answer
65
views
How variability with different data densities could affect comparisons among environmental variables
I have several sets of data continually measured and recorded by instruments. The periods of record are 30+ years and the frequency of measurement can be every 5, 10, 15, or 30 minutes (288, 144. 96, ...
0
votes
1
answer
69
views
How to structure this multi-dimensional data for AR modelling?
I have a time-series dataset for each month for the past three years which represent quoted prices for the same product but with different delivery month.
For example, Jul-19 is a dataset consisting ...
0
votes
1
answer
60
views
Data Wrangling for Modeling in R [closed]
I have a data set (original version, # A tibble: 33,478 x 12) of the form similar to the attached picture, and partial data:
...
0
votes
0
answers
51
views
Uncertainty estimation in the input space
my input is an array between 0 and 1000 and the output is the corresponding system velocity. The input value is randomly generated (for instance by using the function in Python ...
0
votes
1
answer
96
views
Is my data set stationary
Here is what I obtain when I plot my data set in R. I am now wondering whether this data set is stationary or not. I'm assuming it is stationary since it has no visible trends or seasonality. However, ...