Skip to main content

Questions tagged [dataset]

Requests for datasets are off-topic on this site. Use this tag for questions concerning creating, processing, or maintaining datasets.

Filter by
Sorted by
Tagged with
2 votes
1 answer
122 views

When obtaining the error estimation of a model over a dataset using k-fold cross-validation, lower values of the error estimation necessarily imply a lower bias? Are both concepts, error estimation ...
dreamco9's user avatar
3 votes
0 answers
4k views

I would like to generate synthetic data by specifying their mean, variance, skew, and kurtosis. However, I only know how to generate synthetic data with mean and var. Here is an example with mean and ...
Joseph's user avatar
  • 151
0 votes
0 answers
135 views

I've tried a log-transformation but data becomes left-skewed. In general, when data is distributed like this, what is the next best transformation to try if you want to normalize the data?
mike_mussini's user avatar
2 votes
1 answer
287 views

a research used the percentile method to find the prevalence in severe group as below: The outcome variable was categorized using percentiles to Mild, Moderate, and Severe in the Symptom Severity ...
kareen kk's user avatar
0 votes
1 answer
174 views

Let $X$ be a a tabular dataset with $N$ features indexed by row number and a categorical value "Cat". Let $A$ be an aggregation function, e.g. ...
Rachel's user avatar
  • 229
0 votes
1 answer
641 views

I did visual binning process in spss and made three cutpoints like in this image: I did check off included I want to know the percentage range or value for each group. If I describe it in this way is ...
Stats34's user avatar
  • 57
0 votes
0 answers
113 views

I plan to capture data to predict energy consumption in a food processing plant. I want to capture production details such as how much each category of food is produced, what is the machine's output, ...
NAS_2339's user avatar
  • 223
0 votes
1 answer
141 views

The Question If I'm drawing up a table that describes some variables -- a data dictionary, say -- and I consider whether a variable is "ordinal", or "nominal", or "interval&...
logjammin's user avatar
  • 759
3 votes
2 answers
208 views

I am planning to conduct a A/B test with data obtained through a deep learning algorithm. Say, I got a binary classification dataset through machine learning with about 100k rows classified into yes, ...
Dan K's user avatar
  • 63
0 votes
1 answer
107 views

I was just thinking about what would be the properties of an ideal data set $X \in R^{n,d}$ where n is sample size, d represents features. I think (or at least I understood from reading text books) ...
Kadir Gunel's user avatar
1 vote
1 answer
203 views

By "representative" I mean that the data in the dataset faithfully reflects the "underlying signal" a model is trying to tap in to. Is it always true that, as long as increasing ...
sangstar's user avatar
  • 131
-1 votes
1 answer
148 views

My dataset has 80,886 obs and 16 variables. I am using Mahalanobis Distance to detect outliers. And use P-value less than 0.001 as the cut-off. I am getting 5,423 obs as outlier which is 6% of total ...
surfffffffff's user avatar
1 vote
1 answer
678 views

Background I'm designing a study that models time-to-event for two groups of study subject: people who receive a treatment and those who do not. I'm fairly new to applied survival analysis, so I've ...
logjammin's user avatar
  • 759
1 vote
2 answers
2k views

I have a dataset that provides the count of cyber incidents since 2011 for different countries and different attack types, and I want to use this data in a machine learning model to predict future ...
Travelling Salesman's user avatar
1 vote
1 answer
66 views

Suppose we have two set of numbers : A = [1,4,9,16,25,49...100] and B = [1,4,7,7,25,49,64...100]. As you seen the first one is consistently growing, elements of it is square of numbers. But although ...
student0434's user avatar
1 vote
2 answers
66 views

I have a good-sized data set containing details and outcomes of various types of injury. I've worked through it and cleaned the data as best I can - it's now consistent and complete. At this point I'm ...
Alex McGruder's user avatar
0 votes
0 answers
19 views

can I do welch one way anova to compare mean between the following three groups Group 1 has 83 observations Group 2 has 15 observations Group 3 has only 2 observations And in group 3 the variance is 0 ...
kareen kk's user avatar
2 votes
1 answer
91 views

I am new to machine learning. I have a problem where I have to predict patients to be/not to be discharged using hospital data and depending on that prediction (i.e. if the patient is successfully ...
KALHIM's user avatar
  • 21
0 votes
0 answers
238 views

I have sensor "A" that is outputting a set of data and I also have another sensor that is acting as the reference sensor. The aim is to get Sensor "A" as close as possible to the ...
user8400863's user avatar
2 votes
1 answer
48 views

I have a terminology question that I couldn't answer by googling. What is the term for data which do not include multiple analytically relevant variables needed for controlling confounding in analyses?...
st4co4's user avatar
  • 2,327
1 vote
0 answers
37 views

I am looking to gather data on the amount of people following/viewing digital artists' content on social media like youtube, tikotok, instagram etc. I've looked around online but I can only find lists ...
DeltaChief's user avatar
0 votes
0 answers
213 views

I have a dataset consisting of log files from a smartphone application. Currently, it creates a row each time a user clicks on something, i.e. a user clicks on the homepage, and a new row is created ...
sword134's user avatar
0 votes
0 answers
150 views

Example of such dataset is open, high, low, close, volume data of any stock over the period of last 5 days. This data is available at the frequency of each minute. I want to test the quality of my ...
KnowledgeSeeeker's user avatar
4 votes
0 answers
312 views

Problem: I am looking for a metric to find the representativeness of a sample for a given distribution, being the representativeness of a random sample as the degree of capacity of the sample to ...
mcardoner's user avatar
2 votes
0 answers
153 views

I have a dataset of text-based documents that I want to cluster. For a sample of this dataset (~10%) I have manually annotated labels (i.e., the ground truth). I would like to cluster this dataset to &...
BNise's user avatar
  • 21
0 votes
1 answer
210 views

I have a dataset with 141 observations and 8 corresponding variables and I mean to apply a GLM to this dataset. However, a lot of observations lack either one or multiple variable values. So if I want ...
dumei's user avatar
  • 11
1 vote
2 answers
484 views

I've computed an ANOVA with one between-subjects factor (2 groups each including 26 participants) and 2 within-subjects factors (item type with 3 levels and emotion with 2 levels) in ezANOVA. ...
valid's user avatar
  • 45
1 vote
1 answer
210 views

Let's say I have a basic regression model being used in production and now I want to implement periodical model retraining (i.e. once a month) where I take a batch of new data from last month and fit ...
GKozinski's user avatar
  • 121
3 votes
0 answers
139 views

I want to run a probit or a logit model and I am curious about the choise regarding the data sample size that I have. In a previous answer in a question 'Probit vs Logit' I read that "Probit is ...
Collin Focas's user avatar
1 vote
0 answers
801 views

In Machine Learning (ML) tasks, one splits the dataset into training and test sets. We train the ML model based on the training test, and then we evaluate the performance of the model with the test ...
Hamed's user avatar
  • 111
0 votes
1 answer
206 views

I would like to see when we standradize the data and then apply linear regression or Bayesian regression do we need intercept or no? or it has nothing with standardize?
Raz's user avatar
  • 135
0 votes
1 answer
147 views

I can understand why if from your univariate data (1 column?) you plot a histogram which seems to have 2+ peaks ie a mix of more than 1 gaussian. But what if you plot a histogram and there looks to be ...
spacexyz's user avatar
0 votes
1 answer
145 views

I have been bumping my head against wall in trying to figure out a good real-world solution for this challenging problem that my friend asked me. Could you please give some pointers? Lets say we want ...
Mike Dwell's user avatar
0 votes
1 answer
1k views

This is the original data histogram, I have a data set and plot by DataFrame.hist(): After that I applied the zscore function to my data set and plot this histogram: After I have applied zscore, I ...
arcane_data's user avatar
4 votes
3 answers
2k views

I am in the process of solving a Machine Learning challenge, and I want to do it the right way. I did some exploratory data analysisand I wanted to check the distribution of the data. As displayed in ...
Spicy strike's user avatar
4 votes
1 answer
599 views

I have been asked to calculate a 95% confidence interval based on the following paper: https://www.nber.org/system/files/working_papers/w26107/w26107.pdf The question reads: Suppose rainfall in the ...
jserv's user avatar
  • 155
0 votes
0 answers
73 views

In the traditional Bayesian hierarchical approach, you typically have a hierarchy built on your coefficients. That is to say, you might have your coefficient of interest distributed around some ...
St4096's user avatar
  • 61
1 vote
1 answer
53 views

I have downloaded the following table from https://www.ons.gov.uk for car rental market. Does Turnover (£000s) mean the the values should be multiplied by 1000? I.e....
Stretch0's user avatar
  • 111
2 votes
3 answers
369 views

I am working with a private medical dataset including categorical features coming from patients examinations. However, the problem is that some patients underwent MRI, others scanner, and some ...
aulok's user avatar
  • 51
0 votes
1 answer
38 views

On Google Sheets, I have collected responses from my team members to assign them to be committee members in any of the 4 following committees: Internal, External, Membership, Speaker Management. ...
Tyler's user avatar
  • 3
0 votes
1 answer
71 views

I have two dataset: population density and case fatality rate. Population density is measured as number of people living in an area divided by that area size in square miles (number of people/area ...
kaka's user avatar
  • 11
1 vote
1 answer
41 views

I am totally new to data science and neural networks. So I want to make a simple chord recognition neural network using chroma data from audio. I collected some recordings and songs, and then divide ...
Zukaru's user avatar
  • 121
0 votes
1 answer
123 views

I am writing a program in Java that outputs a List<Double> of distances that roughly follow a bell curve distribution. From this data, I need to generate two ...
Dr. Rubisco's user avatar
0 votes
1 answer
61 views

I want to build a model that predicts 1v1 tennis match outcomes. What is the best way to layout my data? Essentially is it better to have 1 row per match or create rows/observations from each player's ...
swang16's user avatar
  • 131
0 votes
0 answers
85 views

BTOG. I need to develop a machine learning algorithm that matches random text to predefined categories. The texts that I need to predict is web page text. I know there are many machine learning ...
Ben Goz's user avatar
0 votes
1 answer
65 views

I have several sets of data continually measured and recorded by instruments. The periods of record are 30+ years and the frequency of measurement can be every 5, 10, 15, or 30 minutes (288, 144. 96, ...
Rich Shepard's user avatar
0 votes
1 answer
69 views

I have a time-series dataset for each month for the past three years which represent quoted prices for the same product but with different delivery month. For example, Jul-19 is a dataset consisting ...
MilTom's user avatar
  • 369
0 votes
1 answer
60 views

I have a data set (original version, # A tibble: 33,478 x 12) of the form similar to the attached picture, and partial data: ...
Valy's user avatar
  • 103
0 votes
0 answers
51 views

my input is an array between 0 and 1000 and the output is the corresponding system velocity. The input value is randomly generated (for instance by using the function in Python ...
Joe's user avatar
  • 101
0 votes
1 answer
96 views

Here is what I obtain when I plot my data set in R. I am now wondering whether this data set is stationary or not. I'm assuming it is stationary since it has no visible trends or seasonality. However, ...
JOJO's user avatar
  • 101

1
4 5
6
7 8
39