Newest 'dataset' Questions - Page 6

2 votes

1 answer

122 views

Cross-validation: error estimation and bias

When obtaining the error estimation of a model over a dataset using k-fold cross-validation, lower values of the error estimation necessarily imply a lower bias? Are both concepts, error estimation ...

dreamco9

65

asked Apr 10, 2022 at 8:16

3 votes

0 answers

4k views

Synthesize data given mean, variance, skew, and kurtosis in python [closed]

I would like to generate synthetic data by specifying their mean, variance, skew, and kurtosis. However, I only know how to generate synthetic data with mean and var. Here is an example with mean and ...

Joseph

151

asked Apr 10, 2022 at 0:49

0 votes

0 answers

135 views

What is a good type of transformation to try on this data to get normal distribution?

I've tried a log-transformation but data becomes left-skewed. In general, when data is distributed like this, what is the next best transformation to try if you want to normalize the data?

mike_mussini

3

asked Apr 8, 2022 at 20:13

2 votes

1 answer

287 views

The percentile method

a research used the percentile method to find the prevalence in severe group as below: The outcome variable was categorized using percentiles to Mild, Moderate, and Severe in the Symptom Severity ...

kareen kk

81

asked Apr 6, 2022 at 6:24

0 votes

1 answer

174 views

How to learn a filter on a dataset?

Let $X$ be a a tabular dataset with $N$ features indexed by row number and a categorical value "Cat". Let $A$ be an aggregation function, e.g. ...

Rachel

229

asked Apr 5, 2022 at 13:44

0 votes

1 answer

641 views

Grouping using percentiles

I did visual binning process in spss and made three cutpoints like in this image: I did check off included I want to know the percentage range or value for each group. If I describe it in this way is ...

Stats34

57

asked Apr 3, 2022 at 8:55

0 votes

0 answers

113 views

How do I decide the frequency of data capture for modeling?

I plan to capture data to predict energy consumption in a food processing plant. I want to capture production details such as how much each category of food is produced, what is the machine's output, ...

NAS_2339

223

asked Apr 1, 2022 at 17:07

0 votes

1 answer

141 views

"Ordinal" is to "level of measurement" as "dependent variable" is to __________ [closed]

The Question If I'm drawing up a table that describes some variables -- a data dictionary, say -- and I consider whether a variable is "ordinal", or "nominal", or "interval&...

logjammin

759

asked Mar 28, 2022 at 1:33

3 votes

2 answers

208 views

A/B test with a result of another deep learning analysis with 90% accuracy

I am planning to conduct a A/B test with data obtained through a deep learning algorithm. Say, I got a binary classification dataset through machine learning with about 100k rows classified into yes, ...

Dan K

63

asked Mar 25, 2022 at 20:37

0 votes

1 answer

107 views

About ideal data and its distribution

I was just thinking about what would be the properties of an ideal data set $X \in R^{n,d}$ where n is sample size, d represents features. I think (or at least I understood from reading text books) ...

Kadir Gunel

103

asked Mar 23, 2022 at 8:32

1 vote

1 answer

203 views

Is it true that a larger, representative dataset is always better to use than a smaller, representative dataset?

By "representative" I mean that the data in the dataset faithfully reflects the "underlying signal" a model is trying to tap in to. Is it always true that, as long as increasing ...

sangstar

131

asked Mar 21, 2022 at 14:57

-1 votes

1 answer

148 views

Is 6% of your dataset are outliers normal?

My dataset has 80,886 obs and 16 variables. I am using Mahalanobis Distance to detect outliers. And use P-value less than 0.001 as the cut-off. I am getting 5,423 obs as outlier which is 6% of total ...

surfffffffff

11

asked Mar 21, 2022 at 5:01

1 vote

1 answer

678 views

In data setup for Cox regression, how to handle a subject's time before treatment of interest (i.e. before time-zero)?

Background I'm designing a study that models time-to-event for two groups of study subject: people who receive a treatment and those who do not. I'm fairly new to applied survival analysis, so I've ...

logjammin

759

asked Mar 21, 2022 at 1:37

1 vote

2 answers

2k views

How to solve the problem of having sparse data that would become too small when aggregated?

I have a dataset that provides the count of cyber incidents since 2011 for different countries and different attack types, and I want to use this data in a machine learning model to predict future ...

Travelling Salesman

113

asked Mar 20, 2022 at 11:24

1 vote

1 answer

66 views

Measure how dataset is harmonious or organized

Suppose we have two set of numbers : A = [1,4,9,16,25,49...100] and B = [1,4,7,7,25,49,64...100]. As you seen the first one is consistently growing, elements of it is square of numbers. But although ...

student0434

11

asked Mar 17, 2022 at 13:14

1 vote

2 answers

66 views

Data structure for p-value analysis of medical data

I have a good-sized data set containing details and outcomes of various types of injury. I've worked through it and cleaned the data as best I can - it's now consistent and complete. At this point I'm ...

Alex McGruder

11

asked Mar 15, 2022 at 12:14

0 votes

0 answers

19 views

What test is appropriate? [duplicate]

can I do welch one way anova to compare mean between the following three groups Group 1 has 83 observations Group 2 has 15 observations Group 3 has only 2 observations And in group 3 the variance is 0 ...

kareen kk

81

asked Mar 14, 2022 at 19:34

2 votes

1 answer

91 views

Apply machine learning to predict discharging and risk of readmission based on medical data

I am new to machine learning. I have a problem where I have to predict patients to be/not to be discharged using hospital data and depending on that prediction (i.e. if the patient is successfully ...

KALHIM

21

asked Mar 13, 2022 at 22:07

0 votes

0 answers

238 views

Data comparison of two sensors, to produce a correction factor

I have sensor "A" that is outputting a set of data and I also have another sensor that is acting as the reference sensor. The aim is to get Sensor "A" as close as possible to the ...

user8400863

101

asked Mar 12, 2022 at 20:34

2 votes

1 answer

48 views

What is the term for data which do not include multiple variables needed for controlling confounding in analyses?

I have a terminology question that I couldn't answer by googling. What is the term for data which do not include multiple analytically relevant variables needed for controlling confounding in analyses?...

st4co4

2,327

asked Feb 28, 2022 at 8:18

1 vote

0 answers

37 views

Best resources to gather social media statistics [closed]

I am looking to gather data on the amount of people following/viewing digital artists' content on social media like youtube, tikotok, instagram etc. I've looked around online but I can only find lists ...

DeltaChief

111

asked Feb 28, 2022 at 6:55

0 votes

0 answers

213 views

Training a machine learning model on data that has several rows for each user

I have a dataset consisting of log files from a smartphone application. Currently, it creates a row each time a user clicks on something, i.e. a user clicks on the homepage, and a new row is created ...

sword134

1

asked Feb 22, 2022 at 13:32

0 votes

0 answers

150 views

How to compare/test two datasets which are not supposed to be exactly equal

Example of such dataset is open, high, low, close, volume data of any stock over the period of last 5 days. This data is available at the frequency of each minute. I want to test the quality of my ...

KnowledgeSeeeker

101

asked Feb 21, 2022 at 15:12

4 votes

0 answers

312 views

How can you evaluate the representativeness of a sample for a given distribution?

Problem: I am looking for a metric to find the representativeness of a sample for a given distribution, being the representativeness of a random sample as the degree of capacity of the sample to ...

mcardoner

41

asked Feb 17, 2022 at 15:08

2 votes

0 answers

153 views

What is the proper way to externally validate clusters when I have only a sample of the dataset labeled, but want to cluster the entire dataset?

I have a dataset of text-based documents that I want to cluster. For a sample of this dataset (~10%) I have manually annotated labels (i.e., the ground truth). I would like to cluster this dataset to &...

BNise

21

asked Feb 17, 2022 at 14:45

0 votes

1 answer

210 views

Variable selection with sparse data

I have a dataset with 141 observations and 8 corresponding variables and I mean to apply a GLM to this dataset. However, a lot of observations lack either one or multiple variable values. So if I want ...

dumei

11

asked Feb 17, 2022 at 11:08

1 vote

2 answers

484 views

t-test with data in long-format including levels of non-interesting factor in R

I've computed an ANOVA with one between-subjects factor (2 groups each including 26 participants) and 2 within-subjects factors (item type with 3 levels and emotion with 2 levels) in ezANOVA. ...

valid

45

asked Feb 16, 2022 at 18:03

1 vote

1 answer

210 views

How to scale data for model retraining on production?

Let's say I have a basic regression model being used in production and now I want to implement periodical model retraining (i.e. once a month) where I take a batch of new data from last month and fit ...

GKozinski

121

asked Feb 15, 2022 at 12:13

3 votes

0 answers

139 views

How to identify small, moderate or large data sample sizes

I want to run a probit or a logit model and I am curious about the choise regarding the data sample size that I have. In a previous answer in a question 'Probit vs Logit' I read that "Probit is ...

Collin Focas

31

asked Feb 14, 2022 at 19:23

1 vote

0 answers

801 views

How to guarantee the test set is "independent"?

In Machine Learning (ML) tasks, one splits the dataset into training and test sets. We train the ML model based on the training test, and then we evaluate the performance of the model with the test ...

Hamed

111

asked Feb 11, 2022 at 16:59

0 votes

1 answer

206 views

In regression when we standardise the data do we need intercept?

I would like to see when we standradize the data and then apply linear regression or Bayesian regression do we need intercept or no? or it has nothing with standardize?

Raz

135

asked Feb 11, 2022 at 9:43

0 votes

1 answer

147 views

For univariate data, why do we need the normalmixEM function in R instead of just computing the mean and variance with the basic methods?

I can understand why if from your univariate data (1 column?) you plot a histogram which seems to have 2+ peaks ie a mix of more than 1 gaussian. But what if you plot a histogram and there looks to be ...

spacexyz

5

asked Feb 3, 2022 at 14:14

0 votes

1 answer

145 views

Questions on Data Quality Assessment

I have been bumping my head against wall in trying to figure out a good real-world solution for this challenging problem that my friend asked me. Could you please give some pointers? Lets say we want ...

Mike Dwell

1

asked Jan 29, 2022 at 5:53

0 votes

1 answer

1k views

normalizing and scaling are different?

This is the original data histogram, I have a data set and plot by DataFrame.hist(): After that I applied the zscore function to my data set and plot this histogram: After I have applied zscore, I ...

arcane_data

1

asked Jan 28, 2022 at 15:46

4 votes

3 answers

2k views

Which are outliers?

I am in the process of solving a Machine Learning challenge, and I want to do it the right way. I did some exploratory data analysisand I wanted to check the distribution of the data. As displayed in ...

Spicy strike

51

asked Jan 26, 2022 at 14:38

4 votes

1 answer

599 views

Calculating a 95% confidence interval without the original dataset

I have been asked to calculate a 95% confidence interval based on the following paper: https://www.nber.org/system/files/working_papers/w26107/w26107.pdf The question reads: Suppose rainfall in the ...

jserv

155

asked Jan 23, 2022 at 13:48

0 votes

0 answers

73 views

In Bayesian Statistics, can the data be a random variable drawn from the posterior of a separate model? Will the uncertainty flow through?

In the traditional Bayesian hierarchical approach, you typically have a hierarchy built on your coefficients. That is to say, you might have your coefficient of interest distributed around some ...

St4096

61

asked Jan 21, 2022 at 21:45

1 vote

1 answer

53 views

How to read the following table of data?

I have downloaded the following table from https://www.ons.gov.uk for car rental market. Does Turnover (£000s) mean the the values should be multiplied by 1000? I.e....

Stretch0

111

asked Jan 20, 2022 at 20:13

2 votes

3 answers

369 views

Feature not applicable to some samples

I am working with a private medical dataset including categorical features coming from patients examinations. However, the problem is that some patients underwent MRI, others scanner, and some ...

aulok

51

asked Jan 14, 2022 at 13:06

0 votes

1 answer

38 views

How to Analyze Data to optimize committee Allocations? [closed]

On Google Sheets, I have collected responses from my team members to assign them to be committee members in any of the 4 following committees: Internal, External, Membership, Speaker Management. ...

Tyler

3

asked Jan 13, 2022 at 20:54

0 votes

1 answer

71 views

Are these two datasets interval?

I have two dataset: population density and case fatality rate. Population density is measured as number of people living in an area divided by that area size in square miles (number of people/area ...

kaka

11

asked Jan 10, 2022 at 20:18

1 vote

1 answer

41 views

Is it a good practice to collect different numbers of data (randomly picked) per class?

I am totally new to data science and neural networks. So I want to make a simple chord recognition neural network using chroma data from audio. I collected some recordings and songs, and then divide ...

Zukaru

121

asked Jan 9, 2022 at 10:14

0 votes

1 answer

123 views

Statistical analysis of distributed values in Java [closed]

I am writing a program in Java that outputs a List<Double> of distances that roughly follow a bell curve distribution. From this data, I need to generate two ...

Dr. Rubisco

1

asked Dec 23, 2021 at 2:52

0 votes

1 answer

61 views

How should my data be formatted for 1v1 match prediction?

I want to build a model that predicts 1v1 tennis match outcomes. What is the best way to layout my data? Essentially is it better to have 1 row per match or create rows/observations from each player's ...

swang16

131

asked Dec 22, 2021 at 22:02

0 votes

0 answers

85 views

Scientific way to construct dataset for text classification

BTOG. I need to develop a machine learning algorithm that matches random text to predefined categories. The texts that I need to predict is web page text. I know there are many machine learning ...

Ben Goz

1

asked Dec 21, 2021 at 11:45

0 votes

1 answer

65 views

How variability with different data densities could affect comparisons among environmental variables

I have several sets of data continually measured and recorded by instruments. The periods of record are 30+ years and the frequency of measurement can be every 5, 10, 15, or 30 minutes (288, 144. 96, ...

Rich Shepard

21

asked Dec 16, 2021 at 17:33

0 votes

1 answer

69 views

How to structure this multi-dimensional data for AR modelling?

I have a time-series dataset for each month for the past three years which represent quoted prices for the same product but with different delivery month. For example, Jul-19 is a dataset consisting ...

MilTom

369

asked Dec 15, 2021 at 12:53

0 votes

1 answer

60 views

Data Wrangling for Modeling in R [closed]

I have a data set (original version, # A tibble: 33,478 x 12) of the form similar to the attached picture, and partial data: ...

Valy

103

asked Dec 15, 2021 at 3:54

0 votes

0 answers

51 views

Uncertainty estimation in the input space

my input is an array between 0 and 1000 and the output is the corresponding system velocity. The input value is randomly generated (for instance by using the function in Python ...

Joe

101

asked Dec 11, 2021 at 19:08

0 votes

1 answer

96 views

Is my data set stationary

Here is what I obtain when I plot my data set in R. I am now wondering whether this data set is stationary or not. I'm assuming it is stationary since it has no visible trends or seasonality. However, ...

JOJO

101

asked Dec 9, 2021 at 22:52

Questions tagged [dataset]