Newest 'dataset' Questions - Page 2

2 votes

1 answer

75 views

Analyzing lists and variables of multiple answers

My current issue lies within EMR extracted data for medications. There are multiple variables named: Medication_1, Medication_2, Medication_3, etc... This data may overlap and analyzing each column ...

Abdallah Al-Ani

21

asked Jun 29, 2024 at 13:16

2 votes

0 answers

59 views

What is the best metric to use to discard annotators with low IAA (inter-annotator agreement) with all others?

This question is specific to ordinal data collected on the likert scale What is the best metric to discard annotators with low inter-annotator agreement (IAA) with others? from e.g., Cohen’s Kappa, ...

user2160809

141

asked Jun 21, 2024 at 19:19

0 votes

0 answers

44 views

How can i create a linear regression model not having the exact same dataset

i want to create a data regression between two financial indexes, but they don't have a perfect correspondence in the data of observation (for example one has the relevations for 17/6 18/6 19/6 but ...

ConfusedConsultant

1

asked Jun 20, 2024 at 7:18

1 vote

1 answer

101 views

Poisson regression given multiple predictors on a repeating ID variable

I was wondering how a poisson regression would work given my dataset which describes a series of zip codes stratified by age groups, gender and death counts. The regression would use death counts as ...

Seyong Chang

13

asked Jun 12, 2024 at 18:57

0 votes

0 answers

66 views

Is there any technique that I can use to achieve the required computation without reading all 10 million rows?

I have a data file from a Monte Carlo simulation of fifteen protein chains. The file contains 10 million r_end_to_end 3D vectors as rows and 3 x 15 = 45 columns. My ...

user366312

2,077

asked Jun 5, 2024 at 21:17

1 vote

0 answers

87 views

Sales data trend

I have my historical sales data and I want to check for the trend (increase, decrease or no change). When I do my annual line graph, the slop of my linear equation is positive (indicating increase) ...

monique

31

asked May 20, 2024 at 2:33

1 vote

0 answers

30 views

Problems using custom dataset using Minirocket classification [closed]

I'm working on a bigger school project, trying to classify timeseries measurements with Minirocket/Rocket. My trainingdata consists of a 1D matrix containing the measurements, and a seperate 1D matrix ...

Michael

11

asked May 19, 2024 at 11:49

1 vote

0 answers

198 views

How to deal with outliers in panel data? [closed]

When we have cross-sectional data, we can easily detect and remove outliers. But how should one approach outliers when we are dealing with panel data? Since we have $i$ entities and $t$ times periods, ...

TFT

345

asked May 18, 2024 at 11:37

2 votes

0 answers

65 views

Testing time series data stationarity

I am working with time series and want test different forecasting methods but first I need to test if my time series (sales) data is stationary or not. So I have been learning about KPSS and Dickey-...

monique

31

asked May 16, 2024 at 0:13

0 votes

0 answers

11 views

Imbalanced dataset with multiple classes [duplicate]

I have an imbalanced dataset with multiple classes where some have less than 100 some are more than 10k,where i want to apply random forest(the dataset is confidential so i cant share),i used all ...

Deepak kumar

1

asked May 7, 2024 at 17:32

1 vote

1 answer

77 views

How to train Logistic regression model with multiple inputs for 1 target value?

My data looks like similar to this: (the picture below is not mine, but describes perfectly my situation) where the IDs are not unique but for each ID value I have a unique target value The following ...

Moez Daly

43

asked May 5, 2024 at 9:21

1 vote

0 answers

184 views

Improving prediction accuracy for Gaussian process with derivative information

I am fitting two different GPs with derivative observations (one with 9 dimensional input and one 12 dimensional input), however for some reason I am getting much worse results for the 12 dimensional ...

m-julian

23

asked Apr 28, 2024 at 23:01

1 vote

0 answers

82 views

Best method to identify layered clusters

The Problem Hello everyone. I'm working with a dataset that has 15300 samples with 49 features each, equally distributed amongst three classes. I used TSNE to reduce the dimensions of the feature ...

Amyr14

11

asked Apr 18, 2024 at 22:23

3 votes

1 answer

92 views

Generating synthetic data with multiple records per ID

I would like to generate a synthetic dataset where there are multiple records per ID, and self-consistency is maintained among records of each ID. For example, imagine a dataset where the ID is a ...

user12138762

81

asked Apr 9, 2024 at 19:47

0 votes

0 answers

46 views

"How can I address the lack of correlation and a low R-squared value in my univariate linear regression when the data is scattered?"

** "I'm trying to find a correlation between the confirmed cases and deaths rates against HUMIDEX values. As you can see, the data is very scattered, so I understand that polynomial and ...

Carlos Leonel Guerrero Rodrigu

1

asked Apr 6, 2024 at 21:38

0 votes

1 answer

93 views

Data analysis for school project

I have to create a hypothetical study focusing on the relationship between sBCMA (soluble B-cell maturation antigen in blood) and the expression of BCMA on bone marrow cells in patients with multiple ...

youknow 321

1

asked Apr 3, 2024 at 11:46

2 votes

0 answers

74 views

Problem with mathematical formulas in gap statistic

I'm studying the article "Estimating the number of clusters in a data set via the gap statistic" by R. Tibshirani, G. Walther and T.Hastie: https://academic.oup.com/jrsssb/article/63/2/411/...

user2702

51

asked Apr 3, 2024 at 11:33

0 votes

0 answers

43 views

Given variable A and B containing data of lemma sentiments, what is the correct term for the variable containing average of var A and var B?

I have a data visualization, showing the sentiment of two lemmas "гей" (var a) and "трансгендер" (var b) in a news corpus throughout the year. Here is the dataframe sample of my ...

pindakazen

13

asked Apr 2, 2024 at 2:14

1 vote

0 answers

34 views

Consistent way of doing paired-trial validation (and leave-one-dataset-out validation)

In paired-trial validation, a statistical (ML) models are trained on $n$ datasets separately and then applied to other datasets, as a way of estimating the generalization of the models obtained. ...

Roger V.

5,091

asked Mar 25, 2024 at 9:45

4 votes

2 answers

202 views

Autocorrelation of discontinuous time series data [closed]

I am attempting to perform an autocorrelation study using python on a discontinuous time series dataset. To share a bit about how my data looks like, it is a single column of values, which spans over ...

Sam

83

asked Mar 20, 2024 at 13:12

0 votes

1 answer

87 views

Should the testing data be uniquely distinct and come from different source/dataset than the training data?

I am building an audio classification system using CNN. My dataset consists of different audio I have recorded and spliced to equal time lengths. Like with any other common ML or DL tasks, I am to ...

Flash

1

asked Mar 19, 2024 at 18:00

0 votes

1 answer

129 views

How to calculate reliability of difference scores?

I am trying to calculate the reliability of a difference score. Specifically, the data have, for each participant, scores for 10 items in Condition X (1s and 0s), as well as 10 different items in ...

Altair555

61

asked Mar 13, 2024 at 17:03

0 votes

0 answers

80 views

How do I calculate the weighted average of a 2D data set for a 3D structure?

I have a 2D data array indicating a chemical percentage content (PC) in a chemical droplet. I am trying to calculate the average PC in the droplet. The image of one of these arrays is shown below (the ...

user7077252

101

asked Mar 13, 2024 at 13:48

3 votes

1 answer

495 views

Can I change values in data from yes and no to binary

I have a dataset that I want to perform a regression on. However, some of the columns are not in numerical form. For example, the extra classes column. What I ...

Charlotte

31

asked Feb 21, 2024 at 20:58

1 vote

1 answer

177 views

What's the justification for comparing two separate models built on subsets of data versus using one model that uses the whole dataset?

I've noticed that there are some data analysis being done in some scientific field where the authors would split out an entire dataset into subsets based on a particular property. One classic example ...

Syuma

115

asked Feb 19, 2024 at 18:22

0 votes

1 answer

67 views

Difference-in-difference age groups

I would appreciate your help with a question I have. I'm creating a Difference-in-Difference study to examine how a conditional cash transfer to individuals 18 years of age to be spent in sport ...

Retir

1

asked Feb 19, 2024 at 15:36

1 vote

0 answers

140 views

Datasets with multiple maximum likelihood estimators

There is a sizeable body of literature on the issue of multiple maximizers in maximum likelihood estimation, such as https://projecteuclid.org/journals/statistical-science/volume-15/issue-4/...

Tom Solberg

139

asked Feb 18, 2024 at 16:58

3 votes

0 answers

223 views

Recreating data variance from the posterior distribution

Recreating data variance from the posterior distribution Take a set of data points $(x, y)$ with (Gaussian) uncertainties $\sigma_y$ on the $y$ coordinate; they are modeled as $y \sim f(x; \alpha) + \...

Jacopo Tissino

131

asked Feb 6, 2024 at 15:53

0 votes

0 answers

53 views

Reference datasets for conditional density estimation

[In case you feel inclined to close this question because I'm asking for a dataset - I'm looking for solutions in the spirit of point 2 (on-topic) in the accepted answer to this question about asking ...

Scriddie

2,673

asked Jan 31, 2024 at 12:21

0 votes

0 answers

42 views

Should I indicate "success" of an experimental run at the beginning of the data?

Because I'm that guy, I wanted to run some statistical analysis on the results of a number of experiments; specifically, I'm wanting to track my progress on different runs of the turn-based strategy ...

John Doe

85

asked Jan 29, 2024 at 18:57

1 vote

0 answers

122 views

What is the difference between construct validity and reliability? [closed]

I want to design a questionnaire and examine a new construct (variable) in my research with a five point scale from 1 to 5. How can I test whether the questionnaire satisfies the requirements of ...

Dr. Subhash Chander

61

asked Jan 29, 2024 at 16:04

1 vote

0 answers

37 views

Correlation of event occurrence in multiple sectors

I have the following problem to analyze: I divided an area into several sectors (i.e.: S1,S2,S3,…,Sn) and there is an event that can happen in one or more sectors at the same time. I considered a ...

Rodrigo

111

asked Jan 28, 2024 at 14:35

0 votes

1 answer

78 views

Make Predictions with an RNN Using a Multi-dimensional Training Set

I have a 2D matrix TD of training data that is a collection of N non-linear signals that are functions of time (hence the ...

Jonathan Frutschy

103

asked Jan 23, 2024 at 0:22

1 vote

0 answers

145 views

Plotting 3 points per data set in lollipop-type chart [closed]

I am wanting to plot a graph where I have multiple data points per category of data. For some context, I have done some analysis on different samples and now have up to 3 3 data points for each sample ...

Charllotte

11

asked Jan 16, 2024 at 16:24

2 votes

1 answer

114 views

Missing data for Cox regression and HR

I'm conducting a research in which patients went through a surgery, for some the surgery was successful (outcome = 1) and for some it wasn't (outcome = 0). The risk factors were calculated using a Cox ...

AREEEL

21

asked Jan 16, 2024 at 14:02

1 vote

1 answer

121 views

How can I best visualize & compare this data? Should I create weighted composite variables for a scatter plot?

I have data from a survey which was asked people 1) how often they used a particular tool (daily, weekly, monthly, annually, etc) and 2) many hours they usually spent using it (0 - 4 hrs, 5 - 9, 10 - ...

Arctic

81

asked Jan 11, 2024 at 1:51

2 votes

1 answer

114 views

Analyzing microbiome and clinical data for event-prediction

I am analyzing clinical data and complex microbiome data in a longitudinal study. I already compared different groups at baseline and between baseline and "events" using linear mixed models (...

BHO_1990

21

asked Dec 15, 2023 at 14:16

6 votes

1 answer

429 views

Name of academic field studying geometric structure of data sets [closed]

I have questions about the geometric structure of data sets, esp. as it relates to the relationships between predictors. Is there a name for this field?

Chris Science

403

asked Dec 14, 2023 at 17:44

2 votes

1 answer

107 views

Data taken from survey where survey-takers self report a continous variable

I have a problem with some health data that I'm trying to analyze. The main issue originates from a census variable is derived from self reported times. The variable is sleep duration, which is ...

Ender_The_Xenocide

123

asked Nov 18, 2023 at 11:26

0 votes

0 answers

65 views

Can I do a multiple linear regression analysis with a mixture of raw data and index data?

I'm trying to do a multiple linear regression analysis in Excel using the Analysis Toolpak and I am not good at math, let alone stats. So please excuse my total ignorance. I'm using the following ...

MissyM

1

asked Nov 18, 2023 at 3:07

1 vote

0 answers

50 views

What kind of machine learning model could I use on this dataset?

I am a beginner to data science. I found this dataset that covers natural disaster incidents in Afghanistan from 2016 - present. Here are the 13 columns: REGION (South West, North, etc), PROV_CODE (...

Mas

11

asked Nov 16, 2023 at 18:50

3 votes

1 answer

179 views

Are bootstrapped samples considered to be coming from the same distribution as the original sample?

Let a dataset $\mathcal{D}$ be sampled according to $F_{\mathcal{D}}$. My question is, suppose I create bootstrapped samples from $\mathcal{D}$. That is, create $\mathcal{D}_1, \ldots, \mathcal{D}_M$ ...

Your neighbor Todorovich

707

asked Nov 16, 2023 at 16:22

2 votes

1 answer

182 views

My dataset includes multiple variable and all of these variables have sub-variables. How to visualise & test which segment is significant?

So, I have survey responses from users. Just to make it clear, if you select an issue like Poor UI then you are prompted with 4-5 specific issues about the UI to select from. Poor UI is the main ...

doodle2611

23

asked Nov 14, 2023 at 13:39

1 vote

1 answer

221 views

Multivariate Time Series dataset preparation

I am a bit confused with the time series dataset preparation. From the internet, I saw all examples which used tree-based models, had input features and target defined as: ...

kg__

63

asked Nov 14, 2023 at 13:07

0 votes

1 answer

106 views

Is it possible to replace missing data if 50% of total data are missing?

In my study, 40K completed household surveys. However, when we suggested visiting a nearby health center to measure their physical parameters (height, weight, blood pressure, and blood glucose), only ...

Dr bappa

1

asked Nov 10, 2023 at 6:44

2 votes

1 answer

95 views

Understanding notation for input space

I'm taking the free Caltech machine learning course. I'm having trouble understanding the notation on one of the problems: In this problem, you will create your own target function f and data set D ...

Ben G

153

asked Oct 30, 2023 at 21:22

0 votes

1 answer

291 views

Is it valid to add more data to a training data set after evaluating on a test set?

I'm working on a machine learning project to find particular key points in images. To do this, I'm using a U-net like architecture and treating it as a regression problem to produce a heat-map of ...

ocharles

103

asked Oct 28, 2023 at 9:20

1 vote

0 answers

63 views

Are there any downsides to using AUROC curves in low event rate samples?

I was just asked to familiarize myself with some methods looking at comparing AUROC for a few predictive scores to predict outcomes. Issue is that I have a dataset of about 200 with <5% with the ...

Mike K

11

asked Oct 25, 2023 at 2:44

0 votes

1 answer

107 views

How to deal with the missingness of numerical data that is only relevant to some of the observations when making regression models?

Firstly i'm completely new to data science (first project) and to StackExchange, so sorry if i'm asking a stupid question or not providing adequate information in my question. Please tell if i could ...

Mathias Therkelsen

13

asked Oct 20, 2023 at 14:11

0 votes

0 answers

89 views

Assign weights to examples in a highly imbalanced dataset

I have a highly imbalanced dataset and I'd like to train a simple ANN classifier on it. My model currently is a simple 2-layer feed-forward neural network with ReLU activation in between. After a few ...

Green 绿色

201

asked Oct 14, 2023 at 9:46

Questions tagged [dataset]