Questions tagged [dataset]
Requests for datasets are off-topic on this site. Use this tag for questions concerning creating, processing, or maintaining datasets.
1,934 questions
2
votes
1
answer
75
views
Analyzing lists and variables of multiple answers
My current issue lies within EMR extracted data for medications. There are multiple variables named: Medication_1, Medication_2, Medication_3, etc...
This data may overlap and analyzing each column ...
2
votes
0
answers
59
views
What is the best metric to use to discard annotators with low IAA (inter-annotator agreement) with all others?
This question is specific to ordinal data collected on the likert scale
What is the best metric to discard annotators with low inter-annotator agreement (IAA) with others? from e.g., Cohen’s Kappa, ...
0
votes
0
answers
44
views
How can i create a linear regression model not having the exact same dataset
i want to create a data regression between two financial indexes, but they don't have a perfect correspondence in the data of observation (for example one has the relevations for 17/6 18/6 19/6 but ...
1
vote
1
answer
101
views
Poisson regression given multiple predictors on a repeating ID variable
I was wondering how a poisson regression would work given my dataset which describes a series of zip codes stratified by age groups, gender and death counts.
The regression would use death counts as ...
0
votes
0
answers
66
views
Is there any technique that I can use to achieve the required computation without reading all 10 million rows?
I have a data file from a Monte Carlo simulation of fifteen protein chains. The file contains 10 million r_end_to_end 3D vectors as rows and 3 x 15 = 45 columns.
My ...
1
vote
0
answers
87
views
Sales data trend
I have my historical sales data and I want to check for the trend (increase, decrease or no change). When I do my annual line graph, the slop of my linear equation is positive (indicating increase) ...
1
vote
0
answers
30
views
Problems using custom dataset using Minirocket classification [closed]
I'm working on a bigger school project, trying to classify timeseries measurements with Minirocket/Rocket. My trainingdata consists of a 1D matrix containing the measurements, and a seperate 1D matrix ...
1
vote
0
answers
198
views
How to deal with outliers in panel data? [closed]
When we have cross-sectional data, we can easily detect and remove outliers. But how should one approach outliers when we are dealing with panel data? Since we have $i$ entities and $t$ times periods, ...
2
votes
0
answers
65
views
Testing time series data stationarity
I am working with time series and want test different forecasting methods but first I need to test if my time series (sales) data is stationary or not. So I have been learning about KPSS and Dickey-...
0
votes
0
answers
11
views
Imbalanced dataset with multiple classes [duplicate]
I have an imbalanced dataset with multiple classes where some have less than 100 some are more than 10k,where i want to apply random forest(the dataset is confidential so i cant share),i used all ...
1
vote
1
answer
77
views
How to train Logistic regression model with multiple inputs for 1 target value?
My data looks like similar to this: (the picture below is not mine, but describes perfectly my situation)
where the IDs are not unique but for each ID value I have a unique target value The following ...
1
vote
0
answers
184
views
Improving prediction accuracy for Gaussian process with derivative information
I am fitting two different GPs with derivative observations (one with 9 dimensional input and one 12 dimensional input), however for some reason I am getting much worse results for the 12 dimensional ...
1
vote
0
answers
82
views
Best method to identify layered clusters
The Problem
Hello everyone. I'm working with a dataset that has 15300 samples with 49 features each, equally distributed amongst three classes. I used TSNE to reduce the dimensions of the feature ...
3
votes
1
answer
92
views
Generating synthetic data with multiple records per ID
I would like to generate a synthetic dataset where there are multiple records per ID, and self-consistency is maintained among records of each ID.
For example, imagine a dataset where the ID is a ...
0
votes
0
answers
46
views
"How can I address the lack of correlation and a low R-squared value in my univariate linear regression when the data is scattered?"
**
"I'm trying to find a correlation between the confirmed cases and deaths rates against HUMIDEX values. As you can see, the data is very scattered, so I understand that polynomial and ...
0
votes
1
answer
93
views
Data analysis for school project
I have to create a hypothetical study focusing on the relationship between sBCMA (soluble B-cell maturation antigen in blood) and the expression of BCMA on bone marrow cells in patients with multiple ...
2
votes
0
answers
74
views
Problem with mathematical formulas in gap statistic
I'm studying the article "Estimating the number of clusters in a data set via the gap statistic" by R. Tibshirani, G. Walther and T.Hastie: https://academic.oup.com/jrsssb/article/63/2/411/...
0
votes
0
answers
43
views
Given variable A and B containing data of lemma sentiments, what is the correct term for the variable containing average of var A and var B?
I have a data visualization, showing the sentiment of two lemmas "гей" (var a) and "трансгендер" (var b) in a news corpus throughout the year.
Here is the dataframe sample of my ...
1
vote
0
answers
34
views
Consistent way of doing paired-trial validation (and leave-one-dataset-out validation)
In paired-trial validation, a statistical (ML) models are trained on $n$ datasets separately and then applied to other datasets, as a way of estimating the generalization of the models obtained. ...
4
votes
2
answers
202
views
Autocorrelation of discontinuous time series data [closed]
I am attempting to perform an autocorrelation study using python on a discontinuous time series dataset. To share a bit about how my data looks like, it is a single column of values, which spans over ...
0
votes
1
answer
87
views
Should the testing data be uniquely distinct and come from different source/dataset than the training data?
I am building an audio classification system using CNN. My dataset consists of different audio I have recorded and spliced to equal time lengths. Like with any other common ML or DL tasks, I am to ...
0
votes
1
answer
129
views
How to calculate reliability of difference scores?
I am trying to calculate the reliability of a difference score. Specifically, the data have, for each participant, scores for 10 items in Condition X (1s and 0s), as well as 10 different items in ...
0
votes
0
answers
80
views
How do I calculate the weighted average of a 2D data set for a 3D structure?
I have a 2D data array indicating a chemical percentage content (PC) in a chemical droplet.
I am trying to calculate the average PC in the droplet.
The image of one of these arrays is shown below (the ...
3
votes
1
answer
495
views
Can I change values in data from yes and no to binary
I have a dataset that I want to perform a regression on. However, some of the columns are not in numerical form. For example, the extra classes column. What I ...
1
vote
1
answer
177
views
What's the justification for comparing two separate models built on subsets of data versus using one model that uses the whole dataset?
I've noticed that there are some data analysis being done in some scientific field where the authors would split out an entire dataset into subsets based on a particular property. One classic example ...
0
votes
1
answer
67
views
Difference-in-difference age groups
I would appreciate your help with a question I have.
I'm creating a Difference-in-Difference study to examine how a conditional cash transfer to individuals 18 years of age to be spent in sport ...
1
vote
0
answers
140
views
Datasets with multiple maximum likelihood estimators
There is a sizeable body of literature on the issue of multiple maximizers in maximum likelihood estimation, such as
https://projecteuclid.org/journals/statistical-science/volume-15/issue-4/...
3
votes
0
answers
223
views
Recreating data variance from the posterior distribution
Recreating data variance from the posterior distribution
Take a set of data points $(x, y)$ with (Gaussian) uncertainties $\sigma_y$ on the $y$ coordinate; they are modeled as $y \sim f(x; \alpha) + \...
0
votes
0
answers
53
views
Reference datasets for conditional density estimation
[In case you feel inclined to close this question because I'm asking for a dataset - I'm looking for solutions in the spirit of point 2 (on-topic) in the accepted answer to this question about asking ...
0
votes
0
answers
42
views
Should I indicate "success" of an experimental run at the beginning of the data?
Because I'm that guy, I wanted to run some statistical analysis on the results of a number of experiments; specifically, I'm wanting to track my progress on different runs of the turn-based strategy ...
1
vote
0
answers
122
views
What is the difference between construct validity and reliability? [closed]
I want to design a questionnaire and examine a new construct (variable) in my research with a five point scale from 1 to 5. How can I test whether the questionnaire satisfies the requirements of ...
1
vote
0
answers
37
views
Correlation of event occurrence in multiple sectors
I have the following problem to analyze:
I divided an area into several sectors (i.e.: S1,S2,S3,…,Sn) and there is an event that can happen in one or more sectors at the same time. I considered a ...
0
votes
1
answer
78
views
Make Predictions with an RNN Using a Multi-dimensional Training Set
I have a 2D matrix TD of training data that is a collection of N non-linear signals that are functions of time (hence the ...
1
vote
0
answers
145
views
Plotting 3 points per data set in lollipop-type chart [closed]
I am wanting to plot a graph where I have multiple data points per category of data. For some context, I have done some analysis on different samples and now have up to 3 3 data points for each sample ...
2
votes
1
answer
114
views
Missing data for Cox regression and HR
I'm conducting a research in which patients went through a surgery, for some the surgery was successful (outcome = 1) and for some it wasn't (outcome = 0). The risk factors were calculated using a Cox ...
1
vote
1
answer
121
views
How can I best visualize & compare this data? Should I create weighted composite variables for a scatter plot?
I have data from a survey which was asked people 1) how often they used a particular tool (daily, weekly, monthly, annually, etc) and 2) many hours they usually spent using it (0 - 4 hrs, 5 - 9, 10 - ...
2
votes
1
answer
114
views
Analyzing microbiome and clinical data for event-prediction
I am analyzing clinical data and complex microbiome data in a longitudinal study. I already compared different groups at baseline and between baseline and "events" using linear mixed models (...
6
votes
1
answer
429
views
Name of academic field studying geometric structure of data sets [closed]
I have questions about the geometric structure of data sets, esp. as it relates to the relationships between predictors. Is there a name for this field?
2
votes
1
answer
107
views
Data taken from survey where survey-takers self report a continous variable
I have a problem with some health data that I'm trying to analyze. The main issue originates from a census variable is derived from self reported times. The variable is sleep duration, which is ...
0
votes
0
answers
65
views
Can I do a multiple linear regression analysis with a mixture of raw data and index data?
I'm trying to do a multiple linear regression analysis in Excel using the Analysis Toolpak and I am not good at math, let alone stats. So please excuse my total ignorance. I'm using the following ...
1
vote
0
answers
50
views
What kind of machine learning model could I use on this dataset?
I am a beginner to data science. I found this dataset that covers natural disaster incidents in Afghanistan from 2016 - present. Here are the 13 columns: REGION (South West, North, etc), PROV_CODE (...
3
votes
1
answer
179
views
Are bootstrapped samples considered to be coming from the same distribution as the original sample?
Let a dataset $\mathcal{D}$ be sampled according to $F_{\mathcal{D}}$.
My question is, suppose I create bootstrapped samples from $\mathcal{D}$. That is, create $\mathcal{D}_1, \ldots, \mathcal{D}_M$ ...
2
votes
1
answer
182
views
My dataset includes multiple variable and all of these variables have sub-variables. How to visualise & test which segment is significant?
So, I have survey responses from users. Just to make it clear, if you select an issue like Poor UI then you are prompted with 4-5 specific issues about the UI to select from. Poor UI is the main ...
1
vote
1
answer
221
views
Multivariate Time Series dataset preparation
I am a bit confused with the time series dataset preparation. From the internet, I saw all examples which used tree-based models, had input features and target defined as:
...
0
votes
1
answer
106
views
Is it possible to replace missing data if 50% of total data are missing?
In my study, 40K completed household surveys. However, when we suggested visiting a nearby health center to measure their physical parameters (height, weight, blood pressure, and blood glucose), only ...
2
votes
1
answer
95
views
Understanding notation for input space
I'm taking the free Caltech machine learning course. I'm having trouble understanding the notation on one of the problems:
In this problem, you will create your own target function f and data set D ...
0
votes
1
answer
291
views
Is it valid to add more data to a training data set after evaluating on a test set?
I'm working on a machine learning project to find particular key points in images. To do this, I'm using a U-net like architecture and treating it as a regression problem to produce a heat-map of ...
1
vote
0
answers
63
views
Are there any downsides to using AUROC curves in low event rate samples?
I was just asked to familiarize myself with some methods looking at comparing AUROC for a few predictive scores to predict outcomes. Issue is that I have a dataset of about 200 with <5% with the ...
0
votes
1
answer
107
views
How to deal with the missingness of numerical data that is only relevant to some of the observations when making regression models?
Firstly i'm completely new to data science (first project) and to StackExchange, so sorry if i'm asking a stupid question or not providing adequate information in my question. Please tell if i could ...
0
votes
0
answers
89
views
Assign weights to examples in a highly imbalanced dataset
I have a highly imbalanced dataset and I'd like to train a simple ANN classifier on it. My model currently is a simple 2-layer feed-forward neural network with ReLU activation in between. After a few ...