Skip to main content

Questions tagged [dataset]

Requests for datasets are off-topic on this site. Use this tag for questions concerning creating, processing, or maintaining datasets.

Filter by
Sorted by
Tagged with
3 votes
1 answer
119 views

How should I handle a mass-point in the dependent variable when running OLS regression in R? I’m working with a a household expenditure dataset (Living Costs 2019) where the dependent variable is the ...
Jim's user avatar
  • 31
4 votes
5 answers
711 views

It’s confusing to understand how quartile values can actually be used to give insights into a dataset. Please assist with examples. I struggle to interpret the values in the context of providing ...
Buchi's user avatar
  • 41
5 votes
3 answers
533 views

I’m working on a project where I need to build a predictive model for wine quality based on its chemical properties. The goal is to find which features best explain or predict the quality score. I’ve ...
QualityX's user avatar
1 vote
1 answer
136 views

This question is inspired by a blog post by https://www.argmin.net/p/in-defense-of-typing-monkeys and several rumors I've heard from other people who works in machine learning. The gist of it is that ...
Your neighbor Todorovich's user avatar
1 vote
0 answers
99 views

Out of curiosity, I am looking for an example of an authentic variable (which one would find in a data set) with an exceptionally small coefficient of variation:  $\text{CV} = \frac{s}{\bar{x}}$.  To ...
Gregg H's user avatar
  • 7,077
1 vote
2 answers
123 views

If I got it correct, the standard error is a statistic that measures the variability of a sample’s data and how accurately a statistic represents the corresponding parameter. Please suggest any ...
okman's user avatar
  • 315
1 vote
0 answers
47 views

Neural Network Beginner here. I am currently implementing a CNN on PyTorch for recognizing Japanese handwritten letters, which has 46 classes of outputs. I found a dataset on Kaggle https://www.kaggle....
Krish Thyagarajan's user avatar
2 votes
1 answer
102 views

If we have a high-dimensional dataset (7-10 columns) of continuous variables like Time, Temperature etc. recorded from experiments (not performed by us) are there established methods to quantitatively ...
Sunera Wijeratne's user avatar
1 vote
1 answer
94 views

I am looking to apply a calibration/correction approach on a set of sensors and I just wanted to know that the approach I am going to use is statistically correct and acceptable. I am using a set of ...
Milad's user avatar
  • 157
12 votes
5 answers
2k views

A question is how many missing values are too many to be handled. It has been asked in the context of applying specific software and method (MICE). I am interested in understanding a bit better what ...
Johan's user avatar
  • 346
2 votes
1 answer
83 views

I have a task: for the store, where customers may pay for their items on registers with cashiers, were added self-service checkouts. I have 4 months of transaction data of customers who make their ...
remon's user avatar
  • 21
7 votes
4 answers
880 views

I am currently attending my first data analysis class and we do some simple hypothesis tests like t test etc. Our teacher told us that we can remove outliers, as long as they are not more that the 10% ...
Maria's user avatar
  • 71
0 votes
1 answer
50 views

I have a question that relates to the use of IAT scores across timepoints. As part of a large health-based intervention my colleagues and I have obtained IAT scores at different timepoints, from which ...
Jonathan Kim's user avatar
187 votes
15 answers
57k views

In a recent article of Amstat News, the authors (Mark van der Laan and Sherri Rose) stated that "We know that for large enough sample sizes, every study—including ones in which the null hypothesis of ...
Carlos Accioly's user avatar
10 votes
1 answer
1k views

We are trying to develop some predictive models in medical research. We have combination of clinical and RNA-seq data just for 40 patients. The problem is classification. After feature selection, we ...
Leila ali's user avatar
  • 189
11 votes
4 answers
957 views

In machine learning, it is commonly assumed that samples are generated i.i.d. according to some probability distribution. On the importance of the i.i.d. assumption in statistical learning The ...
Your neighbor Todorovich's user avatar
0 votes
0 answers
39 views

Suppose one must share a data file – could be a simple CSV file – where each datapoint has several variates, let's say a nominal one, an ordinal one, and a continuous-real one. Are there any standard ...
pglpm's user avatar
  • 1,356
6 votes
0 answers
320 views

Assume we are only able to observe two-way entry table counting the number of observations of a pair of categorical features $x_i,x_j$. $$ \begin{array}{c|ccc} & & x_j & \\ \hline ...
Three Diag's user avatar
103 votes
25 answers
43k views

I've been working on a new method for analyzing and parsing datasets to identify and isolate subgroups of a population without foreknowledge of any subgroup's characteristics. While the method works ...
1 vote
1 answer
292 views

I developed and compared four ML models via Random Forest, Support Vector Machine, Logistic Regression, and Xgboost (tidymodels R package) algorithms using data without stratification by age groups. ...
Data and data's user avatar
0 votes
0 answers
49 views

In a longitudinal hospitalization survey dataset, where patients are asked to fill out a survey each time they are admitted into the hospital, one of the questions is no longer asked. This question ...
Kevin's user avatar
  • 353
0 votes
0 answers
129 views

I have a metagenomics dataset with more than 2 million features (each one being the relative abundance of a gene family, this is, a cluster of gene sequences) in 30 samples. First, I CLR-transformed ...
AdrianLG's user avatar
1 vote
1 answer
95 views

The MNIST dataset can be obtained directly using Keras by running the following lines of Python code. ...
user3728501's user avatar
5 votes
1 answer
181 views

Consider the distribution with the bivariate cumulative distribution function $$F(t_1,t_2)=t_1^{1+\theta\log(t_2)}t_2,0<t_1,t_2<1; \theta\leq 0$$. I want to generate data from this distribution (...
Unknown's user avatar
  • 220
1 vote
1 answer
87 views

I have 100 samples for 6 days (every day 100 observations, 600 observation in general). I have tried to fit the mixed model to the data. Another time, I tried to fit the mixed model just for ...
Leila's user avatar
  • 11
2 votes
0 answers
56 views

This is more of a philosophical question, and also a question asking for references. To those who follow statistical academic literature, there are papers that discuss philosophical issues in ...
cgmil's user avatar
  • 1,633
95 votes
2 answers
193k views

I have seen the min-max normalization formula but that normalizes values between 0 and 1. How would I normalize my data between -1 and 1? I have both negative and positive values in my data matrix.
covfefe's user avatar
  • 1,299
1 vote
0 answers
83 views

In my research, I am examining the impact of AI labels (with vs. without) on various brand perceptions and behavioral intentions. Specifically, I analyze how the stimulus (IV, 4 stimuli in 2 subgroups)...
Marcel El Joundi's user avatar
0 votes
0 answers
49 views

I'm using this dataset for a regression project, and the goal is to predict the beneficiary risk score(Bene_Avg_Risk_Scre). Now, to protect beneficiary identities and safeguard this information, CMS ...
Anirudh's user avatar
0 votes
0 answers
68 views

I have a question about panel data related to feols in R. Suppose that I have the linear regression model y_{it}=a+x_{1it}+x_{2it}+error_{it} where i=1,...,T is ...
user454850's user avatar
1 vote
0 answers
79 views

I am wondering if we can analyze ACE twin models from a long data set rather than the usual wide data set (R lavaan, mplus). For example, I have found that you can create ACE twin data in R from here. ...
POC's user avatar
  • 688
0 votes
0 answers
92 views

I have collected a dataset on Employment status. I created the following categories; Pensioners, Formally employed, Informally Employed, Self-employed, and Unemployed. I also have Learners or Students ...
Amelia Nicodemus's user avatar
0 votes
0 answers
53 views

There is a dataset of roughly 4,000 people interviewed and examined among them a dependend variable I am interested in and roughly 50 variables that I would like to investigate further in their ...
Bernhard's user avatar
  • 8,645
0 votes
0 answers
42 views

I am investigating the relationship between the number of branches on the upper part of a plant (independent) and the number of branches on the lower part (dependent). Thus, the data is numerical ...
Scott's user avatar
  • 1
0 votes
0 answers
57 views

I am learning about Random Forests and want to test it on a dataset that I have. There are 500 samples equally distributed among 50 classes, the samples have ~500 values. Is this suitable for random ...
user438409385's user avatar
0 votes
0 answers
43 views

I am performing a generalized extreme value analysis using about 20 years of data sampled every 1 minute. I am doing this in order to predict return levels at e.g. 1-in-50 and 1-in-100 intervals. The ...
Darcy's user avatar
  • 947
2 votes
1 answer
161 views

I am working on a dataset containing trade data, and my goal is to impute the missing data for a period of around 24 hours. Here's a sample of the trade data I'm working with: timestamp symbol price ...
Mocak's user avatar
  • 21
21 votes
4 answers
4k views

The i.i.d. assumption states: We are given a data set, $\{(x_i,y_i)\}_{i = 1, \ldots, n}$, each data $(x_i,y_i)$ is generated in an independent and identically distributed fashion. To me, ...
Olórin's user avatar
  • 744
0 votes
1 answer
137 views

My data is RNA-seq data with more than 14000 features and the problem is binary classification. Then the total sample is 50 and p>>n. When I use Elasticnet method with train and test data, the ...
Leila ali's user avatar
  • 189
96 votes
6 answers
9k views

In my job role I often work with other people's datasets; non-experts bring me clinical data and I help them summarise it and perform statistical tests. The problem I am having is that the datasets I ...
Chris Beeley's user avatar
  • 5,921
1 vote
0 answers
132 views

Unlike a real distance measure, relative entropy is not symmetric in the sense that $D(p(x)∥q(x)) \neq D(q(x)∥p(x))$. It turns out that many information measures can be expressed by relative entropies....
허정윤's user avatar
6 votes
1 answer
429 views

I have questions about the geometric structure of data sets, esp. as it relates to the relationships between predictors. Is there a name for this field?
Chris Science's user avatar
2 votes
1 answer
119 views

I have the following two types of datasets: The dataset-1 is a tabular data that describe 7000 proteins. dataset-1 is only one file. dataset-2 consists of 7000 individual data files that are 2D ...
user366312's user avatar
  • 2,077
3 votes
1 answer
495 views

I have a dataset that I want to perform a regression on. However, some of the columns are not in numerical form. For example, the extra classes column. What I ...
Charlotte's user avatar
0 votes
1 answer
506 views

When should I select pairwise deletion? So I grasp the idea of pairwise deletion, but what conditions are actually needed to select this? Is it when data is MCAR? Why would researches select this ...
Fats's user avatar
  • 21
1 vote
1 answer
79 views

I'm conducting a study on corporate social responsibility (CSR) and am encountering a challenge with my data cleaning. I've instructed respondents to answer the Likert scale questions only for ...
user432017's user avatar
2 votes
2 answers
111 views

I have a longitudinal data set with 2 dependent variables (couple) - a husband and a wife. There were 2 waves for the husbands and 3 waves for the wives. Since there is a lot of missing data, I ...
eagersquirrel's user avatar
2 votes
2 answers
378 views

As explained in the title, I would like to transform a yearly dataset into a monthly one, but including a constraint. My current dataset gives the yearly production of a commodity, and from year to ...
EstebanVer's user avatar
4 votes
3 answers
2k views

I am in the process of solving a Machine Learning challenge, and I want to do it the right way. I did some exploratory data analysisand I wanted to check the distribution of the data. As displayed in ...
Spicy strike's user avatar
0 votes
1 answer
129 views

I am trying to calculate the reliability of a difference score. Specifically, the data have, for each participant, scores for 10 items in Condition X (1s and 0s), as well as 10 different items in ...
Altair555's user avatar

1
2 3 4 5
39