Skip to main content

Questions tagged [dataset]

Requests for datasets are off-topic on this site. Use this tag for questions concerning creating, processing, or maintaining datasets.

Filter by
Sorted by
Tagged with
1 vote
0 answers
130 views

I have this function I wrote: ...
Igor9094's user avatar
2 votes
2 answers
231 views

What algorithm should I go for if I want to determine collective outliers within a dataset? By collective outliers, I mean a series of data points differ significantly from the trends in the rest of ...
Iamtrying's user avatar
3 votes
4 answers
398 views

Suppose that we have N-observations for a random variable $V$ from a population: $$\{v_1,\;\ldots,\;v_N\}$$ , and all of the observations is positive: $v_i>0\quad \forall i=1,\;\ldots,\;N$. Here, ...
M.C. Park's user avatar
  • 985
3 votes
1 answer
1k views

I understand that latitude and longitude are interval data, but are coordinates (e.g. Point(123, -123)) also considered interval data? If so, how can the standard deviation, mean etc. be calculated?
Verum's user avatar
  • 31
1 vote
0 answers
34 views

My family is renting out a vacation home and I would like to create a dataset in R to calculate statistics on the bookings (for some helpful insights and also for me to practice R). I have a question ...
Sofia's user avatar
  • 11
2 votes
0 answers
56 views

In most datasets I use, the spectrum of the covariance matrix decays to 0 quite fast, meaning that they are more or less low-rank. My question is, whether there are setting or disciplines that are ...
0 votes
2 answers
212 views

I would like to generate random data for a distribution which takes the form of hurdle-like model. Let's denote a random variable X with probability mass function \begin{align} Pr(X=0)&=\alpha\\ ...
RRMT's user avatar
  • 382
0 votes
0 answers
91 views

For example: Let's say I have dataset A: Measured body temperature of a person during the day. I have measurements from 3 people in the span of a year. If I cluster it, I expect the clusters to ...
user326964's user avatar
0 votes
1 answer
552 views

I am working with a pretty large dataset containing 760 rows and arround 58k-60k features and I'd like to perform a feature selection to reduce the dimensionality of those. After stardardising the ...
Julen's user avatar
  • 23
0 votes
0 answers
74 views

So I have multiple datasets which are all histograms and cannot be linked. The topic of the data is quite complex so for example imagine I was surveying different qualities between men and women. The ...
sputnik44's user avatar
2 votes
1 answer
138 views

I was creating a linear regression model on a particle collisions dataset. I observed that my model breaks several assumptions in linear regression, and when I tried to fix them, it increased the ...
Featherball's user avatar
7 votes
4 answers
3k views

Assume that I have 10 classes with 100 samples for each class—same # of samples, perfect balanced dataset. I want to add 3 new classes, and which of the following is the best option for the number of ...
Kevin Choi's user avatar
1 vote
0 answers
51 views

I have been suggested to train my model with a single labeled internet traffic and then test it with both traffic type I.e., Nirmal and Ddos traffic. Is it possible to train the model with just one ...
Akash Singh's user avatar
1 vote
1 answer
67 views

I'm working on a regression problem involving nutrient concentrations. The lab I'm getting my data from can measure up to 9000ppm of a particular nutrient. Beyond that, everything is reported as ...
Viv Crowe's user avatar
1 vote
0 answers
29 views

I am trying to analyze two external forecast models for weather data that each generate hourly forecasts twice a day for one week ahead. Thereby I get a panel-like dataset, in which I am interested in ...
kriskj's user avatar
  • 11
2 votes
1 answer
219 views

I'm looking to predict the probability of a team winning a game in basketball so that I can create something close to this: If you can't see the picture, it's a graph with time remaining on the x-...
sla813's user avatar
  • 85
1 vote
1 answer
133 views

I am aware there is quite a few similar questions but none answer was dealing with following situation: I have a task with train dataset and test dataset provided. All previous approaches are measured ...
Baltazar Gąbka's user avatar
2 votes
0 answers
422 views

I've been normalizing a dataset and after tokenizing my words I've noticed that some records contain combinations of words where the spaces between them are missing. ...
Tolure's user avatar
  • 121
0 votes
0 answers
37 views

I have a question about the root mean square error and Wasserstein distance on the paper https://arxiv.org/abs/2111.08736?context=stat. Consider two discrete probability distributions $P=\{P_i\}_{i=1}...
oliver's user avatar
  • 1
1 vote
1 answer
183 views

There are many discussions on here about techniques for handling unbalanced datasets, eg. Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?. My question is different ...
Ryan Volpi's user avatar
  • 2,008
1 vote
1 answer
772 views

I'm currently struggling with the choice of time origin for survival analysis in my data. My data comes from an ongoing clinical database of patients who all have the same genetic disease. In it, I ...
floubert's user avatar
0 votes
1 answer
70 views

We have the following dataset: $$ \begin{bmatrix} x_1 & x_2 & y\\ +1 & 0 & +1\\ -1 & 0 & +1\\ 0 & +2 & +1\\ 0 & +1 & -1 \end{bmatrix} $$ I was asked to find ...
user361992's user avatar
1 vote
0 answers
54 views

We frequently apply various statistical tests to data, including the stationarity test, the t-test, tests for randomness, etc. They are typically used with direct data, or information obtained ...
bogus's user avatar
  • 65
1 vote
1 answer
1k views

I learned in school that we don't split the dataset into training and testing sets if the sample size is less than 30. I wonder why we don't?
Anna Quoc Nguyen's user avatar
1 vote
0 answers
75 views

I'll try to summarize my situation as best as I can first. I'm doing a project in which we're trying to do a prognostic model on a specific disease to determine the risk of an outcome happening to ...
floubert's user avatar
4 votes
2 answers
463 views

I am trying to summarise and visualise three attributes in R: Patient_Age, Patient_Deprivation and Hospital_Time. I am trying to summarise the time patients spent in hospital by deprivation(scale 1-5) ...
Usman YousafZai's user avatar
1 vote
1 answer
106 views

I've discovered by chance that R produces different results when using a dataset which has been first transformed by the natural logarithm and then loaded into R for analysis and when the dataset is ...
TFT's user avatar
  • 345
3 votes
1 answer
151 views

Imagine Anne has a labeled training dataset for a machine learning prediction problem. There is an opportunity to acquire more data from an agent, at a cost. However, before she decides to acquire ...
NGInd's user avatar
  • 75
0 votes
0 answers
140 views

I am working on a project for university and I would like to demonstrate that adding more rows to the same model (linear regressions for instance or other types of models) increase the model quality. ...
 Annis99's user avatar
0 votes
0 answers
26 views

I have a matrix that has 50 named rows and 15 named columns. Matrix contains 0 and 1 values and is 85% sparse. Similar to the 'last.fm dataset'. I would like to intelligently create more rows from my ...
UsrnmChck's user avatar
1 vote
2 answers
90 views

Let's suppose I have a classification problem with three y classes {bad, middle, good}. Data in my dataset and also data in true reality are identically distributed between the three labels (33%, 33%, ...
Luca's user avatar
  • 11
3 votes
1 answer
431 views

I'm writing a work on the Aitchison geometry for compositional data and I have seen an Image I want to reproduce in R. I work with the "compositions" library and I want to understand how to ...
vitalmath's user avatar
  • 131
0 votes
0 answers
193 views

I need to do the 'cleaning' of the dataset, ie preprocessing. I noticed that I have two columns in the dataset that are totally linearly dependent. Is it okay to delete one, both or am I not allowed ...
Guest0225's user avatar
1 vote
1 answer
729 views

I want to find the average days to pay for a client based on last year's many invoices. Nearly all of them are paid. Most formulas for this activity rely on outstanding amounts for the client. In this ...
jabs's user avatar
  • 199
0 votes
0 answers
466 views

Goal: To determine whether to use the median or average (or a weighted combination of the two) from a data set based on whether ...
p.luck's user avatar
  • 101
3 votes
1 answer
97 views

I have access to a sample (size $N$) of functional data. Each observation corresponds to $C$ functions. Each function $f_{n,c}$ is represented by $T_n$ points for $1\geq n \geq N, 1\geq c \geq C$. All ...
noob's user avatar
  • 2,620
0 votes
1 answer
385 views

I have regression/classification problem. Dataset contains data from 4 sensors on 4 positions (1,2,3,4). Processes measured on all 4 positions are equivalent and same label and features describe all 4 ...
ziga's user avatar
  • 3
0 votes
0 answers
104 views

I want to build a recommender system to suggest similar songs to continue a playlist (similar to what Spotify does by recommending similar songs at the end of a playlist). I want to build two models: ...
Pybubb's user avatar
  • 1
1 vote
1 answer
329 views

Evaluating ML models is a fundamental task and subfield of the Machine Learning practice. On the other hand, I was not able to find any existing materials, guides, protocols, papers on how to proceed ...
Betelgeux's user avatar
0 votes
2 answers
607 views

I am training a classifier using BERT and want to check how the accuracy changes with increasing training data size. Up until now, I have 1k annotated training samples and tested the accuracy for ...
Sven's user avatar
  • 3
5 votes
2 answers
236 views

Suppose I have a dataset comprised of garbages. Will a model perform better if I only label the dataset with biodegradable or non-biodegradable? Or will it be better if I label them with plastics, ...
wd violet's user avatar
  • 787
2 votes
2 answers
2k views

I have salary data of 3000 employees ranging from 3000 - 10000 dollars. Based on my understanding:(https://mathbitsnotebook.com/Algebra1/FunctionGraphs/FNGContinuousDiscrete.html) Continuous data is a ...
user3164187's user avatar
0 votes
0 answers
130 views

I hope that this question has not already been asked. I am analyzing data in R (and am a novice). I have a highly skewed data vector in a dataframe with missing values that I hope to set as the ...
vochoa213's user avatar
0 votes
0 answers
70 views

I have 3 domains of supplier data (Jan 2017 to Jan 2022) and they are as follows a) Purchase data - Contains all the purchase (of product) data made by the suppliers with us. It contains columns such ...
The Great's user avatar
  • 3,380
1 vote
0 answers
31 views

Below is the given data, how can I make clusters using symmetric matrix?
Nivedita's user avatar
2 votes
1 answer
265 views

From my understanding, a paired t-test is used when samples are dependent of each other. I'm having trouble deciding whether a paired t-test should be used when comparing the average population of a ...
user avatar
0 votes
0 answers
135 views

I am training a CNN to predict age,mass and tone from images. The structure of my dateset is as follows ...
Sparsh Garg's user avatar
0 votes
1 answer
53 views

I have a bunch of telephone data including information on call start time, end time, and duration. I am trying to evaluate the quality of the dataset to determine if the phone call data are legitimate ...
324's user avatar
  • 504
0 votes
0 answers
109 views

If we divide the data into training data, validation data, and testing data, I remember the lesson from Andrew Ng saying we use the validation data for hyperparameter tuning purpose. (you can see this ...
william007's user avatar
  • 1,097
1 vote
0 answers
31 views

I am trying to find a dataset of medical images related to tumor/cancer, there should be images different stages of the cancer and also preferably the details about the patient, their medical history, ...
JyotishmanK7719's user avatar

1
3 4
5
6 7
39