Skip to main content
Search type Search syntax
Tags [tag]
Exact "words here"
Author user:1234
user:me (yours)
Score score:3 (3+)
score:0 (none)
Answers answers:3 (3+)
answers:0 (none)
isaccepted:yes
hasaccepted:no
inquestion:1234
Views views:250
Code code:"if (foo != bar)"
Sections title:apples
body:"apples oranges"
URL url:"*.example.com"
Saves in:saves
Status closed:yes
duplicate:no
migrated:no
wiki:no
Types is:question
is:answer
Exclude -[tag]
-apples
For more details on advanced search visit our help page
Results tagged with
Search options answers only not deleted user 12359

Requests for datasets are off-topic on this site. Use this tag for questions concerning creating, processing, or maintaining datasets.

5 votes

Data Sets suitable for k-means

In complement to JEquihua's great answer, I would like to add 2 points. Case 3 is a nice example of a case where it would be useful to have a clustering algorithm that doesn't give only the cluster a …
Franck Dernoncourt's user avatar
2 votes

Looking for redacted text corpus

For medical data, a few datasets can be found at: Physician notes with annotated PHI 1) i2b2 2006 Deidentification and Smoking Challenge's data set: NLP Data Set #1B: 889 de-identified discharge …
Franck Dernoncourt's user avatar
8 votes

What is exactly meant by a "data set"?

In the open data discipline, dataset is the unit to measure the information released in a public open data repository. The European Open Data portal aggregates more than half a million datasets. …
Franck Dernoncourt's user avatar
5 votes

Plotting data from several files on one plot

One way to do it is to use points: x <- seq(0, 2*pi, len = 51) y1 = sin(x) y2 = cos(x) plot(x, y1) points(x, y2, col = "red") If your data files share a common axis, you can use matplot: a <- ma …
Franck Dernoncourt's user avatar
3 votes

A suitable corpus for training skip-though vectors

Common Crawl corpus: consists of 145 TB of data from 1.81 billion webpages as of August 2015 http://www.lrec-conf.org/proceedings/lrec2018/pdf/889.pdf: see Table 1 for several summarization corpora, …
Franck Dernoncourt's user avatar
3 votes
Accepted

Why does the Ciphar 10 tutorial on TensorFlow crop the images to be 24x24?

As a side note, the CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. This means that 24x24 cropping keeps most of the image. …
Franck Dernoncourt's user avatar
17 votes

Training data is imbalanced - but should my validation set also be?

The point of the validation set is to select the epoch/iteration where the neural network is most likely to perform the best on the test set. Subsequently, it is preferable that the distribution of cl …
Franck Dernoncourt's user avatar