How to manipulate open-source datasets for coursework

Question

I have an open-source dataset that I would like to use in a deep learning course, but the downside of using open-source datasets is that there are open-source solutions. I want to manipulate the data such that students cannot use previously trained architectures but enable them to undo the manipulation once the assignment is complete. This will allow them to compare their results to existing results on the same data.

Is it possible to manipulate the data such that the student will require a different neural network architecture than is optimal for the original dataset? In addition, is it possible to make it so the student can undo the manipulation and perform similarly well on the original dataset with their unique architecture?

One suggested is using an ill-conditioned matrix to manipulate the data and preserve the information, but it is not clear to me how you could use the inverse make the neural network architecture work on the original dataset (without retraining).

To be clear, it is known that, for the original open-source dataset:

The architecture to model the data exists and is publically available,
The trained model of (1) is publically available.

Therefore, I would like to transform the open-source dataset such that:

The student needs to figure out a new architecture that is not known,
As a consequence of (1), the student must train their model,
Without re-training the model, the student is able to apply a transformation (deducted from the original transformation) so their model works on the original open-source dataset.

The question is, what transform could be applied to the original data and how do you undo the transform in the trained model? There are a variety of datasets, including recordings, images, and numeric data for classification or regression.

I think this belongs on a different site, maybe stackoverflow.com or math.stackexchange.com — Carl
– Carl, Commented Aug 11, 2019 at 6:15
I have re-framed the question to clarify the specific points on model training. I have also removed the details of the suggestion to avoid confusion. Please un-flag my question. — Joseph Konan
– Joseph Konan, Commented Aug 11, 2019 at 6:43
Joseph let me understand this: you want to modify the dataset in a way that: 1) the data are different so that a new network must be estimated (this is trivial) 2) WITHOUT re-estimating anything, you want to use the model estimated on the new dataset to fit the old dataset. The step 2 is key. Can you confirm that in the step 2 you want no re-estimation? — Fr1
– Fr1, Commented Aug 11, 2019 at 14:44

astel · Accepted Answer · 2019-08-11 13:39:34Z

1

Check out the synthpop package in R. It creates synthetic datasets so that some of the data is changed at the micro level but much of the statistical properties of the dataset is preserved

answered Aug 11, 2019 at 13:39

astel

1,64811 silver badges20 bronze badges

$\begingroup$ Upvoted for the hint $\endgroup$

Fr1
– Fr1

2019-08-11 14:40:53 +00:00
Commented Aug 11, 2019 at 14:40

Add a comment |

Fr1 · Accepted Answer · 2019-08-11 15:27:00Z

If you impose the constraint that the student must avoid re-estimation, then you can think of a solution like the following. I don't know whether this will actually fit your needs, so think about it, I do not pretend it is the best possible solution for your problem, so take it as an additional simple idea. Also because I don't know what your network purpose actually is (i.e. you have a binomial/multinomial model, or you are trying to predict a numerical variable, etc..). My answer is referred to the case of the numerical variable.

Since the network will take into account the non-linearity in the relationship between the model and the dependent variable, then you could apply a log to the dependent variable, leaving the series of predictors unchanged. This will likely mean that there will exist a new model with new parameters that will fit the new transformed variable better than the existing textbook solutions calibrated on the non-transformed variable. Since the new model estimated will predict the log of the dependent variable, then, applying the exponential to the prediction of the model and the dependent variable, you will likely see that the exponential of the scores will fit the exponential of the log of the original dependent variable. this will allow you to oblige the students to fit a new model from scratch to find the new optimal parameters for the transformed dataset where the dependent variable has been transformed (via a non-linear albeit monotonic transformation). At the same time, it is convenient because you can revert back by applying the exp to both the sides of the model and dependent variable. You can also use more complex (BUT INVERTIBLE) transformations on the dependent variable.

If you also wish to change the predictors and add a noise to the dataset, here the things get more involved. For example, you can obfuscate the data by multiplying everything by a white noise variable (suppose normally distributed, with a low variance, suppose 1 but you can use even lower values if you want to avoid too many simulations later). Then you have transformed/obfuscated predictors, but the obfuscation is just random and small, so it should disappear in repeated simulations. In that case, you can show that on average, simulating the whole process n times, for n high enough, the average model accuracy will converge to the accuracy of the textbook examples. The same more or less should hold if you obfuscate each predictor with an independent white noise, but I would skip too many overcomplications. In any case, the advice I give you, is that you should try this method numerically before, just to check whether it also works in practice and for which n compared to the variance of the noise.

Stack Exchange Network

How to manipulate open-source datasets for coursework

2 Answers 2

Your Answer

Hot Network Questions

How to manipulate open-source datasets for coursework

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions