Neural network immediately overfitting

I have a FFNN with 2 hidden layers for a regression task that overfits almost immediately (epoch 2-5, depending on # hidden units). (ReLU, Adam, MSE, same # hidden units per layer, tf.keras)

32 neurons:

32 neurons

128 neurons:

128 neurons

I will be tuning the number of hidden units, but to limit the search space I would like to know what the upper and lower bounds should be.

Afaik it is better to have a too large network and try to regularize via L2-reg or dropout than to lower the network's capacity -- because a larger network will have more local minima, but the actual loss value will be better.

Is there any point in trying to regularize (via e.g. dropout) a network that overfits from the get-go?

If so I suppose I could increase both bounds. If not I would lower them.

model = Sequential() model.add(Dense(n_neurons, 'relu')) model.add(Dense(n_neurons, 'relu')) model.add(Dense(1, 'linear')) model.compile('adam', 'mse')

How much training data you have? Are your training and test (and validation) data from same distribution? If they aren't from same distribution your network will learn something completely different. Add your model code to the question.
– youhans
Jul 1 at 10:31

Done, thanks for your comment. I have 120k samples. All sets from the same distribution. Data augmentation is an option I am looking into.
– svdc
Jul 1 at 10:38

2 Answers
2

Hyperparameter tuning is generally the hardest step in ML, In general we try different values randomly and evalute the model and choose those set of values which give the best performance.

Getting back to your question, You have a high varience problem (Good in training, bad in testing).

There are eight things you can do in order

Depending on your computation power and time you can set a bound to the number of hidden units and hidden layers you can have.

because a larger network will have more local minima.

Nope, this is not quite true, in reality as the number of input dimension increases the chance of getting stuck into a local minima decreases. So We usually ignore the problem of local minima. It is very rare. The derivatives across all the dimensions in the working space must be zero for a local/global minima. Hence, it is highly unlikely in a typical model.

One more thing, I noticed you are using linear unit for last layer. I suggest you to go for ReLu instead. In general we do not need negative values in regression. It will reduce test/train error

Take this :

In MSE 1/2 * (y_true - y_prediction)^2

because y_prediction can be nagative value. The whole MSE term may blow up to large values as y_prediction gets highly negative or highly positive.

y_prediction

Using a ReLu for last layer makes sure that y_prediction is positive. Hence low error will be expected.

y_prediction

Thanks for your comment. 1, 2, 3 are done. 4: it's already overfitting, so wouldn't that make things worse? Do you mean more layers, but less neurons per layer?
– svdc
Jul 1 at 18:06

Yup you got it correct. There are some function that deep network learn easily that shallow nets even with high neurons cannot. Build a deep net with 3-6 layers. Apply one of the regularizer. If even this doesn't helps, you may need to change loss function or complete network architecture.
– coder3101
Jul 1 at 18:20

Also remove linear unit from last layer and use ReLu. Read the new answer
– coder3101
Jul 1 at 18:41

Let me try to substantiate some of the ideas here, referenced from Ian Goodfellow et. al. Deep Learning book which is available for free online:

Bottom line is you can't just play with the model and hope for the best. Check the data, understand what is required and then apply the corresponding techniques. For more details read the book, it's very good. Your starting point should be a simple regression model, 1 layer, very few neurons and see what happens. Then incrementally experiment.

Excellent comment, thank you! Both the training and test set are pulling from the same distribution. I agree that I am severely lacking data though; I will give that injecting noise technique a go. I have a simple linear model is performing as well (or even better) than my NN, which seems odd. I would expect a NN to do better on a regression problem -- perhaps not a lot depending on the exact use case, but better none the less.
– svdc
Jul 1 at 18:19

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Search between a Gas Station