A Guide to Overfitting and Underfitting in Machine Learning

Table of Contents

1. Introduction to Overfitting and Underfitting
2. What is Model Complexity?
3. Overfitting in Machine Learning
4. Signs that a Model is Overfitting
5. Problems with Overfitting
6. What is Underfitting in Machine Learning?
7. Signs that a Model is Underfitting
8. Problems with Underfitting
9. The Bias-Variance Tradeoff
10. Techniques to Deal with Overfitting and Underfitting
11. Examples of Overfitting and Underfitting
12. Conclusion

1. Introduction to Overfitting and Underfitting

When training machine learning models, one of the key challenges is balancing model complexity. Models that are too simple may struggle to detect the underlying patterns in the data. This is called underfitting. On the other hand, models that are too complex may overfocus on the noise and irrelevant details in the training data. This is called overfitting.

The goal in machine learning is to find the “Goldilocks” model that is “just right” – not too simple and not too complex. This ideal model is able to learn the true signal from the data without memorizing unnecessary noise. Finding the right level of model complexity for the problem and dataset is critical for success.

In this comprehensive guide, we will dig deep into overfitting and underfitting. You will learn how to detect overfitted and underfitted models, understand the dangers of these issues, and techniques to address them. Mastering model complexity helps build machine learning models that generalize well to new, unseen data.

2. What is Model Complexity?

Model complexity refers to how flexible or rigid a machine learning model is. Simple linear models like linear regression tend to have low complexity. The model only learns a single line or plane. More complex models like deep neural networks have many parameters and can learn highly non-linear relationships.

Model complexity is driven by factors like:

Number of features or input variables
Number of model parameters that must be learned
Type of model (linear vs. non-linear)
Depth of the model (e.g. number of hidden layers in a neural network)

Complex models are very flexible and can fit many types of functions. Simple models are more rigid but may fail to pick up nuanced patterns. The optimal model complexity depends on the complexity of the true underlying relationship we want to model.

3. Overfitting in Machine Learning

Overfitting occurs when a machine learning model fits the training data too well, but does poorly on new data. An overfit model has “memorized” the noise and details of the training set, instead of learning the true underlying relationship.

For example, imagine training a classifier to detect spam emails. If the model overfits, it may fit the quirks and characteristics of the emails in the training set, without learning general properties of spam. The performance on the training set will be excellent, but the model will do poorly on new emails it hasn’t seen before.

Overfitting commonly occurs with highly flexible models like deep neural networks. The abundance of parameters enables the model to perfectly fit the training data by essentially memorizing it.

4. Signs that a Model is Overfitting

There are several telltale signs that a machine learning model is overfitting:

Training error is low, but validation error is high. The gap between training and validation performance indicates overfitting.
Training error decreases, but validation error starts increasing. The model begins overfitting as capacity increases.
The model is very complex relative to the amount and complexity of the training data. High capacity models tend to overfit small or simple datasets.
The model achieves very high training accuracy (>95%) quickly and keeps improving. This suggests it may be memorizing details of the training set rather than learning general patterns.

Looking at metrics like training vs. validation error as model capacity increases helps diagnose overfitting. In general, if performance on the validation set degrades, it’s a sign of overfitting.

5. Problems with Overfitting

Overfitting leads to several issues:

Poor generalization – The model will not perform well on new data since it did not learn the true underlying patterns.
Increased variance – Small changes in the training data can significantly impact the learned patterns.
Lack of interpretability – Analyzing and reasoning about heavily overfit models is difficult.
Wasted resources – Additional model capacity beyond a certain point is not helpful since the model starts overfitting.

In real-world applications, being able to apply models to new, unseen data (generalization) is critical. Overfitting interferes with this goal and should generally be avoided, except in some cases like anomaly detection.

6. What is Underfitting in Machine Learning?

Underfitting occurs when a machine learning model is not complex enough to learn the underlying pattern in the data. An underfit model will have high error on both the training data and new data.

For example, fitting a simple linear model to non-linear data will underfit the relationship. The linear model lacks the capacity to adequately capture the patterns.

Underfitting commonly occurs with very simple models like linear regression applied to complex data. The model will struggle to understand the relationships beyond its limited capability.

7. Signs that a Model is Underfitting

There are a few signs that indicate a machine learning model is underfitting:

Training error and validation error are both high. The model fails to fit the training data well.
Increasing model capacity does not lead to lower training error. The model needs more complexity.
The model achieves poor accuracy (<60-70%) quickly and stays poor. The model lacks capacity.
Adding more data does not lead to lower error. The model cannot take advantage of additional training data.

If training a model leads to high error quickly which does not improve with additional iterations or data, it’s a clear sign of underfitting. The solution is to increase model capacity.

8. Problems with Underfitting

Underfitting causes the following issues:

Poor generalization – Like overfitting, underfitting hurts the model’s ability to generalize to new data since it did not capture the patterns well.
High bias – The rigid assumptions baked into an underfit model result in high bias.
Wasted resources – Collecting additional training data cannot help if the model does not have enough capacity to learn from it.

Choosing too simple a model can lead to many wasted resources trying to improve performance. The first priority should be ensuring the model has sufficient complexity for the problem.

9. The Bias-Variance Tradeoff

Overfitting and underfitting relate closely to the bias-variance tradeoff. Bias is error due to inflexible assumptions in the model. Variance is error due to too much model flexibility.

Simple models with low capacity have high bias and low variance. Complex models with high capacity have low bias but high variance. The goal is to find a model complexity that balances bias and variance.

As model complexity increases:

Bias decreases (good)
Variance increases (bad)

The optimal model complexity depends on the problem. Simple problems may need simple models. But most real-world datasets require some model complexity in order to fit nonlinear relationships while avoiding overfitting.

10. Techniques to Deal with Overfitting and Underfitting

There are several ways to combat overfitting and underfitting:

For overfitting:

Regularization – Penalizes model complexity by adding a cost for higher capacity models. Improves generalization.
Early stopping – Stops training before the model overfits the training data.
Dropouts – Temporarily drops neurons during neural network training to reduce inter-dependencies.
Data augmentation – Increases size of training data to decrease overfitting.
Model averaging – Averages predictions from multiple models to reduce variance.

For underfitting:

Increase model capacity – Use more complex models like deep neural networks.
Reduce constraints – Remove rigid assumptions limiting model flexibility.
Feature engineering – Create better model inputs that allow learning richer relationships.

In practice, most solutions deal with overfitting as it is more common. But underfitting should not be overlooked, especially early in the model development process.

11. Examples of Overfitting and Underfitting

Here are some examples of overfitting and underfitting:

Overfitting

A high-capacity deep neural network trained on a small image dataset. The model achieves 99% training accuracy but only 50% validation accuracy – a sign of overfitting.
A polynomial regression model trained to predict housing prices. As the polynomial degree increases, training error goes down but validation error goes up, indicating overfitting.
A support vector machine classifier with an RBF kernel trained to classify emails as spam or not spam. The model perfectly classifies the emails in the training set but does poorly on new emails.

Underfitting

Fitting a linear regression model to non-linear data. The model cannot capture the true relationship, resulting in high training and validation error.
Using logistic regression to predict complex human behavior. The rigid assumptions result in poor performance on complex datasets.
Training a decision tree with very little data and setting a max depth of 2. The overly simple model cannot learn the patterns well.

The key is to visualize training/validation performance over time and model complexity. Monitoring for divergence between the two curves helps diagnose overfitting and underfitting.

12. Conclusion

Finding the right level of model complexity is critical in machine learning. Overly simple models will underfit and fail to capture rich relationships in the data. Overly complex models will overfit – memorizing noise instead of learning generalizable patterns.

Techniques like regularization, early stopping, and getting more training data can help prevent overfitting. For underfitting, increasing model capacity and removing rigid assumptions enables learning more complex relationships.

The bias-variance tradeoff provides a useful framework for reasoning about model complexity. Optimal performance requires balancing the flexibility to fit the data with regularization to avoid capturing noise. By understanding overfitting, underfitting and the tools to address them, you can build machine learning models that generalize well to new data.