Overfitting and underfitting¶

Overfitting¶

Even when training data and testing data are i.i.d, generalization may also fail.

Is a modeling error which occurs when a function is too closely fit to a limited set of data points.

Why is overfitting a problem¶

Overfitting leads to low training error yet high testing error.
Out goal is to make the testing error small, Not the training error.

Plotting a polynomial¶

Using a polynomial of degree N to fit \(y==\sum^N_{i=1}w_ix_i\)
Higher degree has more complex curve to fit the data.

https://github.com/MSBD-5001/Lecture-Materials/blob/master/l3_simulation_lecture.ipynb

Underfitting¶

Occurs when the model or algorithm doesn't fit the the data well enough.

Example¶

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
from statsmodels import regression
from scipy i0mport poly1d

/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm

x = np.arange(10)
y = 2*np.random.randn(10) + x**2
xs = np.linspace(-0.25, 9.25, 200)

lin = np.polyfit(x, y, 1)
quad = np.polyfit(x, y, 2)
many = np.polyfit(x, y, 9)

plt.scatter(x, y)
plt.plot(xs, poly1d(lin)(xs))
plt.plot(xs, poly1d(quad)(xs))
plt.plot(xs, poly1d(many)(xs))
plt.ylabel('Y')
plt.xlabel('X')
plt.legend(['Underfit', 'Good fit', 'Overfit']);

png

Errors: Bias and variance¶

Expect error = \(Bias^2\) + Variance + Noise

Bias: Difference between the average prediction of our model and the correct value which we are trying to predict

Variance: The variability of model prediction for a fiven data point.

Our goal is to select models that are of optimal complexity.

Complex models have low bias and high variance: - Low bias: Complicated models capture a lot of features - High variance: testing set many not have the same feature - Overfitting

Simeple models have low variance and high bias. - Underfitting

How to reduce variance and keep bias at a low value?¶

Larger training dataset reduces variance
Noise is unavoidable on the data
Regularization and ensemble learning

Selecting good models¶

Validation¶

Split training data into training and validation data

Validation data are only used to evaluate the performance of trained model.
If model generalize well on validation data, then should also generalize well on testing data.
Wasting part of original training data.

Cross validation¶

Will make all training data for validation

Partition training data into serveral groups
repeat: One group as validation set, train new model
Performance metric: average error on validation data.

k-fold cross validation¶

Equally split data into k folds
Each time uses one fold as validation
K fold can be used for large dataset
Leave-one-out can be used when dataset is small. Use only 1 sample for validation, the rest for training.
Select models with cross-validation. Use cross validation to evaluate performance of different models. Select the best model.

Improving the models¶

Method	Train sequentially or in parallel	How to generate different models	Reduces bias or variance
Bagging	Parallel	Boostrap data	Variance
Random Forest	Parallel	Bootsrap + random subset of features at splitting	Variance
Boosting	Sequential	Reweight training data	Bias and variance

Regularization¶

Prevent overfitting by reducing flexibility of the model.

Prevent parameters having too large absolute values. - Reduce variance - Prevent overfitting

Ensemeble¶

Standard decision trees can achieve low bias. - Training set error can be zero. You can always train to the last branch - Large variance

Early stopping with fixed nodes or fixed depth may incur high bias

Averaging¶

For regression: Simply average the results predicted by different trees, can take weighted average

For classification: just select the most predicted value.

Also called voting

Baging¶

Short for Boostrap aggregating.

Bootstrap samples B times, each with size N, with replacement. Train B classifiers each with a bootstrap sample.

Bagging gets similar bias: data are from resampling.

Random Forest¶

Refinement of the bagged trees.

Problem: We want the trees to be independent, don't want them to be similar. But bootstrapping data doesn't help that much: still drawn from same dataset with all features.
At each tree split, a random sample of m features are drawn. Only these m features are consldered for splitting.
Typically, m is \(\sqrt{p}\) pr \(p/3\) where p is the total number of features.

Boosting¶

Random forest and bagging: trees are trained in parallel

Boosting: trees should be trained sequentially - Start with original training sample - In each iteration: - Train a classifier and check wich samples are hard to train - Increase the weight of those mis-classified samples in training data - Repeat this - Final classifier: weighted classifier model.