Divergence¶

ML Setup

$P(x)$ -> generate -> Data -> learn $Q(x)$ where Q should as close to P as possible.

Entropy, cross entropy, and KL divergence¶

Entropy¶

\[H(p) = -\sum_{i}p_{i}log(p_i)\]

Cross Entropy¶

p = true distribution
q = predicted distribution

\[H(p, q) = -\sum_ip_ilog(q_i)\]

Relative entropy or Kullback-Leibler divergence¶

Meassure how much a distribution Q(X) differs from a "True" probability distribution P(X)
K-L Divergence if Q from P is defined as follows:

\[ KL(P||Q) = \sum_x{P(X)log(\frac{P(X)}{Q(X)})} = -log\sum_x{Q(X)}\]

Relationship between entropy, cross-entropy, and kl divergence $$cross-entropy = entropy + kl divergence $$

\[or\]

\[D_{kl}(p||q) = H(p, q) - H(p)\]

Minimize cross entropy = Maximizing log likelyhood

Suppose we have likelihood of the training set is

\[\sum_{i}(probability\ of\ i)^{number\ of\ occurrences\ of\ i} = \sum_{i}q_i^{Np_i}\]

where N is number of conditionally independent samples in training set

So the log-likelihood divided by N is

\[\frac{1}{N}log\sum_iq_i^{Np_i} = \sum_ip_ilog(q_i) = -H(p, q)\]

Supervised learning¶

Unsupervised learning¶

Multual information¶

H(x): Initial uncertainty about x

H(X | Y): Expected uncertainty about x if y is tested

Linear Regression¶

\[y = w_0 + w_1x_1\]

Least Square Regression¶

import matplotlib.pyplot as plt 
import numpy as np 

x1 = [0, 0, 1, 1]
x2 = [0, 1, 0, 1]
y = [1 ,0 ,0 ,1]

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x1, x2, y)

<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x7f2ec3a9ae80>

png

using sklearn

from sklearn import linear_model

X = np.column_stack((x1, x2))
lm = linear_model.LinearRegression()
model = lm.fit(X, y)
print(model.coef_)
print(model.intercept_)
print(model.score(X, y))

[0.00000000e+00 2.22044605e-16]
0.4999999999999999
0.0

using numpy $\(A = (X^TX)^{-1}X^TY$\)

ones = [1  for i in range(len(x1))]
X = np.column_stack((ones, x1, x2))
X_T = X.transpose()
print(X)
print(X_T)

[[1 0 1]
 [1 0 1]
 [1 1 1]
 [1 1 0]
 [1 1 0]]
[[1 1 1 1 1]
 [0 0 1 1 1]
 [1 1 1 0 0]]

dot = np.dot(X_T, X)
inverse = np.linalg.inv(dot)
print(dot)
print(inverse)

[[5 3 3]
 [3 3 1]
 [3 1 3]]
[[ 2.  -1.5 -1.5]
 [-1.5  1.5  1. ]
 [-1.5  1.   1.5]]

dot2 = np.dot(inverse, X_T).dot(y)
dot2

array([ 2. ,  1.5, -1.5])

Mean Square Error¶

import numpy as np 

# Given values 
Y_true = [1,1,2,2,4]  # Y_true = Y (original values) 

# Calculated values 
Y_pred = [0.6,1.29,1.99,2.69,3.4]  # Y_pred = Y' 

# Mean Squared Error 
MSE = np.square(np.subtract(Y_true,Y_pred)).mean() 
MSE

0.21606

Hypothesis space¶

Is the set of functioins that it is allowed to select as being the solution. Thhe size of the hypothesis space is called the capacity of the model.

For polynomial regression, the larger the d, the higher the model capacity.

Higher model capacity implies better fit to training data.

$S_1 = \{y = w_0 + w_1x_1 | w_0, w_1 \in R\}$
$S_2 = \{y=w_0 + w_1x_1 + w_2x_1^2 + w_3x_1^3 | w_0, w_1, w_2, w_3 \in R\}$

Generalization Error¶

Model select:

Validation
- Split training data into two parts. One part for training and second part for validation. This has to be randomly split.
Regularization

Regilarization¶

ridge regression¶

The larger the regularization constant $\lambda$, the smaller the weights