Skip to content

Divergence

ML Setup

\(P(x)\) -> generate -> Data -> learn \(Q(x)\) where Q should as close to P as possible.

Entropy, cross entropy, and KL divergence

Entropy

\[H(p) = -\sum_{i}p_{i}log(p_i)\]

Cross Entropy

  • p = true distribution

  • q = predicted distribution

\[H(p, q) = -\sum_ip_ilog(q_i)\]

Relative entropy or Kullback-Leibler divergence

  • Meassure how much a distribution Q(X) differs from a "True" probability distribution P(X)

  • K-L Divergence if Q from P is defined as follows:

\[ KL(P||Q) = \sum_x{P(X)log(\frac{P(X)}{Q(X)})} = -log\sum_x{Q(X)}\]

Relationship between entropy, cross-entropy, and kl divergence $$cross-entropy = entropy + kl divergence $$

\[or\]
\[D_{kl}(p||q) = H(p, q) - H(p)\]

Minimize cross entropy = Maximizing log likelyhood

Suppose we have likelihood of the training set is

\[\sum_{i}(probability\ of\ i)^{number\ of\ occurrences\ of\ i} = \sum_{i}q_i^{Np_i}\]

where N is number of conditionally independent samples in training set

So the log-likelihood divided by N is

\[\frac{1}{N}log\sum_iq_i^{Np_i} = \sum_ip_ilog(q_i) = -H(p, q)\]

Supervised learning

Unsupervised learning

Multual information

H(x): Initial uncertainty about x

H(X | Y): Expected uncertainty about x if y is tested

Linear Regression

\[y = w_0 + w_1x_1\]

Least Square Regression

import matplotlib.pyplot as plt 
import numpy as np 

x1 = [0, 0, 1, 1]
x2 = [0, 1, 0, 1]
y = [1 ,0 ,0 ,1]

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x1, x2, y)
<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x7f2ec3a9ae80>

png

using sklearn

from sklearn import linear_model

X = np.column_stack((x1, x2))
lm = linear_model.LinearRegression()
model = lm.fit(X, y)
print(model.coef_)
print(model.intercept_)
print(model.score(X, y))
[0.00000000e+00 2.22044605e-16]
0.4999999999999999
0.0

using numpy \(\(A = (X^TX)^{-1}X^TY\)\)

ones = [1  for i in range(len(x1))]
X = np.column_stack((ones, x1, x2))
X_T = X.transpose()
print(X)
print(X_T)
[[1 0 1]
 [1 0 1]
 [1 1 1]
 [1 1 0]
 [1 1 0]]
[[1 1 1 1 1]
 [0 0 1 1 1]
 [1 1 1 0 0]]
dot = np.dot(X_T, X)
inverse = np.linalg.inv(dot)
print(dot)
print(inverse)
[[5 3 3]
 [3 3 1]
 [3 1 3]]
[[ 2.  -1.5 -1.5]
 [-1.5  1.5  1. ]
 [-1.5  1.   1.5]]
dot2 = np.dot(inverse, X_T).dot(y)
dot2
array([ 2. ,  1.5, -1.5])

Mean Square Error

import numpy as np 

# Given values 
Y_true = [1,1,2,2,4]  # Y_true = Y (original values) 

# Calculated values 
Y_pred = [0.6,1.29,1.99,2.69,3.4]  # Y_pred = Y' 

# Mean Squared Error 
MSE = np.square(np.subtract(Y_true,Y_pred)).mean() 
MSE
0.21606

Hypothesis space

Is the set of functioins that it is allowed to select as being the solution. Thhe size of the hypothesis space is called the capacity of the model.

For polynomial regression, the larger the d, the higher the model capacity.

Higher model capacity implies better fit to training data.

  1. \(S_1 = \{y = w_0 + w_1x_1 | w_0, w_1 \in R\}\)

  2. \(S_2 = \{y=w_0 + w_1x_1 + w_2x_1^2 + w_3x_1^3 | w_0, w_1, w_2, w_3 \in R\}\)

Generalization Error

Model select:

  • Validation

    • Split training data into two parts. One part for training and second part for validation. This has to be randomly split.
  • Regularization

Regilarization

ridge regression

The larger the regularization constant \(\lambda\), the smaller the weights