Divergence¶
ML Setup
\(P(x)\) -> generate -> Data -> learn \(Q(x)\) where Q should as close to P as possible.
Entropy, cross entropy, and KL divergence¶
Entropy¶
Cross Entropy¶
-
p = true distribution
-
q = predicted distribution
Relative entropy or Kullback-Leibler divergence¶
-
Meassure how much a distribution Q(X) differs from a "True" probability distribution P(X)
-
K-L Divergence if Q from P is defined as follows:
Relationship between entropy, cross-entropy, and kl divergence $$cross-entropy = entropy + kl divergence $$
Minimize cross entropy = Maximizing log likelyhood
Suppose we have likelihood of the training set is
where N is number of conditionally independent samples in training set
So the log-likelihood divided by N is
Supervised learning¶
Unsupervised learning¶
Multual information¶
H(x): Initial uncertainty about x
H(X | Y): Expected uncertainty about x if y is tested
Linear Regression¶
Least Square Regression¶
import matplotlib.pyplot as plt
import numpy as np
x1 = [0, 0, 1, 1]
x2 = [0, 1, 0, 1]
y = [1 ,0 ,0 ,1]
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x1, x2, y)
<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x7f2ec3a9ae80>
using sklearn
from sklearn import linear_model
X = np.column_stack((x1, x2))
lm = linear_model.LinearRegression()
model = lm.fit(X, y)
print(model.coef_)
print(model.intercept_)
print(model.score(X, y))
[0.00000000e+00 2.22044605e-16]
0.4999999999999999
0.0
using numpy \(\(A = (X^TX)^{-1}X^TY\)\)
ones = [1 for i in range(len(x1))]
X = np.column_stack((ones, x1, x2))
X_T = X.transpose()
print(X)
print(X_T)
[[1 0 1]
[1 0 1]
[1 1 1]
[1 1 0]
[1 1 0]]
[[1 1 1 1 1]
[0 0 1 1 1]
[1 1 1 0 0]]
dot = np.dot(X_T, X)
inverse = np.linalg.inv(dot)
print(dot)
print(inverse)
[[5 3 3]
[3 3 1]
[3 1 3]]
[[ 2. -1.5 -1.5]
[-1.5 1.5 1. ]
[-1.5 1. 1.5]]
dot2 = np.dot(inverse, X_T).dot(y)
dot2
array([ 2. , 1.5, -1.5])
Mean Square Error¶
import numpy as np
# Given values
Y_true = [1,1,2,2,4] # Y_true = Y (original values)
# Calculated values
Y_pred = [0.6,1.29,1.99,2.69,3.4] # Y_pred = Y'
# Mean Squared Error
MSE = np.square(np.subtract(Y_true,Y_pred)).mean()
MSE
0.21606
Hypothesis space¶
Is the set of functioins that it is allowed to select as being the solution. Thhe size of the hypothesis space is called the capacity of the model.
For polynomial regression, the larger the d, the higher the model capacity.
Higher model capacity implies better fit to training data.
-
\(S_1 = \{y = w_0 + w_1x_1 | w_0, w_1 \in R\}\)
-
\(S_2 = \{y=w_0 + w_1x_1 + w_2x_1^2 + w_3x_1^3 | w_0, w_1, w_2, w_3 \in R\}\)
Generalization Error¶
Model select:
-
Validation
- Split training data into two parts. One part for training and second part for validation. This has to be randomly split.
-
Regularization
Regilarization¶
ridge regression¶
The larger the regularization constant \(\lambda\), the smaller the weights