Supervised learning¶

Supervised larning
Unsupervised learning
Reinforcement learning

Whats is machine learning¶

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks.

Linear Regression¶

A Part of machine learning
Given training set x, y
Find a good approximation to f: \(x \to y\)
Examples:
- Spam detection ( Classification)
- Digit recognition ( Classification)
- House price prediction (Refression)

Terminology¶

Given a data point (x, y), x is called featyre vector, y is called label
The dataset given for learning is training data
The dataset to be tested is called testing data

Machine learning 3 steps¶

Collect data, extract features
Determine a model
Train the model with the data

Loss¶

Loss on traning set

We measure the error using a loss function \(L(y, \hat{y})\)

For regression, squared error is often used \(\(L(y_1, f(x_i)) = (y_i - f(x_i))^2\)\)

Loss on testing set

Empirical loss is measuring the loss on the training set

We assume both training set and testing set are i.i.d from the same distribution D - Minimizing loss on training set will make loss on testing set small

Minimizing loss functions¶

The minimizers of some loss functions have analytical solutions: an exact solution you can explicitly derive by analyzing the formula.
However, most poular supervised learning models use loss functions with no analytical solution
We use gradient descent to approximate the minimal value of function.
Gradients: A vector, points to the direction where changing value is the fastest.

Method¶

For function G, randomly guess an initial value \(x_0\)
Repeat \(xi+1 = x_i - r \times \nabla G(x)\) where \(\nabla\) denotes the gradients, r denotes learning rate
Until convergence

from sympy import symbols, diff

r = 0.1
f_i = (1, 1, 1)
x, y, z = symbols('x y z', real=True)
f = (y + 2 * x)**2 + y + 2*x
g = (diff(f, x), diff(f, y), diff(f, z))
G

(8*x + 4*y + 2, 4*x + 2*y + 1, 0)

import numpy as np
result = np.array([8, 6, 3]) * r +np.array([1, 1, 1])
result

array([1.8, 1.6, 1.3])

Linear Classification¶

Use a line to separate data points
Use \(x = (x_1, x_2)\), \(w = (w_1, w_2)\), i.e., x, w are vectors in 2D space

Doesn't work well with classification problem

Label y as either 1 or -1
Find f_w(x) = w^Tx that minimizes the loss function \(\(L(f_w(x)) = \frac{1}{n}\sum_{i=1}^n(w^Tx_i-y_1)^2\)\)
Find a line that minimizes the distance between red and blue

If there is a outlier in the graph, the seperation line will miss classification some points

- If the value get very large, the \(w^TX_i\) is correct \(\to\) large loss value even if predict value is positive.¶

Solution:

We use sigmoid function to minimize the value between 0 and 1 \(\(\sigma(a) = \frac{1}{1+exp(-a)}\)\)

Similar to step functions
Continuous and easy to compute

Some properties of sigmoid function¶

\(\sigma(a) = \frac{1}{1+exp(-a)} \in (0, 1)\)
symetric
Easy to compute gradients

Logistic Regression¶

Better approach ( cross-entropy loss function) find w that minimizes loss function
If misclassfication happens on i-th data with label 1, \(log(\sigma(w^Tx_i))\) is very large
No analytical solution, needs gradient descent

SVM¶

A svm performs classification by finding the hyperplane that maximizes the margin between the two classes

K-Nearest neighbor methods¶

Learning algorithm: just store training examples
Prediction algorithm:
- Regression: take the average value of k nearest neighbors
- Classification: assign to the most frequent class of k nearest neighbors
Easy to train with high storage requirement, but high-computation cost at prediction

---	Linear	knn
Advantages	Easy to fit	Strong assumtions on linear relationship
Disadvantages	Hard to classify the data	Takes a lot of computation power

Decision Tree¶

Entropy is used to measure how informative is a probability distribution. The more entropy, the more uncertainty.

More info

Wrap up¶

Collect data, extract features
Determine a model
- Select a good model for your data