Skip to content

Supervised learning

  • Supervised larning

  • Unsupervised learning

  • Reinforcement learning

Whats is machine learning

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks.

Linear Regression

  • A Part of machine learning

  • Given training set x, y

  • Find a good approximation to f: \(x \to y\)

  • Examples:

    • Spam detection ( Classification)

    • Digit recognition ( Classification)

    • House price prediction (Refression)

Terminology

  • Given a data point (x, y), x is called featyre vector, y is called label

  • The dataset given for learning is training data

  • The dataset to be tested is called testing data

Machine learning 3 steps

  1. Collect data, extract features
  2. Determine a model

  3. Train the model with the data

Loss

Loss on traning set

We measure the error using a loss function \(L(y, \hat{y})\)

For regression, squared error is often used \(\(L(y_1, f(x_i)) = (y_i - f(x_i))^2\)\)

Loss on testing set

Empirical loss is measuring the loss on the training set

We assume both training set and testing set are i.i.d from the same distribution D - Minimizing loss on training set will make loss on testing set small

Minimizing loss functions

  • The minimizers of some loss functions have analytical solutions: an exact solution you can explicitly derive by analyzing the formula.

  • However, most poular supervised learning models use loss functions with no analytical solution

  • We use gradient descent to approximate the minimal value of function.

  • Gradients: A vector, points to the direction where changing value is the fastest.

Method

  1. For function G, randomly guess an initial value \(x_0\)

  2. Repeat \(xi+1 = x_i - r \times \nabla G(x)\) where \(\nabla\) denotes the gradients, r denotes learning rate

  3. Until convergence

from sympy import symbols, diff

r = 0.1
f_i = (1, 1, 1)
x, y, z = symbols('x y z', real=True)
f = (y + 2 * x)**2 + y + 2*x
g = (diff(f, x), diff(f, y), diff(f, z))
G 
(8*x + 4*y + 2, 4*x + 2*y + 1, 0)
import numpy as np
result = np.array([8, 6, 3]) * r +np.array([1, 1, 1])
result
array([1.8, 1.6, 1.3])

Linear Classification

  • Use a line to separate data points

  • Use \(x = (x_1, x_2)\), \(w = (w_1, w_2)\), i.e., x, w are vectors in 2D space


Doesn't work well with classification problem

  • Label y as either 1 or -1
  • Find f_w(x) = w^Tx that minimizes the loss function \(\(L(f_w(x)) = \frac{1}{n}\sum_{i=1}^n(w^Tx_i-y_1)^2\)\)

  • Find a line that minimizes the distance between red and blue

If there is a outlier in the graph, the seperation line will miss classification some points

- If the value get very large, the \(w^TX_i\) is correct \(\to\) large loss value even if predict value is positive.

Solution:

We use sigmoid function to minimize the value between 0 and 1 \(\(\sigma(a) = \frac{1}{1+exp(-a)}\)\)

  1. Similar to step functions

  2. Continuous and easy to compute

Some properties of sigmoid function

  1. \(\sigma(a) = \frac{1}{1+exp(-a)} \in (0, 1)\)

  2. symetric

  3. Easy to compute gradients

Logistic Regression

  • Better approach ( cross-entropy loss function) find w that minimizes loss function

  • If misclassfication happens on i-th data with label 1, \(log(\sigma(w^Tx_i))\) is very large

  • No analytical solution, needs gradient descent

SVM

A svm performs classification by finding the hyperplane that maximizes the margin between the two classes

K-Nearest neighbor methods

  • Learning algorithm: just store training examples

  • Prediction algorithm:

    • Regression: take the average value of k nearest neighbors
    • Classification: assign to the most frequent class of k nearest neighbors
  • Easy to train with high storage requirement, but high-computation cost at prediction

--- Linear knn
Advantages Easy to fit Strong assumtions on linear relationship
Disadvantages Hard to classify the data Takes a lot of computation power

Decision Tree

Entropy is used to measure how informative is a probability distribution. The more entropy, the more uncertainty.

More info

Wrap up

  1. Collect data, extract features

  2. Determine a model

    • Select a good model for your data