Unsupervised learning¶

Another important class of machine learning methods

Analyze the structure of the data using feature X

Supervised: Use features X to predict labels Y
Unsupervised: Only requires features, don’t deal with labels

Examples: - Clustering: divide a dataset into meaningful groups. Data points in the same group are more similar with each other, compared to those in different groups.

Dimensionality Reduction: | have a dataset of extremely high dimension of features (e.g., images), can | represent them with a lower dimension?
Ranking: | have a dataset represented as a graph, each data point is a node, and their relationship are edges, e.g., the World Wide Web, can | rank the importance of the data points?

Clustering¶

Clustering: the process of grouping a set of objects into classes of similar objects
Objects within the same cluster should be more similar.
Objects across the different clusters should be less similar.

K-means clustering method¶

Given k, the k-means algorithm is implemented

in four steps:

Partition objects into k nonempty subsets
Compute the mean point for every cluster at current partitioning
Reassign each object to the cluster with the nearest mean point
Go back to Step 2, stop when the assignment does not change

k means by python¶

Original Graph

import matplotlib.pyplot as plt
import seaborn as sns; sns.set()  # for plot styling
import numpy as np
from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=300, centers=4,
                       cluster_std=0.60, random_state=0)
plt.scatter(X[:, 0], X[:, 1], s=50);

png

After grouping

from sklearn.cluster import KMeans

for i in range(1, 10):
    kmeans = KMeans(n_clusters=i)
    kmeans.fit(X)
    y_kmeans = kmeans.predict(X)
    plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
    centers = kmeans.cluster_centers_
    plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)
    plt.title(f"k-means cluster: {i}")
    plt.show()

png

Example 1¶

Suppose you have 5 points in 1-D: {1,2,4,7,10}. Use k-means to cluster these points with k=2. Start with initial partition {1} and {2,4,7,10}. The distance is difference of coordinate on the axis.

截屏2020-09-28 下午7.54.37.png

Drawbacks¶

Both large K and small K can lead to bad results: left K=4, right K=2. Didn't describe data well

截屏2020-09-28 下午7.58.34.png

Hierarchical clustering¶

A method of cluster analysis which seeks to build a hierarchy of clusters
No need to specify the number of clusters, we can generate partitions at different levels of the hierarchy.
A dendrogram is a tree diagram that can be used to represent the hierarchical clustering structure between data points.

截屏2020-09-28 下午8.00.58.png

The height of connections in the dendrogram represents the distance between partitions: The higher the connection, the larger distance between the two connected clusters.
Cut the dendrogram: Clustering is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.

Building diagram¶

Two types: bottom-up (agglomerative), and top-down (divisive)
Bottom-up: two groups are merged if distance between them is less than a threshold
Top-down: one group is split into two if inter- group distance is more than a threshold

Dimensionality Reduction¶

Given data points in d dimensions, convert them to data points in r<d dimensions, with minimal loss of information.
Used for statistical analysis, data compression, and data visualization

Idea of Principle Component Analysis¶

Reduce from n-dimension to k-dimension: Find vectors \(u^1,u^2,...,u^k\) onto which to project the data, so as to minimize the projection error.
These vectors should represent primary information of data, we call them principle components

Identity matrix¶

In linear algebra, the identity matrix (sometimes ambiguously called a unit matrix) of size n is the n × n square matrix with ones on the main diagonal and zeros elsewhere. It is denoted by In, or simply by I if the size is immaterial or can be trivially determined by the context. In some fields, such as quantum mechanics, the identity matrix is denoted by a boldface one, 1; otherwise it is identical to I. Less frequently, some mathematics books use U or E to represent the identity matrix, meaning "unit matrix" and the German word Einheitsmatrix respectively.

Eigenvalues and eigenvectors¶

https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors

https://www.khanacademy.org/math/linear-algebra/alternate-bases/eigen-everything/v/linear-algebra-introduction-to-eigenvalues-and-eigenvectors

Intuitive Idea of PCA¶

https://dilloncamp.com/projects/pca.html

What we DON’T want for projection:
Original data has large variance, but projected data has small variance.
lt means original data is spread, but projected data is not: A lot of information loss during projection.
PCA‘s goal: maximize the variance of projected data.
In fact, the mathematical definition of principle. components (PC) is the eigenvectors of covariance matrix of data points
The order of PCs follows the magnitude of. eigenvalues, e.g., the most Signi icant PC is the eigenvector corresponding to largest eigenvalue

Page rating¶

A method for rating the importance of web pages using the link structure of the web

Simple Recursive Formulation¶

Each link’s vote is proportional to the importance of its source page
If page P with importance x has n out-links, each link gets x/n votes
Page P’s own importance is the sum of the votes on its in-links
Final PageRank score: Importance=sum of votes from all in-links

Page rank in python¶

\[ r_a = \sum_{j=1}^n L_{a.j}r_j\]

import numpy as np
import matplotlib.pyplot as plt

# set plot size
plt.rcParams['figure.figsize'] = [20, 5]

def page_rank(matrix: np.array, iter = 3):
    shape = matrix.shape[1]
    r = np.full((shape, 1), 1/ shape)
    l = matrix
    total_results = None

    for i in range(iter):
        r = l.dot(r)
        if i == 0:
            total_results = r
        else:
            total_results = np.concatenate((total_results, r), axis=1)

    return r, total_results


m = np.array(
    [[0, 0.5, 0, 0],
     [1/3, 0, 0, 1/2],
     [1/3, 0, 0, 1/2],
     [1/3, 1/2, 1, 0]
    ]
)
r, total_results = page_rank(m, 20)
for i, p in enumerate(total_results):
    plt.plot(p, label=f"{i} - line")

plt.legend()

<matplotlib.legend.Legend at 0x7fc4907065c0>

png

Random walk interpretation¶

An equivalent view of PageRank.

Imagine a random web surfer

At any time t, surfer is on some page P
At time t+1, the surfer follows an outlink from P uniformly at random.
Ends up on some page Q linked from P, process repeats indefinitely

Let p(t) be a vector whose \(i^{th}\) component is the probability that the surfer is at page i at time t * p(t) is a probability distribution on pages

Stationary Distribution¶

Where is the surfer at time t+1? - Follows a link uniformly at some probability - p(t+1) = Mp(t) where M is the transition probability

Suppose the random walk reaches a state such that p(t+1) = Mp(t) = p(t)

Then p(t) is called a stationary distribution for the random walk
The PageRank score r is the stationary distribution, can be solved by r=Mr.
Normalization by scaling sum of r to 1.

Stationary Distribution=PageRank Score¶

Stationary distribution represents PageRank score.

PageRank Score: A node’s importance equals to the votes from adjacent nodes.
Stationary Distribution: Probability of being at one node equals to sum of the probabilities coming from other nodes
Both of them describe the stable state.

Spider Traps¶

Agroup of pages is a spider trap if there are no links from pages within the group to pages outside the group
Random surfer gets trapped, it continuously walk in the trap.
Spider traps violate the conditions needed for the random walk theorem

截屏2020-09-28 下午9.36.23.png

import numpy as np

def pagerank(M, num_iterations: int = 100, d: float = 0.85):
    """PageRank: The trillion dollar algorithm.

    Parameters
    ----------
    M : numpy array
        adjacency matrix where M_i,j represents the link from 'j' to 'i', such that for all 'j'
        sum(i, M_i,j) = 1
    num_iterations : int, optional
        number of iterations, by default 100
    d : float, optional
        damping factor, by default 0.85

    Returns
    -------
    numpy array
        a vector of ranks such that v_i is the i-th rank from [0, 1],
        v sums to 1

    """
    N = M.shape[1]
    v = np.random.rand(N, 1)
    v = v / np.linalg.norm(v, 1)
    M_hat = (d * M + (1 - d) / N)
    for i in range(num_iterations):
        v = M_hat @ v
    return v

M = np.array([[0.5, 0.5, 0],
              [0.5, 0, 0],
              [0, 0.5, 1],
              ])
v = pagerank(M, 100, 0.8)
print(v)

[[0.21212121]
 [0.15151515]
 [0.63636364]]