Unsupervised learning¶
Another important class of machine learning methods
Analyze the structure of the data using feature X
- Supervised: Use features X to predict labels Y
- Unsupervised: Only requires features, don’t deal with labels
Examples: - Clustering: divide a dataset into meaningful groups. Data points in the same group are more similar with each other, compared to those in different groups.
-
Dimensionality Reduction: | have a dataset of extremely high dimension of features (e.g., images), can | represent them with a lower dimension?
-
Ranking: | have a dataset represented as a graph, each data point is a node, and their relationship are edges, e.g., the World Wide Web, can | rank the importance of the data points?
Clustering¶
- Clustering: the process of grouping a set of objects into classes of similar objects
-
Objects within the same cluster should be more similar.
-
Objects across the different clusters should be less similar.
K-means clustering method¶
Given k, the k-means algorithm is implemented
in four steps:
-
Partition objects into k nonempty subsets
-
Compute the mean point for every cluster at current partitioning
-
Reassign each object to the cluster with the nearest mean point
-
Go back to Step 2, stop when the assignment does not change
k means by python¶
Original Graph
import matplotlib.pyplot as plt
import seaborn as sns; sns.set() # for plot styling
import numpy as np
from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=300, centers=4,
cluster_std=0.60, random_state=0)
plt.scatter(X[:, 0], X[:, 1], s=50);
After grouping
from sklearn.cluster import KMeans
for i in range(1, 10):
kmeans = KMeans(n_clusters=i)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)
plt.title(f"k-means cluster: {i}")
plt.show()
Example 1¶
Suppose you have 5 points in 1-D: {1,2,4,7,10}. Use k-means to cluster these points with k=2. Start with initial partition {1} and {2,4,7,10}. The distance is difference of coordinate on the axis.
Drawbacks¶
Both large K and small K can lead to bad results: left K=4, right K=2. Didn't describe data well
Hierarchical clustering¶
-
A method of cluster analysis which seeks to build a hierarchy of clusters
-
No need to specify the number of clusters, we can generate partitions at different levels of the hierarchy.
-
A dendrogram is a tree diagram that can be used to represent the hierarchical clustering structure between data points.
- The height of connections in the dendrogram represents the distance between partitions: The higher the connection, the larger distance between the two connected clusters.
- Cut the dendrogram: Clustering is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.
Building diagram¶
-
Two types: bottom-up (agglomerative), and top-down (divisive)
-
Bottom-up: two groups are merged if distance between them is less than a threshold
-
Top-down: one group is split into two if inter- group distance is more than a threshold
Dimensionality Reduction¶
-
Given data points in d dimensions, convert them to data points in r<d dimensions, with minimal loss of information.
-
Used for statistical analysis, data compression, and data visualization
Idea of Principle Component Analysis¶
-
Reduce from n-dimension to k-dimension: Find vectors \(u^1,u^2,...,u^k\) onto which to project the data, so as to minimize the projection error.
-
These vectors should represent primary information of data, we call them principle components
Identity matrix¶
In linear algebra, the identity matrix (sometimes ambiguously called a unit matrix) of size n is the n × n square matrix with ones on the main diagonal and zeros elsewhere. It is denoted by In, or simply by I if the size is immaterial or can be trivially determined by the context. In some fields, such as quantum mechanics, the identity matrix is denoted by a boldface one, 1; otherwise it is identical to I. Less frequently, some mathematics books use U or E to represent the identity matrix, meaning "unit matrix" and the German word Einheitsmatrix respectively.
Eigenvalues and eigenvectors¶
https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors
https://www.khanacademy.org/math/linear-algebra/alternate-bases/eigen-everything/v/linear-algebra-introduction-to-eigenvalues-and-eigenvectors
Intuitive Idea of PCA¶
https://dilloncamp.com/projects/pca.html
-
What we DON’T want for projection:
-
Original data has large variance, but projected data has small variance.
-
lt means original data is spread, but projected data is not: A lot of information loss during projection.
-
PCA‘s goal: maximize the variance of projected data.
-
In fact, the mathematical definition of principle. components (PC) is the eigenvectors of covariance matrix of data points
-
The order of PCs follows the magnitude of. eigenvalues, e.g., the most Signi icant PC is the eigenvector corresponding to largest eigenvalue
Page rating¶
A method for rating the importance of web pages using the link structure of the web
Simple Recursive Formulation¶
-
Each link’s vote is proportional to the importance of its source page
-
If page P with importance x has n out-links, each link gets x/n votes
-
Page P’s own importance is the sum of the votes on its in-links
-
Final PageRank score: Importance=sum of votes from all in-links
Page rank in python¶
import numpy as np
import matplotlib.pyplot as plt
# set plot size
plt.rcParams['figure.figsize'] = [20, 5]
def page_rank(matrix: np.array, iter = 3):
shape = matrix.shape[1]
r = np.full((shape, 1), 1/ shape)
l = matrix
total_results = None
for i in range(iter):
r = l.dot(r)
if i == 0:
total_results = r
else:
total_results = np.concatenate((total_results, r), axis=1)
return r, total_results
m = np.array(
[[0, 0.5, 0, 0],
[1/3, 0, 0, 1/2],
[1/3, 0, 0, 1/2],
[1/3, 1/2, 1, 0]
]
)
r, total_results = page_rank(m, 20)
for i, p in enumerate(total_results):
plt.plot(p, label=f"{i} - line")
plt.legend()
<matplotlib.legend.Legend at 0x7fc4907065c0>
Random walk interpretation¶
An equivalent view of PageRank.
Imagine a random web surfer
-
At any time t, surfer is on some page P
-
At time t+1, the surfer follows an outlink from P uniformly at random.
-
Ends up on some page Q linked from P, process repeats indefinitely
Let p(t) be a vector whose \(i^{th}\) component is the probability that the surfer is at page i at time t * p(t) is a probability distribution on pages
Stationary Distribution¶
Where is the surfer at time t+1? - Follows a link uniformly at some probability - p(t+1) = Mp(t) where M is the transition probability
Suppose the random walk reaches a state such that p(t+1) = Mp(t) = p(t)
-
Then p(t) is called a stationary distribution for the random walk
-
The PageRank score r is the stationary distribution, can be solved by r=Mr.
-
Normalization by scaling sum of r to 1.
Stationary Distribution=PageRank Score¶
Stationary distribution represents PageRank score.
-
PageRank Score: A node’s importance equals to the votes from adjacent nodes.
-
Stationary Distribution: Probability of being at one node equals to sum of the probabilities coming from other nodes
-
Both of them describe the stable state.
Spider Traps¶
- Agroup of pages is a spider trap if there are no links from pages within the group to pages outside the group
- Random surfer gets trapped, it continuously walk in the trap.
- Spider traps violate the conditions needed for the random walk theorem
import numpy as np
def pagerank(M, num_iterations: int = 100, d: float = 0.85):
"""PageRank: The trillion dollar algorithm.
Parameters
----------
M : numpy array
adjacency matrix where M_i,j represents the link from 'j' to 'i', such that for all 'j'
sum(i, M_i,j) = 1
num_iterations : int, optional
number of iterations, by default 100
d : float, optional
damping factor, by default 0.85
Returns
-------
numpy array
a vector of ranks such that v_i is the i-th rank from [0, 1],
v sums to 1
"""
N = M.shape[1]
v = np.random.rand(N, 1)
v = v / np.linalg.norm(v, 1)
M_hat = (d * M + (1 - d) / N)
for i in range(num_iterations):
v = M_hat @ v
return v
M = np.array([[0.5, 0.5, 0],
[0.5, 0, 0],
[0, 0.5, 1],
])
v = pagerank(M, 100, 0.8)
print(v)
[[0.21212121]
[0.15151515]
[0.63636364]]