Similarity and Dissimilarity¶
Similarity is the numerical measure of how alike two data objects are.
Similarity is important.
It is the basic component of many data processing techniques, such as - data integration - data mining: classication, clustering, recommendation, anomaly detection
Dissimilarity is the numerical measure of how two objects are different.
The term distance is frequently used as a synonym for dissimilarity.
Attibute Type¶
- Norminal (Categorical)
- Ordinal
- Interval or Ratio
String Matching:¶
Matching strings often appear quite dierently - Typing and OCR errors: David Smith vs. Davod Smith - Dierent formatting convertions: 10/8 vs Oct 8 - Custom abbreviation, shortening, or omission: Daniel Walker Herbert Smith vs. Dan W. Smith - Dierent names, nick names: William Smith vs. Bill Smith - Shuing parts of strings: Dept. of Computer Science, UST vs. Computer Science Dept., UST
namea = "Dave Smith"
nameb = 'David D. Smith'
Edit Distance¶
from pprint import pprint
def minDistance(word1: str, word2: str) -> int:
# padding one whitespace for empty string representation
word_1 = ' ' + word1
word_2 = ' ' + word2
h, w = len(word_1), len(word_2)
min_edit_dist = [ [ 0 for _ in range (w) ] for _ in range(h) ]
# initialization for top row
for x in range(1, w):
min_edit_dist[0][x] = x
# initialization for left-most column
for y in range(1, h):
min_edit_dist[y][0] = y
# compute minimum edit distance with optimal substructure
for y in range(1, h):
for x in range(1, w):
if word_1[y] == word_2[x]:
# current character match, no need to edit
min_edit_dist[y][x] = min_edit_dist[y-1][x-1]
else:
# current character mismatch, choose the method of lowest cost, among character replacement, character addition, or character deletion
min_edit_dist[y][x] = min( min_edit_dist[y][x-1], min_edit_dist[y-1][x], min_edit_dist[y-1][x-1]) + 1
pprint(min_edit_dist)
return min_edit_dist[-1][-1]
minDistance(namea, nameb)
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
[1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
[2, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
[3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
[4, 3, 2, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
[5, 4, 3, 2, 2, 2, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[6, 5, 4, 3, 3, 3, 3, 3, 4, 5, 5, 6, 7, 8, 9],
[7, 6, 5, 4, 4, 4, 4, 4, 4, 5, 6, 5, 6, 7, 8],
[8, 7, 6, 5, 4, 5, 5, 5, 5, 5, 6, 6, 5, 6, 7],
[9, 8, 7, 6, 5, 5, 6, 6, 6, 6, 6, 7, 6, 5, 6],
[10, 9, 8, 7, 6, 6, 6, 7, 7, 7, 7, 7, 7, 6, 5]]
5
Needleman wunch measure¶
initialize matrix of size (n + 1)x(m + 1) where s(a; b) is the element at the ath row and bth column.
fill matrix: \(s(i,0) = -i*c_g, s(0,j)=-j*c_g\)
from pprint import pprint
def find_match(word1: str, word2: str, cg, c, cm):
if word1 == word2:
return c
elif word1 == ' ' or word2 == ' ':
return -cg
else:
return -cm
def needle_man(word1: str, word2: str, cg=1, c=1, cm=1) -> int:
"""
cg: gap_penalty
c: match award
cm: mismatch penalty
"""
# padding one whitespace for empty string representation
word_1 = ' ' + word1
word_2 = ' ' + word2
h, w = len(word_1), len(word_2)
s = [ [ 0 for _ in range (w ) ] for _ in range(h ) ]
# initialization for top row
for j in range(1, w ):
s[0][j] = -j * cg
# initialization for left-most column
for i in range(1, h ):
s[i][0] = -i * cg
for i in range(1, h ):
for j in range(1, w ):
s[i][j] = max(s[i - 1][j] - cg, s[i][j - 1] - cg, s[i - 1][j - 1] + find_match(word_1[i], word_2[j], cg, c,cm))
pprint(s)
return s[-1][-1]
needle_man("dva", "deeve", c=2)
[[0, -1, -2, -3, -4, -5],
[-1, 2, 1, 0, -1, -2],
[-2, 1, 1, 0, 2, 1],
[-3, 0, 0, 0, 1, 1]]
1
Affine gap measure¶
Define x = \(x_1x_2...x_n\); y = \(y_1y_2..y_m\) where xi and yj are the i-th and j-th prefixes of x and y
Initialization:
- \(M(0, 0) = 0, l_x(0,0)=-c_o, l_u(0,0)=-c_o\)
- \(l_x(i,0)=-c_o - c_r * (i - 1)\) -\(l_y(0, j) = -c_o - c_r * (j - 1)\)
- Other cells in top row and leftmost column = \(-\infty\)
where \(c_o\) is the cost of opening a gap, \(c_r\) is the cost of continuing a gap, \((x_i, y_j)\) is the score for correspoding character \(x_i\) with \(y_j\) in the score matrix.
Score: max(m, ix, iy)
from math import inf
def find_match(word1, word2, reward, penalty):
if word1 == word2:
return reward
else:
return -penalty
def affine_gap(word1: str, word2: str, co=1, cr=1, cg=1, c=1, cm=1) -> int:
"""
co: cost of opening a gap
cm: cost of continuing the gap
cg: gap_penalty
c: match award
cm: mismatch penalty
"""
# padding one whitespace for empty string representation
word_1 = ' ' + word1
word_2 = ' ' + word2
h, w = len(word_1), len(word_2)
m = [ [ -inf for _ in range (w ) ] for _ in range(h ) ]
i_x = [ [ -inf for _ in range (w ) ] for _ in range(h ) ]
i_y = [ [ -inf for _ in range (w ) ] for _ in range(h ) ]
m[0][0] = 0
i_x[0][0] = -co
i_y[0][0] = -co
# initialization for top row
for j in range(1, w ):
i_y[0][j] = -co - cr * (j - 1)
# initialization for left-most column
for i in range(1, h ):
i_x[i][0] = -co - cr * (i - 1)
for i in range(1, h ):
for j in range(1, w ):
match_reward = find_match(word_1[i], word_2[j], reward=c, penalty=cm)
m[i][j] = max(m[i-1][j-1] + match_reward, i_x[i - 1][j - 1] + match_reward, i_y[i - 1][j - 1] + match_reward)
i_x[i][j] = max(m[i - 1][j] - co, i_x[i - 1][j] - cr)
i_y[i][j] = max(m[i][j - 1] - co, i_y[i][j - 1] - cr)
print("m: ")
pprint(m)
print("i_x: ")
pprint(i_x)
print("i_y: ")
pprint(i_y)
return max(m[-1][-1], i_x[-1][-1], i_y[-1][-1])
affine_gap("AAT", "ACACT", c=1, co=4, cr=1)
m:
[[0, -inf, -inf, -inf, -inf, -inf],
[-inf, 1, -5, -4, -7, -8],
[-inf, -3, 0, -2, -5, -6],
[-inf, -6, -4, -1, -3, -4]]
i_x:
[[-4, -inf, -inf, -inf, -inf, -inf],
[-4, -inf, -inf, -inf, -inf, -inf],
[-5, -3, -9, -8, -11, -12],
[-6, -4, -4, -6, -9, -10]]
i_y:
[[-4, -4, -5, -6, -7, -8],
[-inf, -inf, -3, -4, -5, -6],
[-inf, -inf, -7, -4, -5, -6],
[-inf, -inf, -10, -8, -5, -6]]
-4
Smith-Waterman Measure¶
Initialization: - initialize matrix of size (n + 1)x(m + 1) where s(a; b) is the element at the a-th row and b-th column. - fill matrix: s(i ; 0) = 0, s(0; j) = 0
def smith_waterman(word1: str, word2: str, cm=1, c=1, cg=1) -> int:
"""
cg: gap_penalty
c: match award
cm: mismatch penalty
"""
# padding one whitespace for empty string representation
word_1 = ' ' + word1
word_2 = ' ' + word2
h, w = len(word_1), len(word_2)
s = [ [ 0 for _ in range (w ) ] for _ in range(h ) ]
# initialization for top row
for j in range(1, w ):
s[0][j] = 0
# initialization for left-most column
for i in range(1, h ):
s[i][0] = 0
for i in range(1, h ):
for j in range(1, w ):
match_reward = find_match(word_1[i], word_2[j], reward=c, penalty=cg)
s[i][j] = max(0, s[i - 1][j] - cg, s[i][j - 1] - cg, s[i - 1][j - 1] + match_reward)
pprint(s)
return s[-1][-1]
smith_waterman(" avd", "dave")
[[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 2, 1],
[0, 1, 0, 1, 1]]
1
Set base¶
View strings as sets or multi-sets of tokens Common methods to generate tokens
- words delimited by space
- e.g. for the string \david smith", the tokens are \david" and \smith"
- stem the words if necessary
- remove stop words (e.g. the, and of)
I q-grams, substrings of length q
- e.g. for the string \david smith", the set of 3-grams are ##d, #da, dav, avi, ..., h##
- special character # to handle the start and end of string fkccecia,
Overlap Measure¶
-
Let Bx = set of tokens generated for string x
-
Let By = set of tokens generated for string y
-
returns the number of common tokens \(\(O(x, y) = |B_x \cap B_y |\)\)
- E.g., x = dave, y = dav, considering 2-grams $$ B_x = {#d, da, av, ve, e# }$$
def token(word: str, max_token_len=2):
sets = []
new_word = f"#{word}#"
length = len(new_word)
for i in range(length - 1):
w = new_word[i:i+2]
sets.append(w)
return set(sets)
def overlap(word1, word2, max_token_len=2):
set1 = token(word1, max_token_len)
set2 = token(word2, max_token_len)
return set1.intersection(set2)
overlap("dave", "dav")
{'#d', 'av', 'da'}
Jaccard Measure¶
-
Let Bx = set of tokens generated for string x
-
Let By = set of tokens generated for string y
- returns the number of common tokens
\(\(J(x, y) = |B_x \cap B_y | / |B_x \cup B_y |\)\) E.g., x = dave, y = dav, considering 2-grams
Bx = {#d, da, av, ve, e#},
By = {#d, da, av, v#}
J(x; y) = 3 / 6
def jaccard(word1, word2, max_token_len=2):
set1 = token(word1, max_token_len)
set2 = token(word2, max_token_len)
print(f"Union: {set1.union(set2)}")
print(f"Intersection: {set1.intersection(set2)}")
return len(set1.intersection(set2)) / len(set1.union(set2))
jaccard("dave", "dav")
Union: {'#d', 'e#', 've', 'da', 'v#', 'av'}
Intersection: {'#d', 'da', 'av'}
0.5