Similarity and Dissimilarity¶

Similarity is the numerical measure of how alike two data objects are.

Similarity is important.

It is the basic component of many data processing techniques, such as - data integration - data mining: classication, clustering, recommendation, anomaly detection

Dissimilarity is the numerical measure of how two objects are different.

The term distance is frequently used as a synonym for dissimilarity.

Attibute Type¶

Norminal (Categorical)
Ordinal
Interval or Ratio

String Matching:¶

Matching strings often appear quite dierently - Typing and OCR errors: David Smith vs. Davod Smith - Dierent formatting convertions: 10/8 vs Oct 8 - Custom abbreviation, shortening, or omission: Daniel Walker Herbert Smith vs. Dan W. Smith - Dierent names, nick names: William Smith vs. Bill Smith - Shuing parts of strings: Dept. of Computer Science, UST vs. Computer Science Dept., UST

namea = "Dave Smith"
nameb = 'David D. Smith'

Edit Distance¶

\[min \begin{cases}d(i-1, j) + 1\\ d(i, j-1)+1\\ d(i-1, j-1) +1\ if x_1 \neq y_j \\ d(i-1,j-1)\ if x_1 =y_1 \end{cases}\]

 from pprint import pprint
 def minDistance(word1: str, word2: str) -> int:
        # padding one whitespace for empty string representation
        word_1 = ' ' + word1
        word_2 = ' ' + word2

        h, w = len(word_1), len(word_2)

        min_edit_dist = [ [ 0 for _ in range (w) ] for _ in range(h) ]

        # initialization for top row
        for x in range(1, w):
            min_edit_dist[0][x] = x

        # initialization for left-most column
        for y in range(1, h):
            min_edit_dist[y][0] = y

        # compute minimum edit distance with optimal substructure
        for y in range(1, h):
            for x in range(1, w):

                if word_1[y] == word_2[x]:
                    # current character match, no need to edit
                    min_edit_dist[y][x] = min_edit_dist[y-1][x-1]
                else:
                    # current character mismatch, choose the method of lowest cost, among character replacement, character addition, or character deletion
                    min_edit_dist[y][x] = min( min_edit_dist[y][x-1], min_edit_dist[y-1][x], min_edit_dist[y-1][x-1]) + 1

        pprint(min_edit_dist)
        return min_edit_dist[-1][-1]

minDistance(namea, nameb)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
 [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
 [2, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
 [3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
 [4, 3, 2, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
 [5, 4, 3, 2, 2, 2, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 [6, 5, 4, 3, 3, 3, 3, 3, 4, 5, 5, 6, 7, 8, 9],
 [7, 6, 5, 4, 4, 4, 4, 4, 4, 5, 6, 5, 6, 7, 8],
 [8, 7, 6, 5, 4, 5, 5, 5, 5, 5, 6, 6, 5, 6, 7],
 [9, 8, 7, 6, 5, 5, 6, 6, 6, 6, 6, 7, 6, 5, 6],
 [10, 9, 8, 7, 6, 6, 6, 7, 7, 7, 7, 7, 7, 6, 5]]





5

Needleman wunch measure¶

initialize matrix of size (n + 1)x(m + 1) where s(a; b) is the element at the a􀀀th row and b􀀀th column.

fill matrix: $s(i,0) = -i*c_g, s(0,j)=-j*c_g$

\[s(i, j) = max \begin{cases} s(u - 1, j) - c_g\\ s(i, j-1)-c_g\\ s(i-1,j-1)=c(x_i,y_j) \end{cases}\]

 from pprint import pprint

def find_match(word1: str, word2: str, cg, c, cm):
    if word1 == word2:
        return c
    elif word1 == ' ' or word2 == ' ':
        return -cg

    else:
        return -cm

def needle_man(word1: str, word2: str, cg=1, c=1, cm=1) -> int:
    """
    cg: gap_penalty
    c: match award
    cm: mismatch penalty
    """

    # padding one whitespace for empty string representation
    word_1 = ' ' + word1
    word_2 = ' ' + word2

    h, w = len(word_1), len(word_2)

    s = [ [ 0 for _ in range (w ) ] for _ in range(h ) ]


    # initialization for top row
    for j in range(1, w ):
        s[0][j] = -j * cg

    # initialization for left-most column
    for i in range(1, h ):
        s[i][0] = -i * cg



    for i in range(1, h ):
        for j in range(1, w ):
            s[i][j] = max(s[i - 1][j] - cg, s[i][j - 1] - cg, s[i - 1][j - 1] + find_match(word_1[i], word_2[j], cg, c,cm))



    pprint(s)
    return s[-1][-1]

needle_man("dva", "deeve", c=2)

[[0, -1, -2, -3, -4, -5],
 [-1, 2, 1, 0, -1, -2],
 [-2, 1, 1, 0, 2, 1],
 [-3, 0, 0, 0, 1, 1]]





1

Affine gap measure¶

Define x = $x_1x_2...x_n$; y = $y_1y_2..y_m$ where xi and yj are the i-th and j-th prefixes of x and y

Initialization:

$M(0, 0) = 0, l_x(0,0)=-c_o, l_u(0,0)=-c_o$
$l_x(i,0)=-c_o - c_r * (i - 1)$ -$l_y(0, j) = -c_o - c_r * (j - 1)$
Other cells in top row and leftmost column = $-\infty$

\[M(i, j) = max \begin{cases} M(i -1, j-1)+c(x_i,y_i)\\ l_x(i - 1, j -1) + c(x_i,y_i)\\ l_y(u - 1, j-1)+c(x_i, y_i) \end{cases} \]

\[ l_x(i, j) = max\begin{cases} M(i - 1, j) - c_o \\ l_x(i - 1, j) - c_r \end{cases} \]

\[ l_y(i, j) = max\begin{cases} M(i , j - 1) - c_o \\ l_y(i, j - 1) - c_r \end{cases} \]

where $c_o$ is the cost of opening a gap, $c_r$ is the cost of continuing a gap, $(x_i, y_j)$ is the score for correspoding character $x_i$ with $y_j$ in the score matrix.

Score: max(m, ix, iy)

from math import inf

def find_match(word1, word2, reward, penalty):
    if word1 == word2:
        return reward
    else:
        return -penalty


def affine_gap(word1: str, word2: str, co=1, cr=1,  cg=1, c=1, cm=1) -> int:
    """
    co: cost of opening a gap
    cm: cost of continuing the gap
    cg: gap_penalty
    c: match award
    cm: mismatch penalty
    """

    # padding one whitespace for empty string representation
    word_1 = ' ' + word1
    word_2 = ' ' + word2

    h, w = len(word_1), len(word_2)

    m = [ [ -inf for _ in range (w ) ] for _ in range(h ) ]
    i_x = [ [ -inf for _ in range (w ) ] for _ in range(h ) ]
    i_y = [ [ -inf for _ in range (w ) ] for _ in range(h ) ]

    m[0][0] = 0
    i_x[0][0] = -co
    i_y[0][0] = -co


    # initialization for top row
    for j in range(1, w ):
        i_y[0][j] = -co - cr * (j - 1)

    # initialization for left-most column
    for i in range(1, h ):
        i_x[i][0] = -co - cr * (i - 1)



    for i in range(1, h ):
        for j in range(1, w ):
            match_reward = find_match(word_1[i], word_2[j], reward=c, penalty=cm)
            m[i][j] = max(m[i-1][j-1] + match_reward, i_x[i - 1][j - 1] + match_reward, i_y[i - 1][j - 1] + match_reward)
            i_x[i][j] = max(m[i - 1][j] - co, i_x[i - 1][j] - cr)
            i_y[i][j] = max(m[i][j - 1] - co, i_y[i][j - 1] - cr)


    print("m: ")
    pprint(m)
    print("i_x: ")
    pprint(i_x)
    print("i_y: ")
    pprint(i_y)
    return max(m[-1][-1], i_x[-1][-1], i_y[-1][-1])

affine_gap("AAT", "ACACT", c=1, co=4, cr=1)

m: 
[[0, -inf, -inf, -inf, -inf, -inf],
 [-inf, 1, -5, -4, -7, -8],
 [-inf, -3, 0, -2, -5, -6],
 [-inf, -6, -4, -1, -3, -4]]
i_x: 
[[-4, -inf, -inf, -inf, -inf, -inf],
 [-4, -inf, -inf, -inf, -inf, -inf],
 [-5, -3, -9, -8, -11, -12],
 [-6, -4, -4, -6, -9, -10]]
i_y: 
[[-4, -4, -5, -6, -7, -8],
 [-inf, -inf, -3, -4, -5, -6],
 [-inf, -inf, -7, -4, -5, -6],
 [-inf, -inf, -10, -8, -5, -6]]





-4

Smith-Waterman Measure¶

Initialization: - initialize matrix of size (n + 1)x(m + 1) where s(a; b) is the element at the a-th row and b-th column. - fill matrix: s(i ; 0) = 0, s(0; j) = 0

\[s(i, j) = max= \begin {cases} 0 \\ s_i - 1, j)-c_g\\ s(i, j-1)-c_g\\ s(i-1,j-1)+c(x_i,y_i) \end{cases}\]

def smith_waterman(word1: str, word2: str, cm=1, c=1, cg=1) -> int:
    """
    cg: gap_penalty
    c: match award
    cm: mismatch penalty
    """

    # padding one whitespace for empty string representation
    word_1 = ' ' + word1
    word_2 = ' ' + word2

    h, w = len(word_1), len(word_2)

    s = [ [ 0 for _ in range (w ) ] for _ in range(h ) ]


    # initialization for top row
    for j in range(1, w ):
        s[0][j] = 0

    # initialization for left-most column
    for i in range(1, h ):
        s[i][0] = 0


    for i in range(1, h ):
        for j in range(1, w ):
            match_reward = find_match(word_1[i], word_2[j], reward=c, penalty=cg)
            s[i][j] = max(0, s[i - 1][j] - cg, s[i][j - 1] - cg, s[i - 1][j - 1] + match_reward)

    pprint(s)
    return s[-1][-1]

smith_waterman(" avd", "dave")

[[0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 1, 0, 0],
 [0, 0, 0, 2, 1],
 [0, 1, 0, 1, 1]]





1

Set base¶

View strings as sets or multi-sets of tokens Common methods to generate tokens

words delimited by space
- e.g. for the string \david smith", the tokens are \david" and \smith"
stem the words if necessary
remove stop words (e.g. the, and of) I q-grams, substrings of length q
- e.g. for the string \david smith", the set of 3-grams are ##d, #da, dav, avi, ..., h##
- special character # to handle the start and end of string fkccecia,

Overlap Measure¶

Let Bx = set of tokens generated for string x
Let By = set of tokens generated for string y
returns the number of common tokens $\(O(x, y) = |B_x \cap B_y |$\)
E.g., x = dave, y = dav, considering 2-grams $$ B_x = {#d, da, av, ve, e# }$$

\[ B_y = \{\#d,da, av, v\# \}$$ $$ O(x, y) = 3 \]

def token(word: str, max_token_len=2):
    sets = []
    new_word = f"#{word}#"
    length = len(new_word)
    for i in range(length - 1):
        w = new_word[i:i+2]
        sets.append(w)

    return set(sets)

def overlap(word1, word2, max_token_len=2):
    set1 = token(word1, max_token_len)
    set2 = token(word2, max_token_len)

    return set1.intersection(set2)

overlap("dave", "dav")

{'#d', 'av', 'da'}

Jaccard Measure¶

Let Bx = set of tokens generated for string x
Let By = set of tokens generated for string y
returns the number of common tokens

$\(J(x, y) = |B_x \cap B_y | / |B_x \cup B_y |$\) E.g., x = dave, y = dav, considering 2-grams

Bx = {#d, da, av, ve, e#},

By = {#d, da, av, v#}

J(x; y) = 3 / 6

def jaccard(word1, word2, max_token_len=2):
    set1 = token(word1, max_token_len)
    set2 = token(word2, max_token_len)
    print(f"Union: {set1.union(set2)}")
    print(f"Intersection: {set1.intersection(set2)}")

    return len(set1.intersection(set2)) / len(set1.union(set2))

jaccard("dave", "dav")

Union: {'#d', 'e#', 've', 'da', 'v#', 'av'}
Intersection: {'#d', 'da', 'av'}





0.5