Skip to content

Similarity and Dissimilarity

Similarity is the numerical measure of how alike two data objects are.

Similarity is important.

It is the basic component of many data processing techniques, such as - data integration - data mining: classi cation, clustering, recommendation, anomaly detection

Dissimilarity is the numerical measure of how two objects are different.

The term distance is frequently used as a synonym for dissimilarity.

Attibute Type

  • Norminal (Categorical)
  • Ordinal
  • Interval or Ratio

String Matching:

Matching strings often appear quite di erently - Typing and OCR errors: David Smith vs. Davod Smith - Di erent formatting convertions: 10/8 vs Oct 8 - Custom abbreviation, shortening, or omission: Daniel Walker Herbert Smith vs. Dan W. Smith - Di erent names, nick names: William Smith vs. Bill Smith - Shuing parts of strings: Dept. of Computer Science, UST vs. Computer Science Dept., UST

namea = "Dave Smith"
nameb = 'David D. Smith'

Edit Distance

\[min \begin{cases}d(i-1, j) + 1\\ d(i, j-1)+1\\ d(i-1, j-1) +1\ if x_1 \neq y_j \\ d(i-1,j-1)\ if x_1 =y_1 \end{cases}\]
 from pprint import pprint
 def minDistance(word1: str, word2: str) -> int:
        # padding one whitespace for empty string representation
        word_1 = ' ' + word1
        word_2 = ' ' + word2

        h, w = len(word_1), len(word_2)

        min_edit_dist = [ [ 0 for _ in range (w) ] for _ in range(h) ]

        # initialization for top row
        for x in range(1, w):
            min_edit_dist[0][x] = x

        # initialization for left-most column
        for y in range(1, h):
            min_edit_dist[y][0] = y

        # compute minimum edit distance with optimal substructure
        for y in range(1, h):
            for x in range(1, w):

                if word_1[y] == word_2[x]:
                    # current character match, no need to edit
                    min_edit_dist[y][x] = min_edit_dist[y-1][x-1]
                else:
                    # current character mismatch, choose the method of lowest cost, among character replacement, character addition, or character deletion
                    min_edit_dist[y][x] = min( min_edit_dist[y][x-1], min_edit_dist[y-1][x], min_edit_dist[y-1][x-1]) + 1

        pprint(min_edit_dist)
        return min_edit_dist[-1][-1]
minDistance(namea, nameb)
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
 [1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13],
 [2, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
 [3, 2, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
 [4, 3, 2, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
 [5, 4, 3, 2, 2, 2, 2, 3, 4, 5, 6, 7, 8, 9, 10],
 [6, 5, 4, 3, 3, 3, 3, 3, 4, 5, 5, 6, 7, 8, 9],
 [7, 6, 5, 4, 4, 4, 4, 4, 4, 5, 6, 5, 6, 7, 8],
 [8, 7, 6, 5, 4, 5, 5, 5, 5, 5, 6, 6, 5, 6, 7],
 [9, 8, 7, 6, 5, 5, 6, 6, 6, 6, 6, 7, 6, 5, 6],
 [10, 9, 8, 7, 6, 6, 6, 7, 7, 7, 7, 7, 7, 6, 5]]





5

Needleman wunch measure

initialize matrix of size (n + 1)x(m + 1) where s(a; b) is the element at the a􀀀th row and b􀀀th column.

fill matrix: \(s(i,0) = -i*c_g, s(0,j)=-j*c_g\)

\[s(i, j) = max \begin{cases} s(u - 1, j) - c_g\\ s(i, j-1)-c_g\\ s(i-1,j-1)=c(x_i,y_j) \end{cases}\]
 from pprint import pprint

def find_match(word1: str, word2: str, cg, c, cm):
    if word1 == word2:
        return c
    elif word1 == ' ' or word2 == ' ':
        return -cg

    else:
        return -cm

def needle_man(word1: str, word2: str, cg=1, c=1, cm=1) -> int:
    """
    cg: gap_penalty
    c: match award
    cm: mismatch penalty
    """

    # padding one whitespace for empty string representation
    word_1 = ' ' + word1
    word_2 = ' ' + word2

    h, w = len(word_1), len(word_2)

    s = [ [ 0 for _ in range (w ) ] for _ in range(h ) ]


    # initialization for top row
    for j in range(1, w ):
        s[0][j] = -j * cg

    # initialization for left-most column
    for i in range(1, h ):
        s[i][0] = -i * cg



    for i in range(1, h ):
        for j in range(1, w ):
            s[i][j] = max(s[i - 1][j] - cg, s[i][j - 1] - cg, s[i - 1][j - 1] + find_match(word_1[i], word_2[j], cg, c,cm))



    pprint(s)
    return s[-1][-1]
needle_man("dva", "deeve", c=2)
[[0, -1, -2, -3, -4, -5],
 [-1, 2, 1, 0, -1, -2],
 [-2, 1, 1, 0, 2, 1],
 [-3, 0, 0, 0, 1, 1]]





1

Affine gap measure

Define x = \(x_1x_2...x_n\); y = \(y_1y_2..y_m\) where xi and yj are the i-th and j-th prefixes of x and y

Initialization:

  • \(M(0, 0) = 0, l_x(0,0)=-c_o, l_u(0,0)=-c_o\)
  • \(l_x(i,0)=-c_o - c_r * (i - 1)\) -\(l_y(0, j) = -c_o - c_r * (j - 1)\)
  • Other cells in top row and leftmost column = \(-\infty\)
\[M(i, j) = max \begin{cases} M(i -1, j-1)+c(x_i,y_i)\\ l_x(i - 1, j -1) + c(x_i,y_i)\\ l_y(u - 1, j-1)+c(x_i, y_i) \end{cases} \]
\[ l_x(i, j) = max\begin{cases} M(i - 1, j) - c_o \\ l_x(i - 1, j) - c_r \end{cases} \]
\[ l_y(i, j) = max\begin{cases} M(i , j - 1) - c_o \\ l_y(i, j - 1) - c_r \end{cases} \]

where \(c_o\) is the cost of opening a gap, \(c_r\) is the cost of continuing a gap, \((x_i, y_j)\) is the score for correspoding character \(x_i\) with \(y_j\) in the score matrix.

Score: max(m, ix, iy)

from math import inf

def find_match(word1, word2, reward, penalty):
    if word1 == word2:
        return reward
    else:
        return -penalty


def affine_gap(word1: str, word2: str, co=1, cr=1,  cg=1, c=1, cm=1) -> int:
    """
    co: cost of opening a gap
    cm: cost of continuing the gap
    cg: gap_penalty
    c: match award
    cm: mismatch penalty
    """

    # padding one whitespace for empty string representation
    word_1 = ' ' + word1
    word_2 = ' ' + word2

    h, w = len(word_1), len(word_2)

    m = [ [ -inf for _ in range (w ) ] for _ in range(h ) ]
    i_x = [ [ -inf for _ in range (w ) ] for _ in range(h ) ]
    i_y = [ [ -inf for _ in range (w ) ] for _ in range(h ) ]

    m[0][0] = 0
    i_x[0][0] = -co
    i_y[0][0] = -co


    # initialization for top row
    for j in range(1, w ):
        i_y[0][j] = -co - cr * (j - 1)

    # initialization for left-most column
    for i in range(1, h ):
        i_x[i][0] = -co - cr * (i - 1)



    for i in range(1, h ):
        for j in range(1, w ):
            match_reward = find_match(word_1[i], word_2[j], reward=c, penalty=cm)
            m[i][j] = max(m[i-1][j-1] + match_reward, i_x[i - 1][j - 1] + match_reward, i_y[i - 1][j - 1] + match_reward)
            i_x[i][j] = max(m[i - 1][j] - co, i_x[i - 1][j] - cr)
            i_y[i][j] = max(m[i][j - 1] - co, i_y[i][j - 1] - cr)


    print("m: ")
    pprint(m)
    print("i_x: ")
    pprint(i_x)
    print("i_y: ")
    pprint(i_y)
    return max(m[-1][-1], i_x[-1][-1], i_y[-1][-1])
affine_gap("AAT", "ACACT", c=1, co=4, cr=1)
m: 
[[0, -inf, -inf, -inf, -inf, -inf],
 [-inf, 1, -5, -4, -7, -8],
 [-inf, -3, 0, -2, -5, -6],
 [-inf, -6, -4, -1, -3, -4]]
i_x: 
[[-4, -inf, -inf, -inf, -inf, -inf],
 [-4, -inf, -inf, -inf, -inf, -inf],
 [-5, -3, -9, -8, -11, -12],
 [-6, -4, -4, -6, -9, -10]]
i_y: 
[[-4, -4, -5, -6, -7, -8],
 [-inf, -inf, -3, -4, -5, -6],
 [-inf, -inf, -7, -4, -5, -6],
 [-inf, -inf, -10, -8, -5, -6]]





-4

Smith-Waterman Measure

Initialization: - initialize matrix of size (n + 1)x(m + 1) where s(a; b) is the element at the a-th row and b-th column. - fill matrix: s(i ; 0) = 0, s(0; j) = 0

\[s(i, j) = max= \begin {cases} 0 \\ s_i - 1, j)-c_g\\ s(i, j-1)-c_g\\ s(i-1,j-1)+c(x_i,y_i) \end{cases}\]
def smith_waterman(word1: str, word2: str, cm=1, c=1, cg=1) -> int:
    """
    cg: gap_penalty
    c: match award
    cm: mismatch penalty
    """

    # padding one whitespace for empty string representation
    word_1 = ' ' + word1
    word_2 = ' ' + word2

    h, w = len(word_1), len(word_2)

    s = [ [ 0 for _ in range (w ) ] for _ in range(h ) ]


    # initialization for top row
    for j in range(1, w ):
        s[0][j] = 0

    # initialization for left-most column
    for i in range(1, h ):
        s[i][0] = 0


    for i in range(1, h ):
        for j in range(1, w ):
            match_reward = find_match(word_1[i], word_2[j], reward=c, penalty=cg)
            s[i][j] = max(0, s[i - 1][j] - cg, s[i][j - 1] - cg, s[i - 1][j - 1] + match_reward)

    pprint(s)
    return s[-1][-1]
smith_waterman(" avd", "dave")
[[0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 1, 0, 0],
 [0, 0, 0, 2, 1],
 [0, 1, 0, 1, 1]]





1

Set base

View strings as sets or multi-sets of tokens Common methods to generate tokens

  • words delimited by space
    • e.g. for the string \david smith", the tokens are \david" and \smith"
  • stem the words if necessary
  • remove stop words (e.g. the, and of) I q-grams, substrings of length q
    • e.g. for the string \david smith", the set of 3-grams are ##d, #da, dav, avi, ..., h##
    • special character # to handle the start and end of string fkccecia,

Overlap Measure

  • Let Bx = set of tokens generated for string x

  • Let By = set of tokens generated for string y

  • returns the number of common tokens \(\(O(x, y) = |B_x \cap B_y |\)\)

  • E.g., x = dave, y = dav, considering 2-grams $$ B_x = {#d, da, av, ve, e# }$$
\[ B_y = \{\#d,da, av, v\# \}$$ $$ O(x, y) = 3 \]
def token(word: str, max_token_len=2):
    sets = []
    new_word = f"#{word}#"
    length = len(new_word)
    for i in range(length - 1):
        w = new_word[i:i+2]
        sets.append(w)

    return set(sets)

def overlap(word1, word2, max_token_len=2):
    set1 = token(word1, max_token_len)
    set2 = token(word2, max_token_len)

    return set1.intersection(set2)
overlap("dave", "dav")
{'#d', 'av', 'da'}

Jaccard Measure

  • Let Bx = set of tokens generated for string x

  • Let By = set of tokens generated for string y

  • returns the number of common tokens

\(\(J(x, y) = |B_x \cap B_y | / |B_x \cup B_y |\)\) E.g., x = dave, y = dav, considering 2-grams

Bx = {#d, da, av, ve, e#},

By = {#d, da, av, v#}

J(x; y) = 3 / 6

def jaccard(word1, word2, max_token_len=2):
    set1 = token(word1, max_token_len)
    set2 = token(word2, max_token_len)
    print(f"Union: {set1.union(set2)}")
    print(f"Intersection: {set1.intersection(set2)}")

    return len(set1.intersection(set2)) / len(set1.union(set2))
jaccard("dave", "dav")
Union: {'#d', 'e#', 've', 'da', 'v#', 'av'}
Intersection: {'#d', 'da', 'av'}





0.5