Pmi

PMI¶

PMI (pointwise mutual information) is a measure of association used in information theory and statistics.

Given a list of pairs (x, y)

\[pmi(x, y) = log\frac{p(x,y)}{p(x)p(y}\]

where - \(p(x)\): probability of x - \(p(y)\): probability of y -\(p(x,y)\): joint probability

Example: p(x=0) = 0.8, p(x=1)=0.2, p(y=0)=0.25, p(y=1)=0.75

pmi(x=0;y=0) = −1
pmi(x=0;y=1) = 0.222392
pmi(x=1;y=0) = 1.584963
pmi(x=1;y=1) = -1.584963

Example notebook see: note book in class/PMI

!pip install pyspark

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/f0/26/198fc8c0b98580f617cb03cb298c6056587b8f0447e20fa40c5b634ced77/pyspark-3.0.1.tar.gz (204.2MB)
[K     |████████████████████████████████| 204.2MB 70kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 44.6MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.0.1-py2.py3-none-any.whl size=204612243 sha256=ae3121fc30af19c4ec22b0beb2c7452d103f59dc5ad06c6fa21a5b108cdbf54a
  Stored in directory: /root/.cache/pip/wheels/5e/bd/07/031766ca628adec8435bb40f0bd83bb676ce65ff4007f8e73f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.0.1

import requests
from pyspark.context import SparkContext

r = requests.get('https://www.cse.ust.hk/msbd5003/data/adj_noun_pairs.txt')
open('adj_noun_pairs.txt', 'wb').write(r.content)
sc = SparkContext.getOrCreate()

# Data file at https://www.cse.ust.hk/msbd5003/data

lines = sc.textFile('adj_noun_pairs.txt')

lines.count()

lines.getNumPartitions()

lines.take(5)

['early radical',
 'french revolution',
 'pejorative way',
 'violent means',
 'positive label']

# Converting lines into word pairs. 
# Data is dirty: some lines have more than 2 words, so filter them out.
pairs = lines.map(lambda l: tuple(l.split())).filter(lambda p: len(p)==2)
pairs.cache()

PythonRDD[4] at RDD at PythonRDD.scala:53

pairs.take(5)

[('early', 'radical'),
 ('french', 'revolution'),
 ('pejorative', 'way'),
 ('violent', 'means'),
 ('positive', 'label')]

N = pairs.count()

# Compute the frequency of each pair.
# Ignore pairs that not frequent enough
pair_freqs = pairs.map(lambda p: (p,1)).reduceByKey(lambda f1, f2: f1 + f2) \
                  .filter(lambda pf: pf[1] >= 100)

pair_freqs.take(5)

[(('political', 'philosophy'), 160),
 (('human', 'society'), 154),
 (('16th', 'century'), 950),
 (('first', 'man'), 166),
 (('same', 'time'), 2744)]

# Computing the frequencies of the adjectives and the nouns
a_freqs = pairs.map(lambda p: (p[0],1)).reduceByKey(lambda x,y: x+y)
n_freqs = pairs.map(lambda p: (p[1],1)).reduceByKey(lambda x,y: x+y)

a_freqs.take(5)

[('violent', 1191),
 ('positive', 2302),
 ('self-defined', 3),
 ('political', 15935),
 ('differ', 381)]

n_freqs.count()

# Broadcasting the adjective and noun frequencies. 
#a_dict = a_freqs.collectAsMap()
#a_dict = sc.parallelize(a_dict).map(lambda x: x)
n_dict = sc.broadcast(n_freqs.collectAsMap())
a_dict = sc.broadcast(a_freqs.collectAsMap())
a_dict.value['violent']

from math import *

# Computing the PMI for a pair.
def pmi_score(pair_freq):
    w1, w2 = pair_freq[0]
    f = pair_freq[1]
    pmi = log(float(f)*N/(a_dict.value[w1]*n_dict.value[w2]), 2)
    return pmi, (w1, w2)

# Computing the PMI for all pairs.
scored_pairs = pair_freqs.map(pmi_score)

# Printing the most strongly associated pairs. 
scored_pairs.top(10)

[(14.41018838546462, ('magna', 'carta')),
 (13.071365888694997, ('polish-lithuanian', 'Commonwealth')),
 (12.990597616733414, ('nitrous', 'oxide')),
 (12.64972604311254, ('latter-day', 'Saints')),
 (12.50658937509916, ('stainless', 'steel')),
 (12.482331020687814, ('pave', 'runway')),
 (12.19140721768055, ('corporal', 'punishment')),
 (12.183248694293388, ('capital', 'punishment')),
 (12.147015483562537, ('rush', 'yard')),
 (12.109945794428935, ('globular', 'cluster'))]

Another way

n_dict = n_freqs.collectAsMap()
a_dict = a_freqs.collectAsMap()

from math import *

# Computing the PMI for a pair.
def pmi_score(pair_freq):
    w1, w2 = pair_freq[0]
    f = pair_freq[1]
    pmi = log(float(f)*N/(a_dict[w1]*n_dict[w2]), 2)
    return pmi, (w1, w2)

scored_pairs = pair_freqs.map(pmi_score)
scored_pairs.top(10)

[(14.41018838546462, ('magna', 'carta')),
 (13.071365888694997, ('polish-lithuanian', 'Commonwealth')),
 (12.990597616733414, ('nitrous', 'oxide')),
 (12.64972604311254, ('latter-day', 'Saints')),
 (12.50658937509916, ('stainless', 'steel')),
 (12.482331020687814, ('pave', 'runway')),
 (12.19140721768055, ('corporal', 'punishment')),
 (12.183248694293388, ('capital', 'punishment')),
 (12.147015483562537, ('rush', 'yard')),
 (12.109945794428935, ('globular', 'cluster'))]