Lecture3

!pip install pyspark

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/f0/26/198fc8c0b98580f617cb03cb298c6056587b8f0447e20fa40c5b634ced77/pyspark-3.0.1.tar.gz (204.2MB)
[K     |████████████████████████████████| 204.2MB 69kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 40.7MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.0.1-py2.py3-none-any.whl size=204612243 sha256=afa56cc6a9466c8bd467ecaf305826a28ca15fb5f31ee4f848525e71d4193f60
  Stored in directory: /root/.cache/pip/wheels/5e/bd/07/031766ca628adec8435bb40f0bd83bb676ce65ff4007f8e73f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.0.1

Example: Linear-time selection¶

Problem:¶

— Input: an array A of n numbers (unordered), and k

— Output: the k-th smallest number (counting from 0)

Algorithm¶

\(x=A[0]\) partition A into \(A[0..mid-1] < A[mid] = x < A[mid+1..n-1]\)
if \(mid =k\) then return \(x\)
if \(k<mid\) then \(A= A[O..mid-1]\) if k > mid then \(A = A[mid+1,n-1], k= k— mid-1\)
gotostep 1

Key-value Pairs¶

While most Spark operations work on RDDs containing any type of objects, a few special operations are only available on RDDs of key-value pairs. In Python, these operations work on RDDs containing built-in Python tuples such as (1, 2). Simply create such tuples and then call your desired operation. For example, the following code uses the reduceByKey operation on key-value pairs to count how many times each line of text occurs in a file:

lines = sc.textFile("README.md")
pairs = lines.map(lambda s: (s, 1))
counts = pairs.reduceByKey(lambda a, b: a + b)

We could also use counts.sortByKey(), for example, to sort the pairs alphabetically, and finally counts.collect() to bring them back to the driver program as a list of objects.

PMI¶

PMI (pointwise mutual information) is a measure of association used in information theory and statistics.

Given a list of pairs (x, y)

\[pmi(x, y) = log\frac{p(x,y)}{p(x)p(y}\]

where - \(p(x)\): probability of x - \(p(y)\): probability of y -\(p(x,y)\): joint probability

Example: p(x=0) = 0.8, p(x=1)=0.2, p(y=0)=0.25, p(y=1)=0.75

pmi(x=0;y=0) = −1
pmi(x=0;y=1) = 0.222392
pmi(x=1;y=0) = 1.584963
pmi(x=1;y=1) = -1.584963

Example notebook see: note book in class/PMI