Lecture3
!pip install pyspark
Collecting pyspark
[?25l Downloading https://files.pythonhosted.org/packages/f0/26/198fc8c0b98580f617cb03cb298c6056587b8f0447e20fa40c5b634ced77/pyspark-3.0.1.tar.gz (204.2MB)
[K |████████████████████████████████| 204.2MB 69kB/s
[?25hCollecting py4j==0.10.9
[?25l Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K |████████████████████████████████| 204kB 40.7MB/s
[?25hBuilding wheels for collected packages: pyspark
Building wheel for pyspark (setup.py) ... [?25l[?25hdone
Created wheel for pyspark: filename=pyspark-3.0.1-py2.py3-none-any.whl size=204612243 sha256=afa56cc6a9466c8bd467ecaf305826a28ca15fb5f31ee4f848525e71d4193f60
Stored in directory: /root/.cache/pip/wheels/5e/bd/07/031766ca628adec8435bb40f0bd83bb676ce65ff4007f8e73f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.0.1
Example: Linear-time selection¶
Problem:¶
— Input: an array A of n numbers (unordered), and k
— Output: the k-th smallest number (counting from 0)
Algorithm¶
- \(x=A[0]\) partition A into \(A[0..mid-1] < A[mid] = x < A[mid+1..n-1]\)
-
if \(mid =k\) then return \(x\)
-
if \(k<mid\) then \(A= A[O..mid-1]\) if k > mid then \(A = A[mid+1,n-1], k= k— mid-1\)
- gotostep 1
Key-value Pairs¶
While most Spark operations work on RDDs containing any type of objects, a few special operations are only available on RDDs of key-value pairs. In Python, these operations work on RDDs containing built-in Python tuples such as (1, 2). Simply create such tuples and then call your desired operation. For example, the following code uses the reduceByKey operation on key-value pairs to count how many times each line of text occurs in a file:
lines = sc.textFile("README.md")
pairs = lines.map(lambda s: (s, 1))
counts = pairs.reduceByKey(lambda a, b: a + b)
We could also use counts.sortByKey(), for example, to sort the pairs alphabetically, and finally counts.collect() to bring them back to the driver program as a list of objects.
PMI¶
PMI (pointwise mutual information) is a measure of association used in information theory and statistics.
Given a list of pairs (x, y)
where - \(p(x)\): probability of x - \(p(y)\): probability of y -\(p(x,y)\): joint probability
Example: p(x=0) = 0.8, p(x=1)=0.2, p(y=0)=0.25, p(y=1)=0.75
- pmi(x=0;y=0) = −1
- pmi(x=0;y=1) = 0.222392
- pmi(x=1;y=0) = 1.584963
- pmi(x=1;y=1) = -1.584963
Example notebook see: note book in class/PMI