Mlib¶

Spark's scalable machine learning library

ML Pipelines¶

Pipeline componens: - Transformer - Estimator

Parameters

Converts one dataframe to another Must implement a method transform()

Examples:

Model: - DataFrame[id: int, feature_vector: Vector] => DataFrame[id: int, label: string]

Feature transformer: - DataFrame[id: int, text: string] => DataFrame[id: int, feature_vector: Vector]

Input: DataFrame

Output: Model

Must implement a method fit()

Example: - LogisticRegression is an Estimator. - Calling fit() trains a LogisticRegressionModel, which is a Model (hence also a Transformer).

Both transformers and estimators can have parameters

Set parameters:

lr = LogisticRegression() lr.setMaxIter(10) Pass a ParamMap to fit() or transform().
Pass A ParamMap is a set of (parameter, value) pairs.