Skip to content

Mlib

Spark's scalable machine learning library

ML Pipelines

  • Inspired by scikit-learn
  • DataFrame

Pipeline componens: - Transformer - Estimator

Parameters

Transformers

Converts one dataframe to another Must implement a method transform()

Examples:

Model: - DataFrame[id: int, feature_vector: Vector] => DataFrame[id: int, label: string]

Feature transformer: - DataFrame[id: int, text: string] => DataFrame[id: int, feature_vector: Vector]

Estimators

Input: DataFrame

Output: Model

Must implement a method fit()

Example: - LogisticRegression is an Estimator. - Calling fit() trains a LogisticRegressionModel, which is a Model (hence also a Transformer).

Paramters

Both transformers and estimators can have parameters

Set parameters:

  • lr = LogisticRegression() lr.setMaxIter(10) Pass a ParamMap to fit() or transform().
  • Pass A ParamMap is a set of (parameter, value) pairs.