Mlib¶
Spark's scalable machine learning library
ML Pipelines¶
- Inspired by scikit-learn
- DataFrame
Pipeline componens: - Transformer - Estimator
Parameters
Transformers¶
Converts one dataframe to another Must implement a method transform()
Examples:
Model: - DataFrame[id: int, feature_vector: Vector] => DataFrame[id: int, label: string]
Feature transformer: - DataFrame[id: int, text: string] => DataFrame[id: int, feature_vector: Vector]
Estimators¶
Input: DataFrame
Output: Model
Must implement a method fit()
Example: - LogisticRegression is an Estimator. - Calling fit() trains a LogisticRegressionModel, which is a Model (hence also a Transformer).
Paramters¶
Both transformers and estimators can have parameters
Set parameters:
- lr = LogisticRegression() lr.setMaxIter(10) Pass a ParamMap to fit() or transform().
- Pass A ParamMap is a set of (parameter, value) pairs.