EvalMLとは

EvalMLとは、ドメイン固有の目的関数を使用して機械学習パイプラインを構築、最適化、評価するAutoMLライブラリです。回帰、分類（バイナリとマルチクラスの両方）、時系列解析など、様々な教師あり学習タスクをサポートしています。目的関数は、分類タスクにはLog loss関数、回帰タスクにはR-squared関数が使用され、それぞれ簡単にカスタマイズできるようになっています。EvalMLの時系列解析機能は、過去の値を使用して将来の値を予測するようになっています。

特徴

EvalMLが処理可能なモデルは線形モデル、K_neighbors、Random_forest、SVM（Simple vector machine）、XGBoost、Lightgbm、CATBoost、Extra_trees、Ensemble、Decision_Tree、Exponential_smoothing、ARIMA、Baseline、PROPHET、Vowpal_wabbitなどです。
時系列モデルの学習する場合、ARIMA、Baseline、PROPHET、XGBoost、Ensemble、Exponential_smoothing、SARIMAなどのモデルを比較できます。
AutoMLを使用することで、学習不足と過学習及びデータの不均衡、データ漏洩の検出を回避することもできます。
AutoMLのハイパーパラメータ調整はモデルによって実行されるため、手動学習では達成が困難なモデルのパフォーマンスが得られます。

主な機能

EvalMLが持つ主な機能として、次のようなものがあります。

自動化
データチェック
End to End
モデルの理解
ドメイン固有

自動化

機械学習を容易にします。データ品質チェック、検証を可能とし、モデルの手動学習およびパラメータ調整を避けることができます。

データチェック

モデル構築前に、異常・欠損データなどのデータチェックを実施し、警告します。例えば evalml.data_checks の NullDataCheck を使用することで、以下のように欠落値のしきい値を超える列を返すことがことができます。

import numpy as np
import pandas as pd

from evalml.data_checks import NullDataCheck

X = pd.DataFrame([[1, 2, 3],
                  [0, 4, np.nan],
                  [1, 4, np.nan],
                  [9, 4, np.nan],
                  [8, 6, np.nan]])

null_check = NullDataCheck(pct_null_col_threshold=0.8, pct_null_row_threshold=0.8)
messages = null_check.validate(X)

errors = [message for message in messages if message['level'] == 'error']
warnings = [message for message in messages if message['level'] == 'warning']

for warning in warnings:
    print("Warning:", warning['message'])

for error in errors:
    print("Error:", error['message'])

上記を実行すると、以下の出力が得られます。

Warning: Column(s) '2' are 80.0% or more null

End to End

最先端の前処理、特徴量抽出、特徴量選択、及び様々なモデリング手法を含むパイプラインを構築および最適化します。

モデルの理解

モデルの特徴を理解し、タスクのドメインでモデルがどのように動作するかを把握することができます。例えば以下のようにパイプラインを学習します。

import evalml
from evalml.pipelines import BinaryClassificationPipeline
X, y = evalml.demos.load_breast_cancer()

X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, problem_type='binary',
                                                                     test_size=0.2, random_seed=0)


pipeline_binary = BinaryClassificationPipeline(component_graph = {
            "Label Encoder": ["Label Encoder", "X", "y"],
            "Imputer": ["Imputer", "X", "Label Encoder.y"],
            "Random Forest Classifier": [
                "Random Forest Classifier",
                "Imputer.x",
                "Label Encoder.y",
            ],
        })
pipeline_binary.fit(X_train, y_train)

結果として得られるパイプラインに関して、各特徴量の重要度を次のように取得することができます。

pipeline_binary.feature_importance

上記の出力結果は以下の通りです。

	feature	importance
0	mean concave points	0.138857
1	worst perimeter	0.137780
2	worst concave points	0.117782
3	worst radius	0.100584
4	mean concavity	0.086402
5	worst area	0.072027
6	mean perimeter	0.046500
7	worst concavity	0.043408
8	mean radius	0.037664
9	mean area	0.033683
10	radius error	0.025036
11	area error	0.019324
12	worst texture	0.014754
13	worst compactness	0.014462
14	mean texture	0.013856
15	worst smoothness	0.013710
16	worst symmetry	0.011395
17	perimeter error	0.010284
18	mean compactness	0.008162
19	mean smoothness	0.008154
20	worst fractal dimension	0.007034
21	fractal dimension error	0.005502
22	compactness error	0.004953
23	smoothness error	0.004728
24	texture error	0.004384
25	symmetry error	0.004250
26	mean fractal dimension	0.004164
27	concavity error	0.004089
28	mean symmetry	0.003997
29	concave points error	0.003076

pipeline_binary.graph_feature_importance()を使用すれば上記の結果を棒グラフとして得ることもできます。またその他にも、ROCカーブ、混同行列などの様々な指標でモデルの性能を得ることができます。

ドメイン固有

EvalMLでは、ドメイン固有の目的関数を最適化するか、独自のカスタム目的関数を定義することにより、特定の問題に対してモデルを学習し最適化できます。現在、EvalMLには以下の2つのドメイン固有の目的関数があります。

具体的な例として、不正取引検出モデルの目的関数は以下のように構築されます。

from evalml.objectives.binary_classification_objective import BinaryClassificationObjective
import pandas as pd

class FraudCost(BinaryClassificationObjective):
    """不正取引により、取引金額プロセス全体の中で失われた金額の割合をスコアリングします"""
    name = "Fraud Cost"
    greater_is_better = False
    score_needs_proba = False
    perfect_score = 0.0

    def __init__(self, retry_percentage=.5, interchange_fee=.02,
                 fraud_payout_percentage=1.0, amount_col='amount'):
        """不正取引のインスタンスを作成します

          引数:
            retry_percentage (float): 拒否された場合にトランザクションを再試行する顧客の割合。0から1の間。デフォルト:0.5。

            interchange_fee (float): 成功した各トランザクションのうち、どれだけ収集できるか。0から1の間。デフォルト:0.02。

            fraud_payout_percentage (float): 収集できない不正取引の割合。0から1の間。デフォルト:1.0。

            amount_col (str): 金額を含むデータの列の名前。デフォルト:"amount"
        """
        self.retry_percentage = retry_percentage
        self.interchange_fee = interchange_fee
        self.fraud_payout_percentage = fraud_payout_percentage
        self.amount_col = amount_col

    def decision_function(self, ypred_proba, threshold=0.0, X=None):
        """予測される確率、しきい値、及び取引金額を含むデータフレームから、トランザクションが不正であるかどうかを判断します。
            引数:
                ypred_proba (pd.Series): 予測される確率
                X (pd.DataFrame): 取引金額を含むデータフレーム
                threshold (float): 取引が不正取引であるかどうかを判断するためのドルのしきい値

            戻り値:
                pd.Series: Xとしきい値を使用し、予測された不正ラベルの系列
        """
        if not isinstance(X, pd.DataFrame):
            X = pd.DataFrame(X)

        if not isinstance(ypred_proba, pd.Series):
            ypred_proba = pd.Series(ypred_proba)

        transformed_probs = (ypred_proba.values * X[self.amount_col])
        return transformed_probs > threshold

    def objective_function(self, y_true, y_predicted, X):
        """予測、真の値、及び取引金額を含むデータフレームから、トランザクションごとに不正に失われた金額を計算します。
            引数:
                y_predicted (pd.Series): 予測される不正ラベル
                y_true (pd.Series): 真の不正取引ラベル
                X (pd.DataFrame): 取引金額を含むデータフレーム

            戻り値:
                float: amount lost to fraud per transaction
        """
        if not isinstance(X, pd.DataFrame):
            X = pd.DataFrame(X)

        if not isinstance(y_predicted, pd.Series):
            y_predicted = pd.Series(y_predicted)

        if not isinstance(y_true, pd.Series):
            y_true = pd.Series(y_true)

        # ユーザーデータの金額列を使用してトランザクションを抽出します
        try:
            transaction_amount = X[self.amount_col]
        except KeyError:
            raise ValueError("`{}` is not a valid column in X.".format(self.amount_col))

        # 不正取引の場合に支払われる金額
        fraud_cost = transaction_amount * self.fraud_payout_percentage

        # 取引のインターチェンジフィー
        interchange_cost = transaction_amount * (1 - self.retry_percentage) * self.interchange_fee

        # 不正取引の欠落のコストを計算する
        false_negatives = (y_true &amp; ~y_predicted) * fraud_cost

        # 手数料で引かれた金額を計算する
        false_positives = (~y_true &amp; y_predicted) * interchange_cost

        loss = false_negatives.sum() + false_positives.sum()

        loss_per_total_processed = loss / transaction_amount.sum()

        return loss_per_total_processed

カスタム目的関数を作成する場合は、以下の要素を定義する必要があります。

name
- 定義する目的関数名
objective_function
- この関数は、予測、正ラベル、入力へのオプションの参照を取得し、モデルのパフォーマンスのスコアを返します
greater_is_better
- より高いobjective_functionの値がより良い解決策を表す場合True、それ以外の場合False
score_needs_proba
- 分類目的のみで使用
- 目的関数では予測値ではなく予測確率を使用する場合はTrue（例：分類器のクロスエントロピー）
decision_function
- 二項分類のみで使用
- この関数は、モデルから出力された予測確率と2項分類しきい値を取得し、予測値を返します
perfect_score
- 目的関数で最良なモデルによって達成されたスコア
expected_range
- 目的関数で出力する期待値の範囲。これは、必ずしも可能な値の範囲と同じである必要はありません

実装

インストール

pipまたはcondaを使用してインストールできます。pipの場合は以下の通り。

pip install evalml

condaの場合は以下の通り。

conda install -c conda-forge evalml

また、アドオンを個別に、または一度にすべてインストールできます。

pip install evalml[complete]            #すべてのアドオン
pip install evalml[prophet]             #時系列のサポートには、FacebookのProphetライブラリでEvalMLを使用
pip install evalml[update_checker]      #新しいEvalMLリリースの自動通知を受信

データの読み込み、データを学習データと検証データに分割

from evalml.demos import load_weather
X, y = load_weather()
X_train, X_test, y_train, y_test = evalml.preprocessing.split_data(X, y, problem_type='')

AutoML実行

from evalml.automl import AutoMLSearch

automl = AutoMLSearch(X_train, y_train, problem_type="time series regression",
                      max_batches=1,
                      problem_configuration={"gap": 0, "max_delay": 7,
                                             "forecast_horizon": 7, "time_index": "Date"},
                      allowed_model_families=["xgboost", "random_forest", "linear_model", "extra_trees",
                                              "decision_tree"]
                      )
automl.search()

上記で使用されているパラメータについては下表にまとめています。

パラメータ名	概要
problem_type	教師あり学習タスクの種類。完全なリストについては、evalml.problem_types.ProblemType.all_problem_typesを参照。
max_batches	検索するパイプラインのバッチの最大数。
forecast_horizon	予測しようとしている期間。上記の例では、次の7日間の天気を予測しているため値は7です。
gap	学習データの終了から検証データの開始までの期間。上記の例では、「今日」のデータを使用して次の7日間の天気を予測しているため、値は0です。
max_delay	特徴量を計算するために現在の行から過去に検索する行の最大数。上記の例では、前の週の天気を使用して、今週の天気を予測できます。
time_index	各観測値に対応する日付を含むトレーニングデータセットの列。このパラメーターは、一部の時系列のモデルでのみ使用されるいます。

最良のパイプラインとベースラインパイプライン検証データセットのパフォーマンスを比較できます。

ベースラインパイプライン

import pandas as pd
baseline = automl.get_pipeline(0)
baseline.fit(X_train, y_train)
naive_baseline_preds = baseline.predict_in_sample(X_test, y_test, objective=None,
                                                  X_train=X_train, y_train=y_train)
expected_preds = pd.concat([y_train.iloc[-7:], y_test]).shift(7).iloc[7:]
pd.testing.assert_series_equal(expected_preds, naive_baseline_preds)

最良のパイプライン

pl = automl.best_pipeline
pl.fit(X_train, y_train)
best_pipeline_score = pl.score(X_test, y_test, ['R2'], X_train, y_train)['R2']

特徴