《用贝叶斯优化进行超参数调优》

时间：2020-05-16 12:19:56 收藏：0 阅读：936

TPE CMAES 网格搜索随机搜索贝叶斯优化

用贝叶斯优化进行超参数调优

@QI ZHANG · JUL 12, 2019 · 7 MIN READ

超参数调优一直是机器学习里比较intractable的问题，繁多的超参数以及指数型爆炸的参数空间，往往让人无从下手。调参是一个很枯燥的过程，而且最后也不一定有很好的reward。很多的机器学习工程师也会戏称自己是”调参民工”，”炼丹师”……

超参数（Hyper-parameters）：Hyper-parameters are parameters that are not directly learnt within estimators.

超参数的优化可以看做是这样一个方程：

?? ? = arg min ?? \in  ?? (??)

其中

一般的调参方法有下面几种：

手动调参（Manual Search）
网格搜索（Grid Search）
随机搜索（Randomized Search）
贝叶斯优化（Bayesian Optimization）

1. 手动调参

对于手动调参，会对模型最重要的一些参数基于经验进行调节。比如lightgbm的叶num_leaves, learning_rate, feature_fraction, lambda_l1,lambda_l2, min_data_in_leaf等。

2. Randomized Search vs Grid Search

Grid Search会对定义的参数空间进行暴力搜索。网格搜索速度慢，但在搜索整个搜索空间方面效果很好。Randomized Search是从定义的参数空间中进行采样，然后训练。随机搜索很快，但可能会错过搜索空间中的重要点。

技术分享图片

事实上，调参的时候，不需要遍历模型的所有参数。事实上，影响效果的往往只有其中的几个参数，一般对这些参数进行Randomized Search或者Grid Search即可。具体可以查看模型文档，或相关文献。

# Comparing randomized search and grid search for hyperparameter estimation

import numpy as np
from time import time
from scipy.stats import randint as sp_randint

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier

# get some data
digits = load_digits()
X, y = digits.data, digits.target

# build a classifier
clf = RandomForestClassifier(n_estimators=20)

# Utility function to report best scores
def report(results, n_top=3):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results[‘rank_test_score‘] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
                  results[‘mean_test_score‘][candidate],
                  results[‘std_test_score‘][candidate]))
            print("Parameters: {0}".format(results[‘params‘][candidate]))
            print("")


# specify parameters and distributions to sample from
param_dist = {"max_depth": [3, None],
              "max_features": sp_randint(1, 11),
              "min_samples_split": sp_randint(2, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# run randomized search
n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
                                   n_iter=n_iter_search, cv=5)

start = time()
random_search.fit(X, y)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search))
report(random_search.cv_results_)

# use a full grid over all parameters
param_grid = {"max_depth": [3, None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# run grid search
grid_search = GridSearchCV(clf, param_grid=param_grid, cv=5)
start = time()
grid_search.fit(X, y)

print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
      % (time() - start, len(grid_search.cv_results_[‘params‘])))
report(grid_search.cv_results_)

Output:

RandomizedSearchCV took 5.55 seconds for 20 candidates parameter settings.
Model with rank: 1
Mean validation score: 0.939 (std: 0.024)
Parameters: {‘bootstrap‘: False, ‘criterion‘: ‘entropy‘, ‘max_depth‘: None, ‘max_features‘: 7, ‘min_samples_split‘: 3}

Model with rank: 2
Mean validation score: 0.933 (std: 0.022)
Parameters: {‘bootstrap‘: False, ‘criterion‘: ‘gini‘, ‘max_depth‘: None, ‘max_features‘: 6, ‘min_samples_split‘: 6}

Model with rank: 3
Mean validation score: 0.930 (std: 0.031)
Parameters: {‘bootstrap‘: True, ‘criterion‘: ‘gini‘, ‘max_depth‘: None, ‘max_features‘: 6, ‘min_samples_split‘: 6}

GridSearchCV took 16.95 seconds for 72 candidate parameter settings.
Model with rank: 1
Mean validation score: 0.937 (std: 0.019)
Parameters: {‘bootstrap‘: False, ‘criterion‘: ‘entropy‘, ‘max_depth‘: None, ‘max_features‘: 10, ‘min_samples_split‘: 2}

Model with rank: 2
Mean validation score: 0.935 (std: 0.020)
Parameters: {‘bootstrap‘: False, ‘criterion‘: ‘gini‘, ‘max_depth‘: None, ‘max_features‘: 10, ‘min_samples_split‘: 2}

Model with rank: 3
Mean validation score: 0.930 (std: 0.029)
Parameters: {‘bootstrap‘: False, ‘criterion‘: ‘entropy‘, ‘max_depth‘: None, ‘max_features‘: 10, ‘min_samples_split‘: 3}

1.1 Grid Search

from sklearn.model_selection import GridSearchCV

def func_grid_search(model, X_train, y_train, param_grid, cv=5, n_jobs=-1):
    """
    model: model instance
    para_grid: dict type, params searching grid.
    para_grid = {‘n_estimators‘:[100,200,500],
                 ‘max_depth‘: [5, 8, 10, 15],
                 ‘max_features‘: [0.80, 0.90, 0.95],
                 ‘min_samples_split‘:[2, 5, 8],
                 ‘min_samples_leaf‘: [1, 3, 5]}
    """
    gs = GridSearchCV(model, param_grid, cv=cv, n_jobs=n_jobs)
    gs.fit(X_train, y_train.ravel())
    print(gs.best_params_) 
    print(gs.best_score_)
    # print(gs.cv_results_)
    bst_model = gs.best_estimator_
    return bst_model

1.2 Randomized Search

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint as sp_randint

def func_random_search(model, X_train, y_train, param_dist, n_iter=20, cv=5, n_jobs=-1):
    """
    # parameters for GridSearchCV
    # specify parameters and distributions to sample from
    param_dist = {"max_depth": [3, 5],
                  "max_features": sp_randint(1, 11),
                  "min_samples_split": sp_randint(2, 11),
                  "min_samples_leaf": sp_randint(1, 11),
                  "bootstrap": [True, False]
                 }
    """
    rs = RandomizedSearchCV(model, param_dist, n_iter=n_iter, cv=cv, n_jobs=n_jobs)
    rs.fit(X_train, y_train.ravel())
    print(rs.best_params_) 
    print(rs.best_score_)
    # print(rs.cv_results_)
    bst_model = rs.best_estimator_
    return bst_model

Grid Search太慢了，并不是很实用，一般会对参数空间先调一个粗粒度的格点搜索，然后根据结果进行细粒度的调整。而Randomized Search在迭代一定的次之后也可能实现较好的效果，值得更多的尝试。

技术分享图片

3. Bayesian Optimization

Grid Search和Randomized Search虽然可以让整个调参过程自动化，但它们无法从之前的调参结果中获取信息，可能会尝试很多无效的参数空间。而贝叶斯优化，会对上一次的评估结果进行追踪，建立一个概率模型，反应超参数在目标函数上表现的概率分布，即

贝叶斯优化可以更好的trade off Exploration&Exploitation，而且适用于随机、非凸、不连续方程的优化。具体过程可用一句话概括为：对目标函数建立概率模型，通过这个概率模型得到最promising的参数组合，用于最终目标函数的评估。

Sequential Model-Based Optimization(SMBO) 是贝叶斯优化更具体的表现形式，可认为它们是等价的。一般会有以下几个过程：

给定要搜索的超参数空间
定义一个目标函数用于评估优化
建立目标函数的surrogate model
建立一个评估surrogate model，作为选择超参数的标准(选择方程)
获取score和hyperparameter的样本用于更新 surrogate model

贝叶斯优化模型主要的区分是代理函数（surrogate function）的差异。surrogate model一般有 Gaussian Process, Random Forest和Tree Parzen Estimator (TPE) 这几种。常见的框架有Spearmint, Hyperopt, SMAC, MOE, BayesianOptimization, skopt等，它们的对比如下表：

library	surrogate function
Spearmint	Gaussian Process
Hyperopt	Tree Parzen Estimator (TPE)
SMAC	Random Forest

Optunity包括多种超参数调优的方法：

Grid Search
Random Search
Particle Swarm Optimization(默认方法)
Nelder-Mead simplex
CMA-ES
Tree-structured Parzen Estimator(基于hyperopt实现，用hyperopt更好)
Sobol sequences

下面以Hyperopt为例说明贝叶斯优化的具体应用。

对于一个优化问题我们可以分为4个部分：

优化的目标函数
优化的参数空间
超参数优化方程，建立代理函数（surrogate function），并用其决定下一次尝试的参数组合
Trials: 记录每次尝试的loss、参数及更多额外信息（可DIY），可以记录整个迭代的过程，用于回测。

空间定义：

from hyperopt import STATUS_OK, fmin, tpe, Trials, hp
import xgboost as xgb
import logging
from timeit import default_timer as timer
import os
from functools import partial
import pandas as pd
import numpy as np

MAX_EVALS = 100 # 迭代次数
NFOLDS = 5 # K-FOLD 
FOLDS = None # 自定义的FOLDS，优先级高于NFOLDS
BASE_DIR = os.path.dirname(os.path.abspath(__file__))

XGB_SPACE = {
    ‘booster‘: ‘gbtree‘,
    ‘random_state‘: 2019,
    ‘eval_metric‘: ‘rmse‘,
    ‘n_jobs‘: -1,
    ‘learning_rate‘: 0.05,
    ‘subsample‘: hp.uniform(‘subsample‘, 0.1, 1.0),
    ‘colsample_bytree‘: hp.uniform(‘colsample_bytree‘, 0.1, 1.0),
    ‘max_depth‘: hp.quniform(‘max_depth‘, 5, 30, 1),
    ‘gamma‘: hp.uniform(‘gamma‘, 0.0, 2.0),
    ‘min_child_weight‘: hp.uniform(‘min_child_weight‘, 0.0, 5.0),
    ‘reg_alpha‘: hp.uniform(‘reg_alpha‘, 0.0, 3.0),
    ‘reg_lambda‘: hp.uniform(‘reg_lambda‘, 0.0, 3.0)
}

定义优化的目标函数：

def objective_base(params,
                   train_set,
                   folds=None,
                   nfold=5,
                   writetoFile=True):
    """
    Objective function for Gradient Boosting Machine Hyperparameter Optimization
    Args:
        folds: This argument has highest priority over other data split arguments.
    Return:
    """
    # Keep track of evals
    global _ITERATION
    _ITERATION += 1
    # Make sure parameters that need to be integers are integers
    for parameter_name in [
            ‘num_leaves‘, ‘max_depth‘, ‘bagging_freq‘, ‘min_data_in_leaf‘,
            ‘min_samples_split‘, ‘min_samples_leaf‘
    ]:
        if parameter_name in params:
            params[parameter_name] = int(params[parameter_name])
    start = timer()
    logging.info(f"{_ITERATION} ITERATION")
    logging.info(f"params:\n{params}")
    cv_dict = xgb.cv(params,
                     train_set,
                     num_boost_round=5000,
                     nfold=nfold,
                     stratified=False,
                     folds=folds,
                     early_stopping_rounds=100,
                     as_pandas=False,
                     verbose_eval=10,
                     seed=0,
                     shuffle=False)
    # Extract the min rmse, Loss must be minimized
    loss = np.min(cv_dict[‘test-rmse-mean‘])
    # Boosting rounds that returned the lowest cv rmse
    n_estimators = int(np.argmin(cv_dict[‘test-rmse-mean‘])+1)
    run_time = timer() - start
    # Write to the csv file (‘a‘ means append)
    if writetoFile:
        random_datetime = str(int(time.time()))
        hyper_base_path = os.path.join(BASE_DIR, ‘hyperopt_output‘)
        trial_file = os.path.join(hyper_base_path, ‘trials.csv‘)
        trial_file_rename = os.path.join(hyper_base_path,
                                         ‘trials_%s.csv‘ % random_datetime)
        if not os.path.exists(hyper_base_path):
            os.makedirs(hyper_base_path)
            print(
                "No trial file directory <hyperopt_output> exists, will be created..."
            )
        if os.path.exists(trial_file) and _ITERATION == 1:
            print("Trial file exists, will be renamed...")
            os.rename(trial_file, trial_file_rename)
            assert os.path.exists(
                trial_file
            ) == False, "Trial file still exists, rename failed..."
            # File to save first results
            of_connection = open(trial_file, ‘w‘)
            writer = csv.writer(of_connection)
            # Write the headers to the file
            writer.writerow(
                [‘loss‘, ‘params‘, ‘iteration‘, ‘estimators‘, ‘train_time‘])
            of_connection.close()
        of_connection = open(trial_file, ‘a‘)
        writer = csv.writer(of_connection)
        writer.writerow([loss, params, _ITERATION, n_estimators, run_time])
    # Dictionary with information for evaluation
    return {
        ‘loss‘: loss,
        ‘params‘: params,
        ‘iteration‘: _ITERATION,
        ‘estimators‘: n_estimators,
        ‘train_time‘: run_time,
        ‘status‘: STATUS_OK
    }

定义前处理和后处理模块：

def build_train_set(X_train, y_train):
    isX_df = isinstance(X_train, pd.DataFrame)
    isY_sr = isinstance(y_train, pd.Series)
    isY_df = isinstance(y_train, pd.DataFrame)
    if isY_df:
        raise TypeError(
            f"y_train is df, with the shape {y_train.shape}, which is not supportable now."
        )
    if isX_df ^ isY_sr:
        raise TypeError(f"X_train and y_train have different types!")
    if isX_df:
        train_set = xgb.DMatrix(X_train.values, y_train.values)
    else:
        train_set = xgb.DMatrix(X_train, y_train)
    return train_set
 
def post_hyperopt(bayes_trials, train_set, folds=None, nfold=5):
    # get best params
    bayes_results = pd.DataFrame(bayes_trials.results)
    bayes_results = bayes_results.sort_values(by=‘loss‘)
    bayes_results.reset_index(drop=True, inplace=True)
    best_params = bayes_results.loc[0, ‘params‘]
    # get best loss and trees
    best_params[‘learning_rate‘] = 0.01
    # Perform n_folds cross validation
    cv_dict = xgb.cv(best_params,
                     train_set,
                     num_boost_round=5000,
                     folds=folds,
                     nfold=nfold,
                     stratified=False,
                     shuffle=False,
                     early_stopping_rounds=100,
                     as_pandas=False,
                     verbose_eval=10,
                     seed=2019)
    # Extract the min rmse, Loss must be minimized
    loss = np.min(cv_dict[‘test-rmse-mean‘])
    # Boosting rounds that returned the lowest cv rmse
    n_estimators = int(np.argmin(cv_dict[‘test-rmse-mean‘]) + 1)
    best_params[‘n_estimators‘] = n_estimators
    logging.info(f"best loss: {loss}, best n_estimators: {n_estimators}")
    logging.info(f"best params: {best_params}")
    return best_params, loss

定义主函数：

  
def main_tuning_with_bo(X_train,
                        y_train,
                        max_evals=MAX_EVALS,
                        folds=FOLDS,
                        nfold=NFOLD):
    # Keep track of results
    bayes_trials = Trials()
    # Global variable
    global _ITERATION
    _ITERATION = 0
    TRAIN_SET = build_train_set(X_train, y_train)
    SPACE = XGB_SPACE
    func_objective = partial(objective_base,
                             train_set=TRAIN_SET,
                             folds=folds,
                             nfold=nfold,
                             writetoFile=True)
    # Run optimization
    best = fmin(fn=func_objective,
                space=SPACE,
                algo=tpe.suggest,
                max_evals=max_evals,
                trials=bayes_trials,
                rstate=np.random.RandomState(2019))
    best_params, loss = post_hyperopt(bayes_trials,
                                      train_set=TRAIN_SET,
                                      folds=folds,
                                      nfold=nfold)
    return best_params, loss

Reference

原文：https://www.cnblogs.com/cx2016/p/12899500.html