Regression with Hundred Hammers#

In this notebook we will explain how to use the HundredHammers library to perfom a basic model selection and hyperparameter optimization for a classification problem.

To do this, we will use one of the example datasets available in the scikit-learn library.

[1]:
import logging

import hundred_hammers as hh
from hundred_hammers.model_zoo import (
    DummyRegressor,
    Ridge,
    DecisionTreeRegressor,
    KNeighborsRegressor,
)

from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_squared_error

First we store the data in the X (input) and y (target) variables.

[2]:
data = load_diabetes()
X = data.data
y = data.target

We are going to first train some models with their default configuration. If you don’t specify the models that you want to use, some regression models will be chosen for you.

To see which models are chosen by default, you can check the DEFAULT_REGRESSION_MODELS variable

[3]:
hh.model_zoo.DEFAULT_REGRESSION_MODELS
[3]:
[('Dummy Mean', DummyRegressor(), {}),
 ('Dummy Median', DummyRegressor(strategy='median'), {}),
 ('Linear Regression', LinearRegression(), {}),
 ('Decision Tree', DecisionTreeRegressor(), {}),
 ('SVR', SVR(), {}),
 ('Linear SVR', LinearSVR(), {}),
 ('Ridge', Ridge(), {}),
 ('Passive Aggressive', PassiveAggressiveRegressor(), {}),
 ('KNN', KNeighborsRegressor(), {}),
 ('Neural Network Regressor', MLPRegressor(), {}),
 ('Gaussian Process', GaussianProcessRegressor(), {}),
 ('Random Forest', RandomForestRegressor(), {}),
 ('AdaBoost', AdaBoostRegressor(), {}),
 ('Gradient Boosting', GradientBoostingRegressor(), {})]

Notice that it is composed of a list of tuples. Each tuple contains the name we give to the regressors, an instance of the class that implements the regression model and a grid of hyperparameters (which now is empty, but will be explained later).

Those are the models that we are going to use now.

Evaluation with default models#

First create the HundredHammersRegressor object

[4]:
hh_models = hh.HundredHammersRegressor(show_progress_bar=True)

Then evaluate the models. Apart from the actual data (the variables X and y), you can pass other parameters. optim_hyper checks whether we want to optimize the hyperparameters of the models and n_grid_points controls how many values from each hyperparameter to check in the optimization.

Since we don’t want to optimize the hyperparameters, optim_hyper will stay as false.

[5]:
# configure the logger
hh.hh_logger.setLevel(logging.WARNING)

# Evaluate the models and store the results in a variable
df_results = hh_models.evaluate(X, y, optim_hyper=False)
Evaluating models...: 100%|██████████| 14/14 [00:52<00:00,  3.77s/it]

Notice the line above the evaluation of the models. This configures the logger to only show warnings (of which there should be none). The setting you most likely would want to use in an interactive enviroment would be logging.INFO, since you get information about each model in “real time”.

If you want to see more detailed information, you can set the level to logging.DEBUG. It outputs a lot of information, but it might be useful if you encounter a bug.

For the purposes of this notebook, it will be kept to logging.WARNING but you are welcome to change it if you are running this notebook locally.

We can now show the results of our execution

[6]:
df_results
[6]:
Model Avg R2 (Validation Train) Std R2 (Validation Train) Avg R2 (Validation Test) Std R2 (Validation Test) Avg R2 (Train) Std R2 (Train) Avg R2 (Test) Std R2 (Test) Avg MSE (Validation Train) ... Avg MSE (Test) Std MSE (Test) Avg MAE (Validation Train) Std MAE (Validation Train) Avg MAE (Validation Test) Std MAE (Validation Test) Avg MAE (Train) Std MAE (Train) Avg MAE (Test) Std MAE (Test)
0 Dummy Mean 0.000000 0.000000 -0.023206 0.033628 0.000000 0.000000e+00 -0.001337 0.000000e+00 6125.118931 ... 5134.783503 0.000000e+00 67.301695 1.188564 67.615074 3.943941 67.339534 1.421085e-14 59.227456 7.105427e-15
1 Dummy Median -0.027476 0.007274 -0.050684 0.071611 -0.025922 0.000000e+00 -0.045202 0.000000e+00 6293.183322 ... 5359.719101 9.094947e-13 66.517990 1.217067 66.993618 4.993507 66.566572 0.000000e+00 59.044944 0.000000e+00
2 Linear Regression 0.556522 0.015235 0.514558 0.067539 0.553925 0.000000e+00 0.332233 0.000000e+00 2715.353061 ... 3424.259334 0.000000e+00 42.404995 0.829168 43.952508 3.145022 42.593344 0.000000e+00 46.173585 0.000000e+00
3 Decision Tree 1.000000 0.000000 -0.046363 0.200661 1.000000 0.000000e+00 -0.452581 7.741044e-02 0.000000 ... 7448.729213 3.969551e+02 0.000000 0.000000 61.592362 5.524460 0.000000 0.000000e+00 72.787640 2.030020e+00
4 SVR 0.159545 0.009264 0.126855 0.062307 0.186951 2.775558e-17 0.128119 0.000000e+00 5147.222487 ... 4470.939683 0.000000e+00 59.987041 1.025494 61.080993 4.554385 58.932403 0.000000e+00 53.268617 0.000000e+00
5 Linear SVR -0.480290 0.021770 -0.499608 0.177843 -0.380897 2.104066e-03 -0.515761 2.670427e-03 9068.418326 ... 7772.710116 1.369376e+01 72.936544 1.610525 73.117766 7.733969 71.098948 3.656171e-02 67.626756 6.017854e-02
6 Ridge 0.445599 0.013815 0.419461 0.046400 0.465084 0.000000e+00 0.340980 5.551115e-17 3394.642031 ... 3379.406308 0.000000e+00 49.381703 0.756940 50.079365 3.114845 48.381085 7.105427e-15 46.566795 0.000000e+00
7 Passive Aggressive 0.506681 0.026801 0.471776 0.060789 0.517667 9.373723e-03 0.358801 5.003282e-03 3020.352370 ... 3288.020684 2.565646e+01 45.207007 1.403516 46.656725 3.233495 44.703236 6.186625e-01 45.272923 2.780478e-01
8 KNN 0.615279 0.021038 0.410247 0.083013 0.618820 0.000000e+00 0.172488 0.000000e+00 2354.986676 ... 4243.422022 0.000000e+00 37.785156 1.130082 46.627299 3.687450 37.339377 0.000000e+00 49.492135 0.000000e+00
9 Neural Network Regressor -2.925185 0.119621 -2.988113 0.350351 -2.917495 9.059604e-02 -3.644418 1.081171e-01 24040.326288 ... 23816.235196 5.544168e+02 134.976709 3.010464 135.014455 9.449213 134.932019 1.912264e+00 137.552821 1.910063e+00
10 Gaussian Process 0.995085 0.001516 -13.683252 6.935421 0.984183 1.110223e-16 -9.864145 0.000000e+00 30.042210 ... 55710.543741 7.275958e-12 2.979676 0.408915 182.608347 32.059452 5.759758 0.000000e+00 144.352684 2.842171e-14
11 Random Forest 0.925400 0.003492 0.451311 0.081209 0.924440 1.996863e-03 0.261153 1.513057e-02 456.784823 ... 3788.750941 7.758846e+01 17.203293 0.450834 46.492412 3.162357 17.225159 2.502492e-01 48.277247 5.898557e-01
12 AdaBoost 0.687707 0.017497 0.460084 0.072768 0.662306 5.843726e-03 0.279929 1.969705e-02 1911.433685 ... 3692.470967 1.010051e+02 38.272124 1.042228 46.951305 3.079916 39.829220 4.565226e-01 47.444248 5.013379e-01
13 Gradient Boosting 0.889395 0.008559 0.441646 0.077073 0.857853 1.110223e-16 0.208258 1.787313e-03 677.065400 ... 4059.994938 9.165209e+00 20.760164 0.796809 46.463974 3.126900 23.559587 3.552714e-15 49.229710 1.154214e-01

14 rows × 25 columns

That’s an ok way of displaying the result, but tables can sometimes be hard to read, this is why we also implement a couple of functions to display the information of the table in a more readable format.

[7]:
hh.plot_batch_results(df_results, metric_name="MSE", title="Iris Dataset", display=False)
../_images/examples_example_regression_17_0.png
[8]:
# Take the models in positions 1, 2, 3 and 5
models = [i for _, i, _ in hh_models.trained_models[1:3] + hh_models.trained_models[4:5]]

# Plot the predictions
hh.plot_regression_pred(
    X,
    y,
    models=models,
    metric=mean_squared_error,
    title="Diabetes",
    y_label="Diabetes (Value)",
)
../_images/examples_example_regression_18_0.png

In case we needed to use one of the trained models, we can take it from the trained_models attribute from the HundredHammersRegressor class. This value will consist on a list with tuples containing the name of the model and the trained model.

[9]:
hh_models.trained_models
[9]:
[('Dummy Mean', DummyRegressor(), {}),
 ('Dummy Median', DummyRegressor(strategy='median'), {}),
 ('Linear Regression', LinearRegression(), {}),
 ('Decision Tree', DecisionTreeRegressor(random_state=9), {}),
 ('SVR', SVR(), {}),
 ('Linear SVR', LinearSVR(random_state=9), {}),
 ('Ridge', Ridge(random_state=9), {}),
 ('Passive Aggressive', PassiveAggressiveRegressor(random_state=9), {}),
 ('KNN', KNeighborsRegressor(), {}),
 ('Neural Network Regressor', MLPRegressor(random_state=9), {}),
 ('Gaussian Process', GaussianProcessRegressor(random_state=9), {}),
 ('Random Forest', RandomForestRegressor(random_state=9), {}),
 ('AdaBoost', AdaBoostRegressor(random_state=9), {}),
 ('Gradient Boosting', GradientBoostingRegressor(random_state=9), {})]

Automatic optimization of hyperparameters#

In case we want to choose the models we want to evaluate, we must indicate them to the HundredHammersRegressor class.

For this example, we will use four simple regression models.

[10]:
models_to_check = [
    ("Dummy", DummyRegressor(), None),
    ("Ridge", Ridge(random_state=0), None),
    ("Decision Tree", DecisionTreeRegressor(random_state=0), None),
    ("KNN", KNeighborsRegressor(), None),
]

Each model has a name and an object that implements it. The third position in the tuple represents the user-specified grid of hyperparameters, however, we will let them be automatically generated.

This will only happen for already configured models, if you want automatic generation of hyperparameters for a model that is not already added, check the “example_add_model.ipynb” notebook.

We can now proceed passing these models to the HundredHammersRegressor class.

[11]:
hh_models = hh.HundredHammersRegressor(models=models_to_check, show_progress_bar=True)

This time, since we want to optimize the hyperparameters of our models, we set the appropriate parameter to True.

We can configure how many parameters to check in the GridSearch step, n_grid_points will indicate how many values each of the hyperparameters will take. In this case, we will take 8 values for each one. In the case of categorical values, if there are less than 8 values, only those will be taken.

[12]:
df_results = hh_models.evaluate(X, y, optim_hyper=True, n_grid_points=8)
Evaluating models...: 100%|██████████| 4/4 [00:01<00:00,  3.37it/s]
[13]:
df_results
[13]:
Model Avg R2 (Validation Train) Std R2 (Validation Train) Avg R2 (Validation Test) Std R2 (Validation Test) Avg R2 (Train) Std R2 (Train) Avg R2 (Test) Std R2 (Test) Avg MSE (Validation Train) ... Avg MSE (Test) Std MSE (Test) Avg MAE (Validation Train) Std MAE (Validation Train) Avg MAE (Validation Test) Std MAE (Validation Test) Avg MAE (Train) Std MAE (Train) Avg MAE (Test) Std MAE (Test)
0 Dummy 0.000000 0.000000 -0.023206 0.033628 0.000000 0.000000e+00 -0.001337 0.000000e+00 6125.118931 ... 5134.783503 0.000000e+00 67.301695 1.188564 67.615074 3.943941 67.339534 1.421085e-14 59.227456 7.105427e-15
1 Ridge 0.553573 0.015243 0.516564 0.064871 0.551601 1.110223e-16 0.333235 0.000000e+00 2733.434135 ... 3419.120423 0.000000e+00 42.664323 0.836894 43.975160 3.176279 42.767356 0.000000e+00 46.036471 0.000000e+00
2 Decision Tree 0.705049 0.025423 0.227255 0.138329 0.663131 2.495672e-03 -0.061039 1.694207e-02 1806.214920 ... 5440.928933 8.687768e+01 29.342648 1.290889 52.845684 4.971056 31.059490 3.552714e-15 57.152247 7.021638e-01
3 KNN 0.465824 0.017405 0.436415 0.061569 0.485286 5.551115e-17 0.315831 5.551115e-17 3270.483891 ... 3508.367072 4.547474e-13 48.434238 0.888525 49.248325 2.949155 47.282775 0.000000e+00 47.117041 7.105427e-15

4 rows × 25 columns

Now that we have optimized the hyperparameters of the models, we can check which hyperparameters were chosen for each. This is done by checking the best_params attribute.

[14]:
hh_models.best_params
[14]:
[('Dummy', {'strategy': 'mean'}),
 ('Ridge', {'alpha': 0.03727593720314938}),
 ('Decision Tree', {'criterion': 'absolute_error', 'max_depth': 5}),
 ('KNN', {'metric': 'cosine', 'n_neighbors': 72})]
[15]:
hh.plot_batch_results(df_results, metric_name="MSE", title="Iris Dataset", display=False)
../_images/examples_example_regression_31_0.png

Optimization of hyperparameters with custom parameter grids#

For this example, we will use four simple classifier models with grids of hyperparameters.

These grid will contain all the paramaters that the gridsearch optimization will use.

[16]:
models_to_check = [
    ("Dummy", DummyRegressor(), {"strategy": ["median"]}),
    ("Ridge", Ridge(random_state=0), {"alpha": [1e-4, 1e-3, 1e-2, 0.1, 1, 10]}),
    (
        "Decision Tree",
        DecisionTreeRegressor(random_state=0),
        {
            "criterion": ["squared_error", "absolute_error", "friedman_mse", "poisson"],
            "max_depth": [1, 2, 3, 4, 5, 6, 7],
        },
    ),
    (
        "KNN",
        KNeighborsRegressor(),
        {"n_neighbors": [1, 3, 5, 7, 9, 11], "metric": ["manhattan", "euclidean"]},
    ),
]

We can now proceed passing these models to the HundredHammersRegressor class.

[17]:
hh_models = hh.HundredHammersRegressor(models=models_to_check, show_progress_bar=True)

Since we want to optimize the hyperparameters of our models, we set the appropriate parameter to True.

We don’t need to set the n_grid_points parameter since we have already chosen which parameters to take in the GridSearch step.

[18]:
df_results = hh_models.evaluate(X, y, optim_hyper=True)
Evaluating models...: 100%|██████████| 4/4 [00:00<00:00,  6.63it/s]
[19]:
df_results
[19]:
Model Avg R2 (Validation Train) Std R2 (Validation Train) Avg R2 (Validation Test) Std R2 (Validation Test) Avg R2 (Train) Std R2 (Train) Avg R2 (Test) Std R2 (Test) Avg MSE (Validation Train) ... Avg MSE (Test) Std MSE (Test) Avg MAE (Validation Train) Std MAE (Validation Train) Avg MAE (Validation Test) Std MAE (Validation Test) Avg MAE (Train) Std MAE (Train) Avg MAE (Test) Std MAE (Test)
0 Dummy -0.027476 0.007274 -0.050684 0.071611 -0.025922 0.000000e+00 -0.045202 0.0 6293.183322 ... 5359.719101 9.094947e-13 66.517990 1.217067 66.993618 4.993507 66.566572 0.0 59.044944 0.000000e+00
1 Ridge 0.555347 0.015218 0.515900 0.066570 0.553033 1.110223e-16 0.329983 0.0 2722.561310 ... 3435.796416 4.547474e-13 42.494128 0.829984 43.926355 3.180674 42.664912 0.0 46.170390 0.000000e+00
2 Decision Tree 0.473673 0.020150 0.373744 0.120035 0.477605 5.551115e-17 0.020330 0.0 3222.626891 ... 5023.676966 0.000000e+00 43.848744 0.862175 47.866137 4.431692 44.014164 0.0 54.353933 7.105427e-15
3 KNN 0.554573 0.020198 0.459955 0.074056 0.548160 0.000000e+00 0.284378 0.0 2727.046694 ... 3669.657350 9.094947e-13 41.663537 1.146837 45.521591 3.630353 41.737832 0.0 46.806946 0.000000e+00

4 rows × 25 columns

Now that we have optimized the hyperparameters of the models, we can check which hyperparameters were chosen for each. This is done by checking the best_params attribute.

[20]:
hh_models.best_params
[20]:
[('Dummy', {'strategy': 'median'}),
 ('Ridge', {'alpha': 0.01}),
 ('Decision Tree', {'criterion': 'absolute_error', 'max_depth': 2}),
 ('KNN', {'metric': 'euclidean', 'n_neighbors': 11})]

We can also show the plots like last time.

[21]:
hh.plot_batch_results(df_results, metric_name="MSE", title="Iris Dataset", display=False)
../_images/examples_example_regression_43_0.png