{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Using custom models" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2023-09-22T19:38:05.121869200Z", "start_time": "2023-09-22T19:38:03.965758800Z" } }, "outputs": [], "source": [ "import logging\n", "from sklearn.datasets import load_diabetes\n", "from sklearn.metrics import mean_squared_error\n", "import matplotlib.pyplot as plt\n", "from pprint import pprint\n", "\n", "import hundred_hammers as hh\n", "from hundred_hammers.model_zoo import DummyRegressor, Ridge, DecisionTreeRegressor\n", "\n", "import warnings\n", "from sklearn.exceptions import ConvergenceWarning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's say you have discovered the best machine learning model ever, the \"free lunch\" everyone seems to be talking about (check the no free lunch theorem for machine learning for more information). \n", "\n", "If you want to compare it to other models, it should be implemented as an scikit-learn style model, implementing at least the methods ```fit```, ```predict``` and ```set_params```.\n", "\n", "We have not found such a model yet, so we will use a modified version of a Ridge model as an example." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2023-09-22T19:38:05.126116100Z", "start_time": "2023-09-22T19:38:05.123869300Z" } }, "outputs": [], "source": [ "class InverseRidge(Ridge):\n", " \"\"\"\n", " A newly defined machine learning model.\n", " Ridge regressor with alpha defined as 1/c.\n", " \"\"\"\n", "\n", " def __init__(self, c=1):\n", " # It is important to add the parameter as an attribute of the class\n", " self.c = c\n", "\n", " super().__init__(alpha=1 / c)\n", "\n", " # The following methods are not needed but are shown as an example of how to implement them.\n", "\n", " def fit(self, X, y):\n", " # Returns the trained model.\n", " return super().fit(X, y)\n", "\n", " def predict(self, X):\n", " # Returns an array.\n", " return super().predict(X)\n", "\n", " # If the class inherits from BaseEstimator, this doesn't need to be implemented.\n", " def set_params(self, **params):\n", " # Returns the configured model.\n", " return super().set_params(**params)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This new example model has a single real valued hyperparameter named \"c\" that has values higher than 0.\n", "\n", "If we want grid search to automatically generate a grid of parameters for this new model, its definition must be added to the knowledge base of this library.\n", "\n", "The hyperparameter definitions of all registered models are stored in the dictionary ```known_hyperparams```.\n", "\n", "We can show a couple of examples " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2023-09-22T19:38:05.130382100Z", "start_time": "2023-09-22T19:38:05.126116100Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'model': 'SGDClassifier',\n", " 'penalty': {'type': 'categorical',\n", " 'values': ['l1', 'l2', 'elasticnet', 'None']},\n", " 'alpha': {'type': 'real', 'min': 0, 'max': 100},\n", " 'l1_ratio': {'type': 'real', 'min': 0, 'max': 1}}\n", "\n", "{'model': 'Perceptron',\n", " 'penalty': {'type': 'categorical',\n", " 'values': ['l1', 'l2', 'elasticnet', 'None']},\n", " 'alpha': {'type': 'real', 'min': 0, 'max': 100},\n", " 'l1_ratio': {'type': 'real', 'min': 0, 'max': 1}}\n", "\n", "{'model': 'PassiveAggressiveClassifier',\n", " 'C': {'type': 'real', 'min': 0, 'max': 100}}\n", "\n" ] } ], "source": [ "# Take a sample of model definitions\n", "hyper_defs = hh.known_hyperparams[11:14]\n", "\n", "# pretty print each of them\n", "for i in hyper_defs:\n", " pprint(i, sort_dicts=False)\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To register our model in this structure, we must provide a dictionary with the name of the class that implements the model and each of the hyperparameters, indicating whether it is a real number, an integer or a categorical value (a parameter that can take only a given set of values).\n", "\n", "We must specify the bounds of each parameters by setting a minimum value and a maximum except for categorical parameters, for which we will specify the allowed values." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2023-09-22T19:38:05.134306400Z", "start_time": "2023-09-22T19:38:05.130382100Z" } }, "outputs": [], "source": [ "# Define the hyperparameters of the model\n", "inverse_ridge_def = {\n", " \"model\": \"InverseRidge\",\n", " \"c\": {\"type\": \"real\", \"min\": 1e-10, \"max\": 100},\n", "}\n", "\n", "# Add them to the knowlege base\n", "hh.add_known_model_def(inverse_ridge_def)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can verify that it has been added to the structure." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2023-09-22T19:38:05.137973200Z", "start_time": "2023-09-22T19:38:05.134306400Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'model': 'InverseRidge',\n", " 'c': {'type': 'real', 'min': 1e-10, 'max': 100}}\n" ] } ], "source": [ "# Take the last registered model\n", "new_model_def = hh.known_hyperparams[-1]\n", "\n", "# Pretty print it\n", "pprint(new_model_def, sort_dicts=False, width=60)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have added or new model, we can use as another machine learning model without worrying about it. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2023-09-22T19:38:09.292531500Z", "start_time": "2023-09-22T19:38:05.137973200Z" } }, "outputs": [ { "data": { "text/html": [ "
| \n", " | Model | \n", "Avg R2 (Validation Train) | \n", "Std R2 (Validation Train) | \n", "Avg R2 (Validation Test) | \n", "Std R2 (Validation Test) | \n", "Avg R2 (Train) | \n", "Std R2 (Train) | \n", "Avg R2 (Test) | \n", "Std R2 (Test) | \n", "Avg MSE (Validation Train) | \n", "... | \n", "Avg MSE (Test) | \n", "Std MSE (Test) | \n", "Avg MAE (Validation Train) | \n", "Std MAE (Validation Train) | \n", "Avg MAE (Validation Test) | \n", "Std MAE (Validation Test) | \n", "Avg MAE (Train) | \n", "Std MAE (Train) | \n", "Avg MAE (Test) | \n", "Std MAE (Test) | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "Dummy Regressor | \n", "0.000000 | \n", "0.000000 | \n", "-0.023206 | \n", "0.033628 | \n", "0.000000 | \n", "0.000000e+00 | \n", "-0.001337 | \n", "0.000000e+00 | \n", "6125.118931 | \n", "... | \n", "5134.783503 | \n", "0.000000e+00 | \n", "67.301695 | \n", "1.188564 | \n", "67.615074 | \n", "3.943941 | \n", "67.339534 | \n", "1.421085e-14 | \n", "59.227456 | \n", "7.105427e-15 | \n", "
| 1 | \n", "Inverse Ridge | \n", "0.445599 | \n", "0.013815 | \n", "0.419461 | \n", "0.046400 | \n", "0.465084 | \n", "0.000000e+00 | \n", "0.340980 | \n", "5.551115e-17 | \n", "3394.642031 | \n", "... | \n", "3379.406308 | \n", "0.000000e+00 | \n", "49.381703 | \n", "0.756940 | \n", "50.079365 | \n", "3.114845 | \n", "48.381085 | \n", "7.105427e-15 | \n", "46.566795 | \n", "0.000000e+00 | \n", "
| 2 | \n", "Decision Tree | \n", "0.335350 | \n", "0.016310 | \n", "0.221059 | \n", "0.096463 | \n", "0.326202 | \n", "5.551115e-17 | \n", "0.002965 | \n", "0.000000e+00 | \n", "4070.500515 | \n", "... | \n", "5112.723322 | \n", "9.094947e-13 | \n", "52.458085 | \n", "0.997956 | \n", "55.908937 | \n", "3.478008 | \n", "52.784358 | \n", "0.000000e+00 | \n", "57.041851 | \n", "0.000000e+00 | \n", "
3 rows × 25 columns
\n", "