Introduction

Google Cloud’s AutoML Tables lets you automatically build and deploy state-of-the-art machine learning models using your own structured data.

AutoML Tables now has an easier-to-use Tables-specific Python client library, as well as a new ability to explain online prediction results— called local feature importance— which gives visibility into how the features in a specific prediction request informed the resulting prediction. You can read more about explainable AI for Tables in this blog post.

The source for this post is a Jupyter notebook. In this notebook, we'll create a custom Tables model to predict duration of London bike rentals given information about local weather as well as info about the rental trip. We'll walk through examples of using the Tables client libraries for creating a dataset, training a custom model, deploying the model, and using it to make predictions; and show how you can programmatically request local feature importance information.

We recommend running this notebook using AI Platform Notebooks. If you want to run the notebook on colab (or locally), it's possible, but you'll need to do a bit more setup. See the Appendix section of this notebook for details.

Before you begin

Follow the AutoML Tables documentation to:

Select or create a GCP project.
Make sure that billing is enabled for your project
Enable the Cloud AutoML and Storage APIs.
(Recommended) Create an AI Platform Notebook instance and upload this notebook to it.

(See also the Quickstart guide for a getting-started walkthrough on AutoML Tables).

Then, install the AutoML Python client libraries into your notebook environment:

!pip3 install -U google-cloud-automl

You may need to restart your notebook kernel after running the above to pick up the installation.

Enter your GCP project ID in the cell below, then run the cell.

PROJECT_ID = "<your-project-id>"

Do some imports

Next, import some libraries and set some variables.

import argparse
import os
from google.api_core.client_options import ClientOptions
from google.cloud import automl_v1beta1 as automl
import google.cloud.automl_v1beta1.proto.data_types_pb2 as data_types

REGION = 'us-central1'
DATASET_NAME = 'bikes-weather'
BIGQUERY_PROJECT_ID = 'aju-dev-demos'
DATASET_ID = 'london_bikes_weather'
TABLE_ID = 'bikes_weather'
IMPORT_URI = 'bq://%s.%s.%s' % (BIGQUERY_PROJECT_ID, DATASET_ID, TABLE_ID)
print(IMPORT_URI)

DATASET_NAME = 'bikes_weather'

Create a dataset, and import data

Next, we'll define some utility functions to create a dataset, and to import data into a dataset. The client.import_data() call returns an operation future that can be used to check for completion synchronously or asynchronously— in this case we wait synchronously.

def create_dataset(client, dataset_display_name):
    """Create a dataset."""

    # Create a dataset with the given display name
    dataset = client.create_dataset(dataset_display_name)

    # Display the dataset information.
    print("Dataset name: {}".format(dataset.name))
    print("Dataset id: {}".format(dataset.name.split("/")[-1]))
    print("Dataset display name: {}".format(dataset.display_name))
    print("Dataset metadata:")
    print("\t{}".format(dataset.tables_dataset_metadata))
    print("Dataset example count: {}".format(dataset.example_count))
    print("Dataset create time:")
    print("\tseconds: {}".format(dataset.create_time.seconds))
    print("\tnanos: {}".format(dataset.create_time.nanos))

    return dataset

def import_data(client, dataset_display_name, path):
    """Import structured data."""
 
    response = None
    if path.startswith('bq'):
        response = client.import_data(
            dataset_display_name=dataset_display_name, bigquery_input_uri=path
        )
    else:
        # Get the multiple Google Cloud Storage URIs.
        input_uris = path.split(",")
        response = client.import_data(
            dataset_display_name=dataset_display_name,
            gcs_input_uris=input_uris
        )

    print("Processing import...")
    # synchronous check of operation status.
    print("Data imported. {}".format(response.result()))

Next, we'll create the client object that we'll use for all our operations.

client = automl.TablesClient(project=PROJECT_ID, region=REGION)

Create the Tables dataset:

create_dataset(client, DATASET_NAME)

... and then import data from the BigQuery table into the dataset. The import command will take a while to run. Wait until it has returned before proceeding. You can also check import status in the Cloud Console.

(Note that if you run this notebook multiple times, you will get an error if you try to create multiple datasets with the same name. However, you can train multiple models against the same dataset.)

import_data(client, DATASET_NAME, IMPORT_URI)

Update the dataset schema

Now we'll define utility functions to update dataset and column information. We need these to set the dataset's target column (the field we'll train our model to predict) and to change the types of some of the columns. AutoML Tables is pretty good at inferring reasonable column types based on input, but in our case, there are some columns (like bike station IDs) that we want to treat as Categorical instead of Numeric.

def update_column_spec(client,
                       dataset_display_name,
                       column_spec_display_name,
                       type_code,
                       nullable=None):
    """Update column spec."""

    response = client.update_column_spec(
        dataset_display_name=dataset_display_name,
        column_spec_display_name=column_spec_display_name,
        type_code=type_code, nullable=nullable
    )

    # synchronous check of operation status.
    print("Table spec updated. {}".format(response))
    
def update_dataset(client,
                   dataset_display_name,
                   target_column_spec_name=None,
                   time_column_spec_name=None,
                   test_train_column_spec_name=None):
    """Update dataset."""

    if target_column_spec_name is not None:
        response = client.set_target_column(
            dataset_display_name=dataset_display_name,
            column_spec_display_name=target_column_spec_name
        )
        print("Target column updated. {}".format(response))
    if time_column_spec_name is not None:
        response = client.set_time_column(
            dataset_display_name=dataset_display_name,
            column_spec_display_name=time_column_spec_name
        )
        print("Time column updated. {}".format(response))

def list_column_specs(client,
                      dataset_display_name,
                      filter_=None):
    """List all column specs."""
    result = []

    # List all the table specs in the dataset by applying filter.
    response = client.list_column_specs(
        dataset_display_name=dataset_display_name, filter_=filter_)

    print("List of column specs:")
    for column_spec in response:
        # Display the column_spec information.
        print("Column spec name: {}".format(column_spec.name))
        print("Column spec id: {}".format(column_spec.name.split("/")[-1]))
        print("Column spec display name: {}".format(column_spec.display_name))
        print("Column spec data type: {}".format(column_spec.data_type))

        result.append(column_spec)

    return result

Update the dataset to indicate that the target column is duration.

update_dataset(client, DATASET_NAME,
                target_column_spec_name='duration',
#                 time_column_spec_name='ts'
              )

Now we'll update some of the column types. You can list their default specs first if you like:

list_column_specs(client, DATASET_NAME)

... and now we'll update them to the types we want:

update_column_spec(client, DATASET_NAME,
                   'end_station_id',
                    'CATEGORY')
update_column_spec(client, DATASET_NAME,
                   'start_station_id',
                    'CATEGORY')
update_column_spec(client, DATASET_NAME,
                   'loc_cross',
                   'CATEGORY')
update_column_spec(client, DATASET_NAME,
                   'bike_id',
                   'CATEGORY')

You can view the results in the Cloud Console. Note that useful stats are generated for each column. You can also run the list_column_specs() function again to see the new config.

# list_column_specs(client, DATASET_NAME)

Train a custom model on the dataset

Now we're ready to train a model on the dataset. We'll need to generate a unique name for the model, which we'll do by appending a timestamp, in case you want to run this notebook multiple times. The 1000 arg in the create_model() call specifies to budget 1 hour of training time.

In the create_model() utility function below, we may not want to block on the result, since total job time can be multiple hours. If you want the function to block until training is complete, uncomment the last line of the function below.

import time
MODEL_NAME = 'bwmodel_' + str(int(time.time()))
print('MODEL_NAME: %s' % MODEL_NAME)

def create_model(client,
                 dataset_display_name,
                 model_display_name,
                 train_budget_milli_node_hours,
                 include_column_spec_names=None,
                 exclude_column_spec_names=None):
    """Create a model."""
 
    # Create a model with the model metadata in the region.
    response = client.create_model(
        model_display_name,
        train_budget_milli_node_hours=train_budget_milli_node_hours,
        dataset_display_name=dataset_display_name,
        include_column_spec_names=include_column_spec_names,
        exclude_column_spec_names=exclude_column_spec_names,
    )

    print("Training model...")
    print("Training operation: {}".format(response.operation))
    print("Training operation name: {}".format(response.operation.name))
    # uncomment the following to block until training is finished.
    # print("Training completed: {}".format(response.result()))

create_model(client, DATASET_NAME, MODEL_NAME, 1000)

Get the status of your training job

Edit the following call to set OP_NAME to the "training operation name" listed in the output of create_model() above.

OP_NAME = 'YOUR TRAINING OPERATION NAME'

def get_operation_status(client, operation_full_id):
    """Get operation status."""
 
    # Get the latest state of a long-running operation.
    op = client.auto_ml_client.transport._operations_client.get_operation(
        operation_full_id
    )

    print("Operation status: {}".format(op))
    from google.cloud.automl import types
    msg = types.OperationMetadata()
    print(msg.ParseFromString(op.metadata.value))

The training job may take several hours. You can check on its status in the Cloud Console UI. You can also monitor it via the get_operation_status() call below. (Make sure you've edited the OP_NAME variable value above). You'll see: done: true in the output when it's finished.

(Note: if you should lose your notebook kernel context while the training job is running, you can continue the rest of the notebook later with a new kernel: just make note of the MODEL_NAME. You can find that information in the Cloud Console as well).

res = get_operation_status(client, OP_NAME)

Get information about your trained custom model

Once it has been created, you can get information about a specific model. (While the training job is still running, you'll just get a not found message.)

from google.cloud.automl_v1beta1 import enums
from google.api_core import exceptions

def get_model(client, model_display_name):
    """Get model details."""

    try:
        model = client.get_model(model_display_name=model_display_name)
    except exceptions.NotFound:
        print("Model %s not found." % model_display_name)
        return (None, None)

    # Get complete detail of the model.a
    model = client.get_model(model_display_name=model_display_name)

    # Retrieve deployment state.
    if model.deployment_state == enums.Model.DeploymentState.DEPLOYED:
        deployment_state = "deployed"
    else:
        deployment_state = "undeployed"

    # get features of top global importance
    feat_list = [
        (column.feature_importance, column.column_display_name)
        for column in model.tables_model_metadata.tables_model_column_info
    ]
    feat_list.sort(reverse=True)
    if len(feat_list) < 10:
        feat_to_show = len(feat_list)
    else:
        feat_to_show = 10

    # Display the model information.
    print("Model name: {}".format(model.name))
    print("Model id: {}".format(model.name.split("/")[-1]))
    print("Model display name: {}".format(model.display_name))
    print("Features of top importance:")
    for feat in feat_list[:feat_to_show]:
        print(feat)
    print("Model create time:")
    print("\tseconds: {}".format(model.create_time.seconds))
    print("\tnanos: {}".format(model.create_time.nanos))
    print("Model deployment state: {}".format(deployment_state))

    return (model, feat_list)

Don't proceed with the rest of the notebook until the model has finished training and the following get_model() call returns model information rather than 'not found'.

Once the training job has finished, we can get information about the model, including information about which input features proved to be the most important globally (that is, across the full training dataset).

(model, global_feat_importance) = get_model(client, MODEL_NAME)

We can graph the global feature importance values to get a visualization of which inputs were most important in training the model. (The Cloud Console UI also displays such a graph).

print(global_feat_importance)

import matplotlib.pyplot as plt

res = list(zip(*global_feat_importance))
x = list(res[0])
y = list(res[1])

y_pos = list(range(len(y)))
plt.barh(y_pos, x, alpha=0.5)
plt.yticks(y_pos, y)
plt.show()

See your model's evaluation metrics

We can also get model evaluation information once the model is trained. The available metrics depend upon which optimization objective you used. In this example, we used the default, RMSE.

evals = client.list_model_evaluations(model_display_name=MODEL_NAME)
list(evals)[1].regression_evaluation_metrics

Use your trained model to make predictions and see explanations of the results

Deploy your model and get predictions + explanations

Once your training job has finished, you can use your model to make predictions.

With online prediction, you can now request explanations of the results, in the form of local feature importance calculations on the inputs. Local feature importance gives you visibility into how the features in a specific prediction request informed the resulting prediction.

To get online predictions, we first need to deploy the model.

Note: see the documentation for other prediction options including the ability to export your custom model and run it in a container anywhere.

def deploy_model(client, model_display_name):
    """Deploy model."""

    response = client.deploy_model(model_display_name=model_display_name)
    # synchronous check of operation status.
    print("Model deployed. {}".format(response.result()))

It will take a while to deploy the model. Wait for the deploy_model() call to finish before proceeding with the rest of the notebook cells. You can track status in the Console UI as well.

deploy_model(client, MODEL_NAME)

Once the model is deployed, you can access it via the UI, or the API, to make online prediction requests. These can include a request for local feature importance calculations on the inputs, a newly-launched feature. Local feature importance gives you visibility into how the features in a specific prediction request informed the resulting prediction.

def predict(client,
            model_display_name,
            inputs,
            feature_importance=False):
    """Make a prediction."""

    if feature_importance:
        response = client.predict(
            model_display_name=model_display_name,
            inputs=inputs,
            feature_importance=True,
        )
    else:
        response = client.predict(
            model_display_name=model_display_name,
            inputs=inputs)
    print("Prediction results:")
    print(response)
    return response

inputs =  {
      "bike_id": "5373",
      "day_of_week": "3",
      "end_latitude": 51.52059681,
      "end_longitude": -0.116688468,
      "end_station_id": "68",
      "euclidean": 3589.5146210024977,
      "loc_cross": "POINT(-0.07 51.52)POINT(-0.12 51.52)",
      "max": 44.6,
      "min": 34.0,
      "prcp": 0,
      "ts": "1480407420",
      "start_latitude": 51.52388,
      "start_longitude": -0.065076,
      "start_station_id": "445",
      "temp": 38.2,
      "dewp": 28.6
    }

Try running the prediction request first without, then with, the local feature importance calculations, to see the difference in the information that is returned. (The actual duration— that we're predicting— is 1200.)

predict(client, MODEL_NAME, inputs, feature_importance=False)

response = predict(client, MODEL_NAME, inputs, feature_importance=True)

We can plot the local feature importance values to get a visualization of which fields were most and least important for this particular prediction.

import matplotlib.pyplot as plt

col_info = response.payload[0].tables.tables_model_column_info
x = []
y = []
for c in col_info:
  y.append(c.column_display_name)
  x.append(c.feature_importance)
y_pos = list(range(len(y)))
plt.barh(y_pos, x, alpha=0.5)
plt.yticks(y_pos, y)
plt.show()

You can see a similar graphic in the Cloud Console Tables UI when you submit an ONLINE PREDICTION and tick the "Generate feature importance" checkbox.

The local feature importance calculations are specific to a given input instance.

Summary

In this notebook, we showed how you can use the AutoML Tables client library to create datasets, train models, and get predictions from your trained model— and in particular, how you can get explanations of the results along with the predictions.

Appendix: running this notebook on colab (or locally)

It's possible to run this example on colab, but it takes a bit more setup. Do the following before you create the Tables client object or call the API.

Create a service account, give it the necessary roles (e.g., AutoML Admin) and download a json credentials file for the service account. Upload the credentials file to the colab file system.

Then, edit the following to point to that file, and run the cell:

%env GOOGLE_APPLICATION_CREDENTIALS /content/your-credentials-file.json

Your Tables API calls should now be properly authenticated. If you lose the colab runtime, you'll need to re-upload the file and re-set the environment variable.

If you're running the notebook locally, point the GOOGLE_APPLICATION_CREDENTIALS environment variable to the service account credentials file before starting the notebook, e.g.:

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your-credentials-file.json