Wednesday, November 18, 2020

BigQueryML Classification Model - Predict Visitor Purchases with a Classification Model with BigQuery ML

For SQL and Model, refer: https://github.com/IamVigneshC/GCP-BigQueryML-ClassificationModel


• Use BigQuery to find public datasets

• Query and explore the ecommerce dataset

• Create a training and evaluation dataset to be used for batch prediction

• Create a classification (logistic regression) model in BQML

• Evaluate the performance of your machine learning model

• Predict and rank the probability that a visitor will make a purchase

A Subset of visitors who bought on their very first session and then came back and bought again.What are some of the reasons a typical ecommerce customer will browse but not buy until a later visit?

Although there is no one right answer, one popular reason is comparison shopping between different ecommerce sites before ultimately making a purchase decision. This is very common for luxury goods where significant up-front research and comparison is required by the customer before deciding (think car purchases) but also true to a lesser extent for the merchandise on this site (t-shirts, accessories, etc). In the world of online marketing, identifying and marketing to these future customers based on the characteristics of their first visit will increase conversion rates and reduce the outflow to competitor sites.

Create a Machine Learning model in BigQuery to predict whether or not a new user is likely to purchase in the future. Identifying these high-value users can help your marketing team target them with special promotions and ad campaigns to ensure a conversion while they comparison shop between visits to your ecommerce site.

The team decides to test whether these two fields are good inputs for your classification model:

totals.bounces (whether the visitor left the website immediately) totals.timeOnSite (how long the visitor was on our website)

Machine learning is only as good as the training data that is fed into it. If there isn't enough information for the model to determine and learn the relationship between your input features and your label (in this case, whether the visitor bought in the future) then you will not have an accurate model. While training a model on just these two fields is a start, you will see if they're good enough to produce an accurate model.

The inputs are bounces and time_on_site. The label is will_buy_on_return_visit. bounces and time_on_site are known after a visitor's first session. will_buy_on_return_visit is not known after the first visit. Again, you're predicting for a subset of users who returned to your website and purchased. Since you don't know the future at prediction time, you cannot say with certainty whether a new visitor come back and purchase. The value of building a ML model is to get the probability of future purchase based on the data gleaned about their first session. It's often too early to tell before training and evaluating the model, but at first glance out of the top 10 time_on_site, only 1 customer returned to buy, which isn't very promising. Let's see how well the model does.

Create a BigQuery dataset to store models

Select a BQML model type and specify options

Since you are bucketing visitors into "will buy in future" or "won't buy in future", use logistic_reg in a classification model.

You cannot feed all of your available data to the model during training since you need to save some unseen data points for model evaluation and testing. To accomplish this, add a WHERE clause condition is being used to filter and train on only the first 9 months of session data in your 12 month dataset.

After your model is trained, evaluate the performance of the model against new unseen evaluation data.

Evaluate classification model performance

Select your performance criteria:

For classification problems in ML, you want to minimize the False Positive Rate (predict that the user will return and purchase and they don't) and maximize the True Positive Rate (predict that the user will return and purchase and they do).

This relationship is visualized with a ROC (Receiver Operating Characteristic) curve like the one shown here, where you try to maximize the area under the curve or AUC:

Image of ROC

In BQML, roc_auc is simply a queryable field when evaluating your trained ML model.

Now that training is complete, you can evaluate how well the model performs with this query using ML.EVALUATE:

You should see the following result:

Row roc_auc model_quality


1 0.724588 decent

After evaluating your model you get a roc_auc of 0.72, which shows the model has decent, but not great, predictive power. Since the goal is to get the area under the curve as close to 1.0 as possible, there is room for improvement.

Improve model performance with Feature Engineering

There are many more features in the dataset that may help the model better understand the relationship between a visitor's first session and the likelihood that they will purchase on a subsequent visit.

Add some new features and create a second machine learning model called classification_model_2:

• How far the visitor got in the checkout process on their first visit

• Where the visitor came from (traffic source: organic search, referring site etc..)

• Device category (mobile, tablet, desktop)

• Geographic information (country)

A key new feature that was added to the training dataset query is the maximum checkout progress each visitor reached in their session, which is recorded in the field hits.eCommerceAction.action_type.

Evaluate this new model to see if there is better predictive power:

Row roc_auc model_quality


1 0.910382 good

With this new model you now get a roc_auc of 0.91 which is significantly better than the first model.

Now that you have a trained model, time to make some predictions.

Predict which new visitors will come back and purchase

Refer the query to predict which new visitors will come back and make a purchase.

The prediction query uses the improved classification model to predict the probability that a first-time visitor to the Google Merchandise Store will make a purchase in a later visit:

The predictions are made on the last 1 month (out of 12 months) of the dataset.

Your model will now output the predictions it has for those July 2017 ecommerce sessions. You can see three newly added fields:

• predicted_will_buy_on_return_visit: whether the model thinks the visitor will buy later (1 = yes)

• predicted_will_buy_on_return_visit_probs.label: the binary classifier for yes / no

• predicted_will_buy_on_return_visit.prob: the confidence the model has in it's prediction (1 = 100%)

Image of predict

Results

• Of the top 6% of first-time visitors (sorted in decreasing order of predicted probability), more than 6% make a purchase in a later visit.

• These users represent nearly 50% of all first-time visitors who make a purchase in a later visit.

• Overall, only 0.7% of first-time visitors make a purchase in a later visit.

• Targeting the top 6% of first-time increases marketing ROI by 9x vs targeting them all!

Additional information

roc_auc is just one of the performance metrics available during model evaluation. Also available are accuracy, precision, and recall. Knowing which performance metric to rely on is highly dependent on what your overall objective or goal is.

AutoMLVision Classify Images of Clouds

 

Set up AutoML Vision

AutoML Vision provides an interface for all the steps in training an image classification model and generating predictions on it. Open the navigation menu and and select APIs & Services > Library. In the search bar type in "Cloud AutoML API". Click on the Cloud AutoML API result and then click ENABLE.

Create environment variables for your Project ID and Username:

export PROJECT_ID=$DEVSHELL_PROJECT_ID

export USERNAME=<USERNAME>

Create a Storage Bucket for the images you will use in testing.

gsutil mb -p $PROJECT_ID \ -c regional \ -l us-central1 \ gs://$PROJECT_ID-vcm/

Open a new browser tab and navigate to the AutoML UI.

Upload training images to Google Cloud Storage

In order to train a model to classify images of clouds, you need to provide labeled training data so the model can develop an understanding of the image features associated with different types of clouds. Our model will learn to classify three different types of clouds: cirrus, cumulus, and cumulonimbus. To use AutoML Vision we need to put the training images in Google Cloud Storage.

In the GCP console, open the Navigation menu and select Storage > Browser

Create an environment variable with the name of your bucket

export BUCKET=YOUR_BUCKET_NAME

The training images are publicly available in a Cloud Storage bucket. Use the gsutil command line utility for Cloud Storage to copy the training images into your bucket:

gsutil -m cp -r gs://automl-codelab-clouds/* gs://${BUCKET}

Create a dataset

Now that your training data is in Cloud Storage, you need a way for AutoML Vision to access it. You'll create a CSV file where each row contains a URL to a training image and the associated label for that image.

Run the following command to copy the file to your Cloud Shell instance:

gsutil cp gs://automl-codelab-metadata/data.csv .

Then update the CSV with the files in your project:

sed -i -e "s/placeholder/${BUCKET}/g" ./data.csv

Navigate back to the AutoML Vision Datasets page.

At the top of the console, click + NEW DATASET.

Type clouds for the Dataset name.

Leave Single-label Classification checked.

Click CREATE DATASET to continue

Image of 1

choose the location of your training images (the ones you uploaded in the previous step)

Choose Select a CSV file on Cloud Storage and add the file name to the URL for the file you just uploaded - gs://your-project-name-vcm/data.csv. You may also use the browse function to find the csv file. Once you see the white in green checkbox you may select CONTINUE to proceed.

After you are returned to the IMPORT tab, navigate to the IMAGES tab. It will take 8 to 12 minutes while the image metadata is processed. Once complete, the images will appear by category.

Inspect images

Image of 2

To see a summary of how many images you have for each label, click on LABEL STATS. You should see the following pop-out box show up on the right side of your browser. Press DONE after reviewing the list.

Image of 3

Train your model

Start training your model! AutoML Vision handles this for you automatically, without requiring you to write any of the model code.

To train your clouds model, go to the TRAIN tab and click START TRAINING.

Enter a name for your model, or use the default auto-generated name.

Leave Cloud hosted selected and click CONTINUE.

For the next step, type the value "8" into the Set your budget box and check "Deploy model to 1 node after training." This process (auto-deploy) will make your model immediately available for predictions after testing is complete.

Click START TRAINING.

The total training time includes node training time as well as infrastructure set up and tear down.

Evaluate your model

After training is complete, click on the EVALUATE tab. Here you'll see information about Precision and Recall of the model. It should resemble the following:

Image of 4

You can also adjust the Confidence threshold slider to see its impact.

Finally, scroll down to take a look at the Confusion matrix.

Image of 5

This tab provides some common machine learning metrics to evaluate your model accuracy and see where you can improve your training data. Since the focus was not on accuracy, move on to the next section about predictions section.

Generate predictions

There are a few ways to generate predictions. Here you'll use the UI to upload images. You'll see how your model does classifying these two images (the first is a cirrus cloud, the second is a cumulonimbus).

First, download these images to your local machine by right-clicking on each of them (Note: You may want to assign a simple name like 'Image1' and 'Image2' to assist with uploading later):

Navigate to the TEST & USE tab in the AutoML UI:

On this page you will see that the model you just trained and deployed is listed in the "Model" pick list.

Click UPLOAD IMAGES and upload the cloud sample images you just saved to your local disk (you may select both images at the same time).

When the prediction request completes you should see something like the following:

Image of 6

The model classified each type of cloud correctly!

Creating a Data Transformation Pipeline with Cloud Dataprep

Cloud Dataprep by Trifacta is an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis. We will explore the Cloud Dataprep UI to build a data transformation pipeline that runs at a scheduled interval and outputs results into BigQuery

The dataset you'll use is an ecommerce dataset that has millions of Google Analytics session records for the Google Merchandise Store loaded into BigQuery.

• Connect BigQuery datasets to Cloud Dataprep

• Explore dataset quality with Cloud Dataprep

• Create a data transformation pipeline with Cloud Dataprep

• Schedule transformation jobs outputs to BigQuery

Image of Pipe

We need BigQuery as an endpoint for dataset ingestion to the pipeline and as a destination for the output when the pipeline is completed.

Creating a BigQuery Dataset

Create a new BigQuery dataset to receive the output table of your new pipeline. CREATE DATASET

• For Dataset ID, type ecommerce.

Create table

#standardSQL
CREATE OR REPLACE TABLE ecommerce.all_sessions_raw_dataprep
OPTIONS(
description="Raw data from analyst team to ingest into Cloud Dataprep"
) AS
SELECT * FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE date = '20170801'; # limiting to one day of data 56k rows


Open Cloud Dataprep

Connecting BigQuery data to Cloud Dataprep

Connect Cloud Dataprep to your BigQuery data source. On the Cloud Dataprep page:

  1. Click Create Flow in the top-right corner.

  2. In the Create Flow dialog, specify these details:

    • For Flow Name, type Ecommerce Analytics Pipeline

    • For Flow Description, type Revenue reporting table

Click Create

Click Import & Add Datasets.

click BigQuery.

Create dataset

Import & Add to Flow

Image of ecomm

Exploring ecommerce data fields with a UI

In the right pane, click Add new Recipe.

Click Edit Recipe.

Cloud Dataprep loads a sample of your dataset into the Transformer view. This process might take a few seconds.

Image of Transformer

Grey bar under totalTransactionRevenue represent missing values for the totalTransactionRevenue field. This means that a lot of sessions in this sample did not generate revenue. Later, we will filter out these values so our final table only has customer transactions and associated revenue.

Maximum timeOnSite in seconds, Maximum pageviews, and Maximum sessionQualityDim for the data sample

Image of Timeonsite

• Maximum Time On Site: 5,561 seconds (or 92 minutes)

• Maximum Pageviews: 155 pages

• Maximum Session Quality Dimension: 97

A red bar indicates mismatched values. While sampling data, Cloud Dataprep attempts to automatically identify the type of each column. If you do not see a red bar for the productSKU column, then this means that Cloud Dataprep correctly identified the type for the column (i.e. the String type). If you do see a red bar, then this means that Cloud Dataprep found enough number values in its sampling to determine (incorrectly) that the type should be Integer. Cloud Dataprep also detected some non-integer values and therefore flagged those values as mismatched. In fact, the productSKU is not always an integer (for example, a correct value might be "GGOEGOCD078399"). So in this case, Cloud Dataprep incorrectly identified the column type: it should be a string, not an integer.

Image of red

Cleaning the data

Clean the data by deleting unused columns, eliminating duplicates, creating calculated fields, and filtering out unwanted rows.

Converting the productSKU column data type

To ensure that the productSKU column type is a string data type, open the menu to the right of the productSKU column, then click Change type > String.

Deleting unused columns

We will be deleting the itemQuantity and itemRevenue columns as they only contain NULL values are not useful

Deduplicating rows

Team has informed you there may be duplicate session values included in the source dataset. Let's remove these with a new deduplicate step.

  1. Click the Filter rows icon in the toolbar, then click Remove duplicate rows.

Click Add

Review the recipe

Filtering out sessions without revenue

Your team has asked you to create a table of all user sessions that bought at least one item from the website. Filter out user sessions with NULL revenue.

  1. Under the totalTransactionRevenue column, click the grey Missing values bar. All rows with a missing value for totalTransactionRevenue are now highlighted in red.

  2. In the Suggestions panel, in Delete rows , click Add.

Image of filter

This step filters your dataset to only include transactions with revenue (where totalTransactionRevenue is not NULL).

Filtering sessions for PAGE views

The dataset contains sessions of different types, for example PAGE (for page views) or EVENT (for triggered events like "viewed product categories" or "added to cart"). To avoid double counting session pageviews, add a filter to only include page view related hits.

  1. In the histogram below the type column, click the bar for PAGE. All rows with the type PAGE are now highlighted in green.

  2. In the Suggestions panel, in Keep rows, and click Add.

Image of filter2

Enriching the data

• visitId: an identifier for this session. This is part of the value usually stored as the utmb cookie. This is only unique to the user. For a completely unique ID, you should use a combination of fullVisitorId and visitId.*

As we see, visitId is not unique across all users. We will need to create a unique identifier.

Creating a new column for a unique session ID

As you discovered, the dataset has no single column for a unique visitor session. Create a unique ID for each session by concatenating the fullVisitorID and visitId fields.

  1. Click on the Merge columns icon in the toolbar.

  2. For Columns, select fullVisitorId and visitId.

  3. For Separator type a single hyphen character: -

  4. For the New column name, type unique_session_id.

  5. Click Add.

The unique_session_id is now a combination of the fullVisitorId and visitId. We will explore in a later lab whether each row in this dataset is at the unique session level (one row per user session) or something even more granular.

Creating a case statement for the ecommerce action type

As you saw earlier, values in the eCommerceAction_type column are integers that map to actual ecommerce actions performed in that session. For example, 3 = "Add to Cart" or 5 = "Check out." This mapping will not be immediately apparent to our end users so let's create a calculated field that brings in the value name.

  1. Click on the Conditions icon in the toolbar, then click Case on single column.

  2. For Column to evaluate, specify eCommerceAction_type.

  3. Next to Cases (1), click Add 8 times for a total of 9 cases.

Value to compare New value:

0 'Unknown'

1 'Click through of product lists'

2 'Product detail views'

3 'Add product(s) to cart'

4 'Remove product(s) from cart'

5 'Check out'

6 'Completed purchase'

7 'Refund of purchase'

8 'Checkout options'

  1. For New column name, type eCommerceAction_label. Leave the other fields at their default values.

  2. Click Add.

Adjusting values in the totalTransactionRevenue column

As mentioned in the schema, the totalTransactionRevenue column contains values passed to Analytics multiplied by 10^6 (e.g., 2.40 would be given as 2400000). You now divide contents of that column by 10^6 to get the original values.

  1. Open the menu to the right of the totalTransactionRevenue column, then select Calculate > Custom formula.

  2. For Formula, type: DIVIDE(totalTransactionRevenue,1000000) and for New column name, type: totalTransactionRevenue1. Notice the preview for the transformation:

  3. Click Add.

  4. To convert the new totalTransactionRevenue1 column's type to a decimal data type, open the menu to the right of the totalTransactionRevenue1 column by clicking , then click Change type > Decimal.

  5. Review the full list of steps in your recipe:

Image of recipe

Running and scheduling Cloud Dataprep jobs to BigQuery

  1. Click Run Job

  2. Hover over the Publishing Actions created and click Edit.

  3. Select BigQuery as a data sink in the left bar

  4. Select your existing ecommerce dataset

  5. Select Create new Table

  6. For Table Name, type revenue_reporting

  7. For options, Truncate the table every run

Image of publish

  1. Click Update

  2. Review the setting then Run Job

Once your Cloud Dataprep job is completed (takes 10 - 15 minutes), refresh your BigQuery page and confirm that the output table revenue_reporting exists.

Image of pipeline2

You will know your revenue reporting table is ready when the below query successfully executes:

-- generate a report showing the most recent transactions

SELECT
date,
unique_session_id,
fullVisitorId,
totalTransactionRevenue1,
-- push the entire checkout basket into an array
ARRAY_AGG(DISTINCT v2ProductName) AS products_bought
FROM
ecommerce.revenue_reporting
-- only include products that were actually bought
WHERE productQuantity > 0
GROUP BY 1,2,3,4
ORDER BY date DESC

Image of results

Creating a scheduled pipeline job

Even if your pipeline is still running, you can also schedule the execution of pipeline in the next step so the job can be re-run automatically on a regular basis to account for newer data.

Note: You can navigate and perform other operations while jobs are running.

  1. You will now schedule a recurrent job execution. Click the Flows icon on the left of the screen.

  2. On the right of your Ecommerce Analytics Pipeline flow click the More icon (...), then click Schedule Flow.

  3. In the Add Schedule dialog:

  4. For Frequency, select Weekly.

  5. For day of week, select Saturday and unselect Sunday.

  6. For time, enter 3:00 and select AM.

  7. Click Save.

The job is now scheduled to run every Saturday at 3AM

IMPORTANT: You will not be able to view your scheduled flows until you setup a scheduled output destination

  1. In your flow, click the output node as shown below:

Image of job

  1. Under Scheduled Destinations, click Add

  2. In the Scheduled Publishing settings page click Add Publishing Action

  3. Specify an output destination in BigQuery like the one you created previously.

Monitoring jobs

  1. Click the Jobs icon on the left of the screen.

  2. You see the list of jobs, and wait until your job is marked as Completed.

Image of joblist