Integrating with Amazon Machine Learning

Note: Amazon ML is no longer available to new Amazon customers

This article provides information about integrating with Amazon Machine Learning. If you are unfamiliar with machine learning, it is recommended that you read the Machine Learning Overview article for information about the technology, different model types and training data guidance.

Amazon offers a plethora of services that all fall under its machine learning arm from translation services (Amazon Translate) to video recognition (Amazon DeepLens). Appian can integrate with all of these services; however, this article solely focuses on the Amazon Machine Learning service through the use of the Appian AI Designer. Furthermore, there are many other machine learning offerings available including Google Cloud and Microsoft Azure. Appian is integration agnostic and has the ability to connect with all of them. 

Amazon Machine Learning Models

Amazon Machine Learning (AML) supports three different type of ML models. The type of model that Amazon will build depends on the type of target attribute that you want to predict.

Model Prediction Type Performance Metric
Regression  Predicts a numeric value  Root Mean Square Error (RMSE)
Binary Classification  Predicts binary values (ex. true or false)  Area Under the Curve (AUC)
Multiclass Classification  Predicts values that belong to a limited, predefined set of permissible values  F1 Score

Creating Amazon ML Models in Appian

The following steps outline how to create a model using the Appian AI Designer shared component. It is possible to create models directly in the AML admin console. It also possible to interact with models in Appian that already exist or have not been created using the Appian AI Designer (for more information on making predictions see next section).

  1. Create an Amazon developer account and an Amazon S3 bucket to store the data you will use to create your model. A credit card is required and you will be charged to create models and make predictions, but costs are relatively insignificant (see AML pricing). 
  2. Download Appian AI Designer from shared components and follow the deployment instructions.
    1. Note: you will need to have Appian automatically create the database tables by manually publishing the data store after the application import.
  3. Collect data used to create the model and format into csv where one row consists of an observation with multiple features (or attributes) and one target attribute. The more observations (rows in the csv) included, the better the model. Below is a sample set of data for banking customers where the first 9 columns represent features that will be used by the model to recognize patterns and relationships while the last column (y) is the binary target value the model will try to predict. In this case y represents if the banking customer decided to take an offer pitched over the phone.
    age job marital education default housing contact duration day_of_w y
    44 blue-collar married basic.4y 0 1 cellular 210 thu 0
    53 technical married unknown 1 0 telephone 180 fri 1
    28 management single university.degree 0 1 cellular 465 mon 1
    39 services divorced high.school 0 1 cellular 180 wed 0
  4. Navigate to: https://<your.server>/suite/sites/aml and follow the sites wizard to create a new model.
    1. On the first tab you can select the S3 bucket created earlier.
    2. If you do not plan on using Amazon’s feature transformation formulas than ensure that any data manipulation has done before formatting the data into a csv. See feature transformation below for more information.

Making Predictions

Once a model is created you can make batch predictions or individual real-time predictions. There are two main ways to make real time predictions within Appian: you can use either the shared component function AML_getRealtimePrediction or you can use the connected system object in Appian versions 18.2 or later. The AML_getRealtimePrediction function takes in a model ID and two parallel arrays that hold attribute names and attribute values. If using this function it is recommended to create a mapping rule that takes in a CDT and converts the CDT values into a text array to be passed into AML_getRealtimePrediction. Before even creating a connected system or creating a rule to call the API you can test out real time predictions from the AML admin console or from the machine learning model record in the Appian AI designer site. It's recommended to test out the predictions and evaluate the model (more below) before deciding to move forward with an initial model. 

Evaluating and Adjusting Model Performance 

Whenever a new model is created there are four objects created in the AML Admin Console: One training data source, one evaluation data source, one model, and one evaluation object. As discussed above, Amazon uses different metrics to quantify performance.  In addition, Amazon provides a different performance visualization for each model. To access the performance metric and visualizations navigate to the admin console and select the evaluation object. For binary classification models you are able to adjust output using the dual histogram visualization (pictured below) by raising or lowering the score threshold that is defaulted to 0.5. For example if you would like to automate a process by auto approving likely true values you may want to raise the score threshold to a value closer to 1 in order to limit the false positives (raising the score threshold has the effect of increasing the probability needed for the model to predict a value as true). Inversely, if you would like to flag values that are likely false for further review you may want to lower the score threshold in order to limit the false negatives.

Another way to evaluate the model is to take a look at how each feature correlates to the target value. Some values have more of an impact of the predicted outcomes and this is quantified by Amazon (to view these values navigate to either of the data sources in the AML admin console). It is generally a best practice to include as many relevant features as possible in your data set, but noise introduced by including too many variables with little predictive power may negatively impact your models performance.

Best Practices

Retraining Models

  • Retraining is the process of providing new data to models in attempt keep your model accurate with the drift of actual outcome distribution over time. Like most application development, implementing a machine learning model is not a one time activity; it is best practice to continuously monitor your model and retrain it if new observations begin to deviate from the original training data distributions.  
  • In order to retrain a model in Amazon you will need to create a completely new model with your updated data set. Be sure to avoid hard coding model Ids in your Appian applications so that updating your applications after retraining models will only require updating a single object such as a constant or connected system.

Feature Transformation

  • A key characteristic of good training data is that it is provided in a way that is optimized for learning and generalization. The process of putting together the data in this optimal format is known in the industry as feature transformation.
  • Feature transformation can be performed on all types of data (numeric, text, boolean).  A simple example of feature transformation is converting all null numeric values to 0, but can also include more complex formulas for the purpose of normalizing data or discovering non-linearity in the variables distribution.
  • Feature transformation can take place prior to uploading data in Amazon or you can use built-in transformation recipes within the Amazon machine learning console. Regardless of the method used, the process should be repeatable such that models can be recreated or re-trained easily.

Splitting Data

  • In order to test the accuracy of ML models a percentage of the data provided to Amazon is set aside for evaluation. By default Amazon splits the data such that 70% of it is used to train the model while 30% is used to evaluate it. The split percentage can be altered when creating the model.
  • It is important to split the input data such that there is a random distribution of observations between the training and evaluation data sources. If the data for either data source is skewed towards a certain target value the ML model could be skewed and the evaluation model may not be indicative of true performance.

Shuffling Data

  • In Amazon ML, you must shuffle your training data. Shuffling mixes up the order of your data so that the SGD algorithm doesn't encounter one type of data for too many observations in succession. 
  • When creating a model via the admin console or the Appian AI Designer shared component wizard, you can indicate if you would like Amazon to shuffle your data or if you have already shuffled it.

See Also

Websites: