Machine Learning Overview

Appian is integration agnostic and has the ability to connect with any machine learning offering that exposes itself with a web API. The purpose of this article is to provide information that will empower your general understanding of machine learning technology regardless of the specific tool being used. For a list of the machine learning integrations that have been written about in detail in Appian's documentation, refer to the articles below:

What is Machine Learning?

Machine learning is a type of artificial intelligence that uses mathematical models to generate probabilistic predictions by finding patterns in historical data. Machine learning models can be thought of as black boxes that are created by processing many observations with known outcomes. These models are then able to take in one or many observations without a known outcome and produce possible outcomes and their probabilities.

There are many different uses and applications for machine learning, but this article currently focuses on machine learning technology that analyzes structured data—such as rows of an Excel spreadsheet or an Appian CDT—and delivers a prediction for a specific field or column in the data. This feature, value or attribute that is being predicted for is often referred to as the target.

Other uses for machine learning include natural language analysis and translation and the ability to decipher image contents, done using tools such as IBM's Watson and Google's AutoML Vision, respectively.

Common Model Types

There are two major categories of model types that are used for making machine learning predictions on structured data:

  1. Regression: predicts a numeric value.
  2. Classification: predicts a categorical value from a discrete, fixed number of possible categories. Classification models can be further broken down into two types:
    1. Binary classification: the model has only two prediction values to choose from (ex. true and false).
    2. Multiclassification:  the model has more than two prediction values to choose from.

Which type you utilize is dependent on the target attribute you want to predict for and your overall objective in creating the model. Read the sections below to learn more about the purpose of each model type and see examples describing appropriate uses of each one.

Regression

  • Use a regression model when you want to predict for a numerical value that is not constrained to a finite or particular list of values.
  • The main metric used to determine accuracy of a regression model is the root mean square error (RMSE). A perfect model would have a RMSE of 0. The RMSE represents the standard deviation between predicted and actual values; thus good values are relative to your value ranges you are trying to predict.
  • When using a regression model, be aware that the predicted value may not fall within the range of values provided in training data and might take on any positive or negative number. It is important to have a plan for how to address any values that would fall outside of acceptable ranges for your application.

Regression models can be used to predict:

  • The sale price of a home, given information about the home's size, number of bedrooms, zip code, etc.
  • The appropriate salary for a job posting, given information about that job's difficulty and expected characteristics of qualified candidates.
  • The number of viewers who will watch the premiere of a new TV series, given information about the show's genre and cast.

Binary Classification

  • Use a binary classification model when you want to predict for a value that has only two possible outcomes.
  • A binary classification model will return a value (true or false) and a predicted score (a number between 0 to 1). By default if a predicted score is greater than 0.5 than the predicted value will be true, but machine learning tools typically allow you to adjust the score threshold to alter the number of true and false values depending on your use case.
  • The main metric used to evaluate performance of a binary classification model is Area Under the Curve (AUC). The AUC is represented as a number between 0 and 1. A number closer 1 indicates a highly accurate model. Values near 0.5 represent the model is no better than guessing at random. Values close to 0 indicate the model has learned correct patterns, but is using them to make inverse predictions.

Binary classification models can be used to predict:

  • Whether a job candidate should be given an offer of employment, given information about their qualifications and interview scores.
  • Whether a loan application should be approved or rejected, given credit details about the applicant.
  • Whether someone will sign up for a service, given their demographics.
  • Whether a bank transaction is fraudulent, given information about how much that transaction deviates from the account's typical usage patterns.

Multiclassification

  • Use a multiclass model when you want to predict for a value that can take on a single categorical value from among a list of three or more discrete, finite possibilities.
  • A multiclass model will return a list of values and their related probabilities. The value with the highest probability represents the model's best prediction. For example, if you are trying to predict which tier of support a customer service case should be routed to, a multiclass model might return: Tier 1 - 60%, Tier 2 - 13%, Tier 3 - 27%.
  • Since the target attribute's possible values are derived from training data, the model will never deliver a prediction value that did not occur in the training data.
  • The main metric used to determine the accuracy of a multiclass model is called an F1 score. The F1 score is the harmonic mean between precision and recall. The range is 0 to 1. The closer the value is to 1, the better the model.
  • Some machine learning tools set a limit on the number of possible predictable values that a multiclass model can have. This is because target attributes with hundreds or thousands of potential values can be difficult to train and have a higher likelihood of failure and poor model performance.
  • To use machine learning to make predictions from a group of possibilities that is larger a tool's limit, consider using series of different models. For example, imagine a car dealership that sells 75 different minivans, 50 different convertibles, and 100 different sedans. You may not be able to create one model to predict one of the 225 cars, but you could create a model to predict which type of car the customer is likely to buy (minivan, convertible, or sedan) and then one model for each type of car to predict the particular minivan, convertible, or sedan.

Multiclass classification models can be used to predict:

  • Which category of car—sedan, truck or SUV—someone is likely to purchase, given their demographics.
  • A book's genre, given information about the book's author, length, characters, storyline, etc.

Model Types Summary

Model Prediction Type Common Performance Metrics Example
Regression Predicts a numeric value

Root Mean Square Error (RMSE)

Mean Absolute Error (MSE)

Predicting a home's sale price
Binary Classification Predicts binary values (ex. true or false) Area Under the Curve (AUC) Predicting whether a job candidate should be offered employment
Multiclass Classification Predicts values that belong to a limited, predefined set of permissible values

F1 Score

Log Loss

Predicting a book's genre

Training Data 

To create a model, you must supply the machine learning tool with training data that it will use to learn about associations between different attribute values and the target attribute. This training data is the means by which the model understands and recognizes patterns about the data for which you ask it to make predictions. Below is an example of a data structure that might be used for training data for a model designed to predict the sale price of a used car. In this use case, the column marked "Sale Price" would be identified to the model as the target attribute to predict for.

Year Make Model Color Transmission Mileage Previous Owners Sale Price
1997 Ford Mustang Silver Automatic 201,298 3 1,499
2013 Mazda 3 Black Automatic 60,588 1 8,100
2005 Honda Element Red Automatic 160,378 2 4,760
2009 Toyota Camry Blue Manual 87,380 1 7,290

The details about how data should be ordered, formatted and uploaded to a machine learning tool for training vary depending on the specific tool being used, so refer to your tool's documentation for specific information about appropriately presenting data.

Best Practices and Tips for Training Data

  • The more training observations (ie. rows of data) that you provide during training, the more accurate the final model will be.
  • To the greatest extent possible, provide training data that resembles the data you expect to see in production.
  • Machine learning tools typically have both minimal requirements and limits regarding the size and complexity of training data. Read your tool's documentation for more details.
  • Some tools allow you modify the weight given to specific columns during training, or specify a "time" column if training data values are influenced by time. Read your tool's documentation for more details.
  • Models trained with skewed or unrepresentative data can result in unwanted bias when making predictions. Google has documentation and a video regarding bias and machine learning that is helpful for learning more about this topic.

See Also

Websites:

Videos: