Machine Learning Overview

Appian is integration agnostic and has the ability to connect with any machine learning offering that exposes itself with a web API. The purpose of this article is to provide information that will empower your general understanding of machine learning technology regardless of the specific tool being used. Refer to the article for Amazon machine learning integrations that have been written about in detail in Appian's documentation.

What is Machine Learning?

Machine learning is a type of artificial intelligence that uses mathematical models to generate probabilistic predictions by finding patterns in historical data. Machine learning is a subset of AI that focuses on the development of algorithms and models that enable computers to learn and make decisions based on data. The models can be thought of as black boxes that are created by processing many observations both supervised and semi-supervised. These machine learning models are then able to take in one or many observations without a known outcome and produce possible outcomes based on their probabilities.

There are many different use cases and applications for machine learning. This article mainly focuses on machine learning technology that analyzes structured data, such as rows of an Excel spreadsheet or an Appian CDT, and delivers a prediction for a specific field or column in the data. The feature, value, or attribute that is being predicted for is often referred to as the target. Within the context of Appian, we’ll dive into the practical implementation of AI features that integrate with applications.

Other uses for machine learning include natural language analysis and translation and the ability to decipher image contents, done using tools such as IBM's Watson and Google AI respectively.

Appian ML/AI Capabilities

Appian AI Skills facilitate the integration of machine learning and AI capabilities into your application. This is done using a variety of low-code design objects, functions and smart services. Features available within Appian AI Skills include document and email classification with custom-built models, and document extraction with pre-trained models.

Classification models can be custom-built, including being trained and tested using data that will accurately reflect your use case. The Document Extraction AI skill identifies data from PDF documents, extracting and saving data into key-value pairs that can be used within the application or saved within a database.

Appian AI Skills offer pre-trained models that use built-in documentation extraction capabilities.

Pre-trained models in Appian are designed for general use cases and are used in documents that have similar information and labeled values (e.g. structured or semi-structured documents). Incorporating Google AI functionalities into your Appian application enables the integration of various features, including but not limited to natural language processing, translation services, cloud-based storage, and more. See Using Google AI Services for a full list of features available.

Note that starting from January 23, 2024, Appian is no longer selling Appian-provisioned Google credentials to customers. Customers have to purchase the license directly through Google and add their Google credentials to their Appian Admin console.

Appian AI Copilot is a starting point to further AI capabilities using Appian. AI Copilot utilizes generative AI to create functional interfaces by generating an initial interface from the fields in your form through a simple pdf upload. AI Copilot is integrated with Azure OpenAI to enable this functionally in your application. Azure OpenAI leverages generative AI models (e.g. gpt-3, codex, dall-e, chatgpt) to provide writing assistance, content generation, etc. You can use AI Copilot to build interfaces directly from a pdf, resulting in a personalized product that can be further customized according to your specific requirements once the initial interface is generated.

Machine learning, particularly deep learning, is one of the fundamental components of generative AI. Similarly to other machine learning models, generative AI models undergo training with large amounts of data that aids in identifying inherent patterns. The generative AI model is fine-tuned and enhanced with the introduction of more data over time. Leveraging AI with Appian allows you to automate repetitive tasks and simplify processes, streamlining development and increasing efficiency and productivity.

Common Model Types

There are two major categories of model types that are used for making machine learning predictions on structured data:

Regression: predicts a numeric value.
Classification: predicts a categorical value from a discrete, fixed number of possible categories. Classification models can be further broken down into two types:
1. Binary classification: the model has only two prediction values to choose from (ex. true and false).
2. Multiclassification: the model has more than two prediction values to choose from.

Which type you utilize is dependent on the target attribute you want to predict for and your overall objective in creating the model. Read the sections below to learn more about the purpose of each model type and see examples describing appropriate uses of each one.

Regression

Regression models make predictions along a continuous range of numerical values. They have many important use cases (examples below), but can't be used in cases where binary, categorical, or non-numeric values are required without additional processing to the model’s output.
The main metric used to determine accuracy of a regression model is the root mean square error (RMSE). The RMSE represents the standard deviation between predicted and actual values; thus a good RMSE is relative to the range of values you are trying to predict. A perfect model would have a RMSE of 0.
When using a regression model, be aware that the predicted value may not fall within the range of values provided in training data and might take on any positive or negative number. It is important to have a plan for how to address any values that would fall outside of acceptable ranges for your application.

Regression models can be used to predict:

The sale price of a home, given information about the home's size, number of bedrooms, zip code, etc.
The appropriate salary for a job posting, given information about that job's difficulty and expected characteristics of qualified candidates.
The number of viewers who will watch the premiere of a new TV series, given information about the show's genre and cast.

Binary Classification

Binary classification models predict for a value that has only two possible outcomes.
A binary classification model will return a value (true or false) and a predicted score (a number between 0 to 1). By default if a predicted score is greater than 0.5, then the predicted value will be true. However, machine learning tools typically allow you to adjust the score threshold to alter the number of true and false values depending on your use case.
The main metric used to evaluate performance of a binary classification model is Area Under the Curve (AUC). The AUC is represented as a number between 0 and 1. A number closer 1 indicates a highly accurate model. Values near 0.5 represent the model is no better than guessing at random. Values close to 0 indicate the model has learned correct patterns, but is using them to make inverse predictions.

Binary classification models can be used to predict:

Whether a job candidate should be given an offer of employment, given information about their qualifications and interview scores.
Whether a loan application should be approved or rejected, given credit details about the applicant.
Whether someone will sign up for a service, given their demographics.
Whether a bank transaction is fraudulent, given information about how much that transaction deviates from the account's typical usage patterns.

Multiclassification

Multiclass models predict for a categorical value from a list of three or more discrete, finite possibilities.
A multiclass model will return a list of values and their related probabilities. The value with the highest probability represents the model's best prediction. For example, if you are trying to predict which tier of support a customer service case should be routed to, a multiclass model might return: Tier 1 - 60%, Tier 2 - 13%, Tier 3 - 27%.
Since the target attribute's possible values are derived from training data, the model will never deliver a prediction value that did not occur in the training data.
The main metric used to determine the accuracy of a multiclass model is called an F1 score. The F1 score is the harmonic mean between precision and recall. The range is 0 to 1. The closer the value is to 1, the better the model.
Some machine learning tools set a limit on the number of possible predictable values that a multiclass model can have. This is because target attributes with hundreds or thousands of potential values can be difficult to train and have a higher likelihood of failure and poor model performance.
To make predictions from a group of possibilities that is larger than a machine learning tool's limit, consider using a series of different models. For example, to classify animal species from an image, better results can be achieved by first training the model for a more general classification (e.g. feline, canine, rodent). Additional models can be trained to identify specific species.

Multiclass classification models can be used to predict:

Which category of car—sedan, truck or SUV—someone is likely to purchase, given their demographics.
A book's genre, given information about the book's author, length, characters, storyline, etc.

Appian AI Skills Use Case: Email Classification

The client receives thousands of emails everyday for customer support. Employees manually forward these emails to appropriate departments and locations based on a review of the email description and the customer's location. This process is time consuming and prone to human error. The client can automate this process using the Email Classification AI Skill that combines machine learning and automation. For the new model to be effective, the client must upload a "training set" consisting of a diverse set of emails which includes multiple examples for all desired email routing options.Once the model is trained and tested, the client can publish the model to make it available for use through the Classify Emails smart service.

Model Types Summary

Model	Prediction Type	Common Performance Metrics	Example
Regression	Predicts a numeric value	Root Mean Square Error (RMSE) Mean Absolute Error (MSE)	Predicting a home's sale price
Binary Classification	Predicts binary values (ex. true or false)	Area Under the Curve (AUC)	Predicting whether a job candidate should be offered employment
Multiclass Classification	Predicts values that belong to a limited, predefined set of permissible values	F1 Score Log Loss	Predicting a book's genre

Training Data

To create a model, you must supply the machine learning tool with training data that it will use to learn about associations between different attribute values of input data and the target attribute. The model ultimately applies the associations and patterns it found in the training data to make predictions for novel input data. There is a common adage that a model is “only as good as its training data''. If the training data is not a representative sample of the data against which it will be making predictions, the model’s performance will suffer.

Below is an example of a data structure that might be used for training data for a model designed to predict the sale price of a used car. In this use case, the column marked "Sale Price" would be identified to the model as the target attribute to predict for.

Year	Make	Model	Color	Transmission	Mileage	Previous Owners	Sale Price
1997	Ford	Mustang	Silver	Automatic	201,298	3	1,499
2013	Mazda	3	Black	Automatic	60,588	1	8,100
2005	Honda	Element	Red	Automatic	160,378	2	4,760
2009	Toyota	Camry	Blue	Manual	87,380	1	7,290

The details about how data should be ordered, formatted and uploaded to a machine learning tool for training vary depending on the specific tool being used, so refer to your tool's documentation for specific information about appropriately presenting data.

Best Practices and Tips for Training Data

The more training observations (ie. rows of data) that you provide during training, the more accurate the final model will be. This is applicable when a diverse and balanced (e.g. data between class A and B are split equally) set of training data is provided to avoid bias when making predictions. Google has documentation and a video regarding bias and machine learning that is helpful for learning more about this topic.
To the greatest extent possible, provide training data that resembles the data you expect to see in production.
Machine learning tools typically have both minimal requirements and limits regarding the size and complexity of training data. Read your tool's documentation for more details.
Some tools allow you modify the weight given to specific columns during training, or specify a "time" column if training data values are influenced by time. Read your tool's documentation for more details.

See Also

Websites:

Best Practices for Creating Training Data

Videos:

Topics

Machine Learning Overview

What is Machine Learning?

Appian ML/AI Capabilities

Common Model Types

Training Data