High Availability and Disaster Recovery Configurations

The availability and resilience of an Appian production system is an important topic to consider when planning an Appian deployment. Failures that may disrupt availability must be identified early and mitigated effectively. And while some types of failures, such as those caused by a catastrophic event like a natural disaster, cannot be completely prevented, a robust architecture and sound restoration plan can limit the amount of system downtime and data loss caused by the failure.

The purpose of this page is to guide you through the configuration of an Appian production system prepared for high availability and disaster recovery by providing an understanding of the following:

  • How Appian components communicate.
  • Where Appian stores data.
  • How data can be backed up and restored.

High availability and disaster recovery are related concepts but are conceptually different. Deploying a high availability configuration for Appian ensures that if a failure occurs in any given component of the hardware or software, it will not cause the system to violate its service-level agreement (SLA) requirements. A disaster recovery plan ensures the continuity of operations with minimal delay in case of a catastrophic event.

The two concepts are not mutually exclusive. Every production system should employ disaster recovery procedures, regardless of whether it uses a high availability architecture. These concepts are described in detail in sections further below, but must be understood in the context of the Appian architecture.

Architectural Overview

An Appian production system is composed of several components that interact with and depend upon one another:

  • Java application running on an application server (Tomcat)
  • Appian Engines
  • Search Server
  • Transaction Management (Kafka, Zookeeper)
  • RDBMS (such as Oracle) used to persist both Appian application data as well as business data
  • Filesystem where the application writes application-generated content (such as user-uploaded documents, archived processes, and other files)

All of these components are discussed below.

In addition to these core components, an Appian production system may employ a firewall, hardware SSL accelerator, load balancer, web servers, or NAS/SAN servers - among other devices. A complete high availability and disaster recovery architecture must consider the failure of these components as well. Redundancy configurations and recovery procedures for these ancillary components, however, are not covered in detail here. Refer to the component vendors’ documentation for industry standards and best practices.

Appian Application

The front-end component of Appian is a standard Web Application Resource (WAR) deployment structure. The Appian WAR may be deployed to multiple application servers to meet both scalability and redundancy requirements.

See also: High Availability and Distributed Installations

The Appian WAR communicates with the Appian engines via an Appian Service Manager that uses TCP/IP for communication. The Service Manager resides within the application server as part of the WAR and is responsible for distributing read and write calls among the Appian engines.

Search Server

The search server component of the Appian architecture powers search, design object dependency analysis, and usage and performance metrics reporting. It is a required data persistence and reporting component of the architecture. When it comes to data redundancy, backup, and recovery, the search server should be considered as equivalent to the Appian engines or RBDMS.

Search server data is persisted on disk with other application data and should be part of the same backup procedures that handle application data for the overall system.

Transaction Management

Transaction management is handled by two services - Kafka and Zookeeper. These components handle the transaction log persistence and node leadership election for HA/Distributed environments.

Kafka's real-time synchronous transaction logs also capture all calls that result in data changes. This data is written to disk and the associated files are a critical part of any backup and recovery strategy.

Each of these services has its own OS process.

Appian Engines

Appian engines are real-time in-memory (RAM) databases, and use a .kdb file extension. At startup, engines are loaded into memory using a combination of data in these .kdb files and any transactions that need to be replayed from the Kafka logs.

There are different Appian engines which power the various parts of the Appian suite. These engines include process design, process execution, process analytics, business rules, user authentication and authorization, document management, etc. In a default configuration, there are 9 engines plus 3 pairs of process execution and analytics engines, resulting in 15 total Appian engines. A site can be configured with up to 32 pairs of process execution and analytics engines, resulting in up to 72 total Appian engines.

Appian engines store data in-memory as well as on disk. A properly shut down Appian Engine contains all transactions. A running Appian Engine is accompanied by a transaction log and it will replay the transaction log in the event of an improper shutdown. A checkpoint is the clearing of the transaction log so a new Appian engine file is created without an accompanying transaction log. As such, your backup strategy will need to include both the .kdb files and the transaction logs. 

Managing the database processes is the Service Manager, which handles load balancing calls across the engines, engine status monitoring, and checkpointing.

Each database runs in its own OS process. Therefore, in a standard install, the 15 different engines result in 15 separate Appian processes running on the server. Engines can be split, however, between physical servers or replicated across servers for failover purposes.

Service Manager will run as a single OS process on each node running Appian engines or transaction management processes in your architecture.

Relational Database Management System

Appian leverages an RDBMS as a required component of its architecture. Application data, such as News entries and comments, as well as data type definitions, are stored in the RDBMS configured as the primary data source. Business data created by Appian applications and stored using the Write to Data Store Entity Smart Service are stored in the RDBMS(s) configured as secondary data sources. Business data written and accessed by Appian applications may reside in one or more schemas or RDBMS installations. As such, the redundancy, backup, and recovery of the single or several RDBMSs must be considered.

See also: Write to Data Store Entity Smart Service

The data storage, replication, backup, and recovery capabilities and procedures differ depending on the RDBMS vendor. While the specifics of each vendor’s capabilities are not discussed in this document, the high-level requirements for those capabilities used in conjunction with an Appian system are presented in the sections below.

High Availability

The application servers, transaction management, engine servers, search server, and database all have different high availability considerations.

 Application High Availability Considerations

The Appian Application can be configured without a single point of failure by installing the front-end components as a cluster. This ensures the failure of any single front-end web or application server will not affect the availability of the system.

Figure 1: Application Server Failover Configuration

 

Client requests are load balanced between the web and application servers. If either the web server or application server were to fail, the other server would continue to handle all requests.

Search Server High Availability Considerations

The search server can be configured as a cluster to provide automatic data redundancy and high availability. When configured with three or more nodes, a search server cluster is capable of losing a node and continuing operation with full functionality as long as a majority of the nodes remain in the cluster. A two node cluster provides data redundancy but does not provide the capability to automatically fail over if one of the nodes goes down. Instead, the system must be manually recovered by restarting or replacing the failed node. Alternatively, a two node cluster can be temporarily downgraded to a single node to continue operation with full functionality but no data redundancy. For the highest availability in the case of failure of a single node, a three node search server cluster is recommended.

See also: Search Server

Transaction Management High Availability Considerations

The transaction management components can also be configured as a cluster to provide automatic data redundancy and high availability. When configured with three nodes, a transaction management cluster is capable of losing a node and continuing operation with full functionality as long as a majority of the nodes remain in the cluster. A two node cluster provides data redundancy but does not provide the capability to automatically fail over if one of the nodes goes down. Instead, the system must be manually recovered by restarting or replacing the failed node. Alternatively, a two node cluster can be temporarily downgraded to a single node to continue operation with full functionality but no data redundancy. For high availability in the case of failure of a single node, a three node transaction management cluster is required.

Note: Clustering of transaction management components, and therefore, Software Engine Failover, is currently not supported on Windows.

Relational Database High Availability Considerations

The relational databases should be configured in a distributed, multi-master, ACID compliant cluster architecture with no single point of failure. Provided that the cluster is configured correctly, any single node, system, or piece of hardware can fail without the entire cluster failing.

All major relational-databases support clustered configurations.

Figure 2: Database replication clustering

 

Appian Engines High Availability Considerations

Appian provides two options for ensuring high availability:

  1. Software Engine Failover (Recommended for Linux; Not supported on Windows) - This option entails configuring an Appian engine cluster in which two or more sets of Appian engines that automatically replicate data, detect failure and manage failover from the primary to secondary engine server.
  2. Hardware Engine Failover (For Windows) - With this option, a second Appian engine server is on standby, ready to be started if monitoring reveals the primary server has failed.

Note: Software failover is currently not supported on Windows.

Software Engine Failover Configuration

Appian engines and transaction manager can be configured in a cluster of three identical nodes. This ensures that a primary node can be elected via the native leadership election mechanism, ensuring data write consistency to a single node.

 

The figure below shows a typical software failover configuration:

Figure 3: Leader and Replica Engine Server Architecture

For more information on configuration a clustered environment, please see this documentation.

 

Hardware Engine Failover Configuration

Appian engines can be configured with a duplicate engine server to compensate for a failure of the master engine server. This ensures that if the master engine server fails, a duplicate can be brought online.

Application servers are configured to send data to whichever server is available at the time. This is handled automatically by the Appian Configuration Manager. If the master engine server fails, the Application Servers will automatically detect the availability of the duplicate server. No changes need to be made to the Application servers when the master engine fails.

Hardware Engine Failover describes various ways to configure hardware engine failover depending on your environment.

The figure below shows a typical hardware failover configuration:

Figure 4: Master and Duplicate Engine Server Architecture

The solid green lines show the normal flow of communication. The dotted red lines show the flow of communication when the master engine is unavailable.

Requirements for a High Availability Engine Setup

In order to set up engine servers for high availability, the following requirements must be satisfied:

  • Duplicate hardware must be available for the cluster to provide redundancy.
    • If using virtual machines, the failover image must reside on a separate physical server.
  • Storage array (SAN or NAS) with fast disk I/O speeds must be available to all servers.
    • Appian Support can provide a benchmarking script to determine if your storage array speeds are sufficient.
  • IT monitoring must be enabled for all servers to detect failure events.
    • Failure procedures must be in place to make sure immediate steps are taken to restore the system to the normal configuration as soon as possible. Recommendations for IT monitoring can be found in the product documentation located on Appian Forum.

Appian Engines High Availability Recommendation

Appian recommends Software Engine Failover for Linux and Hardware Engine Failover for Windows.

Software engine failover is preferred due to near-zero RTO, at the cost of additional hardware. For mission critical systems that require high-availability, software engine failover will provide the best SLA.

High Availability Testing

In order to ensure high availability will work during a real outage, the configuration should be tested periodically by simulating an outage. Verify that replication and failover procedures work correctly by disconnecting network connections or shutting down primary servers. Conduct user testing before and after the simulated outage to verify the site functions as expected.

Appian does not prescribe a frequency for such testing, but consider it a best practice to conduct the test after first configuring the production servers and repeat the test at least once a year.

Disaster Recovery

Disaster recovery (DR) is the process, policies, and procedures related to preparing for recovery or continuation of the Appian system after an entire environment suffers a catastrophic failure, referred to as a Major Incident.

High availability as described above is insufficient for use as a disaster recovery strategy because it requires a collocated standby duplicate Appian engine server for hardware engine failover or a stable network with very low latency for software engine failover. Instead, Appian recommends a cold-failover disaster recovery configuration with geographically-separated copies of the full server stack.

 

Figure 5: Disaster Recovery Configuration

 

When considering the configuration of the system to enable a disaster recovery plan, the metrics listed below must be decided on by the business based on their tolerance for downtime and data loss. Once defined by the business, these metric requirements will drive both the architecture of the disaster recovery setup as well as the IT policies and procedures related to handling a Major Incident (defined as any event that causes total unavailability of the system).

  • Recovery Point Objective (RPO) - Maximum tolerable period in which data might be lost due to a Major Incident.
  • Recovery Time Objective (RTO) - Duration of time in which the system must be restored after a Major Incident.

Recovery Point Objective (RPO)

In an Appian system, the RPO is determined by how often data from the primary environment is replicated to the disaster recovery environment.

Data is defined as the database .kdb files, RDBMS data, and files on disk representing documents uploaded to the Appian document management system and file content created by the application.

It is also assumed the Appian configurations on the application server, engine servers, and RDBMS are kept in sync between the primary and disaster recovery environments, which is a standard practice when using software configuration management tools. For easiest recovery, specific data stored in Appian .kdb files and transaction logs, RDBMS data and files on disk replication should be kept in sync with the Appian engine replication, so that Appian data does not become out-of-phase with the application and business data stored in the database.

Recovery Time Objective (RTO)

The minimum RTO will be defined by how quickly the application server and Appian engines can be started after a Major Incident occurs. Assuming that the system configuration is kept up-to-date and the startup procedures are well documented, starting Appian should be a very quick and easy process.

The startup speed of the Appian engines depends on how recently they were checkpointed. You can reduce startup time by checkpointing before backing up the Appian engine databases.

Disaster Recovery Configurations

A disaster recovery plan is prepared by setting up periodic backups that replicate all data from the primary site to the failover site.

In the event of a Major Incident at the primary site, the failover site is started with the most recent backup, and user requests are routed to the failover site. Users with existing sessions that were active on the primary site must log in again to restart their session on the failover site.

  • No critical data is restored in the new session.
  • Any data entered by users will be current as of the most recent backup replicated to the failover site.

There are three high-level strategies for performing the backup of the Appian components. The details of each strategy are outlined below.

Strategy 1: Freeze and Snapshot

In order to backup the Appian transactions since the last Appian checkpoint, the Kafka logs must be part of the backup strategy. However, disk replication for this component cannot be used as doing so can cause a service interruption for the running environment. Snapshot methods that temporarily freeze the file system, take the snapshot, then resume all queued write transactions are recommended. Useful tools to implement this type of snapshot are the fsfreeze (or xfs_freeze) command, lvm snapshots, or VMWare Tools..

Pros & Cons

Pros:

  • Scheduled snapshots allow for simple synchronization of Appian components such as KDBs, Kafka logs, and documents.
  • Checkpointing strategy does not need to be altered to account for disaster recovery RPO.
  • Appian installation does not need to be maintained on disaster recovery servers since snapshots will contain the installation from production.

  Cons:

  • Real-time disk replication technologies cannot be leveraged
  • Recovery time is slightly extended since transaction logs need to be replayed into memory

Requirements

In order to set up Appian in a disaster recovery environment via snapshots, the following requirements must be satisfied:

  • Snapshots of servers that contain components of your Appian system must be taken at the same time. The frequency of replication needed is determined by the RPO.
  • RDBMS data must be replicated from primary to disaster recovery environments with the same frequency of replication as the Appian data.
  • The disaster recovery environment should mirror the production environment. However, going from HA production site to a non-HA disaster recovery site is possible.
  • If using the snapshots as the Appian install, Appian configurations that are specific to the disaster recovery environment must be available to be applied to the install in a recovery scenario.
  • The network must be able to route users to the disaster recovery environment in the case of a Major Incident. The routing capability is provided by the network components (for example, DNS) and not an Appian component. Keep this in mind when determining an RTO.
  • IT Systems Administrators in the disaster recovery environment must be trained on how to start the Appian system in the case of a Major Incident.

Procedure

  1. Since Kafka logs are replicated with this strategy, do not alter current checkpointing strategy to account for the target RPO.
    1. Compare the current checkpointing strategy to the target RTO to make sure they align. More transactions since the last checkpoint will result in a longer Appian start time.
  2. Configure snapshots of Appian servers based on the target RPO
    1. Include the Appian install, local files and shared files.
    2. Balance the desire to have a low RPO with the notion that frequent backups may have a negative performance outcome for the environment.
  3. Backup the schemas/tables in the RDBMS used by Appian using the RDBMS vendor's preferred backup method.
    1. If multiple RBDMSs and/or schemas are used for the primary and secondary data sources, they all must be backed up.
    2. If an RDBMS contains schemas used by enterprise applications other than Appian, do not back up that data with Appian unless it is acceptable to restore those applications to the same restore point as Appian.
    3. If the database is replicated in real-time, restore the database to the point-in-time when the Appian snapshots occured
  4. Copy the snapshots to the backup location
    1. This can be a storage location for snapshots or applying the snapshots to the disaster recovery environment to expedite the restoration process.

Restoration

The restoration procedure is essentially the backup procedure in reverse. The data is restored to the recovery system, and then that system is started.

It is important the system restoring the data has the same Appian configurations as the primary site. This approach accomplishes that using the snapshots for the Appian install and the configure script to apply any differences needed for the disaster recovery environment. The restoration procedure must be completed as follows:

  1. Restore the RDBMS data using the RDBMS vendor's preferred mechanism.
    1. The RDBMS should be started and made available to the application once the data is restored and verified to match the time of the snapshots.
  2. Restore the snapshots on the appropriate disaster recovery servers.
    1. Verify that the KDBs, Kafka logs, and documents are present.
  3. Run the configure script to apply the disaster recovery configurations to the production snapshots.
    1. This includes but is not limited to custom.properties difference and database configuration differences.
    2. See: Configure Script
  4. Start Appian.
    1. See: Starting and Stopping Appian.
  5. Perform the required network procedures to direct traffic to the failover site.
    1. This depends on the method by which network traffic will be directed, but may include updating DNS records.

 Strategy 2: Checkpoint & Backup

In previous versions of Appian, an alternative disaster recovery strategy focuses on the KDB files and not the Kafka transaction logs. At a high level the checkpointing configuration is altered to meet the RPO needs of the business and the disaster recovery backup of specific files is taken across the system once the checkpoint is complete.

Pros & Cons

Pros:

  • Snapshot technology is not required as the strategy incorporates the Appian checkpoint script and copying of specific files from the production servers to the backup servers.
  • Lowest RTO: recovery time attributed to the Appian startup duration is the minimum value since every backup is taken after an engine checkpoint.

  Cons:

  • Checkpointing strategy must be altered to account for disaster recovery RPO. This may include more frequent checkpointing and using the checkpointing script rather than the configuration properties.
  • Appian installation must be installed and maintained on the disaster recovery environment. For example, if a hotfix is applied to production, it must be applied to the disaster recovery environment as well.

Requirements

In order to set up Appian in a disaster recovery environment, the following requirements must be satisfied:

  • Appian must be installed and fully configured in the disaster recovery environment.
  • Appian configurations must be kept in sync between primary and disaster recovery environments.
  • Appian data, including file system data, must be replicated from primary to disaster recovery environments. The frequency of replication needed is determined by the RPO.
  • RDBMS data must be replicated from primary to disaster recovery environments with the same frequency of replication as the Appian data.
  • The disaster recovery environment should mirror the production environment. However, going from HA production site to a non-HA disaster recovery site is possible.
  • The network must be able to route users to the disaster recovery environment in the case of a Major Incident. The routing capability is provided by the network components (for example, DNS) and not an Appian component. Keep this in mind when determining an RTO.
  • IT Systems Administrators in the disaster recovery environment must be trained on how to start the Appian system in the case of a Major Incident.

Procedure

When backing up Appian data, follow the recommended steps below.

This procedure can be followed to replicate data to a failover site by copying the data files to a NAS/SAN device that automatically syncs with the failover site or by using standard tools to transfer the data over the network. Backups generated using this procedure may also be stored at an offsite location on disk or tape for recovery.

It is important for a proper recovery that the backed up data are consistent. Therefore, all data storage mechanisms must be backed up together in the order listed below.

  1. Prepare the Appian Engine database files for backup by running the checkpoint script.
    1. This will condense all of the transactions in the transaction log into the data image, which can be started immediately.
    2. This will allow for the minimum RTO since there will be only a small number of transactions to replay upon startup.
  2. Copy Appian data from the environment to the backup location.
    1. Proceed to this step only after the checkpointing from the previous step has completed.
    2. This includes .kdb files, documents, plugins, process models, etc.
    3. The full list of data can be found in the Appian Update Guide (include search server section) except this strategy should NOT include the kafka-logs or data server directory.
  3. Backup the schemas/tables in the RDBMS used by Appian using the RDBMS vendor's preferred backup method.
    1. If multiple RBDMSs and/or schemas are used for the primary and secondary data sources, they all must be backed up.
    2. If an RDBMS contains schemas used by enterprise applications other than Appian, do not back up that data with Appian unless it is acceptable to restore those applications to the same restore point as Appian.
    3. If the database is replicated in real-time, restore the database to the point-in-time when the Appian snapshots occured

Restoration

The restoration procedure is essentially the backup procedure in reverse. The data is restored to the recovery system, and then that system is started.

It is important the system restoring the data has the same Appian configurations as the primary site. This should be set up ahead of time, and any configuration file changes applied to the primary site should be applied to the failover site at the same time.

The restoration procedure must be completed as follows:

  1. Completely shut down the disaster recovery site.
  2. Restore the RDBMS data using the RDBMS vendor's preferred mechanism.
    1. The RDBMS should be started and made available to the application once the data is restored and verified.
  3. Restore the Appian data by copying the backed up files to the corresponding locations on the recovery system.
    1. Any pre-existing files on the recovery system should be removed prior to this restoration.
    2. The full list of data can be found in the Appian Update Guide.
  4. Start Appian.
    1. See: Starting and Stopping Appian.
  5. Perform the required network procedures to direct traffic back to the production site.
    1. This depends on the method by which network traffic will be directed, but may include updating DNS records.

Strategy 3: Run High Availability Across Data Centers 

If there is low network latency between data centers (under 10ms) then nodes of an Appian high availability system may reside in different data centers. However, implementing this setup without this guarantee will cause serious performance issues in the environment. The setup can serve as both the high availability and disaster recovery strategy.

Pros & Cons

Pros:

  • Kafka logs are replicated using Kafka replication
  • Lowest RPO: Appian data is replicated in real-time

  Cons:

  • Requires reliable network speeds and inconsistency can result in performance degradation in Production
  • Additional considerations regarding shared files must be addressed

Requirements

In order to set up Appian in a disaster recovery environment, the following requirements must be satisfied:

  • Network latency between data centers must be below 10ms
  • Shared files must be shared across data centers
  • Shared file technology must have a real-time replication strategy to use in a failure scenario
  • RDBMS must be available across data centers
  • RDBMS must have a real-time replication strategy to use in a failure scenario
  • The installation has either 3 nodes in separate data centers or 2 nodes in one data center with the 3rd in a separate data center.
  • Appian configurations that are specific to the disaster recovery environment must be available to be applied to the remaining instal nodel in a recovery scenario.
  • The network must be able to route users to the disaster recovery environment in the case of a Major Incident. The routing capability is provided by the network components (for example, DNS) and not an Appian component. Keep this in mind when determining an RTO.
  • IT Systems Administrators in the disaster recovery environment must be trained on how to start the Appian system in the case of a Major Incident.

Procedure

When backing up Appian data, follow the recommended steps below.

It is important for a proper recovery that the backed up data are consistent. Therefore, all data storage mechanisms must be backed up together in the order listed below.

  1. Set up Appian high availability across data centers
    1. This will cover Kafka log replication
  2. Set up shared files such that the files are shared across the data center nodes and have a real-time backup strategy in case the failure occurs in the shared file technology
    1. This covers Appian KDB files, documents, and other components.
  3. Set up the RDBMS to be available across data centers and backup the schemas/tables in real-time using the RDBMS vendor's preferred backup method.
    1. If multiple RBDMSs and/or schemas are used for the primary and secondary data sources, they all must be backed up.
    2. If an RDBMS contains schemas used by enterprise applications other than Appian, do not back up that data with Appian unless it is acceptable to restore those applications to the same restore point as Appian.

Restoration

The restoration procedure is essentially the backup procedure in reverse. The data is restored to the recovery system, and then that system is started.

It is important the system restoring the data has the same Appian configurations as the primary site. This should be set up ahead of time, and any configuration file changes applied to the primary site should be applied to the failover site at the same time.

The restoration procedure must be completed as follows:

  1. Stop Appian on the node in the disaster recovery environment.
    1. See: Starting and Stopping Appian.
  2. Restore the RDBMS data using the RDBMS vendor's preferred mechanism.
    1. The RDBMS should be started and made available to the application once the data is restored and verified to match the time of the failure.
  3. Ensure the shared files needed to run Appian are available to the disaster recovery environment
    1. This might require pointing to the back-up
  4. Run the configure script to apply the disaster recovery configurations to the environment.
    1. This includes but is not limited to topology differences, custom.properties differences, and database configuration differences.
    2. See: Configure Script
  5. Start Appian.
    1. See: Starting and Stopping Appian.
  6. Perform the required network procedures to direct traffic to the failover site.
    1. This depends on the method by which network traffic will be directed, but may include updating DNS records.

Disaster Recovery Testing

A disaster recovery plan that has not been tested is not complete. Verify the disaster recovery site can be used to recover from backups by periodically using the backed-up data to restore to the failover site. After restoring the site, conduct user testing to verify the site functions as expected.

Appian does not prescribe a frequency for such testing, but consider it a best practice to conduct the test after first configuring the disaster recovery site and procedures and repeat the test at least once a year.

Disaster Recovery Recommendations

  1. Determine RPO and RTO based on business needs for the system.
  2. Choose a backup strategy based on those goals and technologies available to the organization.
  3. Test and rehearse disaster recovery testing.
  4. Use the backup procedures to reliably recover the system to a consistent state

Summary of Recommendations

  1. Use Software Engine Failover High Availability configuration to eliminate single points of failure within the same data center.
    • Ensure High Availability requirements are met.
    • Test high availability configurations work properly.
  2. Use a Disaster Recovery configuration to set up a procedure to recover from a Major Incident that causes multiple component failures or entire system failure.
    • Define requirements based on business objectives.
    • Define and test recovery procedures.