The availability and resilience of an Appian production system is an important topic to consider when planning an Appian deployment. Failures that may disrupt availability must be identified early and mitigated effectively. And while some types of failures, such as those caused by a catastrophic event like a natural disaster, cannot be completely prevented, a robust architecture and sound restoration plan can limit the amount of system downtime and data loss caused by the failure.
The purpose of this page is to guide you through the configuration of an Appian production system prepared for high availability and disaster recovery by providing an understanding of the following:
High availability and disaster recovery are related concepts but are conceptually different. Deploying a high availability configuration for Appian ensures that if a failure occurs in any given component of the hardware or software, it will not cause the system to violate its service-level agreement (SLA) requirements. A disaster recovery plan ensures the continuity of operations with minimal delay in case of a catastrophic event.
The two concepts are not mutually exclusive. Every production system should employ disaster recovery procedures, regardless of whether it uses a high availability architecture. These concepts are described in detail in sections further below, but must be understood in the context of the Appian architecture.
An Appian production system is composed of several components that interact with and depend upon one another:
All of these components are discussed below.
In addition to these core components, an Appian production system may employ a firewall, hardware SSL accelerator, load balancer, web servers, or NAS/SAN servers - among other devices. A complete high availability and disaster recovery architecture must consider the failure of these components as well. Redundancy configurations and recovery procedures for these ancillary components, however, are not covered in detail here. Refer to the component vendors’ documentation for industry standards and best practices.
The front-end component of Appian is a standard Web Application Resource (WAR) deployment structure. The Appian WAR may be deployed to multiple application servers to meet both scalability and redundancy requirements.
See also: High Availability and Distributed Installations
The Appian WAR communicates with the Appian engines via an Appian Service Manager that uses TCP/IP for communication. The Service Manager resides within the application server as part of the WAR and is responsible for distributing read and write calls among the Appian engines.
The search server component of the Appian architecture powers search, design object dependency analysis, and usage and performance metrics reporting. It is a required data persistence and reporting component of the architecture. When it comes to data redundancy, backup, and recovery, the search server should be considered as equivalent to the Appian engines or RBDMS.
Search server data is persisted on disk with other application data and should be part of the same backup procedures that handle application data for the overall system.
Transaction management is handled by two services - Kafka and Zookeeper. These components handle the transaction log persistence and node leadership election for HA/Distributed environments.
Kafka's real-time synchronous transaction logs also capture all calls that result in data changes. This data is written to disk and the associated files are a critical part of any backup and recovery strategy.
Each of these services has its own OS process.
Appian engines are real-time in-memory (RAM) databases, and use a .kdb file extension. At startup, engines are loaded into memory using a combination of data in these .kdb files and any transactions that need to be replayed from the Kafka logs.
There are different Appian engines which power the various parts of the Appian suite. These engines include process design, process execution, process analytics, business rules, user authentication and authorization, document management, etc. In a default configuration, there are 9 engines plus 3 pairs of process execution and analytics engines, resulting in 15 total Appian engines. A site can be configured with up to 32 pairs of process execution and analytics engines, resulting in up to 72 total Appian engines.
Appian engines store data in-memory as well as on disk. A properly shut down Appian Engine contains all transactions. A running Appian Engine is accompanied by a transaction log and it will replay the transaction log in the event of an improper shutdown. A checkpoint is the clearing of the transaction log so a new Appian engine file is created without an accompanying transaction log. As such, your backup strategy will need to include both the .kdb files and the transaction logs.
Managing the database processes is the Service Manager, which handles load balancing calls across the engines, engine status monitoring, and checkpointing.
Each database runs in its own OS process. Therefore, in a standard install, the 15 different engines result in 15 separate Appian processes running on the server. Engines can be split, however, between physical servers or replicated across servers for failover purposes.
Service Manager will run as a single OS process on each node running Appian engines or transaction management processes in your architecture.
Appian leverages an RDBMS as a required component of its architecture. Application data, such as News entries and comments, as well as data type definitions, are stored in the RDBMS configured as the primary data source. Business data created by Appian applications and stored using the Write to Data Store Entity Smart Service are stored in the RDBMS(s) configured as secondary data sources. Business data written and accessed by Appian applications may reside in one or more schemas or RDBMS installations. As such, the redundancy, backup, and recovery of the single or several RDBMSs must be considered.
See also: Write to Data Store Entity Smart Service
The data storage, replication, backup, and recovery capabilities and procedures differ depending on the RDBMS vendor. While the specifics of each vendor’s capabilities are not discussed in this document, the high-level requirements for those capabilities used in conjunction with an Appian system are presented in the sections below.
The application servers, transaction management, engine servers, search server, and database all have different high availability considerations.
The Appian Application can be configured without a single point of failure by installing the front-end components as a cluster. This ensures the failure of any single front-end web or application server will not affect the availability of the system.
Figure 1: Application Server Failover Configuration
Client requests are load balanced between the web and application servers. If either the web server or application server were to fail, the other server would continue to handle all requests.
The search server can be configured as a cluster to provide automatic data redundancy and high availability. When configured with three or more nodes, a search server cluster is capable of losing a node and continuing operation with full functionality as long as a majority of the nodes remain in the cluster. A two node cluster provides data redundancy but does not provide the capability to automatically fail over if one of the nodes goes down. Instead, the system must be manually recovered by restarting or replacing the failed node. Alternatively, a two node cluster can be temporarily downgraded to a single node to continue operation with full functionality but no data redundancy. For the highest availability in the case of failure of a single node, a three node search server cluster is recommended.
See also: Search Server
The transaction management components can also be configured as a cluster to provide automatic data redundancy and high availability. When configured with three nodes, a transaction management cluster is capable of losing a node and continuing operation with full functionality as long as a majority of the nodes remain in the cluster. A two node cluster provides data redundancy but does not provide the capability to automatically fail over if one of the nodes goes down. Instead, the system must be manually recovered by restarting or replacing the failed node. Alternatively, a two node cluster can be temporarily downgraded to a single node to continue operation with full functionality but no data redundancy. For high availability in the case of failure of a single node, a three node transaction management cluster is required.
Note: Clustering of transaction management components, and therefore, Software Engine Failover, is currently not supported on Windows.
The relational databases should be configured in a distributed, multi-master, ACID compliant cluster architecture with no single point of failure. Provided that the cluster is configured correctly, any single node, system, or piece of hardware can fail without the entire cluster failing.
All major relational-databases support clustered configurations.
Figure 2: Database replication clustering
Appian provides two options for ensuring high availability:
Note: Software failover is currently not supported on Windows.
Software Engine Failover Configuration
Appian engines and transaction manager can be configured in a cluster of three identical nodes. This ensures that a primary node can be elected via the native leadership election mechanism, ensuring data write consistency to a single node.
The figure below shows a typical software failover configuration:
Figure 3: Leader and Replica Engine Server Architecture
For more information on configuration a clustered environment, please see this documentation.
Hardware Engine Failover Configuration
Appian engines can be configured with a duplicate engine server to compensate for a failure of the master engine server. This ensures that if the master engine server fails, a duplicate can be brought online.
Application servers are configured to send data to whichever server is available at the time. This is handled automatically by the Appian Configuration Manager. If the master engine server fails, the Application Servers will automatically detect the availability of the duplicate server. No changes need to be made to the Application servers when the master engine fails.
Hardware Engine Failover describes various ways to configure hardware engine failover depending on your environment.
The figure below shows a typical hardware failover configuration:
Figure 4: Master and Duplicate Engine Server Architecture
The solid green lines show the normal flow of communication. The dotted red lines show the flow of communication when the master engine is unavailable.
Requirements for a High Availability Engine Setup
In order to set up engine servers for high availability, the following requirements must be satisfied:
Appian recommends Software Engine Failover for Linux and Hardware Engine Failover for Windows.
Software engine failover is preferred due to near-zero RTO, at the cost of additional hardware. For mission critical systems that require high-availability, software engine failover will provide the best SLA.
In order to ensure high availability will work during a real outage, the configuration should be tested periodically by simulating an outage. Verify that replication and failover procedures work correctly by disconnecting network connections or shutting down primary servers. Conduct user testing before and after the simulated outage to verify the site functions as expected.
Appian does not prescribe a frequency for such testing, but consider it a best practice to conduct the test after first configuring the production servers and repeat the test at least once a year.
Disaster recovery (DR) is the process, policies, and procedures related to preparing for recovery or continuation of the Appian system after an entire environment suffers a catastrophic failure, referred to as a Major Incident.
High availability as described above is insufficient for use as a disaster recovery strategy because it requires a collocated standby duplicate Appian engine server for hardware engine failover or a stable network with very low latency for software engine failover. Instead, Appian recommends a cold-failover disaster recovery configuration with geographically-separated copies of the full server stack.
Figure 5: Disaster Recovery Configuration
When considering the configuration of the system to enable a disaster recovery plan, the metrics listed below must be decided on by the business based on their tolerance for downtime and data loss. Once defined by the business, these metric requirements will drive both the architecture of the disaster recovery setup as well as the IT policies and procedures related to handling a Major Incident (defined as any event that causes total unavailability of the system).
In an Appian system, the RPO is determined by how often data from the primary environment is replicated to the disaster recovery environment.
Data is defined as the database .kdb files, RDBMS data, and files on disk representing documents uploaded to the Appian document management system and file content created by the application.
It is also assumed the Appian configurations on the application server, engine servers, and RDBMS are kept in sync between the primary and disaster recovery environments, which is a standard practice when using software configuration management tools. For easiest recovery, specific data stored in Appian .kdb files and transaction logs, RDBMS data and files on disk replication should be kept in sync with the Appian engine replication, so that Appian data does not become out-of-phase with the application and business data stored in the database.
The minimum RTO will be defined by how quickly the application server and Appian engines can be started after a Major Incident occurs. Assuming that the system configuration is kept up-to-date and the startup procedures are well documented, starting Appian should be a very quick and easy process.
The startup speed of the Appian engines depends on how recently they were checkpointed. You can reduce startup time by checkpointing before backing up the Appian engine databases.
A disaster recovery plan is prepared by setting up periodic backups that replicate all data from the primary site to the failover site.
In the event of a Major Incident at the primary site, the failover site is started with the most recent backup, and user requests are routed to the failover site. Users with existing sessions that were active on the primary site must log in again to restart their session on the failover site.
There are three high-level strategies for performing the backup of the Appian components. The details of each strategy are outlined below.
In order to backup the Appian transactions since the last Appian checkpoint, the Kafka logs must be part of the backup strategy. However, disk replication for this component cannot be used as doing so can cause a service interruption for the running environment. Snapshot methods that temporarily freeze the file system, take the snapshot, then resume all queued write transactions are recommended. Useful tools to implement this type of snapshot are the fsfreeze (or xfs_freeze) command, lvm snapshots, or VMWare Tools..
Pros & Cons
Pros:
Cons:
In order to set up Appian in a disaster recovery environment via snapshots, the following requirements must be satisfied:
The restoration procedure is essentially the backup procedure in reverse. The data is restored to the recovery system, and then that system is started.
It is important the system restoring the data has the same Appian configurations as the primary site. This approach accomplishes that using the snapshots for the Appian install and the configure script to apply any differences needed for the disaster recovery environment. The restoration procedure must be completed as follows:
In previous versions of Appian, an alternative disaster recovery strategy focuses on the KDB files and not the Kafka transaction logs. At a high level the checkpointing configuration is altered to meet the RPO needs of the business and the disaster recovery backup of specific files is taken across the system once the checkpoint is complete.
In order to set up Appian in a disaster recovery environment, the following requirements must be satisfied:
When backing up Appian data, follow the recommended steps below.
This procedure can be followed to replicate data to a failover site by copying the data files to a NAS/SAN device that automatically syncs with the failover site or by using standard tools to transfer the data over the network. Backups generated using this procedure may also be stored at an offsite location on disk or tape for recovery.
It is important for a proper recovery that the backed up data are consistent. Therefore, all data storage mechanisms must be backed up together in the order listed below.
It is important the system restoring the data has the same Appian configurations as the primary site. This should be set up ahead of time, and any configuration file changes applied to the primary site should be applied to the failover site at the same time.
The restoration procedure must be completed as follows:
If there is low network latency between data centers (under 10ms) then nodes of an Appian high availability system may reside in different data centers. However, implementing this setup without this guarantee will cause serious performance issues in the environment. The setup can serve as both the high availability and disaster recovery strategy.
A disaster recovery plan that has not been tested is not complete. Verify the disaster recovery site can be used to recover from backups by periodically using the backed-up data to restore to the failover site. After restoring the site, conduct user testing to verify the site functions as expected.
Appian does not prescribe a frequency for such testing, but consider it a best practice to conduct the test after first configuring the disaster recovery site and procedures and repeat the test at least once a year.