KB-2068 How to understand checkpointing in service_manager.log

Purpose

The Appian Engines use a process called "checkpointing" to persist data stored in the in-memory engines to disk. This is triggered either manually or automatically. This article illustrates and explains the process of a single engine checkpointing, and how the associated output appears in the <APPIAN_HOME>/logs/service_manager.log.

Instructions

Note: The remainder of the document uses the following notations and concepts:

  • Appian Engines: Appian engines are real-time in-memory (RAM) databases that also persist all data in a file on disk. These database files have the extension .kdb.
  • <engine_id>: This is the name of the engine associated with the checkpoint. e.g. groups, execution00.
  • <path_to_temporary>: This is the temporary folder where files related to an in-progress checkpoint are stored. These files are stored in the <APPIAN_HOME>/services/data/temporary folder. 
  • <path_to_archived>: This is the archived folder where historical KDB files are stored. These files are stored in the <APPIAN_HOME>/services/data/archived folder. This directory can be cleaned and maintained using the cleanupArchives script.
  • <path_to_engine_gateway>: This is the directory that stores the live KDB that an engine will boot from if it is shutdown. This file should never be modified. The exact path varies from engine to engine. e.g. <APPIAN_HOME>/server/process/exec/00/gw1 is associated with the execution00 engine gateway.

Phase 1: Queuing

When a checkpoint is first requested or triggered, it is placed into a queue within the Service Manager component. Checkpoints occur serially, in the order that they are placed into the queue. In the logs, this looks like:

YYYY-MM-DD HH:MM:SS.mmm [KomodoEventBus-XXXX] INFO  com.appian.komodo.engine.status.CheckpointingStateProvider - Checkpoint <checkpoint id> requested for: <engine_id>

Service Manager will also log output indicating the type of checkpoint that was requested, e.g. automatic or user triggered.

Automatic:

YYYY-MM-DD HH:MM:SS.mmm [CheckpointManager-<engine_id>-X] INFO  com.appian.komodo.engine.checkpoint.CheckpointManager - HandlingCheckpointRequestedEventAutoCheckpointRequestedEvent{createdInstant=YYYY-MM-DDTHH:MM:SS.mmmZ,engineId=<engine_id>>,
 requestId=<checkpoint id>,engineShuttingDown=false, checkpointContext=null}
YYYY-MM-DD HH:MM:SS.mmm [CheckpointManager-<engine_id>-X] INFO com.appian.komodo.engine.checkpoint.CheckpointManager - QueueingCheckpointRequestedEventAutoCheckpointRequestedEvent{createdInstant=YYYY-MM-DDTHH:MM:SS.mmmZ,engineId=<engine_id>, requestId
=<checkpoint id>,engineShuttingDown=false,checkpointContext=CheckpointContext{checkpointId=<checkpoint id>,minTxId=XXXX, KDBFileIndex=XXX,retentionMetadata=serviceManager.transaction.<engine_id>-X@XXXX,initiatingHost=<host>, executionHost=<host>}}

User Requested (using the checkpoint script):

YYYY-MM-DD HH:MM:SS.mmm [CheckpointManager-<engine_id-X] INFO com.appian.komodo.engine.checkpoint.CheckpointManager - HandlingCheckpointRequestedEventUserCheckpointRequestedEvent{createdInstant=YYYY-MM-DDTHH:MM:SS.mmmZ,engineId=<engine_id>, requestId=<checkpoint id>,engineShuttingDown=false, checkpointContext=null}
YYYY-MM-DD HH:MM:SS.mmm [CheckpointManager-<engine_id-X] INFO com.appian.komodo.engine.checkpoint.CheckpointManager - QueueingCheckpointRequestedEventUserCheckpointRequestedEvent{createdInstant=YYYY-MM-DDTHH:MM:SS.mmmZ,engineId=<engine_id>, requestId=<checkpoint id>,engineShuttingDown=false,checkpointContext=CheckpointContext{checkpointId=<checkpoint id>,minTxId=XXXX, KDBFileIndex=X,retentionMetadata=serviceManager.transaction.<engine_id>-X@X,initiatingHost=<host>, executionHost=<host>}}

As can be seen, in both of the above outputs, the CheckpointRequestedEvent is handled by putting it on the checkpointing queue, to be processed.

Phase 2: De-queuing 

Service Manager will de-queue a checkpoint request in order to process it. This is associated with:

YYYY-MM-DD HH:MM:SS.mmm [KomodoEventBus-XXXX] INFO com.appian.komodo.engine.status.CheckpointingStateProvider -Checkpoint <checkpoint id> dequeued for: <engine_id>
YYYY-MM-DD HH:MM:SS.mmm [KomodoEventBus-XXXX] INFO com.appian.komodo.engine.checkpoint.CheckpointWatcher - Waiting for checkpoint <checkpoint id> for <engine_id>
YYYY-MM-DD HH:MM:SS.mmm [KomodoEventBus-XXXX] INFO com.appian.komodo.engine.status.CheckpointingStateProvider - Checkpoint <checkpoint id> started for <engine_id>

After this output, Service Manager will be actively checkpointing the specific engine that the request was associated with. While an engine is being checkpointed, a temporary *.writing KDB is created in the temporary folder associated with the engine undergoing a checkpoint. This is usually located at <APPIAN_HOME>/services/data/temporary/<engine_id>.

Phase 3: Checkpoint

Service Manager will complete the checkpoint operation and perform some operations on the file system:

  • The *.writing KDB from Phase 2 will be renamed to *.written.
  • The *.written KDB will be moved into the <APPIAN_HOME>/server/**/gw1 folder associated with the engine that underwent a checkpoint. This is the new live KDB.

These appear in the logs as:

YYYY-MM-DD HH:MM:SS.mmm [checkpoint-watcher-<engine_id>-XXX-X] INFO com.appian.komodo.engine.checkpoint.LocalCheckpointWatcherTask - Notified of Kimage created for checkpoint <checkpoint id> file<path_to_temporary>/temporary/<engine_id>/<engine_id>-<checkpoint id>.written at YYYY-MM-DD HH:MM:SS.mmmZ
YYYY-MM-DD HH:MM:SS.mmm[checkpoint-watcher-<engine_id>-XXX-X] INFO com.appian.komodo.engine.checkpoint.SingleGatewayCheckpointCopier - Moving KCheckpoint to next live: <path_to_temporary>/temporary/<engine_id>/<engine_id>-<checkpoint_id>.written to/<APPIAN_HOME>/server/<path_to_engine_gateway>/gw1/<KDB_name>.kdb
YYYY-MM-DD HH:MM:SS.mmm [checkpoint-watcher-<engine_id>-XXX-X] INFO com.appian.komodo.engine.checkpoint.LocalCheckpointWatcherTask - Finishedcheckpoint <checkpoint id> for <engine_id>
YYYY-MM-DD HH:MM:SS.mmm [KomodoEventBus-XXXX] INFO com.appian.komodo.engine.status.CheckpointingStateProvider - Checkpoint<checkpoint id> completed for <engine_id>
YYYY-MM-DD HH:MM:SS.mmm [CheckpointManager-<engine_id>-X] INFO com.appian.komodo.engine.checkpoint.CheckpointManager - Checkpoint <checkpoint id> execution complete

Phase 4: Cleanup

Once the checkpoint is complete Service Manager will perform cleanup of the prior "live" KDB, and will also instruct the Kafka component to mark any transactions that were persisted to disk in the checkpoint as ready for cleanup. In the logs, this is associated with:

YYYY-MM-DD HH:MM:SS.mmm[checkpointArchiver-XXX-X]INFO com.appian.komodo.engine.checkpoint.SingleGatewayCheckpointCopier -Archiving kdb image from <APPIAN_HOME>/server/<path_to_engine_gateway>/gw1/*.kdb to <path_to_archived>/archived/<engine_id>/*.kdb
...
YYYY-MM-DD HH:MM:SS.mmm [KomodoEventBus-XXXX]INFO  com.appian.komodo.log.KafkaLogCleaner - Updating committed offset for serviceManager.transaction.<engine_id>-X from XXXX to XXXX
YYYY-MM-DD HH:MM:SS.mmm [KomodoEventBus-XXXX] INFO com.appian.komodo.log.KafkaLogCleaner - Marking all transaction log entries for <engine_id> that are an earlier offset than XXXX for deletion.

Affected Versions

This article applies to Appian 17.3 and later. The output was captured on Appian 19.4 and may be slightly different on different versions of Appian.

Last Reviewed: January 2020

Related
Recommended