Transferring/Processing Large Data Sets (ETL)

This page explains how to design solutions for high-volume processing of large data sets or extract-transform-load (ETL) patterns, such as:

  • Running a nightly sync on up to 2 million records 
  • Performing a one-time transfer of 1 million rows from one database to another, transforming each row as they are processed
  • Allowing users to upload Excel files with up to 10,000 rows, each of which must be processed through a series of validations and database updates

Do

  • Load test at production volumes to confirm both functional and performance requirements are met
  • Throttle or queue incoming requests to reduce peak processing requirements
  • Convert real-time processing to background/asynchronous activities whenever possible

Don't

  • Pass or store large amounts of data in process variables
  • Manipulate data in long sequences of unattended nodes in series and/or heavy use of MNI/looping
  • Activity chain through long sequences of script tasks and smart services
  • Run large transaction volumes during business hours (and make sure nightly batch processes complete before users start logging in the next day)

Appian Records

Appian Records makes it easy to unify data across your organization. Using Appian Records, you can converge your system data into a single point of management. Simply choose your data source (whether it be a database, web service, or Salesforce object) and follow the steps to configure your record type. From there, you can adjust your record fields, and even extend your data set by creating custom record fields that calculate and transform your data to fit your business needs.

When needing to ETL data that resides in the Appian Cloud database, setting up the Enhanced Data Pipeline is a viable approach. Data can be collected by querying the database with a read-only user or data can be replicated by exposing the Appian database’s binary logs to your CDC tool of choice.

Smart Services and Plugins

An alternative is to use Appian processes to perform processing with script tasks and out-of-the-box smart services. Typically this approach is quick to implement and doesn’t require any specialized knowledge. This could be considered a good fit when:

  • Data volumes are low and it is a lower complexity use case
    • The exact threshold depends on many factors including infrastructure, peak load, and processing requirements. Perform a load test with expected production user and data volumes to assess the solution prior to go-live.
  • It is only needed for background processing (not between activity chained activities)

Consider one of the other options if...

  • You need to return results in real time
  • You find yourself with a process of 10+ unattended nodes in series and/or heavy use of MNI/looping
  • Volume or complexity will grow over time (avoid future refactoring)

The following Appian smart services and plug-ins are commonly used together to efficiently load and process large data sets:

Smart service plug-ins can be used within an Appian process to optimize your more complex and/or high volume processing steps. Plug-ins consolidate long sequences of activity or MNI/looping by delegating the processing to Java, reducing the number of Appian processes and activities.

Plug-ins used for high volume processing should operate or move data entirely within the plug-in. They should not bring data into the Appian process, since this can lead to higher load and memory use. For this reason the following Shared Components are not recommended for high volume processing:

  • Parse CSV File Plug-in
  • Read Excel File Smart Service
  • Excel Read Cells

Custom Solutions

If you can't get the functionality or performance you need from existing plug-ins then you can write your own smart services to perform custom processing. When designing a custom smart service consider the following:

  • Requires a Java developer
  • Pass references to the data source(s) of the processing/transformation rather than the data itself. For example, pass a data source name rather than the rows of data, or a document id instead of the text of the document.
  • Perform looping inside the plug-in rather than executing multiple instances from the parent process

Specialized solutions should be used for extreme cases of volume and complexity. In these cases an external rules engine, document generation, or data integration platform might provide more scalability. Examples of such tools include Mulesoft or Informatica.

When using external systems consider the following:

  • Requires working with the vendor team to configure and manage the third-party system
  • Use any of Appian's integration capabilities to connect
  • Use an integration method that supports passing data and results in bulk (avoid looping over individual calls within Appian and ideally use Appian Records to facilitate the synchronization)
  • Test the performance of the integrated solution before using it in a real-time, user-facing scenario