An Introduction to Query Optimization

This play provides an introduction to database query optimization. 

As a general recommendation, design your applications using Appian Records with Data Sync enabled. The Data Sync feature caches your source data in Appian and provides the Appian Data Fabric framework to efficiently query data without the need of manual optimizations.

As an enterprise application used for a wide variety of tasks, Appian is often integral to an organization’s productivity.  The performance of Appian applications could directly impact employee task completion times and in turn their organization’s KPIs. The perception of Appian as a platform being performant or non-performant can also greatly influence end user satisfaction and thus further organizational adoption of a system.

One of the most common actions a typical Appian app performs is reading or writing to a database. Due to the sheer frequency of database operations, even small performance improvements of database interactions can greatly increase the perception of overall database performance by reducing the immediate response times to the user and the overall load on the server.  If your app uses a database, optimizing database interactions should be a part of your release plan.

Database and query optimization is an involved topic, and can often feel like an art, rather than a science.  Optimizing for all three of performance, memory usage and maintainability/code clarity together may simply not be possible in most cases.   There may be no single configuration that is optimal for all scenarios, and as an Appian App Developer you should evaluate the options available and weigh the benefits/trade-offs to develop the design.  This document is intended to provide an introduction to a number of different optimization techniques and rules of thumb.

Definitions 

For the purposes of this document, the following definitions apply:

  • Blob – a Binary Large Object, this is the type of database column used to store large amounts of binary data (e.g., attachments, images, documents)
  • Clause – One of the individual components of a SQL statement (e.g., a WHERE clause, an ORDER BY clause, etc.)
  • JOIN – A SQL statement that is used to combine multiple tables together based on some specific criteria.  Some examples include:
    • Inner Joins - an intersection of matching data on both tables.  If data is omitted from one or the other of the tables, the entire row is left out of the result
    • Left Outer Joins – All rows in the first specified table, combined with data from a second table, where it also exists and matches the criteria
    • Right Outer joins – All rows in the second specified table, combined with data from the first specific table, where it also exists and matches the criteria
  • Normalized vs. Denormalized –  Normalized database design involves tables that are usually structured for maximum efficiency, reducing data redundancy and improving data integrity.   With this style, each table typically contains information that isn’t repeated on other tables in the database. Denormalized tables are typically structured for maximum performance, combining frequently used data together, even if that means the data is duplicated on multiple database tables and may require more maintenance to keep synchronized.  More details can be found at https://en.wikipedia.org/wiki/Database_normalization
  • Query – A SQL statement that returns data from a database, e.g., SELECT ID, Name FROM Issues
  • Query Execution Plan – The set of steps that a database will perform to obtain the result set.   An optimal query execution plan can hugely impact the performance of a query. Also referred to as an Explain Plan or Query Plan.
  • Stored Procedure – a subroutine stored and executed on the database often accepting one or more variables and then either updating data or returning a result set
  • View – A result set of a stored query in the database.  Some databases allow updateable views, but in many cases, they are read only.  Often used with JOINs to denormalize previously normalized tables.
    • A materialized view is similar to a view, in that it is the result of a stored query, however the results of the view are “cached” in a table structure, so they can often be retrieved much quicker, at the cost of not always being 100% up to date.  They can be useful for aggregation, metrics, dashboards, charting, etc.

General Tips

Some overall details to keep in mind:

  • Refer to the best practices specified in the Developer Playbook first.
    • This document is intended to augment information provided there.
  • Become friends with your DBA!  They will be the main resource available to you for evaluating database hardware and memory considerations, and they typically have many tools and personal expertise with respect to database performance.  A partnership combining your Appian application knowledge with their database expertise could be quite beneficial in improving performance.
  • Ensure that database/query optimization is a step in the release plan of your application.  
    • Ensure that the database performance is tested via the application with a realistic number of users and data.   
    • Consider data growth over time and if possible, simulate and test how the application will behave in the future
  • Ensure that database optimization is an ongoing activity as part of normal maintenance.  DBAs typically have tools to measure poorly performing queries and other database problems, and they should ideally be involved in periodically monitoring the overall “health” of the database.

Records CDT and Table Design tips

When designing your Appian records or CDTs and the underlying tables, always keep in mind how the application, and the end users, will be interacting with them.  Some sample criteria to determine:

  • Are these objects going to be written to the database frequently?
    • Once written, how often will the objects be updated?
  • Are these objects going to be read from the database frequently?
  • Are certain combinations of objects frequently going to be read or written together?  
  • Which attributes are appropriate for which objects?
    • Are certain attributes updated and/or read much more frequently than other attributes on the same object?
    • Are certain attributes only used by a particular set of users or by a particular function?

Understanding how the application will use the data can help guide you, as an application developer, in the structure of your entities, such as deciding how Normalized vs. Denormalized your tables should be.   As an example, consider a scenario at a Courier company with a Package Shipping Request CDT representing data associated with a Package delivery, and a Customer CDT representing data about the person who requested the delivery, such as an account number, invoice address and phone number.  

  • When an Appian user opens the package Shipping Request, they want to see the Customer details, including the Name, Account Number and Address
  • An account number may never change over the lifetime of the customer
  • Although a shipping address for invoices may change periodically over time, but in the context of existing and past deliveries, an update to the address should not retroactively update those previous Package Shipping Requests
  • With these considerations, although the tendency may be towards an entirely normalized CDT structure, with package data on the package CDT and all customer data on the customer CDT, the system may actually benefit by copying the rarely updated, but frequently viewed data from the customer CDT on to the package shipping request CDT, thus avoiding the need to also query the customer CDT each time the shipping request is viewed
  • It is a general best practice of database design to be as specific as possible in your datatypes, avoiding text when it is not necessary.  e.g., If all the existing data for a ticket ID is numeric, and there aren’t any immediate plans to change this, don’t use a text column, just in case IDs might be changed from numeric to alphanumeric in the future. 
    • This best practice is beneficial for a number of reasons outside of performance, but strictly with respect to performance, character by character text comparisons often take the database engine longer to perform compared to simpler, more primitive datatypes such as an integer or boolean, especially if the keyword LIKE or pattern matching is being used on the text. A text representation may also use slightly more memory than other representations of the data, e.g, the text “true” & “false” vs a boolean/bit 1 or 0.
    • Text columns often use a different type of indexing, usually referred to as Full Text indexing, which usually can handle case differences, wildcards, plural/singular words, etc. This type of index is often updated asynchronously, and may result in slightly out of date results.
  • For some databases, MySQL especially, it is slightly more performant for columns which will be compared against each other to have the same column type
    • As an example, if employee ID or an Item status field is stored on multiple tables, ensure that they are the same datatype and length in all the locations.
  • If your application uses BLOBs, separate to the Document Management functionality provided by Appian, consider storing them in a separate table on their own, normalized from the rest of the database design. DBAs may often want to store BLOBs on their own tablespace, which may reside on a different hard disk for performance reasons. 
  • In general, there is a longstanding database design principle that wide tables, i.e., tables with many columns, are not ideal. With modern databases and hardware, this is not nearly as problematic as it has been in the past, however, keep the following in mind  
    • From a CDT and design perspective, having a wide table with a few frequently used columns and many infrequently used columns will likely be inefficient from a memory perspective. Consider “offloading” the infrequently used data to another cdt and/or table which can be loaded only when necessary. 
  • Follow the additional guidelines as documented in the Database Schema Best Practices section of the documentation:
  • As an Appian Developer, one of your primary focuses should be on the end user experience. However, reporting is also often a critical factor for application success.  It is often difficult to optimize a database for both of these scenarios. For large scale or resource intensive reporting, an Extract Transform Load (ETL) process could be used to create a separate structure or even database that is more suitable or performant for the reporting.  

Database Statistics

Most databases, including Oracle and MS SQL server, use internally computed statistics to decide on the appropriate strategy for creating the query execution plan, e.g., which order of operations to perform, which indexes to use on which table, etc.   Up to date statistics are often critical for the database to choose the most efficient query execution plan.

  • Confirm with your DBA that statistics will be updated automatically or on a schedule. As the data in the database updates over time, you will want to ensure that the statistics are updated as well.  Since computing statistics may be resource intensive, it is often scheduled to occur during off hours
  • It is especially important to update statistics after a large import or removal of data, e.g., if you are migrating from another system and batch load data, then you may want to force that statistics are updated immediately by using commands similar to the following
    • SQL Server – UPDATE STATISTICS <tbl-name>;
    • MySQL – ANALYZE TABLE <tbl_name>;
    • Oracle - exec GATHER_DATABASE_STATS;
  • After a large number of updates, inserts or deletes on MySQL, there is a command ‘OPTIMIZE <tbl_name>;’ which reorganizes the physical storage of the table and may be able to reduce storage space and improve I/O efficiency

Query Performance

When attempting to improve query performance for query entities, the general sequence of steps is as follows:

  1. Identify the query SQL, using additional database logging if appropriate.
  2. Determine the current result set. 
    1. Is the expected number of rows being returned?
  3. Examine the filters and ensure the accuracy of their logic:
    1. In Appian, look at the a!logicalExpression()s and a!queryFilter()s
    2. In SQL, examine the where or join clauses
  4. Examine the columns being retrieved and ensure they are needed.
    1. In Appian, look at the a!query(selection: <selection>)
    2. In SQL, examine the columns listed in the Select statement.
  5. When steps 1-4 are complete, retrieve a query execution plan.
  6. Run the query and record the results to measure success.
  7. If further improvements are needed:
    1. Adjust the query logic.
    2. Add or remove indices and update statistics.
  8. Repeat steps 5-7.

Keep in mind that returning the total row counts using the flag “fetchTotalCount” executes the query twice in Appian. Setting this flag to false will help improve performances.

Query/View Construction

As an Appian App developer, most of the time you will not be writing SQL directly for queries, instead you will be using a!queryEntity(), a!query() and a!queryFilter()s,  which will construct the actual SQL for you.   You should be aware of some best practices in interacting with the database to improve performance of a!queryFilters().

  • Minimize the number of round trips to the database, it is generally better to retrieve many items of data at once rather than many individual calls due to the overhead of the sql connection itself.
    • Use Appian index() and wherecontains() functions to further subdivide the data into what your application may actually need.
    • When retrieving and holding data in variables, you also want to keep the memory implications in mind of “wide” tables or CDTs stored as process variables. If the number of fields in the CDT is very large, the ideal design may differ depending on whether you want to optimize performance, memory usage or some balance between the two.
  • When designing your SQL view, avoid using computed columns in the WHERE, JOIN or ORDER BY whenever possible, you want these values being evaluated in a conditional expression to be static and indexed for maximum performance, not computed dynamically at runtime.   If the computation is computed dynamically, the computation may limit the usefulness of an index and require a value to be computed to every row before the actual selection is done, greatly increasing overhead. 
    • Once the view has been created, you also want to avoid using computed columns in the a!queryFilter() function in Appian. Also, in general only return computed columns in the a!query (selection: <selection>), when they are going to be used.
    • If there is a computation to be done, consider where the data resides and whether Appian or the database is the best location to do the computation.
    • Remember that a!queryFilter()s can be dynamically changed easily, sometimes avoiding a database computation. As a trivial example, instead of a statement such as the following to find deliveries occurring a couple days in the future:

SELECT * FROM orders WHERE deliveryDate > DATE_ADD(NOW(), INTERVAL 2 DAY);

it may be more performant for Appian to compute the date of Today()+2 itself, and send the single hardcoded date to the database.  In more complex sql statements, the impact may be even more evident.

  • In general, with all else being equal, it will be slightly faster to use WHERE= instead of WHERE NOT = and using actual column names instead of * with a SELECT
    • Note: This may increase the maintenance overhead, as for example if you are listing a set of states, and a new state is added in the workflow, you may need to update the where clause or a!queryFilter() to include the newly added state. It is up to you as an application developer to decide the best approach based on how much performance optimization outweighs maintenance overhead
  • It is possible to have multiple CDTs based on different portions of the same view (or table). E.g., CDT A is based on columns 1-20 and CDT B is based on columns 1, 5 & 21-30. If there are no constraints requiring this type of design, it is generally preferred to have separate purpose-built views.  
    • Most database engines would be optimized to ignore the “unselected” columns in the view, so there should be little performance difference between the two approaches. However, smaller purpose built views are easier to update, rename, deprecate, etc. so this pattern should be used unless there is a reason not to.   
    • It is also easier to create the CDT from the view automatically, if the view is specifically configured to only have the columns needed for your CDT.
  • When possible, minimize the use of subqueries. SQL is not a procedural or object oriented language, it is often better to make use of its core strengths, such as joins instead of subqueries.  Subqueries may sometimes result in the DB engine creating a temporary table, which limits the use of indexes. 
    • If you are using a GROUP BY clause on a large table, especially when joining with other tables, it may be more performant to use a subquery to do the GROUP BY first, before doing the rest of the joins. It may be worthwhile trying both approaches on your particular dataset to see which result is more performant. 
  • Use appropriately sized a!pagingInfo() instead of a!pagingInfo(batchSize:-1) to retrieve a subset of data, if all of the data is not needed.
  • Sorting of views will generally be done via Appian passing an a!sortInfo() within an a!pagingInfo(), so it may not be necessary to specify a particular ORDER BY clause in your view definition.
    • With MySQL specifically, if there is no ordering needed in a view, it may be slightly more performant to specify an ORDER BY NULL clause, rather than having no ORDER BY clause at all.

Query Execution Plans

The query execution plan, or explain plan, is a set of steps that a database will perform in order to obtain the result set.  Reviewing the explain plan is one of the best ways to understand how your database performs the query, broken down into ordered steps.  By examining this plan, you can determine if the database is picking the optimal indexes and joining the tables in the most efficient manner.   (Note that indexes are referred to in more detail in the next section.) In general, a query execution plan will contain:

  • An ordered list of operations
  • Relative Costs of each operation
    • Occasionally time to complete an operation (notably on Oracle)
    • Occasionally Icons/Warning Messages (notably on MS SQL Server)

Oracle, SQL Server and MySQL all include one or more tools which will provide details on the query execution plan.  Some tools are text based, some XML and some even graphical.  For the exact details on these tools, please refer to the documentation provided by your database vendor.  

Sample commands to retrieve a plan

  • Oracle – EXPLAIN PLAN FOR “your_sql_stmt”
  • MySQL – EXPLAIN “your_sql_stmt”
  • MS Sql – EXPLAIN “your_sql_stmt”
    • Also the Display Estimated Execution Plan button in SQL Server Management Studio

An example query execution plan, from the sample Oracle HR database, is as follows:

Another example, from the sample MySQL sakila database:

A full understanding of query execution plans is beyond the scope of this document and is both database vendor and version specific.   Please refer to the documentation provided by your database vendor for more specific details.

To use the query execution plan:

  • Identify the costliest steps, and try to adjust the query logic or add indexes to improve the relative costs of these operations
  • E.g., Avoid Full Table Scans and Clustered Index Scan by adding an index to the appropriate set of columns on the table involved in the scan
  • Identify incorrect order of joins, if any
  • Add Optimizer hints, if your database supports them. Hints are additional commands added to a query, giving a suggestion to the database engine on how to execute the query.  As most queries are generated for you with Appian, these would primarily be used when creating views and stored procedures.

Indexing

Database Indexes are additional structures added to a database schema, usually to facilitate data retrieval.  They are used to quickly find the results of conditional expressions, i.e., rows that match certain criteria, without having to search every row when the query is executed.  Sample SQL syntax for creating an index would be as follows:

CREATE INDEX index_nameON table_name (column1, column2, ...);

Here is a very simple example.   Let’s consider the table Inventory with the following structure

 

This table has roughly 4500 rows, consisting of the inventory to multiple stores in a central database.  In order to retrieve the inventory from a single store, we could run a SQL query like:

SELECT * FROM inventory WHERE store_id=2

In this case, the store_id column is not indexed, so the database engine will scan through the table, finding all the rows with store_id=2.   Running the explain plan on this simple SELECT when the WHERE column is not indexed results in the following (on MySQL).

 

Since we know that this store_id column will be queried regularly, we can add an index to this column to facilitate future searches.

CREATE INDEX idx_storeID ON inventory (store_id)

If we take a look at the explain plan after the table is indexed, there are some important changes.

 

 

In this simple example, we can see immediately that there is a key (or index) that is being used, and in the rows column that fewer rows are being considered by the database engine.  There is a corresponding increase in performance as well, with the query execution time on the indexed table actually being 5X faster than those of the non-indexed table.

Some additional guidance:

  • Database vendors may provide additional tools, such as an Index Tuning Wizard, to automatically suggest indexes or to allow graphical creation of indexes. The exact mechanism for creating indexes through a UI, as well as the exact underlying implementation of indexes themselves will vary between database vendors.
  • Indexing columns that are simply used in the select statement is typically not valuable, the primary use for indexes is to act as an aid when the database is evaluating conditional expressions.
    • Be sure to index columns that are used in WHERE, JOIN, GROUP BY or ORDER BY clauses, or a!queryFilter() functions.
    • If the columns that are used in these conditions cannot be indexed, and there are performance problems, consider whether the Query can be rewritten to leverage indexes (e.g., avoiding a computed column in favor of the source columns)
  • Multiple Column Indexes, also known as composite or compound indexes, are usually slightly faster than multiple individual indexes on the same set of columns.
    • When defining the order of columns in a composite index, put the columns that eliminate the most amount of data early in the list of columns. It is ideal to exclude as much data as early as possible with the index.
    • It can also be beneficial for all the columns in a multiple column group clause to be in the same composite index.
  • Although indexes should probably only be created on a small number of columns in the database, if you do not know the usage pattern of your users, it’s generally better to over index than under index. Inserts may take slightly longer, but the ratio of writes to reads will generally balance out the increased performance cost of the write.
    • The ideal would be a small number of compound indexes purposefully built for the specific queries that you are using
    • Consider real time queries, as well as reporting/dashboard data.
    • After your database has been used for some time, indexes that are rarely used can be removed as part of the maintenance process using drop index. There may be tools provided by your database vendor to determine how often an index is being used.

DROP INDEX index_name ON tbl_name;

  • Small tables in which all of the data is often read, may not particularly benefit from having an index (such as a table of picklist options)

Database Logging

  • Additional database logging can be enabled in the database. This can be very helpful for obtaining the actual SQL being run so that a query execution plan can be generated.  For further details please refer to the core Appian Documentation here.
    • If there is a need to review SQL statements produced for the Appian Cloud business database, the Cloud Database Requests log can be accessed by members of the Database Administrators group without additional logging.
  • To log the details of the Write to Data Store smart services and Query Rules, add the following loggers:
    • # Output SQL statements
    • logger.org.hibernate.SQL=DEBUG
    • # Output param values within SQL statements
    • logger.org.hibernate.type=TRACE
  • To view the calls made by the Query Database Node, add the following logger:
    • logger.com.appiancorp.process.runtime.activities.QueryRdbmsActivity=DEBUG