Asyncronous Execution/DML Batching

Many applications and environments have application nodes that end up performing large amounts of DML operations, which are performed one at a time. Most Databases are very slow at this, with analytics/MPP databases being particuarly slow. The async feature is to help improve the performance and reduce the database load when this type of activity occurs.

Basic Theory

when a DML operation is flagged with an Async execute rule, Heimdall will take the DML operation out of the current connection context, and will put it in a queue for execution on an independent connection. This queue is processed in transaction batches, which has a configurable batch size. By executing the commit across multiple DML operations, much of the overhead of the commit (the most expensive part) is shared between many DML operations, and overall DML performance improves.

There are two main categories of situations where async will help:

One connection is inserting many DML operations in rapid order
Connections are inserting DML operations over time from a large pool of connections

Heimdall can provide support for both, but there are situations where the Async batching should not be used.

Basic Configuration

In the Async rule logic, the following options are availabe:

asyncsize: The maximum length of a query (in characters) that can be made asynchronous
batchSize: The maximum batch size of queries to be executed--queries will not be delayed to fill the queue, but this sets the maxiumim size of the batch.
source: The name of the data source to forward the query to for async execute. This source must be specified in the VDB definition.
queueName: The per-source named async queue to process the query from. Defaults to the name of the data source the query is sent to.
holdUntil: When to hold the thread until
spoofedResult: The result the DML will return if the result is spoofed (only for immediate holdUntil values).

The first option, asyncsize, controls the actual size of the SQL that can be made asynchronous. In particular a DML operation may end up doing an insert of a million records as one statement. By using this size option, you can control how large the actual SQL can be for async to activate.

The second option is the size of the batches to perform on the database. This is the maximum value, not the minimum. Consider the case of this setting being set to 100, and the application is first started up. The first DML that is performed will likely be pushed to the database immediately, as only one query came in that instant. The queue has zero items at first, then one. As that one DML is executing on the server however, another four may be received from the application connections. Once the first DML is completed, the next batch will contain the next four in the queue after the first was done. Again, while those four are processing, another eight or more may come in. This process of queue buildup will continue until the batchSize is reached. At that point, even if more than that many queries are in the queue, only this many will be dispatched to the DB as a single transaction.

As a general rule the batchSize should be a reasonable number (64 is a good start) but not so high that if any one DML in the transaction fails and the transaction has to be restarted, then the burden of restarting will be too high (please review the retry logic below). Likewise, the value shouldn't be too low that the overall benefit isn't significant.

The third option is the source option. Normally, this won't be used, but this can be used in certain cluster types to route DML operations for a particular table to alternate data sources.

The queueName options is used to separate different batches of DML operations from each other. By default a single queue will exist for any given data source, based on the data source name. When an explicit queueName is specified however, the DML operations will be handled from a different queue, and will use a distinct database connection for processing. This is used in conjunction with table rules to filter DML for one table into one queue, and DML for another table into another queue. Through this, each table avoid blocking operations on each other, and can be used to improve performance of DML overall. The special queueName value of "${table}" can be used to create a queue per table, and is ideal when expecting rewrites of inserts into a multi-row insert.

Next, the holdUntil option is used to determine the actual async behavior. For each DML, it is often the case that the application is expecting the result of the DML operation, i.e. to tell it how many rows (if any) were modified. This may impact later logic in the application. In the case that the application needs this, we have two options for holdUntil, that of "result" or "commit". We can wait until the result has been sent from the DB, but no commit executed, OR we can wait for a commit to be generated. For best performance, a holdUntil value of none should be used, BUT this requires spoofing the result so that the application can continue processing immediately. When spoofing is needed, the final option of spoofedResult is used, and sets the value that will be returned.

Insert Rewrites

As part of the logic of batching, if a queue contains sequential inserts of the pattern "INSERT INTO ( ) VALUES ( )", and the table and field list are the same between multiple inserts, these inserts will be rewritten into a single insert with multiple rows. For now, ONLY inserts that specify the table and field list will be rewritten, to ensure that exactly the same data is being written in the same columns, and it will ONLY rewrite inserts that follow each other sequentially with no interruption. Only the batch size number of rows will be written at once.

Retry Logic

When a batch is created, there is always a chance that the DML will result in an exception. The way this is handled is that if an exception is generated at the time the sql is executed (but before the commit), then the sql is removed from the batch, the batch is rolled back, the queries are restored into the queue in the proper order, and a new batch is generated. A log entry will then be generated.

In the case of a holdUntil value of none, no exception in this case is returned to the application, since it already received a spoofed result. In such an environment, the logs should be maintained in case a replay is needed.

In the case of any other holdUntil value, the exception will be passed to the client.

The final case is if the exception occurs when the commit is called. This should be rare, but in the event that it happens, then it is impossible to determine which of the batch caused the issue (if it was one SQL in particular). As such, in this case, the entire batch is logged as a failure, and the connection to the DB is torn down and rebuilt, as it is safest to just start with a clean connection.

Blocking

One issue that is likely to come up with many attempts to use Async is the following sequence of events:

select from tableA where id=10; update tableA set something=newvalue where id=10; select from tableA where id=10; // expects the new value here

Without any additional logic, this would likely break the application. With the Async feature of Heimdall however, what will happen is that the update to the table will result in a lock that is associated with the table. When the final select is issued, it will block until the update commits, THEN the select will be issued against the database. This prevents this sequence from returning the incorrect value, and breaking the application. Please note: This locking is only done within a single Heimdall instance--if additional nodes are active, they will not block on the same action, as the synchoronization time to implement this would be excessive. The main objective is to insure that at least for a given connection, the order of operations are maintained.

Initializing the Async Connection

Often, the connection that the Async logic used will need to be initialized with SQL commands that make it behave differently than normal connections. As an example, with DML operations, Greenplum should in general have the MPP optimizer disabled, as it adds excessive latency that doesn't provide a benefit. In order to do this, the connection property of "initSQLAsync" can be set on the data source, with the property of the SQL desired. In the Greenplum example, it would be "set optimizer = off".