MODIFY...TO COMBINE Statement
The MODIFY...TO COMBINE statement is processed against a table and performs the following operations:
• Writes the data from all completed, cached, INSERT/UPDATE/DELETE statements against a table that resides in memory to disk and optimizes the table layout on disk.
• Performs the transactions requested in the arguments to the MODIFY...TO COMBINE statement as bulk operations.
The use of MODIFY...TO COMBINE is also the most efficient way to perform updates and deletes to a sizable percentage of the data in a table. However, depending on the volume of outstanding changes on a table and whether a table is indexed, the MODIFY...TO COMBINE statement can still be a lengthy operation.
For more information on the use of MODIFY...TO COMBINE, including practical examples, refer to the Vector SQL Language Guide.
DML and Memory Utilization
Any DML that is not a data append will be stored per table in an optimized in-memory structure called the Positional Delta Tree (PDT). This includes all DML executed against a table with a clustered index that already has data in it, except for MODIFY...TO COMBINE, which is never stored in PDTs.
Vector supports multi-version read consistency. Committed data in PDTs is efficiently merged with data read from disk to answer queries. Any data that is committed to PDTs is also written to the transaction log and persisted on disk for recovery purposes before the commit completes.
In Vector a session’s transaction starts with the first operation, even if that operation is a query, and read consistency is maintained throughout the transaction. If one session is performing read-only queries, and another session is performing DML then for the read-only session to see the changes made by the DML session:
• First commit the changes in the DML session (which completes the transaction).
• End the ongoing transaction in the read-only session by issuing a commit or rollback.
Maintaining read-consistency for a read-only session can use a lot of memory if other sessions perform DML and complete transactions. For this reason, on a system that performs lots of DML it is important to commit or roll back a session periodically, or to terminate a session and restart it. Also consider the use of the AUTOCOMMIT setting to automatically commit each statement. For more information, see the Vector SQL Language Guide.
Every Vector database reserves a portion of the server’s memory to store PDTs. The amount of memory reserved is controlled by the parameter max_global_update_memory as target maximum amount of query execution memory that can be used for PDTs. The default is 0.25.
When the amount of memory used by PDTs exceeds the maximum target amount then Vector triggers automatic data propagation against all tables with outstanding transactions. Transactions will continue to be executed during the process, but more memory may be consumed to hold updates than the target amount configured in the vectorwise.conf file.
Best practices for incremental data loads into Vector:
• Append data whenever possible.
• Use the MODIFY...TO COMBINE statement to delete or update sizable portions of data. As a guideline use MODIFY...TO COMBINE if 10% or more of the data in a table changes.
• Make sure as DML is applied to commit or roll back read-only sessions regularly to (1) see committed updated data, and (2) free up memory and data blocks required to maintain a relatively old transaction consistency view.
• Consider using an explicit call to the MODIFY...TO COMBINE statement against a table with outstanding DML at a time when the system has surplus resources available to perform the operation and free up memory in PDTs.
Constraints
Vector enforces primary, unique, and foreign key constraints using SELECT statements against the constraint column(s). Enforced constraints are validated for all operations. As a result, constraint checking can have a significant impact on resource consumption for frequent small incremental loads. Take this into consideration when designing your schema and consider whether to declare enforced constraints and validate data integrity in the data load process. Consider using not enforced constraints or dropping and recreating constraints to minimize the impact of constraint validation during data loads.
Moving Window of Data
A moving window of data is very common for fact tables in a data warehouse or data mart. For example: you want to keep the most recent 36 months of data online. Below are three approaches to achieve a moving window data loading strategy with Vector:
1. Append data directly to the fact table in Vector as it arrives--for example, add data daily or multiple times per day. This approach assumes that the fact table is not indexed.
Monthly, create a staging table by selecting the (primary key for the) rows from the fact table for the month you want to purge (CREATE TABLE...AS SELECT...) and then use the MODIFY...TO COMBINE statement to remove the data from the original table.
MODIFY basetable EXCEPT staging TO COMBINE
Then drop the staging table.
2. Append data to staging table STAGING1 rather than the base fact table as it arrives--for example, add data daily or multiple times per day. This approach assumes that staging table STAGING1 is not indexed. Create a view to union (all) the results of staging table STAGING1 and the base fact table.
Monthly, create a second staging table STAGING2 by selecting the (primary key for the) rows from the fact table for the month you want to purge (CREATE TABLE...AS SELECT…) and then use the MODIFY...TO COMBINE statement to add the data in staging table STAGING1 and remove the data from the original table which is in staging table STAGING2.
MODIFY basetable UNION staging1 EXCEPT staging2 TO COMBINE
For this operation the base table may be indexed.
Drop staging table STAGING2.
Then empty the table STAGING1:
MODIFY staging1 TO TRUNCATED
3. Use multiple tables with the same structure to represent all data for the fact table (for example one table per month--36 tables total for 36 months of data).
Create a view to UNION ALL the results of all tables and present the data for all tables as a single fact table view to an application. Add data as it arrives to the most recent month of data. If the table is indexed, then consider using MODIFY...TO COMBINE to add the data. Every month drop or empty (using the MODIFY...TO COMBINE statement) the oldest table.
Ideally, add a clause in the UNION definition that specifies WHERE date_range BETWEEN <first minute in first day of month 1> and <last minute of last day of month 1> for each view being unioned together, to provide information to the optimizer that each view contained only rows in that range. This helps to eliminate views that are not relevant.
Beware that the different approaches may result in different query execution plans that can impact performance of queries against the base table data.