RDBMS Transformation Maps
It is very common for Bulk/Batch integrations to involve one or more database servers. Database servers are excellent at storing and managing data but are not necessarily optimized for moving large quantities of data into and out of the database. The native client drivers were originally intended to work with relatively small datasets by the transactional applications that used them. However, bulk / batch integrations often deal with entire tables and sometimes entire database schemas. There are different considerations and depends on if you are reading from or writing to RDBMS endpoints with Actian DataConnect.
Source Data
If the source dataset is extremely large these are some common techniques used to reduce the overall processing efficiency:
• Split a very large table into smaller chunks based on a relevant WHERE clause and iterate through each chunk separately. This allows you to:
• Serialize processing and implement restart points by tracking of each iteration as a separate unit of work.
• Parallelize processing each result set in separate JMVs using job concurrency to distribute the load across more than one engine.
• Instead of a full table scan by connecting directly to the table, use a SQL pass-through statement to select only the columns and rows you required. Use DJX syntax to leverage the EZscript variables and macros as parameters to make queries more reusable. The keywords to wrap entire scripts as source queries are DJXBegin and DJXEnd.
• Use SQL pass-through statements referenced from file when you have standard queries. This also provides the option of having complex queries written by DBA staff that can be quickly implanted by the mapping teams.
• Use a View or have your SQL Pass-through statements execute a stored procedure as part of the source connection to offload work to the database server.
• Consider exporting a table to file and then use one of the file connectors. Overall, this can be much faster than a single map step that uses the transactional-based connection through a native client or ODBC driver.
• Use various Cursor property settings when available. Forward looking cursors are the fastest, Keyset provides the most current data.
• Transaction Isolation setting, when available, can have similar effects. The default setting can be serialized, which insures data quality but also has a slight performance cost. The fastest is Read_Uncommitted but its cost is that the data may have changed after the starting the transaction and hence, may no longer be accurate. Use it when you can afford to wait for subsequent updates (that is, you run the job frequently so that the update window is narrow).
Target Data
• Single mode database connectors are typically easier for new users or those users less familiar with SQL concepts but are not as flexible as Multimode connectors.
• Multimode connectors are best when you need to perform multiple CRUD operations for a single row of source data. Writing to multiple target tables to normalize data or load data warehouse star schemas for instance. They also enable the execution of SQL statement actions as part of an Event-Condition-Action rule.
• Most database connectors have an Array Size or a Bulk Operation property for target connections. Use these to submit optimal batch sizes to the database server. Typically, the default setting for these creates a single batch of rows that get written to the target database. This can make managing transactions easier but can tax memory and slow down processing if the dataset are too large.
• There is no one-size-fits-all scenario but you must find the optimal balance between CommitFrequency and ArraySize to fine tune performance of your inserts:
• CommitFrequency property has the default value as 0, which means the entire dataset will be committed (or rolled back) depending on the outcome of the integration. Use smaller sizes to improve performance.
• ArraySize property has the default value as 1, which means every row will be committed individually. Sending more rows per batch by increasing the size improves performance.
• Most of the RDMBS have a mass insert or bulk copy connector that will outperform the transactional connector. They bypass transaction logs. Therefore, use them when you only need to load data. For example, Oracle Direct Path, SQL Server Mass Insert, and ODBC Mass Insert.
• If you need transactional support, you can offload those SQL operations to the database server by using the mass insert or bulk copy connector to a staging table first. Then, execute a stored procedure to perform updates.
• When loading large amounts of data to a target table, it generally is slower if that table has one or more indexes. When possible, remove those indexes and then re-index after the transformation ends.
Last modified date: 12/03/2024