“Reuse” Heuristic
The query optimizer includes a heuristic that searches for common tables in different fragments of a query. For example, in the query SELECT ... FROM a, b, c WHERE a.x = b.y AND ... UNION SELECT ... FROM a, b, d WHERE a.x = b.y and ..., the "a, b" join is shared between the union’ed selects. In such cases query plans are built that materialize the common portions once and cache them in memory to be reused. This “reuse” heuristic can greatly improve the performance of some queries.
In rare cases, the cached fragments may overflow the available memory. The heuristic attempts to avoid overflow by using estimates of the size of the cached results, but you can also turn off the feature by using the SET NOREUSE statement. SET NOREUSE stays in effect until the end of the session or until a SET REUSE statement is issued to reinstate the default behavior.
If you want the NOREUSE behavior to be the default for the server, you can set the parameter opf_reuse to OFF in config.dat. You must then use the SET REUSE statement to enable the heuristic for a given session.
Note: Use NOREUSE only in extreme cases.
The parameter opf_maxreuse in config.dat specifies the maximum estimated size of a reuse fragment (as a nonnegative number of bytes) that can be cached. If the estimate for a potential reuse fragment of a query exceeds that amount, the results of the reuse fragment are not cached and are materialized as many times as referenced in a given query. The default value for opf_maxreuse is 100 MB. If it is explicitly set to 0, fragment size is not checked and the reuse feature is always attempted.
Summary for Evaluating QEPs
The main points to check when evaluating a QEP are as follows:
• Cart-prods can be caused by errors due to disjoint queries or queries that involve certain types of OR operations. Also, joins involving calculations, data type conversions, and non-equijoins can generate cart-prods. Alternative ways of posing the query is often advised under these circumstances.
• The NU on the storage structure part in the orig node description is not a good sign if you believe the storage structure must have been used to restrict the number of rows being read.
• Verify that the appropriate secondary indexes are being used. Running the optimization process to generate statistics on the indexed columns allows the query optimizer to better differentiate between the selectivity powers of the different indexes for a particular situation.
• If there is little data in a table (for example, less than five pages) the query optimizer can consider a scan of the table rather than use any primary or secondary indexes, because little is to be gained from using them.
• Check that row estimates are accurate on the QEP nodes. If not, run the optimization process to generate statistics on the columns in question.
Specialized Statistics Processing
Optimizer statistics can be reviewed and processed by several utilities. You can:
• View and delete statistics
• Unload statistics to a text file
• Load statistics from a text file
• Copy a table and its associated statistics to another database
• Create sampled statistics
Display Optimizer Statistics
You can view statistics that have been collected. Use the following methods:
• statdump command
• Display Statistics dialog in Director or VDBA
The usual output is based on statistics generated by the optimization process, as described in
Database Statistics.
Display Optimizer Statistics for Individual Tables and Columns
By default, optimizer statistics are shown for all tables and columns in the current database, but you can view statistics for specific table columns.
The following command lists statistics for the empno column of the emp table:
statdump demodb -remp -aempno
In Director or VDBA, use the Specify Tables and Specify Columns check boxes in the Display Statistics dialog.
Delete Optimizer Statistics
To delete statistics use any of these methods:
• Use the -zdl flag with the statdump command. Use the -r and -a flags to specify the tables and columns for which to delete statistics.
This command deletes statistics from the demodb database: columns empno and sex from the emp table and the empno column from the task table:
statdump -zdl demodb -remp -aempno -asex -rtask -aempno
• DROP STATISTICS statement
• In Director, use the Display Statistics dialog, Options page.
• In VDBA, enable the Delete Statistics from Syscat check box. Using this check box with the Specify Tables and Specify Columns check boxes, you can specify the tables and columns for which to delete statistics.
Floating Point Precision in Optimizer Statistics Display
You can specify the precision with which floating point numbers are displayed in the statistics by enabling the Set Precision Level to check box and entering a value in the corresponding edit control to determine the number of decimal digits in the text format of the floating-point numbers.
For example, assume a table, t_float, is defined with a column named c_float of type float, and that the following statements are used to insert values (all of which are approximately 1.0):
insert into t_float values (0.99999998);
insert into t_float values (0.99999999);
insert into t_float values (1.0);
insert into t_float values (1.00000001);
insert into t_float values (1.00000002);
You can create statistics for this table using the optimization procedure described in
Database Statistics.
With its default floating point precision, the standard output is show seven places after the decimal point. For greater precision, you can enable the Set Precision Level check box and enter a larger value.
For example, specifying a precision level of 14 generates output like the following, in which there is sufficient precision to maintain a visible difference in the values:
*** statistics for database demodb version: 00850
*** table t_float rows:5 pages:3 overflow pages:0
*** column c_float of type float (length:8, scale:0, nullable)
date:2000_02_24 15:15:30 GMT unique values:5.000
repetition factor:1.000 unique flag:Y complete flag:0
domain:0 histogram cells:10 null count:0.00000000000000 value length:8
cell: 0 count:0.00000000000000 repf:0.00000000000000 value:0.99999997999999
cell: 1 count:0.20000000298023 repf:1.00000000000000 value:0.99999998000000
cell: 2 count:0.00000000000000 repf:0.00000000000000 value:0.99999998999999
cell: 3 count:0.20000000298023 repf:1.00000000000000 value:0.99999999000000
cell: 4 count:0.00000000000000 repf:0.00000000000000 value:0.99999999999999
cell: 5 count:0.20000000298023 repf:1.00000000000000 value:1.00000000000000
cell: 6 count:0.00000000000000 repf:0.00000000000000 value:1.00000000999999
cell: 7 count:0.20000000298023 repf:1.00000000000000 value:1.00000001000000
cell: 8 count:0.00000000000000 repf:0.00000000000000 value:1.00000001999999
cell: 9 count:0.20000000298023 repf:1.00000000000000 value:1.00000002000000
This can be useful when statistics are output to a text file or input from a text file. For more information, see
Statistics in Text Files. When reading statistics from a text file, the optimization process assumes that all cell values are in ascending order. You can use the Set Precision Level option to preserve sufficient precision for floating point numbers.
Statistics in Text Files
The optimization process can directly read a set of optimizer statistics from an external text file, rapidly updating the statistics for specific database tables. This can be useful when:
• Information is being moved from one database to another (for example, using copydb), and you want to copy the statistics for the tables as well.
• You know the distribution of the data in a table and want to read or input these values directly, instead of letting the optimization process generate them for you.
The actual table data is ignored. This gives you a modeling ability, because the table can be empty but there are statistics that indicate the presence of data and its distribution. The query optimizer uses those false statistics to determine a QEP. For more information, see
Query Execution Plans. This gives the DBA the ability to verify correctness of QEPs without having to load data into tables.
The text file read by the optimization process can be created in one of two ways:
• When displaying statistics, you can unload statistics that already exist in the database and use the generated file as input to the optimization process.
• You can create an input file from scratch or by editing a file created when displaying statistics.
Unload Optimizer Statistics to a Text File
You can unload optimizer statistics to a text file. The generated file is in an appropriate format so that it can be used as input to the optimization process. This allows:
• Statistics to be easily moved from one database to another
• A default text file to be created if you are generating your own statistics
Use the statdump -o flag. This command dumps all statistics from database mydb into the file stats.out:
statdump -o stats.out mydb
You can also use Director or VDBA.
Unload Statistics for Selected Tables or Columns
To unload statistics for selected tables or columns, use the statdump -r and -a flags.
This command unloads statistics for only the arel table and col1 of the brel table:
statdump -o stats.out mydb -rarel -rbrel -acol1
In VDBA, use the Read Statistics from Server File option with the Specify Tables and Specify Columns check boxes in the Display Statistics dialog. For example, if you want the stats.out file to contain statistics for the entire arel table and the col1 column in the brel table, enable the Specify Tables check box, and choose only the arel and brel tables from the Specify Tables dialog. Enable the Specify Columns check box and choose only the col1 column for brel from the Specify Columns dialog.
Sample Text File Statistics
A sample output file generated using the Direct Output to Server File option of the Display Statistics dialog is shown below. This same text file can be used as input to the optimization process, as described in the next section, Load Optimizer Statistics from a Text File:
*** statistics for database demodb version: 00850
*** table brel rows:151 pages:3 overflow pages:1
*** column col1 of type integer (length:4, scale:0, nullable)
date:2000_02_24 16:04:37 GMT unique values:132.000
repetition factor:1.144 unique flag:N complete flag:0
domain:0 histogram cells:16 null count:0.0000000 value length:4
cell: 0 count:0.0000000 repf:0.0000000 value: 0
cell: 1 count:0.0728477 repf:1.3750000 value: 23
cell: 2 count:0.0728477 repf:1.8333334 value: 31
cell: 3 count:0.0728477 repf:1.3750000 value: 59
cell: 4 count:0.0728477 repf:1.1000000 value: 138
cell: 5 count:0.0728477 repf:1.0000000 value: 151
cell: 6 count:0.0728477 repf:1.0000000 value: 162
cell: 7 count:0.0728477 repf:1.0000000 value: 173
cell: 8 count:0.0662252 repf:1.2500000 value: 181
cell: 9 count:0.0662252 repf:1.1111112 value: 193
cell: 10 count:0.0662252 repf:1.2500000 value: 202
cell: 11 count:0.0662252 repf:1.0000000 value: 214
cell: 12 count:0.0662252 repf:1.0000000 value: 224
cell: 13 count:0.0662252 repf:1.0000000 value: 236
cell: 14 count:0.0662252 repf:1.2500000 value: 256
cell: 15 count:0.0264901 repf:1.0000000 value: 261
Load Optimizer Statistics from a Text File
To load optimizer statistics from a text file, use the optimizedb -i (input) flag, which will load one or more sets of values from the file.
For example, if the file arelbrel.dat contains statistics for the arel and brel tables, these can be loaded into the database using the command:
optimizedb -i arelbrel.dat mydb
In Director, use the Generate Statistics dialog, Input/Output File page, Statistics Input Filename field.
In VDBA, use the Read Statistics from Server File option in the Optimize Database dialog.
Load Statistics for Selected Tables or Columns
If the input file contains statistics for multiple tables, you can load selected tables or columns by specifying the optimizedb ‑r (read selected tables) and ‑a (read selected columns) flags.
For example, if the file arelbrel.dat contains statistics for the arel and brel tables, only the statistics for arel can be loaded into the database using the command:
optimizedb -i arelbrel.dat mydb -rarel
Only statistics for column col3 of the table are loaded with the following command:
optimizedb -i arelbrel.dat mydb -rarel -acol3
In VDBA, use the Read Statistics from Server File option with the Specify Tables and Specify Columns check boxes in the Optimize Database dialog.
Update Row and Page Counts
The input file for the optimization process contains information about the number of rows, as well as primary and overflow page counts in a table. However, because these values are critical to correct operation, these input values are normally disregarded when creating statistics, leaving the catalog values untouched.
You can force the values in the input file to be used when loading the statistics by specifying the ‑zp flag. For example:
optimizedb -i arelbrel.dat -zp mydb -rarel
In VDBA, enable the Read Row and Page check box in the Optimize Database dialog.
IMPORTANT! This option must be used with extreme care because it sets critical values.
This option can be useful for certain specialized processing, such as query modeling and performance problem debugging. Bear in mind that the row count value can be modified for the table and its indexes. However, the page count is modified for the table only—the index page count values remains unchanged.
Copy a Table and Associated Statistics
You can copy a table and its associated optimizer statistics from one database to another using copydb and statistics that have been unloaded to a text file. This is usually much faster than copying only the table and rerunning the optimization process to recreate the statistics.
Note: Doing this makes sense only if the statistics are up-to-date.
First, unload the table and its statistics to text files, as described in the steps below:
1. Enter the following command to generate copy.in and copy.out scripts for the arel table:
copydb olddb arel
2. Copy the are1 table out of the olddb database:
sql olddb <copy.out
3. Unload the statistics for the are1 table to a text file named are1.dat:
statdump -o arel.dat olddb -rarel
(Or use Director or VDBA.)
Next, copy the table and statistics back into the new database:
4. Copy the are1 table into the new database:
sql newdb <copy.in
5. Load the statistics for the are1 table from the text file are1.dat in Step 3 of the previous example:
optimizedb -i arel.dat newdb -rarel
(Or use Director or VDBA.)
Sampled Optimizer Statistics
The optimization process allows you to create sampled optimizer statistics on database tables. For large tables, sampled statistics are usually much faster to generate than full statistics, and if the percentage of rows to be sampled is chosen appropriately, they can be nearly as accurate.
Sampled statistics are generated by only looking at a certain percentage of the rows in a table. The percentage must be chosen so that all significant variations in the data distribution are likely to be sampled.
The sampled rows are selected by randomly generating values for the tuple identifier (tid), so tid values are required to support this functionality.
Create Sampled Statistics
To specify sampled statistics and the percentage of rows to be sampled, use the optimizedb -zs flag:
optimizedb -zsn database
where n is a real number indicating the percentage of rows to be sampled, from 1 to 100.
The following example optimizes the table bigtable, sampling 3% of the rows:
optimizedb -zs3 mydb -rbigtable
In Director, on the Generate Statistics dialog, select Generate Statistics on Subset of Data, and then specify a Precision Level.
In VDBA, in the Optimize Database dialog, select Statistics on Subset of Data, and then specify a Percentage.
When sampling, the query optimizer chooses rows randomly from the base table and inserts them into a temporary table. Rows are selected in such a way that uniqueness of column values are preserved (conceptually, when sampling, a row can be selected not more than once). Full statistics, or minmax if requested, are created on the temporary table and stored in the catalogs as statistics for the base table. The temporary table is deleted. Be sure you have enough free disk space to store this temporary table, and that create_table permission has been granted to the user running the optimization process.
You have control over the percentage of rows that are sampled. It is worthwhile to experiment with this percentage. When the percentages are too small for a good sampling, the statistics created change as percentage figures change. As you increase the percentage, eventually a plateau is reached where the statistics begin coming out almost the same. The smallest percentage that provides stable statistics is the most efficient number.
Composite Histograms
The optimizer usually calculates costs from statistics on individual columns. However, it is possible for Vector to create and use histograms created from the concatenation of several columns. Such histograms are called composite histograms.
Composite histograms are useful in ad hoc query applications in which there are WHERE clause restrictions on varying combinations of columns. Such applications can have a variety of secondary indexes constructed on different permutations of the same columns with the goal of allowing the query optimizer to pick an index tailored to the specific combination of restrictions used in any one query.
For example, consider a table X with columns A, B, C, D, E, etc. and secondary indexes defined on (A, B, C), (B, C, D), (B, A, E). Consider a query with a WHERE clause such as “A = 25 and B = 30 and E = 99”. With histograms on the individual columns, the query optimizer finds it difficult to differentiate the cost of solving the query using the (A, B, C) index and the (B, A, E) index. This is because of the technique used to determine the combined effect of several restrictions on the same table. However, with composite histograms defined on each index, the optimizer combines the three restrictions into a single restriction on the concatenated key values, and the (B, A, E) index clearly produces the best-looking query plan.
Composite histograms can be created on any collection of two or more columns. The columns can be, but need not be, storage structure or secondary index key columns. Any one column can participate in multiple composite histograms.
When optimizing a query, the optimizer looks for the longest (that is, the most columns) composite histogram whose leading columns are equality predicates in the query. All the columns in that histogram are then marked “statistics found,” and the process repeats with remaining columns in the query.
Finally, all remaining columns not already covered by a composite histogram are estimated using standard single-column statistics.