Sampled Optimizer Statistics

The optimization process allows you to create sampled optimizer statistics on database tables. For large tables, sampled statistics are usually much faster to generate than full statistics, and if the percentage of rows to be sampled is chosen appropriately, they can be nearly as accurate.

Sampled statistics are generated by only looking at a certain percentage of the rows in a table. The percentage must be chosen so that all significant variations in the data distribution are likely to be sampled.

The sampled rows are selected by randomly generating values for the tuple identifier (tid), so tid values are required to support this functionality.

To specify sampled statistics and the percentage of rows to be sampled, use the optimizedb -zs flag:

where n is a real number indicating the percentage of rows to be sampled, from 1 to 100.

The following example optimizes the table bigtable, sampling 3% of the rows:

In Director, on the Generate Statistics dialog, select Generate Statistics on Subset of Data, and then specify a Precision Level.

In VDBA, in the Optimize Database dialog, select Statistics on Subset of Data, and then specify a Percentage.

When sampling, the query optimizer chooses rows randomly from the base table and inserts them into a temporary table. Rows are selected in such a way that uniqueness of column values are preserved (conceptually, when sampling, a row can be selected not more than once). Full statistics, or minmax if requested, are created on the temporary table and stored in the catalogs as statistics for the base table. The temporary table is deleted. Be sure you have enough free disk space to store this temporary table, and that create_table permission has been granted to the user running the optimization process.

You have control over the percentage of rows that are sampled. It is worthwhile to experiment with this percentage. When the percentages are too small for a good sampling, the statistics created change as percentage figures change. As you increase the percentage, eventually a plateau is reached where the statistics begin coming out almost the same. The smallest percentage that provides stable statistics is the most efficient number.