14. Using the Query Optimizer : Database Statistics : Types and Levels of Statistics Collected

Types and Levels of Statistics Collected
When optimizing a database, you can create several types and levels of statistics by specifying options.
First, you can specify what set of data the statistics are gathered on:
Non-sampled statistics—all rows in the selected tables are retrieved
Sampled statistics—a subset of rows from the selected tables is retrieved
Next, either of the following can be created, based on the selected data set. This division determines how much information about the distribution of data the statistics can hold:
Full statistics—a histogram for the whole range of column values is created
Minmax statistics—a histogram showing only minimum and maximum values is created
Non-Sampled and Sampled Statistics
When generating statistics for a database, by default all rows of the selected tables are used in the generation of statistics. These non-sampled statistics represent the most accurate statistics possible, because all data is considered.
When the base table is large, you may want to use sampled statistics. With a sufficient sampling, statistics created are almost identical to statistics created on the full table. The processing for sampled statistics is discussed in greater detail in Sampled Optimizer Statistics.
Note:  By default, sampled statistics are used for tables that have more than one million rows. To prevent sampling, use the –zns flag of optimizedb or the WITH NOSAMPLE option of the CREATE STATISTICS statement.
Full Statistics
When optimizing a database, full statistics are generated by default.
For example, generate full statistics for all columns in all tables in the empdata database:
optimizedb empdata
Full statistics carry the most information about data distribution (unless the data is modified significantly after statistics are collected).
The cost of their creation (in terms of system resources used), however, is the highest of all types. For each selected column the table is scanned once, and the column values are retrieved in a sorted order. Depending on the availability of indexes on the selected columns, a sort can be required, increasing the cost even further.
The process of generating such complete and accurate statistics can require some time, but there are several ways to adjust this.
Generate Full Statistics on Sample Data
You can shorten the process of creating full statistics by creating them on sampled data.
This example generates full statistics with a sampling of 0.5% rows of the emp table:
optimizedb -zs0.5 empdata -remp
This example generates full statistics with a sampling of 1% rows of the emp table:
CREATE STATISTICS FOR emp WITH SAMPLE = 1;
Minmax Statistics
Minmax statistics are “cheaper” than full statistics to create. In most cases they require only one scan of the entire table. Statistics created have information only about minimum and maximum values for a column. This can be acceptable if the distribution of values in the column is reasonably even. However, if the values of a particular column are skewed, minmax statistics can mislead the query optimizer and result in poor query plan choices.
Note:  Minmax statistics cannot be created using the CREATE STATISTICS statement.
Example: Generate Statistics with Only Minimum and Maximum Values for a Table
This example generates statistics with only minimum and maximum values for the employee table:
optimizedb -zx empdata -remployee
In Director:
1. Open the Generate Statistics dialog
2. Click Specify Tables, and then click the Tables button.
3. Select the employee table and click OK.
4. Click Options under Select a page.
5. Select Determine Min/Max values for each Column, and then click OK.
Key Column Statistics
Key column statistics create full or minmax statistics on key or indexed columns only.
These statistics are generated by specifying the -zk flag in the optimizedb command.
The effect of this command is the same as running optimizedb and specifying key and index columns for a table with the -a flag. This method saves the user some work because optimizedb determines from catalogs which columns are keys and indexed.
Note:  Key column statistics cannot be created using the CREATE STATISTICS statement.
Examples: Create Statistics on Key or Indexed Columns Only
This command generates full statistics for all key and indexed columns in the employee table of the empdata database:
optimizedb -zk empdata -remployee
In Director:
1. Open the Generate Statistic dialog.
2. Select Generate Statistics on Keys/Indexes Columns Only (the default)
3. Select Specify Tables, and then click the Tables button.
4. Select the employee table and then click OK.
5. De-select Generate Statistics on Subset of Data, and then click OK.
This command generates minmax statistics on a 1% sampling:
optimizedb -zx -zk -zs1.0 empdata -remployee
In Director:
1. Open the Generate Statistic dialog.
2. Select Generate Statistics on Keys/Indexes Columns Only (the default)
3. Select Specify Tables, and then click the Tables button.
4. Select the employee table and then click OK.
5. Select Generate Statistics on Subset of Data, set the Precision Level at 1.
6. Click Options under Select a page.
7. Select Determine the Min/Max Values for each Column, and then click OK.
All key and indexed columns in the table are processed regardless of any column designations. For example, assume that dno is a data column and kno is a key column in the employee table. The following example for generating full statistics is the same as the first example in this section, except that in addition to key and indexed columns, statistics are generated also for the dno column:
optimizedb -zk empdata -remployee -adno -akno
In Director:
1. Open the Generate Statistic dialog.
2. Select Generate Statistics on Keys/Indexes Columns Only
3. Select Specify Tables, and then click the Tables button.
4. Select the employee table and click OK.
5. Select Specify Columns, and then click the Columns button.
6. Select the dno column and kno column, and then click OK.
You do not have to select the kno column because it is a key column and we selected Generate Statistics on Keys/Index Columns Only
7. De-select Generate Statistics on Subset of Data, and then click OK.
Statistics from an Input Text File
Statistics can be read in from a text file. The input file must conform to a certain format, which is identical to that produced when you direct output to a file when displaying statistics. Display Optimizer Statistics provides more information.
The file can be edited to reflect changes in data distribution as required, before submitting the file for use during the optimization process. However, this can potentially mislead the query optimizer into generating poor query plans. Manually editing statistics must be done only if you have a full understanding of the data and how the statistics are used in Vector.
Details on creating and using text files as input when optimizing a database are provided in Statistics in Text Files.