Using the Query Optimizer
Data and Query Optimization
Vector uses a query optimizer to develop sophisticated query execution strategies. The query optimizer makes use of basic information such as row size, number of rows, primary key fields and indexes defined, and more specific data-related information such as the amount of data duplication in a column.
The data-related information is available for use by the query optimizer only after statistics (see
Database Statistics) have been generated for the database. Without knowing exactly what data you have stored in your table the query optimizer can only guess what your data looks like.
Consider the following examples:
select * from emp where empno = 13;
select * from emp where sex = 'M';
In each query, the guess is that few rows can qualify. In the first query, this guess is probably correct because employee numbers are usually unique. In the second query, however, this guess is probably incorrect because a company typically has as many males as females.
Why do restricted assumptions about your query make a performance difference? For a single-table, keyed retrieval where you are specifying the key, there is probably no difference at all. The key is used to retrieve your data. However, in a multi-table query with several restrictions, knowing what your data looks like can help determine the best way to execute your query. The following example shows why:
select e.name, e.dept, b.address
from emp e, dept d, bldg b
where e.dept = d.dname
and d.bldg = b.bldg
and b.state = 'CA'
and e.salary = 50000;
There are many ways of executing this query. If appropriate keys exist, the probable choice is to execute the query in one of these two ways:
• Retrieve all the employees with a salary of 50000. Join the employees with a salary of 50000 to the department table, join the employees with their valid departments to the valid buildings. The tables are processed in the following order:
emp --> dept --> bldg
• Retrieve all the buildings with a state of CA. Join the valid buildings with the department table and join the qualifying departments to the valid employees. The tables are processed in the following order:
bldg --> dept --> emp
The difference between these two possibilities is the order in which the tables are joined. Which method is preferable? Only if you knew exactly how many employees made $50,000, how many buildings were in California, and how many departments were in each building, can you pick the best strategy.
The best (that is, the fastest) query execution strategy can be determined only by having an idea of what your data looks like—how many rows qualify from the restriction, and how many rows join from table to table.
Query Execution Plans (QEPs) (see
Query Execution Plans), generated by the query optimizer each time you perform a query, illustrate how a query is executed. By optimizing your database, you can optimize the QEPs that are generated, thereby making your queries more efficient.
Database Statistics
Generating statistics for a database optimizes it, which affects the speed of query processing. More complete and accurate statistics generally result in more efficient query execution strategies, which further result in faster system performance.
The extent of the statistics to be generated for a database can be modified by various options, including restricting the tables and columns that are used.
Note: Deleting all the rows in a table does not delete the statistics.
Methods for Generating Statistics
You can generate database statistics using any of the following methods:
• optimizedb command
• CREATE STATISTICS statement
• WITH STATISTICS option of the COPY FROM command
• --stats option of the vwload command
• Using Director or VDBA
• Let Vector do it automatically
Automatic Histogram Generation
Vector automatically constructs histograms from live data. After a server is started, the histograms are added as required, but left in cache for later reuse.
The histograms are built from sample data maintained in memory by the min-max indexes. This provides accurate histograms with little server overhead.
Histograms are automatically generated only for those columns that do not already have histograms stored in the catalog.
A typical use strategy is to create histograms (with optimizedb or CREATE STATS) for columns whose distribution does not change, and then let Vector generate the new histograms on the other, more dynamic, columns for every server cycle.
Alternatively, if you do not want to use optimizedb or CREATE STATS, you can simply let Vector automatically build histograms on all columns.
This feature is enabled or disabled by the setting on the opf_autostats DBMS Server configuration parameter in config.dat, which is set to VECTOR (automatically generates histograms for Vector tables) by default. In addition, the opf_autostats_rebuild parameter can be set to trigger rebuilding of such histograms. For example, opf_autostats_rebuild=0.1 means that if the table data has been increased or decreased by 10% since the last histograms built by autostats, a new set of histograms will be built automatically. The default value is 0.0, which prevents rebuilding.
Assumptions of the Query Optimizer
If a database has not been optimized, the query optimizer assumes that:
• All exact match restrictions return 1% of the table, except where a key or index is defined to be unique, in which case one row is returned for the indexed attribute:
where emp.empno = 275
Note: To override the default of 1% for exact match qualifications, use the Configuration-By-Forms opf_exact_key parameter.
• All range qualifications (<, <=, >=, >) and like predicates, in which the first character is not a wild card, return 10% of the table for each qualification. Thus, if there are three (non-exact match) qualifications, the following amount of the table is selected:
1/10 x 1/10 x 1/10 = 1/1000
Note: To override the default of 10% for range qualifications, use the CBF opf_range_key parameter.
• All “not equals” qualifications (<>) and like predicates, in which the first character is a wild card, return 50% of the table for each qualification. The default 50% for these qualifications can be overidden by the Configuration-By-Forms opf_non_key parameter.
All joins are assumed to be one-to-one, based on the smaller data set; for example, when table1 with 100 rows is joined with table2 with 1000 rows, the estimated result is 100 rows.
• When there are restrictions on the join tables, the number of resulting rows is greater than or equal to the lower bound of 10% of qualifying rows from the smaller table.
If these assumptions are not valid for your data, you must optimize the database by generating statistics for it.
Resources Required During Optimization
Optimizing a database generally requires disk space, because system catalog tables are created.
While the optimization process is running, the locking system takes an exclusive lock for a brief period on the user table being optimized. Whenever possible, tables on which statistics are created are not locked while statistics are gathered. This means that updates to the optimized table are not disabled for extended periods. However, it is recommended that you optimize the database during off-hours.
When running optimizedb from the command line, the “-o filename” option can be used to write the statistics to an external file rather than to the system catalogs, to not require any catalog locks. At a later, more convenient time, the statistics can be loaded into the catalog with the “-i filename” option, using the same external file. Optimizedb -o requires read locks and optimizedb -i requires a brief exclusive lock on the user table. For more information on the optimizedb command, see the Command Reference.
System Modification After Optimization
Because optimizing a database adds column statistics and histogram information to the system catalogs, you should run the system modification operation on your database after optimizing it.
Running system modification modifies the system tables in the database to optimize catalog access. You should do this on a database periodically to maintain peak performance.
Run System Modification
To run system modification, use any of the following methods:
• At the command line, use the sysmod command.
• In Director, use the Modify System Tables dialog.
• In VDBA, use the System Modification dialog.
Examples:
Modify the system tables in the empdata database:
sysmod empdata
Modify only the system tables affected by optimizedb:
sysmod empdata iistatistics iihistogram
Information Collected by the Optimizer
When you optimize a database, the following information is collected:
• The number of unique values in those columns selected for optimization
• A count showing what percentage of those column values are NULL
• The number of duplicate values in those columns in the whole table, on average, or whether all values are unique. This is termed the repetition factor.
• A histogram showing data distribution for each of those columns. Sample histograms and further information on their contents can be found in
Histogram (Optimization Output).
The query optimizer can use this information to calculate the cost of a QEP.
Types and Levels of Statistics Collected
When optimizing a database, you can create several types and levels of statistics by specifying options.
First, you can specify what set of data the statistics are gathered on:
• Non-sampled statistics—all rows in the selected tables are retrieved
• Sampled statistics—a subset of rows from the selected tables is retrieved
Next, either of the following can be created, based on the selected data set. This division determines how much information about the distribution of data the statistics can hold:
• Full statistics—a histogram for the entire range of column values is created
• Minmax statistics—a histogram showing only minimum and maximum values is created
Non-Sampled and Sampled Statistics
When generating statistics for a database, by default all rows of the selected tables are used in the generation of statistics. These non-sampled statistics represent the most accurate statistics possible, because all data is considered.
When the base table is large, you may want to use sampled statistics. With a sufficient sampling, statistics created are almost identical to statistics created on the full table. The processing for sampled statistics is discussed in greater detail in
Sampled Optimizer Statistics.
Note: By default, sampled statistics are used for tables that have more than one million rows. To prevent sampling, use the –zns flag of optimizedb or the WITH NOSAMPLE option of the CREATE STATISTICS statement.
Full Statistics
When optimizing a database, full statistics are generated by default.
For example, generate full statistics for all columns in all tables in the empdata database:
optimizedb empdata
Full statistics carry the most information about data distribution (unless the data is modified significantly after statistics are collected).
The cost of their creation (in terms of system resources used), however, is the highest of all types. For each selected column the table is scanned once, and the column values are retrieved in a sorted order. Depending on the availability of indexes on the selected columns, a sort can be required, increasing the cost even further.
The process of generating such complete and accurate statistics can require some time, but there are several ways to adjust this.
Generate Full Statistics on Sample Data
You can shorten the process of creating full statistics by creating them on sampled data.
This example generates full statistics with a sampling of 0.5% rows of the emp table:
optimizedb -zs0.5 empdata -remp
This example generates full statistics with a sampling of 1% rows of the emp table:
CREATE STATISTICS FOR emp WITH SAMPLE = 1;
Minmax Statistics
Minmax statistics are “cheaper” than full statistics to create. In most cases they require only one scan of the entire table. Statistics created have information only about minimum and maximum values for a column. This can be acceptable if the distribution of values in the column is reasonably even. However, if the values of a column are skewed, minmax statistics can mislead the query optimizer and result in poor query plan choices.
Note: Minmax statistics cannot be created using the CREATE STATISTICS statement.
Example: Generate Statistics with Only Minimum and Maximum Values for a Table
This example generates statistics with only minimum and maximum values for the employee table:
optimizedb -zx empdata -remployee
In Director:
1. Open the Generate Statistics dialog
2. Click Specify Tables, and then click the Tables button.
3. Select the employee table and click OK.
4. Click Options under Select a page.
5. Select Determine Min/Max values for each Column, and then click OK.
Key Column Statistics
Key column statistics create full or minmax statistics on key or indexed columns only.
These statistics are generated by specifying the -zk flag in the optimizedb command.
The effect of this command is the same as running optimizedb and specifying key and index columns for a table with the -a flag. This method saves the user some work because optimizedb determines from catalogs which columns are keys and indexed.
Note: Key column statistics cannot be created using the CREATE STATISTICS statement.
Examples: Create Statistics on Key or Indexed Columns Only
This command generates full statistics for all key and indexed columns in the employee table of the empdata database:
optimizedb -zk empdata -remployee
In Director:
1. Open the Generate Statistic dialog.
2. Select Generate Statistics on Keys/Indexes Columns Only (the default)
3. Select Specify Tables, and then click the Tables button.
4. Select the employee table and then click OK.
5. De-select Generate Statistics on Subset of Data, and then click OK.
This command generates minmax statistics on a 1% sampling:
optimizedb -zx -zk -zs1.0 empdata -remployee
In Director:
1. Open the Generate Statistic dialog.
2. Select Generate Statistics on Keys/Indexes Columns Only (the default)
3. Select Specify Tables, and then click the Tables button.
4. Select the employee table and then click OK.
5. Select Generate Statistics on Subset of Data, set the Precision Level at 1.
6. Click Options under Select a page.
7. Select Determine the Min/Max Values for each Column, and then click OK.
All key and indexed columns in the table are processed regardless of any column designations. For example, assume that dno is a data column and kno is a key column in the employee table. The following example for generating full statistics is the same as the first example in this section, except that in addition to key and indexed columns, statistics are generated also for the dno column:
optimizedb -zk empdata -remployee -adno -akno
In Director:
1. Open the Generate Statistic dialog.
2. Select Generate Statistics on Keys/Indexes Columns Only
3. Select Specify Tables, and then click the Tables button.
4. Select the employee table and click OK.
5. Select Specify Columns, and then click the Columns button.
6. Select the dno column and kno column, and then click OK.
You do not have to select the kno column because it is a key column and we selected Generate Statistics on Keys/Index Columns Only
7. De-select Generate Statistics on Subset of Data, and then click OK.
Statistics from an Input Text File
Statistics can be read in from a text file. The input file must conform to a certain format, which is identical to that produced when you direct output to a file when displaying statistics.
Display Optimizer Statistics provides more information.
The file can be edited to reflect changes in data distribution as required, before submitting the file for use during the optimization process. However, this can potentially mislead the query optimizer into generating poor query plans. Manually editing statistics must be done only if you have a full understanding of the data and how the statistics are used in Vector.
Details on creating and using text files as input when optimizing a database are provided in
Statistics in Text Files.
Column Statistics
Collecting statistics is generally a time-consuming process because a large amount of data must be scanned. The techniques described so far—except for
Key Column Statistics—collect statistics on all columns of the indicated tables.
It is not necessary, however, to choose all columns in all tables in your database when optimizing. The query optimizer uses statistics on a column only if the column is needed to restrict data, if it is specified in a join, or if it is in a GROUP BY clause (to estimate the number of groups). Therefore, we recommend to limit creation of statistics only to those columns used in a WHERE or GROUP BY clause.
The DBA or table owner usually understands the table structure and content and is able to predict how the various columns are used in queries. Thus, someone familiar with the table can identify columns that are used in the WHERE or GROUP BY clause.
Given these queries:
select name, age from emp
where dept = 'Tech Support';
select e.name, e.salary, b.address
from emp e, bldg b, dept d
where e.dept = d.dname
and d.bldg = b.bldg;
Candidate columns for optimization are:
emp table: dept
dept table: dname, bldg
bldg table: bldg
Based on their use in these sample queries, there is no reason to obtain statistics on employee name, age, salary, or building address. These columns are listed in the target list only, not the WHERE clause of the query.
Columns used in the WHERE clause are often indexed to speed up joins and execution of constraints. If this is the case, specify the Gen Statistics on Keys/Index option to create statistics on key (that is, indexed) columns. However, it is often equally as important to create statistics on non‑indexed columns referenced in WHERE clauses.
Histogram (Optimization Output)
When you optimize a database, output is generated to show the statistics.
For example, if the Print Histogram option was enabled when optimizing the database, and you chose to optimize the name and sex columns of the emp table, the following output is typical:
*** statistics for database demodb version: 00850
*** table emp1 rows:1536 pages:50 overflow pages:49
*** column name of type varchar (length:30, scale:0, nullable)
date:2000_02_24 15:40:38 GMT unique values:16.000
repetition factor:96.000 unique flag:N complete flag:0
domain:0 histogram cells:32 null count:0.0000 value length:8
cell: 0 count:0.0000 repf:0.0000 value:Abbot \037
cell: 1 count:0.0625 repf:96.0000 value:Abbot
cell: 2 count:0.0000 repf:0.0000 value:Beirne \037
cell: 3 count:0.0625 repf:96.0000 value:Beirne
cell: 4 count:0.0000 repf:0.0000 value:Buchanam
cell: 5 count:0.0625 repf:96.0000 value:Buchanan
cell: 6 count:0.0000 repf:0.0000 value:Cooper \037
cell: 7 count:0.0625 repf:96.0000 value:Cooper
cell: 8 count:0.0000 repf:0.0000 value:Dunham \037
cell: 9 count:0.0625 repf:96.0000 value:Dunham
cell: 10 count:0.0000 repf:0.0000 value:Ganley \037
cell: 11 count:0.0625 repf:96.0000 value:Ganley
cell: 12 count:0.0000 repf:0.0000 value:Hegner \037
cell: 13 count:0.0625 repf:96.0000 value:Hegner
cell: 14 count:0.0000 repf:0.0000 value:Jackson\037
cell: 15 count:0.0625 repf:96.0000 value:Jackson
cell: 16 count:0.0000 repf:0.0000 value:Klietz \037
cell: 17 count:0.0625 repf:96.0000 value:Klietz
cell: 18 count:0.0000 repf:0.0000 value:Myers \037
cell: 19 count:0.0625 repf:96.0000 value:Myers
cell: 20 count:0.0000 repf:0.0000 value:Petersom
cell: 21 count:0.0625 repf:96.0000 value:Peterson
cell: 22 count:0.0000 repf:0.0000 value:Rumpel \037
cell: 23 count:0.0625 repf:96.0000 value:Rumpel
cell: 24 count:0.0000 repf:0.0000 value:Singer \037
cell: 25 count:0.0625 repf:96.0000 value:Singer
cell: 26 count:0.0000 repf:0.0000 value:Stec \037
cell: 27 count:0.0625 repf:96.0000 value:Stec
cell: 28 count:0.0000 repf:0.0000 value:Washings
cell: 29 count:0.0625 repf:96.0000 value:Washingt
cell: 30 count:0.0000 repf:0.0000 value:Zywicki\037
cell: 31 count:0.0625 repf:96.0000 value:Zywicki
unique chars: 14 9 11 11 9 11 6 3
char set densities: 0.5200 0.3333 0.4762 0.6667 0.0952 0.1111 0.0633 0.0238
*** statistics for database demodb version: 00850
*** table emp rows:1536 pages:50 overflow pages:49
*** column sex of type char (length:1, scale:0, nullable)
date:23-feb-2000 10:12:00 unique values:2.000
repetition factor:768.000 unique flag:N complete flag:0
domain:0 histogram cells:4 null count:0.0000000 value length:1
cell: 0 count:0.0000000 repf:0.0000000 value:E
cell: 1 count:0.0006510 repf:1.0000000 value:F
cell: 2 count:0.0000000 repf:0.0000000 value:L
cell: 3 count:0.9993489 repf:1535.0000000 value:M
unique chars: 2
char set densities: 0.1428571
The items in the histogram are as follows:
database
Database name
version
Version of the catalog from which statistics were derived. Shown only if version is 00605 or later.
table
Table currently processing
rows
Current number of rows in table as stored in the iitables catalog
pages
Number of pages (from the iitables catalog)
overflow pages
Number of overflow pages (from the iitables catalog)
column
Column currently processing
type
Column data type. The length, scale, and nullable indicators are obtained from the iicolumns catalog.
date
Time and date when statistics were created
unique values
Number of unique values found in the table
repetition factor
Average number of rows per unique value. The repetition factor times the number of unique values must produce the row count.
unique flag
“Y” if unique or nearly unique, “N” if not unique
complete flag
All possible values for the column exist in this table. When this column is used in a join predicate with some other column, it tells the query optimizer that every value in the other column must be a value of this column as well. This knowledge enables the query optimizer to build more accurate query plans for the join.
domain
Not used
histogram cells
Number of histogram cells used (0 to 500 maximum)
null count
Proportion of column values that are NULL, expressed as a real number between 0.0 and 1.0
value length
Length of cell values
cell
For each cell, a cell number, count (proportion of rows whose values fall into this cell: between 0.0 and 1.0), average number of duplicates per unique value in the cell, and the upper bound value for the cell
unique chars
Number of unique characters per character position. Shown only for character columns.
char set densities
Relative density of the character set for each character position. Shown only for character columns.
The number of unique values the column has is calculated. The count listed for each cell is the fraction of all the values falling between the lower and upper boundaries of the cell. Statistics for the sex column show that there are no rows with values less than or equal to ‘E,’ 0.06510% of rows with values equal to ‘F,’ no rows with values in the ‘G’ to ‘L’ range, and 99.93% of the rows with values equal to ‘M.’ The cell count includes those rows whose column values are greater than the lower cell bound but less than or equal to the upper cell bound. All cell counts must add to 1.0, representing 100% of the table rows.
Looking at the cells for the name column, you see that between the lower bound cell 0, “Abbot \037”, and cell 1, “Abbot”, 6.25% of the employee’s names are located:
cell: 0 count:0.0000000 repf:0.0000000 value:Abbot \037
cell: 1 count:0.0625000 repf:96.000000 value:Abbot
A restriction such as the following brings back about 6.25% of the rows in the table:
where emp.name = 'Abbot'
The character cell value \037 at the end of the string is octal for the ASCII character that is one less than the blank. Therefore, cell 0 in the name example represents the value immediately preceding ‘Abbot’ in cell 1. This indicates that the count for cell 1 includes all rows whose name column is exactly ‘Abbot.’
In addition to the count and value, each cell of a histogram also contains a repetition factor (labeled “repf” in the statistics output). This is the average number of rows per unique value for each cell, or the “per-cell” repetition factor. The query optimizer uses these values to compute more accurate estimates of the number of rows that result from a join. This is distinct from the repetition factor for the whole column displayed in the header portion of the statistics output.
Histogram Cells
A histogram can have up to 15,000 cells. The first cell of a histogram is always an “empty” cell, with count = 0.0. It serves as the lower boundary for all values in the histogram. Thus, all values in the column are greater than the value in the first cell. This first cell is usually not included when discussing number of cells, but it is included when statistics are displayed.
A histogram in which there is a separate cell for each distinct column value is known as an “exact” histogram. If there are more distinct values in the column than cells in the histogram, some sets of contiguous values must be merged into a single cell. Histograms in which some cells represent multiple column values are known as “inexact” histograms.
You can control the number of cells used, even for inexact histograms. You can choose to set the number of inexact cells to the same number you chose for an exact histogram, or to another number that seems appropriate. If your data is unevenly distributed, the data distribution cannot be apparent when merged into an inexact histogram with the default 100 cells. Increasing the number of cells can help.
You can control the number of cells your data is merged into even if you go above the maximum number of histogram cells you requested. You can choose to set the default merging number to the same number you chose for the maximum, or a lesser number, if the default of 100 cells seems inappropriate. If your data is unevenly distributed, the data distribution cannot be apparent when merged into the default 100 cells and controlling the merging factor can help.
To control the maximum histogram cells, use the Max Cells “Exact” Histogram option in Director or VDBA. The maximum value accepted is 14,999. You can control the number of cells that your data is merged into if you go beyond the maximum number of unique values using the Max Cells “Inexact” Histogram option. By default, the number of cells used when merging into an inexact histogram is 100, and the maximum value is 14,999.
Using the CREATE STATISTICS command, the maximum histogram cells can be specified using the WITH MAXCELLS= option.
For example, set the maximum number of unique histogram cells to 200, and if there are more than 200 unique values, merge the histogram into 200 cells. To do this, set both the Max Cells “Exact” Histogram and the Max Cells “Inexact” Histogram options to 200.
Set the maximum number of unique histogram cells to 100, and if there are more than 100 unique values, merge the histogram into 50 cells. To do this, set Max Cells “Exact” Histogram to 100 and Max Cells “Inexact” Histogram to 50.
When using these options, remember that the goal is to accurately reflect the distribution of your data so that there can be an accurate estimate of the resultant number of rows from queries that restrict on these columns. The query optimizer uses linear interpolation techniques to compute row estimates from an inexact histogram and the more cells it has to work with, the more accurate are the resulting estimates. The cost of building a histogram is not dependent on the number of cells it contains and is not a factor when determining how many cells to request.
Statistics and Global Temporary Tables
Because global temporary tables only exist for the duration of a session, Optimize Database cannot be used to gather statistical information about them. Without histograms, the query optimizer has no knowledge about the value distributions of the columns in a global temporary table. Vector maintains a reasonably accurate row count for global temporary tables, and this row count can be used by the query optimizer to compile a query which accesses a global temporary table.
The row counts alone are usually enough to permit the compilation of efficient query plans from queries that reference global temporary tables because they often contain relatively small data volumes. The lack of histograms on global temporary tables, however, can cause poor estimates of the number of rows that result from the application of restriction or join predicates. These poor estimates can in turn cause the generation of inefficient query plans. Inefficient query plans typically occur with large global temporary tables or tables with columns having skewed value distributions, which are not handled well by the default estimation algorithms of the query optimizer.
For Vector global temporary tables, the CREATE STATISTICS statement can be used to generate histograms on any or all columns when run from the same session that uses the temporary table. The statistics are kept in memory for the duration of the existence of the temporary table and are used for the optimization of any query involving the table.
When to Rerun Optimization
Optimization does not necessarily need to be run whenever data is changed or added to the database. Optimization collects statistics that represent percentages of data in ranges and repetition factors. For instance, the statistics collected on employee gender show that 49% of the employees are female and 51% are male. Unless this percentage shifts dramatically, there is no need to rerun optimization on this column, even if the total number of employees changes.
You must rerun optimization if there are modifications to the database that alter the following:
• Repetition factor
• Percentage of rows returned from a range qualification (that is, your histogram information is incorrect)
For example, if you had run complete statistics on the empno column early in your company’s history, your repetition factor is correct because all employees still have unique employee numbers. If you used ranges of employee numbers in any way, as you added new employees your histogram information is less accurate.
If your company originally had 100 employees, 10% of the employees have employee numbers greater than 90. If the company hired an additional 100 employees, 55% of the employees have employee numbers greater than 90, but the original histogram information does not reflect this.
Columns that show this type of “receding end” growth and are used in range queries can periodically need to have optimization run on them (exact match on employee number is not affected, because the information that says all employee numbers are unique is still correct).
Even if the statistics are not up-to-date, the query results are still correct.
Note: For Vector tables, the automatic histogram generation feature is a viable alternative to the frequent re-optimization of such columns. By not creating statistics at all on such columns, the query optimizer will automatically generate a histogram from current live data when the first use of the column in a WHERE or GROUP BY clause is detected in a server cycle. This will greatly simplify the management of statistics for such columns.
Example: Before and After Optimization
If statistics are available on a column referenced in a WHERE clause, the query optimizer uses the information to choose the most efficient QEP. Understanding how this information is used can be helpful in analyzing query performance. For more information, see
Query Execution Plans.
Two QEPs showing the effect of optimization are presented here. The first is a QEP before optimizing; the second shows the same query after optimization. The query used is a join, where both the r and s tables use the B-tree storage structure:
select * from r, s
where s.a > 4000 and r.a = s.a;
QEP Before Optimization
Before obtaining statistics, the optimizer chooses a full sort-merge (FSM) join, because it assumes that 10% of each table satisfies the qualification “a > 4000,” as shown in the QEP diagram below:
QUERY PLAN 4,1, no timeout, of main query
FSM Join(a)
Heap
Pages 1 Tups 267
D15 C44
/ \
Proj-rest Proj-rest
Sorted(a) Sorted(a)
Pages 5 Tups 1235 Pages 1 Tups 267
D11 C12 D4 C3
/ /
r s
B-Tree(a) B-Tree (a)
Pages 172 Tups 12352 Pages 37 Tups 2666
QEP After Optimization
After obtaining statistics, the optimizer chooses a Key join, because only one row satisfies the qualification “a > 4000,” as shown in the QEP diagram below:
QUERY PLAN 4,1, no timeout, of main query
K Join(a)
Heap
Pages 1 Tups 1
D4 C1
/ \
Proj-rest r
Sorted(a) B-Tree(a)
Pages 1 Tups 1 Pages 172 Tups 12352
D2 C1
/
s
B-Tree(a)
Pages 37 Tups 2666
The cost of the key join is significantly less than the cost of the FSM join because the join involves far fewer rows.