Using Histograms to Help Oracle Cost-Based Optimizer Make Better Decisions

Introduction

Histograms are a feature of the cost-based optimizer (CBO) that allows the Oracle engine to determine how data is distributed within a column. They are most useful for a column that is included in the WHERE clause of SQL and the data distribution is skewed.

Example

Assume a table named PROCESS_QUEUE with one million rows including a column named PROCESSED_FLAG with five distinct values. Also assume a query similar to the following is executed:

SELECT id, serial_number

FROM process_queue
WHERE processed_flag = ‘N’;

SELECT STATEMENT Optimizer=ALL_ROWS (Cost=1087 Card=260363 Bytes=7029801)
TABLE ACCESS (FULL) OF ‘PROCESS_QUEUE’ (TABLE) (Cost=1087 Card=260363 Bytes=7029801)

Without histograms and only five distinct values, Oracle assumes an even data distribution and would most likely perform a full table scan for this query. With one million rows and five values, Oracle assumes that each value would return 200,000 rows, or 20% of the rows.

Data Skew

However, what if the data for the PROCESSED_FLAG column was skewed:

SELECT processed_flag, COUNT(1)
FROM process_queue
GROUP BY processed_flag;

PROCESSED_FLAG COUNT
——————————- ———-
P 24
Y 999345
E 30
S 568
N 33

In this case, ony 33 rows have a value of ‘N’, so there has to be a way to tell Oracle to use the index on the PROCESSED_FLAG column. That is where histograms come into use. A histogram would include data similar to above and allow Oracle to know that only 33 rows would be returned for this query.

Collecting Histograms

To collect histograms for this column, a command similar to the following could be used:

EXECUTE DBMS_STATS.GATHER_TABLE_STATS(user, ‘PROCESS_QUEUE’, method_opt => ‘for all indexed columns size 254’)

Note 1: Using histograms works best for SQL statements that use literal values. If a statement uses a bind variable, the first time the query is parsed, Oracle will peek at the value of the bind variable and choose a plan accordingly. That same plan will be used until the SQL is re-parsed. In this case, if the bind variable was ‘Y’ the first time, Oracle may perform a full table scan for this query no matter what value was passed in from then on.

The opposite may also be true. Assume a similar data distribution to above but with 100 distinct values for the PROCESSED_FLAG column. The rows that have a ‘Y’ value are still be 95% of the rows. However, if you used the criteria “WHERE processed_flag=’Y'”, without histograms Oracle may decide to use the index when a full table scan may be a better option.

Note 2: The defaults for the METHOD_OPT parameter changed between Oracle 9i and 10g. In 9i the parameter defaulted to ‘for all columns size 1’ which essentially turns off histograms. The default value in Oracle 10g is ‘for all columns size auto’ which means that Oracle will decide whether or not to collect histograms for a column. In my experience it seems that unnecessary histograms are collected and histogram data is not collected for some columns where it would be useful.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: