Monday, November 23, 2009

UNION ALL Performance and Behavior

This time I am going to talk about UNION ALL.
I am sure numerous posts were already dedicated to the UNION vs. UNION ALL issue, but just to make sure everyone knows the difference, UNION ALL returns all rows while UNION returns only distinct rows (which will require DB2 to sort the result set and remove duplicate rows before it is returned).

What I wanted to do is to check how DB2 process the UNION ALL clause, so I have created two simple tables with a single column named ID, each table has 10 rows with values between 1 and 10.

I started my tests with this simple query:

Select 'table 1'
From table1
Union ALL
Select 'table 2'
From table2

As expected, the results were composed of 20 rows, 10 rows for each table, when the 'table 1' rows appeared first.

Can you guess what the following query returns?

Select 'table 1'
From table1
Union ALL
Select 'table 2'
From table2
Fetch first 10 rows only

The result set contains only the 10 records from the first table. To get things a bit more interesting I stopped the tablespace that contains the second table (using the stop database dsn command). Running the above query again returned the same results, that is, DB2 did not try to access the second table at all.
Obviously, running the same query with UNION instead of UNION ALL returned SQLCODE -904 with reason code 00C90081, which means the tablespace is in stopped mode.

And now for the really nice part; I left the second tablespace in stop mode and executed the following query:

Select 'table 2'
From table2
Where :hv=0
Union ALL
Select 'table 1'
From table1

When :hv contains 0, then SQLCODE -904 is received, but, when :hv contains 1, we get the rows from table1 in the result set, which means, DB2 is smart enough to evaluate :hv=0 before it actually access the table (I have tried this for both integer and character data types).

Now for the unfortunate results, the query

Select 'table 2'
From table2
Where :hv>0
Union ALL
Select 'table 1'
From table1

Always tries to access both tables (and thus returns SQLCODE -904 when the first tablespace is stopped). I don't know if this behavior is intentional or not, since this query

Select 'table 2'
From table2
Where 0>0
Union ALL
Select 'table 1'
From table1

Also access both tables in order to evaluate 0>0, even though DB2 is able to tell this predicate always returns false when it parses the SQL statement.

Another interesting issue I came across which is partially relevant; when an index is defined on col1, col2 of a table with 2 columns, this query

Select *
From table
Where col1 in (1)
Order by col2

Will use the index and will not need a sort for the order by clause, but this query

Select *
From table
Where col1 in (1,1)
Order by col2

Will have a single column matching index scan, but will not be able to avoid the sort for the ORDER BY. Rewriting the above as an equal predicate sounds like a stupid thing to ask for, but since there are many tools that generate SQL on the fly, maybe it would be better to incorporate such logic into the SQL parser itself.

So what have we learned today?
1. By using FETCH FIRST X ROWS we can make sure DB2 executes the second subquery in the UNION ALL query only if it has to.
2. In order to selectively bypass some of the subqueries In a UNION ALL query, we can add predicates of the form :hv = 1 and set the variable value accordingly. This can be viewed like accessing only relevant partitions of a partitioned table by adding predicates on the partitioning key.

Till next time
Momi

No comments:

Post a Comment