Saturday, March 10, 2012

DB2 Case Statement Performance

All,

Recently someone approached me and asked for help with a DB2 query that he had, and that query had some real nasty case statements in it, so it got me curious about case statement optimization.
I have conducted several tests to determine if case statements can be optimized in any way, and I decided to share my results.
All tests were conducted using DB2 Express 9.7.5 on my humble Lenovo T420 laptop.
Every query was executed 10 times (so the results are totals for the 10 executions).
I have also verified that all the data was in the buffer pools and no physical reads were done.

To get started, I created a new database, and the following table:
create table test(col1 int);

Then I initialized that table with numbers using this type of query:
insert into test
  select row_number() over(order by t1.name) rown
  from sysibm.syscolumns t1, sysibm.syscolumns t2
  fetch first X rows only;

I started with 100K rows and went up from there.
In the first test I checked how the number of case branches that has to be processed effects the CPU time, so I executed the following two queries. In the first one, only the first case branch evaluates to true. In the second one only the last case branch evaluates to true.


  select sum(val)
  from (
  select case when col1 < 2000000  then 1
             when col1 between 2000000 and 3000000 then 2
             when col1 between 3000000 and 4000000 then 3
             when col1 between 4000000 and 5000000 then 4
             when col1 between 5000000 and 6000000 then 5
             when col1 between 6000000 and 7000000 then 6
             when col1 between 7000000 and 8000000 then 7
             when col1 between 8000000 and 9000000 then 8
             when col1 between 9000000 and 10000000 then 9
             when col1 > 10000000 then 10 end val            
  from test
  )t;


  select sum(val)
  from (
  select case
             when col1 between 2000000 and 3000000 then 2
             when col1 between 3000000 and 4000000 then 3
             when col1 between 4000000 and 5000000 then 4
             when col1 between 5000000 and 6000000 then 5
             when col1 between 6000000 and 7000000 then 6
             when col1 between 7000000 and 8000000 then 7
             when col1 between 8000000 and 9000000 then 8
             when col1 between 9000000 and 10000000 then 9
             when col1 > 10000000 then 10
             when col1 < 2000000 then 1
             end val            
  from test
  )t;


Then I used this query to get the metrics

SELECT stmt_text,NUM_EXEC_WITH_METRICS,prep_time, total_cpu_time, POOL_DATA_L_READS, POOL_TEMP_DATA_L_READS, POOL_DATA_P_READS, POOL_TEMP_DATA_P_READS, POOL_DATA_WRITES,STMT_EXEC_TIME
      FROM TABLE(MON_GET_PKG_CACHE_STMT ( 'D', NULL, NULL, -2)) as T
      where stmt_text like '%1000000%'
order by INSERT_TIMESTAMP
  ;  

for some reason, data studio strips away the comments from the statement, which makes it harder to find it in the cache.

The results were:
row count case branch prep_time  total_cpu_time pool_data_l_reads stmt_exec_time
100k 1 44 78001 3770 93
100k 10 1 280801 3770 279
250k 1 1 202802 9400 224
250k 10 1 717604 9400 742
500k 1 1 452404 18790 478
500k 10 2 1419606 18790 1425

Obviously my test was a bit extreme, and usually we won't have all the rows match the first or last clause, but even if 80% match one or two options, we might be able to gain some savings.
We can see that the worst case took about 3 times more CPU time than the best case.

In the second test I wanted to compare using a case statement to using a join to another table.
Again I initialized the test table, but now with a small number of different values that distributes evenly:

insert into test
  select mod(rown,10) + 1 from (
  select row_number() oveR(order by t1.name) rown
  from sysibm.syscolumns t1, sysibm.syscolumns t2 )
  fetch first 100000 rows only;

and I also created a lookup table that would be used to decode the values


and then against this table
create table lookup (val_from int, val_to int);
insert into lookup 
select row_number() over(order by name) rown,1
from sysibm.syscolumns
fetch first 10 rows only;
create unique index lookup_ix on lookup(val_from, val_to);


In order to do the comparison, I used the following two queries

  select sum(val)
  from (
  select case when col1 = 1  then 1
             when col1 = 2  then 1
             when col1 = 3  then 1
             when col1 = 4  then 1
             when col1 = 5  then 1
             when col1 = 6  then 1
             when col1 = 7  then 1
             when col1 = 8  then 1
             when col1 = 9  then 1
             when col1 = 10  then 1 end val            
  from test
  )t;

and
  select sum(val) 
  from (
select val_to val
from test t1 join lookup t2 on t1.col1 = t2.val_from) t;


Here are the results I got

row count case branch prep_time  total_cpu_time pool_data_l_reads stmt_exec_time
100k 10 branch case 1 187202 3770 185
100k ix only join 35 202802 4490 180
500k 10 branch case 1 936005 18790 937
500k ix only join  1 842407 22430 826
100k 20 branch case 2 265201 3770 268
100k ix only join 21 187203 4490 184
500k 20 branch case 2 1388409 18790 1410
500k ix only join 2 904807 22430 878
100k 50 branch case 24 670804
3770
646
100k ix only join 46  156001  4490182
500k 50 branch case 3 2854818 18790 3052
500k ix only join 57 842405 22430 846

We can see that when dealing with a small number of rows, and a relatively small case clause, we would be better with the case statement, but when the number of rows or number of case branches grow, using the join is the way to go.
We can also see that the join give somewhat consistent results.
For 100K rows and 10 values, the total stmt time was 180, and for 100k rows and 20 values it was 184 (that is a 1% increase)
For 500K rows we had a 1.06% increase

When using the case on the other hand, the number of branches effects performance drastically.
For 100K rows and 10 branches, total time was 185, and for 20 branches it was 268 which is a 49% increase.
For 500K rows, moving from 10 branches to 20 increased the time by 50%

Besides that, using tables is better in terms of code abstraction. Instead of having the list of values hard coded in your queries, they are stored in a table, thus making it easier to changes those values in the future (only change a single table instead of possibly many queries).

I learned that the case statement, which is perhaps one of the most convenient tools we have in SQL, is not always the best option in terms of performance, and even if we do decide to use the case statement, we might get better performance by rearranging the branches.

Thanks
Momi

Update:
As for the request of my dear friend, i added the results for a 50 way case / join. The elapsed time numbers might no perfectly align with the other test since I executed the previous test when my machine was under a different load.
I have also added this chart which displays all the results (CPU time in Milliseconds)