Tuesday, January 12, 2010

Identity Column Insert Performance

Hi

It has been a while since my last post, but since it was the holiday season, I assumed that nobody was reading blog posts anyway... :-)

This time I want to write about Identity Column insert performance, or to be more precise, the effect of the CACHE parameter of identity columns on insert performance.

The identity column provided by DB2 is a special type of column. You can create only a single identity column in each table, and you can ask DB2 to generate the values for this column. You can specify several parameters regarding value generation; the parameter I want to focus on is CACHE.

The last value that was assigned to an identity column is kept in SYSIBM.SYSSEQUENCES in the column MAXASSIGNEDVAL.
The CACHE parameter tells DB2 if it is allowed to pre-allocate values and store them in memory. This will improve performance since DB2 will need to access the catalog less frequently. If you allow caching of n values, then the MAXASSIGNEDVAL column will be updated once in every n statements that consumes the identity value.

If you are running a Data Sharing system, and you must guarantee that the identity values are generated in the order that they are requested, you should turn off caching.

In order to check the effect of the caching parameter I have created a table that contains a single integer column which is an identity column with the generated always attribute defined. The tablespace was created with locksize any in order to avoid lock contentions. The table had no indexes created on it.

CREATE TABLE MOMI.MOMITEST
(ID INT GENERATED ALWAYS AS IDENTITY (START WITH 1 CACHE X)
)

I wrote a REXX script that performs the following statement in a loop:

INSERT INTO MOMI.MOMITEST (ID) VALUES(DEFAULT)

I executed this REXX script in 5 concurrent jobs in order to create a load on the table. Each execution inserted 250k rows with a commit every 500 rows.

Each time I picked a different cache size and tested the average CPU and elapsed time for each INSERT statement.

In order to get the statistics I used the 318 ifcid trace and got the data out of the dynamic statement cache table, using EXPLAIN STMTCACHE ALL.
To make those statistics available to you in the dynamic statement cache table, you should issue the following command
-start trace(mon) ifcid(318)

The statement that I used in order to retrieve the data from the table was:
SELECT STAT_EXEC,
               STAT_GPAG,
               STAT_SYNR,
               STAT_WRIT,
               DEC( STAT_ELAP /STAT_EXEC , 20, 10) AVG_ELPS ,
               DEC( STAT_CPU /STAT_EXEC , 20, 10) AVG_CPU ,
               DEC( (STAT_SUS_SYNIO +
               STAT_SUS_LOCK +
               STAT_SUS_SWIT +
               STAT_SUS_OTHR +
               STAT_SUS_OTHW )/STAT_EXEC , 20, 10) AVG_WAIT
FROM DSN_STATEMENT_CACHE_TABLE
WHERE STMT_TEXT LIKE 'INSERT%'
;

The following table shows the results I got: (all times are in seconds)

Task 1
Task 2
Task 3
Task 4
Task 5
Average
Theoretical Ins/S
Actual Ins / S
No Cache
CPU
0.0000663071
0.0000654882
0.0000656633
0.0000658551
0.0000660197
0.00006586668
Elapsed
0.0077742199
0.0075929553
0.0078071447
0.0077350354
0.0077541955
0.00773271016
647
312
Cache 2
CPU
0.000042866
0.0000428516
0.00004271
0.0000429092
0.000043037
0.00004287476
Elapsed
0.0035098233
0.0035261292
0.0035296884
0.0035264798
0.0035261121
0.00352364656
1419
827
Cache 20
CPU
0.0000189029
0.0000186776
0.0000189927
0.0000188641
0.0000187513
0.00001883772
Elapsed
0.0004911035
0.0004984513
0.0004985346
0.0005029077
0.0005012357
0.00049844656
10031
7000
Cache 200
CPU
0.000017257
0.0000171668
0.0000172071
0.00001717
0.000017321
0.00001722438
Elapsed
0.0001121486
0.0001027187
0.0001122547
0.0001085953
0.0001055655
0.00010825656
46187
15500
Cache 2000
CPU
0.0000178619
0.0000177593
0.0000179487
0.0000178366
0.000017819
0.0000178451
Elapsed
0.0000419019
0.000039692
0.0000413995
0.0000436566
0.0000363546
0.00004060092
123150
20500
Cache 20000
CPU
0.0000177429
0.0000177092
0.0000178616
0.0000176705
0.0000180373
0.0000178043
Elapsed
0.0000380529
0.0000352191
0.0000346786
0.0000313353
0.0000339644
0.00003465006
144300
22000
Cache 250000
CPU
0.0000179855
0.0000178565
0.0000180507
0.0000178743
0.0000182472
0.00001800284
Elapsed
0.0000373096
0.0000436134
0.0000390873
0.0000395342
0.0000415881
0.00004022652
124296
22500

I also plotted the above data into graphs in order to emphasize the improvements.
The first graph shows how the insert rate increases as we increase the number of cached values


The second graph shows how the average time required to perform a single insert operation decreases as the cache value increases




From the above results we can see that the CACHE parameter has a crucial effect on insert performance against tables with identity columns.
If for example, we look on task number 1, performing those 250k inserts took more than 200 times longer than doing the same job against a column with cache 250000 (which means DB2 catalog accessed is minimized).
I would definitely recommend everybody to use the CACHE parameter. According to my results, for CACHE = 2000 we get very good results and increasing the value beyond that does not give much more benefit, but maybe it will be different on your system. Keep in mind that once you turn on caching, if the DB2 system crashes, those cached values are lost, so it might not suite your business needs (also when using caching, one can not rely on the order in which the values are generated).

Hope you found this useful.
Till next time...

1 comments:

  1. Thanks. It's very important information for me.

    ReplyDelete