Monday, January 4, 2010

Space Reuse Affected By Commit?

Hi

In this post I would like to present to you a strange scenario involving DB2 space reuse which I was not able to reconstruct in a test system.

One day a programmer told me he added a tiny change to one of his batch processes which caused the CPU consumption to increase from 1 minute (1 hour elapsed time) to 22 minutes. The batch process receives a file with client records (usually around 30 records per client but some clients have up to 300 records). Before the change was applied, the client records were read from file upon need. After the change, the records were inserted into a DB2 table, and processing was done using that table.

The logical structure of the program was:
• Insert client records into table
• Do some processing (during which the records are read from the table)
• Delete the records from the table

This process was executed for each client, with a total of about 100K clients.
I started an Omegamon Application Trace when the batch processes executed and I noticed that each fetch from this table took about 0.00500 seconds. Since a fetch in this DB2 system usually takes 0.00001 seconds I figured that this is probably the problem.

The mentioned table resided in a tablespace and was defined in the following way (it was the only table in that tablespace):

CREATE TABLESPACE XXXXX
IN XXX
USING
STOGROUP XXX
PRIQTY 48
SECQTY 48
LOCKSIZE ANY
SEGSIZE 32
COMPRESS NO
CLOSE NO ;

CREATE TABLE XXXX.XXXX (
COL01 DECIMAL(11) NOT NULL,
COL02 DECIMAL(3) NOT NULL,
COL03 DECIMAL(3) NOT NULL,
COL04 DECIMAL(3) NOT NULL,
COL05 DECIMAL(7) NOT NULL,
COL06 DECIMAL(3) NOT NULL,
COL07 DECIMAL(3) NOT NULL,
COL08 DECIMAL(3) NOT NULL,
COL09 DECIMAL(5) NOT NULL,
COL10 DECIMAL(7) NOT NULL,
COL11 DECIMAL(3) NOT NULL,
COL12 DECIMAL(7) NOT NULL,
COL13 DECIMAL(3) NOT NULL,
COL14 DECIMAL(3) NOT NULL,
COL15 DECIMAL(15,2) NOT NULL,
COL16 DECIMAL(5,2) NOT NULL,
COL17 TIMESTAMP NOT NULL,
COL18 DECIMAL(13) NOT NULL(
IN XXX.XXX ;

CREATE TYPE 2 UNIQUE INDEX XXXX.XXXX
ON XXX.XXX
( COL01,
COL02,
COL03,
COL04,
COL05,
COL06,
COL07,
COL08,
COL09,
COL10,
COL11,
COL12,
COL13,
COL18 )
USING STOGROUP XXX
PRIQTY 48
SECQTY 48
CLUSTER
CLOSE NO ;


First thing I did is to look which SQL Statements were executed against this table. There were a total of 3 statements (each executed many time):
1) A cursor that joined this table to another tables. This table was the outer table in a nested loop join (it was accessed first) and access to the table was done using the index with 3 matching columns. The fetch from this cursor took most of the time.
2) A simple insert to that table
3) A searched delete from the table with a where clause on COL02,COL03 which was done using a none matching index scan.

Since the fetch was done using an index, and that table only contained a small number of records, I guessed that there was a problem with the data organization in the table (even though the record length of rows in the table is constant, thus this could not be an issue of overflow records). I executed a REORG against the tablespace and asked the programmer to execute the process again.

Now each fetch took on average 0.00001 seconds. Beautiful.
The problem was that as the process progressed, the average fetch time got worse and worse climbing to 0.00005 seconds, 0.00015 seconds and eventually reached about 0.00250 seconds. Looking at the SQL counts of the batch process I noticed no commits are being taken. I asked the programmer to add a commit once every 100 clients, and executed REORG again.
This solved the problem. The average fetch time was about 0.00005 seconds throughout the entire execution.
In order to verify that this was actually the reason for the improvement I asked the programmer to remove the commit statement, and when he did, the problem occurred again.
Another possible solution might be using a temporary table instead of the above table, but the programmer did not want to test this solution.

My conclusion is that somehow commit influences space reuse in segmented tablespaces. The first though that came to my mind was that in a massive delete scenario, DB2 might not reuse the space in order to enable a fast rollback, but since this program performed a searched delete, this theory is irrelevant.

I tried to duplicate this problem on a table that I have created in a test system with similar definitions but I could not get it to happen again.
I wrote a REXX script that had a loop which performed:
1) Insert records into a table (I tested 30 records and 50 records)
2) Perform select operations from the table with similar access path, that is, using a matching index scan (I tested 30 and 50 select statements)
3) Perform a searched delete from the table

The loop was executed between 100 and 5000 times without a commit, but I couldn’t get the problem to reoccur. My fetch time remained rather constant throughout the entire test.
I find it hard to believe that I could not get this problem to reoccur because of differences between dynamic and static SQL (the original program is coded using COBOL). If anyone thinks otherwise, please post a comment.


Did anyone else ever encounter such a problem?

Till next time
Momi

0 comments:

Post a Comment