As I have seen in sites where I worked, there is some confusion regarding the parameters of the REORG utility. Everybody would like to have true 24x7 systems using online REORG SHRLEVEL CHANGE, but in order to be on the safe side, they used SHRLEVEL REFERENCE instead.
When DB2 performs a REORG with SHRLEVEL CHANGE, it builds the shadow files the same way as it does with REORG SHRLEVEL REFERENCE, but, it has an additional steps in which it needs to apply the log records of the updates that took place during the REORG itself.
Obviously, at some point, DB2 has to restrict update access to the table, in order to make a final synchronization (applying the updates to the shadow files). Additionally, at some point, all access to the tablespace must be blocked in order to switch the shadow files and the actual files (the SWITCH phase).
The phase in which DB2 applies the accumulated log records to the shadow files is called the LOG APPLY phase. This phase is done in several iterations. In each iteration, DB2 applies some log records, while allowing full access to the tablespace.
At some point DB2 decides to go into the last log iteration. At this point the tablespace will be available for read only access and all the remaining log records will be applied.
In order to decide which iteration will be the last, DB2 uses the MAXRO parameter. This parameter tells DB2 how many seconds the tablespace can be in read only mode. DB2 will apply log records iteratively until it estimates the amount of log records to apply is small enough, to be applied in less than MAXRO seconds. At this point DB2 will DRAIN all write activity and go into the last log iteration phase.
This gets us to the first problem. The default value of MAXRO is 300 seconds, while the default timeout interval (IRLMRWT) is 60 seconds which means, once DB2 will get to the last log iteration, most likely some concurrent transactions will time out.
After the last iteration, DB2 must block all access to the tablespace in order to perform the switch phase. In order to set this restrictive state DB2 must DRAIN concurrent read activity (write activity was already drained before the last log apply iteration) to the tablespace.
This gets us to the second problem. Let's assume we have a long running unit of work that only reads data, running concurrently with short user transactions that update data. When the REORG utility decides to perform the last log iteration, it will drain all write activity (short transactions). Since the user transactions are short, it will be able to drain the tablespace and change it to read only access for the last iteration. At this point user transactions that try to update the tablespace are in lock wait status.
Once the last iteration is finished, DB2 will try to drain all readers and perform the SWITCH phase. Since we have a long reader, the REORG utility will not be able to drain the read activity. The number of seconds the REORG utility will wait for the DRAIN is controlled by the DRAIN_WAIT parameter. The default value of this parameter is IRLMRWT*UTIMOUT (zparms) which is 300 seconds by default (60 seconds lock wait time, 6 for utility timeout). So, during the time REORG waits for the drain, more user transactions will probably timeout.
After DRAIN_WAIT seconds the drain will fail and the REORG utility will terminate, leaving the tablespace in read only mode (leading to many resource unavailable conditions for the short user transactions). At this point we can either terminate the utility in order to open the tablespace for write activity, (but this will mean all the processing we done is lost, and all the user transactions timed out needlessly), or we can kill the long running unit of recovery and restart the utility which will finish rapidly. In both situations, we had many user transactions that timed out because of our REORG. This of course is unacceptable.
In the second part of this post, I will tell you how I used the parameters of the REORG utility to get pass those 2 problems.
To Be Continued…