DB2 - Problem description
Problem IT24028 | Status: Closed |
DB2 LOG REPLAY (RECOVERY, RFWD, HADR) MIGHT HANG WITH DB2REDOM IN SQLPRGETFREEQE->WAIT AND DB2REDOWS IN SQLPRFINDQUEUE->WAIT. | |
product: | |
DB2 FOR LUW / DB2FORLUW / B10 - DB2 | |
Problem description: | |
Under rare conditions, typically with a long sequence (thousands) of single-record transactions without a commit, that has to be replayed, Db2 log replay might hang with all EDUs ending up in a wait state. Log replay scenarios are: - crash recovery - rollforward - HADR replication In case of crash recovery, "db2pd -recovery" and "list utilities" will indicate an ongoing recovery, but "completed work" will not move forward. Stacks from EDUs involved in recovery will show the recovery master (db2redom) in: sqloWaitInterrupt sqloWaitEDUWaitPost sqlprGetFreeQE sqlpPRecReadLog sqlpParallelRecovery and all recovery workers (db2redow) in: sqloWaitInterrupt sqloWaitEDUWaitPost sqlprFindQueue sqlpPRecProcLog sqlpParallelRecovery sqleSubCoordProcessRequest The same EDUs will be involved in the remaining scenarios (rollforward and HADR). Condition leading to the hang is very likely to cause the recovery master to grow the transaction table, which will trigger a message from db2redom in db2diag.log similar to this one: 2018-02-01-12.00.00.850000+060 I179497F539 LEVEL: Info PID : 5092 TID : 4488 PROC : db2syscs INSTANCE: DB2 NODE : 000 DB : SAMPLE APPHDL : 0-7 APPID: *LOCAL.DB2.180201115810 AUTHID : db2inst1 HOSTNAME: db2host EDUID : 4488 EDUNAME: db2redom (SAMPLE) 0 FUNCTION: DB2 UDB, data protection services, sqlptintMore, probe:701 DATA #1 : <preformatted> Current usable transaction entries are 14463 on log stream 0. | |
Problem Summary: | |
**************************************************************** * USERS AFFECTED: * * ALL * **************************************************************** * PROBLEM DESCRIPTION: * * See Error Description * **************************************************************** * RECOMMENDATION: * * The complete fix for this problem first appears in DB2 * * Version 11.1.3.3 iFix001 and all the subsequent Fix Packs. * **************************************************************** | |
Local Fix: | |
Problem is related to the internal logic of work parallelization during the recovery, which depends on the number of recovery worker EDUs (db2redow). By default number of them is calculated based on the number of CPUs. In case of a hang like this, one can try to force Db2 to use a higher number of recovery workers using DB2BPVARS: $ echo "PREC_NUM_AGENTS=64" > db2bpvars.cfg $ db2set DB2BPVARS=$(pwd)/db2bpvars.cfg and see if that allows recovery to complete. Setting requires instance restart to be applied and should be cleared once problem is fixed. | |
available fix packs: | |
Db2 Version 11.1 Mod3 Fix Pack3 iFix001 for Linux, UNIX, and Windows | |
Solution | |
The complete fix for this problem first appears in DB2 Version 11.1.3.3 iFix001 and all the subsequent Fix Packs. | |
Workaround | |
not known / see Local fix | |
Timestamps | |
Date - problem reported : Date - problem closed : Date - last modified : | 12.02.2018 22.05.2018 22.05.2018 |
Problem solved at the following versions (IBM BugInfos) | |
Problem solved according to the fixlist(s) of the following version(s) |