DB2 - Problem description
Problem IT37302 | Status: Closed |
HADR TAKEOVER FAILED WITH SQL1770 RC 7 DUE TO EVENT MONITOR BLOCKS FORCING OFF ONLINE REORG ON PRIMARY | |
product: | |
DB2 FOR LUW / DB2FORLUW / B50 - DB2 | |
Problem description: | |
On HADR standby database, a graceful TAKEOVER command might fail with SQL1770N reason code 7. $ db2 takeover hadr on db hadrdb SQL1770N Takeover HADR cannot complete. Reason code = "7". This error is returned after the TAKEOVER command has been issued for significant time, typically 10 minutes. The following message can be found in db2diag.log on standby. 2020-07-17-01.59.58.216931-240 I466139E592 LEVEL: Error PID : 18218 TID : 140069126006528 PROC : db2sysc INSTANCE: db2inst1 NODE : 000 DB : HADRDB HOSTNAME: host1 EDUID : 393 EDUNAME: db2hadrs.0.0 (HADRDB) FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduAcceptEvent, probe:20240 DATA #1 : Standby has not received data from primary for 601 seconds. Check the status of the primary. Aborting TAKEOVER. hdrCurrentTime 1594965598 hdrLastLogRecvTime 1594964997 hdrGracefulTkTimeout 600 This failure is due to the primary database not able to complete the takeover operation. There are be different root causes. One particular cause has been identified and addressed. There is an online reorg operation on the primary database that has not finished. This blocks the primary from completing the takeover operation. If stacks were collected on the primary database while the TAKEOVER command was blocked, the following stack from the db2reorg thread can been: 0x00007F62DBFE68FF sqloWaitEDUWaitPost + 0x03bf (/home/db2inst1/sqllib/lib64/libdb2e.so.1) 0x00007F62DC616FE5 _Z21sqlplWaitForLockGrantP9sqeBsuEduP8SQLP_AWBPjl + 0x02e5 (/home/db2inst1/sqllib/lib64/libdb2e.so.1) 0x00007F62DC605CCE _Z13sqlplWaitOnWPP9sqeBsuEduP14SQLP_LOCK_INFOP8SQLP_LRBP15SQLP_L TRN_CHAINbbb + 0x147e (/home/db2inst1/sqllib/lib64/libdb2e.so.1) 0x00007F62DC5FD56D _Z24sqlplMakeNewRequestNonSDP9sqeBsuEduP14SQLP_LOCK_INFOP11SQLP_ TENTRYP8SQLP_LRBS6_P15SQLP_LTRN_CHAINbbb + 0x070d (/home/db2inst1/sqllib/lib64/libdb2e.so.1) 0x00007F62DC4314C5 _Z7sqlplrqP9sqeBsuEduP14SQLP_LOCK_INFO + 0x0ee5 (/home/db2inst1/sqllib/lib64/libdb2e.so.1) 0x00007F62DC436DA0 _Z19sqlplDrainOldAccessP8sqeAgentP13SQLP_LOCKNAMEmbb + 0x0990 (/home/db2inst1/sqllib/lib64/libdb2e.so.1) 0x00007F62D4DB50C0 _Z20sqldOnlineTableReorgP8sqeAgenttthmittPciS1_iP9SQLP_LSN8S3_si + 0x3560 (/home/db2inst1/sqllib/lib64/libdb2e.so.1) 0x00007F62D4DB1A8E _Z13sqldOLRInvokeP8sqeAgentPc + 0x00be (/home/db2inst1/sqllib/lib64/libdb2e.so.1) 0x00007F62DA0268EA _Z26sqleIndCoordProcessRequestP8sqeAgent + 0x15aa (/home/db2inst1/sqllib/lib64/libdb2e.so.1) The above stack identifies the reorg is the application that is blocking the completion of the TAKEOVER. The stack file also shows the lock that the reorg is waiting on: Waiting on lock name: 0049000F000000000000000054 SQLP_TABLE (obj={73;15}) If lock information is also collected (eg. db2pd -lock) during the time TAKEOVER is hanging, the lock holder can be identified. Even without this information, the table being reorganized is shown with id (73;15). This information can be used to confirm that the table being reorganized is the target table of some event monitor, and the lock is held by an active event monitor fast writer thread. In fact, such reorg operation would have been blocked by the active event monitor, and will never be able to complete until the event monitor is deactivated. This occurs without the TAKEOVER command. Therefore, it is recommended that user should first deactivate the event monitor before initiating the reorg operation. It is still undesirable for such condition to fail the TAKEOVER. The TAKEOVER should detect and deactivate the event monitor and force off the reorg operation to ensure successful completion of the HADR role switch. | |
Problem Summary: | |
**************************************************************** * USERS AFFECTED: * * all * **************************************************************** * PROBLEM DESCRIPTION: * * See Error Description * **************************************************************** * RECOMMENDATION: * * Upgrade to 11.5.6 * **************************************************************** | |
Local Fix: | |
Deactivate the event monitor for the table being reorganized on primary before running takeover on standby. | |
Solution | |
Workaround | |
**************************************************************** * USERS AFFECTED: * * all * **************************************************************** * PROBLEM DESCRIPTION: * * See Error Description * **************************************************************** * RECOMMENDATION: * * Upgrade to 11.5.6 * **************************************************************** | |
Comment | |
Upgrade to 11.5.6 | |
Timestamps | |
Date - problem reported : Date - problem closed : Date - last modified : | 16.06.2021 16.06.2021 16.06.2021 |
Problem solved at the following versions (IBM BugInfos) | |
Problem solved according to the fixlist(s) of the following version(s) |