DB2 - Problem description
Problem IT23720 | Status: Closed |
HADR: FAIL TO UNLOCK THE RESOURCE WHEN BOTH HOSTS COMING BACK ATSAME TIME CAUSES FREQUENT DISCONNECT AND BAD PERFORMANCE | |
product: | |
DB2 FOR LUW / DB2FORLUW / A50 - DB2 | |
Problem description: | |
This problem happens in a HADR system managed by TSA. It is a timing issue, that only happens when both hosts go down and then come back online at the same time. When servers come up back, TSA chooses to start the previous standby host as primary before it is in PEER state. Once the takeover command fails because the standby has not finished syncing with the original primary, Db2 fails to unlock the locked resource, so TSA could not online the resource when it monitored the primary as online. db2diag.log on standby: 2018-01-07-14.32.39.638222+480 E36383784A500 LEVEL: Event PID : 4849770 TID : 6684 PROC : db2sysc 0 INSTANCE: db2inst1 NODE : 000 DB : SAMPLE APPHDL : 0-9 APPID: *LOCAL.db2inst1.180107063239 AUTHID : db2inst1 HOSTNAME: standby_host EDUID : 6684 EDUNAME: db2agent (SAMPLE) 0 FUNCTION: DB2 UDB, base sys utilities, sqeDBMgr::StartUsingLocalDatabase, probe:13 START : Received TAKEOVER HADR command. 2018-01-07-14.32.39.711013+480 I36384794A788 LEVEL: Warning PID : 4849770 TID : 6684 PROC : db2sysc 0 INSTANCE: db2inst1 NODE : 000 DB : SAMPLE APPHDL : 0-9 APPID: *LOCAL.db2inst1.180107063239 AUTHID : db2inst1 HOSTNAME: standby_host EDUID : 6684 EDUNAME: db2agent (SAMPLE) 0 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrValidateTakeoverRequest, probe:52050 MESSAGE : ZRC=0x8280001D=-2105540579=HDR_ZRC_NOT_TAKEOVER_CANDIDATE_FORCED "Forced takeover rejected as standby is in the wrong state or peer window has expired" DATA #1 : HADR standby not ready for takeover. Current HADR state: HDR_S_REM_CATCHUP_PENDING Light scan status : Inactive 2018-01-07-14.34.15.530427+480 I36451452A637 LEVEL: Severe PID : 4849770 TID : 6684 PROC : db2sysc 0 INSTANCE: db2inst1 NODE : 000 DB : SAMPLE APPHDL : 0-9 APPID: *LOCAL.db2inst1.180107063239 AUTHID : db2inst1 HOSTNAME: standby_host EDUID : 6684 EDUNAME: db2agent (SAMPLE) 0 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrTakeoverHdrRouteIn, probe:55602 MESSAGE : Failed to unlock HADR resource group after failed HADR takeover DATA #1 : Hexdump, 4 bytes 0x0A00000009BF4C50 : 8273 00AA db2diag.log on primary: 2018-01-08-14.26.37.289578+480 E125415942A2205 LEVEL: Error PID : 7340092 TID : 12338 PROC : db2sysc 0 INSTANCE: db2inst1 NODE : 000 DB : SAMPLE HOSTNAME: primary_host EDUID : 12338 EDUNAME: db2hadrp.0.1 (SAMPLE) 0 FUNCTION: DB2 UDB, high avail services, sqlhaWaitForResourceState, probe:16314 DATA #1 : String, 25 bytes db2_db2inst1_db2inst1_SAMPLE-rs DATA #2 : String, 0 bytes Object not dumped: Address: 0x0A000000043D5934 Size: 0 Reason: Zero-length data DATA #3 : signed integer, 4 bytes 17 DATA #4 : signed integer, 4 bytes 1 ... 2018-01-08-14.26.37.290411+480 E125418148A631 LEVEL: Error PID : 7340092 TID : 12338 PROC : db2sysc 0 INSTANCE: db2inst1 NODE : 000 DB : SAMPLE HOSTNAME: primary_host EDUID : 12338 EDUNAME: db2hadrp.0.1 (SAMPLE) 0 FUNCTION: DB2 UDB, high avail services, sqlhaEnableHADRResource, probe:14174 MESSAGE : ZRC=0x87000057=-2030043049=SQLZ_RC_TIMEOUT "Action timed out" DIA8578C A timeout occurred while waiting on a semaphore. DATA #1 : String, 47 bytes Unable to verify HADR resource state as online. DATA #2 : String, 25 bytes db2_db2inst1_db2inst1_SAMPLE-rs db2diag.log also shows frequent HADR_TIMEOUT and HADR disconnect / remote catch up states: 2018-01-08-14.26.07.281479+480 E89408879A649 LEVEL: Error PID : 4849770 TID : 12338 PROC : db2sysc 0 INSTANCE: db2inst1 NODE : 000 DB : SAMPLE HOSTNAME: standby_host EDUID : 12338 EDUNAME: db2hadrs.0.0 (SAMPLE) 0 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduAcceptEvent, probe:20200 MESSAGE : Did not receive anything through HADR connection for the duration of HADR_TIMEOUT. Closing connection. DATA #1 : String, 30 bytes hdrCurrentTime/hdrLastRecvTime DATA #2 : unsigned integer, 4 bytes 1515392767 DATA #3 : unsigned integer, 4 bytes 1515392761 2018-01-08-14.26.07.283483+480 E89409903A448 LEVEL: Event PID : 4849770 TID : 12338 PROC : db2sysc 0 INSTANCE: db2inst1 NODE : 000 DB : SAMPLE HOSTNAME: standby_host EDUID : 12338 EDUNAME: db2hadrs.0.0 (SAMPLE) 0 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to HDR_S_DISCONN_PEER (was HDR_S_PEER), connId=13590 The performance on primary should be impacted, the user might see connections delayed in Commit-Active state blocked by the syncing attempts of the hadr EDUs. | |
Problem Summary: | |
**************************************************************** * USERS AFFECTED: * * The users are running HADR managed by TSA. * **************************************************************** * PROBLEM DESCRIPTION: * * See Error Description * **************************************************************** * RECOMMENDATION: * * Upgrade to Db2 V10.5 FP10 or later. * **************************************************************** | |
Local Fix: | |
This issue can be avoided by starting the host that was the original primary first therefore forcing TSA to try starting the primary on the correct host and avoiding the resource lock. | |
Solution | |
Workaround | |
not known / see Local fix | |
Timestamps | |
Date - problem reported : Date - problem closed : Date - last modified : | 12.01.2018 11.07.2018 11.07.2018 |
Problem solved at the following versions (IBM BugInfos) | |
Problem solved according to the fixlist(s) of the following version(s) |