DB2 - Problem description
Problem IT33851 | Status: Closed |
TSA INITIATED TAKEOVER IN AN AUTOMATED HADR ENVIRONMENT MAY FAILDUE TO PEER WINDOW HAVING EXPIRED. | |
product: | |
DB2 FOR LUW / DB2FORLUW / B50 - DB2 | |
Problem description: | |
In a TSA automated HADR environment, the standby database may not be able to successfully takeover as the new primary in the event of a failure on the old primary host due to the peer window expiring. In this case, the following db2diag.log error will be observed: 2020-05-21-09.19.47.704063-420 I6042A435 LEVEL: Warning PID : 13933 TID : 4395211155728 PROC : db2sysc 0 INSTANCE: seeluser NODE : 000 DB : HADRDB HOSTNAME: svlxtorf.svl.ibm.com EDUID : 62 EDUNAME: db2hadrs.0.0 (HADRDB) 0 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduAcceptEvent, probe:20202 MESSAGE : Peer window ends. Peer window expired. 2020-05-21-09.19.47.704162-420 E6478A470 LEVEL: Event PID : 13933 TID : 4395211155728 PROC : db2sysc 0 INSTANCE: seeluser NODE : 000 DB : HADRDB HOSTNAME: svlxtorf.svl.ibm.com EDUID : 62 EDUNAME: db2hadrs.0.0 (HADRDB) 0 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrSetHdrState, probe:10000 CHANGE : HADR state set to HDR_S_REM_CATCHUP_PENDING (was HDR_S_DISCONN_PEER), connId=4 2020-05-21-09.19.52.006067-420 I6949A392 LEVEL: Warning PID : 27581 TID : 2199453935008 PROC : db2gcf INSTANCE: seeluser NODE : 000 HOSTNAME: svlxtorf.svl.ibm.com FUNCTION: DB2 Common, Generic Control Facility, gcf_start, probe:928 DATA #1 : String, 18 bytes Current HADR state DATA #2 : String, 6 bytes HADRDB DATA #3 : unsigned integer, 8 bytes 2 2020-05-21-09.19.52.019589-420 E7342A399 LEVEL: Info PID : 27581 TID : 2199453935008 PROC : db2gcf INSTANCE: seeluser NODE : 000 HOSTNAME: svlxtorf.svl.ibm.com FUNCTION: DB2 UDB, high avail services, sqlhaCreateFlagRG, probe:535 MESSAGE : IBM.Test flag resource has been created DATA #1 : String, 49 bytes db2_HADRDB_ClusterInitiatedMove_seeluser_seeluser 2020-05-21-09.19.52.019623-420 I7742A384 LEVEL: Warning PID : 27581 TID : 2199453935008 PROC : db2gcf INSTANCE: seeluser NODE : 000 HOSTNAME: svlxtorf.svl.ibm.com FUNCTION: DB2 Common, Generic Control Facility, gcf_start, probe:957 DATA #1 : String, 48 bytes Initiating cluster driven HADR takeover request. DATA #2 : String, 6 bytes HADRDB 2020-05-21-09.19.52.023487-420 E8127A514 LEVEL: Event PID : 13933 TID : 4376282261776 PROC : db2sysc 0 INSTANCE: seeluser NODE : 000 DB : HADRDB APPHDL : 0-22 APPID: *LOCAL.seeluser.200521161952 AUTHID : SEELUSER HOSTNAME: svlxtorf.svl.ibm.com EDUID : 80 EDUNAME: db2agent (HADRDB) 0 FUNCTION: DB2 UDB, base sys utilities, sqeDBMgr::StartUsingLocalDatabase, probe:13 START : Received TAKEOVER HADR command. [12:28 PM] 2020-05-21-09.19.52.024417-420 I8642A870 LEVEL: Warning PID : 13933 TID : 4376282261776 PROC : db2sysc 0 INSTANCE: seeluser NODE : 000 DB : HADRDB APPHDL : 0-22 APPID: *LOCAL.seeluser.200521161952 AUTHID : SEELUSER HOSTNAME: svlxtorf.svl.ibm.com EDUID : 80 EDUNAME: db2agent (HADRDB) 0 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrValidateTakeoverRequest, probe:52050 MESSAGE : ZRC=0x8280001D=-2105540579=HDR_ZRC_NOT_TAKEOVER_CANDIDATE_FORCED "Forced takeover rejected as standby is in the wrong state or peer window has expired" DATA #1 : HADR standby not ready for takeover. Current HADR state: HDR_S_REM_CATCHUP_PENDING Light scan status : Inactive Peer Window End : 1590077986 Current Time : 1590077991 2020-05-21-09.19.52.024459-420 I9513A713 LEVEL: Error PID : 13933 TID : 4376282261776 PROC : db2sysc 0 INSTANCE: seeluser NODE : 000 DB : HADRDB APPHDL : 0-22 APPID: *LOCAL.seeluser.200521161952 AUTHID : SEELUSER HOSTNAME: svlxtorf.svl.ibm.com EDUID : 80 EDUNAME: db2agent (HADRDB) 0 FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrRequestTakeover, probe:39999 MESSAGE : ZRC=0x8280001D=-2105540579=HDR_ZRC_NOT_TAKEOVER_CANDIDATE_FORCED "Forced takeover rejected as standby is in the wrong state or peer window has expired" DATA #1 : String, 36 bytes HADR takeover pre-validation failed This could be due to the RSCT grace period communication group setting being enabled. This setting specifies the grace period that is used when heartbeats are no longer received. Setting this value delays the host failure detection time which could cause the peer window to expire prior to the takeover command being received on the standby. | |
Problem Summary: | |
**************************************************************** * USERS AFFECTED: * * all * **************************************************************** * PROBLEM DESCRIPTION: * * See Error Description * **************************************************************** * RECOMMENDATION: * * Upgrade to Db2 11.5.5.0 or higher * **************************************************************** | |
Local Fix: | |
Setting the RSCT grace period to 0 (disabling it), will allow automation to acknowledge the host failure sooner, thus reducing the likelihood of encountering this error. In addition to this it is recommended that the HADR_PEER_WINDOW value is set to at least 120 seconds for automated HADR environments. As root, verify the communication group settings of the domain via the lscomg command. Ex: $ lscomg Name Sensitivity Period Priority Broadcast SourceRouting NIMPathName NIMParameters Grace MediaType UseForNodeMembership CG1 4 4 1 Yes Yes -1 (Default) 1 (IP) 1 CG2 4 4 1 Yes Yes -1 (Default) 1 (IP) 1 If ?Grace? is set to anything other than 0, set it to 0 via the chcomg command for every communication group: e.g. $ chcomg -g 0 CG1 $ chcomg -g 0 CG2 Once disabled, the lscomg output should look as follows: $ lscomg Name Sensitivity Period Priority Broadcast SourceRouting NIMPathName NIMParameters Grace MediaType UseForNodeMembership CG2 4 4 1 Yes Yes 0 (Disabled) 1 (IP) 1 CG1 4 4 1 Yes Yes 0 (Disabled) 1 (IP) 1 | |
Solution | |
Workaround | |
not known / see Local fix | |
Timestamps | |
Date - problem reported : Date - problem closed : Date - last modified : | 10.08.2020 20.11.2020 20.11.2020 |
Problem solved at the following versions (IBM BugInfos) | |
Problem solved according to the fixlist(s) of the following version(s) |