suche 36x36
Latest versionsfixlist
11.1.0.7 FixList
10.5.0.9 FixList
10.1.0.6 FixList
9.8.0.5 FixList
9.7.0.11 FixList
9.5.0.10 FixList
9.1.0.12 FixList
Have problems? - contact us.
Register for free anmeldung-x26
Contact form kontakt-x26

DB2 - Problem description

Problem IT23720 Status: Closed

HADR: FAIL TO UNLOCK THE RESOURCE WHEN BOTH HOSTS COMING BACK ATSAME TIME
CAUSES FREQUENT DISCONNECT AND BAD PERFORMANCE

product:
DB2 FOR LUW / DB2FORLUW / A50 - DB2
Problem description:
This problem happens in a HADR system managed by TSA.

It is a timing issue, that only happens when both hosts go down
and then come back online at the same time. When servers come up
back, TSA chooses to start the previous standby host as primary
before it is in PEER state. Once the takeover command fails
because the standby has not finished syncing with the original
primary, Db2 fails to unlock the locked resource, so TSA could
not online the resource when it monitored the primary as online.

db2diag.log on standby:

2018-01-07-14.32.39.638222+480 E36383784A500        LEVEL: Event
PID     : 4849770              TID : 6684           PROC :
db2sysc 0
INSTANCE: db2inst1                NODE : 000           DB   :
SAMPLE
APPHDL  : 0-9                  APPID:
*LOCAL.db2inst1.180107063239
AUTHID  : db2inst1                HOSTNAME: standby_host
EDUID   : 6684                 EDUNAME: db2agent (SAMPLE) 0
FUNCTION: DB2 UDB, base sys utilities,
sqeDBMgr::StartUsingLocalDatabase, probe:13
START   : Received TAKEOVER HADR command.

2018-01-07-14.32.39.711013+480 I36384794A788        LEVEL:
Warning
PID     : 4849770              TID : 6684           PROC :
db2sysc 0
INSTANCE: db2inst1                NODE : 000           DB   :
SAMPLE
APPHDL  : 0-9                  APPID:
*LOCAL.db2inst1.180107063239
AUTHID  : db2inst1                HOSTNAME: standby_host
EDUID   : 6684                 EDUNAME: db2agent (SAMPLE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery,
hdrValidateTakeoverRequest, probe:52050
MESSAGE :
ZRC=0x8280001D=-2105540579=HDR_ZRC_NOT_TAKEOVER_CANDIDATE_FORCED
          "Forced takeover rejected as standby is in the wrong
state or peer window has expired"
DATA #1 : 
HADR standby not ready for takeover.
   Current HADR state: HDR_S_REM_CATCHUP_PENDING
   Light scan status : Inactive


2018-01-07-14.34.15.530427+480 I36451452A637        LEVEL:
Severe
PID     : 4849770              TID : 6684           PROC :
db2sysc 0
INSTANCE: db2inst1                NODE : 000           DB   :
SAMPLE
APPHDL  : 0-9                  APPID:
*LOCAL.db2inst1.180107063239
AUTHID  : db2inst1                HOSTNAME: standby_host
EDUID   : 6684                 EDUNAME: db2agent (SAMPLE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery,
hdrTakeoverHdrRouteIn, probe:55602
MESSAGE : Failed to unlock HADR resource group after failed HADR
takeover
DATA #1 : Hexdump, 4 bytes
0x0A00000009BF4C50 : 8273 00AA


db2diag.log on primary:

2018-01-08-14.26.37.289578+480 E125415942A2205      LEVEL: Error
PID     : 7340092              TID : 12338          PROC :
db2sysc 0
INSTANCE: db2inst1                NODE : 000           DB   :
SAMPLE
HOSTNAME: primary_host
EDUID   : 12338                EDUNAME: db2hadrp.0.1 (SAMPLE) 0
FUNCTION: DB2 UDB, high avail services,
sqlhaWaitForResourceState, probe:16314
DATA #1 : String, 25 bytes
db2_db2inst1_db2inst1_SAMPLE-rs
DATA #2 : String, 0 bytes
Object not dumped: Address: 0x0A000000043D5934 Size: 0 Reason:
Zero-length data
DATA #3 : signed integer, 4 bytes
17
DATA #4 : signed integer, 4 bytes
1
...

2018-01-08-14.26.37.290411+480 E125418148A631       LEVEL: Error
PID     : 7340092              TID : 12338          PROC :
db2sysc 0
INSTANCE: db2inst1                NODE : 000           DB   :
SAMPLE
HOSTNAME: primary_host
EDUID   : 12338                EDUNAME: db2hadrp.0.1 (SAMPLE) 0
FUNCTION: DB2 UDB, high avail services, sqlhaEnableHADRResource,
probe:14174
MESSAGE : ZRC=0x87000057=-2030043049=SQLZ_RC_TIMEOUT "Action
timed out"
          DIA8578C A timeout occurred while waiting on a
semaphore.
DATA #1 : String, 47 bytes
Unable to verify HADR resource state as online.
DATA #2 : String, 25 bytes
db2_db2inst1_db2inst1_SAMPLE-rs


db2diag.log also shows frequent HADR_TIMEOUT and HADR disconnect
/ remote catch up states:
2018-01-08-14.26.07.281479+480 E89408879A649        LEVEL: Error
PID     : 4849770              TID : 12338          PROC :
db2sysc 0
INSTANCE: db2inst1                NODE : 000           DB   :
SAMPLE
HOSTNAME: standby_host
EDUID   : 12338                EDUNAME: db2hadrs.0.0 (SAMPLE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery,
hdrEduAcceptEvent, probe:20200
MESSAGE : Did not receive anything through HADR connection for
the duration of
          HADR_TIMEOUT. Closing connection.
DATA #1 : String, 30 bytes
hdrCurrentTime/hdrLastRecvTime
DATA #2 : unsigned integer, 4 bytes
1515392767
DATA #3 : unsigned integer, 4 bytes
1515392761

2018-01-08-14.26.07.283483+480 E89409903A448        LEVEL: Event
PID     : 4849770              TID : 12338          PROC :
db2sysc 0
INSTANCE: db2inst1                NODE : 000           DB   :
SAMPLE
HOSTNAME: standby_host
EDUID   : 12338                EDUNAME: db2hadrs.0.0 (SAMPLE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery,
hdrSetHdrState, probe:10000
CHANGE  : HADR state set to HDR_S_DISCONN_PEER (was HDR_S_PEER),
connId=13590

The performance on primary should be impacted,  the user might
see connections delayed in Commit-Active state blocked by the
syncing attempts of the hadr EDUs.
Problem Summary:
****************************************************************
* USERS AFFECTED:                                              *
* The users are running HADR managed by TSA.                   *
****************************************************************
* PROBLEM DESCRIPTION:                                         *
* See Error Description                                        *
****************************************************************
* RECOMMENDATION:                                              *
* Upgrade to Db2 V10.5 FP10 or later.                          *
****************************************************************
Local Fix:
This issue can be avoided by starting the host that was the
original primary first therefore forcing TSA to try starting the
primary on the correct host and avoiding the resource lock.
Solution
Workaround
not known / see Local fix
Timestamps
Date  - problem reported    :
Date  - problem closed      :
Date  - last modified       :
12.01.2018
11.07.2018
11.07.2018
Problem solved at the following versions (IBM BugInfos)
Problem solved according to the fixlist(s) of the following version(s)