DB2 - Problem description
| Problem IC66646 | Status: Closed |
HADR PRIMARY REINTEGRATION WILL FAIL WITH PRIMARY/STANDBY MISMATCH AFTER THE PAIR REACHES PEER STATE | |
| product: | |
DB2 FOR LUW / DB2FORLUW / 970 - DB2 | |
| Problem description: | |
The problem can be seen after a takeover by force is issued and
a) the old-primary is deactivated and brought up as a standby
or
b) the old-primary is killed and is brought up as a primary
first instead of as a standby (which will fail),then trying to
reintegrate it as a standby
will cause a Primary/Standby lsn mismatch. The reason is that
when the old-primary is deactivated or the old-primary is first
brought up as a primary (which will eventually fail due to
timeout). The last/current log file will be truncated and the
minbufflsn, lowtranlsn and remote catchup start lsn will be
moved to the start of next file, The same log record that is
truncated on the old-primary is NOT truncated on the new Primary
and so is used for writing more log records and so is used for
writing more log records. When the old-Primary is reintegrated
as a standby
and if no log writes are done on the new-primary until this
point a Peer connection is established between the
Primary/Standby.
After the peer state is established, when the new primary writes
some logs, sends them to standby then it will result in a
Primary/standby LSN mismatch on the standby server which will
bring down the standby server. The error mssage "SQL1768N unable
to start HADR. Reason code='7' " will be given.
You may see the following log entries in the db2diag.log file.
2010-02-10-10.36.47.166177-360 E121063953A371 LEVEL: Event
PID : 172306 TID : 7969 PROC : db2sysc
0
INSTANCE: db2inst1 NODE : 000
EDUID : 7969 EDUNAME: db2hadrs (sample) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery,
hdrSetHdrState, probe:10000
CHANGE : HADR state set to S-Peer (was S-NearlyPeer)
2010-02-10-10.36.51.574186-360 I121079812A498 LEVEL: Error
PID : 172306 TID : 7969 PROC : db2sysc
0
INSTANCE: db2inst1 NODE : 000
EDUID : 7969 EDUNAME: db2hadrs (sample) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery,
hdrAddDataBlock, probe:40012
MESSAGE : Primary/standby mismatch. RCUStartLSN 0000000224D4000C
not on record
boundary. RCU first page bytecount 4080, firstindex
16, pagelsn
0002230BCFFB.
2010-02-10-10.36.51.574321-360 I121080311A438
LEVEL: Severe
PID : 172306 TID : 7969 PROC : db2sysc
0
INSTANCE: db2inst1 NODE : 000
EDUID : 7969 EDUNAME: db2hadrs (sample) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery,
hdrAddDataBlock, probe:40012
RETCODE : ZRC=0x87800145=-2021654203=HDR_ZRC_VALIDATION_REJECT
"HADR shuts down due to validation rejection" | |
| Problem Summary: | |
The problem can be seen after a takeover by force is issued and
a) the old-primary is deactivated and brought up as a standby
or
b) the old-primary is killed and is brought up as a primary
first instead of as a standby (which will fail),then trying to
reintegrate it as a standby
will cause a Primary/Standby lsn mismatch. The reason is that
when the old-primary is deactivated or the old-primary is first
brought up as a primary (which will eventually fail due to
timeout). The last/current log file will be truncated and the
minbufflsn, lowtranlsn and remote catchup start lsn will be
moved to the start of next file, The same log record that is
truncated on the old-primary is NOT truncated on the new Primary
and so is used for writing more log records and so is used for
writing more log records. When the old-Primary is reintegrated
as a standby
and if no log writes are done on the new-primary until this
point a Peer connection is established between the
Primary/Standby.
After the peer state is established, when the new primary writes
some logs, sends them to standby then it will result in a
Primary/standby LSN mismatch on the standby server which will
bring down the standby server. The error mssage "SQL1768N unable
to start HADR. Reason code='7' " will be given.
You may see the following log entries in the db2diag.log file.
2010-02-10-10.36.47.166177-360 E121063953A371 LEVEL: Event
PID : 172306 TID : 7969 PROC : db2sysc
0
INSTANCE: db2inst1 NODE : 000
EDUID : 7969 EDUNAME: db2hadrs (sample) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery,
hdrSetHdrState, probe:10000
CHANGE : HADR state set to S-Peer (was S-NearlyPeer)
2010-02-10-10.36.51.574186-360 I121079812A498 LEVEL: Error
PID : 172306 TID : 7969 PROC : db2sysc
0
INSTANCE: db2inst1 NODE : 000
EDUID : 7969 EDUNAME: db2hadrs (sample) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery,
hdrAddDataBlock, probe:40012
MESSAGE : Primary/standby mismatch. RCUStartLSN 0000000224D4000C
not on record
boundary. RCU first page bytecount 4080, firstindex
16, pagelsn
0002230BCFFB.
2010-02-10-10.36.51.574321-360 I121080311A438
LEVEL: Severe
PID : 172306 TID : 7969 PROC : db2sysc
0
INSTANCE: db2inst1 NODE : 000
EDUID : 7969 EDUNAME: db2hadrs (sample) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery,
hdrAddDataBlock, probe:40012
RETCODE : ZRC=0x87800145=-2021654203=HDR_ZRC_VALIDATION_REJECT
"HADR shuts down due to validation rejection" | |
| Local Fix: | |
Backup the new primary database and restore it on the standby machine and enable HADR to bring it up as a standby. If the system is in HA (TSA) environment fixing the APAR IC65836 maybe avoid hitting this APAR | |
| available fix packs: | |
DB2 Version 9.7 Fix Pack 3 for Linux, UNIX, and Windows | |
| Solution | |
This issue is first fixed on DB2 V9.7fp3 | |
| Workaround | |
Backup the new primary database and restore it on the standby machine and enable HADR to bring it up as a standby. If the system is in HA (TSA) environment fixing the APAR IC65836 maybe avoid hitting this APAR | |
| Timestamps | |
Date - problem reported : Date - problem closed : Date - last modified : | 25.02.2010 23.09.2010 23.09.2010 |
| Problem solved at the following versions (IBM BugInfos) | |
9.7.FP3 | |
| Problem solved according to the fixlist(s) of the following version(s) | |
| 9.7.0.3 |
|
| 9.7.0.3 |
|