DB2 - Problem description
| Problem IC99381 | Status: Closed |
Database hang during forced shutdown or HADR takeover | |
| product: | |
DB2 FOR LUW / DB2FORLUW / A10 - DB2 | |
| Problem description: | |
Due to a defect in the page cleaning code path, a page cleaner
might still retain page latches during a database force
scenario. Any other waiters for these page latches will not be
interrupted properly. As a result, the database will start to
hang, not being able to shut down properly. The problem might
happen in the following scenarios:
1. An error causing the database to be marked bad, thus
resulting in a forced database shutdown.
2. An HADR takeover by force, where the primary will hang as a
result.
If the problem happens during a forced HADR takeover, the
primary will hang, although the standby will able to take over
properly. However, the primary will not be able to enter the
standby role and perform any further takeover.
A sample call stack of an EDU waiting for a page latch (the
actual stacks may vary, the important piece is
"sqlbVerifyAndLatchPage"):
SQLO_SLATCH_CAS64::getConflictComplex
SQLO_SLATCH_CAS64::getConflict
sqlo_latch_ns::get
sqloSXULatch::get
sqloSXUltch_notrack
sqloSXUltch_track_page
sqlbGetAndMonitorPageLatch
sqlbVerifyAndLatchPage
sqlbFindPageInBPOrSim
sqlbfix
sqlbFixPage
sqlifix
sqliaddk
sqldUpdateIndexes
sqldRowUpdate
sqlriupd
An excerpt from db2diag.log indicating that the database was
forced during an HADR takeover and a page cleaner got terminated
forcifully. If the database starts to hang after encountering
similar messages and there are EDUs waiting for page latches,
the problem has been reproduced.
2013-11-26-04.24.48.924238-300 I143262171A574 LEVEL:
Severe
PID : 16187616 TID : 49367 KTID :
74383395
PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB :
SAMPLE
APPHDL : 0-8575 APPID: *N0.DB2.131126101026
HOSTNAME: myhostname
EDUID : 49367 EDUNAME: db2agent (SAMPLE) 0
FUNCTION: DB2 UDB, High Availability Disaster Recovery,
hdrPoisonLocalMember, probe:41180
DATA #1 : <preformatted>
HADR marking logs bad; database should shut down to avoid split
brain; standby is taking over.
...
2013-11-26-04.24.49.194218-300 E143268528A535 LEVEL: Error
PID : 16187616 TID : 42427 KTID :
68288521
PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB :
SAMPLE
HOSTNAME: myhostname
EDUID : 42427 EDUNAME: db2pclnr (SAMPLE) 0
FUNCTION: DB2 UDB, data protection services, sqlpflog, probe:480
MESSAGE : ZRC=0x870F0151=-2029059759=SQLO_WP_TERM
"The waitpost area has been terminated"
...
2013-11-26-04.24.49.195988-300 E143269064A686 LEVEL:
Severe
PID : 16187616 TID : 42427 KTID :
68288521
PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB :
SAMPLE
HOSTNAME: myhostname
EDUID : 42427 EDUNAME: db2pclnr (SAMPLE) 0
FUNCTION: DB2 UDB, buffer pool services, sqlbgbWAR, probe:5933
MESSAGE : ZRC=0x870F0151=-2029059759=SQLO_WP_TERM
"The waitpost area has been terminated"
...
2013-11-26-04.24.49.437225-300 I143296129A1261 LEVEL: Info
PID : 16187616 TID : 47824 KTID :
71827519
PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB :
SAMPLE
APPHDL : 0-8577 APPID: *N0.DB2.131126101028
HOSTNAME: myhostname
EDUID : 47824 EDUNAME: db2agent (SAMPLE ) 0
FUNCTION: DB2 UDB, base sys utilities,
sqeLocalDatabase::ForceDBShutdown, probe:15056
MESSAGE : Regular agent EDU doing ForceDBShutdown. Force DB
shutdown agent ID | |
| Problem Summary: | |
**************************************************************** * USERS AFFECTED: * * See Error Description * **************************************************************** * PROBLEM DESCRIPTION: * * See Error Description * **************************************************************** * RECOMMENDATION: * * Upgrade to DB2 10.1 for Linux, UNIX, and Windows Fix Pack 4 * **************************************************************** | |
| Local Fix: | |
Kill and restart the hanging database | |
| available fix packs: | |
DB2 Version 10.1 Fix Pack 4 for Linux, UNIX, and Windows | |
| Solution | |
Problem first fixed in DB2 10.1 for Linux, UNIX, and Windows Fix Pack 4 | |
| Workaround | |
not known / see Local fix | |
| Timestamps | |
Date - problem reported : Date - problem closed : Date - last modified : | 14.02.2014 16.06.2014 16.06.2014 |
| Problem solved at the following versions (IBM BugInfos) | |
| Problem solved according to the fixlist(s) of the following version(s) | |
| 10.1.0.4 |
|