suche 36x36
Latest versionsfixlist
11.1.0.7 FixList
10.5.0.9 FixList
10.1.0.6 FixList
9.8.0.5 FixList
9.7.0.11 FixList
9.5.0.10 FixList
9.1.0.12 FixList
Have problems? - contact us.
Register for free anmeldung-x26
Contact form kontakt-x26

DB2 - Problem description

Problem IT35720 Status: Closed

PURESCALE MAY HANG WHEN THERE ARE 2 MORE CONCURRENT NODE FAILURES AND ONE
OF THE NODE FAILURES CAUSES A DATABASE DEACTIVATION.

product:
DB2 FOR LUW / DB2FORLUW / B50 - DB2
Problem description:
pureScale may hang when involves 2 concurrent node failures and
a db deactivation driven by one of the node failures. The db
deactivation tries to force system apps like db2 periodic but
this force interrupt is suppressed (as expected) because it is a
lower interrupt priority than node failure and occurs at the
same time as the node failure recover. However, we end up not
sending an interrupt for the 2'nd node failure and remain
blocked waiting for a reply indefinitely in the periodic daemon
from one of the failed nodes, while the deactivate subagent
blocks forever waiting for the forced periodic daemon to shut
down. The timing to hit this (at least for my repro) is very
tight; To hit the issue the second node failure has to occur
while we're handling the first (in nodeFailureRecovery()) and
after we've queried failed nodes in this function.

The diag log will have messages indicating that node recovery
was completed for 2 or more members close to the same time.

For example:

2020-12-02-14.06.32.476438+480 E210897765E384        LEVEL: Info
PID     : 36218                TID : 46913088907008  PROC :
db2sysc 1
...
EDUID   : 22                   EDUNAME: db2pdbc 1
FUNCTION: DB2 UDB, base sys utilities, sqleExecuteNodeRecovery,
probe:200
DATA #1 : String, 34 bytes
Node recovery completed for node 0

and

2020-12-02-14.06.32.476438+480 E210897765E384        LEVEL: Info
PID     : 36218                TID : 46913088907008  PROC :
db2sysc 1
...
EDUID   : 22                   EDUNAME: db2pdbc 1
FUNCTION: DB2 UDB, base sys utilities, sqleExecuteNodeRecovery,
probe:200
DATA #1 : String, 34 bytes
Node recovery completed for node 2

db2pd -agents shows only system applications (for example
db2periodic) and one other agent which is driving a database
deactivation. For example, this output shows only the
db2periodic daemon and one other agent:

0x00002AC0E64F7680 78951    [001-13415] 52450      0
Coord    Inst-Active 0                   db2perio 0          0
NotSet SAMPLE*N1.DB2.200708193047
Thu Jul  9 03:30:45
0x00002AB6860BAF00 64778    [000-64778] 34251      0
SubAgent Inst-Active 0                   db2jcc_a 0          0
NotSet SAMPLE 10.134.83.81.64901.201111085015
n/a

The call stack of the system agent(s) will be blocked waiting to
receive an RPC reply. For example, the db2periodic daemon may be
blocked in a call stack that looks like this:

0x00002AAAAE42C97D _ZN11sqkfChannel13WaitRecvReadyEii + 0x02fd
                (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
0x00002AAAAE429C28
_ZN11sqkfChannel13ReceiveBufferEPP10sqkfBufferi + 0x0678
                (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
0x00002AAAAE404897
_ZN18sqkdBdsBufferTable12getNextReplyEP8SQLKD_CB + 0x0077
                (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
0x00002AAAAE404420
_ZN18sqkdBdsBufferTable13getNextBufferEPP10sqkfBufferP8SQLKD_CB
+ 0x0a00
                (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
0x00002AAAAE3F8671 address: 0x00002AAAAE3F8671 ; dladdress:
0x00002AAAAAEEA000 ; offset in lib: 0x000000000350E671 ;
                (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
0x00002AAAAE3F82E1 address: 0x00002AAAAE3F82E1 ; dladdress:
0x00002AAAAAEEA000 ; offset in lib: 0x000000000350E2E1 ;
                (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
0x00002AAAAE3F3AC0 address: 0x00002AAAAE3F3AC0 ; dladdress:
0x00002AAAAAEEA000 ; offset in lib: 0x0000000003509AC0 ;
                (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
0x00002AAAAE3F4BDF
_Z17sqlkdReceiveReplyP23SQLKD_RQST_REPLY_FORMAT + 0x04cf
                (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
0x00002AAAAF27C907
_Z11sqlrkrpc_nlP8sqlrr_cbiiiPKsP15SQLR_RPCMESSAGEP13SQLO_MEM_POO
LP18SQLR_RPC_REPLY_HDRPbPlmP17SQLR_WLM_BDSREPLY + 0x1827
                (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
0x00002AAAAF27A7E2
_Z12sqlrkrpc_allP8sqlrr_cbiP15SQLR_RPCMESSAGEP13SQLO_MEM_POOLPP1
8SQLR_RPC_REPLY_HDRimP17SQLR_WLM_BDSREPLY + 0x1262
                (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
0x00002AAAAE1382EA sqleRPCSync + 0x039a
                (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
0x00002AAAAE1312D1
_Z16sqlePeriodicMainP16sqeLocalDatabaseP8sqeAgent + 0x10e1
                (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
0x00002AAAAE07741F _Z26sqleIndCoordProcessRequestP8sqeAgent +
0x180f

The callstack of the other agent shows that is is performing a
database deactivation and blocked waiting on the completion of
system applications:

0x00002AAAAEFD9435 sqloWaitEDUWaitPost + 0x02a5
                (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
0x00002AAAAE11DAB8
_ZN16sqeLocalDatabase13TermDbConnectEP8sqeAgentP5sqlcai + 0x2388
                (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
0x00002AAAAE0ADF1B
_ZN14sqeApplication12AppStopUsingEP8sqeAgenthP5sqlca + 0x0c3b
                (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
0x00002AAAB04AF0DC _Z24sqleSubAgentNodeRecoveryP8sqeAgent +
0x00bc
                (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
0x00002AAAAE06C36E address: 0x00002AAAAE06C36E ; dladdress:
0x00002AAAAAEEA000 ; offset in lib: 0x000000000318236E ;
                (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
0x00002AAAAE06A4BC _Z21sqleProcessSubRequestP8sqeAgent + 0x02ec
                (/home/db2sdin1/sqllib/lib64/libdb2e.so.1)
0x00002AAAAE086162 _ZN8sqeAgent6RunEDUEv + 0x04c2

Other agents attempting to connect to the database will be
blocking in StartUsingLocalDatabase, looping and waiting for
database deactivation to complete. For example:

0x00002AAAAE0FCDAB
_ZN8sqeDBMgr23StartUsingLocalDatabaseEP8SQLE_BWAP8sqeAgentRccP8s
qlo_gmtPb + 0x0e7b
0x00002AAAAE0A1F2F
_ZN14sqeApplication13AppStartUsingEP8SQLE_BWAP8sqeAgentccP5sqlca
Pc + 0x043f
0x00002AAAAE0A123A
_Z22sqleSubAgentStartUsingP8sqeAgentP16SQLE_CLIENT_INFO + 0x038a

0x00002AAAAE0B2353
_ZN14sqeApplication22AppSecondaryStartUsingEP8sqeAgentP16SQLE_CL
IENT_INFOP5sqlca + 0x0923
0x00002AAAAE08CFC7 _ZN8sqeAgent12initSubAgentEPi + 0x1f57
Problem Summary:
****************************************************************
* USERS AFFECTED:                                              *
* purescale                                                    *
****************************************************************
* PROBLEM DESCRIPTION:                                         *
* See Error Description                                        *
****************************************************************
* RECOMMENDATION:                                              *
* Upgrade to Db2 v11.5.7 or later.                             *
****************************************************************
Local Fix:
Solution
Workaround
****************************************************************
* USERS AFFECTED:                                              *
* purescale                                                    *
****************************************************************
* PROBLEM DESCRIPTION:                                         *
* See Error Description                                        *
****************************************************************
* RECOMMENDATION:                                              *
* Upgrade to Db2 v11.5.7 or later.                             *
****************************************************************
Comment
The problem is firstly fixed on Db2 v11.5.7.
Timestamps
Date  - problem reported    :
Date  - problem closed      :
Date  - last modified       :
27.01.2021
03.11.2021
03.11.2021
Problem solved at the following versions (IBM BugInfos)
Problem solved according to the fixlist(s) of the following version(s)