DB2 - Problem description
Problem IT35720 | Status: Closed |
PURESCALE MAY HANG WHEN THERE ARE 2 MORE CONCURRENT NODE FAILURES AND ONE OF THE NODE FAILURES CAUSES A DATABASE DEACTIVATION. | |
product: | |
DB2 FOR LUW / DB2FORLUW / B50 - DB2 | |
Problem description: | |
pureScale may hang when involves 2 concurrent node failures and a db deactivation driven by one of the node failures. The db deactivation tries to force system apps like db2 periodic but this force interrupt is suppressed (as expected) because it is a lower interrupt priority than node failure and occurs at the same time as the node failure recover. However, we end up not sending an interrupt for the 2'nd node failure and remain blocked waiting for a reply indefinitely in the periodic daemon from one of the failed nodes, while the deactivate subagent blocks forever waiting for the forced periodic daemon to shut down. The timing to hit this (at least for my repro) is very tight; To hit the issue the second node failure has to occur while we're handling the first (in nodeFailureRecovery()) and after we've queried failed nodes in this function. The diag log will have messages indicating that node recovery was completed for 2 or more members close to the same time. For example: 2020-12-02-14.06.32.476438+480 E210897765E384 LEVEL: Info PID : 36218 TID : 46913088907008 PROC : db2sysc 1 ... EDUID : 22 EDUNAME: db2pdbc 1 FUNCTION: DB2 UDB, base sys utilities, sqleExecuteNodeRecovery, probe:200 DATA #1 : String, 34 bytes Node recovery completed for node 0 and 2020-12-02-14.06.32.476438+480 E210897765E384 LEVEL: Info PID : 36218 TID : 46913088907008 PROC : db2sysc 1 ... EDUID : 22 EDUNAME: db2pdbc 1 FUNCTION: DB2 UDB, base sys utilities, sqleExecuteNodeRecovery, probe:200 DATA #1 : String, 34 bytes Node recovery completed for node 2 db2pd -agents shows only system applications (for example db2periodic) and one other agent which is driving a database deactivation. For example, this output shows only the db2periodic daemon and one other agent: 0x00002AC0E64F7680 78951 [001-13415] 52450 0 Coord Inst-Active 0 db2perio 0 0 NotSet SAMPLE*N1.DB2.200708193047 Thu Jul 9 03:30:45 0x00002AB6860BAF00 64778 [000-64778] 34251 0 SubAgent Inst-Active 0 db2jcc_a 0 0 NotSet SAMPLE 10.134.83.81.64901.201111085015 n/a The call stack of the system agent(s) will be blocked waiting to receive an RPC reply. For example, the db2periodic daemon may be blocked in a call stack that looks like this: 0x00002AAAAE42C97D _ZN11sqkfChannel13WaitRecvReadyEii + 0x02fd (/home/db2sdin1/sqllib/lib64/libdb2e.so.1) 0x00002AAAAE429C28 _ZN11sqkfChannel13ReceiveBufferEPP10sqkfBufferi + 0x0678 (/home/db2sdin1/sqllib/lib64/libdb2e.so.1) 0x00002AAAAE404897 _ZN18sqkdBdsBufferTable12getNextReplyEP8SQLKD_CB + 0x0077 (/home/db2sdin1/sqllib/lib64/libdb2e.so.1) 0x00002AAAAE404420 _ZN18sqkdBdsBufferTable13getNextBufferEPP10sqkfBufferP8SQLKD_CB + 0x0a00 (/home/db2sdin1/sqllib/lib64/libdb2e.so.1) 0x00002AAAAE3F8671 address: 0x00002AAAAE3F8671 ; dladdress: 0x00002AAAAAEEA000 ; offset in lib: 0x000000000350E671 ; (/home/db2sdin1/sqllib/lib64/libdb2e.so.1) 0x00002AAAAE3F82E1 address: 0x00002AAAAE3F82E1 ; dladdress: 0x00002AAAAAEEA000 ; offset in lib: 0x000000000350E2E1 ; (/home/db2sdin1/sqllib/lib64/libdb2e.so.1) 0x00002AAAAE3F3AC0 address: 0x00002AAAAE3F3AC0 ; dladdress: 0x00002AAAAAEEA000 ; offset in lib: 0x0000000003509AC0 ; (/home/db2sdin1/sqllib/lib64/libdb2e.so.1) 0x00002AAAAE3F4BDF _Z17sqlkdReceiveReplyP23SQLKD_RQST_REPLY_FORMAT + 0x04cf (/home/db2sdin1/sqllib/lib64/libdb2e.so.1) 0x00002AAAAF27C907 _Z11sqlrkrpc_nlP8sqlrr_cbiiiPKsP15SQLR_RPCMESSAGEP13SQLO_MEM_POO LP18SQLR_RPC_REPLY_HDRPbPlmP17SQLR_WLM_BDSREPLY + 0x1827 (/home/db2sdin1/sqllib/lib64/libdb2e.so.1) 0x00002AAAAF27A7E2 _Z12sqlrkrpc_allP8sqlrr_cbiP15SQLR_RPCMESSAGEP13SQLO_MEM_POOLPP1 8SQLR_RPC_REPLY_HDRimP17SQLR_WLM_BDSREPLY + 0x1262 (/home/db2sdin1/sqllib/lib64/libdb2e.so.1) 0x00002AAAAE1382EA sqleRPCSync + 0x039a (/home/db2sdin1/sqllib/lib64/libdb2e.so.1) 0x00002AAAAE1312D1 _Z16sqlePeriodicMainP16sqeLocalDatabaseP8sqeAgent + 0x10e1 (/home/db2sdin1/sqllib/lib64/libdb2e.so.1) 0x00002AAAAE07741F _Z26sqleIndCoordProcessRequestP8sqeAgent + 0x180f The callstack of the other agent shows that is is performing a database deactivation and blocked waiting on the completion of system applications: 0x00002AAAAEFD9435 sqloWaitEDUWaitPost + 0x02a5 (/home/db2sdin1/sqllib/lib64/libdb2e.so.1) 0x00002AAAAE11DAB8 _ZN16sqeLocalDatabase13TermDbConnectEP8sqeAgentP5sqlcai + 0x2388 (/home/db2sdin1/sqllib/lib64/libdb2e.so.1) 0x00002AAAAE0ADF1B _ZN14sqeApplication12AppStopUsingEP8sqeAgenthP5sqlca + 0x0c3b (/home/db2sdin1/sqllib/lib64/libdb2e.so.1) 0x00002AAAB04AF0DC _Z24sqleSubAgentNodeRecoveryP8sqeAgent + 0x00bc (/home/db2sdin1/sqllib/lib64/libdb2e.so.1) 0x00002AAAAE06C36E address: 0x00002AAAAE06C36E ; dladdress: 0x00002AAAAAEEA000 ; offset in lib: 0x000000000318236E ; (/home/db2sdin1/sqllib/lib64/libdb2e.so.1) 0x00002AAAAE06A4BC _Z21sqleProcessSubRequestP8sqeAgent + 0x02ec (/home/db2sdin1/sqllib/lib64/libdb2e.so.1) 0x00002AAAAE086162 _ZN8sqeAgent6RunEDUEv + 0x04c2 Other agents attempting to connect to the database will be blocking in StartUsingLocalDatabase, looping and waiting for database deactivation to complete. For example: 0x00002AAAAE0FCDAB _ZN8sqeDBMgr23StartUsingLocalDatabaseEP8SQLE_BWAP8sqeAgentRccP8s qlo_gmtPb + 0x0e7b 0x00002AAAAE0A1F2F _ZN14sqeApplication13AppStartUsingEP8SQLE_BWAP8sqeAgentccP5sqlca Pc + 0x043f 0x00002AAAAE0A123A _Z22sqleSubAgentStartUsingP8sqeAgentP16SQLE_CLIENT_INFO + 0x038a 0x00002AAAAE0B2353 _ZN14sqeApplication22AppSecondaryStartUsingEP8sqeAgentP16SQLE_CL IENT_INFOP5sqlca + 0x0923 0x00002AAAAE08CFC7 _ZN8sqeAgent12initSubAgentEPi + 0x1f57 | |
Problem Summary: | |
**************************************************************** * USERS AFFECTED: * * purescale * **************************************************************** * PROBLEM DESCRIPTION: * * See Error Description * **************************************************************** * RECOMMENDATION: * * Upgrade to Db2 v11.5.7 or later. * **************************************************************** | |
Local Fix: | |
Solution | |
Workaround | |
**************************************************************** * USERS AFFECTED: * * purescale * **************************************************************** * PROBLEM DESCRIPTION: * * See Error Description * **************************************************************** * RECOMMENDATION: * * Upgrade to Db2 v11.5.7 or later. * **************************************************************** | |
Comment | |
The problem is firstly fixed on Db2 v11.5.7. | |
Timestamps | |
Date - problem reported : Date - problem closed : Date - last modified : | 27.01.2021 03.11.2021 03.11.2021 |
Problem solved at the following versions (IBM BugInfos) | |
Problem solved according to the fixlist(s) of the following version(s) |