DB2 - Problem description
Problem IT40847 | Status: Closed |
ENDLESS ITERATION OF DB2 CLEANUP AND KILL PROCESSES MAKE DB2 PURESCALE CLUSTER HANG | |
product: | |
DB2 FOR LUW / DB2FORLUW / B50 - DB2 | |
Problem description: | |
On AIX operatin system, a process can be stuck in "EXITING" state in the kernel. In this state, it cannot be killed using kill signal. If db2sysc process can not be terminated by SIGKILL signal, db2rocm CLEANUP and KILL processes are interrupted by SIGALRM signal (Time expired). In such a situation, TSA CLEANUP task will be repeatedly issued until the system is rebooted and its member will not be started on the other host as restart light. In the meanwhile, all applications will be getting stack to wait for the database objects which are not cleaned up by the member crash recovery during restart light. In this situation, similar messgaes are logged in db2diag.log as below. 2019-05-05-20.00.56.369398+540 I58987522A827 LEVEL: Event PID : 19136798 TID : 1 PROC : db2rocm 0 [db2inst1] INSTANCE: db2inst1 NODE : 000 HOSTNAME: member00 EDUID : 1 EDUNAME: db2rocm 0 [db2inst1] FUNCTION: DB2 UDB, oper system services, sqlossig, probe:10 MESSAGE : Sending SIGKILL to the following process id DATA #1 : signed integer, 4 bytes -11337922 CALLSTCK: (Static functions may not be resolved correctly, as they are resolved to the nearest symbol) [0] 0x090000000E0D5FE0 sqlossig + 0xA0 [1] 0x00000001000203C0 sqlhaKillProcesses__FP18SQLHA_PROCESS_INFOUlbT2T3 + 0x8E0 [2] 0x00000001000144DC sqlhaDB2KillNode + 0xE3C [3] 0x000000010000C120 rocmDB2Cleanup + 0x10A0 [4] 0x0000000100004080 main + 0x1820 [5] 0x00000001000002F8 __start + 0x70 2019-05-05-20.03.26.369026+540 I58998026A1507 LEVEL: Warning PID : 19136798 TID : 1 PROC : db2rocm 0 [db2inst1] INSTANCE: db2inst1 NODE : 000 HOSTNAME: member00 EDUID : 1 EDUNAME: db2rocm 0 [db2inst1] FUNCTION: DB2 UDB, high avail services, rocmSignalsForTimeoutOffline, probe:411 MESSAGE : Received signal during CLEANUP - exiting with return code 12. DATA #1 : String, 7 bytes SIGALRM DATA #2 : ROCM Action, PD_TYPE_ROCM_ACTION, 2103568 bytes action->version: 1 action->actor->actorType: DB2 action->actor->actorID: 0 action->actor->instName: db2inst1 action->actor->hostname: NOT_POPULATED action->actor->options: NONE action->command: CLEANUP DATA #3 : PGRP File Contents, PD_TYPE_SQLO_PGRP_FILE_CONTENTS, 3224 bytes pgrpFile->iPgrpFileVersion : 2225 pgrpFile->iPgrpId : 11337922 pgrpFile->iWdogPgrpId : 12517570 pgrpFile->iSubPgrpId : NOT_INITIALIZED pgrpFile->iIndex : 0 pgrpFile->iNumber : 0 pgrpFile->iMonitorOverride : 0 pgrpFile->crashCounter : 0 pgrpFile->firstCrashTimeSeconds : 1970-01-01 09:00:00.000000 pgrpFile->monitorTimeoutCounter : 0 pgrpFile->firstMonitorTimeoutSeconds : 1970-01-01 09:00:00.000000 pgrpFile->lastMonitorTimeoutSeconds : 1970-01-01 09:00:00.000000 pgrpFile->hostname : member00 pgrpFile->iNumHCAs : 0 CALLSTCK: (Static functions may not be resolved correctly, as they are resolved to the nearest symbol) [0] 0x0000000100006EB4 rocmSignalsForTimeoutOffline + 0xAF4 [1] 0x0000000000000000 ?unknown + 0x0 2019-05-05-20.03.26.623617+540 I59000696A890 LEVEL: Event PID : 46924020 TID : 1 PROC : db2rocme 0 [db2inst1] INSTANCE: db2inst1 NODE : 000 HOSTNAME: member00 EDUID : 1 EDUNAME: db2rocme 0 [db2inst1] FUNCTION: DB2 UDB, oper system services, sqlossig, probe:10 MESSAGE : Sending SIGKILL to the following process id DATA #1 : signed integer, 4 bytes -11337922 CALLSTCK: (Static functions may not be resolved correctly, as they are resolved to the nearest symbol) [0] 0x090000000E0D5FE0 sqlossig + 0xA0 [1] 0x00000001001002C0 sqlhaKillProcesses__FP18SQLHA_PROCESS_INFOUlbT2T3 + 0x8E0 [2] 0x00000001000FC6CC sqlhaDB2KillNode + 0xE4C [3] 0x000000010000FAD8 rocmDB2Notify + 0x2F8 [4] 0x000000010010322C rocmCommandRetryUntilFailure + 0x162C [5] 0x0000000100003F00 main + 0x16A0 [6] 0x00000001000002F8 __start + 0x70 2019-05-05-20.03.56.620065+540 I59003951A1646 LEVEL: Warning PID : 46924020 TID : 1 PROC : db2rocme 0 [db2inst1] INSTANCE: db2inst1 NODE : 000 HOSTNAME: member00 EDUID : 1 EDUNAME: db2rocme 0 [db2inst1] FUNCTION: DB2 UDB, high avail services, rocmSignalsForTimeoutOffline, probe:426 MESSAGE : Received signal during KILL event - exiting with return code 13. DATA #1 : String, 7 bytes SIGALRM DATA #2 : ROCM Action, PD_TYPE_ROCM_ACTION, 2103568 bytes action->version: 1 action->actor->actorType: DB2 action->actor->actorID: 0 action->actor->instName: db2inst1 action->actor->hostname: NOT_POPULATED action->actor->options: NONE action->command: NOTIFY action->notification->version: 1111 action->notification->eventType: KILL action->notification->actor->actorType: DB2 action->notification->actor->actorID: 0 action->notification->actor->instName: db2inst1 action->notification->actor->hostname: member01 action->notification->actor->options: NONE action->notification->sequenceNumber: 214 (0x00000000000000d6) action->notification->eventWhitelistFlags: NONE action->notification->bNotifSent: false action->notification->retryNum: 0 action->notification->eventWhitelistFlagsToChange: 0 action->notification->options: FORCE DATA #3 : PGRP File Contents, PD_TYPE_SQLO_PGRP_FILE_CONTENTS, 3224 bytes Object not dumped: Address: 0x0000000000000000 Size: 3224 Reason: Address is NULL CALLSTCK: (Static functions may not be resolved correctly, as they are resolved to the nearest symbol) [0] 0x00000001000084EC rocmSignalsForTimeoutOffline + 0xA2C [1] 0x0000000000000000 ?unknown + 0x0 ... if one of the event recorders are formatted using db2fdump command the following message would indicate that the process is stuck in exiting state: 7445 Event sequence number: 0 Time: 2019-05-05-13.03.26.350648433 sqlhaVerifyProcessExists (3.115.49.0.748) PID: TID: EDUID: APPHDL: Data1 (PD_TYPE_SQLHA_ER_PDINFO,80) SQLHA Event Recorder header data (struct sqlhaErPdInfo): m_pTimeStamp: N/A m_LogDestination: 0 m_PdFlags: 1 m_FunctionId: 462946353 (sqlhaVerifyProcessExists) m_ErrorCode: 0 = 0 m_Probe: 748 m_Level: 4 Data2 (PD_TYPE_MESSAGE,46) Message String: Process is in EXITING state - returning ONLINE Data3 (PD_TYPE_PROCESS_ID,4) Process ID: 11337922 Data4 (PD_TYPE_STRING,9) String: db2sysc 0 Data5 (PD_TYPE_UINT,8) unsigned integer: 0 Data6 (PD_TYPE_MESSAGE,39) Message String: Setting ROCM_ACTION_FLAGS_DUMP_HA_EVENT | |
Problem Summary: | |
**************************************************************** * USERS AFFECTED: * * AIX pureScale user * **************************************************************** * PROBLEM DESCRIPTION: * * See Error Description * **************************************************************** * RECOMMENDATION: * * Upgrade to Db2 11.5m4fp0 or higher * **************************************************************** | |
Local Fix: | |
Reboot the system where never died processes exist with such message logs in db2diag.log | |
Solution | |
Workaround | |
**************************************************************** * USERS AFFECTED: * * AIX pureScale user * **************************************************************** * PROBLEM DESCRIPTION: * * See Error Description * **************************************************************** * RECOMMENDATION: * * Upgrade to Db2 11.5m4fp0 or higher * **************************************************************** | |
Comment | |
First fixed in Db2 11.5m4fp0 | |
Timestamps | |
Date - problem reported : Date - problem closed : Date - last modified : | 05.05.2022 19.05.2022 19.05.2022 |
Problem solved at the following versions (IBM BugInfos) | |
Problem solved according to the fixlist(s) of the following version(s) |