DB2 - Problem description
Problem IT22123 | Status: Closed |
PURESCALE MEMBER HITS AN FODC AND RESTARTS AFTER PRIMARY CF FAILOVER | |
product: | |
DB2 FOR LUW / DB2FORLUW / A50 - DB2 | |
Problem description: | |
## Problem description DB2 pureScale member hits an FODC and restart after CF failover. ## Diagnostic information During a CF failover scenario, a standby CF takesover the primary role A member recieves RECONSTRUCT notification. 2017-08-18-17.30.48.391876+480 I2841217A1267 LEVEL: Event PID : 5112136 TID : 1 PROC : db2rocme 1 [db2inst1] INSTANCE: db2inst1 NODE : 001 HOSTNAME: hosta EDUID : 1 EDUNAME: db2rocme 1 [db2inst1] FUNCTION: DB2 UDB, high avail services, rocmCommandRetryUntilFailure, probe:109 MESSAGE : Sending ROCM notification. DATA #1 : ROCM Notification, PD_TYPE_ROCM_NOTIFICATION, 368 bytes notification->version: 1111 notification->eventType: RECONSTRUCT notification->actor->actorType: PRIMARY notification->actor->actorID: 900 notification->actor->underlyingActorID: 129 During reconstruction the member is supposed to close connections to the restarted CF. But in this case, the reconstruction times out after 100 seconds: 2017-08-18-17.32.28.026664+480 E2844960A2834 LEVEL: Severe PID : 5112136 TID : 1 PROC : db2rocme 1 [db2inst1] INSTANCE: db2inst1 NODE : 001 HOSTNAME: hostA EDUID : 1 EDUNAME: db2rocme 1 [db2inst1] FUNCTION: DB2 UDB, high avail services, rocmCommandRetryUntilFailure, probe:709 MESSAGE : ZRC=0x82000199=-2113928807=SQLZ_RC_HA_NOTIFICATION_RETRY "The given whitelist transition is valid, but can not be completed at the present time." DATA #1 : Codepath, 8 bytes 1:2:3:4:20:32:47:48:57:58 DATA #2 : ROCM Actor, PD_TYPE_ROCM_ACTOR, 304 bytes actor->actorType: DB2 actor->actorID: 1 actor->instName: db2inst1 actor->hostname: NOT_POPULATED actor->options: NONE DATA #3 : Database Partition Number, PD_TYPE_NODE, 2 bytes 0 DATA #4 : signed integer, 4 bytes 0 DATA #5 : ROCM Notification, PD_TYPE_ROCM_NOTIFICATION, 368 bytes notification->version: 1111 notification->eventType: RECONSTRUCT notification->actor->actorType: PRIMARY notification->actor->actorID: 900 As a result DB2 member 1 FODC's. Inside the FODC directory, the members connection pool state is dumped. To identify this issue, Active List Size is greater than 0 meaning that a connection was not cleaned up db2pd.1.cfpool.txt: Shared connection pool information for CF[128] ... Active List Size: 1 ... CF Pool Entries (Legend: U A S D S W R R = Used, Async, Sent, Dedicated, SLS, WAR, RLS, RAR): ---------------------------- # CF Netname Mbr DevName List Use Cnt Flag U A S D S W R R PID EDU Mem Address Function Probe --- ------------------------------ ------------ ---- ------- ---- - - - - - - - - -------- ------ -------------------- ------------------------------ ----- 0 neta hca3 3 129387159 0001 Y N N N N N N N 15663144 363125 0x0a00020696dc0080 sqlbgbCastout 2299 1 netb hca0 -1 139269382 0000 N N N N N N N N 15663144 574572 0x0a00020696a60080 sqlpLLMSetLockState 340 2 netc hca1 -1 140648402 0000 N N N N N N N N 15663144 814362 0x0a00020696e80080 sqlpLLMSetLockState 340 3 netd hca0 -1 153163535 0000 N N N N N N N N 15663144 701532 0x0a000204fb510080 sqlpLLMSetLockState 340 From above output the edu holding the connection is 363125 inside function sqlbCastout , the following is the stack of that edu -------Frame------ ------Function + Offset------ 0x0900000000112F14 thread_wait + 0x94 0x090000000C91D930 sqloWaitEDUWaitPost + 0x254 0x090000000D4FFB30 SAL_WaitForPrimaryArrival__14@82@SAL_CA_KEYFCUiCUl + 0x290 0x090000000D4390C0 SAL_ReadSA__13SAL_SA_HANDLEFCP10PsPageNameCPUlCUlCUiT3 + 0x19A8 0x090000000C1B9968 SAL_RefreshGlobalCastoutDrainState__14SAL_GBP_HANDLEFv + 0x19C 0x090000000C1BA7D8 SAL_ReadForCastoutMult__14SAL_GBP_HANDLEFR35SAL_CASTOUT_CLASS_AN D_PROGRESS_INFOCP18SAL_CASTOUT_COOKIECUlCPP9SQLB_PAGEP13PsCastou tNameCPCUlRC10PsP ageNameCUsCPUiN29CUiT3 + 0xD68 0x090000000C1B8DEC sqlbgbCastout__FP12SQLB_CLNR_CBb + 0x2978 0x090000000CCD3A04 sqlbClnrGatherPages__FP12SQLB_CLNR_CB + 0x24 0x090000000CCD26EC sqlbClnrEntryPoint__FP12sqbPgClnrEdu + 0x904 0x090000000BB8558C RunEDU__12sqbPgClnrEduFv + 0x40 0x090000000C4F7DA8 EDUDriver__9sqzEDUObjFv + 0x3F4 0x090000000C1892D4 sqloEDUEntry + 0x3A0 The member is automatically restarted after FODC and returns to normal state. No further action is required from DBA. | |
Problem Summary: | |
**************************************************************** * USERS AFFECTED: * * db2 pureScale * **************************************************************** * PROBLEM DESCRIPTION: * * See Error Description * **************************************************************** * RECOMMENDATION: * * Update to next release if it contains the fix. * **************************************************************** | |
Local Fix: | |
Solution | |
Workaround | |
not known / see Local fix | |
Timestamps | |
Date - problem reported : Date - problem closed : Date - last modified : | 23.08.2017 18.07.2018 18.07.2018 |
Problem solved at the following versions (IBM BugInfos) | |
Problem solved according to the fixlist(s) of the following version(s) |