DB2 - Problem description
| Problem IT05398 | Status: Closed |
RSCT WILL NOT NOTIFY DB2 THAT THE PORT IS DOWN WHEN WE MOVE THAT PORT TO ANOTHER DIFFERENT VLAN | |
| product: | |
DB2 FOR LUW / DB2FORLUW / A50 - DB2 | |
| Problem description: | |
Scenario :
All the ports of the servers are belong to a same VLAN(e.g
VlAN10) , if we change the RoCE0 of one member(e.g:member0) to
another different VLAN(e.g VLAN11) , after about 5 minutes , db
connect will hang on the rest of members(e.g:member1 and
member2), member0 works as normal .
EDUs on member 1 and 2 is waiting for
000E0000000000000000000076 SQLP_VALLOCK. The holder is member0.
This caused the hang one member 1,2.
From db2diag.log file for member0 , db2CFConnPoolMgr 0 is
repeating sqleCaCeConnect, probe:720 and
sqleSingleCaCreateNewConnec, probe:2135 when we connected to
PRIMARY CF from device hba0, and it reports that PsConnect
failed and port state detected by RSCT to be online, but
encountered error.
2014-10-17-10.20.53.827735+480 I422919A2148 LEVEL:
Severe
PID : 16580738 TID : 24461 PROC :
db2sysc 0
INSTANCE: instance NODE : 000
HOSTNAME: host
EDUID : 24461 EDUNAME: db2CFConnPoolMgr 0
FUNCTION: DB2 UDB, Shared Data Structure Abstraction Layer for
CF, SQLE_CA_CONN_ENTRY_DATA::sqleCaCeConnect, probe:720
MESSAGE : CA RC= 2148073473
DATA #1 : String, 17 bytes
PsConnect failed.
DATA #2 : PsToken_t, PD_TYPE_SD_PSTOKEN, 152 bytes
Eye Catcher = CATOKEN
CF Server Info :
- Unique Sequence Number = 187 (0xbb)
- Port Number = 56001
- Node Identifier = 1
- Instance Identifier = 0
- Netname = netname-ib0
Local Member Info :
- uDAPL Device = ib0
Transport Type = UDAPL (0x1)
Cmd Connection Use Types = NORMAL (0x0)
DATA #3 : SAL CF Server Name, PD_TYPE_SAL_CF_SERVER_NAME, 13
bytes
host
DATA #4 : SAL Member Device Name,
PD_TYPE_SAL_MEMBER_DEVICE_NAME, 4 bytes
ib0
DATA #5 : CF Retry Position, PD_TYPE_SAL_RETRY_COUNTER, 8 bytes
10
DATA #6 : unsigned integer, 8 bytes
1
CALLSTCK: (Static functions may not be resolved correctly, as
they are resolved to the nearest symbol)
[0] 0x09000000063B9D84
sqleSingleCaCreateNewConnectionsForPool__21SQLE_SINGLE_CA_HANDLE
FCUlR12sqzDataChainXT18SQLE_CA_CONN_ENTRYT16sqzChainNodeBaseXT1
+ 0x42C
[1] 0x09000000063B9E04
sqleSingleCaCreateNewConnectionsForPool__21SQLE_SINGLE_CA_HANDLE
FCUlR12sqzDataChainXT18SQLE_CA_CONN_ENTRYT16sqzChainNodeBaseXT1
+ 0x4AC
[2] 0x0900000006339C7C
sqleSingleCaCreateNewConnectionsForPool__21SQLE_SINGLE_CA_HANDLE
FCUlR12sqzDataChainXT18SQLE_CA_CONN_ENTRYT16sqzChainNodeBaseXT1
+ 0xB70
[3] 0x090000000502CB64
sqleSingleCaGrowPool__21SQLE_SINGLE_CA_HANDLEFCUlT1C17SAL_ADAPTE
R_INDEX + 0x6CC
[4] 0x0900000007AD9654 sqleCFConnPoolMgrEntry__FPUcUi + 0x5C8
[5] 0x0900000007ACEC90 sqleCFConnPoolMgrEntry__FPUcUi + 0x1B4
[6] 0x0900000007ACE678 sqleCFConnPoolMgrEntry__FPUcUi + 0x110
[7] 0x090000000644F9F0 sqloEDUEntry + 0x4B8
[8] 0x0900000000782E10 _pthread_body + 0xF0
[9] 0xFFFFFFFFFFFFFFFC ?unknown + 0xFFFFFFFF
2014-10-17-10.20.53.830449+480 I425068A1808 LEVEL:
Warning
PID : 16580738 TID : 24461 PROC :
db2sysc 0
INSTANCE: instance NODE : 000
HOSTNAME: host
EDUID : 24461 EDUNAME: db2CFConnPoolMgr 0
FUNCTION: DB2 UDB, Shared Data Structure Abstraction Layer for
CF, SQLE_SINGLE_CA_HANDLE::sqleSingleCaCreateNewConnec,
probe:2135
MESSAGE : Port state detected by RSCT to be online, but
encountered error
establishing a uDAPL connection. Netname, m_whichCa,
numOfflineAdapters, numConsecutiveFailures, CF node
num,
numConnections, bInitialConnections
DATA #1 : SAL CF Server Name, PD_TYPE_SAL_CF_SERVER_NAME, 13
bytes
host
DATA #2 : SAL Member Device Name,
PD_TYPE_SAL_MEMBER_DEVICE_NAME, 4 bytes
ib0
DATA #3 : SAL CF Index, PD_TYPE_SAL_CF_INDEX, 8 bytes
2
DATA #4 : unsigned integer, 8 bytes
1
DATA #5 : unsigned integer, 8 bytes
0
DATA #6 : SAL CF Node Number, PD_TYPE_SAL_CF_NODE_NUM, 2 bytes
129
DATA #7 : unsigned integer, 8 bytes
1
DATA #8 : Boolean, 8 bytes
false
DATA #9 : Codepath, 8 bytes
6:14:16
CALLSTCK: (Static functions may not be resolved correctly, as
they are resolved to the nearest symbol)
[0] 0x090000000633AAA8
sqleSingleCaCreateNewConnectionsForPool__21SQLE_SINGLE_CA_HANDLE
FCUlR12sqzDataChainXT18SQLE_CA_CONN_ENTRYT16sqzChainNodeBaseXT1
+ 0x199C
[1] 0x090000000502CB64
sqleSingleCaGrowPool__21SQLE_SINGLE_CA_HANDLEFCUlT1C17SAL_ADAPTE
R_INDEX + 0x6CC
[2] 0x0900000007AD9654 sqleCFConnPoolMgrEntry__FPUcUi + 0x5C8
[3] 0x0900000007ACEC90 sqleCFConnPoolMgrEntry__FPUcUi + 0x1B4
[4] 0x0900000007ACE678 sqleCFConnPoolMgrEntry__FPUcUi + 0x110
[5] 0x090000000644F9F0 sqloEDUEntry + 0x4B8
[6] 0x0900000000782E10 _pthread_body + 0xF0
[7] 0xFFFFFFFFFFFFFFFC ?unknown + 0xFFFFFFFF
Indeed, the ibstat output shows that port state as "UP" ,
----------------------------------------------------------------
ETHERNET PORT 1 INFORMATION (roce0)
----------------------------------------------------------------
Link State: UP
Link Speed: 10G XFI
Link MTU: 9600
Hardware Address: f4:52:14:cf:4a:da
GIDS (up to 3 GIDs):
GID0 :00:00:00:00:00:00:00:00:00:00:f4:52:14:cf:4a:da
GID1 :00:00:00:00:00:00:00:00:00:00:ff:ff:0a:de:01:65
GID2 :00:00:00:00:00:00:00:00:00:00:00:00:00:00:00:00
And all the EDU's kept trying to reconnect to the CF using hba0
and did not try to use hba1 .
Since we using RSCT to detect network adapter status , so if
the status of the port is UP, RSCT will think it is UP and will
notify DB2 that the port is "UP".While in this case , because
of the VLAN isolation ,the port is suppose to report as
INACTIVE state , so the expected behavior should be used hba1
to reconnect to CF for all EDU's .
As the exposure scenario is not covered in lab, and we didn't
consider it at the beginning design ,so lead to the current
problem. | |
| Problem Summary: | |
**************************************************************** * USERS AFFECTED: * * Members hang * **************************************************************** * PROBLEM DESCRIPTION: * * See Error Description * **************************************************************** * RECOMMENDATION: * * Upgrade to V10.5fp7 * **************************************************************** | |
| Local Fix: | |
| Solution | |
| Workaround | |
not known / see Local fix | |
| Timestamps | |
Date - problem reported : Date - problem closed : Date - last modified : | 06.11.2014 07.01.2016 07.01.2016 |
| Problem solved at the following versions (IBM BugInfos) | |
| Problem solved according to the fixlist(s) of the following version(s) | |
| 10.5.0.7 |
|