2016-12-24

interconnect fragmentation kills the cluster

On a particular Oracle 2 node cluster (12.1) we faced random instances failing. Servicerequests at Oracle were open with limited result, as it was quite random and we could not link it to any trigger.
As it looked somehow like a communication problem between the 2 nodes, network team has checked the switches involved - without any outcome.
Even crashing instances were a problem already, it get worse one day when one node rebooted (according to the clusters alert.log and cssd.log due to network heartbeat issues) and then the clusterstack did not start anymore.

2016-12-12 03:35:34.203 [CLSECHO(54825)]CRS-10001: 12-Dec-16 03:35 AFD-9204: AFD device driver installed or loaded status: 'false' 
2016-12-12 09:17:25.698 [OSYSMOND(1247)]CRS-8500: Oracle Clusterware OSYSMOND process is starting with operating system process ID 1247 
2016-12-12 09:17:25.699 [CSSDAGENT(1248)]CRS-8500: Oracle Clusterware CSSDAGENT process is starting with operating system process ID 1248 
2016-12-12 09:17:25.854 [OCSSD(1264)]CRS-8500: Oracle Clusterware OCSSD process is starting with operating system process ID 1264 
2016-12-12 09:17:26.899 [OCSSD(1264)]CRS-1713: CSSD daemon is started in hub mode 
2016-12-12 09:17:32.220 [OCSSD(1264)]CRS-1707: Lease acquisition for node yyy2 number 2 completed 
2016-12-12 09:17:33.280 [OCSSD(1264)]CRS-1605: CSSD voting file is online: ORCL:ASM_OCR_VOTE_1; details in /xxx/app/grid/diag/crs/yyy1/crs/trace/ocssd.trc. 
2016-12-12 09:17:33.289 [OCSSD(1264)]CRS-1672: The number of voting files currently available 1 has fallen to the minimum number of voting files required 1. 
2016-12-12 09:27:25.925 [CSSDAGENT(1248)]CRS-5818: Aborted command 'start' for resource 'ora.cssd'. Details at (:CRSAGF00113:) {0:0:22951} in /xxx/app/grid/diag/crs/yyy2/crs/trace/ohasd_cssdagent_root.trc. 
2016-12-12 09:27:25.925 [OCSSD(1264)]CRS-1656: The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /xxx/app/grid/diag/crs/yyy2/crs/trace/ocssd.trc 
2016-12-12 09:27:25.926 [OCSSD(1264)]CRS-1603: CSSD on node yyy2 shutdown by user. 
Mon Dec 12 09:27:30 2016 
Errors in file /xxx/app/grid/diag/crs/yyy2/crs/trace/ocssd.trc (incident=857): 
CRS-8503 [] [] [] [] [] [] [] [] [] [] [] [] 
Incident details in: /xxx/app/grid/diag/crs/yyy/crs/incident/incdir_857/ocssd_i857.trc 

CSS trace is filled with messages reporting no connectivity with node1:
2016-12-12 09:27:20.375584 : CSSD:3154114304: clssscWaitOnEventValue: after CmInfo State val 3, eval 1 waited 1000 with cvtimewait status 4294967186 
2016-12-12 09:27:20.585624 :GIPCHALO:3141216000: gipchaLowerSendEstablish: sending establish message for node '0x7f7f900a37e0 { host 'yyy1', haName '480e-0dfa-bf94-bbda', srcLuid c33a92f9-675f2c44, dstLuid 00000000-00000000 numInf 1, sentRegister 0, localMonitor 0, baseStream 0x7f7f9009b110 type gipchaNodeType12001 (20), nodeIncarnation 9ec9e8e8-682809fa incarnation 2 flags 0x102804}' 
2016-12-12 09:27:20.633907 : CSSD:3635484416: clsssc_CLSFAInit_CB: System not ready for CLSFA initialization 
2016-12-12 09:27:20.633912 : CSSD:3635484416: clsssc_CLSFAInit_CB: clsfa fencing not ready yet 
2016-12-12 09:27:20.656587 : CSSD:3124418304: clssnmvDHBValidateNCopy: node 1, yyy1, has a disk HB, but no network HB, DHB has rcfg 371663236, wrtcnt, 11120596, LATS 232008644, lastSeqNo 11120595, uniqueness 1476197219, timestamp 1481534839/2789302712 
2016-12-12 09:27:20.868210 : CSSD:3119687424: clssnmSendingThread: Connection pending for node yyy1, number 1, flags 0x00000002 
2016-12-12 09:27:21.375702 : CSSD:3154114304: clssscWaitOnEventValue: after CmInfo State val 3, eval 1 waited 1000 with cvtimewait status 4294967186 
2016-12-12 09:27:21.585813 :GIPCHALO:3141216000: gipchaLowerSendEstablish: sending establish message for node '0x7f7f900a37e0 { host 'yyy1', haName '480e-0dfa-bf94-bbda', srcLuid c33a92f9-675f2c44, dstLuid 00000000-00000000 numInf 1, sentRegister 0, localMonitor 0, baseStream 0x7f7f9009b110 type gipchaNodeType12001 (20), nodeIncarnation 9ec9e8e8-682809fa incarnation 2 flags 0x102804}' 
2016-12-12 09:27:21.634038 : CSSD:3635484416: clsssc_CLSFAInit_CB: System not ready for CLSFA initialization 
2016-12-12 09:27:21.634046 : CSSD:3635484416: clsssc_CLSFAInit_CB: clsfa fencing not ready yet 
2016-12-12 09:27:21.657538 : CSSD:3124418304: clssnmvDHBValidateNCopy: node 1, yyy1, has a disk HB, but no network HB, DHB has rcfg 371663236, wrtcnt, 11120597, LATS 232009644, lastSeqNo 11120596, uniqueness 1476197219, timestamp 1481534840/2789303712 
2016-12-12 09:27:21.868336 : CSSD:3119687424: clssnmSendingThread: Connection pending for node yyy1, number 1, flags 0x00000002 
2016-12-12 09:27:22.375830 : CSSD:3154114304: clssscWaitOnEventValue: after CmInfo State val 3, eval 1 waited 1000 with cvtimewait status 4294967186 
2016-12-12 09:27:22.586063 :GIPCHALO:3141216000: gipchaLowerSendEstablish: sending establish message for node '0x7f7f900a37e0 { host 'yyy1', haName '480e-0dfa-bf94-bbda', srcLuid c33a92f9-675f2c44, dstLuid 00000000-00000000 numInf 1, sentRegister 0, localMonitor 0, baseStream 0x7f7f9009b110 type gipchaNodeType12001 (20), nodeIncarnation 9ec9e8e8-682809fa incarnation 2 flags 0x102804}' 
2016-12-12 09:27:22.634195 : CSSD:3635484416: clsssc_CLSFAInit_CB: System not ready for CLSFA initialization 
2016-12-12 09:27:22.634203 : CSSD:3635484416: clsssc_CLSFAInit_CB: clsfa fencing not ready yet 

After even more investigation on the Network another SR was filed.

Due to previous SRs oswatcher was installed already, and there we found the important information in the netstats segment:

zzz ***Fri Dec 9 14:54:54 GMT 2016 
Ip: 
13943376329 total packets received 
129843 with invalid addresses 
0 forwarded 
0 incoming packets discarded 
11934989273 incoming packets delivered 
11631767391 requests sent out 
2 outgoing packets dropped 
148375 fragments dropped after timeout 
2498052793 reassemblies required 
494739589 packets reassembled ok 
353229 packet reassembles failed 
411073325 fragments received ok 
2109526776 fragments created 

and after 2 minutes:
zzz ***Fri Dec 9 14:56:55 GMT 2016 
Ip: 
13943469180 total packets received 
129849 with invalid addresses 
0 forwarded 
0 incoming packets discarded 
11935067348 incoming packets delivered 
11631828206 requests sent out 
2 outgoing packets dropped 
148375 fragments dropped after timeout 
2498069258 reassemblies required 
494741345 packets reassembled ok 
359542 packet reassembles failed 
411073565 fragments received ok 
2109528513 fragments created 

The important part are the 6313 packet reassembles failed. In comparison to 16465 reassemblies required.

This led to some notes which describe both our symptoms (instance and cluster stack failure)

RHEL 6.6: IPC Send timeout/node eviction etc with high packet reassembles failure (Doc ID 2008933.1)

and

The CRSD is Intermediate State and Not Joining to the Cluster (Doc ID 2168576.1)



Reassembly happens when the sender wants so send more data than fits into a single packet. In this cluster the MTU size is 1500 - and in our example we had 16465 datagrams which needed to be reassembled, but 6131 failed. There are some variables in the Linux kernel, they can affect the buffer used in kernel to reassembly fragmented datagrams.

The solution for our system was to increase 2 parameters:

net.ipv4.ipfrag_high_thresh = 16777216
net.ipv4.ipfrag_low_thresh = 15728640 

These can be changed in the running system in
/proc/sys/net/ipv4/ipfrag_low_thresh
/proc/sys/net/ipv4/ipfrag_high_thresh
and for persistent changes in sysctl.conf

Unfortunately these parameters were not mentioned in any of the prerequisit scripts I found.

With all these knowledge, we identified an important difference to other clusters: This one is the only with MTU 1500 - so much more fragmented packages needed carehere.

After the issue itself was solved, I wondered if it can be found on a vanilla 12.1 crs installation.
(vanilla in comparison to our setup where oswatcher was installed due to the first SRs).
Yes, our beloved -MGMTDB holds the information already! It's in the documentation as well (Troubleshooting Oracle Clusterware) and in the output of oclumon dumpnodeview I can see
IPReasFail - Number of failures detected by the IPv4 reassembly algorithm
Node: yyy1 Clock: '16-11-26 06.55.27 Etc/GMT' SerialNo:443 
NICS: 
bond0 netrr: 159.021 netwr: 181.688 neteff: 340.709 nicerrors: 0 pktsin: 412 pktsout: 358 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 412 innonunicast: 0 type: PUBLIC 
lo netrr: 37.722 netwr: 37.722 neteff: 75.443 nicerrors: 0 pktsin: 95 pktsout: 95 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 95 innonunicast: 0 type: PUBLIC 
bond1 netrr: 2350.313 netwr: 42989.510 neteff: 45339.823 nicerrors: 0 pktsin: 1927 pktsout: 31345 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 1927 innonunicast: 0 type: PRIVATE 
PROTOCOL ERRORS: 
IPHdrErr: 0 IPAddrErr: 102203 IPUnkProto: 0 IPReasFail: 59886 IPFragFail: 0 TCPFailedConn: 12598 TCPEstRst: 335559 TCPRetraSeg: 67276584 UDPUnkPort: 40134 UDPRcvErr: 0 

Unfortunately the format is kind of clumsy - I will need to dig into it's tables for a better output - especially for quick but powerful reports during problems.


During my research, I discovered it's not an oracle-only problem, others are affected as well (and provide a great description).