ESXi 5.5 Hosts randomly lose network connectivity

I recently ran into an issue where my ESXi 5.5 hosts started randomly dropping off the network and the only way to get them back was to reboot them. In going through the logs below is the findings of what happened and why:

2015-06-12T16:23:54.454Z cpu17:33604)<3>bnx2x: [bnx2x_attn_int_deasserted3:4816(vmnic0)]MC assert!
2015-06-12T16:23:54.454Z cpu17:33604)<3>bnx2x: [bnx2x_mc_assert:937(vmnic0)]XSTORM_ASSERT_LIST_INDEX 0x2
2015-06-12T16:23:54.454Z cpu17:33604)<3>bnx2x: [bnx2x_mc_assert:951(vmnic0)]XSTORM_ASSERT_INDEX 0x0 = 0x00020000 0x00010017 0x05aa05b4 0x00010053
2015-06-12T16:23:54.454Z cpu17:33604)<3>bnx2x: [bnx2x_mc_assert:965(vmnic0)]Chip Revision: everest3, FW Version: 7_10_51
2015-06-12T16:23:54.454Z cpu17:33604)<3>bnx2x: [bnx2x_attn_int_deasserted3:4822(vmnic0)]driver assert
2015-06-12T16:23:54.454Z cpu17:33604)<3>bnx2x: [bnx2x_panic_dump:1140(vmnic0)]

—————– begin crash dump —————–

2015-06-12T16:23:54.454Z cpu17:33604)<3>bnx2x: [bnx2x_panic_dump:1150(vmnic0)]def_idx(0xfbd2) def_att_idx(0xa) attn_state(0x1) spq_prod_idx(0xcf) next_stats_cnt(0xdd33)
2015-06-12T16:23:54.454Z cpu17:33604)<3>bnx2x: [bnx2x_panic_dump:1155(vmnic0)]DSB: attn bits(0x0) ack(0x1) id(0x0) idx(0xa)

This is a low-level driver crash caused by a MC assert without any other hypervisor problems at the time. The bnx2x card then begins a crash dump and resets itself. The data from the crash dump of the adapter has data but it appears to only be useful to Broadcom/Qlogic. At the end of the crash dump it shows that the card gets reset.

2015-06-12T16:23:54.455Z cpu17:33604)<5>[bnx2x_self_test_log:151(vmnic0)]WARNING MISC: pcie_rst_b was asserted without perst assertion.Value is 0x1
2015-06-12T16:23:54.455Z cpu17:33604)<5>[bnx2x_self_test_log:151(vmnic0)]WARNING TSEM: interrupt 0 is active.Value is 0x10010000 <5>[bnx2x_self_test_log:151(vmnic0)]WARNING CSEM: interrupt 0 is active.Value is 0x10010000
2015-06-12T16:23:54.455Z cpu17:33604)<5>[bnx2x_self_test_log:151(vmnic0)]WARNING USEM: interrupt 0 is active.Value is 0x10000000 <5>[bnx2x_self_test_log:151(vmnic0)]WARNING XSEM: interrupt 0 is active.Value is 0x10010000

The crash dump then finishes and then an error mentioning the failure of rss configuration (receive side scaling, an offloading/performance enhancing feature of the driver)

2015-06-12T16:23:54.455Z cpu17:33604)<3>bnx2x: [bnx2x_panic_dump:1410(vmnic0)]

—————– end crash dump —————–

2015-06-12T16:23:54.455Z cpu17:33604)<5>bnx2x: [bnx2x_attn_int_deasserted:5653(vmnic0)]about to mask 0xfffffffe at IGU addr 0x442d10
2015-06-12T16:23:54.455Z cpu17:33604)<5>bnx2x: [bnx2x_attn_int_deasserted:5666(vmnic0)]aeu_mask 116 newly deasserted 1
2015-06-12T16:23:54.455Z cpu17:33604)<5>bnx2x: [bnx2x_attn_int_deasserted:5668(vmnic0)]new mask 117
2015-06-12T16:23:54.455Z cpu17:33604)<5>bnx2x: [bnx2x_attn_int_deasserted:5673(vmnic0)]attn_state 1
2015-06-12T16:23:54.455Z cpu17:33604)<5>bnx2x: [bnx2x_attn_int_deasserted:5675(vmnic0)]new state 0 <3>[bnx2x_esx_config_rss_pf:299(vmnic0)]Failed to config rss

After the crash dump, the state of the card is not exactly clear as the driver did not report a state. A few seconds later, ESXi’s netdev watchdog service that is responsible for monitoring the health of the network adapters, determined that vmnic0 was not in a good state. It can clearly see that the adapter is unresponsive via the bnx2x driver:

2015-06-12T16:24:00.445Z cpu17:33835)WARNING: netsched: NetSchedMClkWatchdogSysWorld:3863: vmnic0 : scheduler(0x410beb3dd410)/device(0x410a95eec010) 0/1 lock up [stopped=0]:
2015-06-12T16:24:00.445Z cpu17:33835)WARNING: netsched: NetSchedMClkWatchdogSysWorld:3874: detected at 2594566175 while last xmit at 2594560455 and 86934 bytes in flight [window 86460 bytes] and last enqueued/
2015-06-12T16:24:00.445Z cpu17:33835)WARNING: netsched: NetSchedMClkWatchdogSysWorld:3890: vmnic0: packets completion seems stuck, issuing reset

The watchdog service did as designed and issued a reset of the network adapter in an attempt to wake it up from whatever state it is in. A few seconds later, you can see the watchdog service still reporting vmnic0 as unresponsive and attempts to reset it several more times:

2015-06-12T16:24:10.460Z cpu17:33835)WARNING: netsched: NetSchedMClkWatchdogSysWorld:3863: vmnic0 : scheduler(0x410beb3dd410)/device(0x410a95eec010) 0/1 lock up [stopped=0]:
2015-06-12T16:24:10.460Z cpu17:33835)WARNING: netsched: NetSchedMClkWatchdogSysWorld:3874: detected at 2594576191 while last xmit at 2594570455 and 86934 bytes in flight [window 86460 bytes] and last enqueued/
2015-06-12T16:24:10.460Z cpu17:33835)WARNING: netsched: NetSchedMClkWatchdogSysWorld:3890: vmnic0: packets completion seems stuck, issuing reset
2015-06-12T16:24:40.500Z cpu16:33835)WARNING: netsched: NetSchedMClkWatchdogSysWorld:3863: vmnic0 : scheduler(0x410beb3dd410)/device(0x410a95eec010) 0/1 lock up [stopped=0]:
2015-06-12T16:24:40.500Z cpu16:33835)WARNING: netsched: NetSchedMClkWatchdogSysWorld:3874: detected at 2594606230 while last xmit at 2594600455 and 86934 bytes in flight [window 86460 bytes] and last enqueued/
2015-06-12T16:24:40.500Z cpu16:33835)WARNING: netsched: NetSchedMClkWatchdogSysWorld:3890: vmnic0: packets completion seems stuck, issuing reset

The interesting part is that it gets no response at all from the bnx2x driver when the card reset is performed. Normally we’d expect to see the card initialize. A few seconds later, we get a slew of internal failures from the bnx2x driver indicating timeouts, and failures to enable features and rx queues:

2015-06-12T16:24:43.436Z cpu17:33617)<3>[bnx2x_clean_tx_queue:1518(vmnic0)]timeout waiting for queue[0]: txdata->tx_pkt_prod(59502) != txdata->tx_pkt_cons(59387)
2015-06-12T16:24:43.478Z cpu17:33617)<3>bnx2x: [bnx2x_del_all_macs:9218(vmnic0)]Failed to delete MACs: -5
2015-06-12T16:24:43.478Z cpu17:33617)<3>bnx2x: [bnx2x_chip_cleanup:10047(vmnic0)]Failed to schedule DEL commands for UC MACs list: -5
2015-06-12T16:24:43.500Z cpu16:33617)<3>bnx2x: [bnx2x_set_q_rx_mode:6754(vmnic0)]Set rx_mode 0 failed
2015-06-12T16:24:43.500Z cpu16:33617)<3>[bnx2x_netq_remove_rx_filter:2245(vmnic0)]NetQ could not remove RX filter, rx mode failed: queue:3 filter id:0
2015-06-12T16:24:43.522Z cpu16:33617)<3>[bnx2x_netq_free_rx_queue_rss:708(vmnic0)]Failed to clear RSS ind tbl
2015-06-12T16:24:43.545Z cpu27:33617)<3>[bnx2x_queue_chk_transition:5524(vmnic0)]Blocking transition since pending was 80
2015-06-12T16:24:43.545Z cpu27:33617)<3>[bnx2x_queue_state_change:4658(vmnic0)]check transition returned an error. rc -16
2015-06-12T16:24:43.545Z cpu27:33617)<3>bnx2x: [bnx2x_setup_queue:9532(vmnic0)]Queue(7) INIT failed
2015-06-12T16:24:43.545Z cpu27:33617)<3>[bnx2x_esx_setup_queue:607(vmnic0)]Queue 7 setup failed[0xffffffea]
2015-06-12T16:24:43.545Z cpu27:33617)<3>[bnx2x_netq_free_rx_queue_single:681(vmnic0)]Could not restart queue:7 as Tx
2015-06-12T16:24:43.567Z cpu16:33617)<3>[bnx2x_netq_free_rx_queue_rss:708(vmnic0)]Failed to clear RSS ind tbl
2015-06-12T16:24:43.588Z cpu17:33617)<3>[bnx2x_queue_chk_transition:5524(vmnic0)]Blocking transition since pending was 80
2015-06-12T16:24:43.588Z cpu17:33617)<3>[bnx2x_queue_state_change:4658(vmnic0)]check transition returned an error. rc -16
2015-06-12T16:24:43.588Z cpu17:33617)<3>bnx2x: [bnx2x_setup_queue:9532(vmnic0)]Queue(6) INIT failed
2015-06-12T16:24:43.588Z cpu17:33617)<3>[bnx2x_esx_setup_queue:607(vmnic0)]Queue 6 setup failed[0xffffffea]
2015-06-12T16:24:43.588Z cpu17:33617)<3>[bnx2x_netq_free_rx_queue_single:681(vmnic0)]Could not restart queue:6 as Tx
2015-06-12T16:24:43.610Z cpu18:33617)<3>[bnx2x_netq_free_rx_queue_rss:708(vmnic0)]Failed to clear RSS ind tbl
2015-06-12T16:24:43.633Z cpu18:33617)<3>[bnx2x_queue_chk_transition:5524(vmnic0)]Blocking transition since pending was 80
2015-06-12T16:24:43.633Z cpu18:33617)<3>[bnx2x_queue_state_change:4658(vmnic0)]check transition returned an error. rc -16
2015-06-12T16:24:43.633Z cpu18:33617)<3>bnx2x: [bnx2x_setup_queue:9532(vmnic0)]Queue(5) INIT failed
2015-06-12T16:24:43.633Z cpu18:33617)<3>[bnx2x_esx_setup_queue:607(vmnic0)]Queue 5 setup failed[0xffffffea]
2015-06-12T16:24:43.633Z cpu18:33617)<3>[bnx2x_netq_free_rx_queue_single:681(vmnic0)]Could not restart queue:5 as Tx

This continues until the card eventually tries to reset itself:

2015-06-12T16:24:44.379Z cpu28:33617)<5>bnx2x: [bnx2x_init_hw_port:8120(vmnic0)]starting port init port 0
2015-06-12T16:24:44.381Z cpu28:33617)<5>bnx2x: [bnx2x_init_hw_func:8579(vmnic0)]starting func init func 0

This time it flat out fails to initialize and the hypervisor sees it go link-down:

2015-06-12T16:24:44.567Z cpu17:33617)<3>[bnx2x_nic_load:3478(vmnic0)]Function start failed!
2015-06-12T16:24:44.755Z cpu16:33616)netsched: NetSchedMClkNotify:2334: vmnic0: link down notification

ESXi then flushes any queued packets that may have been intended for vmnic0, which is normal behavior:

2015-06-12T16:24:44.755Z cpu16:33616)netsched: NetSchedMClkDoFlushQueue:3225: vmnic0: dropping 14 packets from queue netsched.pools.persist.default
2015-06-12T16:24:44.755Z cpu16:33616)netsched: NetSchedMClkDoFlushQueue:3225: vmnic0: dropping 100 packets from queue netsched.pools.persist.mgmt
2015-06-12T16:24:44.755Z cpu16:33616)netsched: NetSchedMClkDoFlushQueue:3225: vmnic0: dropping 8 packets from queue netsched.pools.vm.50331739
2015-06-12T16:24:44.755Z cpu16:33616)netsched: NetSchedMClkDoFlushQueue:3225: vmnic0: dropping 48 packets from queue netsched.pools.vm.50331741
2015-06-12T16:24:44.755Z cpu16:33616)netsched: NetSchedMClkDoFlushQueue:3225: vmnic0: dropping 11 packets from queue netsched.pools.vm.50331746

The card then tries one last time to initialize but fails:

2015-06-12T16:24:45.333Z cpu18:33620)<6>bnx2x 0000:01:00.0: vmnic0: using MSI-X IRQs: sp 16 fp[0] 28 … fp[7] 35
2015-06-12T16:24:45.376Z cpu16:33620)<3>[bnx2x_nic_load:3478(vmnic0)]Function start failed!

After this the link remains down and we hear nothing futher from the bnx2x driver nor vmnic0. Based on this log analysis, below are the findings:

The bnx2x adapter crashed due to an assert failure in the bnx2x driver code.
The adapter failed to respond for almost a full minute before going link-down. This caused an outage during this time as ESXi would not have initiated a failover action until the link went down.

After much digging I finally found the root cause of the issue, VMware KB 2114957. The issue has been identified as a problem with the bnx2x driver and there is currently no updated driver. For now there is a workaround. It consists of disabling TSO inside the guest OS of the VM. Alternatively if there are a large number of VM’s, you can disable VXLAN offloading at the host level:

1esxcfg-module -s "enable_vxlan_ofld=0" bnx2x

Then reboot the host for the change to take effect.