hairpin中文翻译为发卡。bridge不允许包从收到包的端口发出,比如bridge从一个端口收到一个广播报文后,会将其广播到所有其他端口。bridge的某个端口打开hairpin mode后允许从这个端口收到的包仍然从这个端口发出。这个特性用于NAT场景下,比如docker的nat网络,一个容器访问其自身映射到主机的端口时,包到达bridge设备后走到ip协议栈,经过iptables规则的dnat转换后发现又需要从bridge的收包端口发出,需要开启端口的hairpin mode。See https://wiki.mikrotik.com/wiki/Hairpin_NAT
我们在使用Kubernetes的时候遇到了一个大流量的问题,集群计算节点偶然流量突增,机器ping不通。
网络运营的同事反馈说从交换机上发现这些服务器组播包很多
- 对其进行了抓包分析和系统各模块梳理排查,但是只能截取到VRRP组播报文。
- 分析不出来,限流组播报文!
ebtables -A INPUT --pkttype-type multicast --limit 1000/s -j ACCEPT, ebtables -A INPUT --pkttype-type multicast -j DROP
- 发现Ebtables drop规则计数偶现增长 => 写脚本抓取计数增长时刻的报文并告警 => 禁止IPv6 DAD报文 抓取的样例报文有大量ICMP6/ARP报文 => 内核发出的IPv6 dad报文=>关闭容器网卡IPv6
后续发现关闭ipv6功能后,问题依然复现了。是否方向错了,不是dad报文导致的?
正没有头绪的时候,网平DC运营组同事提示是否开启了hairpin mode。在主机上查看确实物理网卡开启了hairpin mode。经查发现是K8s老版本的代码bug,本意是开启容器veth device的hairpin mode,支持同主机容器的NAT访问,新版本已经修复,commit参见0cfd09。从代码上分析如果在创建容器的时候并发删除容器,拉起pod的线程有可能正在尝试进入容器的网络空间给veth设备配置hairpin mode,而如果容器进程已经停止了,所以都没进入容器的网络空间,代码中就会尝试给所有的网卡都配置hairpin mode,就给主机的所有网卡都配置了hairpin mode。
那开启Hairpin mode为啥会造成这么多流量?我们FloatingIP使用bridge拓扑下,eth1与bridge桥接,eth1如果收到一个组播/广播报文,bridge会将这个报文广播(默认收到组播报文会flood到所有口mcast_flood可以关闭)到所有端口,也就是回头又会从eth1出去,然后交换机收到报文后又会广播,被也开启hairpin的主机收到后同样会广播回交换机和其他主机,这样就形成了大量的流量。
我们用bridge设备和veth设备做个简单的测试就可以验证这个问题,创建两个网桥和一对veth设备,将veth设备的两端分别连接到两个网桥上,up所有网卡,将其中一个网桥配上一个ip地址后,用arping从配ip地址的网桥上发送一个arp宣告的广播报文,在另一个网桥上抓包,可以收到一条arp报文,但是当我们将veth设备都开启hairpin mode后,再发一个arp宣告,再次抓包,我们会收到无穷无尽的arp包。
附测试的操作过程
ip link add br0 type bridge
ip link add br1 type bridge
ip link add dev v0 type veth peer name v1
ip link set v0 master br0
ip link set v1 master br1
ip link set br0 up
ip link set br1 up
ip link set v0 up
ip link set v1 up
ip ad add 192.168.0.1/24 dev br0
[root@10 vagrant]# arping -c 1 -A -I br0 192.168.0.1
ARPING 192.168.0.1 from 192.168.0.1 br0
Sent 1 probes (1 broadcast(s))
Received 0 response(s)
[root@10 vagrant]# tcpdump -vvnneSs 0 -i v1
tcpdump: WARNING: v1: no IPv4 address assigned
tcpdump: listening on v1, link-type EN10MB (Ethernet), capture size 65535 bytes
15:52:20.866169 66:83:79:e3:a5:6c > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Reply 192.168.0.1 is-at 66:83:79:e3:a5:6c, length 28
bridge link set dev v0 hairpin on
bridge link set dev v1 hairpin on
[root@10 vagrant]# bridge -d link show
15: v1 state UP @v0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br1 state forwarding priority 32 cost 2
hairpin on guard off root_block off fastleave off
16: v0 state UP @v1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br0 state forwarding priority 32 cost 2
hairpin on guard off root_block off fastleave off
[root@10 vagrant]# arping -c 1 -A -I br0 192.168.0.1
ARPING 192.168.0.1 from 192.168.0.1 br0
Sent 1 probes (1 broadcast(s))
Received 0 response(s)
[root@10 vagrant]# tcpdump -vvnneSs 0 -i v1
tcpdump: WARNING: v1: no IPv4 address assigned
tcpdump: listening on v1, link-type EN10MB (Ethernet), capture size 65535 bytes
15:53:53.586450 66:83:79:e3:a5:6c > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Reply 192.168.0.1 is-at 66:83:79:e3:a5:6c, length 28
15:53:53.586479 66:83:79:e3:a5:6c > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Reply 192.168.0.1 is-at 66:83:79:e3:a5:6c, length 28
15:53:53.586489 66:83:79:e3:a5:6c > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Reply 192.168.0.1 is-at 66:83:79:e3:a5:6c, length 28
15:53:53.586491 66:83:79:e3:a5:6c > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Reply 192.168.0.1 is-at 66:83:79:e3:a5:6c, length 28
15:53:53.586494 66:83:79:e3:a5:6c > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Reply 192.168.0.1 is-at 66:83:79:e3:a5:6c, length 28
15:53:53.586495 66:83:79:e3:a5:6c > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Reply 192.168.0.1 is-at 66:83:79:e3:a5:6c, length 28
...