Merbridge now supports Ambient Mesh, no worry about CNI compatibility!

This blog describes how Merbridge supports Ambient Mesh.

In the blog Deep Dive into Ambient Mesh - Traffic Path, we analyzed how Ambient Mesh forwards the ingress and egress traffic of Pod to ztunnel. It is implemented by iptables + TPROXY + routing table. The traffic datapath is relatively long compared to sidecar mode, and the principle is complicated. Moreover, it uses routing marks, which may cause unexpected behaviors in some cases when it relies on CNI or it is running in a CNI with the bridge mode. These severely limit the applicable scope of ambient mesh.

The main purpose of Merbridge is to replace iptables with eBPF to accelerate applications running in a service mesh. Ambient Mesh is a new mode of Istio. It is necessary for Merbridge to support this new mode. iptables is a powerful tool to block unwanted traffic, allow desired traffic, and redirect packets to specific addresses and ports, but it also has some weaknesses. First, iptables uses a linear matching method. When several applications simultaneously call a same program, conflicts may arise and make some features become unavailable. Second, although it is flexible enough, it still cannot be programmed as freely as eBPF. Therefore, replacing iptables with eBPF can help Ambient Mesh achieve traffic interception.

Objectives

As mentioned in Deep Dive into Ambient Mesh - Traffic Path, we set two objectives:

  • Outgoing traffic from pods in Ambient Mesh should be intercepted and redirected to port 15001 of ztunnel.
  • Traffic sent from host applications to pods in Ambient Mesh should be redirected to port 15006 of ztunnel.

Since istioin and other network interface cards (NICs) are completely designed to adapt to the native Ambient Mesh, we don’t need to make any changes.

Pain points analysis

Ambient Mesh has a different operation mechanism from sidecars. According to the official definition of Istio, adding a Pod to Ambient Mesh does not require restarting the Pod and no any sidecar-related process is running in the Pod. It means:

  1. Merbridge used the CNI mode to enable the eBPF program to get the current Pod IP to make policy decisions, which is incompatible with ambient mesh. The reason is a pod will not be restarted after joining or leaving the ambient mesh, nor will it call the CNI plug-in.
  2. In the sidecar mode, the only thing you need to change is the destination address to 127.0.0.1:15001 in the connect hook of eBPF, but in the ambient mesh you need to replace the desitination IP with that of ztunnel.

In addtion, no sidecar-related process exists in a Pod running in an ambient mesh, so the legacy method of checking whether a port such as 15006 is listening in the current Pod is no longer applicable. It is necessary to redesign the scheme to check the environment where processes are running.

Therefore, based on the above analysis, it is required to redesign the entire interception scheme so that Merbridge can support the ambient mesh.

In summary, we need to implement the following features:

  • Redesign a scheme for judging whether a Pod is running in the ambient mesh
  • Use eBPF to perceive current Pod IP regardless of CNIs
  • Enable eBPF programs to know the ztunnel IP on the current node

Solution

In version 0.7.2, cgroup id is used to improve the performance of the connect program. Usually, each container in a Pod has a proper cgroup id, which can be obtained through the bpf_get_current_cgroup_id function in the BPF program. The speed of the connect program can be optimized by writing IP information to a specific cgroup_info_map.

An ambient mesh is different from the legacy CNI listening on a special port in the network namespace for storing Pod-related information. In the ambient mesh, cgroup id is useful. If cgroup id can be associated with the Pod IP, you can get the current Pod IP in eBPF.

Since CNI cannot be relied on anymore, we need change the scheme for obtaining the information of Pod status. For this reason, we detect the creation and revocation actions of local Pods by watching the process creation and revocation. We created a new tool to watch the process changes on a host: process-watcher project.

Read the cgroup id and ip information from the process ID and writing it to the cgroup_info_map.

tcg := cgroupInfo{
		ID:            cgroupInode,
		IsInMesh:      in,
		CgroupIp:      *(*[4]uint32)(_ip),
		Flags:         flag,
		DetectedFlags: cgrinfo.DetectedFlags | AMBIENT_MESH_MODE_FLAG | ZTUNNEL_FLAG,
	}
	return ebpfs.GetCgroupInfoMap().Update(&cgroupInode, &tcg, ebpf.UpdateAny)

Then get the current cgroup-related information in eBPF:

__u64 cgroup_id = bpf_get_current_cgroup_id();
void *info = bpf_map_lookup_elem(&cgroup_info_map, &cgroup_id);

Now, we can learn whether the current container has the ambient mesh enabled and it is located in a mesh or not.

Second, for the ztunnel IP, Istio implements it by adding NIC and binding fixed IPs. This scheme may have the risk of conflict, and the original addresses may be lost in some cases (such as SNAT). So Merbridge gives up the scheme and directly obtains the ztunnel IPs on the control plane, writes it into the map, and enables the eBPF program read it (this is faster).

static inline __u32 *get_ztunnel_ip()
{
    __u32 ztunnel_ip_key = ZTUNNEL_KEY;
    return (__u32 *)bpf_map_lookup_elem(&settings, &ztunnel_ip_key);
}

Then use the connect program to rewrite the destination address:

ctx->user_ip4 = ztunnel_ip[3];
ctx->user_port = bpf_htons(OUT_REDIRECT_PORT);

With the association with the cgroup id, the Pod IP of current processes can be obtained in eBPF, so as to enforce policies. Forward the traffic from the Pod in the ambient mesh to the ztunnel, so that Merbridge can be compatible with the ambient mesh.

This will be a capability that is adaptable to all CNIs and can avoid the problem that the native ambient mesh cannot work well in most CNI modes.

Usage and feedback

Since the ambient mesh is still in its early stage and the support for ambient mode is relatively preliminary, some problems have not been well resolved, so the code of supporting for the ambient mode has not been merged into the main branch. If you want to experience the capability of Merbridge to implement traffic interception for ambient mesh instead of iptables, you can perform the following steps (it is required to install the ambient mesh in advance):

  1. Disable Istio CNI (set --set components.cni.enabled=false during installation, or delete Istio CNI’s DaemonSet kubectl -n istio-system delete ds istio-cni).
  2. Remove the init container of ztunnel (because it initializes iptables rules and NICs, which is not required for Merbridge).
  3. Install Merbridge by running kubectl apply -f https://github.com/merbridge/merbridge/raw/ambient/deploy/all-in-one.yaml

After the Merbridge is ready, you can use all capabilities of ambient mesh.

*Attentions:

  1. The Ambient mode under Kind is not supported currently (we have a plan to support it in the future)
  2. The host kernel version needs to be not less than 5.7
  3. cgroup v2 is required to be enabled
  4. This mode is also compatible with sidecars
  5. The debug mode will be enabled by default in an ambient mesh, which will have certain impact on performance

For more details see source code.

If you have any question, please reach out to us with Slack or add the wechat group to chat.

Merbridge helps Kuma reduce network latency by 12%

Recently, Kuma announced a major release of v2.0 with several new major features. A notable feature is that Kuma is using eBPF to improve the traffic flow.

Kuma 2.0 release preview

Based on the official release notes and blogs, Kuma implements the eBPF capabilities by integrating with Merbridge.

Performance comparison between eBPF and iptables for Kuma 2.0

A quote from Kuma 2.0 release blog:

We are utilizing the Merbridge OSS project within our eBPF capabilities and are very excited that we have been able to contribute back to that library and become co-maintainers. We look forward to working more with the Merbridge team as we continue to explore different areas to include eBPF functionality in Kuma.

As an open source project, we are very excited to see that Merbridge brings such capabitilities to Kuma. This case proves that traffic latency can be reduced without any extra overhead if you use Merbridge in a service mesh.

Since June this year, Kuma developers have been working on integrating with Merbridge, trying to get the eBPF-based acceleration capabilities.

Thanks to the clear architecture of Merbridge, Kuma is smoothly adapted with Merbridge in days. A big thanks to the Kuma community for contributing such an important compatibility capability to Merbridge, which helps both communities grow together!

So far, Merbridge has the capabilities to support popular service mesh products like Istio, Linkerd2, and Kuma, and also has a clear plan to develop new features to support IPv4/IPv6 dual-stack, ambient mesh, and earlier versions of kernel. It is exciting to see that Merbridge gets used more widely. We really hope the project can help you land your project with eBPF technologies. We are looking forward to receiving more comments, and having more developers get involved.

Deep Dive into Ambient Mesh - Traffic Path

This blog analyzes the traffic path on data plane in the ambient mesh.

Ambient Mesh has been released for a while, and some online articles have talked much about its usage and architecture. This blog will dive into the traffic path on data plane in the ambient mesh to help you fully understand the implementations of the ambient data plane.

Before start, you shall carefully read through introducing ambient mesh to learn the basic knowledge of the ambient mesh.

For your convenience, the test environment can be deployed by following Get Started with Istio Ambient Mesh.

Start from the moment you make a request

In order to explore the traffic path, we first analyze the scenario where two services access each other in the ambient mesh (only for L4 mode on different nodes).

After enabling the ambient mesh in the default namespace, all services will have capabilities of mesh governance.

Our analysis starts from this command: kubectl exec deploy/sleep -- curl -s http://productpage:9080/ | head -n1

In the sidecar mode, Istio intercepts traffic through iptables. When you run the curl command in a sleep pod, the traffic will be forwarded by iptables to port 15001 of sidecar. However, in the ambient mesh, no sidecar exists in a pod, and it does not need restart to enable the ambient mesh. How to make sure the request is processed by ztunnel?

Egress traffic interception

To learn details about intercepting the egress traffic, let’s check the control plane components:

kebe@pc $ kubectl -n istio-system get po
NAME                                   READY   STATUS    RESTARTS   AGE
istio-cni-node-5rh5z                   1/1     Running   0          20h
istio-cni-node-qsvsz                   1/1     Running   0          20h
istio-cni-node-wdffp                   1/1     Running   0          20h
istio-ingressgateway-5cfcb57bd-kx9hx   1/1     Running   0          20h
istiod-6b84499b75-ncmn7                1/1     Running   0          20h
ztunnel-nptf6                          1/1     Running   0          20h
ztunnel-vxv4b                          1/1     Running   0          20h
ztunnel-xkz4s                          1/1     Running   0          20h

In the sidecar mode, istio-cni is mainly a CNI to avoid permission leakage caused by using the istio-init container to process iptables rules. However, in the ambient mesh, istio-cni becomes a required component. Sidecars are theoretically not needed. Why is the istio-cni component being required?

Let’s check the logs:

kebe@pc $ kubectl -n istio-system logs istio-cni-node-qsvsz
...
2022-10-12T07:34:33.224957Z	info	ambient	Adding route for reviews-v1-6494d87c7b-zrpks/default: [table 100 10.244.1.4/32 via 192.168.126.2 dev istioin src 10.244.1.1]
2022-10-12T07:34:33.226054Z	info	ambient	Adding pod 'reviews-v2-79857b95b-m4q2g/default' (0ff78312-3a13-4a02-b39d-644bfb91e861) to ipset
2022-10-12T07:34:33.228305Z	info	ambient	Adding route for reviews-v2-79857b95b-m4q2g/default: [table 100 10.244.1.5/32 via 192.168.126.2 dev istioin src 10.244.1.1]
2022-10-12T07:34:33.229967Z	info	ambient	Adding pod 'reviews-v3-75f494fccb-92nq5/default' (e41edf7c-a347-45cb-a144-97492faa77bf) to ipset
2022-10-12T07:34:33.232236Z	info	ambient	Adding route for reviews-v3-75f494fccb-92nq5/default: [table 100 10.244.1.6/32 via 192.168.126.2 dev istioin src 10.244.1.1]

As shown in the above output, for a pod in the ambient mesh, istio-cni performs the following actions:

  1. Add the pod to ipset
  2. Add a routing rule to table 100 (for its usage see below)

You can view the ipset contents on the node (note that the kind cluster is used here, you need to use docker exec to enter the host first):

kebe@pc $ docker exec -it ambient-worker2 bash
root@ambient-worker2:/# ipset list
Name: ztunnel-pods-ips
Type: hash:ip
Revision: 0
Header: family inet hashsize 1024 maxelem 65536
Size in memory: 520
References: 1
Number of entries: 5
Members:
10.244.1.5
10.244.1.7
10.244.1.8
10.244.1.4
10.244.1.6

It is found that an ipset exists on the node where this pod is running. ipset holds many IPs for pods.

kebe@pc $ kubectl get po -o wide
NAME                              READY   STATUS    RESTARTS   AGE   IP           NODE              NOMINATED NODE   READINESS GATES
details-v1-76778d6644-wn4d2       1/1     Running   0          20h   10.244.1.9   ambient-worker2   <none>           <none>
notsleep-6d6c8669b5-pngxg         1/1     Running   0          20h   10.244.2.5   ambient-worker    <none>           <none>
productpage-v1-7c548b785b-w9zl6   1/1     Running   0          20h   10.244.1.7   ambient-worker2   <none>           <none>
ratings-v1-85c74b6cb4-57m52       1/1     Running   0          20h   10.244.1.8   ambient-worker2   <none>           <none>
reviews-v1-6494d87c7b-zrpks       1/1     Running   0          20h   10.244.1.4   ambient-worker2   <none>           <none>
reviews-v2-79857b95b-m4q2g        1/1     Running   0          20h   10.244.1.5   ambient-worker2   <none>           <none>
reviews-v3-75f494fccb-92nq5       1/1     Running   0          20h   10.244.1.6   ambient-worker2   <none>           <none>
sleep-7b85956664-z6qh7            1/1     Running   0          20h   10.244.2.4   ambient-worker    <none>           <none>

Therefore, this ipset holds a list of all PodIPs in the ambient mesh on the current node.

Where can this ipset be used?

Let’s take a look at the iptables rules and you can find:

root@ambient-worker2:/# iptables-save
*mangle
...
-A POSTROUTING -j ztunnel-POSTROUTING
...
-A ztunnel-PREROUTING -p tcp -m set --match-set ztunnel-pods-ips src -j MARK --set-xmark 0x100/0x100

You now learn that when a pod in the ambient mesh on a node (in the ztunnel-pods-ips ipset) initiates a request, its connection will be marked with 0x100/0x100.

Generally, it will be related to routing. Let’s check the routing rules:

root@ambient-worker2:/# ip rule
0: from all lookup local
100: from all fwmark 0x200/0x200 goto 32766
101: from all fwmark 0x100/0x100 lookup 101
102: from all fwmark 0x40/0x40 lookup 102
103: from all lookup 100
32766: from all lookup main
32767: from all lookup default

The traffic marked with 0x100/0x100 goes via the routing table 101. Let’s check the routing table:

root@ambient-worker2:/# ip r show table 101
default via 192.168.127.2 dev istioout
10.244.1.2 dev veth5db63c11 scope link

It can be clearly seen that the default gateway has been replaced with 192.168.127.2 via the istioout NIC (network interface card).

192.168.127.2 does not belong to any of NodeIP, PodIP, and ClusterIP. The istioout NIC should not exist by default, then where does this IP come from? Since the traffic ultimately needs to go to ztunnel, you can check the ztunnel configuration to see if you can find the answer.

kebe@pc $ kubectl -n istio-system get po ztunnel-vxv4b -o yaml
apiVersion: v1
kind: Pod
metadata:
  ...
  name: ztunnel-vxv4b
  namespace: istio-system
	...
spec:
  ...
  initContainers:
  - command:
			...
      OUTBOUND_TUN=istioout
			...
      OUTBOUND_TUN_IP=192.168.127.1
      ZTUNNEL_OUTBOUND_TUN_IP=192.168.127.2

      ip link add name p$INBOUND_TUN type geneve id 1000 remote $HOST_IP
      ip addr add $ZTUNNEL_INBOUND_TUN_IP/$TUN_PREFIX dev p$INBOUND_TUN

      ip link add name p$OUTBOUND_TUN type geneve id 1001 remote $HOST_IP
      ip addr add $ZTUNNEL_OUTBOUND_TUN_IP/$TUN_PREFIX dev p$OUTBOUND_TUN

      ip link set p$INBOUND_TUN up
      ip link set p$OUTBOUND_TUN up
      ...

As above, ztunnel will be responsible for creating the istioout NIC, you now go to the node to check the NIC.

root@ambient-worker2:/# ip a
11: istioout: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
    link/ether 0a:ea:4e:e0:8d:26 brd ff:ff:ff:ff:ff:ff
    inet 192.168.127.1/30 brd 192.168.127.3 scope global istioout
       valid_lft forever preferred_lft forever

Where is the gateway IP of 192.168.127.2? It is allocated in ztunnel.

kebe@pc $ kubectl -n istio-system exec -it ztunnel-nptf6 -- ip a
Defaulted container "istio-proxy" out of: istio-proxy, istio-init (init)
2: eth0@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 46:8a:46:72:1d:3b brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.244.2.3/24 brd 10.244.2.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::448a:46ff:fe72:1d3b/64 scope link
       valid_lft forever preferred_lft forever
4: pistioout: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether c2:d0:18:20:3b:97 brd ff:ff:ff:ff:ff:ff
    inet 192.168.127.2/30 scope global pistioout
       valid_lft forever preferred_lft forever
    inet6 fe80::c0d0:18ff:fe20:3b97/64 scope link
       valid_lft forever preferred_lft forever

You can now see that the traffic is going to ztunnel, but nothing else is done to the traffic at this time, it is simply routed to ztunnel. How to does Envoy in ztunnel process the traffic?

Let’s continue to check the ztunnel configuration with many iptables rules. Let’s check the specific rules in ztunnel.

kebe@pc $ kubectl -n istio-system exec -it ztunnel-nptf6 -- iptables-save
Defaulted container "istio-proxy" out of: istio-proxy, istio-init (init)
...
*mangle
-A PREROUTING -i pistioout -p tcp -j TPROXY --on-port 15001 --on-ip 127.0.0.1 --tproxy-mark 0x400/0xfff
...
COMMIT

When traffic enters ztunnel, it will use TPROXY to transfer the traffic to port 15001 for processing, where 15001 is the port that Envoy actually listens to and process the pod egress traffic. As for TPROXY, you can learn relevant reference, and this blog will not repeat it further.

So when a pod is running in the ambient mesh, its egress traffic path is as follows:

  1. Initiate traffic from a process in pod
  2. The traffic flows via the node network and get marks by iptables on the node
  3. The traffic is forwarded to the ztunnel pod on current node by the routing table
  4. When the traffic reaches ztunnel, it will go through iptables for TPROXY (transparent proxy), and send the traffic to port 15001 of Envoy in the current pod.

So far in the ambient mesh, it is clear that the processing of pod egress traffic is relatively complex. The path is also relatively long, unlike the sidecar mode in which a traffic forwarding is directly completed in the pod.

Ingress traffic interception

With the above experience, it is easy to learn that in the ambient mesh, the traffic interception is mainly through the method of MARK routing + TPROXY, and the ingress traffic should be similar.

Let’s analyze it in the simplest way. When a process on a node, or a program on another host accesses a pod on the current node, the traffic goes through the host’s routing table. Let’s check the routing info when the response arrives at productpage-v1-7c548b785b-w9zl6(10.244.1.7):

root@ambient-worker2:/# ip r get 10.244.1.7
10.244.1.7 via 192.168.126.2 dev istioin table 100 src 10.244.1.1 uid 0
    cache

When accessing 10.244.1.7, the traffic will be routed to 192.168.126.2, and this rule is added by istio-cni.

Similarly 192.168.126.2 belongs to ztunnel:

kebe@pc $ kubectl -n istio-system exec -it ztunnel-nptf6 -- ip a
Defaulted container "istio-proxy" out of: istio-proxy, istio-init (init)
2: eth0@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 46:8a:46:72:1d:3b brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.244.2.3/24 brd 10.244.2.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::448a:46ff:fe72:1d3b/64 scope link
       valid_lft forever preferred_lft forever
3: pistioin: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 7e:b2:e6:f9:a4:92 brd ff:ff:ff:ff:ff:ff
    inet 192.168.126.2/30 scope global pistioin
       valid_lft forever preferred_lft forever
    inet6 fe80::7cb2:e6ff:fef9:a492/64 scope link
       valid_lft forever preferred_lft forever

By using the same analysis method, let’s check the iptables rules:

kebe@pc $ kubectl -n istio-system exec -it ztunnel-nptf6 -- iptables-save
...
-A PREROUTING -i pistioin -p tcp -m tcp --dport 15008 -j TPROXY --on-port 15008 --on-ip 127.0.0.1 --tproxy-mark 0x400/0xfff
-A PREROUTING -i pistioin -p tcp -j TPROXY --on-port 15006 --on-ip 127.0.0.1 --tproxy-mark 0x400/0xfff
...

If you directly access the node via PodIP + pod port, the traffic will be forwarded to port 15006 of ztunnel, which is the port to handle the ingress traffic in Istio.

As for the traffic whose destination port is port 15008, this is the port used by ztunnel for L4 traffic tunneling. This blog will not explain this further.

Handle traffic for Envoy itself

In the sidecar mode, Envoy and user containers run in the same network namespace. For the traffic from user containers, you need to intercept all the traffic to guarantee complete control of the traffic. However, is it also required in the ambient mesh?

The answer is no, because Envoy has been isolated from other pods. The traffic sent by Envoy does not require special notice. In other words, you only need to handle ingress traffic for ztunnel, so the rules in ztunnel seem relatively simple.

Wrapping up

As above explained, this blog mainly analyzed the scheme for Pod traffic interception in the ambient mesh, but this blog has not involved with how to handle L7 traffic and the specific principles of ztunnel implementation. The next plan is to analyze the detailed traffic paths in ztunnel and waypoint proxy.

Merbridge CNI Mode

This blog explains how CNI works in Merbridge.

The CNI mode is designed to better adapt to the service mesh functions. Before having this mode, Merbridge was limited to certain scenarios. The biggest problem was that it could not adapt to the sidecar annotations from the container injected by Istio, which led to Merbridge cannot exclude traffic from certain ports and IP ranges. Furthermore, Merbridge was only able to handle requests inside the pod, which means the external traffic sent to the pod was not handled.

Therefore, we have implemented the Merbridge CNI to address these issues.

Why CNI mode is needed

First, Merbridge had a small control plane before, which listened to pods resources, and wrote the current node IP into the map of local_pod_ips for use by connect. However, since the connect program only works at the host kernel layer, it won’t know which pod’s traffic is being processed. Thus, configurations like excludeOutboundPorts cannot be handled. In order to be able to adapt to the injected sidecar annotation excludeOutboundPorts, we need to let the eBPF program know which Pod’s request is currently being processed.

To this end, we have designed a method to cooperate with the CNI, through which you can get the current Pod IP to validate special configurations for the Pod.

Second, for early versions of Merbridge, only connect would process requests from the host, which had no problem for intra-node pod communication. However, it becomes problematic when traffic flows between different nodes. According to the previous logic, the traffic will not be modified during the cross-node communication, which will lead to the use of iptables at the end.

Here, we turned to the XDP program for processing the inbound traffic. The XDP program needs to mount a network card, which also needs to use CNI.

How does CNI work

This section will explore how CNI works and how to use CNI to solve the issues mentioned above.

How to use CNI to let eBPF have the current Pod IP

When a pod is created, we write Pod IP into the map mark_pod_ips_map through CNI, where the key is a random value, and the value is the Pod IP. Then, we listen to a special port 39807 in the NetNS of the current Pod, and write the key to the mark of this port socket using setsockopt.

In eBPF, we get the recorded mark information of port 39807 through bpf_sk_lookup_tcp, and use it to get the current Pod IP (also the current NetNS) from mark_pod_ips_map.

With the current Pod IP, we can determine the path to route traffic (such as excludeOutboundPorts) according to the configuration of this Pod.

In addition, we also optimized the quadruple conflicts by using bpf_bind to bind the source IP and using 127.0.0.1 as the destination IP, which also prepares for future support of IPv6.

How to handle ingress traffic

In order to handle inbound traffic, we introduced the XDP program, which works on the network card and can modify the original data packets.

We use the XDP program to modify the destination port as 15006 when the traffic reaches the Pod, so as to complete traffic forwarding.

At the same time, considering the possibility that the host directly accesses the Pod, and in order to reduce the scope of influence, we choose to attach the XDP program to the Pod’s network card. This requires the ability of CNI to perform additional operations when creating Pods

How to use CNI mode?

CNI mode is disabled by default. You need to enable it manually with the following command.

curl -sSL https://raw.githubusercontent.com/merbridge/merbridge/main/deploy/all-in-one.yaml | sed 's/--cni-mode=false/--cni-mode=true/g' | kubectl apply -f -

Notes

CNI mode is in beta

The CNI mode is a new feature that may not be perfect. We welcome your feedback and suggestions to help improve Merbridge.

If you are trying to do benchmark test using tools like Istio perf benchmark, it is suggested to enable the CNI mode. Otherwise the test results will be inaccurate.

Check whether the host can enable the hardware-checksum capability

In order to ensure the CNI mode works properly, the hardware-checksum capability is disabled by default, which may affect network performance. It is recommended to check whether you can enable this capability on the host before enabling the CNI mode. If yes, we suggest to set --hardware-checksum=true for best performance.

Test method: if ethtool -k <network card> | grep tx-checksum-ipv4 is on, it means enabled.

Merbridge and Cilium

Merbridge and Cilium

Cilium is a great open source software that provides a lot of networking capabilities for cloud native applications based on eBPF, with a lot of great designs. Among others, Cilium designed a set of sockmap-based redir capabilities to help accelerate network communications, which inspired us and is the basis for Merbridge to provide network acceleration. It is a really great design.

Merbridge leverages the great foundation that Cilium has provided, along with some targeted adaptations we’ve made in the Service Mesh, to make it easier to apply eBPF technology to Service Mesh.

Our development team have learned a lot eBPF theoretical knowledge, practical methods, and testing methods, from Cilium’s detailed documentation and our frequent exchanges with the Cilium technical team. All these together helps make Merbridge possible.

Thanks again to the Cilium project and community, and to Cilium for these great designs.

Livestream with Solo.io

On March 29, 2022, Solo.io and Merbridge co-hosted a livestream.

In this livestream, we discussed a lot of Merbridge-related issues, including a live demo that will help you get a quick overview of Merbridge’s features and usage.

Also, the PPT is available here for download.

If you are interested, see:

Merbridge - Accelerate your mesh with eBPF

Merbridge - Accelerate your mesh with eBPF

Replacing iptables rules with eBPF allows transporting data directly from inbound sockets to outbound sockets, shortening the datapath between sidecars and services.

Introduction

The secret of Istio’s abilities in traffic management, security, observability and policy is all in the Envoy proxy. Istio uses Envoy as the “sidecar” to intercept service traffic, with the kernel’s netfilter packet filter functionality configured by iptables.

There are shortcomings in using iptables to perform this interception. Since netfilter is a highly versatile tool for filtering packets, several routing rules and data filtering processes are applied before reaching the destination socket. For example, from the network layer to the transport layer, netfilter will be used for processing for several times with the rules predefined, like pre_routing, post_routing and etc. When the packet becomes a TCP packet or UDP packet, and is forwarded to user space, some additional steps like packet validation, protocol policy processing and destination socket searching will be performed. When a sidecar is configured to intercept traffic, the original data path can become very long, since duplicated steps are performed several times.

Over the past two years, eBPF has become a trending technology, and many projects based on eBPF have been released to the community. Tools like Cilium and Pixie show great use cases for eBPF in observability and network packet processing. With eBPF’s sockops and redir capabilities, data packets can be processed efficiently by directly being transported from an inbound socket to an outbound socket. In an Istio mesh, it is possible to use eBPF to replace iptables rules, and accelerate the data plane by shortening the data path.

We have created an open source project called Merbridge, and by applying the following command to your Istio-managed cluster, you can use eBPF to achieve such network acceleration.

kubectl apply -f https://raw.githubusercontent.com/merbridge/merbridge/main/deploy/all-in-one.yaml

Attention: Merbridge uses eBPF functions which require a Linux kernel version ≥ 5.7.

With Merbridge, the packet datapath can be shortened directly from one socket to another destination socket, and here’s how it works.

Using eBPF sockops for performance optimization

Network connection is essentially socket communication. eBPF provides a function bpf_msg_redirect_hash, to directly forward the packets sent by the application in the inbound socket to the outbound socket. By entering the function mentioned before, developers can perform any logic to decide the packet destination. According to this characteristic, the datapath of packets can noticeably be optimized in the kernel.

The sock_map is the crucial piece in recording information for packet forwarding. When a packet arrives, an existing socket is selected from the sock_map to forward the packet. As a result, we need to save all the socket information for packets to make the transportation process function properly. When there are new socket operations — like a new socket being created — the sock_ops function is executed. The socket metadata is obtained and stored in the sock_map to be used when processing packets. The common key type in the sock_map is a “quadruple” of source and destination addresses and ports. With the key and the rules stored in the map, the destination socket will be found when a new packet arrives.

The Merbridge approach

Let’s introduce the detailed design and implementation principles of Merbridge step by step, with a real scenario.

Istio sidecar traffic interception based on iptables

Istio Sidecar Traffic Interception Based on iptables

When external traffic hits your application’s ports, it will be intercepted by a PREROUTING rule in iptables, forwarded to port 15006 of the sidecar container, and handed over to Envoy for processing. This is shown as steps 1-4 in the red path in the above diagram.

Envoy processes the traffic using the policies issued by the Istio control plane. If allowed, the traffic will be sent to the actual container port of the application container.

When the application tries to access other services, it will be intercepted by an OUTPUT rule in iptables, and then be forwarded to port 15001 of the sidecar container, where Envoy is listening. This is steps 9-12 in the red path, similar to inbound traffic processing.

Traffic to the application port needs to be forwarded to the sidecar, then sent to the container port from the sidecar port, which is overhead. Moreover, iptables’ versatility determines that its performance is not always ideal because it inevitably adds delays to the whole datapath with different filtering rules applied. Although iptables is the common way to do packet filtering, in the Envoy proxy case, the longer datapath amplifies the bottleneck of packet filtering process in the kernel.

If we use sockops to directly connect the sidecar’s socket to the application’s socket, the traffic will not need to go through iptables rules, and thus performance can be improved.

Processing outbound traffic

As mentioned above, we would like to use eBPF’s sockops to bypass iptables to accelerate network requests. At the same time, we also do not want to modify any parts of Istio, to make Merbridge fully adaptive to the community version. As a result, we need to simulate what iptables does in eBPF.

Traffic redirection in iptables utilizes its DNAT function. When trying to simulate the capabilities of iptables using eBPF, there are two main things we need to do:

  1. Modify the destination address, when the connection is initiated, so that traffic can be sent to the new interface.
  2. Enable Envoy to identify the original destination address, to be able to identify the traffic.

For the first part, we can use eBPF’s connect program to process it, by modifying user_ip and user_port.

For the second part, we need to understand the concept of ORIGINAL_DST which belongs to the netfilter module in the kernel.

When an application (including Envoy) receives a connection, it will call the get_sockopt function to obtain ORIGINAL_DST. If going through the iptables DNAT process, iptables will set this parameter, with the “original IP + port” value, to the current socket. Thus, the application can get the original destination address according to the connection.

We have to modify this call process through eBPF’s get_sockopts function. (bpf_setsockopt is not used here because this parameter does not currently support the optname of SO_ORIGINAL_DST).

Referring to the figure below, when an application initiates a request, it will go through the following steps:

  1. When the application initiates a connection, the connect program will modify the destination address to 127.x.y.z:15001, and use cookie_original_dst to save the original destination address.
  2. In the sockops program, the current socket information and the quadruple are saved in sock_pair_map. At the same time, the same quadruple and its corresponding original destination address will be written to pair_original_dst. (Cookie is not used here because it cannot be obtained in the get_sockopt program).
  3. After Envoy receives the connection, it will call the get_sockopt function to read the destination address of the current connection. get_sockopt will extract and return the original destination address from pair_original_dst, according to the quadruple information. Thus, the connection is completely established.
  4. In the data transport step, the redir program will read the sock information from sock_pair_map according to the quadruple information, and then forward it directly through bpf_msg_redirect_hash to speed up the request.

Processing Outbound Traffic

Why do we set the destination address to 127.x.y.z instead of 127.0.0.1? When different pods exist, there might be conflicting quadruples, and this gracefully avoids conflict. (Pods’ IPs are different, and they will not be in the conflicting condition at any time.)

Inbound traffic processing

The processing of inbound traffic is basically similar to outbound traffic, with the only difference: revising the port of the destination to 15006.

It should be noted that since eBPF cannot take effect in a specified namespace like iptables, the change will be global, which means that if we use a Pod that is not originally managed by Istio, or an external IP address, serious problems will be encountered — like the connection not being established at all.

As a result, we designed a tiny control plane (deployed as a DaemonSet), which watches all pods — similar to the kubelet watching pods on the node — to write the pod IP addresses that have been injected into the sidecar to the local_pod_ips map.

When processing inbound traffic, if the destination address is not in the map, we will not do anything to the traffic.

The other steps are the same as for outbound traffic.

Processing Inbound Traffic

Same-node acceleration

Theoretically, acceleration between Envoy sidecars on the same node can be achieved directly through inbound traffic processing. However, Envoy will raise an error when accessing the application of the current pod in this scenario.

In Istio, Envoy accesses the application by using the current pod IP and port number. With the above scenario, we realized that the pod IP exists in the local_pod_ips map as well, and the traffic will be redirected to the pod IP on port 15006 again because it is the same address that the inbound traffic comes from. Redirecting to the same inbound address causes an infinite loop.

Here comes the question: are there any ways to get the IP address in the current namespace with eBPF? The answer is yes!

We have designed a feedback mechanism: When Envoy tries to establish the connection, we redirect it to port 15006. However, in the sockops step, we will determine if the source IP and the destination IP are the same. If yes, it means the wrong request is sent, and we will discard this connection in the sockops process. In the meantime, the current ProcessID and IP information will be written into the process_ip map, to allow eBPF to support correspondence between processes and IPs.

When the next request is sent, the same process need not be performed again. We will check directly from the process_ip map if the destination address is the same as the current IP address.

Envoy will retry when the request fails, and this retry process will only occur once, meaning subsequent requests will be accelerated.

Same-node acceleration

Connection relationship

Before applying eBPF using Merbridge, the data path between pods is like:

iptables’s data path

Diagram From: Accelerating Envoy and Istio with Cilium and the Linux Kernel

After applying Merbridge, the outbound traffic will skip many filter steps to improve the performance:

eBPF’s data path

Diagram From: Accelerating Envoy and Istio with Cilium and the Linux Kernel

If two pods are on the same machine, the connection can even be faster:

eBPF’s data path on the same machine

Diagram From: Accelerating Envoy and Istio with Cilium and the Linux Kernel

Performance results

The below tests are from our development, and not yet validated in production use cases.

Let’s see the effect on overall latency using eBPF instead of iptables (lower is better):

Latency vs Client Connections Graph

We can also see overall QPS after using eBPF (higher is better):

QPS vs Client Connections Graph

Test results are generated with wrk.

Summary

We have introduced the core ideas of Merbridge in this post. By replacing iptables with eBPF, the data transportation process can be accelerated in a mesh scenario. At the same time, Istio will not be changed at all. This means if you do not want to use eBPF any more, just delete the DaemonSet, and the datapath will be reverted to the traditional iptables-based routing without any problems.

Merbridge is a completely independent open source project. It is still at an early stage, and we are looking forward to having more users and developers to get engaged. It would be greatly appreciated if you would try this new technology to accelerate your mesh, and provide us with some feedback!

Merbridge Project: https://github.com/merbridge/merbridge

See also