~blog

IPv6 connectivity issue : a journey into radvd and socket options

Posted at — Apr 2, 2024

Context

Shadow is a cloud computing service specialized in Gaming. To provide isolated machines to users, they rely heavily on virtual machines.

Each virtual machine has dual-stack connectivity with both IPv4 and IPv6. Using IPv4 the network configuration is pushed via DHCP providing both addressing and default route informations. Using IPv6, we need to use the Router Advertisement mechanism.

Recently, we changed our networking stack on the hypervisor, whereas the virtual interfaces for each VMs were created beforehand and persistent we moved to transient virtual interfaces creating them and removing each times a virtual machine is started and stopped.

After a few weeks we noticed virtual machines could randomly lose their IPv6 connectivity to the Internet.
While disabling the IPv6 connectivity allowed reliable access to the VM to be restored, we dug to find out what could be the cause of this defect.

Analysis of the IPv6 connectivity in the VM

After a short analysis we noticed that the Windows routing table no longer had an IPv6 route but instead, only a link-local route.

> route print
[...]
IPv6 Route Table
Active Routes:
If	Metric	Network Destination		Gateway
1	331	::1/128				On-link
4	271	fd12:0:0:a01::a/128		On-link
4	271	fe80::/64			On-link
4	271	fe80::d756:522a:7e8b:339c/128	On-link
1	331	ff00::/8			On-link
4	271	ff00::/8			On-link
Persistent Routes: None

A correct configuration would be:

> route print
[...]
IPv6 Route Table
Active Routes:
If	Metric	Network Destination		Gateway
13 	271 	::/0				fe80::fc42:c6ff:fe7e:6388
1 	331 	::1/128				On-link
13 	271 	fd12:0:0:a08::a/128		On-linka
13 	271	fe80::/64			On-link
13	271 	fe80::6b86:ba8a:cc33:f514/128	On-link
1	331	ff00::/8			On-link
13 	271	ff00::/8			On-link
Persistent Routes: None

The VM does not know where to send its IPv6 packets despite having IPv6 connectivity.

Once the VM initialized its network interface, it sends a RS (Router Sollicitation) request which aims to discover available IPv6 routers. This is radvd’s time to shine, as it’s in charge of answering this RS by a RA (Router Advertisement). The VM will then know that this router will handle its traffic (thus adding the ::/0 route via fe80::fc42:c6ff:fe7e).
These mechanisms are part of the Neighbor Discovery Protocol.

This router advertisement is never sent when radvd encounters a precise bug… And tcpdump made pretty clear that this router advertisement went missing.

Understanding the missing Router Advertisment

To receive Router Solicitation and send Router Advertisement, radvd configures its socket to join the multicast group of the virtual machine network interface. This is done by using the setsockopt syscall.
setsockopt is a syscall allowing to specify options on a socket.

For example, one can use this syscall to force the reuse of an address (with SO_REUSE_ADDR), change the size of the buffer allocated for receiving data on this socket (with SO_RCVBUF). In our case, it is used to join a multicast group (with IPV6_ADD_MEMBERSHIP).

Using strace we were able to trace these calls, their options and their return value.
Below, a curated output of the strace attached to the radvd ¹ process.

$ strace -p <radvd PID>
[...]
setsockopt(3, SOL_IPV6, IPV6_ADD_MEMBERSHIP, {inet_pton(AF_INET6, "ff02::2", &ipv6mr_multiaddr), ipv6mr_interface=if_nametoindex("tap2")}, 20) = -1 ENOMEM (Cannot allocate memory)
[...]
setsockopt(3, SOL_IPV6, IPV6_ADD_MEMBERSHIP, {inet_pton(AF_INET6, "ff02::2", &ipv6mr_multiaddr), ipv6mr_interface=if_nametoindex("tap4")}, 20) = -1 ENOMEM (Cannot allocate memory)
[...]
setsockopt(3, SOL_IPV6, IPV6_ADD_MEMBERSHIP, {inet_pton(AF_INET6, "ff02::2", &ipv6mr_multiaddr), ipv6mr_interface=if_nametoindex("tap3")}, 20) = -1 ENOMEM (Cannot allocate memory)
[...]
setsockopt(3, SOL_IPV6, IPV6_ADD_MEMBERSHIP, {inet_pton(AF_INET6, "ff02::2", &ipv6mr_multiaddr), ipv6mr_interface=if_nametoindex("tap1")}, 20) = -1 ENOMEM (Cannot allocate memory)
[...]

Its argument in this context are :

3: the target socket file descriptor, that will always be the same
SOL_IPV6: IPv6 protocol
IPV6_ADD_MEMBERSHIP: States that we want to join a multicast group
{inet_pton(AF_INET6, "ff02::2", &ipv6mr_multiaddr), ipv6mr_interface=if_nametoindex("tap1")}: structure that contains options about the multicast group we aim to join
- net_pton(AF_INET6, "ff02::2", &ipv6mr_multiaddr): converts the ff02::2 IP address from a string to a binary sequence
- ipv6mr_interface=if_nametoindex("tap1"): target interface index, here tap1
20: the structure’s length

We basically try to join an IPv6 multicast group using tap1. This however results in an ENOMEM error. That means we run out of memory, but why? The server’s memory is absolutely not a concern, there’s even a lot of it that’s free!

So, it will be necessary to analyze in depth to determine exactly where this ENOMEM is being returned.

Digging a little deeper

We used the ftrace tool, which can track the execution of various kernel-side functions that compose the setsockopt syscall. ftrace or function tracer is an internal tracer in the Linux kernel that allows analysis, debugging, and examination of kernel activities via the tracefs filesystem.

$ cd /sys/kernel/tracing
# Request ftrace to trace only the radv process
$ echo <radvd PID> > set_ftrace_pid
# Trace only function calls
$ echo function_graph > current_tracer

To reproduce the issue, we create a network interface in tap mode and launch a VM attempting to boot over network, which will generate a router advertisement from radvd (and thus trigger this setsockopt() call)

$ ip tuntap add mode tap tap1
$ ip link set tap1 up
$ ip addr add fd12:0:0:a08::$1/64 dev tap1
$ QEMU="qemu-system-x86_64 -boot n -net nic -net tap,ifname=tap$1,script=no,downscript=no -nographic"

We now stop ftrace:

$ echo nop > /sys/kernel/tracing/current_tracer

Trace has been shrunk for the sake of clarity

$ cat /sys/kernel/tracing/trace
 60) ! 304.905 us  |  } /* syscall_trace_enter.constprop.0 */
 60)               |  __x64_sys_setsockopt() {
 60)               |    __sys_setsockopt() {
 60)               |      sockfd_lookup_light() {
 60)               |        __fdget() {
 60)   0.376 us    |          __fget_light();
 60)   0.995 us    |        }
 60)   1.719 us    |      }
[..]
 60)               |      sock_common_setsockopt() {
 60)               |        rawv6_setsockopt() {
 60)               |          ipv6_setsockopt() {
 60)               |            do_ipv6_setsockopt() {
 60)               |              rtnl_lock() {
 60)               |                mutex_lock() {
 60)   0.308 us    |                  __cond_resched();
 60)   0.866 us    |                }
 60)   1.440 us    |              }
 60)               |              sockopt_lock_sock() {
 60)   0.285 us    |                __cond_resched();
 60)               |                _raw_spin_lock_bh() {
 60)   0.321 us    |                  preempt_count_add();
 60)   0.998 us    |                }
 60)               |                _raw_spin_unlock_bh() {
 60)               |                  __local_bh_enable_ip() {
 60)   0.278 us    |                    preempt_count_sub();
 60)   0.860 us    |                  }
 60)   1.400 us    |                }
 60)   4.289 us    |              }
 60)               |              ipv6_sock_mc_join() {
 60)               |                __ipv6_sock_mc_join() {
 60)               |                  rtnl_is_locked() {
 60)   0.286 us    |                    mutex_is_locked();
 60)   0.864 us    |                  }
 60)   0.375 us    |                  sock_kmalloc();
 60)   6.348 us    |                }
 60)   7.243 us    |              }
 60)               |              sockopt_release_sock() {
 60)               |                release_sock() {
 60)               |                  _raw_spin_lock_bh() {
 60)   0.318 us    |                    preempt_count_add();
 60)   1.033 us    |                  }
 60)               |                  _raw_spin_unlock_bh() {
 60)               |                    __local_bh_enable_ip() {
 60)   0.307 us    |                      preempt_count_sub();
 60)   0.878 us    |                    }
 60)   1.407 us    |                  }
 60)   3.493 us    |                }
 60)   4.086 us    |              }
[..]
 60) + 22.281 us   |            }
 60) + 23.195 us   |          }
 60) + 23.953 us   |        }
 60) + 24.821 us   |      }
 60)   0.302 us    |      kfree();
 60) + 32.012 us   |    }
 60) + 32.633 us   |  }

Making use of the setsockopt syscall involves calling:

sock_common_setsockopt
rawv6_setsockopt
do_ipv6_setsockopt
sockopt_lock_sock
ipv6_sock_mc_join
__ipv6_sock_mc_join
sock_kmalloc
sockopt_release_sock

The ipv6_sock_mc_join (“ipv6 socket multicast join”) function catched our attention here since its role is to add the interface to the multicast group. It also calls sock_kmalloc.

(ipv6_sock_mc_join is a mere wrapper of __ipv6_sock_mc_join that just adds an argument).

static int __ipv6_sock_mc_join(struct sock *sk, int ifindex,
			       const struct in6_addr *addr, unsigned int mode)
{
	struct net_device *dev = NULL;
	struct ipv6_mc_socklist *mc_lst;
	struct ipv6_pinfo *np = inet6_sk(sk);
	struct net *net = sock_net(sk);
	int err;

	ASSERT_RTNL();

	if (!ipv6_addr_is_multicast(addr))
		return -EINVAL;

	for_each_pmc_socklock(np, sk, mc_lst) {
		if ((ifindex == 0 || mc_lst->ifindex == ifindex) &&
		    ipv6_addr_equal(&mc_lst->addr, addr))
			return -EADDRINUSE;
	}
[...]

struct ipv6_mc_socklist structure represents a multicast group member, as seen earlier in the trace:
{inet_pton(AF_INET6, "ff02::2", &ipv6mr_multiaddr), ipv6mr_interface=if_nametoindex("tap1")}

addr: multicast group address
ifindex: index of the interface we want to join the multicast group with
sfmode: set to MCAST_EXCLUDE (this is the result of the use of the ipv6_sock_mc_join wrapper, that’s the mode argument of __ipv6_sock_mc_join)
rcu: Read-Copy-Update, a mechanism ensuring the reading of valid data in the context of data read much more frequently than written, concurrently (see: https://www.kernel.org/doc/html/next/RCU/whatisRCU.html)

struct ipv6_mc_socklist {
	struct in6_addr		addr;
	int			ifindex;
	unsigned int		sfmode;		/* MCAST_{INCLUDE,EXCLUDE} */
	struct ipv6_mc_socklist __rcu *next;
	struct ip6_sf_socklist	__rcu *sflist;
	struct rcu_head		rcu;
};

This struct is 56 bytes long on a 6.1 kernel.

gdb /usr/lib/debug/lib/modules/6.1.0-18-amd64/vmlinux
GNU gdb (Debian 13.1-3) 13.1
(gdb) print sizeof(struct ipv6_mc_socklist)
$1 = 56
(gdb)

After a few checks the sock_kmalloc function is then called, with the following arguments :

sk: current socket structure
sizeof(struct ipv6_mc_socklist): added structure length
GFP_KERNEL: Get Free Page flag, controls memory allocator behavior (see: https://www.kernel.org/doc/html/next/core-api/memory-allocation.html)

[__ipv6_sock_mc_join continued]
	mc_lst = sock_kmalloc(sk, sizeof(struct ipv6_mc_socklist), GFP_KERNEL);

	if (!mc_lst)
		return -ENOMEM; <- here we go

[...]

sock_kmalloc seems to yield NULL since what we get from ipv6_sock_mc_join is an ENOMEM and no other function in the code has been executed (see ftrace previous output)

/*
 * Allocate a memory block from the socket's option memory buffer.
 */
void *sock_kmalloc(struct sock *sk, int size, gfp_t priority)
{
	int optmem_max = READ_ONCE(sock_net(sk)->core.sysctl_optmem_max);

	if ((unsigned int)size <= optmem_max &&
	    atomic_read(&sk->sk_omem_alloc) + size < optmem_max) {
		void *mem;
		/* First do the add, to avoid the race if kmalloc
		 * might sleep.
		 */
		atomic_add(size, &sk->sk_omem_alloc);
		mem = kmalloc(size, priority);
		if (mem)
			return mem;
		atomic_sub(size, &sk->sk_omem_alloc);
	}
	return NULL;
}

sk_omem_alloc (socket option/other memory alloc) is an atomic_t variable type (that’s an int32_t) that holds the grand total of every memory allocation that were made on this socket for the option buffer (see atomic_add(size, &sk->sk_omem_alloc);)

optmem_max is the net.core.optmem_max sysctl value. That’s the maximal memory size that can be allocated for a socket’s option buffer. This is set to 20480 on our hypervisors.

$ sysctl net.core.optmem_max
net.core.optmem_max = 20480

Basically there is a maximum size for the option buffer of a socket, which is set by the net.core.optmem_max sysctl setting. The struct ipv6_mc_socklist structure that stores multicast group member interfaces data is 56 bytes long. In our setup, it takes 365 group members to entirely fill the option buffer of that socket, taking 204440 bytes. The 366th member will result in an ENOMEM as 20486 bytes would be required to store this new member alongside the existing ones.

In theory, this buffer should not be entierly filled, however radvd does not remove any interface from the multicast group when an interface is removed (tap VM interface are dynamically created and removed on-the-fly when a VM boots). The more connections there are to this server, the less space is available in the “option” buffer. This can only be fixed via a service restart (we can also increase the value of net.core.optmem_max but the issue will come back if the value is not high enough, and this is a very temporary fix)

The fix

We saw earlier that we can add members to an IPv6 multicast group by using IPV6_ADD_MEMBERSHIP. Shouldn’t we remove them altogether? There’s a setsockopt option named IPV6_DROP_MEMBERSHIP whose goal is to remove an interface from a multicast group.
That’s the sole principle of the cleanup_iface function within radvd’s code:

int cleanup_iface(int sock, struct Interface *iface)
{
	/* leave the allrouters multicast group */
	cleanup_allrouters_membership(sock, iface);
	return 0;
}

Where cleanup_allrouters_membership is:

int cleanup_allrouters_membership(int sock, struct Interface *iface)
{
	struct ipv6_mreq mreq;

	memset(&mreq, 0, sizeof(mreq));
	mreq.ipv6mr_interface = iface->props.if_index;

	/* ipv6-allrouters: ff02::2 */
	mreq.ipv6mr_multiaddr.s6_addr32[0] = htonl(0xFF020000);
	mreq.ipv6mr_multiaddr.s6_addr32[3] = htonl(0x2);
	setsockopt(sock, SOL_IPV6, IPV6_DROP_MEMBERSHIP, &mreq, sizeof(mreq));
	return 0;
}

cleanup_iface only gets called at the end of the program’s execution.
Fix would be to call cleanup_iface for each interface deletion event:

	if (nh->nlmsg_type == RTM_DELLINK) {
		dlog(LOG_INFO, 4, "netlink: %s removed, cleaning up", iface->props.name);
		cleanup_iface(icmp_sock, iface);
	}

A patch has been submitted to radvd.

Thanks to @xdbob for the patch, Claire for the network debugging and @cedricmaunoury for the help during the bug search!

We did not have any logs due to radvd logging by default to stdout and then daemonized itself, closing those fd ↩︎