~blog
IPv6 connectivity issue : a journey into radvd and socket options
Context
Shadow is a cloud computing service specialized in Gaming. To provide isolated machines to users, they rely heavily on virtual machines.
Each virtual machine has dual-stack connectivity with both IPv4 and IPv6. Using IPv4 the network configuration is pushed via DHCP providing both addressing and default route informations. Using IPv6, we need to use the Router Advertisement mechanism.
Recently, we changed our networking stack on the hypervisor, whereas the virtual interfaces for each VMs were created beforehand and persistent we moved to transient virtual interfaces creating them and removing each times a virtual machine is started and stopped.
After a few weeks we noticed virtual machines could randomly lose their IPv6 connectivity to the Internet.
While disabling the IPv6 connectivity allowed reliable access to the VM to be restored, we dug to find out what could be the cause of this defect.
Analysis of the IPv6 connectivity in the VM
After a short analysis we noticed that the Windows routing table no longer had an IPv6 route but instead, only a link-local route.
> route print
[...]
IPv6 Route Table
Active Routes:
If Metric Network Destination Gateway
1 331 ::1/128 On-link
4 271 fd12:0:0:a01::a/128 On-link
4 271 fe80::/64 On-link
4 271 fe80::d756:522a:7e8b:339c/128 On-link
1 331 ff00::/8 On-link
4 271 ff00::/8 On-link
Persistent Routes: None
A correct configuration would be:
> route print
[...]
IPv6 Route Table
Active Routes:
If Metric Network Destination Gateway
13 271 ::/0 fe80::fc42:c6ff:fe7e:6388
1 331 ::1/128 On-link
13 271 fd12:0:0:a08::a/128 On-linka
13 271 fe80::/64 On-link
13 271 fe80::6b86:ba8a:cc33:f514/128 On-link
1 331 ff00::/8 On-link
13 271 ff00::/8 On-link
Persistent Routes: None
The VM does not know where to send its IPv6 packets despite having IPv6 connectivity.
Once the VM initialized its network interface, it sends a RS
(Router Sollicitation) request which aims to discover available IPv6 routers. This is radvd
’s time to shine, as it’s in charge of answering this RS
by a RA
(Router Advertisement). The VM will then know that this router will handle its traffic (thus adding the ::/0
route via fe80::fc42:c6ff:fe7e
).
These mechanisms are part of the Neighbor Discovery Protocol.
This router advertisement is never sent when radvd
encounters a precise bug… And tcpdump
made pretty clear that this router advertisement went missing.
Understanding the missing Router Advertisment
To receive Router Solicitation and send Router Advertisement, radvd configures its socket to join the multicast group of the virtual machine network interface. This is done by using the setsockopt
syscall.
setsockopt
is a syscall allowing to specify options on a socket.
For example, one can use this syscall to force the reuse of an address (with SO_REUSE_ADDR
), change the size of the buffer allocated for receiving data on this socket (with SO_RCVBUF
). In our case, it is used to join a multicast group (with IPV6_ADD_MEMBERSHIP
).
Using strace
we were able to trace these calls, their options and their return value.
Below, a curated output of the strace
attached to the radvd 1 process.
$ strace -p <radvd PID>
[...]
setsockopt(3, SOL_IPV6, IPV6_ADD_MEMBERSHIP, {inet_pton(AF_INET6, "ff02::2", &ipv6mr_multiaddr), ipv6mr_interface=if_nametoindex("tap2")}, 20) = -1 ENOMEM (Cannot allocate memory)
[...]
setsockopt(3, SOL_IPV6, IPV6_ADD_MEMBERSHIP, {inet_pton(AF_INET6, "ff02::2", &ipv6mr_multiaddr), ipv6mr_interface=if_nametoindex("tap4")}, 20) = -1 ENOMEM (Cannot allocate memory)
[...]
setsockopt(3, SOL_IPV6, IPV6_ADD_MEMBERSHIP, {inet_pton(AF_INET6, "ff02::2", &ipv6mr_multiaddr), ipv6mr_interface=if_nametoindex("tap3")}, 20) = -1 ENOMEM (Cannot allocate memory)
[...]
setsockopt(3, SOL_IPV6, IPV6_ADD_MEMBERSHIP, {inet_pton(AF_INET6, "ff02::2", &ipv6mr_multiaddr), ipv6mr_interface=if_nametoindex("tap1")}, 20) = -1 ENOMEM (Cannot allocate memory)
[...]
Its argument in this context are :
3
: the target socket file descriptor, that will always be the sameSOL_IPV6
: IPv6 protocolIPV6_ADD_MEMBERSHIP
: States that we want to join a multicast group{inet_pton(AF_INET6, "ff02::2", &ipv6mr_multiaddr), ipv6mr_interface=if_nametoindex("tap1")}
: structure that contains options about the multicast group we aim to join-
net_pton(AF_INET6, "ff02::2", &ipv6mr_multiaddr)
: converts theff02::2
IP address from a string to a binary sequence
-
ipv6mr_interface=if_nametoindex("tap1")
: target interface index, here tap1
20
: the structure’s length
We basically try to join an IPv6 multicast group using tap1
. This however results in an ENOMEM
error. That means we run out of memory, but why? The server’s memory is absolutely not a concern, there’s even a lot of it that’s free!
So, it will be necessary to analyze in depth to determine exactly where this ENOMEM
is being returned.
Digging a little deeper
We used the ftrace
tool, which can track the execution of various kernel-side functions that compose the setsockopt
syscall. ftrace
or function tracer
is an internal tracer in the Linux kernel that allows analysis, debugging, and examination of kernel activities via the tracefs
filesystem.
$ cd /sys/kernel/tracing
# Request ftrace to trace only the radv process
$ echo <radvd PID> > set_ftrace_pid
# Trace only function calls
$ echo function_graph > current_tracer
To reproduce the issue, we create a network interface in tap mode and launch a VM attempting to boot over network, which will generate a router advertisement from radvd (and thus trigger this setsockopt() call)
$ ip tuntap add mode tap tap1
$ ip link set tap1 up
$ ip addr add fd12:0:0:a08::$1/64 dev tap1
$ QEMU="qemu-system-x86_64 -boot n -net nic -net tap,ifname=tap$1,script=no,downscript=no -nographic"
We now stop ftrace
:
$ echo nop > /sys/kernel/tracing/current_tracer
Trace has been shrunk for the sake of clarity
$ cat /sys/kernel/tracing/trace
60) ! 304.905 us | } /* syscall_trace_enter.constprop.0 */
60) | __x64_sys_setsockopt() {
60) | __sys_setsockopt() {
60) | sockfd_lookup_light() {
60) | __fdget() {
60) 0.376 us | __fget_light();
60) 0.995 us | }
60) 1.719 us | }
[..]
60) | sock_common_setsockopt() {
60) | rawv6_setsockopt() {
60) | ipv6_setsockopt() {
60) | do_ipv6_setsockopt() {
60) | rtnl_lock() {
60) | mutex_lock() {
60) 0.308 us | __cond_resched();
60) 0.866 us | }
60) 1.440 us | }
60) | sockopt_lock_sock() {
60) 0.285 us | __cond_resched();
60) | _raw_spin_lock_bh() {
60) 0.321 us | preempt_count_add();
60) 0.998 us | }
60) | _raw_spin_unlock_bh() {
60) | __local_bh_enable_ip() {
60) 0.278 us | preempt_count_sub();
60) 0.860 us | }
60) 1.400 us | }
60) 4.289 us | }
60) | ipv6_sock_mc_join() {
60) | __ipv6_sock_mc_join() {
60) | rtnl_is_locked() {
60) 0.286 us | mutex_is_locked();
60) 0.864 us | }
60) 0.375 us | sock_kmalloc();
60) 6.348 us | }
60) 7.243 us | }
60) | sockopt_release_sock() {
60) | release_sock() {
60) | _raw_spin_lock_bh() {
60) 0.318 us | preempt_count_add();
60) 1.033 us | }
60) | _raw_spin_unlock_bh() {
60) | __local_bh_enable_ip() {
60) 0.307 us | preempt_count_sub();
60) 0.878 us | }
60) 1.407 us | }
60) 3.493 us | }
60) 4.086 us | }
[..]
60) + 22.281 us | }
60) + 23.195 us | }
60) + 23.953 us | }
60) + 24.821 us | }
60) 0.302 us | kfree();
60) + 32.012 us | }
60) + 32.633 us | }
Making use of the setsockopt
syscall involves calling:
sock_common_setsockopt
rawv6_setsockopt
do_ipv6_setsockopt
sockopt_lock_sock
ipv6_sock_mc_join
__ipv6_sock_mc_join
sock_kmalloc
sockopt_release_sock
The ipv6_sock_mc_join
(“ipv6 socket multicast join”) function catched our attention here since its role is to add the interface to the multicast group. It also calls sock_kmalloc
.
(ipv6_sock_mc_join
is a mere wrapper of __ipv6_sock_mc_join
that just adds an argument).
static int __ipv6_sock_mc_join(struct sock *sk, int ifindex,
const struct in6_addr *addr, unsigned int mode)
{
struct net_device *dev = NULL;
struct ipv6_mc_socklist *mc_lst;
struct ipv6_pinfo *np = inet6_sk(sk);
struct net *net = sock_net(sk);
int err;
ASSERT_RTNL();
if (!ipv6_addr_is_multicast(addr))
return -EINVAL;
for_each_pmc_socklock(np, sk, mc_lst) {
if ((ifindex == 0 || mc_lst->ifindex == ifindex) &&
ipv6_addr_equal(&mc_lst->addr, addr))
return -EADDRINUSE;
}
[...]
struct ipv6_mc_socklist
structure represents a multicast group member, as seen earlier in the trace:
{inet_pton(AF_INET6, "ff02::2", &ipv6mr_multiaddr), ipv6mr_interface=if_nametoindex("tap1")}
addr
: multicast group addressifindex
: index of the interface we want to join the multicast group withsfmode
: set to MCAST_EXCLUDE (this is the result of the use of the ipv6_sock_mc_join wrapper, that’s themode
argument of __ipv6_sock_mc_join)rcu
: Read-Copy-Update, a mechanism ensuring the reading of valid data in the context of data read much more frequently than written, concurrently (see: https://www.kernel.org/doc/html/next/RCU/whatisRCU.html)
struct ipv6_mc_socklist {
struct in6_addr addr;
int ifindex;
unsigned int sfmode; /* MCAST_{INCLUDE,EXCLUDE} */
struct ipv6_mc_socklist __rcu *next;
struct ip6_sf_socklist __rcu *sflist;
struct rcu_head rcu;
};
This struct is 56 bytes long on a 6.1 kernel.
gdb /usr/lib/debug/lib/modules/6.1.0-18-amd64/vmlinux
GNU gdb (Debian 13.1-3) 13.1
(gdb) print sizeof(struct ipv6_mc_socklist)
$1 = 56
(gdb)
After a few checks the sock_kmalloc
function is then called, with the following arguments :
sk
: current socket structuresizeof(struct ipv6_mc_socklist)
: added structure lengthGFP_KERNEL
: Get Free Page flag, controls memory allocator behavior (see: https://www.kernel.org/doc/html/next/core-api/memory-allocation.html)
[__ipv6_sock_mc_join continued]
mc_lst = sock_kmalloc(sk, sizeof(struct ipv6_mc_socklist), GFP_KERNEL);
if (!mc_lst)
return -ENOMEM; <- here we go
[...]
sock_kmalloc
seems to yield NULL since what we get from ipv6_sock_mc_join
is an ENOMEM
and no other function in the code has been executed (see ftrace previous output)
/*
* Allocate a memory block from the socket's option memory buffer.
*/
void *sock_kmalloc(struct sock *sk, int size, gfp_t priority)
{
int optmem_max = READ_ONCE(sock_net(sk)->core.sysctl_optmem_max);
if ((unsigned int)size <= optmem_max &&
atomic_read(&sk->sk_omem_alloc) + size < optmem_max) {
void *mem;
/* First do the add, to avoid the race if kmalloc
* might sleep.
*/
atomic_add(size, &sk->sk_omem_alloc);
mem = kmalloc(size, priority);
if (mem)
return mem;
atomic_sub(size, &sk->sk_omem_alloc);
}
return NULL;
}
sk_omem_alloc
(socket option/other memory alloc) is an atomic_t
variable type (that’s an int32_t) that holds the grand total of every memory allocation that were made on this socket for the option buffer (see atomic_add(size, &sk->sk_omem_alloc);
)
optmem_max
is the net.core.optmem_max
sysctl value. That’s the maximal memory size that can be allocated for a socket’s option buffer. This is set to 20480 on our hypervisors.
$ sysctl net.core.optmem_max
net.core.optmem_max = 20480
Basically there is a maximum size for the option buffer of a socket, which is set by the net.core.optmem_max
sysctl setting. The struct ipv6_mc_socklist
structure that stores multicast group member interfaces data is 56 bytes long. In our setup, it takes 365 group members to entirely fill the option buffer of that socket, taking 204440 bytes. The 366th member will result in an ENOMEM
as 20486 bytes would be required to store this new member alongside the existing ones.
In theory, this buffer should not be entierly filled, however radvd does not remove any interface from the multicast group when an interface is removed (tap VM interface are dynamically created and removed on-the-fly when a VM boots). The more connections there are to this server, the less space is available in the “option” buffer. This can only be fixed via a service restart (we can also increase the value of net.core.optmem_max
but the issue will come back if the value is not high enough, and this is a very temporary fix)
The fix
We saw earlier that we can add members to an IPv6 multicast group by using IPV6_ADD_MEMBERSHIP
. Shouldn’t we remove them altogether? There’s a setsockopt
option named IPV6_DROP_MEMBERSHIP
whose goal is to remove an interface from a multicast group.
That’s the sole principle of the cleanup_iface
function within radvd’s code:
int cleanup_iface(int sock, struct Interface *iface)
{
/* leave the allrouters multicast group */
cleanup_allrouters_membership(sock, iface);
return 0;
}
Where cleanup_allrouters_membership
is:
int cleanup_allrouters_membership(int sock, struct Interface *iface)
{
struct ipv6_mreq mreq;
memset(&mreq, 0, sizeof(mreq));
mreq.ipv6mr_interface = iface->props.if_index;
/* ipv6-allrouters: ff02::2 */
mreq.ipv6mr_multiaddr.s6_addr32[0] = htonl(0xFF020000);
mreq.ipv6mr_multiaddr.s6_addr32[3] = htonl(0x2);
setsockopt(sock, SOL_IPV6, IPV6_DROP_MEMBERSHIP, &mreq, sizeof(mreq));
return 0;
}
cleanup_iface
only gets called at the end of the program’s execution.
Fix would be to call cleanup_iface
for each interface deletion event:
if (nh->nlmsg_type == RTM_DELLINK) {
dlog(LOG_INFO, 4, "netlink: %s removed, cleaning up", iface->props.name);
cleanup_iface(icmp_sock, iface);
}
A patch has been submitted to radvd.
Thanks to @xdbob for the patch, Claire for the network debugging and @cedricmaunoury for the help during the bug search!
-
We did not have any logs due to radvd logging by default to stdout and then daemonized itself, closing those fd ↩︎