Jul 23, 2020

[GN] Part 2: Receiving

Now that we know how fast we can send packets, it is time to find out how fast we can receive them on the other side of the network. As in the last post, I will start on sending everything locally so that I can test it easily, we will then combine both applications and let them send to each other on different hosts.

Sadly not my infrastructure Photo by Thomas Jensen on Unsplash

Starting off easy

Let’s start single-threaded as we did in the last post: one thread that receives from one socket, sums one received bit (so the compiler won’t simply optimize away the whole program) and discard it.

This works surprisingly well: a single thread can receive and discard about 1.8 million packets per second from localhost on my machine. However, we need to distribute receiving onto multiple threads if we want to be able to efficiently distribute parsing and crypto on multiple threads as well. I further suppose that when using actual hardware, it will come in handy do have several threads splitting their time between waiting for the device and calculating stuff on the CPU.

One way to have multiple threads share the load, would be to bind each of them to a different socket address. This might work quite well but it complicates clientside logic, since the clients need to re-connect to different sockets when we change the amount of threads.

Fortunately, the SO_REUSERPORT from the last article also works for receiving messages. Before we go over how to make use of it, let me first show you what to avoid: Contesting on a single socket as I did in the tokio post.

What did I expect?

In retrospect, I seriously wonder, what I was thinking. Tokio distributes work across multiple tasks that might be executed in parallel or at least concurrently on a thread pool. This is great for connection-oriented protocols like TCP, when you have 1000 connenctions, they can easily be mapped to 8 threads that way, only costing resources when there is work available.

But this is not necessarily a good idea for UDP. The UDP protocol was designed for connection-less communication. If we want to implement connection-semantics on top of it, one has to do so in their own code, *after receiving messages. It therefore is utter nonsense to receive from one socket and distribute the receiving itself to a thread pool an exception would be if we cannot offload processing the received message to another thread for whatever reason.

After thinking about the approach in the tokio article, I am still surprised that it did in fact work so well.

Reusing the port

Now, let’s make use of SO_REUSERPORT. First the same note as in the send article: this won’t allow our code to run on windows or linux systems with ancient kernels. So what does SO_REUSERPORT actually mean for receiving Datagrams?

Let me quickly cite the docs here:

For UDP sockets, the use of this option can provide better distribution of incoming datagrams to multiple processes (or threads) as compared to the traditional technique of having multiple processes compete to receive datagrams on the same socket.

First thing to note: several threads are not competing for the incoming data and therefore are not churning on some lock or something. But how is this achieved? According to first result on google stating:

Incoming connections and datagrams are distributed to the server sockets using a hash based on the 4-tuple of the connection

Originally, I planned to dive into the net/ipv4/udp.c sources of the kernel here but I got scared off pretty quickly: I completely forgot that goto still exists. Perhaps one day I will work my way through the netcode. For now let’s just assume that the hashing works as one would expect.

For now, we need to

Using SO_REUSERPORT, have several threads that receive from their own share of connected hosts. This means that if one thread has no client assigned that is sending anything it would poll its socket and block until there is something new. However, polling might be slower than being notified. I therefore chose to use epoll for this cause: our code is woken up, whenever there is something interesting happening.

How to use epoll in Rust? One could use the syscall directly but that sounds quite taunting. Luckily there is mio, a crate that abstract all the nasty low-level things away and lets us directly register interests, for which, if fulfilled, we want to be notified.

Mio

Mio uses their own mio::net::UdpSocket, wich does not support setting the reuseport flag upon creation. Luckily mio can reuse std sockets and wrap them into their own constructs. So we can use a socket2 socket, set the reuseport flag, convert it to an std::net::UdpSocket and convert that one to a mio::net::UdpSocket.

let s2_socket = Socket::new(...);
s2_socket.set_reuse_port(true);
s2_socket.bind(&addr.into());
let std_socket = s2_socket.into_udp_socket();
let mio_socket = UdpSocket::from_std(std_socket);

Using Mio, we have a problem with measuring our performance: Using the std sockets, I simply could let them receive a million packets, return from the call and measure the execution duration and resume receiving. In mio however, things work quite a bit differently: We first register an interst with a Poll struct. That way, mio knows if we want to be notified when the socket is readable, writeable or both. We then poll on the Poll struct (not the socket) and get a list of events poll.poll(&mut events,...). We then have to process the events and hopefully have our token in it, stating that the socket is readable. Now the tricky part begins. Now we have to receive from the socket until it return the following Errot type: io::ErrorKind::WouldBlock. If we stop receiving before that, the socket might not fire the readable event again and we might never be able to read from it again.

One way to work around this is to re-register the interest. However, since I only want that code to measure things and not to win any beauty competitions, the receive loop of each thread measures the receive rates individually. This has another side-benefit: Having the rates of all threads, we can observe how incoming datagrams are distributed across the receiving threads.

The following table shows how well (or rather how bad) the clients (12) are distributed across the receivers (6).

thread      average pps
0           583666.7
1           186166.7
2           583972.2
3           384625.0
4           186250.0
5           384708.3

In a real-world scenario, we would most likely have more than a hundred clients, so the differences will be evened-out. Nonetheless, this observation is important: If we planned to deploy some traffic-aggregators in front of the actual game server to decrease the packet rate on the GS by combining multiple clients in one packet, we would observe the same inequality.

Summing up the averages in R reveals a combined packet-rate of 2.3 Mpps when sending with 2.4 Mpps at the receiver, so there is some packet loss but it is not yet the majority:

x <- read.csv("whatever.csv")
a <- aggregate(x,list(x$thread),mean)
> sum(a1$pps)
[1] 2309389

When sending without limits in the sender, the receiver can reach about 2.9 Mpps. However, the receivers with less associated senders will not live up to their full potential. A ‘saturated’ receiver is receiving about 1 Mpps on my machine.

Real Hardware

Now that we know what the software is capable of if running on its own, it is time to introduce a bit of problems: we need to send the traffic over real networks. Let me start with a bunch of hardware I have sitting at home, collecting dust.

My own hardware

We will repeat the same test as in the tokio article, and I expect to see the whole thing capped again at 500Kpps. In a second step, I will directly connect the two machines without any network hardware in between.

It turns out the assumption was correct, the packet rate was again capped at about 500 Kpps on the sending side. This is important, since we could probably increase the receive rate by sending from multiple devices. However, since I wanted to test without a switch in between, I directly connected the hosts with some CAT7 cable and tried again, observing the same packet rate.

Time to level up the hardware a bit, let’s make use of that fancy clouds.

Someone else’s hardware

To dampen your expectations in advance, this one was disappointing: At Digitalocean I ordered the fattest machines they have for me, 32 CPUs, loads of RAM and let them connect it to some VPC. They seem to be interconnected by a 1Gbps link according to iperf. Therefore I suspected we can observe results comparable to my home network or hopefully faster, these are state-of-the-art servers after all. Well, turns out virtualization is a problem: The benchmaks are capped at exactly 101000 pps, having a deviance of only 500 pps. This strongly indicates that the machines are rate-limited.

Renting a bit of metal

Luckily we one can rent bare-metal servers based on hourly billing. Since I had a bit of demo-credits left, I ordered two servers at packet.com for an hour to play with. Although being slightly overpowered (48 CPUs and 64Gb RAM), they will hopefully do their job well.

Both servers have two Intel x710, 10 Gbit/s NICs which are bonded in their default configuration.

So how much did that machines achieve? I ran 16 receivers, 32 senders with a packet size of 128 bytes. According to the previously used calculation method, adding up the average rates we total to 8 Mpps. However, this calculation method has its problems: it is only a rough estimate assuming all threads run for at their average speed the whole time. Since this is not the case, let’s have a few more details about the benchmark data.

Thread 12 did perform exceptionally well, showcasing the best-case scenario:

> filter(x, th==12)
   run th     pps
1    1 12 1037000
2    2 12 1046000
3    3 12 1046000
4    4 12 1059000
5    5 12 1054000
6    6 12 1050000
7    7 12 1054000
8    8 12 1059000
9    9 12 1059000
10  10 12 1054000
11  11 12 1059000
12  12 12 1054000
13  13 12 1054000

Whereas thread 3 had not enough work to do:

> filter(x, th==3)
  run th    pps
1   0  3 132000
2   1  3 145000

To summarize this catastrophic distribution, I want to show you the best boxplot I ever made:

It made a huge difference between using 16 receiver threads compared to only using 8 threads. On average we processed 576 Kpps per thread but, we might be able to process one million packets per second per thread, at least we did with thread 12. If it turns out we need to increase the total throughput we could try to tune the kernel, hardware or something in that line but I doubt that these rates won’t be enough for now.

Next steps

The next step is to develop some pseudo-reliable protocol on top of UDP and then find out how fast we can run it compensating packet loss. However, I first want to ensure that the other components of the networking stack are capable of keeping up with the speed of the socket, in particular deserialization and crypto.