Monday, January 5, 2015

Fixing The Internet, Part 2 - Bufferbloat

If you ask the average user, quality of internet connections are measured in bandwidth. Period. Not surprising really, as this is what has been communicated by the industry. If your internet is slow, you just need to upgrade your bandwidth – at a premium of course.

But there is another metric that is just as important to the quality of the connections, and that is latency. Latency (or lag, ping time) is the time it takes for a tiny packet of data to be sent through the network. This is critical in real time response use cases as telephone or video conferencing, online gaming and normal browsing. In theory, the latency should just be a function of how far away the content is, i.e. how many nodes it needs to travel through. In reality, it is a function how buffers have been configured and managed in the network.

Buffers are queues that have been configured to make the connections more robust. Sometimes a node gets busy. If there were no buffers, packets would be lost. So to maximize bandwidth, the ISP might increase buffer size to make sure that all packets will be delivered.

For video streaming, this is great. It doesn't matter if the stream is delayed a few seconds, so better buffer up and make sure everything comes through. The problem is that the buffers are mostly managed as a simple FIFO queue. Nothing is prioritized, and everything gets stuck in the same queue.

Our tests show that ISP's and networks have very different buffer policies and management algorithms. Some are managing OK, some get into trouble when uploading but most run into problems on downloads – simply because the ISP has capped the download capacity on the connection.

\What happens is that when you request content from some internet server, the internet server will try to send it to you as quickly as it can. It doesn't know about the cap. So when the ISP cap the connection, some of the content will need to wait in a buffer. And everything else will need to wait as well.
Latency while streaming YouTube
The diagram above is a typical result. When traffic is unmanaged, latency is a function of download rate. Latency at 300ms or more will have significant impact on the quality of telephony, gaming and browsing.

In the later part of this experiment, we applied a traffic shaping algorithm on the router that put a constraint on the bandwidth for the device to 20Mbps (or 80% of the capacity of the 25Mbps fiber connection in this example). As the result shows, latency is almost eliminated.

What is really interesting here is that the YouTube video only needed 5Mbps, or 20% of the download capacity. So why did we still get the latency issues? YouTube and others are using a dynamic streaming algorithm called MPEG-DASH. It will only keep a limited buffer in the device, maxing out on the download for a few seconds and then wait until the client is catching up. So even if the video only needs a fraction of the available bandwidth, it makes the internet stutter for all other users of the connection!

We automatically test bandwidth and buffering on a daily basis on our routers. Through this we can learn what is the optimal traffic shaping policies, and dynamically configure the router accordingly. Through this we have been able to remove 70-80% of the latency issues we have observed so far. Will share more on this in a follow up post.

Thanks to Dave Taht for adding insights to this post.