Scaling Kubernetes to 2,500 nodes

We’ve beenrunning⁠(opens in a new window)Kubernetes⁠(opens in a new window)for deep learning research for over two years. While our largest-scale workloads manage bare cloud VMs directly, Kubernetes provides a fast iteration cycle, reasonable scalability, and a lack of boilerplate which makes it ideal for most of our experiments. We now operate several Kubernetes clusters (some in the cloud and some on physical hardware), the largest of which we’ve pushed to over 2,500 nodes. This cluster runs in Azure on a combination of D15v2 and NC24 VMs.

On the path to this scale, many system components caused breakages, including etcd, the Kube masters, Docker image pulls, network, KubeDNS, and even our machines’ ARP caches. We felt it’d be helpful to share the specific issues we ran into, and how we solved them.

etcd

After passing 500 nodes in our cluster, our researchers started reporting regular timeouts from thekubectl⁠(opens in a new window)command line tool. We tried adding more Kube masters (VMs runningkube-apiserver⁠(opens in a new window)). This seemed to solve the problem temporarily, but once we passed 10 replicas we knew we were treating symptoms and not the cause (by comparison,GKE⁠(opens in a new window)uses a single 32-core VM for 500 nodes).

This made us strongly suspect ouretcd⁠(opens in a new window)cluster, which is the central store of state for the Kube masters. Looking inDatadog⁠(opens in a new window), we saw write latency spiking to hundreds of milliseconds on the DS15v2 machines running our etcd replicas, despite each machine using a P30 SSD capable of 5,000 IOPS.

Benchmarking performance withfio⁠(opens in a new window), we saw etcd was only able to use about 10% of the available IOPS because the write latency was 2ms and etcd does sequential I/O, making it latency-bound.

We then moved the etcd directory for each node to the local temp disk, which is an SSD connected directly to the instance rather than a network-attached one. Switching to the local disk brought write latency to 200us, and etcd became healthy!

Our cluster ran well until we passed about 1,000 nodes, at which point we once again saw high commit latency from etcd. This time, we noticed the kube-apiservers were reading more than 500MB/s from etcd. We set upPrometheus⁠(opens in a new window)to monitor the apiservers, and also set the--audit-log-pathand--audit-log-maxbackupflags to enabled more logging on the apiserver. This surfaced a number of slow queries and excessive calls to the LIST API for Events.

The root cause: the default setting forFluentd⁠(opens in a new window)’s and Datadog’s monitoring processes was to query the apiservers from every node in the cluster (for example, thisissue⁠(opens in a new window)which is now fixed). We simply changed these processes to be less aggressive with their polling, and load on the apiservers became stable again:

Another helpful tweak was storing Kubernetes Events in a separate etcd cluster, so that spikes in Event creation wouldn’t affect performance of the main etcd instances. To do this, we just set the--etcd-servers-overridesflag to something like this:--etcd-servers-overrides=/events#https://0.example.com:2381;https://1.example.com:2381;https://2.example.com:2381

Another post-1,000 nodes failure was to hit etcd’s hard storage limit (by default 2GB), which causes it to stop accepting writes. This triggered a cascading failure: all our Kube nodes failed their health checks, and ourautoscaler⁠(opens in a new window)decided it thus needed to terminate all the workers. We’ve increased the max etcd size with the--quota-backend-bytesflag, and the autoscaler now has a sanity check not to take action if it would terminate more than 50% of the cluster.

Kube masters

We colocate the kube-apiserver,kube-controller-manager⁠(opens in a new window), andkube-scheduler⁠(opens in a new window)processes on the same machines. Forhigh availability⁠(opens in a new window), we always have at least 2 masters, and set the--apiserver-countflag to the number of apiservers we’re running (otherwise Prometheus monitoring can get confused between instances).

We use Kubernetes mainly as a batch scheduling system and rely on ourautoscaler⁠(opens in a new window)to dynamically scale up and down our cluster — this lets us significantly reduce costs for idle nodes, while still providing low latency while iterating rapidly. The default kube-scheduler policy is to spread out load evenly among nodes, but we want the opposite so that unused nodes can be terminated and also so that largepods⁠(opens in a new window)can be scheduled quickly. So we switched to the following policy:

Plain Text

1{2"kind" : "Policy",3"apiVersion" : "v1",4"predicates" : [5 {"name" : "GeneralPredicates"},6 {"name" : "MatchInterPodAffinity"},7 {"name" : "NoDiskConflict"},8 {"name" : "NoVolumeZoneConflict"},9 {"name" : "PodToleratesNodeTaints"}10 ],11"priorities" : [12 {"name" : "MostRequestedPriority", "weight" : 1},13 {"name" : "InterPodAffinityPriority", "weight" : 2}14 ]15}

We useKubeDNS⁠(opens in a new window)extensively for service discovery, but soon after rolling out the new scheduling policy it started having reliability issues. We found that the failures were only happening on certain pods of KubeDNS. With the new scheduling policy some machines ended up running 10+ copies of KubeDNS, creating hotspots, and we had exceeded the ~200QPS that’s allowed from each Azure VM for external domains lookups.

We fixed this by adding ananti-affinity rule⁠(opens in a new window)to our KubeDNS pods:

1affinity:2 podAntiAffinity:3 requiredDuringSchedulingIgnoredDuringExecution:4 - weight: 1005 labelSelector:6 matchExpressions:7 - key: k8s-app8 operator: In9 values:10 - kube-dns11 topologyKey: kubernetes.io/hostname

Docker image pulls

OurDota⁠(opens in a new window)project started out on Kubernetes, and as it scaled, we noticed that fresh Kubernetes nodes often have pods sitting inPending⁠(opens in a new window)for a long time. The game image is around 17GB, and would often take 30 minutes to pull on a fresh cluster node, so we understood why the Dota container would be Pending for a while — but this was true for other containers as well. Digging in, we found thatkubelet⁠(opens in a new window)has a--serialize-image-pullsflag which defaults totrue, meaning the Dota image pull blocked all other images. Changing tofalserequired switching Docker to overlay2 rather than AUFS. To further speed up pulls, we also moved the Docker root to the instance-attached SSD, like we did for the etcd machines.

Even after optimizing the pull speed, we saw pods failing to start with a cryptic error message:rpc error: code = 2 desc = net/http: request canceled. The kubelet and Docker logs also contained messages indicating that the image pull had been canceled, due to a lack of progress. We tracked the root to large images taking too long to pull/extract, or times when we had a long backlog of images to pull. To address this, we set kubelet’s--image-pull-progress-deadlineflag to 30 minutes, and set the Docker daemon’smax-concurrent-downloadsoption to 10. (The second option didn’t speed up extraction of large images, but allowed the queue of images to pull in parallel.)

Our last Docker pull issue was due to the Google Container Registry. By default, kubelet pulls a special image fromgcr.io(controlled by the--pod-infra-container-imageflag) which is used when starting any new container. If that pull fails for any reason, like exceeding yourquota⁠(opens in a new window), that node won’t be able to launch any containers. Because our nodes go through a NAT to reachgcr.iorather than having their own public IP, it’s quite likely that we’ll hit this per-IP quota limit. To fix this, we simply preloaded that Docker image in the machine image for our Kubernetes workers by usingdocker image save -o /opt/preloaded_docker_images.taranddocker image load -i /opt/preloaded_docker_images.tar. To improve performance, we do the same for a whitelist of common OpenAI-internal images like the Dota image.

Networking

As our experiments grow larger, they also become increasingly complex distributed systems which rely heavily on the network for their operation. When we first started running distributed experiments, it became immediately obvious that our networking wasn’t configured well. Directly between machines we got 10-15Gbit/s of throughput, but our Kube pods using Flannel were maxing out at ~2Gbit/s. Machine Zone’spublic benchmarks⁠(opens in a new window)show similar numbers, meaning the issue wasn’t likely to just be bad config, but instead something inherent to our environment. (By contrast, Flannel does not add this overhead on our physical machines.)

To work around this, users can add two different settings to disable Flannel for their pod:hostNetwork: trueanddnsPolicy: ClusterFirstWithHostNet. (Though read thewarnings⁠(opens in a new window)in the Kubernetes documentation before doing this.)

ARP cache

Despite our DNS tuning, we still saw intermittent issues with DNS resolution. One day an engineer reported thatnc -vto their Redis server was taking over 30 seconds to print that the connection was established. We tracked the issue to the kernel’s ARP stack. Initial investigation of the Redis pod’s host showed something seriously wrong with the network: communication on any port was hanging for multiple seconds, and no DNS names could be resolved via the localdnsmasq⁠(opens in a new window)daemon, withdig⁠(opens in a new window))just printing a cryptic failure message:socket.c:1915: internal_send: 127.0.0.1#53: Invalid argument. Thedmesg⁠(opens in a new window)log was more informative:neighbor table overflow!which meant that the ARP cache had run out of space. ARP is used for mapping a network address such as an IPv4 address, to a physical address, such as a MAC address. Fortunately, this was easy to fix by setting a few options in/etc/sysctl.conf:

1net.ipv4.neigh.default.gc_thresh1 = 800002net.ipv4.neigh.default.gc_thresh2 = 900003net.ipv4.neigh.default.gc_thresh3 = 100000

It’s common to tune this setting in HPC clusters, and is particularly relevant in Kubernetes clusters since every pod has its own IP address which consumes space in the ARP cache.

Our Kubernetes clusters have been incident-free for about 3 months now, and we’re planning to scale to even larger clusters in 2018. We recently upgraded to version 1.8.4, and are excited to see that it now officially supports 5,000. If you’re interested in building large scale compute clusters,we’re hiring⁠(opens in a new window)!

1net.ipv4.neigh.default.gc_thresh1 = 800002net.ipv4.neigh.default.gc_thresh2 = 900003net.ipv4.neigh.default.gc_thresh3 = 100000

It’s common to tune this setting in HPC clusters, and is particularly relevant in Kubernetes clusters since every pod has its own IP address which consumes space in the ARP cache.

Author

Christopher Berner

View all

Techniques for training large neural networks Publication Jun 9, 2022

Introducing Triton: Open-source GPU programming for neural networks Release Jul 28, 2021

Scaling Kubernetes to 7,500 nodes Conclusion Jan 25, 2021

Techniques for training large neural networks Publication Jun 9, 2022

Introducing Triton: Open-source GPU programming for neural networks Release Jul 28, 2021

Scaling Kubernetes to 7,500 nodes Conclusion Jan 25, 2021

Scaling Kubernetes to 2,500 nodes

To work around this, users can add two different settings to disable Flannel for their pod:hostNetwork: trueanddnsPolicy: ClusterFirstWithHostNet. (Though read thewarnings⁠(opens in a new window)in the Kubernetes documentation before doing this.)

1net.ipv4.neigh.default.gc_thresh1 = 800002net.ipv4.neigh.default.gc_thresh2 = 900003net.ipv4.neigh.default.gc_thresh3 = 100000

View all

Introducing Triton: Open-source GPU programming for neural networks Release Jul 28, 2021

More from ChatGPT

New usage analytics and updated spend controls for enterprises

Improving health intelligence in ChatGPT

Using AI to help physicians diagnose rare genetic diseases affecting children

Just a moment...

Comments

Scaling Kubernetes to 2,500 nodes

To work around this, users can add two different settings to disable Flannel for their pod:hostNetwork: trueanddnsPolicy: ClusterFirstWithHostNet. (Though read thewarnings⁠(opens in a new window)in the Kubernetes documentation before doing this.)

1net.ipv4.neigh.default.gc_thresh1 = 800002net.ipv4.neigh.default.gc_thresh2 = 900003net.ipv4.neigh.default.gc_thresh3 = 100000

View all

Introducing Triton: Open-source GPU programming for neural networks Release Jul 28, 2021

More from ChatGPT

New usage analytics and updated spend controls for enterprises

Improving health intelligence in ChatGPT

Using AI to help physicians diagnose rare genetic diseases affecting children

Just a moment...

Comments

The Next Input keeps optional media off until you say yes.