The 'Prime' Pod Question: Kubernetes Scheduling Demystified

I recently went through a flurry of interviews trying to hire a strong Platform Engineer for my team. Since our platform is built on Kubernetes, a solid grasp of its fundamentals isn't optional, it's table stakes.

Here is my favorite question.

"Imagine a multi-tenant Kubernetes cluster running many different workloads. What is the best way to configure requests and limits for your workload so that this workload is never OOM-killed?"

To answer this correctly, you need to understand how Kubernetes thinks about resources. Kubernetes makes resource management decisions across two distinct phases:

Scheduling -- Requests help the scheduler decide where a Pod can run.
Enforcement -- Limits and QoS determine what happens when things go wrong, such as memory pressure on a node.

Requests

Resource requests are simply requests made to the Kubernetes Scheduler. If a Pod requests 3 GiB of memory, the scheduler guarantees that it will be placed on a node that can accommodate that request.

However, if the node has additional free resources available, the container is allowed to use more than its requested amount. Requests are only about placement, not enforcement.

Caveats to Keep in Mind

Pod without limits: If a Pod specifies requests but no limits, it can use any available resources on the node beyond its request - not a great idea for production.
Pod with limits but no requests: Kubernetes automatically sets the request equal to the limit. This ensures the scheduler can place the Pod safely.

Limits

Resource limits are enforced at runtime by the container runtime and the kubelet. Unlike requests, which only affect scheduling, limits define hard boundaries on how a container can consume resources.

CPU Limits

Exceeding a CPU limit does not kill the container.
Instead, the container is throttled using CFS quotas, which slows it down.
This allows other containers on the node to get CPU time fairly.

Memory Limits

Exceeding a memory limit results in an OOMKill -- the container is terminated immediately.
Memory cannot be throttled, only restricted.
Under node pressure, Pods are also evicted according to their QoS class.

This asymmetry -- throttling of CPU vs killing off pods that consume too much memory is critical to understand when planning mission-critical workloads like prime.

QoS Classes

Kubernetes assigns every Pod a QoS class based on the resource requests and limits of its component containers. QoS classes are derived, not configured.

Kubernetes uses QoS to decide which Pods are evicted first when the node is under resource pressure.

Guaranteed

Every container in the Pod has both requests and limits specified for cpu and memory, and requests == limits.
Strongest guarantees.
Last to be evicted.

Burstable

At least one container in the Pod has a memory or CPU request or limit set.
Evicted after all BestEffort pods are evicted.

BestEffort

None of the containers have any requests or limits.
First to be evicted.

An ideal Answer

For a mission-critical, memory-intensive workload like prime, the correct approach is to use Guaranteed QoS:

resources:
  requests:
    cpu: 4
    memory: 6Gi
  limits:
    cpu: 4
    memory: 6Gi

This ensures that memory is reserved, eliminating the risk of memory overcommitment, and that the Pod is last in line for eviction.

So there you go. If you ever come across a similar question, you know how to impress.