Kube-scheduler 源码分析之调度队列

Posted 2024-01-28 Updated 2024-11- 26

By Ray Lyu

23~29 min read

源码基于 v1.27

子队列

调度器的 SchedulingQueue 实现是一个 PriorityQueue 结构体，其中有三个子队列。

ActiveQ（heap）：存放就绪的 pod，调度流程会从中取出 pod 进行调度；
BackOffQ（heap）：存放调度失败的 pod，这里的 pod 各自被设置了退避时间，等待足够的时间后才可以离开；
UnschedulableQ（map）：存放调度失败且被判定为“无法调度成功”的 pod，除非集群中发生了特定的事件或者 pod 已达在子队列中阻塞时间的上限，否则 pod 不会出队。

type PriorityQueue struct {
    // ...
    
    // activeQ is heap structure that scheduler actively looks at to find pods to
    // schedule. Head of heap is the highest priority pod.
    activeQ *heap.Heap
    // podBackoffQ is a heap ordered by backoff expiry. Pods which have completed backoff
    // are popped from this heap before the scheduler looks at activeQ
    podBackoffQ *heap.Heap
    // unschedulablePods holds pods that have been tried and determined unschedulable.
    unschedulablePods *UnschedulablePods
    
    // ...
}

BackOffQ 用于降低已调度失败的 pod 的调度尝试频次，任何调度失败的 pod 都必须等待特定的一段 BackOff 时间。这是一种惩罚机制：“你刚刚尝试过了，得给新来的人一点机会”。UnschedulableQ 用于避免无效调度，“你再试也没用，别浪费大家时间”，所以它就是一个小黑屋，调度器把其认为压根无法调度上的 pod 暂时关进小黑屋，从而提高整体的调度效率。

根据 [1]，pod 在三个子队列间的流转如下图所示：

结合上图，我们可以举一个一般性的例子让 pod 在三个子队列中完整地流转一遍。对于一个带有 NodeAffinity 强限制的 pod，假设它从 ActiveQ 中出队尝试调度且因 NodeAffinity plugin 阻拦而调度失败。此时除非集群中已有节点发生状态（label）变化，否则对该 pod 再次尝试调度是没有意义的，所以它应当先进入 UnschedulableQ，直到产生了节点状态变化的事件才适时地将其放进 BackOffQ，随后等待达到 BackOff 时间并进入 ActiveQ 中准备被再次调度。

基于 UnschedulableQ 的小黑屋机制

既然 UnschedulableQ 是一个小黑屋，那么该如何判断 pod 是否应该关进小黑屋，以及什么时候可以放出来呢？

我们先介绍 SchedulingCycle 的概念。SchedulingCycle 就是“调度轮次”的意思，PriorityQueue 中维护了该机制最核心的两个变量：

schedulingCycle，即 PriorityQueue 当前的调度轮次，当 PriorityQueue pop 一个 pod 时，该记录会加一；
moveRequestCycle，即收到最近一次 move 请求时 PriorityQueue 所处的调度轮次，move 请求指的是从 UnschedulableQ 中移出特定的 pod，可以理解为发起 moveRequest 就意味着“集群状态发生了变化“，稍后做详细介绍。

type PriorityQueue struct {
    // ...
    
    // schedulingCycle represents sequence number of scheduling cycle and is incremented
    // when a pod is popped.
    schedulingCycle int64
    // moveRequestCycle caches the sequence number of scheduling cycle when we
    // received a move request. Unschedulable pods in and before this scheduling
    // cycle will be put back to activeQueue if we were trying to schedule them
    // when we received move request.
    moveRequestCycle int64 
    
    // ...
}

关进小黑屋

调度/绑定失败

当 pod 调度或者异步绑定失败时，调度器会调用 AddUnschedulableIfNotPresent 函数，该函数将 pod 放进 UnschedulableQ，除非在 pod 调度/绑定的过程中集群状态已经发生了变化。前文说过，当发起 moveRequest 时就意味着集群状态发生了变化，因此当 pod 返回队列时需要感知是否已经发起过 moveRequest。具体地，当 pod 返回队列时，若 moveRequestCycle 大于或等于该 pod 的 podSchedulingCycle，调度器就判定为在该 pod 调度/绑定的过程中已经发起过 moveRequest。

func (sched *Scheduler) scheduleOne(ctx context.Context) {
    // ...

    scheduleResult, assumedPodInfo, status := sched.schedulingCycle(schedulingCycleCtx, state, fwk, podInfo, start, podsToActivate)
    if !status.IsSuccess() {
        // 下层会调用 AddUnschedulableIfNotPresent
        sched.FailureHandler(schedulingCycleCtx, fwk, assumedPodInfo, status, scheduleResult.nominatingInfo, start)
        return
    }

    go func() {
        // ...

        // 异步绑定
        status := sched.bindingCycle(bindingCycleCtx, state, fwk, scheduleResult, assumedPodInfo, start, podsToActivate)
        if !status.IsSuccess() {
            // 下层会调用 AddUnschedulableIfNotPresent
            sched.handleBindingCycleError(bindingCycleCtx, state, fwk, assumedPodInfo, start, scheduleResult, status)
        }
    }()
}

func (p *PriorityQueue) AddUnschedulableIfNotPresent(pInfo *framework.QueuedPodInfo, podSchedulingCycle int64) error {
    // ...
    
    if p.moveRequestCycle >= podSchedulingCycle {
        if err := p.podBackoffQ.Add(pInfo); err != nil {
            return fmt.Errorf("error adding pod %v to the backoff queue: %v", klog.KObj(pod), err)
        }
        //...
    } else {
        p.unschedulablePods.addOrUpdate(pInfo)
        // ...
    }

    p.addNominatedPodUnlocked(pInfo.PodInfo, nil)
    return nil
}

这里有比较绕的地方，“等于”比较好理解，即在 pod 调度的过程中就发生了集群状态变化，那什么时候会“大于”呢？这里的关键在于 pod 的绑定是异步的，假设 pod 在 m 轮次尝试调度，之后进行异步绑定且失败，在处理绑定失败并将 pod 放回队列时拿到的当前调度轮次是 m+4，则 pod 的整个调度和绑定的调度轮次区间是 [m, m+4]。那么最近一次的 moveRequest 位于什么区间才可以认为是最新的状态变化呢？实际实现采用的区间是 [m+4, infinity)。这是一种比较保守的方式，因为这意味着在 [m, m+3] 区间内发起的 moveRequest 都不被参考了。也许是考虑到实际异步绑定的耗时比调度要少，所以异步绑定轮次可能恰好就是 m+1，对于这种情况采用 [m+1, infinity) 是合理的，但是这个区间是否合理还要看实际优化效果如何。

ActiveQ 入队前置检查失败

在 addToActiveQ 函数中，如果 pod 未能通过 PriorityQueue 注册的 PreEnqueue Plugin，则直接放进 UnschedulableQ。这是一种 ActiveQ 入队时的前置检查机制，不过在项目中未发现实现了 PreEnqueue 接口的 plugin，此处不重点关注。

放出小黑屋

对于经判定在调度和绑定时暂未发生集群状态变化的 pod，将被放进 UnschedulabelQ，后续这些 pod 会基于特定事件的响应或者周期性的 queue flush 而离开 UnschedulableQ。从 UnschedulableQ 出队的核心函数是 movePodsToActiveOrBackoffQueue，离开小黑屋的机制就隐含在该函数的上层调用中。

func (p *PriorityQueue) movePodsToActiveOrBackoffQueue(podInfoList []*framework.QueuedPodInfo, event framework.ClusterEvent) {
    // ...
    for _, pInfo := range podInfoList {
        // If the event doesn't help making the Pod schedulable, continue.
        // Note: we don't run the check if pInfo.UnschedulablePlugins is nil, which denotes
        // either there is some abnormal error, or scheduling the pod failed by plugins other than PreFilter, Filter and Permit.
        // In that case, it's desired to move it anyways.
        if len(pInfo.UnschedulablePlugins) != 0 && !p.podMatchesEvent(pInfo, event) {
            continue
        }
        pod := pInfo.Pod
        if p.isPodBackingoff(pInfo) {
            // 加进 BackOffQ
            // ...
        } else {
            gated := pInfo.Gated
            if added, _ := p.addToActiveQ(pInfo); added {
                // 加进 ActiveQ
                // ...
            }
        }
    }
    p.moveRequestCycle = p.schedulingCycle
    // ...
}

离开小黑屋的方式有两种：定时 flush 和事件触发，前者就是兜底，保证关进去的 pod 超过一定的时间后总能出来；后者根据集群中发生的 API 对象的 event 来判断哪些 pod 可以被移出小黑屋。根据 movePodsToActiveOrBackoffQueue 中的判断逻辑，只有 podInfo 中记录了 UnschedulablePlugins 且该 pod 与当前 event 能够匹配上才有机会移出。UnschedulablePlugins 就是在 pod 调度失败时记录下的失败 plugin 名称，podMatchesEvent() 通过遍历 PriorityQueue 的 clusterEventMap 字段来判断当前 event 是否为失败 plugin 所关心的。

func (p *PriorityQueue) podMatchesEvent(podInfo *framework.QueuedPodInfo, clusterEvent framework.ClusterEvent) bool {
    if clusterEvent.IsWildCard() {
        return true
    }

    for evt, nameSet := range p.clusterEventMap {
        // Firstly verify if the two ClusterEvents match:
        // - either the registered event from plugin side is a WildCardEvent,
        // - or the two events have identical Resource fields and *compatible* ActionType.
        //   Note the ActionTypes don't need to be *identical*. We check if the ANDed value
        //   is zero or not. In this way, it's easy to tell Update&Delete is not compatible,
        //   but Update&All is.
        evtMatch := evt.IsWildCard() ||
            (evt.Resource == clusterEvent.Resource && evt.ActionType&clusterEvent.ActionType != 0)

        // Secondly verify the plugin name matches.
        // Note that if it doesn't match, we shouldn't continue to search.
        if evtMatch && intersect(nameSet, podInfo.UnschedulablePlugins) {
            return true
        }
    }

    return false
}

clusterEventMap 是一个 event 到 plugin name 的映射，表示某个 event 被哪些 plugin 所关注。追踪 clusterEventMap 的初始化可以发现，该字段是由各个 plugin 的 EventsToRegister() 方法生成的。我们以 TatintToleration plugin 为例，该 plugin 声明自己对 Node 的 Add 和 Update 事件感兴趣。

func (pl *TaintToleration) EventsToRegister() []framework.ClusterEvent {
    return []framework.ClusterEvent{
        {Resource: framework.Node, ActionType: framework.Add | framework.Update},
    }
}

后续优化：QueueingHints

当前版本虽然已经有了完整的小黑屋机制，且能根据 event 的类型来判断 UnschedulableQ 中的 pod 是否应该响应并出队，但关于调度失败的 pod 是否应该放入 UnschedulableQ 却缺乏较为精细的判断。只要集群状态发生过更新（moveRequestCycle >= schedulingCycle），当前调度失败的 pod 就不会进入 UnschedulableQ。当集群状态变化特别频繁时，这会导致调度失败的 pod 几乎无法进入 UnschedulableQ，也就是说这一整套小黑屋机制几乎派不上用场，整体的调度效率自然就无法提升。

针对该问题，社区在 v1.28 中引入了SchedulerQueueingHints 特性 [2]，可通过 feature gate 控制启停，当该特性稳定后会移除 moveRequestCycle 变量。在这个优化中，当前正在调度的 pod 能够知道它在调度时发生的具体 event 是什么，并在失败时逐一检查这些 event，判断是否为自己所关注，而不再基于 moveRequestCycle 这样粗粒度的信息来判断。

该 feature 在 v1.28 中尚有有一些内存泄露问题，最新版本可能已经修复 [3]。

参考资料

Kubernetes

License: CC BY 4.0