[RFC] sched/eevdf: sched feature to dismiss lag on wakeup

Wed Mar 20 19:12:18 AEDT 2024

On 3/20/24 07:04, Tobias Huschle wrote:
> On Tue, Mar 19, 2024 at 02:41:14PM +0100, Vincent Guittot wrote:
>> On Tue, 19 Mar 2024 at 10:08, Tobias Huschle <huschle at linux.ibm.com> wrote:
>>>
>>> On 2024-03-18 15:45, Luis Machado wrote:
>>>> On 3/14/24 13:45, Tobias Huschle wrote:
>>>>> On Fri, Mar 08, 2024 at 03:11:38PM +0000, Luis Machado wrote:
>>>>>> On 2/28/24 16:10, Tobias Huschle wrote:
>>>>>>>
>>>>>>> Questions:
>>>>>>> 1. The kworker getting its negative lag occurs in the following
>>>>>>> scenario
>>>>>>>    - kworker and a cgroup are supposed to execute on the same CPU
>>>>>>>    - one task within the cgroup is executing and wakes up the
>>>>>>> kworker
>>>>>>>    - kworker with 0 lag, gets picked immediately and finishes its
>>>>>>>      execution within ~5000ns
>>>>>>>    - on dequeue, kworker gets assigned a negative lag
>>>>>>>    Is this expected behavior? With this short execution time, I
>>>>>>> would
>>>>>>>    expect the kworker to be fine.
>>>>>>
>>>>>> That strikes me as a bit odd as well. Have you been able to determine
>>>>>> how a negative lag
>>>>>> is assigned to the kworker after such a short runtime?
>>>>>>
>>>>>
>>>>> I did some more trace reading though and found something.
>>>>>
>>>>> What I observed if everything runs regularly:
>>>>> - vhost and kworker run alternating on the same CPU
>>>>> - if the kworker is done, it leaves the runqueue
>>>>> - vhost wakes up the kworker if it needs it
>>>>> --> this means:
>>>>>   - vhost starts alone on an otherwise empty runqueue
>>>>>   - it seems like it never gets dequeued
>>>>>     (unless another unrelated task joins or migration hits)
>>>>>   - if vhost wakes up the kworker, the kworker gets selected
>>>>>   - vhost runtime > kworker runtime
>>>>>     --> kworker gets positive lag and gets selected immediately next
>>>>> time
>>>>>
>>>>> What happens if it does go wrong:
>>>>> From what I gather, there seem to be occasions where the vhost either
>>>>> executes suprisingly quick, or the kworker surprinsingly slow. If
>>>>> these
>>>>> outliers reach critical values, it can happen, that
>>>>>    vhost runtime < kworker runtime
>>>>> which now causes the kworker to get the negative lag.
>>>>>
>>>>> In this case it seems like, that the vhost is very fast in waking up
>>>>> the kworker. And coincidentally, the kworker takes, more time than
>>>>> usual
>>>>> to finish. We speak of 4-digit to low 5-digit nanoseconds.
>>>>>
>>>>> So, for these outliers, the scheduler extrapolates that the kworker
>>>>> out-consumes the vhost and should be slowed down, although in the
>>>>> majority
>>>>> of other cases this does not happen.
>>>>
>>>> Thanks for providing the above details Tobias. It does seem like EEVDF
>>>> is strict
>>>> about the eligibility checks and making tasks wait when their lags are
>>>> negative, even
>>>> if just a little bit as in the case of the kworker.
>>>>
>>>> There was a patch to disable the eligibility checks
>>>> (https://lore.kernel.org/lkml/20231013030213.2472697-1-youssefesmat@chromium.org/),
>>>> which would make EEVDF more like EVDF, though the deadline comparison
>>>> would
>>>> probably still favor the vhost task instead of the kworker with the
>>>> negative lag.
>>>>
>>>> I'm not sure if you tried it, but I thought I'd mention it.
>>>
>>> Haven't seen that one yet. Unfortunately, it does not help to ignore the
>>> eligibility.
>>>
>>> I'm inclined to rather propose propose a documentation change, which
>>> describes that tasks should not rely on woken up tasks being scheduled
>>> immediately.
>>
>> Where do you see such an assumption ? Even before eevdf, there were
>> nothing that ensures such behavior. When using CFS (legacy or eevdf)
>> tasks, you can't know if the newly wakeup task will run 1st or not
>>
> 
> There was no guarantee of course. place_entity was reducing the vruntime of 
> woken up tasks though, giving it a slight boost, right?. For the scenario 
> that I observed, that boost was enough to make sure, that the woken up tasks 
> gets scheduled consistently. This might still not be true for all scenarios, 
> but in general EEVDF seems to be stricter with woken up tasks.

It seems that way, as EEVDF will do eligibility and deadline checks before scheduling a task, so
a task would have to satisfy both of those checks.

I think we have some special treatment for when a task initially joins the competition, in which
case we halve its slice. But I don't think there is any special treatment for woken tasks
anymore.

There was also a fix (63304558ba5dcaaff9e052ee43cfdcc7f9c29e85) to try to reduce the number of
wake up preemptions under some conditions, under the RUN_TO_PARITY feature.