summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2017-02-09Merge "sched: remove sched_new_task_windows tunable"Linux Build Service Account
2017-02-08Merge "sched: fix bug in auto adjustment of group upmigrate/downmigrate"Linux Build Service Account
2017-02-08Merge "Use after free from pid_nr_ns()"Linux Build Service Account
2017-02-08sched: fix bug in auto adjustment of group upmigrate/downmigratePavankumar Kondeti
sched_group_upmigrate tunable can accept values greater than 100%. Don't limit it to 100% while doing the auto adjustment. Change-Id: I3d1c1e84f2f4dec688235feb1536b9261a3e808b Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2017-02-08sched: remove sched_new_task_windows tunablePavankumar Kondeti
The sched_new_task_windows tunable is set to 5 in the scheduler and it is not changed from user space. Remove this unused tunable. Change-Id: I771e12b44876efe75ce87a90e4e9d69c22168b64 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2017-02-07Merge "sched: fix argument type in update_task_burst()"Linux Build Service Account
2017-02-07Merge "sysctl: define upper limit for sched_freq_reporting_policy"Linux Build Service Account
2017-02-03Merge "sched: Remove sched_enable_hmp flag"Linux Build Service Account
2017-02-03sysctl: define upper limit for sched_freq_reporting_policyPavankumar Kondeti
Setting sched_freq_reporting_policy tunable to an unsupported values results in a warning from the scheduler. The previous policy setting is also lost. As sched_freq_reporting_policy can not be set to an incorrect value now, remove the WARN_ON_ONCE from the scheduler. Change-Id: I58d7e5dfefb7d11d2309bc05a1dd66acdc11b766 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2017-02-02sched: Remove sched_enable_hmp flagOlav Haugan
Clean up the code and make it more maintainable by removing dependency on the sched_enable_hmp flag. We do not support HMP scheduler without recompiling. Enabling the HMP scheduler is done through enabling the CONFIG_SCHED_HMP config. Change-Id: I246c1b1889f8dcbc8f0a0805077c0ce5d4f083b0 Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
2017-02-02sysctl: disallow setting sched_time_avg_ms to 0Pavankumar Kondeti
The sched average period can not be 0, disallow setting it to 0. Otherwise CPUs stuck in sched_avg_update(). Change-Id: Ib9fcc5b35dface09d848ba7a737dc4ac0f05d8ee Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2017-02-02sched: fix argument type in update_task_burst()Pavankumar Kondeti
update_task_burst() function's runtime argument type should be u64 not int. Fix this to avoid potential overflow. Change-Id: I33757b7b42f142138c1a099bb8be18c2a3bed331 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2017-02-01sched: maintain group busy time counters in runqueuePavankumar Kondeti
There is no advantage of tracking busy time counters per related thread group. We need busy time across all groups for either a CPU or a frequency domain. Hence maintain group busy time counters in the runqueue itself. When CPU window is rolled over, the group busy counters are also rolled over. This eliminates the overhead of individual group's window_start maintenance. As we are preallocating related thread group now, this patch saves 40 * nr_cpu_ids * (nr_grp - 1) bytes memory. Change-Id: Ieaaccea483b377f54ea1761e6939ee23a78a5e9c Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2017-01-30Merge "sched: set LBF_IGNORE_PREFERRED_CLUSTER_TASKS correctly"Linux Build Service Account
2017-01-30Merge "sysctl: enable strict writes"Linux Build Service Account
2017-01-27sched: set LBF_IGNORE_PREFERRED_CLUSTER_TASKS correctlyPavankumar Kondeti
The LBF_IGNORE_PREFERRED_CLUSTER_TASKS flag needs to be set for all types of inter-cluster load balance. Currently this is set only when higher capacity CPU is pulling the tasks from a lower capacity CPU. This can result in the migration of grouped tasks from higher capacity cluster to lower capacity cluster. Change-Id: Ib0476c5c85781804798ef49268e1b193859ff5ef Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2017-01-23Use after free from pid_nr_ns()Oleg Nesterov
There is use after free reported due to group leader task is already freed but other tasks are still holding the group leader task address in task->group_leader pointer. pid_nr_ns+0x10/0x38 cgroup_pidlist_start+0x144/0x400 cgroup_seqfile_start+0x1c/0x24 kernfs_seq_start+0x54/0x90 seq_read+0x15c/0x3a8 kernfs_fop_read+0x38/0x160 __vfs_read+0x28/0xc8 vfs_read+0x84/0xfc Change-Id: Ib6b3fc75bf0d24a04455bf81d54900c21c434958 Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
2017-01-23genirq: Add IRQ_AFFINITY_MANAGED flagRunmin Wang
Add IRQ_AFFINITY_MANAGED flag and related kernel APIs so that kernel driver can modify an irq's status in such a way that user space affinity change will be ignored. Kernel space's affinity setting will not be changed. Change-Id: Ib2d5ea651263bff4317562af69079ad950c9e71e Signed-off-by: Runmin Wang <runminw@codeaurora.org>
2017-01-23genirq: Introduce IRQD_AFFINITY_MANAGED flagThomas Gleixner
Interupts marked with this flag are excluded from user space interrupt affinity changes. Contrary to the IRQ_NO_BALANCING flag, the kernel internal affinity mechanism is not blocked. This flag will be used for multi-queue device interrupts. Change-Id: I204c49bb1c8ce87fbcd163119093163b120bfe83 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Christoph Hellwig <hch@lst.de> Cc: linux-block@vger.kernel.org Cc: linux-pci@vger.kernel.org Cc: linux-nvme@lists.infradead.org Cc: axboe@fb.com Cc: agordeev@redhat.com Link: http://lkml.kernel.org/r/1467621574-8277-3-git-send-email-hch@lst.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Git-commit: 9c2555835bb3d34dfac52a0be943dcc4bedd650f Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [runminw@codeaurora.org: resolve trivial merge conflicts] Signed-off-by: Runmin Wang <runminw@codeaurora.org>
2017-01-20sched: Update capacity and load scale factor for all clusters at bootSyed Rameez Mustafa
Cluster capacities should reflect differences in efficiency of different clusters even in the absence of cpufreq. Currently capacity is updated only when cpufreq policy notifier is received. Therefore placement is suboptimal when cpufreq is turned off. Fix this by updating capacities and load scaling factors during cluster detection. Change-Id: I47f63c1e374bbfd247a4302525afb37d55334bad Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
2017-01-19Merge "sched: kill sync_cpu maintenance"Linux Build Service Account
2017-01-18Merge "sched: hmp: Remove the global sysctl_sched_enable_colocation tunable"Linux Build Service Account
2017-01-18Merge "tracing: Use SOFTIRQ_OFFSET for softirq dectection for more accurate ↵Linux Build Service Account
results"
2017-01-19sched: kill sync_cpu maintenancePavankumar Kondeti
We assume boot CPU as a sync CPU and initialize it's window_start to sched_ktime_clock(). As windows are synchronized across all CPUs, the secondary CPUs' window_start are initialized from the sync_cpu's window_start. A CPU's window_start is never reset, so this synchronization happens only once for a given CPU. Given this fact, there is no need to reassigning the sync_cpu role to another CPU when the boot CPU is going offline. Remove this unnecessary maintenance of sync_cpu and use any online CPU's window_start as reference. Change-Id: I169a8e80573c6dbcb1edeab0659c07c17102f4c9 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2017-01-18sched: hmp: Remove the global sysctl_sched_enable_colocation tunableVikram Mulukutla
Colocation in HMP includes a tunable that turns on or off the feature globally across all colocation groups. Supporting this tunable correctly would result in complexity that would outweigh any foreseeable benefits. For example, disabling the feature globally would involve deleting all colocation groups one by one while ensuring no placement decisions are made during the process. Remove the tunable. Adding or removing a task from a colocation group is still possible and so we're not losing functionality. Change-Id: I4cb8bcdbee98d3bdd168baacbac345eca9ea8879 Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
2017-01-18sched: hmp: Ensure that best_cluster() never returns NULLVikram Mulukutla
There are certain conditions under which group_will_fit() may return 0 for all clusters in the system, especially under changing thermal conditions. This may result in crashes such as this one: CPU 0 | CPU 1 ==================================================================== select_best_cpu() | -> env.rtg = rtgA | rtgA.pref_cluster=C_big | | set_pref_cluster() for rtgA | -> best_cluster() | C_little doesn't fit | | IRQ: thermal mitigation | C_big capacity now less | than C_little capacity | | -> best_cluster() continues | C_big doesn't fit | set_pref_cluster() sets | rtgA.pref_cluster = NULL | select_least_power_cluster() | -> cluster_first_cpu() | -> BUG() | To add lock protection around accesses to the group's preferred cluster would be expensive and defeat the point of the usage of RCU to protect access to the related_thread_group structure. Therefore, ensure that best_cluster() can never return NULL. In the worst case, we'll select the wrong cluster for a related_thread_group's demand, but this should be fixed in the next tick or wakeup etc. Locking would have still led to the momentary wrong decision with the additional expense! Also, don't set preferred cluster to NULL when colocation is disabled. Change-Id: Id3f514b149add9b3ed33d104fa6a9bd57bec27e2 Signed-off-by: Vikram Mulukutla <markivx@codeaurora.org>
2017-01-18tracing: Use SOFTIRQ_OFFSET for softirq dectection for more accurate resultsPavankumar Kondeti
The 's' flag is supposed to indicate that a softirq is running. This can be detected by testing the preempt_count with SOFTIRQ_OFFSET. The current code tests the preempt_count with SOFTIRQ_MASK, which would be true even when softirqs are disabled but not serving a softirq. Link: http://lkml.kernel.org/r/1481300417-3564-1-git-send-email-pkondeti@codeaurora.org Change-Id: I084531ce806e0f7d42a38be0a7ad45977c43d158 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Git-commit: c59f29cb144a6a0dfac16ede9dc8eafc02dc56ca Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
2017-01-17Merge "workqueue: fix possible livelock with concurrent mod_delayed_work()"Linux Build Service Account
2017-01-16Merge "sched: Initialize variables"Linux Build Service Account
2017-01-16Merge "sched: Fix compilation errors when CFS_BANDWIDTH && !SCHED_HMP"Linux Build Service Account
2017-01-16Merge "perf: don't leave group_entry on sibling list (use-after-free)"Linux Build Service Account
2017-01-14Merge "sched: fix a bug in handling top task table rollover"Linux Build Service Account
2017-01-13sched: Initialize variablesOlav Haugan
Initialize variable at definition to avoid compiler warning when compiling with CONFIG_OPTIMIZE_FOR_SIZE=n. Change-Id: Ibd201877b2274c70ced9d7240d0e527bc77402f3 Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
2017-01-13Merge "perf: protect group_leader from races that cause ctx double-free"Linux Build Service Account
2017-01-12sysctl: enable strict writesKees Cook
SYSCTL_WRITES_WARN was added in commit f4aacea2f5d1 ("sysctl: allow for strict write position handling"), and released in v3.16 in August of 2014. Since then I can find only 1 instance of non-zero offset writing[1], and it was fixed immediately in CRIU[2]. As such, it appears safe to flip this to the strict state now. [1] https://www.google.com/search?q="when%20file%20position%20was%20not%200" [2] http://lists.openvz.org/pipermail/criu/2015-April/019819.html Change-Id: Ibf8d46fa34fa9fd4df3527dc4dfc3e3d31b2f7e0 Signed-off-by: Kees Cook <keescook@chromium.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-commit: 41662f5cc55335807d39404371cfcbb1909304c4 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
2017-01-12sched: Fix compilation errors when CFS_BANDWIDTH && !SCHED_HMPPavankumar Kondeti
There are few compiler errors and warnings when CFS_BANDWIDTH config is enabled but not SCHED_HMP. Change-Id: Idaf4a7364564b6faf56df2eb3a1a74eeb242d57e Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2017-01-12sched: fix compiler errors with !SCHED_HMPPavankumar Kondeti
HMP scheduler boost feature related functions are referred in SMP load balancer. Add the nop functions for the same to fix the compiler errors with !SCHED_HMP. Change-Id: I1cbcf67f728c2cbc7c0f47e8eaf1f4165649dce8 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2017-01-11workqueue: fix possible livelock with concurrent mod_delayed_work()Pavankumar Kondeti
When mod_delayed_work() is concurrently executed, there a potential live lock scenario due to pool->lock contention. Lets say both CPU#0 and CPU#4 calls mod_delayed_work() on the same work item with 0 delay on a bounded workqueue. This workitem has run on CPU#4 previously. CPU#0 wins the work item PENDING bit race and proceeds to queueing. As this work has previously run on CPU#4, it tries to acquire the corresponding pool->lock to check if it is still running there. In the meantime, CPU#4 loops in try_to_grab_pending() for the workitem to be linked with a pwq so that it can steal it from pwq->pool->worklist. The CPU#4 essentially acquires and releases the pool->lock in a busy loop and CPU#0 may never gets this lock. ---------------- -------------------- CPU#0 CPU#4 --------------- -------------------- blk_run_queue_async() mod_delayed_work_on() queue_unplugged() --> try_to_grab_pending() returns blk_run_queue_async() 0 indicating PENDING bit is set now. __queue_delayed_work() mod_delayed_work_on() __queue_work() try_to_grab_pending() { --> waiting for the CPU#4's acquire pool->lock() pool->lock release pool->lock() } Change-Id: I9aeab111f55a19478a9d045c8e3576bce3b7a7c5 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2017-01-10perf: don't leave group_entry on sibling list (use-after-free)John Dias
When perf_group_detach is called on a group leader, it should empty its sibling list. Otherwise, when a sibling is later deallocated, list_del_event() removes the sibling's group_entry from its current list, which can be the now-deallocated group leader's sibling list (use-after-free bug). Bug: 32402548 Change-Id: I99f6bc97c8518df1cb0035814368012ba72ab1f1 Signed-off-by: John Dias <joaodias@google.com> Git-repo: https://android.googlesource.com/kernel/msm Git-commit: 6b6cfb2362f09553b46b3b7e5684b16b6e53e373 Signed-off-by: Dennis Cagle <d-cagle@codeaurora.org>
2017-01-10sched: Convert the global wake_up_idle flag to a per cluster flagSyed Rameez Mustafa
Since clusters can vary significantly in the power and performance characteristics, there may be a need to have different CPU selection policies based on which cluster a task is being placed on. For example the placement policy can be more aggressive in using idle CPUs on cluster that are power efficient and less aggressive on clusters that are geared towards performance. Add support for per cluster wake_up_idle flag to allow greater flexibility in placement policies. Change-Id: I18cd3d907cd965db03a13f4655870dc10c07acfe Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
2017-01-09Merge "sched: fix stale predicted load in trace_sched_get_busy()"Linux Build Service Account
2017-01-07sched: fix a bug in handling top task table rolloverPavankumar Kondeti
When frequency aggregation is enabled, there is a possibility of rolling over the top task table multiple times in a single window. For example - utra() is called with PUT_PREV_TASK for task 'A' which does not belong to any related thread grp. Lets say window rollover happens. rq counters and top task table rollover is done. - utra() is called with PICK_NEXT_TASK/TASK_WAKE for task 'B' which belongs to a related thread grp. Lets say this happens before the grp's cpu_time->window_start is in sync with rq->window_start. In this case, grp's cpu_time counters are rolled over and the top task table is also rolled over again. Roll over the top task table in the context of current running task to fix this. Change-Id: Iea3075e0ea460a9279a01ba42725890c46edd713 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2017-01-07sched: fix stale predicted load in trace_sched_get_busy()Pavankumar Kondeti
When early detection notification is pending, we skip calculating predicted load. Initialize it to 0 so that stale value does not get printed in trace_sched_get_busy(). Change-Id: I36287c0081f6c12191235104666172b7cae2a583 Signed-off-by: Pavankumar Kondeti <pkondeti@codeaurora.org>
2017-01-05timers: Fix documentation for schedule_timeout() and similarDouglas Anderson
The documentation for schedule_timeout(), schedule_hrtimeout(), and schedule_hrtimeout_range() all claim that the routines couldn't possibly return early if the task state was TASK_UNINTERRUPTIBLE. This is simply not true since wake_up_process() will cause those routines to exit early. We cannot make schedule_[hr]timeout() loop until the timeout expires if the task state is uninterruptible because we have users which rely on the existing and designed behaviour. Make the documentation match the (correct) implementation. schedule_hrtimeout() returns -EINTR even when a uninterruptible task was woken up. This might look strange, but making the return code depend on the state is too much of an effort as it would affect all the call sites. There is no value in doing so, but we spell it out clearly in the documentation. Change-Id: I3e9bf91e7d285abcac134e32b02131b999d79f40 Suggested-by: Daniel Kurtz <djkurtz@chromium.org> Signed-off-by: Douglas Anderson <dianders@chromium.org> Cc: huangtao@rock-chips.com Cc: heiko@sntech.de Cc: broonie@kernel.org Cc: briannorris@chromium.org Cc: Andreas Mohr <andi@lisas.de> Cc: linux-rockchip@lists.infradead.org Cc: tony.xie@rock-chips.com Cc: John Stultz <john.stultz@linaro.org> Cc: linux@roeck-us.net Cc: tskd08@gmail.com Link: http://lkml.kernel.org/r/1477065531-30342-2-git-send-email-dianders@chromium.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Git-commit: 4b7e9cf9c84b09adc428e0433cd376b91f9c52a7 Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Runmin Wang <runminw@codeaurora.org>
2017-01-05timers: Fix usleep_range() in the context of wake_up_process()Douglas Anderson
Users of usleep_range() expect that it will _never_ return in less time than the minimum passed parameter. However, nothing in the code ensures this, when the sleeping task is woken by wake_up_process() or any other mechanism which can wake a task from uninterruptible state. Neither usleep_range() nor schedule_hrtimeout_range*() have any protection against wakeups. schedule_hrtimeout_range*() is designed this way despite the fact that the API documentation does not mention it. msleep() already has code to handle this case since it will loop as long as there was still time left. usleep_range() has no such loop, add it. Presumably this problem was not detected before because usleep_range() is only used in a few places and the function is mostly used in contexts which are not exposed to wakeups of any form. An effort was made to look for users relying on the old behavior by looking for usleep_range() in the same file as wake_up_process(). No problems were found by this search, though it is conceivable that someone could have put the sleep and wakeup in two different files. An effort was made to ask several upstream maintainers if they were aware of people relying on wake_up_process() to wake up usleep_range(). No maintainers were aware of that but they were aware of many people relying on usleep_range() never returning before the minimum. Change-Id: Ia403f0dc9cac711c8a4b6fcc4cf0094ad1358ed7 Reported-by: Tao Huang <huangtao@rock-chips.com> Signed-off-by: Douglas Anderson <dianders@chromium.org> Cc: heiko@sntech.de Cc: broonie@kernel.org Cc: briannorris@chromium.org Cc: Andreas Mohr <andi@lisas.de> Cc: linux-rockchip@lists.infradead.org Cc: tony.xie@rock-chips.com Cc: John Stultz <john.stultz@linaro.org> Cc: djkurtz@chromium.org Cc: linux@roeck-us.net Cc: tskd08@gmail.com Link: http://lkml.kernel.org/r/1477065531-30342-1-git-send-email-dianders@chromium.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Git-commit: 6c5e9059692567740a4ee51530dffe51a4b9584d Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Signed-off-by: Runmin Wang <runminw@codeaurora.org>
2017-01-05timers: Plug locking race vs. timer migrationThomas Gleixner
Linus noticed that lock_timer_base() lacks a READ_ONCE() for accessing the timer flags. As a consequence the compiler is allowed to reload the flags between the initial check for TIMER_MIGRATION and the following timer base computation and the spin lock of the base. While this has not been observed (yet), we need to make sure that it never happens. Change-Id: I577327e02ab77b6de951ac2aa936cb5d5a4f477a Fixes: 0eeda71bc30d ("timer: Replace timer base by a cpu index") Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610241711220.4983@nanos Cc: stable@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Git-commit: b831275a3553c32091222ac619cfddd73a5553fb Git-repo: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git [runminw@codeaurora.org: resolve trivial merge conflicts] Signed-off-by: Runmin Wang <runminw@codeaurora.org>
2017-01-05Merge "sched: Delete heavy task heuristics in prediction code"Linux Build Service Account
2017-01-05Merge "sched: Fix new task accounting bug in transfer_busy_time()"Linux Build Service Account
2017-01-04sched: Delete heavy task heuristics in prediction codeRohit Gupta
Heavy task prediction code needs further tuning to avoid any negative power impact. Delete the code for now instead of adding tunables to avoid inefficiencies in the scheduler path. Change-Id: I71e3b37a5c99e24bc5be93cc825d7e171e8ff7ce Signed-off-by: Rohit Gupta <rohgup@codeaurora.org>
2017-01-03sched: Fix new task accounting bug in transfer_busy_time()Syed Rameez Mustafa
In transfer_busy_time(), the new_task flag is set based on the active window count prior to the call to update_task_ravg(). update_task_ravg() however, can then increment the active window count and consequently the new_task flag above becomes stale. This is turn leads to inaccurate accounting whereby update_task_ravg() does accounting based on the fact that the task is not new whereas transfer_busy_time() then continues to do further accounting assuming that the task is new. The accounting discrepancies are sometimes caught by some of the scheduler BUGs. Fix the described problem by moving the check is_new_task() after the call to update_task_ravg(). Also add two missing BUGs that would catch the problem sooner rather than later. Change-Id: I8dc4822e97cc03ebf2ca1ee2de95eb4e5851f459 Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>