summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2011-05-07cgroup,rcu: convert call_rcu(free_css_set_rcu) to kfree_rcu()Lai Jiangshan
The rcu callback free_css_set_rcu() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(free_css_set_rcu). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-07rcu: permit rcu_read_unlock() to be called while holding runqueue locksPaul E. McKenney
Avoid calling into the scheduler while holding core RCU locks. This allows rcu_read_unlock() to be called while holding the runqueue locks, but only as long as there was no chance of the RCU read-side critical section having been preempted. (Otherwise, if RCU priority boosting is enabled, rcu_read_unlock() might call into the scheduler in order to unboost itself, which might allows self-deadlock on the runqueue locks within the scheduler.) Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-05-05rcu: provide rcu_virt_note_context_switch() function.Gleb Natapov
Provide rcu_virt_note_context_switch() for vitalization use to note quiescent state during guest entry. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-05-05rcu: get rid of signed overflow in check_cpu_stall()Paul E. McKenney
Signed integer overflow is undefined by the C standard, so move calculations to unsigned. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-05-05rcu: optimize rcutinyEric Dumazet
rcu_sched_qs() currently calls local_irq_save()/local_irq_restore() up to three times. Remove irq masking from rcu_qsctr_help() / invoke_rcu_kthread() and do it once in rcu_sched_qs() / rcu_bh_qs() This generates smaller code as well. text data bss dec hex filename 2314 156 24 2494 9be kernel/rcutiny.old.o 2250 156 24 2430 97e kernel/rcutiny.new.o Fix an outdated comment for rcu_qsctr_help() Move invoke_rcu_kthread() definition before its use. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05rcu: prevent call_rcu() from diving into rcu core if irqs disabledPaul E. McKenney
This commit marks a first step towards making call_rcu() have real-time behavior. If irqs are disabled, don't dive into the RCU core. Later on, this new early exit will wake up the per-CPU kthread, which first must be modified to handle the cases involving callback storms. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05rcu: further lower priority in rcu_yield()Paul E. McKenney
Although rcu_yield() dropped from real-time to normal priority, there is always the possibility that the competing tasks have been niced. So nice to 19 in rcu_yield() to help ensure that other tasks have a better chance of running. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05rcu: introduce kfree_rcu()Lai Jiangshan
Many rcu callbacks functions just call kfree() on the base structure. These functions are trivial, but their size adds up, and furthermore when they are used in a kernel module, that module must invoke the high-latency rcu_barrier() function at module-unload time. The kfree_rcu() function introduced by this commit addresses this issue. Rather than encoding a function address in the embedded rcu_head structure, kfree_rcu() instead encodes the offset of the rcu_head structure within the base structure. Because the functions are not allowed in the low-order 4096 bytes of kernel virtual memory, offsets up to 4095 bytes can be accommodated. If the offset is larger than 4095 bytes, a compile-time error will be generated in __kfree_rcu(). If this error is triggered, you can either fall back to use of call_rcu() or rearrange the structure to position the rcu_head structure into the first 4096 bytes. Note that the allowable offset might decrease in the future, for example, to allow something like kmem_cache_free_rcu(). The new kfree_rcu() function can replace code as follows: call_rcu(&p->rcu, simple_kfree_callback); where "simple_kfree_callback()" might be defined as follows: void simple_kfree_callback(struct rcu_head *p) { struct foo *q = container_of(p, struct foo, rcu); kfree(q); } with the following: kfree_rcu(&p->rcu, rcu); Note that the "rcu" is the name of a field in the structure being freed. The reason for using this rather than passing in a pointer to the base structure is that the above approach allows better type checking. This commit is based on earlier work by Lai Jiangshan and Manfred Spraul: Lai's V1 patch: http://lkml.org/lkml/2008/9/18/1 Manfred's patch: http://lkml.org/lkml/2009/1/2/115 Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: David Howells <dhowells@redhat.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05rcu: fix spellingPaul E. McKenney
The "preemptible" spelling is preferable. May as well fix it. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05rcu: call __rcu_read_unlock() in exit_rcu for tree RCULai Jiangshan
Using __rcu_read_lock() in place of rcu_read_lock() leaves any debug state as it really should be, namely with the lock still held. Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05rcu: Converge TINY_RCU expedited and normal boostingPaul E. McKenney
This applies a trick from TREE_RCU boosting to TINY_RCU, eliminating code and adding comments. The key point is that it is possible for the booster thread itself to work out whether there is a normal or expedited boost required based solely on local information. There is therefore no need for boost initiation to know or care what type of boosting is required. In addition, when boosting is complete for a given grace period, then by definition there cannot be any more boosting for that grace period. This allows eliminating yet more state and statistics. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05rcu: remove useless ->boosted_this_gp fieldPaul E. McKenney
The ->boosted_this_gp field is a holdover from an earlier design that was to carry out multiple boost operations in parallel. It is not required by the current design, which boosts one task at a time. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-05-05rcu: code cleanups in TINY_RCU priority boosting.Paul E. McKenney
Extraneous semicolon, bad comment, and fold INIT_LIST_HEAD() into list_del() to get list_del_init(). Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05rcu: Switch to this_cpu() primitivesPaul E. McKenney
This removes a couple of lines from invoke_rcu_cpu_kthread(), improving readability. Reported-by: Christoph Lameter <cl@linux.com> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05rcu: Use WARN_ON_ONCE for DEBUG_OBJECTS_RCU_HEAD warningsPaul E. McKenney
Avoid additional multiple-warning confusion in memory-corruption scenarios. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05rcu: mark rcutorture boosting callback as being on-stackPaul E. McKenney
The CONFIG_DEBUG_OBJECTS_RCU_HEAD facility requires that on-stack RCU callbacks be flagged explicitly to debug-objects using the init_rcu_head_on_stack() and destroy_rcu_head_on_stack() functions. This commit applies those functions to the rcutorture code that tests RCU priority boosting. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05rcu: Enable DEBUG_OBJECTS_RCU_HEAD from !PREEMPTMathieu Desnoyers
The prohibition of DEBUG_OBJECTS_RCU_HEAD from !PREEMPT was due to the fixup actions. So just produce a warning from !PREEMPT. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05rcu: Add forward-progress diagnostic for per-CPU kthreadsPaul E. McKenney
Increment a per-CPU counter on each pass through rcu_cpu_kthread()'s service loop, and add it to the rcudata trace output. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05rcu: add grace-period age and more kthread state to tracingPaul E. McKenney
This commit adds the age in jiffies of the current grace period along with the duration in jiffies of the longest grace period since boot to the rcu/rcugp debugfs file. It also adds an additional "O" state to kthread tracing to differentiate between the kthread waiting due to having nothing to do on the one hand and waiting due to being on the wrong CPU on the other hand. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-05-05rcu: fix tracing bug thinko on boost-balk attributionPaul E. McKenney
The rcu_initiate_boost_trace() function mis-attributed refusals to initiate RCU priority boosting that were in fact due to its not yet being time to boost. This patch fixes the faulty comparison. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-05-05rcu: make rcutorture version numbers available through debugfsPaul E. McKenney
It is not possible to accurately correlate rcutorture output with that of debugfs. This patch therefore adds a debugfs file that prints out the rcutorture version number, permitting easy correlation. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05rcu: add tracing for RCU's kthread run states.Paul E. McKenney
Add tracing to help debugging situations when RCU's kthreads are not running but are supposed to be. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05rcu: add callback-queue information to rcudata outputPaul E. McKenney
This commit adds an indication of the state of the callback queue using a string of four characters following the "ql=" integer queue length. The first character is "N" if there are callbacks that have been queued that are not yet ready to be handled by the next grace period, or "." otherwise. The second character is "R" if there are callbacks queued that are ready to be handled by the next grace period, or "." otherwise. The third character is "W" if there are callbacks waiting for the current grace period, or "." otherwise. Finally, the fourth character is "D" if there are callbacks that have been handled by a prior grace period and are waiting to be invoked, or ".". Note that callbacks that are in the process of being invoked are not shown. These callbacks would have been removed from the rcu_data structure's list by rcu_do_batch() prior to being executed. (These callbacks are also not reflected in the "ql=" total, FWIW.) Also, document the new callback-queue trace information. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05rcu: Add boosting to TREE_PREEMPT_RCU tracingPaul E. McKenney
Includes total number of tasks boosted, number boosted on behalf of each of normal and expedited grace periods, and statistics on attempts to initiate boosting that failed for various reasons. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05rcu: eliminate unused boosting statisticsPaul E. McKenney
The n_rcu_torture_boost_allocerror and n_rcu_torture_boost_afferror statistics are not actually incremented anymore, so eliminate them. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05rcu: avoid hammering sched with yet another bound RT kthreadPaul E. McKenney
The scheduler does not appear to take kindly to having multiple real-time threads bound to a CPU that is going offline. So this commit is a temporary hack-around to avoid that happening. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-05-05rcu: put per-CPU kthread at non-RT priority during CPU hotplug operationsPaul E. McKenney
If you are doing CPU hotplug operations, it is best not to have CPU-bound realtime tasks running CPU-bound on the outgoing CPU. So this commit makes per-CPU kthreads run at non-realtime priority during that time. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05rcu: Force per-rcu_node kthreads off of the outgoing CPUPaul E. McKenney
The scheduler has had some heartburn in the past when too many real-time kthreads were affinitied to the outgoing CPU. So, this commit lightens the load by forcing the per-rcu_node and the boost kthreads off of the outgoing CPU. Note that RCU's per-CPU kthread remains on the outgoing CPU until the bitter end, as it must in order to preserve correctness. Also avoid disabling hardirqs across calls to set_cpus_allowed_ptr(), given that this function can block. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-05-05rcu: priority boosting for TREE_PREEMPT_RCUPaul E. McKenney
Add priority boosting for TREE_PREEMPT_RCU, similar to that for TINY_PREEMPT_RCU. This is enabled by the default-off RCU_BOOST kernel parameter. The priority to which to boost preempted RCU readers is controlled by the RCU_BOOST_PRIO kernel parameter (defaulting to real-time priority 1) and the time to wait before boosting the readers who are blocking a given grace period is controlled by the RCU_BOOST_DELAY kernel parameter (defaulting to 500 milliseconds). Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05rcu: move TREE_RCU from softirq to kthreadPaul E. McKenney
If RCU priority boosting is to be meaningful, callback invocation must be boosted in addition to preempted RCU readers. Otherwise, in presence of CPU real-time threads, the grace period ends, but the callbacks don't get invoked. If the callbacks don't get invoked, the associated memory doesn't get freed, so the system is still subject to OOM. But it is not reasonable to priority-boost RCU_SOFTIRQ, so this commit moves the callback invocations to a kthread, which can be boosted easily. Also add comments and properly synchronized all accesses to rcu_cpu_kthread_task, as suggested by Lai Jiangshan. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05rcu: merge TREE_PREEPT_RCU blocked_tasks[] listsPaul E. McKenney
Combine the current TREE_PREEMPT_RCU ->blocked_tasks[] lists in the rcu_node structure into a single ->blkd_tasks list with ->gp_tasks and ->exp_tasks tail pointers. This is in preparation for RCU priority boosting, which will add a third dimension to the combinatorial explosion in the ->blocked_tasks[] case, but simply a third pointer in the new ->blkd_tasks case. Also update documentation to reflect blocked_tasks[] merge Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05rcu: Decrease memory-barrier usage based on semi-formal proofPaul E. McKenney
Commit d09b62d fixed grace-period synchronization, but left some smp_mb() invocations in rcu_process_callbacks() that are no longer needed, but sheer paranoia prevented them from being removed. This commit removes them and provides a proof of correctness in their absence. It also adds a memory barrier to rcu_report_qs_rsp() immediately before the update to rsp->completed in order to handle the theoretical possibility that the compiler or CPU might move massive quantities of code into a lock-based critical section. This also proves that the sheer paranoia was not entirely unjustified, at least from a theoretical point of view. In addition, the old dyntick-idle synchronization depended on the fact that grace periods were many milliseconds in duration, so that it could be assumed that no dyntick-idle CPU could reorder a memory reference across an entire grace period. Unfortunately for this design, the addition of expedited grace periods breaks this assumption, which has the unfortunate side-effect of requiring atomic operations in the functions that track dyntick-idle state for RCU. (There is some hope that the algorithms used in user-level RCU might be applied here, but some work is required to handle the NMIs that user-space applications can happily ignore. For the short term, better safe than sorry.) This proof assumes that neither compiler nor CPU will allow a lock acquisition and release to be reordered, as doing so can result in deadlock. The proof is as follows: 1. A given CPU declares a quiescent state under the protection of its leaf rcu_node's lock. 2. If there is more than one level of rcu_node hierarchy, the last CPU to declare a quiescent state will also acquire the ->lock of the next rcu_node up in the hierarchy, but only after releasing the lower level's lock. The acquisition of this lock clearly cannot occur prior to the acquisition of the leaf node's lock. 3. Step 2 repeats until we reach the root rcu_node structure. Please note again that only one lock is held at a time through this process. The acquisition of the root rcu_node's ->lock must occur after the release of that of the leaf rcu_node. 4. At this point, we set the ->completed field in the rcu_state structure in rcu_report_qs_rsp(). However, if the rcu_node hierarchy contains only one rcu_node, then in theory the code preceding the quiescent state could leak into the critical section. We therefore precede the update of ->completed with a memory barrier. All CPUs will therefore agree that any updates preceding any report of a quiescent state will have happened before the update of ->completed. 5. Regardless of whether a new grace period is needed, rcu_start_gp() will propagate the new value of ->completed to all of the leaf rcu_node structures, under the protection of each rcu_node's ->lock. If a new grace period is needed immediately, this propagation will occur in the same critical section that ->completed was set in, but courtesy of the memory barrier in #4 above, is still seen to follow any pre-quiescent-state activity. 6. When a given CPU invokes __rcu_process_gp_end(), it becomes aware of the end of the old grace period and therefore makes any RCU callbacks that were waiting on that grace period eligible for invocation. If this CPU is the same one that detected the end of the grace period, and if there is but a single rcu_node in the hierarchy, we will still be in the single critical section. In this case, the memory barrier in step #4 guarantees that all callbacks will be seen to execute after each CPU's quiescent state. On the other hand, if this is a different CPU, it will acquire the leaf rcu_node's ->lock, and will again be serialized after each CPU's quiescent state for the old grace period. On the strength of this proof, this commit therefore removes the memory barriers from rcu_process_callbacks() and adds one to rcu_report_qs_rsp(). The effect is to reduce the number of memory barriers by one and to reduce the frequency of execution from about once per scheduling tick per CPU to once per grace period. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-05rcu: Remove conditional compilation for RCU CPU stall warningsPaul E. McKenney
The RCU CPU stall warnings can now be controlled using the rcu_cpu_stall_suppress boot-time parameter or via the same parameter from sysfs. There is therefore no longer any reason to have kernel config parameters for this feature. This commit therefore removes the RCU_CPU_STALL_DETECTOR and RCU_CPU_STALL_DETECTOR_RUNNABLE kernel config parameters. The RCU_CPU_STALL_TIMEOUT parameter remains to allow the timeout to be tuned and the RCU_CPU_STALL_VERBOSE parameter remains to allow task-stall information to be suppressed if desired. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-02genirq: Fix typo CONFIG_GENIRC_IRQ_SHOW_LEVELGeert Uytterhoeven
commit ab7798ffcf98b11a9525cf65bacdae3fd58d357f ("genirq: Expand generic show_interrupts()") added the Kconfig option GENERIC_IRQ_SHOW_LEVEL to accomodate PowerPC, but this doesn't actually enable the functionality due to a typo in the #ifdef check. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Linux/PPC Development <linuxppc-dev@lists.ozlabs.org> Link: http://lkml.kernel.org/r/%3Calpine.DEB.2.00.1104302251370.19068%40ayla.of.borg%3E Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-04-30Merge branch 'fixes-2.6.39' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq * 'fixes-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix deadlock in worker_maybe_bind_and_lock() workqueue: Document debugging tricks Fix up trivial spelling conflict in kernel/workqueue.c
2011-04-29Merge branch 'perf-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf, x86, nmi: Move LVT un-masking into irq handlers perf events, x86: Work around the Nehalem AAJ80 erratum perf, x86: Fix BTS condition ftrace: Build without frame pointers on Microblaze
2011-04-29Merge branch 'timer-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timer-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: hrtimer: Initialize CLOCK_ID to HRTIMER_BASE table statically rtc: max8925: Call dev_set_drvdata before rtc_device_register
2011-04-29workqueue: fix deadlock in worker_maybe_bind_and_lock()Tejun Heo
If a rescuer and stop_machine() bringing down a CPU race with each other, they may deadlock on non-preemptive kernel. The CPU won't accept a new task, so the rescuer can't migrate to the target CPU, while stop_machine() can't proceed because the rescuer is holding one of the CPU retrying migration. GCWQ_DISASSOCIATED is never cleared and worker_maybe_bind_and_lock() retries indefinitely. This problem can be reproduced semi reliably while the system is entering suspend. http://thread.gmane.org/gmane.linux.kernel/1122051 A lot of kudos to Thilo-Alexander for reporting this tricky issue and painstaking testing. stable: This affects all kernels with cmwq, so all kernels since and including v2.6.36 need this fix. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Thilo-Alexander Ginkel <thilo@ginkel.com> Tested-by: Thilo-Alexander Ginkel <thilo@ginkel.com> Cc: stable@kernel.org
2011-04-29hrtimer: Initialize CLOCK_ID to HRTIMER_BASE table staticallyThomas Gleixner
Sedat and Bruno reported RCU stalls which turned out to be caused by the following; sched_init() calls init_rt_bandwidth() which calls hrtimer_init() _BEFORE_ hrtimers_init() is called. While not entirely correct this worked because hrtimer_init() only accessed statically initialized data (hrtimer_bases.clock_base[CLOCK_MONOTONIC]) Commit e06383db9 (hrtimers: extend hrtimer base code to handle more then 2 clockids) added an indirection to the hrtimer_bases.clock_base lookup to avoid gap handling in the hot path. The table which is used for the translataion from CLOCK_ID to HRTIMER_BASE index is initialized at runtime in hrtimers_init(). So the early call of the scheduler code translates CLOCK_MONOTONIC to HRTIMER_BASE_REALTIME. Thus the rt_bandwith timer ends up on CLOCK_REALTIME. If the timer is armed and the wall clock time is set (e.g. ntpdate in the early boot process - which also gives the problem deterministic behaviour i.e. magic recovery after N hours), then the timer ends up with an expiry time far into the future. That breaks the RT throttler mechanism as rt runtime is accumulated and never cleared, so the rt throttler detects a false cpu hog condition and blocks all RT tasks until the timer finally expires. That in turn stalls the RCU thread of TINYRCU which leads to an huge amount of RCU callbacks piling up. Make the translation table statically initialized, so we are back to the status of <= 2.6.39. Reported-and-tested-by: Sedat Dilek <sedat.dilek@gmail.com> Reported-by: Bruno Prémont <bonbons@linux-vserver.org> Cc: John stultz <johnstul@us.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/%3Calpine.LFD.2.02.1104282353140.3005%40ionos%3E Reviewed-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-04-28kernel/watchdog.c: disable nmi perf event in the error path of enabling watchdogHillf Danton
In corner cases where softlockup watchdog is not setup successfully, the relevant nmi perf event for hardlockup watchdog could be disabled, then the status of the underlying hardware remains unchanged. Also, if the kthread doesn't start then the hrtimer won't run and the hardlockup detector will falsely fire. Signed-off-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-27Merge branch 'tip/perf/urgent' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-2.6-trace into perf/urgent
2011-04-23Merge branch 'pm-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: PM: Add missing syscore_suspend() and syscore_resume() calls PM: Fix error code paths executed after failing syscore_suspend()
2011-04-21ftrace: Build without frame pointers on MicroblazeMichal Simek
Microblaze doesn't need/support FRAME_POINTERS in order to have a working function tracer. The patch remove Kconfig warning. Warning log: warning: (LOCKDEP && FAULT_INJECTION_STACKTRACE_FILTER && LATENCYTOP && FUNCTION_TRACER && KMEMCHECK) selects FRAME_POINTER which has unmet direct dependencies (DEBUG_KERNEL && (CRIS || M68K || FRV || UML || AVR32 || SUPERH || BLACKFIN || MN10300) || ARCH_WANT_FRAME_POINTERS) Signed-off-by: Michal Simek <monstr@monstr.eu> Link: http://lkml.kernel.org/r/1301908812-8119-2-git-send-email-monstr@monstr.eu CC: Frederic Weisbecker <fweisbec@gmail.com> CC: Ingo Molnar <mingo@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-04-20PM: Add missing syscore_suspend() and syscore_resume() callsRafael J. Wysocki
Device suspend/resume infrastructure is used not only by the suspend and hibernate code in kernel/power, but also by APM, Xen and the kexec jump feature. However, commit 40dc166cb5dddbd36aa4ad11c03915ea (PM / Core: Introduce struct syscore_ops for core subsystems PM) failed to add syscore_suspend() and syscore_resume() calls to that code, which generally leads to breakage when the features in question are used. To fix this problem, add the missing syscore_suspend() and syscore_resume() calls to arch/x86/kernel/apm_32.c, kernel/kexec.c and drivers/xen/manage.c. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: Ian Campbell <ian.campbell@citrix.com>
2011-04-19Merge branch 'timer-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timer-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: RTC: rtc-omap: Fix a leak of the IRQ during init failure posix clocks: Replace mutex with reader/writer semaphore
2011-04-18PM: Fix error code paths executed after failing syscore_suspend()Rafael J. Wysocki
If syscore_suspend() fails in suspend_enter(), create_image() or resume_target_kernel(), it is necessary to call sysdev_resume(), because sysdev_suspend() has been called already and succeeded and we are going to abort the transition. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-04-18next_pidmap: fix overflow conditionLinus Torvalds
next_pidmap() just quietly accepted whatever 'last' pid that was passed in, which is not all that safe when one of the users is /proc. Admittedly the proc code should do some sanity checking on the range (and that will be the next commit), but that doesn't mean that the helper functions should just do that pidmap pointer arithmetic without checking the range of its arguments. So clamp 'last' to PID_MAX_LIMIT. The fact that we then do "last+1" doesn't really matter, the for-loop does check against the end of the pidmap array properly (it's only the actual pointer arithmetic overflow case we need to worry about, and going one bit beyond isn't going to overflow). [ Use PID_MAX_LIMIT rather than pid_max as per Eric Biederman ] Reported-by: Tavis Ormandy <taviso@cmpxchg8b.com> Analyzed-by: Robert Święcki <robert@swiecki.net> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-04-18posix clocks: Replace mutex with reader/writer semaphoreRichard Cochran
A dynamic posix clock is protected from asynchronous removal by a mutex. However, using a mutex has the unwanted effect that a long running clock operation in one process will unnecessarily block other processes. For example, one process might call read() to get an external time stamp coming in at one pulse per second. A second process calling clock_gettime would have to wait for almost a whole second. This patch fixes the issue by using a reader/writer semaphore instead of a mutex. Signed-off-by: Richard Cochran <richard.cochran@omicron.at> Cc: John Stultz <john.stultz@linaro.org> Link: http://lkml.kernel.org/r/%3C20110330132421.GA31771%40riccoc20.at.omicron.at%3E Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2011-04-16Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: block: make unplug timer trace event correspond to the schedule() unplug block: let io_schedule() flush the plug inline
2011-04-16Merge branches 'core-fixes-for-linus', 'perf-fixes-for-linus', ↵Linus Torvalds
'sched-fixes-for-linus', 'timer-fixes-for-linus' and 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: futex: Set FLAGS_HAS_TIMEOUT during futex_wait restart setup * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf_event: Fix cgrp event scheduling bug in perf_enable_on_exec() perf: Fix a build error with some GCC versions * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Fix erroneous all_pinned logic sched: Fix sched-domain avg_load calculation * 'timer-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: RTC: rtc-mrst: follow on to the change of rtc_device_register() RTC: add missing "return 0" in new alarm func for rtc-bfin.c RTC: Fix s3c compile error due to missing s3c_rtc_setpie RTC: Fix early irqs caused by calling rtc_set_alarm too early * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, amd: Disable GartTlbWlkErr when BIOS forgets it x86, NUMA: Fix fakenuma boot failure x86/mrst: Fix boot crash caused by incorrect pin to irq mapping x86/ce4100: Add reg property to bridges