summaryrefslogtreecommitdiff
path: root/arch/powerpc
AgeCommit message (Collapse)Author
2015-03-23powerpc/book3s: Fix the MCE code to use CONFIG_KVM_BOOK3S_64_HANDLERMahesh Salgaonkar
commit id 2ba9f0d has changed CONFIG_KVM_BOOK3S_64_HV to tristate to allow HV/PR bits to be built as modules. But the MCE code still depends on CONFIG_KVM_BOOK3S_64_HV which is wrong. When user selects CONFIG_KVM_BOOK3S_64_HV=m to build HV/PR bits as a separate module the relevant MCE code gets excluded. This patch fixes the MCE code to use CONFIG_KVM_BOOK3S_64_HANDLER. This makes sure that the relevant MCE code is included when HV/PR bits are built as a separate modules. Fixes: 2ba9f0d88750 ("kvm: powerpc: book3s: Support building HV and PR KVM as module") Cc: stable@vger.kernel.org # v3.14+ Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-03-20powerpc/pseries: Little endian fixes for post mobility device tree updateTyrel Datwyler
We currently use the device tree update code in the kernel after resuming from a suspend operation to re-sync the kernels view of the device tree with that of the hypervisor. The code as it stands is not endian safe as it relies on parsing buffers returned by RTAS calls that thusly contains data in big endian format. This patch annotates variables and structure members with __be types as well as performing necessary byte swaps to cpu endian for data that needs to be parsed. Signed-off-by: Tyrel Datwyler <tyreld@linux.vnet.ibm.com> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Cc: Cyril Bur <cyrilbur@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-03-20powerpc: Add PVR for POWER8NVL processorBenjamin Herrenschmidt
There's a new variant of POWER8 coming called "POWER8 with NVLink". The core is identical to POWER8 but unfortunately they strapped it with a different PVR, so we need to add an explicit entry for it. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-03-20powerpc/powernv: Fixes for hypervisor doorbell handlingPaul Mackerras
Since we can now use hypervisor doorbells for host IPIs, this makes sure we clear the host IPI flag when taking a doorbell interrupt, and clears any pending doorbell IPI in pnv_smp_cpu_kill_self() (as we already do for IPIs sent via the XICS interrupt controller). Otherwise if there did happen to be a leftover pending doorbell interrupt for an offline CPU thread for any reason, it would prevent that thread from going into a power-saving mode; it would instead keep waking up because of the interrupt. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-03-04powerpc/iommu: Remove IOMMU device references via bus notifierNishanth Aravamudan
After d905c5df9aef ("PPC: POWERNV: move iommu_add_device earlier"), the refcnt on the kobject backing the IOMMU group for a PCI device is elevated by each call to pci_dma_dev_setup_pSeriesLP() (via set_iommu_table_base_and_group). When we go to dlpar a multi-function PCI device out: iommu_reconfig_notifier -> iommu_free_table -> iommu_group_put BUG_ON(tbl->it_group) We trip this BUG_ON, because there are still references on the table, so it is not freed. Fix this by moving the powernv bus notifier to common code and calling it for both powernv and pseries. Fixes: d905c5df9aef ("PPC: POWERNV: move iommu_add_device earlier") Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-03-04powerpc/smp: Wait until secondaries are active & onlineMichael Ellerman
Anton has a busy ppc64le KVM box where guests sometimes hit the infamous "kernel BUG at kernel/smpboot.c:134!" issue during boot: BUG_ON(td->cpu != smp_processor_id()); Basically a per CPU hotplug thread scheduled on the wrong CPU. The oops output confirms it: CPU: 0 Comm: watchdog/130 The problem is that we aren't ensuring the CPU active bit is set for the secondary before allowing the master to continue on. The master unparks the secondary CPU's kthreads and the scheduler looks for a CPU to run on. It calls select_task_rq() and realises the suggested CPU is not in the cpus_allowed mask. It then ends up in select_fallback_rq(), and since the active bit isnt't set we choose some other CPU to run on. This seems to have been introduced by 6acbfb96976f "sched: Fix hotplug vs. set_cpus_allowed_ptr()", which changed from setting active before online to setting active after online. However that was in turn fixing a bug where other code assumed an active CPU was also online, so we can't just revert that fix. The simplest fix is just to spin waiting for both active & online to be set. We already have a barrier prior to set_cpu_online() (which also sets active), to ensure all other setup is completed before online & active are set. Fixes: 6acbfb96976f ("sched: Fix hotplug vs. set_cpus_allowed_ptr()") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-02-23powerpc: Re-enable dynticksPaul Clarke
Implement arch_irq_work_has_interrupt() for powerpc Commit 9b01f5bf3 introduced a dependency on "IRQ work self-IPIs" for full dynamic ticks to be enabled, by expecting architectures to implement a suitable arch_irq_work_has_interrupt() routine. Several arches have implemented this routine, including x86 (3010279f) and arm (09f6edd4), but powerpc was omitted. This patch implements this routine for powerpc. The symptom, at boot (on powerpc systems) with "nohz_full=<CPU list>" is displayed: NO_HZ: Can't run full dynticks because arch doesn't support irq work self-IPIs after this patch: NO_HZ: Full dynticks CPUs: <CPU list>. Tested against 3.19. powerpc implements "IRQ work self-IPIs" by setting the decrementer to 1 in arch_irq_work_raise(), which causes a decrementer exception on the next timebase tick. We then handle the work in __timer_interrupt(). CC: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Paul A. Clarke <pc@us.ibm.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> [mpe: Flesh out change log, fix ws & include guards, remove include of processor.h] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-02-21Merge tag 'clk-for-linus-3.20' of ↵Linus Torvalds
git://git.linaro.org/people/mike.turquette/linux Pull clock framework updates from Mike Turquette: "The clock framework changes contain the usual driver additions, enhancements and fixes mostly for ARM32, ARM64, MIPS and Power-based devices. Additionally the framework core underwent a bit of surgery with two major changes: - The boundary between the clock core and clock providers (e.g clock drivers) is now more well defined with dedicated provider helper functions. struct clk no longer maps 1:1 with the hardware clock but is a true per-user cookie which helps us tracker users of hardware clocks and debug bad behavior. - The addition of rate constraints for clocks. Rate ranges are now supported which are analogous to the voltage ranges in the regulator framework. Unfortunately these changes to the core created some breakeage. We think we fixed it all up but for this reason there are lots of last minute commits trying to undo the damage" * tag 'clk-for-linus-3.20' of git://git.linaro.org/people/mike.turquette/linux: (113 commits) clk: Only recalculate the rate if needed Revert "clk: mxs: Fix invalid 32-bit access to frac registers" clk: qoriq: Add support for the platform PLL powerpc/corenet: Enable CLK_QORIQ clk: Replace explicit clk assignment with __clk_hw_set_clk clk: Add __clk_hw_set_clk helper function clk: Don't dereference parent clock if is NULL MIPS: Alchemy: Remove bogus args from alchemy_clk_fgcs_detr clkdev: Always allocate a struct clk and call __clk_get() w/ CCF clk: shmobile: div6: Avoid division by zero in .round_rate() clk: mxs: Fix invalid 32-bit access to frac registers clk: omap: compile legacy omap3 clocks conditionally clkdev: Export clk_register_clkdev clk: Add rate constraints to clocks clk: remove clk-private.h pci: xgene: do not use clk-private.h arm: omap2+ remove dead clock code clk: Make clk API return per-user struct clk instances clk: tegra: Define PLLD_DSI and remove dsia(b)_mux clk: tegra: Add support for the Tegra132 CAR IP block ...
2015-02-19Merge branch 'kbuild' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild Pull kbuild updates from Michal Marek: - several cleanups in kbuild - serialize multiple *config targets so that 'make defconfig kvmconfig' works - The cc-ifversion macro got support for an else-branch * 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: kbuild,gcov: simplify kernel/gcov/Makefile more kbuild: allow cc-ifversion to have the argument for false condition kbuild,gcov: simplify kernel/gcov/Makefile kbuild,gcov: remove unnecessary workaround kbuild: do not add $(call ...) to invoke cc-version or cc-fullversion kbuild: fix cc-ifversion macro kbuild: drop $(version_h) from MRPROPER_FILES kbuild: use mixed-targets when two or more config targets are given kbuild: remove redundant line from bounds.h/asm-offsets.h kbuild: merge bounds.h and asm-offsets.h rules kbuild: Drop support for clean-rule
2015-02-18powerpc/corenet: Enable CLK_QORIQEmil Medve
Change-Id: I1a80ad7b9f6854791bd270b746f93a91439155a6 Signed-off-by: Emil Medve <Emilian.Medve@Freescale.com> Acked-by: Tang Yuantian <Yuantian.Tang@freescale.com> Signed-off-by: Michael Turquette <mturquette@linaro.org>
2015-02-17kexec: add IND_FLAGS macroGeoff Levand
Add a new kexec preprocessor macro IND_FLAGS, which is the bitwise OR of all the possible kexec IND_ kimage_entry indirection flags. Having this macro allows for simplified code in the prosessing of the kexec kimage_entry items. Also, remove the local powerpc definition and use the generic one. Signed-off-by: Geoff Levand <geoff@infradead.org> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Maximilian Attems <max@stro.at> Cc: Michal Marek <mmarek@suse.cz> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16powerpc: drop _PAGE_FILE and pte_file()-related helpersKirill A. Shutemov
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-15Merge tag 'char-misc-3.20-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char / misc patches from Greg KH: "Here's the big char/misc driver update for 3.20-rc1. Lots of little things in here, all described in the changelog. Nothing major or unusual, except maybe the binder selinux stuff, which was all acked by the proper selinux people and they thought it best to come through this tree. All of this has been in linux-next with no reported issues for a while" * tag 'char-misc-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (90 commits) coresight: fix function etm_writel_cp14() parameter order coresight-etm: remove check for unknown Kconfig macro coresight: fixing CPU hwid lookup in device tree coresight: remove the unnecessary function coresight_is_bit_set() coresight: fix the debug AMBA bus name coresight: remove the extra spaces coresight: fix the link between orphan connection and newly added device coresight: remove the unnecessary replicator property coresight: fix the replicator subtype value pdfdocs: Fix 'make pdfdocs' failure for 'uio-howto.tmpl' mcb: Fix error path of mcb_pci_probe virtio/console: verify device has config space ti-st: clean up data types (fix harmless memory corruption) mei: me: release hw from reset only during the reset flow mei: mask interrupt set bit on clean reset bit extcon: max77693: Constify struct regmap_config extcon: adc-jack: Release IIO channel on driver remove extcon: Remove duplicated include from extcon-class.c Drivers: hv: vmbus: hv_process_timer_expiration() can be static Drivers: hv: vmbus: serialize Offer and Rescind offer ...
2015-02-14Merge tag 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux Pull ACCESS_ONCE() rule tightening from Christian Borntraeger: "Tighten rules for ACCESS_ONCE This series tightens the rules for ACCESS_ONCE to only work on scalar types. It also contains the necessary fixups as indicated by build bots of linux-next. Now everything is in place to prevent new non-scalar users of ACCESS_ONCE and we can continue to convert code to READ_ONCE/WRITE_ONCE" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux: kernel: Fix sparse warning for ACCESS_ONCE next: sh: Fix compile error kernel: tighten rules for ACCESS ONCE mm/gup: Replace ACCESS_ONCE with READ_ONCE x86/spinlock: Leftover conversion ACCESS_ONCE->READ_ONCE x86/xen/p2m: Replace ACCESS_ONCE with READ_ONCE ppc/hugetlbfs: Replace ACCESS_ONCE with READ_ONCE ppc/kvm: Replace ACCESS_ONCE with READ_ONCE
2015-02-13powerpc: use %*pb[l] to print bitmaps including cpumasks and nodemasksTejun Heo
printk and friends can now format bitmaps using '%*pb[l]'. cpumask and nodemask also provide cpumask_pr_args() and nodemask_pr_args() respectively which can be used to generate the two printf arguments necessary to format the specified cpu/nodemask. * Spurious if (len > 1) test dropped from shared_cpu_map_show(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM update from Paolo Bonzini: "Fairly small update, but there are some interesting new features. Common: Optional support for adding a small amount of polling on each HLT instruction executed in the guest (or equivalent for other architectures). This can improve latency up to 50% on some scenarios (e.g. O_DSYNC writes or TCP_RR netperf tests). This also has to be enabled manually for now, but the plan is to auto-tune this in the future. ARM/ARM64: The highlights are support for GICv3 emulation and dirty page tracking s390: Several optimizations and bugfixes. Also a first: a feature exposed by KVM (UUID and long guest name in /proc/sysinfo) before it is available in IBM's hypervisor! :) MIPS: Bugfixes. x86: Support for PML (page modification logging, a new feature in Broadwell Xeons that speeds up dirty page tracking), nested virtualization improvements (nested APICv---a nice optimization), usual round of emulation fixes. There is also a new option to reduce latency of the TSC deadline timer in the guest; this needs to be tuned manually. Some commits are common between this pull and Catalin's; I see you have already included his tree. Powerpc: Nothing yet. The KVM/PPC changes will come in through the PPC maintainers, because I haven't received them yet and I might end up being offline for some part of next week" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (130 commits) KVM: ia64: drop kvm.h from installed user headers KVM: x86: fix build with !CONFIG_SMP KVM: x86: emulate: correct page fault error code for NoWrite instructions KVM: Disable compat ioctl for s390 KVM: s390: add cpu model support KVM: s390: use facilities and cpu_id per KVM KVM: s390/CPACF: Choose crypto control block format s390/kernel: Update /proc/sysinfo file with Extended Name and UUID KVM: s390: reenable LPP facility KVM: s390: floating irqs: fix user triggerable endless loop kvm: add halt_poll_ns module parameter kvm: remove KVM_MMIO_SIZE KVM: MIPS: Don't leak FPU/DSP to guest KVM: MIPS: Disable HTW while in guest KVM: nVMX: Enable nested posted interrupt processing KVM: nVMX: Enable nested virtual interrupt delivery KVM: nVMX: Enable nested apic register virtualization KVM: nVMX: Make nested control MSRs per-cpu KVM: nVMX: Enable nested virtualize x2apic mode KVM: nVMX: Prepare for using hardware MSR bitmap ...
2015-02-12Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge third set of updates from Andrew Morton: - the rest of MM [ This includes getting rid of the numa hinting bits, in favor of just generic protnone logic. Yay. - Linus ] - core kernel - procfs - some of lib/ (lots of lib/ material this time) * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (104 commits) lib/lcm.c: replace include lib/percpu_ida.c: remove redundant includes lib/strncpy_from_user.c: replace module.h include lib/stmp_device.c: replace module.h include lib/sort.c: move include inside #if 0 lib/show_mem.c: remove redundant include lib/radix-tree.c: change to simpler include lib/plist.c: remove redundant include lib/nlattr.c: remove redundant include lib/kobject_uevent.c: remove redundant include lib/llist.c: remove redundant include lib/md5.c: simplify include lib/list_sort.c: rearrange includes lib/genalloc.c: remove redundant include lib/idr.c: remove redundant include lib/halfmd4.c: simplify includes lib/dynamic_queue_limits.c: simplify includes lib/sort.c: use simpler includes lib/interval_tree.c: simplify includes hexdump: make it return number of bytes placed in buffer ...
2015-02-12powerpc: add running_clock for powerpc to prevent spurious softlockup warningsCyril Bur
On POWER8 virtualised kernels the VTB register can be read to have a view of time that only increases while the guest is running. This will prevent guests from seeing time jump if a guest is paused for significant amounts of time. On POWER7 and below virtualised kernels stolen time is subtracted from local_clock as a best effort approximation. This will not eliminate spurious warnings in the case of a suspended guest but may reduce the occurance in the case of softlockups due to host over commit. Bare metal kernels should avoid reading the VTB as KVM does not restore sane values when not executing, the approxmation is fine as host kernels won't observe any stolen time. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Cyril Bur <cyrilbur@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Andrew Jones <drjones@redhat.com> Acked-by: Don Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: chai wen <chaiw.fnst@cn.fujitsu.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Ben Zhang <benzh@chromium.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12all arches, signal: move restart_block to struct task_structAndy Lutomirski
If an attacker can cause a controlled kernel stack overflow, overwriting the restart block is a very juicy exploit target. This is because the restart_block is held in the same memory allocation as the kernel stack. Moving the restart block to struct task_struct prevents this exploit by making the restart_block harder to locate. Note that there are other fields in thread_info that are also easy targets, at least on some architectures. It's also a decent simplification, since the restart code is more or less identical on all architectures. [james.hogan@imgtec.com: metag: align thread_info::supervisor_stack] Signed-off-by: Andy Lutomirski <luto@amacapital.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: David Miller <davem@davemloft.net> Acked-by: Richard Weinberger <richard@nod.at> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Haavard Skinnemoen <hskinnemoen@gmail.com> Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no> Cc: Steven Miao <realmz6@gmail.com> Cc: Mark Salter <msalter@redhat.com> Cc: Aurelien Jacquiot <a-jacquiot@ti.com> Cc: Mikael Starvik <starvik@axis.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Helge Deller <deller@gmx.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Chen Liqin <liqin.linux@gmail.com> Cc: Lennox Wu <lennox.wu@gmail.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Chris Zankel <chris@zankel.net> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12mm: remove remaining references to NUMA hinting bits and helpersMel Gorman
This patch removes the NUMA PTE bits and associated helpers. As a side-effect it increases the maximum possible swap space on x86-64. One potential source of problems is races between the marking of PTEs PROT_NONE, NUMA hinting faults and migration. It must be guaranteed that a PTE being protected is not faulted in parallel, seen as a pte_none and corrupting memory. The base case is safe but transhuge has problems in the past due to an different migration mechanism and a dependance on page lock to serialise migrations and warrants a closer look. task_work hinting update parallel fault ------------------------ -------------- change_pmd_range change_huge_pmd __pmd_trans_huge_lock pmdp_get_and_clear __handle_mm_fault pmd_none do_huge_pmd_anonymous_page read? pmd_lock blocks until hinting complete, fail !pmd_none test write? __do_huge_pmd_anonymous_page acquires pmd_lock, checks pmd_none pmd_modify set_pmd_at task_work hinting update parallel migration ------------------------ ------------------ change_pmd_range change_huge_pmd __pmd_trans_huge_lock pmdp_get_and_clear __handle_mm_fault do_huge_pmd_numa_page migrate_misplaced_transhuge_page pmd_lock waits for updates to complete, recheck pmd_same pmd_modify set_pmd_at Both of those are safe and the case where a transhuge page is inserted during a protection update is unchanged. The case where two processes try migrating at the same time is unchanged by this series so should still be ok. I could not find a case where we are accidentally depending on the PTE not being cleared and flushed. If one is missed, it'll manifest as corruption problems that start triggering shortly after this series is merged and only happen when NUMA balancing is enabled. Signed-off-by: Mel Gorman <mgorman@suse.de> Tested-by: Sasha Levin <sasha.levin@oracle.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dave Jones <davej@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mark Brown <broonie@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12ppc64: add paranoid warnings for unexpected DSISR_PROTFAULTMel Gorman
ppc64 should not be depending on DSISR_PROTFAULT and it's unexpected if they are triggered. This patch adds warnings just in case they are being accidentally depended upon. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dave Jones <davej@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12mm: convert p[te|md]_numa users to p[te|md]_protnone_numaMel Gorman
Convert existing users of pte_numa and friends to the new helper. Note that the kernel is broken after this patch is applied until the other page table modifiers are also altered. This patch layout is to make review easier. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Tested-by: Sasha Levin <sasha.levin@oracle.com> Cc: Dave Jones <davej@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Rik van Riel <riel@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12mm: add p[te|md] protnone helpers for use by NUMA balancingMel Gorman
This is a preparatory patch that introduces protnone helpers for automatic NUMA balancing. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dave Jones <davej@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12Merge branch 'for-3.20/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block driver changes from Jens Axboe: "This contains: - The 4k/partition fixes for brd from Boaz/Matthew. - A few xen front/back block fixes from David Vrabel and Roger Pau Monne. - Floppy changes from Takashi, cleaning the device file creation. - Switching libata to use the new blk-mq tagging policy, removing code (and a suboptimal implementation) from libata. This will throw you a merge conflict, since a bug in the original libata tagging code was fixed since this code was branched. Trivial. From Shaohua. - Conversion of loop to blk-mq, from Ming Lei. - Cleanup of the io_schedule() handling in bsg from Peter Zijlstra. He claims it improves on unreadable code, which will cost him a beer. - Maintainer update or NDB, now handled by Markus Pargmann. - NVMe: - Optimization from me that avoids a kmalloc/kfree per IO for smaller (<= 8KB) IO. This cuts about 1% of high IOPS CPU overhead. - Removal of (now) dead RCU code, a relic from before NVMe was converted to blk-mq" * 'for-3.20/drivers' of git://git.kernel.dk/linux-block: xen-blkback: default to X86_32 ABI on x86 xen-blkfront: fix accounting of reqs when migrating xen-blkback,xen-blkfront: add myself as maintainer block: Simplify bsg complete all floppy: Avoid manual call of device_create_file() NVMe: avoid kmalloc/kfree for smaller IO MAINTAINERS: Update NBD maintainer libata: make sata_sil24 use fifo tag allocator libata: move sas ata tag allocation to libata-scsi.c libata: use blk taging NVMe: within nvme_free_queues(), delete RCU sychro/deferred free null_blk: suppress invalid partition info brd: Request from fdisk 4k alignment brd: Fix all partitions BUGs axonram: Fix bug in direct_access loop: add blk-mq.h include block: loop: don't handle REQ_FUA explicitly block: loop: introduce lo_discard() and lo_req_flush() block: loop: say goodby to bio block: loop: improve performance via blk-mq
2015-02-12Merge branch 'for-3.20/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull core block IO changes from Jens Axboe: "This contains: - A series from Christoph that cleans up and refactors various parts of the REQ_BLOCK_PC handling. Contributions in that series from Dongsu Park and Kent Overstreet as well. - CFQ: - A bug fix for cfq for realtime IO scheduling from Jeff Moyer. - A stable patch fixing a potential crash in CFQ in OOM situations. From Konstantin Khlebnikov. - blk-mq: - Add support for tag allocation policies, from Shaohua. This is a prep patch enabling libata (and other SCSI parts) to use the blk-mq tagging, instead of rolling their own. - Various little tweaks from Keith and Mike, in preparation for DM blk-mq support. - Minor little fixes or tweaks from me. - A double free error fix from Tony Battersby. - The partition 4k issue fixes from Matthew and Boaz. - Add support for zero+unprovision for blkdev_issue_zeroout() from Martin" * 'for-3.20/core' of git://git.kernel.dk/linux-block: (27 commits) block: remove unused function blk_bio_map_sg block: handle the null_mapped flag correctly in blk_rq_map_user_iov blk-mq: fix double-free in error path block: prevent request-to-request merging with gaps if not allowed blk-mq: make blk_mq_run_queues() static dm: fix multipath regression due to initializing wrong request cfq-iosched: handle failure of cfq group allocation block: Quiesce zeroout wrapper block: rewrite and split __bio_copy_iov() block: merge __bio_map_user_iov into bio_map_user_iov block: merge __bio_map_kern into bio_map_kern block: pass iov_iter to the BLOCK_PC mapping functions block: add a helper to free bio bounce buffer pages block: use blk_rq_map_user_iov to implement blk_rq_map_user block: simplify bio_map_kern block: mark blk-mq devices as stackable block: keep established cmd_flags when cloning into a blk-mq request block: add blk-mq support to blk_insert_cloned_request() block: require blk_rq_prep_clone() be given an initialized clone request blk-mq: add tag allocation policy ...
2015-02-12Merge tag 'iommu-updates-v3.20' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull IOMMU updates from Joerg Roedel: "This time with: - Generic page-table framework for ARM IOMMUs using the LPAE page-table format, ARM-SMMU and Renesas IPMMU make use of it already. - Break out the IO virtual address allocator from the Intel IOMMU so that it can be used by other DMA-API implementations too. The first user will be the ARM64 common DMA-API implementation for IOMMUs - Device tree support for Renesas IPMMU - Various fixes and cleanups all over the place" * tag 'iommu-updates-v3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (36 commits) iommu/amd: Convert non-returned local variable to boolean when relevant iommu: Update my email address iommu/amd: Use wait_event in put_pasid_state_wait iommu/amd: Fix amd_iommu_free_device() iommu/arm-smmu: Avoid build warning iommu/fsl: Various cleanups iommu/fsl: Use %pa to print phys_addr_t iommu/omap: Print phys_addr_t using %pa iommu: Make more drivers depend on COMPILE_TEST iommu/ipmmu-vmsa: Fix IOMMU lookup when multiple IOMMUs are registered iommu: Disable on !MMU builds iommu/fsl: Remove unused fsl_of_pamu_ids[] iommu/fsl: Fix section mismatch iommu/ipmmu-vmsa: Use the ARM LPAE page table allocator iommu: Fix trace_map() to report original iova and original size iommu/arm-smmu: add support for iova_to_phys through ATS1PR iopoll: Introduce memory-mapped IO polling macros iommu/arm-smmu: don't touch the secure STLBIALL register iommu/arm-smmu: make use of generic LPAE allocator iommu: io-pgtable-arm: add non-secure quirk ...
2015-02-11Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge second set of updates from Andrew Morton: "More of MM" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (83 commits) mm/nommu.c: fix arithmetic overflow in __vm_enough_memory() mm/mmap.c: fix arithmetic overflow in __vm_enough_memory() vmstat: Reduce time interval to stat update on idle cpu mm/page_owner.c: remove unnecessary stack_trace field Documentation/filesystems/proc.txt: describe /proc/<pid>/map_files mm: incorporate read-only pages into transparent huge pages vmstat: do not use deferrable delayed work for vmstat_update mm: more aggressive page stealing for UNMOVABLE allocations mm: always steal split buddies in fallback allocations mm: when stealing freepages, also take pages created by splitting buddy page mincore: apply page table walker on do_mincore() mm: /proc/pid/clear_refs: avoid split_huge_page() mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP) mempolicy: apply page table walker on queue_pages_range() arch/powerpc/mm/subpage-prot.c: use walk->vma and walk_page_vma() memcg: cleanup preparation for page table walk numa_maps: remove numa_maps->vma numa_maps: fix typo in gather_hugetbl_stats pagemap: use walk->vma instead of calling find_vma() clear_refs: remove clear_refs_private->vma and introduce clear_refs_test_walk() ...
2015-02-11Merge tag 'powerpc-3.20-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux Pull powerpc updates from Michael Ellerman: - Update of all defconfigs - Addition of a bunch of config options to modernise our defconfigs - Some PS3 updates from Geoff - Optimised memcmp for 64 bit from Anton - Fix for kprobes that allows 'perf probe' to work from Naveen - Several cxl updates from Ian & Ryan - Expanded support for the '24x7' PMU from Cody & Sukadev - Freescale updates from Scott: "Highlights include 8xx optimizations, some more work on datapath device tree content, e300 machine check support, t1040 corenet error reporting, and various cleanups and fixes" * tag 'powerpc-3.20-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (102 commits) cxl: Add missing return statement after handling AFU errror cxl: Fail AFU initialisation if an invalid configuration record is found cxl: Export optional AFU configuration record in sysfs powerpc/mm: Warn on flushing tlb page in kernel context powerpc/powernv: Add OPAL soft-poweroff routine powerpc/perf/hv-24x7: Document sysfs event description entries powerpc/perf/hv-gpci: add the remaining gpci requests powerpc/perf/{hv-gpci, hv-common}: generate requests with counters annotated powerpc/perf/hv-24x7: parse catalog and populate sysfs with events perf: define EVENT_DEFINE_RANGE_FORMAT_LITE helper perf: add PMU_EVENT_ATTR_STRING() helper perf: provide sysfs_show for struct perf_pmu_events_attr powerpc/kernel: Avoid initializing device-tree pointer twice powerpc: Remove old compile time disabled syscall tracing code powerpc/kernel: Make syscall_exit a local label cxl: Fix device_node reference counting powerpc/mm: bail out early when flushing TLB page powerpc: defconfigs: add MTD_SPI_NOR (new dependency for M25P80) perf/powerpc: reset event hw state when adding it to the PMU powerpc/qe: Use strlcpy() ...
2015-02-11arch/powerpc/mm/subpage-prot.c: use walk->vma and walk_page_vma()Naoya Horiguchi
We don't have to use mm_walk->private to pass vma to the callback function because of mm_walk->vma. And walk_page_vma() is useful if we walk over a single vma. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11mm: make FIRST_USER_ADDRESS unsigned long on all archsKirill A. Shutemov
LKP has triggered a compiler warning after my recent patch "mm: account pmd page tables to the process": mm/mmap.c: In function 'exit_mmap': >> mm/mmap.c:2857:2: warning: right shift count >= width of type [enabled by default] The code: > 2857 WARN_ON(mm_nr_pmds(mm) > 2858 round_up(FIRST_USER_ADDRESS, PUD_SIZE) >> PUD_SHIFT); In this, on tile, we have FIRST_USER_ADDRESS defined as 0. round_up() has the same type -- int. PUD_SHIFT. I think the best way to fix it is to define FIRST_USER_ADDRESS as unsigned long. On every arch for consistency. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11mm/hugetlb: reduce arch dependent code around follow_huge_*Naoya Horiguchi
Currently we have many duplicates in definitions around follow_huge_addr(), follow_huge_pmd(), and follow_huge_pud(), so this patch tries to remove the m. The basic idea is to put the default implementation for these functions in mm/hugetlb.c as weak symbols (regardless of CONFIG_ARCH_WANT_GENERAL_HUGETL B), and to implement arch-specific code only when the arch needs it. For follow_huge_addr(), only powerpc and ia64 have their own implementation, and in all other architectures this function just returns ERR_PTR(-EINVAL). So this patch sets returning ERR_PTR(-EINVAL) as default. As for follow_huge_(pmd|pud)(), if (pmd|pud)_huge() is implemented to always return 0 in your architecture (like in ia64 or sparc,) it's never called (the callsite is optimized away) no matter how implemented it is. So in such architectures, we don't need arch-specific implementation. In some architecture (like mips, s390 and tile,) their current arch-specific follow_huge_(pmd|pud)() are effectively identical with the common code, so this patch lets these architecture use the common code. One exception is metag, where pmd_huge() could return non-zero but it expects follow_huge_pmd() to always return NULL. This means that we need arch-specific implementation which returns NULL. This behavior looks strange to me (because non-zero pmd_huge() implies that the architecture supports PMD-based hugepage, so follow_huge_pmd() can/should return some relevant value,) but that's beyond this cleanup patch, so let's keep it. Justification of non-trivial changes: - in s390, follow_huge_pmd() checks !MACHINE_HAS_HPAGE at first, and this patch removes the check. This is OK because we can assume MACHINE_HAS_HPAGE is true when follow_huge_pmd() can be called (note that pmd_huge() has the same check and always returns 0 for !MACHINE_HAS_HPAGE.) - in s390 and mips, we use HPAGE_MASK instead of PMD_MASK as done in common code. This patch forces these archs use PMD_MASK, but it's OK because they are identical in both archs. In s390, both of HPAGE_SHIFT and PMD_SHIFT are 20. In mips, HPAGE_SHIFT is defined as (PAGE_SHIFT + PAGE_SHIFT - 3) and PMD_SHIFT is define as (PAGE_SHIFT + PAGE_SHIFT + PTE_ORDER - 3), but PTE_ORDER is always 0, so these are identical. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Steve Capper <steve.capper@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-10Merge tag 'pci-v3.20-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI changes from Bjorn Helgaas: "Enumeration - Move domain assignment from arm64 to generic code (Lorenzo Pieralisi) - ARM: Remove artificial dependency on pci_sys_data domain (Lorenzo Pieralisi) - ARM: Move to generic PCI domains (Lorenzo Pieralisi) - Generate uppercase hex for modalias var in uevent (Ricardo Ribalda Delgado) - Add and use generic config accessors on ARM, PowerPC (Rob Herring) Resource management - Free resources on failure in of_pci_get_host_bridge_resources() (Lorenzo Pieralisi) - Fix infinite loop with ROM image of size 0 (Michel Dänzer) PCI device hotplug - Handle surprise add even if surprise removal isn't supported (Bjorn Helgaas) Virtualization - Mark AMD/ATI VGA devices that don't reset on D3hot->D0 transition (Alex Williamson) - Add DMA alias quirk for Adaptec 3405 (Alex Williamson) - Add Wellsburg (X99) to Intel PCH root port ACS quirk (Alex Williamson) - Add ACS quirk for Emulex NICs (Vasundhara Volam) MSI - Fail MSI-X mappings if there's no space assigned to MSI-X BAR (Yijing Wang) Freescale Layerscape host bridge driver - Fix platform_no_drv_owner.cocci warnings (Julia Lawall) NVIDIA Tegra host bridge driver - Remove unnecessary tegra_pcie_fixup_bridge() (Lucas Stach) Renesas R-Car host bridge driver - Fix error handling of irq_of_parse_and_map() (Dmitry Torokhov) TI Keystone host bridge driver - Fix error handling of irq_of_parse_and_map() (Dmitry Torokhov) - Fix misspelling of current function in debug output (Julia Lawall) Xilinx AXI host bridge driver - Fix harmless format string warning (Arnd Bergmann) Miscellaneous - Use standard parsing functions for ASPM sysfs setters (Chris J Arges) - Add pci_device_to_OF_node() stub for !CONFIG_OF (Kevin Hao) - Delete unnecessary NULL pointer checks (Markus Elfring) - Add and use defines for PCIe Max_Read_Request_Size (Rafał Miłecki) - Include clk.h instead of clk-private.h (Stephen Boyd)" * tag 'pci-v3.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (48 commits) PCI: Add pci_device_to_OF_node() stub for !CONFIG_OF PCI: xilinx: Convert to use generic config accessors PCI: xgene: Convert to use generic config accessors PCI: tegra: Convert to use generic config accessors PCI: rcar: Convert to use generic config accessors PCI: generic: Convert to use generic config accessors powerpc/powermac: Convert PCI to use generic config accessors powerpc/fsl_pci: Convert PCI to use generic config accessors ARM: ks8695: Convert PCI to use generic config accessors ARM: sa1100: Convert PCI to use generic config accessors ARM: integrator: Convert PCI to use generic config accessors PCI: versatile: Add DT-based ARM Versatile PB PCIe host driver ARM: dts: versatile: add PCI controller binding of/pci: Free resources on failure in of_pci_get_host_bridge_resources() PCI: versatile: Add DT docs for ARM Versatile PB PCIe driver PCI: Fail MSI-X mappings if there's no space assigned to MSI-X BAR r8169: use PCI define for Max_Read_Request_Size [SCSI] esas2r: use PCI define for Max_Read_Request_Size tile: use PCI define for Max_Read_Request_Size rapidio/tsi721: use PCI define for Max_Read_Request_Size ...
2015-02-09Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: "The main RCU changes in this cycle are: - Documentation updates. - Miscellaneous fixes. - Preemptible-RCU fixes, including fixing an old bug in the interaction of RCU priority boosting and CPU hotplug. - SRCU updates. - RCU CPU stall-warning updates. - RCU torture-test updates" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits) rcu: Initialize tiny RCU stall-warning timeouts at boot rcu: Fix RCU CPU stall detection in tiny implementation rcu: Add GP-kthread-starvation checks to CPU stall warnings rcu: Make cond_resched_rcu_qs() apply to normal RCU flavors rcu: Optionally run grace-period kthreads at real-time priority ksoftirqd: Use new cond_resched_rcu_qs() function ksoftirqd: Enable IRQs and call cond_resched() before poking RCU rcutorture: Add more diagnostics in rcu_barrier() test failure case torture: Flag console.log file to prevent holdovers from earlier runs torture: Add "-enable-kvm -soundhw pcspk" to qemu command line rcutorture: Handle different mpstat versions rcutorture: Check from beginning to end of grace period rcu: Remove redundant rcu_batches_completed() declaration rcutorture: Drop rcu_torture_completed() and friends rcu: Provide rcu_batches_completed_sched() for TINY_RCU rcutorture: Use unsigned for Reader Batch computations rcutorture: Make build-output parsing correctly flag RCU's warnings rcu: Make _batches_completed() functions return unsigned long rcutorture: Issue warnings on close calls due to Reader Batch blows documentation: Fix smp typo in memory-barriers.txt ...
2015-02-06kvm: add halt_poll_ns module parameterPaolo Bonzini
This patch introduces a new module parameter for the KVM module; when it is present, KVM attempts a bit of polling on every HLT before scheduling itself out via kvm_vcpu_block. This parameter helps a lot for latency-bound workloads---in particular I tested it with O_DSYNC writes with a battery-backed disk in the host. In this case, writes are fast (because the data doesn't have to go all the way to the platters) but they cannot be merged by either the host or the guest. KVM's performance here is usually around 30% of bare metal, or 50% if you use cache=directsync or cache=writethrough (these parameters avoid that the guest sends pointless flush requests, and at the same time they are not slow because of the battery-backed cache). The bad performance happens because on every halt the host CPU decides to halt itself too. When the interrupt comes, the vCPU thread is then migrated to a new physical CPU, and in general the latency is horrible because the vCPU thread has to be scheduled back in. With this patch performance reaches 60-65% of bare metal and, more important, 99% of what you get if you use idle=poll in the guest. This means that the tunable gets rid of this particular bottleneck, and more work can be done to improve performance in the kernel or QEMU. Of course there is some price to pay; every time an otherwise idle vCPUs is interrupted by an interrupt, it will poll unnecessarily and thus impose a little load on the host. The above results were obtained with a mostly random value of the parameter (500000), and the load was around 1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU. The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll, that can be used to tune the parameter. It counts how many HLT instructions received an interrupt during the polling period; each successful poll avoids that Linux schedules the VCPU thread out and back in, and may also avoid a likely trip to C1 and back for the physical CPU. While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second. Of these halts, almost all are failed polls. During the benchmark, instead, basically all halts end within the polling period, except a more or less constant stream of 50 per second coming from vCPUs that are not running the benchmark. The wasted time is thus very low. Things may be slightly different for Windows VMs, which have a ~10 ms timer tick. The effect is also visible on Marcelo's recently-introduced latency test for the TSC deadline timer. Though of course a non-RT kernel has awful latency bounds, the latency of the timer is around 8000-10000 clock cycles compared to 20000-120000 without setting halt_poll_ns. For the TSC deadline timer, thus, the effect is both a smaller average latency and a smaller variance. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-05mm/debug_pagealloc: fix build failure on ppc and some other archsJoonsoo Kim
Kim Phillips reported following build failure. LD init/built-in.o mm/built-in.o: In function `free_pages_prepare': mm/page_alloc.c:770: undefined reference to `.kernel_map_pages' mm/built-in.o: In function `prep_new_page': mm/page_alloc.c:933: undefined reference to `.kernel_map_pages' mm/built-in.o: In function `map_pages': mm/compaction.c:61: undefined reference to `.kernel_map_pages' make: *** [vmlinux] Error 1 Reason for this problem is that commit 031bc5743f15 ("mm/debug-pagealloc: make debug-pagealloc boottime configurable") forgot to remove the old declaration of kernel_map_pages() for some architectures. This patch removes them to fix build failure. Reported-by: Kim Phillips <kim.phillips@freescale.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: David Miller <davem@davemloft.net> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-04Merge branches 'arm/renesas', 'arm/smmu', 'arm/omap', 'ppc/pamu', 'x86/amd' ↵Joerg Roedel
and 'core' into next Conflicts: drivers/iommu/Kconfig drivers/iommu/Makefile
2015-02-04powerpc/mm: Warn on flushing tlb page in kernel contextArseny Solokha
Function __flush_tlb_page() must only be called for user contexts, so put in extra hardening to warn on calling it for kernel context. Signed-off-by: Arseny Solokha <asolokha@kb.kras.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-02-04powerpc/powernv: Add OPAL soft-poweroff routineJoel Stanley
Register a notifier for a OPAL message indicating that the machine should prepare itself for a graceful power off. OPAL will tell us if the power off is a reboot or shutdown, but for now we perform the same orderly_poweroff action. Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-02-04Merge branch 'next' of ↵Michael Ellerman
git://git.kernel.org/pub/scm/linux/kernel/git/scottwood/linux into next Freescale updates from Scott: "Highlights include 8xx optimizations, some more work on datapath device tree content, e300 machine check support, t1040 corenet error reporting, and various cleanups and fixes."
2015-02-03iommu/fsl: Various cleanupsEmil Medve
Currently a PAMU driver patch is very likely to receive some checkpatch complaints about the code in the context of the patch. This patch is an attempt to fix most of that and make the driver more readable Also fixed a subset of the sparse and coccinelle reported issues. Signed-off-by: Emil Medve <Emilian.Medve@Freescale.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
2015-02-02Merge branch 'clk-next' into v3.19-rc7Michael Turquette
2015-02-02powerpc/perf/hv-gpci: add the remaining gpci requestsCody P Schafer
Add the remaining gpci requests that contain counters suitable for use by perf. Omit those that don't contain any counters (but note their ommision). Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-02-02powerpc/perf/{hv-gpci, hv-common}: generate requests with counters annotatedCody P Schafer
This adds (in req-gen/) a framework for defining gpci counter requests. It uses macro magic similar to ftrace. Also convert the existing hv-gpci request structures and enum values to use the new framework (and adjust old users of the structs and enum values to cope with changes in naming). In exchange for this macro disaster, we get autogenerated event listing for GPCI in sysfs, build time field offset checking, and zero duplication of information about GPCI requests. Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-02-02powerpc/perf/hv-24x7: parse catalog and populate sysfs with eventsCody P Schafer
Retrieves and parses the 24x7 catalog on POWER systems that supply it (right now, only POWER 8). Events are exposed via sysfs in the standard fashion, and are all parameterized. $ cd /sys/bus/event_source/devices/hv_24x7/events $ cat HPM_CS_FROM_L4_LDATA__PHYS_CORE domain=0x2,offset=0xd58,core=?,lpar=0x0 $ cat HPM_TLBIE__VCPU_HOME_CHIP domain=0x4,offset=0x358,vcpu=?,lpar=? where user is required to specify values for the fields with '?' (like core, vcpu, lpar above), when specifying the event with the perf tool. Catalog is (at the moment) only parsed on boot. It needs re-parsing when a some hypervisor events occur. At that point we'll also need to prevent old events from continuing to function (counter that is passed in via spare space in the config values?). Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-02-02perf: define EVENT_DEFINE_RANGE_FORMAT_LITE helpersukadev@linux.vnet.ibm.com
Define a lite version of the EVENT_DEFINE_RANGE_FORMAT() that avoids defining helper functions for the bit-field ranges. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-02-02powerpc/kernel: Avoid initializing device-tree pointer twiceGavin Shan
As commit 50ba08f3 ("of/fdt: Don't clear initial_boot_params if fdt_check_header() fails") does, the device-tree pointer "initial_boot_params" is initialized by early_init_dt_verify(), which is called by early_init_devtree(). So we needn't explicitly initialize that again in early_init_devtree(). Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-02-02powerpc: Remove old compile time disabled syscall tracing codeMichael Ellerman
We have code to do syscall tracing which is disabled at compile time by default. It's not been touched since the dawn of time (ie. v2.6.12). There are now better ways to do syscall tracing, ie. using the raw_syscall, or syscall tracepoints. For the specific case of tracing syscalls at boot on a system that doesn't get to userspace, you can boot with: trace_event=syscalls tp_printk=on Which will trace syscalls from boot, and echo all output to the console. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-02-02powerpc/kernel: Make syscall_exit a local labelMichael Ellerman
Currently when we back trace something that is in a syscall we see something like this: [c000000000000000] [c000000000000000] SyS_read+0x6c/0x110 [c000000000000000] [c000000000000000] syscall_exit+0x0/0x98 Although it's entirely correct, seeing syscall_exit at the bottom can be confusing - we were exiting from a syscall and then called SyS_read() ? If we instead change syscall_exit to be a local label we get something more intuitive: [c0000001fa46fde0] [c00000000026719c] SyS_read+0x6c/0x110 [c0000001fa46fe30] [c000000000009264] system_call+0x38/0xd0 ie. we were handling a system call, and it was SyS_read(). Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-02-02cxl: Fix device_node reference countingRyan Grimm
When unbinding and rebinding the driver on a system with a card in PHB0, this error condition is reached after a few attempts: ERROR: Bad of_node_put() on /pciex@3fffe40000000 CPU: 0 PID: 3040 Comm: bash Not tainted 3.18.0-rc3-12545-g3627ffe #152 Call Trace: [c000000721acb5c0] [c00000000086ef94] .dump_stack+0x84/0xb0 (unreliable) [c000000721acb640] [c00000000073a0a8] .of_node_release+0xd8/0xe0 [c000000721acb6d0] [c00000000044bc44] .kobject_release+0x74/0xe0 [c000000721acb760] [c0000000007394fc] .of_node_put+0x1c/0x30 [c000000721acb7d0] [c000000000545cd8] .cxl_probe+0x1a98/0x1d50 [c000000721acb900] [c0000000004845a0] .local_pci_probe+0x40/0xc0 [c000000721acb980] [c000000000484998] .pci_device_probe+0x128/0x170 [c000000721acba30] [c00000000052400c] .driver_probe_device+0xac/0x2a0 [c000000721acbad0] [c000000000522468] .bind_store+0x108/0x160 [c000000721acbb70] [c000000000521448] .drv_attr_store+0x38/0x60 [c000000721acbbe0] [c000000000293840] .sysfs_kf_write+0x60/0xa0 [c000000721acbc50] [c000000000292500] .kernfs_fop_write+0x140/0x1d0 [c000000721acbcf0] [c000000000208648] .vfs_write+0xd8/0x260 [c000000721acbd90] [c000000000208b18] .SyS_write+0x58/0x100 [c000000721acbe30] [c000000000009258] syscall_exit+0x0/0x98 We are missing a call to of_node_get(). pnv_pci_to_phb_node() should call of_node_get() otherwise np's reference count isn't incremented and it might go away. Rename pnv_pci_to_phb_node() to pnv_pci_get_phb_node() so it's clear it calls of_node_get(). Signed-off-by: Ryan Grimm <grimm@linux.vnet.ibm.com> Acked-by: Ian Munsie <imunsie@au1.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-01-30powerpc/mm: bail out early when flushing TLB pageArseny Solokha
MMU_NO_CONTEXT is conditionally defined as 0 or (unsigned int)-1. However, in __flush_tlb_page() a corresponding variable is only tested for open coded 0, which can cause NULL pointer dereference if `mm' argument was legitimately passed as such. Bail out early in case the first argument is NULL, thus eliminate confusion between different values of MMU_NO_CONTEXT and avoid disabling and then re-enabling preemption unnecessarily. Signed-off-by: Arseny Solokha <asolokha@kb.kras.ru> Signed-off-by: Scott Wood <scottwood@freescale.com>