summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2015-10-03ravb: Add support for r8a7795 SoCKazuya Mizuguchi
This patch supports the r8a7795 SoC by: - Using two interrupts + One for E-MAC + One for everything else + Both can be handled by the existing common interrupt handler, which affords a simpler update to support the new SoC. In future some consideration may be given to implementing multiple interrupt handlers - Limiting the phy speed to 100Mbit/s for the new SoC; at this time it is not clear how this restriction may be lifted but I hope it will be possible as more information comes to light Signed-off-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com> [horms: reworked] Signed-off-by: Simon Horman <horms+renesas@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03ravb: Document binding for r8a7795 SoCKazuya Mizuguchi
This patch updates the ravb binding to support the r8a7795 SoC by: - Adding a compat string for the new hardware - Adding 25 named interrupts to binding for the new SoC; older SoCs continue to use a single multiplexed interrupt The example is also updated to reflect the r8a7795 as this is the more complex case. Based on work by Kazuya Mizuguchi and others. Signed-off-by: Simon Horman <horms+renesas@verge.net.au> Acked-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03ravb: Provide dev parameter to DMA APIKazuya Mizuguchi
This patch is in preparation for using this driver on arm64 where the implementation of __dma_alloc_coherent fails if a device parameter is not provided. Signed-off-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com> Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Signed-off-by: Masaru Nagai <masaru.nagai.vx@renesas.com> [horms: squashed into a single patch] Signed-off-by: Simon Horman <horms+renesas@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03phylib: Add phy_set_max_speed helperSimon Horman
Add a helper to allow ethernet drivers to limit the speed of a phy (that they are attached to). This mainly involves factoring out the business-end of of_set_phy_supported() and exporting a new symbol. This code seems to be open coded in several places, in several different variants. It is is envisaged that this will be used in situations where setting the "max-speed" property in DT is not appropriate, e.g. because the maximum speed is not a property of the phy hardware. Signed-off-by: Simon Horman <horms+renesas@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03Merge branch 'bpf-updates'David S. Miller
Daniel Borkmann says: ==================== BPF updates Some minor updates to {cls,act}_bpf to retrieve routing realms and to make skb->priority writable. Thanks! v1 -> v2: - Dropped preclassify patch for now from the series as the rest is pretty much independent of it - Rest unchanged, only rebased and already posted Acked-by's kept ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03sched, bpf: make skb->priority writableDaniel Borkmann
{cls,act}_bpf can now set the skb->priority from an eBPF program based on various critera, so that for example classful qdiscs like multiq can update the skb's priority during enqueue time and further push it down into subsequent qdiscs. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03sched, bpf: add helper for retrieving routing realmsDaniel Borkmann
Using routing realms as part of the classifier is quite useful, it can be viewed as a tag for one or multiple routing entries (think of an analogy to net_cls cgroup for processes), set by user space routing daemons or via iproute2 as an indicator for traffic classifiers and later on processed in the eBPF program. Unlike actions, the classifier can inspect device flags and enable netif_keep_dst() if necessary. tc actions don't have that possibility, but in case people know what they are doing, it can be used from there as well (e.g. via devs that must keep dsts by design anyway). If a realm is set, the handler returns the non-zero realm. User space can set the full 32bit realm for the dst. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03ebpf: migrate bpf_prog's flags to bitfieldDaniel Borkmann
As we need to add further flags to the bpf_prog structure, lets migrate both bools to a bitfield representation. The size of the base structure (excluding insns) remains unchanged at 40 bytes. Add also tags for the kmemchecker, so that it doesn't throw false positives. Even in case gcc would generate suboptimal code, it's not being accessed in performance critical paths. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03Merge branch 'switchdev-obj'David S. Miller
Jiri Pirko says: ==================== switchdev: bring back switchdev_obj Second version of the patch extends to a patchset. Basically this patchset brings object structure back which disappeared with recent Vivien's patchset. Also it does a bit of naming changes in order to get the things in line. Also, object id is put back into object structure. Thanks to Scott and Vivien for review and suggestions. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03switchdev: push object ID back to object structureJiri Pirko
Suggested-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Scott Feldman <sfeldma@gmail.com> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03switchdev: bring back switchdev_obj and use it as a generic object paramJiri Pirko
Replace "void *obj" with a generic structure. Introduce couple of helpers along that. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Scott Feldman <sfeldma@gmail.com> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03switchdev: rename switchdev_obj_fdb to switchdev_obj_port_fdbJiri Pirko
Make the struct name in sync with object id name. Suggested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Scott Feldman <sfeldma@gmail.com> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03switchdev: rename switchdev_obj_vlan to switchdev_obj_port_vlanJiri Pirko
Make the struct name in sync with object id name. Suggested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Scott Feldman <sfeldma@gmail.com> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03switchdev: rename SWITCHDEV_ATTR_* enum values to SWITCHDEV_ATTR_ID_*Jiri Pirko
To be aligned with obj. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Scott Feldman <sfeldma@gmail.com> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03switchdev: rename SWITCHDEV_OBJ_* enum values to SWITCHDEV_OBJ_ID_*Jiri Pirko
Suggested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Scott Feldman <sfeldma@gmail.com> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03Merge branch 'tcp-lockless-listener'David S. Miller
Eric Dumazet says: ==================== tcp/dccp: lockless listener TCP listener refactoring : this is becoming interesting ! This patch series takes the steps to use normal TCP/DCCP ehash table to store SYN_RECV requests, instead of the private per-listener hash table we had until now. SYNACK skb are now attached to their syn_recv request socket, so that we no longer heavily modify listener sk_wmem_alloc. listener lock is no longer held in fast path, including SYNCOOKIE mode. During my tests, my server was able to process 3,500,000 SYN packets per second on one listener and still had available cpu cycles. That is about 2 to 3 order of magnitude what we had with older kernels. This effort started two years ago and I am pleased to reach expectations. We'll probably extend SO_REUSEPORT to add proper cpu/numa affinities, so that heavy duty TCP servers can get proper siloing thanks to multi-queues NIC. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03tcp: do not lock listener to process SYN packetsEric Dumazet
Everything should now be ready to finally allow SYN packets processing without holding listener lock. Tested: 3.5 Mpps SYNFLOOD. Plenty of cpu cycles available. Next bottleneck is the refcount taken on listener, that could be avoided if we remove SLAB_DESTROY_BY_RCU strict semantic for listeners, and use regular RCU. 13.18% [kernel] [k] __inet_lookup_listener 9.61% [kernel] [k] tcp_conn_request 8.16% [kernel] [k] sha_transform 5.30% [kernel] [k] inet_reqsk_alloc 4.22% [kernel] [k] sock_put 3.74% [kernel] [k] tcp_make_synack 2.88% [kernel] [k] ipt_do_table 2.56% [kernel] [k] memcpy_erms 2.53% [kernel] [k] sock_wfree 2.40% [kernel] [k] tcp_v4_rcv 2.08% [kernel] [k] fib_table_lookup 1.84% [kernel] [k] tcp_openreq_init_rwin Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03tcp/dccp: add a reschedule point in inet_csk_listen_stop()Eric Dumazet
If a listener with thousands of children in accept queue is dismantled, it can take a while to close all of them. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03tcp: remove max_qlen_logEric Dumazet
This control variable was set at first listen(fd, backlog) call, but not updated if application tried to increase or decrease backlog. It made sense at the time listener had a non resizeable hash table. Also rounding to powers of two was not very friendly. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03tcp/dccp: remove struct listen_sockEric Dumazet
It is enough to check listener sk_state, no need for an extra condition. max_qlen_log can be moved into struct request_sock_queue We can remove syn_wait_lock and the alignment it enforced. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03tcp: attach SYNACK messages to request sockets instead of listenerEric Dumazet
If a listen backlog is very big (to avoid syncookies), then the listener sk->sk_wmem_alloc is the main source of false sharing, as we need to touch it twice per SYNACK re-transmit and TX completion. (One SYN packet takes listener lock once, but up to 6 SYNACK are generated) By attaching the skb to the request socket, we remove this source of contention. Tested: listen(fd, 10485760); // single listener (no SO_REUSEPORT) 16 RX/TX queue NIC Sustain a SYNFLOOD attack of ~320,000 SYN per second, Sending ~1,400,000 SYNACK per second. Perf profiles now show listener spinlock being next bottleneck. 20.29% [kernel] [k] queued_spin_lock_slowpath 10.06% [kernel] [k] __inet_lookup_established 5.12% [kernel] [k] reqsk_timer_handler 3.22% [kernel] [k] get_next_timer_interrupt 3.00% [kernel] [k] tcp_make_synack 2.77% [kernel] [k] ipt_do_table 2.70% [kernel] [k] run_timer_softirq 2.50% [kernel] [k] ip_finish_output 2.04% [kernel] [k] cascade Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03ipv6: remove obsolete inet6 functionsEric Dumazet
inet6_csk_search_req() and inet6_csk_reqsk_queue_hash_add() no longer exist. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03tcp/dccp: shrink struct listen_sockEric Dumazet
We no longer use hash_rnd, nr_table_entries and syn_table[] For a listener with a backlog of 10 millions sockets, this saves 80 MBytes of vmalloced memory. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03tcp/dccp: install syn_recv requests into ehash tableEric Dumazet
In this patch, we insert request sockets into TCP/DCCP regular ehash table (where ESTABLISHED and TIMEWAIT sockets are) instead of using the per listener hash table. ACK packets find SYN_RECV pseudo sockets without having to find and lock the listener. In nominal conditions, this halves pressure on listener lock. Note that this will allow for SO_REUSEPORT refinements, so that we can select a listener using cpu/numa affinities instead of the prior 'consistent hash', since only SYN packets will apply this selection logic. We will shrink listen_sock in the following patch to ease code review. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ying Cai <ycai@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03tcp/dccp: remove inet_csk_reqsk_queue_added() timeout argumentEric Dumazet
This is no longer used. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03tcp: get_openreq[46]() changesEric Dumazet
When request sockets are no longer in a per listener hash table but on regular TCP ehash, we need to access listener uid through req->rsk_listener get_openreq6() also gets a const for its request socket argument. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03tcp: remove BUG_ON() in tcp_check_req()Eric Dumazet
Once listener is lockless, its sk_state can change anytime. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03tcp: cleanup tcp_v[46]_inbound_md5_hash()Eric Dumazet
We'll soon have to call tcp_v[46]_inbound_md5_hash() twice. Also add const attribute to the socket, as it might be the unlocked listener for SYN packets. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03tcp/dccp: init sk_prot and call sk_node_init() in reqsk_alloc()Eric Dumazet
We plan to use generic functions to insert request sockets into ehash table. sk_prot needs to be set (to retrieve sk_prot->h.hashinfo) sk_node needs to be cleared. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03tcp: call sk_mark_napi_id() on the child, not the listenerEric Dumazet
This fixes a typo : We want to store the NAPI id on child socket. Presumably nobody really uses busy polling, on short lived flows. Fixes: 3d97379a67486 ("tcp: move sk_mark_napi_id() at the right place") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03tcp: move synflood_warned into struct request_sock_queueEric Dumazet
long term plan is to remove struct listen_sock when its hash table is no longer there. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03tcp: move qlen/young out of struct listen_sockEric Dumazet
qlen_inc & young_inc were protected by listener lock, while qlen_dec & young_dec were atomic fields. Everything needs to be atomic for upcoming lockless listener. Also move qlen/young in request_sock_queue as we'll get rid of struct listen_sock eventually. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-03tcp: add a spinlock to protect struct request_sock_queueEric Dumazet
struct request_sock_queue fields are currently protected by the listener 'lock' (not a real spinlock) We need to add a private spinlock instead, so that softirq handlers creating children do not have to worry with backlog notion that the listener 'lock' carries. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: net/dsa/slave.c net/dsa/slave.c simply had overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-02Merge tag 'mmc-v4.3-rc3' of git://git.linaro.org/people/ulf.hansson/mmcLinus Torvalds
Pull MMC fixes from Ulf Hansson: "Here are some mmc fixes intended for v4.3 rc4: MMC core: - Allow users of mmc_of_parse() to succeed when CONFIG_GPIOLIB is unset - Prevent infinite loop of re-tuning for CRC-errors for CMD19 and CMD21 MMC host: - pxamci: Fix issues with card detect - sunxi: Fix clk-delay settings" * tag 'mmc-v4.3-rc3' of git://git.linaro.org/people/ulf.hansson/mmc: mmc: core: fix dead loop of mmc_retune mmc: pxamci: fix card detect with slot-gpio API mmc: sunxi: Fix clk-delay settings mmc: core: Don't return an error for CD/WP GPIOs when GPIOLIB is unset
2015-10-02Merge git://git.infradead.org/intel-iommuLinus Torvalds
Pull IOVA fixes from David Woodhouse: "The main fix here is the first one, fixing the over-allocation of size-aligned requests. The other patches simply make the existing IOVA code available to users other than the Intel VT-d driver, with no functional change. I concede the latter really *should* have been submitted during the merge window, but since it's basically risk-free and people are waiting to build on top of it and it's my fault I didn't get it in, I (and they) would be grateful if you'd take it" * git://git.infradead.org/intel-iommu: iommu: Make the iova library a module iommu: iova: Export symbols iommu: iova: Move iova cache management to the iova library iommu/iova: Avoid over-allocating when size-aligned
2015-10-01Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc fixes from Andrew Morton: "12 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: dmapool: fix overflow condition in pool_find_page() thermal: avoid division by zero in power allocator memcg: remove pcp_counter_lock kprobes: use _do_fork() in samples to make them work again drivers/input/joystick/Kconfig: zhenhua.c needs BITREVERSE memcg: make mem_cgroup_read_stat() unsigned memcg: fix dirty page migration dax: fix NULL pointer in __dax_pmd_fault() mm: hugetlbfs: skip shared VMAs when unmapping private pages to satisfy a fault mm/slab: fix unexpected index mapping result of kmalloc_size(INDEX_NODE+1) userfaultfd: remove kernel header include from uapi header arch/x86/include/asm/efi.h: fix build failure
2015-10-01Merge tag 'pm+acpi-4.3-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management and ACPI fixes from Rafael Wysocki: "These are fixes mostly, for a few changes made in this cycle (the intel_idle driver, the OPP library, the ACPI EC driver, turbostat) and for some issues that have just been discovered (ACPI PCI IRQ management, PCI power management documentation, turbostat), with a couple of cleanups on top of them. Specifics: - intel_idle driver fixup for the recently added Skylake chips support (Len Brown). - Operating Performance Points (OPP) library fix related to the recently added support for new DT bindings and a fix for a typo in a comment (Viresh Kumar, Stephen Boyd). - ACPI EC driver fix for a recently introduced memory leak in an error code path (Lv Zheng). - ACPI PCI IRQ management fix for the issue where an ISA IRQ is shared with a PCI device which requires it to be configured in a different way and may cause an interrupt storm to happen as a result with an extra ACPI SCI IRQ handling simplification on top of it (Jiang Liu). - Update of the PCI power management documentation that became outdated and started to actively confuse the readers to make it actually reflect the code (Rafael J Wysocki). - turbostat fixes including an IVB Xeon regression fix (related to the --debug command line option), Skylake adjustment for the TSC running at a frequency that doesn't match the base one exactly, and a Knights Landing quirk to account for the fact that it only updates APERF and MPERF every 1024 clock cycles plus bumping up the turbostat version number (Len Brown, Hubert Chrzaniuk)" * tag 'pm+acpi-4.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: tools/power turbosat: update version number tools/power turbostat: SKL: Adjust for TSC difference from base frequency tools/power turbostat: KNL workaround for %Busy and Avg_MHz tools/power turbostat: IVB Xeon: fix --debug regression ACPI / PCI: Remove duplicated penalty on SCI IRQ ACPI, PCI, irq: Do not share PCI IRQ with ISA IRQ ACPI / EC: Fix a memory leak issue in acpi_ec_query() PM / OPP: Fix typo modifcation -> modification PCI / PM: Update runtime PM documentation for PCI devices PM / OPP: of_property_count_u32_elems() can return errors intel_idle: Skylake Client Support - updated
2015-10-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix regression in SKB partial checksum handling, from Pravin B Shalar. 2) Fix VLAN inside of VXLAN handling in i40e driver, from Jesse Brandeburg. 3) Cure softlockups during accept() in SCTP, from Karl Heiss. 4) MSG_PEEK should return multiple SKBs worth of data in AF_UNIX, from Aaron Conole. 5) IPV6 erroneously ignores output interface specifier in lookup key for route lookups, fix from David Ahern. 6) In Marvell DSA driver, forward unknown frames to CPU port, from Andrew Lunn. 7) Mission flow flag initializations in some code paths, from David Ahern. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: net: Initialize flow flags in input path net: dsa: fix preparation of a port STP update testptp: Silence compiler warnings on ppc64 net/mlx4: Handle return codes in mlx4_qp_attach_common dsa: mv88e6xxx: Enable forwarding for unknown to the CPU port skbuff: Fix skb checksum partial check. net: ipv6: Add RT6_LOOKUP_F_IFACE flag if oif is set net sysfs: Print link speed as signed integer bna: fix error handling af_unix: return data from multiple SKBs on recv() with MSG_PEEK flag af_unix: Convert the unix_sk macro to an inline function for type safety net: sctp: Don't use 64 kilobyte lookup table for four elements l2tp: protect tunnel->del_work by ref_count net/ibm/emac: bump version numbers for correct work with ethtool sctp: Prevent soft lockup when sctp_accept() is called during a timeout event sctp: Whitespace fix i40e/i40evf: check for stopped admin queue i40e: fix VLAN inside VXLAN r8169: fix handling rtl_readphy result net: hisilicon: fix handling platform_get_irq result
2015-10-01dmapool: fix overflow condition in pool_find_page()Robin Murphy
If a DMA pool lies at the very top of the dma_addr_t range (as may happen with an IOMMU involved), the calculated end address of the pool wraps around to zero, and page lookup always fails. Tweak the relevant calculation to be overflow-proof. Signed-off-by: Robin Murphy <robin.murphy@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Sakari Ailus <sakari.ailus@iki.fi> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-01thermal: avoid division by zero in power allocatorAndrea Arcangeli
During boot I get a div by zero Oops regression starting in v4.3-rc3. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Javi Merino <javi.merino@arm.com> Cc: Zhang Rui <rui.zhang@intel.com> Cc: Eduardo Valentin <edubezval@gmail.com> Cc: Daniel Kurtz <djkurtz@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-01memcg: remove pcp_counter_lockGreg Thelen
Commit 733a572e66d2 ("memcg: make mem_cgroup_read_{stat|event}() iterate possible cpus instead of online") removed the last use of the per memcg pcp_counter_lock but forgot to remove the variable. Kill the vestigial variable. Signed-off-by: Greg Thelen <gthelen@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-01kprobes: use _do_fork() in samples to make them work againPetr Mladek
Commit 3033f14ab78c ("clone: support passing tls argument via C rather than pt_regs magic") introduced _do_fork() that allowed to pass @tls parameter. The old do_fork() is defined only for architectures that are not ready to use this way and do not define HAVE_COPY_THREAD_TLS. Let's use _do_fork() in the kprobe examples to make them work again on all architectures. Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thiago Macieira <thiago.macieira@intel.com> Cc: Jiri Kosina <jkosina@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-01drivers/input/joystick/Kconfig: zhenhua.c needs BITREVERSEAndrew Morton
It uses bitrev8(), so it must ensure that lib/bitrev.o gets included in vmlinux. Cc: Fengguang Wu <fengguang.wu@gmail.com> Cc: yalin wang <yalin.wang2010@gmail.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-01memcg: make mem_cgroup_read_stat() unsignedGreg Thelen
mem_cgroup_read_stat() returns a page count by summing per cpu page counters. The summing is racy wrt. updates, so a transient negative sum is possible. Callers don't want negative values: - mem_cgroup_wb_stats() doesn't want negative nr_dirty or nr_writeback. This could confuse dirty throttling. - oom reports and memory.stat shouldn't show confusing negative usage. - tree_usage() already avoids negatives. Avoid returning negative page counts from mem_cgroup_read_stat() and convert it to unsigned. [akpm@linux-foundation.org: fix old typo while we're in there] Signed-off-by: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> [4.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-01memcg: fix dirty page migrationGreg Thelen
The problem starts with a file backed dirty page which is charged to a memcg. Then page migration is used to move oldpage to newpage. Migration: - copies the oldpage's data to newpage - clears oldpage.PG_dirty - sets newpage.PG_dirty - uncharges oldpage from memcg - charges newpage to memcg Clearing oldpage.PG_dirty decrements the charged memcg's dirty page count. However, because newpage is not yet charged, setting newpage.PG_dirty does not increment the memcg's dirty page count. After migration completes newpage.PG_dirty is eventually cleared, often in account_page_cleaned(). At this time newpage is charged to a memcg so the memcg's dirty page count is decremented which causes underflow because the count was not previously incremented by migration. This underflow causes balance_dirty_pages() to see a very large unsigned number of dirty memcg pages which leads to aggressive throttling of buffered writes by processes in non root memcg. This issue: - can harm performance of non root memcg buffered writes. - can report too small (even negative) values in memory.stat[(total_)dirty] counters of all memcg, including the root. To avoid polluting migrate.c with #ifdef CONFIG_MEMCG checks, introduce page_memcg() and set_page_memcg() helpers. Test: 0) setup and enter limited memcg mkdir /sys/fs/cgroup/test echo 1G > /sys/fs/cgroup/test/memory.limit_in_bytes echo $$ > /sys/fs/cgroup/test/cgroup.procs 1) buffered writes baseline dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k sync grep ^dirty /sys/fs/cgroup/test/memory.stat 2) buffered writes with compaction antagonist to induce migration yes 1 > /proc/sys/vm/compact_memory & rm -rf /data/tmp/foo dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k kill % sync grep ^dirty /sys/fs/cgroup/test/memory.stat 3) buffered writes without antagonist, should match baseline rm -rf /data/tmp/foo dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k sync grep ^dirty /sys/fs/cgroup/test/memory.stat (speed, dirty residue) unpatched patched 1) 841 MB/s 0 dirty pages 886 MB/s 0 dirty pages 2) 611 MB/s -33427456 dirty pages 793 MB/s 0 dirty pages 3) 114 MB/s -33427456 dirty pages 891 MB/s 0 dirty pages Notice that unpatched baseline performance (1) fell after migration (3): 841 -> 114 MB/s. In the patched kernel, post migration performance matches baseline. Fixes: c4843a7593a9 ("memcg: add per cgroup dirty page accounting") Signed-off-by: Greg Thelen <gthelen@google.com> Reported-by: Dave Hansen <dave.hansen@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> [4.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-01dax: fix NULL pointer in __dax_pmd_fault()Ross Zwisler
Commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX") moved some code in __dax_pmd_fault() that was responsible for zeroing newly allocated PMD pages. The new location didn't properly set up 'kaddr', so when run this code resulted in a NULL pointer BUG. Fix this by getting the correct 'kaddr' via bdev_direct_access(). Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-01mm: hugetlbfs: skip shared VMAs when unmapping private pages to satisfy a faultMel Gorman
SunDong reported the following on https://bugzilla.kernel.org/show_bug.cgi?id=103841 I think I find a linux bug, I have the test cases is constructed. I can stable recurring problems in fedora22(4.0.4) kernel version, arch for x86_64. I construct transparent huge page, when the parent and child process with MAP_SHARE, MAP_PRIVATE way to access the same huge page area, it has the opportunity to lead to huge page copy on write failure, and then it will munmap the child corresponding mmap area, but then the child mmap area with VM_MAYSHARE attributes, child process munmap this area can trigger VM_BUG_ON in set_vma_resv_flags functions (vma - > vm_flags & VM_MAYSHARE). There were a number of problems with the report (e.g. it's hugetlbfs that triggers this, not transparent huge pages) but it was fundamentally correct in that a VM_BUG_ON in set_vma_resv_flags() can be triggered that looks like this vma ffff8804651fd0d0 start 00007fc474e00000 end 00007fc475e00000 next ffff8804651fd018 prev ffff8804651fd188 mm ffff88046b1b1800 prot 8000000000000027 anon_vma (null) vm_ops ffffffff8182a7a0 pgoff 0 file ffff88106bdb9800 private_data (null) flags: 0x84400fb(read|write|shared|mayread|maywrite|mayexec|mayshare|dontexpand|hugetlb) ------------ kernel BUG at mm/hugetlb.c:462! SMP Modules linked in: xt_pkttype xt_LOG xt_limit [..] CPU: 38 PID: 26839 Comm: map Not tainted 4.0.4-default #1 Hardware name: Dell Inc. PowerEdge R810/0TT6JF, BIOS 2.7.4 04/26/2012 set_vma_resv_flags+0x2d/0x30 The VM_BUG_ON is correct because private and shared mappings have different reservation accounting but the warning clearly shows that the VMA is shared. When a private COW fails to allocate a new page then only the process that created the VMA gets the page -- all the children unmap the page. If the children access that data in the future then they get killed. The problem is that the same file is mapped shared and private. During the COW, the allocation fails, the VMAs are traversed to unmap the other private pages but a shared VMA is found and the bug is triggered. This patch identifies such VMAs and skips them. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: SunDong <sund_sky@126.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: David Rientjes <rientjes@google.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-01mm/slab: fix unexpected index mapping result of kmalloc_size(INDEX_NODE+1)Joonsoo Kim
Commit description is copied from the original post of this bug: http://comments.gmane.org/gmane.linux.kernel.mm/135349 Kernels after v3.9 use kmalloc_size(INDEX_NODE + 1) to get the next larger cache size than the size index INDEX_NODE mapping. In kernels 3.9 and earlier we used malloc_sizes[INDEX_L3 + 1].cs_size. However, sometimes we can't get the right output we expected via kmalloc_size(INDEX_NODE + 1), causing a BUG(). The mapping table in the latest kernel is like: index = {0, 1, 2 , 3, 4, 5, 6, n} size = {0, 96, 192, 8, 16, 32, 64, 2^n} The mapping table before 3.10 is like this: index = {0 , 1 , 2, 3, 4 , 5 , 6, n} size = {32, 64, 96, 128, 192, 256, 512, 2^(n+3)} The problem on my mips64 machine is as follows: (1) When configured DEBUG_SLAB && DEBUG_PAGEALLOC && DEBUG_LOCK_ALLOC && DEBUG_SPINLOCK, the sizeof(struct kmem_cache_node) will be "150", and the macro INDEX_NODE turns out to be "2": #define INDEX_NODE kmalloc_index(sizeof(struct kmem_cache_node)) (2) Then the result of kmalloc_size(INDEX_NODE + 1) is 8. (3) Then "if(size >= kmalloc_size(INDEX_NODE + 1)" will lead to "size = PAGE_SIZE". (4) Then "if ((size >= (PAGE_SIZE >> 3))" test will be satisfied and "flags |= CFLGS_OFF_SLAB" will be covered. (5) if (flags & CFLGS_OFF_SLAB)" test will be satisfied and will go to "cachep->slabp_cache = kmalloc_slab(slab_size, 0u)", and the result here may be NULL while kernel bootup. (6) Finally,"BUG_ON(ZERO_OR_NULL_PTR(cachep->slabp_cache));" causes the BUG info as the following shows (may be only mips64 has this problem): This patch fixes the problem of kmalloc_size(INDEX_NODE + 1) and removes the BUG by adding 'size >= 256' check to guarantee that all necessary small sized slabs are initialized regardless sequence of slab size in mapping table. Fixes: e33660165c90 ("slab: Use common kmalloc_index/kmalloc_size...") Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Liuhailong <liu.hailong6@zte.com.cn> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-01userfaultfd: remove kernel header include from uapi headerAndre Przywara
As include/uapi/linux/userfaultfd.h is a user visible header file, it should not include kernel-exclusive header files. So trying to build the userfaultfd test program from the selftests directory fails, since it contains a reference to linux/compiler.h. As it turns out, that header is not really needed there, so we can simply remove it to fix that issue. Signed-off-by: Andre Przywara <andre.przywara@arm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Shuah Khan <shuahkh@osg.samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>