summaryrefslogtreecommitdiff
path: root/mm/swapfile.c
AgeCommit message (Collapse)Author
2017-02-28Merge tag 'lsk-v4.4-16.12-android' into branch 'msm-4.4'Runmin Wang
* remotes/origin/tmp-2f0de51: Linux 4.4.38 esp6: Fix integrity verification when ESN are used esp4: Fix integrity verification when ESN are used ipv4: Set skb->protocol properly for local output ipv6: Set skb->protocol properly for local output Don't feed anything but regular iovec's to blk_rq_map_user_iov constify iov_iter_count() and iter_is_iovec() sparc64: fix compile warning section mismatch in find_node() sparc64: Fix find_node warning if numa node cannot be found sparc32: Fix inverted invalid_frame_pointer checks on sigreturns net: ping: check minimum size on ICMP header length net: avoid signed overflows for SO_{SND|RCV}BUFFORCE geneve: avoid use-after-free of skb->data sh_eth: remove unchecked interrupts for RZ/A1 net: bcmgenet: Utilize correct struct device for all DMA operations packet: fix race condition in packet_set_ring net/dccp: fix use-after-free in dccp_invalid_packet netlink: Do not schedule work from sk_destruct netlink: Call cb->done from a worker thread net/sched: pedit: make sure that offset is valid net, sched: respect rcu grace period on cls destruction net: dsa: bcm_sf2: Ensure we re-negotiate EEE during after link change l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind() rtnetlink: fix FDB size computation af_unix: conditionally use freezable blocking calls in read net: sky2: Fix shutdown crash ip6_tunnel: disable caching when the traffic class is inherited net: check dead netns for peernet2id_alloc() virtio-net: add a missing synchronize_net() Linux 4.4.37 arm64: suspend: Reconfigure PSTATE after resume from idle arm64: mm: Set PSTATE.PAN from the cpu_enable_pan() call arm64: cpufeature: Schedule enable() calls instead of calling them via IPI pwm: Fix device reference leak mwifiex: printk() overflow with 32-byte SSIDs PCI: Set Read Completion Boundary to 128 iff Root Port supports it (_HPX) PCI: Export pcie_find_root_port rcu: Fix soft lockup for rcu_nocb_kthread ALSA: pcm : Call kill_fasync() in stream lock x86/traps: Ignore high word of regs->cs in early_fixup_exception() kasan: update kasan_global for gcc 7 zram: fix unbalanced idr management at hot removal ARC: Don't use "+l" inline asm constraint Linux 4.4.36 scsi: mpt3sas: Unblock device after controller reset flow_dissect: call init_default_flow_dissectors() earlier mei: fix return value on disconnection mei: me: fix place for kaby point device ids. mei: me: disable driver on SPT SPS firmware drm/radeon: Ensure vblank interrupt is enabled on DPMS transition to on mpi: Fix NULL ptr dereference in mpi_powm() [ver #3] parisc: Also flush data TLB in flush_icache_page_asm parisc: Fix race in pci-dma.c parisc: Fix races in parisc_setup_cache_timing() NFSv4.x: hide array-bounds warning apparmor: fix change_hat not finding hat after policy replacement cfg80211: limit scan results cache size tile: avoid using clocksource_cyc2ns with absolute cycle count scsi: mpt3sas: Fix secure erase premature termination Fix USB CB/CBI storage devices with CONFIG_VMAP_STACK=y USB: serial: ftdi_sio: add support for TI CC3200 LaunchPad USB: serial: cp210x: add ID for the Zone DPMX usb: chipidea: move the lock initialization to core file KVM: x86: check for pic and ioapic presence before use KVM: x86: drop error recovery in em_jmp_far and em_ret_far iommu/vt-d: Fix IOMMU lookup for SR-IOV Virtual Functions iommu/vt-d: Fix PASID table allocation sched: tune: Fix lacking spinlock initialization UPSTREAM: trace: Update documentation for mono, mono_raw and boot clock UPSTREAM: trace: Add an option for boot clock as trace clock UPSTREAM: timekeeping: Add a fast and NMI safe boot clock ANDROID: goldfish_pipe: fix allmodconfig build ANDROID: goldfish: goldfish_pipe: fix locking errors ANDROID: video: goldfishfb: fix platform_no_drv_owner.cocci warnings ANDROID: goldfish_pipe: fix call_kern.cocci warnings arm64: rename ranchu defconfig to ranchu64 ANDROID: arch: x86: disable pic for Android toolchain ANDROID: goldfish_pipe: An implementation of more parallel pipe ANDROID: goldfish_pipe: bugfixes and performance improvements. ANDROID: goldfish: Add goldfish sync driver ANDROID: goldfish: add ranchu defconfigs ANDROID: goldfish_audio: Clear audio read buffer status after each read ANDROID: goldfish_events: no extra EV_SYN; register goldfish ANDROID: goldfish_fb: Set pixclock = 0 ANDROID: goldfish: Enable ACPI-based enumeration for goldfish audio ANDROID: goldfish: Enable ACPI-based enumeration for goldfish framebuffer ANDROID: video: goldfishfb: add devicetree bindings BACKPORT: staging: goldfish: audio: fix compiliation on arm BACKPORT: Input: goldfish_events - enable ACPI-based enumeration for goldfish events BACKPORT: goldfish: Enable ACPI-based enumeration for goldfish battery BACKPORT: drivers: tty: goldfish: Add device tree bindings BACKPORT: tty: goldfish: support platform_device with id -1 BACKPORT: Input: goldfish_events - add devicetree bindings BACKPORT: power: goldfish_battery: add devicetree bindings BACKPORT: staging: goldfish: audio: add devicetree bindings ANDROID: usb: gadget: function: cleanup: Add blank line after declaration cpufreq: sched: Fix kernel crash on accessing sysfs file usb: gadget: f_mtp: simplify ptp NULL pointer check cgroup: replace unified-hierarchy.txt with a proper cgroup v2 documentation cgroup: rename Documentation/cgroups/ to Documentation/cgroup-legacy/ cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type writeback: initialize inode members that track writeback history mm: page_alloc: generalize the dirty balance reserve block: fix module reference leak on put_disk() call for cgroups throttle Linux 4.4.35 netfilter: nft_dynset: fix element timeout for HZ != 1000 IB/cm: Mark stale CM id's whenever the mad agent was unregistered IB/uverbs: Fix leak of XRC target QPs IB/core: Avoid unsigned int overflow in sg_alloc_table IB/mlx5: Fix fatal error dispatching IB/mlx5: Use cache line size to select CQE stride IB/mlx4: Fix create CQ error flow IB/mlx4: Check gid_index return value PM / sleep: don't suspend parent when async child suspend_{noirq, late} fails PM / sleep: fix device reference leak in test_suspend uwb: fix device reference leaks mfd: core: Fix device reference leak in mfd_clone_cell iwlwifi: pcie: fix SPLC structure parsing rtc: omap: Fix selecting external osc clk: mmp: mmp2: fix return value check in mmp2_clk_init() clk: mmp: pxa168: fix return value check in pxa168_clk_init() clk: mmp: pxa910: fix return value check in pxa910_clk_init() drm/amdgpu: Attach exclusive fence to prime exported bo's. (v5) crypto: caam - do not register AES-XTS mode on LP units ext4: sanity check the block and cluster size at mount time kbuild: Steal gcc's pie from the very beginning x86/kexec: add -fno-PIE scripts/has-stack-protector: add -fno-PIE kbuild: add -fno-PIE i2c: mux: fix up dependencies can: bcm: fix warning in bcm_connect/proc_register mfd: intel-lpss: Do not put device in reset state on suspend fuse: fix fuse_write_end() if zero bytes were copied KVM: Disable irq while unregistering user notifier KVM: x86: fix missed SRCU usage in kvm_lapic_set_vapic_addr x86/cpu/AMD: Fix cpu_llc_id for AMD Fam17h systems Linux 4.4.34 sparc64: Delete now unused user copy fixup functions. sparc64: Delete now unused user copy assembler helpers. sparc64: Convert U3copy_{from,to}_user to accurate exception reporting. sparc64: Convert NG2copy_{from,to}_user to accurate exception reporting. sparc64: Convert NGcopy_{from,to}_user to accurate exception reporting. sparc64: Convert NG4copy_{from,to}_user to accurate exception reporting. sparc64: Convert U1copy_{from,to}_user to accurate exception reporting. sparc64: Convert GENcopy_{from,to}_user to accurate exception reporting. sparc64: Convert copy_in_user to accurate exception reporting. sparc64: Prepare to move to more saner user copy exception handling. sparc64: Delete __ret_efault. sparc64: Handle extremely large kernel TLB range flushes more gracefully. sparc64: Fix illegal relative branches in hypervisor patched TLB cross-call code. sparc64: Fix instruction count in comment for __hypervisor_flush_tlb_pending. sparc64: Fix illegal relative branches in hypervisor patched TLB code. sparc64: Handle extremely large kernel TSB range flushes sanely. sparc: Handle negative offsets in arch_jump_label_transform sparc64 mm: Fix base TSB sizing when hugetlb pages are used sparc: serial: sunhv: fix a double lock bug sparc: Don't leak context bits into thread->fault_address tty: Prevent ldisc drivers from re-using stale tty fields tcp: take care of truncations done by sk_filter() ipv4: use new_gw for redirect neigh lookup net: __skb_flow_dissect() must cap its return value sock: fix sendmmsg for partial sendmsg fib_trie: Correct /proc/net/route off by one error sctp: assign assoc_id earlier in __sctp_connect ipv6: dccp: add missing bind_conflict to dccp_ipv6_mapped ipv6: dccp: fix out of bound access in dccp_v6_err() dccp: fix out of bound access in dccp_v4_err() dccp: do not send reset to already closed sockets tcp: fix potential memory corruption ip6_tunnel: Clear IP6CB in ip6tunnel_xmit() bgmac: stop clearing DMA receive control register right after it is set net: mangle zero checksum in skb_checksum_help() net: clear sk_err_soft in sk_clone_lock() dctcp: avoid bogus doubling of cwnd after loss ARM: 8485/1: cpuidle: remove cpu parameter from the cpuidle_ops suspend hook Linux 4.4.33 netfilter: fix namespace handling in nf_log_proc_dostring btrfs: qgroup: Prevent qgroup->reserved from going subzero mmc: mxs: Initialize the spinlock prior to using it ASoC: sun4i-codec: return error code instead of NULL when create_card fails ACPI / APEI: Fix incorrect return value of ghes_proc() i40e: fix call of ndo_dflt_bridge_getlink() hwrng: core - Don't use a stack buffer in add_early_randomness() lib/genalloc.c: start search from start of chunk mei: bus: fix received data size check in NFC fixup iommu/vt-d: Fix dead-locks in disable_dmar_iommu() path iommu/amd: Free domain id when free a domain of struct dma_ops_domain tty/serial: at91: fix hardware handshake on Atmel platforms dmaengine: at_xdmac: fix spurious flag status for mem2mem transfers drm/i915: Respect alternate_ddc_pin for all DDI ports KVM: MIPS: Precalculate MMIO load resume PC scsi: mpt3sas: Fix for block device of raid exists even after deleting raid disk scsi: qla2xxx: Fix scsi scan hang triggered if adapter fails during init iio: orientation: hid-sensor-rotation: Add PM function (fix non working driver) iio: hid-sensors: Increase the precision of scale to fix wrong reading interpretation. clk: qoriq: Don't allow CPU clocks higher than starting value toshiba-wmi: Fix loading the driver on non Toshiba laptops drbd: Fix kernel_sendmsg() usage - potential NULL deref usb: gadget: u_ether: remove interrupt throttling USB: cdc-acm: fix TIOCMIWAIT staging: nvec: remove managed resource from PS2 driver Revert "staging: nvec: ps2: change serio type to passthrough" drivers: staging: nvec: remove bogus reset command for PS/2 interface staging: iio: ad5933: avoid uninitialized variable in error case pinctrl: cherryview: Prevent possible interrupt storm on resume pinctrl: cherryview: Serialize register access in suspend/resume ARC: timer: rtc: implement read loop in "C" vs. inline asm s390/hypfs: Use get_free_page() instead of kmalloc to ensure page alignment coredump: fix unfreezable coredumping task swapfile: fix memory corruption via malformed swapfile dib0700: fix nec repeat handling ASoC: cs4270: fix DAPM stream name mismatch ALSA: info: Limit the proc text input size ALSA: info: Return error for invalid read/write arm64: Enable KPROBES/HIBERNATION/CORESIGHT in defconfig arm64: kvm: allows kvm cpu hotplug arm64: KVM: Register CPU notifiers when the kernel runs at HYP arm64: KVM: Skip HYP setup when already running in HYP arm64: hyp/kvm: Make hyp-stub reject kvm_call_hyp() arm64: hyp/kvm: Make hyp-stub extensible arm64: kvm: Move lr save/restore from do_el2_call into EL1 arm64: kvm: deal with kernel symbols outside of linear mapping arm64: introduce KIMAGE_VADDR as the virtual base of the kernel region ANDROID: video: adf: Avoid directly referencing user pointers ANDROID: usb: gadget: audio_source: fix comparison of distinct pointer types android: binder: support for file-descriptor arrays. android: binder: support for scatter-gather. android: binder: add extra size to allocator. android: binder: refactor binder_transact() android: binder: support multiple /dev instances. android: binder: deal with contexts in debugfs. android: binder: support multiple context managers. android: binder: split flat_binder_object. disable aio support in recommended configuration Linux 4.4.32 scsi: megaraid_sas: fix macro MEGASAS_IS_LOGICAL to avoid regression drm/radeon: fix DP mode validation drm/radeon/dp: add back special handling for NUTMEG drm/amdgpu: fix DP mode validation drm/amdgpu/dp: add back special handling for NUTMEG KVM: MIPS: Drop other CPU ASIDs on guest MMU changes Revert KVM: MIPS: Drop other CPU ASIDs on guest MMU changes of: silence warnings due to max() usage packet: on direct_xmit, limit tso and csum to supported devices sctp: validate chunk len before actually using it net sched filters: fix notification of filter delete with proper handle udp: fix IP_CHECKSUM handling net: sctp, forbid negative length ipv4: use the right lock for ping_group_range ipv4: disable BH in set_ping_group_range() net: add recursion limit to GRO rtnetlink: Add rtnexthop offload flag to compare mask bridge: multicast: restore perm router ports on multicast enable net: pktgen: remove rcu locking in pktgen_change_name() ipv6: correctly add local routes when lo goes up ip6_tunnel: fix ip6_tnl_lookup ipv6: tcp: restore IP6CB for pktoptions skbs netlink: do not enter direct reclaim from netlink_dump() packet: call fanout_release, while UNREGISTERING a netdev net: Add netdev all_adj_list refcnt propagation to fix panic net/sched: act_vlan: Push skb->data to mac_header prior calling skb_vlan_*() functions net: pktgen: fix pkt_size net: fec: set mac address unconditionally tg3: Avoid NULL pointer dereference in tg3_io_error_detected() ipmr, ip6mr: fix scheduling while atomic and a deadlock with ipmr_get_route ip6_gre: fix flowi6_proto value in ip6gre_xmit_other() tcp: fix a compile error in DBGUNDO() tcp: fix wrong checksum calculation on MTU probing net: avoid sk_forward_alloc overflows tcp: fix overflow in __tcp_retransmit_skb() arm64/kvm: fix build issue on kvm debug arm64: ptdump: Indicate whether memory should be faulting arm64: Add support for ARCH_SUPPORTS_DEBUG_PAGEALLOC arm64: Drop alloc function from create_mapping arm64: allow vmalloc regions to be set with set_memory_* arm64: kernel: implement ACPI parking protocol arm64: mm: create new fine-grained mappings at boot arm64: ensure _stext and _etext are page-aligned arm64: mm: allow passing a pgdir to alloc_init_* arm64: mm: allocate pagetables anywhere arm64: mm: use fixmap when creating page tables arm64: mm: add functions to walk tables in fixmap arm64: mm: add __{pud,pgd}_populate arm64: mm: avoid redundant __pa(__va(x)) Linux 4.4.31 HID: usbhid: add ATEN CS962 to list of quirky devices ubi: fastmap: Fix add_vol() return value test in ubi_attach_fastmap() kvm: x86: Check memopp before dereference (CVE-2016-8630) tty: vt, fix bogus division in csi_J usb: dwc3: Fix size used in dma_free_coherent() pwm: Unexport children before chip removal UBI: fastmap: scrub PEB when bitflips are detected in a free PEB EC header Disable "frame-address" warning smc91x: avoid self-comparison warning cgroup: avoid false positive gcc-6 warning drm/exynos: fix error handling in exynos_drm_subdrv_open mm/cma: silence warnings due to max() usage ARM: 8584/1: floppy: avoid gcc-6 warning powerpc/ptrace: Fix out of bounds array access warning x86/xen: fix upper bound of pmd loop in xen_cleanhighmap() perf build: Fix traceevent plugins build race drm/dp/mst: Check peer device type before attempting EDID read drm/radeon: drop register readback in cayman_cp_int_cntl_setup drm/radeon/si_dpm: workaround for SI kickers drm/radeon/si_dpm: Limit clocks on HD86xx part Revert "drm/radeon: fix DP link training issue with second 4K monitor" mmc: dw_mmc-pltfm: fix the potential NULL pointer dereference scsi: arcmsr: Send SYNCHRONIZE_CACHE command to firmware scsi: scsi_debug: Fix memory leak if LBP enabled and module is unloaded scsi: megaraid_sas: Fix data integrity failure for JBOD (passthrough) devices mac80211: discard multicast and 4-addr A-MSDUs firewire: net: fix fragmented datagram_size off-by-one firewire: net: guard against rx buffer overflows Input: i8042 - add XMG C504 to keyboard reset table dm mirror: fix read error on recovery after default leg failure virtio: console: Unlock vqs while freeing buffers virtio_ring: Make interrupt suppression spec compliant parisc: Ensure consistent state when switching to kernel stack at syscall entry ovl: fsync after copy-up KVM: MIPS: Make ERET handle ERL before EXL KVM: x86: fix wbinvd_dirty_mask use-after-free dm: free io_barrier after blk_cleanup_queue call USB: serial: cp210x: fix tiocmget error handling tty: limit terminal size to 4M chars xhci: add restart quirk for Intel Wildcatpoint PCH hv: do not lose pending heartbeat vmbus packets vt: clear selection before resizing Fix potential infoleak in older kernels GenWQE: Fix bad page access during abort of resource allocation usb: increase ohci watchdog delay to 275 msec xhci: use default USB_RESUME_TIMEOUT when resuming ports. USB: serial: ftdi_sio: add support for Infineon TriBoard TC2X7 USB: serial: fix potential NULL-dereference at probe usb: gadget: function: u_ether: don't starve tx request queue mei: txe: don't clean an unprocessed interrupt cause. ubifs: Fix regression in ubifs_readdir() ubifs: Abort readdir upon error btrfs: fix races on root_log_ctx lists ANDROID: binder: Clear binder and cookie when setting handle in flat binder struct ANDROID: binder: Add strong ref checks ALSA: hda - Fix headset mic detection problem for two Dell laptops ALSA: hda - Adding a new group of pin cfg into ALC295 pin quirk table ALSA: hda - allow 40 bit DMA mask for NVidia devices ALSA: hda - Raise AZX_DCAPS_RIRB_DELAY handling into top drivers ALSA: hda - Merge RIRB_PRE_DELAY into CTX_WORKAROUND caps ALSA: usb-audio: Add quirk for Syntek STK1160 KEYS: Fix short sprintf buffer in /proc/keys show function mm: memcontrol: do not recurse in direct reclaim mm/list_lru.c: avoid error-path NULL pointer deref libxfs: clean up _calc_dquots_per_chunk h8300: fix syscall restarting drm/dp/mst: Clear port->pdt when tearing down the i2c adapter i2c: core: fix NULL pointer dereference under race condition i2c: xgene: Avoid dma_buffer overrun arm64:cpufeature ARM64_NCAPS is the indicator of last feature arm64: hibernate: Refuse to hibernate if the boot cpu is offline PM / sleep: Add support for read-only sysfs attributes arm64: kernel: Add support for hibernate/suspend-to-disk arm64: mm: add functions to walk page tables by PA arm64: mm: move pte_* macros PM / Hibernate: Call flush_icache_range() on pages restored in-place arm64: Add new asm macro copy_page arm64: Promote KERNEL_START/KERNEL_END definitions to a header file arm64: kernel: Include _AC definition in page.h arm64: Change cpu_resume() to enable mmu early then access sleep_sp by va arm64: kernel: Rework finisher callback out of __cpu_suspend_enter() arm64: Cleanup SCTLR flags arm64: Fold proc-macros.S into assembler.h arm/arm64: KVM: Add hook for C-based stage2 init arm/arm64: KVM: Detect vGIC presence at runtime arm64: KVM: Add support for 16-bit VMID arm: KVM: Make kvm_arm.h friendly to assembly code arm/arm64: KVM: Remove unreferenced S2_PGD_ORDER arm64: KVM: debug: Remove spurious inline attributes ARM: KVM: Cleanup exception injection arm64: KVM: Remove weak attributes arm64: KVM: Cleanup asm-offset.c arm64: KVM: Turn system register numbers to an enum arm64: KVM: VHE: Patch out use of HVC arm64: Add ARM64_HAS_VIRT_HOST_EXTN feature arm/arm64: Add new is_kernel_in_hyp_mode predicate arm64: KVM: Move away from the assembly version of the world switch arm64: KVM: Map the kernel RO section into HYP arm64: KVM: Add compatibility aliases arm64: KVM: Implement vgic-v3 save/restore arm64: KVM: Add panic handling arm64: KVM: HYP mode entry points arm64: KVM: Implement TLB handling arm64: KVM: Implement fpsimd save/restore arm64: KVM: Implement the core world switch arm64: KVM: Add patchable function selector arm64: KVM: Implement guest entry arm64: KVM: Implement debug save/restore arm64: KVM: Implement 32bit system register save/restore arm64: KVM: Implement system register save/restore arm64: KVM: Implement timer save/restore arm64: KVM: Implement vgic-v2 save/restore arm64: KVM: Add a HYP-specific header file KVM: arm/arm64: vgic-v3: Make the LR indexing macro public arm64: Add macros to read/write system registers Linux 4.4.30 Revert "fix minor infoleak in get_user_ex()" Revert "x86/mm: Expand the exception table logic to allow new handling options" Linux 4.4.29 ARM: pxa: pxa_cplds: fix interrupt handling powerpc/nvram: Fix an incorrect partition merge mpt3sas: Don't spam logs if logging level is 0 perf symbols: Fixup symbol sizes before picking best ones perf symbols: Check symbol_conf.allow_aliases for kallsyms loading too perf hists browser: Fix event group display clk: divider: Fix clk_divider_round_rate() to use clk_readl() clk: qoriq: fix a register offset error s390/con3270: fix insufficient space padding s390/con3270: fix use of uninitialised data s390/cio: fix accidental interrupt enabling during resume x86/mm: Expand the exception table logic to allow new handling options dmaengine: ipu: remove bogus NO_IRQ reference power: bq24257: Fix use of uninitialized pointer bq->charger staging: r8188eu: Fix scheduling while atomic splat ASoC: dapm: Fix kcontrol creation for output driver widget ASoC: dapm: Fix value setting for _ENUM_DOUBLE MUX's second channel ASoC: dapm: Fix possible uninitialized variable in snd_soc_dapm_get_volsw() ASoC: topology: Fix error return code in soc_tplg_dapm_widget_create() hwrng: omap - Only fail if pm_runtime_get_sync returns < 0 crypto: arm/ghash-ce - add missing async import/export crypto: gcm - Fix IV buffer size in crypto_gcm_setkey mwifiex: correct aid value during tdls setup spi: spi-fsl-dspi: Drop extra spi_master_put in device remove function ARM: clk-imx35: fix name for ckil clk uio: fix dmem_region_start computation genirq/generic_chip: Add irq_unmap callback perf stat: Fix interval output values powerpc/eeh: Null check uses of eeh_pe_bus_get tunnels: Remove encapsulation offloads on decap. tunnels: Don't apply GRO to multiple layers of encapsulation. ipip: Properly mark ipip GRO packets as encapsulated. posix_acl: Clear SGID bit when setting file permissions brcmfmac: avoid potential stack overflow in brcmf_cfg80211_start_ap() mm/hugetlb: fix memory offline with hugepage size > memory block size drm/i915: Unalias obj->phys_handle and obj->userptr drm/i915: Account for TSEG size when determining 865G stolen base Revert "drm/i915: Check live status before reading edid" drm/i915/gen9: fix the WaWmMemoryReadLatency implementation xenbus: don't look up transaction IDs for ordinary writes drm/vmwgfx: Limit the user-space command buffer size drm/radeon: change vblank_time's calculation method to reduce computational error. drm/radeon/si/dpm: fix phase shedding setup drm/radeon: narrow asic_init for virtualization drm/amdgpu: change vblank_time's calculation method to reduce computational error. drm/amdgpu/dce11: add missing drm_mode_config_cleanup call drm/amdgpu/dce11: disable hpd on local panels drm/amdgpu/dce8: disable hpd on local panels drm/amdgpu/dce10: disable hpd on local panels drm/amdgpu: fix IB alignment for UVD drm/prime: Pass the right module owner through to dma_buf_export() Linux 4.4.28 target: Don't override EXTENDED_COPY xcopy_pt_cmd SCSI status code target: Make EXTENDED_COPY 0xe4 failure return COPY TARGET DEVICE NOT REACHABLE target: Re-add missing SCF_ACK_KREF assignment in v4.1.y ubifs: Fix xattr_names length in exit paths jbd2: fix incorrect unlock on j_list_lock ext4: do not advertise encryption support when disabled mmc: rtsx_usb_sdmmc: Handle runtime PM while changing the led mmc: rtsx_usb_sdmmc: Avoid keeping the device runtime resumed when unused mmc: core: Annotate cmd_hdr as __le32 powerpc/mm: Prevent unlikely crash in copro_calculate_slb() ceph: fix error handling in ceph_read_iter arm64: kernel: Init MDCR_EL2 even in the absence of a PMU arm64: percpu: rewrite ll/sc loops in assembly memstick: rtsx_usb_ms: Manage runtime PM when accessing the device memstick: rtsx_usb_ms: Runtime resume the device when polling for cards isofs: Do not return EACCES for unknown filesystems irqchip/gic-v3-its: Fix entry size mask for GITS_BASER s390/mm: fix gmap tlb flush issues Using BUG_ON() as an assert() is _never_ acceptable mm: filemap: fix mapping->nrpages double accounting in fuse mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page() acpi, nfit: check for the correct event code in notifications net/mlx4_core: Allow resetting VF admin mac to zero bnx2x: Prevent false warning for lack of FC NPIV PKCS#7: Don't require SpcSpOpusInfo in Authenticode pkcs7 signatures hpsa: correct skipping masked peripherals sd: Fix rw_max for devices that report an optimal xfer size irqchip/gicv3: Handle loop timeout proper kvm: x86: memset whole irq_eoi x86/e820: Don't merge consecutive E820_PRAM ranges blkcg: Unlock blkcg_pol_mutex only once when cpd == NULL Fix regression which breaks DFS mounting Cleanup missing frees on some ioctls Do not send SMB3 SET_INFO request if nothing is changing SMB3: GUIDs should be constructed as random but valid uuids Set previous session id correctly on SMB3 reconnect Display number of credits available Clarify locking of cifs file and tcon structures and make more granular fs/cifs: keep guid when assigning fid to fileinfo cifs: Limit the overall credit acquired fs/super.c: fix race between freeze_super() and thaw_super() arc: don't leak bits of kernel stack into coredump lightnvm: ensure that nvm_dev_ops can be used without CONFIG_NVM ipc/sem.c: fix complex_count vs. simple op race mm: filemap: don't plant shadow entries without radix tree node metag: Only define atomic_dec_if_positive conditionally scsi: Fix use-after-free NFSv4.2: Fix a reference leak in nfs42_proc_layoutstats_generic NFSv4: Open state recovery must account for file permission changes NFSv4: nfs4_copy_delegation_stateid() must fail if the delegation is invalid NFSv4: Don't report revoked delegations as valid in nfs_have_delegation() sunrpc: fix write space race causing stalls Input: elantech - add Fujitsu Lifebook E556 to force crc_enabled Input: elantech - force needed quirks on Fujitsu H760 Input: i8042 - skip selftest on ASUS laptops lib: add "on"/"off" support to kstrtobool lib: update single-char callers of strtobool() lib: move strtobool() to kstrtobool() MIPS: ptrace: Fix regs_return_value for kernel context MIPS: Fix -mabi=64 build of vdso.lds ALSA: hda - Fix a failure of micmute led when having multi adcs cx231xx: fix GPIOs for Pixelview SBTVD hybrid cx231xx: don't return error on success mb86a20s: fix demod settings mb86a20s: fix the locking logic ovl: copy_up_xattr(): use strnlen ovl: Fix info leak in ovl_lookup_temp() fbdev/efifb: Fix 16 color palette entry calculation scsi: zfcp: spin_lock_irqsave() is not nestable zfcp: trace full payload of all SAN records (req,resp,iels) zfcp: fix payload trace length for SAN request&response zfcp: fix D_ID field with actual value on tracing SAN responses zfcp: restore tracing of handle for port and LUN with HBA records zfcp: trace on request for open and close of WKA port zfcp: restore: Dont use 0 to indicate invalid LUN in rec trace zfcp: retain trace level for SCSI and HBA FSF response records zfcp: close window with unblocked rport during rport gone zfcp: fix ELS/GS request&response length for hardware data router zfcp: fix fc_host port_type with NPIV ubi: Deal with interrupted erasures in WL powerpc/pseries: Fix stack corruption in htpe code powerpc/64: Fix incorrect return value from __copy_tofrom_user powerpc/powernv: Use CPU-endian PEST in pnv_pci_dump_p7ioc_diag_data() powerpc/powernv: Use CPU-endian hub diag-data type in pnv_eeh_get_and_dump_hub_diag() powerpc/powernv: Pass CPU-endian PE number to opal_pci_eeh_freeze_clear() powerpc/vdso64: Use double word compare on pointers dm crypt: fix crash on exit dm mpath: check if path's request_queue is dying in activate_path() dm: return correct error code in dm_resume()'s retry loop dm: mark request_queue dead before destroying the DM device perf intel-pt: Fix MTC timestamp calculation for large MTC periods perf intel-pt: Fix estimated timestamps for cycle-accurate mode perf intel-pt: Fix snapshot overlap detection decoder errors pstore/ram: Use memcpy_fromio() to save old buffer pstore/ram: Use memcpy_toio instead of memcpy pstore/core: drop cmpxchg based updates pstore/ramoops: fixup driver removal parisc: Increase initial kernel mapping size parisc: Fix kernel memory layout regarding position of __gp parisc: Increase KERNEL_INITIAL_SIZE for 32-bit SMP kernels cpufreq: intel_pstate: Fix unsafe HWP MSR access platform: don't return 0 from platform_get_irq[_byname]() on error PCI: Mark Atheros AR9580 to avoid bus reset mmc: sdhci: cast unsigned int to unsigned long long to avoid unexpeted error mmc: block: don't use CMD23 with very old MMC cards rtlwifi: Fix missing country code for Great Britain PM / devfreq: event: remove duplicate devfreq_event_get_drvdata() clk: imx6: initialize GPU clocks regulator: tps65910: Work around silicon erratum SWCZ010 mei: me: add kaby point device ids gpio: mpc8xxx: Correct irq handler function cgroup: Change from CAP_SYS_NICE to CAP_SYS_RESOURCE for cgroup migration permissions UPSTREAM: cpu/hotplug: Handle unbalanced hotplug enable/disable UPSTREAM: arm64: kaslr: fix breakage with CONFIG_MODVERSIONS=y UPSTREAM: arm64: kaslr: keep modules close to the kernel when DYNAMIC_FTRACE=y cgroup: Remove leftover instances of allow_attach BACKPORT: lib: harden strncpy_from_user CHROMIUM: cgroups: relax permissions on moving tasks between cgroups CHROMIUM: remove Android's cgroup generic permissions checks Linux 4.4.27 cfq: fix starvation of asynchronous writes vfs: move permission checking into notify_change() for utimes(NULL) dlm: free workqueues after the connections crypto: vmx - Fix memory corruption caused by p8_ghash crypto: ghash-generic - move common definitions to a new header file ext4: release bh in make_indexed_dir ext4: allow DAX writeback for hole punch ext4: fix memory leak in ext4_insert_range() ext4: reinforce check of i_dtime when clearing high fields of uid and gid ext4: enforce online defrag restriction for encrypted files scsi: ibmvfc: Fix I/O hang when port is not mapped scsi: arcmsr: Simplify user_len checking scsi: arcmsr: Buffer overflow in arcmsr_iop_message_xfer() async_pq_val: fix DMA memory leak reiserfs: switch to generic_{get,set,remove}xattr() reiserfs: Unlock superblock before calling reiserfs_quota_on_mount() ASoC: Intel: Atom: add a missing star in a memcpy call brcmfmac: fix memory leak in brcmf_fill_bss_param i40e: avoid NULL pointer dereference and recursive errors on early PCI error fuse: fix killing s[ug]id in setattr fuse: invalidate dir dentry after chmod fuse: listxattr: verify xattr list drivers: base: dma-mapping: page align the size when unmap_kernel_range btrfs: assign error values to the correct bio structs serial: 8250_dw: Check the data->pclk when get apb_pclk arm64: Use PoU cache instr for I/D coherency arm64: mm: add code to safely replace TTBR1_EL1 arm64: mm: place __cpu_setup in .text arm64: add function to install the idmap arm64: unmap idmap earlier arm64: unify idmap removal arm64: mm: place empty_zero_page in bss arm64: head.S: use memset to clear BSS arm64: mm: specialise pagetable allocators arm64: mm: remove pointless PAGE_MASKing asm-generic: Fix local variable shadow in __set_fixmap_offset arm64: mm: fold alternatives into .init ARM: 8511/1: ARM64: kernel: PSCI: move PSCI idle management code to drivers/firmware ARM: 8481/2: drivers: psci: replace psci firmware calls ARM: 8480/2: arm64: add implementation for arm-smccc ARM: 8479/2: add implementation for arm-smccc ARM: 8478/2: arm/arm64: add arm-smccc ARM: 8510/1: rework ARM_CPU_SUSPEND dependencies ARM: 8458/1: bL_switcher: add GIC dependency Linux 4.4.26 mm: remove gup_flags FOLL_WRITE games from __get_user_pages() x86/build: Build compressed x86 kernels as PIE arm64: Remove stack duplicating code from jprobes arm64: kprobes: Add KASAN instrumentation around stack accesses arm64: kprobes: Cleanup jprobe_return arm64: kprobes: Fix overflow when saving stack arm64: kprobes: WARN if attempting to step with PSTATE.D=1 kprobes: Add arm64 case in kprobe example module arm64: Add kernel return probes support (kretprobes) arm64: Add trampoline code for kretprobes arm64: kprobes instruction simulation support arm64: Treat all entry code as non-kprobe-able arm64: Blacklist non-kprobe-able symbol arm64: Kprobes with single stepping support arm64: add conditional instruction simulation support arm64: Add more test functions to insn.c arm64: Add HAVE_REGS_AND_STACK_ACCESS_API feature Linux 4.4.25 tpm_crb: fix crb_req_canceled behavior tpm: fix a race condition in tpm2_unseal_trusted() ima: use file_dentry() ARM: cpuidle: Fix error return code ARM: dts: MSM8064 remove flags from SPMI/MPP IRQs ARM: dts: mvebu: armada-390: add missing compatibility string and bracket x86/dumpstack: Fix x86_32 kernel_stack_pointer() previous stack access x86/irq: Prevent force migration of irqs which are not in the vector domain x86/boot: Fix kdump, cleanup aborted E820_PRAM max_pfn manipulation KVM: PPC: BookE: Fix a sanity check KVM: MIPS: Drop other CPU ASIDs on guest MMU changes KVM: PPC: Book3s PR: Allow access to unprivileged MMCR2 register mfd: wm8350-i2c: Make sure the i2c regmap functions are compiled mfd: 88pm80x: Double shifting bug in suspend/resume mfd: atmel-hlcdc: Do not sleep in atomic context mfd: rtsx_usb: Avoid setting ucr->current_sg.status ALSA: usb-line6: use the same declaration as definition in header for MIDI manufacturer ID ALSA: usb-audio: Extend DragonFly dB scale quirk to cover other variants ALSA: ali5451: Fix out-of-bound position reporting timekeeping: Fix __ktime_get_fast_ns() regression time: Add cycles to nanoseconds translation mm: Fix build for hardened usercopy ANDROID: binder: Clear binder and cookie when setting handle in flat binder struct ANDROID: binder: Add strong ref checks UPSTREAM: staging/android/ion : fix a race condition in the ion driver ANDROID: android-base: CONFIG_HARDENED_USERCOPY=y UPSTREAM: fs/proc/kcore.c: Add bounce buffer for ktext data UPSTREAM: fs/proc/kcore.c: Make bounce buffer global for read BACKPORT: arm64: Correctly bounds check virt_addr_valid Fix a build breakage in IO latency hist code. UPSTREAM: efi: include asm/early_ioremap.h not asm/efi.h to get early_memremap UPSTREAM: ia64: split off early_ioremap() declarations into asm/early_ioremap.h FROMLIST: arm64: Enable CONFIG_ARM64_SW_TTBR0_PAN FROMLIST: arm64: xen: Enable user access before a privcmd hvc call FROMLIST: arm64: Handle faults caused by inadvertent user access with PAN enabled FROMLIST: arm64: Disable TTBR0_EL1 during normal kernel execution FROMLIST: arm64: Introduce uaccess_{disable,enable} functionality based on TTBR0_EL1 FROMLIST: arm64: Factor out TTBR0_EL1 post-update workaround into a specific asm macro FROMLIST: arm64: Factor out PAN enabling/disabling into separate uaccess_* macros UPSTREAM: arm64: Handle el1 synchronous instruction aborts cleanly UPSTREAM: arm64: include alternative handling in dcache_by_line_op UPSTREAM: arm64: fix "dc cvau" cache operation on errata-affected core UPSTREAM: Revert "arm64: alternatives: add enable parameter to conditional asm macros" UPSTREAM: arm64: Add new asm macro copy_page UPSTREAM: arm64: kill ESR_LNX_EXEC UPSTREAM: arm64: add macro to extract ESR_ELx.EC UPSTREAM: arm64: mm: mark fault_info table const UPSTREAM: arm64: fix dump_instr when PAN and UAO are in use BACKPORT: arm64: Fold proc-macros.S into assembler.h UPSTREAM: arm64: choose memstart_addr based on minimum sparsemem section alignment UPSTREAM: arm64/mm: ensure memstart_addr remains sufficiently aligned UPSTREAM: arm64/kernel: fix incorrect EL0 check in inv_entry macro UPSTREAM: arm64: Add macros to read/write system registers UPSTREAM: arm64/efi: refactor EFI init and runtime code for reuse by 32-bit ARM UPSTREAM: arm64/efi: split off EFI init and runtime code for reuse by 32-bit ARM UPSTREAM: arm64/efi: mark UEFI reserved regions as MEMBLOCK_NOMAP BACKPORT: arm64: only consider memblocks with NOMAP cleared for linear mapping UPSTREAM: mm/memblock: add MEMBLOCK_NOMAP attribute to memblock memory table ANDROID: dm: android-verity: Remove fec_header location constraint BACKPORT: audit: consistently record PIDs with task_tgid_nr() android-base.cfg: Enable kernel ASLR UPSTREAM: vmlinux.lds.h: allow arch specific handling of ro_after_init data section UPSTREAM: arm64: spinlock: fix spin_unlock_wait for LSE atomics UPSTREAM: arm64: avoid TLB conflict with CONFIG_RANDOMIZE_BASE UPSTREAM: arm64: Only select ARM64_MODULE_PLTS if MODULES=y sched: Add Kconfig option DEFAULT_USE_ENERGY_AWARE to set ENERGY_AWARE feature flag sched/fair: remove printk while schedule is in progress ANDROID: fs: FS tracepoints to track IO. sched/walt: Drop arch-specific timer access ANDROID: fiq_debugger: Pass task parameter to unwind_frame() eas/sched/fair: Fixing comments in find_best_target. input: keyreset: switch to orderly_reboot UPSTREAM: tun: fix transmit timestamp support UPSTREAM: arch/arm/include/asm/pgtable-3level.h: add pmd_mkclean for THP net: inet: diag: expose the socket mark to privileged processes. net: diag: make udp_diag_destroy work for mapped addresses. net: diag: support SOCK_DESTROY for UDP sockets net: diag: allow socket bytecode filters to match socket marks net: diag: slightly refactor the inet_diag_bc_audit error checks. net: diag: Add support to filter on device index UPSTREAM: brcmfmac: avoid potential stack overflow in brcmf_cfg80211_start_ap() Linux 4.4.24 ALSA: hda - Add the top speaker pin config for HP Spectre x360 ALSA: hda - Fix headset mic detection problem for several Dell laptops ACPICA: acpi_get_sleep_type_data: Reduce warnings ALSA: hda - Adding one more ALC255 pin definition for headset problem Revert "usbtmc: convert to devm_kzalloc" USB: serial: cp210x: Add ID for a Juniper console Staging: fbtft: Fix bug in fbtft-core usb: misc: legousbtower: Fix NULL pointer deference USB: serial: cp210x: fix hardware flow-control disable dm log writes: fix bug with too large bios clk: xgene: Add missing parenthesis when clearing divider value aio: mark AIO pseudo-fs noexec batman-adv: remove unused callback from batadv_algo_ops struct IB/mlx4: Use correct subnet-prefix in QP1 mads under SR-IOV IB/mlx4: Fix code indentation in QP1 MAD flow IB/mlx4: Fix incorrect MC join state bit-masking on SR-IOV IB/ipoib: Don't allow MC joins during light MC flush IB/core: Fix use after free in send_leave function IB/ipoib: Fix memory corruption in ipoib cm mode connect flow KVM: nVMX: postpone VMCS changes on MSR_IA32_APICBASE write dmaengine: at_xdmac: fix to pass correct device identity to free_irq() kernel/fork: fix CLONE_CHILD_CLEARTID regression in nscd ASoC: omap-mcpdm: Fix irq resource handling sysctl: handle error writing UINT_MAX to u32 fields powerpc/prom: Fix sub-processor option passed to ibm, client-architecture-support brcmsmac: Initialize power in brcms_c_stf_ss_algo_channel_get() brcmsmac: Free packet if dma_mapping_error() fails in dma_rxfill brcmfmac: Fix glob_skb leak in brcmf_sdiod_recv_chain ASoC: Intel: Skylake: Fix error return code in skl_probe() pNFS/flexfiles: Fix layoutcommit after a commit to DS pNFS/files: Fix layoutcommit after a commit to DS NFS: Don't drop CB requests with invalid principals svc: Avoid garbage replies when pc_func() returns rpc_drop_reply dmaengine: at_xdmac: fix debug string fnic: pci_dma_mapping_error() doesn't return an error code avr32: off by one in at32_init_pio() ath9k: Fix programming of minCCA power threshold gspca: avoid unused variable warnings em28xx-i2c: rt_mutex_trylock() returns zero on failure NFC: fdp: Detect errors from fdp_nci_create_conn() iwlmvm: mvm: set correct state in smart-fifo configuration tile: Define AT_VECTOR_SIZE_ARCH for ARCH_DLINFO pstore: drop file opened reference count blk-mq: actually hook up defer list when running requests hwrng: omap - Fix assumption that runtime_get_sync will always succeed ARM: sa1111: fix pcmcia suspend/resume ARM: shmobile: fix regulator quirk for Gen2 ARM: sa1100: clear reset status prior to reboot ARM: sa1100: fix 3.6864MHz clock ARM: sa1100: register clocks early ARM: sun5i: Fix typo in trip point temperature regulator: qcom_smd: Fix voltage ranges for pm8x41 regulator: qcom_spmi: Update mvs1/mvs2 switches on pm8941 regulator: qcom_spmi: Add support for get_mode/set_mode on switches regulator: qcom_spmi: Add support for S4 supply on pm8941 tpm: fix byte-order for the value read by tpm2_get_tpm_pt printk: fix parsing of "brl=" option MIPS: uprobes: fix use of uninitialised variable MIPS: Malta: Fix IOCU disable switch read for MIPS64 MIPS: fix uretprobe implementation MIPS: uprobes: remove incorrect set_orig_insn arm64: debug: avoid resetting stepping state machine when TIF_SINGLESTEP ARM: 8618/1: decompressor: reset ttbcr fields to use TTBR0 on ARMv7 irqchip/gicv3: Silence noisy DEBUG_PER_CPU_MAPS warning gpio: sa1100: fix irq probing for ucb1x00 usb: gadget: fsl_qe_udc: signedness bug in qe_get_frame() ceph: fix race during filling readdir cache iwlwifi: mvm: don't use ret when not initialised iwlwifi: pcie: fix access to scratch buffer spi: sh-msiof: Avoid invalid clock generator parameters hwmon: (adt7411) set bit 3 in CFG1 register nvmem: Declare nvmem_cell_read() consistently ipvs: fix bind to link-local mcast IPv6 address in backup tools/vm/slabinfo: fix an unintentional printf mmc: pxamci: fix potential oops drivers/perf: arm_pmu: Fix leak in error path pinctrl: Flag strict is a field in struct pinmux_ops pinctrl: uniphier: fix .pin_dbg_show() callback i40e: avoid null pointer dereference perf/core: Fix pmu::filter_match for SW-led groups iwlwifi: mvm: fix a few firmware capability checks usb: musb: fix DMA for host mode usb: musb: Fix DMA desired mode for Mentor DMA engine ARM: 8617/1: dma: fix dma_max_pfn() ARM: 8616/1: dt: Respect property size when parsing CPUs drm/radeon/si/dpm: add workaround for for Jet parts drm/nouveau/fifo/nv04: avoid ramht race against cookie insertion x86/boot: Initialize FPU and X86_FEATURE_ALWAYS even if we don't have CPUID x86/init: Fix cr4_init_shadow() on CR4-less machines can: dev: fix deadlock reported after bus-off mm,ksm: fix endless looping in allocating memory when ksm enable mtd: nand: davinci: Reinitialize the HW ECC engine in 4bit hwctl cpuset: handle race between CPU hotplug and cpuset_hotplug_work usercopy: fold builtin_const check into inline function Linux 4.4.23 hostfs: Freeing an ERR_PTR in hostfs_fill_sb_common() qxl: check for kmap failures power: supply: max17042_battery: fix model download bug. power_supply: tps65217-charger: fix missing platform_set_drvdata() PM / hibernate: Fix rtree_next_node() to avoid walking off list ends PM / hibernate: Restore processor state before using per-CPU variables MIPS: paravirt: Fix undefined reference to smp_bootstrap MIPS: Add a missing ".set pop" in an early commit MIPS: Avoid a BUG warning during prctl(PR_SET_FP_MODE, ...) MIPS: Remove compact branch policy Kconfig entries MIPS: vDSO: Fix Malta EVA mapping to vDSO page structs MIPS: SMP: Fix possibility of deadlock when bringing CPUs online MIPS: Fix pre-r6 emulation FPU initialisation i2c: qup: skip qup_i2c_suspend if the device is already runtime suspended i2c-eg20t: fix race between i2c init and interrupt enable btrfs: ensure that file descriptor used with subvol ioctls is a dir nl80211: validate number of probe response CSA counters can: flexcan: fix resume function mm: delete unnecessary and unsafe init_tlb_ubc() tracing: Move mutex to protect against resetting of seq data fix memory leaks in tracing_buffers_splice_read() power: reset: hisi-reboot: Unmap region obtained by of_iomap mtd: pmcmsp-flash: Allocating too much in init_msp_flash() mtd: maps: sa1100-flash: potential NULL dereference fix fault_in_multipages_...() on architectures with no-op access_ok() fanotify: fix list corruption in fanotify_get_response() fsnotify: add a way to stop queueing events on group shutdown xfs: prevent dropping ioend completions during buftarg wait autofs: use dentry flags to block walks during expire autofs races pwm: Mark all devices as "might sleep" bridge: re-introduce 'fix parsing of MLDv2 reports' net: smc91x: fix SMC accesses Revert "phy: IRQ cannot be shared" net: dsa: bcm_sf2: Fix race condition while unmasking interrupts net/mlx5: Added missing check of msg length in verifying its signature tipc: fix NULL pointer dereference in shutdown() net/irda: handle iriap_register_lsap() allocation failure vti: flush x-netns xfrm cache when vti interface is removed af_unix: split 'u->readlock' into two: 'iolock' and 'bindlock' Revert "af_unix: Fix splice-bind deadlock" bonding: Fix bonding crash megaraid: fix null pointer check in megasas_detach_one(). nouveau: fix nv40_perfctr_next() cleanup regression Staging: iio: adc: fix indent on break statement iwlegacy: avoid warning about missing braces ath9k: fix misleading indentation am437x-vfpe: fix typo in vpfe_get_app_input_index Add braces to avoid "ambiguous ‘else’" compiler warnings net: caif: fix misleading indentation Makefile: Mute warning for __builtin_return_address(>0) for tracing only Disable "frame-address" warning Disable "maybe-uninitialized" warning globally gcov: disable -Wmaybe-uninitialized warning Kbuild: disable 'maybe-uninitialized' warning for CONFIG_PROFILE_ALL_BRANCHES kbuild: forbid kernel directory to contain spaces and colons tools: Support relative directory path for 'O=' Makefile: revert "Makefile: Document ability to make file.lst and file.S" partially kbuild: Do not run modules_install and install in paralel ocfs2: fix start offset to ocfs2_zero_range_for_truncate() ocfs2/dlm: fix race between convert and migration crypto: echainiv - Replace chaining with multiplication crypto: skcipher - Fix blkcipher walk OOM crash crypto: arm/aes-ctr - fix NULL dereference in tail processing crypto: arm64/aes-ctr - fix NULL dereference in tail processing tcp: properly scale window in tcp_v[46]_reqsk_send_ack() tcp: fix use after free in tcp_xmit_retransmit_queue() tcp: cwnd does not increase in TCP YeAH ipv6: release dst in ping_v6_sendmsg ipv4: panic in leaf_walk_rcu due to stale node pointer reiserfs: fix "new_insert_key may be used uninitialized ..." Fix build warning in kernel/cpuset.c include/linux/kernel.h: change abs() macro so it uses consistent return type Linux 4.4.22 openrisc: fix the fix of copy_from_user() avr32: fix 'undefined reference to `___copy_from_user' ia64: copy_from_user() should zero the destination on access_ok() failure genirq/msi: Fix broken debug output ppc32: fix copy_from_user() sparc32: fix copy_from_user() mn10300: copy_from_user() should zero on access_ok() failure... nios2: copy_from_user() should zero the tail of destination openrisc: fix copy_from_user() parisc: fix copy_from_user() metag: copy_from_user() should zero the destination on access_ok() failure alpha: fix copy_from_user() asm-generic: make copy_from_user() zero the destination properly mips: copy_from_user() must zero the destination on access_ok() failure hexagon: fix strncpy_from_user() error return sh: fix copy_from_user() score: fix copy_from_user() and friends blackfin: fix copy_from_user() cris: buggered copy_from_user/copy_to_user/clear_user frv: fix clear_user() asm-generic: make get_user() clear the destination on errors ARC: uaccess: get_user to zero out dest in cause of fault s390: get_user() should zero on failure score: fix __get_user/get_user nios2: fix __get_user() sh64: failing __get_user() should zero m32r: fix __get_user() mn10300: failing __get_user() and get_user() should zero fix minor infoleak in get_user_ex() microblaze: fix copy_from_user() avr32: fix copy_from_user() microblaze: fix __get_user() fix iov_iter_fault_in_readable() irqchip/atmel-aic: Fix potential deadlock in ->xlate() genirq: Provide irq_gc_{lock_irqsave,unlock_irqrestore}() helpers drm: Only use compat ioctl for addfb2 on X86/IA64 drm: atmel-hlcdc: Fix vertical scaling net: simplify napi_synchronize() to avoid warnings kconfig: tinyconfig: provide whole choice blocks to avoid warnings soc: qcom/spm: shut up uninitialized variable warning pinctrl: at91-pio4: use %pr format string for resource mmc: dw_mmc: use resource_size_t to store physical address drm/i915: Avoid pointer arithmetic in calculating plane surface offset mpssd: fix buffer overflow warning gma500: remove annoying deprecation warning ipv6: addrconf: fix dev refcont leak when DAD failed sched/core: Fix a race between try_to_wake_up() and a woken up task Revert "wext: Fix 32 bit iwpriv compatibility issue with 64 bit Kernel" ath9k: fix using sta->drv_priv before initializing it md-cluster: make md-cluster also can work when compiled into kernel xhci: fix null pointer dereference in stop command timeout function fuse: direct-io: don't dirty ITER_BVEC pages Btrfs: remove root_log_ctx from ctx list before btrfs_sync_log returns crypto: cryptd - initialize child shash_desc on import arm64: spinlocks: implement smp_mb__before_spinlock() as smp_mb() pinctrl: sunxi: fix uart1 CTS/RTS pins at PG on A23/A33 pinctrl: pistachio: fix mfio pll_lock pinmux dm crypt: fix error with too large bios dm log writes: move IO accounting earlier to fix error path dm log writes: fix check of kthread_run() return value bus: arm-ccn: Fix XP watchpoint settings bitmask bus: arm-ccn: Do not attempt to configure XPs for cycle counter bus: arm-ccn: Fix PMU handling of MN ARM: dts: STiH407-family: Provide interconnect clock for consumption in ST SDHCI ARM: dts: overo: fix gpmc nand on boards with ethernet ARM: dts: overo: fix gpmc nand cs0 range ARM: dts: imx6qdl: Fix SPDIF regression ARM: OMAP3: hwmod data: Add sysc information for DSI ARM: kirkwood: ib62x0: fix size of u-boot environment partition ARM: imx6: add missing BM_CLPCR_BYPASS_PMIC_READY setting for imx6sx ARM: imx6: add missing BM_CLPCR_BYP_MMDC_CH0_LPM_HS setting for imx6ul ARM: AM43XX: hwmod: Fix RSTST register offset for pruss cpuset: make sure new tasks conform to the current config of the cpuset net: thunderx: Fix OOPs with ethtool --register-dump USB: change bInterval default to 10 ms ARM: dts: STiH410: Handle interconnect clock required by EHCI/OHCI (USB) usb: chipidea: udc: fix NULL ptr dereference in isr_setup_status_phase usb: renesas_usbhs: fix clearing the {BRDY,BEMP}STS condition USB: serial: simple: add support for another Infineon flashloader serial: 8250: added acces i/o products quad and octal serial cards serial: 8250_mid: fix divide error bug if baud rate is 0 iio: ensure ret is initialized to zero before entering do loop iio:core: fix IIO_VAL_FRACTIONAL sign handling iio: accel: kxsd9: Fix scaling bug iio: fix pressure data output unit in hid-sensor-attributes iio: accel: bmc150: reset chip at init time iio: adc: at91: unbreak channel adc channel 3 iio: ad799x: Fix buffered capture for ad7991/ad7995/ad7999 iio: adc: ti_am335x_adc: Increase timeout value waiting for ADC sample iio: adc: ti_am335x_adc: Protect FIFO1 from concurrent access iio: adc: rockchip_saradc: reset saradc controller before programming it iio: proximity: as3935: set up buffer timestamps for non-zero values iio: accel: kxsd9: Fix raw read return kvm-arm: Unmap shadow pagetables properly x86/AMD: Apply erratum 665 on machines without a BIOS fix x86/paravirt: Do not trace _paravirt_ident_*() functions ARC: mm: fix build breakage with STRICT_MM_TYPECHECKS IB/uverbs: Fix race between uverbs_close and remove_one dm flakey: fix reads to be issued if drop_writes configured audit: fix exe_file access in audit_exe_compare mm: introduce get_task_exe_file kexec: fix double-free when failing to relocate the purgatory NFSv4.1: Fix the CREATE_SESSION slot number accounting pNFS: Ensure LAYOUTGET and LAYOUTRETURN are properly serialised nfsd: Close race between nfsd4_release_lockowner and nfsd4_lock NFSv4.x: Fix a refcount leak in nfs_callback_up_net pNFS: The client must not do I/O to the DS if it's lease has expired kernfs: don't depend on d_find_any_alias() when generating notifications powerpc/mm: Don't alias user region to other regions below PAGE_OFFSET powerpc/powernv : Drop reference added by kset_find_obj() powerpc/tm: do not use r13 for tabort_syscall tipc: move linearization of buffers to generic code lightnvm: put bio before return fscrypto: require write access to mount to set encryption policy Revert "KVM: x86: fix missed hardware breakpoints" MIPS: KVM: Check for pfn noslot case clocksource/drivers/sun4i: Clear interrupts after stopping timer in probe function fscrypto: add authorization check for setting encryption policy ext4: use __GFP_NOFAIL in ext4_free_blocks() Conflicts: arch/arm/kernel/devtree.c arch/arm64/Kconfig arch/arm64/kernel/arm64ksyms.c arch/arm64/kernel/psci.c arch/arm64/mm/fault.c drivers/android/binder.c drivers/usb/host/xhci-hub.c fs/ext4/readpage.c include/linux/mmc/core.h include/linux/mmzone.h mm/memcontrol.c net/core/filter.c net/netlink/af_netlink.c net/netlink/af_netlink.h Change-Id: I99fe7a0914e83e284b11b33185b71448a8999d1f Signed-off-by: Runmin Wang <runminw@codeaurora.org> Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>
2016-11-18swapfile: fix memory corruption via malformed swapfileJann Horn
commit dd111be69114cc867f8e826284559bfbc1c40e37 upstream. When root activates a swap partition whose header has the wrong endianness, nr_badpages elements of badpages are swabbed before nr_badpages has been checked, leading to a buffer overrun of up to 8GB. This normally is not a security issue because it can only be exploited by root (more specifically, a process with CAP_SYS_ADMIN or the ability to modify a swap file/partition), and such a process can already e.g. modify swapped-out memory of any other userspace process on the system. Link: http://lkml.kernel.org/r/1477949533-2509-1-git-send-email-jann@thejh.net Signed-off-by: Jann Horn <jann@thejh.net> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-23mm: swap: swap ratio supportVinayak Menon
Add support to receive a static ratio from userspace to divide the swap pages between ZRAM and disk based swap devices. The existing infrastructure allows to keep same priority for multiple swap devices, which results in round robin distribution of pages. With this patch, the ratio can be defined. CRs-fixed: 968416 Change-Id: I54f54489db84cabb206569dd62d61a8a7a898991 Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
2016-03-23mm: swap: fix swapcache usage for fast swap devicesVinayak Menon
1) The swap readahead algorithm need not be applied for fast swap devices like zram. 2) Code to set SWP_FAST is placed incorrectly, resulting in the flag not being set. Fix these to reduce the swapcache usage. Change-Id: I23d9af5819f4b25f90f14a12657fa19ed401fb2a Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
2016-03-22mm: swap: don't delay swap free for fast swap devicesVinayak Menon
There are couple of issues with swapcache usage when ZRAM is used as swap device. 1) Kernel does a swap readahead which can be around 6 to 8 pages depending on total ram, which is not required for zram since accesses are fast. 2) Kernel delays the freeing up of swapcache expecting a later hit, which again is useless in the case of zram. 3) This is not related to swapcache, but zram usage itself. As mentioned in (2) kernel delays freeing of swapcache, but along with that it delays zram compressed page free also. i.e. there can be 2 copies, though one is compressed. This patch addresses these issues using two new flags QUEUE_FLAG_FAST and SWP_FAST, to indicate that accesses to the device will be fast and cheap, and instructs the swap layer to free up swap space agressively, and not to do read ahead. Change-Id: I5d2d5176a5f9420300bb2f843f6ecbdb25ea80e4 Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
2015-09-08mm: /proc/pid/smaps:: show proportional swap share of the mappingMinchan Kim
We want to know per-process workingset size for smart memory management on userland and we use swap(ex, zram) heavily to maximize memory efficiency so workingset includes swap as well as RSS. On such system, if there are lots of shared anonymous pages, it's really hard to figure out exactly how many each process consumes memory(ie, rss + wap) if the system has lots of shared anonymous memory(e.g, android). This patch introduces SwapPss field on /proc/<pid>/smaps so we can get more exact workingset size per process. Bongkyu tested it. Result is below. 1. 50M used swap SwapTotal: 461976 kB SwapFree: 411192 kB $ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}'; 48236 $ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}'; 141184 2. 240M used swap SwapTotal: 461976 kB SwapFree: 216808 kB $ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}'; 230315 $ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}'; 1387744 [akpm@linux-foundation.org: simplify kunmap_atomic() call] Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Bongkyu Kim <bongkyu.kim@lge.com> Tested-by: Bongkyu Kim <bongkyu.kim@lge.com> Cc: Hugh Dickins <hughd@google.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-21mm: fix potential data race in SyS_swaponHugh Dickins
While running KernelThreadSanitizer (ktsan) on upstream kernel with trinity, we got a few reports from SyS_swapon, here is one of them: Read of size 8 by thread T307 (K7621): [< inlined >] SyS_swapon+0x3c0/0x1850 SYSC_swapon mm/swapfile.c:2395 [<ffffffff812242c0>] SyS_swapon+0x3c0/0x1850 mm/swapfile.c:2345 [<ffffffff81e97c8a>] ia32_do_call+0x1b/0x25 Looks like the swap_lock should be taken when iterating through the swap_info array on lines 2392 - 2401: q->swap_file may be reset to NULL by another thread before it is dereferenced for f_mapping. But why is that iteration needed at all? Doesn't the claim_swapfile() which follows do all that is needed to check for a duplicate entry - FMODE_EXCL on a bdev, testing IS_SWAPFILE under i_mutex on a regfile? Well, not quite: bd_may_claim() allows the same "holder" to claim the bdev again, so we do need to use a different holder than "sys_swapon"; and we should not replace appropriate -EBUSY by inappropriate -EINVAL. Index i was reused in a cpu loop further down: renamed cpu there. Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-06-23vfs: add seq_file_path() helperMiklos Szeredi
Turn seq_path(..., &file->f_path, ...); into seq_file_path(..., file, ...); Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15mm: remove rest of ACCESS_ONCE() usagesJason Low
We converted some of the usages of ACCESS_ONCE to READ_ONCE in the mm/ tree since it doesn't work reliably on non-scalar types. This patch removes the rest of the usages of ACCESS_ONCE, and use the new READ_ONCE API for the read accesses. This makes things cleaner, instead of using separate/multiple sets of APIs. Signed-off-by: Jason Low <jason.low2@hp.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10mm: page_cgroup: rename file to mm/swap_cgroup.cJohannes Weiner
Now that the external page_cgroup data structure and its lookup is gone, the only code remaining in there is swap slot accounting. Rename it and move the conditional compilation into mm/Makefile. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Tejun Heo <tj@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08mm: memcontrol: rewrite uncharge APIJohannes Weiner
The memcg uncharging code that is involved towards the end of a page's lifetime - truncation, reclaim, swapout, migration - is impressively complicated and fragile. Because anonymous and file pages were always charged before they had their page->mapping established, uncharges had to happen when the page type could still be known from the context; as in unmap for anonymous, page cache removal for file and shmem pages, and swap cache truncation for swap pages. However, these operations happen well before the page is actually freed, and so a lot of synchronization is necessary: - Charging, uncharging, page migration, and charge migration all need to take a per-page bit spinlock as they could race with uncharging. - Swap cache truncation happens during both swap-in and swap-out, and possibly repeatedly before the page is actually freed. This means that the memcg swapout code is called from many contexts that make no sense and it has to figure out the direction from page state to make sure memory and memory+swap are always correctly charged. - On page migration, the old page might be unmapped but then reused, so memcg code has to prevent untimely uncharging in that case. Because this code - which should be a simple charge transfer - is so special-cased, it is not reusable for replace_page_cache(). But now that charged pages always have a page->mapping, introduce mem_cgroup_uncharge(), which is called after the final put_page(), when we know for sure that nobody is looking at the page anymore. For page migration, introduce mem_cgroup_migrate(), which is called after the migration is successful and the new page is fully rmapped. Because the old page is no longer uncharged after migration, prevent double charges by decoupling the page's memcg association (PCG_USED and pc->mem_cgroup) from the page holding an actual charge. The new bits PCG_MEM and PCG_MEMSW represent the respective charges and are transferred to the new page during migration. mem_cgroup_migrate() is suitable for replace_page_cache() as well, which gets rid of mem_cgroup_replace_page_cache(). However, care needs to be taken because both the source and the target page can already be charged and on the LRU when fuse is splicing: grab the page lock on the charge moving side to prevent changing pc->mem_cgroup of a page under migration. Also, the lruvecs of both pages change as we uncharge the old and charge the new during migration, and putback may race with us, so grab the lru lock and isolate the pages iff on LRU to prevent races and ensure the pages are on the right lruvec afterward. Swap accounting is massively simplified: because the page is no longer uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry before the final put_page() in page reclaim. Finally, page_cgroup changes are now protected by whatever protection the page itself offers: anonymous pages are charged under the page table lock, whereas page cache insertions, swapin, and migration hold the page lock. Uncharging happens under full exclusion with no outstanding references. Charging and uncharging also ensure that the page is off-LRU, which serializes against charge migration. Remove the very costly page_cgroup lock and set pc->flags non-atomically. [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable] [vdavydov@parallels.com: fix flags definition] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Tested-by: Jet Chen <jet.chen@intel.com> Acked-by: Michal Hocko <mhocko@suse.cz> Tested-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08mm: memcontrol: rewrite charge APIJohannes Weiner
These patches rework memcg charge lifetime to integrate more naturally with the lifetime of user pages. This drastically simplifies the code and reduces charging and uncharging overhead. The most expensive part of charging and uncharging is the page_cgroup bit spinlock, which is removed entirely after this series. Here are the top-10 profile entries of a stress test that reads a 128G sparse file on a freshly booted box, without even a dedicated cgroup (i.e. executing in the root memcg). Before: 15.36% cat [kernel.kallsyms] [k] copy_user_generic_string 13.31% cat [kernel.kallsyms] [k] memset 11.48% cat [kernel.kallsyms] [k] do_mpage_readpage 4.23% cat [kernel.kallsyms] [k] get_page_from_freelist 2.38% cat [kernel.kallsyms] [k] put_page 2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge 2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common 1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list 1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup 1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn After: 15.67% cat [kernel.kallsyms] [k] copy_user_generic_string 13.48% cat [kernel.kallsyms] [k] memset 11.42% cat [kernel.kallsyms] [k] do_mpage_readpage 3.98% cat [kernel.kallsyms] [k] get_page_from_freelist 2.46% cat [kernel.kallsyms] [k] put_page 2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list 1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup 1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn 1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk 1.30% cat [kernel.kallsyms] [k] kfree As you can see, the memcg footprint has shrunk quite a bit. text data bss dec hex filename 37970 9892 400 48262 bc86 mm/memcontrol.o.old 35239 9892 400 45531 b1db mm/memcontrol.o This patch (of 4): The memcg charge API charges pages before they are rmapped - i.e. have an actual "type" - and so every callsite needs its own set of charge and uncharge functions to know what type is being operated on. Worse, uncharge has to happen from a context that is still type-specific, rather than at the end of the page's lifetime with exclusive access, and so requires a lot of synchronization. Rewrite the charge API to provide a generic set of try_charge(), commit_charge() and cancel_charge() transaction operations, much like what's currently done for swap-in: mem_cgroup_try_charge() attempts to reserve a charge, reclaiming pages from the memcg if necessary. mem_cgroup_commit_charge() commits the page to the charge once it has a valid page->mapping and PageAnon() reliably tells the type. mem_cgroup_cancel_charge() aborts the transaction. This reduces the charge API and enables subsequent patches to drastically simplify uncharging. As pages need to be committed after rmap is established but before they are added to the LRU, page_add_new_anon_rmap() must stop doing LRU additions again. Revive lru_cache_add_active_or_unevictable(). [hughd@google.com: fix shmem_unuse] [hughd@google.com: Add comments on the private use of -EAGAIN] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04mm/swapfile.c: delete the "last_in_cluster < scan_base" loop in the body of ↵Chen Yucong
scan_swap_map() Via commit ebc2a1a69111 ("swap: make cluster allocation per-cpu"), we can find that all SWP_SOLIDSTATE "seek is cheap"(SSD case) has already gone to si->cluster_info scan_swap_map_try_ssd_cluster() route. So that the "last_in_cluster < scan_base" loop in the body of scan_swap_map() has already become a dead code snippet, and it should have been deleted. This patch is to delete the redundant loop as Hugh and Shaohua suggested. [hughd@google.com: fix comment, simplify code] Signed-off-by: Chen Yucong <slaoub@gmail.com> Cc: Shaohua Li <shli@kernel.org> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04swap: change swap_list_head to plist, add swap_avail_headDan Streetman
Originally get_swap_page() started iterating through the singly-linked list of swap_info_structs using swap_list.next or highest_priority_index, which both were intended to point to the highest priority active swap target that was not full. The first patch in this series changed the singly-linked list to a doubly-linked list, and removed the logic to start at the highest priority non-full entry; it starts scanning at the highest priority entry each time, even if the entry is full. Replace the manually ordered swap_list_head with a plist, swap_active_head. Add a new plist, swap_avail_head. The original swap_active_head plist contains all active swap_info_structs, as before, while the new swap_avail_head plist contains only swap_info_structs that are active and available, i.e. not full. Add a new spinlock, swap_avail_lock, to protect the swap_avail_head list. Mel Gorman suggested using plists since they internally handle ordering the list entries based on priority, which is exactly what swap was doing manually. All the ordering code is now removed, and swap_info_struct entries and simply added to their corresponding plist and automatically ordered correctly. Using a new plist for available swap_info_structs simplifies and optimizes get_swap_page(), which no longer has to iterate over full swap_info_structs. Using a new spinlock for swap_avail_head plist allows each swap_info_struct to add or remove themselves from the plist when they become full or not-full; previously they could not do so because the swap_info_struct->lock is held when they change from full<->not-full, and the swap_lock protecting the main swap_active_head must be ordered before any swap_info_struct->lock. Signed-off-by: Dan Streetman <ddstreet@ieee.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Shaohua Li <shli@fusionio.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Cc: Weijie Yang <weijieut@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Bob Liu <bob.liu@oracle.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-04swap: change swap_info singly-linked list to list_headDan Streetman
The logic controlling the singly-linked list of swap_info_struct entries for all active, i.e. swapon'ed, swap targets is rather complex, because: - it stores the entries in priority order - there is a pointer to the highest priority entry - there is a pointer to the highest priority not-full entry - there is a highest_priority_index variable set outside the swap_lock - swap entries of equal priority should be used equally this complexity leads to bugs such as: https://lkml.org/lkml/2014/2/13/181 where different priority swap targets are incorrectly used equally. That bug probably could be solved with the existing singly-linked lists, but I think it would only add more complexity to the already difficult to understand get_swap_page() swap_list iteration logic. The first patch changes from a singly-linked list to a doubly-linked list using list_heads; the highest_priority_index and related code are removed and get_swap_page() starts each iteration at the highest priority swap_info entry, even if it's full. While this does introduce unnecessary list iteration (i.e. Schlemiel the painter's algorithm) in the case where one or more of the highest priority entries are full, the iteration and manipulation code is much simpler and behaves correctly re: the above bug; and the fourth patch removes the unnecessary iteration. The second patch adds some minor plist helper functions; nothing new really, just functions to match existing regular list functions. These are used by the next two patches. The third patch adds plist_requeue(), which is used by get_swap_page() in the next patch - it performs the requeueing of same-priority entries (which moves the entry to the end of its priority in the plist), so that all equal-priority swap_info_structs get used equally. The fourth patch converts the main list into a plist, and adds a new plist that contains only swap_info entries that are both active and not full. As Mel suggested using plists allows removing all the ordering code from swap - plists handle ordering automatically. The list naming is also clarified now that there are two lists, with the original list changed from swap_list_head to swap_active_head and the new list named swap_avail_head. A new spinlock is also added for the new list, so swap_info entries can be added or removed from the new list immediately as they become full or not full. This patch (of 4): Replace the singly-linked list tracking active, i.e. swapon'ed, swap_info_struct entries with a doubly-linked list using struct list_heads. Simplify the logic iterating and manipulating the list of entries, especially get_swap_page(), by using standard list_head functions, and removing the highest priority iteration logic. The change fixes the bug: https://lkml.org/lkml/2014/2/13/181 in which different priority swap entries after the highest priority entry are incorrectly used equally in pairs. The swap behavior is now as advertised, i.e. different priority swap entries are used in order, and equal priority swap targets are used concurrently. Signed-off-by: Dan Streetman <ddstreet@ieee.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Shaohua Li <shli@fusionio.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com> Cc: Weijie Yang <weijieut@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Bob Liu <bob.liu@oracle.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-06mm/swap: fix race on swap_info reuse between swapoff and swaponWeijie Yang
swapoff clear swap_info's SWP_USED flag prematurely and free its resources after that. A concurrent swapon will reuse this swap_info while its previous resources are not cleared completely. These late freed resources are: - p->percpu_cluster - swap_cgroup_ctrl[type] - block_device setting - inode->i_flags &= ~S_SWAPFILE This patch clears the SWP_USED flag after all its resources are freed, so that swapon can reuse this swap_info by alloc_swap_info() safely. [akpm@linux-foundation.org: tidy up code comment] Signed-off-by: Weijie Yang <weijie.yang@samsung.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Krzysztof Kozlowski <k.kozlowski@samsung.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23mm/swapfile.c: do not skip lowest_bit in scan_swap_map() scan loopJamie Liu
In the second half of scan_swap_map()'s scan loop, offset is set to si->lowest_bit and then incremented before entering the loop for the first time, causing si->swap_map[si->lowest_bit] to be skipped. Signed-off-by: Jamie Liu <jamieliu@google.com> Cc: Shaohua Li <shli@fusionio.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGESasha Levin
Most of the VM_BUG_ON assertions are performed on a page. Usually, when one of these assertions fails we'll get a BUG_ON with a call stack and the registers. I've recently noticed based on the requests to add a small piece of code that dumps the page to various VM_BUG_ON sites that the page dump is quite useful to people debugging issues in mm. This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what VM_BUG_ON() does, also dumps the page before executing the actual BUG_ON. [akpm@linux-foundation.org: fix up includes] Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13frontswap: enable call to invalidate area on swapoffKrzysztof Kozlowski
During swapoff the frontswap_map was NULL-ified before calling frontswap_invalidate_area(). However the frontswap_invalidate_area() exits early if frontswap_map is NULL. Invalidate was never called during swapoff. This patch moves frontswap_map_set() in swapoff just after calling frontswap_invalidate_area() so outside of locks (swap_lock and swap_info_struct->lock). This shouldn't be a problem as during swapon the frontswap_map_set() is called also outside of any locks. Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Reviewed-by: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Shaohua Li <shli@fusionio.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13mm/swapfile.c: fix comment typosSeth Jennings
Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-16swap: fix set_blocksize race during swapon/swapoffKrzysztof Kozlowski
Fix race between swapoff and swapon. Swapoff used old_block_size from swap_info outside of swapon_mutex so it could be overwritten by concurrent swapon. The race has visible effect only if more than one swap block device exists with different block sizes (e.g. /dev/sda1 with block size 4096 and /dev/sdb1 with 512). In such case it leads to setting the blocksize of swapped off device with wrong blocksize. The bug can be triggered with multiple concurrent swapoff and swapon: 0. Swap for some device is on. 1. swapoff: First the swapoff is called on this device and "struct swap_info_struct *p" is assigned. This is done under swap_lock however this lock is released for the call try_to_unuse(). 2. swapon: After the assignment above (and before acquiring swapon_mutex & swap_lock by swapoff) the swapon is called on the same device. The p->old_block_size is assigned to the value of block_size the device. This block size should be the same as previous but sometimes it is not. The swapon ends successfully. 3. swapoff: Swapoff resumes, grabs the locks and mutex and continues to disable this swap device. Now it sets the block size to value taken from swap_info which was overwritten by swapon in 2. Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Reported-by: Weijie Yang <weijie.yang.kh@gmail.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Shaohua Li <shli@fusionio.com> Cc: Minchan Kim <minchan@kernel.org> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11swap: make cluster allocation per-cpuShaohua Li
swap cluster allocation is to get better request merge to improve performance. But the cluster is shared globally, if multiple tasks are doing swap, this will cause interleave disk access. While multiple tasks swap is quite common, for example, each numa node has a kswapd thread doing swap and multiple threads/processes doing direct page reclaim. ioscheduler can't help too much here, because tasks don't send swapout IO down to block layer in the meantime. Block layer does merge some IOs, but a lot not, depending on how many tasks are doing swapout concurrently. In practice, I've seen a lot of small size IO in swapout workloads. We makes the cluster allocation per-cpu here. The interleave disk access issue goes away. All tasks swapout to their own cluster, so swapout will become sequential, which can be easily merged to big size IO. If one CPU can't get its per-cpu cluster (for example, there is no free cluster anymore in the swap), it will fallback to scan swap_map. The CPU can still continue swap. We don't need recycle free swap entries of other CPUs. In my test (swap to a 2-disk raid0 partition), this improves around 10% swapout throughput, and request size is increased significantly. How does this impact swap readahead is uncertain though. On one side, page reclaim always isolates and swaps several adjancent pages, this will make page reclaim write the pages sequentially and benefit readahead. On the other side, several CPU write pages interleave means the pages don't live _sequentially_ but relatively _near_. In the per-cpu allocation case, if adjancent pages are written by different cpus, they will live relatively _far_. So how this impacts swap readahead depends on how many pages page reclaim isolates and swaps one time. If the number is big, this patch will benefit swap readahead. Of course, this is about sequential access pattern. The patch has no impact for random access pattern, because the new cluster allocation algorithm is just for SSD. Alternative solution is organizing swap layout to be per-mm instead of this per-cpu approach. In the per-mm layout, we allocate a disk range for each mm, so pages of one mm live in swap disk adjacently. per-mm layout has potential issues of lock contention if multiple reclaimers are swap pages from one mm. For a sequential workload, per-mm layout is better to implement swap readahead, because pages from the mm are adjacent in disk. But per-cpu layout isn't very bad in this workload, as page reclaim always isolates and swaps several pages one time, such pages will still live in disk sequentially and readahead can utilize this. For a random workload, per-mm layout isn't beneficial of request merge, because it's quite possible pages from different mm are swapout in the meantime and IO can't be merged in per-mm layout. while with per-cpu layout we can merge requests from any mm. Considering random workload is more popular in workloads with swap (and per-cpu approach isn't too bad for sequential workload too), I'm choosing per-cpu layout. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Kyungmin Park <kmpark@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11swap: fix races exposed by swap discardShaohua Li
The previous patch can expose races, according to Hugh: swapoff was sometimes failing with "Cannot allocate memory", coming from try_to_unuse()'s -ENOMEM: it needs to allow for swap_duplicate() failing on a free entry temporarily SWAP_MAP_BAD while being discarded. We should use ACCESS_ONCE() there, and whenever accessing swap_map locklessly; but rather than peppering it throughout try_to_unuse(), just declare *swap_map with volatile. try_to_unuse() is accustomed to *swap_map going down racily, but not necessarily to it jumping up from 0 to SWAP_MAP_BAD: we'll be safer to prevent that transition once SWP_WRITEOK is switched off, when it's a waste of time to issue discards anyway (swapon can do a whole discard). Another issue is: In swapin_readahead(), read_swap_cache_async() can read a bad swap entry, because we don't check if readahead swap entry is bad. This doesn't break anything but such swapin page is wasteful and can only be freed at page reclaim. We should avoid read such swap entry. And in discard, we mark swap entry SWAP_MAP_BAD and then switch it to normal when discard is finished. If readahead reads such swap entry, we have the same issue, so we much check if swap entry is bad too. Thanks Hugh to inspire swapin_readahead could use bad swap entry. [include Hugh's patch 'swap: fix swapoff ENOMEMs from discard'] Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Kyungmin Park <kmpark@infradead.org> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11swap: make swap discard asyncShaohua Li
swap can do cluster discard for SSD, which is good, but there are some problems here: 1. swap do the discard just before page reclaim gets a swap entry and writes the disk sectors. This is useless for high end SSD, because an overwrite to a sector implies a discard to original sector too. A discard + overwrite == overwrite. 2. the purpose of doing discard is to improve SSD firmware garbage collection. Idealy we should send discard as early as possible, so firmware can do something smart. Sending discard just after swap entry is freed is considered early compared to sending discard before write. Of course, if workload is already bound to gc speed, sending discard earlier or later doesn't make 3. block discard is a sync API, which will delay scan_swap_map() significantly. 4. Write and discard command can be executed parallel in PCIe SSD. Making swap discard async can make execution more efficiently. This patch makes swap discard async and moves discard to where swap entry is freed. Discard and write have no dependence now, so above issues can be avoided. Idealy we should do discard for any freed sectors, but some SSD discard is very slow. This patch still does discard for a whole cluster. My test does a several round of 'mmap, write, unmap', which will trigger a lot of swap discard. In a fusionio card, with this patch, the test runtime is reduced to 18% of the time without it, so around 5.5x faster. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Kyungmin Park <kmpark@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11swap: change block allocation algorithm for SSDShaohua Li
I'm using a fast SSD to do swap. scan_swap_map() sometimes uses up to 20~30% CPU time (when cluster is hard to find, the CPU time can be up to 80%), which becomes a bottleneck. scan_swap_map() scans a byte array to search a 256 page cluster, which is very slow. Here I introduced a simple algorithm to search cluster. Since we only care about 256 pages cluster, we can just use a counter to track if a cluster is free. Every 256 pages use one int to store the counter. If the counter of a cluster is 0, the cluster is free. All free clusters will be added to a list, so searching cluster is very efficient. With this, scap_swap_map() overhead disappears. This might help low end SD card swap too. Because if the cluster is aligned, SD firmware can do flash erase more efficiently. We only enable the algorithm for SSD. Hard disk swap isn't fast enough and has downside with the algorithm which might introduce regression (see below). The patch slightly changes which cluster is choosen. It always adds free cluster to list tail. This can help wear leveling for low end SSD too. And if no cluster found, the scan_swap_map() will do search from the end of last cluster. So if no cluster found, the scan_swap_map() will do search from the end of last free cluster, which is random. For SSD, this isn't a problem at all. Another downside is the cluster must be aligned to 256 pages, which will reduce the chance to find a cluster. I would expect this isn't a big problem for SSD because of the non-seek penality. (And this is the reason I only enable the algorithm for SSD). Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Kyungmin Park <kmpark@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm/swapfile.c: convert to pr_foo()Andrew Morton
A few 80-col gymnastics were cleaned up as a result. Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11swap: warn when a swap area overflows the maximum sizeRaymond Jennings
It is possible to swapon a swap area that is too big for the pte width to handle. Presently this failure happens silently. Instead, emit a diagnostic to warn the user. Testing results, root prompt commands and kernel log messages: # lvresize /dev/system/swap --size 16G # mkswap /dev/system/swap # swapon /dev/system/swap Jul 7 04:27:22 warfang kernel: Adding 16777212k swap on /dev/mapper/system-swap. Priority:-1 extents:1 across:16777212k # lvresize /dev/system/swap --size 64G # mkswap /dev/system/swap # swapon /dev/system/swap Jul 7 04:27:22 warfang kernel: Truncating oversized swap area, only using 33554432k out of 67108860k Jul 7 04:27:22 warfang kernel: Adding 33554428k swap on /dev/mapper/system-swap. Priority:-1 extents:1 across:33554428k [akpm@linux-foundation.org: fix warning] Signed-off-by: Raymond Jennings <shentino@gmail.com> Acked-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13mm: save soft-dirty bits on swapped pagesCyrill Gorcunov
Andy Lutomirski reported that if a page with _PAGE_SOFT_DIRTY bit set get swapped out, the bit is getting lost and no longer available when pte read back. To resolve this we introduce _PTE_SWP_SOFT_DIRTY bit which is saved in pte entry for the page being swapped out. When such page is to be read back from a swap cache we check for bit presence and if it's there we clear it and restore the former _PAGE_SOFT_DIRTY bit back. One of the problem was to find a place in pte entry where we can save the _PTE_SWP_SOFT_DIRTY bit while page is in swap. The _PAGE_PSE was chosen for that, it doesn't intersect with swap entry format stored in pte. Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-03swap: discard while swapping only if SWAP_FLAG_DISCARD_PAGESRafael Aquini
Considering the use cases where the swap device supports discard: a) and can do it quickly; b) but it's slow to do in small granularities (or concurrent with other I/O); c) but the implementation is so horrendous that you don't even want to send one down; And assuming that the sysadmin considers it useful to send the discards down at all, we would (probably) want the following solutions: i. do the fine-grained discards for freed swap pages, if device is capable of doing so optimally; ii. do single-time (batched) swap area discards, either at swapon or via something like fstrim (not implemented yet); iii. allow doing both single-time and fine-grained discards; or iv. turn it off completely (default behavior) As implemented today, one can only enable/disable discards for swap, but one cannot select, for instance, solution (ii) on a swap device like (b) even though the single-time discard is regarded to be interesting, or necessary to the workload because it would imply (1), and the device is not capable of performing it optimally. This patch addresses the scenario depicted above by introducing a way to ensure the (probably) wanted solutions (i, ii, iii and iv) can be flexibly flagged through swapon(8) to allow a sysadmin to select the best suitable swap discard policy accordingly to system constraints. This patch introduces SWAP_FLAG_DISCARD_PAGES and SWAP_FLAG_DISCARD_ONCE new flags to allow more flexibe swap discard policies being flagged through swapon(8). The default behavior is to keep both single-time, or batched, area discards (SWAP_FLAG_DISCARD_ONCE) and fine-grained discards for page-clusters (SWAP_FLAG_DISCARD_PAGES) enabled, in order to keep consistentcy with older kernel behavior, as well as maintain compatibility with older swapon(8). However, through the new introduced flags the best suitable discard policy can be selected accordingly to any given swap device constraint. [akpm@linux-foundation.org: tweak comments] Signed-off-by: Rafael Aquini <aquini@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Shaohua Li <shli@kernel.org> Cc: Karel Zak <kzak@redhat.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-12frontswap: fix incorrect zeroing and allocation size for frontswap_mapAkinobu Mita
The bitmap accessed by bitops must have enough size to hold the required numbers of bits rounded up to a multiple of BITS_PER_LONG. And the bitmap must not be zeroed by memset() if the number of bits cleared is not a multiple of BITS_PER_LONG. This fixes incorrect zeroing and allocation size for frontswap_map. The incorrect zeroing part doesn't cause any problem because frontswap_map is freed just after zeroing. But the wrongly calculated allocation size may cause the problem. For 32bit systems, the allocation size of frontswap_map is about twice as large as required size. For 64bit systems, the allocation size is smaller than requeired if the number of bits is not a multiple of BITS_PER_LONG. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-30frontswap: get rid of swap_lock dependencyMinchan Kim
Frontswap initialization routine depends on swap_lock, which want to be atomic about frontswap's first appearance. IOW, frontswap is not present and will fail all calls OR frontswap is fully functional but if new swap_info_struct isn't registered by enable_swap_info, swap subsystem doesn't start I/O so there is no race between init procedure and page I/O working on frontswap. So let's remove unnecessary swap_lock dependency. Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Signed-off-by: Minchan Kim <minchan@kernel.org> [v1: Rebased on my branch, reworked to work with backends loading late] [v2: Added a check for !map] [v3: Made the invalidate path follow the init path] [v4: Address comments by Wanpeng Li <liwanp@linux.vnet.ibm.com>] Signed-off-by: Konrad Rzeszutek Wilk <konrad@darnok.org> Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andor Daam <andor.daam@googlemail.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Stefan Hengelein <ilendir@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-04-29mm/: rename random32() to prandom_u32()Akinobu Mita
Use preferable function name which implies using a pseudo-random number generator. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile (part one) from Al Viro: "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent locking violations, etc. The most visible changes here are death of FS_REVAL_DOT (replaced with "has ->d_weak_revalidate()") and a new helper getting from struct file to inode. Some bits of preparation to xattr method interface changes. Misc patches by various people sent this cycle *and* ocfs2 fixes from several cycles ago that should've been upstream right then. PS: the next vfs pile will be xattr stuff." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits) saner proc_get_inode() calling conventions proc: avoid extra pde_put() in proc_fill_super() fs: change return values from -EACCES to -EPERM fs/exec.c: make bprm_mm_init() static ocfs2/dlm: use GFP_ATOMIC inside a spin_lock ocfs2: fix possible use-after-free with AIO ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero target: writev() on single-element vector is pointless export kernel_write(), convert open-coded instances fs: encode_fh: return FILEID_INVALID if invalid fid_type kill f_vfsmnt vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op nfsd: handle vfs_getattr errors in acl protocol switch vfs_getattr() to struct path default SET_PERSONALITY() in linux/elf.h ceph: prepopulate inodes only when request is aborted d_hash_and_lookup(): export, switch open-coded instances 9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate() 9p: split dropping the acls from v9fs_set_create_acl() ...
2013-02-23mm,ksm: swapoff might need to copyHugh Dickins
Before establishing that KSM page migration was the cause of my WARN_ON_ONCE(page_mapped(page))s, I suspected that they came from the lack of a ksm_might_need_to_copy() in swapoff's unuse_pte() - which in many respects is equivalent to faulting in a page. In fact I've never caught that as the cause: but in theory it does at least need the KSM_RUN_UNMERGE check in ksm_might_need_to_copy(), to avoid bringing a KSM page back in when it's not supposed to be. I intended to copy how it's done in do_swap_page(), but have a strong aversion to how "swapcache" ends up being used there: rework it with "page != swapcache". Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23swap: add per-partition lock for swapfileShaohua Li
swap_lock is heavily contended when I test swap to 3 fast SSD (even slightly slower than swap to 2 such SSD). The main contention comes from swap_info_get(). This patch tries to fix the gap with adding a new per-partition lock. Global data like nr_swapfiles, total_swap_pages, least_priority and swap_list are still protected by swap_lock. nr_swap_pages is an atomic now, it can be changed without swap_lock. In theory, it's possible get_swap_page() finds no swap pages but actually there are free swap pages. But sounds not a big problem. Accessing partition specific data (like scan_swap_map and so on) is only protected by swap_info_struct.lock. Changing swap_info_struct.flags need hold swap_lock and swap_info_struct.lock, because scan_scan_map() will check it. read the flags is ok with either the locks hold. If both swap_lock and swap_info_struct.lock must be hold, we always hold the former first to avoid deadlock. swap_entry_free() can change swap_list. To delete that code, we add a new highest_priority_index. Whenever get_swap_page() is called, we check it. If it's valid, we use it. It's a pity get_swap_page() still holds swap_lock(). But in practice, swap_lock() isn't heavily contended in my test with this patch (or I can say there are other much more heavier bottlenecks like TLB flush). And BTW, looks get_swap_page() doesn't really need the lock. We never free swap_info[] and we check SWAP_WRITEOK flag. The only risk without the lock is we could swapout to some low priority swap, but we can quickly recover after several rounds of swap, so sounds not a big deal to me. But I'd prefer to fix this if it's a real problem. "swap: make each swap partition have one address_space" improved the swapout speed from 1.7G/s to 2G/s. This patch further improves the speed to 2.3G/s, so around 15% improvement. It's a multi-process test, so TLB flush isn't the biggest bottleneck before the patches. [arnd@arndb.de: fix it for nommu] [hughd@google.com: add missing unlock] [minchan@kernel.org: get rid of lockdep whinge on sys_swapon] Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23swap: make each swap partition have one address_spaceShaohua Li
When I use several fast SSD to do swap, swapper_space.tree_lock is heavily contended. This makes each swap partition have one address_space to reduce the lock contention. There is an array of address_space for swap. The swap entry type is the index to the array. In my test with 3 SSD, this increases the swapout throughput 20%. [akpm@linux-foundation.org: revert unneeded change to __add_to_swap_cache] Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-22new helper: file_inode(file)Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-11mm, oom: fix race when specifying a thread as the oom originDavid Rientjes
test_set_oom_score_adj() and compare_swap_oom_score_adj() are used to specify that current should be killed first if an oom condition occurs in between the two calls. The usage is short oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX); ... compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj); to store the thread's oom_score_adj, temporarily change it to the maximum score possible, and then restore the old value if it is still the same. This happens to still be racy, however, if the user writes OOM_SCORE_ADJ_MAX to /proc/pid/oom_score_adj in between the two calls. The compare_swap_oom_score_adj() will then incorrectly reset the old value prior to the write of OOM_SCORE_ADJ_MAX. To fix this, introduce a new oom_flags_t member in struct signal_struct that will be used for per-thread oom killer flags. KSM and swapoff can now use a bit in this member to specify that threads should be killed first in oom conditions without playing around with oom_score_adj. This also allows the correct oom_score_adj to always be shown when reading /proc/pid/oom_score. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-11mm, oom: change type of oom_score_adj to shortDavid Rientjes
The maximum oom_score_adj is 1000 and the minimum oom_score_adj is -1000, so this range can be represented by the signed short type with no functional change. The extra space this frees up in struct signal_struct will be used for per-thread oom kill flags in the next patch. Signed-off-by: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-11mm: do not call frontswap_init() during swapoffCesar Eduardo Barros
The call to frontswap_init() was added within enable_swap_info(), which was called not only during sys_swapon, but also to reinsert the swap_info into the swap_list in case of failure of try_to_unuse() within sys_swapoff. This means that frontswap_init() might be called more than once for the same swap area. While as far as I could see no frontswap implementation has any problem with it (and in fact, all the ones I found ignore the parameter passed to frontswap_init), this could change in the future. To prevent future problems, move the call to frontswap_init() to outside the code shared between sys_swapon and sys_swapoff. Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-11mm: refactor reinsert of swap_info in sys_swapoff()Cesar Eduardo Barros
The block within sys_swapoff() which re-inserts the swap_info into the swap_list in case of failure of try_to_unuse() reads a few values outside the swap_lock. While this is safe at that point, it is subtle code. Simplify the code by moving the reading of these values to a separate function, refactoring it a bit so they are read from within the swap_lock. This is easier to understand, and matches better the way it worked before I unified the insertion of the swap_info from both sys_swapon and sys_swapoff. This change should make no functional difference. The only real change is moving the read of two or three structure fields to within the lock (frontswap_map_get() is nothing more than a read of p->frontswap_map). Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-11-16swapfile: fix name leak in swapoffXiaotian Feng
There's a name leak introduced by commit 91a27b2a7567 ("vfs: define struct filename and have getname() return it"). Add the missing putname. [akpm@linux-foundation.org: cleanup] Signed-off-by: Xiaotian Feng <dannyfeng@tencent.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-12vfs: make path_openat take a struct filename pointerJeff Layton
...and fix up the callers. For do_file_open_root, just declare a struct filename on the stack and fill out the .name field. For do_filp_open, make it also take a struct filename pointer, and fix up its callers to call it appropriately. For filp_open, add a variant that takes a struct filename pointer and turn filp_open into a wrapper around it. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12vfs: define struct filename and have getname() return itJeff Layton
getname() is intended to copy pathname strings from userspace into a kernel buffer. The result is just a string in kernel space. It would however be quite helpful to be able to attach some ancillary info to the string. For instance, we could attach some audit-related info to reduce the amount of audit-related processing needed. When auditing is enabled, we could also call getname() on the string more than once and not need to recopy it from userspace. This patchset converts the getname()/putname() interfaces to return a struct instead of a string. For now, the struct just tracks the string in kernel space and the original userland pointer for it. Later, we'll add other information to the struct as it becomes convenient. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-31mm: swapfile: clean up unuse_pte race handlingJohannes Weiner
The conditional mem_cgroup_cancel_charge_swapin() is a leftover from when the function would continue to reestablish the page even after mem_cgroup_try_charge_swapin() failed. After 85d9fc8 "memcg: fix refcnt handling at swapoff", the condition is always true when this code is reached. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Wanpeng Li <liwp.linux@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31swapfile: avoid dereferencing bd_disk during swap_entry_free for network storageMel Gorman
Commit b3a27d ("swap: Add swap slot free callback to block_device_operations") dereferences p->bdev->bd_disk but this is a NULL dereference if using swap-over-NFS. This patch checks SWP_BLKDEV on the swap_info_struct before dereferencing. With reference to this callback, Christoph Hellwig stated "Please just remove the callback entirely. It has no user outside the staging tree and was added clearly against the rules for that staging tree". This would also be my preference but there was not an obvious way of keeping zram in staging/ happy. Signed-off-by: Xiaotian Feng <dfeng@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David S. Miller <davem@davemloft.net> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Paris <eparis@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm: swap: implement generic handler for swap_activateMel Gorman
The version of swap_activate introduced is sufficient for swap-over-NFS but would not provide enough information to implement a generic handler. This patch shuffles things slightly to ensure the same information is available for aops->swap_activate() as is available to the core. No functionality change. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David S. Miller <davem@davemloft.net> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Paris <eparis@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Xiaotian Feng <dfeng@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm: add support for a filesystem to activate swap files and use direct_IO ↵Mel Gorman
for writing swap pages Currently swapfiles are managed entirely by the core VM by using ->bmap to allocate space and write to the blocks directly. This effectively ensures that the underlying blocks are allocated and avoids the need for the swap subsystem to locate what physical blocks store offsets within a file. If the swap subsystem is to use the filesystem information to locate the blocks, it is critical that information such as block groups, block bitmaps and the block descriptor table that map the swap file were resident in memory. This patch adds address_space_operations that the VM can call when activating or deactivating swap backed by a file. int swap_activate(struct file *); int swap_deactivate(struct file *); The ->swap_activate() method is used to communicate to the file that the VM relies on it, and the address_space should take adequate measures such as reserving space in the underlying device, reserving memory for mempools and pinning information such as the block descriptor table in memory. The ->swap_deactivate() method is called on sys_swapoff() if ->swap_activate() returned success. After a successful swapfile ->swap_activate, the swapfile is marked SWP_FILE and swapper_space.a_ops will proxy to sis->swap_file->f_mappings->a_ops using ->direct_io to write swapcache pages and ->readpage to read. It is perfectly possible that direct_IO be used to read the swap pages but it is an unnecessary complication. Similarly, it is possible that ->writepage be used instead of direct_io to write the pages but filesystem developers have stated that calling writepage from the VM is undesirable for a variety of reasons and using direct_IO opens up the possibility of writing back batches of swap pages in the future. [a.p.zijlstra@chello.nl: Original patch] Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David S. Miller <davem@davemloft.net> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Paris <eparis@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Xiaotian Feng <dfeng@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm: methods for teaching filesystems about PG_swapcache pagesMel Gorman
In order to teach filesystems to handle swap cache pages, three new page functions are introduced: pgoff_t page_file_index(struct page *); loff_t page_file_offset(struct page *); struct address_space *page_file_mapping(struct page *); page_file_index() - gives the offset of this page in the file in PAGE_CACHE_SIZE blocks. Like page->index is for mapped pages, this function also gives the correct index for PG_swapcache pages. page_file_offset() - uses page_file_index(), so that it will give the expected result, even for PG_swapcache pages. page_file_mapping() - gives the mapping backing the actual page; that is for swap cache pages it will give swap_file->f_mapping. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David S. Miller <davem@davemloft.net> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Paris <eparis@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Neil Brown <neilb@suse.de> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Xiaotian Feng <dfeng@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-15swap: fix shmem swapping when more than 8 areasHugh Dickins
Minchan Kim reports that when a system has many swap areas, and tmpfs swaps out to the ninth or more, shmem_getpage_gfp()'s attempts to read back the page cannot locate it, and the read fails with -ENOMEM. Whoops. Yes, I blindly followed read_swap_header()'s pte_to_swp_entry( swp_entry_to_pte()) technique for determining maximum usable swap offset, without stopping to realize that that actually depends upon the pte swap encoding shifting swap offset to the higher bits and truncating it there. Whereas our radix_tree swap encoding leaves offset in the lower bits: it's swap "type" (that is, index of swap area) that was truncated. Fix it by reducing the SWP_TYPE_SHIFT() in swapops.h, and removing the broken radix_to_swp_entry(swp_to_radix_entry()) from read_swap_header(). This does not reduce the usable size of a swap area any further, it leaves it as claimed when making the original commit: no change from 3.0 on x86_64, nor on i386 without PAE; but 3.0's 512GB is reduced to 128GB per swapfile on i386 with PAE. It's not a change I would have risked five years ago, but with x86_64 supported for ten years, I believe it's appropriate now. Hmm, and what if some architecture implements its swap pte with offset encoded below type? That would equally break the maximum usable swap offset check. Happily, they all follow the same tradition of encoding offset above type, but I'll prepare a check on that for next. Reported-and-Reviewed-and-Tested-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org [3.1, 3.2, 3.3, 3.4] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>