summaryrefslogtreecommitdiff
path: root/mm/mlock.c
AgeCommit message (Collapse)Author
2016-03-23Merge remote-tracking branch 'lsk-44/linux-linaro-lsk-v4.4' into 44rc2David Keitel
* lsk-44/linux-linaro-lsk-v4.4: Linux 4.4.3 modules: fix modparam async_probe request module: wrapper for symbol name. itimers: Handle relative timers with CONFIG_TIME_LOW_RES proper posix-timers: Handle relative timers with CONFIG_TIME_LOW_RES proper timerfd: Handle relative timers with CONFIG_TIME_LOW_RES proper prctl: take mmap sem for writing to protect against others xfs: log mount failures don't wait for buffers to be released Revert "xfs: clear PF_NOFREEZE for xfsaild kthread" xfs: inode recovery readahead can race with inode buffer creation libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct ovl: setattr: check permissions before copy-up ovl: root: copy attr ovl: check dentry positiveness in ovl_cleanup_whiteouts() ovl: use a minimal buffer in ovl_copy_xattr ovl: allow zero size xattr futex: Drop refcount if requeue_pi() acquired the rtmutex devm_memremap_release(): fix memremap'd addr handling ipc/shm: handle removed segments gracefully in shm_mmap() intel_scu_ipcutil: underflow in scu_reg_access() mm,thp: khugepaged: call pte flush at the time of collapse dump_stack: avoid potential deadlocks radix-tree: fix oops after radix_tree_iter_retry drivers/hwspinlock: fix race between radix tree insertion and lookup radix-tree: fix race in gang lookup MAINTAINERS: return arch/sh to maintained state, with new maintainers memcg: only free spare array when readers are done numa: fix /proc/<pid>/numa_maps for hugetlbfs on s390 fs/hugetlbfs/inode.c: fix bugs in hugetlb_vmtruncate_list() scripts/bloat-o-meter: fix python3 syntax error dma-debug: switch check from _text to _stext m32r: fix m32104ut_defconfig build fail xhci: Fix list corruption in urb dequeue at host removal Revert "xhci: don't finish a TD if we get a short-transfer event mid TD" iommu/vt-d: Clear PPR bit to ensure we get more page request interrupts iommu/vt-d: Fix 64-bit accesses to 32-bit DMAR_GSTS_REG iommu/vt-d: Fix mm refcounting to hold mm_count not mm_users iommu/amd: Correct the wrong setting of alias DTE in do_attach iommu/vt-d: Don't skip PCI devices when disabling IOTLB Input: vmmouse - fix absolute device registration string_helpers: fix precision loss for some inputs Input: i8042 - add Fujitsu Lifebook U745 to the nomux list Input: elantech - mark protocols v2 and v3 as semi-mt mm: fix regression in remap_file_pages() emulation mm: replace vma_lock_anon_vma with anon_vma_lock_read/write mm: fix mlock accouting libnvdimm: fix namespace object confusion in is_uuid_busy() mm: soft-offline: check return value in second __get_any_page() call perf kvm record/report: 'unprocessable sample' error while recording/reporting guest data KVM: PPC: Fix ONE_REG AltiVec support KVM: PPC: Fix emulation of H_SET_DABR/X on POWER8 KVM: arm/arm64: Fix reference to uninitialised VGIC arm64: dma-mapping: fix handling of devices registered before arch_initcall ARM: OMAP2+: Fix ppa_zero_params and ppa_por_params for rodata ARM: OMAP2+: Fix save_secure_ram_context for rodata ARM: OMAP2+: Fix l2dis_3630 for rodata ARM: OMAP2+: Fix l2_inv_api_params for rodata ARM: OMAP2+: Fix wait_dll_lock_timed for rodata ARM: dts: at91: sama5d4ek: add phy address and IRQ for macb0 ARM: dts: at91: sama5d4 xplained: fix phy0 IRQ type ARM: dts: at91: sama5d4: fix instance id of DBGU ARM: dts: at91: sama5d4 xplained: properly mux phy interrupt ARM: dts: omap5-board-common: enable rtc and charging of backup battery ARM: dts: Fix omap5 PMIC control lines for RTC writes ARM: dts: Fix wl12xx missing clocks that cause hangs ARM: nomadik: fix up SD/MMC DT settings ARM: 8517/1: ICST: avoid arithmetic overflow in icst_hz() ARM: 8519/1: ICST: try other dividends than 1 arm64: mm: avoid calling apply_to_page_range on empty range ARM: mvebu: remove duplicated regulator definition in Armada 388 GP powerpc/ioda: Set "read" permission when "write" is set powerpc/powernv: Fix stale PE primary bus powerpc/eeh: Fix stale cached primary bus powerpc/eeh: Fix PE location code SUNRPC: Fixup socket wait for memory udf: Check output buffer length when converting name to CS0 udf: Prevent buffer overrun with multi-byte characters udf: limit the maximum number of indirect extents in a row pNFS/flexfiles: Fix an XDR encoding bug in layoutreturn nfs: Fix race in __update_open_stateid() pNFS/flexfiles: Fix an Oopsable typo in ff_mirror_match_fh() NFS: Fix attribute cache revalidation cifs: fix erroneous return value cifs_dbg() outputs an uninitialized buffer in cifs_readdir() cifs: fix race between call_async() and reconnect() cifs: Ratelimit kernel log messages iio: inkern: fix a NULL dereference on error iio: pressure: mpl115: fix temperature offset sign iio: light: acpi-als: Report data as processed iio: dac: mcp4725: set iio name property in sysfs iio: add IIO_TRIGGER dependency to STK8BA50 iio: add HAS_IOMEM dependency to VF610_ADC iio-light: Use a signed return type for ltr501_match_samp_freq() iio:adc:ti_am335x_adc Fix buffered mode by identifying as software buffer. iio: adis_buffer: Fix out-of-bounds memory access scsi: fix soft lockup in scsi_remove_target() on module removal SCSI: Add Marvell Console to VPD blacklist scsi_dh_rdac: always retry MODE SELECT on command lock violation drivers/scsi/sg.c: mark VMA as VM_IO to prevent migration SCSI: fix crashes in sd and sr runtime PM iscsi-target: Fix potential dead-lock during node acl delete scsi: add Synology to 1024 sector blacklist klist: fix starting point removed bug in klist iterators tracepoints: Do not trace when cpu is offline tracing: Fix freak link error caused by branch tracer perf tools: tracepoint_error() can receive e=NULL, robustify it tools lib traceevent: Fix output of %llu for 64 bit values read on 32 bit machines ptrace: use fsuid, fsgid, effective creds for fs access checks Btrfs: fix direct IO requests not reporting IO error to user space Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctl Btrfs: fix page reading in extent_same ioctl leading to csum errors Btrfs: fix invalid page accesses in extent_same (dedup) ioctl btrfs: properly set the termination value of ctx->pos in readdir Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()" Btrfs: fix fitrim discarding device area reserved for boot loader's use btrfs: handle invalid num_stripes in sys_array ext4: don't read blocks from disk after extents being swapped ext4: fix potential integer overflow ext4: fix scheduling in atomic on group checksum failure serial: omap: Prevent DoS using unprivileged ioctl(TIOCSRS485) serial: 8250_pci: Add Intel Broadwell ports tty: Add support for PCIe WCH382 2S multi-IO card pty: make sure super_block is still valid in final /dev/tty close pty: fix possible use after free of tty->driver_data staging/speakup: Use tty_ldisc_ref() for paste kworker phy: twl4030-usb: Fix unbalanced pm_runtime_enable on module reload phy: twl4030-usb: Relase usb phy on unload ALSA: seq: Fix double port list deletion ALSA: seq: Fix leak of pool buffer at concurrent writes ALSA: pcm: Fix rwsem deadlock for non-atomic PCM stream ALSA: hda - Cancel probe work instead of flush at remove x86/mm: Fix vmalloc_fault() to handle large pages properly x86/uaccess/64: Handle the caching of 4-byte nocache copies properly in __copy_user_nocache() x86/uaccess/64: Make the __copy_user_nocache() assembly code more readable x86/mm/pat: Avoid truncation when converting cpa->numpages to address x86/mm: Fix types used in pgprot cacheability flags translations Linux 4.4.2 HID: multitouch: fix input mode switching on some Elan panels mm, vmstat: fix wrong WQ sleep when memory reclaim doesn't make any progress zsmalloc: fix migrate_zspage-zs_free race condition zram: don't call idr_remove() from zram_remove() zram: try vmalloc() after kmalloc() zram/zcomp: use GFP_NOIO to allocate streams rtlwifi: rtl8821ae: Fix 5G failure when EEPROM is incorrectly encoded rtlwifi: rtl8821ae: Fix errors in parameter initialization crypto: marvell/cesa - fix test in mv_cesa_dev_dma_init() crypto: atmel-sha - remove calls of clk_prepare() from atomic contexts crypto: atmel-sha - fix atmel_sha_remove() crypto: algif_skcipher - Do not set MAY_BACKLOG on the async path crypto: algif_skcipher - Do not dereference ctx without socket lock crypto: algif_skcipher - Do not assume that req is unchanged crypto: user - lock crypto_alg_list on alg dump EVM: Use crypto_memneq() for digest comparisons crypto: algif_hash - wait for crypto_ahash_init() to complete crypto: shash - Fix has_key setting crypto: chacha20-ssse3 - Align stack pointer to 64 bytes crypto: caam - make write transactions bufferable on PPC platforms crypto: algif_skcipher - sendmsg SG marking is off by one crypto: algif_skcipher - Load TX SG list after waiting crypto: crc32c - Fix crc32c soft dependency crypto: algif_skcipher - Fix race condition in skcipher_check_key crypto: algif_hash - Fix race condition in hash_check_key crypto: af_alg - Forbid bind(2) when nokey child sockets are present crypto: algif_skcipher - Remove custom release parent function crypto: algif_hash - Remove custom release parent function crypto: af_alg - Allow af_af_alg_release_parent to be called on nokey path ahci: Intel DNV device IDs SATA libata: disable forced PORTS_IMPL for >= AHCI 1.3 crypto: algif_skcipher - Add key check exception for cipher_null crypto: skcipher - Add crypto_skcipher_has_setkey crypto: algif_hash - Require setkey before accept(2) crypto: hash - Add crypto_ahash_has_setkey crypto: algif_skcipher - Add nokey compatibility path crypto: af_alg - Add nokey compatibility path crypto: af_alg - Fix socket double-free when accept fails crypto: af_alg - Disallow bind/setkey/... after accept(2) crypto: algif_skcipher - Require setkey before accept(2) sched: Fix crash in sched_init_numa() ext4 crypto: add missing locking for keyring_key access iommu/io-pgtable-arm: Ensure we free the final level on teardown tty: Fix unsafe ldisc reference via ioctl(TIOCGETD) tty: Retry failed reopen if tty teardown in-progress tty: Wait interruptibly for tty lock on reopen n_tty: Fix unsafe reference to "other" ldisc usb: xhci: apply XHCI_PME_STUCK_QUIRK to Intel Broxton-M platforms usb: xhci: handle both SSIC ports in PME stuck quirk usb: phy: msm: fix error handling in probe. usb: cdc-acm: send zero packet for intel 7260 modem usb: cdc-acm: handle unlinked urb in acm read callback USB: option: fix Cinterion AHxx enumeration USB: serial: option: Adding support for Telit LE922 USB: cp210x: add ID for IAI USB to RS485 adaptor USB: serial: ftdi_sio: add support for Yaesu SCU-18 cable usb: hub: do not clear BOS field during reset device USB: visor: fix null-deref at probe USB: serial: visor: fix crash on detecting device without write_urbs ASoC: rt5645: fix the shift bit of IN1 boost saa7134-alsa: Only frees registered sound cards ALSA: dummy: Implement timer backend switching more safely ALSA: hda - Fix bad dereference of jack object ALSA: hda - Fix speaker output from VAIO AiO machines Revert "ALSA: hda - Fix noise on Gigabyte Z170X mobo" ALSA: hda - Fix static checker warning in patch_hdmi.c ALSA: hda - Add fixup for Mac Mini 7,1 model ALSA: timer: Fix race between stop and interrupt ALSA: timer: Fix wrong instance passed to slave callbacks ALSA: timer: Fix race at concurrent reads ALSA: timer: Fix link corruption due to double start or stop ALSA: timer: Fix leftover link at closing ALSA: timer: Code cleanup ALSA: seq: Fix lockdep warnings due to double mutex locks ALSA: seq: Fix race at closing in virmidi driver ALSA: seq: Fix yet another races among ALSA timer accesses ASoC: dpcm: fix the BE state on hw_free ALSA: pcm: Fix potential deadlock in OSS emulation ALSA: hda/realtek - Support Dell headset mode for ALC225 ALSA: hda/realtek - Support headset mode for ALC225 ALSA: hda/realtek - New codec support of ALC225 ALSA: rawmidi: Fix race at copying & updating the position ALSA: rawmidi: Remove kernel WARNING for NULL user-space buffer check ALSA: rawmidi: Make snd_rawmidi_transmit() race-free ALSA: seq: Degrade the error message for too many opens ALSA: seq: Fix incorrect sanity check at snd_seq_oss_synth_cleanup() ALSA: dummy: Disable switching timer backend via sysfs ALSA: compress: Disable GET_CODEC_CAPS ioctl for some architectures ALSA: hda - disable dynamic clock gating on Broxton before reset ALSA: Add missing dependency on CONFIG_SND_TIMER ALSA: bebob: Use a signed return type for get_formation_index ALSA: usb-audio: avoid freeing umidi object twice ALSA: usb-audio: Add native DSD support for PS Audio NuWave DAC ALSA: usb-audio: Fix OPPO HA-1 vendor ID ALSA: usb-audio: Add quirk for Microsoft LifeCam HD-6000 ALSA: usb-audio: Fix TEAC UD-501/UD-503/NT-503 usb delay hrtimer: Handle remaining time proper for TIME_LOW_RES md/raid: only permit hot-add of compatible integrity profiles media: i2c: Don't export ir-kbd-i2c module alias parisc: Fix __ARCH_SI_PREAMBLE_SIZE parisc: Protect huge page pte changes with spinlocks printk: do cond_resched() between lines while outputting to consoles tracing/stacktrace: Show entire trace if passed in function not found tracing: Fix stacktrace skip depth in trace_buffer_unlock_commit_regs() PCI: Fix minimum allocation address overwrite PCI: host: Mark PCIe/PCI (MSI) IRQ cascade handlers as IRQF_NO_THREAD mtd: nand: assign reasonable default name for NAND drivers wlcore/wl12xx: spi: fix NULL pointer dereference (Oops) wlcore/wl12xx: spi: fix oops on firmware load ocfs2/dlm: clear refmap bit of recovery lock while doing local recovery cleanup ocfs2/dlm: ignore cleaning the migration mle that is inuse ALSA: hda - Implement loopback control switch for Realtek and other codecs block: fix bio splitting on max sectors base/platform: Fix platform drivers with no probe callback HID: usbhid: fix recursive deadlock ocfs2: NFS hangs in __ocfs2_cluster_lock due to race with ocfs2_unblock_lock block: split bios to max possible length NFSv4.1/pnfs: Fixup an lo->plh_block_lgets imbalance in layoutreturn crypto: sun4i-ss - add missing statesize Linux 4.4.1 arm64: kernel: fix architected PMU registers unconditional access arm64: kernel: enforce pmuserenr_el0 initialization and restore arm64: mm: ensure that the zero page is visible to the page table walker arm64: Clear out any singlestep state on a ptrace detach operation powerpc/module: Handle R_PPC64_ENTRY relocations scripts/recordmcount.pl: support data in text section on powerpc powerpc: Make {cmp}xchg* and their atomic_ versions fully ordered powerpc: Make value-returning atomics fully ordered powerpc/tm: Check for already reclaimed tasks batman-adv: Drop immediate orig_node free function batman-adv: Drop immediate batadv_hard_iface free function batman-adv: Drop immediate neigh_ifinfo free function batman-adv: Drop immediate batadv_neigh_node free function batman-adv: Drop immediate batadv_orig_ifinfo free function batman-adv: Avoid recursive call_rcu for batadv_nc_node batman-adv: Avoid recursive call_rcu for batadv_bla_claim team: Replace rcu_read_lock with a mutex in team_vlan_rx_kill_vid net/mlx5_core: Fix trimming down IRQ number bridge: fix lockdep addr_list_lock false positive splat ipv6: update skb->csum when CE mark is propagated net: bpf: reject invalid shifts phonet: properly unshare skbs in phonet_rcv() dwc_eth_qos: Fix dma address for multi-fragment skbs bonding: Prevent IPv6 link local address on enslaved devices net: preserve IP control block during GSO segmentation udp: disallow UFO for sockets with SO_NO_CHECK option net: pktgen: fix null ptr deref in skb allocation sched,cls_flower: set key address type when present tcp_yeah: don't set ssthresh below 2 ipv6: tcp: add rcu locking in tcp_v6_send_synack() net: sctp: prevent writes to cookie_hmac_alg from accessing invalid memory vxlan: fix test which detect duplicate vxlan iface unix: properly account for FDs passed over unix sockets xhci: refuse loading if nousb is used usb: core: lpm: fix usb3_hardware_lpm sysfs node USB: cp210x: add ID for ELV Marble Sound Board 1 rtlwifi: fix memory leak for USB device ASoC: compress: Fix compress device direction check ASoC: wm5110: Fix PGA clear when disabling DRE ALSA: timer: Handle disconnection more safely ALSA: hda - Flush the pending probe work at remove ALSA: hda - Fix missing module loading with model=generic option ALSA: hda - Fix bass pin fixup for ASUS N550JX ALSA: control: Avoid kernel warnings from tlv ioctl with numid 0 ALSA: hrtimer: Fix stall by hrtimer_cancel() ALSA: pcm: Fix snd_pcm_hw_params struct copy in compat mode ALSA: seq: Fix snd_seq_call_port_info_ioctl in compat mode ALSA: hda - Add fixup for Dell Latitidue E6540 ALSA: timer: Fix double unlink of active_list ALSA: timer: Fix race among timer ioctls ALSA: hda - fix the headset mic detection problem for a Dell laptop ALSA: timer: Harden slave timer list handling ALSA: usb-audio: Fix mixer ctl regression of Native Instrument devices ALSA: hda - Fix white noise on Dell Latitude E5550 ALSA: seq: Fix race at timer setup and close ALSA: usb-audio: Avoid calling usb_autopm_put_interface() at disconnect ALSA: seq: Fix missing NULL check at remove_events ioctl ALSA: hda - Fixup inverted internal mic for Lenovo E50-80 ALSA: usb: Add native DSD support for Oppo HA-1 x86/mm: Improve switch_mm() barrier comments x86/mm: Add barriers and document switch_mm()-vs-flush synchronization x86/boot: Double BOOT_HEAP_SIZE to 64KB x86/reboot/quirks: Add iMac10,1 to pci_reboot_dmi_table[] kvm: x86: Fix vmwrite to SECONDARY_VM_EXEC_CONTROL KVM: x86: correctly print #AC in traces KVM: x86: expose MSR_TSC_AUX to userspace x86/xen: don't reset vcpu_info on a cancelled suspend KEYS: Fix keyring ref leak in join_session_keyring() Conflicts: arch/arm64/kernel/perf_event.c drivers/scsi/sd.c sound/core/compress_offload.c Change-Id: I9f77fe42aaae249c24cd6e170202110ab1426878 Signed-off-by: Trilok Soni <tsoni@codeaurora.org>
2016-02-25mm: fix mlock accoutingKirill A. Shutemov
commit 7162a1e87b3e380133dadc7909081bb70d0a7041 upstream. Tetsuo Handa reported underflow of NR_MLOCK on munlock. Testcase: #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #define BASE ((void *)0x400000000000) #define SIZE (1UL << 21) int main(int argc, char *argv[]) { void *addr; system("grep Mlocked /proc/meminfo"); addr = mmap(BASE, SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE | MAP_LOCKED | MAP_FIXED, -1, 0); if (addr == MAP_FAILED) printf("mmap() failed\n"), exit(1); munmap(addr, SIZE); system("grep Mlocked /proc/meminfo"); return 0; } It happens on munlock_vma_page() due to unfortunate choice of nr_pages data type: __mod_zone_page_state(zone, NR_MLOCK, -nr_pages); For unsigned int nr_pages, implicitly casted to long in __mod_zone_page_state(), it becomes something around UINT_MAX. munlock_vma_page() usually called for THP as small pages go though pagevec. Let's make nr_pages signed int. Similar fixes in 6cdb18ad98a4 ("mm/vmstat: fix overflow in mod_zone_page_state()") used `long' type, but `int' here is OK for a count of the number of sub-pages in a huge page. Fixes: ff6a6da60b89 ("mm: accelerate munlock() treatment of THP pages") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Michel Lespinasse <walken@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-16mm: add a field to store names for private anonymous memoryColin Cross
Userspace processes often have multiple allocators that each do anonymous mmaps to get memory. When examining memory usage of individual processes or systems as a whole, it is useful to be able to break down the various heaps that were allocated by each layer and examine their size, RSS, and physical memory usage. This patch adds a user pointer to the shared union in vm_area_struct that points to a null terminated string inside the user process containing a name for the vma. vmas that point to the same address will be merged, but vmas that point to equivalent strings at different addresses will not be merged. Userspace can set the name for a region of memory by calling prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name); Setting the name to NULL clears it. The names of named anonymous vmas are shown in /proc/pid/maps as [anon:<name>] and in /proc/pid/smaps in a new "Name" field that is only present for named vmas. If the userspace pointer is no longer valid all or part of the name will be replaced with "<fault>". The idea to store a userspace pointer to reduce the complexity within mm (at the expense of the complexity of reading /proc/pid/mem) came from Dave Hansen. This results in no runtime overhead in the mm subsystem other than comparing the anon_name pointers when considering vma merging. The pointer is stored in a union with fieds that are only used on file-backed mappings, so it does not increase memory usage. Includes fix from Jed Davis <jld@mozilla.com> for typo in prctl_set_vma_anon_name, which could attempt to set the name across two vmas at the same time due to a typo, which might corrupt the vma list. Fix it to use tmp instead of end to limit the name setting to a single vma at a time. Change-Id: I9aa7b6b5ef536cd780599ba4e2fba8ceebe8b59f Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
2015-11-05mm: mlock: add mlock flags to enable VM_LOCKONFAULT usageEric B Munson
The previous patch introduced a flag that specified pages in a VMA should be placed on the unevictable LRU, but they should not be made present when the area is created. This patch adds the ability to set this state via the new mlock system calls. We add MLOCK_ONFAULT for mlock2 and MCL_ONFAULT for mlockall. MLOCK_ONFAULT will set the VM_LOCKONFAULT modifier for VM_LOCKED. MCL_ONFAULT should be used as a modifier to the two other mlockall flags. When used with MCL_CURRENT, all current mappings will be marked with VM_LOCKED | VM_LOCKONFAULT. When used with MCL_FUTURE, the mm->def_flags will be marked with VM_LOCKED | VM_LOCKONFAULT. When used with both MCL_CURRENT and MCL_FUTURE, all current mappings and mm->def_flags will be marked with VM_LOCKED | VM_LOCKONFAULT. Prior to this patch, mlockall() will unconditionally clear the mm->def_flags any time it is called without MCL_FUTURE. This behavior is maintained after adding MCL_ONFAULT. If a call to mlockall(MCL_FUTURE) is followed by mlockall(MCL_CURRENT), the mm->def_flags will be cleared and new VMAs will be unlocked. This remains true with or without MCL_ONFAULT in either mlockall() invocation. munlock() will unconditionally clear both vma flags. munlockall() unconditionally clears for VMA flags on all VMAs and in the mm->def_flags field. Signed-off-by: Eric B Munson <emunson@akamai.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05mm: introduce VM_LOCKONFAULTEric B Munson
The cost of faulting in all memory to be locked can be very high when working with large mappings. If only portions of the mapping will be used this can incur a high penalty for locking. For the example of a large file, this is the usage pattern for a large statical language model (probably applies to other statical or graphical models as well). For the security example, any application transacting in data that cannot be swapped out (credit card data, medical records, etc). This patch introduces the ability to request that pages are not pre-faulted, but are placed on the unevictable LRU when they are finally faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED and has no effect when set without VM_LOCKED. Setting the VM_LOCKONFAULT flag for a VMA will cause pages faulted into that VMA to be added to the unevictable LRU when they are faulted or if they are already present, but will not cause any missing pages to be faulted in. Exposing this new lock state means that we cannot overload the meaning of the FOLL_POPULATE flag any longer. Prior to this patch it was used to mean that the VMA for a fault was locked. This means we need the new FOLL_MLOCK flag to communicate the locked state of a VMA. FOLL_POPULATE will now only control if the VMA should be populated and in the case of VM_LOCKONFAULT, it will not be set. Signed-off-by: Eric B Munson <emunson@akamai.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05mm: mlock: add new mlock system callEric B Munson
With the refactored mlock code, introduce a new system call for mlock. The new call will allow the user to specify what lock states are being added. mlock2 is trivial at the moment, but a follow on patch will add a new mlock state making it useful. Signed-off-by: Eric B Munson <emunson@akamai.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Shuah Khan <shuahkh@osg.samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05mm: mlock: refactor mlock, munlock, and munlockall codeEric B Munson
mlock() allows a user to control page out of program memory, but this comes at the cost of faulting in the entire mapping when it is allocated. For large mappings where the entire area is not necessary this is not ideal. Instead of forcing all locked pages to be present when they are allocated, this set creates a middle ground. Pages are marked to be placed on the unevictable LRU (locked) when they are first used, but they are not faulted in by the mlock call. This series introduces a new mlock() system call that takes a flags argument along with the start address and size. This flags argument gives the caller the ability to request memory be locked in the traditional way, or to be locked after the page is faulted in. A new MCL flag is added to mirror the lock on fault behavior from mlock() in mlockall(). There are two main use cases that this set covers. The first is the security focussed mlock case. A buffer is needed that cannot be written to swap. The maximum size is known, but on average the memory used is significantly less than this maximum. With lock on fault, the buffer is guaranteed to never be paged out without consuming the maximum size every time such a buffer is created. The second use case is focussed on performance. Portions of a large file are needed and we want to keep the used portions in memory once accessed. This is the case for large graphical models where the path through the graph is not known until run time. The entire graph is unlikely to be used in a given invocation, but once a node has been used it needs to stay resident for further processing. Given these constraints we have a number of options. We can potentially waste a large amount of memory by mlocking the entire region (this can also cause a significant stall at startup as the entire file is read in). We can mlock every page as we access them without tracking if the page is already resident but this introduces large overhead for each access. The third option is mapping the entire region with PROT_NONE and using a signal handler for SIGSEGV to mprotect(PROT_READ) and mlock() the needed page. Doing this page at a time adds a significant performance penalty. Batching can be used to mitigate this overhead, but in order to safely avoid trying to mprotect pages outside of the mapping, the boundaries of each mapping to be used in this way must be tracked and available to the signal handler. This is precisely what the mm system in the kernel should already be doing. For mlock(MLOCK_ONFAULT) the user is charged against RLIMIT_MEMLOCK as if mlock(MLOCK_LOCKED) or mmap(MAP_LOCKED) was used, so when the VMA is created not when the pages are faulted in. For mlockall(MCL_ONFAULT) the user is charged as if MCL_FUTURE was used. This decision was made to keep the accounting checks out of the page fault path. To illustrate the benefit of this set I wrote a test program that mmaps a 5 GB file filled with random data and then makes 15,000,000 accesses to random addresses in that mapping. The test program was run 20 times for each setup. Results are reported for two program portions, setup and execution. The setup phase is calling mmap and optionally mlock on the entire region. For most experiments this is trivial, but it highlights the cost of faulting in the entire region. Results are averages across the 20 runs in milliseconds. mmap with mlock(MLOCK_LOCKED) on entire range: Setup avg: 8228.666 Processing avg: 8274.257 mmap with mlock(MLOCK_LOCKED) before each access: Setup avg: 0.113 Processing avg: 90993.552 mmap with PROT_NONE and signal handler and batch size of 1 page: With the default value in max_map_count, this gets ENOMEM as I attempt to change the permissions, after upping the sysctl significantly I get: Setup avg: 0.058 Processing avg: 69488.073 mmap with PROT_NONE and signal handler and batch size of 8 pages: Setup avg: 0.068 Processing avg: 38204.116 mmap with PROT_NONE and signal handler and batch size of 16 pages: Setup avg: 0.044 Processing avg: 29671.180 mmap with mlock(MLOCK_ONFAULT) on entire range: Setup avg: 0.189 Processing avg: 17904.899 The signal handler in the batch cases faulted in memory in two steps to avoid having to know the start and end of the faulting mapping. The first step covers the page that caused the fault as we know that it will be possible to lock. The second step speculatively tries to mlock and mprotect the batch size - 1 pages that follow. There may be a clever way to avoid this without having the program track each mapping to be covered by this handeler in a globally accessible structure, but I could not find it. It should be noted that with a large enough batch size this two step fault handler can still cause the program to crash if it reaches far beyond the end of the mapping. These results show that if the developer knows that a majority of the mapping will be used, it is better to try and fault it in at once, otherwise mlock(MLOCK_ONFAULT) is significantly faster. The performance cost of these patches are minimal on the two benchmarks I have tested (stream and kernbench). The following are the average values across 20 runs of stream and 10 runs of kernbench after a warmup run whose results were discarded. Avg throughput in MB/s from stream using 1000000 element arrays Test 4.2-rc1 4.2-rc1+lock-on-fault Copy: 10,566.5 10,421 Scale: 10,685 10,503.5 Add: 12,044.1 11,814.2 Triad: 12,064.8 11,846.3 Kernbench optimal load 4.2-rc1 4.2-rc1+lock-on-fault Elapsed Time 78.453 78.991 User Time 64.2395 65.2355 System Time 9.7335 9.7085 Context Switches 22211.5 22412.1 Sleeps 14965.3 14956.1 This patch (of 6): Extending the mlock system call is very difficult because it currently does not take a flags argument. A later patch in this set will extend mlock to support a middle ground between pages that are locked and faulted in immediately and unlocked pages. To pave the way for the new system call, the code needs some reorganization so that all the actual entry point handles is checking input and translating to VMA flags. Signed-off-by: Eric B Munson <emunson@akamai.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05mm/mlock: use offset_in_page macroAlexander Kuleshov
linux/mm.h provides offset_in_page() macro. Let's use already predefined macro instead of (addr & ~PAGE_MASK). Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05mm/mlock.c: reorganize mlockall() return values and remove goto-out labelAlexey Klimov
In mlockall syscall wrapper after out-label for goto code just doing return. Remove goto out statements and return error values directly. Also instead of rewriting ret variable before every if-check move returns to 'error'-like path under if-check. Objdump asm listing showed me reducing by few asm lines. Object file size descreased from 220592 bytes to 220528 bytes for me (for aarch64). Signed-off-by: Alexey Klimov <klimov.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctxAndrea Arcangeli
vma->vm_userfaultfd_ctx is yet another vma parameter that vma_merge must be aware about so that we can merge vmas back like they were originally before arming the userfaultfd on some memory range. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Cc: zhang.zhanghailiang@huawei.com Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hugh Dickins <hughd@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: move mm_populate()-related code to mm/gup.cKirill A. Shutemov
It's odd that we have populate_vma_page_range() and __mm_populate() in mm/mlock.c. It's implementation of generic memory population and mlocking is one of possible side effect, if VM_LOCKED is set. __get_user_pages() is core of the implementation. Let's move the code into mm/gup.c. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: move gup() -> posix mlock() error conversion out of __mm_populateKirill A. Shutemov
This is praparation to moving mm_populate()-related code out of mm/mlock.c. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: rename __mlock_vma_pages_range() to populate_vma_page_range()Kirill A. Shutemov
__mlock_vma_pages_range() doesn't necessarily mlock pages. It depends on vma flags. The same codepath is used for MAP_POPULATE. Let's rename __mlock_vma_pages_range() to populate_vma_page_range(). This patch also drops mlock_vma_pages_range() references from documentation. It has gone in cea10a19b797 ("mm: directly use __mlock_vma_pages_range() in find_extend_vma()"). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: rename FOLL_MLOCK to FOLL_POPULATEKirill A. Shutemov
After commit a1fde08c74e9 ("VM: skip the stack guard page lookup in get_user_pages only for mlock") FOLL_MLOCK has lost its original meaning: we don't necessarily mlock the page if the flags is set -- we also take VM_LOCKED into consideration. Since we use the same codepath for __mm_populate(), let's rename FOLL_MLOCK to FOLL_POPULATE. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-12mm: reorder can_do_mlock to fix audit denialJeff Vander Stoep
A userspace call to mmap(MAP_LOCKED) may result in the successful locking of memory while also producing a confusing audit log denial. can_do_mlock checks capable and rlimit. If either of these return positive can_do_mlock returns true. The capable check leads to an LSM hook used by apparmour and selinux which produce the audit denial. Reordering so rlimit is checked first eliminates the denial on success, only recording a denial when the lock is unsuccessful as a result of the denial. Signed-off-by: Jeff Vander Stoep <jeffv@google.com> Acked-by: Nick Kralevich <nnk@google.com> Cc: Jeff Vander Stoep <jeffv@google.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Paul Cassella <cassella@cray.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-13Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: "The main changes in this cycle were: - changes related to No-CBs CPUs and NO_HZ_FULL - RCU-tasks implementation - torture-test updates - miscellaneous fixes - locktorture updates - RCU documentation updates" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits) workqueue: Use cond_resched_rcu_qs macro workqueue: Add quiescent state between work items locktorture: Cleanup header usage locktorture: Cannot hold read and write lock locktorture: Fix __acquire annotation for spinlock irq locktorture: Support rwlocks rcu: Eliminate deadlock between CPU hotplug and expedited grace periods locktorture: Document boot/module parameters rcutorture: Rename rcutorture_runnable parameter locktorture: Add test scenario for rwsem_lock locktorture: Add test scenario for mutex_lock locktorture: Make torture scripting account for new _runnable name locktorture: Introduce torture context locktorture: Support rwsems locktorture: Add infrastructure for torturing read locks torture: Address race in module cleanup locktorture: Make statistics generic locktorture: Teach about lock debugging locktorture: Support mutexes locktorture: Add documentation ...
2014-10-09mm: use VM_BUG_ON_MM where possibleSasha Levin
Dump the contents of the relevant struct_mm when we hit the bug condition. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm: convert a few VM_BUG_ON callers to VM_BUG_ON_VMASasha Levin
Trivially convert a few VM_BUG_ON calls to VM_BUG_ON_VMA to extract more information when they trigger. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michel Lespinasse <walken@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-07rcu: Provide cond_resched_rcu_qs() to force quiescent states in long loopsPaul E. McKenney
RCU-tasks requires the occasional voluntary context switch from CPU-bound in-kernel tasks. In some cases, this requires instrumenting cond_resched(). However, there is some reluctance to countenance unconditionally instrumenting cond_resched() (see http://lwn.net/Articles/603252/), so this commit creates a separate cond_resched_rcu_qs() that may be used in place of cond_resched() in locations prone to long-duration in-kernel looping. This commit currently instruments only RCU-tasks. Future possibilities include also instrumenting RCU, RCU-bh, and RCU-sched in order to reduce IPI usage. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-08-06mm: describe mmap_sem rules for __lock_page_or_retry() and callersPaul Cassella
Add a comment describing the circumstances in which __lock_page_or_retry() will or will not release the mmap_sem when returning 0. Add comments to lock_page_or_retry()'s callers (filemap_fault(), do_swap_page()) noting the impact on VM_FAULT_RETRY returns. Add comments on up the call tree, particularly replacing the false "We return with mmap_sem still held" comments. Signed-off-by: Paul Cassella <cassella@cray.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07mm: try_to_unmap_cluster() should lock_page() before mlockingVlastimil Babka
A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin fuzzing with trinity. The call site try_to_unmap_cluster() does not lock the pages other than its check_page parameter (which is already locked). The BUG_ON in mlock_vma_page() is not documented and its purpose is somewhat unclear, but apparently it serializes against page migration, which could otherwise fail to transfer the PG_mlocked flag. This would not be fatal, as the page would be eventually encountered again, but NR_MLOCK accounting would become distorted nevertheless. This patch adds a comment to the BUG_ON in mlock_vma_page() and munlock_vma_page() to that effect. The call site try_to_unmap_cluster() is fixed so that for page != check_page, trylock_page() is attempted (to avoid possible deadlocks as we already have check_page locked) and mlock_vma_page() is performed only upon success. If the page lock cannot be obtained, the page is left without PG_mlocked, which is again not a problem in the whole unevictable memory design. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Bob Liu <bob.liu@oracle.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Michel Lespinasse <walken@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGESasha Levin
Most of the VM_BUG_ON assertions are performed on a page. Usually, when one of these assertions fails we'll get a BUG_ON with a call stack and the registers. I've recently noticed based on the requests to add a small piece of code that dumps the page to various VM_BUG_ON sites that the page dump is quite useful to people debugging issues in mm. This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what VM_BUG_ON() does, also dumps the page before executing the actual BUG_ON. [akpm@linux-foundation.org: fix up includes] Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23mm: munlock: fix potential race with THP page splitVlastimil Babka
Since commit ff6a6da60b89 ("mm: accelerate munlock() treatment of THP pages") munlock skips tail pages of a munlocked THP page. There is some attempt to prevent bad consequences of racing with a THP page split, but code inspection indicates that there are two problems that may lead to a non-fatal, yet wrong outcome. First, __split_huge_page_refcount() copies flags including PageMlocked from the head page to the tail pages. Clearing PageMlocked by munlock_vma_page() in the middle of this operation might result in part of tail pages left with PageMlocked flag. As the head page still appears to be a THP page until all tail pages are processed, munlock_vma_page() might think it munlocked the whole THP page and skip all the former tail pages. Before ff6a6da60, those pages would be cleared in further iterations of munlock_vma_pages_range(), but NR_MLOCK would still become undercounted (related the next point). Second, NR_MLOCK accounting is based on call to hpage_nr_pages() after the PageMlocked is cleared. The accounting might also become inconsistent due to race with __split_huge_page_refcount() - undercount when HUGE_PMD_NR is subtracted, but some tail pages are left with PageMlocked set and counted again (only possible before ff6a6da60) - overcount when hpage_nr_pages() sees a normal page (split has already finished), but the parallel split has meanwhile cleared PageMlocked from additional tail pages This patch prevents both problems via extending the scope of lru_lock in munlock_vma_page(). This is convenient because: - __split_huge_page_refcount() takes lru_lock for its whole operation - munlock_vma_page() typically takes lru_lock anyway for page isolation As this becomes a second function where page isolation is done with lru_lock already held, factor this out to a new __munlock_isolate_lru_page() function and clean up the code around. [akpm@linux-foundation.org: avoid a coding-style ugly] Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-21mm/mlock: prepare params outside critical regionDavidlohr Bueso
All mlock related syscalls prepare lock limits, lengths and start parameters with the mmap_sem held. Move this logic outside of the critical region. For the case of mlock, continue incrementing the amount already locked by mm->locked_vm with the rwsem taken. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Rik van Riel <riel@redhat.com> Reviewed-by: Michel Lespinasse <walken@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-02mm: munlock: fix deadlock in __munlock_pagevec()Vlastimil Babka
Commit 7225522bb429 ("mm: munlock: batch non-THP page isolation and munlock+putback using pagevec" introduced __munlock_pagevec() to speed up munlock by holding lru_lock over multiple isolated pages. Pages that fail to be isolated are put_page()d immediately, also within the lock. This can lead to deadlock when __munlock_pagevec() becomes the holder of the last page pin and put_page() leads to __page_cache_release() which also locks lru_lock. The deadlock has been observed by Sasha Levin using trinity. This patch avoids the deadlock by deferring put_page() operations until lru_lock is released. Another pagevec (which is also used by later phases of the function is reused to gather the pages for put_page() operation. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-02mm: munlock: fix a bug where THP tail page is encounteredVlastimil Babka
Since commit ff6a6da60b89 ("mm: accelerate munlock() treatment of THP pages") munlock skips tail pages of a munlocked THP page. However, when the head page already has PageMlocked unset, it will not skip the tail pages. Commit 7225522bb429 ("mm: munlock: batch non-THP page isolation and munlock+putback using pagevec") has added a PageTransHuge() check which contains VM_BUG_ON(PageTail(page)). Sasha Levin found this triggered using trinity, on the first tail page of a THP page without PageMlocked flag. This patch fixes the issue by skipping tail pages also in the case when PageMlocked flag is unset. There is still a possibility of race with THP page split between clearing PageMlocked and determining how many pages to skip. The race might result in former tail pages not being skipped, which is however no longer a bug, as during the skip the PageTail flags are cleared. However this race also affects correctness of NR_MLOCK accounting, which is to be fixed in a separate patch. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-30mm/mlock.c: prevent walking off the end of a pagetable in no-pmd configurationVlastimil Babka
The function __munlock_pagevec_fill() introduced in commit 7a8010cd3627 ("mm: munlock: manual pte walk in fast path instead of follow_page_mask()") uses pmd_addr_end() for restricting its operation within current page table. This is insufficient on architectures/configurations where pmd is folded and pmd_addr_end() just returns the end of the full range to be walked. In this case, it allows pte++ to walk off the end of a page table resulting in unpredictable behaviour. This patch fixes the function by using pgd_addr_end() and pud_addr_end() before pmd_addr_end(), which will yield correct page table boundary on all configurations. This is similar to what existing page walkers do when walking each level of the page table. Additionaly, the patch clarifies a comment for get_locked_pte() call in the function. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Jörn Engel <joern@logfs.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24mm: Place preemption point in do_mlockall() loopPaul E. McKenney
There is a loop in do_mlockall() that lacks a preemption point, which means that the following can happen on non-preemptible builds of the kernel. Dave Jones reports: "My fuzz tester keeps hitting this. Every instance shows the non-irq stack came in from mlockall. I'm only seeing this on one box, but that has more ram (8gb) than my other machines, which might explain it. INFO: rcu_preempt self-detected stall on CPU { 3} (t=6500 jiffies g=470344 c=470343 q=0) sending NMI to all CPUs: NMI backtrace for cpu 3 CPU: 3 PID: 29664 Comm: trinity-child2 Not tainted 3.11.0-rc1+ #32 Call Trace: lru_add_drain_all+0x15/0x20 SyS_mlockall+0xa5/0x1a0 tracesys+0xdd/0xe2" This commit addresses this problem by inserting the required preemption point. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Michel Lespinasse <walken@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm: munlock: manual pte walk in fast path instead of follow_page_mask()Vlastimil Babka
Currently munlock_vma_pages_range() calls follow_page_mask() to obtain each individual struct page. This entails repeated full page table translations and page table lock taken for each page separately. This patch avoids the costly follow_page_mask() where possible, by iterating over ptes within single pmd under single page table lock. The first pte is obtained by get_locked_pte() for non-THP page acquired by the initial follow_page_mask(). The rest of the on-stack pagevec for munlock is filled up using pte_walk as long as pte_present() and vm_normal_page() are sufficient to obtain the struct page. After this patch, a 14% speedup was measured for munlocking a 56GB large memory area with THP disabled. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jörn Engel <joern@logfs.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm: munlock: remove redundant get_page/put_page pair on the fast pathVlastimil Babka
The performance of the fast path in munlock_vma_range() can be further improved by avoiding atomic ops of a redundant get_page()/put_page() pair. When calling get_page() during page isolation, we already have the pin from follow_page_mask(). This pin will be then returned by __pagevec_lru_add(), after which we do not reference the pages anymore. After this patch, an 8% speedup was measured for munlocking a 56GB large memory area with THP disabled. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jörn Engel <joern@logfs.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm: munlock: bypass per-cpu pvec for putback_lru_pageVlastimil Babka
After introducing batching by pagevecs into munlock_vma_range(), we can further improve performance by bypassing the copying into per-cpu pagevec and the get_page/put_page pair associated with that. Instead we perform LRU putback directly from our pagevec. However, this is possible only for single-mapped pages that are evictable after munlock. Unevictable pages require rechecking after putting on the unevictable list, so for those we fallback to putback_lru_page(), hich handles that. After this patch, a 13% speedup was measured for munlocking a 56GB large memory area with THP disabled. [akpm@linux-foundation.org:clarify comment] Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jörn Engel <joern@logfs.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm: munlock: batch NR_MLOCK zone state updatesVlastimil Babka
Depending on previous batch which introduced batched isolation in munlock_vma_range(), we can batch also the updates of NR_MLOCK page stats. After the whole pagevec is processed for page isolation, the stats are updated only once with the number of successful isolations. There were however no measurable perfomance gains. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jörn Engel <joern@logfs.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm: munlock: batch non-THP page isolation and munlock+putback using pagevecVlastimil Babka
Currently, munlock_vma_range() calls munlock_vma_page on each page in a loop, which results in repeated taking and releasing of the lru_lock spinlock for isolating pages one by one. This patch batches the munlock operations using an on-stack pagevec, so that isolation is done under single lru_lock. For THP pages, the old behavior is preserved as they might be split while putting them into the pagevec. After this patch, a 9% speedup was measured for munlocking a 56GB large memory area with THP disabled. A new function __munlock_pagevec() is introduced that takes a pagevec and: 1) It clears PageMlocked and isolates all pages under lru_lock. Zone page stats can be also updated using the variant which assumes disabled interrupts. 2) It finishes the munlock and lru putback on all pages under their lock_page. Note that previously, lock_page covered also the PageMlocked clearing and page isolation, but it is not needed for those operations. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jörn Engel <joern@logfs.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm: munlock: remove unnecessary call to lru_add_drain()Vlastimil Babka
In munlock_vma_range(), lru_add_drain() is currently called in a loop before each munlock_vma_page() call. This is suboptimal for performance when munlocking many pages. The benefits of per-cpu pagevec for batching the LRU putback are removed since the pagevec only holds at most one page from the previous loop's iteration. The lru_add_drain() call also does not serve any purposes for correctness - it does not even drain pagavecs of all cpu's. The munlock code already expects and handles situations where a page cannot be isolated from the LRU (e.g. because it is on some per-cpu pagevec). The history of the (not commented) call also suggest that it appears there as an oversight rather than intentionally. Before commit ff6a6da6 ("mm: accelerate munlock() treatment of THP pages") the call happened only once upon entering the function. The commit has moved the call into the while loope. So while the other changes in the commit improved munlock performance for THP pages, it introduced the abovementioned suboptimal per-cpu pagevec usage. Further in history, before commit 408e82b7 ("mm: munlock use follow_page"), munlock_vma_pages_range() was just a wrapper around __mlock_vma_pages_range which performed both mlock and munlock depending on a flag. However, before ba470de4 ("mmap: handle mlocked pages during map, remap, unmap") the function handled only mlock, not munlock. The lru_add_drain call thus comes from the implementation in commit b291f000 ("mlock: mlocked pages are unevictable" and was intended only for mlocking, not munlocking. The original intention of draining the LRU pagevec at mlock time was to ensure the pages were on the LRU before the lock operation so that they could be placed on the unevictable list immediately. There is very little motivation to do the same in the munlock path this, particularly for every single page. This patch therefore removes the call completely. After removing the call, a 10% speedup was measured for munlock() of a 56GB large memory area with THP disabled. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Jörn Engel <joern@logfs.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-28Revert "mm: introduce VM_POPULATE flag to better deal with racy userspace ↵Michel Lespinasse
programs" This reverts commit 186930500985 ("mm: introduce VM_POPULATE flag to better deal with racy userspace programs"). VM_POPULATE only has any effect when userspace plays racy games with vmas by trying to unmap and remap memory regions that mmap or mlock are operating on. Also, the only effect of VM_POPULATE when userspace plays such games is that it avoids populating new memory regions that get remapped into the address range that was being operated on by the original mmap or mlock calls. Let's remove VM_POPULATE as there isn't any strong argument to mandate a new vm_flag. Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27mm: accelerate munlock() treatment of THP pagesMichel Lespinasse
munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE at a time. When munlocking THP pages (or the huge zero page), this resulted in taking the mm->page_table_lock 512 times in a row. We can do better by making use of the page_mask returned by follow_page_mask (for the huge zero page case), or the size of the page munlock_vma_page() operated on (for the true THP page case). Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: use long type for page counts in mm_populate() and get_user_pages()Michel Lespinasse
Use long type for page counts in mm_populate() so as to avoid integer overflow when running the following test code: int main(void) { void *p = mmap(NULL, 0x100000000000, PROT_READ, MAP_PRIVATE | MAP_ANON, -1, 0); printf("p: %p\n", p); mlockall(MCL_CURRENT); printf("done\n"); return 0; } Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm/mlock.c: document scary-looking stack expansion mlock chainJohannes Weiner
The fact that mlock calls get_user_pages, and get_user_pages might call mlock when expanding a stack looks like a potential recursion. However, mlock makes sure the requested range is already contained within a vma, so no stack expansion will actually happen from mlock. Should this ever change: the stack expansion mlocks only the newly expanded range and so will not result in recursive expansion. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Cc: Hugh Dickins <hughd@google.com> Acked-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: introduce VM_POPULATE flag to better deal with racy userspace programsMichel Lespinasse
The vm_populate() code populates user mappings without constantly holding the mmap_sem. This makes it susceptible to racy userspace programs: the user mappings may change while vm_populate() is running, and in this case vm_populate() may end up populating the new mapping instead of the old one. In order to reduce the possibility of userspace getting surprised by this behavior, this change introduces the VM_POPULATE vma flag which gets set on vmas we want vm_populate() to work on. This way vm_populate() may still end up populating the new mapping after such a race, but only if the new mapping is also one that the user has requested (using MAP_SHARED, MAP_LOCKED or mlock) to be populated. Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: directly use __mlock_vma_pages_range() in find_extend_vma()Michel Lespinasse
In find_extend_vma(), we don't need mlock_vma_pages_range() to verify the vma type - we know we're working with a stack. So, we can call directly into __mlock_vma_pages_range(), and remove the last make_pages_present() call site. Note that we don't use mm_populate() here, so we can't release the mmap_sem while allocating new stack pages. This is deemed acceptable, because the stack vmas grow by a bounded number of pages at a time, and these are anon pages so we don't have to read from disk to populate them. Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23mm: introduce mm_populate() for populating new vmasMichel Lespinasse
When creating new mappings using the MAP_POPULATE / MAP_LOCKED flags (or with MCL_FUTURE in effect), we want to populate the pages within the newly created vmas. This may take a while as we may have to read pages from disk, so ideally we want to do this outside of the write-locked mmap_sem region. This change introduces mm_populate(), which is used to defer populating such mappings until after the mmap_sem write lock has been released. This is implemented as a generalization of the former do_mlock_pages(), which accomplished the same task but was using during mlock() / mlockall(). Signed-off-by: Michel Lespinasse <walken@google.com> Reported-by: Andy Lutomirski <luto@amacapital.net> Acked-by: Rik van Riel <riel@redhat.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-12mm: don't overwrite mm->def_flags in do_mlockall()Gerald Schaefer
With commit 8e72033f2a48 ("thp: make MADV_HUGEPAGE check for mm->def_flags") the VM_NOHUGEPAGE flag may be set on s390 in mm->def_flags for certain processes, to prevent future thp mappings. This would be overwritten by do_mlockall(), which sets it back to 0 with an optional VM_LOCKED flag set. To fix this, instead of overwriting mm->def_flags in do_mlockall(), only the VM_LOCKED flag should be set or cleared. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Reported-by: Vivek Goyal <vgoyal@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm, thp: fix mlock statisticsDavid Rientjes
NR_MLOCK is only accounted in single page units: there's no logic to handle transparent hugepages. This patch checks the appropriate number of pages to adjust the statistics by so that the correct amount of memory is reflected. Currently: $ grep Mlocked /proc/meminfo Mlocked: 19636 kB #define MAP_SIZE (4 << 30) /* 4GB */ void *ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); mlock(ptr, MAP_SIZE); $ grep Mlocked /proc/meminfo Mlocked: 29844 kB munlock(ptr, MAP_SIZE); $ grep Mlocked /proc/meminfo Mlocked: 19636 kB And with this patch: $ grep Mlock /proc/meminfo Mlocked: 19636 kB mlock(ptr, MAP_SIZE); $ grep Mlock /proc/meminfo Mlocked: 4213664 kB munlock(ptr, MAP_SIZE); $ grep Mlock /proc/meminfo Mlocked: 19636 kB Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Hugh Dickens <hughd@google.com> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: use clear_page_mlock() in page_remove_rmap()Hugh Dickins
We had thought that pages could no longer get freed while still marked as mlocked; but Johannes Weiner posted this program to demonstrate that truncating an mlocked private file mapping containing COWed pages is still mishandled: #include <sys/types.h> #include <sys/mman.h> #include <sys/stat.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <stdio.h> int main(void) { char *map; int fd; system("grep mlockfreed /proc/vmstat"); fd = open("chigurh", O_CREAT|O_EXCL|O_RDWR); unlink("chigurh"); ftruncate(fd, 4096); map = mmap(NULL, 4096, PROT_WRITE, MAP_PRIVATE, fd, 0); map[0] = 11; mlock(map, sizeof(fd)); ftruncate(fd, 0); close(fd); munlock(map, sizeof(fd)); munmap(map, 4096); system("grep mlockfreed /proc/vmstat"); return 0; } The anon COWed pages are not caught by truncation's clear_page_mlock() of the pagecache pages; but unmap_mapping_range() unmaps them, so we ought to look out for them there in page_remove_rmap(). Indeed, why should truncation or invalidation be doing the clear_page_mlock() when removing from pagecache? mlock is a property of mapping in userspace, not a property of pagecache: an mlocked unmapped page is nonsensical. Reported-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ying Han <yinghan@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09mm: kill vma flag VM_RESERVED and mm->reserved_vm counterKonstantin Khlebnikov
A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA, currently it lost original meaning but still has some effects: | effect | alternative flags -+------------------------+--------------------------------------------- 1| account as reserved_vm | VM_IO 2| skip in core dump | VM_IO, VM_DONTDUMP 3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP 4| do not mlock | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP This patch removes reserved_vm counter from mm_struct. Seems like nobody cares about it, it does not exported into userspace directly, it only reduces total_vm showed in proc. Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP. remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP. remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP. [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup] Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-06vm: avoid using find_vma_prev() unnecessarilyLinus Torvalds
Several users of "find_vma_prev()" were not in fact interested in the previous vma if there was no primary vma to be found either. And in those cases, we're much better off just using the regular "find_vma()", and then "prev" can be looked up by just checking vma->vm_prev. The find_vma_prev() semantics are fairly subtle (see Mikulas' recent commit 83cd904d271b: "mm: fix find_vma_prev"), and the whole "return prev by reference" means that it generates worse code too. Thus this "let's avoid using this inconvenient and clearly too subtle interface when we don't really have to" patch. Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-06Merge branch 'modsplit-Oct31_2011' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits) Revert "tracing: Include module.h in define_trace.h" irq: don't put module.h into irq.h for tracking irqgen modules. bluetooth: macroize two small inlines to avoid module.h ip_vs.h: fix implicit use of module_get/module_put from module.h nf_conntrack.h: fix up fallout from implicit moduleparam.h presence include: replace linux/module.h with "struct module" wherever possible include: convert various register fcns to macros to avoid include chaining crypto.h: remove unused crypto_tfm_alg_modname() inline uwb.h: fix implicit use of asm/page.h for PAGE_SIZE pm_runtime.h: explicitly requires notifier.h linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h miscdevice.h: fix up implicit use of lists and types stop_machine.h: fix implicit use of smp.h for smp_processor_id of: fix implicit use of errno.h in include/linux/of.h of_platform.h: delete needless include <linux/module.h> acpi: remove module.h include from platform/aclinux.h miscdevice.h: delete unnecessary inclusion of module.h device_cgroup.h: delete needless include <linux/module.h> net: sch_generic remove redundant use of <linux/module.h> net: inet_timewait_sock doesnt need <linux/module.h> ... Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in - drivers/media/dvb/frontends/dibx000_common.c - drivers/media/video/{mt9m111.c,ov6650.c} - drivers/mfd/ab3550-core.c - include/linux/dmaengine.h
2011-10-31mm: munlock use mapcount to avoid terrible overheadHugh Dickins
A process spent 30 minutes exiting, just munlocking the pages of a large anonymous area that had been alternately mprotected into page-sized vmas: for every single page there's an anon_vma walk through all the other little vmas to find the right one. A general fix to that would be a lot more complicated (use prio_tree on anon_vma?), but there's one very simple thing we can do to speed up the common case: if a page to be munlocked is mapped only once, then it is our vma that it is mapped into, and there's no need whatever to walk through all the others. Okay, there is a very remote race in munlock_vma_pages_range(), if between its follow_page() and lock_page(), another process were to munlock the same page, then page reclaim remove it from our vma, then another process mlock it again. We would find it with page_mapcount 1, yet it's still mlocked in another process. But never mind, that's much less likely than the down_read_trylock() failure which munlocking already tolerates (in try_to_unmap_one()): in due course page reclaim will discover and move the page to unevictable instead. [akpm@linux-foundation.org: add comment] Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm: do not drain pagevecs for mlockall(MCL_FUTURE)Christoph Lameter
MCL_FUTURE does not move pages between lru list and draining the LRU per cpu pagevecs is a nasty activity. Avoid doing it unecessarily. Signed-off-by: Christoph Lameter <cl@gentwo.org> Cc: David Rientjes <rientjes@google.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31mm: Map most files to use export.h instead of module.hPaul Gortmaker
The files changed within are only using the EXPORT_SYMBOL macro variants. They are not using core modular infrastructure and hence don't need module.h but only the export.h header. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>