summaryrefslogtreecommitdiff
path: root/mm/slab.c
AgeCommit message (Collapse)Author
2019-04-30Merge 4.4.179 into android-4.4Greg Kroah-Hartman
Changes in 4.4.179 arm64: debug: Don't propagate UNKNOWN FAR into si_code for debug signals arm64: debug: Ensure debug handlers check triggering exception level ext4: cleanup bh release code in ext4_ind_remove_space() lib/int_sqrt: optimize initial value compute tty/serial: atmel: Add is_half_duplex helper mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified i2c: core-smbus: prevent stack corruption on read I2C_BLOCK_DATA Bluetooth: Fix decrementing reference count twice in releasing socket tty/serial: atmel: RS485 HD w/DMA: enable RX after TX is stopped CIFS: fix POSIX lock leak and invalid ptr deref h8300: use cc-cross-prefix instead of hardcoding h8300-unknown-linux- tracing: kdb: Fix ftdump to not sleep gpio: gpio-omap: fix level interrupt idling sysctl: handle overflow for file-max enic: fix build warning without CONFIG_CPUMASK_OFFSTACK mm/cma.c: cma_declare_contiguous: correct err handling mm/page_ext.c: fix an imbalance with kmemleak mm/vmalloc.c: fix kernel BUG at mm/vmalloc.c:512! mm/slab.c: kmemleak no scan alien caches ocfs2: fix a panic problem caused by o2cb_ctl f2fs: do not use mutex lock in atomic context fs/file.c: initialize init_files.resize_wait cifs: use correct format characters dm thin: add sanity checks to thin-pool and external snapshot creation cifs: Fix NULL pointer dereference of devname fs: fix guard_bio_eod to check for real EOD errors tools lib traceevent: Fix buffer overflow in arg_eval usb: chipidea: Grab the (legacy) USB PHY by phandle first scsi: core: replace GFP_ATOMIC with GFP_KERNEL in scsi_scan.c coresight: etm4x: Add support to enable ETMv4.2 ARM: 8840/1: use a raw_spinlock_t in unwind mmc: omap: fix the maximum timeout setting e1000e: Fix -Wformat-truncation warnings IB/mlx4: Increase the timeout for CM cache scsi: megaraid_sas: return error when create DMA pool failed perf test: Fix failure of 'evsel-tp-sched' test on s390 SoC: imx-sgtl5000: add missing put_device() media: sh_veu: Correct return type for mem2mem buffer helpers media: s5p-jpeg: Correct return type for mem2mem buffer helpers media: s5p-g2d: Correct return type for mem2mem buffer helpers media: mx2_emmaprp: Correct return type for mem2mem buffer helpers leds: lp55xx: fix null deref on firmware load failure kprobes: Prohibit probing on bsearch() ARM: 8833/1: Ensure that NEON code always compiles with Clang ALSA: PCM: check if ops are defined before suspending PCM bcache: fix input overflow to cache set sysfs file io_error_halflife bcache: fix input overflow to sequential_cutoff bcache: improve sysfs_strtoul_clamp() fbdev: fbmem: fix memory access if logo is bigger than the screen cdrom: Fix race condition in cdrom_sysctl_register ASoC: fsl-asoc-card: fix object reference leaks in fsl_asoc_card_probe soc: qcom: gsbi: Fix error handling in gsbi_probe() mt7601u: bump supported EEPROM version ARM: avoid Cortex-A9 livelock on tight dmb loops tty: increase the default flip buffer limit to 2*640K media: mt9m111: set initial frame size other than 0x0 hwrng: virtio - Avoid repeated init of completion soc/tegra: fuse: Fix illegal free of IO base address hpet: Fix missing '=' character in the __setup() code of hpet_mmap_enable dmaengine: imx-dma: fix warning comparison of distinct pointer types netfilter: physdev: relax br_netfilter dependency media: s5p-jpeg: Check for fmt_ver_flag when doing fmt enumeration regulator: act8865: Fix act8600_sudcdc_voltage_ranges setting wlcore: Fix memory leak in case wl12xx_fetch_firmware failure x86/build: Mark per-CPU symbols as absolute explicitly for LLD dmaengine: tegra: avoid overflow of byte tracking drm/dp/mst: Configure no_stop_bit correctly for remote i2c xfers binfmt_elf: switch to new creds when switching to new mm kbuild: clang: choose GCC_TOOLCHAIN_DIR not on LD x86/build: Specify elf_i386 linker emulation explicitly for i386 objects x86: vdso: Use $LD instead of $CC to link x86/vdso: Drop implicit common-page-size linker flag lib/string.c: implement a basic bcmp tty: mark Siemens R3964 line discipline as BROKEN tty: ldisc: add sysctl to prevent autoloading of ldiscs ipv6: Fix dangling pointer when ipv6 fragment ipv6: sit: reset ip header pointer in ipip6_rcv net: rds: force to destroy connection if t_sock is NULL in rds_tcp_kill_sock(). openvswitch: fix flow actions reallocation qmi_wwan: add Olicard 600 sctp: initialize _pad of sockaddr_in before copying to user memory tcp: Ensure DCTCP reacts to losses netns: provide pure entropy for net_hash_mix() net: ethtool: not call vzalloc for zero sized memory request ip6_tunnel: Match to ARPHRD_TUNNEL6 for dev type ALSA: seq: Fix OOB-reads from strlcpy include/linux/bitrev.h: fix constant bitrev ASoC: fsl_esai: fix channel swap issue when stream starts block: do not leak memory in bio_copy_user_iov() genirq: Respect IRQCHIP_SKIP_SET_WAKE in irq_chip_set_wake_parent() ARM: dts: at91: Fix typo in ISC_D0 on PC9 arm64: futex: Fix FUTEX_WAKE_OP atomic ops with non-zero result value xen: Prevent buffer overflow in privcmd ioctl sched/fair: Do not re-read ->h_load_next during hierarchical load calculation xtensa: fix return_address PCI: Add function 1 DMA alias quirk for Marvell 9170 SATA controller perf/core: Restore mmap record type correctly ext4: add missing brelse() in add_new_gdb_meta_bg() ext4: report real fs size after failed resize ALSA: echoaudio: add a check for ioremap_nocache ALSA: sb8: add a check for request_region IB/mlx4: Fix race condition between catas error reset and aliasguid flows mmc: davinci: remove extraneous __init annotation ALSA: opl3: fix mismatch between snd_opl3_drum_switch definition and declaration thermal/int340x_thermal: Add additional UUIDs thermal/int340x_thermal: fix mode setting tools/power turbostat: return the exit status of a command perf top: Fix error handling in cmd_top() perf evsel: Free evsel->counts in perf_evsel__exit() perf tests: Fix a memory leak of cpu_map object in the openat_syscall_event_on_all_cpus test perf tests: Fix a memory leak in test__perf_evsel__tp_sched_test() x86/hpet: Prevent potential NULL pointer dereference x86/cpu/cyrix: Use correct macros for Cyrix calls on Geode processors iommu/vt-d: Check capability before disabling protected memory x86/hw_breakpoints: Make default case in hw_breakpoint_arch_parse() return an error fix incorrect error code mapping for OBJECTID_NOT_FOUND ext4: prohibit fstrim in norecovery mode rsi: improve kernel thread handling to fix kernel panic 9p: do not trust pdu content for stat item size 9p locks: add mount option for lock retry interval f2fs: fix to do sanity check with current segment number serial: uartps: console_setup() can't be placed to init section ARM: samsung: Limit SAMSUNG_PM_CHECK config option to non-Exynos platforms ACPI / SBS: Fix GPE storm on recent MacBookPro's cifs: fallback to older infolevels on findfirst queryinfo retry crypto: sha256/arm - fix crash bug in Thumb2 build crypto: sha512/arm - fix crash bug in Thumb2 build iommu/dmar: Fix buffer overflow during PCI bus notification ARM: 8839/1: kprobe: make patch_lock a raw_spinlock_t appletalk: Fix use-after-free in atalk_proc_exit lib/div64.c: off by one in shift include/linux/swap.h: use offsetof() instead of custom __swapoffset macro tpm/tpm_crb: Avoid unaligned reads in crb_recv() ovl: fix uid/gid when creating over whiteout appletalk: Fix compile regression bonding: fix event handling for stacked bonds net: atm: Fix potential Spectre v1 vulnerabilities net: bridge: multicast: use rcu to access port list from br_multicast_start_querier net: fou: do not use guehdr after iptunnel_pull_offloads in gue_udp_recv tcp: tcp_grow_window() needs to respect tcp_space() ipv4: recompile ip options in ipv4_link_failure ipv4: ensure rcu_read_lock() in ipv4_link_failure() crypto: crypto4xx - properly set IV after de- and encrypt modpost: file2alias: go back to simple devtable lookup modpost: file2alias: check prototype of handler tpm/tpm_i2c_atmel: Return -E2BIG when the transfer is incomplete KVM: x86: Don't clear EFER during SMM transitions for 32-bit vCPU iio/gyro/bmg160: Use millidegrees for temperature scale iio: ad_sigma_delta: select channel when reading register iio: adc: at91: disable adc channel interrupt in timeout case io: accel: kxcjk1013: restore the range after resume. staging: comedi: vmk80xx: Fix use of uninitialized semaphore staging: comedi: vmk80xx: Fix possible double-free of ->usb_rx_buf staging: comedi: ni_usb6501: Fix use of uninitialized mutex staging: comedi: ni_usb6501: Fix possible double-free of ->usb_rx_buf ALSA: core: Fix card races between register and disconnect crypto: x86/poly1305 - fix overflow during partial reduction arm64: futex: Restore oldval initialization to work around buggy compilers x86/kprobes: Verify stack frame on kretprobe kprobes: Mark ftrace mcount handler functions nokprobe kprobes: Fix error check when reusing optimized probes mac80211: do not call driver wake_tx_queue op during reconfig Revert "kbuild: use -Oz instead of -Os when using clang" sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup device_cgroup: fix RCU imbalance in error case mm/vmstat.c: fix /proc/vmstat format for CONFIG_DEBUG_TLBFLUSH=y CONFIG_SMP=n ALSA: info: Fix racy addition/deletion of nodes Revert "locking/lockdep: Add debug_locks check in __lock_downgrade()" kernel/sysctl.c: fix out-of-bounds access when setting file-max Linux 4.4.179 Change-Id: Ib81a248d73ba7504649be93bd6882b290e548882 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-04-27mm/slab.c: kmemleak no scan alien cachesQian Cai
[ Upstream commit 92d1d07daad65c300c7d0b68bbef8867e9895d54 ] Kmemleak throws endless warnings during boot due to in __alloc_alien_cache(), alc = kmalloc_node(memsize, gfp, node); init_arraycache(&alc->ac, entries, batch); kmemleak_no_scan(ac); Kmemleak does not track the array cache (alc->ac) but the alien cache (alc) instead, so let it track the latter by lifting kmemleak_no_scan() out of init_arraycache(). There is another place that calls init_arraycache(), but alloc_kmem_cache_cpus() uses the percpu allocation where will never be considered as a leak. kmemleak: Found object by alias at 0xffff8007b9aa7e38 CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2 Call trace: dump_backtrace+0x0/0x168 show_stack+0x24/0x30 dump_stack+0x88/0xb0 lookup_object+0x84/0xac find_and_get_object+0x84/0xe4 kmemleak_no_scan+0x74/0xf4 setup_kmem_cache_node+0x2b4/0x35c __do_tune_cpucache+0x250/0x2d4 do_tune_cpucache+0x4c/0xe4 enable_cpucache+0xc8/0x110 setup_cpu_cache+0x40/0x1b8 __kmem_cache_create+0x240/0x358 create_cache+0xc0/0x198 kmem_cache_create_usercopy+0x158/0x20c kmem_cache_create+0x50/0x64 fsnotify_init+0x58/0x6c do_one_initcall+0x194/0x388 kernel_init_freeable+0x668/0x688 kernel_init+0x18/0x124 ret_from_fork+0x10/0x18 kmemleak: Object 0xffff8007b9aa7e00 (size 256): kmemleak: comm "swapper/0", pid 1, jiffies 4294697137 kmemleak: min_count = 1 kmemleak: count = 0 kmemleak: flags = 0x1 kmemleak: checksum = 0 kmemleak: backtrace: kmemleak_alloc+0x84/0xb8 kmem_cache_alloc_node_trace+0x31c/0x3a0 __kmalloc_node+0x58/0x78 setup_kmem_cache_node+0x26c/0x35c __do_tune_cpucache+0x250/0x2d4 do_tune_cpucache+0x4c/0xe4 enable_cpucache+0xc8/0x110 setup_cpu_cache+0x40/0x1b8 __kmem_cache_create+0x240/0x358 create_cache+0xc0/0x198 kmem_cache_create_usercopy+0x158/0x20c kmem_cache_create+0x50/0x64 fsnotify_init+0x58/0x6c do_one_initcall+0x194/0x388 kernel_init_freeable+0x668/0x688 kernel_init+0x18/0x124 kmemleak: Not scanning unknown object at 0xffff8007b9aa7e38 CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2 Call trace: dump_backtrace+0x0/0x168 show_stack+0x24/0x30 dump_stack+0x88/0xb0 kmemleak_no_scan+0x90/0xf4 setup_kmem_cache_node+0x2b4/0x35c __do_tune_cpucache+0x250/0x2d4 do_tune_cpucache+0x4c/0xe4 enable_cpucache+0xc8/0x110 setup_cpu_cache+0x40/0x1b8 __kmem_cache_create+0x240/0x358 create_cache+0xc0/0x198 kmem_cache_create_usercopy+0x158/0x20c kmem_cache_create+0x50/0x64 fsnotify_init+0x58/0x6c do_one_initcall+0x194/0x388 kernel_init_freeable+0x668/0x688 kernel_init+0x18/0x124 ret_from_fork+0x10/0x18 Link: http://lkml.kernel.org/r/20190129184518.39808-1-cai@lca.pw Fixes: 1fe00d50a9e8 ("slab: factor out initialization of array cache") Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-16Merge 4.4.171 into android-4.4Greg Kroah-Hartman
Changes in 4.4.171 ALSA: hda/realtek - Disable headset Mic VREF for headset mode of ALC225 btrfs: cleanup, stop casting for extent_map->lookup everywhere btrfs: Enhance chunk validation check Btrfs: add validadtion checks for chunk loading Btrfs: check inconsistence between chunk and block group Btrfs: fix em leak in find_first_block_group Btrfs: detect corruption when non-root leaf has zero item Btrfs: check btree node's nritems Btrfs: fix BUG_ON in btrfs_mark_buffer_dirty Btrfs: memset to avoid stale content in btree node block Btrfs: improve check_node to avoid reading corrupted nodes Btrfs: kill BUG_ON in run_delayed_tree_ref Btrfs: memset to avoid stale content in btree leaf Btrfs: fix emptiness check for dirtied extent buffers at check_leaf() btrfs: struct-funcs, constify readers btrfs: Refactor check_leaf function for later expansion btrfs: Check if item pointer overlaps with the item itself btrfs: Add sanity check for EXTENT_DATA when reading out leaf btrfs: Add checker for EXTENT_CSUM btrfs: Move leaf and node validation checker to tree-checker.c btrfs: tree-checker: Enhance btrfs_check_node output btrfs: tree-checker: Fix false panic for sanity test btrfs: tree-checker: Add checker for dir item btrfs: tree-checker: use %zu format string for size_t btrfs: tree-check: reduce stack consumption in check_dir_item btrfs: tree-checker: Verify block_group_item btrfs: tree-checker: Detect invalid and empty essential trees btrfs: validate type when reading a chunk btrfs: Check that each block group has corresponding chunk at mount time btrfs: Verify that every chunk has corresponding block group at mount time btrfs: tree-checker: Check level for leaves and nodes btrfs: tree-checker: Fix misleading group system information CIFS: Do not hide EINTR after sending network packets cifs: Fix potential OOB access of lock element array usb: cdc-acm: send ZLP for Telit 3G Intel based modems USB: storage: don't insert sane sense for SPC3+ when bad sense specified USB: storage: add quirk for SMI SM3350 USB: Add USB_QUIRK_DELAY_CTRL_MSG quirk for Corsair K70 RGB slab: alien caches must not be initialized if the allocation of the alien cache failed PCI: altera: Fix altera_pcie_link_is_up() PCI: altera: Reorder read/write functions PCI: altera: Check link status before retrain link PCI: altera: Poll for link up status after retraining the link PCI: altera: Poll for link training status after retraining the link PCI: altera: Rework config accessors for use without a struct pci_bus PCI: altera: Move retrain from fixup to altera_pcie_host_init() ACPI: power: Skip duplicate power resource references in _PRx i2c: dev: prevent adapter retries and timeout being set as minus value crypto: cts - fix crash on short inputs ext4: fix a potential fiemap/page fault deadlock w/ inline_data sunrpc: use-after-free in svc_process_common() Linux 4.4.171 Change-Id: If59c94897d4f135b24d45772a7db116503695ba7 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2019-01-16slab: alien caches must not be initialized if the allocation of the alien ↵Christoph Lameter
cache failed commit 09c2e76ed734a1d36470d257a778aaba28e86531 upstream. Callers of __alloc_alien() check for NULL. We must do the same check in __alloc_alien_cache to avoid NULL pointer dereferences on allocation failures. Link: http://lkml.kernel.org/r/010001680f42f192-82b4e12e-1565-4ee0-ae1f-1e98974906aa-000000@email.amazonses.com Fixes: 49dfc304ba241 ("slab: use the lock on alien_cache, instead of the lock on array_cache") Fixes: c8522a3a5832b ("Slab: introduce alloc_alien") Signed-off-by: Christoph Lameter <cl@linux.com> Reported-by: syzbot+d6ed4ec679652b4fd4e4@syzkaller.appspotmail.com Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-24Merge 4.4.129 into android-4.4Greg Kroah-Hartman
Changes in 4.4.129 media: v4l2-compat-ioctl32: don't oops on overlay parisc: Fix out of array access in match_pci_device() perf intel-pt: Fix overlap detection to identify consecutive buffers correctly perf intel-pt: Fix sync_switch perf intel-pt: Fix error recovery from missing TIP packet perf intel-pt: Fix timestamp following overflow radeon: hide pointless #warning when compile testing Revert "perf tests: Decompress kernel module before objdump" block/loop: fix deadlock after loop_set_status s390/qdio: don't retry EQBS after CCQ 96 s390/qdio: don't merge ERROR output buffers s390/ipl: ensure loadparm valid flag is set getname_kernel() needs to make sure that ->name != ->iname in long case rtl8187: Fix NULL pointer dereference in priv->conf_mutex hwmon: (ina2xx) Fix access to uninitialized mutex cdc_ether: flag the Cinterion AHS8 modem by gemalto as WWAN slip: Check if rstate is initialized before uncompressing lan78xx: Correctly indicate invalid OTP x86/hweight: Get rid of the special calling convention x86/hweight: Don't clobber %rdi tty: make n_tty_read() always abort if hangup is in progress ubifs: Check ubifs_wbuf_sync() return code ubi: fastmap: Don't flush fastmap work on detach ubi: Fix error for write access ubi: Reject MLC NAND fs/reiserfs/journal.c: add missing resierfs_warning() arg resource: fix integer overflow at reallocation ipc/shm: fix use-after-free of shm file via remap_file_pages() mm, slab: reschedule cache_reap() on the same CPU usb: musb: gadget: misplaced out of bounds check ARM: dts: at91: at91sam9g25: fix mux-mask pinctrl property ARM: dts: at91: sama5d4: fix pinctrl compatible string xen-netfront: Fix hang on device removal regmap: Fix reversed bounds check in regmap_raw_write() ACPI / video: Add quirk to force acpi-video backlight on Samsung 670Z5E ACPI / hotplug / PCI: Check presence of slot itself in get_slot_status() USB:fix USB3 devices behind USB3 hubs not resuming at hibernate thaw usb: dwc3: pci: Properly cleanup resource HID: i2c-hid: fix size check and type usage powerpc/powernv: Handle unknown OPAL errors in opal_nvram_write() powerpc/64: Fix smp_wmb barrier definition use use lwsync consistently powerpc/powernv: define a standard delay for OPAL_BUSY type retry loops powerpc/powernv: Fix OPAL NVRAM driver OPAL_BUSY loops HID: Fix hid_report_len usage HID: core: Fix size as type u32 ASoC: ssm2602: Replace reg_default_raw with reg_default thunderbolt: Resume control channel after hibernation image is created random: use a tighter cap in credit_entropy_bits_safe() jbd2: if the journal is aborted then don't allow update of the log tail ext4: don't update checksum of new initialized bitmaps ext4: fail ext4_iget for root directory if unallocated RDMA/ucma: Don't allow setting RDMA_OPTION_IB_PATH without an RDMA device ALSA: pcm: Fix UAF at PCM release via PCM timer access IB/srp: Fix srp_abort() IB/srp: Fix completion vector assignment algorithm dmaengine: at_xdmac: fix rare residue corruption um: Use POSIX ucontext_t instead of struct ucontext iommu/vt-d: Fix a potential memory leak mmc: jz4740: Fix race condition in IRQ mask update clk: mvebu: armada-38x: add support for 1866MHz variants clk: mvebu: armada-38x: add support for missing clocks clk: bcm2835: De-assert/assert PLL reset signal when appropriate thermal: imx: Fix race condition in imx_thermal_probe() watchdog: f71808e_wdt: Fix WD_EN register read ALSA: oss: consolidate kmalloc/memset 0 call to kzalloc ALSA: pcm: Use ERESTARTSYS instead of EINTR in OSS emulation ALSA: pcm: Avoid potential races between OSS ioctls and read/write ALSA: pcm: Return -EBUSY for OSS ioctls changing busy streams ALSA: pcm: Fix mutex unbalance in OSS emulation ioctls ALSA: pcm: Fix endless loop for XRUN recovery in OSS emulation vfio-pci: Virtualize PCIe & AF FLR vfio/pci: Virtualize Maximum Payload Size vfio/pci: Virtualize Maximum Read Request Size ext4: don't allow r/w mounts if metadata blocks overlap the superblock drm/radeon: Fix PCIe lane width calculation ext4: fix crashes in dioread_nolock mode ext4: fix deadlock between inline_data and ext4_expand_extra_isize_ea() ALSA: line6: Use correct endpoint type for midi output ALSA: rawmidi: Fix missing input substream checks in compat ioctls ALSA: hda - New VIA controller suppor no-snoop path HID: hidraw: Fix crash on HIDIOCGFEATURE with a destroyed device MIPS: uaccess: Add micromips clobbers to bzero invocation MIPS: memset.S: EVA & fault support for small_memset MIPS: memset.S: Fix return of __clear_user from Lpartial_fixup MIPS: memset.S: Fix clobber of v1 in last_fixup powerpc/eeh: Fix enabling bridge MMIO windows powerpc/lib: Fix off-by-one in alternate feature patching jffs2_kill_sb(): deal with failed allocations hypfs_kill_super(): deal with failed allocations rpc_pipefs: fix double-dput() Don't leak MNT_INTERNAL away from internal mounts autofs: mount point create should honour passed in mode mm: allow GFP_{FS,IO} for page_cache_read page cache allocation mm/filemap.c: fix NULL pointer in page_cache_tree_insert() ext4: bugfix for mmaped pages in mpage_release_unused_pages() fanotify: fix logic of events on child writeback: safer lock nesting Linux 4.4.129 Change-Id: I8806d2cc92fe512f27a349e8f630ced0cac9a8d7 Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
2018-04-24mm, slab: reschedule cache_reap() on the same CPUVlastimil Babka
commit a9f2a846f0503e7d729f552e3ccfe2279010fe94 upstream. cache_reap() is initially scheduled in start_cpu_timer() via schedule_delayed_work_on(). But then the next iterations are scheduled via schedule_delayed_work(), i.e. using WORK_CPU_UNBOUND. Thus since commit ef557180447f ("workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs") there is no guarantee the future iterations will run on the originally intended cpu, although it's still preferred. I was able to demonstrate this with /sys/module/workqueue/parameters/debug_force_rr_cpu. IIUC, it may also happen due to migrating timers in nohz context. As a result, some cpu's would be calling cache_reap() more frequently and others never. This patch uses schedule_delayed_work_on() with the current cpu when scheduling the next iteration. Link: http://lkml.kernel.org/r/20180411070007.32225-1-vbabka@suse.cz Fixes: ef557180447f ("workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Pekka Enberg <penberg@kernel.org> Acked-by: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Stephen Boyd <sboyd@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-14BACKPORT: mm: coalesce split stringsJoe Perches
Kernel style prefers a single string over split strings when the string is 'user-visible'. Miscellanea: - Add a missing newline - Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> [percpu] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Bug: 64145065 (cherry-picked from 756a025f00091918d9d09ca3229defb160b409c0) Change-Id: I377fb1542980c15d2f306924656227ad17b02b5e Signed-off-by: Paul Lawrence <paullawrence@google.com>
2017-12-14BACKPORT: mm/kasan: get rid of ->state in struct kasan_alloc_metaAndrey Ryabinin
The state of object currently tracked in two places - shadow memory, and the ->state field in struct kasan_alloc_meta. We can get rid of the latter. The will save us a little bit of memory. Also, this allow us to move free stack into struct kasan_alloc_meta, without increasing memory consumption. So now we should always know when the last time the object was freed. This may be useful for long delayed use-after-free bugs. As a side effect this fixes following UBSAN warning: UBSAN: Undefined behaviour in mm/kasan/quarantine.c:102:13 member access within misaligned address ffff88000d1efebc for type 'struct qlist_node' which requires 8 byte alignment Link: http://lkml.kernel.org/r/1470062715-14077-5-git-send-email-aryabinin@virtuozzo.com Reported-by: kernel test robot <xiaolong.ye@intel.com> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Bug: 64145065 (cherry-picked from b3cbd9bf77cd1888114dbee1653e79aa23fd4068) Change-Id: Iaa4959a78ffd2e49f9060099df1fb32483df3085 Signed-off-by: Paul Lawrence <paullawrence@google.com>
2017-12-14UPSTREAM: mm: kasan: initial memory quarantine implementationAlexander Potapenko
Quarantine isolates freed objects in a separate queue. The objects are returned to the allocator later, which helps to detect use-after-free errors. When the object is freed, its state changes from KASAN_STATE_ALLOC to KASAN_STATE_QUARANTINE. The object is poisoned and put into quarantine instead of being returned to the allocator, therefore every subsequent access to that object triggers a KASAN error, and the error handler is able to say where the object has been allocated and deallocated. When it's time for the object to leave quarantine, its state becomes KASAN_STATE_FREE and it's returned to the allocator. From now on the allocator may reuse it for another allocation. Before that happens, it's still possible to detect a use-after free on that object (it retains the allocation/deallocation stacks). When the allocator reuses this object, the shadow is unpoisoned and old allocation/deallocation stacks are wiped. Therefore a use of this object, even an incorrect one, won't trigger ASan warning. Without the quarantine, it's not guaranteed that the objects aren't reused immediately, that's why the probability of catching a use-after-free is lower than with quarantine in place. Quarantine isolates freed objects in a separate queue. The objects are returned to the allocator later, which helps to detect use-after-free errors. Freed objects are first added to per-cpu quarantine queues. When a cache is destroyed or memory shrinking is requested, the objects are moved into the global quarantine queue. Whenever a kmalloc call allows memory reclaiming, the oldest objects are popped out of the global queue until the total size of objects in quarantine is less than 3/4 of the maximum quarantine size (which is a fraction of installed physical memory). As long as an object remains in the quarantine, KASAN is able to report accesses to it, so the chance of reporting a use-after-free is increased. Once the object leaves quarantine, the allocator may reuse it, in which case the object is unpoisoned and KASAN can't detect incorrect accesses to it. Right now quarantine support is only enabled in SLAB allocator. Unification of KASAN features in SLAB and SLUB will be done later. This patch is based on the "mm: kasan: quarantine" patch originally prepared by Dmitry Chernenkov. A number of improvements have been suggested by Andrey Ryabinin. [glider@google.com: v9] Link: http://lkml.kernel.org/r/1462987130-144092-1-git-send-email-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Bug: 64145065 (cherry-picked from 55834c59098d0c5a97b0f3247e55832b67facdcf) Change-Id: Ib808d72a40f2e5137961d93dad540e85f8bbd2c4 Signed-off-by: Paul Lawrence <paullawrence@google.com>
2017-12-14BACKPORT: mm, kasan: add GFP flags to KASAN APIAlexander Potapenko
Add GFP flags to KASAN hooks for future patches to use. This patch is based on the "mm: kasan: unified support for SLUB and SLAB allocators" patch originally prepared by Dmitry Chernenkov. Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Bug: 64145065 (cherry-picked from 505f5dcb1c419e55a9621a01f83eb5745d8d7398) Change-Id: I7c5539f59e6969e484a6ff4f104dce2390669cfd Signed-off-by: Paul Lawrence <paullawrence@google.com>
2017-12-14UPSTREAM: mm, kasan: SLAB supportAlexander Potapenko
Add KASAN hooks to SLAB allocator. This patch is based on the "mm: kasan: unified support for SLUB and SLAB allocators" patch originally prepared by Dmitry Chernenkov. Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Bug: 64145065 (cherry-picked from 7ed2f9e663854db313f177a511145630e398b402) Change-Id: I131fdafc1c27a25732475f5bbd1653b66954e1b7 Signed-off-by: Paul Lawrence <paullawrence@google.com>
2017-12-14UPSTREAM: mm/slab: align cache size first before determination of OFF_SLAB ↵Joonsoo Kim
candidate Finding suitable OFF_SLAB candidate is more related to aligned cache size rather than original size. Same reasoning can be applied to the debug pagealloc candidate. So, this patch moves up alignment fixup to proper position. From that point, size is aligned so we can remove some alignment fixups. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Bug: 64145065 (cherry-picked from 832a15d209cd260180407bde1af18965b21623f3) Change-Id: I8338d647da4a6eb6402c6fe4e2402b7db45ea5a5 Signed-off-by: Paul Lawrence <paullawrence@google.com>
2017-12-14UPSTREAM: mm/slab: use more appropriate condition check for debug_pageallocJoonsoo Kim
debug_pagealloc debugging is related to SLAB_POISON flag rather than FORCED_DEBUG option, although FORCED_DEBUG option will enable SLAB_POISON. Fix it. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Bug: 64145065 (cherry-picked from 40323278b557a5909bbecfa181c91a3af7afbbe3) Change-Id: I4a7dbd40b6a2eb777439f271fa600b201e6db1d3 Signed-off-by: Paul Lawrence <paullawrence@google.com>
2017-12-14UPSTREAM: mm/slab: factor out debugging initialization in cache_init_objs()Joonsoo Kim
cache_init_objs() will be changed in following patch and current form doesn't fit well for that change. So, before doing it, this patch separates debugging initialization. This would cause two loop iteration when debugging is enabled, but, this overhead seems too light than debug feature itself so effect may not be visible. This patch will greatly simplify changes in cache_init_objs() in following patch. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Bug: 64145065 (cherry-picked from 10b2e9e8e808bd30e1f4018a36366d07b0abd12f) Change-Id: I9904974a674f17fc7d57daf0fe351742db67c006 Signed-off-by: Paul Lawrence <paullawrence@google.com>
2017-12-14UPSTREAM: mm/slab: remove object status buffer for DEBUG_SLAB_LEAKJoonsoo Kim
Now, we don't use object status buffer in any setup. Remove it. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Bug: 64145065 (cherry-picked from 249247b6f8ee362189a2f2bf598a14ff6c95fb4c) Change-Id: I60b1fb030da5d44aff2627f57dffd9dd7436e511 Signed-off-by: Paul Lawrence <paullawrence@google.com>
2017-12-14UPSTREAM: mm/slab: alternative implementation for DEBUG_SLAB_LEAKJoonsoo Kim
DEBUG_SLAB_LEAK is a debug option. It's current implementation requires status buffer so we need more memory to use it. And, it cause kmem_cache initialization step more complex. To remove this extra memory usage and to simplify initialization step, this patch implement this feature with another way. When user requests to get slab object owner information, it marks that getting information is started. And then, all free objects in caches are flushed to corresponding slab page. Now, we can distinguish all freed object so we can know all allocated objects, too. After collecting slab object owner information on allocated objects, mark is checked that there is no free during the processing. If true, we can be sure that our information is correct so information is returned to user. Although this way is rather complex, it has two important benefits mentioned above. So, I think it is worth changing. There is one drawback that it takes more time to get slab object owner information but it is just a debug option so it doesn't matter at all. To help review, this patch implements new way only. Following patch will remove useless code. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Bug: 64145065 (cherry-picked from d31676dfde257cb2b3e52d4e657d8ad2251e4d49) Change-Id: I204ea0dd5553577d17c93f32f0d5a797ba0304af Signed-off-by: Paul Lawrence <paullawrence@google.com>
2017-12-14UPSTREAM: mm/slab: clean up DEBUG_PAGEALLOC processing codeJoonsoo Kim
Currently, open code for checking DEBUG_PAGEALLOC cache is spread to some sites. It makes code unreadable and hard to change. This patch cleans up this code. The following patch will change the criteria for DEBUG_PAGEALLOC cache so this clean-up will help it, too. [akpm@linux-foundation.org: fix build with CONFIG_DEBUG_PAGEALLOC=n] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Bug: 64145065 (cherry-picked from 40b44137971c2e5865a78f9f7de274449983ccb5) Change-Id: I784df4a54b62f77a22f8fa70990387fdf968219f Signed-off-by: Paul Lawrence <paullawrence@google.com>
2017-12-14UPSTREAM: mm/slab: activate debug_pagealloc in SLAB when it is actually enabledJoonsoo Kim
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Bug: 64145065 (cherry-picked from a307ebd468e0b97c203f5a99a56a6017e4d1991a) Change-Id: I4407dbf0a5a22ab6f9335219bafc6f28023635a0 Signed-off-by: Paul Lawrence <paullawrence@google.com>
2016-09-06UPSTREAM: mm: SLAB hardened usercopy supportKees Cook
Under CONFIG_HARDENED_USERCOPY, this adds object size checking to the SLAB allocator to catch any copies that may span objects. Based on code from PaX and grsecurity. Signed-off-by: Kees Cook <keescook@chromium.org> Tested-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Change-Id: Ib910a71fdc2ab808e1a45b6d33e9bae1681a1f4a (cherry picked from commit 04385fc5e8fffed84425d909a783c0f0c587d847) Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
2015-11-22slab/slub: adjust kmem_cache_alloc_bulk APIJesper Dangaard Brouer
Adjust kmem_cache_alloc_bulk API before we have any real users. Adjust API to return type 'int' instead of previously type 'bool'. This is done to allow future extension of the bulk alloc API. A future extension could be to allow SLUB to stop at a page boundary, when specified by a flag, and then return the number of objects. The advantage of this approach, would make it easier to make bulk alloc run without local IRQs disabled. With an approach of cmpxchg "stealing" the entire c->freelist or page->freelist. To avoid overshooting we would stop processing at a slab-page boundary. Else we always end up returning some objects at the cost of another cmpxchg. To keep compatible with future users of this API linking against an older kernel when using the new flag, we need to return the number of allocated objects with this API change. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06slab, slub: use page->rcu_head instead of page->lru plus castKirill A. Shutemov
We have properly typed page->rcu_head, no need to cast page->lru. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06mm, page_alloc: distinguish between being unable to sleep, unwilling to ↵Mel Gorman
sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05memcg: unify slab and other kmem pages chargingVladimir Davydov
We have memcg_kmem_charge and memcg_kmem_uncharge methods for charging and uncharging kmem pages to memcg, but currently they are not used for charging slab pages (i.e. they are only used for charging pages allocated with alloc_kmem_pages). The only reason why the slab subsystem uses special helpers, memcg_charge_slab and memcg_uncharge_slab, is that it needs to charge to the memcg of kmem cache while memcg_charge_kmem charges to the memcg that the current task belongs to. To remove this diversity, this patch adds an extra argument to __memcg_kmem_charge that can be a pointer to a memcg or NULL. If it is not NULL, the function tries to charge to the memcg it points to, otherwise it charge to the current context. Next, it makes the slab subsystem use this function to charge slab pages. Since memcg_charge_kmem and memcg_uncharge_kmem helpers are now used only in __memcg_kmem_charge and __memcg_kmem_uncharge, they are inlined. Since __memcg_kmem_charge stores a pointer to the memcg in the page struct, we don't need memcg_uncharge_slab anymore and can use free_kmem_pages. Besides, one can now detect which memcg a slab page belongs to by reading /proc/kpagecgroup. Note, this patch switches slab to charge-after-alloc design. Since this design is already used for all other memcg charges, it should not make any difference. [hannes@cmpxchg.org: better to have an outer function than a magic parameter for the memcg lookup] Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05mm: slab: only move management objects off-slab for sizes larger than ↵Catalin Marinas
KMALLOC_MIN_SIZE On systems with a KMALLOC_MIN_SIZE of 128 (arm64, some mips and powerpc configurations defining ARCH_DMA_MINALIGN to 128), the first kmalloc_caches[] entry to be initialised after slab_early_init = 0 is "kmalloc-128" with index 7. Depending on the debug kernel configuration, sizeof(struct kmem_cache) can be larger than 128 resulting in an INDEX_NODE of 8. Commit 8fc9cf420b36 ("slab: make more slab management structure off the slab") enables off-slab management objects for sizes starting with PAGE_SIZE >> 5 (128 bytes for a 4KB page configuration) and the creation of the "kmalloc-128" cache would try to place the management objects off-slab. However, since KMALLOC_MIN_SIZE is already 128 and freelist_size == 32 in __kmem_cache_create(), kmalloc_slab(freelist_size) returns NULL (kmalloc_caches[7] not populated yet). This triggers the following bug on arm64: kernel BUG at /work/Linux/linux-2.6-aarch64/mm/slab.c:2283! Internal error: Oops - BUG: 0 [#1] SMP Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 4.3.0-rc4+ #540 Hardware name: Juno (DT) PC is at __kmem_cache_create+0x21c/0x280 LR is at __kmem_cache_create+0x210/0x280 [...] Call trace: __kmem_cache_create+0x21c/0x280 create_boot_cache+0x48/0x80 create_kmalloc_cache+0x50/0x88 create_kmalloc_caches+0x4c/0xf4 kmem_cache_init+0x100/0x118 start_kernel+0x214/0x33c This patch introduces an OFF_SLAB_MIN_SIZE definition to avoid off-slab management objects for sizes equal to or smaller than KMALLOC_MIN_SIZE. Fixes: 8fc9cf420b36 ("slab: make more slab management structure off the slab") Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> [3.15+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-01mm/slab: fix unexpected index mapping result of kmalloc_size(INDEX_NODE+1)Joonsoo Kim
Commit description is copied from the original post of this bug: http://comments.gmane.org/gmane.linux.kernel.mm/135349 Kernels after v3.9 use kmalloc_size(INDEX_NODE + 1) to get the next larger cache size than the size index INDEX_NODE mapping. In kernels 3.9 and earlier we used malloc_sizes[INDEX_L3 + 1].cs_size. However, sometimes we can't get the right output we expected via kmalloc_size(INDEX_NODE + 1), causing a BUG(). The mapping table in the latest kernel is like: index = {0, 1, 2 , 3, 4, 5, 6, n} size = {0, 96, 192, 8, 16, 32, 64, 2^n} The mapping table before 3.10 is like this: index = {0 , 1 , 2, 3, 4 , 5 , 6, n} size = {32, 64, 96, 128, 192, 256, 512, 2^(n+3)} The problem on my mips64 machine is as follows: (1) When configured DEBUG_SLAB && DEBUG_PAGEALLOC && DEBUG_LOCK_ALLOC && DEBUG_SPINLOCK, the sizeof(struct kmem_cache_node) will be "150", and the macro INDEX_NODE turns out to be "2": #define INDEX_NODE kmalloc_index(sizeof(struct kmem_cache_node)) (2) Then the result of kmalloc_size(INDEX_NODE + 1) is 8. (3) Then "if(size >= kmalloc_size(INDEX_NODE + 1)" will lead to "size = PAGE_SIZE". (4) Then "if ((size >= (PAGE_SIZE >> 3))" test will be satisfied and "flags |= CFLGS_OFF_SLAB" will be covered. (5) if (flags & CFLGS_OFF_SLAB)" test will be satisfied and will go to "cachep->slabp_cache = kmalloc_slab(slab_size, 0u)", and the result here may be NULL while kernel bootup. (6) Finally,"BUG_ON(ZERO_OR_NULL_PTR(cachep->slabp_cache));" causes the BUG info as the following shows (may be only mips64 has this problem): This patch fixes the problem of kmalloc_size(INDEX_NODE + 1) and removes the BUG by adding 'size >= 256' check to guarantee that all necessary small sized slabs are initialized regardless sequence of slab size in mapping table. Fixes: e33660165c90 ("slab: Use common kmalloc_index/kmalloc_size...") Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Liuhailong <liu.hailong6@zte.com.cn> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08mm: rename alloc_pages_exact_node() to __alloc_pages_node()Vlastimil Babka
alloc_pages_exact_node() was introduced in commit 6484eb3e2a81 ("page allocator: do not check NUMA node ID when the caller knows the node is valid") as an optimized variant of alloc_pages_node(), that doesn't fallback to current node for nid == NUMA_NO_NODE. Unfortunately the name of the function can easily suggest that the allocation is restricted to the given node and fails otherwise. In truth, the node is only preferred, unless __GFP_THISNODE is passed among the gfp flags. The misleading name has lead to mistakes in the past, see for example commits 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node") and b360edb43f8e ("mm, mempolicy: migrate_to_node should only migrate to node"). Another issue with the name is that there's a family of alloc_pages_exact*() functions where 'exact' means exact size (instead of page order), which leads to more confusion. To prevent further mistakes, this patch effectively renames alloc_pages_exact_node() to __alloc_pages_node() to better convey that it's an optimized variant of alloc_pages_node() not intended for general usage. Both functions get described in comments. It has been also considered to really provide a convenience function for allocations restricted to a node, but the major opinion seems to be that __GFP_THISNODE already provides that functionality and we shouldn't duplicate the API needlessly. The number of users would be small anyway. Existing callers of alloc_pages_exact_node() are simply converted to call __alloc_pages_node(), with the exception of sba_alloc_coherent() which open-codes the check for NUMA_NO_NODE, so it is converted to use alloc_pages_node() instead. This means it no longer performs some VM_BUG_ON checks, and since the current check for nid in alloc_pages_node() uses a 'nid < 0' comparison (which includes NUMA_NO_NODE), it may hide wrong values which would be previously exposed. Both differences will be rectified by the next patch. To sum up, this patch makes no functional changes, except temporarily hiding potentially buggy callers. Restricting the checks in alloc_pages_node() is left for the next patch which can in turn expose more existing buggy callers. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Robin Holt <robinmholt@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Cliff Whickman <cpw@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04slab: infrastructure for bulk object allocation and freeingChristoph Lameter
Add the basic infrastructure for alloc/free operations on pointer arrays. It includes a generic function in the common slab code that is used in this infrastructure patch to create the unoptimized functionality for slab bulk operations. Allocators can then provide optimized allocation functions for situations in which large numbers of objects are needed. These optimization may avoid taking locks repeatedly and bypass metadata creation if all objects in slab pages can be used to provide the objects required. Allocators can extend the skeletons provided and add their own code to the bulk alloc and free functions. They can keep the generic allocation and freeing and just fall back to those if optimizations would not work (like for example when debugging is on). Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-21mm: make page pfmemalloc check more robustMichal Hocko
Commit c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") added checks for page->pfmemalloc to __skb_fill_page_desc(): if (page->pfmemalloc && !page->mapping) skb->pfmemalloc = true; It assumes page->mapping == NULL implies that page->pfmemalloc can be trusted. However, __delete_from_page_cache() can set set page->mapping to NULL and leave page->index value alone. Due to being in union, a non-zero page->index will be interpreted as true page->pfmemalloc. So the assumption is invalid if the networking code can see such a page. And it seems it can. We have encountered this with a NFS over loopback setup when such a page is attached to a new skbuf. There is no copying going on in this case so the page confuses __skb_fill_page_desc which interprets the index as pfmemalloc flag and the network stack drops packets that have been allocated using the reserves unless they are to be queued on sockets handling the swapping which is the case here and that leads to hangs when the nfs client waits for a response from the server which has been dropped and thus never arrive. The struct page is already heavily packed so rather than finding another hole to put it in, let's do a trick instead. We can reuse the index again but define it to an impossible value (-1UL). This is the page index so it should never see the value that large. Replace all direct users of page->pfmemalloc by page_is_pfmemalloc which will hide this nastiness from unspoiled eyes. The information will get lost if somebody wants to use page->index obviously but that was the case before and the original code expected that the information should be persisted somewhere else if that is really needed (e.g. what SLAB and SLUB do). [akpm@linux-foundation.org: fix blooper in slub] Fixes: c48a11c7ad26 ("netvm: propagate page->pfmemalloc to skb") Signed-off-by: Michal Hocko <mhocko@suse.com> Debugged-by: Vlastimil Babka <vbabka@suse.com> Debugged-by: Jiri Bohac <jbohac@suse.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Miller <davem@davemloft.net> Acked-by: Mel Gorman <mgorman@suse.de> Cc: <stable@vger.kernel.org> [3.6+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24slab: correct size_index table before replacing the bootstrap kmem_cache_nodeDaniel Sanders
This patch moves the initialization of the size_index table slightly earlier so that the first few kmem_cache_node's can be safely allocated when KMALLOC_MIN_SIZE is large. There are currently two ways to generate indices into kmalloc_caches (via kmalloc_index() and via the size_index table in slab_common.c) and on some arches (possibly only MIPS) they potentially disagree with each other until create_kmalloc_caches() has been called. It seems that the intention is that the size_index table is a fast equivalent to kmalloc_index() and that create_kmalloc_caches() patches the table to return the correct value for the cases where kmalloc_index()'s if-statements apply. The failing sequence was: * kmalloc_caches contains NULL elements * kmem_cache_init initialises the element that 'struct kmem_cache_node' will be allocated to. For 32-bit Mips, this is a 56-byte struct and kmalloc_index returns KMALLOC_SHIFT_LOW (7). * init_list is called which calls kmalloc_node to allocate a 'struct kmem_cache_node'. * kmalloc_slab selects the kmem_caches element using size_index[size_index_elem(size)]. For MIPS, size is 56, and the expression returns 6. * This element of kmalloc_caches is NULL and allocation fails. * If it had not already failed, it would have called create_kmalloc_caches() at this point which would have changed size_index[size_index_elem(size)] to 7. I don't believe the bug to be LLVM specific but GCC doesn't normally encounter the problem. I haven't been able to identify exactly what GCC is doing better (probably inlining) but it seems that GCC is managing to optimize to the point that it eliminates the problematic allocations. This theory is supported by the fact that GCC can be made to fail in the same way by changing inline, __inline, __inline__, and __always_inline in include/linux/compiler-gcc.h such that they don't actually inline things. Signed-off-by: Daniel Sanders <daniel.sanders@imgtec.com> Acked-by: Pekka Enberg <penberg@kernel.org> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-14mm: remove GFP_THISNODEDavid Rientjes
NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE. GFP_THISNODE is a secret combination of gfp bits that have different behavior than expected. It is a combination of __GFP_THISNODE, __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page allocator slowpath to fail without trying reclaim even though it may be used in combination with __GFP_WAIT. An example of the problem this creates: commit e97ca8e5b864 ("mm: fix GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE that really just wanted __GFP_THISNODE. The problem doesn't end there, however, because even it was a no-op for alloc_misplaced_dst_page(), which also sets __GFP_NORETRY and __GFP_NOWARN, and migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT is set in GFP_TRANSHUGE. Converting GFP_THISNODE to __GFP_THISNODE is a no-op in these cases since the page allocator special-cases __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN. It's time to just remove GFP_THISNODE entirely. We leave __GFP_THISNODE to restrict an allocation to a local node, but remove GFP_THISNODE and its obscurity. Instead, we require that a caller clear __GFP_WAIT if it wants to avoid reclaim. This allows the aforementioned functions to actually reclaim as they should. It also enables any future callers that want to do __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim. The rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT. Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so it is unchanged. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Pravin Shelar <pshelar@nicira.com> Cc: Jarno Rajahalme <jrajahalme@nicira.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Greg Thelen <gthelen@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12slub: make dead caches discard free slabs immediatelyVladimir Davydov
To speed up further allocations SLUB may store empty slabs in per cpu/node partial lists instead of freeing them immediately. This prevents per memcg caches destruction, because kmem caches created for a memory cgroup are only destroyed after the last page charged to the cgroup is freed. To fix this issue, this patch resurrects approach first proposed in [1]. It forbids SLUB to cache empty slabs after the memory cgroup that the cache belongs to was destroyed. It is achieved by setting kmem_cache's cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so that it would drop frozen empty slabs immediately if cpu_partial = 0. The runtime overhead is minimal. From all the hot functions, we only touch relatively cold put_cpu_partial(): we make it call unfreeze_partials() after freezing a slab that belongs to an offline memory cgroup. Since slab freezing exists to avoid moving slabs from/to a partial list on free/alloc, and there can't be allocations from dead caches, it shouldn't cause any overhead. We do have to disable preemption for put_cpu_partial() to achieve that though. The original patch was accepted well and even merged to the mm tree. However, I decided to withdraw it due to changes happening to the memcg core at that time. I had an idea of introducing per-memcg shrinkers for kmem caches, but now, as memcg has finally settled down, I do not see it as an option, because SLUB shrinker would be too costly to call since SLUB does not keep free slabs on a separate list. Besides, we currently do not even call per-memcg shrinkers for offline memcgs. Overall, it would introduce much more complexity to both SLUB and memcg than this small patch. Regarding to SLAB, there's no problem with it, because it shrinks per-cpu/node caches periodically. Thanks to list_lru reparenting, we no longer keep entries for offline cgroups in per-memcg arrays (such as memcg_cache_params->memcg_caches), so we do not have to bother if a per-memcg cache will be shrunk a bit later than it could be. [1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650 Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12slab: link memcg caches of the same kind into a listVladimir Davydov
Sometimes, we need to iterate over all memcg copies of a particular root kmem cache. Currently, we use memcg_cache_params->memcg_caches array for that, because it contains all existing memcg caches. However, it's a bad practice to keep all caches, including those that belong to offline cgroups, in this array, because it will be growing beyond any bounds then. I'm going to wipe away dead caches from it to save space. To still be able to perform iterations over all memcg caches of the same kind, let us link them into a list. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13slab: fix cpuset check in fallback_allocVladimir Davydov
fallback_alloc is called on kmalloc if the preferred node doesn't have free or partial slabs and there's no pages on the node's free list (GFP_THISNODE allocations fail). Before invoking the reclaimer it tries to locate a free or partial slab on other allowed nodes' lists. While iterating over the preferred node's zonelist it skips those zones which hardwall cpuset check returns false for. That means that for a task bound to a specific node using cpusets fallback_alloc will always ignore free slabs on other nodes and go directly to the reclaimer, which, however, may allocate from other nodes if cpuset.mem_hardwall is unset (default). As a result, we may get lists of free slabs grow without bounds on other nodes, which is bad, because inactive slabs are only evicted by cache_reap at a very slow rate and cannot be dropped forcefully. To reproduce the issue, run a process that will walk over a directory tree with lots of files inside a cpuset bound to a node that constantly experiences memory pressure. Look at num_slabs vs active_slabs growth as reported by /proc/slabinfo. To avoid this we should use softwall cpuset check in fallback_alloc. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Zefan Li <lizefan@huawei.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13memcg: fix possible use-after-free in memcg_kmem_get_cache()Vladimir Davydov
Suppose task @t that belongs to a memory cgroup @memcg is going to allocate an object from a kmem cache @c. The copy of @c corresponding to @memcg, @mc, is empty. Then if kmem_cache_alloc races with the memory cgroup destruction we can access the memory cgroup's copy of the cache after it was destroyed: CPU0 CPU1 ---- ---- [ current=@t @mc->memcg_params->nr_pages=0 ] kmem_cache_alloc(@c): call memcg_kmem_get_cache(@c); proceed to allocation from @mc: alloc a page for @mc: ... move @t from @memcg destroy @memcg: mem_cgroup_css_offline(@memcg): memcg_unregister_all_caches(@memcg): kmem_cache_destroy(@mc) add page to @mc We could fix this issue by taking a reference to a per-memcg cache, but that would require adding a per-cpu reference counter to per-memcg caches, which would look cumbersome. Instead, let's take a reference to a memory cgroup, which already has a per-cpu reference counter, in the beginning of kmem_cache_alloc to be dropped in the end, and move per memcg caches destruction from css offline to css free. As a side effect, per-memcg caches will be destroyed not one by one, but all at once when the last page accounted to the memory cgroup is freed. This doesn't sound as a high price for code readability though. Note, this patch does add some overhead to the kmem_cache_alloc hot path, but it is pretty negligible - it's just a function call plus a per cpu counter decrement, which is comparable to what we already have in memcg_kmem_get_cache. Besides, it's only relevant if there are memory cgroups with kmem accounting enabled. I don't think we can find a way to handle this race w/o it, because alloc_page called from kmem_cache_alloc may sleep so we can't flush all pending kmallocs w/o reference counting. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-11Merge branch 'for-3.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup update from Tejun Heo: "cpuset got simplified a bit. cgroup core got a fix on unified hierarchy and grew some effective css related interfaces which will be used for blkio support for writeback IO traffic which is currently being worked on" * 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: implement cgroup_get_e_css() cgroup: add cgroup_subsys->css_e_css_changed() cgroup: add cgroup_subsys->css_released() cgroup: fix the async css offline wait logic in cgroup_subtree_control_write() cgroup: restructure child_subsys_mask handling in cgroup_subtree_control_write() cgroup: separate out cgroup_calc_child_subsys_mask() from cgroup_refresh_child_subsys_mask() cpuset: lock vs unlock typo cpuset: simplify cpuset_node_allowed API cpuset: convert callback_mutex to a spinlock
2014-12-10slab: improve checking for invalid gfp_flagsAndrew Morton
The code goes BUG, but doesn't tell us which bits were unexpectedly set. Print that out. Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10slab: print slabinfo header in seq showVladimir Davydov
Currently we print the slabinfo header in the seq start method, which makes it unusable for showing leaks, so we have leaks_show, which does practically the same as s_show except it doesn't show the header. However, we can print the header in the seq show method - we only need to check if the current element is the first on the list. This will allow us to use the same set of seq iterators for both leaks and slabinfo reporting, which is nice. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10mm: slab/slub: coding style: whitespaces and tabs mixtureLQYMGT
Some code in mm/slab.c and mm/slub.c use whitespaces in indent. Clean them up. Signed-off-by: LQYMGT <lqymgt@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-03slab: fix nodeid bounds check for non-contiguous node IDsPaul Mackerras
The bounds check for nodeid in ____cache_alloc_node gives false positives on machines where the node IDs are not contiguous, leading to a panic at boot time. For example, on a POWER8 machine the node IDs are typically 0, 1, 16 and 17. This means that num_online_nodes() returns 4, so when ____cache_alloc_node is called with nodeid = 16 the VM_BUG_ON triggers, like this: kernel BUG at /home/paulus/kernel/kvm/mm/slab.c:3079! Call Trace: .____cache_alloc_node+0x5c/0x270 (unreliable) .kmem_cache_alloc_node_trace+0xdc/0x360 .init_list+0x3c/0x128 .kmem_cache_init+0x1dc/0x258 .start_kernel+0x2a0/0x568 start_here_common+0x20/0xa8 To fix this, we instead compare the nodeid with MAX_NUMNODES, and additionally make sure it isn't negative (since nodeid is an int). The check is there mainly to protect the array dereference in the get_node() call in the next line, and the array being dereferenced is of size MAX_NUMNODES. If the nodeid is in range but invalid (for example if the node is off-line), the BUG_ON in the next line will catch that. Fixes: 14e50c6a9bc2 ("mm: slab: Verify the nodeid passed to ____cache_alloc_node") Signed-off-by: Paul Mackerras <paulus@samba.org> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-27cpuset: simplify cpuset_node_allowed APIVladimir Davydov
Current cpuset API for checking if a zone/node is allowed to allocate from looks rather awkward. We have hardwall and softwall versions of cpuset_node_allowed with the softwall version doing literally the same as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags. If it isn't, the softwall version may check the given node against the enclosing hardwall cpuset, which it needs to take the callback lock to do. Such a distinction was introduced by commit 02a0e53d8227 ("cpuset: rework cpuset_zone_allowed api"). Before, we had the only version with the __GFP_HARDWALL flag determining its behavior. The purpose of the commit was to avoid sleep-in-atomic bugs when someone would mistakenly call the function without the __GFP_HARDWALL flag for an atomic allocation. The suffixes introduced were intended to make the callers think before using the function. However, since the callback lock was converted from mutex to spinlock by the previous patch, the softwall check function cannot sleep, and these precautions are no longer necessary. So let's simplify the API back to the single check. Suggested-by: David Rientjes <rientjes@google.com> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-10-14mm/slab: fix unaligned access on sparc64Joonsoo Kim
Commit bf0dea23a9c0 ("mm/slab: use percpu allocator for cpu cache") changed the allocation method for cpu cache array from slab allocator to percpu allocator. Alignment should be provided for aligned memory in percpu allocator case, but, that commit mistakenly set this alignment to 0. So, percpu allocator returns unaligned memory address. It doesn't cause any problem on x86 which permits unaligned access, but, it causes the problem on sparc64 which needs strong guarantee of alignment. Following bug report is reported from David Miller. I'm getting tons of the following on sparc64: [603965.383447] Kernel unaligned access at TPC[546b58] free_block+0x98/0x1a0 [603965.396987] Kernel unaligned access at TPC[546b60] free_block+0xa0/0x1a0 ... [603970.554394] log_unaligned: 333 callbacks suppressed ... This patch provides a proper alignment parameter when allocating cpu cache to fix this unaligned memory access problem on sparc64. Reported-by: David Miller <davem@davemloft.net> Tested-by: David Miller <davem@davemloft.net> Tested-by: Meelis Roos <mroos@linux.ee> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm/slab.c: use __seq_open_private() instead of seq_open()Rob Jones
Using __seq_open_private() removes boilerplate code from slabstats_open() The resultant code is shorter and easier to follow. This patch does not change any functionality. Signed-off-by: Rob Jones <rob.jones@codethink.co.uk> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm/slab: use percpu allocator for cpu cacheJoonsoo Kim
Because of chicken and egg problem, initialization of SLAB is really complicated. We need to allocate cpu cache through SLAB to make the kmem_cache work, but before initialization of kmem_cache, allocation through SLAB is impossible. On the other hand, SLUB does initialization in a more simple way. It uses percpu allocator to allocate cpu cache so there is no chicken and egg problem. So, this patch try to use percpu allocator in SLAB. This simplifies the initialization step in SLAB so that we could maintain SLAB code more easily. In my testing there is no performance difference. This implementation relies on percpu allocator. Because percpu allocator uses vmalloc address space, vmalloc address space could be exhausted by this change on many cpu system with *32 bit* kernel. This implementation can cover 1024 cpus in worst case by following calculation. Worst: 1024 cpus * 4 bytes for pointer * 300 kmem_caches * 120 objects per cpu_cache = 140 MB Normal: 1024 cpus * 4 bytes for pointer * 150 kmem_caches(slab merge) * 80 objects per cpu_cache = 46 MB Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Jeremiah Mahler <jmmahler@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm/slab: support slab mergeJoonsoo Kim
Slab merge is good feature to reduce fragmentation. If new creating slab have similar size and property with exsitent slab, this feature reuse it rather than creating new one. As a result, objects are packed into fewer slabs so that fragmentation is reduced. Below is result of my testing. * After boot, sleep 20; cat /proc/meminfo | grep Slab <Before> Slab: 25136 kB <After> Slab: 24364 kB We can save 3% memory used by slab. For supporting this feature in SLAB, we need to implement SLAB specific kmem_cache_flag() and __kmem_cache_alias(), because SLUB implements some SLUB specific processing related to debug flag and object size change on these functions. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm/slab: factor out unlikely part of cache_free_alien()Joonsoo Kim
cache_free_alien() is rarely used function when node mismatch. But, it is defined with inline attribute so it is inlined to __cache_free() which is core free function of slab allocator. It uselessly makes kmem_cache_free()/kfree() functions large. What we really need to inline is just checking node match so this patch factor out other parts of cache_free_alien() to reduce code size of kmem_cache_free()/ kfree(). <Before> nm -S mm/slab.o | grep -e "T kfree" -e "T kmem_cache_free" 00000000000011e0 0000000000000228 T kfree 0000000000000670 0000000000000216 T kmem_cache_free <After> nm -S mm/slab.o | grep -e "T kfree" -e "T kmem_cache_free" 0000000000001110 00000000000001b5 T kfree 0000000000000750 0000000000000181 T kmem_cache_free You can see slightly reduced size of text: 0x228->0x1b5, 0x216->0x181. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm/slab: noinline __ac_put_obj()Joonsoo Kim
Our intention of __ac_put_obj() is that it doesn't affect anything if sk_memalloc_socks() is disabled. But, because __ac_put_obj() is too small, compiler inline it to ac_put_obj() and affect code size of free path. This patch add noinline keyword for __ac_put_obj() not to distrupt normal free path at all. <Before> nm -S slab-orig.o | grep -e "t cache_alloc_refill" -e "T kfree" -e "T kmem_cache_free" 0000000000001e80 00000000000002f5 t cache_alloc_refill 0000000000001230 0000000000000258 T kfree 0000000000000690 000000000000024c T kmem_cache_free <After> nm -S slab-patched.o | grep -e "t cache_alloc_refill" -e "T kfree" -e "T kmem_cache_free" 0000000000001e00 00000000000002e5 t cache_alloc_refill 00000000000011e0 0000000000000228 T kfree 0000000000000670 0000000000000216 T kmem_cache_free cache_alloc_refill: 0x2f5->0x2e5 kfree: 0x256->0x228 kmem_cache_free: 0x24c->0x216 code size of each function is reduced slightly. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm/slab: move cache_flusharray() out of unlikely.text sectionJoonsoo Kim
Now, due to likely keyword, compiled code of cache_flusharray() is on unlikely.text section. Although it is uncommon case compared to free to cpu cache case, it is common case than free_block(). But, free_block() is on normal text section. This patch fix this odd situation to remove likely keyword. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09mm/sl[ao]b: always track caller in kmalloc_(node_)track_caller()Joonsoo Kim
Now, we track caller if tracing or slab debugging is enabled. If they are disabled, we could save one argument passing overhead by calling __kmalloc(_node)(). But, I think that it would be marginal. Furthermore, default slab allocator, SLUB, doesn't use this technique so I think that it's okay to change this situation. After this change, we can turn on/off CONFIG_DEBUG_SLAB without full kernel build and remove some complicated '#if' defintion. It looks more benefitial to me. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-27Merge branch 'for-3.17-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "This is quite late but these need to be backported anyway. This is the fix for a long-standing cpuset bug which existed from 2009. cpuset makes use of PF_SPREAD_{PAGE|SLAB} flags to modify the task's memory allocation behavior according to the settings of the cpuset it belongs to; unfortunately, when those flags have to be changed, cpuset did so directly even whlie the target task is running, which is obviously racy as task->flags may be modified by the task itself at any time. This obscure bug manifested as corrupt PF_USED_MATH flag leading to a weird crash. The bug is fixed by moving the flag to task->atomic_flags. The first two are prepatory ones to help defining atomic_flags accessors and the third one is the actual fix" * 'for-3.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset: PF_SPREAD_PAGE and PF_SPREAD_SLAB should be atomic flags sched: add macros to define bitops for task atomic flags sched: fix confusing PFA_NO_NEW_PRIVS constant
2014-09-26mm, slab: initialize object alignment on cache creationDavid Rientjes
Since commit 4590685546a3 ("mm/sl[aou]b: Common alignment code"), the "ralign" automatic variable in __kmem_cache_create() may be used as uninitialized. The proper alignment defaults to BYTES_PER_WORD and can be overridden by SLAB_RED_ZONE or the alignment specified by the caller. This fixes https://bugzilla.kernel.org/show_bug.cgi?id=85031 Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Andrei Elovikov <a.elovikov@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>