summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2016-03-22mm: slub: call kasan_alloc_pages before freeing pages in slubSe Wang (Patrick) Oh
KASan marks slub objects as redzone and free and the bitmasks for that region are not cleared until the pages are freed. When CONFIG_PAGE_POISONING is enabled, as the pages still have special bitmasks, KAsan report arises during pages poisoning. So mark the pages as alloc status before poisoning the pages. ================================================================== BUG: KASan: use after free in memset+0x24/0x44 at addr ffffffc0bb628000 Write of size 4096 by task kworker/u8:0/6 page:ffffffbacc51d900 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0x4000000000000000() page dumped because: kasan: bad access detected Call trace: [<ffffffc00008c010>] dump_backtrace+0x0/0x250 [<ffffffc00008c270>] show_stack+0x10/0x1c [<ffffffc001b6f9e4>] dump_stack+0x74/0xfc [<ffffffc0002debf4>] kasan_report_error+0x2b0/0x408 [<ffffffc0002dee28>] kasan_report+0x34/0x40 [<ffffffc0002de240>] __asan_storeN+0x15c/0x168 [<ffffffc0002de47c>] memset+0x20/0x44 [<ffffffc0002d77bc>] kernel_map_pages+0x2e8/0x384 [<ffffffc000266458>] free_pages_prepare+0x340/0x3a0 [<ffffffc0002694cc>] __free_pages_ok+0x20/0x12c [<ffffffc00026a698>] __free_pages+0x34/0x44 [<ffffffc00026ab3c>] __free_kmem_pages+0x8/0x14 [<ffffffc0002dc3fc>] kfree+0x114/0x254 [<ffffffc000b05748>] devres_free+0x48/0x5c [<ffffffc000b05824>] devres_destroy+0x10/0x28 [<ffffffc000b05958>] devm_kfree+0x1c/0x3c Memory state around the buggy address: ffffffc0bb627f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffffffc0bb627f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffffffc0bb628000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ^ ffffffc0bb628080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffffffc0bb628100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ================================================================== BUG: KASan: use after free in memset+0x24/0x44 at addr ffffffc0bb2fe000 Write of size 4096 by task swapper/0/1 page:ffffffbacc4fdec0 count:0 mapcount:0 mapping: (null) index:0xffffffc0bb2fe6a0 flags: 0x4000000000000000() page dumped because: kasan: bad access detected Call trace: [<ffffffc00008c010>] dump_backtrace+0x0/0x250 [<ffffffc00008c270>] show_stack+0x10/0x1c [<ffffffc001b6f9e4>] dump_stack+0x74/0xfc [<ffffffc0002debf4>] kasan_report_error+0x2b0/0x408 [<ffffffc0002dee28>] kasan_report+0x34/0x40 [<ffffffc0002de240>] __asan_storeN+0x15c/0x168 [<ffffffc0002de47c>] memset+0x20/0x44 [<ffffffc0002d77bc>] kernel_map_pages+0x2e8/0x384 [<ffffffc000266458>] free_pages_prepare+0x340/0x3a0 [<ffffffc0002694cc>] __free_pages_ok+0x20/0x12c [<ffffffc00026a698>] __free_pages+0x34/0x44 [<ffffffc0002d9c98>] __free_slab+0x15c/0x178 [<ffffffc0002d9d14>] discard_slab+0x60/0x6c [<ffffffc0002dc034>] __slab_free+0x320/0x340 [<ffffffc0002dc224>] kmem_cache_free+0x1d0/0x25c [<ffffffc0003bb608>] kernfs_put+0x2a0/0x3d8 Memory state around the buggy address: ffffffc0bb2fdf00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffffffc0bb2fdf80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffffffc0bb2fe000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc ^ fffffc0bb2fe080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffffffc0bb2fe100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ================================================================== Change-Id: Id963b9439685f94a022dcdd60b59aaf126610387 Signed-off-by: Se Wang (Patrick) Oh <sewango@codeaurora.org> [satyap: trivial merge conflict resolution] Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org>
2016-03-22UBSAN: run-time undefined behavior sanity checkerAndrey Ryabinin
UBSAN uses compile-time instrumentation to catch undefined behavior (UB). Compiler inserts code that perform certain kinds of checks before operations that could cause UB. If check fails (i.e. UB detected) __ubsan_handle_* function called to print error message. So the most of the work is done by compiler. This patch just implements ubsan handlers printing errors. GCC has this capability since 4.9.x [1] (see -fsanitize=undefined option and its suboptions). However GCC 5.x has more checkers implemented [2]. Article [3] has a bit more details about UBSAN in the GCC. [1] - https://gcc.gnu.org/onlinedocs/gcc-4.9.0/gcc/Debugging-Options.html [2] - https://gcc.gnu.org/onlinedocs/gcc/Debugging-Options.html [3] - http://developerblog.redhat.com/2014/10/16/gcc-undefined-behavior-sanitizer-ubsan/ Issues which UBSAN has found thus far are: Found bugs: * out-of-bounds access - 97840cb67ff5 ("netfilter: nfnetlink: fix insufficient validation in nfnetlink_bind") undefined shifts: * d48458d4a768 ("jbd2: use a better hash function for the revoke table") * 10632008b9e1 ("clockevents: Prevent shift out of bounds") * 'x << -1' shift in ext4 - http://lkml.kernel.org/r/<5444EF21.8020501@samsung.com> * undefined rol32(0) - http://lkml.kernel.org/r/<1449198241-20654-1-git-send-email-sasha.levin@oracle.com> * undefined dirty_ratelimit calculation - http://lkml.kernel.org/r/<566594E2.3050306@odin.com> * undefined roundown_pow_of_two(0) - http://lkml.kernel.org/r/<1449156616-11474-1-git-send-email-sasha.levin@oracle.com> * [WONTFIX] undefined shift in __bpf_prog_run - http://lkml.kernel.org/r/<CACT4Y+ZxoR3UjLgcNdUm4fECLMx2VdtfrENMtRRCdgHB2n0bJA@mail.gmail.com> WONTFIX here because it should be fixed in bpf program, not in kernel. signed overflows: * 32a8df4e0b33f ("sched: Fix odd values in effective_load() calculations") * mul overflow in ntp - http://lkml.kernel.org/r/<1449175608-1146-1-git-send-email-sasha.levin@oracle.com> * incorrect conversion into rtc_time in rtc_time64_to_tm() - http://lkml.kernel.org/r/<1449187944-11730-1-git-send-email-sasha.levin@oracle.com> * unvalidated timespec in io_getevents() - http://lkml.kernel.org/r/<CACT4Y+bBxVYLQ6LtOKrKtnLthqLHcw-BMp3aqP3mjdAvr9FULQ@mail.gmail.com> * [NOTABUG] signed overflow in ktime_add_safe() - http://lkml.kernel.org/r/<CACT4Y+aJ4muRnWxsUe1CMnA6P8nooO33kwG-c8YZg=0Xc8rJqw@mail.gmail.com> [akpm@linux-foundation.org: fix unused local warning] [akpm@linux-foundation.org: fix __int128 build woes] Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Marek <mmarek@suse.cz> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Yury Gribov <y.gribov@samsung.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Git-repo: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/ Git-commit: c6d308534aef6c99904bf5862066360ae067abc4 [tsoni@codeaurora.org: trivial merge conflict resolution] CRs-Fixed: 969533 Change-Id: I048b9936b1120e0d375b7932c59de78d8ef8f411 Signed-off-by: Trilok Soni <tsoni@codeaurora.org> [satyap@codeaurora.org: trivial merge conflict resolution] Signed-off-by: Satya Durga Srinivasu Prabhala <satyap@codeaurora.org>
2016-03-22mm: Update is_vmalloc_addr to account for vmalloc savingsSusheel Khiani
is_vmalloc_addr currently assumes that all vmalloc addresses exist between VMALLOC_START and VMALLOC_END. This may not be the case when interleaving vmalloc and lowmem. Update the is_vmalloc_addr to properly check for this. Correspondingly we need to ensure that VMALLOC_TOTAL accounts for all the vmalloc regions when CONFIG_ENABLE_VMALLOC_SAVING is enabled. Change-Id: I5def3d6ae1a4de59ea36f095b8c73649a37b1f36 Signed-off-by: Susheel Khiani <skhiani@codeaurora.org>
2016-03-22msm: Allow lowmem to be non contiguous and mixedSusheel Khiani
Currently on 32 bit systems, virtual space above PAGE_OFFSET is reserved for direct mapped lowmem and part of virtual address space is reserved for vmalloc. We want to optimize such as to have as much direct mapped memory as possible since there is penalty for mapping/unmapping highmem. Now, we may have an image that is expected to have a lifetime of the entire system and is reserved in physical region that would be part of direct mapped lowmem. The physical memory which is thus reserved is never used by Linux. This means that even though the system is not actually accessing the virtual memory corresponding to the reserved physical memory, we are still losing that portion of direct mapped lowmem space. So by allowing lowmem to be non contiguous we can give this unused virtual address space of reserved region back for use in vmalloc. Change-Id: I980b3dfafac71884dcdcb8cd2e4a6363cde5746a Signed-off-by: Susheel Khiani <skhiani@codeaurora.org>
2016-03-22msm: Increase the kernel virtual area to include lowmemSusheel Khiani
Even though lowmem is accounted for in vmalloc space, allocation comes only from the region bounded by VMALLOC_START and VMALLOC_END. The kernel virtual area can now allocate from any unmapped region starting from PAGE_OFFSET. Change-Id: I291b9eb443d3f7445fd979bd7b09e9241ff22ba3 Signed-off-by: Neeti Desai <neetid@codeaurora.org> Signed-off-by: Susheel Khiani <skhiani@codeaurora.org>
2016-03-22mm: showmem: make the notifiers atomicVinayak Menon
There are places in kernel like the lowmemorykiller which invokes show_mem_call_notifiers from an atomic context. So move from a blocking notifier to atomic. At present the notifier callbacks does not call sleeping functions, but it should be made sure, it does not happen in future also. Change-Id: I9668e67463ab8a6a60be55dbc86b88f45be8b041 Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
2016-03-22mm: page-writeback: fix page state calculation in throttle_vm_writeoutVinayak Menon
It was found that a number of tasks were blocked in the reclaim path (throttle_vm_writeout) for seconds, because of vmstat_diff not being synced in time. Fix that by adding a new function global_page_state_snapshot. Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Change-Id: Iec167635ad724a55c27bdbd49eb8686e7857216c
2016-03-22mm: compaction: fix the page state calculation in too_many_isolatedVinayak Menon
Commit "mm: vmscan: fix the page state calculation in too_many_isolated" fixed an issue where a number of tasks were blocked in reclaim path for seconds, because of vmstat_diff not being synced in time. A similar problem can happen in isolate_migratepages_block, where similar calculation is performed. This patch fixes that. Change-Id: Ie74f108ef770da688017b515fe37faea6f384589 Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
2016-03-22mm: vmpressure: account allocstalls only on higher pressuresVinayak Menon
At present any vmpressure value is scaled up if the pages are reclaimed through direct reclaim. This can result in false vmpressure values. Consider a case where a device is booted up and most of the memory is occuppied by file pages. kswapd will make sure that high watermark is maintained. Now when a sudden huge allocation request comes in, the system will definitely have to get into direct reclaims. The vmpressures can be very low, but because of allocstall accounting logic even these low values will be scaled to values nearing 100. This can result in unnecessary LMK kills for example. So define a tunable threshold for vmpressure above which the allocstalls will be accounted. CRs-fixed: 893699 Change-Id: Idd7c6724264ac89f1f68f2e9d70a32390ffca3e5 Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
2016-03-22mm: swap: don't delay swap free for fast swap devicesVinayak Menon
There are couple of issues with swapcache usage when ZRAM is used as swap device. 1) Kernel does a swap readahead which can be around 6 to 8 pages depending on total ram, which is not required for zram since accesses are fast. 2) Kernel delays the freeing up of swapcache expecting a later hit, which again is useless in the case of zram. 3) This is not related to swapcache, but zram usage itself. As mentioned in (2) kernel delays freeing of swapcache, but along with that it delays zram compressed page free also. i.e. there can be 2 copies, though one is compressed. This patch addresses these issues using two new flags QUEUE_FLAG_FAST and SWP_FAST, to indicate that accesses to the device will be fast and cheap, and instructs the swap layer to free up swap space agressively, and not to do read ahead. Change-Id: I5d2d5176a5f9420300bb2f843f6ecbdb25ea80e4 Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
2016-03-22mm: vmpressure: scale pressure based on reclaim contextVinayak Menon
The existing calculation of vmpressure takes into account only the ratio of reclaimed to scanned pages, but not the time spent or the difficulty in reclaiming those pages. For e.g. when there are quite a number of file pages in the system, an allocation request can be satisfied by reclaiming the file pages alone. If such a reclaim is successful, the vmpressure value will remain low irrespective of the time spent by the reclaim code to free up the file pages. With a feature like lowmemorykiller, killing a task can be faster than reclaiming the file pages alone. So if the vmpressure values reflect the reclaim difficulty level, clients can make a decision based on that, for e.g. to kill a task early. This patch monitors the number of pages scanned in the direct reclaim path and scales the vmpressure level according to that. Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Change-Id: I6e643d29a9a1aa0814309253a8b690ad86ec0b13
2016-03-22mm: vmpressure: allow in-kernel clients to subscribe for eventsVinayak Menon
Currently, vmpressure is tied to memcg and its events are available only to userspace clients. This patch removes the dependency on CONFIG_MEMCG and adds a mechanism for in-kernel clients to subscribe for vmpressure events (in fact raw vmpressure values are delivered instead of vmpressure levels, to provide clients more flexibility to take actions on custom pressure levels which are not currently defined by vmpressure module). Change-Id: I38010f166546e8d7f12f5f355b5dbfd6ba04d587 Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
2016-03-22mm/Kconfig: support forcing allocators to return ZONE_DMA memoryLiam Mark
Add a new config item, CONFIG_FORCE_ALLOC_FROM_DMA_ZONE, which can be used to optionally force certain allocators to always return memory from ZONE_DMA. This option helps ensure that clients who require ZONE_DMA memory are always using ZONE_DMA memory. Change-Id: Id2d36214307789f27aa775c2bef2dab5047c4ff0 Signed-off-by: Liam Mark <lmark@codeaurora.org>
2016-03-22kmemleak : Make kmemleak_stack_scan optional using configVignesh Radhakrishnan
Currently we have kmemleak_stack_scan enabled by default. This can hog the cpu with pre-emption disabled for a long time starving other tasks. Make this optional at compile time, since if required we can always write to sysfs entry and enable this option. Change-Id: Ie30447861c942337c7ff25ac269b6025a527e8eb Signed-off-by: Vignesh Radhakrishnan <vigneshr@codeaurora.org> Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
2016-03-22mm: switch KASan hook calling order in page alloc/free pathSe Wang (Patrick) Oh
When CONFIG_PAGE_POISONING is enabled, the pages are poisoned after setting free page in KASan Shadow memory and KASan reports the read after free warning. The same thing happens in the allocation path. So change the order of calling KASan_alloc/free API so that pages poisoning happens when the pages are in alloc status in KASan shadow memory. following is the KASan report for reference. ================================================================== BUG: KASan: use after free in memset+0x24/0x44 at addr ffffffc000000000 Write of size 4096 by task swapper/0 page:ffffffbac5000000 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0x0() page dumped because: kasan: bad access detected CPU: 0 PID: 0 Comm: swapper Not tainted 3.18.0-g5a4a5d5-07242-g6938a8b-dirty #1 Hardware name: Qualcomm Technologies, Inc. MSM 8996 v2 + PMI8994 MTP (DT) Call trace: [<ffffffc000089ea4>] dump_backtrace+0x0/0x1c4 [<ffffffc00008a078>] show_stack+0x10/0x1c [<ffffffc0010ecfd8>] dump_stack+0x74/0xc8 [<ffffffc00020faec>] kasan_report_error+0x2b0/0x408 [<ffffffc00020fd20>] kasan_report+0x34/0x40 [<ffffffc00020f138>] __asan_storeN+0x15c/0x168 [<ffffffc00020f374>] memset+0x20/0x44 [<ffffffc0002086e0>] kernel_map_pages+0x238/0x2a8 [<ffffffc0001ba738>] free_pages_prepare+0x21c/0x25c [<ffffffc0001bc7e4>] __free_pages_ok+0x20/0xf0 [<ffffffc0001bd3bc>] __free_pages+0x34/0x44 [<ffffffc0001bd5d8>] __free_pages_bootmem+0xf4/0x110 [<ffffffc001ca9050>] free_all_bootmem+0x160/0x1f4 [<ffffffc001c97b30>] mem_init+0x70/0x1ec [<ffffffc001c909f8>] start_kernel+0x2b8/0x4e4 [<ffffffc001c987dc>] kasan_early_init+0x154/0x160 Change-Id: Idbd3dc629be57ed55a383b069a735ae3ee7b9f05 Signed-off-by: Se Wang (Patrick) Oh <sewango@codeaurora.org>
2016-03-22mm: change initial readahead window size calculationLee Susman
Change the logic which determines the initial readahead window size such that for small requests (one page) the initial window size will be x4 the size of the original request, regardless of the VM_MAX_READAHEAD value. This prevents a rapid ramp-up that could be caused due to increasing VM_MAX_READAHEAD. Change-Id: I93d59c515d7e6c6d62348790980ff7bd4f434997 Signed-off-by: Lee Susman <lsusman@codeaurora.org>
2016-03-22mm: split_free_page ignore memory watermarks for CMALiam Mark
Memory watermarks were sometimes preventing CMA allocations in low memory. Change-Id: I550ec987cbd6bc6dadd72b4a764df20cd0758479 Signed-off-by: Liam Mark <lmark@codeaurora.org>
2016-03-22mm: Don't put CMA pages on per cpu listsLaura Abbott
CMA allocations rely on being able to migrate pages out quickly to fulfill the allocations. Most use cases for movable allocations meet this requirement. File system allocations may take an unaccpetably long time to migrate, which creates delays from CMA. Prevent CMA pages from ending up on the per-cpu lists to avoid code paths grabbing CMA pages on the fast path. CMA pages can still be allocated as a fallback under tight memory pressure. CRs-Fixed: 452508 Change-Id: I79a28f697275a2a1870caabae53c8ea345b4b47d Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
2016-03-22mm: Add is_cma_pageblock definitionLaura Abbott
Bring back the is_cma_pageblock definition for determining if a page is CMA or not. Change-Id: I39fd546e22e240b752244832c79514f109c8e84b Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
2016-03-22mm: vmscan: support setting of kswapd cpu affinityLiam Mark
Allow the kswapd cpu affinity to be configured. There can be power benefits on certain targets when limiting kswapd to run only on certain cores. CRs-fixed: 752344 Change-Id: I8a83337ff313a7e0324361140398226a09f8be0f Signed-off-by: Liam Mark <lmark@codeaurora.org> [imaund@codeaurora.org: Resolved trivial context conflicts.] Signed-off-by: Ian Maund <imaund@codeaurora.org>
2016-03-22mm: vmscan: lock page on swap error in pageoutVinayak Menon
A workaround was added ealier to move a page to active list if swapping to devices like zram fails. But this can result in try_to_free_swap being called from shrink_page_list, without a properly locked page. Lock the page when we indicate to activate a page in pageout(). Add a check to ensure that error is on swap, and clear the error flag before moving the page to active list. CRs-fixed: 760049 Change-Id: I77a8bbd6ed13efdec943298fe9448412feeac176 Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
2016-03-22mm: vmscan: support complete shrinker reclaimLiam Mark
Ensure that shrinkers are given the option to completely drop their caches even when their caches are smaller than the batch size. This change helps improve memory headroom by ensuring that under significant memory pressure shrinkers can drop all of their caches. This change only attempts to more aggressively call the shrinkers during background memory reclaim, inorder to avoid hurting the perforamnce of direct memory reclaim. Change-Id: I8dbc29c054add639e4810e36fd2c8a063e5c52f3 Signed-off-by: Liam Mark <lmark@codeaurora.org>
2016-03-22mm: slub: panic for object and slab errorsDavid Keitel
If the SLUB_DEBUG_PANIC_ON Kconfig option is selected, also panic for object and slab errors to allow capturing relevant debug data. Change-Id: Idc582ef48d3c0d866fa89cf8660ff0a5402f7e15 Signed-off-by: David Keitel <dkeitel@codeaurora.org>
2016-03-22defconfig: 8994: enable CONFIG_DEBUG_SLUB_PANIC_ONDavid Keitel
Add the DEBUG_SLUB_PANIC_ON option to KCONFIG preventing the existing defconfig option from being overwritten by make config. This will induce a panic if slab debug catches corruptions within the padding of a given object. The intention here is to induce collection of data immediately after the corruption is detected with the goal to catch the possible source of the corruption. Change-Id: Ide0102d0761022c643a761989360ae5c853870a8 Signed-off-by: David Keitel <dkeitel@codeaurora.org> [imaund@codeaurora.org: Resolved trivial merge conflicts.] Signed-off-by: Ian Maund <imaund@codeaurora.org> [lmark@codeaurora.org: ensure change does not create arch/arm64/configs/msm8994_defconfig file] Signed-off-by: Liam Mark <lmark@codeaurora.org>
2016-03-22KSM: Start KSM by defaultAbhimanyu Garg
Strat running KSM by default at device bootup. Change-Id: I7926c529ea42675f4279bffaf149a0cf1080d61b Signed-off-by: Abhimanyu Garg <agarg@codeaurora.org>
2016-03-22mm, oom: make dump_tasks publicLiam Mark
Allow other functions to dump the list of tasks. Useful for when debugging memory leaks. Change-Id: I76c33a118a9765b4c2276e8c76de36399c78dbf6 Signed-off-by: Liam Mark <lmark@codeaurora.org>
2016-03-22ksm: Add showmem notifierLaura Abbott
KSM is yet another framework which may obfuscate some memory problems. Use the showmem notifier to show how KSM is being used to give some insight into potential issues or non-issues. Change-Id: If82405dc33f212d085e6847f7c511fd4d0a32a10 Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
2016-03-22mm: slub: Panic instead of restoring corrupted bytesDavid Keitel
Resiliency of slub was added for production systems in an attempt to restore corruptions and allow production environments to continue to run. In debug setups, this may no be desirable. Thus rather than attempting to restore corrupted bytes in poisoned zones, panic to attempt to catch more context of what was going on in the system at the time. Add the CONFIG_SLUB_DEBUG_PANIC_ON defconfig option to allow debug builds to turn on this panic option. Change-Id: I01763e8eea40a4544e9b7e48c4e4d40840b6c82d Signed-off-by: David Keitel <dkeitel@codeaurora.org>
2016-03-22ksm: Provide support to use deferred timers for scanner threadChintan Pandya
KSM thread to scan pages is getting schedule on definite timeout. That wakes up CPU from idle state and hence may affect the power consumption. Provide an optional support to use deferred timer which suites low-power use-cases. To enable deferred timers, $ echo 1 > /sys/kernel/mm/ksm/deferred_timer Change-Id: I07fe199f97fe1f72f9a9e1b0b757a3ac533719e8 Signed-off-by: Chintan Pandya <cpandya@codeaurora.org>
2016-03-22mm: vmscan: support equal reclaim for anon and file pagesLiam Mark
When performing memory reclaim support treating anonymous and file backed pages equally. Swapping anonymous pages out to memory can be efficient enough to justify treating anonymous and file backed pages equally. CRs-Fixed: 648984 Change-Id: I6315b8557020d1e27a34225bb9cefbef1fb43266 Signed-off-by: Liam Mark <lmark@codeaurora.org>
2016-03-22mm: vmscan: Move pages that fail swapout to LRU active listOlav Haugan
Move pages that fail swapout to the LRU active list to reduce pressure on swap device when swapping out is already failing. This helps when using a pseudo swap device such as zram which starts failing when memory is low. Change-Id: Ib136cd0a744378aa93d837a24b9143ee818c80b3 Signed-off-by: Olav Haugan <ohaugan@codeaurora.org>
2016-03-22mm: Remove __init annotations from free_bootmem_lateLaura Abbott
free_bootmem_late is currently set up to only be used in init functions. Some clients need to use this function past initcalls. The functions themselves have no restrictions on being used later minus the __init annotations so remove the annotation. Change-Id: I7c7e15cf2780a8843ebb4610da5b633c9abb0b3d Signed-off-by: Laura Abbott <lauraa@codeaurora.org> [abhimany@codeaurora.org: resolve minor conflict and remove __init from nobootmem.c] Signed-off-by: Abhimanyu Kapur <abhimany@codeaurora.org>
2016-03-22mm: Mark free pages as read onlyLaura Abbott
Drivers have a tendency to scribble on everything, including free pages. Make life easier by marking free pages as read only when on the buddy list and re-marking as read/write when allocating. Change-Id: I978ed2921394919917307b9c99217fdc22f82c59 Signed-off-by: Laura Abbott <lauraa@codeaurora.org> (cherry picked from commit 752f5aecb0511c4d661dce2538c723675c1e6449)
2016-03-22mm: Add notifier framework for showing memoryLaura Abbott
There are many drivers in the kernel which can hold on to lots of memory. It can be useful to dump out all those drivers at key points in the kernel. Introduct a notifier framework for dumping this information. When the notifiers are called, drivers can dump out the state of any memory they may be using. Change-Id: Ifb2946964bf5d072552dd56d8d6dfdd794af6d84 Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
2016-03-22memblock: Add memblock_overlaps_memory()Stephen Boyd
Add a new function, memblock_overlaps_memory(), to check if a region overlaps with a memory bank. This will be used by peripheral loader code to detect when kernel memory would be overwritten. Change-Id: I851f8f416a0f36e85c0e19536b5209f7d4bd431c Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> (cherry picked from commit cc2753448d9f2adf48295f935a7eee36023ba8d3) Signed-off-by: Josh Cartwright <joshc@codeaurora.org>
2016-02-16FROMLIST: mm: mmap: Add new /proc tunable for mmap_base ASLR.dcashman
(cherry picked from commit https://lkml.org/lkml/2015/12/21/337) ASLR only uses as few as 8 bits to generate the random offset for the mmap base address on 32 bit architectures. This value was chosen to prevent a poorly chosen value from dividing the address space in such a way as to prevent large allocations. This may not be an issue on all platforms. Allow the specification of a minimum number of bits so that platforms desiring greater ASLR protection may determine where to place the trade-off. Bug: 24047224 Signed-off-by: Daniel Cashman <dcashman@android.com> Signed-off-by: Daniel Cashman <dcashman@google.com> Change-Id: Ibf9ed3d4390e9686f5cc34f605d509a20d40e6c2
2016-02-16mm: add a field to store names for private anonymous memoryColin Cross
Userspace processes often have multiple allocators that each do anonymous mmaps to get memory. When examining memory usage of individual processes or systems as a whole, it is useful to be able to break down the various heaps that were allocated by each layer and examine their size, RSS, and physical memory usage. This patch adds a user pointer to the shared union in vm_area_struct that points to a null terminated string inside the user process containing a name for the vma. vmas that point to the same address will be merged, but vmas that point to equivalent strings at different addresses will not be merged. Userspace can set the name for a region of memory by calling prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name); Setting the name to NULL clears it. The names of named anonymous vmas are shown in /proc/pid/maps as [anon:<name>] and in /proc/pid/smaps in a new "Name" field that is only present for named vmas. If the userspace pointer is no longer valid all or part of the name will be replaced with "<fault>". The idea to store a userspace pointer to reduce the complexity within mm (at the expense of the complexity of reading /proc/pid/mem) came from Dave Hansen. This results in no runtime overhead in the mm subsystem other than comparing the anon_name pointers when considering vma merging. The pointer is stored in a union with fieds that are only used on file-backed mappings, so it does not increase memory usage. Includes fix from Jed Davis <jld@mozilla.com> for typo in prctl_set_vma_anon_name, which could attempt to set the name across two vmas at the same time due to a typo, which might corrupt the vma list. Fix it to use tmp instead of end to limit the name setting to a single vma at a time. Change-Id: I9aa7b6b5ef536cd780599ba4e2fba8ceebe8b59f Signed-off-by: Dmitry Shmidt <dimitrysh@google.com>
2016-02-16add extra free kbytes tunableRik van Riel
Add a userspace visible knob to tell the VM to keep an extra amount of memory free, by increasing the gap between each zone's min and low watermarks. This is useful for realtime applications that call system calls and have a bound on the number of allocations that happen in any short time period. In this application, extra_free_kbytes would be left at an amount equal to or larger than than the maximum number of allocations that happen in any burst. It may also be useful to reduce the memory use of virtual machines (temporarily?), in a way that does not cause memory fragmentation like ballooning does. [ccross] Revived for use on old kernels where no other solution exists. The tunable will be removed on kernels that do better at avoiding direct reclaim. Change-Id: I765a42be8e964bfd3e2886d1ca85a29d60c3bb3e Signed-off-by: Rik van Riel<riel@redhat.com> Signed-off-by: Colin Cross <ccross@android.com>
2016-02-16mm: vmscan: Add a debug file for shrinkersRebecca Schultz Zavin
This patch adds a debugfs file called "shrinker" when read this calls all the shrinkers in the system with nr_to_scan set to zero and prints the result. These results are the number of objects the shrinkers have available and can thus be used an indication of the total memory that would be availble to the system if a shrink occurred. Change-Id: Ied0ee7caff3d2fc1cb4bb839aaafee81b5b0b143 Signed-off-by: Rebecca Schultz Zavin <rebecca@android.com>
2016-02-16UPSTREAM: memcg: Only free spare array when readers are doneMartijn Coenen
A spare array holding mem cgroup threshold events is kept around to make sure we can always safely deregister an event and have an array to store the new set of events in. In the scenario where we're going from 1 to 0 registered events, the pointer to the primary array containing 1 event is copied to the spare slot, and then the spare slot is freed because no events are left. However, it is freed before calling synchronize_rcu(), which means readers may still be accessing threshold->primary after it is freed. Fixed by only freeing after synchronize_rcu(). Signed-off-by: Martijn Coenen <maco@google.com>
2016-02-16cgroup: refactor allow_attach handler for 4.4Amit Pundir
Refactor *allow_attach() handler to align it with the changes from mainline commit 1f7dd3e5a6e4 "cgroup: fix handling of multi-destination migration from subtree_control enabling". Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
2016-02-16cgroup: memcg: pass correct argument to subsys_cgroup_allow_attachAmit Pundir
Pass correct argument to subsys_cgroup_allow_attach(), which expects 'struct cgroup_subsys_state *' argument but we pass 'struct cgroup *' instead which doesn't seem right. This fixes following 'incompatible pointer type' compiler warning: ---------- CC mm/memcontrol.o mm/memcontrol.c: In function ‘mem_cgroup_allow_attach’: mm/memcontrol.c:5052:2: warning: passing argument 1 of ‘subsys_cgroup_allow_attach’ from incompatible pointer type [enabled by default] In file included from include/linux/memcontrol.h:22:0, from mm/memcontrol.c:29: include/linux/cgroup.h:953:5: note: expected ‘struct cgroup_subsys_state *’ but argument is of type ‘struct cgroup *’ ---------- Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
2016-02-16memcg: add permission checkRom Lemarchand
Use the 'allow_attach' handler for the 'mem' cgroup to allow non-root processes to add arbitrary processes to a 'mem' cgroup if it has the CAP_SYS_NICE capability set. Bug: 18260435 Change-Id: If7d37bf90c1544024c4db53351adba6a64966250 Signed-off-by: Rom Lemarchand <romlem@android.com>
2016-01-11ashmem: Add shmem_set_file to mm/shmem.cJohn Stultz
NOT FOR STAGING This patch re-adds the original shmem_set_file to mm/shmem.c and converts ashmem.c back to using it. CC: Brian Swetland <swetland@google.com> CC: Colin Cross <ccross@android.com> CC: Arve Hjønnevåg <arve@android.com> CC: Dima Zavin <dima@android.com> CC: Robert Love <rlove@google.com> CC: Greg KH <greg@kroah.com> Signed-off-by: John Stultz <john.stultz@linaro.org>
2016-01-08vmstat: allocate vmstat_wq before it is usedMichal Hocko
kernel test robot has reported the following crash: BUG: unable to handle kernel NULL pointer dereference at 00000100 IP: [<c1074df6>] __queue_work+0x26/0x390 *pdpt = 0000000000000000 *pde = f000ff53f000ff53 *pde = f000ff53f000ff53 Oops: 0000 [#1] PREEMPT PREEMPT SMP SMP CPU: 0 PID: 24 Comm: kworker/0:1 Not tainted 4.4.0-rc4-00139-g373ccbe #1 Workqueue: events vmstat_shepherd task: cb684600 ti: cb7ba000 task.ti: cb7ba000 EIP: 0060:[<c1074df6>] EFLAGS: 00010046 CPU: 0 EIP is at __queue_work+0x26/0x390 EAX: 00000046 EBX: cbb37800 ECX: cbb37800 EDX: 00000000 ESI: 00000000 EDI: 00000000 EBP: cb7bbe68 ESP: cb7bbe38 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 CR0: 8005003b CR2: 00000100 CR3: 01fd5000 CR4: 000006b0 Stack: Call Trace: __queue_delayed_work+0xa1/0x160 queue_delayed_work_on+0x36/0x60 vmstat_shepherd+0xad/0xf0 process_one_work+0x1aa/0x4c0 worker_thread+0x41/0x440 kthread+0xb0/0xd0 ret_from_kernel_thread+0x21/0x40 The reason is that start_shepherd_timer schedules the shepherd work item which uses vmstat_wq (vmstat_shepherd) before setup_vmstat allocates that workqueue so if the further initialization takes more than HZ we might end up scheduling on a NULL vmstat_wq. This is really unlikely but not impossible. Fixes: 373ccbe59270 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress") Reported-by: kernel test robot <ying.huang@linux.intel.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: stable@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-29mm/vmstat: fix overflow in mod_zone_page_state()Heiko Carstens
mod_zone_page_state() takes a "delta" integer argument. delta contains the number of pages that should be added or subtracted from a struct zone's vm_stat field. If a zone is larger than 8TB this will cause overflows. E.g. for a zone with a size slightly larger than 8TB the line mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages); in mm/page_alloc.c:free_area_init_core() will result in a negative result for the NR_ALLOC_BATCH entry within the zone's vm_stat, since 8TB contain 0x8xxxxxxx pages which will be sign extended to a negative value. Fix this by changing the delta argument to long type. This could fix an early boot problem seen on s390, where we have a 9TB system with only one node. ZONE_DMA contains 2GB and ZONE_NORMAL the rest. The system is trying to allocate a GFP_DMA page but ZONE_DMA is completely empty, so it tries to reclaim pages in an endless loop. This was seen on a heavily patched 3.10 kernel. One possible explaination seem to be the overflows caused by mod_zone_page_state(). Unfortunately I did not have the chance to verify that this patch actually fixes the problem, since I don't have access to the system right now. However the overflow problem does exist anyway. Given the description that a system with slightly less than 8TB does work, this seems to be a candidate for the observed problem. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-29mm/memory_hotplug.c: check for missing sections in test_pages_in_a_zone()Andrew Banman
test_pages_in_a_zone() does not account for the possibility of missing sections in the given pfn range. pfn_valid_within always returns 1 when CONFIG_HOLES_IN_ZONE is not set, allowing invalid pfns from missing sections to pass the test, leading to a kernel oops. Wrap an additional pfn loop with PAGES_PER_SECTION granularity to check for missing sections before proceeding into the zone-check code. This also prevents a crash from offlining memory devices with missing sections. Despite this, it may be a good idea to keep the related patch '[PATCH 3/3] drivers: memory: prohibit offlining of memory blocks with missing sections' because missing sections in a memory block may lead to other problems not covered by the scope of this fix. Signed-off-by: Andrew Banman <abanman@sgi.com> Acked-by: Alex Thorlton <athorlton@sgi.com> Cc: Russ Anderson <rja@sgi.com> Cc: Alex Thorlton <athorlton@sgi.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Greg KH <greg@kroah.com> Cc: Seth Jennings <sjennings@variantweb.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-29mm: memcontrol: fix possible memcg leak due to interrupted reclaimVladimir Davydov
Memory cgroup reclaim can be interrupted with mem_cgroup_iter_break() once enough pages have been reclaimed, in which case, in contrast to a full round-trip over a cgroup sub-tree, the current position stored in mem_cgroup_reclaim_iter of the target cgroup does not get invalidated and so is left holding the reference to the last scanned cgroup. If the target cgroup does not get scanned again (we might have just reclaimed the last page or all processes might exit and free their memory voluntary), we will leak it, because there is nobody to put the reference held by the iterator. The problem is easy to reproduce by running the following command sequence in a loop: mkdir /sys/fs/cgroup/memory/test echo 100M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs memhog 150M echo $$ > /sys/fs/cgroup/memory/cgroup.procs rmdir test The cgroups generated by it will never get freed. This patch fixes this issue by making mem_cgroup_iter avoid taking reference to the current position. In order not to hit use-after-free bug while running reclaim in parallel with cgroup deletion, we make use of ->css_released cgroup callback to clear references to the dying cgroup in all reclaim iterators that might refer to it. This callback is called right before scheduling rcu work which will free css, so if we access iter->position from rcu read section, we might be sure it won't go away under us. [hannes@cmpxchg.org: clean up css ref handling] Fixes: 5ac8fb31ad2e ("mm: memcontrol: convert reclaim iterator to simple css refcounting") Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> [3.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-18mm/zswap: change incorrect strncmp use to strcmpDan Streetman
Change the use of strncmp in zswap_pool_find_get() to strcmp. The use of strncmp is no longer correct, now that zswap_zpool_type is not an array; sizeof() will return the size of a pointer, which isn't the right length to compare. We don't need to use strncmp anyway, because the existing params and the passed in params are all guaranteed to be null terminated, so strcmp should be used. Signed-off-by: Dan Streetman <ddstreet@ieee.org> Reported-by: Weijie Yang <weijie.yang@samsung.com> Cc: Seth Jennings <sjennings@variantweb.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-12mm/oom_kill.c: avoid attempting to kill init sharing same memoryChen Jie
It's possible that an oom killed victim shares an ->mm with the init process and thus oom_kill_process() would end up trying to kill init as well. This has been shown in practice: Out of memory: Kill process 9134 (init) score 3 or sacrifice child Killed process 9134 (init) total-vm:1868kB, anon-rss:84kB, file-rss:572kB Kill process 1 (init) sharing same memory ... Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 And this will result in a kernel panic. If a process is forked by init and selected for oom kill while still sharing init_mm, then it's likely this system is in a recoverable state. However, it's better not to try to kill init and allow the machine to panic due to unkillable processes. [rientjes@google.com: rewrote changelog] [akpm@linux-foundation.org: fix inverted test, per Ben] Signed-off-by: Chen Jie <chenjie6@huawei.com> Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>