Age | Commit message (Collapse) | Author |
|
lock_memory_hotplug
We don't need to do register_memory_resource() under
lock_memory_hotplug() since it has its own lock and doesn't make any
callbacks.
Also register_memory_resource return NULL on failure so we don't have
anything to cleanup at this point.
The reason for this rfc is I was doing some experiments with hotplugging
of memory on some of our larger systems. While it seems to work, it can
be quite slow. With some preliminary digging I found that
lock_memory_hotplug is clearly ripe for breakup.
It could be broken up per nid or something but it also covers the
online_page_callback. The online_page_callback shouldn't be very hard
to break out.
Also there is the issue of various structures(wmarks come to mind) that
are only updated under the lock_memory_hotplug that would need to be
dealt with.
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Hedi <hedi@sgi.com>
Cc: Mike Travis <travis@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
get_allocated_memblock_reserved_regions_info() should work if it is
compiled in. Extended the ifdef around
get_allocated_memblock_memory_regions_info() to include
get_allocated_memblock_reserved_regions_info() as well. Similar changes
in nobootmem.c/free_low_memory_core_early() where the two functions are
called.
[akpm@linux-foundation.org: cleanup]
Signed-off-by: Philipp Hachtmann <phacht@linux.vnet.ibm.com>
Cc: qiuxishi <qiuxishi@huawei.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Daeseok Youn <daeseok.youn@gmail.com>
Cc: Jiang Liu <liuj97@gmail.com>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Santosh Shilimkar <santosh.shilimkar@ti.com>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
If a shrinker is not NUMA-aware, shrink_slab() should call it exactly
once with nid=0, but currently it is not true: if node 0 is not set in
the nodemask or if it is not online, we will not call such shrinkers at
all. As a result some slabs will be left untouched under some
circumstances. Let us fix it.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reported-by: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When reclaiming kmem, we currently don't scan slabs that have less than
batch_size objects (see shrink_slab_node()):
while (total_scan >= batch_size) {
shrinkctl->nr_to_scan = batch_size;
shrinker->scan_objects(shrinker, shrinkctl);
total_scan -= batch_size;
}
If there are only a few shrinkers available, such a behavior won't cause
any problems, because the batch_size is usually small, but if we have a
lot of slab shrinkers, which is perfectly possible since FS shrinkers
are now per-superblock, we can end up with hundreds of megabytes of
practically unreclaimable kmem objects. For instance, mounting a
thousand of ext2 FS images with a hundred of files in each and iterating
over all the files using du(1) will result in about 200 Mb of FS caches
that cannot be dropped even with the aid of the vm.drop_caches sysctl!
This problem was initially pointed out by Glauber Costa [*]. Glauber
proposed to fix it by making the shrink_slab() always take at least one
pass, to put it simply, turning the scan loop above to a do{}while()
loop. However, this proposal was rejected, because it could result in
more aggressive and frequent slab shrinking even under low memory
pressure when total_scan is naturally very small.
This patch is a slightly modified version of Glauber's approach.
Similarly to Glauber's patch, it makes shrink_slab() scan less than
batch_size objects, but only if the total number of objects we want to
scan (total_scan) is greater than the total number of objects available
(max_pass). Since total_scan is biased as half max_pass if the current
delta change is small:
if (delta < max_pass / 4)
total_scan = min(total_scan, max_pass / 2);
this is only possible if we are scanning at high prio. That said, this
patch shouldn't change the vmscan behaviour if the memory pressure is
low, but if we are tight on memory, we will do our best by trying to
reclaim all available objects, which sounds reasonable.
[*] http://www.spinics.net/lists/cgroups/msg06913.html
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Glauber Costa <glommer@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit 7851a45cd3f6 ("mm: numa: Copy cpupid on page migration") copiess
over the cpupid at page migration time. It is unnecessary to set it
again in migrate_misplaced_transhuge_page().
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Two cleanups:
1. remove redundant codes for hugetlb pages.
2. end = pmd_addr_end(addr, end) restricts [addr, end) within PMD_SIZE,
this may increase do_mincore() calls, remove it.
Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: qiuxishi <qiuxishi@huawei.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Compiling a C file which includes genalloc.h but without
spinlock_types.h being included before, we will see the compile error
below.
include/linux/genalloc.h:54:2: error: unknown type name `spinlock_t'
Include spinlock_types.h from genalloc.h to fix the problem.
Signed-off-by: Shawn Guo <shawn.guo@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
traceevent regex
When irq, preempt and lockdep fields are printed (field 3 in the example
below) in the trace output, the script fails.
An example entry:
kswapd0-610 [000] ...1 158.112152: mm_vmscan_kswapd_wake: nid=0 order=0
Signed-off-by: Vinayak Menon <vinayakm.list@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
If echo -1 > /proc/vm/sys/min_free_kbytes, the system will hang. Changing
proc_dointvec() to proc_dointvec_minmax() in the
min_free_kbytes_sysctl_handler() can prevent this to happen.
mhocko said:
: You can still do echo $BIG_VALUE > /proc/vm/sys/min_free_kbytes and make
: your machine unusable but I agree that proc_dointvec_minmax is more
: suitable here as we already have:
:
: .proc_handler = min_free_kbytes_sysctl_handler,
: .extra1 = &zero,
:
: It used to work properly but then 6fce56ec91b5 ("sysctl: Remove references
: to ctl_name and strategy from the generic sysctl table") has removed
: sysctl_intvec strategy and so extra1 is ignored.
Signed-off-by: Han Pingtian <hanpt@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Commit 11c731e81bb0 ("mm/mempolicy: fix !vma in new_vma_page()") has
removed BUG_ON(!vma) from new_vma_page which is partially correct
because page_address_in_vma will return EFAULT for non-linear mappings
and at least shared shmem might be mapped this way.
The patch also tried to prevent NULL ptr for hugetlb pages which is not
correct AFAICS because hugetlb pages cannot be mapped as VM_NONLINEAR
and other conditions in page_address_in_vma seem to be legit and catch
real bugs.
This patch restores BUG_ON for PageHuge to catch potential issues when
the to-be-migrated page is not setup properly.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
After thp split in hwpoison_user_mappings(), we hold page lock on the
raw error page only between try_to_unmap, hence we are in danger of race
condition.
I found in the RHEL7 MCE-relay testing that we have "bad page" error
when a memory error happens on a thp tail page used by qemu-kvm:
Triggering MCE exception on CPU 10
mce: [Hardware Error]: Machine check events logged
MCE exception done on CPU 10
MCE 0x38c535: Killing qemu-kvm:8418 due to hardware memory corruption
MCE 0x38c535: dirty LRU page recovery: Recovered
qemu-kvm[8418]: segfault at 20 ip 00007ffb0f0f229a sp 00007fffd6bc5240 error 4 in qemu-kvm[7ffb0ef14000+420000]
BUG: Bad page state in process qemu-kvm pfn:38c400
page:ffffea000e310000 count:0 mapcount:0 mapping: (null) index:0x7ffae3c00
page flags: 0x2fffff0008001d(locked|referenced|uptodate|dirty|swapbacked)
Modules linked in: hwpoison_inject mce_inject vhost_net macvtap macvlan ...
CPU: 0 PID: 8418 Comm: qemu-kvm Tainted: G M -------------- 3.10.0-54.0.1.el7.mce_test_fixed.x86_64 #1
Hardware name: NEC NEC Express5800/R120b-1 [N8100-1719F]/MS-91E7-001, BIOS 4.6.3C19 02/10/2011
Call Trace:
dump_stack+0x19/0x1b
bad_page.part.59+0xcf/0xe8
free_pages_prepare+0x148/0x160
free_hot_cold_page+0x31/0x140
free_hot_cold_page_list+0x46/0xa0
release_pages+0x1c1/0x200
free_pages_and_swap_cache+0xad/0xd0
tlb_flush_mmu.part.46+0x4c/0x90
tlb_finish_mmu+0x55/0x60
exit_mmap+0xcb/0x170
mmput+0x67/0xf0
vhost_dev_cleanup+0x231/0x260 [vhost_net]
vhost_net_release+0x3f/0x90 [vhost_net]
__fput+0xe9/0x270
____fput+0xe/0x10
task_work_run+0xc4/0xe0
do_exit+0x2bb/0xa40
do_group_exit+0x3f/0xa0
get_signal_to_deliver+0x1d0/0x6e0
do_signal+0x48/0x5e0
do_notify_resume+0x71/0xc0
retint_signal+0x48/0x8c
The reason of this bug is that a page fault happens before unlocking the
head page at the end of memory_failure(). This strange page fault is
trying to access to address 0x20 and I'm not sure why qemu-kvm does
this, but anyway as a result the SIGSEGV makes qemu-kvm exit and on the
way we catch the bad page bug/warning because we try to free a locked
page (which was the former head page.)
To fix this, this patch suggests to shift page lock from head page to
tail page just after thp split. SIGSEGV still happens, but it affects
only error affected VMs, not a whole system.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org> [3.9+] # a3e0f9e47d5ef "mm/memory-failure.c: transfer page count from head page to tail page after split thp"
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Add a working sysctl to enable/disable automatic numa memory balancing
at runtime.
This allows us to track down performance problems with this feature and
is generally a good idea.
This was possible earlier through debugfs, but only with special
debugging options set. Also fix the boot message.
[akpm@linux-foundation.org: s/sched_numa_balancing/sysctl_numa_balancing/]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When calling free_all_bootmem() the free areas under memblock's control
are released to the buddy allocator. Additionally the reserved list is
freed if it was reallocated by memblock. The same should apply for the
memory list.
Signed-off-by: Philipp Hachtmann <phacht@linux.vnet.ibm.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When memblock_reserve() fails because memblock.reserved.regions cannot
be resized, the caller (e.g. alloc_bootmem()) is not informed of the
failed allocation. Therefore alloc_bootmem() silently returns the same
pointer again and again.
This patch adds a check for the return value of memblock_reserve() in
__alloc_memory_core().
Signed-off-by: Philipp Hachtmann <phacht@linux.vnet.ibm.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently we take both the memcg_create_mutex and the set_limit_mutex
when we enable kmem accounting for a memory cgroup, which makes kmem
activation events serialize with both memcg creations and other memcg
limit updates (memory.limit, memory.memsw.limit). However, there is no
point in such strict synchronization rules there.
First, the set_limit_mutex was introduced to keep the memory.limit and
memory.memsw.limit values in sync. Since memory.kmem.limit can be set
independently of them, it is better to introduce a separate mutex to
synchronize against concurrent kmem limit updates.
Second, we take the memcg_create_mutex in order to make sure all
children of this memcg will be kmem-active as well. For achieving that,
it is enough to hold this mutex only while checking if
memcg_has_children() though. This guarantees that if a child is added
after we checked that the memcg has no children, the newly added cgroup
will see its parent kmem-active (of course if the latter succeeded), and
call kmem activation for itself.
This patch simplifies the locking rules of memcg_update_kmem_limit()
according to these considerations.
[vdavydov@parallels.com: fix unintialized var warning]
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently we have two state bits in mem_cgroup::kmem_account_flags
regarding kmem accounting activation, ACTIVATED and ACTIVE. We start
kmem accounting only if both flags are set (memcg_can_account_kmem()),
plus throughout the code there are several places where we check only
the ACTIVE flag, but we never check the ACTIVATED flag alone. These
flags are both set from memcg_update_kmem_limit() under the
set_limit_mutex, the ACTIVE flag always being set after ACTIVATED, and
they never get cleared. That said checking if both flags are set is
equivalent to checking only for the ACTIVE flag, and since there is no
ACTIVATED flag checks, we can safely remove the ACTIVATED flag, and
nothing will change.
Let's try to understand what was the reason for introducing these flags.
The purpose of the ACTIVE flag is clear - it states that kmem should be
accounting to the cgroup. The only requirement for it is that it should
be set after we have fully initialized kmem accounting bits for the
cgroup and patched all static branches relating to kmem accounting.
Since we always check if static branch is enabled before actually
considering if we should account (otherwise we wouldn't benefit from
static branching), this guarantees us that we won't skip a commit or
uncharge after a charge due to an unpatched static branch.
Now let's move on to the ACTIVATED bit. As I proved in the beginning of
this message, it is absolutely useless, and removing it will change
nothing. So what was the reason introducing it?
The ACTIVATED flag was introduced by commit a8964b9b84f9 ("memcg: use
static branches when code not in use") in order to guarantee that
static_key_slow_inc(&memcg_kmem_enabled_key) would be called only once
for each memory cgroup when its kmem accounting was activated. The
point was that at that time the memcg_update_kmem_limit() function's
work-flow looked like this:
bool must_inc_static_branch = false;
cgroup_lock();
mutex_lock(&set_limit_mutex);
if (!memcg->kmem_account_flags && val != RESOURCE_MAX) {
/* The kmem limit is set for the first time */
ret = res_counter_set_limit(&memcg->kmem, val);
memcg_kmem_set_activated(memcg);
must_inc_static_branch = true;
} else
ret = res_counter_set_limit(&memcg->kmem, val);
mutex_unlock(&set_limit_mutex);
cgroup_unlock();
if (must_inc_static_branch) {
/* We can't do this under cgroup_lock */
static_key_slow_inc(&memcg_kmem_enabled_key);
memcg_kmem_set_active(memcg);
}
So that without the ACTIVATED flag we could race with other threads
trying to set the limit and increment the static branching ref-counter
more than once. Today we call the whole memcg_update_kmem_limit()
function under the set_limit_mutex and this race is impossible.
As now we understand why the ACTIVATED bit was introduced and why we
don't need it now, and know that removing it will change nothing anyway,
let's get rid of it.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
We relocate root cache's memcg_params whenever we need to grow the
memcg_caches array to accommodate all kmem-active memory cgroups.
Currently on relocation we free the old version immediately, which can
lead to use-after-free, because the memcg_caches array is accessed
lock-free (see cache_from_memcg_idx()). This patch fixes this by making
memcg_params RCU-protected for root caches.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
There is no point in flooding logs with warnings or especially crashing
the system if we fail to create a cache for a memcg. In this case we
will be accounting the memcg allocation to the root cgroup until we
succeed to create its own cache, but it isn't that critical.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
kmem_cache_dup() is only called from memcg_create_kmem_cache(). The
latter, in fact, does nothing besides this, so let's fold
kmem_cache_dup() into memcg_create_kmem_cache().
This patch also makes the memcg_cache_mutex private to
memcg_create_kmem_cache(), because it is not used anywhere else.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
We obtain a per-memcg cache from a root kmem_cache by dereferencing an
entry of the root cache's memcg_params::memcg_caches array. If we find
no cache for a memcg there on allocation, we initiate the memcg cache
creation (see memcg_kmem_get_cache()). The cache creation proceeds
asynchronously in memcg_create_kmem_cache() in order to avoid lock
clashes, so there can be several threads trying to create the same
kmem_cache concurrently, but only one of them may succeed. However, due
to a race in the code, it is not always true. The point is that the
memcg_caches array can be relocated when we activate kmem accounting for
a memcg (see memcg_update_all_caches(), memcg_update_cache_size()). If
memcg_update_cache_size() and memcg_create_kmem_cache() proceed
concurrently as described below, we can leak a kmem_cache.
Asume two threads schedule creation of the same kmem_cache. One of them
successfully creates it. Another one should fail then, but if
memcg_create_kmem_cache() interleaves with memcg_update_cache_size() as
follows, it won't:
memcg_create_kmem_cache() memcg_update_cache_size()
(called w/o mutexes held) (called with slab_mutex,
set_limit_mutex held)
------------------------- -------------------------
mutex_lock(&memcg_cache_mutex)
s->memcg_params=kzalloc(...)
new_cachep=cache_from_memcg_idx(cachep,idx)
// new_cachep==NULL => proceed to creation
s->memcg_params->memcg_caches[i]
=cur_params->memcg_caches[i]
// kmem_cache_create_memcg takes slab_mutex
// so we will hang around until
// memcg_update_cache_size finishes, but
// nothing will prevent it from succeeding so
// memcg_caches[idx] will be overwritten in
// memcg_register_cache!
new_cachep = kmem_cache_create_memcg(...)
mutex_unlock(&memcg_cache_mutex)
Let's fix this by moving the check for existence of the memcg cache to
kmem_cache_create_memcg() to be called under the slab_mutex and make it
return NULL if so.
A similar race is possible when destroying a memcg cache (see
kmem_cache_destroy()). Since memcg_unregister_cache(), which clears the
pointer in the memcg_caches array, is called w/o protection, we can race
with memcg_update_cache_size() and omit clearing the pointer. Therefore
memcg_unregister_cache() should be moved before we release the
slab_mutex.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
All caches of the same memory cgroup are linked in the memcg_slab_caches
list via kmem_cache::memcg_params::list. This list is traversed, for
example, when we read memory.kmem.slabinfo.
Since the list actually consists of memcg_cache_params objects, we have
to convert an element of the list to a kmem_cache object using
memcg_params_to_cache(), which obtains the pointer to the cache from the
memcg_params::memcg_caches array of the corresponding root cache. That
said the pointer to a kmem_cache in its parent's memcg_params must be
initialized before adding the cache to the list, and cleared only after
it has been unlinked. Currently it is vice-versa, which can result in a
NULL ptr dereference while traversing the memcg_slab_caches list. This
patch restores the correct order.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Each root kmem_cache has pointers to per-memcg caches stored in its
memcg_params::memcg_caches array. Whenever we want to allocate a slab
for a memcg, we access this array to get per-memcg cache to allocate
from (see memcg_kmem_get_cache()). The access must be lock-free for
performance reasons, so we should use barriers to assert the kmem_cache
is up-to-date.
First, we should place a write barrier immediately before setting the
pointer to it in the memcg_caches array in order to make sure nobody
will see a partially initialized object. Second, we should issue a read
barrier before dereferencing the pointer to conform to the write
barrier.
However, currently the barrier usage looks rather strange. We have a
write barrier *after* setting the pointer and a read barrier *before*
reading the pointer, which is incorrect. This patch fixes this.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently, we have rather a messy function set relating to per-memcg
kmem cache initialization/destruction.
Per-memcg caches are created in memcg_create_kmem_cache(). This
function calls kmem_cache_create_memcg() to allocate and initialize a
kmem cache and then "registers" the new cache in the
memcg_params::memcg_caches array of the parent cache.
During its work-flow, kmem_cache_create_memcg() executes the following
memcg-related functions:
- memcg_alloc_cache_params(), to initialize memcg_params of the newly
created cache;
- memcg_cache_list_add(), to add the new cache to the memcg_slab_caches
list.
On the other hand, kmem_cache_destroy() called on a cache destruction
only calls memcg_release_cache(), which does all the work: it cleans the
reference to the cache in its parent's memcg_params::memcg_caches,
removes the cache from the memcg_slab_caches list, and frees
memcg_params.
Such an inconsistency between destruction and initialization paths make
the code difficult to read, so let's clean this up a bit.
This patch moves all the code relating to registration of per-memcg
caches (adding to memcg list, setting the pointer to a cache from its
parent) to the newly created memcg_register_cache() and
memcg_unregister_cache() functions making the initialization and
destruction paths look symmetrical.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
We do not free the cache's memcg_params if __kmem_cache_create fails.
Fix this.
Plus, rename memcg_register_cache() to memcg_alloc_cache_params(),
because it actually does not register the cache anywhere, but simply
initialize kmem_cache::memcg_params.
[akpm@linux-foundation.org: fix build]
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Currently kmem_cache_create_memcg() backoffs on failure inside
conditionals, without using gotos. This results in the rollback code
duplication, which makes the function look cumbersome even though on
error we should only free the allocated cache. Since in the next patch
I am going to add yet another rollback function call on error path
there, let's employ labels instead of conditionals for undoing any
changes on failure to keep things clean.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Most of the VM_BUG_ON assertions are performed on a page. Usually, when
one of these assertions fails we'll get a BUG_ON with a call stack and
the registers.
I've recently noticed based on the requests to add a small piece of code
that dumps the page to various VM_BUG_ON sites that the page dump is
quite useful to people debugging issues in mm.
This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
VM_BUG_ON() does, also dumps the page before executing the actual
BUG_ON.
[akpm@linux-foundation.org: fix up includes]
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
stable_page_flags() checks !PageHuge && PageTransCompound && PageLRU to
know that a specified page is thp or not. But sometimes it's not enough
and we fail to detect thp when the thp is on pagevec. This happens only
for a few seconds after LRU list operations, but it makes it difficult
to control our applications depending on this flag.
So this patch adds another check PageAnon to detect thps on pagevec. It
might not give the future extensibility for thp pagecache, but it's OK
at least for now.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The vmalloc was introduced by 33327948782b ("memcgroup: use vmalloc for
mem_cgroup allocation"), because at that time MAX_NUMNODES was used for
defining the per-node array in the mem_cgroup structure so that the
structure could be huge even if the system had the only NUMA node.
The situation was significantly improved by commit 45cf7ebd5a03 ("memcg:
reduce the size of struct memcg 244-fold"), which made the size of the
mem_cgroup structure calculated dynamically depending on the real number
of NUMA nodes installed on the system (nr_node_ids), so now there is no
point in using vmalloc here: the structure is allocated rarely and on
most systems its size is about 1K.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Since commit ff6a6da60b89 ("mm: accelerate munlock() treatment of THP
pages") munlock skips tail pages of a munlocked THP page. There is some
attempt to prevent bad consequences of racing with a THP page split, but
code inspection indicates that there are two problems that may lead to a
non-fatal, yet wrong outcome.
First, __split_huge_page_refcount() copies flags including PageMlocked
from the head page to the tail pages. Clearing PageMlocked by
munlock_vma_page() in the middle of this operation might result in part
of tail pages left with PageMlocked flag. As the head page still
appears to be a THP page until all tail pages are processed,
munlock_vma_page() might think it munlocked the whole THP page and skip
all the former tail pages. Before ff6a6da60, those pages would be
cleared in further iterations of munlock_vma_pages_range(), but NR_MLOCK
would still become undercounted (related the next point).
Second, NR_MLOCK accounting is based on call to hpage_nr_pages() after
the PageMlocked is cleared. The accounting might also become
inconsistent due to race with __split_huge_page_refcount()
- undercount when HUGE_PMD_NR is subtracted, but some tail pages are
left with PageMlocked set and counted again (only possible before
ff6a6da60)
- overcount when hpage_nr_pages() sees a normal page (split has already
finished), but the parallel split has meanwhile cleared PageMlocked from
additional tail pages
This patch prevents both problems via extending the scope of lru_lock in
munlock_vma_page(). This is convenient because:
- __split_huge_page_refcount() takes lru_lock for its whole operation
- munlock_vma_page() typically takes lru_lock anyway for page isolation
As this becomes a second function where page isolation is done with
lru_lock already held, factor this out to a new
__munlock_isolate_lru_page() function and clean up the code around.
[akpm@linux-foundation.org: avoid a coding-style ugly]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
bad_page() is cool in that it prints out a bunch of data about the page.
But, I can never remember which page flags are good and which are bad,
or whether ->index or ->mapping is required to be NULL.
This patch allows bad/dump_page() callers to specify a string about why
they are dumping the page and adds explanation strings to a number of
places. It also adds a 'bad_flags' argument to bad_page(), which it
then dumps out separately from the flags which are actually set.
This way, the messages will show specifically why the page was bad,
*specifically* which flags it is complaining about, if it was a page
flag combination which was the problem.
[akpm@linux-foundation.org: switch to pr_alert]
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The "compressor" and "enabled" params are currently hidden, this changes
them to read-only, so userspace can tell if zswap is enabled or not and
see what compressor is in use.
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Cc: Vladimir Murzin <murzin.v@gmail.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Weijie Yang <weijie.yang@samsung.com>
Acked-by: Seth Jennings <sjennings@variantweb.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Documentation/vm/locking is a blast from the past. In the entire git
history, it has had precisely Three modifications. Two of those look to
be pure renames, and the third was from 2005.
The doc contains such gems as:
> The page_table_lock is grabbed while holding the
> kernel_lock spinning monitor.
> Page stealers hold kernel_lock to protect against a bunch of
> races.
Or this which talks about mmap_sem:
> 4. The exception to this rule is expand_stack, which just
> takes the read lock and the page_table_lock, this is ok
> because it doesn't really modify fields anybody relies on.
expand_stack() doesn't take any locks any more directly, and the
mmap_sem acquisition was long ago moved up in to the page fault code
itself.
It could be argued that we need to rewrite this, but it is dangerous to
leave it as-is. It will confuse more people than it helps.
Signed-off-by: Dave Hansen <dave.hansen@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Sort the exception table at build-time rather than during boot.
Microblaze is the same case as AARCH64 that's why EM_MICROBLAZE
conditional check was added to allow cross-compilation on machines which
are not running the latest libc-dev.
Inspired by AARCH64 commit adace89562c7 ("arm64: extable: sort the
exception table at build time").
Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Acked-by: David Daney <david.daney@cavium.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
drivers/staging/comedi/drivers/das6402.c: In function 'intr_handler':
drivers/staging/comedi/drivers/das6402.c:164:3: error: implicit declaration of function 'outw_p' [-Werror=implicit-function-declaration]
drivers/staging/speakup/speakup_dtlk.c: In function 'synth_probe':
drivers/staging/speakup/speakup_dtlk.c:362:2: error: implicit declaration of function 'inw_p' [-Werror=implicit-function-declaration]
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF & jbd fixes from Jan Kara:
"A cleanup of JBD log messages and UDF fix of a lockdep warning"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Fix lockdep warning from udf_symlink()
jbd: Revise KERN_EMERG error messages
|
|
The associative array code creates unnecessary and potentially
problematic global variable 'status'. Remove it since never used.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse update from Miklos Szeredi:
"This contains a fix for a potential use-after-module-unload bug
noticed by Al and caching improvements for read-only fuse filesystems
by Andrew Gallagher"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: support clients that don't implement 'open'
fuse: don't invalidate attrs when not using atime
fuse: fix SetPageUptodate() condition in STORE
fuse: fix pipe_buf_operations
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"In this round, a couple of sysfs entries were introduced to tune the
f2fs at runtime.
In addition, f2fs starts to support inline_data and improves the
read/write performance in some workloads by refactoring bio-related
flows.
This patch-set includes the following major enhancement patches.
- support inline_data
- refactor bio operations such as merge operations and rw type
assignment
- enhance the direct IO path
- enhance bio operations
- truncate a node page when it becomes obsolete
- add sysfs entries: small_discards, max_victim_search, and
in-place-update
- add a sysfs entry to control max_victim_search
The other bug fixes are as follows.
- fix a bug in truncate_partial_nodes
- avoid warnings during sparse and build process
- fix error handling flows
- fix potential bit overflows
And, there are a bunch of cleanups"
* tag 'for-f2fs-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (95 commits)
f2fs: drop obsolete node page when it is truncated
f2fs: introduce NODE_MAPPING for code consistency
f2fs: remove the orphan block page array
f2fs: add help function META_MAPPING
f2fs: move a branch for code redability
f2fs: call mark_inode_dirty to flush dirty pages
f2fs: clean checkpatch warnings
f2fs: missing REQ_META and REQ_PRIO when sync_meta_pages(META_FLUSH)
f2fs: avoid f2fs_balance_fs call during pageout
f2fs: add delimiter to seperate name and value in debug phrase
f2fs: use spinlock rather than mutex for better speed
f2fs: move alloc new orphan node out of lock protection region
f2fs: move grabing orphan pages out of protection region
f2fs: remove the needless parameter of f2fs_wait_on_page_writeback
f2fs: update documents and a MAINTAINERS entry
f2fs: add a sysfs entry to control max_victim_search
f2fs: improve write performance under frequent fsync calls
f2fs: avoid to read inline data except first page
f2fs: avoid to left uninitialized data in page when read inline data
f2fs: fix truncate_partial_nodes bug
...
|
|
Pull xfs update from Ben Myers:
"This is primarily bug fixes, many of which you already have. New
stuff includes a series to decouple the in-memory and on-disk log
format, helpers in the area of inode clusters, and i_version handling.
We decided to try to use more topic branches this release, so there
are some merge commits in there on account of that. I'm afraid I
didn't do a good job of putting meaningful comments in the first
couple of merges. Sorry about that. I think I have the hang of it
now.
For 3.14-rc1 there are fixes in the areas of remote attributes,
discard, growfs, memory leaks in recovery, directory v2, quotas, the
MAINTAINERS file, allocation alignment, extent list locking, and in
xfs_bmapi_allocate. There are cleanups in xfs_setsize_buftarg,
removing unused macros, quotas, setattr, and freeing of inode
clusters. The in-memory and on-disk log format have been decoupled, a
common helper to calculate the number of blocks in an inode cluster
has been added, and handling of i_version has been pulled into the
filesystems that use it.
- cleanup in xfs_setsize_buftarg
- removal of remaining unused flags for vop toss/flush/flushinval
- fix for memory corruption in xfs_attrlist_by_handle
- fix for out-of-date comment in xfs_trans_dqlockedjoin
- fix for discard if range length is less than one block
- fix for overrun of agfl buffer using growfs on v4 superblock
filesystems
- pull i_version handling out into the filesystems that use it
- don't leak recovery items on error
- fix for memory leak in xfs_dir2_node_removename
- several cleanups for quotas
- fix bad assertion in xfs_qm_vop_create_dqattach
- cleanup for xfs_setattr_mode, and add xfs_setattr_time
- fix quota assert in xfs_setattr_nonsize
- fix an infinite loop when turning off group/project quota before
user quota
- fix for temporary buffer allocation failure in xfs_dir2_block_to_sf
with large directory block sizes
- fix Dave's email address in MAINTAINERS
- cleanup calculation of freed inode cluster blocks
- fix alignment of initial file allocations to match filesystem
geometry
- decouple in-memory and on-disk log format
- introduce a common helper to calculate the number of filesystem
blocks in an inode cluster
- fixes for extent list locking
- fix for off-by-one in xfs_attr3_rmt_verify
- fix for missing destroy_work_on_stack in xfs_bmapi_allocate"
* tag 'xfs-for-linus-v3.14-rc1' of git://oss.sgi.com/xfs/xfs: (51 commits)
xfs: Calling destroy_work_on_stack() to pair with INIT_WORK_ONSTACK()
xfs: fix off-by-one error in xfs_attr3_rmt_verify
xfs: assert that we hold the ilock for extent map access
xfs: use xfs_ilock_attr_map_shared in xfs_attr_list_int
xfs: use xfs_ilock_attr_map_shared in xfs_attr_get
xfs: use xfs_ilock_data_map_shared in xfs_qm_dqiterate
xfs: use xfs_ilock_data_map_shared in xfs_qm_dqtobp
xfs: take the ilock around xfs_bmapi_read in xfs_zero_remaining_bytes
xfs: reinstate the ilock in xfs_readdir
xfs: add xfs_ilock_attr_map_shared
xfs: rename xfs_ilock_map_shared
xfs: remove xfs_iunlock_map_shared
xfs: no need to lock the inode in xfs_find_handle
xfs: use xfs_icluster_size_fsb in xfs_imap
xfs: use xfs_icluster_size_fsb in xfs_ifree_cluster
xfs: use xfs_icluster_size_fsb in xfs_ialloc_inode_init
xfs: use xfs_icluster_size_fsb in xfs_bulkstat
xfs: introduce a common helper xfs_icluster_size_fsb
xfs: get rid of XFS_IALLOC_BLOCKS macros
xfs: get rid of XFS_INODE_CLUSTER_SIZE macros
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module updates from Rusty Russell.
* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
module: Add missing newline in printk call.
module: fix coding style
export: declare ksymtab symbols
module.h: Remove unnecessary semicolon
params: improve standard definitions
Add Documentation/module-signing.txt file
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull virtio update from Rusty Russell:
"A few simple fixes. Quiet cycle"
* tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
drivers: virtio: Mark function virtballoon_migratepage() as static in virtio_balloon.c
virtio-scsi: Fix hotcpu_notifier use-after-free with virtscsi_freeze
virtio: pci: remove unnecessary pci_set_drvdata()
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull Xen updates from Konrad Rzeszutek Wilk:
"Two major features that Xen community is excited about:
The first is event channel scalability by David Vrabel - we switch
over from an two-level per-cpu bitmap of events (IRQs) - to an FIFO
queue with priorities. This lets us be able to handle more events,
have lower latency, and better scalability. Good stuff.
The other is PVH by Mukesh Rathor. In short, PV is a mode where the
kernel lets the hypervisor program page-tables, segments, etc. With
EPT/NPT capabilities in current processors, the overhead of doing this
in an HVM (Hardware Virtual Machine) container is much lower than the
hypervisor doing it for us.
In short we let a PV guest run without doing page-table, segment,
syscall, etc updates through the hypervisor - instead it is all done
within the guest container. It is a "hybrid" PV - hence the 'PVH'
name - a PV guest within an HVM container.
The major benefits are less code to deal with - for example we only
use one function from the the pv_mmu_ops (which has 39 function
calls); faster performance for syscall (no context switches into the
hypervisor); less traps on various operations; etc.
It is still being baked - the ABI is not yet set in stone. But it is
pretty awesome and we are excited about it.
Lastly, there are some changes to ARM code - you should get a simple
conflict which has been resolved in #linux-next.
In short, this pull has awesome features.
Features:
- FIFO event channels. Key advantages: support for over 100,000
events (2^17), 16 different event priorities, improved fairness in
event latency through the use of FIFOs.
- Xen PVH support. "It’s a fully PV kernel mode, running with
paravirtualized disk and network, paravirtualized interrupts and
timers, no emulated devices of any kind (and thus no qemu), no BIOS
or legacy boot — but instead of requiring PV MMU, it uses the HVM
hardware extensions to virtualize the pagetables, as well as system
calls and other privileged operations." (from "The
Paravirtualization Spectrum, Part 2: From poles to a spectrum")
Bug-fixes:
- Fixes in balloon driver (refactor and make it work under ARM)
- Allow xenfb to be used in HVM guests.
- Allow xen_platform_pci=0 to work properly.
- Refactors in event channels"
* tag 'stable/for-linus-3.14-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (52 commits)
xen/pvh: Set X86_CR0_WP and others in CR0 (v2)
MAINTAINERS: add git repository for Xen
xen/pvh: Use 'depend' instead of 'select'.
xen: delete new instances of __cpuinit usage
xen/fb: allow xenfb initialization for hvm guests
xen/evtchn_fifo: fix error return code in evtchn_fifo_setup()
xen-platform: fix error return code in platform_pci_init()
xen/pvh: remove duplicated include from enlighten.c
xen/pvh: Fix compile issues with xen_pvh_domain()
xen: Use dev_is_pci() to check whether it is pci device
xen/grant-table: Force to use v1 of grants.
xen/pvh: Support ParaVirtualized Hardware extensions (v3).
xen/pvh: Piggyback on PVHVM XenBus.
xen/pvh: Piggyback on PVHVM for grant driver (v4)
xen/grant: Implement an grant frame array struct (v3).
xen/grant-table: Refactor gnttab_init
xen/grants: Remove gnttab_max_grant_frames dependency on gnttab_init.
xen/pvh: Piggyback on PVHVM for event channels (v2)
xen/pvh: Update E820 to work with PVH (v2)
xen/pvh: Secondary VCPU bringup (non-bootup CPUs)
...
|
|
Pull KVM updates from Paolo Bonzini:
"First round of KVM updates for 3.14; PPC parts will come next week.
Nothing major here, just bugfixes all over the place. The most
interesting part is the ARM guys' virtualized interrupt controller
overhaul, which lets userspace get/set the state and thus enables
migration of ARM VMs"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (67 commits)
kvm: make KVM_MMU_AUDIT help text more readable
KVM: s390: Fix memory access error detection
KVM: nVMX: Update guest activity state field on L2 exits
KVM: nVMX: Fix nested_run_pending on activity state HLT
KVM: nVMX: Clean up handling of VMX-related MSRs
KVM: nVMX: Add tracepoints for nested_vmexit and nested_vmexit_inject
KVM: nVMX: Pass vmexit parameters to nested_vmx_vmexit
KVM: nVMX: Leave VMX mode on clearing of feature control MSR
KVM: VMX: Fix DR6 update on #DB exception
KVM: SVM: Fix reading of DR6
KVM: x86: Sync DR7 on KVM_SET_DEBUGREGS
add support for Hyper-V reference time counter
KVM: remove useless write to vcpu->hv_clock.tsc_timestamp
KVM: x86: fix tsc catchup issue with tsc scaling
KVM: x86: limit PIT timer frequency
KVM: x86: handle invalid root_hpa everywhere
kvm: Provide kvm_vcpu_eligible_for_directed_yield() stub
kvm: vfio: silence GCC warning
KVM: ARM: Remove duplicate include
arm/arm64: KVM: relax the requirements of VMA alignment for THP
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
"Usual rocket science stuff from trivial.git"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
neighbour.h: fix comment
sched: Fix warning on make htmldocs caused by wait.h
slab: struct kmem_cache is protected by slab_mutex
doc: Fix typo in USB Gadget Documentation
of/Kconfig: Spelling s/one/once/
mkregtable: Fix sscanf handling
lp5523, lp8501: comment improvements
thermal: rcar: comment spelling
treewide: fix comments and printk msgs
IXP4xx: remove '1 &&' from a condition check in ixp4xx_restart()
Documentation: update /proc/uptime field description
Documentation: Fix size parameter for snprintf
arm: fix comment header and macro name
asm-generic: uaccess: Spelling s/a ny/any/
mtd: onenand: fix comment header
doc: driver-model/platform.txt: fix a typo
drivers: fix typo in DEVTMPFS_MOUNT Kconfig help text
doc: Fix typo (acces_process_vm -> access_process_vm)
treewide: Fix typos in printk
drivers/gpu/drm/qxl/Kconfig: reformat the help text
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid
Pull HID updates from Jiri Kosina:
- quite some work on hid-sony driver in order to have DualShock 4
device properly supported, from Frank Praznik
- fixed support for suspending I2C conntected devices, from Mika
Westerberg
- regression fix for 0xff05 usage on Microsoft Ergonomy, from Jiri
Kosina
- support for Synaptics HD touchscreen, from AceLan Kao
- workaround for USB 3.0 problem for logitech-dj connected devices,
from Benjamin Tisssoires
- support for Logitech Dual Action pads, from Vitaly Katraew
- quite a few other assorted fixes and device ID additions
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (33 commits)
HID: sony: Use colors for the Dualshock 4 LED names
HID: sony: Add annotated HID descriptor for the Dualshock 4
HID: sony: Cache the output report for the Dualshock 4
HID: sony: Map gyroscopes and accelerometers to axes
HID: sony: Fix spacing in the device definitions.
HID: sony: Use standard output reports instead of raw reports to send data to the Dualshock 4.
HID: sony: Use separate identifiers for USB and Bluetooth connected Dualshock 4 controllers.
HID: hid-holtek-mouse: add new a070 mouse
HID: hid-sensor-hub: Fix buggy report descriptors
HID: logitech-dj: Fix USB 3.0 issue
HID: sony: Rename worker function
HID: sony: Add LED controls for the Dualshock 4
HID: sony: Add force-feedback support for the Dualshock 4
HID: hidraw: make comment more accurate and nicer
HID: sony: fix error return code
HID: input: fix input sysfs path for hid devices
HID: debug: add labels for some new buttons
HID: remove SIS entries from hid_have_special_driver[]
HID: microsoft: no fallthrough in MS ergonomy 0xff05 usage
HID: add support for SiS multitouch panel in the touch monitor LG 23ET83V
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device-mapper changes from Mike Snitzer:
"A lot of attention was paid to improving the thin-provisioning
target's handling of metadata operation failures and running out of
space. A new 'error_if_no_space' feature was added to allow users to
error IOs rather than queue them when either the data or metadata
space is exhausted.
Additional fixes/features include:
- a few fixes to properly support thin metadata device resizing
- a solution for reliably waiting for a DM device's embedded kobject
to be released before destroying the device
- old dm-snapshot is updated to use the dm-bufio interface to take
advantage of readahead capabilities that improve snapshot
activation
- new dm-cache target tunables to control how quickly data is
promoted to the cache (fast) device
- improved write efficiency of cluster mirror target by combining
userspace flush and mark requests"
* tag 'dm-3.14-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (35 commits)
dm log userspace: allow mark requests to piggyback on flush requests
dm space map metadata: fix bug in resizing of thin metadata
dm cache: add policy name to status output
dm thin: fix pool feature parsing
dm sysfs: fix a module unload race
dm snapshot: use dm-bufio prefetch
dm snapshot: use dm-bufio
dm snapshot: prepare for switch to using dm-bufio
dm snapshot: use GFP_KERNEL when initializing exceptions
dm cache: add block sizes and total cache blocks to status output
dm btree: add dm_btree_find_lowest_key
dm space map metadata: fix extending the space map
dm space map common: make sure new space is used during extend
dm: wait until embedded kobject is released before destroying a device
dm: remove pointless kobject comparison in dm_get_from_kobject
dm snapshot: call destroy_work_on_stack() to pair with INIT_WORK_ONSTACK()
dm cache policy mq: introduce three promotion threshold tunables
dm cache policy mq: use list_del_init instead of list_del + INIT_LIST_HEAD
dm thin: fix set_pool_mode exposed pool operation races
dm thin: eliminate the no_free_space flag
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"This patch set is a lot of driver updates for qla4xxx, bfa, hpsa,
qla2xxx. It also removes the aic7xxx_old driver (which has been
deprecated for nearly a decade) and adds support for deadlines in
error handling"
* tag 'scsi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (75 commits)
[SCSI] hpsa: allow SCSI mid layer to handle unit attention
[SCSI] hpsa: do not require board "not ready" status after hard reset
[SCSI] hpsa: enable unit attention reporting
[SCSI] hpsa: rename scsi prefetch field
[SCSI] hpsa: use workqueue instead of kernel thread for lockup detection
[SCSI] ipr: increase dump size in ipr driver
[SCSI] mac_scsi: Fix crash on out of memory
[SCSI] st: fix enlarge_buffer
[SCSI] qla1280: Annotate timer on stack so object debug does not complain
[SCSI] qla4xxx: Update driver version to 5.04.00-k3
[SCSI] qla4xxx: Recreate chap data list during get chap operation
[SCSI] qla4xxx: Add support for ISCSI_PARAM_LOCAL_IPADDR sysfs attr
[SCSI] libiscsi: Add local_ipaddr parameter in iscsi_conn struct
[SCSI] scsi_transport_iscsi: Export ISCSI_PARAM_LOCAL_IPADDR attr for iscsi_connection
[SCSI] qla4xxx: Add host statistics support
[SCSI] scsi_transport_iscsi: Add host statistics support
[SCSI] qla4xxx: Added support for Diagnostics MBOX command
[SCSI] bfa: Driver version upgrade to 3.2.23.0
[SCSI] bfa: change FC_ELS_TOV to 20sec
[SCSI] bfa: Observed auto D-port mode instead of manual
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI updates from Bjorn Helgaas:
"PCI changes for the v3.14 merge window:
Resource management
- Change pci_bus_region addresses to dma_addr_t (Bjorn Helgaas)
- Support 64-bit AGP BARs (Bjorn Helgaas, Yinghai Lu)
- Add pci_bus_address() to get bus address of a BAR (Bjorn Helgaas)
- Use pci_resource_start() for CPU address of AGP BARs (Bjorn Helgaas)
- Enforce bus address limits in resource allocation (Yinghai Lu)
- Allocate 64-bit BARs above 4G when possible (Yinghai Lu)
- Convert pcibios_resource_to_bus() to take pci_bus, not pci_dev (Yinghai Lu)
PCI device hotplug
- Major rescan/remove locking update (Rafael J. Wysocki)
- Make ioapic builtin only (not modular) (Yinghai Lu)
- Fix release/free issues (Yinghai Lu)
- Clean up pciehp (Bjorn Helgaas)
- Announce pciehp slot info during enumeration (Bjorn Helgaas)
MSI
- Add pci_msi_vec_count(), pci_msix_vec_count() (Alexander Gordeev)
- Add pci_enable_msi_range(), pci_enable_msix_range() (Alexander Gordeev)
- Deprecate "tri-state" interfaces: fail/success/fail+info (Alexander Gordeev)
- Export MSI mode using attributes, not kobjects (Greg Kroah-Hartman)
- Drop "irq" param from *_restore_msi_irqs() (DuanZhenzhong)
SR-IOV
- Clear NumVFs when disabling SR-IOV in sriov_init() (ethan.zhao)
Virtualization
- Add support for save/restore of extended capabilities (Alex Williamson)
- Add Virtual Channel to save/restore support (Alex Williamson)
- Never treat a VF as a multifunction device (Alex Williamson)
- Add pci_try_reset_function(), et al (Alex Williamson)
AER
- Ignore non-PCIe error sources (Betty Dall)
- Support ACPI HEST error sources for domains other than 0 (Betty Dall)
- Consolidate HEST error source parsers (Bjorn Helgaas)
- Add a TLP header print helper (Borislav Petkov)
Freescale i.MX6
- Remove unnecessary code (Fabio Estevam)
- Make reset-gpio optional (Marek Vasut)
- Report "link up" only after link training completes (Marek Vasut)
- Start link in Gen1 before negotiating for Gen2 mode (Marek Vasut)
- Fix PCIe startup code (Richard Zhu)
Marvell MVEBU
- Remove duplicate of_clk_get_by_name() call (Andrew Lunn)
- Drop writes to bridge Secondary Status register (Jason Gunthorpe)
- Obey bridge PCI_COMMAND_MEM and PCI_COMMAND_IO bits (Jason Gunthorpe)
- Support a bridge with no IO port window (Jason Gunthorpe)
- Use max_t() instead of max(resource_size_t,) (Jingoo Han)
- Remove redundant of_match_ptr (Sachin Kamat)
- Call pci_ioremap_io() at startup instead of dynamically (Thomas Petazzoni)
NVIDIA Tegra
- Disable Gen2 for Tegra20 and Tegra30 (Eric Brower)
Renesas R-Car
- Add runtime PM support (Valentine Barshak)
- Fix rcar_pci_probe() return value check (Wei Yongjun)
Synopsys DesignWare
- Fix crash in dw_msi_teardown_irq() (Bjørn Erik Nilsen)
- Remove redundant call to pci_write_config_word() (Bjørn Erik Nilsen)
- Fix missing MSI IRQs (Harro Haan)
- Add dw_pcie prefix before cfg_read/write (Pratyush Anand)
- Fix I/O transfers by using CPU (not realio) address (Pratyush Anand)
- Whitespace cleanup (Jingoo Han)
EISA
- Call put_device() if device_register() fails (Levente Kurusa)
- Revert EISA initialization breakage ((Bjorn Helgaas)
Miscellaneous
- Remove unused code, including PCIe 3.0 interfaces (Stephen Hemminger)
- Prevent bus conflicts while checking for bridge apertures (Bjorn Helgaas)
- Stop clearing bridge Secondary Status when setting up I/O aperture (Bjorn Helgaas)
- Use dev_is_pci() to identify PCI devices (Yijing Wang)
- Deprecate DEFINE_PCI_DEVICE_TABLE (Joe Perches)
- Update documentation 00-INDEX (Erik Ekman)"
* tag 'pci-v3.14-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (119 commits)
Revert "EISA: Initialize device before its resources"
Revert "EISA: Log device resources in dmesg"
vfio-pci: Use pci "try" reset interface
PCI: Check parent kobject in pci_destroy_dev()
xen/pcifront: Use global PCI rescan-remove locking
powerpc/eeh: Use global PCI rescan-remove locking
PCI: Fix pci_check_and_unmask_intx() comment typos
PCI: Add pci_try_reset_function(), pci_try_reset_slot(), pci_try_reset_bus()
MPT / PCI: Use pci_stop_and_remove_bus_device_locked()
platform / x86: Use global PCI rescan-remove locking
PCI: hotplug: Use global PCI rescan-remove locking
pcmcia: Use global PCI rescan-remove locking
ACPI / hotplug / PCI: Use global PCI rescan-remove locking
ACPI / PCI: Use global PCI rescan-remove locking in PCI root hotplug
PCI: Add global pci_lock_rescan_remove()
PCI: Cleanup pci.h whitespace
PCI: Reorder so actual code comes before stubs
PCI/AER: Support ACPI HEST AER error sources for PCI domains other than 0
ACPICA: Add helper macros to extract bus/segment numbers from HEST table.
PCI: Make local functions static
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"This pull request has a new feature to ftrace, namely the trace event
triggers by Tom Zanussi. A trigger is a way to enable an action when
an event is hit. The actions are:
o trace on/off - enable or disable tracing
o snapshot - save the current trace buffer in the snapshot
o stacktrace - dump the current stack trace to the ringbuffer
o enable/disable events - enable or disable another event
Namhyung Kim added updates to the tracing uprobes code. Having the
uprobes add support for fetch methods.
The rest are various bug fixes with the new code, and minor ones for
the old code"
* tag 'trace-3.14' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (38 commits)
tracing: Fix buggered tee(2) on tracing_pipe
tracing: Have trace buffer point back to trace_array
ftrace: Fix synchronization location disabling and freeing ftrace_ops
ftrace: Have function graph only trace based on global_ops filters
ftrace: Synchronize setting function_trace_op with ftrace_trace_function
tracing: Show available event triggers when no trigger is set
tracing: Consolidate event trigger code
tracing: Fix counter for traceon/off event triggers
tracing: Remove double-underscore naming in syscall trigger invocations
tracing/kprobes: Add trace event trigger invocations
tracing/probes: Fix build break on !CONFIG_KPROBE_EVENT
tracing/uprobes: Add @+file_offset fetch method
uprobes: Allocate ->utask before handler_chain() for tracing handlers
tracing/uprobes: Add support for full argument access methods
tracing/uprobes: Fetch args before reserving a ring buffer
tracing/uprobes: Pass 'is_return' to traceprobe_parse_probe_arg()
tracing/probes: Implement 'memory' fetch method for uprobes
tracing/probes: Add fetch{,_size} member into deref fetch method
tracing/probes: Move 'symbol' fetch method to kprobes
tracing/probes: Implement 'stack' fetch method for uprobes
...
|
|
If a node page is trucated, we'd better drop the page in the node_inode's page
cache for better memory footprint.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|