summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2010-05-25mempolicy: restructure rebinding-mempolicy functionsMiao Xie
Nick Piggin reported that the allocator may see an empty nodemask when changing cpuset's mems[1]. It happens only on the kernel that do not do atomic nodemask_t stores. (MAX_NUMNODES > BITS_PER_LONG) But I found that there is also a problem on the kernel that can do atomic nodemask_t stores. The problem is that the allocator can't find a node to alloc page when changing cpuset's mems though there is a lot of free memory. The reason is like this: (mpol: mempolicy) task1 task1's mpol task2 alloc page 1 alloc on node0? NO 1 1 change mems from 1 to 0 1 rebind task1's mpol 0-1 set new bits 0 clear disallowed bits alloc on node1? NO 0 ... can't alloc page goto oom I can use the attached program reproduce it by the following step: # mkdir /dev/cpuset # mount -t cpuset cpuset /dev/cpuset # mkdir /dev/cpuset/1 # echo `cat /dev/cpuset/cpus` > /dev/cpuset/1/cpus # echo `cat /dev/cpuset/mems` > /dev/cpuset/1/mems # echo $$ > /dev/cpuset/1/tasks # numactl --membind=`cat /dev/cpuset/mems` ./cpuset_mem_hog <nr_tasks> & <nr_tasks> = max(nr_cpus - 1, 1) # killall -s SIGUSR1 cpuset_mem_hog # ./change_mems.sh several hours later, oom will happen though there is a lot of free memory. This patchset fixes this problem by expanding the nodes range first(set newly allowed bits) and shrink it lazily(clear newly disallowed bits). So we use a variable to tell the write-side task that read-side task is reading nodemask, and the write-side task clears newly disallowed nodes after read-side task ends the current memory allocation. This patch: In order to fix no node to alloc memory, when we want to update mempolicy and mems_allowed, we expand the set of nodes first (set all the newly nodes) and shrink the set of nodes lazily(clean disallowed nodes), But the mempolicy's rebind functions may breaks the expanding. So we restructure the mempolicy's rebind functions and split the rebind work to two steps, just like the update of cpuset's mems: The 1st step: expand the set of the mempolicy's nodes. The 2nd step: shrink the set of the mempolicy's nodes. It is used when there is no real lock to protect the mempolicy in the read-side. Otherwise we can do rebind work at once. In order to implement it, we define enum mpol_rebind_step { MPOL_REBIND_ONCE, MPOL_REBIND_STEP1, MPOL_REBIND_STEP2, MPOL_REBIND_NSTEP, }; If the mempolicy needn't be updated by two steps, we can pass MPOL_REBIND_ONCE to the rebind functions. Or we can pass MPOL_REBIND_STEP1 to do the first step of the rebind work and pass MPOL_REBIND_STEP2 to do the second step work. Besides that, it maybe long time between these two step and we have to release the lock that protects mempolicy and mems_allowed. If we hold the lock once again, we must check whether the current mempolicy is under the rebinding (the first step has been done) or not, because the task may alloc a new mempolicy when we don't hold the lock. So we defined the following flag to identify it: #define MPOL_F_REBINDING (1 << 2) The new functions will be used in the next patch. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Menage <menage@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25mempolicy: factor mpol_shared_policy_init() return pathsLee Schermerhorn
Factor out duplicate put/frees in mpol_shared_policy_init() to a common return path. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25mempolicy: rename policy_types and cleanup initializationLee Schermerhorn
Rename 'policy_types[]' to 'policy_modes[]' to better match the array contents. Use designated intializer syntax for policy_modes[]. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25mempolicy: lose unnecessary loop variable in mpol_parse_str()Lee Schermerhorn
We don't really need the extra variable 'i' in mpol_parse_str(). The only use is as the the loop variable. Then, it's assigned to 'mode'. Just use mode, and loose the 'uninitialized_var()' macro. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25mempolicy: don't call mpol_set_nodemask() when no_contextLee Schermerhorn
No need to call mpol_set_nodemask() when we have no context for the mempolicy. This can occur when we're parsing a tmpfs 'mpol' mount option. Just save the raw nodemask in the mempolicy's w.user_nodemask member for use when a tmpfs/shmem file is created. mpol_shared_policy_init() will "contextualize" the policy for the new file based on the creating task's context. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25mempolicy: remove redundant checkBob Liu
Lee's patch "mempolicy: use MPOL_PREFERRED for system-wide default policy" has made the MPOL_DEFAULT only used in the memory policy APIs. So, no need to check in __mpol_equal also. Also get rid of mpol_match_intent() and move its logic directly into __mpol_equal(). Signed-off-by: Bob Liu <lliubbo@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25mempolicy: remove case MPOL_INTERLEAVE from policy_zonelist()Bob Liu
In policy_zonelist() mode MPOL_INTERLEAVE shouldn't happen, so fall through to BUG() instead of break to return. I also fixed the comment. Signed-off-by: Bob Liu <lliubbo@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25mempolicy: remove redundant codeBob Liu
1. In funtion is_valid_nodemask(), varibable k will be inited to 0 in the following loop, needn't init to policy_zone anymore. 2. (MPOL_F_STATIC_NODES | MPOL_F_RELATIVE_NODES) has already defined to MPOL_MODE_FLAGS in mempolicy.h. Signed-off-by: Bob Liu <lliubbo@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25mm: remove return value of putback_lru_pages()Minchan Kim
putback_lru_page() never can fail. So it doesn't matter count of "the number of pages put back". In addition, users of this functions don't use return value. Let's remove unnecessary code. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25shmem: remove redundant codeHuang Shijie
prep_new_page() will call set_page_private(page, 0) to initialise the page, so the code is redundant. Signed-off-by: Huang Shijie <shijie8@gmail.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25sparsemem: on no vmemmap path put mem_map on node high tooYinghai Lu
We need to put mem_map high when virtual memmap is not used. before this patch free mem pfn range on first node: [ 0.000000] 19 - 1f [ 0.000000] 28 40 - 80 95 [ 0.000000] 702 740 - 1000 1000 [ 0.000000] 347c - 347e [ 0.000000] 34e7 3500 - 3b80 3b8b [ 0.000000] 73b8b 73bc0 - 73c00 73c00 [ 0.000000] 73ddd - 73e00 [ 0.000000] 73fdd - 74000 [ 0.000000] 741dd - 74200 [ 0.000000] 743dd - 74400 [ 0.000000] 745dd - 74600 [ 0.000000] 747dd - 74800 [ 0.000000] 749dd - 74a00 [ 0.000000] 74bdd - 74c00 [ 0.000000] 74ddd - 74e00 [ 0.000000] 74fdd - 75000 [ 0.000000] 751dd - 75200 [ 0.000000] 753dd - 75400 [ 0.000000] 755dd - 75600 [ 0.000000] 757dd - 75800 [ 0.000000] 759dd - 75a00 [ 0.000000] 79bdd 79c00 - 7d540 7d550 [ 0.000000] 7f745 - 7f750 [ 0.000000] 10000b 100040 - 2080000 2080000 so only 79c00 - 7d540 are major free block under 4g... after this patch, we will get [ 0.000000] 19 - 1f [ 0.000000] 28 40 - 80 95 [ 0.000000] 702 740 - 1000 1000 [ 0.000000] 347c - 347e [ 0.000000] 34e7 3500 - 3600 3600 [ 0.000000] 37dd - 3800 [ 0.000000] 39dd - 3a00 [ 0.000000] 3bdd - 3c00 [ 0.000000] 3ddd - 3e00 [ 0.000000] 3fdd - 4000 [ 0.000000] 41dd - 4200 [ 0.000000] 43dd - 4400 [ 0.000000] 45dd - 4600 [ 0.000000] 47dd - 4800 [ 0.000000] 49dd - 4a00 [ 0.000000] 4bdd - 4c00 [ 0.000000] 4ddd - 4e00 [ 0.000000] 4fdd - 5000 [ 0.000000] 51dd - 5200 [ 0.000000] 53dd - 5400 [ 0.000000] 95dd 9600 - 7d540 7d550 [ 0.000000] 7f745 - 7f750 [ 0.000000] 17000b 170040 - 2080000 2080000 we will have 9600 - 7d540 for major free block... sparse-vmemmap path already used __alloc_bootmem_node_high() Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Jiri Slaby <jirislaby@gmail.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25page allocator: reduce fragmentation in buddy allocator by adding buddies ↵Corrado Zoccolo
that are merging to the tail of the free lists In order to reduce fragmentation, this patch classifies freed pages in two groups according to their probability of being part of a high order merge. Pages belonging to a compound whose next-highest buddy is free are more likely to be part of a high order merge in the near future, so they will be added at the tail of the freelist. The remaining pages are put at the front of the freelist. In this way, the pages that are more likely to cause a big merge are kept free longer. Consequently there is a tendency to aggregate the long-living allocations on a subset of the compounds, reducing the fragmentation. This heuristic was tested on three machines, x86, x86-64 and ppc64 with 3GB of RAM in each machine. The tests were kernbench, netperf, sysbench and STREAM for performance and a high-order stress test for huge page allocations. KernBench X86 Elapsed mean 374.77 ( 0.00%) 375.10 (-0.09%) User mean 649.53 ( 0.00%) 650.44 (-0.14%) System mean 54.75 ( 0.00%) 54.18 ( 1.05%) CPU mean 187.75 ( 0.00%) 187.25 ( 0.27%) KernBench X86-64 Elapsed mean 94.45 ( 0.00%) 94.01 ( 0.47%) User mean 323.27 ( 0.00%) 322.66 ( 0.19%) System mean 36.71 ( 0.00%) 36.50 ( 0.57%) CPU mean 380.75 ( 0.00%) 381.75 (-0.26%) KernBench PPC64 Elapsed mean 173.45 ( 0.00%) 173.74 (-0.17%) User mean 587.99 ( 0.00%) 587.95 ( 0.01%) System mean 60.60 ( 0.00%) 60.57 ( 0.05%) CPU mean 373.50 ( 0.00%) 372.75 ( 0.20%) Nothing notable for kernbench. NetPerf UDP X86 64 42.68 ( 0.00%) 42.77 ( 0.21%) 128 85.62 ( 0.00%) 85.32 (-0.35%) 256 170.01 ( 0.00%) 168.76 (-0.74%) 1024 655.68 ( 0.00%) 652.33 (-0.51%) 2048 1262.39 ( 0.00%) 1248.61 (-1.10%) 3312 1958.41 ( 0.00%) 1944.61 (-0.71%) 4096 2345.63 ( 0.00%) 2318.83 (-1.16%) 8192 4132.90 ( 0.00%) 4089.50 (-1.06%) 16384 6770.88 ( 0.00%) 6642.05 (-1.94%)* NetPerf UDP X86-64 64 148.82 ( 0.00%) 154.92 ( 3.94%) 128 298.96 ( 0.00%) 312.95 ( 4.47%) 256 583.67 ( 0.00%) 626.39 ( 6.82%) 1024 2293.18 ( 0.00%) 2371.10 ( 3.29%) 2048 4274.16 ( 0.00%) 4396.83 ( 2.79%) 3312 6356.94 ( 0.00%) 6571.35 ( 3.26%) 4096 7422.68 ( 0.00%) 7635.42 ( 2.79%)* 8192 12114.81 ( 0.00%)* 12346.88 ( 1.88%) 16384 17022.28 ( 0.00%)* 17033.19 ( 0.06%)* 1.64% 2.73% NetPerf UDP PPC64 64 49.98 ( 0.00%) 50.25 ( 0.54%) 128 98.66 ( 0.00%) 100.95 ( 2.27%) 256 197.33 ( 0.00%) 191.03 (-3.30%) 1024 761.98 ( 0.00%) 785.07 ( 2.94%) 2048 1493.50 ( 0.00%) 1510.85 ( 1.15%) 3312 2303.95 ( 0.00%) 2271.72 (-1.42%) 4096 2774.56 ( 0.00%) 2773.06 (-0.05%) 8192 4918.31 ( 0.00%) 4793.59 (-2.60%) 16384 7497.98 ( 0.00%) 7749.52 ( 3.25%) The tests are run to have confidence limits within 1%. Results marked with a * were not confident although in this case, it's only outside by small amounts. Even with some results that were not confident, the netperf UDP results were generally positive. NetPerf TCP X86 64 652.25 ( 0.00%)* 648.12 (-0.64%)* 23.80% 22.82% 128 1229.98 ( 0.00%)* 1220.56 (-0.77%)* 21.03% 18.90% 256 2105.88 ( 0.00%) 1872.03 (-12.49%)* 1.00% 16.46% 1024 3476.46 ( 0.00%)* 3548.28 ( 2.02%)* 13.37% 11.39% 2048 4023.44 ( 0.00%)* 4231.45 ( 4.92%)* 9.76% 12.48% 3312 4348.88 ( 0.00%)* 4396.96 ( 1.09%)* 6.49% 8.75% 4096 4726.56 ( 0.00%)* 4877.71 ( 3.10%)* 9.85% 8.50% 8192 4732.28 ( 0.00%)* 5777.77 (18.10%)* 9.13% 13.04% 16384 5543.05 ( 0.00%)* 5906.24 ( 6.15%)* 7.73% 8.68% NETPERF TCP X86-64 netperf-tcp-vanilla-netperf netperf-tcp tcp-vanilla pgalloc-delay 64 1895.87 ( 0.00%)* 1775.07 (-6.81%)* 5.79% 4.78% 128 3571.03 ( 0.00%)* 3342.20 (-6.85%)* 3.68% 6.06% 256 5097.21 ( 0.00%)* 4859.43 (-4.89%)* 3.02% 2.10% 1024 8919.10 ( 0.00%)* 8892.49 (-0.30%)* 5.89% 6.55% 2048 10255.46 ( 0.00%)* 10449.39 ( 1.86%)* 7.08% 7.44% 3312 10839.90 ( 0.00%)* 10740.15 (-0.93%)* 6.87% 7.33% 4096 10814.84 ( 0.00%)* 10766.97 (-0.44%)* 6.86% 8.18% 8192 11606.89 ( 0.00%)* 11189.28 (-3.73%)* 7.49% 5.55% 16384 12554.88 ( 0.00%)* 12361.22 (-1.57%)* 7.36% 6.49% NETPERF TCP PPC64 netperf-tcp-vanilla-netperf netperf-tcp tcp-vanilla pgalloc-delay 64 594.17 ( 0.00%) 596.04 ( 0.31%)* 1.00% 2.29% 128 1064.87 ( 0.00%)* 1074.77 ( 0.92%)* 1.30% 1.40% 256 1852.46 ( 0.00%)* 1856.95 ( 0.24%) 1.25% 1.00% 1024 3839.46 ( 0.00%)* 3813.05 (-0.69%) 1.02% 1.00% 2048 4885.04 ( 0.00%)* 4881.97 (-0.06%)* 1.15% 1.04% 3312 5506.90 ( 0.00%) 5459.72 (-0.86%) 4096 6449.19 ( 0.00%) 6345.46 (-1.63%) 8192 7501.17 ( 0.00%) 7508.79 ( 0.10%) 16384 9618.65 ( 0.00%) 9490.10 (-1.35%) There was a distinct lack of confidence in the X86* figures so I included what the devation was where the results were not confident. Many of the results, whether gains or losses were within the standard deviation so no solid conclusion can be reached on performance impact. Looking at the figures, only the X86-64 ones look suspicious with a few losses that were outside the noise. However, the results were so unstable that without knowing why they vary so much, a solid conclusion cannot be reached. SYSBENCH X86 sysbench-vanilla pgalloc-delay 1 7722.85 ( 0.00%) 7756.79 ( 0.44%) 2 14901.11 ( 0.00%) 13683.44 (-8.90%) 3 15171.71 ( 0.00%) 14888.25 (-1.90%) 4 14966.98 ( 0.00%) 15029.67 ( 0.42%) 5 14370.47 ( 0.00%) 14865.00 ( 3.33%) 6 14870.33 ( 0.00%) 14845.57 (-0.17%) 7 14429.45 ( 0.00%) 14520.85 ( 0.63%) 8 14354.35 ( 0.00%) 14362.31 ( 0.06%) SYSBENCH X86-64 1 17448.70 ( 0.00%) 17484.41 ( 0.20%) 2 34276.39 ( 0.00%) 34251.00 (-0.07%) 3 50805.25 ( 0.00%) 50854.80 ( 0.10%) 4 66667.10 ( 0.00%) 66174.69 (-0.74%) 5 66003.91 ( 0.00%) 65685.25 (-0.49%) 6 64981.90 ( 0.00%) 65125.60 ( 0.22%) 7 64933.16 ( 0.00%) 64379.23 (-0.86%) 8 63353.30 ( 0.00%) 63281.22 (-0.11%) 9 63511.84 ( 0.00%) 63570.37 ( 0.09%) 10 62708.27 ( 0.00%) 63166.25 ( 0.73%) 11 62092.81 ( 0.00%) 61787.75 (-0.49%) 12 61330.11 ( 0.00%) 61036.34 (-0.48%) 13 61438.37 ( 0.00%) 61994.47 ( 0.90%) 14 62304.48 ( 0.00%) 62064.90 (-0.39%) 15 63296.48 ( 0.00%) 62875.16 (-0.67%) 16 63951.76 ( 0.00%) 63769.09 (-0.29%) SYSBENCH PPC64 -sysbench-pgalloc-delay-sysbench sysbench-vanilla pgalloc-delay 1 7645.08 ( 0.00%) 7467.43 (-2.38%) 2 14856.67 ( 0.00%) 14558.73 (-2.05%) 3 21952.31 ( 0.00%) 21683.64 (-1.24%) 4 27946.09 ( 0.00%) 28623.29 ( 2.37%) 5 28045.11 ( 0.00%) 28143.69 ( 0.35%) 6 27477.10 ( 0.00%) 27337.45 (-0.51%) 7 26489.17 ( 0.00%) 26590.06 ( 0.38%) 8 26642.91 ( 0.00%) 25274.33 (-5.41%) 9 25137.27 ( 0.00%) 24810.06 (-1.32%) 10 24451.99 ( 0.00%) 24275.85 (-0.73%) 11 23262.20 ( 0.00%) 23674.88 ( 1.74%) 12 24234.81 ( 0.00%) 23640.89 (-2.51%) 13 24577.75 ( 0.00%) 24433.50 (-0.59%) 14 25640.19 ( 0.00%) 25116.52 (-2.08%) 15 26188.84 ( 0.00%) 26181.36 (-0.03%) 16 26782.37 ( 0.00%) 26255.99 (-2.00%) Again, there is little to conclude here. While there are a few losses, the results vary by +/- 8% in some cases. They are the results of most concern as there are some large losses but it's also within the variance typically seen between kernel releases. The STREAM results varied so little and are so verbose that I didn't include them here. The final test stressed how many huge pages can be allocated. The absolute number of huge pages allocated are the same with or without the page. However, the "unusability free space index" which is a measure of external fragmentation was slightly lower (lower is better) throughout the lifetime of the system. I also measured the latency of how long it took to successfully allocate a huge page. The latency was slightly lower and on X86 and PPC64, more huge pages were allocated almost immediately from the free lists. The improvement is slight but there. [mel@csn.ul.ie: Tested, reworked for less branches] [czoccolo@gmail.com: fix oops by checking pfn_valid_within()] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25tmpfs: insert tmpfs cache pages to inactive list at firstKOSAKI Motohiro
Shaohua Li reported parallel file copy on tmpfs can lead to OOM killer. This is regression of caused by commit 9ff473b9a7 ("vmscan: evict streaming IO first"). Wow, It is 2 years old patch! Currently, tmpfs file cache is inserted active list at first. This means that the insertion doesn't only increase numbers of pages in anon LRU, but it also reduces anon scanning ratio. Therefore, vmscan will get totally confused. It scans almost only file LRU even though the system has plenty unused tmpfs pages. Historically, lru_cache_add_active_anon() was used for two reasons. 1) Intend to priotize shmem page rather than regular file cache. 2) Intend to avoid reclaim priority inversion of used once pages. But we've lost both motivation because (1) Now we have separate anon and file LRU list. then, to insert active list doesn't help such priotize. (2) In past, one pte access bit will cause page activation. then to insert inactive list with pte access bit mean higher priority than to insert active list. Its priority inversion may lead to uninteded lru chun. but it was already solved by commit 645747462 (vmscan: detect mapped file pages used only once). (Thanks Hannes, you are great!) Thus, now we can use lru_cache_add_anon() instead. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reported-by: Shaohua Li <shaohua.li@intel.com> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-22Merge branches 'slab/align', 'slab/cleanups', 'slab/fixes', 'slab/memhotadd' ↵Pekka Enberg
and 'slub/fixes' into slab-for-linus
2010-05-22slub: Use alloc_pages_exact_node() for page allocationMinchan Kim
The alloc_slab_page() in SLUB uses alloc_pages() if node is '-1'. This means that node validity check in alloc_pages_node is unnecessary and we can use alloc_pages_exact_node() to avoid comparison and branch as commit 6484eb3e2a81807722 ("page allocator: do not check NUMA node ID when the caller knows the node is valid") did for the page allocator. Cc: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2010-05-22slub: __kmalloc_node_track_caller should trace kmalloc_large_node caseXiaotian Feng
commit 94b528d (kmemtrace: SLUB hooks for caller-tracking functions) missed tracing kmalloc_large_node in __kmalloc_node_track_caller. We should trace it same as __kmalloc_node. Acked-by: David Rientjes <rientjes@google.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Vegard Nossum <vegard.nossum@gmail.com> Signed-off-by: Xiaotian Feng <dfeng@redhat.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2010-05-22slub: Potential stack overflowEric Dumazet
I discovered that we can overflow stack if CONFIG_SLUB_DEBUG=y and use slabs with many objects, since list_slab_objects() and process_slab() use DECLARE_BITMAP(map, page->objects). With 65535 bits, we use 8192 bytes of stack ... Switch these allocations to dynamic allocations. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2010-05-21Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (69 commits) fix handling of offsets in cris eeprom.c, get rid of fake on-stack files get rid of home-grown mutex in cris eeprom.c switch ecryptfs_write() to struct inode *, kill on-stack fake files switch ecryptfs_get_locked_page() to struct inode * simplify access to ecryptfs inodes in ->readpage() and friends AFS: Don't put struct file on the stack Ban ecryptfs over ecryptfs logfs: replace inode uid,gid,mode initialization with helper function ufs: replace inode uid,gid,mode initialization with helper function udf: replace inode uid,gid,mode init with helper ubifs: replace inode uid,gid,mode initialization with helper function sysv: replace inode uid,gid,mode initialization with helper function reiserfs: replace inode uid,gid,mode initialization with helper function ramfs: replace inode uid,gid,mode initialization with helper function omfs: replace inode uid,gid,mode initialization with helper function bfs: replace inode uid,gid,mode initialization with helper function ocfs2: replace inode uid,gid,mode initialization with helper function nilfs2: replace inode uid,gid,mode initialization with helper function minix: replace inode uid,gid,mode init with helper ext4: replace inode uid,gid,mode init with helper ... Trivial conflict in fs/fs-writeback.c (mark bitfields unsigned)
2010-05-21ramfs: replace inode uid,gid,mode initialization with helper functionDmitry Monakhov
- seems what ramfs_get_inode is only locally, make it static. [AV: the hell it is; it's used by shmem, so shmem needed conversion too and no, that function can't be made static] Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-21sanitize vfs_fsync calling conventionsChristoph Hellwig
Now that the last user passing a NULL file pointer is gone we can remove the redundant dentry argument and associated hacks inside vfs_fsynmc_range. The next step will be removig the dentry argument from ->fsync, but given the luck with the last round of method prototype changes I'd rather defer this until after the main merge window. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-21fs: xattr_handler table should be constStephen Hemminger
The entries in xattr handler table should be immutable (ie const) like other operation tables. Later patches convert common filesystems. Uncoverted filesystems will still work, but will generate a compiler warning. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6: (577 commits) Staging: ramzswap: Handler for swap slot free callback swap: Add swap slot free callback to block_device_operations swap: Add flag to identify block swap devices Staging: vt6655: use ETH_FRAME_LEN macro instead of custom one Staging: vt6655: use ETH_DATA_LEN macro instead of custom one Staging: vt6655: use ETH_FCS_LEN macro instead of custom one Staging: vt6656: use ETH_HLEN macro instead of custom one Staging: comedi: quatech_daqp_cs.c Replace eos semaphore with a completion. Staging: dt3155v4l: remove private memory allocator Staging: crystalhd: Remove typedefs from driver Staging: winbond: Fix for pointer name format issue in mds.c Staging: vt6656: removed custom UCHAR/USHORT/UINT/ULONG/ULONGLONG typedefs Staging: vt6656: removed custom CHAR/SHORT/INT/LONG typedefs Staging: comedi: Altered the way printk is used in 8255.c staging: iio: adis16350 and similar IMU driver Staging: iio: max1363 Fix two bugs in single_channel_from_ring Staging: iio: adis16220 extract bin_attribute structures from state Staging: iio: adis16220 vibration sensor driver Staging: comedi: Kconfig dependancy fixes Staging: comedi: fix up build error from last Kconfig changes ...
2010-05-21Merge staging-next tree into Linus's latest versionGreg Kroah-Hartman
Conflicts: drivers/staging/arlan/arlan-main.c drivers/staging/comedi/drivers/cb_das16_cs.c drivers/staging/cx25821/cx25821-alsa.c drivers/staging/dt3155/dt3155_drv.c drivers/staging/hv/hv.c drivers/staging/netwave/netwave_cs.c drivers/staging/wavelan/wavelan.c drivers/staging/wavelan/wavelan_cs.c drivers/staging/wlags49_h2/wl_cs.c This required a bit of hand merging due to the conflicts that happened in the later .34-rc releases, as well as some staging driver changing coming in through other trees (v4l and pcmcia). Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-21Merge branch 'master' into for-2.6.35Jens Axboe
Conflicts: fs/ext3/fsync.c Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-05-21writeback: fix mixed up arguments to bdi_start_writeback()Jens Axboe
The laptop mode timer had the nr_pages and sb_locked arguments mixed up. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-05-21writeback: fix problem with !CONFIG_BLOCK compilationJens Axboe
When CONFIG_BLOCK isn't enabled: mm/page-writeback.c: In function 'laptop_mode_timer_fn': mm/page-writeback.c:708: error: dereferencing pointer to incomplete type mm/page-writeback.c:709: error: dereferencing pointer to incomplete type Fix this by essentially eliminating the laptop sync handlers when CONFIG_BLOCK isn't set, as most are only used from the block layer code. The exception is laptop_sync_completion() which is used from sys_sync(), make that an empty declaration in that case. Reported-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-05-21writeback: fixups for !dirty_writeback_centisecsJens Axboe
Commit 69b62d01 fixed up most of the places where we would enter busy schedule() spins when disabling the periodic background writeback. This fixes up the sb timer so that it doesn't get hammered on with the delay disabled, and ensures that it gets rearmed if needed when /proc/sys/vm/dirty_writeback_centisecs gets modified. bdi_forker_task() also needs to check for !dirty_writeback_centisecs and use schedule() appropriately, fix that up too. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-05-20Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (44 commits) vlynq: make whole Kconfig-menu dependant on architecture add descriptive comment for TIF_MEMDIE task flag declaration. EEPROM: max6875: Header file cleanup EEPROM: 93cx6: Header file cleanup EEPROM: Header file cleanup agp: use NULL instead of 0 when pointer is needed rtc-v3020: make bitfield unsigned PCI: make bitfield unsigned jbd2: use NULL instead of 0 when pointer is needed cciss: fix shadows sparse warning doc: inode uses a mutex instead of a semaphore. uml: i386: Avoid redefinition of NR_syscalls fix "seperate" typos in comments cocbalt_lcdfb: correct sections doc: Change urls for sparse Powerpc: wii: Fix typo in comment i2o: cleanup some exit paths Documentation/: it's -> its where appropriate UML: Fix compiler warning due to missing task_struct declaration UML: add kernel.h include to signal.c ...
2010-05-20Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: ia64: add sparse annotation to __ia64_per_cpu_var() percpu: implement kernel memory based chunk allocation percpu: move vmalloc based chunk management into percpu-vm.c percpu: misc preparations for nommu support percpu: reorganize chunk creation and destruction percpu: factor out pcpu_addr_in_first/reserved_chunk() and update per_cpu_ptr_to_phys()
2010-05-19mm: Move ARCH_SLAB_MINALIGN and ARCH_KMALLOC_MINALIGN to <linux/slub_def.h>David Woodhouse
Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2010-05-19mm: Move ARCH_SLAB_MINALIGN and ARCH_KMALLOC_MINALIGN to <linux/slob_def.h>David Woodhouse
Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2010-05-19mm: Move ARCH_SLAB_MINALIGN and ARCH_KMALLOC_MINALIGN to <linux/slab_def.h>David Woodhouse
Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2010-05-18swap: Add swap slot free callback to block_device_operationsNitin Gupta
This callback is required when RAM based devices are used as swap disks. One such device is ramzswap which is used as compressed in-memory swap disk. For such devices, we need a callback as soon as a swap slot is no longer used to allow freeing memory allocated for this slot. Without this callback, stale data can quickly accumulate in memory defeating the whole purpose of such devices. Signed-off-by: Nitin Gupta <ngupta@vflare.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Nigel Cunningham <nigel@tuxonice.net> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-18swap: Add flag to identify block swap devicesNitin Gupta
Added SWP_BLKDEV flag to distinguish block and regular file backed swap devices. We could also check if a swap is entire block device, rather than a file, by: S_ISBLK(swap_info_struct->swap_file->f_mapping->host->i_mode) but, I think, simply checking this flag is more convenient. Signed-off-by: Nitin Gupta <ngupta@vflare.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Nigel Cunningham <nigel@tuxonice.net> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-18Merge branch 'perf-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (311 commits) perf tools: Add mode to build without newt support perf symbols: symbol inconsistency message should be done only at verbose=1 perf tui: Add explicit -lslang option perf options: Type check all the remaining OPT_ variants perf options: Type check OPT_BOOLEAN and fix the offenders perf options: Check v type in OPT_U?INTEGER perf options: Introduce OPT_UINTEGER perf tui: Add workaround for slang < 2.1.4 perf record: Fix bug mismatch with -c option definition perf options: Introduce OPT_U64 perf tui: Add help window to show key associations perf tui: Make <- exit menus too perf newt: Add single key shortcuts for zoom into DSO and threads perf newt: Exit browser unconditionally when CTRL+C, q or Q is pressed perf newt: Fix the 'A'/'a' shortcut for annotate perf newt: Make <- exit the ui_browser x86, perf: P4 PMU - fix counters management logic perf newt: Make <- zoom out filters perf report: Report number of events, not samples perf hist: Clarify events_stats fields usage ... Fix up trivial conflicts in kernel/fork.c and tools/perf/builtin-record.c
2010-05-17writeback: fix WB_SYNC_NONE writeback from umountJens Axboe
When umount calls sync_filesystem(), we first do a WB_SYNC_NONE writeback to kick off writeback of pending dirty inodes, then follow that up with a WB_SYNC_ALL to wait for it. Since umount already holds the sb s_umount mutex, WB_SYNC_NONE ends up doing nothing and all writeback happens as WB_SYNC_ALL. This can greatly slow down umount, since WB_SYNC_ALL writeback is a data integrity operation and thus a bigger hammer than simple WB_SYNC_NONE. For barrier aware file systems it's a lot slower. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-05-11memcg: fix css_is_ancestor() RCU lockingKAMEZAWA Hiroyuki
Some callers (in memcontrol.c) calls css_is_ancestor() without rcu_read_lock. Because css_is_ancestor() has to access RCU protected data, it should be under rcu_read_lock(). This makes css_is_ancestor() itself does safe access to RCU protected area. (At least, "root" can have refcnt==0 if it's not an ancestor of "child". So, we need rcu_read_lock().) Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-11memcg: fix css_id() RCU locking for realKAMEZAWA Hiroyuki
Commit ad4ba375373937817404fd92239ef4cadbded23b ("memcg: css_id() must be called under rcu_read_lock()") modifies memcontol.c for fixing RCU check message. But Andrew Morton pointed out that the fix doesn't seems sane and it was just for hidining lockdep messages. This is a patch for do proper things. Checking again, all places, accessing without rcu_read_lock, that commit fixies was intentional.... all callers of css_id() has reference count on it. So, it's not necessary to be under rcu_read_lock(). Considering again, we can use rcu_dereference_check for css_id(). We know css->id is valid if css->refcnt > 0. (css->id never changes and freed after css->refcnt going to be 0.) This patch makes use of rcu_dereference_check() in css_id/depth and remove unnecessary rcu-read-lock added by the commit. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-11rmap: remove anon_vma check in page_address_in_vma()Naoya Horiguchi
Currently page_address_in_vma() compares vma->anon_vma and page_anon_vma(page) for parameter check, but in 2.6.34 a vma can have multiple anon_vmas with anon_vma_chain, so current check does not work. (For anonymous page shared by multiple processes, some verified (page,vma) pairs return -EFAULT wrongly.) We can go to checking all anon_vmas in the "same_vma" chain, but it needs to meet lock requirement. Instead, we can remove anon_vma check safely because page_address_in_vma() assumes that page and vma are already checked to belong to the identical process. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-11hugetlbfs: kill applications that use MAP_NORESERVE with SIGBUS instead of ↵Mel Gorman
OOM-killer Ordinarily, application using hugetlbfs will create mappings with reserves. For shared mappings, these pages are reserved before mmap() returns success and for private mappings, the caller process is guaranteed and a child process that cannot get the pages gets killed with sigbus. An application that uses MAP_NORESERVE gets no reservations and mmap() will always succeed at the risk the page will not be available at fault time. This might be used for example on very large sparse mappings where the developer is confident the necessary huge pages exist to satisfy all faults even though the whole mapping cannot be backed by huge pages. Unfortunately, if an allocation does fail, VM_FAULT_OOM is returned to the fault handler which proceeds to trigger the OOM-killer. This is unhelpful. Even without hugetlbfs mounted, a user using mmap() can trivially trigger the OOM-killer because VM_FAULT_OOM is returned (will provide example program if desired - it's a whopping 24 lines long). It could be considered a DOS available to an unprivileged user. This patch alters hugetlbfs to kill a process that uses MAP_NORESERVE where huge pages were not available with SIGBUS instead of triggering the OOM killer. This change affects hugetlb_cow() as well. I feel there is a failure case in there, but I didn't create one. It would need a fairly specific target in terms of the faulting application and the hugepage pool size. The hugetlb_no_page() path is much easier to hit but both might as well be closed. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: David Rientjes <rientjes@google.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-07Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rcu: create rcu_my_thread_group_empty() wrapper memcg: css_id() must be called under rcu_read_lock() cgroup: Check task_lock in task_subsys_state() sched: Fix an RCU warning in print_task() cgroup: Fix an RCU warning in alloc_css_id() cgroup: Fix an RCU warning in cgroup_path() KEYS: Fix an RCU warning in the reading of user keys KEYS: Fix an RCU warning
2010-05-07Merge branch 'perf/urgent' into perf/coreIngo Molnar
Merge reason: Resolve patch dependency Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-05slub: Fix bad boundary check in init_kmem_cache_nodes()Zhang, Yanmin
Function init_kmem_cache_nodes is incorrect when checking upper limitation of kmalloc_caches. The breakage was introduced by commit 91efd773c74bb26b5409c85ad755d536448e229c ("dma kmalloc handling fixes"). Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2010-05-04memcg: css_id() must be called under rcu_read_lock()Paul E. McKenney
This patch fixes task_in_mem_cgroup(), mem_cgroup_uncharge_swapcache(), mem_cgroup_move_swap_account(), and is_target_pte_for_mc() to protect calls to css_id(). An additional RCU lockdep splat was reported for memcg_oom_wake_function(), however, this function is not yet in mainline as of 2.6.34-rc5. Reported-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Tested-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2010-05-01percpu: implement kernel memory based chunk allocationTejun Heo
Implement an alternate percpu chunk management based on kernel memeory for nommu SMP architectures. Instead of mapping into vmalloc area, chunks are allocated as a contiguous kernel memory using alloc_pages(). As such, percpu allocator on nommu will have the following restrictions. * It can't fill chunks on-demand page-by-page. It has to allocate each chunk fully upfront. * It can't support sparse chunk for NUMA configurations. SMP w/o mmu is crazy enough. Let's hope no one does NUMA w/o mmu. :-P * If chunk size isn't power-of-two multiple of PAGE_SIZE, the unaligned amount will be wasted on each chunk. So, archs which use this better align chunk size. For instructions on how to use this, read the comment on top of mm/percpu-km.c. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Howells <dhowells@redhat.com> Cc: Graff Yang <graff.yang@gmail.com> Cc: Sonic Zhang <sonic.adi@gmail.com>
2010-05-01percpu: move vmalloc based chunk management into percpu-vm.cTejun Heo
Separate out and move chunk management (creation/desctruction and [de]population) code into percpu-vm.c which is included by percpu.c and compiled together. The interface for chunk management is defined as follows. * pcpu_populate_chunk - populate the specified range of a chunk * pcpu_depopulate_chunk - depopulate the specified range of a chunk * pcpu_create_chunk - create a new chunk * pcpu_destroy_chunk - destroy a chunk, always preceded by full depop * pcpu_addr_to_page - translate address to physical address * pcpu_verify_alloc_info - check alloc_info is acceptable during init Other than wrapping vmalloc_to_page() inside pcpu_addr_to_page() and dummy pcpu_verify_alloc_info() implementation, this patch only moves code around. This separation is to allow alternate chunk management implementation. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Howells <dhowells@redhat.com> Cc: Graff Yang <graff.yang@gmail.com> Cc: Sonic Zhang <sonic.adi@gmail.com>
2010-05-01percpu: misc preparations for nommu supportTejun Heo
Make the following misc preparations for percpu nommu support. * Remove refernces to vmalloc in common comments as nommu percpu won't use it. * Rename chunk->vms to chunk->data and make it void *. Its use is determined by chunk management implementation. * Relocate utility functions and add __maybe_unused to functions which might not be used by different chunk management implementations. This patch doesn't cause any functional change. This is to allow alternate chunk management implementation for percpu nommu support. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Howells <dhowells@redhat.com> Cc: Graff Yang <graff.yang@gmail.com> Cc: Sonic Zhang <sonic.adi@gmail.com>
2010-05-01percpu: reorganize chunk creation and destructionTejun Heo
Reorganize alloc/free_pcpu_chunk() such that chunk struct alloc/free live in pcpu_alloc/free_chunk() and the rest in pcpu_create/destroy_chunk(). While at it, add missing error handling for chunk->map allocation failure. This is to allow alternate chunk management implementation for percpu nommu support. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Howells <dhowells@redhat.com> Cc: Graff Yang <graff.yang@gmail.com> Cc: Sonic Zhang <sonic.adi@gmail.com>
2010-05-01percpu: factor out pcpu_addr_in_first/reserved_chunk() and update ↵Tejun Heo
per_cpu_ptr_to_phys() Factor out pcpu_addr_in_first/reserved_chunk() from pcpu_chunk_addr_search() and use it to update per_cpu_ptr_to_phys() such that it handles first chunk differently from the rest. This patch doesn't cause any functional change and is to prepare for percpu nommu support. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Howells <dhowells@redhat.com> Cc: Graff Yang <graff.yang@gmail.com> Cc: Sonic Zhang <sonic.adi@gmail.com>
2010-04-30Merge commit 'v2.6.34-rc6' into perf/coreIngo Molnar
Merge reason: update to the latest -rc. Signed-off-by: Ingo Molnar <mingo@elte.hu>