summaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2012-05-29kernel: cgroup: push rcu read locking from css_is_ancestor() to callsiteJohannes Weiner2-15/+19
Library functions should not grab locks when the callsites can do it, even if the lock nests like the rcu read-side lock does. Push the rcu_read_lock() from css_is_ancestor() to its single user, mem_cgroup_same_or_subtree() in preparation for another user that may already hold the rcu read-side lock. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: do_migrate_pages(): rename argumentsAndrew Morton2-14/+13
s/from_nodes/from and s/to_nodes/to/. The "_nodes" is redundant - it duplicates the argument's type. Done in a fit of irritation over 80-col issues :( Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <mkosaki@redhat.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: do_migrate_pages() calls migrate_to_node() even if task is already on a ↵Larry Woodman1-0/+20
correct node While running an application that moves tasks from one cpuset to another I noticed that it takes much longer and moves many more pages than expected. The reason for this is do_migrate_pages() does its best to preserve the relative node differential from the first node of the cpuset because the application may have been written with that in mind. If memory was interleaved on the nodes of the source cpuset by an application do_migrate_pages() will try its best to maintain that interleaving on the nodes of the destination cpuset. This means copying the memory from all source nodes to the destination nodes even if the source and destination nodes overlap. This is a problem for userspace NUMA placement tools. The amount of time spent doing extra memory moves cancels out some of the NUMA performance improvements. Furthermore, if the number of source and destination nodes are to maintain the previous interleaving layout anyway. This patch changes do_migrate_pages() to only preserve the relative layout inside the program if the number of NUMA nodes in the source and destination mask are the same. If the number is different, we do a much more efficient migration by not touching memory that is in an allowed node. This preserves the old behaviour for programs that want it, while allowing a userspace NUMA placement tool to use the new, faster migration. This improves performance in our tests by up to a factor of 7. Without this change migrating tasks from a cpuset containing nodes 0-7 to a cpuset containing nodes 3-4, we migrate from ALL the nodes even if they are in the both the source and destination nodesets: Migrating 7 to 4 Migrating 6 to 3 Migrating 5 to 4 Migrating 4 to 3 Migrating 1 to 4 Migrating 3 to 4 Migrating 0 to 3 Migrating 2 to 3 With this change we only migrate from nodes that are not in the destination nodesets: Migrating 7 to 4 Migrating 6 to 3 Migrating 5 to 4 Migrating 2 to 3 Migrating 1 to 4 Migrating 0 to 3 Yet if we move from a cpuset containing nodes 2,3,4 to a cpuset containing 3,4,5 we still do move everything so that we preserve the desired NUMA offsets: Migrating 4 to 5 Migrating 3 to 4 Migrating 2 to 3 As far as performance is concerned this simple patch improves the time it takes to move 14, 20 and 26 large tasks from a cpuset containing nodes 0-7 to a cpuset containing nodes 1 & 3 by up to a factor of 7. Here are the timings with and without the patch: BEFORE PATCH -- Move times: 59, 140, 651 seconds ============ Moving 14 tasks from nodes (0-7) to nodes (1,3) numad(8780) do_migrate_pages (mm=0xffff88081d414400 from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x7 dest=0x3 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x6 dest=0x1 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x5 dest=0x3 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x4 dest=0x1 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x2 dest=0x1 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x1 dest=0x3 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d414400 source=0x0 dest=0x1 flags=0x4) (Above moves repeated for each of the 14 tasks...) PID 8890 moved to node(s) 1,3 in 59.2 seconds Moving 20 tasks from nodes (0-7) to nodes (1,4-5) numad(8780) do_migrate_pages (mm=0xffff88081d88c700 from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x7 dest=0x4 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x6 dest=0x1 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x3 dest=0x1 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x2 dest=0x5 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x1 dest=0x4 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d88c700 source=0x0 dest=0x1 flags=0x4) (Above moves repeated for each of the 20 tasks...) PID 8962 moved to node(s) 1,4-5 in 139.88 seconds Moving 26 tasks from nodes (0-7) to nodes (1-3,5) numad(8780) do_migrate_pages (mm=0xffff88081d5bc740 from_nodes=0xffff880818c81d28 to_nodes=0xffff880818c81ce8 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x7 dest=0x5 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x6 dest=0x3 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x5 dest=0x2 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x3 dest=0x5 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x2 dest=0x3 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x1 dest=0x2 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x0 dest=0x1 flags=0x4) numad(8780) migrate_to_node (mm=0xffff88081d5bc740 source=0x4 dest=0x1 flags=0x4) (Above moves repeated for each of the 26 tasks...) PID 9058 moved to node(s) 1-3,5 in 651.45 seconds AFTER PATCH -- Move times: 42, 56, 93 seconds =========== Moving 14 tasks from nodes (0-7) to nodes (5,7) numad(33209) do_migrate_pages (mm=0xffff88101d5ff140 from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x6 dest=0x5 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x4 dest=0x5 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x3 dest=0x7 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x2 dest=0x5 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x1 dest=0x7 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d5ff140 source=0x0 dest=0x5 flags=0x4) (Above moves repeated for each of the 14 tasks...) PID 33221 moved to node(s) 5,7 in 41.67 seconds Moving 20 tasks from nodes (0-7) to nodes (1,3,5) numad(33209) do_migrate_pages (mm=0xffff88101d6c37c0 from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x7 dest=0x3 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x6 dest=0x1 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x4 dest=0x3 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x2 dest=0x5 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d6c37c0 source=0x0 dest=0x1 flags=0x4) (Above moves repeated for each of the 20 tasks...) PID 33289 moved to node(s) 1,3,5 in 56.3 seconds Moving 26 tasks from nodes (0-7) to nodes (1,3,5,7) numad(33209) do_migrate_pages (mm=0xffff88101d924400 from_nodes=0xffff88101e7b5d28 to_nodes=0xffff88101e7b5ce8 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x6 dest=0x5 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x4 dest=0x1 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x2 dest=0x5 flags=0x4) numad(33209) migrate_to_node (mm=0xffff88101d924400 source=0x0 dest=0x1 flags=0x4) (Above moves repeated for each of the 26 tasks...) PID 33372 moved to node(s) 1,3,5,7 in 92.67 seconds [akpm@linux-foundation.org: clean up comment layout] Signed-off-by: Larry Woodman <lwoodman@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29thp, memcg: split hugepage for memcg oom on cowDavid Rientjes2-3/+18
On COW, a new hugepage is allocated and charged to the memcg. If the system is oom or the charge to the memcg fails, however, the fault handler will return VM_FAULT_OOM which results in an oom kill. Instead, it's possible to fallback to splitting the hugepage so that the COW results only in an order-0 page being allocated and charged to the memcg which has a higher liklihood to succeed. This is expensive because the hugepage must be split in the page fault handler, but it is much better than unnecessarily oom killing a process. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <jweiner@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm/vmstat.c: remove debug fs entries on failure of file creation and made ↵Sasikantha babu1-3/+7
extfrag_debug_root dentry local Remove debug fs files and directory on failure. Since no one is using "extfrag_debug_root" dentry outside of extfrag_debug_init(), make it local to the function. Signed-off-by: Sasikantha babu <sasikanth.v19@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm/fork: fix overflow in vma length when copying mmap on cloneSiddhesh Poyarekar1-1/+2
The vma length in dup_mmap is calculated and stored in a unsigned int, which is insufficient and hence overflows for very large maps (beyond 16TB). The following program demonstrates this: #include <stdio.h> #include <unistd.h> #include <sys/mman.h> #define GIG 1024 * 1024 * 1024L #define EXTENT 16393 int main(void) { int i, r; void *m; char buf[1024]; for (i = 0; i < EXTENT; i++) { m = mmap(NULL, (size_t) 1 * 1024 * 1024 * 1024L, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); if (m == (void *)-1) printf("MMAP Failed: %d\n", m); else printf("%d : MMAP returned %p\n", i, m); r = fork(); if (r == 0) { printf("%d: successed\n", i); return 0; } else if (r < 0) printf("FORK Failed: %d\n", r); else if (r > 0) wait(NULL); } return 0; } Increase the storage size of the result to unsigned long, which is sufficient for storing the difference between addresses. Signed-off-by: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm/mmap.c: find_vma(): remove unnecessary if(mm) checkRajman Mekaco1-26/+27
The "if (mm)" check is not required in find_vma, as the kernel code calls find_vma only when it is absolutely sure that the mm_struct arg to it is non-NULL. Remove the if(mm) check and adding the a WARN_ONCE(!mm) for now. This will serve the purpose of mandating that the execution context(user-mode/kernel-mode) be known before find_vma is called. Also fixed 2 checkpatch.pl errors in the declaration of the rb_node and vma_tmp local variables. I was browsing through the internet and read a discussion at https://lkml.org/lkml/2012/3/27/342 which discusses removal of the validation check within find_vma. Since no-one responded, I decided to send this patch with Andrew's suggestions. [akpm@linux-foundation.org: add remove-me comment] Signed-off-by: Rajman Mekaco <rajman.mekaco@gmail.com> Cc: Kautuk Consul <consul.kautuk@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: use kcalloc() instead of kzalloc() to allocate arrayThomas Meyer1-2/+2
The advantage of kcalloc is, that will prevent integer overflows which could result from the multiplication of number of elements and size and it is also a bit nicer to read. The semantic patch that makes this change is available in https://lkml.org/lkml/2011/11/25/107 Signed-off-by: Thomas Meyer <thomas@m3y3r.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: fix off-by-one bug in print_nodes_state()Ryota Ozaki1-5/+3
/sys/devices/system/node/{online,possible} outputs a garbage byte because print_nodes_state() returns content size + 1. To fix the bug, the patch changes the use of cpuset_sprintf_cpulist to follow the use at other places, which is clearer and safer. This bug was introduced in v2.6.24 (commit bde631a51876: "mm: add node states sysfs class attributeS"). Signed-off-by: Ryota Ozaki <ozaki.ryota@gmail.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: vmscan: remove reclaim_mode_tMel Gorman2-52/+24
There is little motiviation for reclaim_mode_t once RECLAIM_MODE_[A]SYNC and lumpy reclaim have been removed. This patch gets rid of reclaim_mode_t as well and improves the documentation about what reclaim/compaction is and when it is triggered. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Hugh Dickins <hughd@google.com> Cc: Ying Han <yinghan@google.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: vmscan: do not stall on writeback during memory compactionMel Gorman2-83/+14
This patch stops reclaim/compaction entering sync reclaim as this was only intended for lumpy reclaim and an oversight. Page migration has its own logic for stalling on writeback pages if necessary and memory compaction is already using it. Waiting on page writeback is bad for a number of reasons but the primary one is that waiting on writeback to a slow device like USB can take a considerable length of time. Page reclaim instead uses wait_iff_congested() to throttle if too many dirty pages are being scanned. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Hugh Dickins <hughd@google.com> Cc: Ying Han <yinghan@google.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: vmscan: remove lumpy reclaimMel Gorman2-151/+19
This series removes lumpy reclaim and some stalling logic that was unintentionally being used by memory compaction. The end result is that stalling on dirty pages during page reclaim now depends on wait_iff_congested(). Four kernels were compared 3.3.0 vanilla 3.4.0-rc2 vanilla 3.4.0-rc2 lumpyremove-v2 is patch one from this series 3.4.0-rc2 nosync-v2r3 is the full series Removing lumpy reclaim saves almost 900 bytes of text whereas the full series removes 1200 bytes. text data bss dec hex filename 6740375 1927944 2260992 10929311 a6c49f vmlinux-3.4.0-rc2-vanilla 6739479 1927944 2260992 10928415 a6c11f vmlinux-3.4.0-rc2-lumpyremove-v2 6739159 1927944 2260992 10928095 a6bfdf vmlinux-3.4.0-rc2-nosync-v2 There are behaviour changes in the series and so tests were run with monitoring of ftrace events. This disrupts results so the performance results are distorted but the new behaviour should be clearer. fs-mark running in a threaded configuration showed little of interest as it did not push reclaim aggressively FS-Mark Multi Threaded 3.3.0-vanilla rc2-vanilla lumpyremove-v2r3 nosync-v2r3 Files/s min 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) Files/s mean 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) Files/s stddev 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Files/s max 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) Overhead min 508667.00 ( 0.00%) 521350.00 (-2.49%) 544292.00 (-7.00%) 547168.00 (-7.57%) Overhead mean 551185.00 ( 0.00%) 652690.73 (-18.42%) 991208.40 (-79.83%) 570130.53 (-3.44%) Overhead stddev 18200.69 ( 0.00%) 331958.29 (-1723.88%) 1579579.43 (-8578.68%) 9576.81 (47.38%) Overhead max 576775.00 ( 0.00%) 1846634.00 (-220.17%) 6901055.00 (-1096.49%) 585675.00 (-1.54%) MMTests Statistics: duration Sys Time Running Test (seconds) 309.90 300.95 307.33 298.95 User+Sys Time Running Test (seconds) 319.32 309.67 315.69 307.51 Total Elapsed Time (seconds) 1187.85 1193.09 1191.98 1193.73 MMTests Statistics: vmstat Page Ins 80532 82212 81420 79480 Page Outs 111434984 111456240 111437376 111582628 Swap Ins 0 0 0 0 Swap Outs 0 0 0 0 Direct pages scanned 44881 27889 27453 34843 Kswapd pages scanned 25841428 25860774 25861233 25843212 Kswapd pages reclaimed 25841393 25860741 25861199 25843179 Direct pages reclaimed 44881 27889 27453 34843 Kswapd efficiency 99% 99% 99% 99% Kswapd velocity 21754.791 21675.460 21696.029 21649.127 Direct efficiency 100% 100% 100% 100% Direct velocity 37.783 23.375 23.031 29.188 Percentage direct scans 0% 0% 0% 0% ftrace showed that there was no stalling on writeback or pages submitted for IO from reclaim context. postmark was similar and while it was more interesting, it also did not push reclaim heavily. POSTMARK 3.3.0-vanilla rc2-vanilla lumpyremove-v2r3 nosync-v2r3 Transactions per second: 16.00 ( 0.00%) 20.00 (25.00%) 18.00 (12.50%) 17.00 ( 6.25%) Data megabytes read per second: 18.80 ( 0.00%) 24.27 (29.10%) 22.26 (18.40%) 20.54 ( 9.26%) Data megabytes written per second: 35.83 ( 0.00%) 46.25 (29.08%) 42.42 (18.39%) 39.14 ( 9.24%) Files created alone per second: 28.00 ( 0.00%) 38.00 (35.71%) 34.00 (21.43%) 30.00 ( 7.14%) Files create/transact per second: 8.00 ( 0.00%) 10.00 (25.00%) 9.00 (12.50%) 8.00 ( 0.00%) Files deleted alone per second: 556.00 ( 0.00%) 1224.00 (120.14%) 3062.00 (450.72%) 6124.00 (1001.44%) Files delete/transact per second: 8.00 ( 0.00%) 10.00 (25.00%) 9.00 (12.50%) 8.00 ( 0.00%) MMTests Statistics: duration Sys Time Running Test (seconds) 113.34 107.99 109.73 108.72 User+Sys Time Running Test (seconds) 145.51 139.81 143.32 143.55 Total Elapsed Time (seconds) 1159.16 899.23 980.17 1062.27 MMTests Statistics: vmstat Page Ins 13710192 13729032 13727944 13760136 Page Outs 43071140 42987228 42733684 42931624 Swap Ins 0 0 0 0 Swap Outs 0 0 0 0 Direct pages scanned 0 0 0 0 Kswapd pages scanned 9941613 9937443 9939085 9929154 Kswapd pages reclaimed 9940926 9936751 9938397 9928465 Direct pages reclaimed 0 0 0 0 Kswapd efficiency 99% 99% 99% 99% Kswapd velocity 8576.567 11051.058 10140.164 9347.109 Direct efficiency 100% 100% 100% 100% Direct velocity 0.000 0.000 0.000 0.000 It looks like here that the full series regresses performance but as ftrace showed no usage of wait_iff_congested() or sync reclaim I am assuming it's a disruption due to monitoring. Other data such as memory usage, page IO, swap IO all looked similar. Running a benchmark with a plain DD showed nothing very interesting. The full series stalled in wait_iff_congested() slightly less but stall times on vanilla kernels were marginal. Running a benchmark that hammered on file-backed mappings showed stalls due to congestion but not in sync writebacks MICRO 3.3.0-vanilla rc2-vanilla lumpyremove-v2r3 nosync-v2r3 MMTests Statistics: duration Sys Time Running Test (seconds) 308.13 294.50 298.75 299.53 User+Sys Time Running Test (seconds) 330.45 316.28 318.93 320.79 Total Elapsed Time (seconds) 1814.90 1833.88 1821.14 1832.91 MMTests Statistics: vmstat Page Ins 108712 120708 97224 110344 Page Outs 155514576 156017404 155813676 156193256 Swap Ins 0 0 0 0 Swap Outs 0 0 0 0 Direct pages scanned 2599253 1550480 2512822 2414760 Kswapd pages scanned 69742364 71150694 68839041 69692533 Kswapd pages reclaimed 34824488 34773341 34796602 34799396 Direct pages reclaimed 53693 94750 61792 75205 Kswapd efficiency 49% 48% 50% 49% Kswapd velocity 38427.662 38797.901 37799.972 38022.889 Direct efficiency 2% 6% 2% 3% Direct velocity 1432.174 845.464 1379.807 1317.446 Percentage direct scans 3% 2% 3% 3% Page writes by reclaim 0 0 0 0 Page writes file 0 0 0 0 Page writes anon 0 0 0 0 Page reclaim immediate 0 0 0 1218 Page rescued immediate 0 0 0 0 Slabs scanned 15360 16384 13312 16384 Direct inode steals 0 0 0 0 Kswapd inode steals 4340 4327 1630 4323 FTrace Reclaim Statistics: congestion_wait Direct number congest waited 0 0 0 0 Direct time congest waited 0ms 0ms 0ms 0ms Direct full congest waited 0 0 0 0 Direct number conditional waited 900 870 754 789 Direct time conditional waited 0ms 0ms 0ms 20ms Direct full conditional waited 0 0 0 0 KSwapd number congest waited 2106 2308 2116 1915 KSwapd time congest waited 139924ms 157832ms 125652ms 132516ms KSwapd full congest waited 1346 1530 1202 1278 KSwapd number conditional waited 12922 16320 10943 14670 KSwapd time conditional waited 0ms 0ms 0ms 0ms KSwapd full conditional waited 0 0 0 0 Reclaim statistics are not radically changed. The stall times in kswapd are massive but it is clear that it is due to calls to congestion_wait() and that is almost certainly the call in balance_pgdat(). Otherwise stalls due to dirty pages are non-existant. I ran a benchmark that stressed high-order allocation. This is very artifical load but was used in the past to evaluate lumpy reclaim and compaction. Generally I look at allocation success rates and latency figures. STRESS-HIGHALLOC 3.3.0-vanilla rc2-vanilla lumpyremove-v2r3 nosync-v2r3 Pass 1 81.00 ( 0.00%) 28.00 (-53.00%) 24.00 (-57.00%) 28.00 (-53.00%) Pass 2 82.00 ( 0.00%) 39.00 (-43.00%) 38.00 (-44.00%) 43.00 (-39.00%) while Rested 88.00 ( 0.00%) 87.00 (-1.00%) 88.00 ( 0.00%) 88.00 ( 0.00%) MMTests Statistics: duration Sys Time Running Test (seconds) 740.93 681.42 685.14 684.87 User+Sys Time Running Test (seconds) 2922.65 3269.52 3281.35 3279.44 Total Elapsed Time (seconds) 1161.73 1152.49 1159.55 1161.44 MMTests Statistics: vmstat Page Ins 4486020 2807256 2855944 2876244 Page Outs 7261600 7973688 7975320 7986120 Swap Ins 31694 0 0 0 Swap Outs 98179 0 0 0 Direct pages scanned 53494 57731 34406 113015 Kswapd pages scanned 6271173 1287481 1278174 1219095 Kswapd pages reclaimed 2029240 1281025 1260708 1201583 Direct pages reclaimed 1468 14564 16649 92456 Kswapd efficiency 32% 99% 98% 98% Kswapd velocity 5398.133 1117.130 1102.302 1049.641 Direct efficiency 2% 25% 48% 81% Direct velocity 46.047 50.092 29.672 97.306 Percentage direct scans 0% 4% 2% 8% Page writes by reclaim 1616049 0 0 0 Page writes file 1517870 0 0 0 Page writes anon 98179 0 0 0 Page reclaim immediate 103778 27339 9796 17831 Page rescued immediate 0 0 0 0 Slabs scanned 1096704 986112 980992 998400 Direct inode steals 223 215040 216736 247881 Kswapd inode steals 175331 61548 68444 63066 Kswapd skipped wait 21991 0 1 0 THP fault alloc 1 135 125 134 THP collapse alloc 393 311 228 236 THP splits 25 13 7 8 THP fault fallback 0 0 0 0 THP collapse fail 3 5 7 7 Compaction stalls 865 1270 1422 1518 Compaction success 370 401 353 383 Compaction failures 495 869 1069 1135 Compaction pages moved 870155 3828868 4036106 4423626 Compaction move failure 26429 23865 29742 27514 Success rates are completely hosed for 3.4-rc2 which is almost certainly due to commit fe2c2a106663 ("vmscan: reclaim at order 0 when compaction is enabled"). I expected this would happen for kswapd and impair allocation success rates (https://lkml.org/lkml/2012/1/25/166) but I did not anticipate this much a difference: 80% less scanning, 37% less reclaim by kswapd In comparison, reclaim/compaction is not aggressive and gives up easily which is the intended behaviour. hugetlbfs uses __GFP_REPEAT and would be much more aggressive about reclaim/compaction than THP allocations are. The stress test above is allocating like neither THP or hugetlbfs but is much closer to THP. Mainline is now impaired in terms of high order allocation under heavy load although I do not know to what degree as I did not test with __GFP_REPEAT. Keep this in mind for bugs related to hugepage pool resizing, THP allocation and high order atomic allocation failures from network devices. In terms of congestion throttling, I see the following for this test FTrace Reclaim Statistics: congestion_wait Direct number congest waited 3 0 0 0 Direct time congest waited 0ms 0ms 0ms 0ms Direct full congest waited 0 0 0 0 Direct number conditional waited 957 512 1081 1075 Direct time conditional waited 0ms 0ms 0ms 0ms Direct full conditional waited 0 0 0 0 KSwapd number congest waited 36 4 3 5 KSwapd time congest waited 3148ms 400ms 300ms 500ms KSwapd full congest waited 30 4 3 5 KSwapd number conditional waited 88514 197 332 542 KSwapd time conditional waited 4980ms 0ms 0ms 0ms KSwapd full conditional waited 49 0 0 0 The "conditional waited" times are the most interesting as this is directly impacted by the number of dirty pages encountered during scan. As lumpy reclaim is no longer scanning contiguous ranges, it is finding fewer dirty pages. This brings wait times from about 5 seconds to 0. kswapd itself is still calling congestion_wait() so it'll still stall but it's a lot less. In terms of the type of IO we were doing, I see this FTrace Reclaim Statistics: mm_vmscan_writepage Direct writes anon sync 0 0 0 0 Direct writes anon async 0 0 0 0 Direct writes file sync 0 0 0 0 Direct writes file async 0 0 0 0 Direct writes mixed sync 0 0 0 0 Direct writes mixed async 0 0 0 0 KSwapd writes anon sync 0 0 0 0 KSwapd writes anon async 91682 0 0 0 KSwapd writes file sync 0 0 0 0 KSwapd writes file async 822629 0 0 0 KSwapd writes mixed sync 0 0 0 0 KSwapd writes mixed async 0 0 0 0 In 3.2, kswapd was doing a bunch of async writes of pages but reclaim/compaction was never reaching a point where it was doing sync IO. This does not guarantee that reclaim/compaction was not calling wait_on_page_writeback() but I would consider it unlikely. It indicates that merging patches 2 and 3 to stop reclaim/compaction calling wait_on_page_writeback() should be safe. This patch: Lumpy reclaim had a purpose but in the mind of some, it was to kick the system so hard it trashed. For others the purpose was to complicate vmscan.c. Over time it was giving softer shoes and a nicer attitude but memory compaction needs to step up and replace it so this patch sends lumpy reclaim to the farm. The tracepoint format changes for isolating LRU pages with this patch applied. Furthermore reclaim/compaction can no longer queue dirty pages in pageout() if the underlying BDI is congested. Lumpy reclaim used this logic and reclaim/compaction was using it in error. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Hugh Dickins <hughd@google.com> Cc: Ying Han <yinghan@google.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: remove swap token codeRik van Riel10-307/+2
The swap token code no longer fits in with the current VM model. It does not play well with cgroups or the better NUMA placement code in development, since we have only one swap token globally. It also has the potential to mess with scalability of the system, by increasing the number of non-reclaimable pages on the active and inactive anon LRU lists. Last but not least, the swap token code has been broken for a year without complaints, as reported by Konstantin Khlebnikov. This suggests we no longer have much use for it. The days of sub-1G memory systems with heavy use of swap are over. If we ever need thrashing reducing code in the future, we will have to implement something that does scale. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Hugh Dickins <hughd@google.com> Acked-by: Bob Picco <bpicco@meloft.net> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm, thp: allow fallback when pte_alloc_one() fails for huge pmdDavid Rientjes1-5/+8
The transparent hugepages feature is careful to not invoke the oom killer when a hugepage cannot be allocated. pte_alloc_one() failing in __do_huge_pmd_anonymous_page(), however, currently results in VM_FAULT_OOM which invokes the pagefault oom killer to kill a memory-hogging task. This is unnecessary since it's possible to drop the reference to the hugepage and fallback to allocating a small page. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm, thp: remove unnecessary ret variableDavid Rientjes1-2/+1
The "ret" variable is unnecessary in __do_huge_pmd_anonymous_page(), so remove it. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm/hugetlb.c: use long vars instead of int in region_count()Wang Sheng-Hui1-2/+2
The arguments f & t and fields from & to of struct file_region are defined as long. So use long instead of int to type the temp vars. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm/mempolicy.c: use enum value MPOL_REBIND_ONCE in mpol_rebind_policy()Wang Sheng-Hui1-1/+1
We have enum definition in mempolicy.h: MPOL_REBIND_ONCE. It should replace the magic number 0 for step comparison in function mpol_rebind_policy. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm/memory_failure: let the compiler add the function nameBorislav Petkov1-4/+4
These things tend to get out of sync with time so let the compiler automatically enter the current function name using __func__. No functional change. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> Acked-by: Andi Kleen <andi@firstfloor.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: fix NULL ptr deref when walking hugepagesSasha Levin1-1/+1
A missing validation of the value returned by find_vma() could cause a NULL ptr dereference when walking the pagetable. This is triggerable from usermode by a simple user by trying to read a page info out of /proc/pid/pagemap which doesn't exist. Introduced by commit 025c5b2451e4 ("thp: optimize away unnecessary page table locking"). Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: David Rientjes <rientjes@google.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@vger.kernel.org> [3.4.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29cris: select GENERIC_ATOMIC64Cong Wang1-0/+1
Cris doesn't implement atomic64 operations neither, should select GENERIC_ATOMIC64. Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com> Cc: Mikael Starvik <starvik@axis.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29pagemap.h: fix warning about possibly used before init varPaul Gortmaker1-4/+4
Commit f56f821feb7b ("mm: extend prefault helpers to fault in more than PAGE_SIZE") added in the new functions: fault_in_multipages_writeable() and fault_in_multipages_readable(). However, we currently see: include/linux/pagemap.h:492: warning: 'ret' may be used uninitialized in this function include/linux/pagemap.h:492: note: 'ret' was declared here Unlike a lot of gcc nags, this one appears somewhat legit. i.e. passing in an invalid negative value of "size" does make it look like all the conditionals in there would be bypassed and the uninitialized value would be returned. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29Merge tag 'mfd-3.5-1' of ↵Linus Torvalds94-1750/+6799
git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6 Pull MFD changes from Samuel Ortiz: "Besides the usual cleanups, this one brings: * Support for 5 new chipsets: Intel's ICH LPC and SCH Centerton, ST-E's STAX211, Samsung's MAX77693 and TI's LM3533. * Device tree support for the twl6040, tps65910, da9502 and ab8500 drivers. * Fairly big tps56910, ab8500 and db8500 updates. * i2c support for mc13xxx. * Our regular update for the wm8xxx driver from Mark." Fix up various conflicts with other trees, largely due to ab5500 removal etc. * tag 'mfd-3.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6: (106 commits) mfd: Fix build break of max77693 by adding REGMAP_I2C option mfd: Fix twl6040 build failure mfd: Fix max77693 build failure mfd: ab8500-core should depend on MFD_DB8500_PRCMU gpio: tps65910: dt: process gpio specific device node info mfd: Remove the parsing of dt info for tps65910 gpio mfd: Save device node parsed platform data for tps65910 sub devices mfd: Add r_select to lm3533 platform data gpio: Add Intel Centerton support to gpio-sch mfd: Emulate active low IRQs as well as active high IRQs for wm831x mfd: Mark two lm3533 zone registers as volatile mfd: Fix return type of lm533 attribute is_visible mfd: Enable Device Tree support in the ab8500-pwm driver mfd: Enable Device Tree support in the ab8500-sysctrl driver mfd: Add support for Device Tree to twl6040 mfd: Register the twl6040 child for the ASoC codec unconditionally mfd: Allocate twl6040 IRQ numbers dynamically mfd: twl6040 code cleanup in interrupt initialization part mfd: Enable ab8500-gpadc driver for Device Tree mfd: Prevent unassigned pointer from being used in ab8500-gpadc driver ...
2012-05-29Merge tag 'nfs-for-3.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds48-2881/+3960
Pull NFS client updates from Trond Myklebust: "New features include: - Rewrite the O_DIRECT code so that it can share the same coalescing and pNFS functionality as the page cache code. - Allow the server to provide hints as to when we should use pNFS, and when it is more efficient to read and write through the metadata server. - NFS cache consistency updates: * Use the ctime to emulate a change attribute for NFSv2/v3 so that all NFS versions can share the same cache management code. * New cache management code will only look at the change attribute and size attribute when deciding whether or not our cached data is still valid or not. * Don't request NFSv4 post-op attributes on writes in cases such as O_DIRECT, where we don't care about data cache consistency, or when we have a write delegation, and know that our cache is still consistent. * Don't request NFSv4 post-op attributes on operations such as COMMIT, where there are no expected metadata updates. * Don't request NFSv4 directory post-op attributes in cases where the operations themselves already return change attribute updates: i.e. operations such as OPEN, CREATE, REMOVE, LINK and RENAME. - Speed up 'ls' and friends by using READDIR rather than READDIRPLUS if we detect no attempts to lookup filenames. - Improve the code sharing between NFSv2/v3 and v4 mounts - NFSv4.1 state management efficiency improvements - More patches in preparation for NFSv4/v4.1 migration functionality." Fix trivial conflict in fs/nfs/nfs4proc.c that was due to the dcache qstr name initialization changes (that made the length/hash a 64-bit union) * tag 'nfs-for-3.5-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (146 commits) NFSv4: Add debugging printks to state manager NFSv4: Map NFS4ERR_SHARE_DENIED into an EACCES error instead of EIO NFSv4: update_changeattr does not need to set NFS_INO_REVAL_PAGECACHE NFSv4.1: nfs4_reset_session should use nfs4_handle_reclaim_lease_error NFSv4.1: Handle other occurrences of NFS4ERR_CONN_NOT_BOUND_TO_SESSION NFSv4.1: Handle NFS4ERR_CONN_NOT_BOUND_TO_SESSION in the state manager NFSv4.1: Handle errors in nfs4_bind_conn_to_session NFSv4.1: nfs4_bind_conn_to_session should drain the session NFSv4.1: Don't clobber the seqid if exchange_id returns a confirmed clientid NFSv4.1: Add DESTROY_CLIENTID NFSv4.1: Ensure we use the correct credentials for bind_conn_to_session NFSv4.1: Ensure we use the correct credentials for session create/destroy NFSv4.1: Move NFSPROC4_CLNT_BIND_CONN_TO_SESSION to the end of the operations NFSv4.1: Handle NFS4ERR_SEQ_MISORDERED when confirming the lease NFSv4: When purging the lease, we must clear NFS4CLNT_LEASE_CONFIRM NFSv4: Clean up the error handling for nfs4_reclaim_lease NFSv4.1: Exchange ID must use GFP_NOFS allocation mode nfs41: Use BIND_CONN_TO_SESSION for CB_PATH_DOWN* nfs4.1: add BIND_CONN_TO_SESSION operation NFSv4.1 test the mdsthreshold hint parameters ...
2012-05-29tty: fix ldisc lock inversion traceAlan Cox1-16/+25
This is caused by tty_release using tty_lock_pair to lock both sides of the pty/tty pair, and then tty_ldisc_release dropping and relocking one side only. We can drop both fine, so drop both to avoid any lock ordering concerns. Rework the release path to fix the new locking model. Signed-off-by: Alan Cox <alan@linux.intel.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29pty: Fix lock inversionAlan Cox1-2/+0
The ptmx_open path takes the tty and devpts locks in the wrong order because tty_init_dev locks and returns a locked tty. As far as I can tell this is actually safe anyway because the tty being returned is new so nobody can get a reference to lock it at this point. However we don't even need the devpts lock at this point, it's only held as a byproduct of the way the locks were pushe down. Signed-off-by: Alan Cox <alan@linux.intel.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-28NFSv4: Add debugging printks to state managerTrond Myklebust1-0/+33
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-05-28NFSv4: Map NFS4ERR_SHARE_DENIED into an EACCES error instead of EIOTrond Myklebust1-0/+2
If a file OPEN is denied due to a share lock, the resulting NFS4ERR_SHARE_DENIED is currently mapped to the default EIO. This patch adds a more appropriate mapping, and brings Linux into line with what Solaris 10 does. See https://bugzilla.kernel.org/show_bug.cgi?id=43286 Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org
2012-05-28Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osdLinus Torvalds4-2/+230
Pull exofs updates from Boaz Harrosh: "Just a couple of patches. The first is a BUG fix destined for stable which missed the 3.4-rc7 Kernel. The second is just a fixture addition so exofs is able to be better exported as a cluster file system via pNFS." * 'for-linus' of git://git.open-osd.org/linux-open-osd: exofs: Add SYSFS info for autologin/pNFS export exofs: Fix CRASH on very early IO errors.
2012-05-28Merge branch 'doc' of ↵Linus Torvalds3-7/+4
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull documentation updates from Jiri Kosina: "I am currently relaying documentation patches through 'doc' branch of trivial tree, until Rob, the new documentation maintainer, has established a proper tree." * 'doc' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: doc: ext3: update documentation with barrier=1 default Documentation/initrd.txt: Change the location of util-linux Documentation/SubmittingPatches: suggested the use of scripts/get_maintainer.pl Documentation/kernel-parameters: remove autotest and mcatest
2012-05-28Merge branch 'misc' of ↵Linus Torvalds3-1/+101
git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild Pull misc kbuild changes from Michal Marek: "The non-critical part of kbuild for 3.5 includes - two new coccinelle checks - fix for make deb-pkg to include generated headers in arch/*/include I have more make-deb-pkg fixes in the backlog, but these will likely have to wait for 3.6." * 'misc' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: builddeb: include autogenerated header files scripts/coccinelle: sizeof of pointer scripts/coccinelle: address test is always true
2012-05-28Merge branch 'kconfig' of ↵Linus Torvalds3-19/+32
git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild Pull kconfig changes from Michal Marek: - Error handling for make KCONFIG_ALLCONFIG=<...> all*config plus a fix for a bug that was exposed by this - Fix for the script/config utility. * 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: scripts/config: properly report and set string options kbuild: all{no,yes,mod,def,rand}config only read files when instructed to. kconfig: Add error handling to KCONFIG_ALLCONFIG
2012-05-28Merge branch 'kbuild' of ↵Linus Torvalds5-227/+259
git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild Pull kbuild updates from Michal Marek. Fixed up nontrivial merge conflict in Makefile as per Stephen Rothwell and linux-next (and trivial arch/sparc/Makefile changes due to removed sparc32 logic). * 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: mips: Fix KBUILD_CPPFLAGS definition kbuild: fix ia64 link kbuild: document KBUILD_LDS, KBUILD_VMLINUX_{INIT,MAIN} and LDFLAGS_vmlinux kbuild: link of vmlinux moved to a script kbuild: refactor final link of sparc32 kbuild: drop unused KBUILD_VMLINUX_OBJS from top-level Makefile kbuild: Makefile: remove unnecessary check for m68knommu ARCH
2012-05-28Merge tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linuxLinus Torvalds56-220/+319
Pull writeback tree from Wu Fengguang: "Mainly from Jan Kara to avoid iput() in the flusher threads." * tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: writeback: Avoid iput() from flusher thread vfs: Rename end_writeback() to clear_inode() vfs: Move waiting for inode writeback from end_writeback() to evict_inode() writeback: Refactor writeback_single_inode() writeback: Remove wb->list_lock from writeback_single_inode() writeback: Separate inode requeueing after writeback writeback: Move I_DIRTY_PAGES handling writeback: Move requeueing when I_SYNC set to writeback_sb_inodes() writeback: Move clearing of I_SYNC into inode_sync_complete() writeback: initialize global_dirty_limit fs: remove 8 bytes of padding from struct writeback_control on 64 bit builds mm: page-writeback.c: local functions should not be exposed globally
2012-05-28Merge branch 'next' of git://git.monstr.eu/linux-2.6-microblazeLinus Torvalds5-11/+39
Pull microblaze changes from Michal Simek. * 'next' of git://git.monstr.eu/linux-2.6-microblaze: microblaze: Setup correct pointer to TLS area microblaze: Add TLS support to sys_clone microblaze: ftrace: Pass the first calling instruction for dynamic ftrace microblaze: Port OOM changes to do_page_fault microblaze: Do not select GENERIC_GPIO by default
2012-05-28NFSv4: update_changeattr does not need to set NFS_INO_REVAL_PAGECACHETrond Myklebust1-1/+1
We're already invalidating the data cache, and setting the new change attribute. Since directories don't care about the i_size field, there is no need to be forcing any extra revalidation of the page cache. We do keep the NFS_INO_INVALID_ATTR flag, in order to force an attribute cache revalidation on stat() calls since we do not update the mtime and ctime fields. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-05-27openrisc: use generic strnlen_user() functionJonas Bonn3-75/+3
The generic version is both easier to support and more correct. Signed-off-by: Jonas Bonn <jonas@southpole.se> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-27powerpc: Use the new generic strncpy_from_user() and strnlen_user()Paul Mackerras5-83/+48
This is much the same as for SPARC except that we can do the find_zero() function more efficiently using the count-leading-zeroes instructions. Tested on 32-bit and 64-bit PowerPC. Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-27lib: Fix generic strnlen_user for 32-bit big-endian machinesPaul Mackerras1-1/+1
The aligned_byte_mask() definition is wrong for 32-bit big-endian machines: the "7-(n)" part of the definition assumes a long is 8 bytes. This fixes it by using BITS_PER_LONG - 8 instead of 8*7. Tested on 32-bit and 64-bit PowerPC. Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-27NFSv4.1: nfs4_reset_session should use nfs4_handle_reclaim_lease_errorTrond Myklebust1-1/+1
The results from a call to nfs4_proc_create_session() should always be fed into nfs4_handle_reclaim_lease_error, so that we can handle errors such as NFS4ERR_SEQ_MISORDERED correctly. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-05-27NFSv4.1: Handle other occurrences of NFS4ERR_CONN_NOT_BOUND_TO_SESSIONTrond Myklebust4-9/+15
Let nfs4_schedule_session_recovery() handle the details of choosing between resetting the session, and other session related recovery. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-05-27NFSv4.1: Handle NFS4ERR_CONN_NOT_BOUND_TO_SESSION in the state managerTrond Myklebust1-1/+3
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-05-27NFSv4.1: Handle errors in nfs4_bind_conn_to_sessionTrond Myklebust1-1/+12
Ensure that we handle NFS4ERR_DELAY errors separately, and then let nfs4_recovery_handle_error() handle all other cases. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-05-27NFSv4.1: nfs4_bind_conn_to_session should drain the sessionTrond Myklebust1-0/+2
In order to avoid races with other RPC calls that end up setting the NFS4CLNT_BIND_CONN_TO_SESSION flag. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-05-26Merge branch 'generic-string-functions'Linus Torvalds23-490/+259
This makes <asm/word-at-a-time.h> actually live up to its promise of allowing architectures to help tune the string functions that do their work a word at a time. David had already taken the x86 strncpy_from_user() function, modified it to work on sparc, and then done the extra work to make it generically useful. This then expands on that work by making x86 use that generic version, completing the circle. But more importantly, it fixes up the word-at-a-time interfaces so that it's now easy to also support things like strnlen_user(), and pretty much most random string functions. David reports that it all works fine on sparc, and Jonas Bonn reported that an earlier version of this worked on OpenRISC too. It's pretty easy for architectures to add support for this and just replace their private versions with the generic code. * generic-string-functions: sparc: use the new generic strnlen_user() function x86: use the new generic strnlen_user() function lib: add generic strnlen_user() function word-at-a-time: make the interfaces truly generic x86: use generic strncpy_from_user routine
2012-05-26builddeb: include autogenerated header filesLekensteyn1-1/+1
After 303395ac3bf3e2cb488435537d416bc840438fcb, some headers are autogenerated. Include these autogenerated headers (mainly unistd_32_ia32.h) in out-of-tree builds to allow DKMS modules to be built succesfully. Signed-off-by: Peter Lekensteyn <lekensteyn@gmail.com> Signed-off-by: Michal Marek <mmarek@suse.cz>
2012-05-26Merge branch 'i2c-embedded/for-next' of git://git.pengutronix.de/git/wsa/linuxLinus Torvalds37-498/+483
Pull i2c-embedded changes from Wolfram Sang: "Major changes: - lots of devicetree additions for existing drivers. I tried hard to make sure the bindings are proper. In more complicated cases, I requested acks from people having more experience with them than me. That took a bit of extra time and also some time went into discussions with developers about what bindings are and what not. I have the feeling that the workflow with bindings should be improved to scale better. I will spend some more thought on this... - i2c-muxes are succesfully used meanwhile, so we dropped EXPERIMENTAL for them and renamed the drivers to a standard pattern to match the rest of the subsystem. They can also be used with devicetree now. - ixp2000 was removed since the whole platform goes away. - cleanups (strlcpy instead of strcpy, NULL instead of 0) - The rest is typical driver fixes I assume. All patches have been in linux-next at least since v3.4-rc6." Fixed up trivial conflict in arch/arm/mach-lpc32xx/common.c due to the same patch already having come in through the arm/soc trees, with additional patches on top of it. * 'i2c-embedded/for-next' of git://git.pengutronix.de/git/wsa/linux: (35 commits) i2c: davinci: Free requested IRQ in remove i2c: ocores: register OF i2c devices i2c: tegra: notify transfer-complete after clearing status. I2C: xiic: Add OF binding support i2c: Rename last mux driver to standard pattern i2c: tegra: fix 10bit address configuration i2c: muxes: rename first set of drivers to a standard pattern of/i2c: implement of_find_i2c_adapter_by_node i2c: implement i2c_verify_adapter i2c-s3c2410: Add HDMIPHY quirk for S3C2440 i2c-s3c2410: Rework device type handling i2c: muxes are not EXPERIMENTAL anymore i2c/of: Automatically populate i2c mux busses from device tree data. i2c: Add a struct device * parameter to i2c_add_mux_adapter() of/i2c: call i2c_verify_client from of_find_i2c_device_by_node i2c: designware: Add clk_{un}prepare() support i2c: designware: add PM support i2c: ixp2000: remove driver i2c: pnx: add device tree support i2c: imx: don't use strcpy but strlcpy ...
2012-05-26Merge tag 'cleanup-initcall' of ↵Linus Torvalds178-106/+614
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc Pull sweeping late_initcall cleanup for arm-soc from Olof Johansson: "This is a patch series from Shawn Guo that moves from individual late_initcalls() to using a member in the machine structure to invoke a platform's late initcalls. This cleanup is a step in the move towards multiplatform kernels since it would reduce the need to check for compatible platforms in each and every initcall." Fix up trivial conflicts in arch/arm/mach-{exynos/mach-universal_c210.c, imx/mach-cpuimx51.c, omap2/board-generic.c} due to changes nearby (and, in the case of cpuimx51.c the board support being deleted) * tag 'cleanup-initcall' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: ARM: ux500: use machine specific hook for late init ARM: tegra: use machine specific hook for late init ARM: shmobile: use machine specific hook for late init ARM: sa1100: use machine specific hook for late init ARM: s3c64xx: use machine specific hook for late init ARM: prima2: use machine specific hook for late init ARM: pnx4008: use machine specific hook for late init ARM: omap2: use machine specific hook for late init ARM: omap1: use machine specific hook for late init ARM: msm: use machine specific hook for late init ARM: imx: use machine specific hook for late init ARM: exynos: use machine specific hook for late init ARM: ep93xx: use machine specific hook for late init ARM: davinci: use machine specific hook for late init ARM: provide a late_initcall hook for platform initialization
2012-05-26Merge tag 'soc2' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-socLinus Torvalds81-303/+9648
Pull arm-soc: soc specific changes (part 2) from Olof Johansson: "This adds support for the spear13xx platform, which has first been under review a long time ago and finally been completed after generic spear work has gone into the clock, dt and pinctrl branches. Also a number of updates for the samsung socs are part of this branch." Fix up trivial conflicts in drivers/gpio/gpio-samsung.c that look much worse than they are: the exonys5 init code was refactored in commit fd454997d687 ("gpio: samsung: refactor gpiolib init for exynos4/5"), and then commit f10590c9836c ("ARM: EXYNOS: add GPC4 bank instance") added a new gpio chip define and did tiny updates to the init code. So the conflict diff looks like hell, but it's actually a fairly simple change. * tag 'soc2' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (34 commits) ARM: exynos: fix building with CONFIG_OF disabled ARM: EXYNOS: Add AUXDATA for i2c controllers ARM: dts: Update device tree source files for EXYNOS5250 ARM: EXYNOS: Add device tree support for interrupt combiner ARM: EXYNOS: Add irq_domain support for interrupt combiner ARM: EXYNOS: Remove a new bus_type instance for EXYNOS5 ARM: EXYNOS: update irqs for EXYNOS5250 SoC ARM: EXYNOS: Add pre-divider and fout mux clocks for bpll and mpll ARM: EXYNOS: add GPC4 bank instance ARM: EXYNOS: Redefine IRQ_MCT_L0,1 definition ARM: EXYNOS: Modify the GIC physical address for static io-mapping ARM: EXYNOS: Add watchdog timer clock instance pinctrl: SPEAr1310: Fix pin numbers for clcd_high_res SPEAr: Update MAINTAINERS and Documentation SPEAr13xx: Add defconfig SPEAr13xx: Add compilation support SPEAr13xx: Add dts and dtsi files pinctrl: Add SPEAr13xx pinctrl drivers pinctrl: SPEAr: Create macro for declaring GPIO PINS SPEAr13xx: Add common clock framework support ...
2012-05-26Merge tag 'dt2' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-socLinus Torvalds63-987/+2972
Pull arm-soc device tree conversions (part 2) from Olof Johansson: "These continue the device tree work from part 1, this set is for the tegra, mxs and imx platforms, all of which have dependencies on clock or pinctrl changes submitted earlier." Fix up trivial conflicts due to nearby changes in drivers/{gpio/gpio,i2c/busses/i2c}-mxs.c * tag 'dt2' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (73 commits) ARM: dt: tegra: invert status=disable vs status=okay ARM: dt: tegra: consistent basic property ordering ARM: dt: tegra: sort nodes based on bus order ARM: dt: tegra: remove duplicate device_type property ARM: dt: tegra: consistenly use lower-case for hex constants ARM: dt: tegra: format regs properties consistently ARM: dt: tegra: gpio comment cleanup ARM: dt: tegra: remove unnecessary unit addresses ARM: dt: tegra: whitespace cleanup ARM: dt: tegra cardhu: fix typo in SDHCI node name ARM: dt: tegra: cardhu: register core regulator tps62361 ARM: dt: tegra30.dtsi: Add SMMU node ARM: dt: tegra20.dtsi: Add GART node ARM: dt: tegra30.dtsi: Add Memory Controller(MC) nodes ARM: dt: tegra20.dtsi: Add Memory Controller(MC) nodes ARM: dt: tegra: Add device tree support for AHB ARM: dts: enable audio support for imx28-evk ARM: dts: enable i2c device for imx28-evk i2c: mxs: add device tree probe support ARM: dts: enable mmc for imx28-evk ...
2012-05-26Merge tag 'stmp-dev' of ↵Linus Torvalds6-7/+108
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc Pull arm-soc stmp-dev library code from Olof Johansson: "A number of devices are using a common register layout, this adds support code for it in lib/stmp_device.c so we do not need to duplicate it in each driver." Fix up trivial conflicts in drivers/i2c/busses/i2c-mxs.c and lib/Makefile * tag 'stmp-dev' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: i2c: mxs: use global reset function lib: add support for stmp-style devices