summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2010-03-06mm: add comment on swap_duplicate's error codeHugh Dickins1-1/+5
swap_duplicate()'s loop appears to miss out on returning the error code from __swap_duplicate(), except when that's -ENOMEM. In fact this is intentional: prior to -ENOMEM for swap_count_continuation, swap_duplicate() was void (and the case only occurs when copy_one_pte() hits a corrupt pte). But that's surprising behaviour, which certainly deserves a comment. Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Huang Shijie <shijie8@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06nommu: get_user_pages(): pin last page on non-page-aligned startSteven J. Magnani1-2/+2
The noMMU version of get_user_pages() fails to pin the last page when the start address isn't page-aligned. The patch fixes this in a way that makes find_extend_vma() congruent to its MMU cousin. Signed-off-by: Steven J. Magnani <steve@digidescorp.com> Acked-by: Paul Mundt <lethal@linux-sh.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06vmscan: detect mapped file pages used only onceJohannes Weiner2-13/+35
The VM currently assumes that an inactive, mapped and referenced file page is in use and promotes it to the active list. However, every mapped file page starts out like this and thus a problem arises when workloads create a stream of such pages that are used only for a short time. By flooding the active list with those pages, the VM quickly gets into trouble finding eligible reclaim canditates. The result is long allocation latencies and eviction of the wrong pages. This patch reuses the PG_referenced page flag (used for unmapped file pages) to implement a usage detection that scales with the speed of LRU list cycling (i.e. memory pressure). If the scanner encounters those pages, the flag is set and the page cycled again on the inactive list. Only if it returns with another page table reference it is activated. Otherwise it is reclaimed as 'not recently used cache'. This effectively changes the minimum lifetime of a used-once mapped file page from a full memory cycle to an inactive list cycle, which allows it to occur in linear streams without affecting the stable working set of the system. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: OSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06vmscan: drop page_mapping_inuse()Johannes Weiner1-23/+2
page_mapping_inuse() is a historic predicate function for pages that are about to be reclaimed or deactivated. According to it, a page is in use when it is mapped into page tables OR part of swap cache OR backing an mmapped file. This function is used in combination with page_referenced(), which checks for young bits in ptes and the page descriptor itself for the PG_referenced bit. Thus, checking for unmapped swap cache pages is meaningless as PG_referenced is not set for anonymous pages and unmapped pages do not have young ptes. The test makes no difference. Protecting file pages that are not by themselves mapped but are part of a mapped file is also a historic leftover for short-lived things like the exec() code in libc. However, the VM now does reference accounting and activation of pages at unmap time and thus the special treatment on reclaim is obsolete. This patch drops page_mapping_inuse() and switches the two callsites to use page_mapped() directly. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: OSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06vmscan: factor out page reference checksJohannes Weiner1-13/+43
The used-once mapped file page detection patchset. It is meant to help workloads with large amounts of shortly used file mappings, like rtorrent hashing a file or git when dealing with loose objects (git gc on a bigger site?). Right now, the VM activates referenced mapped file pages on first encounter on the inactive list and it takes a full memory cycle to reclaim them again. When those pages dominate memory, the system no longer has a meaningful notion of 'working set' and is required to give up the active list to make reclaim progress. Obviously, this results in rather bad scanning latencies and the wrong pages being reclaimed. This patch makes the VM be more careful about activating mapped file pages in the first place. The minimum granted lifetime without another memory access becomes an inactive list cycle instead of the full memory cycle, which is more natural given the mentioned loads. This test resembles a hashing rtorrent process. Sequentially, 32MB chunks of a file are mapped into memory, hashed (sha1) and unmapped again. While this happens, every 5 seconds a process is launched and its execution time taken: python2.4 -c 'import pydoc' old: max=2.31s mean=1.26s (0.34) new: max=1.25s mean=0.32s (0.32) find /etc -type f old: max=2.52s mean=1.44s (0.43) new: max=1.92s mean=0.12s (0.17) vim -c ':quit' old: max=6.14s mean=4.03s (0.49) new: max=3.48s mean=2.41s (0.25) mplayer --help old: max=8.08s mean=5.74s (1.02) new: max=3.79s mean=1.32s (0.81) overall hash time (stdev): old: time=1192.30 (12.85) thruput=25.78mb/s (0.27) new: time=1060.27 (32.58) thruput=29.02mb/s (0.88) (-11%) I also tested kernbench with regular IO streaming in the background to see whether the delayed activation of frequently used mapped file pages had a negative impact on performance in the presence of pressure on the inactive list. The patch made no significant difference in timing, neither for kernbench nor for the streaming IO throughput. The first patch submission raised concerns about the cost of the extra faults for actually activated pages on machines that have no hardware support for young page table entries. I created an artificial worst case scenario on an ARM machine with around 300MHz and 64MB of memory to figure out the dimensions involved. The test would mmap a file of 20MB, then 1. touch all its pages to fault them in 2. force one full scan cycle on the inactive file LRU -- old: mapping pages activated -- new: mapping pages inactive 3. touch the mapping pages again -- old and new: fault exceptions to set the young bits 4. force another full scan cycle on the inactive file LRU 5. touch the mapping pages one last time -- new: fault exceptions to set the young bits The test showed an overall increase of 6% in time over 100 iterations of the above (old: ~212sec, new: ~225sec). 13 secs total overhead / (100 * 5k pages), ignoring the execution time of the test itself, makes for about 25us overhead for every page that gets actually activated. Note: 1. File mapping the size of one third of main memory, _completely_ in active use across memory pressure - i.e., most pages referenced within one LRU cycle. This should be rare to non-existant, especially on such embedded setups. 2. Many huge activation batches. Those batches only occur when the working set fluctuates. If it changes completely between every full LRU cycle, you have problematic reclaim overhead anyway. 3. Access of activated pages at maximum speed: sequential loads from every single page without doing anything in between. In reality, the extra faults will get distributed between actual operations on the data. So even if a workload manages to get the VM into the situation of activating a third of memory in one go on such a setup, it will take 2.2 seconds instead 2.1 without the patch. Comparing the numbers (and my user-experience over several months), I think this change is an overall improvement to the VM. Patch 1 is only refactoring to break up that ugly compound conditional in shrink_page_list() and make it easy to document and add new checks in a readable fashion. Patch 2 gets rid of the obsolete page_mapping_inuse(). It's not strictly related to #3, but it was in the original submission and is a net simplification, so I kept it. Patch 3 implements used-once detection of mapped file pages. This patch: Moving the big conditional into its own predicate function makes the code a bit easier to read and allows for better commenting on the checks one-by-one. This is just cleaning up, no semantics should have been changed. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: OSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06mm: suppress pfn range output for zones without pagesDavid Rientjes1-2/+6
free_area_init_nodes() emits pfn ranges for all zones on the system. There may be no pages on a higher zone, however, due to memory limitations or the use of the mem= kernel parameter. For example: Zone PFN ranges: DMA 0x00000001 -> 0x00001000 DMA32 0x00001000 -> 0x00100000 Normal 0x00100000 -> 0x00100000 The implementation copies the previous zone's highest pfn, if any, as the next zone's lowest pfn. If its highest pfn is then greater than the amount of addressable memory, the upper memory limit is used instead. Thus, both the lowest and highest possible pfn for higher zones without memory may be the same. The pfn range for zones without memory is now shown as "empty" instead. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06mm/pm: force GFP_NOIO during suspend/hibernation and resumeRafael J. Wysocki1-0/+25
There are quite a few GFP_KERNEL memory allocations made during suspend/hibernation and resume that may cause the system to hang, because the I/O operations they depend on cannot be completed due to the underlying devices being suspended. Avoid this problem by clearing the __GFP_IO and __GFP_FS bits in gfp_allowed_mask before suspend/hibernation and restoring the original values of these bits in gfp_allowed_mask durig the subsequent resume. [akpm@linux-foundation.org: fix CONFIG_PM=n linkage] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Reported-by: Maxim Levitsky <maximlevitsky@gmail.com> Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06mm/swapfile.c: fix swapon size off-by-oneHugh Dickins1-13/+18
There's an off-by-one disagreement between mkswap and swapon about the meaning of swap_header last_page: mkswap (in all versions I've looked at: util-linux-ng and BusyBox and old util-linux; probably as far back as 1999) consistently means the offset (in page units) of the last page of the swap area, whereas kernel sys_swapon (as far back as 2.2 and 2.3) strangely takes it to mean the size (in page units) of the swap area. This disagreement is the safe way round; but it's worrying people, and loses us one page of swap. The fix is not just to add one to nr_good_pages: we need to get maxpages (the size of the swap_map array) right before that; and though that is an unsigned long, be careful not to overflow the unsigned int p->max which later holds it (probably why header uses __u32 last_page instead of size). Why did we subtract one from the maximum swp_offset to calculate maxpages? Though it was probably me who made that change in 2.4.10, I don't get it: and now we should be adding one (without risk of overflow in this case). Fix the handling of swap_header badpages: it could have overrun the swap_map when very large swap area used on a more limited architecture. Remove pre-initializations of swap_header, nr_good_pages and maxpages: those date from when sys_swapon was supporting other versions of header. Reported-by: Nitin Gupta <ngupta@vflare.org> Reported-by: Jarkko Lavinen <jarkko.lavinen@nokia.com> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06mm: remove VM_LOCK_RMAP codeRik van Riel2-27/+0
When a VMA is in an inconsistent state during setup or teardown, the worst that can happen is that the rmap code will not be able to find the page. The mapping is in the process of being torn down (PTEs just got invalidated by munmap), or set up (no PTEs have been instantiated yet). It is also impossible for the rmap code to follow a pointer to an already freed VMA, because the rmap code holds the anon_vma->lock, which the VMA teardown code needs to take before the VMA is removed from the anon_vma chain. Hence, we should not need the VM_LOCK_RMAP locking at all. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06rmap: move exclusively owned pages to own anon_vma in do_wp_page()Rik van Riel2-0/+31
When the parent process breaks the COW on a page, both the original which is mapped at child and the new page which is mapped parent end up in that same anon_vma. Generally this won't be a problem, but for some workloads it could preserve the O(N) rmap scanning complexity. A simple fix is to ensure that, when a page which is mapped child gets reused in do_wp_page, because we already are the exclusive owner, the page gets moved to our own exclusive child's anon_vma. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06rmap: remove obsolete check from __page_check_anon_rmap()Rik van Riel1-3/+0
When an anonymous page is inherited from a parent process, the vma->anon_vma can differ from the page anon_vma. This can trip up __page_check_anon_rmap, which is indirectly called from do_swap_page(). Remove that obsolete check to prevent an oops. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06mm: change anon_vma linking to fix multi-process server scalability issueRik van Riel7-76/+248
The old anon_vma code can lead to scalability issues with heavily forking workloads. Specifically, each anon_vma will be shared between the parent process and all its child processes. In a workload with 1000 child processes and a VMA with 1000 anonymous pages per process that get COWed, this leads to a system with a million anonymous pages in the same anon_vma, each of which is mapped in just one of the 1000 processes. However, the current rmap code needs to walk them all, leading to O(N) scanning complexity for each page. This can result in systems where one CPU is walking the page tables of 1000 processes in page_referenced_one, while all other CPUs are stuck on the anon_vma lock. This leads to catastrophic failure for a benchmark like AIM7, where the total number of processes can reach in the tens of thousands. Real workloads are still a factor 10 less process intensive than AIM7, but they are catching up. This patch changes the way anon_vmas and VMAs are linked, which allows us to associate multiple anon_vmas with a VMA. At fork time, each child process gets its own anon_vmas, in which its COWed pages will be instantiated. The parents' anon_vma is also linked to the VMA, because non-COWed pages could be present in any of the children. This reduces rmap scanning complexity to O(1) for the pages of the 1000 child processes, with O(N) complexity for at most 1/N pages in the system. This reduces the average scanning cost in heavily forking workloads from O(N) to 2. The only real complexity in this patch stems from the fact that linking a VMA to anon_vmas now involves memory allocations. This means vma_adjust can fail, if it needs to attach a VMA to anon_vma structures. This in turn means error handling needs to be added to the calling functions. A second source of complexity is that, because there can be multiple anon_vmas, the anon_vma linking in vma_adjust can no longer be done under "the" anon_vma lock. To prevent the rmap code from walking up an incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h to make sure it is impossible to compile a kernel that needs both symbolic values for the same bitflag. Some test results: Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test box with 16GB RAM and not quite enough IO), the system ends up running >99% in system time, with every CPU on the same anon_vma lock in the pageout code. With these changes, AIM7 hits the cross-over point around 29.7k users. This happens with ~99% IO wait time, there never seems to be any spike in system time. The anon_vma lock contention appears to be resolved. [akpm@linux-foundation.org: cleanups] Signed-off-by: Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06mm/memcontrol.c: fix "integer as NULL pointer" sparse warningThiago Farina1-1/+1
mm/memcontrol.c:2548:32: warning: Using plain integer as NULL pointer Signed-off-by: Thiago Farina <tfransosi@gmail.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06readahead: introduce FMODE_RANDOM for POSIX_FADV_RANDOMWu Fengguang2-1/+15
This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM. POSIX_FADV_RANDOM used to set ra_pages=0, which leads to poor performance: a 16K read will be carried out in 4 _sync_ 1-page reads. In other places, ra_pages==0 means - it's ramfs/tmpfs/hugetlbfs/sysfs/configfs - some IO error happened where multi-page read IO won't help or should be avoided. POSIX_FADV_RANDOM actually want a different semantics: to disable the *heuristic* readahead algorithm, and to use a dumb one which faithfully submit read IO for whatever application requests. So introduce a flag FMODE_RANDOM for POSIX_FADV_RANDOM. Note that the random hint is not likely to help random reads performance noticeably. And it may be too permissive on huge request size (its IO size is not limited by read_ahead_kb). In Quentin's report (http://lkml.org/lkml/2009/12/24/145), the overall (NFS read) performance of the application increased by 313%! Tested-by: Quentin Barnes <qbarnes+nfs@yahoo-inc.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Andi Kleen <andi@firstfloor.org> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: <stable@kernel.org> [2.6.33.x] Cc: <qbarnes+nfs@yahoo-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06mm/migrate.c: kill anon local variable from migrate_page_copyKOSAKI Motohiro1-4/+0
commit 01b1ae63c2 ("memcg: simple migration handling") removed mem_cgroup_uncharge_cache_page() call from migrate_page_copy. Local variable `anon' is now unused. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06mm/mempolicy.c: fix indentation of the comments of do_migrate_pagesKOSAKI Motohiro1-30/+30
Currently, do_migrate_pages() have very long comment and this is not indent properly. I often misunderstand it is function starting commnents and confused it. this patch fixes it. note: this patch doesn't break 80 column rule. I guess original author intended this indentaion, but an accident corrupted it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06memory-hotplug: create /sys/firmware/memmap entry for new memoryakpm@linux-foundation.org1-0/+4
A memmap is a directory in sysfs which includes 3 text files: start, end and type. For example: start: 0x100000 end: 0x7e7b1cff type: System RAM Interface firmware_map_add was not called explicitly. Remove it and add function firmware_map_add_hotplug as hotplug interface of memmap. Each memory entry has a memmap in sysfs, When we hot-add new memory, sysfs does not export memmap entry for it. We add a call in function add_memory to function firmware_map_add_hotplug. Add a new function add_sysfs_fw_map_entry() to create memmap entry, it will be called when initialize memmap and hot-add memory. [akpm@linux-foundation.org: un-kernedoc a no longer kerneldoc comment] Signed-off-by: Shaohui Zheng <shaohui.zheng@intel.com> Acked-by: Andi Kleen <ak@linux.intel.com> Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06mm: fix mbind vma merge problemKOSAKI Motohiro1-13/+39
Strangely, current mbind() doesn't merge vma with neighbor vma although it's possible. Unfortunately, many vma can reduce performance... This patch fixes it. reproduced program ---------------------------------------------------------------- #include <numaif.h> #include <numa.h> #include <sys/mman.h> #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <string.h> static unsigned long pagesize; int main(int argc, char** argv) { void* addr; int ch; int node; struct bitmask *nmask = numa_allocate_nodemask(); int err; int node_set = 0; char buf[128]; while ((ch = getopt(argc, argv, "n:")) != -1){ switch (ch){ case 'n': node = strtol(optarg, NULL, 0); numa_bitmask_setbit(nmask, node); node_set = 1; break; default: ; } } argc -= optind; argv += optind; if (!node_set) numa_bitmask_setbit(nmask, 0); pagesize = getpagesize(); addr = mmap(NULL, pagesize*3, PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE, 0, 0); if (addr == MAP_FAILED) perror("mmap "), exit(1); fprintf(stderr, "pid = %d \n" "addr = %p\n", getpid(), addr); /* make page populate */ memset(addr, 0, pagesize*3); /* first mbind */ err = mbind(addr+pagesize, pagesize, MPOL_BIND, nmask->maskp, nmask->size, MPOL_MF_MOVE_ALL); if (err) error("mbind1 "); /* second mbind */ err = mbind(addr, pagesize*3, MPOL_DEFAULT, NULL, 0, 0); if (err) error("mbind2 "); sprintf(buf, "cat /proc/%d/maps", getpid()); system(buf); return 0; } ---------------------------------------------------------------- result without this patch addr = 0x7fe26ef09000 [snip] 7fe26ef09000-7fe26ef0a000 rw-p 00000000 00:00 0 7fe26ef0a000-7fe26ef0b000 rw-p 00000000 00:00 0 7fe26ef0b000-7fe26ef0c000 rw-p 00000000 00:00 0 7fe26ef0c000-7fe26ef0d000 rw-p 00000000 00:00 0 => 0x7fe26ef09000-0x7fe26ef0c000 have three vmas. result with this patch addr = 0x7fc9ebc76000 [snip] 7fc9ebc76000-7fc9ebc7a000 rw-p 00000000 00:00 0 7fffbe690000-7fffbe6a5000 rw-p 00000000 00:00 0 [stack] => 0x7fc9ebc76000-0x7fc9ebc7a000 have only one vma. [minchan.kim@gmail.com: fix file offset passed to vma_merge()] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06mm: restore zone->all_unreclaimable to independence wordKOSAKI Motohiro3-17/+13
commit e815af95 ("change all_unreclaimable zone member to flags") changed all_unreclaimable member to bit flag. But it had an undesireble side effect. free_one_page() is one of most hot path in linux kernel and increasing atomic ops in it can reduce kernel performance a bit. Thus, this patch revert such commit partially. at least all_unreclaimable shouldn't share memory word with other zone flags. [akpm@linux-foundation.org: fix patch interaction] Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Huang Shijie <shijie8@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06mm: remove free_hot_page()Li Hong2-8/+4
free_hot_page() is just a wrapper around free_hot_cold_page() with parameter 'cold = 0'. After adding a clear comment for free_hot_cold_page(), it is reasonable to remove a level of call. [akpm@linux-foundation.org: fix build] Signed-off-by: Li Hong <lihong.hi@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Li Ming Chun <macli@brc.ubc.ca> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Americo Wang <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06mm/page_alloc.c: adjust a call site to trace_mm_page_free_directLi Hong1-1/+1
Move a call of trace_mm_page_free_direct() from free_hot_page() to free_hot_cold_page(). It is clearer and close to kmemcheck_free_shadow(), as it is done in function __free_pages_ok(). Signed-off-by: Li Hong <lihong.hi@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Li Ming Chun <macli@brc.ubc.ca> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06mm/page_alloc.c: remove duplicate call to trace_mm_page_free_directLi Hong1-1/+1
trace_mm_page_free_direct() is called in function __free_pages(). But it is called again in free_hot_page() if order == 0 and produce duplicate records in trace file for mm_page_free_direct event. As below: K-PID CPU# TIMESTAMP FUNCTION gnome-terminal-1567 [000] 4415.246466: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0 gnome-terminal-1567 [000] 4415.246468: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0 gnome-terminal-1567 [000] 4415.246506: mm_page_alloc: page=ffffea0003db9f40 pfn=1155800 order=0 migratetype=0 gfp_flags=GFP_KERNEL gnome-terminal-1567 [000] 4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0 gnome-terminal-1567 [000] 4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0 This patch removes the first call and adds a call to trace_mm_page_free_direct() in __free_pages_ok(). Signed-off-by: Li Hong <lihong.hi@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Li Ming Chun <macli@brc.ubc.ca> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06mm, lockdep: annotate reclaim context to zone reclaim tooKOSAKI Motohiro1-0/+2
Commit cf40bd16fd ("lockdep: annotate reclaim context") introduced reclaim context annotation. But it didn't annotate zone reclaim. This patch do it. The point is, commit cf40bd16fd annotate __alloc_pages_direct_reclaim but zone-reclaim doesn't use __alloc_pages_direct_reclaim. current call graph is __alloc_pages_nodemask get_page_from_freelist zone_reclaim() __alloc_pages_slowpath __alloc_pages_direct_reclaim try_to_free_pages Actually, if zone_reclaim_mode=1, VM never call __alloc_pages_direct_reclaim in usual VM pressure. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Nick Piggin <npiggin@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06vmscan: get_scan_ratio() cleanupKOSAKI Motohiro1-9/+14
The get_scan_ratio() should have all scan-ratio related calculations. Thus, this patch move some calculation into get_scan_ratio. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06vmscan: check high watermark after shrink zoneMinchan Kim1-10/+12
Kswapd checks that zone has sufficient pages free via zone_watermark_ok(). If any zone doesn't have enough pages, we set all_zones_ok to zero. !all_zone_ok makes kswapd retry rather than sleeping. I think the watermark check before shrink_zone() is pointless. Only after kswapd has tried to shrink the zone is the check meaningful. Move the check to after the call to shrink_zone(). [akpm@linux-foundation.org: fix comment, layout] Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06mm: use rlimit helpersJiri Slaby4-14/+15
Make sure compiler won't do weird things with limits. E.g. fetching them twice may return 2 different values after writable limits are implemented. I.e. either use rlimit helpers added in 3e10e716abf3c71bdb5d86b8f507f9e72236c9cd ("resource: add helpers for fetching rlimits") or ACCESS_ONCE if not applicable. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06mm: mlock_vma_pages_range() only return success or failureKOSAKI Motohiro1-2/+2
Currently, mlock_vma_pages_range() only return len or 0. then current error handling of mmap_region() is meaningless complex. This patch makes simplify and makes consist with brk() code. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamewzawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06mm: mlock_vma_pages_range() never return negative valueKOSAKI Motohiro1-9/+2
Currently, mlock_vma_pages_range() never return negative value. Then, we can remove some worthless error check. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamewzawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06mm: count swap usageKAMEZAWA Hiroyuki3-4/+14
A frequent questions from users about memory management is what numbers of swap ents are user for processes. And this information will give some hints to oom-killer. Besides we can count the number of swapents per a process by scanning /proc/<pid>/smaps, this is very slow and not good for usual process information handler which works like 'ps' or 'top'. (ps or top is now enough slow..) This patch adds a counter of swapents to mm_counter and update is at each swap events. Information is exported via /proc/<pid>/status file as [kamezawa@bluextal memory]$ cat /proc/self/status Name: cat State: R (running) Tgid: 2910 Pid: 2910 PPid: 2823 TracerPid: 0 Uid: 500 500 500 500 Gid: 500 500 500 500 FDSize: 256 Groups: 500 VmPeak: 82696 kB VmSize: 82696 kB VmLck: 0 kB VmHWM: 432 kB VmRSS: 432 kB VmData: 172 kB VmStk: 84 kB VmExe: 48 kB VmLib: 1568 kB VmPTE: 40 kB VmSwap: 0 kB <=============== this. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06mm: avoid false sharing of mm_counterKAMEZAWA Hiroyuki1-8/+86
Considering the nature of per mm stats, it's the shared object among threads and can be a cache-miss point in the page fault path. This patch adds per-thread cache for mm_counter. RSS value will be counted into a struct in task_struct and synchronized with mm's one at events. Now, in this patch, the event is the number of calls to handle_mm_fault. Per-thread value is added to mm at each 64 calls. rough estimation with small benchmark on parallel thread (2threads) shows [before] 4.5 cache-miss/faults [after] 4.0 cache-miss/faults Anyway, the most contended object is mmap_sem if the number of threads grows. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06mm: clean up mm_counterKAMEZAWA Hiroyuki6-32/+44
Presently, per-mm statistics counter is defined by macro in sched.h This patch modifies it to - defined in mm.h as inlinf functions - use array instead of macro's name creation. This patch is for reducing patch size in future patch to modify implementation of per-mm counter. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-05Merge branch 'slab-for-linus' of ↵Linus Torvalds3-243/+125
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'slab-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: SLUB: Fix per-cpu merge conflict failslab: add ability to filter slab caches slab: fix regression in touched logic dma kmalloc handling fixes slub: remove impossible condition slab: initialize unused alien cache entry as NULL at alloc_alien_cache(). SLUB: Make slub statistics use this_cpu_inc SLUB: this_cpu: Remove slub kmem_cache fields SLUB: Get rid of dynamic DMA kmalloc cache allocation SLUB: Use this_cpu operations in slub
2010-03-04Merge branch 'for-linus' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits) init: Open /dev/console from rootfs mqueue: fix typo "failues" -> "failures" mqueue: only set error codes if they are really necessary mqueue: simplify do_open() error handling mqueue: apply mathematics distributivity on mq_bytes calculation mqueue: remove unneeded info->messages initialization mqueue: fix mq_open() file descriptor leak on user-space processes fix race in d_splice_alias() set S_DEAD on unlink() and non-directory rename() victims vfs: add NOFOLLOW flag to umount(2) get rid of ->mnt_parent in tomoyo/realpath hppfs can use existing proc_mnt, no need for do_kern_mount() in there Mirror MS_KERNMOUNT in ->mnt_flags get rid of useless vfsmount_lock use in put_mnt_ns() Take vfsmount_lock to fs/internal.h get rid of insanity with namespace roots in tomoyo take check for new events in namespace (guts of mounts_poll()) to namespace.c Don't mess with generic_permission() under ->d_lock in hpfs sanitize const/signedness for udf nilfs: sanitize const/signedness in dealing with ->d_name.name ... Fix up fairly trivial (famous last words...) conflicts in drivers/infiniband/core/uverbs_main.c and security/tomoyo/realpath.c
2010-03-04SLUB: Fix per-cpu merge conflictStephen Rothwell1-1/+1
The slab tree adds a percpu variable usage case (commit 9dfc6e68bfe6ee452efb1a4e9ca26a9007f2b864 "SLUB: Use this_cpu operations in slub"), but the percpu tree removes the prefixing of percpu variables (commit dd17c8f72993f9461e9c19250e3f155d6d99df22 "percpu: remove per_cpu__ prefix"), thus causing the following compilation error: CC mm/slub.o mm/slub.c: In function ‘alloc_kmem_cache_cpus’: mm/slub.c:2078: error: implicit declaration of function ‘per_cpu_var’ mm/slub.c:2078: warning: assignment makes pointer from integer without a cast make[1]: *** [mm/slub.o] Error 1 Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2010-03-04Merge branches 'slab/cleanups', 'slab/failslab', 'slab/fixes' and ↵Pekka Enberg3-243/+125
'slub/percpu' into slab-for-linus
2010-03-03kill unused invalidate_inode_pages helperChristoph Hellwig1-1/+1
No one is calling this anymore as everyone has switched to invalidate_mapping_pages long time ago. Also update a few references to it in comments. nfs has two more, but I can't easily figure what they are actually referring to, so I left them as-is. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03Merge branch 'x86-bootmem-for-linus' of ↵Linus Torvalds5-24/+508
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'x86-bootmem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits) early_res: Need to save the allocation name in drop_range_partial() sparsemem: Fix compilation on PowerPC early_res: Add free_early_partial() x86: Fix non-bootmem compilation on PowerPC core: Move early_res from arch/x86 to kernel/ x86: Add find_fw_memmap_area Move round_up/down to kernel.h x86: Make 32bit support NO_BOOTMEM early_res: Enhance check_and_double_early_res x86: Move back find_e820_area to e820.c x86: Add find_early_area_size x86: Separate early_res related code from e820.c x86: Move bios page reserve early to head32/64.c sparsemem: Put mem map for one node together. sparsemem: Put usemap for one node together x86: Make 64 bit use early_res instead of bootmem before slab x86: Only call dma32_reserve_bootmem 64bit !CONFIG_NUMA x86: Make early_node_mem get mem > 4 GB if possible x86: Dynamically increase early_res array size x86: Introduce max_early_res and early_res_count ...
2010-03-03Merge branch 'for-linus' of ↵Linus Torvalds3-155/+98
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: add __percpu sparse annotations to what's left percpu: add __percpu sparse annotations to fs percpu: add __percpu sparse annotations to core kernel subsystems local_t: Remove leftover local.h this_cpu: Remove pageset_notifier this_cpu: Page allocator conversion percpu, x86: Generic inc / dec percpu instructions local_t: Move local.h include to ringbuffer.c and ring_buffer_benchmark.c module: Use this_cpu_xx to dynamically allocate counters local_t: Remove cpu_local_xx macros percpu: refactor the code in pcpu_[de]populate_chunk() percpu: remove compile warnings caused by __verify_pcpu_ptr() percpu: make accessors check for percpu pointer in sparse percpu: add __percpu for sparse. percpu: make access macros universal percpu: remove per_cpu__ prefix.
2010-03-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6Linus Torvalds1-0/+3
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1341 commits) virtio_net: remove forgotten assignment be2net: fix tx completion polling sis190: fix cable detect via link status poll net: fix protocol sk_buff field bridge: Fix build error when IGMP_SNOOPING is not enabled bnx2x: Tx barriers and locks scm: Only support SCM_RIGHTS on unix domain sockets. vhost-net: restart tx poll on sk_sndbuf full vhost: fix get_user_pages_fast error handling vhost: initialize log eventfd context pointer vhost: logging thinko fix wireless: convert to use netdev_for_each_mc_addr ethtool: do not set some flags, if others failed ipoib: returned back addrlen check for mc addresses netlink: Adding inode field to /proc/net/netlink axnet_cs: add new id bridge: Make IGMP snooping depend upon BRIDGE. bridge: Add multicast count/interval sysfs entries bridge: Add hash elasticity/max sysfs entries bridge: Add multicast_snooping sysfs toggle ... Trivial conflicts in Documentation/feature-removal-schedule.txt
2010-03-01sparsemem: Fix compilation on PowerPCYinghai Lu1-4/+7
Stephen reported: build (powerpc ppc64_defconfig) produced these warnings: mm/sparse.c: In function 'sparse_init': mm/sparse.c:488: warning: unused variable 'map_count' mm/sparse.c:484: warning: unused variable 'size2' mm/sparse.c:481: warning: unused variable 'map_map' mm/sparse.c: At top level: mm/sparse.c:442: warning: 'sparse_early_mem_maps_alloc_node' defined but not used Introduced by commit 9bdac914240759457175ac0d6529a37d2820bc4d ("sparsemem: Put mem map for one node together"). Conditionalize the bits appropriately based on the setting of CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Tested-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <4B895682.1080706@kernel.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-03-01Merge branch 'for-linus' of master.kernel.org:/home/rmk/linux-2.6-armLinus Torvalds3-10/+10
* 'for-linus' of master.kernel.org:/home/rmk/linux-2.6-arm: (100 commits) ARM: Eliminate decompressor -Dstatic= PIC hack ARM: 5958/1: ARM: U300: fix inverted clk round rate ARM: 5956/1: misplaced parentheses ARM: 5955/1: ep93xx: move timer defines into core.c and document ARM: 5954/1: ep93xx: move gpio interrupt support to gpio.c ARM: 5953/1: ep93xx: fix broken build of clock.c ARM: 5952/1: ARM: MM: Add ARM_L1_CACHE_SHIFT_6 for handle inside each ARCH Kconfig ARM: 5949/1: NUC900 add gpio virtual memory map ARM: 5948/1: Enable timer0 to time4 clock support for nuc910 ARM: 5940/2: ARM: MMCI: remove custom DBG macro and printk ARM: make_coherent(): fix problems with highpte, part 2 MM: Pass a PTE pointer to update_mmu_cache() rather than the PTE itself ARM: 5945/1: ep93xx: include correct irq.h in core.c ARM: 5933/1: amba-pl011: support hardware flow control ARM: 5930/1: Add PKMAP area description to memory.txt. ARM: 5929/1: Add checks to detect overlap of memory regions. ARM: 5928/1: Change type of VMALLOC_END to unsigned long. ARM: 5927/1: Make delimiters of DMA area globally visibly. ARM: 5926/1: Add "Virtual kernel memory..." printout. ARM: 5920/1: OMAP4: Enable L2 Cache ... Fix up trivial conflict in arch/arm/mach-mx25/clock.c
2010-02-28Merge branch 'master' of /home/davem/src/GIT/linux-2.6/David S. Miller3-22/+18
Conflicts: drivers/firmware/iscsi_ibft.c
2010-02-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6Linus Torvalds1-1/+1
* git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: (187 commits) sh: remove dead LED code for migo-r and ms7724se sh: ecovec build fix for CONFIG_I2C=n sh: ecovec r-standby support sh: ms7724se r-standby support sh: SH-Mobile R-standby register save/restore clocksource: Fix up a registration/IRQ race in the sh drivers. sh: ms7724: modify scan_timing for KEYSC sh: ms7724: Add sh_sir support sh: mach-ecovec24: Add sh_sir support sh: wire up SET/GET_UNALIGN_CTL. sh: allow alignment fault mode to be configured at kernel boot. sh: sh7724: Update FSI/SPU2 clock sh: always enable sh7724 vpu_clk and set to 166MHz on Ecovec sh: add sh7724 kick callback to clk_div4_table sh: introduce struct clk_div4_table sh: clock-cpg div4 set_rate() shift fix sh: Turn on speculative return for SH7785 and SH7786 sh: Merge legacy and dynamic PMB modes. sh: Use uncached I/O helpers in PMB setup. sh: Provide uncached I/O helpers. ...
2010-02-26failslab: add ability to filter slab cachesDmitry Monakhov3-6/+43
This patch allow to inject faults only for specific slabs. In order to preserve default behavior cache filter is off by default (all caches are faulty). One may define specific set of slabs like this: # mark skbuff_head_cache as faulty echo 1 > /sys/kernel/slab/skbuff_head_cache/failslab # Turn on cache filter (off by default) echo 1 > /sys/kernel/debug/failslab/cache-filter # Turn on fault injection echo 1 > /sys/kernel/debug/failslab/times echo 1 > /sys/kernel/debug/failslab/probability Acked-by: David Rientjes <rientjes@google.com> Acked-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2010-02-26early_res: Add free_early_partial()Yinghai Lu1-3/+0
To free partial areas in pcpu_setup... Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> LKML-Reference: <4B85E245.5030001@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-25Merge branches 'clks' and 'pnx' into develRussell King1-3/+1
2010-02-22memcg: fix oom killing a child process in an other cgroupKAMEZAWA Hiroyuki1-0/+2
Presently the oom-killer is memcg aware and it finds the worst process from processes under memcg(s) in oom. Then, it kills victim's child first. It may kill a child in another cgroup and may not be any help for recovery. And it will break the assumption users have. This patch fixes it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-22x86: Fix non-bootmem compilation on PowerPCYinghai Lu1-0/+2
These build errors on some non-x86 platforms (PowerPC for example): mm/page_alloc.c: In function '__alloc_memory_core_early': mm/page_alloc.c:3468: error: implicit declaration of function 'find_early_area' mm/page_alloc.c:3483: error: implicit declaration of function 'reserve_early_without_check' The function is only needed on CONFIG_NO_BOOTMEM. Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@saeurebad.de> Cc: Mel Gorman <mel@csn.ul.ie> LKML-Reference: <4B747239.4070907@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-21mm: Make copy_from_user() in migrate.c statically predictableH. Peter Anvin1-21/+15
x86-32 has had a static test for copy_on_user() overflow for a while. This test currently fails in mm/migrate.c resulting in an allyesconfig/allmodconfig build failure on x86-32: In function ‘copy_from_user’, inlined from ‘do_pages_stat’ at /home/hpa/kernel/git/mm/migrate.c:1012: /home/hpa/kernel/git/arch/x86/include/asm/uaccess_32.h:212: error: call to ‘copy_from_user_overflow’ declared Make the logic more explicit and therefore easier for gcc to understand. v2: rewrite the loop entirely using a more normal structure for a chunked-data loop (Linus Torvalds) Reported-by: Len Brown <lenb@kernel.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Reviewed-and-Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Arjan van de Ven <arjan@linux.kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-02-20MM: Pass a PTE pointer to update_mmu_cache() rather than the PTE itselfRussell King3-10/+10
On VIVT ARM, when we have multiple shared mappings of the same file in the same MM, we need to ensure that we have coherency across all copies. We do this via make_coherent() by making the pages uncacheable. This used to work fine, until we allowed highmem with highpte - we now have a page table which is mapped as required, and is not available for modification via update_mmu_cache(). Ralf Beache suggested getting rid of the PTE value passed to update_mmu_cache(): On MIPS update_mmu_cache() calls __update_tlb() which walks pagetables to construct a pointer to the pte again. Passing a pte_t * is much more elegant. Maybe we might even replace the pte argument with the pte_t? Ben Herrenschmidt would also like the pte pointer for PowerPC: Passing the ptep in there is exactly what I want. I want that -instead- of the PTE value, because I have issue on some ppc cases, for I$/D$ coherency, where set_pte_at() may decide to mask out the _PAGE_EXEC. So, pass in the mapped page table pointer into update_mmu_cache(), and remove the PTE value, updating all implementations and call sites to suit. Includes a fix from Stephen Rothwell: sparc: fix fallout from update_mmu_cache API change Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>