summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)AuthorFilesLines
2014-06-30mm: vmscan: do not throttle based on pfmemalloc reserves if node has no ↵Mel Gorman1-6/+37
ZONE_NORMAL commit 675becce15f320337499bc1a9356260409a5ba29 upstream. throttle_direct_reclaim() is meant to trigger during swap-over-network during which the min watermark is treated as a pfmemalloc reserve. It throttes on the first node in the zonelist but this is flawed. The user-visible impact is that a process running on CPU whose local memory node has no ZONE_NORMAL will stall for prolonged periods of time, possibly indefintely. This is due to throttle_direct_reclaim thinking the pfmemalloc reserves are depleted when in fact they don't exist on that node. On a NUMA machine running a 32-bit kernel (I know) allocation requests from CPUs on node 1 would detect no pfmemalloc reserves and the process gets throttled. This patch adjusts throttling of direct reclaim to throttle based on the first node in the zonelist that has a usable ZONE_NORMAL or lower zone. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-16mm/compaction: make isolate_freepages start at pageblock boundaryVlastimil Babka1-10/+12
commit 49e068f0b73dd042c186ffa9b420a9943e90389a upstream. The compaction freepage scanner implementation in isolate_freepages() starts by taking the current cc->free_pfn value as the first pfn. In a for loop, it scans from this first pfn to the end of the pageblock, and then subtracts pageblock_nr_pages from the first pfn to obtain the first pfn for the next for loop iteration. This means that when cc->free_pfn starts at offset X rather than being aligned on pageblock boundary, the scanner will start at offset X in all scanned pageblock, ignoring potentially many free pages. Currently this can happen when a) zone's end pfn is not pageblock aligned, or b) through zone->compact_cached_free_pfn with CONFIG_HOLES_IN_ZONE enabled and a hole spanning the beginning of a pageblock This patch fixes the problem by aligning the initial pfn in isolate_freepages() to pageblock boundary. This also permits replacing the end-of-pageblock alignment within the for loop with a simple pageblock_nr_pages increment. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Heesub Shin <heesub.shin@samsung.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Dongjun Shin <d.j.shin@samsung.com> Cc: Sunghwan Yun <sunghwan.yun@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-16mm: compaction: detect when scanners meet in isolate_freepagesVlastimil Babka1-4/+15
commit 7ed695e069c3cbea5e1fd08f84a04536da91f584 upstream. Compaction of a zone is finished when the migrate scanner (which begins at the zone's lowest pfn) meets the free page scanner (which begins at the zone's highest pfn). This is detected in compact_zone() and in the case of direct compaction, the compact_blockskip_flush flag is set so that kswapd later resets the cached scanner pfn's, and a new compaction may again start at the zone's borders. The meeting of the scanners can happen during either scanner's activity. However, it may currently fail to be detected when it occurs in the free page scanner, due to two problems. First, isolate_freepages() keeps free_pfn at the highest block where it isolated pages from, for the purposes of not missing the pages that are returned back to allocator when migration fails. Second, failing to isolate enough free pages due to scanners meeting results in -ENOMEM being returned by migrate_pages(), which makes compact_zone() bail out immediately without calling compact_finished() that would detect scanners meeting. This failure to detect scanners meeting might result in repeated attempts at compaction of a zone that keep starting from the cached pfn's close to the meeting point, and quickly failing through the -ENOMEM path, without the cached pfns being reset, over and over. This has been observed (through additional tracepoints) in the third phase of the mmtests stress-highalloc benchmark, where the allocator runs on an otherwise idle system. The problem was observed in the DMA32 zone, which was used as a fallback to the preferred Normal zone, but on the 4GB system it was actually the largest zone. The problem is even amplified for such fallback zone - the deferred compaction logic, which could (after being fixed by a previous patch) reset the cached scanner pfn's, is only applied to the preferred zone and not for the fallbacks. The problem in the third phase of the benchmark was further amplified by commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy") which resulted in a non-deterministic regression of the allocation success rate from ~85% to ~65%. This occurs in about half of benchmark runs, making bisection problematic. It is unlikely that the commit itself is buggy, but it should put more pressure on the DMA32 zone during phases 1 and 2, which may leave it more fragmented in phase 3 and expose the bugs that this patch fixes. The fix is to make scanners meeting in isolate_freepage() stay that way, and to check in compact_zone() for scanners meeting when migrate_pages() returns -ENOMEM. The result is that compact_finished() also detects scanners meeting and sets the compact_blockskip_flush flag to make kswapd reset the scanner pfn's. The results in stress-highalloc benchmark show that the "regression" by commit 81c0a2bb515f in phase 3 no longer occurs, and phase 1 and 2 allocation success rates are also significantly improved. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-16mm: compaction: reset cached scanner pfn's before reading themVlastimil Babka1-8/+8
commit d3132e4b83e6bd383c74d716f7281d7c3136089c upstream. Compaction caches pfn's for its migrate and free scanners to avoid scanning the whole zone each time. In compact_zone(), the cached values are read to set up initial values for the scanners. There are several situations when these cached pfn's are reset to the first and last pfn of the zone, respectively. One of these situations is when a compaction has been deferred for a zone and is now being restarted during a direct compaction, which is also done in compact_zone(). However, compact_zone() currently reads the cached pfn's *before* resetting them. This means the reset doesn't affect the compaction that performs it, and with good chance also subsequent compactions, as update_pageblock_skip() is likely to be called and update the cached pfn's to those being processed. Another chance for a successful reset is when a direct compaction detects that migration and free scanners meet (which has its own problems addressed by another patch) and sets update_pageblock_skip flag which kswapd uses to do the reset because it goes to sleep. This is clearly a bug that results in non-deterministic behavior, so this patch moves the cached pfn reset to be performed *before* the values are read. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-11mm: rmap: fix use-after-free in __put_anon_vmaAndrey Ryabinin1-2/+1
commit 624483f3ea82598ab0f62f1bdb9177f531ab1892 upstream. While working address sanitizer for kernel I've discovered use-after-free bug in __put_anon_vma. For the last anon_vma, anon_vma->root freed before child anon_vma. Later in anon_vma_free(anon_vma) we are referencing to already freed anon_vma->root to check rwsem. This fixes it by freeing the child anon_vma before freeing anon_vma->root. Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-11mm/memory-failure.c: fix memory leak by race between poison and unpoisonNaoya Horiguchi1-0/+2
commit 3e030ecc0fc7de10fd0da10c1c19939872a31717 upstream. When a memory error happens on an in-use page or (free and in-use) hugepage, the victim page is isolated with its refcount set to one. When you try to unpoison it later, unpoison_memory() calls put_page() for it twice in order to bring the page back to free page pool (buddy or free hugepage list). However, if another memory error occurs on the page which we are unpoisoning, memory_failure() returns without releasing the refcount which was incremented in the same call at first, which results in memory leak and unconsistent num_poisoned_pages statistics. This patch fixes it. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07percpu: make pcpu_alloc_chunk() use pcpu_mem_free() instead of kfree()Jianyu Zhan1-1/+1
commit 5a838c3b60e3a36ade764cf7751b8f17d7c9c2da upstream. pcpu_chunk_struct_size = sizeof(struct pcpu_chunk) + BITS_TO_LONGS(pcpu_unit_pages) * sizeof(unsigned long) It hardly could be ever bigger than PAGE_SIZE even for large-scale machine, but for consistency with its couterpart pcpu_mem_zalloc(), use pcpu_mem_free() instead. Commit b4916cb17c26 ("percpu: make pcpu_free_chunk() use pcpu_mem_free() instead of kfree()") addressed this problem, but missed this one. tj: commit message updated Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 099a19d91ca4 ("percpu: allow limited allocation before slab is online) Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07hwpoison, hugetlb: lock_page/unlock_page does not match for handling a free ↵Chen Yucong1-7/+8
hugepage commit b985194c8c0a130ed155b71662e39f7eaea4876f upstream. For handling a free hugepage in memory failure, the race will happen if another thread hwpoisoned this hugepage concurrently. So we need to check PageHWPoison instead of !PageHWPoison. If hwpoison_filter(p) returns true or a race happens, then we need to unlock_page(hpage). Signed-off-by: Chen Yucong <slaoub@gmail.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Tested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07mm, thp: close race between mremap() and split_huge_page()Kirill A. Shutemov1-1/+8
commit dd18dbc2d42af75fffa60c77e0f02220bc329829 upstream. It's critical for split_huge_page() (and migration) to catch and freeze all PMDs on rmap walk. It gets tricky if there's concurrent fork() or mremap() since usually we copy/move page table entries on dup_mm() or move_page_tables() without rmap lock taken. To get it work we rely on rmap walk order to not miss any entry. We expect to see destination VMA after source one to work correctly. But after switching rmap implementation to interval tree it's not always possible to preserve expected walk order. It works fine for dup_mm() since new VMA has the same vma_start_pgoff() / vma_last_pgoff() and explicitly insert dst VMA after src one with vma_interval_tree_insert_after(). But on move_vma() destination VMA can be merged into adjacent one and as result shifted left in interval tree. Fortunately, we can detect the situation and prevent race with rmap walk by moving page table entries under rmap lock. See commit 38a76013ad80. Problem is that we miss the lock when we move transhuge PMD. Most likely this bug caused the crash[1]. [1] http://thread.gmane.org/gmane.linux.kernel.mm/96473 Fixes: 108d6642ad81 ("mm anon rmap: remove anon_vma_moveto_tail") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: Michel Lespinasse <walken@google.com> Cc: Dave Jones <davej@redhat.com> Cc: David Miller <davem@davemloft.net> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-07mm: make fixup_user_fault() check the vma access rights tooLinus Torvalds1-0/+5
commit 1b17844b29ae042576bea588164f2f1e9590a8bc upstream. fixup_user_fault() is used by the futex code when the direct user access fails, and the futex code wants it to either map in the page in a usable form or return an error. It relied on handle_mm_fault() to map the page, and correctly checked the error return from that, but while that does map the page, it doesn't actually guarantee that the page will be mapped with sufficient permissions to be then accessed. So do the appropriate tests of the vma access rights by hand. [ Side note: arguably handle_mm_fault() could just do that itself, but we have traditionally done it in the caller, because some callers - notably get_user_pages() - have been able to access pages even when they are mapped with PROT_NONE. Maybe we should re-visit that design decision, but in the meantime this is the minimal patch. ] Found by Dave Jones running his trinity tool. Reported-by: Dave Jones <davej@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-30mm/hugetlb.c: add cond_resched_lock() in return_unused_surplus_pages()Mizuma, Masayoshi1-0/+1
commit 7848a4bf51b34f41fcc9bd77e837126d99ae84e3 upstream. soft lockup in freeing gigantic hugepage fixed in commit 55f67141a892 "mm: hugetlb: fix softlockup when a large number of hugepages are freed." can happen in return_unused_surplus_pages(), so let's fix it. Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06mm: hugetlb: fix softlockup when a large number of hugepages are freed.Mizuma, Masayoshi1-0/+1
commit 55f67141a8927b2be3e51840da37b8a2320143ed upstream. When I decrease the value of nr_hugepage in procfs a lot, softlockup happens. It is because there is no chance of context switch during this process. On the other hand, when I allocate a large number of hugepages, there is some chance of context switch. Hence softlockup doesn't happen during this process. So it's necessary to add the context switch in the freeing process as same as allocating process to avoid softlockup. When I freed 12 TB hugapages with kernel-2.6.32-358.el6, the freeing process occupied a CPU over 150 seconds and following softlockup message appeared twice or more. $ echo 6000000 > /proc/sys/vm/nr_hugepages $ cat /proc/sys/vm/nr_hugepages 6000000 $ grep ^Huge /proc/meminfo HugePages_Total: 6000000 HugePages_Free: 6000000 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB $ echo 0 > /proc/sys/vm/nr_hugepages BUG: soft lockup - CPU#16 stuck for 67s! [sh:12883] ... Pid: 12883, comm: sh Not tainted 2.6.32-358.el6.x86_64 #1 Call Trace: free_pool_huge_page+0xb8/0xd0 set_max_huge_pages+0x128/0x190 hugetlb_sysctl_handler_common+0x113/0x140 hugetlb_sysctl_handler+0x1e/0x20 proc_sys_call_handler+0x97/0xd0 proc_sys_write+0x14/0x20 vfs_write+0xb8/0x1a0 sys_write+0x51/0x90 __audit_syscall_exit+0x265/0x290 system_call_fastpath+0x16/0x1b I have not confirmed this problem with upstream kernels because I am not able to prepare the machine equipped with 12TB memory now. However I confirmed that the amount of decreasing hugepages was directly proportional to the amount of required time. I measured required times on a smaller machine. It showed 130-145 hugepages decreased in a millisecond. Amount of decreasing Required time Decreasing rate hugepages (msec) (pages/msec) ------------------------------------------------------------ 10,000 pages == 20GB 70 - 74 135-142 30,000 pages == 60GB 208 - 229 131-144 It means decrement of 6TB hugepages will trigger softlockup with the default threshold 20sec, in this decreasing rate. Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-06mm: try_to_unmap_cluster() should lock_page() before mlockingVlastimil Babka2-2/+14
commit 57e68e9cd65b4b8eb4045a1e0d0746458502554c upstream. A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin fuzzing with trinity. The call site try_to_unmap_cluster() does not lock the pages other than its check_page parameter (which is already locked). The BUG_ON in mlock_vma_page() is not documented and its purpose is somewhat unclear, but apparently it serializes against page migration, which could otherwise fail to transfer the PG_mlocked flag. This would not be fatal, as the page would be eventually encountered again, but NR_MLOCK accounting would become distorted nevertheless. This patch adds a comment to the BUG_ON in mlock_vma_page() and munlock_vma_page() to that effect. The call site try_to_unmap_cluster() is fixed so that for page != check_page, trylock_page() is attempted (to avoid possible deadlocks as we already have check_page locked) and mlock_vma_page() is performed only upon success. If the page lock cannot be obtained, the page is left without PG_mlocked, which is again not a problem in the whole unevictable memory design. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Bob Liu <bob.liu@oracle.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Michel Lespinasse <walken@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26bdi: avoid oops on device removalJan Kara1-4/+9
commit 5acda9d12dcf1ad0d9a5a2a7c646de3472fa7555 upstream. After commit 839a8e8660b6 ("writeback: replace custom worker pool implementation with unbound workqueue") when device is removed while we are writing to it we crash in bdi_writeback_workfn() -> set_worker_desc() because bdi->dev is NULL. This can happen because even though bdi_unregister() cancels all pending flushing work, nothing really prevents new ones from being queued from balance_dirty_pages() or other places. Fix the problem by clearing BDI_registered bit in bdi_unregister() and checking it before scheduling of any flushing work. Fixes: 839a8e8660b6777e7fe4e80af1a048aebe2b5977 Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Cc: Derek Basehore <dbasehore@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-26backing_dev: fix hung task on syncDerek Basehore1-1/+4
commit 6ca738d60c563d5c6cf6253ee4b8e76fa77b2b9e upstream. bdi_wakeup_thread_delayed() used the mod_delayed_work() function to schedule work to writeback dirty inodes. The problem with this is that it can delay work that is scheduled for immediate execution, such as the work from sync_inodes_sb(). This can happen since mod_delayed_work() can now steal work from a work_queue. This fixes the problem by using queue_delayed_work() instead. This is a regression caused by commit 839a8e8660b6 ("writeback: replace custom worker pool implementation with unbound workqueue"). The reason that this causes a problem is that laptop-mode will change the delay, dirty_writeback_centisecs, to 60000 (10 minutes) by default. In the case that bdi_wakeup_thread_delayed() races with sync_inodes_sb(), sync will be stopped for 10 minutes and trigger a hung task. Even if dirty_writeback_centisecs is not long enough to cause a hung task, we still don't want to delay sync for that long. We fix the problem by using queue_delayed_work() when we want to schedule writeback sometime in future. This function doesn't change the timer if it is already armed. For the same reason, we also change bdi_writeback_workfn() to immediately queue the work again in the case that the work_list is not empty. The same problem can happen if the sync work is run on the rescue worker. [jack@suse.cz: update changelog, add comment, use bdi_wakeup_thread_delayed()] Signed-off-by: Derek Basehore <dbasehore@chromium.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zento.linux.org.uk> Reviewed-by: Tejun Heo <tj@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Derek Basehore <dbasehore@chromium.org> Cc: Kees Cook <keescook@chromium.org> Cc: Benson Leung <bleung@chromium.org> Cc: Sonny Rao <sonnyrao@chromium.org> Cc: Luigi Semenzato <semenzato@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-03mm: close PageTail raceDavid Rientjes4-5/+7
commit 668f9abbd4334e6c29fa8acd71635c4f9101caa7 upstream. Commit bf6bddf1924e ("mm: introduce compaction and migration for ballooned pages") introduces page_count(page) into memory compaction which dereferences page->first_page if PageTail(page). This results in a very rare NULL pointer dereference on the aforementioned page_count(page). Indeed, anything that does compound_head(), including page_count() is susceptible to racing with prep_compound_page() and seeing a NULL or dangling page->first_page pointer. This patch uses Andrea's implementation of compound_trans_head() that deals with such a race and makes it the default compound_head() implementation. This includes a read memory barrier that ensures that if PageTail(head) is true that we return a head page that is neither NULL nor dangling. The patch then adds a store memory barrier to prep_compound_page() to ensure page->first_page is set. This is the safest way to ensure we see the head page that we are expecting, PageTail(page) is already in the unlikely() path and the memory barriers are unfortunately required. Hugetlbfs is the exception, we don't enforce a store memory barrier during init since no race is possible. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Holger Kiehl <Holger.Kiehl@dwd.de> Cc: Christoph Lameter <cl@linux.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-23memcg: reparent charges of children before processing parentFilipe Brandenburger1-0/+14
commit 4fb1a86fb5e4209a7d4426d4e586c58e9edc74ac upstream. Sometimes the cleanup after memcg hierarchy testing gets stuck in mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0. There may turn out to be several causes, but a major cause is this: the workitem to offline parent can get run before workitem to offline child; parent's mem_cgroup_reparent_charges() circles around waiting for the child's pages to be reparented to its lrus, but it's holding cgroup_mutex which prevents the child from reaching its mem_cgroup_reparent_charges(). Further testing showed that an ordered workqueue for cgroup_destroy_wq is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched stage on the way can mess up the order before reaching the workqueue. Instead, when offlining a memcg, call mem_cgroup_reparent_charges() on all its children (and grandchildren, in the correct order) to have their charges reparented first. [The version for 3.10.34 (or perhaps now 3.10.35) is this below. Yes, more differences, and the old mem_cgroup_reparent_charges line is intentionally left in for 3.10 whereas it was removed for 3.12+: that's because the css/cgroup iterator changed in between, it used not to supply the root of the subtree, but nowadays it does - Hugh] Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction") Signed-off-by: Filipe Brandenburger <filbranden@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-23mm/compaction: break out of loop on !PageBuddy in isolate_freepages_blockLaura Abbott1-7/+13
commit 2af120bc040c5ebcda156df6be6a66610ab6957f upstream. We received several reports of bad page state when freeing CMA pages previously allocated with alloc_contig_range: BUG: Bad page state in process Binder_A pfn:63202 page:d21130b0 count:0 mapcount:1 mapping: (null) index:0x7dfbf page flags: 0x40080068(uptodate|lru|active|swapbacked) Based on the page state, it looks like the page was still in use. The page flags do not make sense for the use case though. Further debugging showed that despite alloc_contig_range returning success, at least one page in the range still remained in the buddy allocator. There is an issue with isolate_freepages_block. In strict mode (which CMA uses), if any pages in the range cannot be isolated, isolate_freepages_block should return failure 0. The current check keeps track of the total number of isolated pages and compares against the size of the range: if (strict && nr_strict_required > total_isolated) total_isolated = 0; After taking the zone lock, if one of the pages in the range is not in the buddy allocator, we continue through the loop and do not increment total_isolated. If in the last iteration of the loop we isolate more than one page (e.g. last page needed is a higher order page), the check for total_isolated may pass and we fail to detect that a page was skipped. The fix is to bail out if the loop immediately if we are in strict mode. There's no benfit to continuing anyway since we need all pages to be isolated. Additionally, drop the error checking based on nr_strict_required and just check the pfn ranges. This matches with what isolate_freepages_range does. Signed-off-by: Laura Abbott <lauraa@codeaurora.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-03-06memcg: fix endless loop caused by mem_cgroup_iterMichal Hocko1-2/+2
commit ecc736fc3c71c411a9d201d8588c9e7e049e5d8c upstream. Hugh has reported an endless loop when the hardlimit reclaim sees the same group all the time. This might happen when the reclaim races with the memcg removal. shrink_zone [rmdir root] mem_cgroup_iter(root, NULL, reclaim) // prev = NULL rcu_read_lock() mem_cgroup_iter_load last_visited = iter->last_visited // gets root || NULL css_tryget(last_visited) // failed last_visited = NULL [1] memcg = root = __mem_cgroup_iter_next(root, NULL) mem_cgroup_iter_update iter->last_visited = root; reclaim->generation = iter->generation mem_cgroup_iter(root, root, reclaim) // prev = root rcu_read_lock mem_cgroup_iter_load last_visited = iter->last_visited // gets root css_tryget(last_visited) // failed [1] The issue seemed to be introduced by commit 5f5781619718 ("memcg: relax memcg iter caching") which has replaced unconditional css_get/css_put by css_tryget/css_put for the cached iterator. This patch fixes the issue by skipping css_tryget on the root of the tree walk in mem_cgroup_iter_load and symmetrically doesn't release it in mem_cgroup_iter_update. Signed-off-by: Michal Hocko <mhocko@suse.cz> Reported-by: Hugh Dickins <hughd@google.com> Tested-by: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Greg Thelen <gthelen@google.com> Cc: <stable@vger.kernel.org> [3.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-22mm/memory-failure.c: move refcount only in !MF_COUNT_INCREASEDNaoya Horiguchi1-2/+4
commit 8d547ff4ac5927245e0833ac18528f939da0ee0e upstream. mce-test detected a test failure when injecting error to a thp tail page. This is because we take page refcount of the tail page in madvise_hwpoison() while the fix in commit a3e0f9e47d5e ("mm/memory-failure.c: transfer page count from head page to tail page after split thp") assumes that we always take refcount on the head page. When a real memory error happens we take refcount on the head page where memory_failure() is called without MF_COUNT_INCREASED set, so it seems to me that testing memory error on thp tail page using madvise makes little sense. This patch cancels moving refcount in !MF_COUNT_INCREASED for valid testing. [akpm@linux-foundation.org: s/&&/&/] Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Chen Gong <gong.chen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-20mm: fix process accidentally killed by mce because of huge page migrationXishi Qiu1-2/+9
Based on c8721bbbdd36382de51cd6b7a56322e0acca2414 upstream, but only the bugfix portion pulled out. Hi Naoya or Greg, We found a bug in 3.10.x. The problem is that we accidentally have a hwpoisoned hugepage in free hugepage list. It could happend in the the following scenario: process A process B migrate_huge_page put_page (old hugepage) linked to free hugepage list hugetlb_fault hugetlb_no_page alloc_huge_page dequeue_huge_page_vma dequeue_huge_page_node (steal hwpoisoned hugepage) set_page_hwpoison_huge_page dequeue_hwpoisoned_huge_page (fail to dequeue) I tested this bug, one process keeps allocating huge page, and I use sysfs interface to soft offline a huge page, then received: "MCE: Killing UCP:2717 due to hardware memory corruption fault at 8200034" Upstream kernel is free from this bug because of these two commits: f15bdfa802bfa5eb6b4b5a241b97ec9fa1204a35 mm/memory-failure.c: fix memory leak in successful soft offlining c8721bbbdd36382de51cd6b7a56322e0acca2414 mm: memory-hotplug: enable memory hotplug to handle hugepage The first one, although the problem is about memory leak, this patch moves unset_migratetype_isolate(), which is important to avoid the race. The latter is not a bug fix and it's too big, so I rewrite a small one. The following patch can fix this bug.(please apply f15bdfa802bf first) Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-20mm/memory-failure.c: fix memory leak in successful soft offliningNaoya Horiguchi1-4/+18
commit f15bdfa802bfa5eb6b4b5a241b97ec9fa1204a35 upstream. After a successful page migration by soft offlining, the source page is not properly freed and it's never reusable even if we unpoison it afterward. This is caused by the race between freeing page and setting PG_hwpoison. In successful soft offlining, the source page is put (and the refcount becomes 0) by putback_lru_page() in unmap_and_move(), where it's linked to pagevec and actual freeing back to buddy is delayed. So if PG_hwpoison is set for the page before freeing, the freeing does not functions as expected (in such case freeing aborts in free_pages_prepare() check.) This patch tries to make sure to free the source page before setting PG_hwpoison on it. To avoid reallocating, the page keeps MIGRATE_ISOLATE until after setting PG_hwpoison. This patch also removes obsolete comments about "keeping elevated refcount" because what they say is not true. Unlike memory_failure(), soft_offline_page() uses no special page isolation code, and the soft-offlined pages have no elevated. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-20mm: __set_page_dirty_nobuffers() uses spin_lock_irqsave() instead of ↵KOSAKI Motohiro1-2/+3
spin_lock_irq() commit a85d9df1ea1d23682a0ed1e100e6965006595d06 upstream. During aio stress test, we observed the following lockdep warning. This mean AIO+numa_balancing is currently deadlockable. The problem is, aio_migratepage disable interrupt, but __set_page_dirty_nobuffers unintentionally enable it again. Generally, all helper function should use spin_lock_irqsave() instead of spin_lock_irq() because they don't know caller at all. other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(&ctx->completion_lock)->rlock); <Interrupt> lock(&(&ctx->completion_lock)->rlock); *** DEADLOCK *** dump_stack+0x19/0x1b print_usage_bug+0x1f7/0x208 mark_lock+0x21d/0x2a0 mark_held_locks+0xb9/0x140 trace_hardirqs_on_caller+0x105/0x1d0 trace_hardirqs_on+0xd/0x10 _raw_spin_unlock_irq+0x2c/0x50 __set_page_dirty_nobuffers+0x8c/0xf0 migrate_page_copy+0x434/0x540 aio_migratepage+0xb1/0x140 move_to_new_page+0x7d/0x230 migrate_pages+0x5e5/0x700 migrate_misplaced_page+0xbc/0xf0 do_numa_page+0x102/0x190 handle_pte_fault+0x241/0x970 handle_mm_fault+0x265/0x370 __do_page_fault+0x172/0x5a0 do_page_fault+0x1a/0x70 page_fault+0x28/0x30 Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <jweiner@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13mm, oom: base root bonus on current usageDavid Rientjes1-1/+1
commit 778c14affaf94a9e4953179d3e13a544ccce7707 upstream. A 3% of system memory bonus is sometimes too excessive in comparison to other processes. With commit a63d83f427fb ("oom: badness heuristic rewrite"), the OOM killer tries to avoid killing privileged tasks by subtracting 3% of overall memory (system or cgroup) from their per-task consumption. But as a result, all root tasks that consume less than 3% of overall memory are considered equal, and so it only takes 33+ privileged tasks pushing the system out of memory for the OOM killer to do something stupid and kill dhclient or other root-owned processes. For example, on a 32G machine it can't tell the difference between the 1M agetty and the 10G fork bomb member. The changelog describes this 3% boost as the equivalent to the global overcommit limit being 3% higher for privileged tasks, but this is not the same as discounting 3% of overall memory from _every privileged task individually_ during OOM selection. Replace the 3% of system memory bonus with a 3% of current memory usage bonus. By giving root tasks a bonus that is proportional to their actual size, they remain comparable even when relatively small. In the example above, the OOM killer will discount the 1M agetty's 256 badness points down to 179, and the 10G fork bomb's 262144 points down to 183500 points and make the right choice, instead of discounting both to 0 and killing agetty because it's first in the task list. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13slub: Fix calculation of cpu slabsLi Zefan1-1/+7
commit 8afb1474db4701d1ab80cd8251137a3260e6913e upstream. /sys/kernel/slab/:t-0000048 # cat cpu_slabs 231 N0=16 N1=215 /sys/kernel/slab/:t-0000048 # cat slabs 145 N0=36 N1=109 See, the number of slabs is smaller than that of cpu slabs. The bug was introduced by commit 49e2258586b423684f03c278149ab46d8f8b6700 ("slub: per cpu cache for partial pages"). We should use page->pages instead of page->pobjects when calculating the number of cpu partial slabs. This also fixes the mapping of slabs and nodes. As there's no variable storing the number of total/active objects in cpu partial slabs, and we don't have user interfaces requiring those statistics, I just add WARN_ON for those cases. Acked-by: Christoph Lameter <cl@linux.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13mm/page-writeback.c: do not count anon pages as dirtyable memoryJohannes Weiner2-37/+18
commit a1c3bfb2f67ef766de03f1f56bdfff9c8595ab14 upstream. The VM is currently heavily tuned to avoid swapping. Whether that is good or bad is a separate discussion, but as long as the VM won't swap to make room for dirty cache, we can not consider anonymous pages when calculating the amount of dirtyable memory, the baseline to which dirty_background_ratio and dirty_ratio are applied. A simple workload that occupies a significant size (40+%, depending on memory layout, storage speeds etc.) of memory with anon/tmpfs pages and uses the remainder for a streaming writer demonstrates this problem. In that case, the actual cache pages are a small fraction of what is considered dirtyable overall, which results in an relatively large portion of the cache pages to be dirtied. As kswapd starts rotating these, random tasks enter direct reclaim and stall on IO. Only consider free pages and file pages dirtyable. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Tejun Heo <tj@kernel.org> Tested-by: Tejun Heo <tj@kernel.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13mm/page-writeback.c: fix dirty_balance_reserve subtraction from dirtyable memoryJohannes Weiner1-31/+24
commit a804552b9a15c931cfc2a92a2e0aed1add8b580a upstream. Tejun reported stuttering and latency spikes on a system where random tasks would enter direct reclaim and get stuck on dirty pages. Around 50% of memory was occupied by tmpfs backed by an SSD, and another disk (rotating) was reading and writing at max speed to shrink a partition. : The problem was pretty ridiculous. It's a 8gig machine w/ one ssd and 10k : rpm harddrive and I could reliably reproduce constant stuttering every : several seconds for as long as buffered IO was going on on the hard drive : either with tmpfs occupying somewhere above 4gig or a test program which : allocates about the same amount of anon memory. Although swap usage was : zero, turning off swap also made the problem go away too. : : The trigger conditions seem quite plausible - high anon memory usage w/ : heavy buffered IO and swap configured - and it's highly likely that this : is happening in the wild too. (this can happen with copying large files : to usb sticks too, right?) This patch (of 2): The dirty_balance_reserve is an approximation of the fraction of free pages that the page allocator does not make available for page cache allocations. As a result, it has to be taken into account when calculating the amount of "dirtyable memory", the baseline to which dirty_background_ratio and dirty_ratio are applied. However, currently the reserve is subtracted from the sum of free and reclaimable pages, which is non-sensical and leads to erroneous results when the system is dominated by unreclaimable pages and the dirty_balance_reserve is bigger than free+reclaimable. In that case, at least the already allocated cache should be considered dirtyable. Fix the calculation by subtracting the reserve from the amount of free pages, then adding the reclaimable pages on top. [akpm@linux-foundation.org: fix CONFIG_HIGHMEM build] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Tejun Heo <tj@kernel.org> Tested-by: Tejun Heo <tj@kernel.org> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13mm/memory-failure.c: shift page lock from head page to tail page after thp splitNaoya Horiguchi1-10/+11
commit 54b9dd14d09f24927285359a227aa363ce46089e upstream. After thp split in hwpoison_user_mappings(), we hold page lock on the raw error page only between try_to_unmap, hence we are in danger of race condition. I found in the RHEL7 MCE-relay testing that we have "bad page" error when a memory error happens on a thp tail page used by qemu-kvm: Triggering MCE exception on CPU 10 mce: [Hardware Error]: Machine check events logged MCE exception done on CPU 10 MCE 0x38c535: Killing qemu-kvm:8418 due to hardware memory corruption MCE 0x38c535: dirty LRU page recovery: Recovered qemu-kvm[8418]: segfault at 20 ip 00007ffb0f0f229a sp 00007fffd6bc5240 error 4 in qemu-kvm[7ffb0ef14000+420000] BUG: Bad page state in process qemu-kvm pfn:38c400 page:ffffea000e310000 count:0 mapcount:0 mapping: (null) index:0x7ffae3c00 page flags: 0x2fffff0008001d(locked|referenced|uptodate|dirty|swapbacked) Modules linked in: hwpoison_inject mce_inject vhost_net macvtap macvlan ... CPU: 0 PID: 8418 Comm: qemu-kvm Tainted: G M -------------- 3.10.0-54.0.1.el7.mce_test_fixed.x86_64 #1 Hardware name: NEC NEC Express5800/R120b-1 [N8100-1719F]/MS-91E7-001, BIOS 4.6.3C19 02/10/2011 Call Trace: dump_stack+0x19/0x1b bad_page.part.59+0xcf/0xe8 free_pages_prepare+0x148/0x160 free_hot_cold_page+0x31/0x140 free_hot_cold_page_list+0x46/0xa0 release_pages+0x1c1/0x200 free_pages_and_swap_cache+0xad/0xd0 tlb_flush_mmu.part.46+0x4c/0x90 tlb_finish_mmu+0x55/0x60 exit_mmap+0xcb/0x170 mmput+0x67/0xf0 vhost_dev_cleanup+0x231/0x260 [vhost_net] vhost_net_release+0x3f/0x90 [vhost_net] __fput+0xe9/0x270 ____fput+0xe/0x10 task_work_run+0xc4/0xe0 do_exit+0x2bb/0xa40 do_group_exit+0x3f/0xa0 get_signal_to_deliver+0x1d0/0x6e0 do_signal+0x48/0x5e0 do_notify_resume+0x71/0xc0 retint_signal+0x48/0x8c The reason of this bug is that a page fault happens before unlocking the head page at the end of memory_failure(). This strange page fault is trying to access to address 0x20 and I'm not sure why qemu-kvm does this, but anyway as a result the SIGSEGV makes qemu-kvm exit and on the way we catch the bad page bug/warning because we try to free a locked page (which was the former head page.) To fix this, this patch suggests to shift page lock from head page to tail page just after thp split. SIGSEGV still happens, but it affects only error affected VMs, not a whole system. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-06mm/mempolicy.c: fix mempolicy printing in numa_mapsDavid Rientjes1-1/+1
commit 8790c71a18e5d2d93532ae250bcf5eddbba729cd upstream. As a result of commit 5606e3877ad8 ("mm: numa: Migrate on reference policy"), /proc/<pid>/numa_maps prints the mempolicy for any <pid> as "prefer:N" for the local node, N, of the process reading the file. This should only be printed when the mempolicy of <pid> is MPOL_PREFERRED for node N. If the process is actually only using the default mempolicy for local node allocation, make sure "default" is printed as expected. Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Robert Lippert <rlippert@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: <stable@vger.kernel.org> [3.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-06mm: hugetlbfs: fix hugetlbfs optimizationAndrea Arcangeli2-60/+100
commit 27c73ae759774e63313c1fbfeb17ba076cea64c5 upstream. Commit 7cb2ef56e6a8 ("mm: fix aio performance regression for database caused by THP") can cause dereference of a dangling pointer if split_huge_page runs during PageHuge() if there are updates to the tail_page->private field. Also it is repeating compound_head twice for hugetlbfs and it is running compound_head+compound_trans_head for THP when a single one is needed in both cases. The new code within the PageSlab() check doesn't need to verify that the THP page size is never bigger than the smallest hugetlbfs page size, to avoid memory corruption. A longstanding theoretical race condition was found while fixing the above (see the change right after the skip_unlock label, that is relevant for the compound_lock path too). By re-establishing the _mapcount tail refcounting for all compound pages, this also fixes the below problem: echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages BUG: Bad page state in process bash pfn:59a01 page:ffffea000139b038 count:0 mapcount:10 mapping: (null) index:0x0 page flags: 0x1c00000000008000(tail) Modules linked in: CPU: 6 PID: 2018 Comm: bash Not tainted 3.12.0+ #25 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: dump_stack+0x55/0x76 bad_page+0xd5/0x130 free_pages_prepare+0x213/0x280 __free_pages+0x36/0x80 update_and_free_page+0xc1/0xd0 free_pool_huge_page+0xc2/0xe0 set_max_huge_pages.part.58+0x14c/0x220 nr_hugepages_store_common.isra.60+0xd0/0xf0 nr_hugepages_store+0x13/0x20 kobj_attr_store+0xf/0x20 sysfs_write_file+0x189/0x1e0 vfs_write+0xc5/0x1f0 SyS_write+0x55/0xb0 system_call_fastpath+0x16/0x1b Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Tested-by: Khalid Aziz <khalid.aziz@oracle.com> Cc: Pravin Shelar <pshelar@nicira.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Guillaume Morin <guillaume@morinfr.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-25thp: fix copy_page_rep GPF by testing is_huge_zero_pmd once onlyHugh Dickins1-3/+3
commit eecc1e426d681351a6026a7d3e7d225f38955b6c upstream. We see General Protection Fault on RSI in copy_page_rep: that RSI is what you get from a NULL struct page pointer. RIP: 0010:[<ffffffff81154955>] [<ffffffff81154955>] copy_page_rep+0x5/0x10 RSP: 0000:ffff880136e15c00 EFLAGS: 00010286 RAX: ffff880000000000 RBX: ffff880136e14000 RCX: 0000000000000200 RDX: 6db6db6db6db6db7 RSI: db73880000000000 RDI: ffff880dd0c00000 RBP: ffff880136e15c18 R08: 0000000000000200 R09: 000000000005987c R10: 000000000005987c R11: 0000000000000200 R12: 0000000000000001 R13: ffffea00305aa000 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f195752f700(0000) GS:ffff880c7fc20000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000093010000 CR3: 00000001458e1000 CR4: 00000000000027e0 Call Trace: copy_user_huge_page+0x93/0xab do_huge_pmd_wp_page+0x710/0x815 handle_mm_fault+0x15d8/0x1d70 __do_page_fault+0x14d/0x840 do_page_fault+0x2f/0x90 page_fault+0x22/0x30 do_huge_pmd_wp_page() tests is_huge_zero_pmd(orig_pmd) four times: but since shrink_huge_zero_page() can free the huge_zero_page, and we have no hold of our own on it here (except where the fourth test holds page_table_lock and has checked pmd_same), it's possible for it to answer yes the first time, but no to the second or third test. Change all those last three to tests for NULL page. (Note: this is not the same issue as trinity's DEBUG_PAGEALLOC BUG in copy_page_rep with RSI: ffff88009c422000, reported by Sasha Levin in https://lkml.org/lkml/2013/3/29/103. I believe that one is due to the source page being split, and a tail page freed, while copy is in progress; and not a problem without DEBUG_PAGEALLOC, since the pmd_same check will prevent a miscopy from being made visible.) Fixes: 97ae17497e99 ("thp: implement refcounting for huge zero page") Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-25mm/memory-failure.c: recheck PageHuge() after hugetlb page migrate successfullyJianguo Wu1-4/+10
commit a49ecbcd7b0d5a1cda7d60e03df402dd0ef76ac8 upstream. After a successful hugetlb page migration by soft offline, the source page will either be freed into hugepage_freelists or buddy(over-commit page). If page is in buddy, page_hstate(page) will be NULL. It will hit a NULL pointer dereference in dequeue_hwpoisoned_huge_page(). BUG: unable to handle kernel NULL pointer dereference at 0000000000000058 IP: [<ffffffff81163761>] dequeue_hwpoisoned_huge_page+0x131/0x1d0 PGD c23762067 PUD c24be2067 PMD 0 Oops: 0000 [#1] SMP So check PageHuge(page) after call migrate_pages() successfully. [wujg: backport to 3.10: - adjust context] Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Tested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09memcg: fix memcg_size() calculationVladimir Davydov1-1/+1
commit 695c60830764945cf61a2cc623eb1392d137223e upstream. The mem_cgroup structure contains nr_node_ids pointers to mem_cgroup_per_node objects, not the objects themselves. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@openvz.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09mm/memory-failure.c: transfer page count from head page to tail page after ↵Naoya Horiguchi1-0/+10
split thp commit a3e0f9e47d5ef7858a26cc12d90ad5146e802d47 upstream. Memory failures on thp tail pages cause kernel panic like below: mce: [Hardware Error]: Machine check events logged MCE exception done on CPU 7 BUG: unable to handle kernel NULL pointer dereference at 0000000000000058 IP: [<ffffffff811b7cd1>] dequeue_hwpoisoned_huge_page+0x131/0x1e0 PGD bae42067 PUD ba47d067 PMD 0 Oops: 0000 [#1] SMP ... CPU: 7 PID: 128 Comm: kworker/7:2 Tainted: G M O 3.13.0-rc4-131217-1558-00003-g83b7df08e462 #25 ... Call Trace: me_huge_page+0x3e/0x50 memory_failure+0x4bb/0xc20 mce_process_work+0x3e/0x70 process_one_work+0x171/0x420 worker_thread+0x11b/0x3a0 ? manage_workers.isra.25+0x2b0/0x2b0 kthread+0xe4/0x100 ? kthread_create_on_node+0x190/0x190 ret_from_fork+0x7c/0xb0 ? kthread_create_on_node+0x190/0x190 ... RIP dequeue_hwpoisoned_huge_page+0x131/0x1e0 CR2: 0000000000000058 The reasoning of this problem is shown below: - when we have a memory error on a thp tail page, the memory error handler grabs a refcount of the head page to keep the thp under us. - Before unmapping the error page from processes, we split the thp, where page refcounts of both of head/tail pages don't change. - Then we call try_to_unmap() over the error page (which was a tail page before). We didn't pin the error page to handle the memory error, this error page is freed and removed from LRU list. - We never have the error page on LRU list, so the first page state check returns "unknown page," then we move to the second check with the saved page flag. - The saved page flag have PG_tail set, so the second page state check returns "hugepage." - We call me_huge_page() for freed error page, then we hit the above panic. The root cause is that we didn't move refcount from the head page to the tail page after split thp. So this patch suggests to do this. This panic was introduced by commit 524fca1e73 ("HWPOISON: fix misjudgement of page_action() for errors on mlocked pages"). Note that we did have the same refcount problem before this commit, but it was just ignored because we had only first page state check which returned "unknown page." The commit changed the refcount problem from "doesn't work" to "kernel panic." Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09mm: fix use-after-free in sys_remap_file_pagesRik van Riel1-3/+5
commit 4eb919825e6c3c7fb3630d5621f6d11e98a18b3a upstream. remap_file_pages calls mmap_region, which may merge the VMA with other existing VMAs, and free "vma". This can lead to a use-after-free bug. Avoid the bug by remembering vm_flags before calling mmap_region, and not trying to dereference vma later. Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: PaX Team <pageexec@freemail.hu> Cc: Kees Cook <keescook@chromium.org> Cc: Michel Lespinasse <walken@google.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09mm/hugetlb: check for pte NULL pointer in __page_check_address()Jianguo Wu1-0/+4
commit 98398c32f6687ee1e1f3ae084effb4b75adb0747 upstream. In __page_check_address(), if address's pud is not present, huge_pte_offset() will return NULL, we should check the return value. Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@suse.de> Cc: qiuxishi <qiuxishi@huawei.com> Cc: Hanjun Guo <guohanjun@huawei.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09mm/compaction: respect ignore_skip_hint in update_pageblock_skipJoonsoo Kim1-0/+4
commit 6815bf3f233e0b10c99a758497d5d236063b010b upstream. update_pageblock_skip() only fits to compaction which tries to isolate by pageblock unit. If isolate_migratepages_range() is called by CMA, it try to isolate regardless of pageblock unit and it don't reference get_pageblock_skip() by ignore_skip_hint. We should also respect it on update_pageblock_skip() to prevent from setting the wrong information. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09mm: fix TLB flush race between migration, and change_protection_rangeRik van Riel3-2/+12
commit 20841405940e7be0617612d521e206e4b6b325db upstream. There are a few subtle races, between change_protection_range (used by mprotect and change_prot_numa) on one side, and NUMA page migration and compaction on the other side. The basic race is that there is a time window between when the PTE gets made non-present (PROT_NONE or NUMA), and the TLB is flushed. During that time, a CPU may continue writing to the page. This is fine most of the time, however compaction or the NUMA migration code may come in, and migrate the page away. When that happens, the CPU may continue writing, through the cached translation, to what is no longer the current memory location of the process. This only affects x86, which has a somewhat optimistic pte_accessible. All other architectures appear to be safe, and will either always flush, or flush whenever there is a valid mapping, even with no permissions (SPARC). The basic race looks like this: CPU A CPU B CPU C load TLB entry make entry PTE/PMD_NUMA fault on entry read/write old page start migrating page change PTE/PMD to new page read/write old page [*] flush TLB reload TLB from new entry read/write new page lose data [*] the old page may belong to a new user at this point! The obvious fix is to flush remote TLB entries, by making sure that pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may still be accessible if there is a TLB flush pending for the mm. This should fix both NUMA migration and compaction. [mgorman@suse.de: fix build] Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Alex Thorlton <athorlton@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09mm: numa: avoid unnecessary work on the failure pathMel Gorman1-1/+3
commit eb4489f69f224356193364dc2762aa009738ca7f upstream. If a PMD changes during a THP migration then migration aborts but the failure path is doing more work than is necessary. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Alex Thorlton <athorlton@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09mm: numa: ensure anon_vma is locked to prevent parallel THP splitsMel Gorman1-0/+7
commit c3a489cac38d43ea6dc4ac240473b44b46deecf7 upstream. The anon_vma lock prevents parallel THP splits and any associated complexity that arises when handling splits during THP migration. This patch checks if the lock was successfully acquired and bails from THP migration if it failed for any reason. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Alex Thorlton <athorlton@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-09mm: clear pmd_numa before invalidatingMel Gorman1-0/+3
commit 67f87463d3a3362424efcbe8b40e4772fd34fc61 upstream. On x86, PMD entries are similar to _PAGE_PROTNONE protection and are handled as NUMA hinting faults. The following two page table protection bits are what defines them _PAGE_NUMA:set _PAGE_PRESENT:clear A PMD is considered present if any of the _PAGE_PRESENT, _PAGE_PROTNONE, _PAGE_PSE or _PAGE_NUMA bits are set. If pmdp_invalidate encounters a pmd_numa, it clears the present bit leaving _PAGE_NUMA which will be considered not present by the CPU but present by pmd_present. The existing caller of pmdp_invalidate should handle it but it's an inconsistent state for a PMD. This patch keeps the state consistent when calling pmdp_invalidate. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Alex Thorlton <athorlton@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-08mm: numa: return the number of base pages altered by protection changesMel Gorman2-1/+7
commit 72403b4a0fbdf433c1fe0127e49864658f6f6468 upstream. Commit 0255d4918480 ("mm: Account for a THP NUMA hinting update as one PTE update") was added to account for the number of PTE updates when marking pages prot_numa. task_numa_work was using the old return value to track how much address space had been updated. Altering the return value causes the scanner to do more work than it is configured or documented to in a single unit of work. This patch reverts that commit and accounts for the number of THP updates separately in vmstat. It is up to the administrator to interpret the pair of values correctly. This is a straight-forward operation and likely to only be of interest when actively debugging NUMA balancing problems. The impact of this patch is that the NUMA PTE scanner will scan slower when THP is enabled and workloads may converge slower as a result. On the flip size system CPU usage should be lower than recent tests reported. This is an illustrative example of a short single JVM specjbb test specjbb 3.12.0 3.12.0 vanilla acctupdates TPut 1 26143.00 ( 0.00%) 25747.00 ( -1.51%) TPut 7 185257.00 ( 0.00%) 183202.00 ( -1.11%) TPut 13 329760.00 ( 0.00%) 346577.00 ( 5.10%) TPut 19 442502.00 ( 0.00%) 460146.00 ( 3.99%) TPut 25 540634.00 ( 0.00%) 549053.00 ( 1.56%) TPut 31 512098.00 ( 0.00%) 519611.00 ( 1.47%) TPut 37 461276.00 ( 0.00%) 474973.00 ( 2.97%) TPut 43 403089.00 ( 0.00%) 414172.00 ( 2.75%) 3.12.0 3.12.0 vanillaacctupdates User 5169.64 5184.14 System 100.45 80.02 Elapsed 252.75 251.85 Performance is similar but note the reduction in system CPU time. While this showed a performance gain, it will not be universal but at least it'll be behaving as documented. The vmstats are obviously different but here is an obvious interpretation of them from mmtests. 3.12.0 3.12.0 vanillaacctupdates NUMA page range updates 1408326 11043064 NUMA huge PMD updates 0 21040 NUMA PTE updates 1408326 291624 "NUMA page range updates" == nr_pte_updates and is the value returned to the NUMA pte scanner. NUMA huge PMD updates were the number of THP updates which in combination can be used to calculate how many ptes were updated from userspace. Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Alex Thorlton <athorlton@sgi.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-04mm: ensure get_unmapped_area() returns higher address than mmap_min_addrAkira Takeuchi1-5/+5
commit 2afc745f3e3079ab16c826be4860da2529054dd2 upstream. This patch fixes the problem that get_unmapped_area() can return illegal address and result in failing mmap(2) etc. In case that the address higher than PAGE_SIZE is set to /proc/sys/vm/mmap_min_addr, the address lower than mmap_min_addr can be returned by get_unmapped_area(), even if you do not pass any virtual address hint (i.e. the second argument). This is because the current get_unmapped_area() code does not take into account mmap_min_addr. This leads to two actual problems as follows: 1. mmap(2) can fail with EPERM on the process without CAP_SYS_RAWIO, although any illegal parameter is not passed. 2. The bottom-up search path after the top-down search might not work in arch_get_unmapped_area_topdown(). Note: The first and third chunk of my patch, which changes "len" check, are for more precise check using mmap_min_addr, and not for solving the above problem. [How to reproduce] --- test.c ------------------------------------------------- #include <stdio.h> #include <unistd.h> #include <sys/mman.h> #include <sys/errno.h> int main(int argc, char *argv[]) { void *ret = NULL, *last_map; size_t pagesize = sysconf(_SC_PAGESIZE); do { last_map = ret; ret = mmap(0, pagesize, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); // printf("ret=%p\n", ret); } while (ret != MAP_FAILED); if (errno != ENOMEM) { printf("ERR: unexpected errno: %d (last map=%p)\n", errno, last_map); } return 0; } --------------------------------------------------------------- $ gcc -m32 -o test test.c $ sudo sysctl -w vm.mmap_min_addr=65536 vm.mmap_min_addr = 65536 $ ./test (run as non-priviledge user) ERR: unexpected errno: 1 (last map=0x10000) Signed-off-by: Akira Takeuchi <takeuchi.akr@jp.panasonic.com> Signed-off-by: Kiyoshi Owada <owada.kiyoshi@jp.panasonic.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-29slub: Handle NULL parameter in kmem_cache_flagsChristoph Lameter1-2/+2
commit c6f58d9b362b45c52afebe4342c9137d0dabe47f upstream. Andreas Herrmann writes: When I've used slub_debug kernel option (e.g. "slub_debug=,skbuff_fclone_cache" or similar) on a debug session I've seen a panic like: Highbank #setenv bootargs console=ttyAMA0 root=/dev/sda2 kgdboc.kgdboc=ttyAMA0,115200 slub_debug=,kmalloc-4096 earlyprintk=ttyAMA0 ... Unable to handle kernel NULL pointer dereference at virtual address 00000000 pgd = c0004000 [00000000] *pgd=00000000 Internal error: Oops: 5 [#1] SMP ARM Modules linked in: CPU: 0 PID: 0 Comm: swapper Tainted: G W 3.12.0-00048-gbe408cd #314 task: c0898360 ti: c088a000 task.ti: c088a000 PC is at strncmp+0x1c/0x84 LR is at kmem_cache_flags.isra.46.part.47+0x44/0x60 pc : [<c02c6da0>] lr : [<c0110a3c>] psr: 200001d3 sp : c088bea8 ip : c088beb8 fp : c088beb4 r10: 00000000 r9 : 413fc090 r8 : 00000001 r7 : 00000000 r6 : c2984a08 r5 : c0966e78 r4 : 00000000 r3 : 0000006b r2 : 0000000c r1 : 00000000 r0 : c2984a08 Flags: nzCv IRQs off FIQs off Mode SVC_32 ISA ARM Segment kernel Control: 10c5387d Table: 0000404a DAC: 00000015 Process swapper (pid: 0, stack limit = 0xc088a248) Stack: (0xc088bea8 to 0xc088c000) bea0: c088bed4 c088beb8 c0110a3c c02c6d90 c0966e78 00000040 bec0: ef001f00 00000040 c088bf14 c088bed8 c0112070 c0110a04 00000005 c010fac8 bee0: c088bf5c c088bef0 c010fac8 ef001f00 00000040 00000000 00000040 00000001 bf00: 413fc090 00000000 c088bf34 c088bf18 c0839190 c0112040 00000000 ef001f00 bf20: 00000000 00000000 c088bf54 c088bf38 c0839200 c083914c 00000006 c0961c4c bf40: c0961c28 00000000 c088bf7c c088bf58 c08392ac c08391c0 c08a2ed8 c0966e78 bf60: c086b874 c08a3f50 c0961c28 00000001 c088bfb4 c088bf80 c083b258 c0839248 bf80: 2f800000 0f000000 c08935b4 ffffffff c08cd400 ffffffff c08cd400 c0868408 bfa0: c29849c0 00000000 c088bff4 c088bfb8 c0824974 c083b1e4 ffffffff ffffffff bfc0: c08245c0 00000000 00000000 c0868408 00000000 10c5387d c0892bcc c0868404 bfe0: c0899440 0000406a 00000000 c088bff8 00008074 c0824824 00000000 00000000 [<c02c6da0>] (strncmp+0x1c/0x84) from [<c0110a3c>] (kmem_cache_flags.isra.46.part.47+0x44/0x60) [<c0110a3c>] (kmem_cache_flags.isra.46.part.47+0x44/0x60) from [<c0112070>] (__kmem_cache_create+0x3c/0x410) [<c0112070>] (__kmem_cache_create+0x3c/0x410) from [<c0839190>] (create_boot_cache+0x50/0x74) [<c0839190>] (create_boot_cache+0x50/0x74) from [<c0839200>] (create_kmalloc_cache+0x4c/0x88) [<c0839200>] (create_kmalloc_cache+0x4c/0x88) from [<c08392ac>] (create_kmalloc_caches+0x70/0x114) [<c08392ac>] (create_kmalloc_caches+0x70/0x114) from [<c083b258>] (kmem_cache_init+0x80/0xe0) [<c083b258>] (kmem_cache_init+0x80/0xe0) from [<c0824974>] (start_kernel+0x15c/0x318) [<c0824974>] (start_kernel+0x15c/0x318) from [<00008074>] (0x8074) Code: e3520000 01a00002 089da800 e5d03000 (e5d1c000) ---[ end trace 1b75b31a2719ed1d ]--- Kernel panic - not syncing: Fatal exception Problem is that slub_debug option is not parsed before create_boot_cache is called. Solve this by changing slub_debug to early_param. Kernels 3.11, 3.10 are also affected. I am not sure about older kernels. Christoph Lameter explains: kmem_cache_flags may be called with NULL parameter during early boot. Skip the test in that case. Reported-by: Andreas Herrmann <andreas.herrmann@calxeda.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-13mm/vmalloc.c: fix an overflow bug in alloc_vmap_area()Zhang Yanfei1-3/+3
commit bcb615a81b1765864c71c50afb56631e7a1e5283 upstream. When searching a vmap area in the vmalloc space, we use (addr + size - 1) to check if the value is less than addr, which is an overflow. But we assign (addr + size) to vmap_area->va_end. So if we come across the below case: (addr + size - 1) : not overflow (addr + size) : overflow we will assign an overflow value (e.g 0) to vmap_area->va_end, And this will trigger BUG in __insert_vmap_area, causing system panic. So using (addr + size) to check the overflow should be the correct behaviour, not (addr + size - 1). Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Reported-by: Ghennadi Procopciuc <unix140@gmail.com> Tested-by: Daniel Baluta <dbaluta@ixiacom.com> Cc: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Anatoly Muliarski <x86ever@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-13mm/pagewalk.c: fix walk_page_range() access of wrong PTEsChen LinX1-1/+1
commit 3017f079efd6af199b0852b5c425364513db460e upstream. When walk_page_range walk a memory map's page tables, it'll skip VM_PFNMAP area, then variable 'next' will to assign to vma->vm_end, it maybe larger than 'end'. In next loop, 'addr' will be larger than 'next'. Then in /proc/XXXX/pagemap file reading procedure, the 'addr' will growing forever in pagemap_pte_range, pte_to_pagemap_entry will access the wrong pte. BUG: Bad page map in process procrank pte:8437526f pmd:785de067 addr:9108d000 vm_flags:00200073 anon_vma:f0d99020 mapping: (null) index:9108d CPU: 1 PID: 4974 Comm: procrank Tainted: G B W O 3.10.1+ #1 Call Trace: dump_stack+0x16/0x18 print_bad_pte+0x114/0x1b0 vm_normal_page+0x56/0x60 pagemap_pte_range+0x17a/0x1d0 walk_page_range+0x19e/0x2c0 pagemap_read+0x16e/0x200 vfs_read+0x84/0x150 SyS_read+0x4a/0x80 syscall_call+0x7/0xb Signed-off-by: Liu ShuoX <shuox.liu@intel.com> Signed-off-by: Chen LinX <linx.z.chen@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-13mm: Account for a THP NUMA hinting update as one PTE updateMel Gorman1-1/+1
commit 0255d491848032f6c601b6410c3b8ebded3a37b1 upstream. A THP PMD update is accounted for as 512 pages updated in vmstat. This is large difference when estimating the cost of automatic NUMA balancing and can be misleading when comparing results that had collapsed versus split THP. This patch addresses the accounting issue. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-10-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-13mm: Close races between THP migration and PMD numa clearingMel Gorman2-26/+26
commit 3f926ab945b60a5824369d21add7710622a2eac0 upstream. THP migration uses the page lock to guard against parallel allocations but there are cases like this still open Task A Task B --------------------- --------------------- do_huge_pmd_numa_page do_huge_pmd_numa_page lock_page mpol_misplaced == -1 unlock_page goto clear_pmdnuma lock_page mpol_misplaced == 2 migrate_misplaced_transhuge pmd = pmd_mknonnuma set_pmd_at During hours of testing, one crashed with weird errors and while I have no direct evidence, I suspect something like the race above happened. This patch extends the page lock to being held until the pmd_numa is cleared to prevent migration starting in parallel while the pmd_numa is being cleared. It also flushes the old pmd entry and orders pagetable insertion before rmap insertion. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-9-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-13mm: numa: Sanitize task_numa_fault() callsitesMel Gorman2-44/+34
commit c61109e34f60f6e85bb43c5a1cd51c0e3db40847 upstream. There are three callers of task_numa_fault(): - do_huge_pmd_numa_page(): Accounts against the current node, not the node where the page resides, unless we migrated, in which case it accounts against the node we migrated to. - do_numa_page(): Accounts against the current node, not the node where the page resides, unless we migrated, in which case it accounts against the node we migrated to. - do_pmd_numa_page(): Accounts not at all when the page isn't migrated, otherwise accounts against the node we migrated towards. This seems wrong to me; all three sites should have the same sementaics, furthermore we should accounts against where the page really is, we already know where the task is. So modify all three sites to always account; we did after all receive the fault; and always account to where the page is after migration, regardless of success. They all still differ on when they clear the PTE/PMD; ideally that would get sorted too. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-8-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-13mm: Prevent parallel splits during THP migrationMel Gorman1-14/+30
commit 587fe586f44a48f9691001ba6c45b86c8e4ba21f upstream. THP migrations are serialised by the page lock but on its own that does not prevent THP splits. If the page is split during THP migration then the pmd_same checks will prevent page table corruption but the unlock page and other fix-ups potentially will cause corruption. This patch takes the anon_vma lock to prevent parallel splits during migration. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-7-git-send-email-mgorman@suse.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>