Age | Commit message (Collapse) | Author | Files | Lines |
|
commit 70ef57e6c22c3323dce179b7d0d433c479266612 upstream.
We had a report about strange OOM killer strikes on a PPC machine
although there was a lot of swap free and a tons of anonymous memory
which could be swapped out. In the end it turned out that the OOM was a
side effect of zone reclaim which wasn't unmapping and swapping out and
so the system was pushed to the OOM. Although this sounds like a bug
somewhere in the kswapd vs. zone reclaim vs. direct reclaim
interaction numactl on the said hardware suggests that the zone reclaim
should not have been set in the first place:
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 0 MB
node 0 free: 0 MB
node 2 cpus:
node 2 size: 7168 MB
node 2 free: 6019 MB
node distances:
node 0 2
0: 10 40
2: 40 10
So all the CPUs are associated with Node0 which doesn't have any memory
while Node2 contains all the available memory. Node distances cause an
automatic zone_reclaim_mode enabling.
Zone reclaim is intended to keep the allocations local but this doesn't
make any sense on the memoryless nodes. So let's exclude such nodes for
init_zone_allows_reclaim which evaluates zone reclaim behavior and
suitable reclaim_nodes.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit abc40bd2eeb77eb7c2effcaf63154aad929a1d5f upstream.
This patch reverts 1ba6e0b50b ("mm: numa: split_huge_page: transfer the
NUMA type from the pmd to the pte"). If a huge page is being split due
a protection change and the tail will be in a PROT_NONE vma then NUMA
hinting PTEs are temporarily created in the protected VMA.
VM_RW|VM_PROTNONE
|-----------------|
^
split here
In the specific case above, it should get fixed up by change_pte_range()
but there is a window of opportunity for weirdness to happen. Similarly,
if a huge page is shrunk and split during a protection update but before
pmd_numa is cleared then a pte_numa can be left behind.
Instead of adding complexity trying to deal with the case, this patch
will not mark PTEs NUMA when splitting a huge page. NUMA hinting faults
will not be triggered which is marginal in comparison to the complexity
in dealing with the corner cases during THP split.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f8303c2582b889351e261ff18c4d8eb197a77db2 upstream.
In __split_huge_page_map(), the check for page_mapcount(page) is
invariant within the for loop. Because of the fact that the macro is
implemented using atomic_read(), the redundant check cannot be optimized
away by the compiler leading to unnecessary read to the page structure.
This patch moves the invariant bug check out of the loop so that it will
be done only once. On a 3.16-rc1 based kernel, the execution time of a
microbenchmark that broke up 1000 transparent huge pages using munmap()
had an execution time of 38,245us and 38,548us with and without the
patch respectively. The performance gain is about 1%.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Scott J Norton <scott.norton@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 457c1b27ed56ec472d202731b12417bff023594a upstream.
Currently, I am seeing the following when I `mount -t hugetlbfs /none
/dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`. I think it's
related to the fact that hugetlbfs is properly not correctly setting
itself up in this state?:
Unable to handle kernel paging request for data at address 0x00000031
Faulting instruction address: 0xc000000000245710
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048 NUMA pSeries
....
In KVM guests on Power, in a guest not backed by hugepages, we see the
following:
AnonHugePages: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 64 kB
HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
are not supported at boot-time, but this is only checked in
hugetlb_init(). Extract the check to a helper function, and use it in a
few relevant places.
This does make hugetlbfs not supported (not registered at all) in this
environment. I believe this is fine, as there are no valid hugepages
and that won't change at runtime.
[akpm@linux-foundation.org: use pr_info(), per Mel]
[akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined]
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d3cb8bf6081b8b7a2dabb1264fe968fd870fa595 upstream.
A migration entry is marked as write if pte_write was true at the time the
entry was created. The VMA protections are not double checked when migration
entries are being removed as mprotect marks write-migration-entries as
read. It means that potentially we take a spurious fault to mark PTEs write
again but it's straight-forward. However, there is a race between write
migrations being marked read and migrations finishing. This potentially
allows a PTE to be write that should have been read. Close this race by
double checking the VMA permissions using maybe_mkwrite when migration
completes.
[torvalds@linux-foundation.org: use maybe_mkwrite]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit dbab31aa2ceec2d201966fa0b552f151310ba5f4 upstream.
This fixes the same bug as b43790eedd31 ("mm: softdirty: don't forget to
save file map softdiry bit on unmap") and 9aed8614af5a ("mm/memory.c:
don't forget to set softdirty on file mapped fault") where the return
value of pte_*mksoft_dirty was being ignored.
To be sure that no other pte/pmd "mk" function return values were being
ignored, I annotated the functions in arch/x86/include/asm/pgtable.h
with __must_check and rebuilt.
The userspace effect of this bug is that the softdirty mark might be
lost if a file mapped pte get zapped.
Signed-off-by: Peter Feiner <pfeiner@google.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Jamie Liu <jamieliu@google.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d4a5fca592b9ab52b90bb261a90af3c8f53be011 upstream.
Since commit 4590685546a3 ("mm/sl[aou]b: Common alignment code"), the
"ralign" automatic variable in __kmem_cache_create() may be used as
uninitialized.
The proper alignment defaults to BYTES_PER_WORD and can be overridden by
SLAB_RED_ZONE or the alignment specified by the caller.
This fixes https://bugzilla.kernel.org/show_bug.cgi?id=85031
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Andrei Elovikov <a.elovikov@gmail.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 849f5169097e1ba35b90ac9df76b5bb6f9c0aabd upstream.
If pcpu_map_pages() fails midway, it unmaps the already mapped pages.
Currently, it doesn't flush tlb after the partial unmapping. This may
be okay in most cases as the established mapping hasn't been used at
that point but it can go wrong and when it goes wrong it'd be
extremely difficult to track down.
Flush tlb after the partial unmapping.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f0d279654dea22b7a6ad34b9334aee80cda62cde upstream.
When pcpu_alloc_pages() fails midway, pcpu_free_pages() is invoked to
free what has already been allocated. The invocation is across the
whole requested range and pcpu_free_pages() will try to free all
non-NULL pages; unfortunately, this is incorrect as
pcpu_get_pages_and_bitmap(), unlike what its comment suggests, doesn't
clear the pages array and thus the array may have entries from the
previous invocations making the partial failure path free incorrect
pages.
Fix it by open-coding the partial freeing of the already allocated
pages.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3189eddbcafcc4d827f7f19facbeddec4424eba8 upstream.
Currently, only SMP system free the percpu allocation info.
Uniprocessor system should free it too. For example, one x86 UML
virtual machine with 256MB memory, UML kernel wastes one page memory.
Signed-off-by: Honggang Li <enjoymindful@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b928095b0a7cff7fb9fcf4c706348ceb8ab2c295 upstream.
If overwriting an empty directory with rename, then need to drop the extra
nlink.
Test prog:
#include <stdio.h>
#include <fcntl.h>
#include <err.h>
#include <sys/stat.h>
int main(void)
{
const char *test_dir1 = "test-dir1";
const char *test_dir2 = "test-dir2";
int res;
int fd;
struct stat statbuf;
res = mkdir(test_dir1, 0777);
if (res == -1)
err(1, "mkdir(\"%s\")", test_dir1);
res = mkdir(test_dir2, 0777);
if (res == -1)
err(1, "mkdir(\"%s\")", test_dir2);
fd = open(test_dir2, O_RDONLY);
if (fd == -1)
err(1, "open(\"%s\")", test_dir2);
res = rename(test_dir1, test_dir2);
if (res == -1)
err(1, "rename(\"%s\", \"%s\")", test_dir1, test_dir2);
res = fstat(fd, &statbuf);
if (res == -1)
err(1, "fstat(%i)", fd);
if (statbuf.st_nlink != 0) {
fprintf(stderr, "nlink is %lu, should be 0\n", statbuf.st_nlink);
return 1;
}
return 0;
}
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0cfb8f0c3e21e36d4a6e472e4c419d58ba848698 upstream.
In memblock_find_in_range_node(), we defined ret as int. But it should
be phys_addr_t because it is used to store the return value from
__memblock_find_range_bottom_up().
The bug has not been triggered because when allocating low memory near
the kernel end, the "int ret" won't turn out to be negative. When we
started to allocate memory on other nodes, and the "int ret" could be
minus. Then the kernel will panic.
A simple way to reproduce this: comment out the following code in
numa_init(),
memblock_set_bottom_up(false);
and the kernel won't boot.
Reported-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Tested-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4449a51a7c281602d3a385044ab928322a122a02 upstream.
Aleksei hit the soft lockup during reading /proc/PID/smaps. David
investigated the problem and suggested the right fix.
while_each_thread() is racy and should die, this patch updates
vm_is_stack().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Aleksei Besogonov <alex.besogonov@gmail.com>
Tested-by: Aleksei Besogonov <alex.besogonov@gmail.com>
Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2bcf2e92c3918ce62ab4e934256e47e9a16d19c3 upstream.
Paul Furtado has reported the following GPF:
general protection fault: 0000 [#1] SMP
Modules linked in: ipv6 dm_mod xen_netfront coretemp hwmon x86_pkg_temp_thermal crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 microcode pcspkr ext4 jbd2 mbcache raid0 xen_blkfront
CPU: 3 PID: 3062 Comm: java Not tainted 3.16.0-rc5 #1
task: ffff8801cfe8f170 ti: ffff8801d2ec4000 task.ti: ffff8801d2ec4000
RIP: e030:mem_cgroup_oom_synchronize+0x140/0x240
RSP: e02b:ffff8801d2ec7d48 EFLAGS: 00010283
RAX: 0000000000000001 RBX: ffff88009d633800 RCX: 000000000000000e
RDX: fffffffffffffffe RSI: ffff88009d630200 RDI: ffff88009d630200
RBP: ffff8801d2ec7da8 R08: 0000000000000012 R09: 00000000fffffffe
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88009d633800
R13: ffff8801d2ec7d48 R14: dead000000100100 R15: ffff88009d633a30
FS: 00007f1748bb4700(0000) GS:ffff8801def80000(0000) knlGS:0000000000000000
CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f4110300308 CR3: 00000000c05f7000 CR4: 0000000000002660
Call Trace:
pagefault_out_of_memory+0x18/0x90
mm_fault_error+0xa9/0x1a0
__do_page_fault+0x478/0x4c0
do_page_fault+0x2c/0x40
page_fault+0x28/0x30
Code: 44 00 00 48 89 df e8 40 ca ff ff 48 85 c0 49 89 c4 74 35 4c 8b b0 30 02 00 00 4c 8d b8 30 02 00 00 4d 39 fe 74 1b 0f 1f 44 00 00 <49> 8b 7e 10 be 01 00 00 00 e8 42 d2 04 00 4d 8b 36 4d 39 fe 75
RIP mem_cgroup_oom_synchronize+0x140/0x240
Commit fb2a6fc56be6 ("mm: memcg: rework and document OOM waiting and
wakeup") has moved mem_cgroup_oom_notify outside of memcg_oom_lock
assuming it is protected by the hierarchical OOM-lock.
Although this is true for the notification part the protection doesn't
cover unregistration of event which can happen in parallel now so
mem_cgroup_oom_notify can see already unlinked and/or freed
mem_cgroup_eventfd_list.
Fix this by using memcg_oom_lock also in mem_cgroup_oom_notify.
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=80881
Fixes: fb2a6fc56be6 (mm: memcg: rework and document OOM waiting and wakeup)
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Paul Furtado <paulfurtado91@gmail.com>
Tested-by: Paul Furtado <paulfurtado91@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b104a35d32025ca740539db2808aa3385d0f30eb upstream.
The page allocator relies on __GFP_WAIT to determine if ALLOC_CPUSET
should be set in allocflags. ALLOC_CPUSET controls if a page allocation
should be restricted only to the set of allowed cpuset mems.
Transparent hugepages clears __GFP_WAIT when defrag is disabled to prevent
the fault path from using memory compaction or direct reclaim. Thus, it
is unfairly able to allocate outside of its cpuset mems restriction as a
side-effect.
This patch ensures that ALLOC_CPUSET is only cleared when the gfp mask is
truly GFP_ATOMIC by verifying it is also not a thp allocation.
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Alex Thorlton <athorlton@sgi.com>
Tested-by: Alex Thorlton <athorlton@sgi.com>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hedi Berriche <hedi@sgi.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f6789593d5cea42a4ecb1cbeab6a23ade5ebbba7 upstream.
Under memory pressure, it is possible for dirty_thresh, calculated by
global_dirty_limits() in balance_dirty_pages(), to equal zero. Then, if
strictlimit is true, bdi_dirty_limits() tries to resolve the proportion:
bdi_bg_thresh : bdi_thresh = background_thresh : dirty_thresh
by dividing by zero.
Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 0253d634e0803a8376a0d88efee0bf523d8673f9 upstream.
Commit 4a705fef9862 ("hugetlb: fix copy_hugetlb_page_range() to handle
migration/hwpoisoned entry") changed the order of
huge_ptep_set_wrprotect() and huge_ptep_get(), which leads to breakage
in some workloads like hugepage-backed heap allocation via libhugetlbfs.
This patch fixes it.
The test program for the problem is shown below:
$ cat heap.c
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#define HPS 0x200000
int main() {
int i;
char *p = malloc(HPS);
memset(p, '1', HPS);
for (i = 0; i < 5; i++) {
if (!fork()) {
memset(p, '2', HPS);
p = malloc(HPS);
memset(p, '3', HPS);
free(p);
return 0;
}
}
sleep(1);
free(p);
return 0;
}
$ export HUGETLB_MORECORE=yes ; export HUGETLB_NO_PREFAULT= ; hugectl --heap ./heap
Fixes 4a705fef9862 ("hugetlb: fix copy_hugetlb_page_range() to handle
migration/hwpoisoned entry"), so is applicable to -stable kernels which
include it.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Guillaume Morin <guillaume@morinfr.org>
Suggested-by: Guillaume Morin <guillaume@morinfr.org>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 694617474e33b8603fc76e090ed7d09376514b1a upstream.
The patch 3e374919b314f20e2a04f641ebc1093d758f66a4 is supposed to fix the
problem where kmem_cache_create incorrectly reports duplicate cache name
and fails. The problem is described in the header of that patch.
However, the patch doesn't really fix the problem because of these
reasons:
* the logic to test for debugging is reversed. It was intended to perform
the check only if slub debugging is enabled (which implies that caches
with the same parameters are not merged). Therefore, there should be
#if !defined(CONFIG_SLUB) || defined(CONFIG_SLUB_DEBUG_ON)
The current code has the condition reversed and performs the test if
debugging is disabled.
* slub debugging may be enabled or disabled based on kernel command line,
CONFIG_SLUB_DEBUG_ON is just the default settings. Therefore the test
based on definition of CONFIG_SLUB_DEBUG_ON is unreliable.
This patch fixes the problem by removing the test
"!defined(CONFIG_SLUB_DEBUG_ON)". Therefore, duplicate names are never
checked if the SLUB allocator is used.
Note to stable kernel maintainers: when backporint this patch, please
backport also the patch 3e374919b314f20e2a04f641ebc1093d758f66a4.
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b738d764652dc5aab1c8939f637112981fce9e0e upstream.
shrink_inactive_list() used to wait 0.1s to avoid congestion when all
the pages that were isolated from the inactive list were dirty but not
under active writeback. That makes no real sense, and apparently causes
major interactivity issues under some loads since 3.11.
The ostensible reason for it was to wait for kswapd to start writing
pages, but that seems questionable as well, since the congestion wait
code seems to trigger for kswapd itself as well. Also, the logic behind
delaying anything when we haven't actually started writeback is not
clear - it only delays actually starting that writeback.
We'll still trigger the congestion waiting if
(a) the process is kswapd, and we hit pages flagged for immediate
reclaim
(b) the process is not kswapd, and the zone backing dev writeback is
actually congested.
This probably needs to be revisited, but as it is this fixes a reported
regression.
Reported-by: Felipe Contreras <felipe.contreras@gmail.com>
Pinpointed-by: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[mhocko@suse.cz: backport to 3.12 stable tree]
Fixes: e2be15f6c3ee ('mm: vmscan: stall page reclaim and writeback pages based on dirty/writepage pages encountered')
Reported-by: Felipe Contreras <felipe.contreras@gmail.com>
Pinpointed-by: Hillf Danton <dhillf@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b1a366500bd537b50c3aad26dc7df083ec03a448 upstream.
shmem_fault() is the actual culprit in trinity's hole-punch starvation,
and the most significant cause of such problems: since a page faulted is
one that then appears page_mapped(), needing unmap_mapping_range() and
i_mmap_mutex to be unmapped again.
But it is not the only way in which a page can be brought into a hole in
the radix_tree while that hole is being punched; and Vlastimil's testing
implies that if enough other processors are busy filling in the hole,
then shmem_undo_range() can be kept from completing indefinitely.
shmem_file_splice_read() is the main other user of SGP_CACHE, which can
instantiate shmem pagecache pages in the read-only case (without holding
i_mutex, so perhaps concurrently with a hole-punch). Probably it's
silly not to use SGP_READ already (using the ZERO_PAGE for holes): which
ought to be safe, but might bring surprises - not a change to be rushed.
shmem_read_mapping_page_gfp() is an internal interface used by
drivers/gpu/drm GEM (and next by uprobes): it should be okay. And
shmem_file_read_iter() uses the SGP_DIRTY variant of SGP_CACHE, when
called internally by the kernel (perhaps for a stacking filesystem,
which might rely on holes to be reserved): it's unclear whether it could
be provoked to keep hole-punch busy or not.
We could apply the same umbrella as now used in shmem_fault() to
shmem_file_splice_read() and the others; but it looks ugly, and use over
a range raises questions - should it actually be per page? can these get
starved themselves?
The origin of this part of the problem is my v3.1 commit d0823576bf4b
("mm: pincer in truncate_inode_pages_range"), once it was duplicated
into shmem.c. It seemed like a nice idea at the time, to ensure
(barring RCU lookup fuzziness) that there's an instant when the entire
hole is empty; but the indefinitely repeated scans to ensure that make
it vulnerable.
Revert that "enhancement" to hole-punch from shmem_undo_range(), but
retain the unproblematic rescanning when it's truncating; add a couple
of comments there.
Remove the "indices[0] >= end" test: that is now handled satisfactorily
by the inner loop, and mem_cgroup_uncharge_start()/end() are too light
to be worth avoiding here.
But if we do not always loop indefinitely, we do need to handle the case
of swap swizzled back to page before shmem_free_swap() gets it: add a
retry for that case, as suggested by Konstantin Khlebnikov; and for the
case of page swizzled back to swap, as suggested by Johannes Weiner.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8e205f779d1443a94b5ae81aa359cb535dd3021e upstream.
Commit f00cdc6df7d7 ("shmem: fix faulting into a hole while it's
punched") was buggy: Sasha sent a lockdep report to remind us that
grabbing i_mutex in the fault path is a no-no (write syscall may already
hold i_mutex while faulting user buffer).
We tried a completely different approach (see following patch) but that
proved inadequate: good enough for a rational workload, but not good
enough against trinity - which forks off so many mappings of the object
that contention on i_mmap_mutex while hole-puncher holds i_mutex builds
into serious starvation when concurrent faults force the puncher to fall
back to single-page unmap_mapping_range() searches of the i_mmap tree.
So return to the original umbrella approach, but keep away from i_mutex
this time. We really don't want to bloat every shmem inode with a new
mutex or completion, just to protect this unlikely case from trinity.
So extend the original with wait_queue_head on stack at the hole-punch
end, and wait_queue item on the stack at the fault end.
This involves further use of i_lock to guard against the races: lockdep
has been happy so far, and I see fs/inode.c:unlock_new_inode() holds
i_lock around wake_up_bit(), which is comparable to what we do here.
i_lock is more convenient, but we could switch to shmem's info->lock.
This issue has been tagged with CVE-2014-4171, which will require commit
f00cdc6df7d7 and this and the following patch to be backported: we
suggest to 3.1+, though in fact the trinity forkbomb effect might go
back as far as 2.6.16, when madvise(,,MADV_REMOVE) came in - or might
not, since much has changed, with i_mmap_mutex a spinlock before 3.0.
Anyone running trinity on 3.0 and earlier? I don't think we need care.
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lukas Czerner <lczerner@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f00cdc6df7d7cfcabb5b740911e6788cb0802bdb upstream.
Trinity finds that mmap access to a hole while it's punched from shmem
can prevent the madvise(MADV_REMOVE) or fallocate(FALLOC_FL_PUNCH_HOLE)
from completing, until the reader chooses to stop; with the puncher's
hold on i_mutex locking out all other writers until it can complete.
It appears that the tmpfs fault path is too light in comparison with its
hole-punching path, lacking an i_data_sem to obstruct it; but we don't
want to slow down the common case.
Extend shmem_fallocate()'s existing range notification mechanism, so
shmem_fault() can refrain from faulting pages into the hole while it's
punched, waiting instead on i_mutex (when safe to sleep; or repeatedly
faulting when not).
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 391acf970d21219a2a5446282d3b20eace0c0d7a upstream.
When runing with the kernel(3.15-rc7+), the follow bug occurs:
[ 9969.258987] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:586
[ 9969.359906] in_atomic(): 1, irqs_disabled(): 0, pid: 160655, name: python
[ 9969.441175] INFO: lockdep is turned off.
[ 9969.488184] CPU: 26 PID: 160655 Comm: python Tainted: G A 3.15.0-rc7+ #85
[ 9969.581032] Hardware name: FUJITSU-SV PRIMEQUEST 1800E/SB, BIOS PRIMEQUEST 1000 Series BIOS Version 1.39 11/16/2012
[ 9969.706052] ffffffff81a20e60 ffff8803e941fbd0 ffffffff8162f523 ffff8803e941fd18
[ 9969.795323] ffff8803e941fbe0 ffffffff8109995a ffff8803e941fc58 ffffffff81633e6c
[ 9969.884710] ffffffff811ba5dc ffff880405c6b480 ffff88041fdd90a0 0000000000002000
[ 9969.974071] Call Trace:
[ 9970.003403] [<ffffffff8162f523>] dump_stack+0x4d/0x66
[ 9970.065074] [<ffffffff8109995a>] __might_sleep+0xfa/0x130
[ 9970.130743] [<ffffffff81633e6c>] mutex_lock_nested+0x3c/0x4f0
[ 9970.200638] [<ffffffff811ba5dc>] ? kmem_cache_alloc+0x1bc/0x210
[ 9970.272610] [<ffffffff81105807>] cpuset_mems_allowed+0x27/0x140
[ 9970.344584] [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150
[ 9970.409282] [<ffffffff811b1385>] __mpol_dup+0xe5/0x150
[ 9970.471897] [<ffffffff811b1303>] ? __mpol_dup+0x63/0x150
[ 9970.536585] [<ffffffff81068c86>] ? copy_process.part.23+0x606/0x1d40
[ 9970.613763] [<ffffffff810bf28d>] ? trace_hardirqs_on+0xd/0x10
[ 9970.683660] [<ffffffff810ddddf>] ? monotonic_to_bootbased+0x2f/0x50
[ 9970.759795] [<ffffffff81068cf0>] copy_process.part.23+0x670/0x1d40
[ 9970.834885] [<ffffffff8106a598>] do_fork+0xd8/0x380
[ 9970.894375] [<ffffffff81110e4c>] ? __audit_syscall_entry+0x9c/0xf0
[ 9970.969470] [<ffffffff8106a8c6>] SyS_clone+0x16/0x20
[ 9971.030011] [<ffffffff81642009>] stub_clone+0x69/0x90
[ 9971.091573] [<ffffffff81641c29>] ? system_call_fastpath+0x16/0x1b
The cause is that cpuset_mems_allowed() try to take
mutex_lock(&callback_mutex) under the rcu_read_lock(which was hold in
__mpol_dup()). And in cpuset_mems_allowed(), the access to cpuset is
under rcu_read_lock, so in __mpol_dup, we can reduce the rcu_read_lock
protection region to protect the access to cpuset only in
current_cpuset_is_being_rebound(). So that we can avoid this bug.
This patch is a temporary solution that just addresses the bug
mentioned above, can not fix the long-standing issue about cpuset.mems
rebinding on fork():
"When the forker's task_struct is duplicated (which includes
->mems_allowed) and it races with an update to cpuset_being_rebound
in update_tasks_nodemask() then the task's mems_allowed doesn't get
updated. And the child task's mems_allowed can be wrong if the
cpuset's nodemask changes before the child has been added to the
cgroup's tasklist."
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 042d27acb64924a0e8a43e972485913a32407beb upstream.
This patch affects only architectures where the stack grows upwards
(currently parisc and metag only). On those do not hardcode the maximum
initial stack size to 1GB for 32-bit processes, but make it configurable
via a config option.
The main problem with the hardcoded stack size is, that we have two
memory regions which grow upwards: stack and heap. To keep most of the
memory available for heap in a flexmap memory layout, it makes no sense
to hard allocate up to 1GB of the memory for stack which can't be used
as heap then.
This patch makes the stack size for 32-bit processes configurable and
uses 80MB as default value which has been in use during the last few
years on parisc and which hasn't showed any problems yet.
Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: linux-parisc@vger.kernel.org
Cc: linux-metag@vger.kernel.org
Cc: John David Anglin <dave.anglin@bell.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d05f0cdcbe6388723f1900c549b4850360545201 upstream.
In v2.6.34 commit 9d8cebd4bcd7 ("mm: fix mbind vma merge problem")
introduced vma merging to mbind(), but it should have also changed the
convention of passing start vma from queue_pages_range() (formerly
check_range()) to new_vma_page(): vma merging may have already freed
that structure, resulting in BUG at mm/mempolicy.c:1738 and probably
worse crashes.
Fixes: 9d8cebd4bcd7 ("mm: fix mbind vma merge problem")
Reported-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Tested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 03787301420376ae41fbaf4267f4a6253d152ac5 upstream.
Commit b1cb0982bdd6 ("change the management method of free objects of
the slab") introduced a bug on slab leak detector
('/proc/slab_allocators'). This detector works like as following
decription.
1. traverse all objects on all the slabs.
2. determine whether it is active or not.
3. if active, print who allocate this object.
but that commit changed the way how to manage free objects, so the logic
determining whether it is active or not is also changed. In before, we
regard object in cpu caches as inactive one, but, with this commit, we
mistakenly regard object in cpu caches as active one.
This intoduces kernel oops if DEBUG_PAGEALLOC is enabled. If
DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who
corrupt free memory in the slab. It unmaps page table mapping if object
is free and map it if object is active. When slab leak detector check
object in cpu caches, it mistakenly think this object active so try to
access object memory to retrieve caller of allocation. At this point,
page table mapping to this object doesn't exist, so oops occurs.
Following is oops message reported from Dave.
It blew up when something tried to read /proc/slab_allocators
(Just cat it, and you should see the oops below)
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
Modules linked in:
[snip...]
CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131
task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000
RIP: 0010:[<ffffffffaa1a8f4a>] [<ffffffffaa1a8f4a>] handle_slab+0x8a/0x180
RSP: 0018:ffff880076925de0 EFLAGS: 00010002
RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7
RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000
RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000
R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0
FS: 00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0
DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602
Call Trace:
leaks_show+0xce/0x240
seq_read+0x28e/0x490
proc_reg_read+0x3d/0x80
vfs_read+0x9b/0x160
SyS_read+0x58/0xb0
tracesys+0xd4/0xd9
Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 <4d> 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46
RIP handle_slab+0x8a/0x180
To fix the problem, I introduce an object status buffer on each slab.
With this, we can track object status precisely, so slab leak detector
would not access active object and no kernel oops would occur. Memory
overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK
which is mainly used for debugging, so memory overhead isn't big
problem.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reported-by: Dave Jones <davej@redhat.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 107437febd495a50e2cd09c81bbaa84d30e57b07 upstream.
Changing PTEs and PMDs to pte_numa & pmd_numa is done with the
mmap_sem held for reading, which means a pmd can be instantiated
and turned into a numa one while __handle_mm_fault() is examining
the value of old_pmd.
If that happens, __handle_mm_fault() should just return and let
the page fault retry, instead of throwing an oops. This is
handled by the test for pmd_trans_huge(*pmd) below.
Signed-off-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Sunil Pandey <sunil.k.pandey@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: linux-mm@kvack.org
Cc: lwoodman@redhat.com
Cc: dave.hansen@intel.com
Link: http://lkml.kernel.org/r/20140429153615.2d72098e@annuminas.surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Patrick McLean <chutzpah@gentoo.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit dc78327c0ea7da5186d8cbc1647bd6088c5c9fa5 upstream.
With a kernel configured with ARM64_64K_PAGES && !TRANSPARENT_HUGEPAGE,
the following is triggered at early boot:
SMP: Total of 8 processors activated.
devtmpfs: initialized
Unable to handle kernel NULL pointer dereference at virtual address 00000008
pgd = fffffe0000050000
[00000008] *pgd=00000043fba00003, *pmd=00000043fba00003, *pte=00e0000078010407
Internal error: Oops: 96000006 [#1] SMP
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.15.0-rc864k+ #44
task: fffffe03bc040000 ti: fffffe03bc080000 task.ti: fffffe03bc080000
PC is at __list_add+0x10/0xd4
LR is at free_one_page+0x270/0x638
...
Call trace:
__list_add+0x10/0xd4
free_one_page+0x26c/0x638
__free_pages_ok.part.52+0x84/0xbc
__free_pages+0x74/0xbc
init_cma_reserved_pageblock+0xe8/0x104
cma_init_reserved_areas+0x190/0x1e4
do_one_initcall+0xc4/0x154
kernel_init_freeable+0x204/0x2a8
kernel_init+0xc/0xd4
This happens because init_cma_reserved_pageblock() calls
__free_one_page() with pageblock_order as page order but it is bigger
than MAX_ORDER. This in turn causes accesses past zone->free_list[].
Fix the problem by changing init_cma_reserved_pageblock() such that it
splits pageblock into individual MAX_ORDER pages if pageblock is bigger
than a MAX_ORDER page.
In cases where !CONFIG_HUGETLB_PAGE_SIZE_VARIABLE, which is all
architectures expect for ia64, powerpc and tile at the moment, the
âpageblock_order > MAX_ORDERâ condition will be optimised out since both
sides of the operator are constants. In cases where pageblock size is
variable, the performance degradation should not be significant anyway
since init_cma_reserved_pageblock() is called only at boot time at most
MAX_CMA_AREAS times which by default is eight.
Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Reported-by: Mark Salter <msalter@redhat.com>
Tested-by: Mark Salter <msalter@redhat.com>
Tested-by: Christopher Covington <cov@codeaurora.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7cd2b0a34ab8e4db971920eef8982f985441adfb upstream.
Oleg reports a division by zero error on zero-length write() to the
percpu_pagelist_fraction sysctl:
divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
RSP: 0018:ffff8800d87a3e78 EFLAGS: 00010246
RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
FS: 00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
Call Trace:
proc_sys_call_handler+0xb3/0xc0
proc_sys_write+0x14/0x20
vfs_write+0xba/0x1e0
SyS_write+0x46/0xb0
tracesys+0xe1/0xe6
However, if the percpu_pagelist_fraction sysctl is set by the user, it
is also impossible to restore it to the kernel default since the user
cannot write 0 to the sysctl.
This patch allows the user to write 0 to restore the default behavior.
It still requires a fraction equal to or larger than 8, however, as
stated by the documentation for sanity. If a value in the range [1, 7]
is written, the sysctl will return EINVAL.
This successfully solves the divide by zero issue at the same time.
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4a705fef986231a3e7a6b1a6d3c37025f021f49f upstream.
There's a race between fork() and hugepage migration, as a result we try
to "dereference" a swap entry as a normal pte, causing kernel panic.
The cause of the problem is that copy_hugetlb_page_range() can't handle
"swap entry" family (migration entry and hwpoisoned entry) so let's fix
it.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1c8349a17137b93f0a83f276c764a6df1b9a116e upstream.
When we perform a data integrity sync we tag all the dirty pages with
PAGECACHE_TAG_TOWRITE at start of ext4_da_writepages. Later we check
for this tag in write_cache_pages_da and creates a struct
mpage_da_data containing contiguously indexed pages tagged with this
tag and sync these pages with a call to mpage_da_map_and_submit. This
process is done in while loop until all the PAGECACHE_TAG_TOWRITE
pages are synced. We also do journal start and stop in each iteration.
journal_stop could initiate journal commit which would call
ext4_writepage which in turn will call ext4_bio_write_page even for
delayed OR unwritten buffers. When ext4_bio_write_page is called for
such buffers, even though it does not sync them but it clears the
PAGECACHE_TAG_TOWRITE of the corresponding page and hence these pages
are also not synced by the currently running data integrity sync. We
will end up with dirty pages although sync is completed.
This could cause a potential data loss when the sync call is followed
by a truncate_pagecache call, which is exactly the case in
collapse_range. (It will cause generic/127 failure in xfstests)
To avoid this issue, we can use set_page_writeback_keepwrite instead of
set_page_writeback, which doesn't clear TOWRITE tag.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 71abdc15adf8c702a1dd535f8e30df50758848d2 upstream.
When kswapd exits, it can end up taking locks that were previously held
by allocating tasks while they waited for reclaim. Lockdep currently
warns about this:
On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote:
> inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
> kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes:
> (&sig->group_rwsem){+++++?}, at: exit_signals+0x24/0x130
> {RECLAIM_FS-ON-W} state was registered at:
> mark_held_locks+0xb9/0x140
> lockdep_trace_alloc+0x7a/0xe0
> kmem_cache_alloc_trace+0x37/0x240
> flex_array_alloc+0x99/0x1a0
> cgroup_attach_task+0x63/0x430
> attach_task_by_pid+0x210/0x280
> cgroup_procs_write+0x16/0x20
> cgroup_file_write+0x120/0x2c0
> vfs_write+0xc0/0x1f0
> SyS_write+0x4c/0xa0
> tracesys+0xdd/0xe2
> irq event stamp: 49
> hardirqs last enabled at (49): _raw_spin_unlock_irqrestore+0x36/0x70
> hardirqs last disabled at (48): _raw_spin_lock_irqsave+0x2b/0xa0
> softirqs last enabled at (0): copy_process.part.24+0x627/0x15f0
> softirqs last disabled at (0): (null)
>
> other info that might help us debug this:
> Possible unsafe locking scenario:
>
> CPU0
> ----
> lock(&sig->group_rwsem);
> <Interrupt>
> lock(&sig->group_rwsem);
>
> *** DEADLOCK ***
>
> no locks held by kswapd2/1151.
>
> stack backtrace:
> CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4
> Call Trace:
> dump_stack+0x19/0x1b
> print_usage_bug+0x1f7/0x208
> mark_lock+0x21d/0x2a0
> __lock_acquire+0x52a/0xb60
> lock_acquire+0xa2/0x140
> down_read+0x51/0xa0
> exit_signals+0x24/0x130
> do_exit+0xb5/0xa50
> kthread+0xdb/0x100
> ret_from_fork+0x7c/0xb0
This is because the kswapd thread is still marked as a reclaimer at the
time of exit. But because it is exiting, nobody is actually waiting on
it to make reclaim progress anymore, and it's nothing but a regular
thread at this point. Be tidy and strip it of all its powers
(PF_MEMALLOC, PF_SWAPWRITE, PF_KSWAPD, and the lockdep reclaim state)
before returning from the thread function.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7f39dda9d86fb4f4f17af0de170decf125726f8c upstream.
Trinity reports BUG:
sleeping function called from invalid context at kernel/locking/rwsem.c:47
in_atomic(): 0, irqs_disabled(): 0, pid: 5787, name: trinity-c27
__might_sleep < down_write < __put_anon_vma < page_get_anon_vma <
migrate_pages < compact_zone < compact_zone_order < try_to_compact_pages ..
Right, since conversion to mutex then rwsem, we should not put_anon_vma()
from inside an rcu_read_lock()ed section: fix the two places that did so.
And add might_sleep() to anon_vma_free(), as suggested by Peter Zijlstra.
Fixes: 88c22088bf23 ("mm: optimize page_lock_anon_vma() fast-path")
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
SIGBUS(BUS_MCEERR_AO)
commit 3ba08129e38437561df44c36b7ea9081185d5333 upstream.
Currently memory error handler handles action optional errors in the
deferred manner by default. And if a recovery aware application wants
to handle it immediately, it can do it by setting PF_MCE_EARLY flag.
However, such signal can be sent only to the main thread, so it's
problematic if the application wants to have a dedicated thread to
handler such signals.
So this patch adds dedicated thread support to memory error handler. We
have PF_MCE_EARLY flags for each thread separately, so with this patch
AO signal is sent to the thread with PF_MCE_EARLY flag set, not the main
thread. If you want to implement a dedicated thread, you call prctl()
to set PF_MCE_EARLY on the thread.
Memory error handler collects processes to be killed, so this patch lets
it check PF_MCE_EARLY flag on each thread in the collecting routines.
No behavioral change for all non-early kill cases.
Tony said:
: The old behavior was crazy - someone with a multithreaded process might
: well expect that if they call prctl(PF_MCE_EARLY) in just one thread, then
: that thread would see the SIGBUS with si_code = BUS_MCEERR_A0 - even if
: that thread wasn't the main thread for the process.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: Kamil Iskra <iskra@mcs.anl.gov>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Chen Gong <gong.chen@linux.jf.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
MF_ACTION_REQUIRED
commit 74614de17db6fb472370c426d4f934d8d616edf2 upstream.
When Linux sees an "action optional" machine check (where h/w has reported
an error that is not in the current execution path) we generally do not
want to signal a process, since most processes do not have a SIGBUS
handler - we'd just prematurely terminate the process for a problem that
they might never actually see.
task_early_kill() decides whether to consider a process - and it checks
whether this specific process has been marked for early signals with
"prctl", or if the system administrator has requested early signals for
all processes using /proc/sys/vm/memory_failure_early_kill.
But for MF_ACTION_REQUIRED case we must not defer. The error is in the
execution path of the current thread so we must send the SIGBUS
immediatley.
Fix by passing a flag argument through collect_procs*() to
task_early_kill() so it knows whether we can defer or must take action.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Chen Gong <gong.chen@linux.jf.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit a70ffcac741d31a406c1d2b832ae43d658e7e1cf upstream.
When a thread in a multi-threaded application hits a machine check because
of an uncorrectable error in memory - we want to send the SIGBUS with
si.si_code = BUS_MCEERR_AR to that thread. Currently we fail to do that
if the active thread is not the primary thread in the process.
collect_procs() just finds primary threads and this test:
if ((flags & MF_ACTION_REQUIRED) && t == current) {
will see that the thread we found isn't the current thread and so send a
si.si_code = BUS_MCEERR_AO to the primary (and nothing to the active
thread at this time).
We can fix this by checking whether "current" shares the same mm with the
process that collect_procs() said owned the page. If so, we send the
SIGBUS to current (with code BUS_MCEERR_AR).
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Otto Bruggeman <otto.g.bruggeman@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Chen Gong <gong.chen@linux.jf.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e58469bafd0524e848c3733bc3918d854595e20f upstream.
The test_bit operations in get/set pageblock flags are expensive. This
patch reads the bitmap on a word basis and use shifts and masks to isolate
the bits of interest. Similarly masks are used to set a local copy of the
bitmap and then use cmpxchg to update the bitmap if there have been no
other changes made in parallel.
In a test running dd onto tmpfs the overhead of the pageblock-related
functions went from 1.27% in profiles to 0.5%.
In addition to the performance benefits, this patch closes races that are
possible between:
a) get_ and set_pageblock_migratetype(), where get_pageblock_migratetype()
reads part of the bits before and other part of the bits after
set_pageblock_migratetype() has updated them.
b) set_pageblock_migratetype() and set_pageblock_skip(), where the non-atomic
read-modify-update set bit operation in set_pageblock_skip() will cause
lost updates to some bits changed in the set_pageblock_migratetype().
Joonsoo Kim first reported the case a) via code inspection. Vlastimil
Babka's testing with a debug patch showed that either a) or b) occurs
roughly once per mmtests' stress-highalloc benchmark (although not
necessarily in the same pageblock). Furthermore during development of
unrelated compaction patches, it was observed that frequent calls to
{start,undo}_isolate_page_range() the race occurs several thousands of
times and has resulted in NULL pointer dereferences in move_freepages()
and free_one_page() in places where free_list[migratetype] is
manipulated by e.g. list_move(). Further debugging confirmed that
migratetype had invalid value of 6, causing out of bounds access to the
free_list array.
That confirmed that the race exist, although it may be extremely rare,
and currently only fatal where page isolation is performed due to
memory hot remove. Races on pageblocks being updated by
set_pageblock_migratetype(), where both old and new migratetype are
lower MIGRATE_RESERVE, currently cannot result in an invalid value
being observed, although theoretically they may still lead to
unexpected creation or destruction of MIGRATE_RESERVE pageblocks.
Furthermore, things could get suddenly worse when memory isolation is
used more, or when new migratetypes are added.
After this patch, the race has no longer been observed in testing.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reported-and-tested-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
ZONE_NORMAL
commit 675becce15f320337499bc1a9356260409a5ba29 upstream.
throttle_direct_reclaim() is meant to trigger during swap-over-network
during which the min watermark is treated as a pfmemalloc reserve. It
throttes on the first node in the zonelist but this is flawed.
The user-visible impact is that a process running on CPU whose local
memory node has no ZONE_NORMAL will stall for prolonged periods of time,
possibly indefintely. This is due to throttle_direct_reclaim thinking the
pfmemalloc reserves are depleted when in fact they don't exist on that
node.
On a NUMA machine running a 32-bit kernel (I know) allocation requests
from CPUs on node 1 would detect no pfmemalloc reserves and the process
gets throttled. This patch adjusts throttling of direct reclaim to
throttle based on the first node in the zonelist that has a usable
ZONE_NORMAL or lower zone.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c177c81e09e517bbf75b67762cdab1b83aba6976 upstream.
Currently hugepage migration is available for all archs which support
pmd-level hugepage, but testing is done only for x86_64 and there're
bugs for other archs. So to avoid breaking such archs, this patch
limits the availability strictly to x86_64 until developers of other
archs get interested in enabling this feature.
Simply disabling hugepage migration on non-x86_64 archs is not enough to
fix the reported problem where sys_move_pages() hits the BUG_ON() in
follow_page(FOLL_GET), so let's fix this by checking if hugepage
migration is supported in vma_migratable().
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 624483f3ea82598ab0f62f1bdb9177f531ab1892 upstream.
While working address sanitizer for kernel I've discovered
use-after-free bug in __put_anon_vma.
For the last anon_vma, anon_vma->root freed before child anon_vma.
Later in anon_vma_free(anon_vma) we are referencing to already freed
anon_vma->root to check rwsem.
This fixes it by freeing the child anon_vma before freeing
anon_vma->root.
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d4c54919ed86302094c0ca7d48a8cbd4ee753e92 upstream.
The age table walker doesn't check non-present hugetlb entry in common
path, so hugetlb_entry() callbacks must check it. The reason for this
behavior is that some callers want to handle it in its own way.
[ I think that reason is bogus, btw - it should just do what the regular
code does, which is to call the "pte_hole()" function for such hugetlb
entries - Linus]
However, some callers don't check it now, which causes unpredictable
result, for example when we have a race between migrating hugepage and
reading /proc/pid/numa_maps. This patch fixes it by adding !pte_present
checks on buggy callbacks.
This bug exists for years and got visible by introducing hugepage
migration.
ChangeLog v2:
- fix if condition (check !pte_present() instead of pte_present())
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
[ Backported to 3.15. Signed-off-by: Josh Boyer <jwboyer@fedoraproject.org> ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
low""
This reverts commit 623762517e2370be3b3f95f4fe08d6c063a49b06.
Ben rightly points out that commit 0bf1457f0cfc, which is what this
original commit was reverting, never ended up in 3.14-stable, but was
only for 3.15.
So revert this patch as we now have the same check twice in a row, which
is pretty pointless. Although the comments were "prettier"...
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3e030ecc0fc7de10fd0da10c1c19939872a31717 upstream.
When a memory error happens on an in-use page or (free and in-use)
hugepage, the victim page is isolated with its refcount set to one.
When you try to unpoison it later, unpoison_memory() calls put_page()
for it twice in order to bring the page back to free page pool (buddy or
free hugepage list). However, if another memory error occurs on the
page which we are unpoisoning, memory_failure() returns without
releasing the refcount which was incremented in the same call at first,
which results in memory leak and unconsistent num_poisoned_pages
statistics. This patch fixes it.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5a838c3b60e3a36ade764cf7751b8f17d7c9c2da upstream.
pcpu_chunk_struct_size = sizeof(struct pcpu_chunk) +
BITS_TO_LONGS(pcpu_unit_pages) * sizeof(unsigned long)
It hardly could be ever bigger than PAGE_SIZE even for large-scale machine,
but for consistency with its couterpart pcpu_mem_zalloc(),
use pcpu_mem_free() instead.
Commit b4916cb17c26 ("percpu: make pcpu_free_chunk() use
pcpu_mem_free() instead of kfree()") addressed this problem, but
missed this one.
tj: commit message updated
Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 099a19d91ca4 ("percpu: allow limited allocation before slab is online)
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 623762517e2370be3b3f95f4fe08d6c063a49b06 upstream.
This reverts commit 0bf1457f0cfc ("mm: vmscan: do not swap anon pages
just because free+file is low") because it introduced a regression in
mostly-anonymous workloads, where reclaim would become ineffective and
trap every allocating task in direct reclaim.
The problem is that there is a runaway feedback loop in the scan balance
between file and anon, where the balance tips heavily towards a tiny
thrashing file LRU and anonymous pages are no longer being looked at.
The commit in question removed the safe guard that would detect such
situations and respond with forced anonymous reclaim.
This commit was part of a series to fix premature swapping in loads with
relatively little cache, and while it made a small difference, the cure
is obviously worse than the disease. Revert it.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 49e068f0b73dd042c186ffa9b420a9943e90389a upstream.
The compaction freepage scanner implementation in isolate_freepages()
starts by taking the current cc->free_pfn value as the first pfn. In a
for loop, it scans from this first pfn to the end of the pageblock, and
then subtracts pageblock_nr_pages from the first pfn to obtain the first
pfn for the next for loop iteration.
This means that when cc->free_pfn starts at offset X rather than being
aligned on pageblock boundary, the scanner will start at offset X in all
scanned pageblock, ignoring potentially many free pages. Currently this
can happen when
a) zone's end pfn is not pageblock aligned, or
b) through zone->compact_cached_free_pfn with CONFIG_HOLES_IN_ZONE
enabled and a hole spanning the beginning of a pageblock
This patch fixes the problem by aligning the initial pfn in
isolate_freepages() to pageblock boundary. This also permits replacing
the end-of-pageblock alignment within the for loop with a simple
pageblock_nr_pages increment.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Heesub Shin <heesub.shin@samsung.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Dongjun Shin <d.j.shin@samsung.com>
Cc: Sunghwan Yun <sunghwan.yun@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d5c9fde3dae750889168807038243ff36431d276 upstream.
It is possible for "limit - setpoint + 1" to equal zero, after getting
truncated to a 32 bit variable, and resulting in a divide by zero error.
Using the fully 64 bit divide functions avoids this problem. It also
will cause pos_ratio_polynom() to return the correct value when
(setpoint - limit) exceeds 2^32.
Also uninline pos_ratio_polynom, at Andrew's request.
Signed-off-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
hugepage
commit b985194c8c0a130ed155b71662e39f7eaea4876f upstream.
For handling a free hugepage in memory failure, the race will happen if
another thread hwpoisoned this hugepage concurrently. So we need to
check PageHWPoison instead of !PageHWPoison.
If hwpoison_filter(p) returns true or a race happens, then we need to
unlock_page(hpage).
Signed-off-by: Chen Yucong <slaoub@gmail.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Tested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit dd18dbc2d42af75fffa60c77e0f02220bc329829 upstream.
It's critical for split_huge_page() (and migration) to catch and freeze
all PMDs on rmap walk. It gets tricky if there's concurrent fork() or
mremap() since usually we copy/move page table entries on dup_mm() or
move_page_tables() without rmap lock taken. To get it work we rely on
rmap walk order to not miss any entry. We expect to see destination VMA
after source one to work correctly.
But after switching rmap implementation to interval tree it's not always
possible to preserve expected walk order.
It works fine for dup_mm() since new VMA has the same vma_start_pgoff()
/ vma_last_pgoff() and explicitly insert dst VMA after src one with
vma_interval_tree_insert_after().
But on move_vma() destination VMA can be merged into adjacent one and as
result shifted left in interval tree. Fortunately, we can detect the
situation and prevent race with rmap walk by moving page table entries
under rmap lock. See commit 38a76013ad80.
Problem is that we miss the lock when we move transhuge PMD. Most
likely this bug caused the crash[1].
[1] http://thread.gmane.org/gmane.linux.kernel.mm/96473
Fixes: 108d6642ad81 ("mm anon rmap: remove anon_vma_moveto_tail")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Michel Lespinasse <walken@google.com>
Cc: Dave Jones <davej@redhat.com>
Cc: David Miller <davem@davemloft.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1b17844b29ae042576bea588164f2f1e9590a8bc upstream.
fixup_user_fault() is used by the futex code when the direct user access
fails, and the futex code wants it to either map in the page in a usable
form or return an error. It relied on handle_mm_fault() to map the
page, and correctly checked the error return from that, but while that
does map the page, it doesn't actually guarantee that the page will be
mapped with sufficient permissions to be then accessed.
So do the appropriate tests of the vma access rights by hand.
[ Side note: arguably handle_mm_fault() could just do that itself, but
we have traditionally done it in the caller, because some callers -
notably get_user_pages() - have been able to access pages even when
they are mapped with PROT_NONE. Maybe we should re-visit that design
decision, but in the meantime this is the minimal patch. ]
Found by Dave Jones running his trinity tool.
Reported-by: Dave Jones <davej@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|