From 418589663d6011de9006425b6c5721e1544fb47a Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 16 Jun 2009 15:32:12 -0700 Subject: page allocator: use allocation flags as an index to the zone watermark ALLOC_WMARK_MIN, ALLOC_WMARK_LOW and ALLOC_WMARK_HIGH determin whether pages_min, pages_low or pages_high is used as the zone watermark when allocating the pages. Two branches in the allocator hotpath determine which watermark to use. This patch uses the flags as an array index into a watermark array that is indexed with WMARK_* defines accessed via helpers. All call sites that use zone->pages_* are updated to use the helpers for accessing the values and the array offsets for setting. Signed-off-by: Mel Gorman Reviewed-by: Christoph Lameter Cc: KOSAKI Motohiro Cc: Pekka Enberg Cc: Peter Zijlstra Cc: Nick Piggin Cc: Dave Hansen Cc: Lee Schermerhorn Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/vm.txt | 11 ++++++----- Documentation/vm/balance | 18 +++++++++--------- 2 files changed, 15 insertions(+), 14 deletions(-) (limited to 'Documentation') diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 6fab2dcbb4d..0ea5adbc5b1 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -233,8 +233,8 @@ These protections are added to score to judge whether this zone should be used for page allocation or should be reclaimed. In this example, if normal pages (index=2) are required to this DMA zone and -pages_high is used for watermark, the kernel judges this zone should not be -used because pages_free(1355) is smaller than watermark + protection[2] +watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should +not be used because pages_free(1355) is smaller than watermark + protection[2] (4 + 2004 = 2008). If this protection value is 0, this zone would be used for normal page requirement. If requirement is DMA zone(index=0), protection[0] (=0) is used. @@ -280,9 +280,10 @@ The default value is 65536. min_free_kbytes: This is used to force the Linux VM to keep a minimum number -of kilobytes free. The VM uses this number to compute a pages_min -value for each lowmem zone in the system. Each lowmem zone gets -a number of reserved free pages based proportionally on its size. +of kilobytes free. The VM uses this number to compute a +watermark[WMARK_MIN] value for each lowmem zone in the system. +Each lowmem zone gets a number of reserved free pages based +proportionally on its size. Some minimal amount of memory is needed to satisfy PF_MEMALLOC allocations; if you set this to lower than 1024KB, your system will diff --git a/Documentation/vm/balance b/Documentation/vm/balance index bd3d31bc491..c46e68cf934 100644 --- a/Documentation/vm/balance +++ b/Documentation/vm/balance @@ -75,15 +75,15 @@ Page stealing from process memory and shm is done if stealing the page would alleviate memory pressure on any zone in the page's node that has fallen below its watermark. -pages_min/pages_low/pages_high/low_on_memory/zone_wake_kswapd: These are -per-zone fields, used to determine when a zone needs to be balanced. When -the number of pages falls below pages_min, the hysteric field low_on_memory -gets set. This stays set till the number of free pages becomes pages_high. -When low_on_memory is set, page allocation requests will try to free some -pages in the zone (providing GFP_WAIT is set in the request). Orthogonal -to this, is the decision to poke kswapd to free some zone pages. That -decision is not hysteresis based, and is done when the number of free -pages is below pages_low; in which case zone_wake_kswapd is also set. +watemark[WMARK_MIN/WMARK_LOW/WMARK_HIGH]/low_on_memory/zone_wake_kswapd: These +are per-zone fields, used to determine when a zone needs to be balanced. When +the number of pages falls below watermark[WMARK_MIN], the hysteric field +low_on_memory gets set. This stays set till the number of free pages becomes +watermark[WMARK_HIGH]. When low_on_memory is set, page allocation requests will +try to free some pages in the zone (providing GFP_WAIT is set in the request). +Orthogonal to this, is the decision to poke kswapd to free some zone pages. +That decision is not hysteresis based, and is done when the number of free +pages is below watermark[WMARK_LOW]; in which case zone_wake_kswapd is also set. (Good) Ideas that I have heard: -- cgit v1.2.3 From c9ba78e226057a1c2f19671383c496df187c02b5 Mon Sep 17 00:00:00 2001 From: Wu Fengguang Date: Tue, 16 Jun 2009 15:32:25 -0700 Subject: pagemap: document clarifications Some bit ranges were inclusive and some not. Fix them to be consistently inclusive. Signed-off-by: Wu Fengguang Cc: KOSAKI Motohiro Cc: Andi Kleen Cc: Matt Mackall Cc: Alexey Dobriyan Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/pagemap.txt | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt index ce72c0fe617..1f1e69f72fc 100644 --- a/Documentation/vm/pagemap.txt +++ b/Documentation/vm/pagemap.txt @@ -12,9 +12,9 @@ There are three components to pagemap: value for each virtual page, containing the following data (from fs/proc/task_mmu.c, above pagemap_read): - * Bits 0-55 page frame number (PFN) if present + * Bits 0-54 page frame number (PFN) if present * Bits 0-4 swap type if swapped - * Bits 5-55 swap offset if swapped + * Bits 5-54 swap offset if swapped * Bits 55-60 page shift (page size = 1< Date: Tue, 16 Jun 2009 15:32:26 -0700 Subject: pagemap: document 9 more exported page flags Also add short descriptions for all of the 20 exported page flags. Signed-off-by: Wu Fengguang Cc: KOSAKI Motohiro Cc: Andi Kleen Cc: Matt Mackall Cc: Alexey Dobriyan Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/pagemap.txt | 62 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 62 insertions(+) (limited to 'Documentation') diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt index 1f1e69f72fc..600a304a828 100644 --- a/Documentation/vm/pagemap.txt +++ b/Documentation/vm/pagemap.txt @@ -49,6 +49,68 @@ There are three components to pagemap: 8. WRITEBACK 9. RECLAIM 10. BUDDY + 11. MMAP + 12. ANON + 13. SWAPCACHE + 14. SWAPBACKED + 15. COMPOUND_HEAD + 16. COMPOUND_TAIL + 16. HUGE + 18. UNEVICTABLE + 20. NOPAGE + +Short descriptions to the page flags: + + 0. LOCKED + page is being locked for exclusive access, eg. by undergoing read/write IO + + 7. SLAB + page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator + When compound page is used, SLUB/SLQB will only set this flag on the head + page; SLOB will not flag it at all. + +10. BUDDY + a free memory block managed by the buddy system allocator + The buddy system organizes free memory in blocks of various orders. + An order N block has 2^N physically contiguous pages, with the BUDDY flag + set for and _only_ for the first page. + +15. COMPOUND_HEAD +16. COMPOUND_TAIL + A compound page with order N consists of 2^N physically contiguous pages. + A compound page with order 2 takes the form of "HTTT", where H donates its + head page and T donates its tail page(s). The major consumers of compound + pages are hugeTLB pages (Documentation/vm/hugetlbpage.txt), the SLUB etc. + memory allocators and various device drivers. However in this interface, + only huge/giga pages are made visible to end users. +17. HUGE + this is an integral part of a HugeTLB page + +20. NOPAGE + no page frame exists at the requested address + + [IO related page flags] + 1. ERROR IO error occurred + 3. UPTODATE page has up-to-date data + ie. for file backed page: (in-memory data revision >= on-disk one) + 4. DIRTY page has been written to, hence contains new data + ie. for file backed page: (in-memory data revision > on-disk one) + 8. WRITEBACK page is being synced to disk + + [LRU related page flags] + 5. LRU page is in one of the LRU lists + 6. ACTIVE page is in the active LRU list +18. UNEVICTABLE page is in the unevictable (non-)LRU list + It is somehow pinned and not a candidate for LRU page reclaims, + eg. ramfs pages, shmctl(SHM_LOCK) and mlock() memory segments + 2. REFERENCED page has been referenced since last LRU list enqueue/requeue + 9. RECLAIM page will be reclaimed soon after its pageout IO completed +11. MMAP a memory mapped page +12. ANON a memory mapped page that is not part of a file +13. SWAPCACHE page is mapped to swap space, ie. has an associated swap entry +14. SWAPBACKED page is backed by swap/RAM + +The page-types tool in this directory can be used to query the above flags. Using pagemap to do something useful: -- cgit v1.2.3 From 35efa5e993a7a00a50b87d2b7725c3eafc80b083 Mon Sep 17 00:00:00 2001 From: Wu Fengguang Date: Tue, 16 Jun 2009 15:32:27 -0700 Subject: pagemap: add page-types tool Add page-types, a handy tool for querying page flags. It will expand some of the overloaded flags: PG_slob_free = PG_private PG_slub_frozen = PG_active PG_slub_debug = PG_error PG_readahead = PG_reclaim and mask out obscure flags except in -raw mode: PG_reserved PG_mlocked PG_mappedtodisk PG_private PG_private_2 PG_owner_priv_1 PG_arch_1 PG_uncached PG_compound* for non hugeTLB pages [akpm@linux-foundation.org: fix warning] Signed-off-by: Wu Fengguang Cc: KOSAKI Motohiro Cc: Andi Kleen Cc: Matt Mackall Cc: Alexey Dobriyan Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/vm/Makefile | 2 +- Documentation/vm/page-types.c | 698 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 699 insertions(+), 1 deletion(-) create mode 100644 Documentation/vm/page-types.c (limited to 'Documentation') diff --git a/Documentation/vm/Makefile b/Documentation/vm/Makefile index 6f562f778b2..27479d43a9b 100644 --- a/Documentation/vm/Makefile +++ b/Documentation/vm/Makefile @@ -2,7 +2,7 @@ obj- := dummy.o # List of programs to build -hostprogs-y := slabinfo +hostprogs-y := slabinfo slqbinfo page-types # Tell kbuild to always build the programs always := $(hostprogs-y) diff --git a/Documentation/vm/page-types.c b/Documentation/vm/page-types.c new file mode 100644 index 00000000000..0833f44ba16 --- /dev/null +++ b/Documentation/vm/page-types.c @@ -0,0 +1,698 @@ +/* + * page-types: Tool for querying page flags + * + * Copyright (C) 2009 Intel corporation + * Copyright (C) 2009 Wu Fengguang + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + + +/* + * kernel page flags + */ + +#define KPF_BYTES 8 +#define PROC_KPAGEFLAGS "/proc/kpageflags" + +/* copied from kpageflags_read() */ +#define KPF_LOCKED 0 +#define KPF_ERROR 1 +#define KPF_REFERENCED 2 +#define KPF_UPTODATE 3 +#define KPF_DIRTY 4 +#define KPF_LRU 5 +#define KPF_ACTIVE 6 +#define KPF_SLAB 7 +#define KPF_WRITEBACK 8 +#define KPF_RECLAIM 9 +#define KPF_BUDDY 10 + +/* [11-20] new additions in 2.6.31 */ +#define KPF_MMAP 11 +#define KPF_ANON 12 +#define KPF_SWAPCACHE 13 +#define KPF_SWAPBACKED 14 +#define KPF_COMPOUND_HEAD 15 +#define KPF_COMPOUND_TAIL 16 +#define KPF_HUGE 17 +#define KPF_UNEVICTABLE 18 +#define KPF_NOPAGE 20 + +/* [32-] kernel hacking assistances */ +#define KPF_RESERVED 32 +#define KPF_MLOCKED 33 +#define KPF_MAPPEDTODISK 34 +#define KPF_PRIVATE 35 +#define KPF_PRIVATE_2 36 +#define KPF_OWNER_PRIVATE 37 +#define KPF_ARCH 38 +#define KPF_UNCACHED 39 + +/* [48-] take some arbitrary free slots for expanding overloaded flags + * not part of kernel API + */ +#define KPF_READAHEAD 48 +#define KPF_SLOB_FREE 49 +#define KPF_SLUB_FROZEN 50 +#define KPF_SLUB_DEBUG 51 + +#define KPF_ALL_BITS ((uint64_t)~0ULL) +#define KPF_HACKERS_BITS (0xffffULL << 32) +#define KPF_OVERLOADED_BITS (0xffffULL << 48) +#define BIT(name) (1ULL << KPF_##name) +#define BITS_COMPOUND (BIT(COMPOUND_HEAD) | BIT(COMPOUND_TAIL)) + +static char *page_flag_names[] = { + [KPF_LOCKED] = "L:locked", + [KPF_ERROR] = "E:error", + [KPF_REFERENCED] = "R:referenced", + [KPF_UPTODATE] = "U:uptodate", + [KPF_DIRTY] = "D:dirty", + [KPF_LRU] = "l:lru", + [KPF_ACTIVE] = "A:active", + [KPF_SLAB] = "S:slab", + [KPF_WRITEBACK] = "W:writeback", + [KPF_RECLAIM] = "I:reclaim", + [KPF_BUDDY] = "B:buddy", + + [KPF_MMAP] = "M:mmap", + [KPF_ANON] = "a:anonymous", + [KPF_SWAPCACHE] = "s:swapcache", + [KPF_SWAPBACKED] = "b:swapbacked", + [KPF_COMPOUND_HEAD] = "H:compound_head", + [KPF_COMPOUND_TAIL] = "T:compound_tail", + [KPF_HUGE] = "G:huge", + [KPF_UNEVICTABLE] = "u:unevictable", + [KPF_NOPAGE] = "n:nopage", + + [KPF_RESERVED] = "r:reserved", + [KPF_MLOCKED] = "m:mlocked", + [KPF_MAPPEDTODISK] = "d:mappedtodisk", + [KPF_PRIVATE] = "P:private", + [KPF_PRIVATE_2] = "p:private_2", + [KPF_OWNER_PRIVATE] = "O:owner_private", + [KPF_ARCH] = "h:arch", + [KPF_UNCACHED] = "c:uncached", + + [KPF_READAHEAD] = "I:readahead", + [KPF_SLOB_FREE] = "P:slob_free", + [KPF_SLUB_FROZEN] = "A:slub_frozen", + [KPF_SLUB_DEBUG] = "E:slub_debug", +}; + + +/* + * data structures + */ + +static int opt_raw; /* for kernel developers */ +static int opt_list; /* list pages (in ranges) */ +static int opt_no_summary; /* don't show summary */ +static pid_t opt_pid; /* process to walk */ + +#define MAX_ADDR_RANGES 1024 +static int nr_addr_ranges; +static unsigned long opt_offset[MAX_ADDR_RANGES]; +static unsigned long opt_size[MAX_ADDR_RANGES]; + +#define MAX_BIT_FILTERS 64 +static int nr_bit_filters; +static uint64_t opt_mask[MAX_BIT_FILTERS]; +static uint64_t opt_bits[MAX_BIT_FILTERS]; + +static int page_size; + +#define PAGES_BATCH (64 << 10) /* 64k pages */ +static int kpageflags_fd; +static uint64_t kpageflags_buf[KPF_BYTES * PAGES_BATCH]; + +#define HASH_SHIFT 13 +#define HASH_SIZE (1 << HASH_SHIFT) +#define HASH_MASK (HASH_SIZE - 1) +#define HASH_KEY(flags) (flags & HASH_MASK) + +static unsigned long total_pages; +static unsigned long nr_pages[HASH_SIZE]; +static uint64_t page_flags[HASH_SIZE]; + + +/* + * helper functions + */ + +#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0])) + +#define min_t(type, x, y) ({ \ + type __min1 = (x); \ + type __min2 = (y); \ + __min1 < __min2 ? __min1 : __min2; }) + +unsigned long pages2mb(unsigned long pages) +{ + return (pages * page_size) >> 20; +} + +void fatal(const char *x, ...) +{ + va_list ap; + + va_start(ap, x); + vfprintf(stderr, x, ap); + va_end(ap); + exit(EXIT_FAILURE); +} + + +/* + * page flag names + */ + +char *page_flag_name(uint64_t flags) +{ + static char buf[65]; + int present; + int i, j; + + for (i = 0, j = 0; i < ARRAY_SIZE(page_flag_names); i++) { + present = (flags >> i) & 1; + if (!page_flag_names[i]) { + if (present) + fatal("unkown flag bit %d\n", i); + continue; + } + buf[j++] = present ? page_flag_names[i][0] : '_'; + } + + return buf; +} + +char *page_flag_longname(uint64_t flags) +{ + static char buf[1024]; + int i, n; + + for (i = 0, n = 0; i < ARRAY_SIZE(page_flag_names); i++) { + if (!page_flag_names[i]) + continue; + if ((flags >> i) & 1) + n += snprintf(buf + n, sizeof(buf) - n, "%s,", + page_flag_names[i] + 2); + } + if (n) + n--; + buf[n] = '\0'; + + return buf; +} + + +/* + * page list and summary + */ + +void show_page_range(unsigned long offset, uint64_t flags) +{ + static uint64_t flags0; + static unsigned long index; + static unsigned long count; + + if (flags == flags0 && offset == index + count) { + count++; + return; + } + + if (count) + printf("%lu\t%lu\t%s\n", + index, count, page_flag_name(flags0)); + + flags0 = flags; + index = offset; + count = 1; +} + +void show_page(unsigned long offset, uint64_t flags) +{ + printf("%lu\t%s\n", offset, page_flag_name(flags)); +} + +void show_summary(void) +{ + int i; + + printf(" flags\tpage-count MB" + " symbolic-flags\t\t\tlong-symbolic-flags\n"); + + for (i = 0; i < ARRAY_SIZE(nr_pages); i++) { + if (nr_pages[i]) + printf("0x%016llx\t%10lu %8lu %s\t%s\n", + (unsigned long long)page_flags[i], + nr_pages[i], + pages2mb(nr_pages[i]), + page_flag_name(page_flags[i]), + page_flag_longname(page_flags[i])); + } + + printf(" total\t%10lu %8lu\n", + total_pages, pages2mb(total_pages)); +} + + +/* + * page flag filters + */ + +int bit_mask_ok(uint64_t flags) +{ + int i; + + for (i = 0; i < nr_bit_filters; i++) { + if (opt_bits[i] == KPF_ALL_BITS) { + if ((flags & opt_mask[i]) == 0) + return 0; + } else { + if ((flags & opt_mask[i]) != opt_bits[i]) + return 0; + } + } + + return 1; +} + +uint64_t expand_overloaded_flags(uint64_t flags) +{ + /* SLOB/SLUB overload several page flags */ + if (flags & BIT(SLAB)) { + if (flags & BIT(PRIVATE)) + flags ^= BIT(PRIVATE) | BIT(SLOB_FREE); + if (flags & BIT(ACTIVE)) + flags ^= BIT(ACTIVE) | BIT(SLUB_FROZEN); + if (flags & BIT(ERROR)) + flags ^= BIT(ERROR) | BIT(SLUB_DEBUG); + } + + /* PG_reclaim is overloaded as PG_readahead in the read path */ + if ((flags & (BIT(RECLAIM) | BIT(WRITEBACK))) == BIT(RECLAIM)) + flags ^= BIT(RECLAIM) | BIT(READAHEAD); + + return flags; +} + +uint64_t well_known_flags(uint64_t flags) +{ + /* hide flags intended only for kernel hacker */ + flags &= ~KPF_HACKERS_BITS; + + /* hide non-hugeTLB compound pages */ + if ((flags & BITS_COMPOUND) && !(flags & BIT(HUGE))) + flags &= ~BITS_COMPOUND; + + return flags; +} + + +/* + * page frame walker + */ + +int hash_slot(uint64_t flags) +{ + int k = HASH_KEY(flags); + int i; + + /* Explicitly reserve slot 0 for flags 0: the following logic + * cannot distinguish an unoccupied slot from slot (flags==0). + */ + if (flags == 0) + return 0; + + /* search through the remaining (HASH_SIZE-1) slots */ + for (i = 1; i < ARRAY_SIZE(page_flags); i++, k++) { + if (!k || k >= ARRAY_SIZE(page_flags)) + k = 1; + if (page_flags[k] == 0) { + page_flags[k] = flags; + return k; + } + if (page_flags[k] == flags) + return k; + } + + fatal("hash table full: bump up HASH_SHIFT?\n"); + exit(EXIT_FAILURE); +} + +void add_page(unsigned long offset, uint64_t flags) +{ + flags = expand_overloaded_flags(flags); + + if (!opt_raw) + flags = well_known_flags(flags); + + if (!bit_mask_ok(flags)) + return; + + if (opt_list == 1) + show_page_range(offset, flags); + else if (opt_list == 2) + show_page(offset, flags); + + nr_pages[hash_slot(flags)]++; + total_pages++; +} + +void walk_pfn(unsigned long index, unsigned long count) +{ + unsigned long batch; + unsigned long n; + unsigned long i; + + if (index > ULONG_MAX / KPF_BYTES) + fatal("index overflow: %lu\n", index); + + lseek(kpageflags_fd, index * KPF_BYTES, SEEK_SET); + + while (count) { + batch = min_t(unsigned long, count, PAGES_BATCH); + n = read(kpageflags_fd, kpageflags_buf, batch * KPF_BYTES); + if (n == 0) + break; + if (n < 0) { + perror(PROC_KPAGEFLAGS); + exit(EXIT_FAILURE); + } + + if (n % KPF_BYTES != 0) + fatal("partial read: %lu bytes\n", n); + n = n / KPF_BYTES; + + for (i = 0; i < n; i++) + add_page(index + i, kpageflags_buf[i]); + + index += batch; + count -= batch; + } +} + +void walk_addr_ranges(void) +{ + int i; + + kpageflags_fd = open(PROC_KPAGEFLAGS, O_RDONLY); + if (kpageflags_fd < 0) { + perror(PROC_KPAGEFLAGS); + exit(EXIT_FAILURE); + } + + if (!nr_addr_ranges) + walk_pfn(0, ULONG_MAX); + + for (i = 0; i < nr_addr_ranges; i++) + walk_pfn(opt_offset[i], opt_size[i]); + + close(kpageflags_fd); +} + + +/* + * user interface + */ + +const char *page_flag_type(uint64_t flag) +{ + if (flag & KPF_HACKERS_BITS) + return "(r)"; + if (flag & KPF_OVERLOADED_BITS) + return "(o)"; + return " "; +} + +void usage(void) +{ + int i, j; + + printf( +"page-types [options]\n" +" -r|--raw Raw mode, for kernel developers\n" +" -a|--addr addr-spec Walk a range of pages\n" +" -b|--bits bits-spec Walk pages with specified bits\n" +#if 0 /* planned features */ +" -p|--pid pid Walk process address space\n" +" -f|--file filename Walk file address space\n" +#endif +" -l|--list Show page details in ranges\n" +" -L|--list-each Show page details one by one\n" +" -N|--no-summary Don't show summay info\n" +" -h|--help Show this usage message\n" +"addr-spec:\n" +" N one page at offset N (unit: pages)\n" +" N+M pages range from N to N+M-1\n" +" N,M pages range from N to M-1\n" +" N, pages range from N to end\n" +" ,M pages range from 0 to M\n" +"bits-spec:\n" +" bit1,bit2 (flags & (bit1|bit2)) != 0\n" +" bit1,bit2=bit1 (flags & (bit1|bit2)) == bit1\n" +" bit1,~bit2 (flags & (bit1|bit2)) == bit1\n" +" =bit1,bit2 flags == (bit1|bit2)\n" +"bit-names:\n" + ); + + for (i = 0, j = 0; i < ARRAY_SIZE(page_flag_names); i++) { + if (!page_flag_names[i]) + continue; + printf("%16s%s", page_flag_names[i] + 2, + page_flag_type(1ULL << i)); + if (++j > 3) { + j = 0; + putchar('\n'); + } + } + printf("\n " + "(r) raw mode bits (o) overloaded bits\n"); +} + +unsigned long long parse_number(const char *str) +{ + unsigned long long n; + + n = strtoll(str, NULL, 0); + + if (n == 0 && str[0] != '0') + fatal("invalid name or number: %s\n", str); + + return n; +} + +void parse_pid(const char *str) +{ + opt_pid = parse_number(str); +} + +void parse_file(const char *name) +{ +} + +void add_addr_range(unsigned long offset, unsigned long size) +{ + if (nr_addr_ranges >= MAX_ADDR_RANGES) + fatal("too much addr ranges\n"); + + opt_offset[nr_addr_ranges] = offset; + opt_size[nr_addr_ranges] = size; + nr_addr_ranges++; +} + +void parse_addr_range(const char *optarg) +{ + unsigned long offset; + unsigned long size; + char *p; + + p = strchr(optarg, ','); + if (!p) + p = strchr(optarg, '+'); + + if (p == optarg) { + offset = 0; + size = parse_number(p + 1); + } else if (p) { + offset = parse_number(optarg); + if (p[1] == '\0') + size = ULONG_MAX; + else { + size = parse_number(p + 1); + if (*p == ',') { + if (size < offset) + fatal("invalid range: %lu,%lu\n", + offset, size); + size -= offset; + } + } + } else { + offset = parse_number(optarg); + size = 1; + } + + add_addr_range(offset, size); +} + +void add_bits_filter(uint64_t mask, uint64_t bits) +{ + if (nr_bit_filters >= MAX_BIT_FILTERS) + fatal("too much bit filters\n"); + + opt_mask[nr_bit_filters] = mask; + opt_bits[nr_bit_filters] = bits; + nr_bit_filters++; +} + +uint64_t parse_flag_name(const char *str, int len) +{ + int i; + + if (!*str || !len) + return 0; + + if (len <= 8 && !strncmp(str, "compound", len)) + return BITS_COMPOUND; + + for (i = 0; i < ARRAY_SIZE(page_flag_names); i++) { + if (!page_flag_names[i]) + continue; + if (!strncmp(str, page_flag_names[i] + 2, len)) + return 1ULL << i; + } + + return parse_number(str); +} + +uint64_t parse_flag_names(const char *str, int all) +{ + const char *p = str; + uint64_t flags = 0; + + while (1) { + if (*p == ',' || *p == '=' || *p == '\0') { + if ((*str != '~') || (*str == '~' && all && *++str)) + flags |= parse_flag_name(str, p - str); + if (*p != ',') + break; + str = p + 1; + } + p++; + } + + return flags; +} + +void parse_bits_mask(const char *optarg) +{ + uint64_t mask; + uint64_t bits; + const char *p; + + p = strchr(optarg, '='); + if (p == optarg) { + mask = KPF_ALL_BITS; + bits = parse_flag_names(p + 1, 0); + } else if (p) { + mask = parse_flag_names(optarg, 0); + bits = parse_flag_names(p + 1, 0); + } else if (strchr(optarg, '~')) { + mask = parse_flag_names(optarg, 1); + bits = parse_flag_names(optarg, 0); + } else { + mask = parse_flag_names(optarg, 0); + bits = KPF_ALL_BITS; + } + + add_bits_filter(mask, bits); +} + + +struct option opts[] = { + { "raw" , 0, NULL, 'r' }, + { "pid" , 1, NULL, 'p' }, + { "file" , 1, NULL, 'f' }, + { "addr" , 1, NULL, 'a' }, + { "bits" , 1, NULL, 'b' }, + { "list" , 0, NULL, 'l' }, + { "list-each" , 0, NULL, 'L' }, + { "no-summary", 0, NULL, 'N' }, + { "help" , 0, NULL, 'h' }, + { NULL , 0, NULL, 0 } +}; + +int main(int argc, char *argv[]) +{ + int c; + + page_size = getpagesize(); + + while ((c = getopt_long(argc, argv, + "rp:f:a:b:lLNh", opts, NULL)) != -1) { + switch (c) { + case 'r': + opt_raw = 1; + break; + case 'p': + parse_pid(optarg); + break; + case 'f': + parse_file(optarg); + break; + case 'a': + parse_addr_range(optarg); + break; + case 'b': + parse_bits_mask(optarg); + break; + case 'l': + opt_list = 1; + break; + case 'L': + opt_list = 2; + break; + case 'N': + opt_no_summary = 1; + break; + case 'h': + usage(); + exit(0); + default: + usage(); + exit(1); + } + } + + if (opt_list == 1) + printf("offset\tcount\tflags\n"); + if (opt_list == 2) + printf("offset\tflags\n"); + + walk_addr_ranges(); + + if (opt_list == 1) + show_page_range(0, 0); /* drain the buffer */ + + if (opt_no_summary) + return 0; + + if (opt_list) + printf("\n\n"); + + show_summary(); + + return 0; +} -- cgit v1.2.3 From 2ff05b2b4eac2e63d345fc731ea151a060247f53 Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Tue, 16 Jun 2009 15:32:56 -0700 Subject: oom: move oom_adj value from task_struct to mm_struct The per-task oom_adj value is a characteristic of its mm more than the task itself since it's not possible to oom kill any thread that shares the mm. If a task were to be killed while attached to an mm that could not be freed because another thread were set to OOM_DISABLE, it would have needlessly been terminated since there is no potential for future memory freeing. This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct mm_struct. This requires task_lock() on a task to check its oom_adj value to protect against exec, but it's already necessary to take the lock when dereferencing the mm to find the total VM size for the badness heuristic. This fixes a livelock if the oom killer chooses a task and another thread sharing the same memory has an oom_adj value of OOM_DISABLE. This occurs because oom_kill_task() repeatedly returns 1 and refuses to kill the chosen task while select_bad_process() will repeatedly choose the same task during the next retry. Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in oom_kill_task() to check for threads sharing the same memory will be removed in the next patch in this series where it will no longer be necessary. Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since these threads are immune from oom killing already. They simply report an oom_adj value of OOM_DISABLE. Cc: Nick Piggin Cc: Rik van Riel Cc: Mel Gorman Signed-off-by: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/filesystems/proc.txt | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index cd8717a3627..ebff3c10a07 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -1003,11 +1003,13 @@ CHAPTER 3: PER-PROCESS PARAMETERS 3.1 /proc//oom_adj - Adjust the oom-killer score ------------------------------------------------------ -This file can be used to adjust the score used to select which processes -should be killed in an out-of-memory situation. Giving it a high score will -increase the likelihood of this process being killed by the oom-killer. Valid -values are in the range -16 to +15, plus the special value -17, which disables -oom-killing altogether for this process. +This file can be used to adjust the score used to select which processes should +be killed in an out-of-memory situation. The oom_adj value is a characteristic +of the task's mm, so all threads that share an mm with pid will have the same +oom_adj value. A high value will increase the likelihood of this process being +killed by the oom-killer. Valid values are in the range -16 to +15 as +explained below and a special value of -17, which disables oom-killing +altogether for threads sharing pid's mm. The process to be killed in an out-of-memory situation is selected among all others based on its badness score. This value equals the original memory size of the process @@ -1021,6 +1023,9 @@ the parent's score if they do not share the same memory. Thus forking servers are the prime candidates to be killed. Having only one 'hungry' child will make parent less preferable than the child. +/proc//oom_adj cannot be changed for kthreads since they are immune from +oom-killing already. + /proc//oom_score shows process' current badness score. The following heuristics are then applied: -- cgit v1.2.3 From 90afa5de6f3fa89a733861e843377302479fcf7e Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 16 Jun 2009 15:33:20 -0700 Subject: vmscan: properly account for the number of page cache pages zone_reclaim() can reclaim A bug was brought to my attention against a distro kernel but it affects mainline and I believe problems like this have been reported in various guises on the mailing lists although I don't have specific examples at the moment. The reported problem was that malloc() stalled for a long time (minutes in some cases) if a large tmpfs mount was occupying a large percentage of memory overall. The pages did not get cleaned or reclaimed by zone_reclaim() because the zone_reclaim_mode was unsuitable, but the lists are uselessly scanned frequencly making the CPU spin at near 100%. This patchset intends to address that bug and bring the behaviour of zone_reclaim() more in line with expectations which were noticed during investigation. It is based on top of mmotm and takes advantage of Kosaki's work with respect to zone_reclaim(). Patch 1 fixes the heuristics that zone_reclaim() uses to determine if the scan should go ahead. The broken heuristic is what was causing the malloc() stall as it uselessly scanned the LRU constantly. Currently, zone_reclaim is assuming zone_reclaim_mode is 1 and historically it could not deal with tmpfs pages at all. This fixes up the heuristic so that an unnecessary scan is more likely to be correctly avoided. Patch 2 notes that zone_reclaim() returning a failure automatically means the zone is marked full. This is not always true. It could have failed because the GFP mask or zone_reclaim_mode were unsuitable. Patch 3 introduces a counter zreclaim_failed that will increment each time the zone_reclaim scan-avoidance heuristics fail. If that counter is rapidly increasing, then zone_reclaim_mode should be set to 0 as a temporarily resolution and a bug reported because the scan-avoidance heuristic is still broken. This patch: On NUMA machines, the administrator can configure zone_reclaim_mode that is a more targetted form of direct reclaim. On machines with large NUMA distances for example, a zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be reclaimed if the zone watermarks are not being met. There is a heuristic that determines if the scan is worthwhile but the problem is that the heuristic is not being properly applied and is basically assuming zone_reclaim_mode is 1 if it is enabled. The lack of proper detection can manfiest as high CPU usage as the LRU list is scanned uselessly. Historically, once enabled it was depending on NR_FILE_PAGES which may include swapcache pages that the reclaim_mode cannot deal with. Patch vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included pages that were not file-backed such as swapcache and made a calculation based on the inactive, active and mapped files. This is far superior when zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a reasonable starting figure. This patch alters how zone_reclaim() works out how many pages it might be able to reclaim given the current reclaim_mode. If RECLAIM_SWAP is set in the reclaim_mode it will either consider NR_FILE_PAGES as potential candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount swapcache and other non-file-backed pages. If RECLAIM_WRITE is not set, then NR_FILE_DIRTY number of pages are not candidates. If RECLAIM_SWAP is not set, then NR_FILE_MAPPED are not. [kosaki.motohiro@jp.fujitsu.com: Estimate unmapped pages minus tmpfs pages] [fengguang.wu@intel.com: Fix underflow problem in Kosaki's estimate] Signed-off-by: Mel Gorman Reviewed-by: Rik van Riel Acked-by: Christoph Lameter Cc: KOSAKI Motohiro Cc: Wu Fengguang Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/sysctl/vm.txt | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 0ea5adbc5b1..c4de6359d44 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -315,10 +315,14 @@ min_unmapped_ratio: This is available only on NUMA kernels. -A percentage of the total pages in each zone. Zone reclaim will only -occur if more than this percentage of pages are file backed and unmapped. -This is to insure that a minimal amount of local pages is still available for -file I/O even if the node is overallocated. +This is a percentage of the total pages in each zone. Zone reclaim will +only occur if more than this percentage of pages are in a state that +zone_reclaim_mode allows to be reclaimed. + +If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared +against all file-backed unmapped pages including swapcache pages and tmpfs +files. Otherwise, only unmapped pages backed by normal files but not tmpfs +files and similar are considered. The default is 1 percent. -- cgit v1.2.3 From b8d9a86590fb334d28c5905a4c419ece7d08e37d Mon Sep 17 00:00:00 2001 From: Jaswinder Singh Rajput Date: Tue, 16 Jun 2009 15:33:46 -0700 Subject: Documentation/accounting/getdelays.c intialize the variable before using it Fix compilation warning: Documentation/accounting/getdelays.c: In function `main': Documentation/accounting/getdelays.c:249: warning: `cmd_type' may be used uninitialized in this function This is in fact a false positive. Signed-off-by: Jaswinder Singh Rajput Acked-by: Balbir Singh Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/accounting/getdelays.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/accounting/getdelays.c b/Documentation/accounting/getdelays.c index 7ea231172c8..aa73e72fd79 100644 --- a/Documentation/accounting/getdelays.c +++ b/Documentation/accounting/getdelays.c @@ -246,7 +246,8 @@ void print_ioacct(struct taskstats *t) int main(int argc, char *argv[]) { - int c, rc, rep_len, aggr_len, len2, cmd_type; + int c, rc, rep_len, aggr_len, len2; + int cmd_type = TASKSTATS_CMD_ATTR_UNSPEC; __u16 id; __u32 mypid; -- cgit v1.2.3 From 4764e280dc7dde1534161e148d38dbd792a2b8ab Mon Sep 17 00:00:00 2001 From: "Figo.zhang" Date: Tue, 16 Jun 2009 15:33:51 -0700 Subject: Documentation/atomic_ops.txt: fix sample code list_add() lost a parameter in sample code. Signed-off-by: Figo.zhang Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/atomic_ops.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'Documentation') diff --git a/Documentation/atomic_ops.txt b/Documentation/atomic_ops.txt index 4ef24501045..396bec3b74e 100644 --- a/Documentation/atomic_ops.txt +++ b/Documentation/atomic_ops.txt @@ -229,10 +229,10 @@ kernel. It is the use of atomic counters to implement reference counting, and it works such that once the counter falls to zero it can be guaranteed that no other entity can be accessing the object: -static void obj_list_add(struct obj *obj) +static void obj_list_add(struct obj *obj, struct list_head *head) { obj->active = 1; - list_add(&obj->list); + list_add(&obj->list, head); } static void obj_list_del(struct obj *obj) -- cgit v1.2.3 From f324edc85e5c1137e49e3b36a58cf436ab5b1fb3 Mon Sep 17 00:00:00 2001 From: Daniel Mack Date: Tue, 16 Jun 2009 15:33:52 -0700 Subject: console: make blank timeout value a boot option The console blank timer is currently hardcoded to 10*60 seconds which might be annoying on systems with no input devices attached to wake up the console again. Especially during development, disabling the screen saver can be handy - for example when debugging the root fs mount mechanism or other scenarios where no userspace program could be started to do that at runtime from userspace. This patch defines a core_param for the variable in charge which allows users to entirely disable the blank feature at boot time by setting it 0. The value can still be overwritten at runtime using the standard ioctl call - this just allows to conditionally change the default. Signed-off-by: Daniel Mack Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/kernel-parameters.txt | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'Documentation') diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index ad380063077..5578248c18a 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -546,6 +546,10 @@ and is between 256 and 4096 characters. It is defined in the file console=brl,ttyS0 For now, only VisioBraille is supported. + consoleblank= [KNL] The console blank (screen saver) timeout in + seconds. Defaults to 10*60 = 10mins. A value of 0 + disables the blank timer. + coredump_filter= [KNL] Change the default value for /proc//coredump_filter. -- cgit v1.2.3 From 2d9d2fdfae4cf7fda90178a9daf0f8f750043ae8 Mon Sep 17 00:00:00 2001 From: Paul Menzel Date: Tue, 16 Jun 2009 15:34:21 -0700 Subject: Documentation/fb/vesafb.txt: fix typo Signed-off-by: Paul Menzel Cc: Gerd Knorr Cc: Nico Schmoigl Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/fb/vesafb.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/fb/vesafb.txt b/Documentation/fb/vesafb.txt index ee277dd204b..950d5a658cb 100644 --- a/Documentation/fb/vesafb.txt +++ b/Documentation/fb/vesafb.txt @@ -95,7 +95,7 @@ There is no way to change the vesafb video mode and/or timings after booting linux. If you are not happy with the 60 Hz refresh rate, you have these options: - * configure and load the DOS-Tools for your the graphics board (if + * configure and load the DOS-Tools for the graphics board (if available) and boot linux with loadlin. * use a native driver (matroxfb/atyfb) instead if vesafb. If none is available, write a new one! -- cgit v1.2.3