platform/kernel/linux-3.10 - Domain: System / Kernel;

Age	Commit message (Collapse)	Author	Files	Lines
2013-01-11	mm: limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT	Michal Hocko	1	-0/+9
	commit 53a59fc67f97374758e63a9c785891ec62324c81 upstream. Since commit e303297e6c3a ("mm: extended batches for generic mmu_gather") we are batching pages to be freed until either tlb_next_batch cannot allocate a new batch or we are done. This works just fine most of the time but we can get in troubles with non-preemptible kernel (CONFIG_PREEMPT_NONE or CONFIG_PREEMPT_VOLUNTARY) on large machines where too aggressive batching might lead to soft lockups during process exit path (exit_mmap) because there are no scheduling points down the free_pages_and_swap_cache path and so the freeing can take long enough to trigger the soft lockup. The lockup is harmless except when the system is setup to panic on softlockup which is not that unusual. The simplest way to work around this issue is to limit the maximum number of batches in a single mmu_gather. 10k of collected pages should be safe to prevent from soft lockups (we would have 2ms for one) even if they are all freed without an explicit scheduling point. This patch doesn't add any new explicit scheduling points because it relies on zap_pmd_range during page tables zapping which calls cond_resched per PMD. The following lockup has been reported for 3.0 kernel with a huge process (in order of hundreds gigs but I do know any more details). BUG: soft lockup - CPU#56 stuck for 22s! [kernel:31053] Modules linked in: af_packet nfs lockd fscache auth_rpcgss nfs_acl sunrpc mptctl mptbase autofs4 binfmt_misc dm_round_robin dm_multipath bonding cpufreq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq mperf microcode fuse loop osst sg sd_mod crc_t10dif st qla2xxx scsi_transport_fc scsi_tgt netxen_nic i7core_edac iTCO_wdt joydev e1000e serio_raw pcspkr edac_core iTCO_vendor_support acpi_power_meter rtc_cmos hpwdt hpilo button container usbhid hid dm_mirror dm_region_hash dm_log linear uhci_hcd ehci_hcd usbcore usb_common scsi_dh_emc scsi_dh_alua scsi_dh_hp_sw scsi_dh_rdac scsi_dh dm_snapshot pcnet32 mii edd dm_mod raid1 ext3 mbcache jbd fan thermal processor thermal_sys hwmon cciss scsi_mod Supported: Yes CPU 56 Pid: 31053, comm: kernel Not tainted 3.0.31-0.9-default #1 HP ProLiant DL580 G7 RIP: 0010: _raw_spin_unlock_irqrestore+0x8/0x10 RSP: 0018:ffff883ec1037af0 EFLAGS: 00000206 RAX: 0000000000000e00 RBX: ffffea01a0817e28 RCX: ffff88803ffd9e80 RDX: 0000000000000200 RSI: 0000000000000206 RDI: 0000000000000206 RBP: 0000000000000002 R08: 0000000000000001 R09: ffff887ec724a400 R10: 0000000000000000 R11: dead000000200200 R12: ffffffff8144c26e R13: 0000000000000030 R14: 0000000000000297 R15: 000000000000000e FS: 00007ed834282700(0000) GS:ffff88c03f200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 000000000068b240 CR3: 0000003ec13c5000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kernel (pid: 31053, threadinfo ffff883ec1036000, task ffff883ebd5d4100) Call Trace: release_pages+0xc5/0x260 free_pages_and_swap_cache+0x9d/0xc0 tlb_flush_mmu+0x5c/0x80 tlb_finish_mmu+0xe/0x50 exit_mmap+0xbd/0x120 mmput+0x49/0x120 exit_mm+0x122/0x160 do_exit+0x17a/0x430 do_group_exit+0x3d/0xb0 get_signal_to_deliver+0x247/0x480 do_signal+0x71/0x1b0 do_notify_resume+0x98/0xb0 int_signal+0x12/0x17 DWARF2 unwinder stuck at int_signal+0x12/0x17 Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-02	mutex: Place lock in contended state after fastpath_lock failure	Will Deacon	1	-2/+9
	commit 0bce9c46bf3b15f485d82d7e81dabed6ebcc24b1 upstream. ARM recently moved to asm-generic/mutex-xchg.h for its mutex implementation after the previous implementation was found to be missing some crucial memory barriers. However, this has revealed some problems running hackbench on SMP platforms due to the way in which the MUTEX_SPIN_ON_OWNER code operates. The symptoms are that a bunch of hackbench tasks are left waiting on an unlocked mutex and therefore never get woken up to claim it. This boils down to the following sequence of events: Task A Task B Task C Lock value 0 1 1 lock() 0 2 lock() 0 3 spin(A) 0 4 unlock() 1 5 lock() 0 6 cmpxchg(1,0) 0 7 contended() -1 8 lock() 0 9 spin(C) 0 10 unlock() 1 11 cmpxchg(1,0) 0 12 unlock() 1 At this point, the lock is unlocked, but Task B is in an uninterruptible sleep with nobody to wake it up. This patch fixes the problem by ensuring we put the lock into the contended state if we fail to acquire it on the fastpath, ensuring that any blocked waiters are woken up when the mutex is released. Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Chris Mason <chris.mason@fusionio.com> Cc: Ingo Molnar <mingo@elte.hu> Reviewed-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-6e9lrw2avczr0617fzl5vqb8@git.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-16	thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE	Andrea Arcangeli	1	-0/+10
	commit e4eed03fd06578571c01d4f1478c874bb432c815 upstream. In the x86 32bit PAE CONFIG_TRANSPARENT_HUGEPAGE=y case while holding the mmap_sem for reading, cmpxchg8b cannot be used to read pmd contents under Xen. So instead of dealing only with "consistent" pmdvals in pmd_none_or_trans_huge_or_clear_bad() (which would be conceptually simpler) we let pmd_none_or_trans_huge_or_clear_bad() deal with pmdvals where the low 32bit and high 32bit could be inconsistent (to avoid having to use cmpxchg8b). The only guarantee we get from pmd_read_atomic is that if the low part of the pmd was found null, the high part will be null too (so the pmd will be considered unstable). And if the low part of the pmd is found "stable" later, then it means the whole pmd was read atomically (because after a pmd is stable, neither MADV_DONTNEED nor page faults can alter it anymore, and we read the high part after the low part). In the 32bit PAE x86 case, it is enough to read the low part of the pmdval atomically to declare the pmd as "stable" and that's true for THP and no THP, furthermore in the THP case we also have a barrier() that will prevent any inconsistent pmdvals to be cached by a later re-read of the *pmd. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Jonathan Nieder <jrnieder@gmail.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Tested-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-16	mm: pmd_read_atomic: fix 32bit PAE pmd walk vs pmd_populate SMP race condition	Andrea Arcangeli	1	-2/+20
	commit 26c191788f18129af0eb32a358cdaea0c7479626 upstream. When holding the mmap_sem for reading, pmd_offset_map_lock should only run on a pmd_t that has been read atomically from the pmdp pointer, otherwise we may read only half of it leading to this crash. PID: 11679 TASK: f06e8000 CPU: 3 COMMAND: "do_race_2_panic" #0 [f06a9dd8] crash_kexec at c049b5ec #1 [f06a9e2c] oops_end at c083d1c2 #2 [f06a9e40] no_context at c0433ded #3 [f06a9e64] bad_area_nosemaphore at c043401a #4 [f06a9e6c] __do_page_fault at c0434493 #5 [f06a9eec] do_page_fault at c083eb45 #6 [f06a9f04] error_code (via page_fault) at c083c5d5 EAX: 01fb470c EBX: fff35000 ECX: 00000003 EDX: 00000100 EBP: 00000000 DS: 007b ESI: 9e201000 ES: 007b EDI: 01fb4700 GS: 00e0 CS: 0060 EIP: c083bc14 ERR: ffffffff EFLAGS: 00010246 #7 [f06a9f38] _spin_lock at c083bc14 #8 [f06a9f44] sys_mincore at c0507b7d #9 [f06a9fb0] system_call at c083becd start len EAX: ffffffda EBX: 9e200000 ECX: 00001000 EDX: 6228537f DS: 007b ESI: 00000000 ES: 007b EDI: 003d0f00 SS: 007b ESP: 62285354 EBP: 62285388 GS: 0033 CS: 0073 EIP: 00291416 ERR: 000000da EFLAGS: 00000286 This should be a longstanding bug affecting x86 32bit PAE without THP. Only archs with 64bit large pmd_t and 32bit unsigned long should be affected. With THP enabled the barrier() in pmd_none_or_trans_huge_or_clear_bad() would partly hide the bug when the pmd transition from none to stable, by forcing a re-read of the pmd in pmd_offset_map_lock, but when THP is enabled a new set of problem arises by the fact could then transition freely in any of the none, pmd_trans_huge or pmd_trans_stable states. So making the barrier in pmd_none_or_trans_huge_or_clear_bad() unconditional isn't good idea and it would be a flakey solution. This should be fully fixed by introducing a pmd_read_atomic that reads the pmd in order with THP disabled, or by reading the pmd atomically with cmpxchg8b with THP enabled. Luckily this new race condition only triggers in the places that must already be covered by pmd_none_or_trans_huge_or_clear_bad() so the fix is localized there but this bug is not related to THP. NOTE: this can trigger on x86 32bit systems with PAE enabled with more than 4G of ram, otherwise the high part of the pmd will never risk to be truncated because it would be zero at all times, in turn so hiding the SMP race. This bug was discovered and fully debugged by Ulrich, quote: ---- [..] pmd_none_or_trans_huge_or_clear_bad() loads the content of edx and eax. 496 static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t pmd) 497 { 498 /* depend on compiler for an atomic pmd read / 499 pmd_t pmdval = pmd; // edi = pmd pointer 0xc0507a74 <sys_mincore+548>: mov 0x8(%esp),%edi ... // edx = PTE page table high address 0xc0507a84 <sys_mincore+564>: mov 0x4(%edi),%edx ... // eax = PTE page table low address 0xc0507a8e <sys_mincore+574>: mov (%edi),%eax [..] Please note that the PMD is not read atomically. These are two "mov" instructions where the high order bits of the PMD entry are fetched first. Hence, the above machine code is prone to the following race. - The PMD entry {high\|low} is 0x0000000000000000. The "mov" at 0xc0507a84 loads 0x00000000 into edx. - A page fault (on another CPU) sneaks in between the two "mov" instructions and instantiates the PMD. - The PMD entry {high\|low} is now 0x00000003fda38067. The "mov" at 0xc0507a8e loads 0xfda38067 into eax. ---- Reported-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-30	asm-generic: Use __BITS_PER_LONG in statfs.h	H. Peter Anvin	1	-1/+1
	<asm-generic/statfs.h> is exported to userspace, so using BITS_PER_LONG is invalid. We need to use __BITS_PER_LONG instead. This is kernel bugzilla 43165. Reported-by: H.J. Lu <hjl.tools@gmail.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Link: http://lkml.kernel.org/r/1335465916-16965-1-git-send-email-hpa@linux.intel.com Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: <stable@vger.kernel.org>
2012-04-23	asm-generic: Allow overriding clock_t and add attributes to siginfo_t	H. Peter Anvin	1	-3/+11
	For the particular issue of x32, which shares code with i386 in the handling of compat_siginfo_t, the use of a 64-bit clock_t bumps the sigchld structure out of alignment, which triggers a messy cascade of padding. This was already handled on the kernel compat side, but it needs handling on the user space side, which uses the generic header. To make that possible: 1. Allow __kernel_clock_t to be overridden in struct siginfo; 2. Allow there to be attributes added to struct siginfo. Reported-by: H.J. Lu <hjl.rools@gmail.com> Cc: Bruce J. Beare <bruce.j.beare@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Link: http://lkml.kernel.org/r/CAMe9rOqF6Kh6-NK7oP0Fpzkd4SBAWU%2BG53hwBbSD4iA2UzyxuA@mail.gmail.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-04-02	asm-generic: add linux/types.h to cmpxchg.h	Paul Gortmaker	1	-0/+1
	Builds of the openrisc or1ksim_defconfig show the following: In file included from arch/openrisc/include/generated/asm/cmpxchg.h:1:0, from include/asm-generic/atomic.h:18, from arch/openrisc/include/generated/asm/atomic.h:1, from include/linux/atomic.h:4, from include/linux/dcache.h:4, from fs/notify/fsnotify.c:19: include/asm-generic/cmpxchg.h: In function '__xchg': include/asm-generic/cmpxchg.h:34:20: error: expected ')' before 'u8' include/asm-generic/cmpxchg.h:34:20: warning: type defaults to 'int' in type name and many more lines of similar errors. It seems specific to the or32 because most other platforms have an arch specific component that would have already included types.h ahead of time, but the o32 does not. Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jonas Bonn <jonas@southpole.se> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Acked-by: David Howells <dhowells@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-29	Merge branch 'x86-x32-for-linus' of ↵	Linus Torvalds	1	-87/+22
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x32 support for x86-64 from Ingo Molnar: "This tree introduces the X32 binary format and execution mode for x86: 32-bit data space binaries using 64-bit instructions and 64-bit kernel syscalls. This allows applications whose working set fits into a 32 bits address space to make use of 64-bit instructions while using a 32-bit address space with shorter pointers, more compressed data structures, etc." Fix up trivial context conflicts in arch/x86/{Kconfig,vdso/vma.c} * 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits) x32: Fix alignment fail in struct compat_siginfo x32: Fix stupid ia32/x32 inversion in the siginfo format x32: Add ptrace for x32 x32: Switch to a 64-bit clock_t x32: Provide separate is_ia32_task() and is_x32_task() predicates x86, mtrr: Use explicit sizing and padding for the 64-bit ioctls x86/x32: Fix the binutils auto-detect x32: Warn and disable rather than error if binutils too old x32: Only clear TIF_X32 flag once x32: Make sure TS_COMPAT is cleared for x32 tasks fs: Remove missed ->fds_bits from cessation use of fd_set structs internally fs: Fix close_on_exec pointer in alloc_fdtable x32: Drop non-__vdso weak symbols from the x32 VDSO x32: Fix coding style violations in the x32 VDSO code x32: Add x32 VDSO support x32: Allow x32 to be configured x32: If configured, add x32 system calls to system call tables x32: Handle process creation x32: Signal-related system calls x86: Add #ifdef CONFIG_COMPAT to <asm/sys_ia32.h> ...
2012-03-29	Merge git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile	Linus Torvalds	1	-1/+1
	Pull arch/tile (really asm-generic) update from Chris Metcalf: "These are a couple of asm-generic changes that apply to tile." * git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile: compat: use sys_sendfile64() implementation for sendfile syscall [PATCH v3] ipc: provide generic compat versions of IPC syscalls
2012-03-28	Merge tag 'split-asm_system_h-for-linus-20120328' of ↵	Linus Torvalds	7	-149/+183
	git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system Pull "Disintegrate and delete asm/system.h" from David Howells: "Here are a bunch of patches to disintegrate asm/system.h into a set of separate bits to relieve the problem of circular inclusion dependencies. I've built all the working defconfigs from all the arches that I can and made sure that they don't break. The reason for these patches is that I recently encountered a circular dependency problem that came about when I produced some patches to optimise get_order() by rewriting it to use ilog2(). This uses bitops - and on the SH arch asm/bitops.h drags in asm-generic/get_order.h by a circuituous route involving asm/system.h. The main difficulty seems to be asm/system.h. It holds a number of low level bits with no/few dependencies that are commonly used (eg. memory barriers) and a number of bits with more dependencies that aren't used in many places (eg. switch_to()). These patches break asm/system.h up into the following core pieces: (1) asm/barrier.h Move memory barriers here. This already done for MIPS and Alpha. (2) asm/switch_to.h Move switch_to() and related stuff here. (3) asm/exec.h Move arch_align_stack() here. Other process execution related bits could perhaps go here from asm/processor.h. (4) asm/cmpxchg.h Move xchg() and cmpxchg() here as they're full word atomic ops and frequently used by atomic_xchg() and atomic_cmpxchg(). (5) asm/bug.h Move die() and related bits. (6) asm/auxvec.h Move AT_VECTOR_SIZE_ARCH here. Other arch headers are created as needed on a per-arch basis." Fixed up some conflicts from other header file cleanups and moving code around that has happened in the meantime, so David's testing is somewhat weakened by that. We'll find out anything that got broken and fix it.. * tag 'split-asm_system_h-for-linus-20120328' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-asm_system: (38 commits) Delete all instances of asm/system.h Remove all #inclusions of asm/system.h Add #includes needed to permit the removal of asm/system.h Move all declarations of free_initmem() to linux/mm.h Disintegrate asm/system.h for OpenRISC Split arch_align_stack() out from asm-generic/system.h Split the switch_to() wrapper out of asm-generic/system.h Move the asm-generic/system.h xchg() implementation to asm-generic/cmpxchg.h Create asm-generic/barrier.h Make asm-generic/cmpxchg.h #include asm-generic/cmpxchg-local.h Disintegrate asm/system.h for Xtensa Disintegrate asm/system.h for Unicore32 [based on ver #3, changed by gxt] Disintegrate asm/system.h for Tile Disintegrate asm/system.h for Sparc Disintegrate asm/system.h for SH Disintegrate asm/system.h for Score Disintegrate asm/system.h for S390 Disintegrate asm/system.h for PowerPC Disintegrate asm/system.h for PA-RISC Disintegrate asm/system.h for MN10300 ...
2012-03-28	Merge git://github.com/rustyrussell/linux	Linus Torvalds	1	-21/+14
	Pull module and param updates from Rusty Russell: "I'm getting married next week, and then honeymoon until 6th May. I'll be offline from next week, except to post the compulsory pictures if Alex shaves her head..." I'm sure Rusty can take time off from his honeymoon if something comes up. And here's the explanation about head shaving: http://baldalex.org/ in case you wondered and wanted to support another insane caper or Rusty's involving shaving. What is it with Rusty and shaving, anyway? * git://github.com/rustyrussell/linux: module: Remove module size limit module: move __module_get and try_module_get() out of line. params: <level>_initcall-like kernel parameters module_param: remove support for bool parameters which are really int. module: add kernel param to force disable module load
2012-03-28	Merge tag 'gpio-for-linus' of git://git.secretlab.ca/git/linux-2.6	Linus Torvalds	1	-2/+2
	Pull GPIO changes for v3.4 from Grant Likely: "Primarily gpio device driver changes with some minor side effects under arch/arm and arch/x86. Also includes a few core changes such as explicitly supporting (electrical) open source and open drain outputs and some help for parsing gpio devicetree properties." Fix up context conflict due to Laxman Dewangan adding sleep control for the tps65910 driver separately for gpio's and regulators. * tag 'gpio-for-linus' of git://git.secretlab.ca/git/linux-2.6: (34 commits) gpio/ep93xx: Remove unused inline function and useless pr_err message gpio/sodaville: Mark broken due to core irqdomain migration gpio/omap: fix redundant decoding of gpio offset gpio/omap: fix incorrect update to context.irqenable1 gpio/omap: fix incorrect context restore logic in omap_gpio_runtime_* gpio/omap: fix missing dataout context save in _set_gpio_dataout_reg gpio/omap: fix _set_gpio_irqenable implementation gpio/omap: fix trigger type to unsigned gpio/omap: fix wakeup_en register update in _set_gpio_wakeup() gpio: tegra: tegra_gpio_config shouldn't be __init gpio/davinci: fix enabling unbanked GPIO IRQs gpio/davinci: fix oops on unbanked gpio irq request gpio/omap: Fix section warning for omap_mpuio_alloc_gc() ARM: tegra: export tegra_gpio_{en,dis}able gpio/gpio-stmpe: Fix the value returned by _get_value routine Documentation/gpio.txt: Explain expected pinctrl interaction GPIO: LPC32xx: Add output reading to GPO P3 GPIO: LPC32xx: Fix missing bit selection mask gpio/omap: fix wakeups on level-triggered GPIOs gpio/omap: Fix IRQ handling for SPARSE_IRQ ...
2012-03-28	Delete all instances of asm/system.h	David Howells	1	-5/+0
	Delete all instances of asm/system.h as they should be redundant by this point. Signed-off-by: David Howells <dhowells@redhat.com>
2012-03-28	Remove all #inclusions of asm/system.h	David Howells	1	-1/+0
	Remove all #inclusions of asm/system.h preparatory to splitting and killing it. Performed with the following command: perl -p -i -e 's!^#\sinclude\s<asm/system[.]h>.\n!!' `grep -Irl '^#\sinclude\s<asm/system[.]h>' ` Signed-off-by: David Howells <dhowells@redhat.com>
2012-03-28	Add #includes needed to permit the removal of asm/system.h	David Howells	1	-1/+1
	asm/system.h is a cause of circular dependency problems because it contains commonly used primitive stuff like barrier definitions and uncommonly used stuff like switch_to() that might require MMU definitions. asm/system.h has been disintegrated by this point on all arches into the following common segments: (1) asm/barrier.h Moved memory barrier definitions here. (2) asm/cmpxchg.h Moved xchg() and cmpxchg() here. #included in asm/atomic.h. (3) asm/bug.h Moved die() and similar here. (4) asm/exec.h Moved arch_align_stack() here. (5) asm/elf.h Moved AT_VECTOR_SIZE_ARCH here. (6) asm/switch_to.h Moved switch_to() here. Signed-off-by: David Howells <dhowells@redhat.com>
2012-03-28	Split arch_align_stack() out from asm-generic/system.h	David Howells	2	-20/+21
	Split arch_align_stack() out from asm-generic/system.h into its own header of asm-generic/exec.h as part of the asm/system.h disintegration. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de>
2012-03-28	Split the switch_to() wrapper out of asm-generic/system.h	David Howells	2	-16/+31
	Split the switch_to() wrapper out of asm-generic/system.h into its own asm-generic/system.h as part of the asm/system.h disintegration. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de>
2012-03-28	Move the asm-generic/system.h xchg() implementation to asm-generic/cmpxchg.h	David Howells	2	-77/+80
	Move the asm-generic/system.h xchg() implementation to asm-generic/cmpxchg.h to simplify disintegration of asm/system.h. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de>
2012-03-28	Create asm-generic/barrier.h	David Howells	2	-33/+51
	Create asm-generic/barrier.h and move the barrier definitions from asm-generic/system.h to it. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de>
2012-03-28	Make asm-generic/cmpxchg.h #include asm-generic/cmpxchg-local.h	David Howells	3	-1/+4
	Make asm-generic/cmpxchg.h #include asm-generic/cmpxchg-local.h as all arch files that #include the former also #include the latter. See: grep -rl asm-generic/cmpxchg-local[.]h arch/ \| sort > b grep -rl asm-generic/cmpxchg[.]h arch/ \| sort > a comm a b This simplifies the disintegration of asm-generic/system.h for arches that don't have their own. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de>
2012-03-27	compat: use sys_sendfile64() implementation for sendfile syscall	Chris Metcalf	1	-1/+1
	<asm-generic/unistd.h> was set up to use sys_sendfile() for the 32-bit compat API instead of sys_sendfile64(), but in fact the right thing to do is to use sys_sendfile64() in all cases. The 32-bit sendfile64() API in glibc uses the sendfile64 syscall, so it has to be capable of doing full 64-bit operations. But the sys_sendfile() kernel implementation has a MAX_NON_LFS test in it which explicitly limits the offset to 2^32. So, we need to use the sys_sendfile64() implementation in the kernel for this case. Cc: <stable@kernel.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
2012-03-26	params: <level>_initcall-like kernel parameters	Pawel Moll	1	-21/+14
	This patch adds a set of macros that can be used to declare kernel parameters to be parsed _before_ initcalls at a chosen level are executed. We rename the now-unused "flags" field of struct kernel_param as the level. It's signed, for when we use this for early params as well, in future. Linker macro collating init calls had to be modified in order to add additional symbols between levels that are later used by the init code to split the calls into blocks. Signed-off-by: Pawel Moll <pawel.moll@arm.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2012-03-24	Merge tag 'bug-for-3.4' of ↵	Linus Torvalds	3	-0/+4
	git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux Pull <linux/bug.h> cleanup from Paul Gortmaker: "The changes shown here are to unify linux's BUG support under the one <linux/bug.h> file. Due to historical reasons, we have some BUG code in bug.h and some in kernel.h -- i.e. the support for BUILD_BUG in linux/kernel.h predates the addition of linux/bug.h, but old code in kernel.h wasn't moved to bug.h at that time. As a band-aid, kernel.h was including <asm/bug.h> to pseudo link them. This has caused confusion[1] and general yuck/WTF[2] reactions. Here is an example that violates the principle of least surprise: CC lib/string.o lib/string.c: In function 'strlcat': lib/string.c:225:2: error: implicit declaration of function 'BUILD_BUG_ON' make[2]: *** [lib/string.o] Error 1 $ $ grep linux/bug.h lib/string.c #include <linux/bug.h> $ We've included <linux/bug.h> for the BUG infrastructure and yet we still get a compile fail! [We've not kernel.h for BUILD_BUG_ON.] Ugh - very confusing for someone who is new to kernel development. With the above in mind, the goals of this changeset are: 1) find and fix any include/.h files that were relying on the implicit presence of BUG code. 2) find and fix any C files that were consuming kernel.h and hence relying on implicitly getting some/all BUG code. 3) Move the BUG related code living in kernel.h to <linux/bug.h> 4) remove the asm/bug.h from kernel.h to finally break the chain. During development, the order was more like 3-4, build-test, 1-2. But to ensure that git history for bisect doesn't get needless build failures introduced, the commits have been reorderd to fix the problem areas in advance. [1] https://lkml.org/lkml/2012/1/3/90 [2] https://lkml.org/lkml/2012/1/17/414" Fix up conflicts (new radeon file, reiserfs header cleanups) as per Paul and linux-next. tag 'bug-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: kernel.h: doesn't explicitly use bug.h, so don't include it. bug: consolidate BUILD_BUG_ON with other bug code BUG: headers with BUG/BUG_ON etc. need linux/bug.h bug.h: add include of it to various implicit C users lib: fix implicit users of kernel.h for TAINT_WARN spinlock: macroize assert_spin_locked to avoid bug.h dependency x86: relocate get/set debugreg fcns to include/asm/debugreg.
2012-03-23	Merge branch 'akpm' (Andrew's patch-bomb)	Linus Torvalds	3	-3/+8
	Merge second batch of patches from Andrew Morton: - various misc things - core kernel changes to prctl, exit, exec, init, etc. - kernel/watchdog.c updates - get_maintainer - MAINTAINERS - the backlight driver queue - core bitops code cleanups - the led driver queue - some core prio_tree work - checkpatch udpates - largeish crc32 update - a new poll() feature for the v4l guys - the rtc driver queue - fatfs - ptrace - signals - kmod/usermodehelper updates - coredump - procfs updates * emailed from Andrew Morton <akpm@linux-foundation.org>: (141 commits) seq_file: add seq_set_overflow(), seq_overflow() proc-ns: use d_set_d_op() API to set dentry ops in proc_ns_instantiate(). procfs: speed up /proc/pid/stat, statm procfs: add num_to_str() to speed up /proc/stat proc: speed up /proc/stat handling fs/proc/kcore.c: make get_sparsemem_vmemmap_info() static coredump: add VM_NODUMP, MADV_NODUMP, MADV_CLEAR_NODUMP coredump: remove VM_ALWAYSDUMP flag kmod: make __request_module() killable kmod: introduce call_modprobe() helper usermodehelper: ____call_usermodehelper() doesn't need do_exit() usermodehelper: kill umh_wait, renumber UMH_* constants usermodehelper: implement UMH_KILLABLE usermodehelper: introduce umh_complete(sub_info) usermodehelper: use UMH_WAIT_PROC consistently signal: zap_pid_ns_processes: s/SEND_SIG_NOINFO/SEND_SIG_FORCED/ signal: oom_kill_task: use SEND_SIG_FORCED instead of force_sig() signal: cosmetic, s/from_ancestor_ns/force/ in prepare_signal() paths signal: give SEND_SIG_FORCED more power to beat SIGNAL_UNKILLABLE Hexagon: use set_current_blocked() and block_sigmask() ...
2012-03-23	coredump: add VM_NODUMP, MADV_NODUMP, MADV_CLEAR_NODUMP	Jason Baron	1	-0/+4
	Since we no longer need the VM_ALWAYSDUMP flag, let's use the freed bit for 'VM_NODUMP' flag. The idea is is to add a new madvise() flag: MADV_DONTDUMP, which can be set by applications to specifically request memory regions which should not dump core. The specific application I have in mind is qemu: we can add a flag there that wouldn't dump all of guest memory when qemu dumps core. This flag might also be useful for security sensitive apps that want to absolutely make sure that parts of memory are not dumped. To clear the flag use: MADV_DODUMP. [akpm@linux-foundation.org: s/MADV_NODUMP/MADV_DONTDUMP/, s/MADV_CLEAR_NODUMP/MADV_DODUMP/, per Roland] [akpm@linux-foundation.org: fix up the architectures which broke] Signed-off-by: Jason Baron <jbaron@redhat.com> Acked-by: Roland McGrath <roland@hack.frob.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Avi Kivity <avi@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Helge Deller <deller@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-23	consolidate WARN_...ONCE() static variables	Jan Beulich	2	-3/+4
	Due to the alignment of following variables, these typically consume more than just the single byte that 'bool' requires, and as there are a few hundred instances, the cache pollution (not so much the waste of memory) sums up. Put these variables into their own section, outside of any half way frequently used memory range. Do the same also to the __warned variable of rcu_lockdep_assert(). (Don't, however, include the ones used by printk_once() and alike, as they can potentially be hot.) Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-23	Merge branch 'linux-next' of ↵	Linus Torvalds	2	-24/+6
	git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci Pull PCI changes (including maintainer change) from Jesse Barnes: "This pull has some good cleanups from Bjorn and Yinghai, as well as some more code from Yinghai to better handle resource re-allocation when enabled. There's also a new initcall_debug feature from Arjan which will print out quirk timing information to help identify slow quirks for fixing or refinement (Yinghai sent in a few patches to do just that once the new debug code landed). Beyond that, I'm handing off PCI maintainership to Bjorn Helgaas. He's been a core PCI and Linux contributor for some time now, and has kindly volunteered to take over. I just don't feel I have the time for PCI review and work that it deserves lately (I've taken on some other projects), and haven't been as responsive lately as I'd like, so I approached Bjorn asking if he'd like to manage things. He's going to give it a try, and I'm confident he'll do at least as well as I have in keeping the tree managed, patches flowing, and keeping things stable." Fix up some fairly trivial conflicts due to other cleanups (mips device resource fixup cleanups clashing with list handling cleanup, ppc iseries removal clashing with pci_probe_only cleanup etc) * 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci: (112 commits) PCI: Bjorn gets PCI hotplug too PCI: hand PCI maintenance over to Bjorn Helgaas unicore32/PCI: move <asm-generic/pci-bridge.h> include to asm/pci.h sparc/PCI: convert devtree and arch-probed bus addresses to resource powerpc/PCI: allow reallocation on PA Semi powerpc/PCI: convert devtree bus addresses to resource powerpc/PCI: compute I/O space bus-to-resource offset consistently arm/PCI: don't export pci_flags PCI: fix bridge I/O window bus-to-resource conversion x86/PCI: add spinlock held check to 'pcibios_fwaddrmap_lookup()' PCI / PCIe: Introduce command line option to disable ARI PCI: make acpihp use __pci_remove_bus_device instead PCI: export __pci_remove_bus_device PCI: Rename pci_remove_behind_bridge to pci_stop_and_remove_behind_bridge PCI: Rename pci_remove_bus_device to pci_stop_and_remove_bus_device PCI: print out PCI device info along with duration PCI: Move "pci reassigndev resource alignment" out of quirks.c PCI: Use class for quirk for usb host controller fixup PCI: Use class for quirk for ti816x class fixup PCI: Use class for quirk for intel e100 interrupt fixup ...
2012-03-22	Merge branch 'x86-asm-for-linus' of ↵	Linus Torvalds	1	-8/+45
	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/asm changes from Ingo Molnar * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Include probe_roms.h in probe_roms.c x86/32: Print control and debug registers for kerenel context x86: Tighten dependencies of CPU_SUP_*_32 x86/numa: Improve internode cache alignment x86: Fix the NMI nesting comments x86-64: Improve insn scheduling in SAVE_ARGS_IRQ x86-64: Fix CFI annotations for NMI nesting code bitops: Add missing parentheses to new get_order macro bitops: Optimise get_order() bitops: Adjust the comment on get_order() to describe the size==0 case x86/spinlocks: Eliminate TICKET_MASK x86-64: Handle byte-wise tail copying in memcpy() without a loop x86-64: Fix memcpy() to support sizes of 4Gb and above x86-64: Fix memset() to support sizes of 4Gb and above x86-64: Slightly shorten copy_page()
2012-03-21	mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode	Andrea Arcangeli	1	-0/+61
	In some cases it may happen that pmd_none_or_clear_bad() is called with the mmap_sem hold in read mode. In those cases the huge page faults can allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a false positive from pmd_bad() that will not like to see a pmd materializing as trans huge. It's not khugepaged causing the problem, khugepaged holds the mmap_sem in write mode (and all those sites must hold the mmap_sem in read mode to prevent pagetables to go away from under them, during code review it seems vm86 mode on 32bit kernels requires that too unless it's restricted to 1 thread per process or UP builds). The race is only with the huge pagefaults that can convert a pmd_none() into a pmd_trans_huge(). Effectively all these pmd_none_or_clear_bad() sites running with mmap_sem in read mode are somewhat speculative with the page faults, and the result is always undefined when they run simultaneously. This is probably why it wasn't common to run into this. For example if the madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page fault, the hugepage will not be zapped, if the page fault runs first it will be zapped. Altering pmd_bad() not to error out if it finds hugepmds won't be enough to fix this, because zap_pmd_range would then proceed to call zap_pte_range (which would be incorrect if the pmd become a pmd_trans_huge()). The simplest way to fix this is to read the pmd in the local stack (regardless of what we read, no need of actual CPU barriers, only compiler barrier needed), and be sure it is not changing under the code that computes its value. Even if the real pmd is changing under the value we hold on the stack, we don't care. If we actually end up in zap_pte_range it means the pmd was not none already and it was not huge, and it can't become huge from under us (khugepaged locking explained above). All we need is to enforce that there is no way anymore that in a code path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad can run into a hugepmd. The overhead of a barrier() is just a compiler tweak and should not be measurable (I only added it for THP builds). I don't exclude different compiler versions may have prevented the race too by caching the value of pmd on the stack (that hasn't been verified, but it wouldn't be impossible considering pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines and there's no external function called in between pmd_trans_huge and pmd_none_or_clear_bad). if (pmd_trans_huge(pmd)) { if (next-addr != HPAGE_PMD_SIZE) { VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem)); split_huge_page_pmd(vma->vm_mm, pmd); } else if (zap_huge_pmd(tlb, vma, pmd, addr)) continue; /* fall through / } if (pmd_none_or_clear_bad(pmd)) Because this race condition could be exercised without special privileges this was reported in CVE-2012-1179. The race was identified and fully explained by Ulrich who debugged it. I'm quoting his accurate explanation below, for reference. ====== start quote ======= mapcount 0 page_mapcount 1 kernel BUG at mm/huge_memory.c:1384! At some point prior to the panic, a "bad pmd ..." message similar to the following is logged on the console: mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7). The "bad pmd ..." message is logged by pmd_clear_bad() before it clears the page's PMD table entry. 143 void pmd_clear_bad(pmd_t pmd) 144 { -> 145 pmd_ERROR(pmd); 146 pmd_clear(pmd); 147 } After the PMD table entry has been cleared, there is an inconsistency between the actual number of PMD table entries that are mapping the page and the page's map count (_mapcount field in struct page). When the page is subsequently reclaimed, __split_huge_page() detects this inconsistency. 1381 if (mapcount != page_mapcount(page)) 1382 printk(KERN_ERR "mapcount %d page_mapcount %d\n", 1383 mapcount, page_mapcount(page)); -> 1384 BUG_ON(mapcount != page_mapcount(page)); The root cause of the problem is a race of two threads in a multithreaded process. Thread B incurs a page fault on a virtual address that has never been accessed (PMD entry is zero) while Thread A is executing an madvise() system call on a virtual address within the same 2 MB (huge page) range. virtual address space .---------------------. \| \| \| \| .-\|---------------------\| \| \| \| \| \| \|<-- B(fault) \| \| \| 2 MB \| \|/////////////////////\|-. huge < \|/////////////////////\| > A(range) page \| \|/////////////////////\|-' \| \| \| \| \| \| '-\|---------------------\| \| \| \| \| '---------------------' - Thread A is executing an madvise(..., MADV_DONTNEED) system call on the virtual address range "A(range)" shown in the picture. sys_madvise // Acquire the semaphore in shared mode. down_read(&current->mm->mmap_sem) ... madvise_vma switch (behavior) case MADV_DONTNEED: madvise_dontneed zap_page_range unmap_vmas unmap_page_range zap_pud_range zap_pmd_range // // Assume that this huge page has never been accessed. // I.e. content of the PMD entry is zero (not mapped). // if (pmd_trans_huge(pmd)) { // We don't get here due to the above assumption. } // // Assume that Thread B incurred a page fault and .---------> // sneaks in here as shown below. \| // \| if (pmd_none_or_clear_bad(pmd)) \| { \| if (unlikely(pmd_bad(pmd))) \| pmd_clear_bad \| { \| pmd_ERROR \| // Log "bad pmd ..." message here. \| pmd_clear \| // Clear the page's PMD entry. \| // Thread B incremented the map count \| // in page_add_new_anon_rmap(), but \| // now the page is no longer mapped \| // by a PMD entry (-> inconsistency). \| } \| } \| v - Thread B is handling a page fault on virtual address "B(fault)" shown in the picture. ... do_page_fault __do_page_fault // Acquire the semaphore in shared mode. down_read_trylock(&mm->mmap_sem) ... handle_mm_fault if (pmd_none(pmd) && transparent_hugepage_enabled(vma)) // We get here due to the above assumption (PMD entry is zero). do_huge_pmd_anonymous_page alloc_hugepage_vma // Allocate a new transparent huge page here. ... __do_huge_pmd_anonymous_page ... spin_lock(&mm->page_table_lock) ... page_add_new_anon_rmap // Here we increment the page's map count (starts at -1). atomic_set(&page->_mapcount, 0) set_pmd_at // Here we set the page's PMD entry which will be cleared // when Thread A calls pmd_clear_bad(). ... spin_unlock(&mm->page_table_lock) The mmap_sem does not prevent the race because both threads are acquiring it in shared mode (down_read). Thread B holds the page_table_lock while the page's map count and PMD table entry are updated. However, Thread A does not synchronize on that lock. ====== end quote ======= [akpm@linux-foundation.org: checkpatch fixes] Reported-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Jones <davej@redhat.com> Acked-by: Larry Woodman <lwoodman@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> [2.6.38+] Cc: Mark Salter <msalter@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-12	Merge tag 'v3.3-rc7' into gpio/next	Grant Likely	5	-2/+60
	Linux 3.3-rc7. Merged into the gpio branch to pick up gpio bugfixes already in mainline before queueing up move v3.4 patches
2012-03-05	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	David S. Miller	2	-2/+2
	Conflicts: drivers/net/vmxnet3/vmxnet3_drv.c Small vmxnet3 conflict with header size bug fix in 'net'. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-04	BUG: headers with BUG/BUG_ON etc. need linux/bug.h	Paul Gortmaker	3	-0/+4
	If a header file is making use of BUG, BUG_ON, BUILD_BUG_ON, or any other BUG variant in a static inline (i.e. not in a #define) then that header really should be including <linux/bug.h> and not just expecting it to be implicitly present. We can make this change risk-free, since if the files using these headers didn't have exposure to linux/bug.h already, they would have been causing compile failures/warnings. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-03-02	gpio: constify the data parameter to gpiochip_find()	Grant Likely	1	-2/+2
	Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2012-02-28	Merge branch 'linus' into x86/asm	Ingo Molnar	14	-61/+174
	Sync up the latest NMI fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-27	[PARISC] fix compile break caused by iomap: make IOPORT/PCI mapping ↵	James Bottomley	2	-2/+2
	functions conditional The problem in commit fea80311a939a746533a6d7e7c3183729d6a3faf Author: Randy Dunlap <rdunlap@xenotime.net> Date: Sun Jul 24 11:39:14 2011 -0700 iomap: make IOPORT/PCI mapping functions conditional is that if your architecture supplies pci_iomap/pci_iounmap, it expects always to supply them. Adding empty body defitions in the !CONFIG_PCI case, which is what this patch does, breaks the parisc compile because the functions become doubly defined. It took us a while to spot this, because we don't actually build !CONFIG_PCI very often (only if someone is brave enough to test the snake/asp machines). Since the note in the commit log says this is to fix a CONFIG_GENERIC_IOMAP issue (which it does because CONFIG_GENERIC_IOMAP supplies pci_iounmap only if CONFIG_PCI is set), there should actually have been a condition upon this. This should make sure no other architecture's !CONFIG_PCI compile breaks in the same way as parisc. The fix had to be updated to take account of the GENERIC_PCI_IOMAP separation. Reported-by: Rolf Eike Beer <eike@sf-mail.de> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2012-02-26	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	David S. Miller	3	-0/+58
	Conflicts: drivers/net/ethernet/sfc/rx.c Overlapping changes in drivers/net/ethernet/sfc/rx.c, one to change the rx_buf->is_page boolean into a set of u16 flags, and another to adjust how ->ip_summed is initialized. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-24	epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree()	Oleg Nesterov	1	-0/+2
	This patch is intentionally incomplete to simplify the review. It ignores ep_unregister_pollwait() which plays with the same wqh. See the next change. epoll assumes that the EPOLL_CTL_ADD'ed file controls everything f_op->poll() needs. In particular it assumes that the wait queue can't go away until eventpoll_release(). This is not true in case of signalfd, the task which does EPOLL_CTL_ADD uses its ->sighand which is not connected to the file. This patch adds the special event, POLLFREE, currently only for epoll. It expects that init_poll_funcptr()'ed hook should do the necessary cleanup. Perhaps it should be defined as EPOLLFREE in eventpoll. __cleanup_sighand() is changed to do wake_up_poll(POLLFREE) if ->signalfd_wqh is not empty, we add the new signalfd_cleanup() helper. ep_poll_callback(POLLFREE) simply does list_del_init(task_list). This make this poll entry inconsistent, but we don't care. If you share epoll fd which contains our sigfd with another process you should blame yourself. signalfd is "really special". I simply do not know how we can define the "right" semantics if it used with epoll. The main problem is, epoll calls signalfd_poll() once to establish the connection with the wait queue, after that signalfd_poll(NULL) returns the different/inconsistent results depending on who does EPOLL_CTL_MOD/signalfd_read/etc. IOW: apart from sigmask, signalfd has nothing to do with the file, it works with the current thread. In short: this patch is the hack which tries to fix the symptoms. It also assumes that nobody can take tasklist_lock under epoll locks, this seems to be true. Note: - we do not have wake_up_all_poll() but wake_up_poll() is fine, poll/epoll doesn't use WQ_FLAG_EXCLUSIVE. - signalfd_cleanup() uses POLLHUP along with POLLFREE, we need a couple of simple changes in eventpoll.c to make sure it can't be "lost". Reported-by: Maxime Bizon <mbizon@freebox.fr> Cc: <stable@kernel.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-24	bitops: Add missing parentheses to new get_order macro	Joerg Roedel	1	-2/+2
	The new get_order macro introcuded in commit d66acc39c7cee323733c8503b9de1821a56dff7e does not use parentheses around all uses of the parameter n. This causes new compile warnings, for example in the amd_iommu_init.c function: drivers/iommu/amd_iommu_init.c:561:6: warning: suggest parentheses around comparison in operand of ‘&’ [-Wparentheses] drivers/iommu/amd_iommu_init.c:561:6: warning: suggest parentheses around comparison in operand of ‘&’ [-Wparentheses] Fix those warnings by adding the missing parentheses. Reported-by: Ingo Molnar <mingo@elte.hu> Cc: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Link: http://lkml.kernel.org/r/1330088295-28732-1-git-send-email-joerg.roedel@amd.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-02-24	net: Add framework to allow sending packets with customized CRC.	Ben Greear	1	-0/+4
	This is useful for testing RX handling of frames with bad CRCs. Requires driver support to actually put the packet on the wire properly. Signed-off-by: Ben Greear <greearb@candelatech.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2012-02-23	PCI: collapse pcibios_resource_to_bus	Bjorn Helgaas	1	-2/+0
	Everybody uses the generic pcibios_resource_to_bus() supplied by the core now, so remove the ARCH_HAS_GENERIC_PCI_OFFSETS used during conversion. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2012-02-23	PCI: add generic pcibios_resource_to_bus()	Bjorn Helgaas	1	-23/+1
	This replaces the generic versions of pcibios_resource_to_bus() and pcibios_bus_to_resource() in asm-generic/pci.h with versions that use pci_resource_to_bus() and pci_bus_to_resource(). The replacements are equivalent except that they can apply host bridge window offsets when the arch has supplied them by using pci_add_resource_offset(). Each arch can convert to using pci_add_resource_offset() individually by removing its device resource fixups from pcibios_fixup_bus() and supplying ARCH_HAS_GENERIC_PCI_OFFSETS. ARCH_HAS_GENERIC_PCI_OFFSETS can be removed after all have converted. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2012-02-23	PCI: add pci_clear_flags()	Bjorn Helgaas	1	-0/+6
	Add a pci_clear_flags() for cases when we statically initialize pci_flags, then decide to clear things out later. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
2012-02-21	asm-generic: architecture independent readq/writeq for 32bit environment	Hitoshi Mitake	2	-0/+56
	This provides unified readq()/writeq() helper functions for 32-bit drivers. For some cases, readq/writeq without atomicity is harmful, and order of io access has to be specified explicitly. So in this patch, new two header files which contain non-atomic readq/writeq are added. - <asm-generic/io-64-nonatomic-lo-hi.h> provides non-atomic readq/ writeq with the order of lower address -> higher address - <asm-generic/io-64-nonatomic-hi-lo.h> provides non-atomic readq/ writeq with reversed order This allows us to remove some readq()s that were added drivers when the default non-atomic ones were removed in commit dbee8a0affd5 ("x86: remove 32-bit versions of readq()/writeq()") The drivers which need readq/writeq but can do with the non-atomic ones must add the line: #include <asm-generic/io-64-nonatomic-lo-hi.h> /* or hi-lo.h */ But this will be nop in 64-bit environments, and no other #ifdefs are required. So I believe that this patch can solve the problem of 1. driver-specific readq/writeq 2. atomicity and order of io access This patch is tested with building allyesconfig and allmodconfig as ARCH=x86 and ARCH=i386 on top of tip/master. Cc: Kashyap Desai <Kashyap.Desai@lsi.com> Cc: Len Brown <lenb@kernel.org> Cc: Ravi Anand <ravi.anand@qlogic.com> Cc: Vikas Chaudhary <vikas.chaudhary@qlogic.com> Cc: Matthew Garrett <mjg@redhat.com> Cc: Jason Uhlenkott <juhlenko@akamai.com> Cc: James Bottomley <James.Bottomley@parallels.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Roland Dreier <roland@purestorage.com> Cc: James Bottomley <jbottomley@parallels.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Hitoshi Mitake <h.mitake@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-21	sock: Introduce the SO_PEEK_OFF sock option	Pavel Emelyanov	1	-0/+1
	This one specifies where to start MSG_PEEK-ing queue data from. When set to negative value means that MSG_PEEK works as ususally -- peeks from the head of the queue always. When some bytes are peeked from queue and the peeking offset is non negative it is moved forward so that the next peek will return next portion of data. When non-peeking recvmsg occurs and the peeking offset is non negative is is moved backward so that the next peek will still peek the proper data (i.e. the one that would have been picked if there were no non peeking recv in between). The offset is set using per-proto opteration to let the protocol handle the locking issues and to check whether the peeking offset feature is supported by the protocol the socket belongs to. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-20	bitops: Optimise get_order()	David Howells	1	-12/+28
	Optimise get_order() to use bit scanning instructions if such exist rather than a loop. Also, make it possible to use get_order() in static initialisations too by building it on top of ilog2() in the constant parameter case. This has been tested for i386 and x86_64 using the following userspace program, and for FRV by making appropriate substitutions for fls() and fls64(). It will abort if the case for get_order() deviates from the original except for the order of 0, for which get_order() produces an undefined result. This program tests both dynamic and static parameters. #include <stdlib.h> #include <stdio.h> #ifdef __x86_64__ #define BITS_PER_LONG 64 #else #define BITS_PER_LONG 32 #endif #define PAGE_SHIFT 12 typedef unsigned long long __u64, u64; typedef unsigned int __u32, u32; #define noinline __attribute__((noinline)) static inline int fls(int x) { int bitpos = -1; asm("bsrl %1,%0" : "+r" (bitpos) : "rm" (x)); return bitpos + 1; } static __always_inline int fls64(__u64 x) { #if BITS_PER_LONG == 64 long bitpos = -1; asm("bsrq %1,%0" : "+r" (bitpos) : "rm" (x)); return bitpos + 1; #else __u32 h = x >> 32, l = x; int bitpos = -1; asm("bsrl %1,%0 \n" "subl %2,%0 \n" "bsrl %3,%0 \n" : "+r" (bitpos) : "rm" (l), "i"(32), "rm" (h)); return bitpos + 33; #endif } static inline __attribute__((const)) int __ilog2_u32(u32 n) { return fls(n) - 1; } static inline __attribute__((const)) int __ilog2_u64(u64 n) { return fls64(n) - 1; } extern __attribute__((const, noreturn)) int ____ilog2_NaN(void); #define ilog2(n) \ ( \ __builtin_constant_p(n) ? ( \ (n) < 1 ? ____ilog2_NaN() : \ (n) & (1ULL << 63) ? 63 : \ (n) & (1ULL << 62) ? 62 : \ (n) & (1ULL << 61) ? 61 : \ (n) & (1ULL << 60) ? 60 : \ (n) & (1ULL << 59) ? 59 : \ (n) & (1ULL << 58) ? 58 : \ (n) & (1ULL << 57) ? 57 : \ (n) & (1ULL << 56) ? 56 : \ (n) & (1ULL << 55) ? 55 : \ (n) & (1ULL << 54) ? 54 : \ (n) & (1ULL << 53) ? 53 : \ (n) & (1ULL << 52) ? 52 : \ (n) & (1ULL << 51) ? 51 : \ (n) & (1ULL << 50) ? 50 : \ (n) & (1ULL << 49) ? 49 : \ (n) & (1ULL << 48) ? 48 : \ (n) & (1ULL << 47) ? 47 : \ (n) & (1ULL << 46) ? 46 : \ (n) & (1ULL << 45) ? 45 : \ (n) & (1ULL << 44) ? 44 : \ (n) & (1ULL << 43) ? 43 : \ (n) & (1ULL << 42) ? 42 : \ (n) & (1ULL << 41) ? 41 : \ (n) & (1ULL << 40) ? 40 : \ (n) & (1ULL << 39) ? 39 : \ (n) & (1ULL << 38) ? 38 : \ (n) & (1ULL << 37) ? 37 : \ (n) & (1ULL << 36) ? 36 : \ (n) & (1ULL << 35) ? 35 : \ (n) & (1ULL << 34) ? 34 : \ (n) & (1ULL << 33) ? 33 : \ (n) & (1ULL << 32) ? 32 : \ (n) & (1ULL << 31) ? 31 : \ (n) & (1ULL << 30) ? 30 : \ (n) & (1ULL << 29) ? 29 : \ (n) & (1ULL << 28) ? 28 : \ (n) & (1ULL << 27) ? 27 : \ (n) & (1ULL << 26) ? 26 : \ (n) & (1ULL << 25) ? 25 : \ (n) & (1ULL << 24) ? 24 : \ (n) & (1ULL << 23) ? 23 : \ (n) & (1ULL << 22) ? 22 : \ (n) & (1ULL << 21) ? 21 : \ (n) & (1ULL << 20) ? 20 : \ (n) & (1ULL << 19) ? 19 : \ (n) & (1ULL << 18) ? 18 : \ (n) & (1ULL << 17) ? 17 : \ (n) & (1ULL << 16) ? 16 : \ (n) & (1ULL << 15) ? 15 : \ (n) & (1ULL << 14) ? 14 : \ (n) & (1ULL << 13) ? 13 : \ (n) & (1ULL << 12) ? 12 : \ (n) & (1ULL << 11) ? 11 : \ (n) & (1ULL << 10) ? 10 : \ (n) & (1ULL << 9) ? 9 : \ (n) & (1ULL << 8) ? 8 : \ (n) & (1ULL << 7) ? 7 : \ (n) & (1ULL << 6) ? 6 : \ (n) & (1ULL << 5) ? 5 : \ (n) & (1ULL << 4) ? 4 : \ (n) & (1ULL << 3) ? 3 : \ (n) & (1ULL << 2) ? 2 : \ (n) & (1ULL << 1) ? 1 : \ (n) & (1ULL << 0) ? 0 : \ ____ilog2_NaN() \ ) : \ (sizeof(n) <= 4) ? \ __ilog2_u32(n) : \ __ilog2_u64(n) \ ) static noinline __attribute__((const)) int old_get_order(unsigned long size) { int order; size = (size - 1) >> (PAGE_SHIFT - 1); order = -1; do { size >>= 1; order++; } while (size); return order; } static noinline __attribute__((const)) int __get_order(unsigned long size) { int order; size--; size >>= PAGE_SHIFT; #if BITS_PER_LONG == 32 order = fls(size); #else order = fls64(size); #endif return order; } #define get_order(n) \ ( \ __builtin_constant_p(n) ? ( \ (n == 0UL) ? BITS_PER_LONG - PAGE_SHIFT : \ ((n < (1UL << PAGE_SHIFT)) ? 0 : \ ilog2((n) - 1) - PAGE_SHIFT + 1) \ ) : \ __get_order(n) \ ) #define order(N) \ { (1UL << N) - 1, get_order((1UL << N) - 1) }, \ { (1UL << N), get_order((1UL << N)) }, \ { (1UL << N) + 1, get_order((1UL << N) + 1) } struct order { unsigned long n, order; }; static const struct order order_table[] = { order(0), order(1), order(2), order(3), order(4), order(5), order(6), order(7), order(8), order(9), order(10), order(11), order(12), order(13), order(14), order(15), order(16), order(17), order(18), order(19), order(20), order(21), order(22), order(23), order(24), order(25), order(26), order(27), order(28), order(29), order(30), order(31), #if BITS_PER_LONG == 64 order(32), order(33), order(34), order(35), #endif { 0x2929 } }; void check(int loop, unsigned long n) { unsigned long old, new; printf("[%2d]: %09lx \| ", loop, n); old = old_get_order(n); new = get_order(n); printf("%3ld, %3ld\n", old, new); if (n != 0 && old != new) abort(); } int main(int argc, char *argv) { const struct order p; unsigned long n; int loop; for (loop = 0; loop <= BITS_PER_LONG - 1; loop++) { n = 1UL << loop; check(loop, n - 1); check(loop, n); check(loop, n + 1); } for (p = order_table; p->n != 0x2929; p++) { unsigned long old, new; old = old_get_order(p->n); new = p->order; printf("%09lx\t%3ld, %3ld\n", p->n, old, new); if (p->n != 0 && old != new) abort(); } return 0; } Disassembling the x86_64 version of the above code shows: 0000000000400510 <old_get_order>: 400510: 48 83 ef 01 sub $0x1,%rdi 400514: b8 ff ff ff ff mov $0xffffffff,%eax 400519: 48 c1 ef 0b shr $0xb,%rdi 40051d: 0f 1f 00 nopl (%rax) 400520: 83 c0 01 add $0x1,%eax 400523: 48 d1 ef shr %rdi 400526: 75 f8 jne 400520 <old_get_order+0x10> 400528: f3 c3 repz retq 40052a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) 0000000000400530 <__get_order>: 400530: 48 83 ef 01 sub $0x1,%rdi 400534: 48 c7 c0 ff ff ff ff mov $0xffffffffffffffff,%rax 40053b: 48 c1 ef 0c shr $0xc,%rdi 40053f: 48 0f bd c7 bsr %rdi,%rax 400543: 83 c0 01 add $0x1,%eax 400546: c3 retq 400547: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1) 40054e: 00 00 As can be seen, the new __get_order() function is simpler than the old_get_order() function. Signed-off-by: David Howells <dhowells@redhat.com> Link: http://lkml.kernel.org/r/20120220223928.16199.29548.stgit@warthog.procyon.org.uk Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-20	bitops: Adjust the comment on get_order() to describe the size==0 case	David Howells	1	-1/+22
	Adjust the comment on get_order() to note that the result of passing a size of 0 results in an undefined value. Signed-off-by: David Howells <dhowells@redhat.com> Link: http://lkml.kernel.org/r/20120220223917.16199.9416.stgit@warthog.procyon.org.uk Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-20	posix_types: Introduce __kernel_[u]long_t	H. Peter Anvin	1	-9/+14
	Introduce __kernel_[u]long_t, which allows an ABI to override all defaults of type [unsigned] long. This enables x32 and potentially other 32-bit userspace on 64-bit kernel ABIs. Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-14	posix_types: Remove fd_set macros	H. Peter Anvin	1	-72/+0
	<asm/posix_types.h> includes a set of macros that operate on file descriptors. Way long ago those were exported to user space, but nowadays they are #ifdef __KERNEL__. However, they are nothing but standard (nonatomic) bit operations, and we already have optimized versions of bit operations in the kernel. We can't include <linux/bitops.h> in <asm/posix_types.h> but we can move the definitions to <linux/time.h> and define them there in terms of standard kernel bitops. [ v2: folds the following fixes in: a) Stray space in __FD_SET(), reported by Andrew Morton b) #include <linux/string.h> needed for memset(), reported by Tony Luck ] Signed-off-by: H. Peter Anvin <hpa@zytor.com> Link: http://lkml.kernel.org/r/1328677745-20121-22-git-send-email-hpa@zytor.com Cc: Arnd Bergmann <arnd@arndb.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2012-02-14	posix_types: Make it possible to override __kernel_fsid_t	H. Peter Anvin	1	-4/+6
	__kernel_fsid_t has members of type "long" on at least one architecture (MIPS32), so make it possible to override the definition. Signed-off-by: H. Peter Anvin <hpa@zytor.com> Link: http://lkml.kernel.org/r/1328677745-20121-3-git-send-email-hpa@zytor.com Cc: Arnd Bergmann <arnd@arndb.de>
2012-02-14	posix_types: Make __kernel_[ug]id32_t default to unsigned int	H. Peter Anvin	1	-2/+2
	All ports use unsigned int for __kernel_[ug]id32_t, but not all ports use unsigned int for __kernel_[ug]id_t. Thus, change the default for the "32" types so ports don't need to override them. Signed-off-by: H. Peter Anvin <hpa@zytor.com> Link: http://lkml.kernel.org/r/1328677745-20121-2-git-send-email-hpa@zytor.com Cc: Arnd Bergmann <arnd@arndb.de>