summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)AuthorFilesLines
2018-05-22xfs, dax: introduce xfs_break_dax_layouts()Dan Williams1-10/+51
xfs_break_dax_layouts(), similar to xfs_break_leased_layouts(), scans for busy / pinned dax pages and waits for those pages to go idle before any potential extent unmap operation. dax_layout_busy_page() handles synchronizing against new page-busy events (get_user_pages). It invalidates all mappings to trigger the get_user_pages slow path which will eventually block on the xfs inode lock held in XFS_MMAPLOCK_EXCL mode. If dax_layout_busy_page() finds a busy page it returns it for xfs to wait for the page-idle event that will fire when the page reference count reaches 1 (recall ZONE_DEVICE pages are idle at count 1, see generic_dax_pagefree()). While waiting, the XFS_MMAPLOCK_EXCL lock is dropped in order to not deadlock the process that might be trying to elevate the page count of more pages before arranging for any of them to go idle. I.e. the typical case of submitting I/O is that iov_iter_get_pages() elevates the reference count of all pages in the I/O before starting I/O on the first page. The process of elevating the reference count of all pages involved in an I/O may cause faults that need to take XFS_MMAPLOCK_EXCL. Although XFS_MMAPLOCK_EXCL is dropped while waiting, XFS_IOLOCK_EXCL is held while sleeping. We need this to prevent starvation of the truncate path as continuous submission of direct-I/O could starve the truncate path indefinitely if the lock is dropped. Cc: Dave Chinner <david@fromorbit.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-05-22xfs: prepare xfs_break_layouts() for another layout typeDan Williams6-15/+53
When xfs is operating as the back-end of a pNFS block server, it prevents collisions between local and remote operations by requiring a lease to be held for remotely accessed blocks. Local filesystem operations break those leases before writing or mutating the extent map of the file. A similar mechanism is needed to prevent operations on pinned dax mappings, like device-DMA, from colliding with extent unmap operations. BREAK_WRITE and BREAK_UNMAP are introduced as two distinct levels of layout breaking. Layouts are broken in the BREAK_WRITE case to ensure that layout-holders do not collide with local writes. Additionally, layouts are broken in the BREAK_UNMAP case to make sure the layout-holder has a consistent view of the file's extent map. While BREAK_WRITE breaks can be satisfied be recalling FL_LAYOUT leases, BREAK_UNMAP breaks additionally require waiting for busy dax-pages to go idle while holding XFS_MMAPLOCK_EXCL. After this refactoring xfs_break_layouts() becomes the entry point for coordinating both types of breaks. Finally, xfs_break_leased_layouts() becomes just the BREAK_WRITE handler. Note that the unlock tracking is needed in a follow on change. That will coordinate retrying either break handler until both successfully test for a lease break while maintaining the lock state. Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Reported-by: Dave Chinner <david@fromorbit.com> Reported-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-05-22xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCLDan Williams4-12/+11
In preparation for adding coordination between extent unmap operations and busy dax-pages, update xfs_break_layouts() to permit it to be called with the mmap lock held. This lock scheme will be required for coordinating the break of 'dax layouts' (non-idle dax (ZONE_DEVICE) pages mapped into the file's address space). Breaking dax layouts will be added to xfs_break_layouts() in a future patch, for now this preps the unmap call sites to take and hold XFS_MMAPLOCK_EXCL over the call to xfs_break_layouts(). Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <darrick.wong@oracle.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-05-22mm, fs, dax: handle layout changes to pinned dax mappingsDan Williams1-0/+97
Background: get_user_pages() in the filesystem pins file backed memory pages for access by devices performing dma. However, it only pins the memory pages not the page-to-file offset association. If a file is truncated the pages are mapped out of the file and dma may continue indefinitely into a page that is owned by a device driver. This breaks coherency of the file vs dma, but the assumption is that if userspace wants the file-space truncated it does not matter what data is inbound from the device, it is not relevant anymore. The only expectation is that dma can safely continue while the filesystem reallocates the block(s). Problem: This expectation that dma can safely continue while the filesystem changes the block map is broken by dax. With dax the target dma page *is* the filesystem block. The model of leaving the page pinned for dma, but truncating the file block out of the file, means that the filesytem is free to reallocate a block under active dma to another file and now the expected data-incoherency situation has turned into active data-corruption. Solution: Defer all filesystem operations (fallocate(), truncate()) on a dax mode file while any page/block in the file is under active dma. This solution assumes that dma is transient. Cases where dma operations are known to not be transient, like RDMA, have been explicitly disabled via commits like 5f1d43de5416 "IB/core: disable memory registration of filesystem-dax vmas". The dax_layout_busy_page() routine is called by filesystems with a lock held against mm faults (i_mmap_lock) to find pinned / busy dax pages. The process of looking up a busy page invalidates all mappings to trigger any subsequent get_user_pages() to block on i_mmap_lock. The filesystem continues to call dax_layout_busy_page() until it finally returns no more active pages. This approach assumes that the page pinning is transient, if that assumption is violated the system would have likely hung from the uncompleted I/O. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Reported-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-05-22mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPSDan Williams1-0/+1
In preparation for fixing dax-dma-vs-unmap issues, filesystems need to be able to rely on the fact that they will get wakeups on dev_pagemap page-idle events. Introduce MEMORY_DEVICE_FS_DAX and generic_dax_page_free() as common indicator / infrastructure for dax filesytems to require. With this change there are no users of the MEMORY_DEVICE_HOST designation, so remove it. The HMM sub-system extended dev_pagemap to arrange a callback when a dev_pagemap managed page is freed. Since a dev_pagemap page is free / idle when its reference count is 1 it requires an additional branch to check the page-type at put_page() time. Given put_page() is a hot-path we do not want to incur that check if HMM is not in use, so a static branch is used to avoid that overhead when not necessary. Now, the FS_DAX implementation wants to reuse this mechanism for receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific static-key into a generic mechanism that either HMM or FS_DAX code paths can enable. For ARCH=um builds, and any other arch that lacks ZONE_DEVICE support, care must be taken to compile out the DEV_PAGEMAP_OPS infrastructure. However, we still need to support FS_DAX in the FS_DAX_LIMITED case implemented by the s390/dcssblk driver. Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Thomas Meyer <thomas@m3y3r.de> Reported-by: Dave Jiang <dave.jiang@intel.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-04-22Merge tag '4.17-rc1-SMB3-CIFS' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds5-9/+16
Pull cifs fixes from Steve French: "Various SMB3/CIFS fixes. There are three more security related fixes in progress that are not included in this set but they are still being tested and reviewed, so sending this unrelated set of smaller fixes now" * tag '4.17-rc1-SMB3-CIFS' of git://git.samba.org/sfrench/cifs-2.6: CIFS: fix typo in cifs_dbg cifs: do not allow creating sockets except with SMB1 posix exensions cifs: smbd: Dump SMB packet when configured cifs: smbd: Check for iov length on sending the last iov fs: cifs: Adding new return type vm_fault_t cifs: smb2ops: Fix NULL check in smb2_query_symlink
2018-04-22Merge tag 'for-4.17-rc1-tag' of ↵Linus Torvalds13-47/+199
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "This contains a few fixups to the qgroup patches that were merged this dev cycle, unaligned access fix, blockgroup removal corner case fix and a small debugging output tweak" * tag 'for-4.17-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: print-tree: debugging output enhancement btrfs: Fix race condition between delayed refs and blockgroup removal btrfs: fix unaligned access in readdir btrfs: Fix wrong btrfs_delalloc_release_extents parameter btrfs: delayed-inode: Remove wrong qgroup meta reservation calls btrfs: qgroup: Use independent and accurate per inode qgroup rsv btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT
2018-04-20fs, elf: don't complain MAP_FIXED_NOREPLACE unless -EEXIST errorTetsuo Handa1-4/+4
Commit 4ed28639519c ("fs, elf: drop MAP_FIXED usage from elf_map") is printing spurious messages under memory pressure due to map_addr == -ENOMEM. 9794 (a.out): Uhuuh, elf segment at 00007f2e34738000(fffffffffffffff4) requested but the memory is mapped already 14104 (a.out): Uhuuh, elf segment at 00007f34fd76c000(fffffffffffffff4) requested but the memory is mapped already 16843 (a.out): Uhuuh, elf segment at 00007f930ecc7000(fffffffffffffff4) requested but the memory is mapped already Complain only if -EEXIST, and use %px for printing the address. Link: http://lkml.kernel.org/r/201804182307.FAC17665.SFMOFJVFtHOLOQ@I-love.SAKURA.ne.jp Fixes: 4ed28639519c7bad ("fs, elf: drop MAP_FIXED usage from elf_map") is Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrei Vagin <avagin@openvz.org> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Kees Cook <keescook@chromium.org> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Joel Stanley <joel@jms.id.au> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-20proc: fix /proc/loadavg regressionAlexey Dobriyan1-1/+1
Commit 95846ecf9dac ("pid: replace pid bitmap implementation with IDR API") changed last field of /proc/loadavg (last pid allocated) to be off by one: # unshare -p -f --mount-proc cat /proc/loadavg 0.00 0.00 0.00 1/60 2 <=== It should be 1 after first fork into pid namespace. This is formally a regression but given how useless this field is I don't think anyone is affected. Bug was found by /proc testsuite! Link: http://lkml.kernel.org/r/20180413175408.GA27246@avx2 Fixes: 95846ecf9dac508 ("pid: replace pid bitmap implementation with IDR API") Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Gargi Sharma <gs051095@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-20proc: revalidate kernel thread inodes to root:rootAlexey Dobriyan1-0/+6
task_dump_owner() has the following code: mm = task->mm; if (mm) { if (get_dumpable(mm) != SUID_DUMP_USER) { uid = ... } } Check for ->mm is buggy -- kernel thread might be borrowing mm and inode will go to some random uid:gid pair. Link: http://lkml.kernel.org/r/20180412220109.GA20978@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-20autofs: mount point create should honour passed in modeIan Kent1-1/+1
The autofs file system mkdir inode operation blindly sets the created directory mode to S_IFDIR | 0555, ingoring the passed in mode, which can cause selinux dac_override denials. But the function also checks if the caller is the daemon (as no-one else should be able to do anything here) so there's no point in not honouring the passed in mode, allowing the daemon to set appropriate mode when required. Link: http://lkml.kernel.org/r/152361593601.8051.14014139124905996173.stgit@pluto.themaw.net Signed-off-by: Ian Kent <raven@themaw.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-20writeback: safer lock nestingGreg Thelen1-3/+4
lock_page_memcg()/unlock_page_memcg() use spin_lock_irqsave/restore() if the page's memcg is undergoing move accounting, which occurs when a process leaves its memcg for a new one that has memory.move_charge_at_immigrate set. unlocked_inode_to_wb_begin,end() use spin_lock_irq/spin_unlock_irq() if the given inode is switching writeback domains. Switches occur when enough writes are issued from a new domain. This existing pattern is thus suspicious: lock_page_memcg(page); unlocked_inode_to_wb_begin(inode, &locked); ... unlocked_inode_to_wb_end(inode, locked); unlock_page_memcg(page); If both inode switch and process memcg migration are both in-flight then unlocked_inode_to_wb_end() will unconditionally enable interrupts while still holding the lock_page_memcg() irq spinlock. This suggests the possibility of deadlock if an interrupt occurs before unlock_page_memcg(). truncate __cancel_dirty_page lock_page_memcg unlocked_inode_to_wb_begin unlocked_inode_to_wb_end <interrupts mistakenly enabled> <interrupt> end_page_writeback test_clear_page_writeback lock_page_memcg <deadlock> unlock_page_memcg Due to configuration limitations this deadlock is not currently possible because we don't mix cgroup writeback (a cgroupv2 feature) and memory.move_charge_at_immigrate (a cgroupv1 feature). If the kernel is hacked to always claim inode switching and memcg moving_account, then this script triggers lockup in less than a minute: cd /mnt/cgroup/memory mkdir a b echo 1 > a/memory.move_charge_at_immigrate echo 1 > b/memory.move_charge_at_immigrate ( echo $BASHPID > a/cgroup.procs while true; do dd if=/dev/zero of=/mnt/big bs=1M count=256 done ) & while true; do sync done & sleep 1h & SLEEP=$! while true; do echo $SLEEP > a/cgroup.procs echo $SLEEP > b/cgroup.procs done The deadlock does not seem possible, so it's debatable if there's any reason to modify the kernel. I suggest we should to prevent future surprises. And Wang Long said "this deadlock occurs three times in our environment", so there's more reason to apply this, even to stable. Stable 4.4 has minor conflicts applying this patch. For a clean 4.4 patch see "[PATCH for-4.4] writeback: safer lock nesting" https://lkml.org/lkml/2018/4/11/146 Wang Long said "this deadlock occurs three times in our environment" [gthelen@google.com: v4] Link: http://lkml.kernel.org/r/20180411084653.254724-1-gthelen@google.com [akpm@linux-foundation.org: comment tweaks, struct initialization simplification] Change-Id: Ibb773e8045852978f6207074491d262f1b3fb613 Link: http://lkml.kernel.org/r/20180410005908.167976-1-gthelen@google.com Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction and use it for stat updates") Signed-off-by: Greg Thelen <gthelen@google.com> Reported-by: Wang Long <wanglong19@meituan.com> Acked-by: Wang Long <wanglong19@meituan.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: <stable@vger.kernel.org> [v4.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-20mm, pagemap: fix swap offset value for PMD migration entryHuang Ying1-1/+5
The swap offset reported by /proc/<pid>/pagemap may be not correct for PMD migration entries. If addr passed into pagemap_pmd_range() isn't aligned with PMD start address, the swap offset reported doesn't reflect this. And in the loop to report information of each sub-page, the swap offset isn't increased accordingly as that for PFN. This may happen after opening /proc/<pid>/pagemap and seeking to a page whose address doesn't align with a PMD start address. I have verified this with a simple test program. BTW: migration swap entries have PFN information, do we need to restrict whether to show them? [akpm@linux-foundation.org: fix typo, per Huang, Ying] Link: http://lkml.kernel.org/r/20180408033737.10897-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrei Vagin <avagin@openvz.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: "Jerome Glisse" <jglisse@redhat.com> Cc: Daniel Colascione <dancol@google.com> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-20CIFS: fix typo in cifs_dbgAurelien Aptel1-1/+1
Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <smfrench@gmail.com> Reported-by: Long Li <longli@microsoft.com>
2018-04-20cifs: do not allow creating sockets except with SMB1 posix exensionsSteve French1-4/+5
RHBZ: 1453123 Since at least the 3.10 kernel and likely a lot earlier we have not been able to create unix domain sockets in a cifs share when mounted using the SFU mount option (except when mounted with the cifs unix extensions to Samba e.g.) Trying to create a socket, for example using the af_unix command from xfstests will cause : BUG: unable to handle kernel NULL pointer dereference at 00000000 00000040 Since no one uses or depends on being able to create unix domains sockets on a cifs share the easiest fix to stop this vulnerability is to simply not allow creation of any other special files than char or block devices when sfu is used. Added update to Ronnie's patch to handle a tcon link leak, and to address a buf leak noticed by Gustavo and Colin. Acked-by: Gustavo A. R. Silva <gustavo@embeddedor.com> CC: Colin Ian King <colin.king@canonical.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Reported-by: Eryu Guan <eguan@redhat.com> Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Cc: stable@vger.kernel.org
2018-04-20cifs: smbd: Dump SMB packet when configuredLong Li1-1/+5
When sending through SMB Direct, also dump the packet in SMB send path. Also fixed a typo in debug message. Signed-off-by: Long Li <longli@microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-04-20btrfs: print-tree: debugging output enhancementQu Wenruo2-11/+16
This patch enhances the following things: - tree block header * add generation and owner output for node and leaf - node pointer generation output - allow btrfs_print_tree() to not follow nodes * just like btrfs-progs Please note that, although function btrfs_print_tree() is not called by anyone right now, it's still a pretty useful function to debug kernel. So that function is still kept for later use. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-20btrfs: Fix race condition between delayed refs and blockgroup removalNikolay Borisov3-10/+26
When the delayed refs for a head are all run, eventually cleanup_ref_head is called which (in case of deletion) obtains a reference for the relevant btrfs_space_info struct by querying the bg for the range. This is problematic because when the last extent of a bg is deleted a race window emerges between removal of that bg and the subsequent invocation of cleanup_ref_head. This can result in cache being null and either a null pointer dereference or assertion failure. task: ffff8d04d31ed080 task.stack: ffff9e5dc10cc000 RIP: 0010:assfail.constprop.78+0x18/0x1a [btrfs] RSP: 0018:ffff9e5dc10cfbe8 EFLAGS: 00010292 RAX: 0000000000000044 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff8d04ffc1f868 RSI: ffff8d04ffc178c8 RDI: ffff8d04ffc178c8 RBP: ffff8d04d29e5ea0 R08: 00000000000001f0 R09: 0000000000000001 R10: ffff9e5dc0507d58 R11: 0000000000000001 R12: ffff8d04d29e5ea0 R13: ffff8d04d29e5f08 R14: ffff8d04efe29b40 R15: ffff8d04efe203e0 FS: 00007fbf58ead500(0000) GS:ffff8d04ffc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe6c6975648 CR3: 0000000013b2a000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __btrfs_run_delayed_refs+0x10e7/0x12c0 [btrfs] btrfs_run_delayed_refs+0x68/0x250 [btrfs] btrfs_should_end_transaction+0x42/0x60 [btrfs] btrfs_truncate_inode_items+0xaac/0xfc0 [btrfs] btrfs_evict_inode+0x4c6/0x5c0 [btrfs] evict+0xc6/0x190 do_unlinkat+0x19c/0x300 do_syscall_64+0x74/0x140 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x7fbf589c57a7 To fix this, introduce a new flag "is_system" to head_ref structs, which is populated at insertion time. This allows to decouple the querying for the spaceinfo from querying the possibly deleted bg. Fixes: d7eae3403f46 ("Btrfs: rework delayed ref total_bytes_pinned accounting") CC: stable@vger.kernel.org # 4.14+ Suggested-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-20vfs: Undo an overly zealous MS_RDONLY -> SB_RDONLY conversionDavid Howells1-1/+1
In do_mount() when the MS_* flags are being converted to MNT_* flags, MS_RDONLY got accidentally convered to SB_RDONLY. Undo this change. Fixes: e462ec50cb5f ("VFS: Differentiate mount flags (MS_*) from internal superblock flags") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-20afs: Fix server record deletionDavid Howells1-1/+8
AFS server records get removed from the net->fs_servers tree when they're deleted, but not from the net->fs_addresses{4,6} lists, which can lead to an oops in afs_find_server() when a server record has been removed, for instance during rmmod. Fix this by deleting the record from the by-address lists before posting it for RCU destruction. The reason this hasn't been noticed before is that the fileserver keeps probing the local cache manager, thereby keeping the service record alive, so the oops would only happen when a fileserver eventually gets bored and stops pinging or if the module gets rmmod'd and a call comes in from the fileserver during the window between the server records being destroyed and the socket being closed. The oops looks something like: BUG: unable to handle kernel NULL pointer dereference at 000000000000001c ... Workqueue: kafsd afs_process_async_call [kafs] RIP: 0010:afs_find_server+0x271/0x36f [kafs] ... Call Trace: afs_deliver_cb_init_call_back_state3+0x1f2/0x21f [kafs] afs_deliver_to_call+0x1ee/0x5e8 [kafs] afs_process_async_call+0x5b/0xd0 [kafs] process_one_work+0x2c2/0x504 worker_thread+0x1d4/0x2ac kthread+0x11f/0x127 ret_from_fork+0x24/0x30 Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-20Merge branch 'for-linus' of ↵Linus Torvalds4-7/+12
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "Assorted fixes. Some of that is only a matter with fault injection (broken handling of small allocation failure in various mount-related places), but the last one is a root-triggerable stack overflow, and combined with userns it gets really nasty ;-/" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: Don't leak MNT_INTERNAL away from internal mounts mm,vmscan: Allow preallocating memory for register_shrinker(). rpc_pipefs: fix double-dput() orangefs_kill_sb(): deal with allocation failures jffs2_kill_sb(): deal with failed allocations hypfs_kill_super(): deal with failed allocations
2018-04-20Merge tag 'ecryptfs-4.17-rc2-fixes' of ↵Linus Torvalds4-21/+46
git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs Pull eCryptfs fixes from Tyler Hicks: "Minor cleanups and a bug fix to completely ignore unencrypted filenames in the lower filesystem when filename encryption is enabled at the eCryptfs layer" * tag 'ecryptfs-4.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs: eCryptfs: don't pass up plaintext names when using filename encryption ecryptfs: fix spelling mistake: "cadidate" -> "candidate" ecryptfs: lookup: Don't check if mount_crypt_stat is NULL
2018-04-20Merge tag 'for_v4.17-rc2' of ↵Linus Torvalds7-39/+54
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs - isofs memory leak fix - two fsnotify fixes of event mask handling - udf fix of UTF-16 handling - couple other smaller cleanups * tag 'for_v4.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: udf: Fix leak of UTF-16 surrogates into encoded strings fs: ext2: Adding new return type vm_fault_t isofs: fix potential memory leak in mount option parsing MAINTAINERS: add an entry for FSNOTIFY infrastructure fsnotify: fix typo in a comment about mark->g_list fsnotify: fix ignore mask logic in send_to_group() isofs compress: Remove VLA usage fs: quota: Replace GFP_ATOMIC with GFP_KERNEL in dquot_init fanotify: fix logic of events on child
2018-04-19Don't leak MNT_INTERNAL away from internal mountsAl Viro1-1/+2
We want it only for the stuff created by SB_KERNMOUNT mounts, *not* for their copies. As it is, creating a deep stack of bindings of /proc/*/ns/* somewhere in a new namespace and exiting yields a stack overflow. Cc: stable@kernel.org Reported-by: Alexander Aring <aring@mojatatu.com> Bisected-by: Kirill Tkhai <ktkhai@virtuozzo.com> Tested-by: Kirill Tkhai <ktkhai@virtuozzo.com> Tested-by: Alexander Aring <aring@mojatatu.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-04-18cifs: smbd: Check for iov length on sending the last iovLong Li1-0/+2
When sending the last iov that breaks into smaller buffers to fit the transfer size, it's necessary to check if this is the last iov. If this is the latest iov, stop and proceed to send pages. Signed-off-by: Long Li <longli@microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-04-19btrfs: fix unaligned access in readdirDavid Sterba1-8/+12
The last update to readdir introduced a temporary buffer to store the emitted readdir data, but as there are file names of variable length, there's a lot of unaligned access. This was observed on a sparc64 machine: Kernel unaligned access at TPC[102f3080] btrfs_real_readdir+0x51c/0x718 [btrfs] Fixes: 23b5ec74943 ("btrfs: fix readdir deadlock with pagefault") CC: stable@vger.kernel.org # 4.14+ Reported-and-tested-by: René Rebe <rene@exactcode.com> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-18btrfs: Fix wrong btrfs_delalloc_release_extents parameterQu Wenruo1-1/+1
Commit 43b18595d660 ("btrfs: qgroup: Use separate meta reservation type for delalloc") merged into mainline is not the latest version submitted to mail list in Dec 2017. It has a fatal wrong @qgroup_free parameter, which results increasing qgroup metadata pertrans reserved space, and causing a lot of early EDQUOT. Fix it by applying the correct diff on top of current branch. Fixes: 43b18595d660 ("btrfs: qgroup: Use separate meta reservation type for delalloc") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-18btrfs: delayed-inode: Remove wrong qgroup meta reservation callsQu Wenruo1-4/+16
Commit 4f5427ccce5d ("btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item") merged into mainline was not latest version submitted to the mail list in Dec 2017. Which lacks the following fixes: 1) Remove btrfs_qgroup_convert_reserved_meta() call in btrfs_delayed_item_release_metadata() 2) Remove btrfs_qgroup_reserve_meta_prealloc() call in btrfs_delayed_inode_reserve_metadata() Those fixes will resolve unexpected EDQUOT problems. Fixes: 4f5427ccce5d ("btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-18btrfs: qgroup: Use independent and accurate per inode qgroup rsvQu Wenruo2-11/+65
Unlike reservation calculation used in inode rsv for metadata, qgroup doesn't really need to care about things like csum size or extent usage for the whole tree COW. Qgroups care more about net change of the extent usage. That's to say, if we're going to insert one file extent, it will mostly find its place in COWed tree block, leaving no change in extent usage. Or causing a leaf split, resulting in one new net extent and increasing qgroup number by nodesize. Or in an even more rare case, increase the tree level, increasing qgroup number by 2 * nodesize. So here instead of using the complicated calculation for extent allocator, which cares more about accuracy and no error, qgroup doesn't need that over-estimated reservation. This patch will maintain 2 new members in btrfs_block_rsv structure for qgroup, using much smaller calculation for qgroup rsv, reducing false EDQUOT. Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com>
2018-04-18btrfs: qgroup: Commit transaction in advance to reduce early EDQUOTQu Wenruo5-2/+63
Unlike previous method that tries to commit transaction inside qgroup_reserve(), this time we will try to commit transaction using fs_info->transaction_kthread to avoid nested transaction and no need to worry about locking context. Since it's an asynchronous function call and we won't wait for transaction commit, unlike previous method, we must call it before we hit the qgroup limit. So this patch will use the ratio and size of qgroup meta_pertrans reservation as indicator to check if we should trigger a transaction commit. (meta_prealloc won't be cleaned in transaction committ, it's useless anyway) Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-18udf: Fix leak of UTF-16 surrogates into encoded stringsJan Kara1-0/+6
OSTA UDF specification does not mention whether the CS0 charset in case of two bytes per character encoding should be treated in UTF-16 or UCS-2. The sample code in the standard does not treat UTF-16 surrogates in any special way but on systems such as Windows which work in UTF-16 internally, filenames would be treated as being in UTF-16 effectively. In Linux it is more difficult to handle characters outside of Base Multilingual plane (beyond 0xffff) as NLS framework works with 2-byte characters only. Just make sure we don't leak UTF-16 surrogates into the resulting string when loading names from the filesystem for now. CC: stable@vger.kernel.org # >= v4.6 Reported-by: Mingye Wang <arthur200126@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2018-04-17fs: cifs: Adding new return type vm_fault_tSouptick Joarder1-1/+1
Use new return type vm_fault_t for page_mkwrite handler. Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2018-04-17cifs: smb2ops: Fix NULL check in smb2_query_symlinkGustavo A. R. Silva1-2/+2
The current code null checks variable err_buf, which is always null when it is checked, hence utf16_path is free'd and the function returns -ENOENT everytime it is called, making it impossible for the execution path to reach the following code: err_buf = err_iov.iov_base; Fix this by null checking err_iov.iov_base instead of err_buf. Also, notice that err_buf no longer needs to be initialized to NULL. Addresses-Coverity-ID: 1467876 ("Logically dead code") Fixes: 2d636199e400 ("cifs: Change SMB2_open to return an iov for the error parameter") Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-04-16eCryptfs: don't pass up plaintext names when using filename encryptionTyler Hicks2-18/+44
Both ecryptfs_filldir() and ecryptfs_readlink_lower() use ecryptfs_decode_and_decrypt_filename() to translate lower filenames to upper filenames. The function correctly passes up lower filenames, unchanged, when filename encryption isn't in use. However, it was also passing up lower filenames when the filename wasn't encrypted or when decryption failed. Since 88ae4ab9802e, eCryptfs refuses to lookup lower plaintext names when filename encryption is enabled so this resulted in a situation where userspace would see lower plaintext filenames in calls to getdents(2) but then not be able to lookup those filenames. An example of this can be seen when enabling filename encryption on an eCryptfs mount at the root directory of an Ext4 filesystem: $ ls -1i /lower 12 ECRYPTFS_FNEK_ENCRYPTED.FWYZD8TcW.5FV-TKTEYOHsheiHX9a-w.NURCCYIMjI8pn5BDB9-h3fXwrE-- 11 lost+found $ ls -1i /upper ls: cannot access '/upper/lost+found': No such file or directory ? lost+found 12 test With this change, the lower lost+found dentry is ignored: $ ls -1i /lower 12 ECRYPTFS_FNEK_ENCRYPTED.FWYZD8TcW.5FV-TKTEYOHsheiHX9a-w.NURCCYIMjI8pn5BDB9-h3fXwrE-- 11 lost+found $ ls -1i /upper 12 test Additionally, some potentially noisy error/info messages in the related code paths are turned into debug messages so that the logs can't be easily filled. Fixes: 88ae4ab9802e ("ecryptfs_lookup(): try either only encrypted or plaintext name") Reported-by: Guenter Roeck <linux@roeck-us.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
2018-04-16fs: ext2: Adding new return type vm_fault_tSouptick Joarder1-2/+2
Use new return type vm_fault_t for page_mkwrite, pfn_mkwrite and fault handler. Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Jan Kara <jack@suse.cz>
2018-04-16isofs: fix potential memory leak in mount option parsingChengguang Xu1-0/+3
When specifying string type mount option (e.g., iocharset) several times in a mount, current option parsing may cause memory leak. Hence, call kfree for previous one in this case. Meanwhile, check memory allocation result for it. Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Signed-off-by: Jan Kara <jack@suse.cz>
2018-04-16ceph: always update atime/mtime/ctime for new inodeYan, Zheng1-3/+7
For new inode, atime/mtime/ctime are uninitialized. Don't compare against them. Cc: stable@kernel.org Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-04-16mm,vmscan: Allow preallocating memory for register_shrinker().Tetsuo Handa1-5/+4
syzbot is catching so many bugs triggered by commit 9ee332d99e4d5a97 ("sget(): handle failures of register_shrinker()"). That commit expected that calling kill_sb() from deactivate_locked_super() without successful fill_super() is safe, but the reality was different; some callers assign attributes which are needed for kill_sb() after sget() succeeds. For example, [1] is a report where sb->s_mode (which seems to be either FMODE_READ | FMODE_EXCL | FMODE_WRITE or FMODE_READ | FMODE_EXCL) is not assigned unless sget() succeeds. But it does not worth complicate sget() so that register_shrinker() failure path can safely call kill_block_super() via kill_sb(). Making alloc_super() fail if memory allocation for register_shrinker() failed is much simpler. Let's avoid calling deactivate_locked_super() from sget_userns() by preallocating memory for the shrinker and making register_shrinker() in sget_userns() never fail. [1] https://syzkaller.appspot.com/bug?id=588996a25a2587be2e3a54e8646728fb9cae44e7 Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: syzbot <syzbot+5a170e19c963a2e0df79@syzkaller.appspotmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-04-15orangefs_kill_sb(): deal with allocation failuresAl Viro1-0/+5
orangefs_fill_sb() might've failed to allocate ORANGEFS_SB(s); don't oops in that case. Cc: stable@kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-04-15jffs2_kill_sb(): deal with failed allocationsAl Viro1-1/+1
jffs2_fill_super() might fail to allocate jffs2_sb_info; jffs2_kill_sb() must survive that. Cc: stable@kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-04-15Merge tag 'for-4.17-part2-tag' of ↵Linus Torvalds94-1234/+277
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull more btrfs updates from David Sterba: "We have queued a few more fixes (error handling, log replay, softlockup) and the rest is SPDX updates that touche almost all files so the diffstat is long" * tag 'for-4.17-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: Only check first key for committed tree blocks btrfs: add SPDX header to Kconfig btrfs: replace GPL boilerplate by SPDX -- sources btrfs: replace GPL boilerplate by SPDX -- headers Btrfs: fix loss of prealloc extents past i_size after fsync log replay Btrfs: clean up resources during umount after trans is aborted btrfs: Fix possible softlock on single core machines Btrfs: bail out on error during replay_dir_deletes Btrfs: fix NULL pointer dereference in log_dir_items
2018-04-15Merge tag '4.17-rc1SMB3-Fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds13-71/+240
Pull cifs fixes from Steve French: "SMB3 fixes, a few for stable, and some important cleanup work from Ronnie of the smb3 transport code" * tag '4.17-rc1SMB3-Fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: change validate_buf to validate_iov cifs: remove rfc1002 hardcoded constants from cifs_discard_remaining_data() cifs: Change SMB2_open to return an iov for the error parameter cifs: add resp_buf_size to the mid_q_entry structure smb3.11: replace a 4 with server->vals->header_preamble_size cifs: replace a 4 with server->vals->header_preamble_size cifs: add pdu_size to the TCP_Server_Info structure SMB311: Improve checking of negotiate security contexts SMB3: Fix length checking of SMB3.11 negotiate request CIFS: add ONCE flag for cifs_dbg type cifs: Use ULL suffix for 64-bit constant SMB3: Log at least once if tree connect fails during reconnect cifs: smb2pdu: Fix potential NULL pointer dereference
2018-04-14Merge branch 'akpm' (patches from Andrew)Linus Torvalds1-1/+22
Merge yet more updates from Andrew Morton: - various hotfixes - kexec_file updates and feature work * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (27 commits) kernel/kexec_file.c: move purgatories sha256 to common code kernel/kexec_file.c: allow archs to set purgatory load address kernel/kexec_file.c: remove mis-use of sh_offset field during purgatory load kernel/kexec_file.c: remove unneeded variables in kexec_purgatory_setup_sechdrs kernel/kexec_file.c: remove unneeded for-loop in kexec_purgatory_setup_sechdrs kernel/kexec_file.c: split up __kexec_load_puragory kernel/kexec_file.c: use read-only sections in arch_kexec_apply_relocations* kernel/kexec_file.c: search symbols in read-only kexec_purgatory kernel/kexec_file.c: make purgatory_info->ehdr const kernel/kexec_file.c: remove checks in kexec_purgatory_load include/linux/kexec.h: silence compile warnings kexec_file, x86: move re-factored code to generic side x86: kexec_file: clean up prepare_elf64_headers() x86: kexec_file: lift CRASH_MAX_RANGES limit on crash_mem buffer x86: kexec_file: remove X86_64 dependency from prepare_elf64_headers() x86: kexec_file: purge system-ram walking from prepare_elf64_headers() kexec_file,x86,powerpc: factor out kexec_file_ops functions kexec_file: make use of purgatory optional proc: revalidate misc dentries mm, slab: reschedule cache_reap() on the same CPU ...
2018-04-13proc: revalidate misc dentriesAlexey Dobriyan1-1/+22
If module removes proc directory while another process pins it by chdir'ing to it, then subsequent recreation of proc entry and all entries down the tree will not be visible to any process until pinning process unchdir from directory and unpins everything. Steps to reproduce: proc_mkdir("aaa", NULL); proc_create("aaa/bbb", ...); chdir("/proc/aaa"); remove_proc_entry("aaa/bbb", NULL); remove_proc_entry("aaa", NULL); proc_mkdir("aaa", NULL); # inaccessible because "aaa" dentry still points # to the original "aaa". proc_create("aaa/bbb", ...); Fix is to implement ->d_revalidate and ->d_delete. Link: http://lkml.kernel.org/r/20180312201938.GA4871@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-13Merge branch 'overlayfs-linus' of ↵Linus Torvalds11-166/+477
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs updates from Miklos Szeredi: "In addition to bug fixes and cleanups there are two new features from Amir: - Consistent inode number support for the case when layers are not all on the same filesystem (feature is dubbed "xino"). - Optimize overlayfs file handle decoding. This one touches the exportfs interface to allow detecting the disconnected directory case" * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: update documentation w.r.t "xino" feature ovl: add support for "xino" mount and config options ovl: consistent d_ino for non-samefs with xino ovl: consistent i_ino for non-samefs with xino ovl: constant st_ino for non-samefs with xino ovl: allocate anon bdev per unique lower fs ovl: factor out ovl_map_dev_ino() helper ovl: cleanup ovl_update_time() ovl: add WARN_ON() for non-dir redirect cases ovl: cleanup setting OVL_INDEX ovl: set d->is_dir and d->opaque for last path element ovl: Do not check for redirect if this is last layer ovl: lookup in inode cache first when decoding lower file handle ovl: do not try to reconnect a disconnected origin dentry ovl: disambiguate ovl_encode_fh() ovl: set lower layer st_dev only if setting lower st_ino ovl: fix lookup with middle layer opaque dir and absolute path redirects ovl: Set d->last properly during lookup ovl: set i_ino to the value of st_ino for NFS export
2018-04-13btrfs: Only check first key for committed tree blocksQu Wenruo1-0/+8
When looping btrfs/074 with many cpus (>= 8), it's possible to trigger kernel warning due to first key verification: [ 4239.523446] WARNING: CPU: 5 PID: 2381 at fs/btrfs/disk-io.c:460 btree_read_extent_buffer_pages+0x1ad/0x210 [ 4239.523830] Modules linked in: [ 4239.524630] RIP: 0010:btree_read_extent_buffer_pages+0x1ad/0x210 [ 4239.527101] Call Trace: [ 4239.527251] read_tree_block+0x42/0x70 [ 4239.527434] read_node_slot+0xd2/0x110 [ 4239.527632] push_leaf_right+0xad/0x1b0 [ 4239.527809] split_leaf+0x4ea/0x700 [ 4239.527988] ? leaf_space_used+0xbc/0xe0 [ 4239.528192] ? btrfs_set_lock_blocking_rw+0x99/0xb0 [ 4239.528416] btrfs_search_slot+0x8cc/0xa40 [ 4239.528605] btrfs_insert_empty_items+0x71/0xc0 [ 4239.528798] __btrfs_run_delayed_refs+0xa98/0x1680 [ 4239.529013] btrfs_run_delayed_refs+0x10b/0x1b0 [ 4239.529205] btrfs_commit_transaction+0x33/0xaf0 [ 4239.529445] ? start_transaction+0xa8/0x4f0 [ 4239.529630] btrfs_alloc_data_chunk_ondemand+0x1b0/0x4e0 [ 4239.529833] btrfs_check_data_free_space+0x54/0xa0 [ 4239.530045] btrfs_delalloc_reserve_space+0x25/0x70 [ 4239.531907] btrfs_direct_IO+0x233/0x3d0 [ 4239.532098] generic_file_direct_write+0xcb/0x170 [ 4239.532296] btrfs_file_write_iter+0x2bb/0x5f4 [ 4239.532491] aio_write+0xe2/0x180 [ 4239.532669] ? lock_acquire+0xac/0x1e0 [ 4239.532839] ? __might_fault+0x3e/0x90 [ 4239.533032] do_io_submit+0x594/0x860 [ 4239.533223] ? do_io_submit+0x594/0x860 [ 4239.533398] SyS_io_submit+0x10/0x20 [ 4239.533560] ? SyS_io_submit+0x10/0x20 [ 4239.533729] do_syscall_64+0x75/0x1d0 [ 4239.533979] entry_SYSCALL_64_after_hwframe+0x42/0xb7 [ 4239.534182] RIP: 0033:0x7f8519741697 The problem here is, at btree_read_extent_buffer_pages() we don't have acquired read/write lock on that extent buffer, only basic info like level/bytenr is reliable. So race condition leads to such false alert. However in current call site, it's impossible to acquire proper lock without race window. To fix the problem, we only verify first key for committed tree blocks (whose generation is no larger than fs_info->last_trans_committed), so the content of such tree blocks will not change and there is no need to get read/write lock. Reported-by: Nikolay Borisov <nborisov@suse.com> Fixes: 581c1760415c ("btrfs: Validate child tree block's level and first key") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-13fsnotify: fix ignore mask logic in send_to_group()Amir Goldstein1-14/+11
The ignore mask logic in send_to_group() does not match the logic in fanotify_should_send_event(). In the latter, a vfsmount mark ignore mask precedes an inode mark mask and in the former, it does not. That difference may cause events to be sent to fanotify backend for no reason. Fix the logic in send_to_group() to match that of fanotify_should_send_event(). Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2018-04-12cifs: change validate_buf to validate_iovRonnie Sahlberg1-18/+21
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-04-12cifs: remove rfc1002 hardcoded constants from cifs_discard_remaining_data()Ronnie Sahlberg1-2/+3
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-04-12cifs: Change SMB2_open to return an iov for the error parameterRonnie Sahlberg3-9/+13
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>