Age | Commit message (Collapse) | Author | Files | Lines |
|
This patch fixes the following deadlock bug during the recovery.
INFO: task mount:1322 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mount D ffffffff81125870 0 1322 1266 0x00000000
ffff8801207e39d8 0000000000000046 ffff88012ab1dee0 0000000000000046
ffff8801207e3a08 ffff880115903f40 ffff8801207e3fd8 ffff8801207e3fd8
ffff8801207e3fd8 ffff880115903f40 ffff8801207e39d8 ffff88012fc94520
Call Trace:
[<ffffffff81125870>] ? __lock_page+0x70/0x70
[<ffffffff816a92d9>] schedule+0x29/0x70
[<ffffffff816a93af>] io_schedule+0x8f/0xd0
[<ffffffff8112587e>] sleep_on_page+0xe/0x20
[<ffffffff816a649a>] __wait_on_bit_lock+0x5a/0xc0
[<ffffffff81125867>] __lock_page+0x67/0x70
[<ffffffff8106c7b0>] ? autoremove_wake_function+0x40/0x40
[<ffffffff81126857>] find_lock_page+0x67/0x80
[<ffffffff8112698f>] find_or_create_page+0x3f/0xb0
[<ffffffffa03901a8>] ? sync_inode_page+0xa8/0xd0 [f2fs]
[<ffffffffa038fdf7>] get_node_page+0x67/0x180 [f2fs]
[<ffffffffa039818b>] recover_fsync_data+0xacb/0xff0 [f2fs]
[<ffffffff816aaa1e>] ? _raw_spin_unlock+0x3e/0x40
[<ffffffffa0389634>] f2fs_fill_super+0x7d4/0x850 [f2fs]
[<ffffffff81184cf9>] mount_bdev+0x1c9/0x210
[<ffffffffa0388e60>] ? validate_superblock+0x180/0x180 [f2fs]
[<ffffffffa0387635>] f2fs_mount+0x15/0x20 [f2fs]
[<ffffffff81185a13>] mount_fs+0x43/0x1b0
[<ffffffff81145ba0>] ? __alloc_percpu+0x10/0x20
[<ffffffff811a0796>] vfs_kern_mount+0x76/0x120
[<ffffffff811a2cb7>] do_mount+0x237/0xa10
[<ffffffff81140b9b>] ? strndup_user+0x5b/0x80
[<ffffffff811a3520>] SyS_mount+0x90/0xe0
[<ffffffff816b3502>] system_call_fastpath+0x16/0x1b
The bug is triggered when check_index_in_prev_nodes tries to get the direct
node page by calling get_node_page.
At this point, if the direct node page is already locked by get_dnode_of_data,
its caller, we got a deadlock condition.
This patch adds additional condition check for the reuse of locked direct node
pages prior to the get_node_page call.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
While an orphan inode has zero link_count, f2fs_gc is able to select the inode
for foreground gc.
- f2fs_gc
- do_garbage_collect
- gc_data_segment
: f2fs_iget is failed
: get_valid_blocks() != 0, so that retry
--> here we got the infinite loop.
This patch resolved this issue.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
Introduce a simple macro function for readability.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
This patch tries to avoid the following deadlock condition of which the reclaim
path can trigger f2fs_balance_fs again.
=================================
[ INFO: inconsistent lock state ]
---------------------------------
inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
kswapd0/41 [HC0[0]:SC0[0]:HE1:SE1] takes:
(&sbi->gc_mutex){+.+.?.}, at: f2fs_balance_fs+0xe6/0x100 [f2fs]
{RECLAIM_FS-ON-W} state was registered at:
[<ffffffff810aa5a9>] mark_held_locks+0xb9/0x140
[<ffffffff810aae85>] lockdep_trace_alloc+0x85/0xf0
[<ffffffff8113ab2c>] __alloc_pages_nodemask+0x7c/0x9b0
[<ffffffff81175aa8>] alloc_pages_current+0xb8/0x180
[<ffffffff811319cf>] __page_cache_alloc+0xaf/0xd0
[<ffffffff8113225c>] find_or_create_page+0x4c/0xb0
[<ffffffffa021359e>] find_data_page+0x14e/0x210 [f2fs]
[<ffffffffa021161b>] f2fs_gc+0x9eb/0xd90 [f2fs]
[<ffffffffa0218fae>] f2fs_balance_fs+0xee/0x100 [f2fs]
[<ffffffffa020848c>] f2fs_setattr+0x6c/0x200 [f2fs]
[<ffffffff811ae51b>] notify_change+0x1db/0x3a0
[<ffffffff8118fbd0>] do_truncate+0x60/0xa0
[<ffffffff8118fd95>] vfs_truncate+0x185/0x1b0
[<ffffffff8118fe1c>] do_sys_truncate+0x5c/0xa0
[<ffffffff8118ffee>] SyS_truncate+0xe/0x10
[<ffffffff816e2b42>] system_call_fastpath+0x16/0x1b
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
If we met an error during the dentry recovery, we should not conduct checkpoint.
Otherwise, some errorneous dentry blocks overwrites the existing blocks that
contain the remaining recovery information.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
If we got an error after lock_page, we should unlock it before exit.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
The allocated page used by the recovery is not on HIGHMEM, so that we don't
need to use kmap/kunmap.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
Few things can be changed in the default mkwrite function
1) Make file_update_time at the start before acquiring any lock
2) the condition page_offset(page) >= i_size_read(inode) should be
changed to page_offset(page) > i_size_read
3) Move wait_on_page_writeback.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Amit Sahrawat <a.sahrawat@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
We can do this, since now we use a global mutex, f2fs_stat_mutex to protect its
list operations.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
[Jaegeuk Kim: add description]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
Code cleanup without behavior changed.
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
Majianpeng reported a lockdep splat for f2fs. It turns out mutex_lock_all()
acquires an array of locks (in global/local lock style).
Any such operation is always serialized using cp_mutex, therefore there is no
fs_lock[] lock-order issue; tell lockdep about this using the
mutex_lock_nest_lock() primitive.
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
This patch adds some trivial debugging messages in the recovery process.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
I found a bug when testing power-off-recovery as follows.
[Bug Scenario]
1. create a file
2. fsync the file
3. reboot w/o any sync
4. try to recover the file
- found its fsync mark
- found its dentry mark
: try to recover its dentry
- get its file name
- get its parent inode number
: here we got zero value
The reason why we get the wrong parent inode number is that we didn't
synchronize the inode page with its newly created inode information perfectly.
Especially, previous f2fs stores fi->i_pino and writes it to the cached
node page in a wrong order, which incurs the zero-valued i_pino during the
recovery.
So, this patch modifies the creation flow to fix the synchronization order of
inode page with its inode.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
This patch is for passing a locked node page to get_dnode_of_data.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
If get_dnode_of_data gets a locked node page, let's skip redundant
get_node_page calls.
This is for the futher enhancement.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
This por_doing check is totally not related to the recovery process.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
During the dentry recovery routine, recover_inode() triggers __f2fs_add_link
with its directory inode.
In the following scenario, a bug is captured.
1. dir = f2fs_iget(pino)
2. __f2fs_add_link(dir, name)
3. iput(dir)
-> f2fs_evict_inode() faces with BUG_ON(atomic_read(fi->dirty_dents))
Kernel BUG at ffffffffa01c0676 [verbose debug info unavailable]
[<ffffffffa01c0676>] f2fs_evict_inode+0x276/0x300 [f2fs]
Call Trace:
[<ffffffff8118ea00>] evict+0xb0/0x1b0
[<ffffffff8118f1c5>] iput+0x105/0x190
[<ffffffffa01d2dac>] recover_fsync_data+0x3bc/0x1070 [f2fs]
[<ffffffff81692e8a>] ? io_schedule+0xaa/0xd0
[<ffffffff81690acb>] ? __wait_on_bit_lock+0x7b/0xc0
[<ffffffff8111a0e7>] ? __lock_page+0x67/0x70
[<ffffffff81165e21>] ? kmem_cache_alloc+0x31/0x140
[<ffffffff8118a502>] ? __d_instantiate+0x92/0xf0
[<ffffffff812a949b>] ? security_d_instantiate+0x1b/0x30
[<ffffffff8118a5b4>] ? d_instantiate+0x54/0x70
This means that we should flush all the dentry pages between iget and iput().
But, during the recovery routine, it is unallowed due to consistency, so we
have to wait the whole recovery process.
And then, write_checkpoint flushes all the dirty dentry blocks, and nicely we
can put the stale dir inodes from the dirty_dir_inode_list.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
The reason of using sbi->por_doing is to alleviate data writes during the
recovery.
The find_fsync_dnodes() produces some dirty dentry pages, so we should
cover it too with sbi->por_doing.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
We don't need to assign a value redundantly.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
In get_lock_data_page, if there is a data race between get_dnode_of_data for
node and grab_cache_page for data, f2fs is able to face with the following
BUG_ON(dn.data_blkaddr == NEW_ADDR).
kernel BUG at /home/zeus/f2fs_test/src/fs/f2fs/data.c:251!
[<ffffffffa044966c>] get_lock_data_page+0x1ec/0x210 [f2fs]
Call Trace:
[<ffffffffa043b089>] f2fs_readdir+0x89/0x210 [f2fs]
[<ffffffff811a0920>] ? fillonedir+0x100/0x100
[<ffffffff811a0920>] ? fillonedir+0x100/0x100
[<ffffffff811a07f8>] vfs_readdir+0xb8/0xe0
[<ffffffff811a0b4f>] sys_getdents+0x8f/0x110
[<ffffffff816d7999>] system_call_fastpath+0x16/0x1b
This bug is able to be occurred when the block address of the data block is
changed after f2fs_put_dnode().
In order to avoid that, this patch fixes the lock order of node and data
blocks in which the node block lock is covered by the data block lock.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
Currently f2fs recovers the dentry of fsynced files.
When power-off-recovery is conducted, this newly recovered inode should increase
node block count as well as inode block count.
This patch resolves this inconsistency that results in:
1. create a file
2. write data
3. fsync
4. reboot without sync
5. mount and recover the file
6. node block count is 1 and inode block count is 2
: fall into the inconsistent state
7. unlink the file
: trigger the following BUG_ON
------------[ cut here ]------------
kernel BUG at /home/zeus/f2fs_test/src/fs/f2fs/f2fs.h:716!
Call Trace:
[<ffffffffa0344100>] ? get_node_page+0x50/0x1a0 [f2fs]
[<ffffffffa0344bfc>] remove_inode_page+0x8c/0x100 [f2fs]
[<ffffffffa03380f0>] ? f2fs_evict_inode+0x180/0x2d0 [f2fs]
[<ffffffffa033812e>] f2fs_evict_inode+0x1be/0x2d0 [f2fs]
[<ffffffff811c7a67>] evict+0xa7/0x1a0
[<ffffffff811c82b5>] iput+0x105/0x190
[<ffffffff811c2b30>] d_kill+0xe0/0x120
[<ffffffff811c2c57>] dput+0xe7/0x1e0
[<ffffffff811acc3d>] __fput+0x19d/0x2d0
[<ffffffff811acd7e>] ____fput+0xe/0x10
[<ffffffff81070645>] task_work_run+0xb5/0xe0
[<ffffffff81002941>] do_notify_resume+0x71/0xb0
[<ffffffff8175f14a>] int_signal+0x12/0x17
Reported-and-Tested-by: Chris Fries <C.Fries@motorola.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
|
|
When a buffer is added to the LRU list, a reference is taken which is
not dropped until the buffer is evicted from the LRU list. This is the
correct behavior, however this LRU reference will prevent the buffer
from being dropped. This means that the buffer can't actually be dropped
until it is selected for eviction. There's no bound on the time spent
on the LRU list, which means that the buffer may be undroppable for
very long periods of time. Given that migration involves dropping
buffers, the associated page is now unmigratible for long periods of
time as well. CMA relies on being able to migrate a specific range
of pages, so these these types of failures make CMA significantly
less reliable, especially under high filesystem usage.
Rather than waiting for the LRU algorithm to eventually kick out
the buffer, explicitly remove the buffer from the LRU list when trying
to drop it. There is still the possibility that the buffer
could be added back on the list, but that indicates the buffer is
still in use and would probably have other 'in use' indicates to
prevent dropping.
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
|
|
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
|
|
commit 1362f4ea20fa63688ba6026e586d9746ff13a846 upstream.
Currently last dqput() can race with dquot_scan_active() causing it to
call callback for an already deactivated dquot. The race is as follows:
CPU1 CPU2
dqput()
spin_lock(&dq_list_lock);
if (atomic_read(&dquot->dq_count) > 1) {
- not taken
if (test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) {
spin_unlock(&dq_list_lock);
->release_dquot(dquot);
if (atomic_read(&dquot->dq_count) > 1)
- not taken
dquot_scan_active()
spin_lock(&dq_list_lock);
if (!test_bit(DQ_ACTIVE_B, &dquot->dq_flags))
- not taken
atomic_inc(&dquot->dq_count);
spin_unlock(&dq_list_lock);
- proceeds to release dquot
ret = fn(dquot, priv);
- called for inactive dquot
Fix the problem by making sure possible ->release_dquot() is finished by
the time we call the callback and new calls to it will notice reference
dquot_scan_active() has taken and bail out.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit dff6efc326a4d5f305797d4a6bba14f374fdd633 upstream.
Currently notify_change directly updates i_version for size updates,
which not only is counter to how all other fields are updated through
struct iattr, but also breaks XFS, which need inode updates to happen
under its own lock, and synchronized to the structure that gets written
to the log.
Remove the update in the common code, and it to btrfs and ext4,
XFS already does a proper updaste internally and currently gets a
double update with the existing code.
IMHO this is 3.13 and -stable material and should go in through the XFS
tree.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Acked-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2365c4eaf077c48574ab6f143960048fc0f31518 upstream.
SMB3 servers can respond with MaxTransactSize of more than 4M
that can cause a memory allocation error returned from kmalloc
in a lock codepath. Also the client doesn't support multicredit
requests now and allows buffer sizes of 65536 bytes only. Set
MaxTransactSize to this maximum supported value.
Signed-off-by: Pavel Shilovsky <piastry@etersoft.ru>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5d81de8e8667da7135d3a32a964087c0faf5483f upstream.
It's possible for userland to pass down an iovec via writev() that has a
bogus user pointer in it. If that happens and we're doing an uncached
write, then we can end up getting less bytes than we expect from the
call to iov_iter_copy_from_user. This is CVE-2014-0069
cifs_iovec_write isn't set up to handle that situation however. It'll
blindly keep chugging through the page array and not filling those pages
with anything useful. Worse yet, we'll later end up with a negative
number in wdata->tailsz, which will confuse the sending routines and
cause an oops at the very least.
Fix this by having the copy phase of cifs_iovec_write stop copying data
in this situation and send the last write as a short one. At the same
time, we want to avoid sending a zero-length write to the server, so
break out of the loop and set rc to -EFAULT if that happens. This also
allows us to handle the case where no address in the iovec is valid.
[Note: Marking this for stable on v3.4+ kernels, but kernels as old as
v2.6.38 may have a similar problem and may need similar fix]
Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 19ea80603715d473600cd993b9987bc97d042e02 upstream.
If the i_crtime field is not present in the inode, don't leave the
field uninitialized.
Fixes: ef7f38359 ("ext4: Add nanosecond timestamps")
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Tested-by: Vegard Nossum <vegard.nossum@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3d2660d0c9c2f296837078c189b68a47f6b2e3b5 upstream.
The set_flexbg_block_bitmap() function assumed that the number of
blocks in a blockgroup was sb->blocksize * 8, which is normally true,
but not always! Use EXT4_BLOCKS_PER_GROUP(sb) instead, to fix block
bitmap corruption after:
mke2fs -t ext4 -g 3072 -i 4096 /dev/vdd 1G
mount -t ext4 /dev/vdd /vdd
resize2fs /dev/vdd 8G
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reported-by: Jon Bernard <jbernard@tuxion.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit b93c95353413041a8cebad915a8109619f66bcc6 upstream.
If a file system has a large number of inodes per block group, all of
the metadata blocks in a flex_bg may be larger than what can fit in a
single block group. Unfortunately, ext4_alloc_group_tables() in
resize.c was never tested to see if it would handle this case
correctly, and there were a large number of bugs which caused the
following sequence to result in a BUG_ON:
kernel bug at fs/ext4/resize.c:409!
...
call trace:
[<ffffffff81256768>] ext4_flex_group_add+0x1448/0x1830
[<ffffffff81257de2>] ext4_resize_fs+0x7b2/0xe80
[<ffffffff8123ac50>] ext4_ioctl+0xbf0/0xf00
[<ffffffff811c111d>] do_vfs_ioctl+0x2dd/0x4b0
[<ffffffff811b9df2>] ? final_putname+0x22/0x50
[<ffffffff811c1371>] sys_ioctl+0x81/0xa0
[<ffffffff81676aa9>] system_call_fastpath+0x16/0x1b
code: c8 4c 89 df e8 41 96 f8 ff 44 89 e8 49 01 c4 44 29 6d d4 0
rip [<ffffffff81254fa1>] set_flexbg_block_bitmap+0x171/0x180
This can be reproduced with the following command sequence:
mke2fs -t ext4 -i 4096 /dev/vdd 1G
mount -t ext4 /dev/vdd /vdd
resize2fs /dev/vdd 8G
To fix this, we need to make sure the right thing happens when a block
group's inode table straddles two block groups, which means the
following bugs had to be fixed:
1) Not clearing the BLOCK_UNINIT flag in the second block group in
ext4_alloc_group_tables --- the was proximate cause of the BUG_ON.
2) Incorrectly determining how many block groups contained contiguous
free blocks in ext4_alloc_group_tables().
3) Incorrectly setting the start of the next block range to be marked
in use after a discontinuity in setup_new_flex_group_blocks().
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 23301410972330c0ae9a8afc379ba2005e249cc6 upstream.
If an ext4 file system is created by some tool other than mke2fs
(perhaps by someone who has a pathalogical fear of the GPL) that
doesn't set one or the other of the EXT2_FLAGS_{UN}SIGNED_HASH flags,
and that file system is then mounted read-only, don't try to modify
the s_flags field. Otherwise, if dm_verity is in use, the superblock
will change, causing an dm_verity failure.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 30d29b119ef01776e0a301444ab24defe8d8bef3 upstream.
In swap_inode_boot_loader() we forgot to release ->i_mutex and resume
unlocked dio for inode and inode_bl if there is an error starting the
journal handle. This commit fixes this issue.
Reported-by: Ahmed Tamrawi <ahmedtamrawi@gmail.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Dr. Tilmann Bubeck <t.bubeck@reinform.de>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 087787959ce851d7bbb19f10f6e9241b7f85a3ca upstream.
Commit 9f060e2231ca changed the way we handle allocations for the
integrity vectors. When the vectors are inline there is no associated
slab and consequently bvec_nr_vecs() returns 0. Ensure that we check
against BIP_INLINE_VECS in that case.
Reported-by: David Milburn <dmilburn@redhat.com>
Tested-by: David Milburn <dmilburn@redhat.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2ec197db1a56c9269d75e965f14c344b58b2a4f6 upstream.
If an NFS client attempts to get a lock (using NLM) and the lock is
not available, the server will remember the request and when the lock
becomes available it will send a GRANT request to the client to
provide the lock.
If the client already held an adjacent lock, the GRANT callback will
report the union of the existing and new locks, which can confuse the
client.
This happens because __posix_lock_file (called by vfs_lock_file)
updates the passed-in file_lock structure when adjacent or
over-lapping locks are found.
To avoid this problem we take a copy of the two fields that can
be changed (fl_start and fl_end) before the call and restore them
afterwards.
An alternate would be to allocate a 'struct file_lock', initialise it,
use locks_copy_lock() to take a copy, then locks_release_private()
after the vfs_lock_file() call. But that is a lot more work.
Reported-by: Olaf Kirch <okir@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
--
v1 had a couple of issues (large on-stack struct and didn't really work properly).
This version is much better tested.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|
commit 83e3bc23ef9ce7c03b7b4e5d3d790246ea59db3e upstream.
The get/set ACL xattr support for CIFS ACLs attempts to send old
cifs dialect protocol requests even when mounted with SMB2 or later
dialects. Sending cifs requests on an smb2 session causes problems -
the server drops the session due to the illegal request.
This patch makes CIFS ACL operations protocol specific to fix that.
Attempting to query/set CIFS ACLs for SMB2 will now return
EOPNOTSUPP (until we add worker routines for sending query
ACL requests via SMB2) instead of sending invalid (cifs)
requests.
A separate followon patch will be needed to fix cifs_acl_to_fattr
(which takes a cifs specific u16 fid so can't be abstracted
to work with SMB2 until that is changed) and will be needed
to fix mount problems when "cifsacl" is specified on mount
with e.g. vers=2.1
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d979f3b0a1f0b5499ab85e68cdf02b56852918b6 upstream.
Changeset 666753c3ef8fc88b0ddd5be4865d0aa66428ac35 added protocol
operations for get/setxattr to avoid calling cifs operations
on smb2/smb3 mounts for xattr operations and this changeset
adds the calls to cifs specific protocol operations for xattrs
(in order to reenable cifs support for xattrs which was
temporarily disabled by the previous changeset. We do not
have SMB2/SMB3 worker function for setting xattrs yet so
this only enables it for cifs.
CCing stable since without these two small changsets (its
small coreq 666753c3ef8fc88b0ddd5be4865d0aa66428ac35 is
also needed) calling getfattr/setfattr on smb2/smb3 mounts
causes problems.
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Shirish Pargaonkar <spargaonkar@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 666753c3ef8fc88b0ddd5be4865d0aa66428ac35 upstream.
When mounting with smb2 (or smb2.1 or smb3) we need to check to make
sure that attempts to query or set extended attributes do not
attempt to send the request with the older cifs protocol instead
(eventually we also need to add the support in SMB2
to query/set extended attributes but this patch prevents us from
using the wrong protocol for extended attribute operations).
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 96c7a2ff21501691587e1ae969b83cbec8b78e08 upstream.
Recently due to a spike in connections per second memcached on 3
separate boxes triggered the OOM killer from accept. At the time the
OOM killer was triggered there was 4GB out of 36GB free in zone 1. The
problem was that alloc_fdtable was allocating an order 3 page (32KiB) to
hold a bitmap, and there was sufficient fragmentation that the largest
page available was 8KiB.
I find the logic that PAGE_ALLOC_COSTLY_ORDER can't fail pretty dubious
but I do agree that order 3 allocations are very likely to succeed.
There are always pathologies where order > 0 allocations can fail when
there are copious amounts of free memory available. Using the pigeon
hole principle it is easy to show that it requires 1 page more than 50%
of the pages being free to guarantee an order 1 (8KiB) allocation will
succeed, 1 page more than 75% of the pages being free to guarantee an
order 2 (16KiB) allocation will succeed and 1 page more than 87.5% of
the pages being free to guarantee an order 3 allocate will succeed.
A server churning memory with a lot of small requests and replies like
memcached is a common case that if anything can will skew the odds
against large pages being available.
Therefore let's not give external applications a practical way to kill
linux server applications, and specify __GFP_NORETRY to the kmalloc in
alloc_fdmem. Unless I am misreading the code and by the time the code
reaches should_alloc_retry in __alloc_pages_slowpath (where
__GFP_NORETRY becomes signification). We have already tried everything
reasonable to allocate a page and the only thing left to do is wait. So
not waiting and falling back to vmalloc immediately seems like the
reasonable thing to do even if there wasn't a chance of triggering the
OOM killer.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Cong Wang <cwang@twopensource.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 227d53b397a32a7614667b3ecaf1d89902fb6c12 upstream.
To use spin_{un}lock_irq is dangerous if caller disabled interrupt.
During aio buffer migration, we have a possibility to see the following
call stack.
aio_migratepage [disable interrupt]
migrate_page_copy
clear_page_dirty_for_io
set_page_dirty
__set_page_dirty_buffers
__set_page_dirty
spin_lock_irq
This mean, current aio migration is a deadlockable. spin_lock_irqsave
is a safer alternative and we should use it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: David Rientjes rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8101c8dbf6243ba517aab58d69bf1bc37d8b7b9c upstream.
It's just broken and it's taking a lot of effort to fix it, so for now just
disable it so people can defrag in peace. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ed7e5423014ad89720fcf315c0b73f2c5d0c7bd2 upstream.
An NFS4ERR_RECALLCONFLICT is returned by server from a GET_LAYOUT
only when a Server Sent a RECALL do to that GET_LAYOUT, or
the RECALL and GET_LAYOUT crossed on the wire.
In any way this means we want to wait at most until in-flight IO
is finished and the RECALL can be satisfied.
So a proper wait here is more like 1/10 of a second, not 15 seconds
like we have now. In case of a server bug we delay exponentially
longer on each retry.
Current code totally craps out performance of very large files on
most pnfs-objects layouts, because of how the map changes when the
file has grown into the next raid group.
[Stable: This will patch back to 3.9. If there are earlier still
maintained trees, please tell me I'll send a patch]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit abad2fa5ba67725a3f9c376c8cfe76fbe94a3041 upstream.
If clp is new (cl_count = 1) and it matches another client in
nfs4_discover_server_trunking, the nfs_put_client will free clp before
->cl_preserve_clid is set.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 64590daa9e0dfb3aad89e3ab9230683b76211d5b upstream.
Both nfs41_walk_client_list and nfs40_walk_client_list expect the
'status' variable to be set to the value -NFS4ERR_STALE_CLIENTID
if the loop fails to find a match.
The problem is that the 'pos->cl_cons_state > NFS_CS_READY' changes
the value of 'status', and sets it either to the value '0' (which
indicates success), or to the value EINTR.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 78b19bae0813bd6f921ca58490196abd101297bd upstream.
Don't check for -NFS4ERR_NOTSUPP, it's already been mapped to -ENOTSUPP
by nfs4_stat_to_errno.
This allows the client to mount v4.1 servers that don't support
SECINFO_NO_NAME by falling back to the "guess and check" method of
nfs4_find_root_sec.
Signed-off-by: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c7848f69ec4a8c03732cde5c949bd2aa711a9f4b upstream.
decode_op_hdr() cannot distinguish between an XDR decoding error and
the perfectly valid errorcode NFS4ERR_IO. This is normally not a
problem, but for the particular case of OPEN, we need to be able
to increment the NFSv4 open sequence id when the server returns
a valid response.
Reported-by: J Bruce Fields <bfields@fieldses.org>
Link: http://lkml.kernel.org/r/20131204210356.GA19452@fieldses.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit aad560b7f63b495f48a7232fd086c5913a676e6f upstream.
At IO preparation we calculate the max pages at each device and
allocate a BIO per device of that size. The calculation was wrong
on some unaligned corner cases offset/length combination and would
make prepare return with -ENOMEM. This would be bad for pnfs-objects
that would in that case IO through MDS. And fatal for exofs were it
would fail writes with EIO.
Fix it by doing the proper math, that will work in all cases. (I
ran a test with all possible offset/length combinations this time
round).
Also when reading we do not need to allocate for the parity units
since we jump over them.
Also lower the max_io_length to take into account the parity pages
so not to allocate BIOs bigger than PAGE_SIZE
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d8d14bd09cddbaf0168d61af638455a26bd027ff upstream.
Commit d5dc77bfeeab ("consolidate compat lookup_dcookie()") coverted all
architectures to the new compat_sys_lookup_dcookie() syscall.
The "len" paramater of the new compat syscall must have the type
compat_size_t in order to enforce zero extension for architectures where
the ABI requires that the caller of a function performed zero and/or
sign extension to 64 bit of all parameters.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit dfd948e32af2e7b28bcd7a490c0a30d4b8df2a36 upstream.
We got a report that the pwritev syscall does not work correctly in
compat mode on s390.
It turned out that with commit 72ec35163f9f ("switch compat readv/writev
variants to COMPAT_SYSCALL_DEFINE") we lost the zero extension of a
couple of syscall parameters because the some parameter types haven't
been converted from unsigned long to compat_ulong_t.
This is needed for architectures where the ABI requires that the caller
of a function performed zero and/or sign extension to 64 bit of all
parameters.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 592f6b842f64e416c7598a1b97c649b34241e22d upstream.
Commit 91c2e0bcae72 ("unify compat fanotify_mark(2), switch to
COMPAT_SYSCALL_DEFINE") added a new unified compat fanotify_mark syscall
to be used by all architectures.
Unfortunately the unified version merges the split mask parameter in a
wrong way: the lower and higher word got swapped.
This was discovered with glibc's tst-fanotify test case.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reported-by: Andreas Krebbel <krebbel@linux.vnet.ibm.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Acked-by: "David S. Miller" <davem@davemloft.net>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 28a625cbc2a14f17b83e47ef907b2658576a32aa upstream.
Having this struct in module memory could Oops when if the module is
unloaded while the buffer still persists in a pipe.
Since sock_pipe_buf_ops is essentially the same as fuse_dev_pipe_buf_steal
merge them into nosteal_pipe_buf_ops (this is the same as
default_pipe_buf_ops except stealing the page from the buffer is not
allowed).
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|