summaryrefslogtreecommitdiff
path: root/fs/direct-io.c
AgeCommit message (Collapse)AuthorFilesLines
2013-03-04fs: Fix possible use-after-free with AIOJan Kara1-1/+1
commit 54c807e71d5ac59dee56c685f2b66e27cd54c475 upstream. Running AIO is pinning inode in memory using file reference. Once AIO is completed using aio_complete(), file reference is put and inode can be freed from memory. So we have to be sure that calling aio_complete() is the last thing we do with the inode. Acked-by: Jeff Moyer <jmoyer@redhat.com> CC: Christoph Hellwig <hch@infradead.org> CC: Jens Axboe <axboe@kernel.dk> CC: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-23Restore direct_io / truncate locking APIAnton Altaparmakov1-2/+2
With kernel 3.1, Christoph removed i_alloc_sem and replaced it with calls (namely inode_dio_wait() and inode_dio_done()) which are EXPORT_SYMBOL_GPL() thus they cannot be used by non-GPL file systems and further inode_dio_wait() was pushed from notify_change() into the file system ->setattr() method but no non-GPL file system can make this call. That means non-GPL file systems cannot exist any more unless they do not use any VFS functionality related to reading/writing as far as I can tell or at least as long as they want to implement direct i/o. Both Linus and Al (and others) have said on LKML that this breakage of the VFS API should not have happened and that the change was simply missed as it was not documented in the change logs of the patches that did those changes. This patch changes the two function exports in question to be EXPORT_SYMBOL() thus restoring the VFS API as it used to be - accessible for all modules. Christoph, who introduced the two functions and exported them GPL-only is CC-ed on this patch to give him the opportunity to object to the symbols being changed in this manner if he did indeed intend them to be GPL-only and does not want them to become available to all modules. Signed-off-by: Anton Altaparmakov <anton@tuxera.com> CC: Christoph Hellwig <hch@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12dio: optimize cache misses in the submission pathAndi Kleen1-9/+37
Some investigation of a transaction processing workload showed that a major consumer of cycles in __blockdev_direct_IO is the cache miss while accessing the block size. This is because it has to walk the chain from block_dev to gendisk to queue. The block size is needed early on to check alignment and sizes. It's only done if the check for the inode block size fails. But the costly block device state is unconditionally fetched. - Reorganize the code to only fetch block dev state when actually needed. Then do a prefetch on the block dev early on in the direct IO path. This is worth it, because there is substantial code run before we actually touch the block dev now. - I also added some unlikelies to make it clear the compiler that block device fetch code is not normally executed. This gave a small, but measurable improvement on a large database benchmark (about 0.3%) [akpm@linux-foundation.org: coding-style fixes] [sfr@canb.auug.org.au: using prefetch requires including prefetch.h] Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-12fs/direct-io.c: calculate fs_count correctly in get_more_blocks()Tao Ma1-7/+4
In get_more_blocks(), we use dio_count to calcuate fs_count and do some tricky things to increase fs_count if dio_count isn't aligned. But actually it still has some corner cases that can't be coverd. See the following example: dio_write foo -s 1024 -w 4096 (direct write 4096 bytes at offset 1024). The same goes if the offset isn't aligned to fs_blocksize. In this case, the old calculation counts fs_count to be 1, but actually we will write into 2 different blocks (if fs_blocksize=4096). The old code just works, since it will call get_block twice (and may have to allocate and create extents twice for filesystems like ext4). So we'd better call get_block just once with the proper fs_count. Signed-off-by: Tao Ma <boyu.mt@taobao.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-28direct-io: merge direct_io_walker into __blockdev_direct_IOAndi Kleen1-139/+132
This doesn't change anything for the compiler, but hch thought it would make the code clearer. I moved the reference counting into its own little inline. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-10-28direct-io: inline the complete submission pathAndi Kleen1-15/+21
Add inlines to all the submission path functions. While this increases code size it also gives gcc a lot of optimization opportunities in this critical hotpath. In particular -- together with some other changes -- this allows gcc to get rid of the unnecessary clearing of sdio at the beginning and optimize the messy parameter passing. Any non inlining of a function which takes a sdio parameter would break this optimization because they cannot be done if the address of a structure is taken. Note that benefits are only seen with CONFIG_OPTIMIZE_INLINING and CONFIG_CC_OPTIMIZE_FOR_SIZE both set to off. This gives about 2.2% improvement on a large database benchmark with a high IOPS rate. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-10-28direct-io: separate map_bh from dioAndi Kleen1-29/+37
Only a single b_private field in the map_bh buffer head is needed after the submission path. Move map_bh separately to avoid storing this information in the long term slab. This avoids the weird 104 byte hole in struct dio_submit which also needed to be memseted early. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-10-28direct-io: use a slab cache for struct dioAndi Kleen1-5/+14
A direct slab call is slightly faster than kmalloc and can be better cached per CPU. It also avoids rounding to the next kmalloc slab. In addition this enforces cache line alignment for struct dio to avoid any false sharing. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-10-28direct-io: rearrange fields in dio/dio_submit to avoid holesAndi Kleen1-7/+6
Fix most problems reported by pahole. There is still a weird 104 byte hole after map_bh. I'm not sure what causes this. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-10-28direct-io: fix a wrong commentAndi Kleen1-1/+1
There's nothing on the stack, even before my changes. Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-10-28direct-io: separate fields only used in the submission path from struct dioAndi Kleen1-188/+201
This large, but largely mechanic, patch moves all fields in struct dio that are only used in the submission path into a separate on stack data structure. This has the advantage that the memory is very likely cache hot, which is not guaranteed for memory fresh out of kmalloc. This also gives gcc more optimization potential because it can easier determine that there are no external aliases for these variables. The sdio initialization is a initialization now instead of memset. This allows gcc to break sdio into individual fields and optimize away unnecessary zeroing (after all the functions are inlined) Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2011-07-26atomic: use <linux/atomic.h>Arun Sharma1-1/+1
This allows us to move duplicated code in <asm/atomic.h> (atomic_inc_not_zero() for now) to <linux/atomic.h> Signed-off-by: Arun Sharma <asharma@fb.com> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-20fs: move inode_dio_done to the end_io handlerChristoph Hellwig1-3/+4
For filesystems that delay their end_io processing we should keep our i_dio_count until the the processing is done. Enable this by moving the inode_dio_done call to the end_io handler if one exist. Note that the actual move to the workqueue for ext4 and XFS is not done in this patch yet, but left to the filesystem maintainers. At least for XFS it's not needed yet either as XFS has an internal equivalent to i_dio_count. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20fs: always maintain i_dio_countChristoph Hellwig1-12/+13
Maintain i_dio_count for all filesystems, not just those using DIO_LOCKING. This these filesystems to also protect truncate against direct I/O requests by using common code. Right now the only non-DIO_LOCKING filesystem that appears to do so is XFS, which uses an opencoded variant of the i_dio_count scheme. Behaviour doesn't change for filesystems never calling inode_dio_wait. For ext4 behaviour changes when using the dioread_nonlock option, which previously was missing any protection between truncate and direct I/O reads. For ocfs2 that handcrafted i_dio_count manipulations are replaced with the common code now enable. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20fs: kill i_alloc_semChristoph Hellwig1-14/+51
i_alloc_sem is a rather special rw_semaphore. It's the last one that may be released by a non-owner, and it's write side is always mirrored by real exclusion. It's intended use it to wait for all pending direct I/O requests to finish before starting a truncate. Replace it with a hand-grown construct: - exclusion for truncates is already guaranteed by i_mutex, so it can simply fall way - the reader side is replaced by an i_dio_count member in struct inode that counts the number of pending direct I/O requests. Truncate can't proceed as long as it's non-zero - when i_dio_count reaches non-zero we wake up a pending truncate using wake_up_bit on a new bit in i_flags - new references to i_dio_count can't appear while we are waiting for it to read zero because the direct I/O count always needs i_mutex (or an equivalent like XFS's i_iolock) for starting a new operation. This scheme is much simpler, and saves the space of a spinlock_t and a struct list_head in struct inode (typically 160 bits on a non-debug 64-bit system). Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20fs: simplify handling of zero sized reads in __blockdev_direct_IOChristoph Hellwig1-2/+5
Reject zero sized reads as soon as we know our I/O length, and don't borther with locks or allocations that might have to be cleaned up otherwise. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-24Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds1-5/+2
* 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits) Documentation/iostats.txt: bit-size reference etc. cfq-iosched: removing unnecessary think time checking cfq-iosched: Don't clear queue stats when preempt. blk-throttle: Reset group slice when limits are changed blk-cgroup: Only give unaccounted_time under debug cfq-iosched: Don't set active queue in preempt block: fix non-atomic access to genhd inflight structures block: attempt to merge with existing requests on plug flush block: NULL dereference on error path in __blkdev_get() cfq-iosched: Don't update group weights when on service tree fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away block: Require subsystems to explicitly allocate bio_set integrity mempool jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging fs: make fsync_buffers_list() plug mm: make generic_writepages() use plugging blk-cgroup: Add unaccounted time to timeslice_used. block: fixup plugging stubs for !CONFIG_BLOCK block: remove obsolete comments for blkdev_issue_zeroout. blktrace: Use rq->cmd_flags directly in blk_add_trace_rq. ... Fix up conflicts in fs/{aio.c,super.c}
2011-03-10block: kill off REQ_UNPLUGJens Axboe1-1/+1
With the plugging now being explicitly controlled by the submitter, callers need not pass down unplugging hints to the block layer. If they want to unplug, it's because they manually plugged on their own - in which case, they should just unplug at will. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-03-10block: remove per-queue pluggingJens Axboe1-4/+1
Code has been converted over to the new explicit on-stack plugging, and delay users have been converted to use the new API for that. So lets kill off the old plugging along with aops->sync_page(). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-02-15Merge branch 'master' into for-nextJiri Kosina1-3/+7
2011-01-20fs/direct-io.c: don't try to allocate more than BIO_MAX_PAGES in a bioDavid Dillow1-3/+7
When using devices that support max_segments > BIO_MAX_PAGES (256), direct IO tries to allocate a bio with more pages than allowed, which leads to an oops in dio_bio_alloc(). Clamp the request to the supported maximum, and change dio_bio_alloc() to reflect that bio_alloc() will always return a bio when called with __GFP_WAIT and a valid number of vectors. [akpm@linux-foundation.org: remove redundant BUG_ON()] Signed-off-by: David Dillow <dillowda@ornl.gov> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-19dio: fix typos in commentsNamhyung Kim1-3/+3
Signed-off-by: Namhyung Kim <namhyung@gmail.com> Cc: Jiri Kosina <trivial@kernel.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-10-26fs/direct-io.c: fix truncation error in dio_complete() returnEdward Shishkin1-1/+1
Fix up truncation (ssize_t->int). This only matters with >2G reads/writes, which the kernel doesn't permit. Signed-off-by: Edward Shishkin <edward@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-09-09O_DIRECT: fix the splitting up of contiguous I/OJeff Moyer1-2/+2
commit c2c6ca4 (direct-io: do not merge logically non-contiguous requests) introduced a bug whereby all O_DIRECT I/Os were submitted a page at a time to the block layer. The problem is that the code expected dio->block_in_file to correspond to the current page in the dio. In fact, it corresponds to the previous page submitted via submit_page_section. This was purely an oversight, as the dio->cur_page_fs_offset field was introduced for just this purpose. This patch simply uses the correct variable when calculating whether there is a mismatch between contiguous logical blocks and contiguous physical blocks (as described in the comments). I also switched the if conditional following this check to an else if, to ensure that we never call dio_bio_submit twice for the same dio (in theory, this should not happen, anyway). I've tested this by running blktrace and verifying that a 64KB I/O was submitted as a single I/O. I also ran the patched kernel through xfstests' aio tests using xfs, ext4 (with 1k and 4k block sizes) and btrfs and verified that there were no regressions as compared to an unpatched kernel. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Josef Bacik <jbacik@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chris Mason <chris.mason@oracle.com> Cc: <stable@kernel.org> [2.6.35.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09sort out blockdev_direct_IO variantsChristoph Hellwig1-54/+20
Move the call to vmtruncate to get rid of accessive blocks to the callers in prepearation of the new truncate calling sequence. This was only done for DIO_LOCKING filesystems, so the __blockdev_direct_IO_newtrunc variant was not needed anyway. Get rid of blockdev_direct_IO_no_locking and its _newtrunc variant while at it as just opencoding the two additional paramters is shorted than the name suffix. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-07-26direct-io: move aio_complete into ->end_ioChristoph Hellwig1-12/+14
Filesystems with unwritten extent support must not complete an AIO request until the transaction to convert the extent has been commited. That means the aio_complete calls needs to be moved into the ->end_io callback so that the filesystem can control when to call it exactly. This makes a bit of a mess out of dio_complete and the ->end_io callback prototype even more complicated. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-05-27fs: introduce new truncate sequencenpiggin@suse.de1-21/+40
Introduce a new truncate calling sequence into fs/mm subsystems. Rather than setattr > vmtruncate > truncate, have filesystems call their truncate sequence from ->setattr if filesystem specific operations are required. vmtruncate is deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced previously should be used. simple_setattr is introduced for simple in-ram filesystems to implement the new truncate sequence. Eventually all filesystems should be converted to implement a setattr, and the default code in notify_change should go away. simple_setsize is also introduced to perform just the ATTR_SIZE portion of simple_setattr (ie. changing i_size and trimming pagecache). To implement the new truncate sequence: - filesystem specific manipulations (eg freeing blocks) must be done in the setattr method rather than ->truncate. - vmtruncate can not be used by core code to trim blocks past i_size in the event of write failure after allocation, so this must be performed in the fs code. - convert usage of helpers block_write_begin, nobh_write_begin, cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed variants. These avoid calling vmtruncate to trim blocks (see previous). - inode_setattr should not be used. generic_setattr is a new function to be used to copy simple attributes into the generic inode. - make use of the better opportunity to handle errors with the new sequence. Big problem with the previous calling sequence: the filesystem is not called until i_size has already changed. This means it is not allowed to fail the call, and also it does not know what the previous i_size was. Also, generic code calling vmtruncate to truncate allocated blocks in case of error had no good way to return a meaningful error (or, for example, atomically handle block deallocation). Cc: Christoph Hellwig <hch@lst.de> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-25direct-io: do not merge logically non-contiguous requestsJosef Bacik1-2/+18
Btrfs cannot handle having logically non-contiguous requests submitted. For example if you have Logical: [0-4095][HOLE][8192-12287] Physical: [0-4095] [4096-8191] Normally the DIO code would put these into the same BIO's. The problem is we need to know exactly what offset is associated with what BIO so we can do our checksumming and unlocking properly, so putting them in the same BIO doesn't work. So add another check where we submit the current BIO if the physical blocks are not contigous OR the logical blocks are not contiguous. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25direct-io: add a hook for the fs to provide its own submit_bio functionJosef Bacik1-5/+37
Because BTRFS can do RAID and such, we need our own submit hook so we can setup the bio's in the correct fashion, and handle checksum errors properly. So there are a few changes here 1) The submit_io hook. This is straightforward, just call this instead of submit_bio. 2) Allow the fs to return -ENOTBLK for reads. Usually this has only worked for writes, since writes can fallback onto buffered IO. But BTRFS needs the option of falling back on buffered IO if it encounters a compressed extent, since we need to read the entire extent in and decompress it. So if we get -ENOTBLK back from get_block we'll return back and fallback on buffered just like the write case. I've tested these changes with fsx and everything seems to work. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-12-17dio: fix use-after-freeAl Viro1-1/+1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-12-16direct-io: cleanup blockdev_direct_IO lockingChristoph Hellwig1-77/+52
Currently the locking in blockdev_direct_IO is a mess, we have three different locking types and very confusing checks for some of them. The most complicated one is DIO_OWN_LOCKING for reads, which happens to not actually be used. This patch gets rid of the DIO_OWN_LOCKING - as mentioned above the read case is unused anyway, and the write side is almost identical to DIO_NO_LOCKING. The difference is that DIO_NO_LOCKING always sets the create argument for the get_blocks callback to zero, but we can easily move that to the actual get_blocks callbacks. There are four users of the DIO_NO_LOCKING mode: gfs already ignores the create argument and thus is fine with the new version, ocfs2 only errors out if create were ever set, and we can remove this dead code now, the block device code only ever uses create for an error message if we are fully beyond the device which can never happen, and last but not least XFS will need the new behavour for writes. Now we can replace the lock_type variable with a flags one, where no flag means the DIO_NO_LOCKING behaviour and DIO_LOCKING is kept as the first flag. Separate out the check for not allowing to fill holes into a separate flag, although for now both flags always get set at the same time. Also revamp the documentation of the locking scheme to actually make sense. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alex Elder <aelder@sgi.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <joel.becker@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-16dio: don't zero out the pages array inside struct dioJeff Moyer1-13/+25
Intel reported a performance regression caused by the following commit: commit 848c4dd5153c7a0de55470ce99a8e13a63b4703f Author: Zach Brown <zach.brown@oracle.com> Date: Mon Aug 20 17:12:01 2007 -0700 dio: zero struct dio with kzalloc instead of manually This patch uses kzalloc to zero all of struct dio rather than manually trying to track which fields we rely on being zero. It passed aio+dio stress testing and some bug regression testing on ext3. This patch was introduced by Linus in the conversation that lead up to Badari's minimal fix to manually zero .map_bh.b_state in commit: 6a648fa72161d1f6468dabd96c5d3c0db04f598a It makes the code a bit smaller. Maybe a couple fewer cachelines to load, if we're lucky: text data bss dec hex filename 3285925 568506 1304616 5159047 4eb887 vmlinux 3285797 568506 1304616 5158919 4eb807 vmlinux.patched I was unable to measure a stable difference in the number of cpu cycles spent in blockdev_direct_IO() when pushing aio+dio 256K reads at ~340MB/s. So the resulting intent of the patch isn't a performance gain but to avoid exposing ourselves to the risk of finding another field like .map_bh.b_state where we rely on zeroing but don't enforce it in the code. Zach surmised that zeroing out the page array was what caused most of the problem, and suggested the approach taken in the attached patch for resolving the issue. Intel re-tested with this patch and saw a 0.6% performance gain (the original regression was 0.5%). [akpm@linux-foundation.org: add comment] Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-11-26Fix regression in direct writes performance due to WRITE_ODIRECT flag removalVivek Goyal1-1/+1
There seems to be a regression in direct write path due to following commit in for-2.6.33 branch of block tree. commit 1af60fbd759d31f565552fea315c2033947cfbe6 Author: Jeff Moyer <jmoyer@redhat.com> Date: Fri Oct 2 18:56:53 2009 -0400 block: get rid of the WRITE_ODIRECT flag Marking direct writes as WRITE_SYNC_PLUG instead of WRITE_ODIRECT, sets the NOIDLE flag in bio and hence in request. This tells CFQ to not expect more request from the queue and not idle on it (despite the fact that queue's think time is less and it is not seeky). So direct writers lose big time when competing with sequential readers. Using fio, I have run one direct writer and two sequential readers and following are the results with 2.6.32-rc7 kernel and with for-2.6.33 branch. Test ==== 1 direct writer and 2 sequential reader running simultaneously. [global] directory=/mnt/sdc/fio/ runtime=10 [seqwrite] rw=write size=4G direct=1 [seqread] rw=read size=2G numjobs=2 2.6.32-rc7 ========== direct writes: aggrb=2,968KB/s readers : aggrb=101MB/s for-2.6.33 branch ================= direct write: aggrb=19KB/s readers aggrb=137MB/s This patch brings back the WRITE_ODIRECT flag, with the difference that we don't set the BIO_RW_UNPLUG flag so that device is not unplugged after submission of request and an explicit unplug from submitter is required. That way we fix the jeff's issue of not enough merging taking place in aio path as well as make sure direct writes get their fair share. After the fix ============= for-2.6.33 + fix ---------------- direct writes: aggrb=2,728KB/s reads: aggrb=103MB/s Thanks Vivek Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-10-28aio: implement request batchingJeff Moyer1-4/+4
Hi, Some workloads issue batches of small I/O, and the performance is poor due to the call to blk_run_address_space for every single iocb. Nathan Roberts pointed this out, and suggested that by deferring this call until all I/Os in the iocb array are submitted to the block layer, we can realize some impressive performance gains (up to 30% for sequential 4k reads in batches of 16). Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-10-28block: get rid of the WRITE_ODIRECT flagJeff Moyer1-1/+1
Hi, The WRITE_ODIRECT flag is only used in one place, and that code path happens to also call blk_run_address_space. The introduction of this flag, then, could result in the device being unplugged twice for every I/O. Further, with the batching changes in the next patch, we don't want an O_DIRECT write to imply a queue unplug. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-22block: Do away with the notion of hardsect_sizeMartin K. Petersen1-1/+1
Until now we have had a 1:1 mapping between storage device physical block size and the logical block sized used when addressing the device. With SATA 4KB drives coming out that will no longer be the case. The sector size will be 4KB but the logical block size will remain 512-bytes. Hence we need to distinguish between the physical block size and the logical ditto. This patch renames hardsect_size to logical_block_size. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-15dio: Remove code handling bio_alloc failure with __GFP_WAITNikanth Karthikesan1-2/+0
Remove code handling bio_alloc failure with __GFP_WAIT. GFP_KERNEL implies __GFP_WAIT. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-04-06block: Add flag for telling the IO schedulers NOT to anticipate more IOJens Axboe1-1/+1
By default, CFQ will anticipate more IO from a given io context if the previously completed IO was sync. This used to be fine, since the only sync IO was reads and O_DIRECT writes. But with more "normal" sync writes being used now, we don't want to anticipate for those. Add a bio/request flag that informs the IO scheduler that this is a sync request that we should not idle for. Introduce WRITE_ODIRECT specifically for O_DIRECT writes, and make sure that the other sync writes set this flag. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06fs: truncate blocks outside i_size after O_DIRECT write errorDmitri Monakhov1-0/+13
In case of error extending write may have instantiated a few blocks outside i_size. We need to trim these blocks. We have to do it *regardless* to blocksize. At least ext2, ext3 and reiserfs interpret (i_size < biggest block) condition as error. Fsck will complain about wrong i_size. Then fsck will fix the error by changing i_size according to the biggest block. This is bad because this blocks contain garbage from previous write attempt. And result in data corruption. ####TESTCASE_BEGIN $touch /mnt/test/BIG_FILE ## at this moment /mnt/test/BIG_FILE size and blocks equal to zero open("/mnt/test/BIG_FILE", O_WRONLY|O_CREAT|O_DIRECT, 0666) = 3 write(3, "aaaaaaaaaaaa"..., 104857600) = -1 ENOSPC (No space left on device) ## size and block sould't be changed because write op failed. $stat /mnt/test/BIG_FILE File: `/mnt/test/BIG_FILE' Size: 0 Blocks: 110896 IO Block: 1024 regular empty file <<<<<<<<^^^^^^^^^^^^^^^^^^^^^^^^^^^^^file size is less than biggest block idx Device: fe07h/65031d Inode: 14 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2007-01-24 20:03:38.000000000 +0300 Modify: 2007-01-24 20:03:38.000000000 +0300 Change: 2007-01-24 20:03:39.000000000 +0300 #fsck.ext3 -f /dev/VG/test e2fsck 1.39 (29-May-2006) Pass 1: Checking inodes, blocks, and sizes Inode 14, i_size is 0, should be 56556544. Fix<y>? yes Pass 2: Checking directory structure .... #####TESTCASE_ENDdiff --git a/fs/direct-io.c b/fs/direct-io.c index af0558d..4e88bea 100644 [akpm@linux-foundation.org: use i_size_read()] Signed-off-by: Dmitri Monakhov <dmonakhov@openvz.org> Cc: Zach Brown <zach.brown@oracle.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-16Remove Andrew Morton's old email accountsFrancois Cami1-2/+2
People can use the real name an an index into MAINTAINERS to find the current email address. Signed-off-by: Francois Cami <francois.cami@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-26dio: use get_user_pages_fastNick Piggin1-8/+2
Use get_user_pages_fast in the common/generic block and fs direct IO paths. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Dave Kleikamp <shaggy@austin.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andi Kleen <andi@firstfloor.org> Cc: Dave Kleikamp <shaggy@austin.ibm.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: Jens Axboe <jens.axboe@oracle.com> Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05Pagecache zeroing: zero_user_segment, zero_user_segments and zero_userChristoph Lameter1-2/+2
Simplify page cache zeroing of segments of pages through 3 functions zero_user_segments(page, start1, end1, start2, end2) Zeros two segments of the page. It takes the position where to start and end the zeroing which avoids length calculations and makes code clearer. zero_user_segment(page, start, end) Same for a single segment. zero_user(page, start, length) Length variant for the case where we know the length. We remove the zero_user_page macro. Issues: 1. Its a macro. Inline functions are preferable. 2. The KM_USER0 macro is only defined for HIGHMEM. Having to treat this special case everywhere makes the code needlessly complex. The parameter for zeroing is always KM_USER0 except in one single case that we open code. Avoiding KM_USER0 makes a lot of code not having to be dealing with the special casing for HIGHMEM anymore. Dealing with kmap is only necessary for HIGHMEM configurations. In those configurations we use KM_USER0 like we do for a series of other functions defined in highmem.h. Since KM_USER0 is depends on HIGHMEM the existing zero_user_page function could not be a macro. zero_user_* functions introduced here can be be inline because that constant is not used when these functions are called. Also extract the flushing of the caches to be outside of the kmap. [akpm@linux-foundation.org: fix nfs and ntfs build] [akpm@linux-foundation.org: fix ntfs build some more] Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Steven French <sfrench@us.ibm.com> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Cc: <linux-ext4@vger.kernel.org> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Anton Altaparmakov <aia21@cantab.net> Cc: Mark Fasheh <mark.fasheh@oracle.com> Cc: David Chinner <dgc@sgi.com> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Steven French <sfrench@us.ibm.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16remove ZERO_PAGENick Piggin1-2/+2
The commit b5810039a54e5babf428e9a1e89fc1940fabff11 contains the note A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and thus mapcounted and count towards shared rss). These writes to the struct page could cause excessive cacheline bouncing on big systems. There are a number of ways this could be addressed if it is an issue. And indeed this cacheline bouncing has shown up on large SGI systems. There was a situation where an Altix system was essentially livelocked tearing down ZERO_PAGE pagetables when an HPC app aborted during startup. This situation can be avoided in userspace, but it does highlight the potential scalability problem with refcounting ZERO_PAGE, and corner cases where it can really hurt (we don't want the system to livelock!). There are several broad ways to fix this problem: 1. add back some special casing to avoid refcounting ZERO_PAGE 2. per-node or per-cpu ZERO_PAGES 3. remove the ZERO_PAGE completely I will argue for 3. The others should also fix the problem, but they result in more complex code than does 3, with little or no real benefit that I can see. Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a false optimisation: if an application is performance critical, it would not be doing many read faults of new memory, or at least it could be expected to write to that memory soon afterwards. If cache or memory use is critical, it should not be working with a significant number of ZERO_PAGEs anyway (a more compact representation of zeroes should be used). As a sanity check -- mesuring on my desktop system, there are never many mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not increase much without it. When running a make -j4 kernel compile on my dual core system, there are about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000 ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second is torn down without being COWed). So removing ZERO_PAGE will save 1,000 page faults per second when running kbuild, while keeping it only saves less than 1 page clearing operation per second. 1 page clear is cheaper than a thousand faults, presumably, so there isn't an obvious loss. Neither the logical argument nor these basic tests give a guarantee of no regressions. However, this is a reasonable opportunity to try to remove the ZERO_PAGE from the pagefault path. If it is found to cause regressions, we can reintroduce it and just avoid refcounting it. The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked. I don't see much use to them except on benchmarks. All other users of ZERO_PAGE are converted just to use ZERO_PAGE(0) for simplicity. We can look at replacing them all and maybe ripping out ZERO_PAGE completely when we are more satisfied with this solution. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus "snif" Torvalds <torvalds@linux-foundation.org>
2007-10-10Drop 'size' argument from bio_endio and bi_end_ioNeilBrown1-11/+2
As bi_end_io is only called once when the reqeust is complete, the 'size' argument is now redundant. Remove it. Now there is no need for bio_endio to subtract the size completed from bi_size. So don't do that either. While we are at it, change bi_end_io to return void. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-08-20dio: zero struct dio with kzalloc instead of manuallyZach Brown1-17/+1
This patch uses kzalloc to zero all of struct dio rather than manually trying to track which fields we rely on being zero. It passed aio+dio stress testing and some bug regression testing on ext3. This patch was introduced by Linus in the conversation that lead up to Badari's minimal fix to manually zero .map_bh.b_state in commit: 6a648fa72161d1f6468dabd96c5d3c0db04f598a It makes the code a bit smaller. Maybe a couple fewer cachelines to load, if we're lucky: text data bss dec hex filename 3285925 568506 1304616 5159047 4eb887 vmlinux 3285797 568506 1304616 5158919 4eb807 vmlinux.patched I was unable to measure a stable difference in the number of cpu cycles spent in blockdev_direct_IO() when pushing aio+dio 256K reads at ~340MB/s. So the resulting intent of the patch isn't a performance gain but to avoid exposing ourselves to the risk of finding another field like .map_bh.b_state where we rely on zeroing but don't enforce it in the code. Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-11direct-io: fix error-path crashesBadari Pulavarty1-0/+1
Need to initialize map_bh.b_state to zero. Otherwise, in case of a faulty user-buffer its possible to go into dio_zero_block() and submit a page by mistake - since it checks for buffer_new(). http://marc.info/?l=linux-kernel&m=118551339032528&w=2 akpm: Linus had a (better) patch to just do a kzalloc() in there, but it got lost. Probably this version is better for -stable anwyay. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Acked-by: Joe Jin <joe.jin@oracle.com> Acked-by: Zach Brown <zach.brown@oracle.com> Cc: gurudas pai <gurudas.pai@oracle.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-03dio: remove bogus refcounting BUG_ONZach Brown1-1/+1
Badari Pulavarty reported a case of this BUG_ON is triggering during testing. It's completely bogus and should be removed. It's trying to notice if we left references to the dio hanging around in the sync case. They should have been dropped as IO completed while this path was in dio_await_completion(). This condition will also be checked, via some twisty logic, by the BUG_ON(ret != -EIOCBQUEUED) a few lines lower. So to start this BUG_ON() is redundant. More fatally, it's dereferencing dio-> after having dropped its reference. It's only safe to dereference the dio after releasing the lock if the final reference was just dropped. Another CPU might free the dio in bio completion and reuse the memory after this path drops the dio lock but before the BUG_ON() is evaluated. This patch passed aio+dio regression unit tests and aio-stress on ext3. Signed-off-by: Zach Brown <zach.brown@oracle.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivialLinus Torvalds1-1/+1
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (25 commits) sound: convert "sound" subdirectory to UTF-8 MAINTAINERS: Add cxacru website/mailing list include files: convert "include" subdirectory to UTF-8 general: convert "kernel" subdirectory to UTF-8 documentation: convert the Documentation directory to UTF-8 Convert the toplevel files CREDITS and MAINTAINERS to UTF-8. remove broken URLs from net drivers' output Magic number prefix consistency change to Documentation/magic-number.txt trivial: s/i_sem /i_mutex/ fix file specification in comments drivers/base/platform.c: fix small typo in doc misc doc and kconfig typos Remove obsolete fat_cvf help text Fix occurrences of "the the " Fix minor typoes in kernel/module.c Kconfig: Remove reference to external mqueue library Kconfig: A couple of grammatical fixes in arch/i386/Kconfig Correct comments in genrtc.c to refer to correct /proc file. Fix more "deprecated" spellos. Fix "deprecated" typoes. ... Fix trivial comment conflict in kernel/relay.c.
2007-05-09fs: convert core functions to zero_user_pageNate Diller1-6/+2
It's very common for file systems to need to zero part or all of a page, the simplist way is just to use kmap_atomic() and memset(). There's actually a library function in include/linux/highmem.h that does exactly that, but it's confusingly named memclear_highpage_flush(), which is descriptive of *how* it does the work rather than what the *purpose* is. So this patchset renames the function to zero_user_page(), and calls it from the various places that currently open code it. This first patch introduces the new function call, and converts all the core kernel callsites, both the open-coded ones and the old memclear_highpage_flush() ones. Following this patch is a series of conversions for each file system individually, per AKPM, and finally a patch deprecating the old call. The diffstat below shows the entire patchset. [akpm@linux-foundation.org: fix a few things] Signed-off-by: Nate Diller <nate.diller@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09Fix misspellings collected by members of KJ list.Robert P. J. Day1-1/+1
Fix the misspellings of "propogate", "writting" and (oh, the shame :-) "kenrel" in the source tree. Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Signed-off-by: Adrian Bunk <bunk@stusta.de>