summaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)AuthorFilesLines
2011-07-26fail_make_request: cleanup should_fail_requestAkinobu Mita1-14/+12
This changes should_fail_request() to more usable wrapper function of should_fail(). It can avoid putting #ifdef CONFIG_FAIL_MAKE_REQUEST in the middle of a function. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-26block: fix warning with calling smp_processor_id() in preemptible sectionJens Axboe1-1/+1
After commit 5757a6d7 introduced an unsafe calling of smp_processor_id(), with preempt debuggin turned on we spew a lot of: BUG: using smp_processor_id() in preemptible [00000000] code: kjournald/514 caller is __make_request+0x1b8/0x308 [<c0019f44>] (unwind_backtrace+0x0/0xe8) from [<c024b4cc>] (debug_smp_processor_id+0xbc/0xf0) [<c024b4cc>] (debug_smp_processor_id+0xbc/0xf0) from [<c0223d14>] (__make_request+0x1b8/0x308) [<c0223d14>] (__make_request+0x1b8/0x308) from [<c02215ac>] (generic_make_request+0x4dc/0x558) [<c02215ac>] (generic_make_request+0x4dc/0x558) from [<c022173c>] (submit_bio+0x114/0x138) [<c022173c>] (submit_bio+0x114/0x138) from [<c011f504>] (submit_bh+0x148/0x16c) [<c011f504>] (submit_bh+0x148/0x16c) from [<c0121ed8>] (__sync_dirty_buffer+0x88/0xd8) [<c0121ed8>] (__sync_dirty_buffer+0x88/0xd8) from [<c01aff78>] (journal_commit_transaction+0x1198/0x1688) [<c01aff78>] (journal_commit_transaction+0x1198/0x1688) from [<c01b4034>] (kjournald+0xb4/0x224) [<c01b4034>] (kjournald+0xb4/0x224) from [<c0069ea0>] (kthread+0x8c/0x94) [<c0069ea0>] (kthread+0x8c/0x94) from [<c00137f8>] (kernel_thread_exit+0x0/0x8) Fix this by just using raw_smp_processor_id(), it's just a hint after all. There's no pinning of the CPU or accessing per-cpu structures involved. Reported-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-25Merge branch 'for-3.1/drivers' of git://git.kernel.dk/linux-blockLinus Torvalds1-9/+9
* 'for-3.1/drivers' of git://git.kernel.dk/linux-block: cciss: do not attempt to read from a write-only register xen/blkback: Add module alias for autoloading xen/blkback: Don't let in-flight requests defer pending ones. bsg: fix address space warning from sparse bsg: remove unnecessary conditional expressions bsg: fix bsg_poll() to return POLLOUT properly
2011-07-25Merge branch 'for-3.1/core' of git://git.kernel.dk/linux-blockLinus Torvalds11-148/+145
* 'for-3.1/core' of git://git.kernel.dk/linux-block: (24 commits) block: strict rq_affinity backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu block: fix patch import error in max_discard_sectors check block: reorder request_queue to remove 64 bit alignment padding CFQ: add think time check for group CFQ: add think time check for service tree CFQ: move think time check variables to a separate struct fixlet: Remove fs_excl from struct task. cfq: Remove special treatment for metadata rqs. block: document blk_plug list access block: avoid building too big plug list compat_ioctl: fix make headers_check regression block: eliminate potential for infinite loop in blkdev_issue_discard compat_ioctl: fix warning caused by qemu block: flush MEDIA_CHANGE from drivers on close(2) blk-throttle: Make total_nr_queued unsigned block: Add __attribute__((format(printf...) and fix fallout fs/partitions/check.c: make local symbols static block:remove some spare spaces in genhd.c block:fix the comment error in blkdev.h ...
2011-07-23block: strict rq_affinityDan Williams3-12/+18
Some systems benefit from completions always being steered to the strict requester cpu rather than the looser "per-socket" steering that blk_cpu_to_group() attempts by default. This is because the first CPU in the group mask ends up being completely overloaded with work, while the others (including the original submitter) has power left to spare. Allow the strict mode to be set by writing '2' to the sysfs control file. This is identical to the scheme used for the nomerges file, where '2' is a more aggressive setting than just being turned on. echo 2 > /sys/block/<bdev>/queue/rq_affinity Cc: Christoph Hellwig <hch@infradead.org> Cc: Roland Dreier <roland@purestorage.com> Tested-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-23block: fix patch import error in max_discard_sectors checkJens Axboe1-1/+1
A '!' snuck in before the unlikely, rendering it useless. Reported-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6Linus Torvalds2-0/+10
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (77 commits) [SCSI] fix crash in scsi_dispatch_cmd() [SCSI] sr: check_events() ignore GET_EVENT when TUR says otherwise [SCSI] bnx2i: Fixed kernel panic due to illegal usage of sc->request->cpu [SCSI] bfa: Update the driver version to 3.0.2.1 [SCSI] bfa: Driver and BSG enhancements. [SCSI] bfa: Added support to query PHY. [SCSI] bfa: Added HBA diagnostics support. [SCSI] bfa: Added support for flash configuration [SCSI] bfa: Added support to obtain SFP info. [SCSI] bfa: Added support for CEE info and stats query. [SCSI] bfa: Extend BSG interface. [SCSI] bfa: FCS bug fixes. [SCSI] bfa: DMA memory allocation enhancement. [SCSI] bfa: Brocade-1860 Fabric Adapter vHBA support. [SCSI] bfa: Brocade-1860 Fabric Adapter PLL init fixes. [SCSI] bfa: Added Fabric Assigned Address(FAA) support [SCSI] bfa: IOC bug fixes. [SCSI] bfa: Enable ASIC block configuration and query. [SCSI] bnx2i: Updated copyright and bump version [SCSI] bnx2i: Modified to skip CNIC registration if iSCSI is not supported ... Fix up some trivial conflicts in: - drivers/scsi/bnx2fc/{bnx2fc.h,bnx2fc_fcoe.c}: Crazy broadcom version number conflicts - drivers/target/tcm_fc/tfc_cmd.c Just trivial cleanups done on adjacent lines
2011-07-21[SCSI] fix crash in scsi_dispatch_cmd()James Bottomley2-0/+10
USB surprise removal of sr is triggering an oops in scsi_dispatch_command(). What seems to be happening is that USB is hanging on to a queue reference until the last close of the upper device, so the crash is caused by surprise remove of a mounted CD followed by attempted unmount. The problem is that USB doesn't issue its final commands as part of the SCSI teardown path, but on last close when the block queue is long gone. The long term fix is probably to make sr do the teardown in the same way as sd (so remove all the lower bits on ejection, but keep the upper disk alive until last close of user space). However, the current oops can be simply fixed by not allowing any commands to be sent to a dead queue. Cc: stable@kernel.org Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2011-07-20block,rcu: Convert call_rcu(disk_free_ptbl_rcu_cb) to kfree_rcu()Lai Jiangshan1-9/+1
The rcu callback disk_free_ptbl_rcu_cb() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(disk_free_ptbl_rcu_cb). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-07-12CFQ: add think time check for groupShaohua Li1-2/+17
Currently when the last queue of a group has no request, we don't expire the queue to hope request from the group comes soon, so the group doesn't miss its share. But if the think time is big, the assumption isn't correct and we just waste bandwidth. In such case, we don't do idle. [global] runtime=30 direct=1 [test1] cgroup=test1 cgroup_weight=1000 rw=randread ioengine=libaio size=500m runtime=30 directory=/mnt filename=file1 thinktime=9000 [test2] cgroup=test2 cgroup_weight=1000 rw=randread ioengine=libaio size=500m runtime=30 directory=/mnt filename=file2 patched base test1 64k 39k test2 548k 540k total 604k 578k group1 gets much better throughput because it waits less time. To check if the patch changes behavior of queue without think time. I also tried to give test1 2ms think time or no think time. The test result is stable. The thoughput doesn't change with/without the patch. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-12CFQ: add think time check for service treeShaohua Li1-4/+30
Currently when the last queue of a service tree has no request, we don't expire the queue to hope request from the service tree comes soon, so the service tree doesn't miss its share. But if the think time is big, the assumption isn't correct and we just waste bandwidth. In such case, we don't do idle. [global] runtime=10 direct=1 [test1] rw=randread ioengine=libaio size=500m directory=/mnt filename=file1 thinktime=9000 [test2] rw=read ioengine=libaio size=1G directory=/mnt filename=file2 patched base test1 41k/s 33k/s test2 15868k/s 15789k/s total 15902k/s 15817k/s A slightly better To check if the patch changes behavior of queue without think time. I also tried to give test1 2ms think time or no think time. The test has variation even without the patch, but the average throughput doesn't change with/without the patch. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-12CFQ: move think time check variables to a separate structShaohua Li1-16/+24
Move the variables to do think time check to a sepatate struct. This is to prepare adding think time check for service tree and group. No functional change. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-12fixlet: Remove fs_excl from struct task.Justin TerAvest1-27/+1
fs_excl is a poor man's priority inheritance for filesystems to hint to the block layer that an operation is important. It was never clearly specified, not widely adopted, and will not prevent starvation in many cases (like across cgroups). fs_excl was introduced with the time sliced CFQ IO scheduler, to indicate when a process held FS exclusive resources and thus needed a boost. It doesn't cover all file systems, and it was never fully complete. Lets kill it. Signed-off-by: Justin TerAvest <teravest@google.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-10cfq: Remove special treatment for metadata rqs.Justin TerAvest1-18/+0
There is no consistency among filesystems from what bios (or requests) are marked as being metadata. It's interesting to expose this in traces, but we shouldn't schedule the requests differently based on whether or not they're marked as being metadata. Signed-off-by: Justin TerAvest <teravest@google.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-08block: avoid building too big plug listShaohua Li1-0/+5
When I test fio script with big I/O depth, I found the total throughput drops compared to some relative small I/O depth. The reason is the thread accumulates big requests in its plug list and causes some delays (surely this depends on CPU speed). I thought we'd better have a threshold for requests. When a threshold reaches, this means there is no request merge and queue lock contention isn't severe when pushing per-task requests to queue, so the main advantages of blk plug don't exist. We can force a plug list flush in this case. With this, my test throughput actually increases and almost equals to small I/O depth. Another side effect is irq off time decreases in blk_flush_plug_list() for big I/O depth. The BLK_MAX_REQUEST_COUNT is choosen arbitarily, but 16 is efficiently to reduce lock contention to me. But I'm open here, 32 is ok in my test too. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-06block: eliminate potential for infinite loop in blkdev_issue_discardMike Snitzer1-1/+4
Due to the recently identified overflow in read_capacity_16() it was possible for max_discard_sectors to be zero but still have discards enabled on the associated device's queue. Eliminate the possibility for blkdev_issue_discard to infinitely loop. Interestingly this issue wasn't identified until a device, whose discard_granularity was 0 due to read_capacity_16 overflow, was consumed by blk_stack_limits() to construct limits for a higher-level DM multipath device. The multipath device's resulting limits never had the discard limits stacked because blk_stack_limits() will only do so if the bottom device's discard_granularity != 0. This resulted in the multipath device's limits.max_discard_sectors being 0. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-01compat_ioctl: fix warning caused by qemuJohannes Stezenbach1-14/+0
On Linux x86_64 host with 32bit userspace, running qemu or even just "qemu-img create -f qcow2 some.img 1G" causes a kernel warning: ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(00005326){t:'S';sz:0} arg(7fffffff) on some.img ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(801c0204){t:02;sz:28} arg(fff77350) on some.img ioctl 00005326 is CDROM_DRIVE_STATUS, ioctl 801c0204 is FDGETPRM. The warning appears because the Linux compat-ioctl handler for these ioctls only applies to block devices, while qemu also uses the ioctls on plain files. Signed-off-by: Johannes Stezenbach <js@sig21.net> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-01block: flush MEDIA_CHANGE from drivers on close(2)Tejun Heo1-10/+12
Currently, only open(2) is defined as the 'clearing' point. It has two roles - first, it's an acknowledgement from userland indicating that the event has been received and kernel can clear pending states and proceed to generate more events. Secondly, it's passed on to device drivers as a hint indicating that a synchronization point has been reached and it might want to take a deeper look at the device. The latter currently is only used by sr which uses two different mechanisms - GET_EVENT_MEDIA_STATUS_NOTIFICATION and TEST_UNIT_READY to discover events, where the former is lighter weight and safe to be used repeatedly but may not provide full coverage. Among other things, GET_EVENT can't detect media removal while TUR can. This patch makes close(2) - blkdev_put() - indicate clearing hint for MEDIA_CHANGE to drivers. disk_check_events() is renamed to disk_flush_events() and updated to take @mask for events to flush which is or'd to ev->clearing and will be passed to the driver on the next ->check_events() invocation. This change makes sr generate MEDIA_CHANGE when media is ejected from userland - e.g. with eject(1). Note: Given the current usage, it seems @clearing hint is needlessly complex. disk_clear_events() can simply clear all events and the hint can be boolean @flush. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-07-01Merge branch 'for-linus' into for-3.1/coreJens Axboe3-45/+57
Conflicts: block/blk-throttle.c block/cfq-iosched.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-27cfq-iosched: make code consistentShaohua Li1-1/+2
ioc->ioc_data is rcu protectd, so uses correct API to access it. This doesn't change any behavior, but just make code consistent. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: stable@kernel.org # after ab4bd22d Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-27cfq-iosched: fix a rcu warningShaohua Li1-1/+4
I got a rcu warnning at boot. the ioc->ioc_data is rcu_deferenced, but doesn't hold rcu_read_lock. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: stable@kernel.org # after ab4bd22d Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-20bsg: fix address space warning from sparseNamhyung Kim1-6/+6
copy_from/to_user() and blk_rq_map_user() want __user pointer. This patch fixes following warnings from sparse: CHECK block/bsg.c block/bsg.c:185:38: warning: incorrect type in argument 2 (different address spaces) block/bsg.c:185:38: expected void const [noderef] <asn:1>*from block/bsg.c:185:38: got void *<noident> block/bsg.c:295:58: warning: incorrect type in argument 4 (different address spaces) block/bsg.c:295:58: expected void [noderef] <asn:1>*<noident> block/bsg.c:295:58: got void *[assigned] dxferp block/bsg.c:311:52: warning: incorrect type in argument 4 (different address spaces) block/bsg.c:311:52: expected void [noderef] <asn:1>*<noident> block/bsg.c:311:52: got void *[assigned] dxferp block/bsg.c:448:37: warning: incorrect type in argument 1 (different address spaces) block/bsg.c:448:37: expected void [noderef] <asn:1>*dst block/bsg.c:448:37: got void *<noident> Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-20bsg: remove unnecessary conditional expressionsNamhyung Kim1-2/+2
Second condition in OR always implies first condition is false thus bytes_read in the second is not needed. The same goes to bytes_written. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-20bsg: fix bsg_poll() to return POLLOUT properlyNamhyung Kim1-1/+1
POLLOUT should be returned only if bd->queued_cmds < bd->max_queue so that bsg_alloc_command() can proceed. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Acked-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-13blk-throttle: Make total_nr_queued unsignedJoe Perches1-4/+4
The total of two unsigned values should also be unsigned. Update throtl_log output to unsigned. Update total_nr_queued test to non-zero to be the same as the other total_nr_queued tests. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-13block: Add __attribute__((format(printf...) and fix falloutJoe Perches2-7/+8
Use the compiler to verify format strings and arguments. Fix fallout. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-13block:remove some spare spaces in genhd.cWanlong Gao1-3/+3
Remove the end-of-line spaces in genhd.c. Signed-off-by: Wanlong Gao <wanlong.gao@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-13block: Add __attribute__((format(printf...) and fix falloutJoe Perches2-7/+8
Use the compiler to verify format strings and arguments. Fix fallout. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-09block: make disk_block_events() properly wait for work cancellationTejun Heo1-0/+10
disk_block_events() should guarantee that the event work is not in flight on return and once blocked it shouldn't issue further cancellations. Because there was no synchronization between the first blocker doing cancel_delayed_work_sync() and the following blockers, the following blockers could finish before cancellation was complete, which broke both guarantees - event work could be in flight and cancellation could happen after return. This bug triggered WARN_ON_ONCE() in disk_clear_events() reported in bug#34662. https://bugzilla.kernel.org/show_bug.cgi?id=34662 Fix it by adding an outer mutex which protects both block count manipulation and work cancellation. -v2: Use outer mutex instead of bit waitqueue per Linus. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com> Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com> Reported-by: Borislav Petkov <bp@alien8.de> Reported-by: Meelis Roos <mroos@linux.ee> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-09block: remove non-syncing __disk_block_events() and fold it into ↵Tejun Heo1-31/+24
disk_block_events() After the previous update to disk_check_events(), nobody is using non-syncing __disk_block_events(). Remove @sync and, as this makes __disk_block_events() virtually identical to disk_block_events(), remove the underscore prefixed version. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-09block: don't use non-syncing event blocking in disk_check_events()Tejun Heo1-3/+11
This patch is part of fix for triggering of WARN_ON_ONCE() in disk_clear_events() reported in bug#34662. https://bugzilla.kernel.org/show_bug.cgi?id=34662 disk_clear_events() blocks events, schedules and flushes the event work. It expects the work to have started execution on schedule and finished on return from flush. WARN_ON_ONCE() triggers if the event work hasn't executed as expected. This problem happens because __disk_block_events() fails to guarantee that the event work item is not in flight on return from the function in race-free manner. The problem is two-fold and this patch addresses one of them. When __disk_block_events() is called with @sync == %false, it bumps event block count, calls cancel_delayed_work() and return. This makes it impossible to guarantee that event polling is not in flight on return from syncing __disk_block_events() - if the first blocker was non-syncing, polling could still be in progress and later syncing ones would assume that the first blocker already canceled it. Making __disk_block_events() cancel_sync regardless of block count isn't feasible either as it may race with forced event checking in disk_clear_events(). As disk_check_events() is the only user of non-syncing __disk_block_events(), updating it to directly cancel and schedule event work is the easiest way to solve the issue. Note that there's another bug in __disk_block_events() and this patch doesn't fix the issue completely. Later patch will fix the other bug. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Sitsofe Wheeler <sitsofe@yahoo.com> Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com> Reported-by: Borislav Petkov <bp@alien8.de> Reported-by: Meelis Roos <mroos@linux.ee> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-06block: rename the return of two functionsPaul Bolle1-20/+20
If we rename the return of alloc_io_context() and get_io_context() from "ret" to "ioc" the code get's (a bit) more readable and (a lot) more grepable. Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-06CFQ: make two functions staticPaul Bolle1-3/+3
Correctly suggested by sparse. Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-06cfq-iosched: fix locking around ioc->ioc_data assignmentJens Axboe1-1/+4
Since we are modifying this RCU pointer, we need to hold the lock protecting it around it. This fixes a potential reuse and double free of a cfq io_context structure. The bug has been in CFQ for a long time, it hit very few people but those it did hit seemed to see it a lot. Tracked in RH bugzilla here: https://bugzilla.redhat.com/show_bug.cgi?id=577968 Credit goes to Paul Bolle for figuring out that the issue was around the one-hit ioc->ioc_data cache. Thanks to his hard work the issue is now fixed. Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-06cfq-iosched: fix locking around ioc->ioc_data assignmentJens Axboe1-1/+4
Since we are modifying this RCU pointer, we need to hold the lock protecting it around it. This fixes a potential reuse and double free of a cfq io_context structure. The bug has been in CFQ for a long time, it hit very few people but those it did hit seemed to see it a lot. Tracked in RH bugzilla here: https://bugzilla.redhat.com/show_bug.cgi?id=577968 Credit goes to Paul Bolle for figuring out that the issue was around the one-hit ioc->ioc_data cache. Thanks to his hard work the issue is now fixed. Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-02iosched: prevent aliased requests from starving other I/OJeff Moyer3-15/+5
Hi, Jens, If you recall, I posted an RFC patch for this back in July of last year: http://lkml.org/lkml/2010/7/13/279 The basic problem is that a process can issue a never-ending stream of async direct I/Os to the same sector on a device, thus starving out other I/O in the system (due to the way the alias handling works in both cfq and deadline). The solution I proposed back then was to start dispatching from the fifo after a certain number of aliases had been dispatched. Vivek asked why we had to treat aliases differently at all, and I never had a good answer. So, I put together a simple patch which allows aliases to be added to the rb tree (it adds them to the right, though that doesn't matter as the order isn't guaranteed anyway). I think this is the preferred solution, as it doesn't break up time slices in CFQ or batches in deadline. I've tested it, and it does solve the starvation issue. Let me know what you think. Cheers, Jeff Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-02block: Use hlist_entry() for io_context.cic_list.firstPaul Bolle1-2/+2
list_entry() and hlist_entry() are both simply aliases for container_of(), but since io_context.cic_list.first is an hlist_node one should at least use the correct alias. Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-06-02cfq-iosched: Remove bogus check in queue_fail pathPaul Bolle1-3/+0
queue_fail can only be reached if cic is NULL, so its check for cic must be bogus. Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-31CFQ: Fix typo and remove unnecessary semicolonKyungmin Park1-4/+4
Fix comment typo and remove unnecessary semicolon at macro Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-27Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds2-4/+3
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: loop: export module parameters block: export blk_{get,put}_queue() block: remove unused variable in bio_attempt_front_merge() block: always allocate genhd->ev if check_events is implemented brd: export module parameters brd: fix comment on initial device creation brd: handle on-demand devices correctly brd: limit 'max_part' module param to DISK_MAX_PARTS brd: get rid of unused members from struct brd_device block: fix oops on !disk->queue and sysfs discard alignment display
2011-05-27block: export blk_{get,put}_queue()Jens Axboe1-0/+2
We need them in SCSI to fix a bug, but currently they are not exported to modules. Export them. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-26cgroups: add per-thread subsystem callbacksBen Blum1-12/+6
Add cgroup subsystem callbacks for per-thread attachment in atomic contexts Add can_attach_task(), pre_attach(), and attach_task() as new callbacks for cgroups's subsystem interface. Unlike can_attach and attach, these are for per-thread operations, to be called potentially many times when attaching an entire threadgroup. Also, the old "bool threadgroup" interface is removed, as replaced by this. All subsystems are modified for the new interface - of note is cpuset, which requires from/to nodemasks for attach to be globally scoped (though per-cpuset would work too) to persist from its pre_attach to attach_task and attach. This is a pre-patch for cgroup-procs-writable.patch. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26block: remove unused variable in bio_attempt_front_merge()Luca Tettamanti1-3/+0
sector is never read inside the function. Signed-off-by: Luca Tettamanti <kronos.it@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-26block: always allocate genhd->ev if check_events is implementedTejun Heo1-1/+1
9fd097b149 (block: unexport DISK_EVENT_MEDIA_CHANGE for legacy/fringe drivers) removed DISK_EVENT_MEDIA_CHANGE from legacy/fringe block drivers which have inadequate ->check_events(). Combined with earlier change 7c88a168da (block: don't propagate unlisted DISK_EVENTs to userland), this enables using ->check_events() for internal processing while avoiding enabling in-kernel block event polling which can lead to infinite event loop. Unfortunately, this made many drivers including floppy without any bit set in disk->events and ->async_events in which case disk_add_events() simply skipped allocation of disk->ev, which disables whole event handling. As ->check_events() is still used during open processing for revalidation, this can lead to open failure. This patch always allocates disk->ev if ->check_events is implemented. In the long term, it would make sense to simply include the event structure inline into genhd as it's now used by virtually all block devices. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Ondrej Zary <linux@rainbow-software.org> Reported-by: Alex Villacis Lasso <avillaci@ceibo.fiec.espol.edu.ec> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-24cfq-iosched: free cic_index if cfqd allocation failsNamhyung Kim1-1/+5
When struct cfq_data allocation fails, cic_index need to be freed. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-24cfq-iosched: remove unused 'group_changed' in cfq_service_tree_add()Namhyung Kim1-2/+1
The 'group_changed' variable is initialized to 0 and never changed, so checking the variable is meaningless. It is a leftover from 0bbfeb832042 ("cfq-iosched: Always provide group iosolation."). Let's get rid of it. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Cc: Justin TerAvest <teravest@google.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-24cfq-iosched: reduce bit operations in cfq_choose_req()Namhyung Kim1-9/+5
Reduce the number of bit operations in cfq_choose_req() on average (and worst) cases. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-24cfq-iosched: algebraic simplification in cfq_prio_to_maxrq()Namhyung Kim1-1/+1
Simplify the calculation in cfq_prio_to_maxrq(), plus replace CFQ_PRIO_LISTS to IOPRIO_BE_NR since they are the same and IOPRIO_BE_NR looks more reasonable in this context IMHO. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-23blk-cgroup: Initialize ioc->cgroup_changed at ioc creation timeVivek Goyal1-0/+3
If we don't explicitly initialize it to zero, CFQ might think that cgroup of ioc has changed and it generates lots of unnecessary calls to call_for_each_cic(changed_cgroup). Fix it. cfq_get_io_context() cfq_ioc_set_cgroup() call_for_each_cic(ioc, changed_cgroup) Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-05-23block: call elv_bio_merged() when mergedVivek Goyal1-0/+2
Commit 73c101011926 ("block: initial patch for on-stack per-task plugging") removed calls to elv_bio_merged() when @bio merged with @req. Re-add them. This in turn will update merged stats in associated group. That should be safe as long as request has got reference to the blkio_group. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Cc: Divyesh Shah <dpshah@google.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>